SPICE: exploration and analysis of post-cytometric complex multivariate datasets.
Roederer, Mario; Nozzi, Joshua L; Nason, Martha C
2011-02-01
Polychromatic flow cytometry results in complex, multivariate datasets. To date, tools for the aggregate analysis of these datasets across multiple specimens grouped by different categorical variables, such as demographic information, have not been optimized. Often, the exploration of such datasets is accomplished by visualization of patterns with pie charts or bar charts, without easy access to statistical comparisons of measurements that comprise multiple components. Here we report on algorithms and a graphical interface we developed for these purposes. In particular, we discuss thresholding necessary for accurate representation of data in pie charts, the implications for display and comparison of normalized versus unnormalized data, and the effects of averaging when samples with significant background noise are present. Finally, we define a statistic for the nonparametric comparison of complex distributions to test for difference between groups of samples based on multi-component measurements. While originally developed to support the analysis of T cell functional profiles, these techniques are amenable to a broad range of datatypes. Published 2011 Wiley-Liss, Inc.
Application of multivariate statistical techniques in microbial ecology
Paliy, O.; Shankar, V.
2016-01-01
Recent advances in high-throughput methods of molecular analyses have led to an explosion of studies generating large scale ecological datasets. Especially noticeable effect has been attained in the field of microbial ecology, where new experimental approaches provided in-depth assessments of the composition, functions, and dynamic changes of complex microbial communities. Because even a single high-throughput experiment produces large amounts of data, powerful statistical techniques of multivariate analysis are well suited to analyze and interpret these datasets. Many different multivariate techniques are available, and often it is not clear which method should be applied to a particular dataset. In this review we describe and compare the most widely used multivariate statistical techniques including exploratory, interpretive, and discriminatory procedures. We consider several important limitations and assumptions of these methods, and we present examples of how these approaches have been utilized in recent studies to provide insight into the ecology of the microbial world. Finally, we offer suggestions for the selection of appropriate methods based on the research question and dataset structure. PMID:26786791
Multivariate spatiotemporal visualizations for mobile devices in Flyover Country
NASA Astrophysics Data System (ADS)
Loeffler, S.; Thorn, R.; Myrbo, A.; Roth, R.; Goring, S. J.; Williams, J.
2017-12-01
Visualizing and interacting with complex multivariate and spatiotemporal datasets on mobile devices is challenging due to their smaller screens, reduced processing power, and limited data connectivity. Pollen data require visualizing pollen assemblages spatially, temporally, and across multiple taxa to understand plant community dynamics through time. Drawing from cartography, information visualization, and paleoecology, we have created new mobile-first visualization techniques that represent multiple taxa across many sites and enable user interaction. Using pollen datasets from the Neotoma Paleoecology Database as a case study, the visualization techniques allow ecological patterns and trends to be quickly understood on a mobile device compared to traditional pollen diagrams and maps. This flexible visualization system can be used for datasets beyond pollen, with the only requirements being point-based localities and multiple variables changing through time or depth.
ERIC Educational Resources Information Center
Polanin, Joshua R.; Wilson, Sandra Jo
2014-01-01
The purpose of this project is to demonstrate the practical methods developed to utilize a dataset consisting of both multivariate and multilevel effect size data. The context for this project is a large-scale meta-analytic review of the predictors of academic achievement. This project is guided by three primary research questions: (1) How do we…
A global × global test for testing associations between two large sets of variables.
Chaturvedi, Nimisha; de Menezes, Renée X; Goeman, Jelle J
2017-01-01
In high-dimensional omics studies where multiple molecular profiles are obtained for each set of patients, there is often interest in identifying complex multivariate associations, for example, copy number regulated expression levels in a certain pathway or in a genomic region. To detect such associations, we present a novel approach to test for association between two sets of variables. Our approach generalizes the global test, which tests for association between a group of covariates and a single univariate response, to allow high-dimensional multivariate response. We apply the method to several simulated datasets as well as two publicly available datasets, where we compare the performance of multivariate global test (G2) with univariate global test. The method is implemented in R and will be available as a part of the globaltest package in R. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Exploratory analysis of TOF-SIMS data from biological surfaces
NASA Astrophysics Data System (ADS)
Vaidyanathan, Seetharaman; Fletcher, John S.; Henderson, Alex; Lockyer, Nicholas P.; Vickerman, John C.
2008-12-01
The application of multivariate analytical tools enables simplification of TOF-SIMS datasets so that useful information can be extracted from complex spectra and images, especially those that do not give readily interpretable results. There is however a challenge in understanding the outputs from such analyses. The problem is complicated when analysing images, given the additional dimensions in the dataset. Here we demonstrate how the application of simple pre-processing routines can enable the interpretation of TOF-SIMS spectra and images. For the spectral data, TOF-SIMS spectra used to discriminate bacterial isolates associated with urinary tract infection were studied. Using different criteria for picking peaks before carrying out PC-DFA enabled identification of the discriminatory information with greater certainty. For the image data, an air-dried salt stressed bacterial sample, discussed in another paper by us in this issue, was studied. Exploration of the image datasets with and without normalisation prior to multivariate analysis by PCA or MAF resulted in different regions of the image being highlighted by the techniques.
Farseer-NMR: automatic treatment, analysis and plotting of large, multi-variable NMR data.
Teixeira, João M C; Skinner, Simon P; Arbesú, Miguel; Breeze, Alexander L; Pons, Miquel
2018-05-11
We present Farseer-NMR ( https://git.io/vAueU ), a software package to treat, evaluate and combine NMR spectroscopic data from sets of protein-derived peaklists covering a range of experimental conditions. The combined advances in NMR and molecular biology enable the study of complex biomolecular systems such as flexible proteins or large multibody complexes, which display a strong and functionally relevant response to their environmental conditions, e.g. the presence of ligands, site-directed mutations, post translational modifications, molecular crowders or the chemical composition of the solution. These advances have created a growing need to analyse those systems' responses to multiple variables. The combined analysis of NMR peaklists from large and multivariable datasets has become a new bottleneck in the NMR analysis pipeline, whereby information-rich NMR-derived parameters have to be manually generated, which can be tedious, repetitive and prone to human error, or even unfeasible for very large datasets. There is a persistent gap in the development and distribution of software focused on peaklist treatment, analysis and representation, and specifically able to handle large multivariable datasets, which are becoming more commonplace. In this regard, Farseer-NMR aims to close this longstanding gap in the automated NMR user pipeline and, altogether, reduce the time burden of analysis of large sets of peaklists from days/weeks to seconds/minutes. We have implemented some of the most common, as well as new, routines for calculation of NMR parameters and several publication-quality plotting templates to improve NMR data representation. Farseer-NMR has been written entirely in Python and its modular code base enables facile extension.
DigOut: viewing differential expression genes as outliers.
Yu, Hui; Tu, Kang; Xie, Lu; Li, Yuan-Yuan
2010-12-01
With regards to well-replicated two-conditional microarray datasets, the selection of differentially expressed (DE) genes is a well-studied computational topic, but for multi-conditional microarray datasets with limited or no replication, the same task is not properly addressed by previous studies. This paper adopts multivariate outlier analysis to analyze replication-lacking multi-conditional microarray datasets, finding that it performs significantly better than the widely used limit fold change (LFC) model in a simulated comparative experiment. Compared with the LFC model, the multivariate outlier analysis also demonstrates improved stability against sample variations in a series of manipulated real expression datasets. The reanalysis of a real non-replicated multi-conditional expression dataset series leads to satisfactory results. In conclusion, a multivariate outlier analysis algorithm, like DigOut, is particularly useful for selecting DE genes from non-replicated multi-conditional gene expression dataset.
Three visualization approaches for communicating and exploring PIT tag data
Letcher, Benjamin; Walker, Jeffrey D.; O'Donnell, Matthew; Whiteley, Andrew R.; Nislow, Keith; Coombs, Jason
2018-01-01
As the number, size and complexity of ecological datasets has increased, narrative and interactive raw data visualizations have emerged as important tools for exploring and understanding these large datasets. As a demonstration, we developed three visualizations to communicate and explore passive integrated transponder tag data from two long-term field studies. We created three independent visualizations for the same dataset, allowing separate entry points for users with different goals and experience levels. The first visualization uses a narrative approach to introduce users to the study. The second visualization provides interactive cross-filters that allow users to explore multi-variate relationships in the dataset. The last visualization allows users to visualize the movement histories of individual fish within the stream network. This suite of visualization tools allows a progressive discovery of more detailed information and should make the data accessible to users with a wide variety of backgrounds and interests.
NASA Astrophysics Data System (ADS)
Gourdol, L.; Hissler, C.; Pfister, L.
2012-04-01
The Luxembourg sandstone aquifer is of major relevance for the national supply of drinking water in Luxembourg. The city of Luxembourg (20% of the country's population) gets almost 2/3 of its drinking water from this aquifer. As a consequence, the study of both the groundwater hydrochemistry, as well as its spatial and temporal variations, are considered as of highest priority. Since 2005, a monitoring network has been implemented by the Water Department of Luxembourg City, with a view to a more sustainable management of this strategic water resource. The data collected to date forms a large and complex dataset, describing spatial and temporal variations of many hydrochemical parameters. The data treatment issue is tightly connected to this kind of water monitoring programs and complex databases. Standard multivariate statistical techniques, such as principal components analysis and hierarchical cluster analysis, have been widely used as unbiased methods for extracting meaningful information from groundwater quality data and are now classically used in many hydrogeological studies, in particular to characterize temporal or spatial hydrochemical variations induced by natural and anthropogenic factors. But these classical multivariate methods deal with two-way matrices, usually parameters/sites or parameters/time, while often the dataset resulting from qualitative water monitoring programs should be seen as a datacube parameters/sites/time. Three-way matrices, such as the one we propose here, are difficult to handle and to analyse by classical multivariate statistical tools and thus should be treated with approaches dealing with three-way data structures. One possible analysis approach consists in the use of partial triadic analysis (PTA). The PTA was previously used with success in many ecological studies but never to date in the domain of hydrogeology. Applied to the dataset of the Luxembourg Sandstone aquifer, the PTA appears as a new promising statistical instrument for hydrogeologists, in particular to characterize temporal and spatial hydrochemical variations induced by natural and anthropogenic factors. This new approach for groundwater management offers potential for 1) identifying a common multivariate spatial structure, 2) untapping the different hydrochemical patterns and explaining their controlling factors and 3) analysing the temporal variability of this structure and grasping hydrochemical changes.
Calypso: a user-friendly web-server for mining and visualizing microbiome-environment interactions.
Zakrzewski, Martha; Proietti, Carla; Ellis, Jonathan J; Hasan, Shihab; Brion, Marie-Jo; Berger, Bernard; Krause, Lutz
2017-03-01
Calypso is an easy-to-use online software suite that allows non-expert users to mine, interpret and compare taxonomic information from metagenomic or 16S rDNA datasets. Calypso has a focus on multivariate statistical approaches that can identify complex environment-microbiome associations. The software enables quantitative visualizations, statistical testing, multivariate analysis, supervised learning, factor analysis, multivariable regression, network analysis and diversity estimates. Comprehensive help pages, tutorials and videos are provided via a wiki page. The web-interface is accessible via http://cgenome.net/calypso/ . The software is programmed in Java, PERL and R and the source code is available from Zenodo ( https://zenodo.org/record/50931 ). The software is freely available for non-commercial users. l.krause@uq.edu.au. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Jaffa, Miran A; Gebregziabher, Mulugeta; Jaffa, Ayad A
2015-06-14
Renal transplant patients are mandated to have continuous assessment of their kidney function over time to monitor disease progression determined by changes in blood urea nitrogen (BUN), serum creatinine (Cr), and estimated glomerular filtration rate (eGFR). Multivariate analysis of these outcomes that aims at identifying the differential factors that affect disease progression is of great clinical significance. Thus our study aims at demonstrating the application of different joint modeling approaches with random coefficients on a cohort of renal transplant patients and presenting a comparison of their performance through a pseudo-simulation study. The objective of this comparison is to identify the model with best performance and to determine whether accuracy compensates for complexity in the different multivariate joint models. We propose a novel application of multivariate Generalized Linear Mixed Models (mGLMM) to analyze multiple longitudinal kidney function outcomes collected over 3 years on a cohort of 110 renal transplantation patients. The correlated outcomes BUN, Cr, and eGFR and the effect of various covariates such patient's gender, age and race on these markers was determined holistically using different mGLMMs. The performance of the various mGLMMs that encompass shared random intercept (SHRI), shared random intercept and slope (SHRIS), separate random intercept (SPRI) and separate random intercept and slope (SPRIS) was assessed to identify the one that has the best fit and most accurate estimates. A bootstrap pseudo-simulation study was conducted to gauge the tradeoff between the complexity and accuracy of the models. Accuracy was determined using two measures; the mean of the differences between the estimates of the bootstrapped datasets and the true beta obtained from the application of each model on the renal dataset, and the mean of the square of these differences. The results showed that SPRI provided most accurate estimates and did not exhibit any computational or convergence problem. Higher accuracy was demonstrated when the level of complexity increased from shared random coefficient models to the separate random coefficient alternatives with SPRI showing to have the best fit and most accurate estimates.
Kim, Sungduk; Chen, Ming-Hui; Ibrahim, Joseph G.; Shah, Arvind K.; Lin, Jianxin
2013-01-01
In this paper, we propose a class of Box-Cox transformation regression models with multidimensional random effects for analyzing multivariate responses for individual patient data (IPD) in meta-analysis. Our modeling formulation uses a multivariate normal response meta-analysis model with multivariate random effects, in which each response is allowed to have its own Box-Cox transformation. Prior distributions are specified for the Box-Cox transformation parameters as well as the regression coefficients in this complex model, and the Deviance Information Criterion (DIC) is used to select the best transformation model. Since the model is quite complex, a novel Monte Carlo Markov chain (MCMC) sampling scheme is developed to sample from the joint posterior of the parameters. This model is motivated by a very rich dataset comprising 26 clinical trials involving cholesterol lowering drugs where the goal is to jointly model the three dimensional response consisting of Low Density Lipoprotein Cholesterol (LDL-C), High Density Lipoprotein Cholesterol (HDL-C), and Triglycerides (TG) (LDL-C, HDL-C, TG). Since the joint distribution of (LDL-C, HDL-C, TG) is not multivariate normal and in fact quite skewed, a Box-Cox transformation is needed to achieve normality. In the clinical literature, these three variables are usually analyzed univariately: however, a multivariate approach would be more appropriate since these variables are correlated with each other. A detailed analysis of these data is carried out using the proposed methodology. PMID:23580436
Kim, Sungduk; Chen, Ming-Hui; Ibrahim, Joseph G; Shah, Arvind K; Lin, Jianxin
2013-10-15
In this paper, we propose a class of Box-Cox transformation regression models with multidimensional random effects for analyzing multivariate responses for individual patient data in meta-analysis. Our modeling formulation uses a multivariate normal response meta-analysis model with multivariate random effects, in which each response is allowed to have its own Box-Cox transformation. Prior distributions are specified for the Box-Cox transformation parameters as well as the regression coefficients in this complex model, and the deviance information criterion is used to select the best transformation model. Because the model is quite complex, we develop a novel Monte Carlo Markov chain sampling scheme to sample from the joint posterior of the parameters. This model is motivated by a very rich dataset comprising 26 clinical trials involving cholesterol-lowering drugs where the goal is to jointly model the three-dimensional response consisting of low density lipoprotein cholesterol (LDL-C), high density lipoprotein cholesterol (HDL-C), and triglycerides (TG) (LDL-C, HDL-C, TG). Because the joint distribution of (LDL-C, HDL-C, TG) is not multivariate normal and in fact quite skewed, a Box-Cox transformation is needed to achieve normality. In the clinical literature, these three variables are usually analyzed univariately; however, a multivariate approach would be more appropriate because these variables are correlated with each other. We carry out a detailed analysis of these data by using the proposed methodology. Copyright © 2013 John Wiley & Sons, Ltd.
Sepehrband, Farshid; Lynch, Kirsten M; Cabeen, Ryan P; Gonzalez-Zacarias, Clio; Zhao, Lu; D'Arcy, Mike; Kesselman, Carl; Herting, Megan M; Dinov, Ivo D; Toga, Arthur W; Clark, Kristi A
2018-05-15
Exploring neuroanatomical sex differences using a multivariate statistical learning approach can yield insights that cannot be derived with univariate analysis. While gross differences in total brain volume are well-established, uncovering the more subtle, regional sex-related differences in neuroanatomy requires a multivariate approach that can accurately model spatial complexity as well as the interactions between neuroanatomical features. Here, we developed a multivariate statistical learning model using a support vector machine (SVM) classifier to predict sex from MRI-derived regional neuroanatomical features from a single-site study of 967 healthy youth from the Philadelphia Neurodevelopmental Cohort (PNC). Then, we validated the multivariate model on an independent dataset of 682 healthy youth from the multi-site Pediatric Imaging, Neurocognition and Genetics (PING) cohort study. The trained model exhibited an 83% cross-validated prediction accuracy, and correctly predicted the sex of 77% of the subjects from the independent multi-site dataset. Results showed that cortical thickness of the middle occipital lobes and the angular gyri are major predictors of sex. Results also demonstrated the inferential benefits of going beyond classical regression approaches to capture the interactions among brain features in order to better characterize sex differences in male and female youths. We also identified specific cortical morphological measures and parcellation techniques, such as cortical thickness as derived from the Destrieux atlas, that are better able to discriminate between males and females in comparison to other brain atlases (Desikan-Killiany, Brodmann and subcortical atlases). Copyright © 2018 Elsevier Inc. All rights reserved.
Baez-Cazull, S. E.; McGuire, J.T.; Cozzarelli, I.M.; Voytek, M.A.
2008-01-01
Determining the processes governing aqueous biogeochemistry in a wetland hydrologically linked to an underlying contaminated aquifer is challenging due to the complex exchange between the systems and their distinct responses to changes in precipitation, recharge, and biological activities. To evaluate temporal and spatial processes in the wetland-aquifer system, water samples were collected using cm-scale multichambered passive diffusion samplers (peepers) to span the wetland-aquifer interface over a period of 3 yr. Samples were analyzed for major cations and anions, methane, and a suite of organic acids resulting in a large dataset of over 8000 points, which was evaluated using multivariate statistics. Principal component analysis (PCA) was chosen with the purpose of exploring the sources of variation in the dataset to expose related variables and provide insight into the biogeochemical processes that control the water chemistry of the system. Factor scores computed from PCA were mapped by date and depth. Patterns observed suggest that (i) fermentation is the process controlling the greatest variability in the dataset and it peaks in May; (ii) iron and sulfate reduction were the dominant terminal electron-accepting processes in the system and were associated with fermentation but had more complex seasonal variability than fermentation; (iii) methanogenesis was also important and associated with bacterial utilization of minerals as a source of electron acceptors (e.g., barite BaSO4); and (iv) seasonal hydrological patterns (wet and dry periods) control the availability of electron acceptors through the reoxidation of reduced iron-sulfur species enhancing iron and sulfate reduction. Copyright ?? 2008 by the American Society of Agronomy, Crop Science Society of America, and Soil Science Society of America. All rights reserved.
Rosa, Maria J; Mehta, Mitul A; Pich, Emilio M; Risterucci, Celine; Zelaya, Fernando; Reinders, Antje A T S; Williams, Steve C R; Dazzan, Paola; Doyle, Orla M; Marquand, Andre F
2015-01-01
An increasing number of neuroimaging studies are based on either combining more than one data modality (inter-modal) or combining more than one measurement from the same modality (intra-modal). To date, most intra-modal studies using multivariate statistics have focused on differences between datasets, for instance relying on classifiers to differentiate between effects in the data. However, to fully characterize these effects, multivariate methods able to measure similarities between datasets are needed. One classical technique for estimating the relationship between two datasets is canonical correlation analysis (CCA). However, in the context of high-dimensional data the application of CCA is extremely challenging. A recent extension of CCA, sparse CCA (SCCA), overcomes this limitation, by regularizing the model parameters while yielding a sparse solution. In this work, we modify SCCA with the aim of facilitating its application to high-dimensional neuroimaging data and finding meaningful multivariate image-to-image correspondences in intra-modal studies. In particular, we show how the optimal subset of variables can be estimated independently and we look at the information encoded in more than one set of SCCA transformations. We illustrate our framework using Arterial Spin Labeling data to investigate multivariate similarities between the effects of two antipsychotic drugs on cerebral blood flow.
Liu, Ya-Juan; André, Silvère; Saint Cristau, Lydia; Lagresle, Sylvain; Hannas, Zahia; Calvosa, Éric; Devos, Olivier; Duponchel, Ludovic
2017-02-01
Multivariate statistical process control (MSPC) is increasingly popular as the challenge provided by large multivariate datasets from analytical instruments such as Raman spectroscopy for the monitoring of complex cell cultures in the biopharmaceutical industry. However, Raman spectroscopy for in-line monitoring often produces unsynchronized data sets, resulting in time-varying batches. Moreover, unsynchronized data sets are common for cell culture monitoring because spectroscopic measurements are generally recorded in an alternate way, with more than one optical probe parallelly connecting to the same spectrometer. Synchronized batches are prerequisite for the application of multivariate analysis such as multi-way principal component analysis (MPCA) for the MSPC monitoring. Correlation optimized warping (COW) is a popular method for data alignment with satisfactory performance; however, it has never been applied to synchronize acquisition time of spectroscopic datasets in MSPC application before. In this paper we propose, for the first time, to use the method of COW to synchronize batches with varying durations analyzed with Raman spectroscopy. In a second step, we developed MPCA models at different time intervals based on the normal operation condition (NOC) batches synchronized by COW. New batches are finally projected considering the corresponding MPCA model. We monitored the evolution of the batches using two multivariate control charts based on Hotelling's T 2 and Q. As illustrated with results, the MSPC model was able to identify abnormal operation condition including contaminated batches which is of prime importance in cell culture monitoring We proved that Raman-based MSPC monitoring can be used to diagnose batches deviating from the normal condition, with higher efficacy than traditional diagnosis, which would save time and money in the biopharmaceutical industry. Copyright © 2016 Elsevier B.V. All rights reserved.
A Network-Based Algorithm for Clustering Multivariate Repeated Measures Data
NASA Technical Reports Server (NTRS)
Koslovsky, Matthew; Arellano, John; Schaefer, Caroline; Feiveson, Alan; Young, Millennia; Lee, Stuart
2017-01-01
The National Aeronautics and Space Administration (NASA) Astronaut Corps is a unique occupational cohort for which vast amounts of measures data have been collected repeatedly in research or operational studies pre-, in-, and post-flight, as well as during multiple clinical care visits. In exploratory analyses aimed at generating hypotheses regarding physiological changes associated with spaceflight exposure, such as impaired vision, it is of interest to identify anomalies and trends across these expansive datasets. Multivariate clustering algorithms for repeated measures data may help parse the data to identify homogeneous groups of astronauts that have higher risks for a particular physiological change. However, available clustering methods may not be able to accommodate the complex data structures found in NASA data, since the methods often rely on strict model assumptions, require equally-spaced and balanced assessment times, cannot accommodate missing data or differing time scales across variables, and cannot process continuous and discrete data simultaneously. To fill this gap, we propose a network-based, multivariate clustering algorithm for repeated measures data that can be tailored to fit various research settings. Using simulated data, we demonstrate how our method can be used to identify patterns in complex data structures found in practice.
Igloo-Plot: a tool for visualization of multidimensional datasets.
Kuntal, Bhusan K; Ghosh, Tarini Shankar; Mande, Sharmila S
2014-01-01
Advances in science and technology have resulted in an exponential growth of multivariate (or multi-dimensional) datasets which are being generated from various research areas especially in the domain of biological sciences. Visualization and analysis of such data (with the objective of uncovering the hidden patterns therein) is an important and challenging task. We present a tool, called Igloo-Plot, for efficient visualization of multidimensional datasets. The tool addresses some of the key limitations of contemporary multivariate visualization and analysis tools. The visualization layout, not only facilitates an easy identification of clusters of data-points having similar feature compositions, but also the 'marker features' specific to each of these clusters. The applicability of the various functionalities implemented herein is demonstrated using several well studied multi-dimensional datasets. Igloo-Plot is expected to be a valuable resource for researchers working in multivariate data mining studies. Igloo-Plot is available for download from: http://metagenomics.atc.tcs.com/IglooPlot/. Copyright © 2014 Elsevier Inc. All rights reserved.
Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions
Li, Haoran; Xiong, Li; Jiang, Xiaoqian
2014-01-01
Differential privacy has recently emerged in private statistical data release as one of the strongest privacy guarantees. Most of the existing techniques that generate differentially private histograms or synthetic data only work well for single dimensional or low-dimensional histograms. They become problematic for high dimensional and large domain data due to increased perturbation error and computation complexity. In this paper, we propose DPCopula, a differentially private data synthesization technique using Copula functions for multi-dimensional data. The core of our method is to compute a differentially private copula function from which we can sample synthetic data. Copula functions are used to describe the dependence between multivariate random vectors and allow us to build the multivariate joint distribution using one-dimensional marginal distributions. We present two methods for estimating the parameters of the copula functions with differential privacy: maximum likelihood estimation and Kendall’s τ estimation. We present formal proofs for the privacy guarantee as well as the convergence property of our methods. Extensive experiments using both real datasets and synthetic datasets demonstrate that DPCopula generates highly accurate synthetic multi-dimensional data with significantly better utility than state-of-the-art techniques. PMID:25405241
NASA Astrophysics Data System (ADS)
Yu, H.; Gu, H.
2017-12-01
A novel multivariate seismic formation pressure prediction methodology is presented, which incorporates high-resolution seismic velocity data from prestack AVO inversion, and petrophysical data (porosity and shale volume) derived from poststack seismic motion inversion. In contrast to traditional seismic formation prediction methods, the proposed methodology is based on a multivariate pressure prediction model and utilizes a trace-by-trace multivariate regression analysis on seismic-derived petrophysical properties to calibrate model parameters in order to make accurate predictions with higher resolution in both vertical and lateral directions. With prestack time migration velocity as initial velocity model, an AVO inversion was first applied to prestack dataset to obtain high-resolution seismic velocity with higher frequency that is to be used as the velocity input for seismic pressure prediction, and the density dataset to calculate accurate Overburden Pressure (OBP). Seismic Motion Inversion (SMI) is an inversion technique based on Markov Chain Monte Carlo simulation. Both structural variability and similarity of seismic waveform are used to incorporate well log data to characterize the variability of the property to be obtained. In this research, porosity and shale volume are first interpreted on well logs, and then combined with poststack seismic data using SMI to build porosity and shale volume datasets for seismic pressure prediction. A multivariate effective stress model is used to convert velocity, porosity and shale volume datasets to effective stress. After a thorough study of the regional stratigraphic and sedimentary characteristics, a regional normally compacted interval model is built, and then the coefficients in the multivariate prediction model are determined in a trace-by-trace multivariate regression analysis on the petrophysical data. The coefficients are used to convert velocity, porosity and shale volume datasets to effective stress and then to calculate formation pressure with OBP. Application of the proposed methodology to a research area in East China Sea has proved that the method can bridge the gap between seismic and well log pressure prediction and give predicted pressure values close to pressure meassurements from well testing.
Goldrick, Stephen; Holmes, William; Bond, Nicholas J.; Lewis, Gareth; Kuiper, Marcel; Turner, Richard
2017-01-01
ABSTRACT Product quality heterogeneities, such as a trisulfide bond (TSB) formation, can be influenced by multiple interacting process parameters. Identifying their root cause is a major challenge in biopharmaceutical production. To address this issue, this paper describes the novel application of advanced multivariate data analysis (MVDA) techniques to identify the process parameters influencing TSB formation in a novel recombinant antibody–peptide fusion expressed in mammalian cell culture. The screening dataset was generated with a high‐throughput (HT) micro‐bioreactor system (AmbrTM 15) using a design of experiments (DoE) approach. The complex dataset was firstly analyzed through the development of a multiple linear regression model focusing solely on the DoE inputs and identified the temperature, pH and initial nutrient feed day as important process parameters influencing this quality attribute. To further scrutinize the dataset, a partial least squares model was subsequently built incorporating both on‐line and off‐line process parameters and enabled accurate predictions of the TSB concentration at harvest. Process parameters identified by the models to promote and suppress TSB formation were implemented on five 7 L bioreactors and the resultant TSB concentrations were comparable to the model predictions. This study demonstrates the ability of MVDA to enable predictions of the key performance drivers influencing TSB formation that are valid also upon scale‐up. Biotechnol. Bioeng. 2017;114: 2222–2234. © 2017 The Authors. Biotechnology and Bioengineering Published by Wiley Periodicals, Inc. PMID:28500668
A multivariate model for predicting segmental body composition.
Tian, Simiao; Mioche, Laurence; Denis, Jean-Baptiste; Morio, Béatrice
2013-12-01
The aims of the present study were to propose a multivariate model for predicting simultaneously body, trunk and appendicular fat and lean masses from easily measured variables and to compare its predictive capacity with that of the available univariate models that predict body fat percentage (BF%). The dual-energy X-ray absorptiometry (DXA) dataset (52% men and 48% women) with White, Black and Hispanic ethnicities (1999-2004, National Health and Nutrition Examination Survey) was randomly divided into three sub-datasets: a training dataset (TRD), a test dataset (TED); a validation dataset (VAD), comprising 3835, 1917 and 1917 subjects. For each sex, several multivariate prediction models were fitted from the TRD using age, weight, height and possibly waist circumference. The most accurate model was selected from the TED and then applied to the VAD and a French DXA dataset (French DB) (526 men and 529 women) to assess the prediction accuracy in comparison with that of five published univariate models, for which adjusted formulas were re-estimated using the TRD. Waist circumference was found to improve the prediction accuracy, especially in men. For BF%, the standard error of prediction (SEP) values were 3.26 (3.75) % for men and 3.47 (3.95)% for women in the VAD (French DB), as good as those of the adjusted univariate models. Moreover, the SEP values for the prediction of body and appendicular lean masses ranged from 1.39 to 2.75 kg for both the sexes. The prediction accuracy was best for age < 65 years, BMI < 30 kg/m2 and the Hispanic ethnicity. The application of our multivariate model to large populations could be useful to address various public health issues.
Descriptive Characteristics of Surface Water Quality in Hong Kong by a Self-Organising Map
An, Yan; Zou, Zhihong; Li, Ranran
2016-01-01
In this study, principal component analysis (PCA) and a self-organising map (SOM) were used to analyse a complex dataset obtained from the river water monitoring stations in the Tolo Harbor and Channel Water Control Zone (Hong Kong), covering the period of 2009–2011. PCA was initially applied to identify the principal components (PCs) among the nonlinear and complex surface water quality parameters. SOM followed PCA, and was implemented to analyze the complex relationships and behaviors of the parameters. The results reveal that PCA reduced the multidimensional parameters to four significant PCs which are combinations of the original ones. The positive and inverse relationships of the parameters were shown explicitly by pattern analysis in the component planes. It was found that PCA and SOM are efficient tools to capture and analyze the behavior of multivariable, complex, and nonlinear related surface water quality data. PMID:26761018
Descriptive Characteristics of Surface Water Quality in Hong Kong by a Self-Organising Map.
An, Yan; Zou, Zhihong; Li, Ranran
2016-01-08
In this study, principal component analysis (PCA) and a self-organising map (SOM) were used to analyse a complex dataset obtained from the river water monitoring stations in the Tolo Harbor and Channel Water Control Zone (Hong Kong), covering the period of 2009-2011. PCA was initially applied to identify the principal components (PCs) among the nonlinear and complex surface water quality parameters. SOM followed PCA, and was implemented to analyze the complex relationships and behaviors of the parameters. The results reveal that PCA reduced the multidimensional parameters to four significant PCs which are combinations of the original ones. The positive and inverse relationships of the parameters were shown explicitly by pattern analysis in the component planes. It was found that PCA and SOM are efficient tools to capture and analyze the behavior of multivariable, complex, and nonlinear related surface water quality data.
Topic modeling for cluster analysis of large biological and medical datasets
2014-01-01
Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets. PMID:25350106
Topic modeling for cluster analysis of large biological and medical datasets.
Zhao, Weizhong; Zou, Wen; Chen, James J
2014-01-01
The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets.
Wen, Xiaotong; Rangarajan, Govindan; Ding, Mingzhou
2013-01-01
Granger causality is increasingly being applied to multi-electrode neurophysiological and functional imaging data to characterize directional interactions between neurons and brain regions. For a multivariate dataset, one might be interested in different subsets of the recorded neurons or brain regions. According to the current estimation framework, for each subset, one conducts a separate autoregressive model fitting process, introducing the potential for unwanted variability and uncertainty. In this paper, we propose a multivariate framework for estimating Granger causality. It is based on spectral density matrix factorization and offers the advantage that the estimation of such a matrix needs to be done only once for the entire multivariate dataset. For any subset of recorded data, Granger causality can be calculated through factorizing the appropriate submatrix of the overall spectral density matrix. PMID:23858479
Goldrick, Stephen; Holmes, William; Bond, Nicholas J; Lewis, Gareth; Kuiper, Marcel; Turner, Richard; Farid, Suzanne S
2017-10-01
Product quality heterogeneities, such as a trisulfide bond (TSB) formation, can be influenced by multiple interacting process parameters. Identifying their root cause is a major challenge in biopharmaceutical production. To address this issue, this paper describes the novel application of advanced multivariate data analysis (MVDA) techniques to identify the process parameters influencing TSB formation in a novel recombinant antibody-peptide fusion expressed in mammalian cell culture. The screening dataset was generated with a high-throughput (HT) micro-bioreactor system (Ambr TM 15) using a design of experiments (DoE) approach. The complex dataset was firstly analyzed through the development of a multiple linear regression model focusing solely on the DoE inputs and identified the temperature, pH and initial nutrient feed day as important process parameters influencing this quality attribute. To further scrutinize the dataset, a partial least squares model was subsequently built incorporating both on-line and off-line process parameters and enabled accurate predictions of the TSB concentration at harvest. Process parameters identified by the models to promote and suppress TSB formation were implemented on five 7 L bioreactors and the resultant TSB concentrations were comparable to the model predictions. This study demonstrates the ability of MVDA to enable predictions of the key performance drivers influencing TSB formation that are valid also upon scale-up. Biotechnol. Bioeng. 2017;114: 2222-2234. © 2017 The Authors. Biotechnology and Bioengineering Published by Wiley Periodicals, Inc. © 2017 The Authors. Biotechnology and Bioengineering Published by Wiley Periodicals, Inc.
Zhou, Yan; Wang, Pei; Wang, Xianlong; Zhu, Ji; Song, Peter X-K
2017-01-01
The multivariate regression model is a useful tool to explore complex associations between two kinds of molecular markers, which enables the understanding of the biological pathways underlying disease etiology. For a set of correlated response variables, accounting for such dependency can increase statistical power. Motivated by integrative genomic data analyses, we propose a new methodology-sparse multivariate factor analysis regression model (smFARM), in which correlations of response variables are assumed to follow a factor analysis model with latent factors. This proposed method not only allows us to address the challenge that the number of association parameters is larger than the sample size, but also to adjust for unobserved genetic and/or nongenetic factors that potentially conceal the underlying response-predictor associations. The proposed smFARM is implemented by the EM algorithm and the blockwise coordinate descent algorithm. The proposed methodology is evaluated and compared to the existing methods through extensive simulation studies. Our results show that accounting for latent factors through the proposed smFARM can improve sensitivity of signal detection and accuracy of sparse association map estimation. We illustrate smFARM by two integrative genomics analysis examples, a breast cancer dataset, and an ovarian cancer dataset, to assess the relationship between DNA copy numbers and gene expression arrays to understand genetic regulatory patterns relevant to the disease. We identify two trans-hub regions: one in cytoband 17q12 whose amplification influences the RNA expression levels of important breast cancer genes, and the other in cytoband 9q21.32-33, which is associated with chemoresistance in ovarian cancer. © 2016 WILEY PERIODICALS, INC.
Forcino, Frank L; Leighton, Lindsey R; Twerdy, Pamela; Cahill, James F
2015-01-01
Community ecologists commonly perform multivariate techniques (e.g., ordination, cluster analysis) to assess patterns and gradients of taxonomic variation. A critical requirement for a meaningful statistical analysis is accurate information on the taxa found within an ecological sample. However, oversampling (too many individuals counted per sample) also comes at a cost, particularly for ecological systems in which identification and quantification is substantially more resource consuming than the field expedition itself. In such systems, an increasingly larger sample size will eventually result in diminishing returns in improving any pattern or gradient revealed by the data, but will also lead to continually increasing costs. Here, we examine 396 datasets: 44 previously published and 352 created datasets. Using meta-analytic and simulation-based approaches, the research within the present paper seeks (1) to determine minimal sample sizes required to produce robust multivariate statistical results when conducting abundance-based, community ecology research. Furthermore, we seek (2) to determine the dataset parameters (i.e., evenness, number of taxa, number of samples) that require larger sample sizes, regardless of resource availability. We found that in the 44 previously published and the 220 created datasets with randomly chosen abundances, a conservative estimate of a sample size of 58 produced the same multivariate results as all larger sample sizes. However, this minimal number varies as a function of evenness, where increased evenness resulted in increased minimal sample sizes. Sample sizes as small as 58 individuals are sufficient for a broad range of multivariate abundance-based research. In cases when resource availability is the limiting factor for conducting a project (e.g., small university, time to conduct the research project), statistically viable results can still be obtained with less of an investment.
Parastar, Hadi; Akvan, Nadia
2014-03-13
In the present contribution, a new combination of multivariate curve resolution-correlation optimized warping (MCR-COW) with trilinear parallel factor analysis (PARAFAC) is developed to exploit second-order advantage in complex chromatographic measurements. In MCR-COW, the complexity of the chromatographic data is reduced by arranging the data in a column-wise augmented matrix, analyzing using MCR bilinear model and aligning the resolved elution profiles using COW in a component-wise manner. The aligned chromatographic data is then decomposed using trilinear model of PARAFAC in order to exploit pure chromatographic and spectroscopic information. The performance of this strategy is evaluated using simulated and real high-performance liquid chromatography-diode array detection (HPLC-DAD) datasets. The obtained results showed that the MCR-COW can efficiently correct elution time shifts of target compounds that are completely overlapped by coeluted interferences in complex chromatographic data. In addition, the PARAFAC analysis of aligned chromatographic data has the advantage of unique decomposition of overlapped chromatographic peaks to identify and quantify the target compounds in the presence of interferences. Finally, to confirm the reliability of the proposed strategy, the performance of the MCR-COW-PARAFAC is compared with the frequently used methods of PARAFAC, COW-PARAFAC, multivariate curve resolution-alternating least squares (MCR-ALS), and MCR-COW-MCR. In general, in most of the cases the MCR-COW-PARAFAC showed an improvement in terms of lack of fit (LOF), relative error (RE) and spectral correlation coefficients in comparison to the PARAFAC, COW-PARAFAC, MCR-ALS and MCR-COW-MCR results. Copyright © 2014 Elsevier B.V. All rights reserved.
Understanding Information Flow Interaction along Separable Causal Paths in Environmental Signals
NASA Astrophysics Data System (ADS)
Jiang, P.; Kumar, P.
2017-12-01
Multivariate environmental signals reflect the outcome of complex inter-dependencies, such as those in ecohydrologic systems. Transfer entropy and information partitioning approaches have been used to characterize such dependencies. However, these approaches capture net information flow occurring through a multitude of pathways involved in the interaction and as a result mask our ability to discern the causal interaction within an interested subsystem through specific pathways. We build on recent developments of momentary information transfer along causal paths proposed by Runge [2015] to develop a framework for quantifying information decomposition along separable causal paths. Momentary information transfer along causal paths captures the amount of information flow between any two variables lagged at two specific points in time. Our approach expands this concept to characterize the causal interaction in terms of synergistic, unique and redundant information flow through separable causal paths. Multivariate analysis using this novel approach reveals precise understanding of causality and feedback. We illustrate our approach with synthetic and observed time series data. We believe the proposed framework helps better delineate the internal structure of complex systems in geoscience where huge amounts of observational datasets exist, and it will also help the modeling community by providing a new way to look at the complexity of real and modeled systems. Runge, Jakob. "Quantifying information transfer and mediation along causal pathways in complex systems." Physical Review E 92.6 (2015): 062829.
Normalization methods in time series of platelet function assays
Van Poucke, Sven; Zhang, Zhongheng; Roest, Mark; Vukicevic, Milan; Beran, Maud; Lauwereins, Bart; Zheng, Ming-Hua; Henskens, Yvonne; Lancé, Marcus; Marcus, Abraham
2016-01-01
Abstract Platelet function can be quantitatively assessed by specific assays such as light-transmission aggregometry, multiple-electrode aggregometry measuring the response to adenosine diphosphate (ADP), arachidonic acid, collagen, and thrombin-receptor activating peptide and viscoelastic tests such as rotational thromboelastometry (ROTEM). The task of extracting meaningful statistical and clinical information from high-dimensional data spaces in temporal multivariate clinical data represented in multivariate time series is complex. Building insightful visualizations for multivariate time series demands adequate usage of normalization techniques. In this article, various methods for data normalization (z-transformation, range transformation, proportion transformation, and interquartile range) are presented and visualized discussing the most suited approach for platelet function data series. Normalization was calculated per assay (test) for all time points and per time point for all tests. Interquartile range, range transformation, and z-transformation demonstrated the correlation as calculated by the Spearman correlation test, when normalized per assay (test) for all time points. When normalizing per time point for all tests, no correlation could be abstracted from the charts as was the case when using all data as 1 dataset for normalization. PMID:27428217
NASA Astrophysics Data System (ADS)
Hoseinzade, Zohre; Mokhtari, Ahmad Reza
2017-10-01
Large numbers of variables have been measured to explain different phenomena. Factor analysis has widely been used in order to reduce the dimension of datasets. Additionally, the technique has been employed to highlight underlying factors hidden in a complex system. As geochemical studies benefit from multivariate assays, application of this method is widespread in geochemistry. However, the conventional protocols in implementing factor analysis have some drawbacks in spite of their advantages. In the present study, a geochemical dataset including 804 soil samples collected from a mining area in central Iran in order to search for MVT type Pb-Zn deposits was considered to outline geochemical analysis through various fractal methods. Routine factor analysis, sequential factor analysis, and staged factor analysis were applied to the dataset after opening the data with (additive logratio) alr-transformation to extract mineralization factor in the dataset. A comparison between these methods indicated that sequential factor analysis has more clearly revealed MVT paragenesis elements in surface samples with nearly 50% variation in F1. In addition, staged factor analysis has given acceptable results while it is easy to practice. It could detect mineralization related elements while larger factor loadings are given to these elements resulting in better pronunciation of mineralization.
Kilborn, Joshua P; Jones, David L; Peebles, Ernst B; Naar, David F
2017-04-01
Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing-based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance-based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.
Identifying Interacting Genetic Variations by Fish-Swarm Logic Regression
Yang, Aiyuan; Yan, Chunxia; Zhu, Feng; Zhao, Zhongmeng; Cao, Zhi
2013-01-01
Understanding associations between genotypes and complex traits is a fundamental problem in human genetics. A major open problem in mapping phenotypes is that of identifying a set of interacting genetic variants, which might contribute to complex traits. Logic regression (LR) is a powerful multivariant association tool. Several LR-based approaches have been successfully applied to different datasets. However, these approaches are not adequate with regard to accuracy and efficiency. In this paper, we propose a new LR-based approach, called fish-swarm logic regression (FSLR), which improves the logic regression process by incorporating swarm optimization. In our approach, a school of fish agents are conducted in parallel. Each fish agent holds a regression model, while the school searches for better models through various preset behaviors. A swarm algorithm improves the accuracy and the efficiency by speeding up the convergence and preventing it from dropping into local optimums. We apply our approach on a real screening dataset and a series of simulation scenarios. Compared to three existing LR-based approaches, our approach outperforms them by having lower type I and type II error rates, being able to identify more preset causal sites, and performing at faster speeds. PMID:23984382
Inouye, David I.; Ravikumar, Pradeep; Dhillon, Inderjit S.
2016-01-01
We develop Square Root Graphical Models (SQR), a novel class of parametric graphical models that provides multivariate generalizations of univariate exponential family distributions. Previous multivariate graphical models (Yang et al., 2015) did not allow positive dependencies for the exponential and Poisson generalizations. However, in many real-world datasets, variables clearly have positive dependencies. For example, the airport delay time in New York—modeled as an exponential distribution—is positively related to the delay time in Boston. With this motivation, we give an example of our model class derived from the univariate exponential distribution that allows for almost arbitrary positive and negative dependencies with only a mild condition on the parameter matrix—a condition akin to the positive definiteness of the Gaussian covariance matrix. Our Poisson generalization allows for both positive and negative dependencies without any constraints on the parameter values. We also develop parameter estimation methods using node-wise regressions with ℓ1 regularization and likelihood approximation methods using sampling. Finally, we demonstrate our exponential generalization on a synthetic dataset and a real-world dataset of airport delay times. PMID:27563373
NASA Astrophysics Data System (ADS)
Bizzi, S.; Surridge, B.; Lerner, D. N.:
2009-04-01
River ecosystems represent complex networks of interacting biological, chemical and geomorphological processes. These processes generate spatial and temporal patterns in biological, chemical and geomorphological variables, and a growing number of these variables are now being used to characterise the status of rivers. However, integrated analyses of these biological-chemical-geomorphological networks have rarely been undertaken, and as a result our knowledge of the underlying processes and how they generate the resulting patterns remains weak. The apparent complexity of the networks involved, and the lack of coherent datasets, represent two key challenges to such analyses. In this paper we describe the application of a novel technique, Structural Equation Modelling (SEM), to the investigation of biological, chemical and geomorphological data collected from rivers across England and Wales. The SEM approach is a multivariate statistical technique enabling simultaneous examination of direct and indirect relationships across a network of variables. Further, SEM allows a-priori conceptual or theoretical models to be tested against available data. This is a significant departure from the solely exploratory analyses which characterise other multivariate techniques. We took biological, chemical and river habitat survey data collected by the Environment Agency for 400 sites in rivers spread across England and Wales, and created a single, coherent dataset suitable for SEM analyses. Biological data cover benthic macroinvertebrates, chemical data relate to a range of standard parameters (e.g. BOD, dissolved oxygen and phosphate concentration), and geomorphological data cover factors such as river typology, substrate material and degree of physical modification. We developed a number of a-priori conceptual models, reflecting current research questions or existing knowledge, and tested the ability of these conceptual models to explain the variance and covariance within the dataset. The conceptual models we developed were able to explain correctly the variance and covariance shown by the datasets, proving to be a relevant representation of the processes involved. The models explained 65% of the variance in indices describing benthic macroinvertebrate communities. Dissolved oxygen was of primary importance, but geomorphological factors, including river habitat type and degree of habitat degradation, also had significant explanatory power. The addition of spatial variables, such as latitude or longitude, did not provide additional explanatory power. This suggests that the variables already included in the models effectively represented the eco-regions across which our data were distributed. The models produced new insights into the relative importance of chemical and geomorphological factors for river macroinvertebrate communities. The SEM technique proved a powerful tool for exploring complex biological-chemical-geomorphological networks, for example able to deal with the co-correlations that are common in rivers due to multiple feedback mechanisms.
Crosse, Michael J; Di Liberto, Giovanni M; Bednar, Adam; Lalor, Edmund C
2016-01-01
Understanding how brains process sensory signals in natural environments is one of the key goals of twenty-first century neuroscience. While brain imaging and invasive electrophysiology will play key roles in this endeavor, there is also an important role to be played by noninvasive, macroscopic techniques with high temporal resolution such as electro- and magnetoencephalography. But challenges exist in determining how best to analyze such complex, time-varying neural responses to complex, time-varying and multivariate natural sensory stimuli. There has been a long history of applying system identification techniques to relate the firing activity of neurons to complex sensory stimuli and such techniques are now seeing increased application to EEG and MEG data. One particular example involves fitting a filter-often referred to as a temporal response function-that describes a mapping between some feature(s) of a sensory stimulus and the neural response. Here, we first briefly review the history of these system identification approaches and describe a specific technique for deriving temporal response functions known as regularized linear regression. We then introduce a new open-source toolbox for performing this analysis. We describe how it can be used to derive (multivariate) temporal response functions describing a mapping between stimulus and response in both directions. We also explain the importance of regularizing the analysis and how this regularization can be optimized for a particular dataset. We then outline specifically how the toolbox implements these analyses and provide several examples of the types of results that the toolbox can produce. Finally, we consider some of the limitations of the toolbox and opportunities for future development and application.
Crosse, Michael J.; Di Liberto, Giovanni M.; Bednar, Adam; Lalor, Edmund C.
2016-01-01
Understanding how brains process sensory signals in natural environments is one of the key goals of twenty-first century neuroscience. While brain imaging and invasive electrophysiology will play key roles in this endeavor, there is also an important role to be played by noninvasive, macroscopic techniques with high temporal resolution such as electro- and magnetoencephalography. But challenges exist in determining how best to analyze such complex, time-varying neural responses to complex, time-varying and multivariate natural sensory stimuli. There has been a long history of applying system identification techniques to relate the firing activity of neurons to complex sensory stimuli and such techniques are now seeing increased application to EEG and MEG data. One particular example involves fitting a filter—often referred to as a temporal response function—that describes a mapping between some feature(s) of a sensory stimulus and the neural response. Here, we first briefly review the history of these system identification approaches and describe a specific technique for deriving temporal response functions known as regularized linear regression. We then introduce a new open-source toolbox for performing this analysis. We describe how it can be used to derive (multivariate) temporal response functions describing a mapping between stimulus and response in both directions. We also explain the importance of regularizing the analysis and how this regularization can be optimized for a particular dataset. We then outline specifically how the toolbox implements these analyses and provide several examples of the types of results that the toolbox can produce. Finally, we consider some of the limitations of the toolbox and opportunities for future development and application. PMID:27965557
NASA Astrophysics Data System (ADS)
Moura, Ricardo; Sinha, Bimal; Coelho, Carlos A.
2017-06-01
The recent popularity of the use of synthetic data as a Statistical Disclosure Control technique has enabled the development of several methods of generating and analyzing such data, but almost always relying in asymptotic distributions and in consequence being not adequate for small sample datasets. Thus, a likelihood-based exact inference procedure is derived for the matrix of regression coefficients of the multivariate regression model, for multiply imputed synthetic data generated via Posterior Predictive Sampling. Since it is based in exact distributions this procedure may even be used in small sample datasets. Simulation studies compare the results obtained from the proposed exact inferential procedure with the results obtained from an adaptation of Reiters combination rule to multiply imputed synthetic datasets and an application to the 2000 Current Population Survey is discussed.
Iafrati, Jillian; Malvache, Arnaud; Gonzalez Campo, Cecilia; Orejarena, M. Juliana; Lassalle, Olivier; Bouamrane, Lamine; Chavis, Pascale
2016-01-01
The postnatal maturation of the prefrontal cortex (PFC) represents a period of increased vulnerability to risk factors and emergence of neuropsychiatric disorders. To disambiguate the pathophysiological mechanisms contributing to these disorders, we revisited the endophenotype approach from a developmental viewpoint. The extracellular matrix protein reelin which contributes to cellular and network plasticity, is a risk factor for several psychiatric diseases. We mapped the aggregate effect of the RELN risk allele on postnatal development of PFC functions by cross-sectional synaptic and behavioral analysis of reelin-haploinsufficient mice. Multivariate analysis of bootstrapped datasets revealed subgroups of phenotypic traits specific to each maturational epoch. The preeminence of synaptic AMPA/NMDA receptor content to pre-weaning and juvenile endophenotypes shifts to long-term potentiation and memory renewal during adolescence followed by NMDA-GluN2B synaptic content in adulthood. Strikingly, multivariate analysis shows that pharmacological rehabilitation of reelin haploinsufficient dysfunctions is mediated through induction of new endophenotypes rather than reversion to wild-type traits. By delineating previously unknown developmental endophenotypic sequences, we conceived a promising general strategy to disambiguate the molecular underpinnings of complex psychiatric disorders and for the rational design of pharmacotherapies in these disorders. PMID:27765946
Lepre, Jorge; Rice, J Jeremy; Tu, Yuhai; Stolovitzky, Gustavo
2004-05-01
Despite the growing literature devoted to finding differentially expressed genes in assays probing different tissues types, little attention has been paid to the combinatorial nature of feature selection inherent to large, high-dimensional gene expression datasets. New flexible data analysis approaches capable of searching relevant subgroups of genes and experiments are needed to understand multivariate associations of gene expression patterns with observed phenotypes. We present in detail a deterministic algorithm to discover patterns of multivariate gene associations in gene expression data. The patterns discovered are differential with respect to a control dataset. The algorithm is exhaustive and efficient, reporting all existent patterns that fit a given input parameter set while avoiding enumeration of the entire pattern space. The value of the pattern discovery approach is demonstrated by finding a set of genes that differentiate between two types of lymphoma. Moreover, these genes are found to behave consistently in an independent dataset produced in a different laboratory using different arrays, thus validating the genes selected using our algorithm. We show that the genes deemed significant in terms of their multivariate statistics will be missed using other methods. Our set of pattern discovery algorithms including a user interface is distributed as a package called Genes@Work. This package is freely available to non-commercial users and can be downloaded from our website (http://www.research.ibm.com/FunGen).
Wilke, Marko
2018-02-01
This dataset contains the regression parameters derived by analyzing segmented brain MRI images (gray matter and white matter) from a large population of healthy subjects, using a multivariate adaptive regression splines approach. A total of 1919 MRI datasets ranging in age from 1-75 years from four publicly available datasets (NIH, C-MIND, fCONN, and IXI) were segmented using the CAT12 segmentation framework, writing out gray matter and white matter images normalized using an affine-only spatial normalization approach. These images were then subjected to a six-step DARTEL procedure, employing an iterative non-linear registration approach and yielding increasingly crisp intermediate images. The resulting six datasets per tissue class were then analyzed using multivariate adaptive regression splines, using the CerebroMatic toolbox. This approach allows for flexibly modelling smoothly varying trajectories while taking into account demographic (age, gender) as well as technical (field strength, data quality) predictors. The resulting regression parameters described here can be used to generate matched DARTEL or SHOOT templates for a given population under study, from infancy to old age. The dataset and the algorithm used to generate it are publicly available at https://irc.cchmc.org/software/cerebromatic.php.
De Francesco, Davide; Leech, Robert; Sabin, Caroline A.; Winston, Alan
2018-01-01
Objective The reported prevalence of cognitive impairment remains similar to that reported in the pre-antiretroviral therapy era. This may be partially artefactual due to the methods used to diagnose impairment. In this study, we evaluated the diagnostic performance of the HIV-associated neurocognitive disorder (Frascati criteria) and global deficit score (GDS) methods in comparison to a new, multivariate method of diagnosis. Methods Using a simulated ‘normative’ dataset informed by real-world cognitive data from the observational Pharmacokinetic and Clinical Observations in PeoPle Over fiftY (POPPY) cohort study, we evaluated the apparent prevalence of cognitive impairment using the Frascati and GDS definitions, as well as a novel multivariate method based on the Mahalanobis distance. We then quantified the diagnostic properties (including positive and negative predictive values and accuracy) of each method, using bootstrapping with 10,000 replicates, with a separate ‘test’ dataset to which a pre-defined proportion of ‘impaired’ individuals had been added. Results The simulated normative dataset demonstrated that up to ~26% of a normative control population would be diagnosed with cognitive impairment with the Frascati criteria and ~20% with the GDS. In contrast, the multivariate Mahalanobis distance method identified impairment in ~5%. Using the test dataset, diagnostic accuracy [95% confidence intervals] and positive predictive value (PPV) was best for the multivariate method vs. Frascati and GDS (accuracy: 92.8% [90.3–95.2%] vs. 76.1% [72.1–80.0%] and 80.6% [76.6–84.5%] respectively; PPV: 61.2% [48.3–72.2%] vs. 29.4% [22.2–36.8%] and 33.9% [25.6–42.3%] respectively). Increasing the a priori false positive rate for the multivariate Mahalanobis distance method from 5% to 15% resulted in an increase in sensitivity from 77.4% (64.5–89.4%) to 92.2% (83.3–100%) at a cost of specificity from 94.5% (92.8–95.2%) to 85.0% (81.2–88.5%). Conclusion Our simulations suggest that the commonly used diagnostic criteria of HIV-associated cognitive impairment label a significant proportion of a normative reference population as cognitively impaired, which will likely lead to a substantial over-estimate of the true proportion in a study population, due to their lower than expected specificity. These findings have important implications for clinical research regarding cognitive health in people living with HIV. More accurate methods of diagnosis should be implemented, with multivariate techniques offering a promising solution. PMID:29641619
A "Model" Multivariable Calculus Course.
ERIC Educational Resources Information Center
Beckmann, Charlene E.; Schlicker, Steven J.
1999-01-01
Describes a rich, investigative approach to multivariable calculus. Introduces a project in which students construct physical models of surfaces that represent real-life applications of their choice. The models, along with student-selected datasets, serve as vehicles to study most of the concepts of the course from both continuous and discrete…
NASA Astrophysics Data System (ADS)
Chiriaco, Marjolaine; Dupont, Jean-Charles; Bastin, Sophie; Badosa, Jordi; Lopez, Julio; Haeffelin, Martial; Chepfer, Helene; Guzman, Rodrigo
2018-05-01
A scientific approach is presented to aggregate and harmonize a set of 60 geophysical variables at hourly timescale over a decade, and to allow multiannual and multi-variable studies combining atmospheric dynamics and thermodynamics, radiation, clouds and aerosols from ground-based observations. Many datasets from ground-based observations are currently in use worldwide. They are very valuable because they contain complete and precise information due to their spatio-temporal co-localization over more than a decade. These datasets, in particular the synergy between different type of observations, are under-used because of their complexity and diversity due to calibration, quality control, treatment, format, temporal averaging, metadata, etc. Two main results are presented in this article: (1) a set of methods available for the community to robustly and reliably process ground-based data at an hourly timescale over a decade is described and (2) a single netCDF file is provided based on the SIRTA supersite observations. This file contains approximately 60 geophysical variables (atmospheric and in ground) hourly averaged over a decade for the longest variables. The netCDF file is available and easy to use for the community. In this article, observations are re-analyzed
. The prefix re
refers to six main steps: calibration, quality control, treatment, hourly averaging, homogenization of the formats and associated metadata, as well as expertise on more than a decade of observations. In contrast, previous studies (i) took only some of these six steps into account for each variable, (ii) did not aggregate all variables together in a single file and (iii) did not offer an hourly resolution for about 60 variables over a decade (for the longest variables). The approach described in this article can be applied to different supersites and to additional variables. The main implication of this work is that complex atmospheric observations are made readily available for scientists who are non-experts in measurements. The dataset from SIRTA observations can be downloaded at http://sirta.ipsl.fr/reobs.html (last access: April 2017) (Downloads tab, no password required) under https://doi.org/10.14768/4F63BAD4-E6AF-4101-AD5A-61D4A34620DE.
Al-Aziz, Jameel; Christou, Nicolas; Dinov, Ivo D.
2011-01-01
The amount, complexity and provenance of data have dramatically increased in the past five years. Visualization of observed and simulated data is a critical component of any social, environmental, biomedical or scientific quest. Dynamic, exploratory and interactive visualization of multivariate data, without preprocessing by dimensionality reduction, remains a nearly insurmountable challenge. The Statistics Online Computational Resource (www.SOCR.ucla.edu) provides portable online aids for probability and statistics education, technology-based instruction and statistical computing. We have developed a new Java-based infrastructure, SOCR Motion Charts, for discovery-based exploratory analysis of multivariate data. This interactive data visualization tool enables the visualization of high-dimensional longitudinal data. SOCR Motion Charts allows mapping of ordinal, nominal and quantitative variables onto time, 2D axes, size, colors, glyphs and appearance characteristics, which facilitates the interactive display of multidimensional data. We validated this new visualization paradigm using several publicly available multivariate datasets including Ice-Thickness, Housing Prices, Consumer Price Index, and California Ozone Data. SOCR Motion Charts is designed using object-oriented programming, implemented as a Java Web-applet and is available to the entire community on the web at www.socr.ucla.edu/SOCR_MotionCharts. It can be used as an instructional tool for rendering and interrogating high-dimensional data in the classroom, as well as a research tool for exploratory data analysis. PMID:21479108
Yang, Jie; McArdle, Conor; Daniels, Stephen
2014-01-01
A new data dimension-reduction method, called Internal Information Redundancy Reduction (IIRR), is proposed for application to Optical Emission Spectroscopy (OES) datasets obtained from industrial plasma processes. For example in a semiconductor manufacturing environment, real-time spectral emission data is potentially very useful for inferring information about critical process parameters such as wafer etch rates, however, the relationship between the spectral sensor data gathered over the duration of an etching process step and the target process output parameters is complex. OES sensor data has high dimensionality (fine wavelength resolution is required in spectral emission measurements in order to capture data on all chemical species involved in plasma reactions) and full spectrum samples are taken at frequent time points, so that dynamic process changes can be captured. To maximise the utility of the gathered dataset, it is essential that information redundancy is minimised, but with the important requirement that the resulting reduced dataset remains in a form that is amenable to direct interpretation of the physical process. To meet this requirement and to achieve a high reduction in dimension with little information loss, the IIRR method proposed in this paper operates directly in the original variable space, identifying peak wavelength emissions and the correlative relationships between them. A new statistic, Mean Determination Ratio (MDR), is proposed to quantify the information loss after dimension reduction and the effectiveness of IIRR is demonstrated using an actual semiconductor manufacturing dataset. As an example of the application of IIRR in process monitoring/control, we also show how etch rates can be accurately predicted from IIRR dimension-reduced spectral data. PMID:24451453
Husain, Syed S; Kalinin, Alexandr; Truong, Anh; Dinov, Ivo D
Intuitive formulation of informative and computationally-efficient queries on big and complex datasets present a number of challenges. As data collection is increasingly streamlined and ubiquitous, data exploration, discovery and analytics get considerably harder. Exploratory querying of heterogeneous and multi-source information is both difficult and necessary to advance our knowledge about the world around us. We developed a mechanism to integrate dispersed multi-source data and service the mashed information via human and machine interfaces in a secure, scalable manner. This process facilitates the exploration of subtle associations between variables, population strata, or clusters of data elements, which may be opaque to standard independent inspection of the individual sources. This a new platform includes a device agnostic tool (Dashboard webapp, http://socr.umich.edu/HTML5/Dashboard/) for graphical querying, navigating and exploring the multivariate associations in complex heterogeneous datasets. The paper illustrates this core functionality and serviceoriented infrastructure using healthcare data (e.g., US data from the 2010 Census, Demographic and Economic surveys, Bureau of Labor Statistics, and Center for Medicare Services) as well as Parkinson's Disease neuroimaging data. Both the back-end data archive and the front-end dashboard interfaces are continuously expanded to include additional data elements and new ways to customize the human and machine interactions. A client-side data import utility allows for easy and intuitive integration of user-supplied datasets. This completely open-science framework may be used for exploratory analytics, confirmatory analyses, meta-analyses, and education and training purposes in a wide variety of fields.
NASA Astrophysics Data System (ADS)
Brandmeier, M.; Wörner, G.
2016-10-01
Multivariate statistical and geospatial analyses based on a compilation of 890 geochemical and 1200 geochronological data for 194 mapped ignimbrites from the Central Andes document the compositional and temporal patterns of large-volume ignimbrites (so-called "ignimbrite flare-ups") during Neogene times. Rapid advances in computational science during the past decade led to a growing pool of algorithms for multivariate statistics for large datasets with many predictor variables. This study applies cluster analysis (CA) and linear discriminant analysis (LDA) on log-ratio transformed data with the aim of (1) testing a tool for ignimbrite correlation and (2) distinguishing compositional groups that reflect different processes and sources of ignimbrite magmatism during the geodynamic evolution of the Central Andes. CA on major and trace elements allows grouping of ignimbrites according to their geochemical characteristics into rhyolitic and dacitic "end-members" and to differentiate characteristic trace element signatures with respect to Eu anomaly, depletions in middle and heavy rare earth elements (REE) and variable enrichments in light REE. To highlight these distinct compositional signatures, we applied LDA to selected ignimbrites for which comprehensive datasets were available. In comparison to traditional geochemical parameters we found that the advantage of multivariate statistics is their capability of dealing with large datasets and many variables (elements) and to take advantage of this n-dimensional space to detect subtle compositional differences contained in the data. The most important predictors for discriminating ignimbrites are La, Yb, Eu, Al2O3, K2O, P2O5, MgO, FeOt, and TiO2. However, other REE such as Gd, Pr, Tm, Sm, Dy and Er also contribute to the discrimination functions. Significant compositional differences were found between (1) the older (> 13 Ma) large-volume plateau-forming ignimbrites in northernmost Chile and southern Peru and (2) the younger (< 10 Ma) Altiplano-Puna-Volcanic-Complex (APVC) ignimbrites that are of similar volumes. Older ignimbrites are less depleted in HREE and less radiogenic in Sr isotopes, indicating smaller crustal contributions during evolution in a thinner and thermally less evolved crust. These compositional variations indicate a relation to crustal thickening with a "transition" from plagioclase to amphibole and garnet residual mineralogy between 13 and 9 Ma. Compositional and volumetric variations correlate to the N-S passage of the Juan-Fernandéz-Ridge, crustal shortening and thickening, and increased average crustal temperatures during the past 26 Ma. Table DR2 Mapped ignimbrite sheets.
Accurate and fast multiple-testing correction in eQTL studies.
Sul, Jae Hoon; Raj, Towfique; de Jong, Simone; de Bakker, Paul I W; Raychaudhuri, Soumya; Ophoff, Roel A; Stranger, Barbara E; Eskin, Eleazar; Han, Buhm
2015-06-04
In studies of expression quantitative trait loci (eQTLs), it is of increasing interest to identify eGenes, the genes whose expression levels are associated with variation at a particular genetic variant. Detecting eGenes is important for follow-up analyses and prioritization because genes are the main entities in biological processes. To detect eGenes, one typically focuses on the genetic variant with the minimum p value among all variants in cis with a gene and corrects for multiple testing to obtain a gene-level p value. For performing multiple-testing correction, a permutation test is widely used. Because of growing sample sizes of eQTL studies, however, the permutation test has become a computational bottleneck in eQTL studies. In this paper, we propose an efficient approach for correcting for multiple testing and assess eGene p values by utilizing a multivariate normal distribution. Our approach properly takes into account the linkage-disequilibrium structure among variants, and its time complexity is independent of sample size. By applying our small-sample correction techniques, our method achieves high accuracy in both small and large studies. We have shown that our method consistently produces extremely accurate p values (accuracy > 98%) for three human eQTL datasets with different sample sizes and SNP densities: the Genotype-Tissue Expression pilot dataset, the multi-region brain dataset, and the HapMap 3 dataset. Copyright © 2015 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Domingo-Almenara, Xavier; Perera, Alexandre; Brezmes, Jesus
2016-11-25
Gas chromatography-mass spectrometry (GC-MS) produces large and complex datasets characterized by co-eluted compounds and at trace levels, and with a distinct compound ion-redundancy as a result of the high fragmentation by the electron impact ionization. Compounds in GC-MS can be resolved by taking advantage of the multivariate nature of GC-MS data by applying multivariate resolution methods. However, multivariate methods have to be applied in small regions of the chromatogram, and therefore chromatograms are segmented prior to the application of the algorithms. The automation of this segmentation process is a challenging task as it implies separating between informative data and noise from the chromatogram. This study demonstrates the capabilities of independent component analysis-orthogonal signal deconvolution (ICA-OSD) and multivariate curve resolution-alternating least squares (MCR-ALS) with an overlapping moving window implementation to avoid the typical hard chromatographic segmentation. Also, after being resolved, compounds are aligned across samples by an automated alignment algorithm. We evaluated the proposed methods through a quantitative analysis of GC-qTOF MS data from 25 serum samples. The quantitative performance of both moving window ICA-OSD and MCR-ALS-based implementations was compared with the quantification of 33 compounds by the XCMS package. Results shown that most of the R 2 coefficients of determination exhibited a high correlation (R 2 >0.90) in both ICA-OSD and MCR-ALS moving window-based approaches. Copyright © 2016 Elsevier B.V. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ruebel, Oliver
2009-11-20
Knowledge discovery from large and complex collections of today's scientific datasets is a challenging task. With the ability to measure and simulate more processes at increasingly finer spatial and temporal scales, the increasing number of data dimensions and data objects is presenting tremendous challenges for data analysis and effective data exploration methods and tools. Researchers are overwhelmed with data and standard tools are often insufficient to enable effective data analysis and knowledge discovery. The main objective of this thesis is to provide important new capabilities to accelerate scientific knowledge discovery form large, complex, and multivariate scientific data. The research coveredmore » in this thesis addresses these scientific challenges using a combination of scientific visualization, information visualization, automated data analysis, and other enabling technologies, such as efficient data management. The effectiveness of the proposed analysis methods is demonstrated via applications in two distinct scientific research fields, namely developmental biology and high-energy physics.Advances in microscopy, image analysis, and embryo registration enable for the first time measurement of gene expression at cellular resolution for entire organisms. Analysis of high-dimensional spatial gene expression datasets is a challenging task. By integrating data clustering and visualization, analysis of complex, time-varying, spatial gene expression patterns and their formation becomes possible. The analysis framework MATLAB and the visualization have been integrated, making advanced analysis tools accessible to biologist and enabling bioinformatic researchers to directly integrate their analysis with the visualization. Laser wakefield particle accelerators (LWFAs) promise to be a new compact source of high-energy particles and radiation, with wide applications ranging from medicine to physics. To gain insight into the complex physical processes of particle acceleration, physicists model LWFAs computationally. The datasets produced by LWFA simulations are (i) extremely large, (ii) of varying spatial and temporal resolution, (iii) heterogeneous, and (iv) high-dimensional, making analysis and knowledge discovery from complex LWFA simulation data a challenging task. To address these challenges this thesis describes the integration of the visualization system VisIt and the state-of-the-art index/query system FastBit, enabling interactive visual exploration of extremely large three-dimensional particle datasets. Researchers are especially interested in beams of high-energy particles formed during the course of a simulation. This thesis describes novel methods for automatic detection and analysis of particle beams enabling a more accurate and efficient data analysis process. By integrating these automated analysis methods with visualization, this research enables more accurate, efficient, and effective analysis of LWFA simulation data than previously possible.« less
Motegi, Hiromi; Tsuboi, Yuuri; Saga, Ayako; Kagami, Tomoko; Inoue, Maki; Toki, Hideaki; Minowa, Osamu; Noda, Tetsuo; Kikuchi, Jun
2015-11-04
There is an increasing need to use multivariate statistical methods for understanding biological functions, identifying the mechanisms of diseases, and exploring biomarkers. In addition to classical analyses such as hierarchical cluster analysis, principal component analysis, and partial least squares discriminant analysis, various multivariate strategies, including independent component analysis, non-negative matrix factorization, and multivariate curve resolution, have recently been proposed. However, determining the number of components is problematic. Despite the proposal of several different methods, no satisfactory approach has yet been reported. To resolve this problem, we implemented a new idea: classifying a component as "reliable" or "unreliable" based on the reproducibility of its appearance, regardless of the number of components in the calculation. Using the clustering method for classification, we applied this idea to multivariate curve resolution-alternating least squares (MCR-ALS). Comparisons between conventional and modified methods applied to proton nuclear magnetic resonance ((1)H-NMR) spectral datasets derived from known standard mixtures and biological mixtures (urine and feces of mice) revealed that more plausible results are obtained by the modified method. In particular, clusters containing little information were detected with reliability. This strategy, named "cluster-aided MCR-ALS," will facilitate the attainment of more reliable results in the metabolomics datasets.
Matos, Larissa A.; Bandyopadhyay, Dipankar; Castro, Luis M.; Lachos, Victor H.
2015-01-01
In biomedical studies on HIV RNA dynamics, viral loads generate repeated measures that are often subjected to upper and lower detection limits, and hence these responses are either left- or right-censored. Linear and non-linear mixed-effects censored (LMEC/NLMEC) models are routinely used to analyse these longitudinal data, with normality assumptions for the random effects and residual errors. However, the derived inference may not be robust when these underlying normality assumptions are questionable, especially the presence of outliers and thick-tails. Motivated by this, Matos et al. (2013b) recently proposed an exact EM-type algorithm for LMEC/NLMEC models using a multivariate Student’s-t distribution, with closed-form expressions at the E-step. In this paper, we develop influence diagnostics for LMEC/NLMEC models using the multivariate Student’s-t density, based on the conditional expectation of the complete data log-likelihood. This partially eliminates the complexity associated with the approach of Cook (1977, 1986) for censored mixed-effects models. The new methodology is illustrated via an application to a longitudinal HIV dataset. In addition, a simulation study explores the accuracy of the proposed measures in detecting possible influential observations for heavy-tailed censored data under different perturbation and censoring schemes. PMID:26190871
Luck, Margaux; Bertho, Gildas; Bateson, Mathilde; Karras, Alexandre; Yartseva, Anastasia; Thervet, Eric
2016-01-01
1H Nuclear Magnetic Resonance (NMR)-based metabolic profiling is very promising for the diagnostic of the stages of chronic kidney disease (CKD). Because of the high dimension of NMR spectra datasets and the complex mixture of metabolites in biological samples, the identification of discriminant biomarkers of a disease is challenging. None of the widely used chemometric methods in NMR metabolomics performs a local exhaustive exploration of the data. We developed a descriptive and easily understandable approach that searches for discriminant local phenomena using an original exhaustive rule-mining algorithm in order to predict two groups of patients: 1) patients having low to mild CKD stages with no renal failure and 2) patients having moderate to established CKD stages with renal failure. Our predictive algorithm explores the m-dimensional variable space to capture the local overdensities of the two groups of patients under the form of easily interpretable rules. Afterwards, a L2-penalized logistic regression on the discriminant rules was used to build predictive models of the CKD stages. We explored a complex multi-source dataset that included the clinical, demographic, clinical chemistry, renal pathology and urine metabolomic data of a cohort of 110 patients. Given this multi-source dataset and the complex nature of metabolomic data, we analyzed 1- and 2-dimensional rules in order to integrate the information carried by the interactions between the variables. The results indicated that our local algorithm is a valuable analytical method for the precise characterization of multivariate CKD stage profiles and as efficient as the classical global model using chi2 variable section with an approximately 70% of good classification level. The resulting predictive models predominantly identify urinary metabolites (such as 3-hydroxyisovalerate, carnitine, citrate, dimethylsulfone, creatinine and N-methylnicotinamide) as relevant variables indicating that CKD significantly affects the urinary metabolome. In addition, the simple knowledge of the concentration of urinary metabolites classifies the CKD stage of the patients correctly. PMID:27861591
2014-01-01
Background To improve quality of care and patient outcomes, health system decision-makers need to identify and implement effective interventions. An increasing number of systematic reviews document the effects of quality improvement programs to assist decision-makers in developing new initiatives. However, limitations in the reporting of primary studies and current meta-analysis methods (including approaches for exploring heterogeneity) reduce the utility of existing syntheses for health system decision-makers. This study will explore the role of innovative meta-analysis approaches and the added value of enriched and updated data for increasing the utility of systematic reviews of complex interventions. Methods/Design We will use the dataset from our recent systematic review of 142 randomized trials of diabetes quality improvement programs to evaluate novel approaches for exploring heterogeneity. These will include exploratory methods, such as multivariate meta-regression analyses and all-subsets combinatorial meta-analysis. We will then update our systematic review to include new trials and enrich the dataset by surveying authors of all included trials. In doing so, we will explore the impact of variables not, reported in previous publications, such as details of study context, on the effectiveness of the intervention. We will use innovative analytical methods on the enriched and updated dataset to identify key success factors in the implementation of quality improvement interventions for diabetes. Decision-makers will be involved throughout to help identify and prioritize variables to be explored and to aid in the interpretation and dissemination of results. Discussion This study will inform future systematic reviews of complex interventions and describe the value of enriching and updating data for exploring heterogeneity in meta-analysis. It will also result in an updated comprehensive systematic review of diabetes quality improvement interventions that will be useful to health system decision-makers in developing interventions to improve outcomes for people with diabetes. Systematic review registration PROSPERO registration no. CRD42013005165 PMID:25115289
Multivariate multiscale entropy of financial markets
NASA Astrophysics Data System (ADS)
Lu, Yunfan; Wang, Jun
2017-11-01
In current process of quantifying the dynamical properties of the complex phenomena in financial market system, the multivariate financial time series are widely concerned. In this work, considering the shortcomings and limitations of univariate multiscale entropy in analyzing the multivariate time series, the multivariate multiscale sample entropy (MMSE), which can evaluate the complexity in multiple data channels over different timescales, is applied to quantify the complexity of financial markets. Its effectiveness and advantages have been detected with numerical simulations with two well-known synthetic noise signals. For the first time, the complexity of four generated trivariate return series for each stock trading hour in China stock markets is quantified thanks to the interdisciplinary application of this method. We find that the complexity of trivariate return series in each hour show a significant decreasing trend with the stock trading time progressing. Further, the shuffled multivariate return series and the absolute multivariate return series are also analyzed. As another new attempt, quantifying the complexity of global stock markets (Asia, Europe and America) is carried out by analyzing the multivariate returns from them. Finally we utilize the multivariate multiscale entropy to assess the relative complexity of normalized multivariate return volatility series with different degrees.
Multivariate Statistical Analysis of Water Quality data in Indian River Lagoon, Florida
NASA Astrophysics Data System (ADS)
Sayemuzzaman, M.; Ye, M.
2015-12-01
The Indian River Lagoon, is part of the longest barrier island complex in the United States, is a region of particular concern to the environmental scientist because of the rapid rate of human development throughout the region and the geographical position in between the colder temperate zone and warmer sub-tropical zone. Thus, the surface water quality analysis in this region always brings the newer information. In this present study, multivariate statistical procedures were applied to analyze the spatial and temporal water quality in the Indian River Lagoon over the period 1998-2013. Twelve parameters have been analyzed on twelve key water monitoring stations in and beside the lagoon on monthly datasets (total of 27,648 observations). The dataset was treated using cluster analysis (CA), principle component analysis (PCA) and non-parametric trend analysis. The CA was used to cluster twelve monitoring stations into four groups, with stations on the similar surrounding characteristics being in the same group. The PCA was then applied to the similar groups to find the important water quality parameters. The principal components (PCs), PC1 to PC5 was considered based on the explained cumulative variances 75% to 85% in each cluster groups. Nutrient species (phosphorus and nitrogen), salinity, specific conductivity and erosion factors (TSS, Turbidity) were major variables involved in the construction of the PCs. Statistical significant positive or negative trends and the abrupt trend shift were detected applying Mann-Kendall trend test and Sequential Mann-Kendall (SQMK), for each individual stations for the important water quality parameters. Land use land cover change pattern, local anthropogenic activities and extreme climate such as drought might be associated with these trends. This study presents the multivariate statistical assessment in order to get better information about the quality of surface water. Thus, effective pollution control/management of the surface waters can be undertaken.
Cross-country transferability of multi-variable damage models
NASA Astrophysics Data System (ADS)
Wagenaar, Dennis; Lüdtke, Stefan; Kreibich, Heidi; Bouwer, Laurens
2017-04-01
Flood damage assessment is often done with simple damage curves based only on flood water depth. Additionally, damage models are often transferred in space and time, e.g. from region to region or from one flood event to another. Validation has shown that depth-damage curve estimates are associated with high uncertainties, particularly when applied in regions outside the area where the data for curve development was collected. Recently, progress has been made with multi-variable damage models created with data-mining techniques, i.e. Bayesian Networks and random forest. However, it is still unknown to what extent and under which conditions model transfers are possible and reliable. Model validations in different countries will provide valuable insights into the transferability of multi-variable damage models. In this study we compare multi-variable models developed on basis of flood damage datasets from Germany as well as from The Netherlands. Data from several German floods was collected using computer aided telephone interviews. Data from the 1993 Meuse flood in the Netherlands is available, based on compensations paid by the government. The Bayesian network and random forest based models are applied and validated in both countries on basis of the individual datasets. A major challenge was the harmonization of the variables between both datasets due to factors like differences in variable definitions, and regional and temporal differences in flood hazard and exposure characteristics. Results of model validations and comparisons in both countries are discussed, particularly in respect to encountered challenges and possible solutions for an improvement of model transferability.
Finley, Andrew O.; Banerjee, Sudipto; Cook, Bruce D.; Bradford, John B.
2013-01-01
In this paper we detail a multivariate spatial regression model that couples LiDAR, hyperspectral and forest inventory data to predict forest outcome variables at a high spatial resolution. The proposed model is used to analyze forest inventory data collected on the US Forest Service Penobscot Experimental Forest (PEF), ME, USA. In addition to helping meet the regression model's assumptions, results from the PEF analysis suggest that the addition of multivariate spatial random effects improves model fit and predictive ability, compared with two commonly applied modeling approaches. This improvement results from explicitly modeling the covariation among forest outcome variables and spatial dependence among observations through the random effects. Direct application of such multivariate models to even moderately large datasets is often computationally infeasible because of cubic order matrix algorithms involved in estimation. We apply a spatial dimension reduction technique to help overcome this computational hurdle without sacrificing richness in modeling.
A Multivariate Model for the Meta-Analysis of Study Level Survival Data at Multiple Times
ERIC Educational Resources Information Center
Jackson, Dan; Rollins, Katie; Coughlin, Patrick
2014-01-01
Motivated by our meta-analytic dataset involving survival rates after treatment for critical leg ischemia, we develop and apply a new multivariate model for the meta-analysis of study level survival data at multiple times. Our data set involves 50 studies that provide mortality rates at up to seven time points, which we model simultaneously, and…
Keenan, Michael R; Smentkowski, Vincent S; Ulfig, Robert M; Oltman, Edward; Larson, David J; Kelly, Thomas F
2011-06-01
We demonstrate for the first time that multivariate statistical analysis techniques can be applied to atom probe tomography data to estimate the chemical composition of a sample at the full spatial resolution of the atom probe in three dimensions. Whereas the raw atom probe data provide the specific identity of an atom at a precise location, the multivariate results can be interpreted in terms of the probabilities that an atom representing a particular chemical phase is situated there. When aggregated to the size scale of a single atom (∼0.2 nm), atom probe spectral-image datasets are huge and extremely sparse. In fact, the average spectrum will have somewhat less than one total count per spectrum due to imperfect detection efficiency. These conditions, under which the variance in the data is completely dominated by counting noise, test the limits of multivariate analysis, and an extensive discussion of how to extract the chemical information is presented. Efficient numerical approaches to performing principal component analysis (PCA) on these datasets, which may number hundreds of millions of individual spectra, are put forward, and it is shown that PCA can be computed in a few seconds on a typical laptop computer.
Lie, Octavian V; van Mierlo, Pieter
2017-01-01
The visual interpretation of intracranial EEG (iEEG) is the standard method used in complex epilepsy surgery cases to map the regions of seizure onset targeted for resection. Still, visual iEEG analysis is labor-intensive and biased due to interpreter dependency. Multivariate parametric functional connectivity measures using adaptive autoregressive (AR) modeling of the iEEG signals based on the Kalman filter algorithm have been used successfully to localize the electrographic seizure onsets. Due to their high computational cost, these methods have been applied to a limited number of iEEG time-series (<60). The aim of this study was to test two Kalman filter implementations, a well-known multivariate adaptive AR model (Arnold et al. 1998) and a simplified, computationally efficient derivation of it, for their potential application to connectivity analysis of high-dimensional (up to 192 channels) iEEG data. When used on simulated seizures together with a multivariate connectivity estimator, the partial directed coherence, the two AR models were compared for their ability to reconstitute the designed seizure signal connections from noisy data. Next, focal seizures from iEEG recordings (73-113 channels) in three patients rendered seizure-free after surgery were mapped with the outdegree, a graph-theory index of outward directed connectivity. Simulation results indicated high levels of mapping accuracy for the two models in the presence of low-to-moderate noise cross-correlation. Accordingly, both AR models correctly mapped the real seizure onset to the resection volume. This study supports the possibility of conducting fully data-driven multivariate connectivity estimations on high-dimensional iEEG datasets using the Kalman filter approach.
Enhancing e-waste estimates: Improving data quality by multivariate Input–Output Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, Feng, E-mail: fwang@unu.edu; Design for Sustainability Lab, Faculty of Industrial Design Engineering, Delft University of Technology, Landbergstraat 15, 2628CE Delft; Huisman, Jaco
2013-11-15
Highlights: • A multivariate Input–Output Analysis method for e-waste estimates is proposed. • Applying multivariate analysis to consolidate data can enhance e-waste estimates. • We examine the influence of model selection and data quality on e-waste estimates. • Datasets of all e-waste related variables in a Dutch case study have been provided. • Accurate modeling of time-variant lifespan distributions is critical for estimate. - Abstract: Waste electrical and electronic equipment (or e-waste) is one of the fastest growing waste streams, which encompasses a wide and increasing spectrum of products. Accurate estimation of e-waste generation is difficult, mainly due to lackmore » of high quality data referred to market and socio-economic dynamics. This paper addresses how to enhance e-waste estimates by providing techniques to increase data quality. An advanced, flexible and multivariate Input–Output Analysis (IOA) method is proposed. It links all three pillars in IOA (product sales, stock and lifespan profiles) to construct mathematical relationships between various data points. By applying this method, the data consolidation steps can generate more accurate time-series datasets from available data pool. This can consequently increase the reliability of e-waste estimates compared to the approach without data processing. A case study in the Netherlands is used to apply the advanced IOA model. As a result, for the first time ever, complete datasets of all three variables for estimating all types of e-waste have been obtained. The result of this study also demonstrates significant disparity between various estimation models, arising from the use of data under different conditions. It shows the importance of applying multivariate approach and multiple sources to improve data quality for modelling, specifically using appropriate time-varying lifespan parameters. Following the case study, a roadmap with a procedural guideline is provided to enhance e-waste estimation studies.« less
Prolonged Instability Prior to a Regime Shift | Science ...
Regime shifts are generally defined as the point of ‘abrupt’ change in the state of a system. However, a seemingly abrupt transition can be the product of a system reorganization that has been ongoing much longer than is evident in statistical analysis of a single component of the system. Using both univariate and multivariate statistical methods, we tested a long-term high-resolution paleoecological dataset with a known change in species assemblage for a regime shift. Analysis of this dataset with Fisher Information and multivariate time series modeling showed that there was a∼2000 year period of instability prior to the regime shift. This period of instability and the subsequent regime shift coincide with regional climate change, indicating that the system is undergoing extrinsic forcing. Paleoecological records offer a unique opportunity to test tools for the detection of thresholds and stable-states, and thus to examine the long-term stability of ecosystems over periods of multiple millennia. This manuscript explores various methods of assessing the transition between alternative states in an ecological system described by a long-term high-resolution paleoecological dataset.
Climate Model Diagnostic Analyzer
NASA Technical Reports Server (NTRS)
Lee, Seungwon; Pan, Lei; Zhai, Chengxing; Tang, Benyang; Kubar, Terry; Zhang, Zia; Wang, Wei
2015-01-01
The comprehensive and innovative evaluation of climate models with newly available global observations is critically needed for the improvement of climate model current-state representation and future-state predictability. A climate model diagnostic evaluation process requires physics-based multi-variable analyses that typically involve large-volume and heterogeneous datasets, making them both computation- and data-intensive. With an exploratory nature of climate data analyses and an explosive growth of datasets and service tools, scientists are struggling to keep track of their datasets, tools, and execution/study history, let alone sharing them with others. In response, we have developed a cloud-enabled, provenance-supported, web-service system called Climate Model Diagnostic Analyzer (CMDA). CMDA enables the physics-based, multivariable model performance evaluations and diagnoses through the comprehensive and synergistic use of multiple observational data, reanalysis data, and model outputs. At the same time, CMDA provides a crowd-sourcing space where scientists can organize their work efficiently and share their work with others. CMDA is empowered by many current state-of-the-art software packages in web service, provenance, and semantic search.
NASA Astrophysics Data System (ADS)
Gaitan, S.; ten Veldhuis, J. A. E.
2015-06-01
Cities worldwide are challenged by increasing urban flood risks. Precise and realistic measures are required to reduce flooding impacts. However, currently implemented sewer and topographic models do not provide realistic predictions of local flooding occurrence during heavy rain events. Assessing other factors such as spatially distributed rainfall, socioeconomic characteristics, and social sensing, may help to explain probability and impacts of urban flooding. Several spatial datasets have been recently made available in the Netherlands, including rainfall-related incident reports made by citizens, spatially distributed rain depths, semidistributed socioeconomic information, and buildings age. Inspecting the potential of this data to explain the occurrence of rainfall related incidents has not been done yet. Multivariate analysis tools for describing communities and environmental patterns have been previously developed and used in the field of study of ecology. The objective of this paper is to outline opportunities for these tools to explore urban flooding risks patterns in the mentioned datasets. To that end, a cluster analysis is performed. Results indicate that incidence of rainfall-related impacts is higher in areas characterized by older infrastructure and higher population density.
NASA Astrophysics Data System (ADS)
Shrestha, S. R.; Collow, T. W.; Rose, B.
2016-12-01
Scientific datasets are generated from various sources and platforms but they are typically produced either by earth observation systems or by modelling systems. These are widely used for monitoring, simulating, or analyzing measurements that are associated with physical, chemical, and biological phenomena over the ocean, atmosphere, or land. A significant subset of scientific datasets stores values directly as rasters or in a form that can be rasterized. This is where a value exists at every cell in a regular grid spanning the spatial extent of the dataset. Government agencies like NOAA, NASA, EPA, USGS produces large volumes of near real-time, forecast, and historical data that drives climatological and meteorological studies, and underpins operations ranging from weather prediction to sea ice loss. Modern science is computationally intensive because of the availability of an enormous amount of scientific data, the adoption of data-driven analysis, and the need to share these dataset and research results with the public. ArcGIS as a platform is sophisticated and capable of handling such complex domain. We'll discuss constructs and capabilities applicable to multidimensional gridded data that can be conceptualized as a multivariate space-time cube. Building on the concept of a two-dimensional raster, a typical multidimensional raster dataset could contain several "slices" within the same spatial extent. We will share a case from the NOAA Climate Forecast Systems Reanalysis (CFSR) multidimensional data as an example of how large collections of rasters can be efficiently organized and managed through a data model within a geodatabase called "Mosaic dataset" and dynamically transformed and analyzed using raster functions. A raster function is a lightweight, raster-valued transformation defined over a mixed set of raster and scalar input. That means, just like any tool, you can provide a raster function with input parameters. It enables dynamic processing of only the data that's being displayed on the screen or requested by an application. We will present the dynamic processing and analysis of CFSR data using the chains of raster function and share it as dynamic multidimensional image service. This workflow and capabilities can be easily applied to any scientific data formats that are supported in mosaic dataset.
de Almeida, Valber Elias; de Araújo Gomes, Adriano; de Sousa Fernandes, David Douglas; Goicoechea, Héctor Casimiro; Galvão, Roberto Kawakami Harrop; Araújo, Mario Cesar Ugulino
2018-05-01
This paper proposes a new variable selection method for nonlinear multivariate calibration, combining the Successive Projections Algorithm for interval selection (iSPA) with the Kernel Partial Least Squares (Kernel-PLS) modelling technique. The proposed iSPA-Kernel-PLS algorithm is employed in a case study involving a Vis-NIR spectrometric dataset with complex nonlinear features. The analytical problem consists of determining Brix and sucrose content in samples from a sugar production system, on the basis of transflectance spectra. As compared to full-spectrum Kernel-PLS, the iSPA-Kernel-PLS models involve a smaller number of variables and display statistically significant superiority in terms of accuracy and/or bias in the predictions. Published by Elsevier B.V.
Some Recent Developments on Complex Multivariate Distributions
ERIC Educational Resources Information Center
Krishnaiah, P. R.
1976-01-01
In this paper, the author gives a review of the literature on complex multivariate distributions. Some new results on these distributions are also given. Finally, the author discusses the applications of the complex multivariate distributions in the area of the inference on multiple time series. (Author)
Parallel Planes Information Visualization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bush, Brian
2015-12-26
This software presents a user-provided multivariate dataset as an interactive three dimensional visualization so that the user can explore the correlation between variables in the observations and the distribution of observations among the variables.
Integrative Exploratory Analysis of Two or More Genomic Datasets.
Meng, Chen; Culhane, Aedin
2016-01-01
Exploratory analysis is an essential step in the analysis of high throughput data. Multivariate approaches such as correspondence analysis (CA), principal component analysis, and multidimensional scaling are widely used in the exploratory analysis of single dataset. Modern biological studies often assay multiple types of biological molecules (e.g., mRNA, protein, phosphoproteins) on a same set of biological samples, thereby creating multiple different types of omics data or multiassay data. Integrative exploratory analysis of these multiple omics data is required to leverage the potential of multiple omics studies. In this chapter, we describe the application of co-inertia analysis (CIA; for analyzing two datasets) and multiple co-inertia analysis (MCIA; for three or more datasets) to address this problem. These methods are powerful yet simple multivariate approaches that represent samples using a lower number of variables, allowing a more easily identification of the correlated structure in and between multiple high dimensional datasets. Graphical representations can be employed to this purpose. In addition, the methods simultaneously project samples and variables (genes, proteins) onto the same lower dimensional space, so the most variant variables from each dataset can be selected and associated with samples, which can be further used to facilitate biological interpretation and pathway analysis. We applied CIA to explore the concordance between mRNA and protein expression in a panel of 60 tumor cell lines from the National Cancer Institute. In the same 60 cell lines, we used MCIA to perform a cross-platform comparison of mRNA gene expression profiles obtained on four different microarray platforms. Last, as an example of integrative analysis of multiassay or multi-omics data we analyzed transcriptomic, proteomic, and phosphoproteomic data from pluripotent (iPS) and embryonic stem (ES) cell lines.
Armah, Frederick Ato; Paintsil, Arnold; Yawson, David Oscar; Adu, Michael Osei; Odoi, Justice O
2017-08-01
Chemometric techniques were applied to evaluate the spatial and temporal heterogeneities in groundwater quality data for approximately 740 goldmining and agriculture-intensive locations in Ghana. The strongest linear and monotonic relationships occurred between Mn and Fe. Sixty-nine per cent of total variance in the dataset was explained by four variance factors: physicochemical properties, bacteriological quality, natural geologic attributes and anthropogenic factors (artisanal goldmining). There was evidence of significant differences in means of all trace metals and physicochemical parameters (p < 0.001) between goldmining and non-goldmining locations. Arsenic and turbidity produced very high value F's demonstrating that 'physical properties and chalcophilic elements' was the function that most discriminated between non-goldmining and goldmining locations. Variations in Escherichia coli and total coliforms were observed between the dry and wet seasons. The overall predictive accuracy of the discriminant function showed that non-goldmining locations were classified with slightly better accuracy (89%) than goldmining areas (69.6%). There were significant differences between the underlying distributions of Cd, Mn and Pb in the wet and dry seasons. This study emphasizes the practicality of chemometrics in the assessment and elucidation of complex water quality datasets to promote effective management of groundwater resources for sustaining human health.
Orientation-Enhanced Parallel Coordinate Plots.
Raidou, Renata Georgia; Eisemann, Martin; Breeuwer, Marcel; Eisemann, Elmar; Vilanova, Anna
2016-01-01
Parallel Coordinate Plots (PCPs) is one of the most powerful techniques for the visualization of multivariate data. However, for large datasets, the representation suffers from clutter due to overplotting. In this case, discerning the underlying data information and selecting specific interesting patterns can become difficult. We propose a new and simple technique to improve the display of PCPs by emphasizing the underlying data structure. Our Orientation-enhanced Parallel Coordinate Plots (OPCPs) improve pattern and outlier discernibility by visually enhancing parts of each PCP polyline with respect to its slope. This enhancement also allows us to introduce a novel and efficient selection method, the Orientation-enhanced Brushing (O-Brushing). Our solution is particularly useful when multiple patterns are present or when the view on certain patterns is obstructed by noise. We present the results of our approach with several synthetic and real-world datasets. Finally, we conducted a user evaluation, which verifies the advantages of the OPCPs in terms of discernibility of information in complex data. It also confirms that O-Brushing eases the selection of data patterns in PCPs and reduces the amount of necessary user interactions compared to state-of-the-art brushing techniques.
Learning multivariate distributions by competitive assembly of marginals.
Sánchez-Vega, Francisco; Younes, Laurent; Geman, Donald
2013-02-01
We present a new framework for learning high-dimensional multivariate probability distributions from estimated marginals. The approach is motivated by compositional models and Bayesian networks, and designed to adapt to small sample sizes. We start with a large, overlapping set of elementary statistical building blocks, or "primitives," which are low-dimensional marginal distributions learned from data. Each variable may appear in many primitives. Subsets of primitives are combined in a Lego-like fashion to construct a probabilistic graphical model; only a small fraction of the primitives will participate in any valid construction. Since primitives can be precomputed, parameter estimation and structure search are separated. Model complexity is controlled by strong biases; we adapt the primitives to the amount of training data and impose rules which restrict the merging of them into allowable compositions. The likelihood of the data decomposes into a sum of local gains, one for each primitive in the final structure. We focus on a specific subclass of networks which are binary forests. Structure optimization corresponds to an integer linear program and the maximizing composition can be computed for reasonably large numbers of variables. Performance is evaluated using both synthetic data and real datasets from natural language processing and computational biology.
Dissecting the space-time structure of tree-ring datasets using the partial triadic analysis.
Rossi, Jean-Pierre; Nardin, Maxime; Godefroid, Martin; Ruiz-Diaz, Manuela; Sergent, Anne-Sophie; Martinez-Meier, Alejandro; Pâques, Luc; Rozenberg, Philippe
2014-01-01
Tree-ring datasets are used in a variety of circumstances, including archeology, climatology, forest ecology, and wood technology. These data are based on microdensity profiles and consist of a set of tree-ring descriptors, such as ring width or early/latewood density, measured for a set of individual trees. Because successive rings correspond to successive years, the resulting dataset is a ring variables × trees × time datacube. Multivariate statistical analyses, such as principal component analysis, have been widely used for extracting worthwhile information from ring datasets, but they typically address two-way matrices, such as ring variables × trees or ring variables × time. Here, we explore the potential of the partial triadic analysis (PTA), a multivariate method dedicated to the analysis of three-way datasets, to apprehend the space-time structure of tree-ring datasets. We analyzed a set of 11 tree-ring descriptors measured in 149 georeferenced individuals of European larch (Larix decidua Miller) during the period of 1967-2007. The processing of densitometry profiles led to a set of ring descriptors for each tree and for each year from 1967-2007. The resulting three-way data table was subjected to two distinct analyses in order to explore i) the temporal evolution of spatial structures and ii) the spatial structure of temporal dynamics. We report the presence of a spatial structure common to the different years, highlighting the inter-individual variability of the ring descriptors at the stand scale. We found a temporal trajectory common to the trees that could be separated into a high and low frequency signal, corresponding to inter-annual variations possibly related to defoliation events and a long-term trend possibly related to climate change. We conclude that PTA is a powerful tool to unravel and hierarchize the different sources of variation within tree-ring datasets.
A conditional Granger causality model approach for group analysis in functional MRI
Zhou, Zhenyu; Wang, Xunheng; Klahr, Nelson J.; Liu, Wei; Arias, Diana; Liu, Hongzhi; von Deneen, Karen M.; Wen, Ying; Lu, Zuhong; Xu, Dongrong; Liu, Yijun
2011-01-01
Granger causality model (GCM) derived from multivariate vector autoregressive models of data has been employed for identifying effective connectivity in the human brain with functional MR imaging (fMRI) and to reveal complex temporal and spatial dynamics underlying a variety of cognitive processes. In the most recent fMRI effective connectivity measures, pairwise GCM has commonly been applied based on single voxel values or average values from special brain areas at the group level. Although a few novel conditional GCM methods have been proposed to quantify the connections between brain areas, our study is the first to propose a viable standardized approach for group analysis of an fMRI data with GCM. To compare the effectiveness of our approach with traditional pairwise GCM models, we applied a well-established conditional GCM to pre-selected time series of brain regions resulting from general linear model (GLM) and group spatial kernel independent component analysis (ICA) of an fMRI dataset in the temporal domain. Datasets consisting of one task-related and one resting-state fMRI were used to investigate connections among brain areas with the conditional GCM method. With the GLM detected brain activation regions in the emotion related cortex during the block design paradigm, the conditional GCM method was proposed to study the causality of the habituation between the left amygdala and pregenual cingulate cortex during emotion processing. For the resting-state dataset, it is possible to calculate not only the effective connectivity between networks but also the heterogeneity within a single network. Our results have further shown a particular interacting pattern of default mode network (DMN) that can be characterized as both afferent and efferent influences on the medial prefrontal cortex (mPFC) and posterior cingulate cortex (PCC). These results suggest that the conditional GCM approach based on a linear multivariate vector autoregressive (MVAR) model can achieve greater accuracy in detecting network connectivity than the widely used pairwise GCM, and this group analysis methodology can be quite useful to extend the information obtainable in fMRI. PMID:21232892
Prolonged instability prior to a regime shift
Spanbauer, Trisha; Allen, Craig R.; Angeler, David G.; Eason, Tarsha; Fritz, Sherilyn C.; Garmestani, Ahjond S.; Nash, Kirsty L.; Stone, Jeffery R.
2014-01-01
Regime shifts are generally defined as the point of ‘abrupt’ change in the state of a system. However, a seemingly abrupt transition can be the product of a system reorganization that has been ongoing much longer than is evident in statistical analysis of a single component of the system. Using both univariate and multivariate statistical methods, we tested a long-term high-resolution paleoecological dataset with a known change in species assemblage for a regime shift. Analysis of this dataset with Fisher Information and multivariate time series modeling showed that there was a∼2000 year period of instability prior to the regime shift. This period of instability and the subsequent regime shift coincide with regional climate change, indicating that the system is undergoing extrinsic forcing. Paleoecological records offer a unique opportunity to test tools for the detection of thresholds and stable-states, and thus to examine the long-term stability of ecosystems over periods of multiple millennia.
I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chard, Kyle; D'Arcy, Mike; Heavner, Benjamin D.
Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets for big data workflows assume that all data reside in a single location, requiring costly data marshaling and permitting errors of omission and commission because dataset members are not explicitly specified. We address these issues by proposing simple methods and toolsmore » for assembling, sharing, and analyzing large and complex datasets that scientists can easily integrate into their daily workflows. These tools combine a simple and robust method for describing data collections (BDBags), data descriptions (Research Objects), and simple persistent identifiers (Minids) to create a powerful ecosystem of tools and services for big data analysis and sharing. We present these tools and use biomedical case studies to illustrate their use for the rapid assembly, sharing, and analysis of large datasets.« less
NASA Astrophysics Data System (ADS)
Golay, Jean; Kanevski, Mikhaïl
2013-04-01
The present research deals with the exploration and modeling of a complex dataset of 200 measurement points of sediment pollution by heavy metals in Lake Geneva. The fundamental idea was to use multivariate Artificial Neural Networks (ANN) along with geostatistical models and tools in order to improve the accuracy and the interpretability of data modeling. The results obtained with ANN were compared to those of traditional geostatistical algorithms like ordinary (co)kriging and (co)kriging with an external drift. Exploratory data analysis highlighted a great variety of relationships (i.e. linear, non-linear, independence) between the 11 variables of the dataset (i.e. Cadmium, Mercury, Zinc, Copper, Titanium, Chromium, Vanadium and Nickel as well as the spatial coordinates of the measurement points and their depth). Then, exploratory spatial data analysis (i.e. anisotropic variography, local spatial correlations and moving window statistics) was carried out. It was shown that the different phenomena to be modeled were characterized by high spatial anisotropies, complex spatial correlation structures and heteroscedasticity. A feature selection procedure based on General Regression Neural Networks (GRNN) was also applied to create subsets of variables enabling to improve the predictions during the modeling phase. The basic modeling was conducted using a Multilayer Perceptron (MLP) which is a workhorse of ANN. MLP models are robust and highly flexible tools which can incorporate in a nonlinear manner different kind of high-dimensional information. In the present research, the input layer was made of either two (spatial coordinates) or three neurons (when depth as auxiliary information could possibly capture an underlying trend) and the output layer was composed of one (univariate MLP) to eight neurons corresponding to the heavy metals of the dataset (multivariate MLP). MLP models with three input neurons can be referred to as Artificial Neural Networks with EXternal drift (ANNEX). Moreover, the exact number of output neurons and the selection of the corresponding variables were based on the subsets created during the exploratory phase. Concerning hidden layers, no restriction were made and multiple architectures were tested. For each MLP model, the quality of the modeling procedure was assessed by variograms: if the variogram of the residuals demonstrates pure nugget effect and if the level of the nugget exactly corresponds to the nugget value of the theoretical variogram of the corresponding variable, all the structured information has been correctly extracted without overfitting. Finally, it is worth mentioning that simple MLP models are not always able to remove all the spatial correlation structure from the data. In that case, Neural Network Residual Kriging (NNRK) can be carried out and risk assessment can be conducted with Neural Network Residual Simulations (NNRS). Finally, the results of the ANNEX models were compared to those of ordinary (co)kriging and (co)kriging with an external drift. It was shown that the ANNEX models performed better than traditional geostatistical algorithms when the relationship between the variable of interest and the auxiliary predictor was not linear. References Kanevski, M. and Maignan, M. (2004). Analysis and Modelling of Spatial Environmental Data. Lausanne: EPFL Press.
Schroeder, David; Keefe, Daniel F
2016-01-01
We present Visualization-by-Sketching, a direct-manipulation user interface for designing new data visualizations. The goals are twofold: First, make the process of creating real, animated, data-driven visualizations of complex information more accessible to artists, graphic designers, and other visual experts with traditional, non-technical training. Second, support and enhance the role of human creativity in visualization design, enabling visual experimentation and workflows similar to what is possible with traditional artistic media. The approach is to conceive of visualization design as a combination of processes that are already closely linked with visual creativity: sketching, digital painting, image editing, and reacting to exemplars. Rather than studying and tweaking low-level algorithms and their parameters, designers create new visualizations by painting directly on top of a digital data canvas, sketching data glyphs, and arranging and blending together multiple layers of animated 2D graphics. This requires new algorithms and techniques to interpret painterly user input relative to data "under" the canvas, balance artistic freedom with the need to produce accurate data visualizations, and interactively explore large (e.g., terabyte-sized) multivariate datasets. Results demonstrate a variety of multivariate data visualization techniques can be rapidly recreated using the interface. More importantly, results and feedback from artists support the potential for interfaces in this style to attract new, creative users to the challenging task of designing more effective data visualizations and to help these users stay "in the creative zone" as they work.
Peikert, Tobias; Duan, Fenghai; Rajagopalan, Srinivasan; Karwoski, Ronald A; Clay, Ryan; Robb, Richard A; Qin, Ziling; Sicks, JoRean; Bartholmai, Brian J; Maldonado, Fabien
2018-01-01
Optimization of the clinical management of screen-detected lung nodules is needed to avoid unnecessary diagnostic interventions. Herein we demonstrate the potential value of a novel radiomics-based approach for the classification of screen-detected indeterminate nodules. Independent quantitative variables assessing various radiologic nodule features such as sphericity, flatness, elongation, spiculation, lobulation and curvature were developed from the NLST dataset using 726 indeterminate nodules (all ≥ 7 mm, benign, n = 318 and malignant, n = 408). Multivariate analysis was performed using least absolute shrinkage and selection operator (LASSO) method for variable selection and regularization in order to enhance the prediction accuracy and interpretability of the multivariate model. The bootstrapping method was then applied for the internal validation and the optimism-corrected AUC was reported for the final model. Eight of the originally considered 57 quantitative radiologic features were selected by LASSO multivariate modeling. These 8 features include variables capturing Location: vertical location (Offset carina centroid z), Size: volume estimate (Minimum enclosing brick), Shape: flatness, Density: texture analysis (Score Indicative of Lesion/Lung Aggression/Abnormality (SILA) texture), and surface characteristics: surface complexity (Maximum shape index and Average shape index), and estimates of surface curvature (Average positive mean curvature and Minimum mean curvature), all with P<0.01. The optimism-corrected AUC for these 8 features is 0.939. Our novel radiomic LDCT-based approach for indeterminate screen-detected nodule characterization appears extremely promising however independent external validation is needed.
External Influences on Modeled and Observed Cloud Trends
NASA Technical Reports Server (NTRS)
Marvel, Kate; Zelinka, Mark; Klein, Stephen A.; Bonfils, Celine; Caldwell, Peter; Doutriaux, Charles; Santer, Benjamin D.; Taylor, Karl E.
2015-01-01
Understanding the cloud response to external forcing is a major challenge for climate science. This crucial goal is complicated by intermodel differences in simulating present and future cloud cover and by observational uncertainty. This is the first formal detection and attribution study of cloud changes over the satellite era. Presented herein are CMIP5 (Coupled Model Intercomparison Project - Phase 5) model-derived fingerprints of externally forced changes to three cloud properties: the latitudes at which the zonally averaged total cloud fraction (CLT) is maximized or minimized, the zonal average CLT at these latitudes, and the height of high clouds at these latitudes. By considering simultaneous changes in all three properties, the authors define a coherent multivariate fingerprint of cloud response to external forcing and use models from phase 5 of CMIP (CMIP5) to calculate the average time to detect these changes. It is found that given perfect satellite cloud observations beginning in 1983, the models indicate that a detectable multivariate signal should have already emerged. A search is then made for signals of external forcing in two observational datasets: ISCCP (International Satellite Cloud Climatology Project) and PATMOS-x (Advanced Very High Resolution Radiometer (AVHRR) Pathfinder Atmospheres - Extended). The datasets are both found to show a poleward migration of the zonal CLT pattern that is incompatible with forced CMIP5 models. Nevertheless, a detectable multivariate signal is predicted by models over the PATMOS-x time period and is indeed present in the dataset. Despite persistent observational uncertainties, these results present a strong case for continued efforts to improve these existing satellite observations, in addition to planning for new missions.
Identifying Talent in Youth Sport: A Novel Methodology Using Higher-Dimensional Analysis.
Till, Kevin; Jones, Ben L; Cobley, Stephen; Morley, David; O'Hara, John; Chapman, Chris; Cooke, Carlton; Beggs, Clive B
2016-01-01
Prediction of adult performance from early age talent identification in sport remains difficult. Talent identification research has generally been performed using univariate analysis, which ignores multivariate relationships. To address this issue, this study used a novel higher-dimensional model to orthogonalize multivariate anthropometric and fitness data from junior rugby league players, with the aim of differentiating future career attainment. Anthropometric and fitness data from 257 Under-15 rugby league players was collected. Players were grouped retrospectively according to their future career attainment (i.e., amateur, academy, professional). Players were blindly and randomly divided into an exploratory (n = 165) and validation dataset (n = 92). The exploratory dataset was used to develop and optimize a novel higher-dimensional model, which combined singular value decomposition (SVD) with receiver operating characteristic analysis. Once optimized, the model was tested using the validation dataset. SVD analysis revealed 60 m sprint and agility 505 performance were the most influential characteristics in distinguishing future professional players from amateur and academy players. The exploratory dataset model was able to distinguish between future amateur and professional players with a high degree of accuracy (sensitivity = 85.7%, specificity = 71.1%; p<0.001), although it could not distinguish between future professional and academy players. The validation dataset model was able to distinguish future professionals from the rest with reasonable accuracy (sensitivity = 83.3%, specificity = 63.8%; p = 0.003). Through the use of SVD analysis it was possible to objectively identify criteria to distinguish future career attainment with a sensitivity over 80% using anthropometric and fitness data alone. As such, this suggests that SVD analysis may be a useful analysis tool for research and practice within talent identification.
Identifying Talent in Youth Sport: A Novel Methodology Using Higher-Dimensional Analysis
Till, Kevin; Jones, Ben L.; Cobley, Stephen; Morley, David; O'Hara, John; Chapman, Chris; Cooke, Carlton; Beggs, Clive B.
2016-01-01
Prediction of adult performance from early age talent identification in sport remains difficult. Talent identification research has generally been performed using univariate analysis, which ignores multivariate relationships. To address this issue, this study used a novel higher-dimensional model to orthogonalize multivariate anthropometric and fitness data from junior rugby league players, with the aim of differentiating future career attainment. Anthropometric and fitness data from 257 Under-15 rugby league players was collected. Players were grouped retrospectively according to their future career attainment (i.e., amateur, academy, professional). Players were blindly and randomly divided into an exploratory (n = 165) and validation dataset (n = 92). The exploratory dataset was used to develop and optimize a novel higher-dimensional model, which combined singular value decomposition (SVD) with receiver operating characteristic analysis. Once optimized, the model was tested using the validation dataset. SVD analysis revealed 60 m sprint and agility 505 performance were the most influential characteristics in distinguishing future professional players from amateur and academy players. The exploratory dataset model was able to distinguish between future amateur and professional players with a high degree of accuracy (sensitivity = 85.7%, specificity = 71.1%; p<0.001), although it could not distinguish between future professional and academy players. The validation dataset model was able to distinguish future professionals from the rest with reasonable accuracy (sensitivity = 83.3%, specificity = 63.8%; p = 0.003). Through the use of SVD analysis it was possible to objectively identify criteria to distinguish future career attainment with a sensitivity over 80% using anthropometric and fitness data alone. As such, this suggests that SVD analysis may be a useful analysis tool for research and practice within talent identification. PMID:27224653
Using Fisher information to track stability in multivariate systems
With the current proliferation of data, the proficient use of statistical and mining techniques offer substantial benefits to capture useful information from any dataset. As numerous approaches make use of information theory concepts, here, we discuss how Fisher information (FI...
NASA Astrophysics Data System (ADS)
Ekenes, K.
2017-12-01
This presentation will outline the process of creating a web application for exploring large amounts of scientific geospatial data using modern automated cartographic techniques. Traditional cartographic methods, including data classification, may inadvertently hide geospatial and statistical patterns in the underlying data. This presentation demonstrates how to use smart web APIs that quickly analyze the data when it loads, and provides suggestions for the most appropriate visualizations based on the statistics of the data. Since there are just a few ways to visualize any given dataset well, it is imperative to provide smart default color schemes tailored to the dataset as opposed to static defaults. Since many users don't go beyond default values, it is imperative that they are provided with smart default visualizations. Multiple functions for automating visualizations are available in the Smart APIs, along with UI elements allowing users to create more than one visualization for a dataset since there isn't a single best way to visualize a given dataset. Since bivariate and multivariate visualizations are particularly difficult to create effectively, this automated approach removes the guesswork out of the process and provides a number of ways to generate multivariate visualizations for the same variables. This allows the user to choose which visualization is most appropriate for their presentation. The methods used in these APIs and the renderers generated by them are not available elsewhere. The presentation will show how statistics can be used as the basis for automating default visualizations of data along continuous ramps, creating more refined visualizations while revealing the spread and outliers of the data. Adding interactive components to instantaneously alter visualizations allows users to unearth spatial patterns previously unknown among one or more variables. These applications may focus on a single dataset that is frequently updated, or configurable for a variety of datasets from multiple sources.
NCAR's Research Data Archive: OPeNDAP Access for Complex Datasets
NASA Astrophysics Data System (ADS)
Dattore, R.; Worley, S. J.
2014-12-01
Many datasets have complex structures including hundreds of parameters and numerous vertical levels, grid resolutions, and temporal products. Making these data accessible is a challenge for a data provider. OPeNDAP is powerful protocol for delivering in real-time multi-file datasets that can be ingested by many analysis and visualization tools, but for these datasets there are too many choices about how to aggregate. Simple aggregation schemes can fail to support, or at least make it very challenging, for many potential studies based on complex datasets. We address this issue by using a rich file content metadata collection to create a real-time customized OPeNDAP service to match the full suite of access possibilities for complex datasets. The Climate Forecast System Reanalysis (CFSR) and it's extension, the Climate Forecast System Version 2 (CFSv2) datasets produced by the National Centers for Environmental Prediction (NCEP) and hosted by the Research Data Archive (RDA) at the Computational and Information Systems Laboratory (CISL) at NCAR are examples of complex datasets that are difficult to aggregate with existing data server software. CFSR and CFSv2 contain 141 distinct parameters on 152 vertical levels, six grid resolutions and 36 products (analyses, n-hour forecasts, multi-hour averages, etc.) where not all parameter/level combinations are available at all grid resolution/product combinations. These data are archived in the RDA with the data structure provided by the producer; no additional re-organization or aggregation have been applied. Since 2011, users have been able to request customized subsets (e.g. - temporal, parameter, spatial) from the CFSR/CFSv2, which are processed in delayed-mode and then downloaded to a user's system. Until now, the complexity has made it difficult to provide real-time OPeNDAP access to the data. We have developed a service that leverages the already-existing subsetting interface and allows users to create a virtual dataset with its own structure (das, dds). The user receives a URL to the customized dataset that can be used by existing tools to ingest, analyze, and visualize the data. This presentation will detail the metadata system and OPeNDAP server that enable user-customized real-time access and show an example of how a visualization tool can access the data.
Geladi, Paul; Nelson, Andrew; Lindholm-Sethson, Britta
2007-07-09
Electrical impedance gives multivariate complex number data as results. Two examples of multivariate electrical impedance data measured on lipid monolayers in different solutions give rise to matrices (16x50 and 38x50) of complex numbers. Multivariate data analysis by principal component analysis (PCA) or singular value decomposition (SVD) can be used for complex data and the necessary equations are given. The scores and loadings obtained are vectors of complex numbers. It is shown that the complex number PCA and SVD are better at concentrating information in a few components than the naïve juxtaposition method and that Argand diagrams can replace score and loading plots. Different concentrations of Magainin and Gramicidin A give different responses and also the role of the electrolyte medium can be studied. An interaction of Gramicidin A in the solution with the monolayer over time can be observed.
Jia, Erik; Chen, Tianlu
2018-01-01
Left-censored missing values commonly exist in targeted metabolomics datasets and can be considered as missing not at random (MNAR). Improper data processing procedures for missing values will cause adverse impacts on subsequent statistical analyses. However, few imputation methods have been developed and applied to the situation of MNAR in the field of metabolomics. Thus, a practical left-censored missing value imputation method is urgently needed. We developed an iterative Gibbs sampler based left-censored missing value imputation approach (GSimp). We compared GSimp with other three imputation methods on two real-world targeted metabolomics datasets and one simulation dataset using our imputation evaluation pipeline. The results show that GSimp outperforms other imputation methods in terms of imputation accuracy, observation distribution, univariate and multivariate analyses, and statistical sensitivity. Additionally, a parallel version of GSimp was developed for dealing with large scale metabolomics datasets. The R code for GSimp, evaluation pipeline, tutorial, real-world and simulated targeted metabolomics datasets are available at: https://github.com/WandeRum/GSimp. PMID:29385130
Limitations of Climatic Data for Inferring Species Boundaries: Insights from Speckled Rattlesnakes
Flores-Villela, Oscar; Fujita, Matthew K.
2015-01-01
Phenotypes, DNA, and measures of ecological differences are widely used in species delimitation. Although rarely defined in such studies, ecological divergence is almost always approximated using multivariate climatic data associated with sets of specimens (i.e., the “climatic niche”); the justification for this approach is that species-specific climatic envelopes act as surrogates for physiological tolerances. Using identical statistical procedures, we evaluated the usefulness and validity of the climate-as-proxy assumption by comparing performance of genetic (nDNA SNPs and mitochondrial DNA), phenotypic, and climatic data for objective species delimitation in the speckled rattlesnake (Crotalus mitchellii) complex. Ordination and clustering patterns were largely congruent among intrinsic (heritable) traits (nDNA, mtDNA, phenotype), and discordance is explained by biological processes (e.g., ontogeny, hybridization). In contrast, climatic data did not produce biologically meaningful clusters that were congruent with any intrinsic dataset, but rather corresponded to regional differences in atmospheric circulation and climate, indicating an absence of inherent taxonomic signal in these data. Surrogating climate for physiological tolerances adds artificial weight to evidence of species boundaries, as these data are irrelevant for that purpose. Based on the evidence from congruent clustering of intrinsic datasets, we recommend that three subspecies of C. mitchellii be recognized as species: C. angelensis, C. mitchellii, and C. Pyrrhus. PMID:26107178
Limitations of climatic data for inferring species boundaries: insights from speckled rattlesnakes.
Meik, Jesse M; Streicher, Jeffrey W; Lawing, A Michelle; Flores-Villela, Oscar; Fujita, Matthew K
2015-01-01
Phenotypes, DNA, and measures of ecological differences are widely used in species delimitation. Although rarely defined in such studies, ecological divergence is almost always approximated using multivariate climatic data associated with sets of specimens (i.e., the "climatic niche"); the justification for this approach is that species-specific climatic envelopes act as surrogates for physiological tolerances. Using identical statistical procedures, we evaluated the usefulness and validity of the climate-as-proxy assumption by comparing performance of genetic (nDNA SNPs and mitochondrial DNA), phenotypic, and climatic data for objective species delimitation in the speckled rattlesnake (Crotalus mitchellii) complex. Ordination and clustering patterns were largely congruent among intrinsic (heritable) traits (nDNA, mtDNA, phenotype), and discordance is explained by biological processes (e.g., ontogeny, hybridization). In contrast, climatic data did not produce biologically meaningful clusters that were congruent with any intrinsic dataset, but rather corresponded to regional differences in atmospheric circulation and climate, indicating an absence of inherent taxonomic signal in these data. Surrogating climate for physiological tolerances adds artificial weight to evidence of species boundaries, as these data are irrelevant for that purpose. Based on the evidence from congruent clustering of intrinsic datasets, we recommend that three subspecies of C. mitchellii be recognized as species: C. angelensis, C. mitchellii, and C. Pyrrhus.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sreepathi, Sarat; Kumar, Jitendra; Mills, Richard T.
A proliferation of data from vast networks of remote sensing platforms (satellites, unmanned aircraft systems (UAS), airborne etc.), observational facilities (meteorological, eddy covariance etc.), state-of-the-art sensors, and simulation models offer unprecedented opportunities for scientific discovery. Unsupervised classification is a widely applied data mining approach to derive insights from such data. However, classification of very large data sets is a complex computational problem that requires efficient numerical algorithms and implementations on high performance computing (HPC) platforms. Additionally, increasing power, space, cooling and efficiency requirements has led to the deployment of hybrid supercomputing platforms with complex architectures and memory hierarchies like themore » Titan system at Oak Ridge National Laboratory. The advent of such accelerated computing architectures offers new challenges and opportunities for big data analytics in general and specifically, large scale cluster analysis in our case. Although there is an existing body of work on parallel cluster analysis, those approaches do not fully meet the needs imposed by the nature and size of our large data sets. Moreover, they had scaling limitations and were mostly limited to traditional distributed memory computing platforms. We present a parallel Multivariate Spatio-Temporal Clustering (MSTC) technique based on k-means cluster analysis that can target hybrid supercomputers like Titan. We developed a hybrid MPI, CUDA and OpenACC implementation that can utilize both CPU and GPU resources on computational nodes. We describe performance results on Titan that demonstrate the scalability and efficacy of our approach in processing large ecological data sets.« less
Analysis models for the estimation of oceanic fields
NASA Technical Reports Server (NTRS)
Carter, E. F.; Robinson, A. R.
1987-01-01
A general model for statistically optimal estimates is presented for dealing with scalar, vector and multivariate datasets. The method deals with anisotropic fields and treats space and time dependence equivalently. Problems addressed include the analysis, or the production of synoptic time series of regularly gridded fields from irregular and gappy datasets, and the estimate of fields by compositing observations from several different instruments and sampling schemes. Technical issues are discussed, including the convergence of statistical estimates, the choice of representation of the correlations, the influential domain of an observation, and the efficiency of numerical computations.
Vision-Based Real-Time Traversable Region Detection for Mobile Robot in the Outdoors.
Deng, Fucheng; Zhu, Xiaorui; He, Chao
2017-09-13
Environment perception is essential for autonomous mobile robots in human-robot coexisting outdoor environments. One of the important tasks for such intelligent robots is to autonomously detect the traversable region in an unstructured 3D real world. The main drawback of most existing methods is that of high computational complexity. Hence, this paper proposes a binocular vision-based, real-time solution for detecting traversable region in the outdoors. In the proposed method, an appearance model based on multivariate Gaussian is quickly constructed from a sample region in the left image adaptively determined by the vanishing point and dominant borders. Then, a fast, self-supervised segmentation scheme is proposed to classify the traversable and non-traversable regions. The proposed method is evaluated on public datasets as well as a real mobile robot. Implementation on the mobile robot has shown its ability in the real-time navigation applications.
Julien, Clavel; Leandro, Aristide; Hélène, Morlon
2018-06-19
Working with high-dimensional phylogenetic comparative datasets is challenging because likelihood-based multivariate methods suffer from low statistical performances as the number of traits p approaches the number of species n and because some computational complications occur when p exceeds n. Alternative phylogenetic comparative methods have recently been proposed to deal with the large p small n scenario but their use and performances are limited. Here we develop a penalized likelihood framework to deal with high-dimensional comparative datasets. We propose various penalizations and methods for selecting the intensity of the penalties. We apply this general framework to the estimation of parameters (the evolutionary trait covariance matrix and parameters of the evolutionary model) and model comparison for the high-dimensional multivariate Brownian (BM), Early-burst (EB), Ornstein-Uhlenbeck (OU) and Pagel's lambda models. We show using simulations that our penalized likelihood approach dramatically improves the estimation of evolutionary trait covariance matrices and model parameters when p approaches n, and allows for their accurate estimation when p equals or exceeds n. In addition, we show that penalized likelihood models can be efficiently compared using Generalized Information Criterion (GIC). We implement these methods, as well as the related estimation of ancestral states and the computation of phylogenetic PCA in the R package RPANDA and mvMORPH. Finally, we illustrate the utility of the new proposed framework by evaluating evolutionary models fit, analyzing integration patterns, and reconstructing evolutionary trajectories for a high-dimensional 3-D dataset of brain shape in the New World monkeys. We find a clear support for an Early-burst model suggesting an early diversification of brain morphology during the ecological radiation of the clade. Penalized likelihood offers an efficient way to deal with high-dimensional multivariate comparative data.
A neuromorphic network for generic multivariate data classification
Schmuker, Michael; Pfeil, Thomas; Nawrot, Martin Paul
2014-01-01
Computational neuroscience has uncovered a number of computational principles used by nervous systems. At the same time, neuromorphic hardware has matured to a state where fast silicon implementations of complex neural networks have become feasible. En route to future technical applications of neuromorphic computing the current challenge lies in the identification and implementation of functional brain algorithms. Taking inspiration from the olfactory system of insects, we constructed a spiking neural network for the classification of multivariate data, a common problem in signal and data analysis. In this model, real-valued multivariate data are converted into spike trains using “virtual receptors” (VRs). Their output is processed by lateral inhibition and drives a winner-take-all circuit that supports supervised learning. VRs are conveniently implemented in software, whereas the lateral inhibition and classification stages run on accelerated neuromorphic hardware. When trained and tested on real-world datasets, we find that the classification performance is on par with a naïve Bayes classifier. An analysis of the network dynamics shows that stable decisions in output neuron populations are reached within less than 100 ms of biological time, matching the time-to-decision reported for the insect nervous system. Through leveraging a population code, the network tolerates the variability of neuronal transfer functions and trial-to-trial variation that is inevitably present on the hardware system. Our work provides a proof of principle for the successful implementation of a functional spiking neural network on a configurable neuromorphic hardware system that can readily be applied to real-world computing problems. PMID:24469794
Artificial intelligence (AI) systems for interpreting complex medical datasets.
Altman, R B
2017-05-01
Advances in machine intelligence have created powerful capabilities in algorithms that find hidden patterns in data, classify objects based on their measured characteristics, and associate similar patients/diseases/drugs based on common features. However, artificial intelligence (AI) applications in medical data have several technical challenges: complex and heterogeneous datasets, noisy medical datasets, and explaining their output to users. There are also social challenges related to intellectual property, data provenance, regulatory issues, economics, and liability. © 2017 ASCPT.
Bansal, Ravi; Hao, Xuejun; Liu, Jun; Peterson, Bradley S.
2014-01-01
Many investigators have tried to apply machine learning techniques to magnetic resonance images (MRIs) of the brain in order to diagnose neuropsychiatric disorders. Usually the number of brain imaging measures (such as measures of cortical thickness and measures of local surface morphology) derived from the MRIs (i.e., their dimensionality) has been large (e.g. >10) relative to the number of participants who provide the MRI data (<100). Sparse data in a high dimensional space increases the variability of the classification rules that machine learning algorithms generate, thereby limiting the validity, reproducibility, and generalizability of those classifiers. The accuracy and stability of the classifiers can improve significantly if the multivariate distributions of the imaging measures can be estimated accurately. To accurately estimate the multivariate distributions using sparse data, we propose to estimate first the univariate distributions of imaging data and then combine them using a Copula to generate more accurate estimates of their multivariate distributions. We then sample the estimated Copula distributions to generate dense sets of imaging measures and use those measures to train classifiers. We hypothesize that the dense sets of brain imaging measures will generate classifiers that are stable to variations in brain imaging measures, thereby improving the reproducibility, validity, and generalizability of diagnostic classification algorithms in imaging datasets from clinical populations. In our experiments, we used both computer-generated and real-world brain imaging datasets to assess the accuracy of multivariate Copula distributions in estimating the corresponding multivariate distributions of real-world imaging data. Our experiments showed that diagnostic classifiers generated using imaging measures sampled from the Copula were significantly more accurate and more reproducible than were the classifiers generated using either the real-world imaging measures or their multivariate Gaussian distributions. Thus, our findings demonstrate that estimated multivariate Copula distributions can generate dense sets of brain imaging measures that can in turn be used to train classifiers, and those classifiers are significantly more accurate and more reproducible than are those generated using real-world imaging measures alone. PMID:25093634
Galas, David J; Sakhanenko, Nikita A; Skupin, Alexander; Ignac, Tomasz
2014-02-01
Context dependence is central to the description of complexity. Keying on the pairwise definition of "set complexity," we use an information theory approach to formulate general measures of systems complexity. We examine the properties of multivariable dependency starting with the concept of interaction information. We then present a new measure for unbiased detection of multivariable dependency, "differential interaction information." This quantity for two variables reduces to the pairwise "set complexity" previously proposed as a context-dependent measure of information in biological systems. We generalize it here to an arbitrary number of variables. Critical limiting properties of the "differential interaction information" are key to the generalization. This measure extends previous ideas about biological information and provides a more sophisticated basis for the study of complexity. The properties of "differential interaction information" also suggest new approaches to data analysis. Given a data set of system measurements, differential interaction information can provide a measure of collective dependence, which can be represented in hypergraphs describing complex system interaction patterns. We investigate this kind of analysis using simulated data sets. The conjoining of a generalized set complexity measure, multivariable dependency analysis, and hypergraphs is our central result. While our focus is on complex biological systems, our results are applicable to any complex system.
He, Awen; Wang, Wenyu; Prakash, N Tejo; Tinkov, Alexey A; Skalny, Anatoly V; Wen, Yan; Hao, Jingcan; Guo, Xiong; Zhang, Feng
2018-03-01
Chemical elements are closely related to human health. Extensive genomic profile data of complex diseases offer us a good opportunity to systemically investigate the relationships between elements and complex diseases/traits. In this study, we applied gene set enrichment analysis (GSEA) approach to detect the associations between elements and complex diseases/traits though integrating element-gene interaction datasets and genome-wide association study (GWAS) data of complex diseases/traits. To illustrate the performance of GSEA, the element-gene interaction datasets of 24 elements were extracted from the comparative toxicogenomics database (CTD). GWAS summary datasets of 24 complex diseases or traits were downloaded from the dbGaP or GEFOS websites. We observed significant associations between 7 elements and 13 complex diseases or traits (all false discovery rate (FDR) < 0.05), including reported relationships such as aluminum vs. Alzheimer's disease (FDR = 0.042), calcium vs. bone mineral density (FDR = 0.031), magnesium vs. systemic lupus erythematosus (FDR = 0.012) as well as novel associations, such as nickel vs. hypertriglyceridemia (FDR = 0.002) and bipolar disorder (FDR = 0.027). Our study results are consistent with previous biological studies, supporting the good performance of GSEA. Our analyzing results based on GSEA framework provide novel clues for discovering causal relationships between elements and complex diseases. © 2017 WILEY PERIODICALS, INC.
Space-time patterns in ignimbrite compositions revealed by GIS and R based statistical analysis
NASA Astrophysics Data System (ADS)
Brandmeier, Melanie; Wörner, Gerhard
2017-04-01
GIS-based multivariate statistical and geospatial analysis of a compilation of 890 geochemical and ca. 1,200 geochronological data for 194 mapped ignimbrites from Central Andes documents the compositional and temporal pattern of large volume ignimbrites (so-called "ignimbrite flare-ups") during Neogene times. Rapid advances in computational sciences during the past decade lead to a growing pool of algorithms for multivariate statistics on big datasets with many predictor variables. This study uses the potential of R and ArcGIS and applies cluster (CA) and linear discriminant analysis (LDA) on log-ratio transformed spatial data. CA on major and trace element data allows to group ignimbrites according to their geochemical characteristics into rhyolitic and a dacitic "end-members" and differentiates characteristic trace element signatures with respect to Eu anomaly, depletion of MREEs and variable enrichment in LREE. To highlight these distinct compositional signatures, we applied LDA to selected ignimbrites for which comprehensive data sets were available. The most important predictors for discriminating ignimbrites are La (LREE), Yb (HREE), Eu, Al2O3, K2O, P2O5, MgO, FeOt and TiO2. However, other REEs such as Gd, Pr, Tm, Sm and Er also contribute to the discrimination functions. Significant compositional differences were found between the older (>14 Ma) large-volume plateau-forming ignimbrites in northernmost Chile and southern Peru and the younger (< 10 Ma) Altiplano-Puna-Volcanic-Complex ignimbrites that are of similar volumes. Older ignimbrites are less depleted in HREEs and less radiogenic in Sr isotopes, indicating smaller crustal contributions during evolution in thinner and thermally less evolved crust. These compositional variations indicate a relation to crustal thickening with a "transition" from plagioclase to amphibole and garnet residual mineralogy between 13 to 9 Ma. We correlate compositional and volumetric variations to the N-S passage of the Juan-Fernandéz ridge and crustal shortening and thickening during the past 26 Ma. The value of GIS and multivariate statistics in comparison to traditional geochemical parameters are highlighted working with large datasets with many predictors in a spatial and temporal context. Algorithms implemented in R allow taking advantage of an n-dimensional space and, thus, of subtle compositional differences contained in the data, while space-time patterns can be analyzed easily in GIS.
NASA Astrophysics Data System (ADS)
Pinales, J. C.; Graber, H. C.; Hargrove, J. T.; Caruso, M. J.
2016-02-01
Previous studies have demonstrated the ability to detect and classify marine hydrocarbon films with spaceborne synthetic aperture radar (SAR) imagery. The dampening effects of hydrocarbon discharges on small surface capillary-gravity waves renders the ocean surface "radar dark" compared with the standard wind-borne ocean surfaces. Given the scope and impact of events like the Deepwater Horizon oil spill, the need for improved, automated and expedient monitoring of hydrocarbon-related marine anomalies has become a pressing and complex issue for governments and the extraction industry. The research presented here describes the development, training, and utilization of an algorithm that detects marine oil spills in an automated, semi-supervised manner, utilizing X-, C-, or L-band SAR data as the primary input. Ancillary datasets include related radar-borne variables (incidence angle, etc.), environmental data (wind speed, etc.) and textural descriptors. Shapefiles produced by an experienced human-analyst served as targets (validation) during the training portion of the investigation. Training and testing datasets were chosen for development and assessment of algorithm effectiveness as well as optimal conditions for oil detection in SAR data. The algorithm detects oil spills by following a 3-step methodology: object detection, feature extraction, and classification. Previous oil spill detection and classification methodologies such as machine learning algorithms, artificial neural networks (ANN), and multivariate classification methods like partial least squares-discriminant analysis (PLS-DA) are evaluated and compared. Statistical, transform, and model-based image texture techniques, commonly used for object mapping directly or as inputs for more complex methodologies, are explored to determine optimal textures for an oil spill detection system. The influence of the ancillary variables is explored, with a particular focus on the role of strong vs. weak wind forcing.
Pont, Laura; Sanz-Nebot, Victoria; Vilaseca, Marta; Jaumot, Joaquim; Tauler, Roma; Benavente, Fernando
2018-05-01
In this study, we describe a chemometric data analysis approach to assist in the interpretation of the complex datasets from the analysis of high-molecular mass oligomeric proteins by ion mobility mass spectrometry (IM-MS). The homotetrameric protein transthyretin (TTR) is involved in familial amyloidotic polyneuropathy type I (FAP-I). FAP-I is associated with a specific TTR mutant variant (TTR(Met30)) that can be easily detected analyzing the monomeric forms of the mutant protein. However, the mechanism of protein misfolding and aggregation onset, which could be triggered by structural changes in the native tetrameric protein, remains under investigation. Serum TTR from healthy controls and FAP-I patients was purified under non-denaturing conditions by conventional immunoprecipitation in solution and analyzed by IM-MS. IM-MS allowed separation and characterization of several tetrameric, trimeric and dimeric TTR gas ions due to their differential drift time. After an appropriate data pre-processing, multivariate curve resolution alternating least squares (MCR-ALS) was applied to the complex datasets. A group of seven independent components being characterized by their ion mobility profiles and mass spectra were resolved to explain the observed data variance in control and patient samples. Then, principal component analysis (PCA) and partial least squares discriminant analysis (PLS-DA) were considered for exploration and classification. Only four out of the seven resolved components were enough for an accurate differentiation. Furthermore, the specific TTR ions identified in the mass spectra of these components and the resolved ion mobility profiles provided a straightforward insight into the most relevant oligomeric TTR proteoforms for the disease. Copyright © 2018 Elsevier B.V. All rights reserved.
Quantifying the impact of between-study heterogeneity in multivariate meta-analyses
Jackson, Dan; White, Ian R; Riley, Richard D
2012-01-01
Measures that quantify the impact of heterogeneity in univariate meta-analysis, including the very popular I2 statistic, are now well established. Multivariate meta-analysis, where studies provide multiple outcomes that are pooled in a single analysis, is also becoming more commonly used. The question of how to quantify heterogeneity in the multivariate setting is therefore raised. It is the univariate R2 statistic, the ratio of the variance of the estimated treatment effect under the random and fixed effects models, that generalises most naturally, so this statistic provides our basis. This statistic is then used to derive a multivariate analogue of I2, which we call . We also provide a multivariate H2 statistic, the ratio of a generalisation of Cochran's heterogeneity statistic and its associated degrees of freedom, with an accompanying generalisation of the usual I2 statistic, . Our proposed heterogeneity statistics can be used alongside all the usual estimates and inferential procedures used in multivariate meta-analysis. We apply our methods to some real datasets and show how our statistics are equally appropriate in the context of multivariate meta-regression, where study level covariate effects are included in the model. Our heterogeneity statistics may be used when applying any procedure for fitting the multivariate random effects model. Copyright © 2012 John Wiley & Sons, Ltd. PMID:22763950
ENSO related variability in the Southern Hemisphere, 1948-2000
NASA Astrophysics Data System (ADS)
Ribera, Pedro; Mann, Michael E.
2003-01-01
The spatiotemporal evolution of Southern Hemisphere climate variability is diagnosed based on the use of the NCEP reanalysis (1948-2000) dataset. Using the MTM-SVD analysis method, significant narrowband variability is isolated from the multi-variate dataset. It is found that the ENSO signal exhibits statistically significant behavior at quasiquadrennial (3-6 yr) timescales for the full time-period. A significant quasibiennial (2-3 yr) timescales emerges only for the latter half of period. Analyses of the spatial evolution of the two reconstructed signals shed additional light on linkages between low and high-latitude Southern Hemisphere climate anomalies.
Microbial bebop: creating music from complex dynamics in microbial ecology.
Larsen, Peter; Gilbert, Jack
2013-01-01
In order for society to make effective policy decisions on complex and far-reaching subjects, such as appropriate responses to global climate change, scientists must effectively communicate complex results to the non-scientifically specialized public. However, there are few ways however to transform highly complicated scientific data into formats that are engaging to the general community. Taking inspiration from patterns observed in nature and from some of the principles of jazz bebop improvisation, we have generated Microbial Bebop, a method by which microbial environmental data are transformed into music. Microbial Bebop uses meter, pitch, duration, and harmony to highlight the relationships between multiple data types in complex biological datasets. We use a comprehensive microbial ecology, time course dataset collected at the L4 marine monitoring station in the Western English Channel as an example of microbial ecological data that can be transformed into music. Four compositions were generated (www.bio.anl.gov/MicrobialBebop.htm.) from L4 Station data using Microbial Bebop. Each composition, though deriving from the same dataset, is created to highlight different relationships between environmental conditions and microbial community structure. The approach presented here can be applied to a wide variety of complex biological datasets.
Rios, Anthony; Kavuluru, Ramakanth
2013-09-01
Extracting diagnosis codes from medical records is a complex task carried out by trained coders by reading all the documents associated with a patient's visit. With the popularity of electronic medical records (EMRs), computational approaches to code extraction have been proposed in the recent years. Machine learning approaches to multi-label text classification provide an important methodology in this task given each EMR can be associated with multiple codes. In this paper, we study the the role of feature selection, training data selection, and probabilistic threshold optimization in improving different multi-label classification approaches. We conduct experiments based on two different datasets: a recent gold standard dataset used for this task and a second larger and more complex EMR dataset we curated from the University of Kentucky Medical Center. While conventional approaches achieve results comparable to the state-of-the-art on the gold standard dataset, on our complex in-house dataset, we show that feature selection, training data selection, and probabilistic thresholding provide significant gains in performance.
Equine grass sickness in Scotland: A case-control study of environmental geochemical risk factors.
Wylie, C E; Shaw, D J; Fordyce, F M; Lilly, A; Pirie, R S; McGorum, B C
2016-11-01
We hypothesised that the apparent geographical distribution of equine grass sickness (EGS) is partly attributable to suboptimal levels of soil macro- and trace elements in fields where EGS occurs. If proven, altering levels of particular elements could be used to reduce the risk of EGS. To determine whether the geographical distribution of EGS cases in eastern Scotland is associated with the presence or absence of particular environmental chemical elements. Retrospective time-matched case-control study. This study used data for 455 geo-referenced EGS cases and 910 time-matched controls in eastern Scotland, and geo-referenced environmental geochemical data from the British Geological Survey Geochemical Baseline Survey of the Environment stream sediment (G-BASE) and the James Hutton Institute, National Soil Inventory of Scotland (NSIS) datasets. Multivariable statistical analyses identified clusters of three main elements associated with cases from (i) the G-BASE dataset - higher environmental Ti and lower Zn, and (ii) the NSIS dataset - higher environmental Ti and lower Cr. There was also some evidence from univariable analyses for lower Al, Cd, Cu, Ni and Pb and higher Ca, K, Mo, Na and Se environmental concentrations being associated with a case. Results were complicated by a high degree of correlation between most geochemical elements. The work presented here would appear to reflect soil- not horse-level risk factors for EGS, but due to the complexity of the correlations between elements, further work is required to determine whether these associations reflect causality, and consequently whether interventions to alter concentrations of particular elements in soil, or in grazing horses, could potentially reduce the risk of EGS. The effect of chemical elements on the growth of those soil microorganisms implicated in EGS aetiology also warrants further study. © 2015 The The Authors Equine Veterinary Journal © 2015 EVJ Ltd.
A hierarchical spatial model for well yield in complex aquifers
NASA Astrophysics Data System (ADS)
Montgomery, J.; O'sullivan, F.
2017-12-01
Efficiently siting and managing groundwater wells requires reliable estimates of the amount of water that can be produced, or the well yield. This can be challenging to predict in highly complex, heterogeneous fractured aquifers due to the uncertainty around local hydraulic properties. Promising statistical approaches have been advanced in recent years. For instance, kriging and multivariate regression analysis have been applied to well test data with limited but encouraging levels of prediction accuracy. Additionally, some analytical solutions to diffusion in homogeneous porous media have been used to infer "effective" properties consistent with observed flow rates or drawdown. However, this is an under-specified inverse problem with substantial and irreducible uncertainty. We describe a flexible machine learning approach capable of combining diverse datasets with constraining physical and geostatistical models for improved well yield prediction accuracy and uncertainty quantification. Our approach can be implemented within a hierarchical Bayesian framework using Markov Chain Monte Carlo, which allows for additional sources of information to be incorporated in priors to further constrain and improve predictions and reduce the model order. We demonstrate the usefulness of this approach using data from over 7,000 wells in a fractured bedrock aquifer.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Louie, Alexander V., E-mail: Dr.alexlouie@gmail.com; Department of Radiation Oncology, London Regional Cancer Program, University of Western Ontario, London, Ontario; Department of Epidemiology, Harvard School of Public Health, Harvard University, Boston, Massachusetts
Purpose: A prognostic model for 5-year overall survival (OS), consisting of recursive partitioning analysis (RPA) and a nomogram, was developed for patients with early-stage non-small cell lung cancer (ES-NSCLC) treated with stereotactic ablative radiation therapy (SABR). Methods and Materials: A primary dataset of 703 ES-NSCLC SABR patients was randomly divided into a training (67%) and an internal validation (33%) dataset. In the former group, 21 unique parameters consisting of patient, treatment, and tumor factors were entered into an RPA model to predict OS. Univariate and multivariate models were constructed for RPA-selected factors to evaluate their relationship with OS. A nomogrammore » for OS was constructed based on factors significant in multivariate modeling and validated with calibration plots. Both the RPA and the nomogram were externally validated in independent surgical (n=193) and SABR (n=543) datasets. Results: RPA identified 2 distinct risk classes based on tumor diameter, age, World Health Organization performance status (PS) and Charlson comorbidity index. This RPA had moderate discrimination in SABR datasets (c-index range: 0.52-0.60) but was of limited value in the surgical validation cohort. The nomogram predicting OS included smoking history in addition to RPA-identified factors. In contrast to RPA, validation of the nomogram performed well in internal validation (r{sup 2}=0.97) and external SABR (r{sup 2}=0.79) and surgical cohorts (r{sup 2}=0.91). Conclusions: The Amsterdam prognostic model is the first externally validated prognostication tool for OS in ES-NSCLC treated with SABR available to individualize patient decision making. The nomogram retained strong performance across surgical and SABR external validation datasets. RPA performance was poor in surgical patients, suggesting that 2 different distinct patient populations are being treated with these 2 effective modalities.« less
Effect of rich-club on diffusion in complex networks
NASA Astrophysics Data System (ADS)
Berahmand, Kamal; Samadi, Negin; Sheikholeslami, Seyed Mahmood
2018-05-01
One of the main issues in complex networks is the phenomenon of diffusion in which the goal is to find the nodes with the highest diffusing power. In diffusion, there is always a conflict between accuracy and efficiency time complexity; therefore, most of the recent studies have focused on finding new centralities to solve this problem and have offered new ones, but our approach is different. Using one of the complex networks’ features, namely the “rich-club”, its effect on diffusion in complex networks has been analyzed and it is demonstrated that in datasets which have a high rich-club, it is better to use the degree centrality for finding influential nodes because it has a linear time complexity and uses the local information; however, this rule does not apply to datasets which have a low rich-club. Next, real and artificial datasets with the high rich-club have been used in which degree centrality has been compared to famous centrality using the SIR standard.
Yue, Chen; Chen, Shaojie; Sair, Haris I; Airan, Raag; Caffo, Brian S
2015-09-01
Data reproducibility is a critical issue in all scientific experiments. In this manuscript, the problem of quantifying the reproducibility of graphical measurements is considered. The image intra-class correlation coefficient (I2C2) is generalized and the graphical intra-class correlation coefficient (GICC) is proposed for such purpose. The concept for GICC is based on multivariate probit-linear mixed effect models. A Markov Chain Monte Carlo EM (mcm-cEM) algorithm is used for estimating the GICC. Simulation results with varied settings are demonstrated and our method is applied to the KIRBY21 test-retest dataset.
Fitting Meta-Analytic Structural Equation Models with Complex Datasets
ERIC Educational Resources Information Center
Wilson, Sandra Jo; Polanin, Joshua R.; Lipsey, Mark W.
2016-01-01
A modification of the first stage of the standard procedure for two-stage meta-analytic structural equation modeling for use with large complex datasets is presented. This modification addresses two common problems that arise in such meta-analyses: (a) primary studies that provide multiple measures of the same construct and (b) the correlation…
Natural image sequences constrain dynamic receptive fields and imply a sparse code.
Häusler, Chris; Susemihl, Alex; Nawrot, Martin P
2013-11-06
In their natural environment, animals experience a complex and dynamic visual scenery. Under such natural stimulus conditions, neurons in the visual cortex employ a spatially and temporally sparse code. For the input scenario of natural still images, previous work demonstrated that unsupervised feature learning combined with the constraint of sparse coding can predict physiologically measured receptive fields of simple cells in the primary visual cortex. This convincingly indicated that the mammalian visual system is adapted to the natural spatial input statistics. Here, we extend this approach to the time domain in order to predict dynamic receptive fields that can account for both spatial and temporal sparse activation in biological neurons. We rely on temporal restricted Boltzmann machines and suggest a novel temporal autoencoding training procedure. When tested on a dynamic multi-variate benchmark dataset this method outperformed existing models of this class. Learning features on a large dataset of natural movies allowed us to model spatio-temporal receptive fields for single neurons. They resemble temporally smooth transformations of previously obtained static receptive fields and are thus consistent with existing theories. A neuronal spike response model demonstrates how the dynamic receptive field facilitates temporal and population sparseness. We discuss the potential mechanisms and benefits of a spatially and temporally sparse representation of natural visual input. Copyright © 2013 The Authors. Published by Elsevier B.V. All rights reserved.
Lee, Yune-Sang; Turkeltaub, Peter; Granger, Richard; Raizada, Rajeev D S
2012-03-14
Although much effort has been directed toward understanding the neural basis of speech processing, the neural processes involved in the categorical perception of speech have been relatively less studied, and many questions remain open. In this functional magnetic resonance imaging (fMRI) study, we probed the cortical regions mediating categorical speech perception using an advanced brain-mapping technique, whole-brain multivariate pattern-based analysis (MVPA). Normal healthy human subjects (native English speakers) were scanned while they listened to 10 consonant-vowel syllables along the /ba/-/da/ continuum. Outside of the scanner, individuals' own category boundaries were measured to divide the fMRI data into /ba/ and /da/ conditions per subject. The whole-brain MVPA revealed that Broca's area and the left pre-supplementary motor area evoked distinct neural activity patterns between the two perceptual categories (/ba/ vs /da/). Broca's area was also found when the same analysis was applied to another dataset (Raizada and Poldrack, 2007), which previously yielded the supramarginal gyrus using a univariate adaptation-fMRI paradigm. The consistent MVPA findings from two independent datasets strongly indicate that Broca's area participates in categorical speech perception, with a possible role of translating speech signals into articulatory codes. The difference in results between univariate and multivariate pattern-based analyses of the same data suggest that processes in different cortical areas along the dorsal speech perception stream are distributed on different spatial scales.
Validation of a Radiosensitivity Molecular Signature in Breast Cancer
Eschrich, Steven A.; Fulp, William J.; Pawitan, Yudi; Foekens, John A.; Smid, Marcel; Martens, John W. M.; Echevarria, Michelle; Kamath, Vidya; Lee, Ji-Hyun; Harris, Eleanor E.; Bergh, Jonas; Torres-Roca, Javier F.
2014-01-01
Purpose Previously, we developed a radiosensitivity molecular signature (RSI) that was clinically-validated in three independent datasets (rectal, esophageal, head and neck) in 118 patients. Here, we test RSI in radiotherapy (RT) treated breast cancer patients. Experimental Design RSI was tested in two previously published breast cancer datasets. Patients were treated at the Karolinska University Hospital (n=159) and Erasmus Medical Center (n=344). RSI was applied as previously described. Results We tested RSI in RT-treated patients (Karolinska). Patients predicted to be radiosensitive (RS) had an improved 5 yr relapse-free survival when compared with radioresistant (RR) patients (95% vs. 75%, p=0.0212) but there was no difference between RS/RR patients treated without RT (71% vs. 77%, p=0.6744), consistent with RSI being RT-specific (interaction term RSIxRT, p=0.05). Similarly, in the Erasmus dataset RT-treated RS patients had an improved 5-year distant-metastasis-free survival over RR patients (77% vs. 64%, p=0.0409) but no difference was observed in patients treated without RT (RS vs. RR, 80% vs. 81%, p=0.9425). Multivariable analysis showed RSI is the strongest variable in RT-treated patients (Karolinska, HR=5.53, p=0.0987, Erasmus, HR=1.64, p=0.0758) and in backward selection (removal alpha of 0.10) RSI was the only variable remaining in the final model. Finally, RSI is an independent predictor of outcome in RT-treated ER+ patients (Erasmus, multivariable analysis, HR=2.64, p=0.0085). Conclusions RSI is validated in two independent breast cancer datasets totaling 503 patients. Including prior data, RSI is validated in five independent cohorts (621 patients) and represents, to our knowledge, the most extensively validated molecular signature in radiation oncology. PMID:22832933
An assessment of differences in gridded precipitation datasets in complex terrain
NASA Astrophysics Data System (ADS)
Henn, Brian; Newman, Andrew J.; Livneh, Ben; Daly, Christopher; Lundquist, Jessica D.
2018-01-01
Hydrologic modeling and other geophysical applications are sensitive to precipitation forcing data quality, and there are known challenges in spatially distributing gauge-based precipitation over complex terrain. We conduct a comparison of six high-resolution, daily and monthly gridded precipitation datasets over the Western United States. We compare the long-term average spatial patterns, and interannual variability of water-year total precipitation, as well as multi-year trends in precipitation across the datasets. We find that the greatest absolute differences among datasets occur in high-elevation areas and in the maritime mountain ranges of the Western United States, while the greatest percent differences among datasets relative to annual total precipitation occur in arid and rain-shadowed areas. Differences between datasets in some high-elevation areas exceed 200 mm yr-1 on average, and relative differences range from 5 to 60% across the Western United States. In areas of high topographic relief, true uncertainties and biases are likely higher than the differences among the datasets; we present evidence of this based on streamflow observations. Precipitation trends in the datasets differ in magnitude and sign at smaller scales, and are sensitive to how temporal inhomogeneities in the underlying precipitation gauge data are handled.
Selecting minimum dataset soil variables using PLSR as a regressive multivariate method
NASA Astrophysics Data System (ADS)
Stellacci, Anna Maria; Armenise, Elena; Castellini, Mirko; Rossi, Roberta; Vitti, Carolina; Leogrande, Rita; De Benedetto, Daniela; Ferrara, Rossana M.; Vivaldi, Gaetano A.
2017-04-01
Long-term field experiments and science-based tools that characterize soil status (namely the soil quality indices, SQIs) assume a strategic role in assessing the effect of agronomic techniques and thus in improving soil management especially in marginal environments. Selecting key soil variables able to best represent soil status is a critical step for the calculation of SQIs. Current studies show the effectiveness of statistical methods for variable selection to extract relevant information deriving from multivariate datasets. Principal component analysis (PCA) has been mainly used, however supervised multivariate methods and regressive techniques are progressively being evaluated (Armenise et al., 2013; de Paul Obade et al., 2016; Pulido Moncada et al., 2014). The present study explores the effectiveness of partial least square regression (PLSR) in selecting critical soil variables, using a dataset comparing conventional tillage and sod-seeding on durum wheat. The results were compared to those obtained using PCA and stepwise discriminant analysis (SDA). The soil data derived from a long-term field experiment in Southern Italy. On samples collected in April 2015, the following set of variables was quantified: (i) chemical: total organic carbon and nitrogen (TOC and TN), alkali-extractable C (TEC and humic substances - HA-FA), water extractable N and organic C (WEN and WEOC), Olsen extractable P, exchangeable cations, pH and EC; (ii) physical: texture, dry bulk density (BD), macroporosity (Pmac), air capacity (AC), and relative field capacity (RFC); (iii) biological: carbon of the microbial biomass quantified with the fumigation-extraction method. PCA and SDA were previously applied to the multivariate dataset (Stellacci et al., 2016). PLSR was carried out on mean centered and variance scaled data of predictors (soil variables) and response (wheat yield) variables using the PLS procedure of SAS/STAT. In addition, variable importance for projection (VIP) statistics was used to quantitatively assess the predictors most relevant for response variable estimation and then for variable selection (Andersen and Bro, 2010). PCA and SDA returned TOC and RFC as influential variables both on the set of chemical and physical data analyzed separately as well as on the whole dataset (Stellacci et al., 2016). Highly weighted variables in PCA were also TEC, followed by K, and AC, followed by Pmac and BD, in the first PC (41.2% of total variance); Olsen P and HA-FA in the second PC (12.6%), Ca in the third (10.6%) component. Variables enabling maximum discrimination among treatments for SDA were WEOC, on the whole dataset, humic substances, followed by Olsen P, EC and clay, in the separate data analyses. The highest PLS-VIP statistics were recorded for Olsen P and Pmac, followed by TOC, TEC, pH and Mg for chemical variables and clay, RFC and AC for the physical variables. Results show that different methods may provide different ranking of the selected variables and the presence of a response variable, in regressive techniques, may affect variable selection. Further investigation with different response variables and with multi-year datasets would allow to better define advantages and limits of single or combined approaches. Acknowledgment The work was supported by the projects "BIOTILLAGE, approcci innovative per il miglioramento delle performances ambientali e produttive dei sistemi cerealicoli no-tillage", financed by PSR-Basilicata 2007-2013, and "DESERT, Low-cost water desalination and sensor technology compact module" financed by ERANET-WATERWORKS 2014. References Andersen C.M. and Bro R., 2010. Variable selection in regression - a tutorial. Journal of Chemometrics, 24 728-737. Armenise et al., 2013. Developing a soil quality index to compare soil fitness for agricultural use under different managements in the mediterranean environment. Soil and Tillage Research, 130:91-98. de Paul Obade et al., 2016. A standardized soil quality index for diverse field conditions. Sci. Total Env. 541:424-434. Pulido Moncada et al., 2014. Data-driven analysis of soil quality indicators using limited data. Geoderma, 235:271-278. Stellacci et al., 2016. Comparison of different multivariate methods to select key soil variables for soil quality indices computation. XLV Congress of the Italian Society of Agronomy (SIA), Sassari, 20-22 September 2016.
A Review of Multivariate Distributions for Count Data Derived from the Poisson Distribution.
Inouye, David; Yang, Eunho; Allen, Genevera; Ravikumar, Pradeep
2017-01-01
The Poisson distribution has been widely studied and used for modeling univariate count-valued data. Multivariate generalizations of the Poisson distribution that permit dependencies, however, have been far less popular. Yet, real-world high-dimensional count-valued data found in word counts, genomics, and crime statistics, for example, exhibit rich dependencies, and motivate the need for multivariate distributions that can appropriately model this data. We review multivariate distributions derived from the univariate Poisson, categorizing these models into three main classes: 1) where the marginal distributions are Poisson, 2) where the joint distribution is a mixture of independent multivariate Poisson distributions, and 3) where the node-conditional distributions are derived from the Poisson. We discuss the development of multiple instances of these classes and compare the models in terms of interpretability and theory. Then, we empirically compare multiple models from each class on three real-world datasets that have varying data characteristics from different domains, namely traffic accident data, biological next generation sequencing data, and text data. These empirical experiments develop intuition about the comparative advantages and disadvantages of each class of multivariate distribution that was derived from the Poisson. Finally, we suggest new research directions as explored in the subsequent discussion section.
Galván-Tejada, Carlos E.; Zanella-Calzada, Laura A.; Galván-Tejada, Jorge I.; Celaya-Padilla, José M.; Gamboa-Rosales, Hamurabi; Garza-Veloz, Idalia; Martinez-Fierro, Margarita L.
2017-01-01
Breast cancer is an important global health problem, and the most common type of cancer among women. Late diagnosis significantly decreases the survival rate of the patient; however, using mammography for early detection has been demonstrated to be a very important tool increasing the survival rate. The purpose of this paper is to obtain a multivariate model to classify benign and malignant tumor lesions using a computer-assisted diagnosis with a genetic algorithm in training and test datasets from mammography image features. A multivariate search was conducted to obtain predictive models with different approaches, in order to compare and validate results. The multivariate models were constructed using: Random Forest, Nearest centroid, and K-Nearest Neighbor (K-NN) strategies as cost function in a genetic algorithm applied to the features in the BCDR public databases. Results suggest that the two texture descriptor features obtained in the multivariate model have a similar or better prediction capability to classify the data outcome compared with the multivariate model composed of all the features, according to their fitness value. This model can help to reduce the workload of radiologists and present a second opinion in the classification of tumor lesions. PMID:28216571
Galván-Tejada, Carlos E; Zanella-Calzada, Laura A; Galván-Tejada, Jorge I; Celaya-Padilla, José M; Gamboa-Rosales, Hamurabi; Garza-Veloz, Idalia; Martinez-Fierro, Margarita L
2017-02-14
Breast cancer is an important global health problem, and the most common type of cancer among women. Late diagnosis significantly decreases the survival rate of the patient; however, using mammography for early detection has been demonstrated to be a very important tool increasing the survival rate. The purpose of this paper is to obtain a multivariate model to classify benign and malignant tumor lesions using a computer-assisted diagnosis with a genetic algorithm in training and test datasets from mammography image features. A multivariate search was conducted to obtain predictive models with different approaches, in order to compare and validate results. The multivariate models were constructed using: Random Forest, Nearest centroid, and K-Nearest Neighbor (K-NN) strategies as cost function in a genetic algorithm applied to the features in the BCDR public databases. Results suggest that the two texture descriptor features obtained in the multivariate model have a similar or better prediction capability to classify the data outcome compared with the multivariate model composed of all the features, according to their fitness value. This model can help to reduce the workload of radiologists and present a second opinion in the classification of tumor lesions.
Associations between Smoking and Extreme Dieting among Adolescents
ERIC Educational Resources Information Center
Seo, Dong-Chul; Jiang, Nan
2009-01-01
This study examined the association between cigarette smoking and dieting behaviors and trends in that association among US adolescents in grades 9-12 between 1999 and 2007. Youth Risk Behavior Survey datasets were analyzed using the multivariable logistic regression method. The sample size of each survey year ranged from 13,554 to 15,273 with…
Multivariate Welch t-test on distances
2016-01-01
Motivation: Permutational non-Euclidean analysis of variance, PERMANOVA, is routinely used in exploratory analysis of multivariate datasets to draw conclusions about the significance of patterns visualized through dimension reduction. This method recognizes that pairwise distance matrix between observations is sufficient to compute within and between group sums of squares necessary to form the (pseudo) F statistic. Moreover, not only Euclidean, but arbitrary distances can be used. This method, however, suffers from loss of power and type I error inflation in the presence of heteroscedasticity and sample size imbalances. Results: We develop a solution in the form of a distance-based Welch t-test, TW2, for two sample potentially unbalanced and heteroscedastic data. We demonstrate empirically the desirable type I error and power characteristics of the new test. We compare the performance of PERMANOVA and TW2 in reanalysis of two existing microbiome datasets, where the methodology has originated. Availability and Implementation: The source code for methods and analysis of this article is available at https://github.com/alekseyenko/Tw2. Further guidance on application of these methods can be obtained from the author. Contact: alekseye@musc.edu PMID:27515741
Multivariate Welch t-test on distances.
Alekseyenko, Alexander V
2016-12-01
Permutational non-Euclidean analysis of variance, PERMANOVA, is routinely used in exploratory analysis of multivariate datasets to draw conclusions about the significance of patterns visualized through dimension reduction. This method recognizes that pairwise distance matrix between observations is sufficient to compute within and between group sums of squares necessary to form the (pseudo) F statistic. Moreover, not only Euclidean, but arbitrary distances can be used. This method, however, suffers from loss of power and type I error inflation in the presence of heteroscedasticity and sample size imbalances. We develop a solution in the form of a distance-based Welch t-test, [Formula: see text], for two sample potentially unbalanced and heteroscedastic data. We demonstrate empirically the desirable type I error and power characteristics of the new test. We compare the performance of PERMANOVA and [Formula: see text] in reanalysis of two existing microbiome datasets, where the methodology has originated. The source code for methods and analysis of this article is available at https://github.com/alekseyenko/Tw2 Further guidance on application of these methods can be obtained from the author. alekseye@musc.edu. © The Author 2016. Published by Oxford University Press.
Mei, Jiangyuan; Liu, Meizhu; Wang, Yuan-Fang; Gao, Huijun
2016-06-01
Multivariate time series (MTS) datasets broadly exist in numerous fields, including health care, multimedia, finance, and biometrics. How to classify MTS accurately has become a hot research topic since it is an important element in many computer vision and pattern recognition applications. In this paper, we propose a Mahalanobis distance-based dynamic time warping (DTW) measure for MTS classification. The Mahalanobis distance builds an accurate relationship between each variable and its corresponding category. It is utilized to calculate the local distance between vectors in MTS. Then we use DTW to align those MTS which are out of synchronization or with different lengths. After that, how to learn an accurate Mahalanobis distance function becomes another key problem. This paper establishes a LogDet divergence-based metric learning with triplet constraint model which can learn Mahalanobis matrix with high precision and robustness. Furthermore, the proposed method is applied on nine MTS datasets selected from the University of California, Irvine machine learning repository and Robert T. Olszewski's homepage, and the results demonstrate the improved performance of the proposed approach.
A Regularized Linear Dynamical System Framework for Multivariate Time Series Analysis.
Liu, Zitao; Hauskrecht, Milos
2015-01-01
Linear Dynamical System (LDS) is an elegant mathematical framework for modeling and learning Multivariate Time Series (MTS). However, in general, it is difficult to set the dimension of an LDS's hidden state space. A small number of hidden states may not be able to model the complexities of a MTS, while a large number of hidden states can lead to overfitting. In this paper, we study learning methods that impose various regularization penalties on the transition matrix of the LDS model and propose a regularized LDS learning framework (rLDS) which aims to (1) automatically shut down LDSs' spurious and unnecessary dimensions, and consequently, address the problem of choosing the optimal number of hidden states; (2) prevent the overfitting problem given a small amount of MTS data; and (3) support accurate MTS forecasting. To learn the regularized LDS from data we incorporate a second order cone program and a generalized gradient descent method into the Maximum a Posteriori framework and use Expectation Maximization to obtain a low-rank transition matrix of the LDS model. We propose two priors for modeling the matrix which lead to two instances of our rLDS. We show that our rLDS is able to recover well the intrinsic dimensionality of the time series dynamics and it improves the predictive performance when compared to baselines on both synthetic and real-world MTS datasets.
Multivariate analysis: greater insights into complex systems
USDA-ARS?s Scientific Manuscript database
Many agronomic researchers measure and collect multiple response variables in an effort to understand the more complex nature of the system being studied. Multivariate (MV) statistical methods encompass the simultaneous analysis of all random variables (RV) measured on each experimental or sampling ...
Gene set analysis using variance component tests.
Huang, Yen-Tsung; Lin, Xihong
2013-06-28
Gene set analyses have become increasingly important in genomic research, as many complex diseases are contributed jointly by alterations of numerous genes. Genes often coordinate together as a functional repertoire, e.g., a biological pathway/network and are highly correlated. However, most of the existing gene set analysis methods do not fully account for the correlation among the genes. Here we propose to tackle this important feature of a gene set to improve statistical power in gene set analyses. We propose to model the effects of an independent variable, e.g., exposure/biological status (yes/no), on multiple gene expression values in a gene set using a multivariate linear regression model, where the correlation among the genes is explicitly modeled using a working covariance matrix. We develop TEGS (Test for the Effect of a Gene Set), a variance component test for the gene set effects by assuming a common distribution for regression coefficients in multivariate linear regression models, and calculate the p-values using permutation and a scaled chi-square approximation. We show using simulations that type I error is protected under different choices of working covariance matrices and power is improved as the working covariance approaches the true covariance. The global test is a special case of TEGS when correlation among genes in a gene set is ignored. Using both simulation data and a published diabetes dataset, we show that our test outperforms the commonly used approaches, the global test and gene set enrichment analysis (GSEA). We develop a gene set analyses method (TEGS) under the multivariate regression framework, which directly models the interdependence of the expression values in a gene set using a working covariance. TEGS outperforms two widely used methods, GSEA and global test in both simulation and a diabetes microarray data.
Searching for forcing signatures in decadal patterns of shoreline change
NASA Astrophysics Data System (ADS)
Burningham, H.; French, J.
2016-12-01
Analysis of shoreline position at spatial scales of the order 10 - 100 km and at a multi-decadal time-scale has the potential to reveal regional coherence (or lack of) in the primary controls on shoreline tendencies and trends. Such information is extremely valuable for the evaluation of climate forcing on coastal behaviour. Segmenting a coast into discrete behaviour units based on these types of analyses is often subjective, however, and in the context of pervasive human interventions and alongshore variability in ocean climate, determining the most important controls on shoreline dynamics can be challenging. Multivariate analyses provide one means to resolve common behaviours across shoreline position datasets, thereby underpinning a more objective evaluation of possible coupling between shorelines at different scales. In an analysis of the Suffolk coast (eastern England) we explore the use of multivariate statistics to understand and classify mesoscale coastal behaviour. Suffolk comprises a relatively linear shoreline that shifts from east-facing in the north to southeast-facing in the south. Although primarily formed of a beach foreshore backed by cliffs or shingle barrier, the shoreline is punctuated at 3 locations by narrow tidal inlets with offset entrances that imply a persistent north to south sediment transport direction. Tidal regime decreases south to north from mesotidal (3.6m STR) to microtidal (1.9m STR), and the bimodal wave climate (northeast and southwest modes) presents complex local-scale variability in nearshore conditions. Shorelines exhibit a range of decadal behaviours from rapid erosion (up to 4m/yr) to quasi-stability that cannot be directly explained by the spatial organisation of contemporary landforms or coastal defences. A multivariate statistical approach to shoreline change analysis helps to define the key modes of change and determine the most likely forcing factors.
Predicting trauma patient mortality: ICD [or ICD-10-AM] versus AIS based approaches.
Willis, Cameron D; Gabbe, Belinda J; Jolley, Damien; Harrison, James E; Cameron, Peter A
2010-11-01
The International Classification of Diseases Injury Severity Score (ICISS) has been proposed as an International Classification of Diseases (ICD)-10-based alternative to mortality prediction tools that use Abbreviated Injury Scale (AIS) data, including the Trauma and Injury Severity Score (TRISS). To date, studies have not examined the performance of ICISS using Australian trauma registry data. This study aimed to compare the performance of ICISS with other mortality prediction tools in an Australian trauma registry. This was a retrospective review of prospectively collected data from the Victorian State Trauma Registry. A training dataset was created for model development and a validation dataset for evaluation. The multiplicative ICISS model was compared with a worst injury ICISS approach, Victorian TRISS (V-TRISS, using local coefficients), maximum AIS severity and a multivariable model including ICD-10-AM codes as predictors. Models were investigated for discrimination (C-statistic) and calibration (Hosmer-Lemeshow statistic). The multivariable approach had the highest level of discrimination (C-statistic 0.90) and calibration (H-L 7.65, P= 0.468). Worst injury ICISS, V-TRISS and maximum AIS had similar performance. The multiplicative ICISS produced the lowest level of discrimination (C-statistic 0.80) and poorest calibration (H-L 50.23, P < 0.001). The performance of ICISS may be affected by the data used to develop estimates, the ICD version employed, the methods for deriving estimates and the inclusion of covariates. In this analysis, a multivariable approach using ICD-10-AM codes was the best-performing method. A multivariable ICISS approach may therefore be a useful alternative to AIS-based methods and may have comparable predictive performance to locally derived TRISS models. © 2010 The Authors. ANZ Journal of Surgery © 2010 Royal Australasian College of Surgeons.
Multivariate Non-Symmetric Stochastic Models for Spatial Dependence Models
NASA Astrophysics Data System (ADS)
Haslauer, C. P.; Bárdossy, A.
2017-12-01
A copula based multivariate framework allows more flexibility to describe different kind of dependences than what is possible using models relying on the confining assumption of symmetric Gaussian models: different quantiles can be modelled with a different degree of dependence; it will be demonstrated how this can be expected given process understanding. maximum likelihood based multivariate quantitative parameter estimation yields stable and reliable results; not only improved results in cross-validation based measures of uncertainty are obtained but also a more realistic spatial structure of uncertainty compared to second order models of dependence; as much information as is available is included in the parameter estimation: incorporation of censored measurements (e.g., below detection limit, or ones that are above the sensitive range of the measurement device) yield to more realistic spatial models; the proportion of true zeros can be jointly estimated with and distinguished from censored measurements which allow estimates about the age of a contaminant in the system; secondary information (categorical and on the rational scale) has been used to improve the estimation of the primary variable; These copula based multivariate statistical techniques are demonstrated based on hydraulic conductivity observations at the Borden (Canada) site, the MADE site (USA), and a large regional groundwater quality data-set in south-west Germany. Fields of spatially distributed K were simulated with identical marginal simulation, identical second order spatial moments, yet substantially differing solute transport characteristics when numerical tracer tests were performed. A statistical methodology is shown that allows the delineation of a boundary layer separating homogenous parts of a spatial data-set. The effects of this boundary layer (macro structure) and the spatial dependence of K (micro structure) on solute transport behaviour is shown.
Schwartz, Rachel S; Mueller, Rachel L
2010-01-11
Estimates of divergence dates between species improve our understanding of processes ranging from nucleotide substitution to speciation. Such estimates are frequently based on molecular genetic differences between species; therefore, they rely on accurate estimates of the number of such differences (i.e. substitutions per site, measured as branch length on phylogenies). We used simulations to determine the effects of dataset size, branch length heterogeneity, branch depth, and analytical framework on branch length estimation across a range of branch lengths. We then reanalyzed an empirical dataset for plethodontid salamanders to determine how inaccurate branch length estimation can affect estimates of divergence dates. The accuracy of branch length estimation varied with branch length, dataset size (both number of taxa and sites), branch length heterogeneity, branch depth, dataset complexity, and analytical framework. For simple phylogenies analyzed in a Bayesian framework, branches were increasingly underestimated as branch length increased; in a maximum likelihood framework, longer branch lengths were somewhat overestimated. Longer datasets improved estimates in both frameworks; however, when the number of taxa was increased, estimation accuracy for deeper branches was less than for tip branches. Increasing the complexity of the dataset produced more misestimated branches in a Bayesian framework; however, in an ML framework, more branches were estimated more accurately. Using ML branch length estimates to re-estimate plethodontid salamander divergence dates generally resulted in an increase in the estimated age of older nodes and a decrease in the estimated age of younger nodes. Branch lengths are misestimated in both statistical frameworks for simulations of simple datasets. However, for complex datasets, length estimates are quite accurate in ML (even for short datasets), whereas few branches are estimated accurately in a Bayesian framework. Our reanalysis of empirical data demonstrates the magnitude of effects of Bayesian branch length misestimation on divergence date estimates. Because the length of branches for empirical datasets can be estimated most reliably in an ML framework when branches are <1 substitution/site and datasets are > or =1 kb, we suggest that divergence date estimates using datasets, branch lengths, and/or analytical techniques that fall outside of these parameters should be interpreted with caution.
Bustamante, Carlos D.; Valero-Cuevas, Francisco J.
2010-01-01
The field of complex biomechanical modeling has begun to rely on Monte Carlo techniques to investigate the effects of parameter variability and measurement uncertainty on model outputs, search for optimal parameter combinations, and define model limitations. However, advanced stochastic methods to perform data-driven explorations, such as Markov chain Monte Carlo (MCMC), become necessary as the number of model parameters increases. Here, we demonstrate the feasibility and, what to our knowledge is, the first use of an MCMC approach to improve the fitness of realistically large biomechanical models. We used a Metropolis–Hastings algorithm to search increasingly complex parameter landscapes (3, 8, 24, and 36 dimensions) to uncover underlying distributions of anatomical parameters of a “truth model” of the human thumb on the basis of simulated kinematic data (thumbnail location, orientation, and linear and angular velocities) polluted by zero-mean, uncorrelated multivariate Gaussian “measurement noise.” Driven by these data, ten Markov chains searched each model parameter space for the subspace that best fit the data (posterior distribution). As expected, the convergence time increased, more local minima were found, and marginal distributions broadened as the parameter space complexity increased. In the 36-D scenario, some chains found local minima but the majority of chains converged to the true posterior distribution (confirmed using a cross-validation dataset), thus demonstrating the feasibility and utility of these methods for realistically large biomechanical problems. PMID:19272906
Serial femtosecond crystallography datasets from G protein-coupled receptors
White, Thomas A.; Barty, Anton; Liu, Wei; Ishchenko, Andrii; Zhang, Haitao; Gati, Cornelius; Zatsepin, Nadia A.; Basu, Shibom; Oberthür, Dominik; Metz, Markus; Beyerlein, Kenneth R.; Yoon, Chun Hong; Yefanov, Oleksandr M.; James, Daniel; Wang, Dingjie; Messerschmidt, Marc; Koglin, Jason E.; Boutet, Sébastien; Weierstall, Uwe; Cherezov, Vadim
2016-01-01
We describe the deposition of four datasets consisting of X-ray diffraction images acquired using serial femtosecond crystallography experiments on microcrystals of human G protein-coupled receptors, grown and delivered in lipidic cubic phase, at the Linac Coherent Light Source. The receptors are: the human serotonin receptor 2B in complex with an agonist ergotamine, the human δ-opioid receptor in complex with a bi-functional peptide ligand DIPP-NH2, the human smoothened receptor in complex with an antagonist cyclopamine, and finally the human angiotensin II type 1 receptor in complex with the selective antagonist ZD7155. All four datasets have been deposited, with minimal processing, in an HDF5-based file format, which can be used directly for crystallographic processing with CrystFEL or other software. We have provided processing scripts and supporting files for recent versions of CrystFEL, which can be used to validate the data. PMID:27479354
Serial femtosecond crystallography datasets from G protein-coupled receptors.
White, Thomas A; Barty, Anton; Liu, Wei; Ishchenko, Andrii; Zhang, Haitao; Gati, Cornelius; Zatsepin, Nadia A; Basu, Shibom; Oberthür, Dominik; Metz, Markus; Beyerlein, Kenneth R; Yoon, Chun Hong; Yefanov, Oleksandr M; James, Daniel; Wang, Dingjie; Messerschmidt, Marc; Koglin, Jason E; Boutet, Sébastien; Weierstall, Uwe; Cherezov, Vadim
2016-08-01
We describe the deposition of four datasets consisting of X-ray diffraction images acquired using serial femtosecond crystallography experiments on microcrystals of human G protein-coupled receptors, grown and delivered in lipidic cubic phase, at the Linac Coherent Light Source. The receptors are: the human serotonin receptor 2B in complex with an agonist ergotamine, the human δ-opioid receptor in complex with a bi-functional peptide ligand DIPP-NH2, the human smoothened receptor in complex with an antagonist cyclopamine, and finally the human angiotensin II type 1 receptor in complex with the selective antagonist ZD7155. All four datasets have been deposited, with minimal processing, in an HDF5-based file format, which can be used directly for crystallographic processing with CrystFEL or other software. We have provided processing scripts and supporting files for recent versions of CrystFEL, which can be used to validate the data.
Segmentation of Unstructured Datasets
NASA Technical Reports Server (NTRS)
Bhat, Smitha
1996-01-01
Datasets generated by computer simulations and experiments in Computational Fluid Dynamics tend to be extremely large and complex. It is difficult to visualize these datasets using standard techniques like Volume Rendering and Ray Casting. Object Segmentation provides a technique to extract and quantify regions of interest within these massive datasets. This thesis explores basic algorithms to extract coherent amorphous regions from two-dimensional and three-dimensional scalar unstructured grids. The techniques are applied to datasets from Computational Fluid Dynamics and from Finite Element Analysis.
A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data.
Goldstein, Markus; Uchida, Seiichi
2016-01-01
Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks.
Bello, Alessandra; Bianchi, Federica; Careri, Maria; Giannetto, Marco; Mori, Giovanni; Musci, Marilena
2007-11-05
A new NIR method based on multivariate calibration for determination of ethanol in industrially packed wholemeal bread was developed and validated. GC-FID was used as reference method for the determination of actual ethanol concentration of different samples of wholemeal bread with proper content of added ethanol, ranging from 0 to 3.5% (w/w). Stepwise discriminant analysis was carried out on the NIR dataset, in order to reduce the number of original variables by selecting those that were able to discriminate between the samples of different ethanol concentrations. With the so selected variables a multivariate calibration model was then obtained by multiple linear regression. The prediction power of the linear model was optimized by a new "leave one out" method, so that the number of original variables resulted further reduced.
Dietary characterization of terrestrial mammals.
Pineda-Munoz, Silvia; Alroy, John
2014-08-22
Understanding the feeding behaviour of the species that make up any ecosystem is essential for designing further research. Mammals have been studied intensively, but the criteria used for classifying their diets are far from being standardized. We built a database summarizing the dietary preferences of terrestrial mammals using published data regarding their stomach contents. We performed multivariate analyses in order to set up a standardized classification scheme. Ideally, food consumption percentages should be used instead of qualitative classifications. However, when highly detailed information is not available we propose classifying animals based on their main feeding resources. They should be classified as generalists when none of the feeding resources constitute over 50% of the diet. The term 'omnivore' should be avoided because it does not communicate all the complexity inherent to food choice. Moreover, the so-called omnivore diets actually involve several distinctive adaptations. Our dataset shows that terrestrial mammals are generally highly specialized and that some degree of food mixing may even be required for most species.
Fernández-Varela, R; Andrade, J M; Muniategui, S; Prada, D; Ramírez-Villalobos, F
2010-04-01
Identifying petroleum-related products released into the environment is a complex and difficult task. To achieve this, polycyclic aromatic hydrocarbons (PAHs) are of outstanding importance nowadays. Despite traditional quantitative fingerprinting uses straightforward univariate statistical analyses to differentiate among oils and to assess their sources, a multivariate strategy based on Procrustes rotation (PR) was applied in this paper. The aim of PR is to select a reduced subset of PAHs still capable of performing a satisfactory identification of petroleum-related hydrocarbons. PR selected two subsets of three (C(2)-naphthalene, C(2)-dibenzothiophene and C(2)-phenanthrene) and five (C(1)-decahidronaphthalene, naphthalene, C(2)-phenanthrene, C(3)-phenanthrene and C(2)-fluoranthene) PAHs for each of the two datasets studied here. The classification abilities of each subset of PAHs were tested using principal components analysis, hierarchical cluster analysis and Kohonen neural networks and it was demonstrated that they unraveled the same patterns as the overall set of PAHs. (c) 2009 Elsevier Ltd. All rights reserved.
Guerra, Jose Luis Lopez; Gomez, Daniel; Wei, Qingyi; Liu, Zhengshen; Wang, Li-E; Yuan, Xianglin; Zhuang, Yan; Komaki, Ritusko; Liao, Zhongxing
2012-12-01
We investigated the association between single nucleotide polymorphisms (SNPs) in the transforming growth factor β1 (TGFβ1) gene and the risk of radiation-induced esophageal toxicity (RE) in patients with non-small-cell lung cancer (NSCLC). Ninety-seven NSCLC patients with available genomic DNA samples and mostly treated with intensity modulated radio(chemo)therapy from 2003 to 2006 were used as a test dataset and 101 NSCLC patients treated with 3-dimensional conformal radio(chemo)therapy from 1998 to 2002 were used as a validation set. We genotyped three SNPs of the TGFβ1 gene (rs1800469:C-509T, rs1800471:G915C, and rs1982073:T869C) by the polymerase chain reaction restriction fragment length polymorphism method. In the test dataset, the CT/TT genotypes of TGFβ1 rs1800469:C-509T were associated with a statistically significant higher risk of RE grade⩾3 in univariate (P=0.026) and multivariate analysis (P=0.045) when compared with the CC genotype. These results were again observed in both univariate (P=0.045) and multivariate (P=0.023) analysis in the validation dataset. We found and validated that the TGFβ1 rs1800469:C-509T genotype is associated with severe RE. This response marker may be used for guiding therapy intensity in an individual patient, which would further the goal of individualized therapy. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
[Theory, method and application of method R on estimation of (co)variance components].
Liu, Wen-Zhong
2004-07-01
Theory, method and application of Method R on estimation of (co)variance components were reviewed in order to make the method be reasonably used. Estimation requires R values,which are regressions of predicted random effects that are calculated using complete dataset on predicted random effects that are calculated using random subsets of the same data. By using multivariate iteration algorithm based on a transformation matrix,and combining with the preconditioned conjugate gradient to solve the mixed model equations, the computation efficiency of Method R is much improved. Method R is computationally inexpensive,and the sampling errors and approximate credible intervals of estimates can be obtained. Disadvantages of Method R include a larger sampling variance than other methods for the same data,and biased estimates in small datasets. As an alternative method, Method R can be used in larger datasets. It is necessary to study its theoretical properties and broaden its application range further.
Linking multimetric and multivariate approaches to assess the ecological condition of streams.
Collier, Kevin J
2009-10-01
Few attempts have been made to combine multimetric and multivariate analyses for bioassessment despite recognition that an integrated method could yield powerful tools for bioassessment. An approach is described that integrates eight macroinvertebrate community metrics into a Principal Components Analysis to develop a Multivariate Condition Score (MCS) from a calibration dataset of 511 samples. The MCS is compared to an Index of Biotic Integrity (IBI) derived using the same metrics based on the ratio to the reference site mean. Both approaches were highly correlated although the MCS appeared to offer greater potential for discriminating a wider range of impaired conditions. Both the MCS and IBI displayed low temporal variability within reference sites, and were able to distinguish between reference conditions and low levels of catchment modification and local habitat degradation, although neither discriminated among three levels of low impact. Pseudosamples developed to test the response of the metric aggregation approaches to organic enrichment, urban, mining, pastoral and logging stressor scenarios ranked pressures in the same order, but the MCS provided a lower score for the urban scenario and a higher score for the pastoral scenario. The MCS was calculated for an independent test dataset of urban and reference sites, and yielded similar results to the IBI. Although both methods performed comparably, the MCS approach may have some advantages because it removes the subjectivity of assigning thresholds for scoring biological condition, and it appears to discriminate a wider range of degraded conditions.
Enhancing e-waste estimates: improving data quality by multivariate Input-Output Analysis.
Wang, Feng; Huisman, Jaco; Stevels, Ab; Baldé, Cornelis Peter
2013-11-01
Waste electrical and electronic equipment (or e-waste) is one of the fastest growing waste streams, which encompasses a wide and increasing spectrum of products. Accurate estimation of e-waste generation is difficult, mainly due to lack of high quality data referred to market and socio-economic dynamics. This paper addresses how to enhance e-waste estimates by providing techniques to increase data quality. An advanced, flexible and multivariate Input-Output Analysis (IOA) method is proposed. It links all three pillars in IOA (product sales, stock and lifespan profiles) to construct mathematical relationships between various data points. By applying this method, the data consolidation steps can generate more accurate time-series datasets from available data pool. This can consequently increase the reliability of e-waste estimates compared to the approach without data processing. A case study in the Netherlands is used to apply the advanced IOA model. As a result, for the first time ever, complete datasets of all three variables for estimating all types of e-waste have been obtained. The result of this study also demonstrates significant disparity between various estimation models, arising from the use of data under different conditions. It shows the importance of applying multivariate approach and multiple sources to improve data quality for modelling, specifically using appropriate time-varying lifespan parameters. Following the case study, a roadmap with a procedural guideline is provided to enhance e-waste estimation studies. Copyright © 2013 Elsevier Ltd. All rights reserved.
A Review of Multivariate Distributions for Count Data Derived from the Poisson Distribution
Inouye, David; Yang, Eunho; Allen, Genevera; Ravikumar, Pradeep
2017-01-01
The Poisson distribution has been widely studied and used for modeling univariate count-valued data. Multivariate generalizations of the Poisson distribution that permit dependencies, however, have been far less popular. Yet, real-world high-dimensional count-valued data found in word counts, genomics, and crime statistics, for example, exhibit rich dependencies, and motivate the need for multivariate distributions that can appropriately model this data. We review multivariate distributions derived from the univariate Poisson, categorizing these models into three main classes: 1) where the marginal distributions are Poisson, 2) where the joint distribution is a mixture of independent multivariate Poisson distributions, and 3) where the node-conditional distributions are derived from the Poisson. We discuss the development of multiple instances of these classes and compare the models in terms of interpretability and theory. Then, we empirically compare multiple models from each class on three real-world datasets that have varying data characteristics from different domains, namely traffic accident data, biological next generation sequencing data, and text data. These empirical experiments develop intuition about the comparative advantages and disadvantages of each class of multivariate distribution that was derived from the Poisson. Finally, we suggest new research directions as explored in the subsequent discussion section. PMID:28983398
Taylor, Sandra L; Ruhaak, L Renee; Weiss, Robert H; Kelly, Karen; Kim, Kyoungmi
2017-01-01
High through-put mass spectrometry (MS) is now being used to profile small molecular compounds across multiple biological sample types from the same subjects with the goal of leveraging information across biospecimens. Multivariate statistical methods that combine information from all biospecimens could be more powerful than the usual univariate analyses. However, missing values are common in MS data and imputation can impact between-biospecimen correlation and multivariate analysis results. We propose two multivariate two-part statistics that accommodate missing values and combine data from all biospecimens to identify differentially regulated compounds. Statistical significance is determined using a multivariate permutation null distribution. Relative to univariate tests, the multivariate procedures detected more significant compounds in three biological datasets. In a simulation study, we showed that multi-biospecimen testing procedures were more powerful than single-biospecimen methods when compounds are differentially regulated in multiple biospecimens but univariate methods can be more powerful if compounds are differentially regulated in only one biospecimen. We provide R functions to implement and illustrate our method as supplementary information CONTACT: sltaylor@ucdavis.eduSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Viljoen, Nadia M; Joubert, Johan W
2018-02-01
This article presents the multilayered complex network formulation for three different supply chain network archetypes on an urban road grid and describes how 500 instances were randomly generated for each archetype. Both the supply chain network layer and the urban road network layer are directed unweighted networks. The shortest path set is calculated for each of the 1 500 experimental instances. The datasets are used to empirically explore the impact that the supply chain's dependence on the transport network has on its vulnerability in Viljoen and Joubert (2017) [1]. The datasets are publicly available on Mendeley (Joubert and Viljoen, 2017) [2].
ERIC Educational Resources Information Center
Grenville-Briggs, Laura J.; Stansfield, Ian
2011-01-01
This report describes a linked series of Masters-level computer practical workshops. They comprise an advanced functional genomics investigation, based upon analysis of a microarray dataset probing yeast DNA damage responses. The workshops require the students to analyse highly complex transcriptomics datasets, and were designed to stimulate…
Saeed, Mohammad
2017-05-01
Systemic lupus erythematosus (SLE) is a complex disorder. Genetic association studies of complex disorders suffer from the following three major issues: phenotypic heterogeneity, false positive (type I error), and false negative (type II error) results. Hence, genes with low to moderate effects are missed in standard analyses, especially after statistical corrections. OASIS is a novel linkage disequilibrium clustering algorithm that can potentially address false positives and negatives in genome-wide association studies (GWAS) of complex disorders such as SLE. OASIS was applied to two SLE dbGAP GWAS datasets (6077 subjects; ∼0.75 million single-nucleotide polymorphisms). OASIS identified three known SLE genes viz. IFIH1, TNIP1, and CD44, not previously reported using these GWAS datasets. In addition, 22 novel loci for SLE were identified and the 5 SLE genes previously reported using these datasets were verified. OASIS methodology was validated using single-variant replication and gene-based analysis with GATES. This led to the verification of 60% of OASIS loci. New SLE genes that OASIS identified and were further verified include TNFAIP6, DNAJB3, TTF1, GRIN2B, MON2, LATS2, SNX6, RBFOX1, NCOA3, and CHAF1B. This study presents the OASIS algorithm, software, and the meta-analyses of two publicly available SLE GWAS datasets along with the novel SLE genes. Hence, OASIS is a novel linkage disequilibrium clustering method that can be universally applied to existing GWAS datasets for the identification of new genes.
Predicting the Fine Particle Fraction of Dry Powder Inhalers Using Artificial Neural Networks.
Muddle, Joanna; Kirton, Stewart B; Parisini, Irene; Muddle, Andrew; Murnane, Darragh; Ali, Jogoth; Brown, Marc; Page, Clive; Forbes, Ben
2017-01-01
Dry powder inhalers are increasingly popular for delivering drugs to the lungs for the treatment of respiratory diseases, but are complex products with multivariate performance determinants. Heuristic product development guided by in vitro aerosol performance testing is a costly and time-consuming process. This study investigated the feasibility of using artificial neural networks (ANNs) to predict fine particle fraction (FPF) based on formulation device variables. Thirty-one ANN architectures were evaluated for their ability to predict experimentally determined FPF for a self-consistent dataset containing salmeterol xinafoate and salbutamol sulfate dry powder inhalers (237 experimental observations). Principal component analysis was used to identify inputs that significantly affected FPF. Orthogonal arrays (OAs) were used to design ANN architectures, optimized using the Taguchi method. The primary OA ANN r 2 values ranged between 0.46 and 0.90 and the secondary OA increased the r 2 values (0.53-0.93). The optimum ANN (9-4-1 architecture, average r 2 0.92 ± 0.02) included active pharmaceutical ingredient, formulation, and device inputs identified by principal component analysis, which reflected the recognized importance and interdependency of these factors for orally inhaled product performance. The Taguchi method was effective at identifying successful architecture with the potential for development as a useful generic inhaler ANN model, although this would require much larger datasets and more variable inputs. Copyright © 2016 American Pharmacists Association®. Published by Elsevier Inc. All rights reserved.
[Regression analysis to select native-like structures from decoys of antigen-antibody docking].
Chen, Zhengshan; Chi, Xiangyang; Fan, Pengfei; Zhang, Guanying; Wang, Meirong; Yu, Changming; Chen, Wei
2018-06-25
Given the increasing exploitation of antibodies in different contexts such as molecular diagnostics and therapeutics, it would be beneficial to unravel properties of antigen-antibody interaction with modeling of computational protein-protein docking, especially, in the absence of a cocrystal structure. However, obtaining a native-like antigen-antibody structure remains challenging due in part to failing to reliably discriminate accurate from inaccurate structures among tens of thousands of decoys after computational docking with existing scoring function. We hypothesized that some important physicochemical and energetic features could be used to describe antigen-antibody interfaces and identify native-like antigen-antibody structure. We prepared a dataset, a subset of Protein-Protein Docking Benchmark Version 4.0, comprising 37 nonredundant 3D structures of antigen-antibody complexes, and used it to train and test multivariate logistic regression equation which took several important physicochemical and energetic features of decoys as dependent variables. Our results indicate that the ability to identify native-like structures of our method is superior to ZRANK and ZDOCK score for the subset of antigen-antibody complexes. And then, we use our method in workflow of predicting epitope of anti-Ebola glycoprotein monoclonal antibody-4G7 and identify three accurate residues in its epitope.
Use of Multivariate Linkage Analysis for Dissection of a Complex Cognitive Trait
Marlow, Angela J.; Fisher, Simon E.; Francks, Clyde; MacPhie, I. Laurence; Cherny, Stacey S.; Richardson, Alex J.; Talcott, Joel B.; Stein, John F.; Monaco, Anthony P.; Cardon, Lon R.
2003-01-01
Replication of linkage results for complex traits has been exceedingly difficult, owing in part to the inability to measure the precise underlying phenotype, small sample sizes, genetic heterogeneity, and statistical methods employed in analysis. Often, in any particular study, multiple correlated traits have been collected, yet these have been analyzed independently or, at most, in bivariate analyses. Theoretical arguments suggest that full multivariate analysis of all available traits should offer more power to detect linkage; however, this has not yet been evaluated on a genomewide scale. Here, we conduct multivariate genomewide analyses of quantitative-trait loci that influence reading- and language-related measures in families affected with developmental dyslexia. The results of these analyses are substantially clearer than those of previous univariate analyses of the same data set, helping to resolve a number of key issues. These outcomes highlight the relevance of multivariate analysis for complex disorders for dissection of linkage results in correlated traits. The approach employed here may aid positional cloning of susceptibility genes in a wide spectrum of complex traits. PMID:12587094
Timothy G.F. Kittel; Nan. A. Rosenbloom; J.A. Royle; C. Daly; W.P. Gibson; H.H. Fisher; P. Thornton; D.N. Yates; S. Aulenbach; C. Kaufman; R. McKeown; Dominque Bachelet; David S. Schimel
2004-01-01
Analysis and simulation of biospheric responses to historical forcing require surface climate data that capture those aspects of climate that control ecological processes, including key spatial gradients and modes of temporal variability. We developed a multivariate, gridded historical climate dataset for the conterminous USA as a common input database for the...
Multi-scale pixel-based image fusion using multivariate empirical mode decomposition.
Rehman, Naveed ur; Ehsan, Shoaib; Abdullah, Syed Muhammad Umer; Akhtar, Muhammad Jehanzaib; Mandic, Danilo P; McDonald-Maier, Klaus D
2015-05-08
A novel scheme to perform the fusion of multiple images using the multivariate empirical mode decomposition (MEMD) algorithm is proposed. Standard multi-scale fusion techniques make a priori assumptions regarding input data, whereas standard univariate empirical mode decomposition (EMD)-based fusion techniques suffer from inherent mode mixing and mode misalignment issues, characterized respectively by either a single intrinsic mode function (IMF) containing multiple scales or the same indexed IMFs corresponding to multiple input images carrying different frequency information. We show that MEMD overcomes these problems by being fully data adaptive and by aligning common frequency scales from multiple channels, thus enabling their comparison at a pixel level and subsequent fusion at multiple data scales. We then demonstrate the potential of the proposed scheme on a large dataset of real-world multi-exposure and multi-focus images and compare the results against those obtained from standard fusion algorithms, including the principal component analysis (PCA), discrete wavelet transform (DWT) and non-subsampled contourlet transform (NCT). A variety of image fusion quality measures are employed for the objective evaluation of the proposed method. We also report the results of a hypothesis testing approach on our large image dataset to identify statistically-significant performance differences.
Multi-Scale Pixel-Based Image Fusion Using Multivariate Empirical Mode Decomposition
Rehman, Naveed ur; Ehsan, Shoaib; Abdullah, Syed Muhammad Umer; Akhtar, Muhammad Jehanzaib; Mandic, Danilo P.; McDonald-Maier, Klaus D.
2015-01-01
A novel scheme to perform the fusion of multiple images using the multivariate empirical mode decomposition (MEMD) algorithm is proposed. Standard multi-scale fusion techniques make a priori assumptions regarding input data, whereas standard univariate empirical mode decomposition (EMD)-based fusion techniques suffer from inherent mode mixing and mode misalignment issues, characterized respectively by either a single intrinsic mode function (IMF) containing multiple scales or the same indexed IMFs corresponding to multiple input images carrying different frequency information. We show that MEMD overcomes these problems by being fully data adaptive and by aligning common frequency scales from multiple channels, thus enabling their comparison at a pixel level and subsequent fusion at multiple data scales. We then demonstrate the potential of the proposed scheme on a large dataset of real-world multi-exposure and multi-focus images and compare the results against those obtained from standard fusion algorithms, including the principal component analysis (PCA), discrete wavelet transform (DWT) and non-subsampled contourlet transform (NCT). A variety of image fusion quality measures are employed for the objective evaluation of the proposed method. We also report the results of a hypothesis testing approach on our large image dataset to identify statistically-significant performance differences. PMID:26007714
Recurrent Neural Networks for Multivariate Time Series with Missing Values.
Che, Zhengping; Purushotham, Sanjay; Cho, Kyunghyun; Sontag, David; Liu, Yan
2018-04-17
Multivariate time series data in practical applications, such as health care, geoscience, and biology, are characterized by a variety of missing values. In time series prediction and other related tasks, it has been noted that missing values and their missing patterns are often correlated with the target labels, a.k.a., informative missingness. There is very limited work on exploiting the missing patterns for effective imputation and improving prediction performance. In this paper, we develop novel deep learning models, namely GRU-D, as one of the early attempts. GRU-D is based on Gated Recurrent Unit (GRU), a state-of-the-art recurrent neural network. It takes two representations of missing patterns, i.e., masking and time interval, and effectively incorporates them into a deep model architecture so that it not only captures the long-term temporal dependencies in time series, but also utilizes the missing patterns to achieve better prediction results. Experiments of time series classification tasks on real-world clinical datasets (MIMIC-III, PhysioNet) and synthetic datasets demonstrate that our models achieve state-of-the-art performance and provide useful insights for better understanding and utilization of missing values in time series analysis.
A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data
Goldstein, Markus; Uchida, Seiichi
2016-01-01
Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks. PMID:27093601
Gomez-Pilar, Javier; Poza, Jesús; Bachiller, Alejandro; Gómez, Carlos; Núñez, Pablo; Lubeiro, Alba; Molina, Vicente; Hornero, Roberto
2018-02-01
The aim of this study was to introduce a novel global measure of graph complexity: Shannon graph complexity (SGC). This measure was specifically developed for weighted graphs, but it can also be applied to binary graphs. The proposed complexity measure was designed to capture the interplay between two properties of a system: the 'information' (calculated by means of Shannon entropy) and the 'order' of the system (estimated by means of a disequilibrium measure). SGC is based on the concept that complex graphs should maintain an equilibrium between the aforementioned two properties, which can be measured by means of the edge weight distribution. In this study, SGC was assessed using four synthetic graph datasets and a real dataset, formed by electroencephalographic (EEG) recordings from controls and schizophrenia patients. SGC was compared with graph density (GD), a classical measure used to evaluate graph complexity. Our results showed that SGC is invariant with respect to GD and independent of node degree distribution. Furthermore, its variation with graph size [Formula: see text] is close to zero for [Formula: see text]. Results from the real dataset showed an increment in the weight distribution balance during the cognitive processing for both controls and schizophrenia patients, although these changes are more relevant for controls. Our findings revealed that SGC does not need a comparison with null-hypothesis networks constructed by a surrogate process. In addition, SGC results on the real dataset suggest that schizophrenia is associated with a deficit in the brain dynamic reorganization related to secondary pathways of the brain network.
Progeny Clustering: A Method to Identify Biological Phenotypes
Hu, Chenyue W.; Kornblau, Steven M.; Slater, John H.; Qutub, Amina A.
2015-01-01
Estimating the optimal number of clusters is a major challenge in applying cluster analysis to any type of dataset, especially to biomedical datasets, which are high-dimensional and complex. Here, we introduce an improved method, Progeny Clustering, which is stability-based and exceptionally efficient in computing, to find the ideal number of clusters. The algorithm employs a novel Progeny Sampling method to reconstruct cluster identity, a co-occurrence probability matrix to assess the clustering stability, and a set of reference datasets to overcome inherent biases in the algorithm and data space. Our method was shown successful and robust when applied to two synthetic datasets (datasets of two-dimensions and ten-dimensions containing eight dimensions of pure noise), two standard biological datasets (the Iris dataset and Rat CNS dataset) and two biological datasets (a cell phenotype dataset and an acute myeloid leukemia (AML) reverse phase protein array (RPPA) dataset). Progeny Clustering outperformed some popular clustering evaluation methods in the ten-dimensional synthetic dataset as well as in the cell phenotype dataset, and it was the only method that successfully discovered clinically meaningful patient groupings in the AML RPPA dataset. PMID:26267476
An Effective Methodology for Processing and Analyzing Large, Complex Spacecraft Data Streams
ERIC Educational Resources Information Center
Teymourlouei, Haydar
2013-01-01
The emerging large datasets have made efficient data processing a much more difficult task for the traditional methodologies. Invariably, datasets continue to increase rapidly in size with time. The purpose of this research is to give an overview of some of the tools and techniques that can be utilized to manage and analyze large datasets. We…
CORUM: the comprehensive resource of mammalian protein complexes
Ruepp, Andreas; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Stransky, Michael; Waegele, Brigitte; Schmidt, Thorsten; Doudieu, Octave Noubibou; Stümpflen, Volker; Mewes, H. Werner
2008-01-01
Protein complexes are key molecular entities that integrate multiple gene products to perform cellular functions. The CORUM (http://mips.gsf.de/genre/proj/corum/index.html) database is a collection of experimentally verified mammalian protein complexes. Information is manually derived by critical reading of the scientific literature from expert annotators. Information about protein complexes includes protein complex names, subunits, literature references as well as the function of the complexes. For functional annotation, we use the FunCat catalogue that enables to organize the protein complex space into biologically meaningful subsets. The database contains more than 1750 protein complexes that are built from 2400 different genes, thus representing 12% of the protein-coding genes in human. A web-based system is available to query, view and download the data. CORUM provides a comprehensive dataset of protein complexes for discoveries in systems biology, analyses of protein networks and protein complex-associated diseases. Comparable to the MIPS reference dataset of protein complexes from yeast, CORUM intends to serve as a reference for mammalian protein complexes. PMID:17965090
Psychocentricity and participant profiles: implications for lexical processing among multilinguals
Libben, Gary; Curtiss, Kaitlin; Weber, Silke
2014-01-01
Lexical processing among bilinguals is often affected by complex patterns of individual experience. In this paper we discuss the psychocentric perspective on language representation and processing, which highlights the centrality of individual experience in psycholinguistic experimentation. We discuss applications to the investigation of lexical processing among multilinguals and explore the advantages of using high-density experiments with multilinguals. High density experiments are designed to co-index measures of lexical perception and production, as well as participant profiles. We discuss the challenges associated with the characterization of participant profiles and present a new data visualization technique, that we term Facial Profiles. This technique is based on Chernoff faces developed over 40 years ago. The Facial Profile technique seeks to overcome some of the challenges associated with the use of Chernoff faces, while maintaining the core insight that recoding multivariate data as facial features can engage the human face recognition system and thus enhance our ability to detect and interpret patterns within multivariate datasets. We demonstrate that Facial Profiles can code participant characteristics in lexical processing studies by recoding variables such as reading ability, speaking ability, and listening ability into iconically-related relative sizes of eye, mouth, and ear, respectively. The balance of ability in bilinguals can be captured by creating composite facial profiles or Janus Facial Profiles. We demonstrate the use of Facial Profiles and Janus Facial Profiles in the characterization of participant effects in the study of lexical perception and production. PMID:25071614
Hirai, Toshinori; Itoh, Toshimasa; Kimura, Toshimi; Echizen, Hirotoshi
2018-06-06
Febuxostat is an active xanthine oxidase (XO) inhibitor that is widely used in the hyperuricemia treatment. We aimed to evaluate the predictive performance of a pharmacokinetic-pharmacodynamic (PK-PD) model for hypouricemic effects of febuxostat. Previously, we have formulated a PK--PD model for predicting hypouricemic effects of febuxostat as a function of baseline serum urate levels, body weight, renal function, and drug dose using datasets reported in preapproval studies (Hirai T et al., Biol Pharm Bull 2016; 39: 1013-21). Using an updated model with sensitivity analysis, we examined the predictive performance of the PK-PD model using datasets obtained from the medical records of patients who received febuxostat from March 2011 to December 2015 at Tokyo Women's Medical University Hospital. Multivariate regression analysis was performed to explore clinical variables to improve the predictive performance of the model. A total of 1,199 serum urate data were retrieved from 168 patients (age: 60.5 ±17.7 years, 71.4% males) who received febuxostat as hyperuricemia treatment. There was a significant correlation (r=0.68, p<0.01) between serum urate levels observed and those predicted by the modified PK-PD model. A multivariate regression analysis revealed that the predictive performance of the model may be improved further by considering comorbidities, such as diabetes mellitus, estimated glomerular filtration rate (eGFR), and co-administration of loop diuretics (r = 0.77, p<0.01). The PK-PD model may be useful for predicting individualized maintenance doses of febuxostat in real-world patients. This article is protected by copyright. All rights reserved.
Ebrahimi, Milad; Gerber, Erin L; Rockaway, Thomas D
2017-05-15
For most water treatment plants, a significant number of performance data variables are attained on a time series basis. Due to the interconnectedness of the variables, it is often difficult to assess over-arching trends and quantify operational performance. The objective of this study was to establish simple and reliable predictive models to correlate target variables with specific measured parameters. This study presents a multivariate analysis of the physicochemical parameters of municipal wastewater. Fifteen quality and quantity parameters were analyzed using data recorded from 2010 to 2016. To determine the overall quality condition of raw and treated wastewater, a Wastewater Quality Index (WWQI) was developed. The index summarizes a large amount of measured quality parameters into a single water quality term by considering pre-established quality limitation standards. To identify treatment process performance, the interdependencies between the variables were determined by using Principal Component Analysis (PCA). The five extracted components from the 15 variables accounted for 75.25% of total dataset information and adequately represented the organic, nutrient, oxygen demanding, and ion activity loadings of influent and effluent streams. The study also utilized the model to predict quality parameters such as Biological Oxygen Demand (BOD), Total Phosphorus (TP), and WWQI. High accuracies ranging from 71% to 97% were achieved for fitting the models with the training dataset and relative prediction percentage errors less than 9% were achieved for the testing dataset. The presented techniques and procedures in this paper provide an assessment framework for the wastewater treatment monitoring programs. Copyright © 2017 Elsevier Ltd. All rights reserved.
Complex versus simple models: ion-channel cardiac toxicity prediction.
Mistry, Hitesh B
2018-01-01
There is growing interest in applying detailed mathematical models of the heart for ion-channel related cardiac toxicity prediction. However, a debate as to whether such complex models are required exists. Here an assessment in the predictive performance between two established large-scale biophysical cardiac models and a simple linear model B net was conducted. Three ion-channel data-sets were extracted from literature. Each compound was designated a cardiac risk category using two different classification schemes based on information within CredibleMeds. The predictive performance of each model within each data-set for each classification scheme was assessed via a leave-one-out cross validation. Overall the B net model performed equally as well as the leading cardiac models in two of the data-sets and outperformed both cardiac models on the latest. These results highlight the importance of benchmarking complex versus simple models but also encourage the development of simple models.
An Easy Tool to Predict Survival in Patients Receiving Radiation Therapy for Painful Bone Metastases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Westhoff, Paulien G., E-mail: p.g.westhoff@umcutrecht.nl; Graeff, Alexander de; Monninkhof, Evelyn M.
2014-11-15
Purpose: Patients with bone metastases have a widely varying survival. A reliable estimation of survival is needed for appropriate treatment strategies. Our goal was to assess the value of simple prognostic factors, namely, patient and tumor characteristics, Karnofsky performance status (KPS), and patient-reported scores of pain and quality of life, to predict survival in patients with painful bone metastases. Methods and Materials: In the Dutch Bone Metastasis Study, 1157 patients were treated with radiation therapy for painful bone metastases. At randomization, physicians determined the KPS; patients rated general health on a visual analogue scale (VAS-gh), valuation of life on amore » verbal rating scale (VRS-vl) and pain intensity. To assess the predictive value of the variables, we used multivariate Cox proportional hazard analyses and C-statistics for discriminative value. Of the final model, calibration was assessed. External validation was performed on a dataset of 934 patients who were treated with radiation therapy for vertebral metastases. Results: Patients had mainly breast (39%), prostate (23%), or lung cancer (25%). After a maximum of 142 weeks' follow-up, 74% of patients had died. The best predictive model included sex, primary tumor, visceral metastases, KPS, VAS-gh, and VRS-vl (C-statistic = 0.72, 95% CI = 0.70-0.74). A reduced model, with only KPS and primary tumor, showed comparable discriminative capacity (C-statistic = 0.71, 95% CI = 0.69-0.72). External validation showed a C-statistic of 0.72 (95% CI = 0.70-0.73). Calibration of the derivation and the validation dataset showed underestimation of survival. Conclusion: In predicting survival in patients with painful bone metastases, KPS combined with primary tumor was comparable to a more complex model. Considering the amount of variables in complex models and the additional burden on patients, the simple model is preferred for daily use. In addition, a risk table for survival is provided.« less
Kittel, T.G.F.; Rosenbloom, N.A.; Royle, J. Andrew; Daly, Christopher; Gibson, W.P.; Fisher, H.H.; Thornton, P.; Yates, D.N.; Aulenbach, S.; Kaufman, C.; McKeown, R.; Bachelet, D.; Schimel, D.S.; Neilson, R.; Lenihan, J.; Drapek, R.; Ojima, D.S.; Parton, W.J.; Melillo, J.M.; Kicklighter, D.W.; Tian, H.; McGuire, A.D.; Sykes, M.T.; Smith, B.; Cowling, S.; Hickler, T.; Prentice, I.C.; Running, S.; Hibbard, K.A.; Post, W.M.; King, A.W.; Smith, T.; Rizzo, B.; Woodward, F.I.
2004-01-01
Analysis and simulation of biospheric responses to historical forcing require surface climate data that capture those aspects of climate that control ecological processes, including key spatial gradients and modes of temporal variability. We developed a multivariate, gridded historical climate dataset for the conterminous USA as a common input database for the Vegetation/Ecosystem Modeling and Analysis Project (VEMAP), a biogeochemical and dynamic vegetation model intercomparison. The dataset covers the period 1895-1993 on a 0.5?? latitude/longitude grid. Climate is represented at both monthly and daily timesteps. Variables are: precipitation, mininimum and maximum temperature, total incident solar radiation, daylight-period irradiance, vapor pressure, and daylight-period relative humidity. The dataset was derived from US Historical Climate Network (HCN), cooperative network, and snowpack telemetry (SNOTEL) monthly precipitation and mean minimum and maximum temperature station data. We employed techniques that rely on geostatistical and physical relationships to create the temporally and spatially complete dataset. We developed a local kriging prediction model to infill discontinuous and limited-length station records based on spatial autocorrelation structure of climate anomalies. A spatial interpolation model (PRISM) that accounts for physiographic controls was used to grid the infilled monthly station data. We implemented a stochastic weather generator (modified WGEN) to disaggregate the gridded monthly series to dailies. Radiation and humidity variables were estimated from the dailies using a physically-based empirical surface climate model (MTCLIM3). Derived datasets include a 100 yr model spin-up climate and a historical Palmer Drought Severity Index (PDSI) dataset. The VEMAP dataset exhibits statistically significant trends in temperature, precipitation, solar radiation, vapor pressure, and PDSI for US National Assessment regions. The historical climate and companion datasets are available online at data archive centers. ?? Inter-Research 2004.
NASA Astrophysics Data System (ADS)
Azami, Hamed; Escudero, Javier
2017-01-01
Multiscale entropy (MSE) is an appealing tool to characterize the complexity of time series over multiple temporal scales. Recent developments in the field have tried to extend the MSE technique in different ways. Building on these trends, we propose the so-called refined composite multivariate multiscale fuzzy entropy (RCmvMFE) whose coarse-graining step uses variance (RCmvMFEσ2) or mean (RCmvMFEμ). We investigate the behavior of these multivariate methods on multichannel white Gaussian and 1/ f noise signals, and two publicly available biomedical recordings. Our simulations demonstrate that RCmvMFEσ2 and RCmvMFEμ lead to more stable results and are less sensitive to the signals' length in comparison with the other existing multivariate multiscale entropy-based methods. The classification results also show that using both the variance and mean in the coarse-graining step offers complexity profiles with complementary information for biomedical signal analysis. We also made freely available all the Matlab codes used in this paper.
Smith, Joseph M.; Mather, Martha E.
2012-01-01
Ecological indicators are science-based tools used to assess how human activities have impacted environmental resources. For monitoring and environmental assessment, existing species assemblage data can be used to make these comparisons through time or across sites. An impediment to using assemblage data, however, is that these data are complex and need to be simplified in an ecologically meaningful way. Because multivariate statistics are mathematical relationships, statistical groupings may not make ecological sense and will not have utility as indicators. Our goal was to define a process to select defensible and ecologically interpretable statistical simplifications of assemblage data in which researchers and managers can have confidence. For this, we chose a suite of statistical methods, compared the groupings that resulted from these analyses, identified convergence among groupings, then we interpreted the groupings using species and ecological guilds. When we tested this approach using a statewide stream fish dataset, not all statistical methods worked equally well. For our dataset, logistic regression (Log), detrended correspondence analysis (DCA), cluster analysis (CL), and non-metric multidimensional scaling (NMDS) provided consistent, simplified output. Specifically, the Log, DCA, CL-1, and NMDS-1 groupings were ≥60% similar to each other, overlapped with the fluvial-specialist ecological guild, and contained a common subset of species. Groupings based on number of species (e.g., Log, DCA, CL and NMDS) outperformed groupings based on abundance [e.g., principal components analysis (PCA) and Poisson regression]. Although the specific methods that worked on our test dataset have generality, here we are advocating a process (e.g., identifying convergent groupings with redundant species composition that are ecologically interpretable) rather than the automatic use of any single statistical tool. We summarize this process in step-by-step guidance for the future use of these commonly available ecological and statistical methods in preparing assemblage data for use in ecological indicators.
Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples
White, James Robert; Nagarajan, Niranjan; Pop, Mihai
2009-01-01
Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial communities to identify features that distinguish them. We present a statistical method for comparing clinical metagenomic samples from two treatment populations on the basis of count data (e.g. as obtained through sequencing) to detect differentially abundant features. Our method, Metastats, employs the false discovery rate to improve specificity in high-complexity environments, and separately handles sparsely-sampled features using Fisher's exact test. Under a variety of simulations, we show that Metastats performs well compared to previously used methods, and significantly outperforms other methods for features with sparse counts. We demonstrate the utility of our method on several datasets including a 16S rRNA survey of obese and lean human gut microbiomes, COG functional profiles of infant and mature gut microbiomes, and bacterial and viral metabolic subsystem data inferred from random sequencing of 85 metagenomes. The application of our method to the obesity dataset reveals differences between obese and lean subjects not reported in the original study. For the COG and subsystem datasets, we provide the first statistically rigorous assessment of the differences between these populations. The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects. Our methods are robust across datasets of varied complexity and sampling level. While designed for metagenomic applications, our software can also be applied to digital gene expression studies (e.g. SAGE). A web server implementation of our methods and freely available source code can be found at http://metastats.cbcb.umd.edu/. PMID:19360128
MANOVA for distinguishing experts' perceptions about entrepreneurship using NES data from GEM
NASA Astrophysics Data System (ADS)
Correia, Aldina; Costa e Silva, Eliana; Lopes, Isabel C.; Braga, Alexandra
2016-12-01
Global Entrepreneurship Monitor is a large scale database for internationally comparative entrepreneurship that includes information about many aspects of entrepreneurship activities, perceptions, conditions, national and regional policy, among others, of a large number of countries. This project has two main sources of primary data: the Adult Population Survey and the National Expert Survey. In this work the 2011 and 2012 National Expert Survey datasets are studied. Our goal is to analyze the effects of the different type of entrepreneurship expert specialization on the perceptions about the Entrepreneurial Framework Conditions. For this purpose the multivariate analysis of variance is used. Some similarities between the results obtained for the 2011 and 2012 datasets were found, however the differences between experts still exist.
Unsupervised learning on scientific ocean drilling datasets from the South China Sea
NASA Astrophysics Data System (ADS)
Tse, Kevin C.; Chiu, Hon-Chim; Tsang, Man-Yin; Li, Yiliang; Lam, Edmund Y.
2018-06-01
Unsupervised learning methods were applied to explore data patterns in multivariate geophysical datasets collected from ocean floor sediment core samples coming from scientific ocean drilling in the South China Sea. Compared to studies on similar datasets, but using supervised learning methods which are designed to make predictions based on sample training data, unsupervised learning methods require no a priori information and focus only on the input data. In this study, popular unsupervised learning methods including K-means, self-organizing maps, hierarchical clustering and random forest were coupled with different distance metrics to form exploratory data clusters. The resulting data clusters were externally validated with lithologic units and geologic time scales assigned to the datasets by conventional methods. Compact and connected data clusters displayed varying degrees of correspondence with existing classification by lithologic units and geologic time scales. K-means and self-organizing maps were observed to perform better with lithologic units while random forest corresponded best with geologic time scales. This study sets a pioneering example of how unsupervised machine learning methods can be used as an automatic processing tool for the increasingly high volume of scientific ocean drilling data.
Theofilatos, Konstantinos; Pavlopoulou, Niki; Papasavvas, Christoforos; Likothanassis, Spiros; Dimitrakopoulos, Christos; Georgopoulos, Efstratios; Moschopoulos, Charalampos; Mavroudi, Seferina
2015-03-01
Proteins are considered to be the most important individual components of biological systems and they combine to form physical protein complexes which are responsible for certain molecular functions. Despite the large availability of protein-protein interaction (PPI) information, not much information is available about protein complexes. Experimental methods are limited in terms of time, efficiency, cost and performance constraints. Existing computational methods have provided encouraging preliminary results, but they phase certain disadvantages as they require parameter tuning, some of them cannot handle weighted PPI data and others do not allow a protein to participate in more than one protein complex. In the present paper, we propose a new fully unsupervised methodology for predicting protein complexes from weighted PPI graphs. The proposed methodology is called evolutionary enhanced Markov clustering (EE-MC) and it is a hybrid combination of an adaptive evolutionary algorithm and a state-of-the-art clustering algorithm named enhanced Markov clustering. EE-MC was compared with state-of-the-art methodologies when applied to datasets from the human and the yeast Saccharomyces cerevisiae organisms. Using public available datasets, EE-MC outperformed existing methodologies (in some datasets the separation metric was increased by 10-20%). Moreover, when applied to new human datasets its performance was encouraging in the prediction of protein complexes which consist of proteins with high functional similarity. In specific, 5737 protein complexes were predicted and 72.58% of them are enriched for at least one gene ontology (GO) function term. EE-MC is by design able to overcome intrinsic limitations of existing methodologies such as their inability to handle weighted PPI networks, their constraint to assign every protein in exactly one cluster and the difficulties they face concerning the parameter tuning. This fact was experimentally validated and moreover, new potentially true human protein complexes were suggested as candidates for further validation using experimental techniques. Copyright © 2015 Elsevier B.V. All rights reserved.
POWERLIB: SAS/IML Software for Computing Power in Multivariate Linear Models
Johnson, Jacqueline L.; Muller, Keith E.; Slaughter, James C.; Gurka, Matthew J.; Gribbin, Matthew J.; Simpson, Sean L.
2014-01-01
The POWERLIB SAS/IML software provides convenient power calculations for a wide range of multivariate linear models with Gaussian errors. The software includes the Box, Geisser-Greenhouse, Huynh-Feldt, and uncorrected tests in the “univariate” approach to repeated measures (UNIREP), the Hotelling Lawley Trace, Pillai-Bartlett Trace, and Wilks Lambda tests in “multivariate” approach (MULTIREP), as well as a limited but useful range of mixed models. The familiar univariate linear model with Gaussian errors is an important special case. For estimated covariance, the software provides confidence limits for the resulting estimated power. All power and confidence limits values can be output to a SAS dataset, which can be used to easily produce plots and tables for manuscripts. PMID:25400516
Bertani, Francesca R; Mozetic, Pamela; Fioramonti, Marco; Iuliani, Michele; Ribelli, Giulia; Pantano, Francesco; Santini, Daniele; Tonini, Giuseppe; Trombetta, Marcella; Businaro, Luca; Selci, Stefano; Rainer, Alberto
2017-08-21
The possibility of detecting and classifying living cells in a label-free and non-invasive manner holds significant theranostic potential. In this work, Hyperspectral Imaging (HSI) has been successfully applied to the analysis of macrophagic polarization, given its central role in several pathological settings, including the regulation of tumour microenvironment. Human monocyte derived macrophages have been investigated using hyperspectral reflectance confocal microscopy, and hyperspectral datasets have been analysed in terms of M1 vs. M2 polarization by Principal Components Analysis (PCA). Following PCA, Linear Discriminant Analysis has been implemented for semi-automatic classification of macrophagic polarization from HSI data. Our results confirm the possibility to perform single-cell-level in vitro classification of M1 vs. M2 macrophages in a non-invasive and label-free manner with a high accuracy (above 98% for cells deriving from the same donor), supporting the idea of applying the technique to the study of complex interacting cellular systems, such in the case of tumour-immunity in vitro models.
Xu, Yun; Muhamadali, Howbeer; Sayqal, Ali; Dixon, Neil; Goodacre, Royston
2016-10-28
Partial least squares (PLS) is one of the most commonly used supervised modelling approaches for analysing multivariate metabolomics data. PLS is typically employed as either a regression model (PLS-R) or a classification model (PLS-DA). However, in metabolomics studies it is common to investigate multiple, potentially interacting, factors simultaneously following a specific experimental design. Such data often cannot be considered as a "pure" regression or a classification problem. Nevertheless, these data have often still been treated as a regression or classification problem and this could lead to ambiguous results. In this study, we investigated the feasibility of designing a hybrid target matrix Y that better reflects the experimental design than simple regression or binary class membership coding commonly used in PLS modelling. The new design of Y coding was based on the same principle used by structural modelling in machine learning techniques. Two real metabolomics datasets were used as examples to illustrate how the new Y coding can improve the interpretability of the PLS model compared to classic regression/classification coding.
How does spatial extent of fMRI datasets affect independent component analysis decomposition?
Aragri, Adriana; Scarabino, Tommaso; Seifritz, Erich; Comani, Silvia; Cirillo, Sossio; Tedeschi, Gioacchino; Esposito, Fabrizio; Di Salle, Francesco
2006-09-01
Spatial independent component analysis (sICA) of functional magnetic resonance imaging (fMRI) time series can generate meaningful activation maps and associated descriptive signals, which are useful to evaluate datasets of the entire brain or selected portions of it. Besides computational implications, variations in the input dataset combined with the multivariate nature of ICA may lead to different spatial or temporal readouts of brain activation phenomena. By reducing and increasing a volume of interest (VOI), we applied sICA to different datasets from real activation experiments with multislice acquisition and single or multiple sensory-motor task-induced blood oxygenation level-dependent (BOLD) signal sources with different spatial and temporal structure. Using receiver operating characteristics (ROC) methodology for accuracy evaluation and multiple regression analysis as benchmark, we compared sICA decompositions of reduced and increased VOI fMRI time-series containing auditory, motor and hemifield visual activation occurring separately or simultaneously in time. Both approaches yielded valid results; however, the results of the increased VOI approach were spatially more accurate compared to the results of the decreased VOI approach. This is consistent with the capability of sICA to take advantage of extended samples of statistical observations and suggests that sICA is more powerful with extended rather than reduced VOI datasets to delineate brain activity. (c) 2006 Wiley-Liss, Inc.
R. L. Czaplewski
2009-01-01
The minimum variance multivariate composite estimator is a relatively simple sequential estimator for complex sampling designs (Czaplewski 2009). Such designs combine a probability sample of expensive field data with multiple censuses and/or samples of relatively inexpensive multi-sensor, multi-resolution remotely sensed data. Unfortunately, the multivariate composite...
Software ion scan functions in analysis of glycomic and lipidomic MS/MS datasets.
Haramija, Marko
2018-03-01
Hardware ion scan functions unique to tandem mass spectrometry (MS/MS) mode of data acquisition, such as precursor ion scan (PIS) and neutral loss scan (NLS), are important for selective extraction of key structural data from complex MS/MS spectra. However, their software counterparts, software ion scan (SIS) functions, are still not regularly available. Software ion scan functions can be easily coded for additional functionalities, such as software multiple precursor ion scan, software no ion scan, and software variable ion scan functions. These are often necessary, since they allow more efficient analysis of complex MS/MS datasets, often encountered in glycomics and lipidomics. Software ion scan functions can be easily coded by using modern script languages and can be independent of instrument manufacturer. Here we demonstrate the utility of SIS functions on a medium-size glycomic MS/MS dataset. Knowledge of sample properties, as well as of diagnostic and conditional diagnostic ions crucial for data analysis, was needed. Based on the tables constructed with the output data from the SIS functions performed, a detailed analysis of a complex MS/MS glycomic dataset could be carried out in a quick, accurate, and efficient manner. Glycomic research is progressing slowly, and with respect to the MS experiments, one of the key obstacles for moving forward is the lack of appropriate bioinformatic tools necessary for fast analysis of glycomic MS/MS datasets. Adding novel SIS functionalities to the glycomic MS/MS toolbox has a potential to significantly speed up the glycomic data analysis process. Similar tools are useful for analysis of lipidomic MS/MS datasets as well, as will be discussed briefly. Copyright © 2017 John Wiley & Sons, Ltd.
On the Multi-Modal Object Tracking and Image Fusion Using Unsupervised Deep Learning Methodologies
NASA Astrophysics Data System (ADS)
LaHaye, N.; Ott, J.; Garay, M. J.; El-Askary, H. M.; Linstead, E.
2017-12-01
The number of different modalities of remote-sensors has been on the rise, resulting in large datasets with different complexity levels. Such complex datasets can provide valuable information separately, yet there is a bigger value in having a comprehensive view of them combined. As such, hidden information can be deduced through applying data mining techniques on the fused data. The curse of dimensionality of such fused data, due to the potentially vast dimension space, hinders our ability to have deep understanding of them. This is because each dataset requires a user to have instrument-specific and dataset-specific knowledge for optimum and meaningful usage. Once a user decides to use multiple datasets together, deeper understanding of translating and combining these datasets in a correct and effective manner is needed. Although there exists data centric techniques, generic automated methodologies that can potentially solve this problem completely don't exist. Here we are developing a system that aims to gain a detailed understanding of different data modalities. Such system will provide an analysis environment that gives the user useful feedback and can aid in research tasks. In our current work, we show the initial outputs our system implementation that leverages unsupervised deep learning techniques so not to burden the user with the task of labeling input data, while still allowing for a detailed machine understanding of the data. Our goal is to be able to track objects, like cloud systems or aerosols, across different image-like data-modalities. The proposed system is flexible, scalable and robust to understand complex likenesses within multi-modal data in a similar spatio-temporal range, and also to be able to co-register and fuse these images when needed.
Spatiotemporal Permutation Entropy as a Measure for Complexity of Cardiac Arrhythmia
NASA Astrophysics Data System (ADS)
Schlemmer, Alexander; Berg, Sebastian; Lilienkamp, Thomas; Luther, Stefan; Parlitz, Ulrich
2018-05-01
Permutation entropy (PE) is a robust quantity for measuring the complexity of time series. In the cardiac community it is predominantly used in the context of electrocardiogram (ECG) signal analysis for diagnoses and predictions with a major application found in heart rate variability parameters. In this article we are combining spatial and temporal PE to form a spatiotemporal PE that captures both, complexity of spatial structures and temporal complexity at the same time. We demonstrate that the spatiotemporal PE (STPE) quantifies complexity using two datasets from simulated cardiac arrhythmia and compare it to phase singularity analysis and spatial PE (SPE). These datasets simulate ventricular fibrillation (VF) on a two-dimensional and a three-dimensional medium using the Fenton-Karma model. We show that SPE and STPE are robust against noise and demonstrate its usefulness for extracting complexity features at different spatial scales.
Antibody-protein interactions: benchmark datasets and prediction tools evaluation
Ponomarenko, Julia V; Bourne, Philip E
2007-01-01
Background The ability to predict antibody binding sites (aka antigenic determinants or B-cell epitopes) for a given protein is a precursor to new vaccine design and diagnostics. Among the various methods of B-cell epitope identification X-ray crystallography is one of the most reliable methods. Using these experimental data computational methods exist for B-cell epitope prediction. As the number of structures of antibody-protein complexes grows, further interest in prediction methods using 3D structure is anticipated. This work aims to establish a benchmark for 3D structure-based epitope prediction methods. Results Two B-cell epitope benchmark datasets inferred from the 3D structures of antibody-protein complexes were defined. The first is a dataset of 62 representative 3D structures of protein antigens with inferred structural epitopes. The second is a dataset of 82 structures of antibody-protein complexes containing different structural epitopes. Using these datasets, eight web-servers developed for antibody and protein binding sites prediction have been evaluated. In no method did performance exceed a 40% precision and 46% recall. The values of the area under the receiver operating characteristic curve for the evaluated methods were about 0.6 for ConSurf, DiscoTope, and PPI-PRED methods and above 0.65 but not exceeding 0.70 for protein-protein docking methods when the best of the top ten models for the bound docking were considered; the remaining methods performed close to random. The benchmark datasets are included as a supplement to this paper. Conclusion It may be possible to improve epitope prediction methods through training on datasets which include only immune epitopes and through utilizing more features characterizing epitopes, for example, the evolutionary conservation score. Notwithstanding, overall poor performance may reflect the generality of antigenicity and hence the inability to decipher B-cell epitopes as an intrinsic feature of the protein. It is an open question as to whether ultimately discriminatory features can be found. PMID:17910770
Evolution of niche preference in Sphagnum peat mosses.
Johnson, Matthew G; Granath, Gustaf; Tahvanainen, Teemu; Pouliot, Remy; Stenøien, Hans K; Rochefort, Line; Rydin, Håkan; Shaw, A Jonathan
2015-01-01
Peat mosses (Sphagnum) are ecosystem engineers-species in boreal peatlands simultaneously create and inhabit narrow habitat preferences along two microhabitat gradients: an ionic gradient and a hydrological hummock-hollow gradient. In this article, we demonstrate the connections between microhabitat preference and phylogeny in Sphagnum. Using a dataset of 39 species of Sphagnum, with an 18-locus DNA alignment and an ecological dataset encompassing three large published studies, we tested for phylogenetic signal and within-genus changes in evolutionary rate of eight niche descriptors and two multivariate niche gradients. We find little to no evidence for phylogenetic signal in most component descriptors of the ionic gradient, but interspecific variation along the hummock-hollow gradient shows considerable phylogenetic signal. We find support for a change in the rate of niche evolution within the genus-the hummock-forming subgenus Acutifolia has evolved along the multivariate hummock-hollow gradient faster than the hollow-inhabiting subgenus Cuspidata. Because peat mosses themselves create some of the ecological gradients constituting their own habitats, the classic microtopography of Sphagnum-dominated peatlands is maintained by evolutionary constraints and the biological properties of related Sphagnum species. The patterns of phylogenetic signal observed here will instruct future study on the role of functional traits in peatland growth and reconstruction. © 2014 The Author(s). Evolution © 2014 The Society for the Study of Evolution.
Multivariate statistical approach to estimate mixing proportions for unknown end members
Valder, Joshua F.; Long, Andrew J.; Davis, Arden D.; Kenner, Scott J.
2012-01-01
A multivariate statistical method is presented, which includes principal components analysis (PCA) and an end-member mixing model to estimate unknown end-member hydrochemical compositions and the relative mixing proportions of those end members in mixed waters. PCA, together with the Hotelling T2 statistic and a conceptual model of groundwater flow and mixing, was used in selecting samples that best approximate end members, which then were used as initial values in optimization of the end-member mixing model. This method was tested on controlled datasets (i.e., true values of estimates were known a priori) and found effective in estimating these end members and mixing proportions. The controlled datasets included synthetically generated hydrochemical data, synthetically generated mixing proportions, and laboratory analyses of sample mixtures, which were used in an evaluation of the effectiveness of this method for potential use in actual hydrological settings. For three different scenarios tested, correlation coefficients (R2) for linear regression between the estimated and known values ranged from 0.968 to 0.993 for mixing proportions and from 0.839 to 0.998 for end-member compositions. The method also was applied to field data from a study of end-member mixing in groundwater as a field example and partial method validation.
Lauric, Alexandra; Baharoglu, Merih I; Malek, Adel M
2013-04-01
The variable definition of size ratio (SR) for sidewall (SW) vs bifurcation (BIF) aneurysms raises confusion for lesions harboring small branches, such as carotid ophthalmic or posterior communicating locations. These aneurysms are considered SW by many clinicians, but SR methodology classifies them as BIF. To evaluate the effect of ignoring small vessels and SW vs stringent BIF labeling on SR ruptured aneurysm detection performance in borderline aneurysms with small branches, and to reconcile SR-based labeling with clinical SW/BIF classification. Catheter rotational angiographic datasets of 134 consecutive aneurysms (60 ruptured) were automatically measured in 3-dimensional. Stringent BIF labeling was applied to clinically labeled aneurysms, with 21 aneurysms switching label from SW to BIF. Parent vessel size was evaluated both taking into account, and ignoring, small vessels. SR was defined accordingly as the ratio between aneurysm and parent vessel sizes. Univariate and multivariate statistics identified significant features. The square of the correlation coefficient (R(2)) was reported for bivariate analysis of alternative SR calculations. Regardless of SW/BIF labeling method, SR was equally significant in discriminating aneurysm ruptured status (P < .001). Bivariate analysis of alternative SR had a high correlation of R(2) = 0.94 on the whole dataset, and R = 0.98 on the 21 borderline aneurysms. Ignoring small branches from SR calculation maintains rupture status detection performance, while reducing postprocessing complexity and removing labeling ambiguity. Aneurysms adjacent to these vessels can be considered SW for morphometric analysis. It is reasonable to use the clinical SW/BIF labeling when using SR for rupture risk evaluation.
Korsgaard, Inge Riis; Lund, Mogens Sandø; Sorensen, Daniel; Gianola, Daniel; Madsen, Per; Jensen, Just
2003-01-01
A fully Bayesian analysis using Gibbs sampling and data augmentation in a multivariate model of Gaussian, right censored, and grouped Gaussian traits is described. The grouped Gaussian traits are either ordered categorical traits (with more than two categories) or binary traits, where the grouping is determined via thresholds on the underlying Gaussian scale, the liability scale. Allowances are made for unequal models, unknown covariance matrices and missing data. Having outlined the theory, strategies for implementation are reviewed. These include joint sampling of location parameters; efficient sampling from the fully conditional posterior distribution of augmented data, a multivariate truncated normal distribution; and sampling from the conditional inverse Wishart distribution, the fully conditional posterior distribution of the residual covariance matrix. Finally, a simulated dataset was analysed to illustrate the methodology. This paper concentrates on a model where residuals associated with liabilities of the binary traits are assumed to be independent. A Bayesian analysis using Gibbs sampling is outlined for the model where this assumption is relaxed. PMID:12633531
Data imputation analysis for Cosmic Rays time series
NASA Astrophysics Data System (ADS)
Fernandes, R. C.; Lucio, P. S.; Fernandez, J. H.
2017-05-01
The occurrence of missing data concerning Galactic Cosmic Rays time series (GCR) is inevitable since loss of data is due to mechanical and human failure or technical problems and different periods of operation of GCR stations. The aim of this study was to perform multiple dataset imputation in order to depict the observational dataset. The study has used the monthly time series of GCR Climax (CLMX) and Roma (ROME) from 1960 to 2004 to simulate scenarios of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of missing data compared to observed ROME series, with 50 replicates. Then, the CLMX station as a proxy for allocation of these scenarios was used. Three different methods for monthly dataset imputation were selected: AMÉLIA II - runs the bootstrap Expectation Maximization algorithm, MICE - runs an algorithm via Multivariate Imputation by Chained Equations and MTSDI - an Expectation Maximization algorithm-based method for imputation of missing values in multivariate normal time series. The synthetic time series compared with the observed ROME series has also been evaluated using several skill measures as such as RMSE, NRMSE, Agreement Index, R, R2, F-test and t-test. The results showed that for CLMX and ROME, the R2 and R statistics were equal to 0.98 and 0.96, respectively. It was observed that increases in the number of gaps generate loss of quality of the time series. Data imputation was more efficient with MTSDI method, with negligible errors and best skill coefficients. The results suggest a limit of about 60% of missing data for imputation, for monthly averages, no more than this. It is noteworthy that CLMX, ROME and KIEL stations present no missing data in the target period. This methodology allowed reconstructing 43 time series.
Multi-criteria evaluation of CMIP5 GCMs for climate change impact analysis
NASA Astrophysics Data System (ADS)
Ahmadalipour, Ali; Rana, Arun; Moradkhani, Hamid; Sharma, Ashish
2017-04-01
Climate change is expected to have severe impacts on global hydrological cycle along with food-water-energy nexus. Currently, there are many climate models used in predicting important climatic variables. Though there have been advances in the field, there are still many problems to be resolved related to reliability, uncertainty, and computing needs, among many others. In the present work, we have analyzed performance of 20 different global climate models (GCMs) from Climate Model Intercomparison Project Phase 5 (CMIP5) dataset over the Columbia River Basin (CRB) in the Pacific Northwest USA. We demonstrate a statistical multicriteria approach, using univariate and multivariate techniques, for selecting suitable GCMs to be used for climate change impact analysis in the region. Univariate methods includes mean, standard deviation, coefficient of variation, relative change (variability), Mann-Kendall test, and Kolmogorov-Smirnov test (KS-test); whereas multivariate methods used were principal component analysis (PCA), singular value decomposition (SVD), canonical correlation analysis (CCA), and cluster analysis. The analysis is performed on raw GCM data, i.e., before bias correction, for precipitation and temperature climatic variables for all the 20 models to capture the reliability and nature of the particular model at regional scale. The analysis is based on spatially averaged datasets of GCMs and observation for the period of 1970 to 2000. Ranking is provided to each of the GCMs based on the performance evaluated against gridded observational data on various temporal scales (daily, monthly, and seasonal). Results have provided insight into each of the methods and various statistical properties addressed by them employed in ranking GCMs. Further; evaluation was also performed for raw GCM simulations against different sets of gridded observational dataset in the area.
Individual and socioeconomic factors associated with childhood immunization coverage in Nigeria
Oleribe, Obinna; Kumar, Vibha; Awosika-Olumo, Adebowale; Taylor-Robinson, Simon David
2017-01-01
Introduction Immunization is the world’s most successful and cost-effective public health intervention as it prevents over 2 million deaths annually. However, over 2 million deaths still occur yearly from Vaccine preventable diseases, the majority of which occur in sub-Saharan Africa. Nigeria is a major contributor of global childhood deaths from VPDs. Till date, Nigeria still has wild polio virus in circulation. The objective of this study was to identify the individual and socioeconomic factors associated with immunization coverage in Nigeria through a secondary dataset analysis of Nigeria Demographic and Health Survey (NDHS), 2013. Methods A quantitative analysis of the 2013 NDHS dataset was performed. Ethical approvals were obtained from Walden University IRB and the National Health Research Ethics Committee of Nigeria. The dataset was downloaded, validated for completeness and analyzed using univariate, bivariate and multivariate statistics. Results Of 27,571 children aged 0 to 59 months, 22.1% had full vaccination, and 29% never received any vaccination. Immunization coverage was significantly associated with childbirth order, delivery place, child number, and presence or absence of a child health card. Maternal age, geographical location, education, religion, literacy, wealth index, marital status, and occupation were significantly associated with immunization coverage. Paternal education, occupation, and age were also significantly associated with coverage. Respondent's age, educational attainment and wealth index remained significantly related to immunization coverage at 95% confidence interval in multivariate analysis. Conclusion The study highlights child, parental and socioeconomic barriers to successful immunization programs in Nigeria. These findings need urgent attention, given the re-emergence of wild poliovirus in Nigeria. An effective, efficient, sustainable, accessible, and acceptable immunization program for children should be designed, developed and undertaken in Nigeria with adequate strategies put in place to implement them. PMID:28690734
Individual and socioeconomic factors associated with childhood immunization coverage in Nigeria.
Oleribe, Obinna; Kumar, Vibha; Awosika-Olumo, Adebowale; Taylor-Robinson, Simon David
2017-01-01
Immunization is the world's most successful and cost-effective public health intervention as it prevents over 2 million deaths annually. However, over 2 million deaths still occur yearly from Vaccine preventable diseases, the majority of which occur in sub-Saharan Africa. Nigeria is a major contributor of global childhood deaths from VPDs. Till date, Nigeria still has wild polio virus in circulation. The objective of this study was to identify the individual and socioeconomic factors associated with immunization coverage in Nigeria through a secondary dataset analysis of Nigeria Demographic and Health Survey (NDHS), 2013. A quantitative analysis of the 2013 NDHS dataset was performed. Ethical approvals were obtained from Walden University IRB and the National Health Research Ethics Committee of Nigeria. The dataset was downloaded, validated for completeness and analyzed using univariate, bivariate and multivariate statistics. Of 27,571 children aged 0 to 59 months, 22.1% had full vaccination, and 29% never received any vaccination. Immunization coverage was significantly associated with childbirth order, delivery place, child number, and presence or absence of a child health card. Maternal age, geographical location, education, religion, literacy, wealth index, marital status, and occupation were significantly associated with immunization coverage. Paternal education, occupation, and age were also significantly associated with coverage. Respondent's age, educational attainment and wealth index remained significantly related to immunization coverage at 95% confidence interval in multivariate analysis. The study highlights child, parental and socioeconomic barriers to successful immunization programs in Nigeria. These findings need urgent attention, given the re-emergence of wild poliovirus in Nigeria. An effective, efficient, sustainable, accessible, and acceptable immunization program for children should be designed, developed and undertaken in Nigeria with adequate strategies put in place to implement them.
Crystal cryocooling distorts conformational heterogeneity in a model Michaelis complex of DHFR
Keedy, Daniel A.; van den Bedem, Henry; Sivak, David A.; Petsko, Gregory A.; Ringe, Dagmar; Wilson, Mark A.; Fraser, James S.
2014-01-01
Summary Most macromolecular X-ray structures are determined from cryocooled crystals, but it is unclear whether cryocooling distorts functionally relevant flexibility. Here we compare independently acquired pairs of high-resolution datasets of a model Michaelis complex of dihydrofolate reductase (DHFR), collected by separate groups at both room and cryogenic temperatures. These datasets allow us to isolate the differences between experimental procedures and between temperatures. Our analyses of multiconformer models and time-averaged ensembles suggest that cryocooling suppresses and otherwise modifies sidechain and mainchain conformational heterogeneity, quenching dynamic contact networks. Despite some idiosyncratic differences, most changes from room temperature to cryogenic temperature are conserved, and likely reflect temperature-dependent solvent remodeling. Both cryogenic datasets point to additional conformations not evident in the corresponding room-temperature datasets, suggesting that cryocooling does not merely trap pre-existing conformational heterogeneity. Our results demonstrate that crystal cryocooling consistently distorts the energy landscape of DHFR, a paragon for understanding functional protein dynamics. PMID:24882744
Privacy-Preserving Integration of Medical Data : A Practical Multiparty Private Set Intersection.
Miyaji, Atsuko; Nakasho, Kazuhisa; Nishida, Shohei
2017-03-01
Medical data are often maintained by different organizations. However, detailed analyses sometimes require these datasets to be integrated without violating patient or commercial privacy. Multiparty Private Set Intersection (MPSI), which is an important privacy-preserving protocol, computes an intersection of multiple private datasets. This approach ensures that only designated parties can identify the intersection. In this paper, we propose a practical MPSI that satisfies the following requirements: The size of the datasets maintained by the different parties is independent of the others, and the computational complexity of the dataset held by each party is independent of the number of parties. Our MPSI is based on the use of an outsourcing provider, who has no knowledge of the data inputs or outputs. This reduces the computational complexity. The performance of the proposed MPSI is evaluated by implementing a prototype on a virtual private network to enable parallel computation in multiple threads. Our protocol is confirmed to be more efficient than comparable existing approaches.
A window-based time series feature extraction method.
Katircioglu-Öztürk, Deniz; Güvenir, H Altay; Ravens, Ursula; Baykal, Nazife
2017-10-01
This study proposes a robust similarity score-based time series feature extraction method that is termed as Window-based Time series Feature ExtraCtion (WTC). Specifically, WTC generates domain-interpretable results and involves significantly low computational complexity thereby rendering itself useful for densely sampled and populated time series datasets. In this study, WTC is applied to a proprietary action potential (AP) time series dataset on human cardiomyocytes and three precordial leads from a publicly available electrocardiogram (ECG) dataset. This is followed by comparing WTC in terms of predictive accuracy and computational complexity with shapelet transform and fast shapelet transform (which constitutes an accelerated variant of the shapelet transform). The results indicate that WTC achieves a slightly higher classification performance with significantly lower execution time when compared to its shapelet-based alternatives. With respect to its interpretable features, WTC has a potential to enable medical experts to explore definitive common trends in novel datasets. Copyright © 2017 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Poyatos, Rafael; Sus, Oliver; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi
2018-05-01
The ubiquity of missing data in plant trait databases may hinder trait-based analyses of ecological patterns and processes. Spatially explicit datasets with information on intraspecific trait variability are rare but offer great promise in improving our understanding of functional biogeography. At the same time, they offer specific challenges in terms of data imputation. Here we compare statistical imputation approaches, using varying levels of environmental information, for five plant traits (leaf biomass to sapwood area ratio, leaf nitrogen content, maximum tree height, leaf mass per area and wood density) in a spatially explicit plant trait dataset of temperate and Mediterranean tree species (Ecological and Forest Inventory of Catalonia, IEFC, dataset for Catalonia, north-east Iberian Peninsula, 31 900 km2). We simulated gaps at different missingness levels (10-80 %) in a complete trait matrix, and we used overall trait means, species means, k nearest neighbours (kNN), ordinary and regression kriging, and multivariate imputation using chained equations (MICE) to impute missing trait values. We assessed these methods in terms of their accuracy and of their ability to preserve trait distributions, multi-trait correlation structure and bivariate trait relationships. The relatively good performance of mean and species mean imputations in terms of accuracy masked a poor representation of trait distributions and multivariate trait structure. Species identity improved MICE imputations for all traits, whereas forest structure and topography improved imputations for some traits. No method performed best consistently for the five studied traits, but, considering all traits and performance metrics, MICE informed by relevant ecological variables gave the best results. However, at higher missingness (> 30 %), species mean imputations and regression kriging tended to outperform MICE for some traits. MICE informed by relevant ecological variables allowed us to fill the gaps in the IEFC incomplete dataset (5495 plots) and quantify imputation uncertainty. Resulting spatial patterns of the studied traits in Catalan forests were broadly similar when using species means, regression kriging or the best-performing MICE application, but some important discrepancies were observed at the local level. Our results highlight the need to assess imputation quality beyond just imputation accuracy and show that including environmental information in statistical imputation approaches yields more plausible imputations in spatially explicit plant trait datasets.
Parsons, Helen M; Ludwig, Christian; Günther, Ulrich L; Viant, Mark R
2007-01-01
Background Classifying nuclear magnetic resonance (NMR) spectra is a crucial step in many metabolomics experiments. Since several multivariate classification techniques depend upon the variance of the data, it is important to first minimise any contribution from unwanted technical variance arising from sample preparation and analytical measurements, and thereby maximise any contribution from wanted biological variance between different classes. The generalised logarithm (glog) transform was developed to stabilise the variance in DNA microarray datasets, but has rarely been applied to metabolomics data. In particular, it has not been rigorously evaluated against other scaling techniques used in metabolomics, nor tested on all forms of NMR spectra including 1-dimensional (1D) 1H, projections of 2D 1H, 1H J-resolved (pJRES), and intact 2D J-resolved (JRES). Results Here, the effects of the glog transform are compared against two commonly used variance stabilising techniques, autoscaling and Pareto scaling, as well as unscaled data. The four methods are evaluated in terms of the effects on the variance of NMR metabolomics data and on the classification accuracy following multivariate analysis, the latter achieved using principal component analysis followed by linear discriminant analysis. For two of three datasets analysed, classification accuracies were highest following glog transformation: 100% accuracy for discriminating 1D NMR spectra of hypoxic and normoxic invertebrate muscle, and 100% accuracy for discriminating 2D JRES spectra of fish livers sampled from two rivers. For the third dataset, pJRES spectra of urine from two breeds of dog, the glog transform and autoscaling achieved equal highest accuracies. Additionally we extended the glog algorithm to effectively suppress noise, which proved critical for the analysis of 2D JRES spectra. Conclusion We have demonstrated that the glog and extended glog transforms stabilise the technical variance in NMR metabolomics datasets. This significantly improves the discrimination between sample classes and has resulted in higher classification accuracies compared to unscaled, autoscaled or Pareto scaled data. Additionally we have confirmed the broad applicability of the glog approach using three disparate datasets from different biological samples using 1D NMR spectra, 1D projections of 2D JRES spectra, and intact 2D JRES spectra. PMID:17605789
Falcaro, Milena; Pickles, Andrew
2007-02-10
We focus on the analysis of multivariate survival times with highly structured interdependency and subject to interval censoring. Such data are common in developmental genetics and genetic epidemiology. We propose a flexible mixed probit model that deals naturally with complex but uninformative censoring. The recorded ages of onset are treated as possibly censored ordinal outcomes with the interval censoring mechanism seen as arising from a coarsened measurement of a continuous variable observed as falling between subject-specific thresholds. This bypasses the requirement for the failure times to be observed as falling into non-overlapping intervals. The assumption of a normal age-of-onset distribution of the standard probit model is relaxed by embedding within it a multivariate Box-Cox transformation whose parameters are jointly estimated with the other parameters of the model. Complex decompositions of the underlying multivariate normal covariance matrix of the transformed ages of onset become possible. The new methodology is here applied to a multivariate study of the ages of first use of tobacco and first consumption of alcohol without parental permission in twins. The proposed model allows estimation of the genetic and environmental effects that are shared by both of these risk behaviours as well as those that are specific. 2006 John Wiley & Sons, Ltd.
Deformable Image Registration based on Similarity-Steered CNN Regression.
Cao, Xiaohuan; Yang, Jianhua; Zhang, Jun; Nie, Dong; Kim, Min-Jeong; Wang, Qian; Shen, Dinggang
2017-09-01
Existing deformable registration methods require exhaustively iterative optimization, along with careful parameter tuning, to estimate the deformation field between images. Although some learning-based methods have been proposed for initiating deformation estimation, they are often template-specific and not flexible in practical use. In this paper, we propose a convolutional neural network (CNN) based regression model to directly learn the complex mapping from the input image pair (i.e., a pair of template and subject) to their corresponding deformation field. Specifically, our CNN architecture is designed in a patch-based manner to learn the complex mapping from the input patch pairs to their respective deformation field. First, the equalized active-points guided sampling strategy is introduced to facilitate accurate CNN model learning upon a limited image dataset. Then, the similarity-steered CNN architecture is designed, where we propose to add the auxiliary contextual cue, i.e., the similarity between input patches, to more directly guide the learning process. Experiments on different brain image datasets demonstrate promising registration performance based on our CNN model. Furthermore, it is found that the trained CNN model from one dataset can be successfully transferred to another dataset, although brain appearances across datasets are quite variable.
Multivariate time series clustering on geophysical data recorded at Mt. Etna from 1996 to 2003
NASA Astrophysics Data System (ADS)
Di Salvo, Roberto; Montalto, Placido; Nunnari, Giuseppe; Neri, Marco; Puglisi, Giuseppe
2013-02-01
Time series clustering is an important task in data analysis issues in order to extract implicit, previously unknown, and potentially useful information from a large collection of data. Finding useful similar trends in multivariate time series represents a challenge in several areas including geophysics environment research. While traditional time series analysis methods deal only with univariate time series, multivariate time series analysis is a more suitable approach in the field of research where different kinds of data are available. Moreover, the conventional time series clustering techniques do not provide desired results for geophysical datasets due to the huge amount of data whose sampling rate is different according to the nature of signal. In this paper, a novel approach concerning geophysical multivariate time series clustering is proposed using dynamic time series segmentation and Self Organizing Maps techniques. This method allows finding coupling among trends of different geophysical data recorded from monitoring networks at Mt. Etna spanning from 1996 to 2003, when the transition from summit eruptions to flank eruptions occurred. This information can be used to carry out a more careful evaluation of the state of volcano and to define potential hazard assessment at Mt. Etna.
Avalappampatty Sivasamy, Aneetha; Sundan, Bose
2015-01-01
The ever expanding communication requirements in today's world demand extensive and efficient network systems with equally efficient and reliable security features integrated for safe, confident, and secured communication and data transfer. Providing effective security protocols for any network environment, therefore, assumes paramount importance. Attempts are made continuously for designing more efficient and dynamic network intrusion detection models. In this work, an approach based on Hotelling's T2 method, a multivariate statistical analysis technique, has been employed for intrusion detection, especially in network environments. Components such as preprocessing, multivariate statistical analysis, and attack detection have been incorporated in developing the multivariate Hotelling's T2 statistical model and necessary profiles have been generated based on the T-square distance metrics. With a threshold range obtained using the central limit theorem, observed traffic profiles have been classified either as normal or attack types. Performance of the model, as evaluated through validation and testing using KDD Cup'99 dataset, has shown very high detection rates for all classes with low false alarm rates. Accuracy of the model presented in this work, in comparison with the existing models, has been found to be much better. PMID:26357668
Sivasamy, Aneetha Avalappampatty; Sundan, Bose
2015-01-01
The ever expanding communication requirements in today's world demand extensive and efficient network systems with equally efficient and reliable security features integrated for safe, confident, and secured communication and data transfer. Providing effective security protocols for any network environment, therefore, assumes paramount importance. Attempts are made continuously for designing more efficient and dynamic network intrusion detection models. In this work, an approach based on Hotelling's T(2) method, a multivariate statistical analysis technique, has been employed for intrusion detection, especially in network environments. Components such as preprocessing, multivariate statistical analysis, and attack detection have been incorporated in developing the multivariate Hotelling's T(2) statistical model and necessary profiles have been generated based on the T-square distance metrics. With a threshold range obtained using the central limit theorem, observed traffic profiles have been classified either as normal or attack types. Performance of the model, as evaluated through validation and testing using KDD Cup'99 dataset, has shown very high detection rates for all classes with low false alarm rates. Accuracy of the model presented in this work, in comparison with the existing models, has been found to be much better.
Vitte, Joana; Ranque, Stéphane; Carsin, Ania; Gomez, Carine; Romain, Thomas; Cassagne, Carole; Gouitaa, Marion; Baravalle-Einaudi, Mélisande; Bel, Nathalie Stremler-Le; Reynaud-Gaubert, Martine; Dubus, Jean-Christophe; Mège, Jean-Louis; Gaudart, Jean
2017-01-01
Molecular-based allergy diagnosis yields multiple biomarker datasets. The classical diagnostic score for allergic bronchopulmonary aspergillosis (ABPA), a severe disease usually occurring in asthmatic patients and people with cystic fibrosis, comprises succinct immunological criteria formulated in 1977: total IgE, anti- Aspergillus fumigatus ( Af ) IgE, anti- Af "precipitins," and anti- Af IgG. Progress achieved over the last four decades led to multiple IgE and IgG(4) Af biomarkers available with quantitative, standardized, molecular-level reports. These newly available biomarkers have not been included in the current diagnostic criteria, either individually or in algorithms, despite persistent underdiagnosis of ABPA. Large numbers of individual biomarkers may hinder their use in clinical practice. Conversely, multivariate analysis using new tools may bring about a better chance of less diagnostic mistakes. We report here a proof-of-concept work consisting of a three-step multivariate analysis of Af IgE, IgG, and IgG4 biomarkers through a combination of principal component analysis, hierarchical ascendant classification, and classification and regression tree multivariate analysis. The resulting diagnostic algorithms might show the way for novel criteria and improved diagnostic efficiency in Af -sensitized patients at risk for ABPA.
Copula-based prediction of economic movements
NASA Astrophysics Data System (ADS)
García, J. E.; González-López, V. A.; Hirsh, I. D.
2016-06-01
In this paper we model the discretized returns of two paired time series BM&FBOVESPA Dividend Index and BM&FBOVESPA Public Utilities Index using multivariate Markov models. The discretization corresponds to three categories, high losses, high profits and the complementary periods of the series. In technical terms, the maximal memory that can be considered for a Markov model, can be derived from the size of the alphabet and dataset. The number of parameters needed to specify a discrete multivariate Markov chain grows exponentially with the order and dimension of the chain. In this case the size of the database is not large enough for a consistent estimation of the model. We apply a strategy to estimate a multivariate process with an order greater than the order achieved using standard procedures. The new strategy consist on obtaining a partition of the state space which is constructed from a combination, of the partitions corresponding to the two marginal processes and the partition corresponding to the multivariate Markov chain. In order to estimate the transition probabilities, all the partitions are linked using a copula. In our application this strategy provides a significant improvement in the movement predictions.
NASA Astrophysics Data System (ADS)
Marques, R.; Amaral, P.; Zêzere, J. L.; Queiroz, G.; Goulart, C.
2009-04-01
Slope instability research and susceptibility mapping is a fundamental component of hazard assessment and is of extreme importance for risk mitigation, land-use management and emergency planning. Landslide susceptibility zonation has been actively pursued during the last two decades and several methodologies are still being improved. Among all the methods presented in the literature, indirect quantitative probabilistic methods have been extensively used. In this work different linear probabilistic methods, both bi-variate and multi-variate (Informative Value, Fuzzy Logic, Weights of Evidence and Logistic Regression), were used for the computation of the spatial probability of landslide occurrence, using the pixel as mapping unit. The methods used are based on linear relationships between landslides and 9 considered conditioning factors (altimetry, slope angle, exposition, curvature, distance to streams, wetness index, contribution area, lithology and land-use). It was assumed that future landslides will be conditioned by the same factors as past landslides in the study area. The presented work was developed for Ribeira Quente Valley (S. Miguel Island, Azores), a study area of 9,5 km2, mainly composed of volcanic deposits (ash and pumice lapilli) produced by explosive eruptions in Furnas Volcano. This materials associated to the steepness of the slopes (38,9% of the area has slope angles higher than 35°, reaching a maximum of 87,5°), make the area very prone to landslide activity. A total of 1.495 shallow landslides were mapped (at 1:5.000 scale) and included in a GIS database. The total affected area is 401.744 m2 (4,5% of the study area). Most slope movements are translational slides frequently evolving into debris-flows. The landslides are elongated, with maximum length generally equivalent to the slope extent, and their width normally does not exceed 25 m. The failure depth rarely exceeds 1,5 m and the volume is usually smaller than 700 m3. For modelling purposes, the landslides were randomly divided in two sub-datasets: a modelling dataset with 748 events (2,2% of the study area) and a validation dataset with 747 events (2,3% of the study area). The susceptibility algorithms achieved with the different probabilistic techniques, were rated individually using success rate and prediction rate curves. The best model performance was obtained with the logistic regression, although the results from the different methods do not show significant differences neither in success nor in prediction rate curves. These evidences revealed that: (1) the modelling landslide dataset is representative of the entire landslide population characteristics; and (2) the increase of complexity and robustness in the probabilistic methodology did not produce a significant increase in success or prediction rates. Therefore, it was concluded that the resolution and quality of the input variables are much more important than the probabilistic model chosen to assess landslide susceptibility. This work was developed on the behalf of VOLCSOILRISK project (Volcanic Soils Geotechnical Characterization for Landslide Risk Mitigation), supported by Direcção Regional da Ciência e Tecnologia - Governo Regional dos Açores.
NASA Astrophysics Data System (ADS)
Lateh, Masitah Abdul; Kamilah Muda, Azah; Yusof, Zeratul Izzah Mohd; Azilah Muda, Noor; Sanusi Azmi, Mohd
2017-09-01
The emerging era of big data for past few years has led to large and complex data which needed faster and better decision making. However, the small dataset problems still arise in a certain area which causes analysis and decision are hard to make. In order to build a prediction model, a large sample is required as a training sample of the model. Small dataset is insufficient to produce an accurate prediction model. This paper will review an artificial data generation approach as one of the solution to solve the small dataset problem.
Nnane, Daniel Ekane
2011-11-15
Contamination of surface waters is a pervasive threat to human health, hence, the need to better understand the sources and spatio-temporal variations of contaminants within river catchments. River catchment managers are required to sustainably monitor and manage the quality of surface waters. Catchment managers therefore need cost-effective low-cost long-term sustainable water quality monitoring and management designs to proactively protect public health and aquatic ecosystems. Multivariate and phage-lysis techniques were used to investigate spatio-temporal variations of water quality, main polluting chemophysical and microbial parameters, faecal micro-organisms sources, and to establish 'sentry' sampling sites in the Ouse River catchment, southeast England, UK. 350 river water samples were analysed for fourteen chemophysical and microbial water quality parameters in conjunction with the novel human-specific phages of Bacteroides GB-124 (Bacteroides GB-124). Annual, autumn, spring, summer, and winter principal components (PCs) explained approximately 54%, 75%, 62%, 48%, and 60%, respectively, of the total variance present in the datasets. Significant loadings of Escherichia coli, intestinal enterococci, turbidity, and human-specific Bacteroides GB-124 were observed in all datasets. Cluster analysis successfully grouped sampling sites into five clusters. Importantly, multivariate and phage-lysis techniques were useful in determining the sources and spatial extent of water contamination in the catchment. Though human faecal contamination was significant during dry periods, the main source of contamination was non-human. Bacteroides GB-124 could potentially be used for catchment routine microbial water quality monitoring. For a cost-effective low-cost long-term sustainable water quality monitoring design, E. coli or intestinal enterococci, turbidity, and Bacteroides GB-124 should be monitored all-year round in this river catchment. Copyright © 2011 Elsevier B.V. All rights reserved.
Unsupervised classification of multivariate geostatistical data: Two algorithms
NASA Astrophysics Data System (ADS)
Romary, Thomas; Ors, Fabien; Rivoirard, Jacques; Deraisme, Jacques
2015-12-01
With the increasing development of remote sensing platforms and the evolution of sampling facilities in mining and oil industry, spatial datasets are becoming increasingly large, inform a growing number of variables and cover wider and wider areas. Therefore, it is often necessary to split the domain of study to account for radically different behaviors of the natural phenomenon over the domain and to simplify the subsequent modeling step. The definition of these areas can be seen as a problem of unsupervised classification, or clustering, where we try to divide the domain into homogeneous domains with respect to the values taken by the variables in hand. The application of classical clustering methods, designed for independent observations, does not ensure the spatial coherence of the resulting classes. Image segmentation methods, based on e.g. Markov random fields, are not adapted to irregularly sampled data. Other existing approaches, based on mixtures of Gaussian random functions estimated via the expectation-maximization algorithm, are limited to reasonable sample sizes and a small number of variables. In this work, we propose two algorithms based on adaptations of classical algorithms to multivariate geostatistical data. Both algorithms are model free and can handle large volumes of multivariate, irregularly spaced data. The first one proceeds by agglomerative hierarchical clustering. The spatial coherence is ensured by a proximity condition imposed for two clusters to merge. This proximity condition relies on a graph organizing the data in the coordinates space. The hierarchical algorithm can then be seen as a graph-partitioning algorithm. Following this interpretation, a spatial version of the spectral clustering algorithm is also proposed. The performances of both algorithms are assessed on toy examples and a mining dataset.
Fast and Flexible Multivariate Time Series Subsequence Search
NASA Technical Reports Server (NTRS)
Bhaduri, Kanishka; Oza, Nikunj C.; Zhu, Qiang; Srivastava, Ashok N.
2010-01-01
Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which often contain several gigabytes of data. Surprisingly, research on MTS search is very limited. Most of the existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two algorithms to solve this problem (1) a List Based Search (LBS) algorithm which uses sorted lists for indexing, and (2) a R*-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences. Both algorithms guarantee that all matching patterns within the specified thresholds will be returned (no false dismissals). The very few false alarms can be removed by a post-processing step. Since our framework is also capable of Univariate Time-Series (UTS) subsequence search, we first demonstrate the efficiency of our algorithms on several UTS datasets previously used in the literature. We follow this up with experiments using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>99%) thus needing actual disk access for only less than 1% of the observations. To the best of our knowledge, MTS subsequence search has never been attempted on datasets of the size we have used in this paper.
Finding the Maine Story in Hugh Cumbersome National Monitoring Datasets
What’s a manager, analyst, or concerned citizen to do with the complex datasets generated by State and Federal monitoring efforts? Is it possible to use such information to address Maine’s environmental issues without having a degree in informatics and statistics? This presentati...
Levin-Schwartz, Yuri; Song, Yang; Schreier, Peter J.; Calhoun, Vince D.; Adalı, Tülay
2016-01-01
Due to their data-driven nature, multivariate methods such as canonical correlation analysis (CCA) have proven very useful for fusion of multimodal neurological data. However, being able to determine the degree of similarity between datasets and appropriate order selection are crucial to the success of such techniques. The standard methods for calculating the order of multimodal data focus only on sources with the greatest individual energy and ignore relations across datasets. Additionally, these techniques as well as the most widely-used methods for determining the degree of similarity between datasets assume sufficient sample support and are not effective in the sample-poor regime. In this paper, we propose to jointly estimate the degree of similarity between datasets and their order when few samples are present using principal component analysis and canonical correlation analysis (PCA-CCA). By considering these two problems simultaneously, we are able to minimize the assumptions placed on the data and achieve superior performance in the sample-poor regime compared to traditional techniques. We apply PCA-CCA to the pairwise combinations of functional magnetic resonance imaging (fMRI), structural magnetic resonance imaging (sMRI), and electroencephalogram (EEG) data drawn from patients with schizophrenia and healthy controls while performing an auditory oddball task. The PCA-CCA results indicate that the fMRI and sMRI datasets are the most similar, whereas the sMRI and EEG datasets share the least similarity. We also demonstrate that the degree of similarity obtained by PCA-CCA is highly predictive of the degree of significance found for components generated using CCA. PMID:27039696
Association between split selection instability and predictive error in survival trees.
Radespiel-Tröger, M; Gefeller, O; Rabenstein, T; Hothorn, T
2006-01-01
To evaluate split selection instability in six survival tree algorithms and its relationship with predictive error by means of a bootstrap study. We study the following algorithms: logrank statistic with multivariate p-value adjustment without pruning (LR), Kaplan-Meier distance of survival curves (KM), martingale residuals (MR), Poisson regression for censored data (PR), within-node impurity (WI), and exponential log-likelihood loss (XL). With the exception of LR, initial trees are pruned by using split-complexity, and final trees are selected by means of cross-validation. We employ a real dataset from a clinical study of patients with gallbladder stones. The predictive error is evaluated using the integrated Brier score for censored data. The relationship between split selection instability and predictive error is evaluated by means of box-percentile plots, covariate and cutpoint selection entropy, and cutpoint selection coefficients of variation, respectively, in the root node. We found a positive association between covariate selection instability and predictive error in the root node. LR yields the lowest predictive error, while KM and MR yield the highest predictive error. The predictive error of survival trees is related to split selection instability. Based on the low predictive error of LR, we recommend the use of this algorithm for the construction of survival trees. Unpruned survival trees with multivariate p-value adjustment can perform equally well compared to pruned trees. The analysis of split selection instability can be used to communicate the results of tree-based analyses to clinicians and to support the application of survival trees.
Multivariate analysis of flow cytometric data using decision trees.
Simon, Svenja; Guthke, Reinhard; Kamradt, Thomas; Frey, Oliver
2012-01-01
Characterization of the response of the host immune system is important in understanding the bidirectional interactions between the host and microbial pathogens. For research on the host site, flow cytometry has become one of the major tools in immunology. Advances in technology and reagents allow now the simultaneous assessment of multiple markers on a single cell level generating multidimensional data sets that require multivariate statistical analysis. We explored the explanatory power of the supervised machine learning method called "induction of decision trees" in flow cytometric data. In order to examine whether the production of a certain cytokine is depended on other cytokines, datasets from intracellular staining for six cytokines with complex patterns of co-expression were analyzed by induction of decision trees. After weighting the data according to their class probabilities, we created a total of 13,392 different decision trees for each given cytokine with different parameter settings. For a more realistic estimation of the decision trees' quality, we used stratified fivefold cross validation and chose the "best" tree according to a combination of different quality criteria. While some of the decision trees reflected previously known co-expression patterns, we found that the expression of some cytokines was not only dependent on the co-expression of others per se, but was also dependent on the intensity of expression. Thus, for the first time we successfully used induction of decision trees for the analysis of high dimensional flow cytometric data and demonstrated the feasibility of this method to reveal structural patterns in such data sets.
Hamilton, C A; Miller, A; Casablanca, Y; Horowitz, N S; Rungruang, B; Krivak, T C; Richard, S D; Rodriguez, N; Birrer, M J; Backes, F J; Geller, M A; Quinn, M; Goodheart, M J; Mutch, D G; Kavanagh, J J; Maxwell, G L; Bookman, M A
2018-02-01
To identify clinicopathologic factors associated with 10-year overall survival in epithelial ovarian cancer (EOC) and primary peritoneal cancer (PPC), and to develop a predictive model identifying long-term survivors. Demographic, surgical, and clinicopathologic data were abstracted from GOG 182 records. The association between clinical variables and long-term survival (LTS) (>10years) was assessed using multivariable regression analysis. Bootstrap methods were used to develop predictive models from known prognostic clinical factors and predictive accuracy was quantified using optimism-adjusted area under the receiver operating characteristic curve (AUC). The analysis dataset included 3010 evaluable patients, of whom 195 survived greater than ten years. These patients were more likely to have better performance status, endometrioid histology, stage III (rather than stage IV) disease, absence of ascites, less extensive preoperative disease distribution, microscopic disease residual following cyoreduction (R0), and decreased complexity of surgery (p<0.01). Multivariable regression analysis revealed that lower CA-125 levels, absence of ascites, stage, and R0 were significant independent predictors of LTS. A predictive model created using these variables had an AUC=0.729, which outperformed any of the individual predictors. The absence of ascites, a low CA-125, stage, and R0 at the time of cytoreduction are factors associated with LTS when controlling for other confounders. An extensively annotated clinicopathologic prediction model for LTS fell short of clinical utility suggesting that prognostic molecular profiles are needed to better predict which patients are likely to be long-term survivors. Published by Elsevier Inc.
Hamilton, C. A.; Miller, A.; Casablanca, Y.; Horowitz, N. S.; Rungruang, B.; Krivak, T. C.; Richard, S. D.; Rodriguez, N.; Birrer, M.J.; Backes, F.J.; Geller, M.A.; Quinn, M.; Goodheart, M.J.; Mutch, D.G.; Kavanagh, J.J.; Maxwell, G. L.; Bookman, M. A.
2018-01-01
Objective To identify clinicopathologic factors associated with 10-year overall survival in epithelial ovarian cancer (EOC) and primary peritoneal cancer (PPC), and to develop a predictive model identifying long-term survivors. Methods Demographic, surgical, and clinicopathologic data were abstracted from GOG 182 records. The association between clinical variables and long-term survival (LTS) (>10 years) was assessed using multivariable regression analysis. Bootstrap methods were used to develop predictive models from known prognostic clinical factors and predictive accuracy was quantified using optimism-adjusted area under the receiver operating characteristic curve (AUC). Results The analysis dataset included 3,010 evaluable patients, of whom 195 survived greater than ten years. These patients were more likely to have better performance status, endometrioid histology, stage III (rather than stage IV) disease, absence of ascites, less extensive preoperative disease distribution, microscopic disease residual following cyoreduction (R0), and decreased complexity of surgery (p<0.01). Multivariable regression analysis revealed that lower CA-125 levels, absence of ascites, stage, and R0 were significant independent predictors of LTS. A predictive model created using these variables had an AUC=0.729, which outperformed any of the individual predictors. Conclusions The absence of ascites, a low CA-125, stage, and R0 at the time of cytoreduction are factors associated with LTS when controlling for other confounders. An extensively annotated clinicopathologic prediction model for LTS fell short of clinical utility suggesting that prognostic molecular profiles are needed to better predict which patients are likely to be long-term survivors. PMID:29195926
Spatial and temporal synchrony in reptile population dynamics in variable environments.
Greenville, Aaron C; Wardle, Glenda M; Nguyen, Vuong; Dickman, Chris R
2016-10-01
Resources are seldom distributed equally across space, but many species exhibit spatially synchronous population dynamics. Such synchrony suggests the operation of large-scale external drivers, such as rainfall or wildfire, or the influence of oasis sites that provide water, shelter, or other resources. However, testing the generality of these factors is not easy, especially in variable environments. Using a long-term dataset (13-22 years) from a large (8000 km(2)) study region in arid Central Australia, we tested firstly for regional synchrony in annual rainfall and the dynamics of six reptile species across nine widely separated sites. For species that showed synchronous spatial dynamics, we then used multivariate follow a multivariate auto-regressive state-space (MARSS) models to predict that regional rainfall would be positively associated with their populations. For asynchronous species, we used MARSS models to explore four other possible population structures: (1) populations were asynchronous, (2) differed between oasis and non-oasis sites, (3) differed between burnt and unburnt sites, or (4) differed between three sub-regions with different rainfall gradients. Only one species showed evidence of spatial population synchrony and our results provide little evidence that rainfall synchronizes reptile populations. The oasis or the wildfire hypotheses were the best-fitting models for the other five species. Thus, our six study species appear generally to be structured in space into one or two populations across the study region. Our findings suggest that for arid-dwelling reptile populations, spatial and temporal dynamics are structured by abiotic events, but individual responses to covariates at smaller spatial scales are complex and poorly understood.
NASA Astrophysics Data System (ADS)
Xu, Xianjin; Yan, Chengfei; Zou, Xiaoqin
2017-08-01
The growing number of protein-ligand complex structures, particularly the structures of proteins co-bound with different ligands, in the Protein Data Bank helps us tackle two major challenges in molecular docking studies: the protein flexibility and the scoring function. Here, we introduced a systematic strategy by using the information embedded in the known protein-ligand complex structures to improve both binding mode and binding affinity predictions. Specifically, a ligand similarity calculation method was employed to search a receptor structure with a bound ligand sharing high similarity with the query ligand for the docking use. The strategy was applied to the two datasets (HSP90 and MAP4K4) in recent D3R Grand Challenge 2015. In addition, for the HSP90 dataset, a system-specific scoring function (ITScore2_hsp90) was generated by recalibrating our statistical potential-based scoring function (ITScore2) using the known protein-ligand complex structures and the statistical mechanics-based iterative method. For the HSP90 dataset, better performances were achieved for both binding mode and binding affinity predictions comparing with the original ITScore2 and with ensemble docking. For the MAP4K4 dataset, although there were only eight known protein-ligand complex structures, our docking strategy achieved a comparable performance with ensemble docking. Our method for receptor conformational selection and iterative method for the development of system-specific statistical potential-based scoring functions can be easily applied to other protein targets that have a number of protein-ligand complex structures available to improve predictions on binding.
Yang, Guanxue; Wang, Lin; Wang, Xiaofan
2017-06-07
Reconstruction of networks underlying complex systems is one of the most crucial problems in many areas of engineering and science. In this paper, rather than identifying parameters of complex systems governed by pre-defined models or taking some polynomial and rational functions as a prior information for subsequent model selection, we put forward a general framework for nonlinear causal network reconstruction from time-series with limited observations. With obtaining multi-source datasets based on the data-fusion strategy, we propose a novel method to handle nonlinearity and directionality of complex networked systems, namely group lasso nonlinear conditional granger causality. Specially, our method can exploit different sets of radial basis functions to approximate the nonlinear interactions between each pair of nodes and integrate sparsity into grouped variables selection. The performance characteristic of our approach is firstly assessed with two types of simulated datasets from nonlinear vector autoregressive model and nonlinear dynamic models, and then verified based on the benchmark datasets from DREAM3 Challenge4. Effects of data size and noise intensity are also discussed. All of the results demonstrate that the proposed method performs better in terms of higher area under precision-recall curve.
TATES: Efficient Multivariate Genotype-Phenotype Analysis for Genome-Wide Association Studies
van der Sluis, Sophie; Posthuma, Danielle; Dolan, Conor V.
2013-01-01
To date, the genome-wide association study (GWAS) is the primary tool to identify genetic variants that cause phenotypic variation. As GWAS analyses are generally univariate in nature, multivariate phenotypic information is usually reduced to a single composite score. This practice often results in loss of statistical power to detect causal variants. Multivariate genotype–phenotype methods do exist but attain maximal power only in special circumstances. Here, we present a new multivariate method that we refer to as TATES (Trait-based Association Test that uses Extended Simes procedure), inspired by the GATES procedure proposed by Li et al (2011). For each component of a multivariate trait, TATES combines p-values obtained in standard univariate GWAS to acquire one trait-based p-value, while correcting for correlations between components. Extensive simulations, probing a wide variety of genotype–phenotype models, show that TATES's false positive rate is correct, and that TATES's statistical power to detect causal variants explaining 0.5% of the variance can be 2.5–9 times higher than the power of univariate tests based on composite scores and 1.5–2 times higher than the power of the standard MANOVA. Unlike other multivariate methods, TATES detects both genetic variants that are common to multiple phenotypes and genetic variants that are specific to a single phenotype, i.e. TATES provides a more complete view of the genetic architecture of complex traits. As the actual causal genotype–phenotype model is usually unknown and probably phenotypically and genetically complex, TATES, available as an open source program, constitutes a powerful new multivariate strategy that allows researchers to identify novel causal variants, while the complexity of traits is no longer a limiting factor. PMID:23359524
Evolving hard problems: Generating human genetics datasets with a complex etiology.
Himmelstein, Daniel S; Greene, Casey S; Moore, Jason H
2011-07-07
A goal of human genetics is to discover genetic factors that influence individuals' susceptibility to common diseases. Most common diseases are thought to result from the joint failure of two or more interacting components instead of single component failures. This greatly complicates both the task of selecting informative genetic variants and the task of modeling interactions between them. We and others have previously developed algorithms to detect and model the relationships between these genetic factors and disease. Previously these methods have been evaluated with datasets simulated according to pre-defined genetic models. Here we develop and evaluate a model free evolution strategy to generate datasets which display a complex relationship between individual genotype and disease susceptibility. We show that this model free approach is capable of generating a diverse array of datasets with distinct gene-disease relationships for an arbitrary interaction order and sample size. We specifically generate eight-hundred Pareto fronts; one for each independent run of our algorithm. In each run the predictiveness of single genetic variation and pairs of genetic variants have been minimized, while the predictiveness of third, fourth, or fifth-order combinations is maximized. Two hundred runs of the algorithm are further dedicated to creating datasets with predictive four or five order interactions and minimized lower-level effects. This method and the resulting datasets will allow the capabilities of novel methods to be tested without pre-specified genetic models. This allows researchers to evaluate which methods will succeed on human genetics problems where the model is not known in advance. We further make freely available to the community the entire Pareto-optimal front of datasets from each run so that novel methods may be rigorously evaluated. These 76,600 datasets are available from http://discovery.dartmouth.edu/model_free_data/.
An interactive web application for the dissemination of human systems immunology data.
Speake, Cate; Presnell, Scott; Domico, Kelly; Zeitner, Brad; Bjork, Anna; Anderson, David; Mason, Michael J; Whalen, Elizabeth; Vargas, Olivia; Popov, Dimitry; Rinchai, Darawan; Jourde-Chiche, Noemie; Chiche, Laurent; Quinn, Charlie; Chaussabel, Damien
2015-06-19
Systems immunology approaches have proven invaluable in translational research settings. The current rate at which large-scale datasets are generated presents unique challenges and opportunities. Mining aggregates of these datasets could accelerate the pace of discovery, but new solutions are needed to integrate the heterogeneous data types with the contextual information that is necessary for interpretation. In addition, enabling tools and technologies facilitating investigators' interaction with large-scale datasets must be developed in order to promote insight and foster knowledge discovery. State of the art application programming was employed to develop an interactive web application for browsing and visualizing large and complex datasets. A collection of human immune transcriptome datasets were loaded alongside contextual information about the samples. We provide a resource enabling interactive query and navigation of transcriptome datasets relevant to human immunology research. Detailed information about studies and samples are displayed dynamically; if desired the associated data can be downloaded. Custom interactive visualizations of the data can be shared via email or social media. This application can be used to browse context-rich systems-scale data within and across systems immunology studies. This resource is publicly available online at [Gene Expression Browser Landing Page ( https://gxb.benaroyaresearch.org/dm3/landing.gsp )]. The source code is also available openly [Gene Expression Browser Source Code ( https://github.com/BenaroyaResearch/gxbrowser )]. We have developed a data browsing and visualization application capable of navigating increasingly large and complex datasets generated in the context of immunological studies. This intuitive tool ensures that, whether taken individually or as a whole, such datasets generated at great effort and expense remain interpretable and a ready source of insight for years to come.
Jia, Peilin; Wang, Lily; Fanous, Ayman H.; Pato, Carlos N.; Edwards, Todd L.; Zhao, Zhongming
2012-01-01
With the recent success of genome-wide association studies (GWAS), a wealth of association data has been accomplished for more than 200 complex diseases/traits, proposing a strong demand for data integration and interpretation. A combinatory analysis of multiple GWAS datasets, or an integrative analysis of GWAS data and other high-throughput data, has been particularly promising. In this study, we proposed an integrative analysis framework of multiple GWAS datasets by overlaying association signals onto the protein-protein interaction network, and demonstrated it using schizophrenia datasets. Building on a dense module search algorithm, we first searched for significantly enriched subnetworks for schizophrenia in each single GWAS dataset and then implemented a discovery-evaluation strategy to identify module genes with consistent association signals. We validated the module genes in an independent dataset, and also examined them through meta-analysis of the related SNPs using multiple GWAS datasets. As a result, we identified 205 module genes with a joint effect significantly associated with schizophrenia; these module genes included a number of well-studied candidate genes such as DISC1, GNA12, GNA13, GNAI1, GPR17, and GRIN2B. Further functional analysis suggested these genes are involved in neuronal related processes. Additionally, meta-analysis found that 18 SNPs in 9 module genes had P meta<1×10−4, including the gene HLA-DQA1 located in the MHC region on chromosome 6, which was reported in previous studies using the largest cohort of schizophrenia patients to date. These results demonstrated our bi-directional network-based strategy is efficient for identifying disease-associated genes with modest signals in GWAS datasets. This approach can be applied to any other complex diseases/traits where multiple GWAS datasets are available. PMID:22792057
Testing for significance of phase synchronisation dynamics in the EEG.
Daly, Ian; Sweeney-Reed, Catherine M; Nasuto, Slawomir J
2013-06-01
A number of tests exist to check for statistical significance of phase synchronisation within the Electroencephalogram (EEG); however, the majority suffer from a lack of generality and applicability. They may also fail to account for temporal dynamics in the phase synchronisation, regarding synchronisation as a constant state instead of a dynamical process. Therefore, a novel test is developed for identifying the statistical significance of phase synchronisation based upon a combination of work characterising temporal dynamics of multivariate time-series and Markov modelling. We show how this method is better able to assess the significance of phase synchronisation than a range of commonly used significance tests. We also show how the method may be applied to identify and classify significantly different phase synchronisation dynamics in both univariate and multivariate datasets.
NASA Astrophysics Data System (ADS)
Darvishzadeh, R.; Skidmore, A. K.; Mirzaie, M.; Atzberger, C.; Schlerf, M.
2014-12-01
Accurate estimation of grassland biomass at their peak productivity can provide crucial information regarding the functioning and productivity of the rangelands. Hyperspectral remote sensing has proved to be valuable for estimation of vegetation biophysical parameters such as biomass using different statistical techniques. However, in statistical analysis of hyperspectral data, multicollinearity is a common problem due to large amount of correlated hyper-spectral reflectance measurements. The aim of this study was to examine the prospect of above ground biomass estimation in a heterogeneous Mediterranean rangeland employing multivariate calibration methods. Canopy spectral measurements were made in the field using a GER 3700 spectroradiometer, along with concomitant in situ measurements of above ground biomass for 170 sample plots. Multivariate calibrations including partial least squares regression (PLSR), principal component regression (PCR), and Least-Squared Support Vector Machine (LS-SVM) were used to estimate the above ground biomass. The prediction accuracy of the multivariate calibration methods were assessed using cross validated R2 and RMSE. The best model performance was obtained using LS_SVM and then PLSR both calibrated with first derivative reflectance dataset with R2cv = 0.88 & 0.86 and RMSEcv= 1.15 & 1.07 respectively. The weakest prediction accuracy was appeared when PCR were used (R2cv = 0.31 and RMSEcv= 2.48). The obtained results highlight the importance of multivariate calibration methods for biomass estimation when hyperspectral data are used.
BESST--efficient scaffolding of large fragmented assemblies.
Sahlin, Kristoffer; Vezzi, Francesco; Nystedt, Björn; Lundeberg, Joakim; Arvestad, Lars
2014-08-15
The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features.We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software's general performance. We propose a new algorithm, implemented in a tool called BESST, which can scaffold genomes of all sizes and complexities and was used to scaffold the genome of P. abies (20 Gbp). We performed a comprehensive comparison of BESST against the most popular stand-alone scaffolders on a large variety of datasets. Our results confirm that some of the popular scaffolders are not practical to run on complex datasets. Furthermore, no single stand-alone scaffolder outperforms the others on all datasets. However, BESST fares favorably to the other tested scaffolders on GAGE datasets and, moreover, outperforms the other methods when library insert size distribution is wide. We conclude from our results that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding.
Dimitrakopoulos, Christos; Theofilatos, Konstantinos; Pegkas, Andreas; Likothanassis, Spiros; Mavroudi, Seferina
2016-07-01
Proteins are vital biological molecules driving many fundamental cellular processes. They rarely act alone, but form interacting groups called protein complexes. The study of protein complexes is a key goal in systems biology. Recently, large protein-protein interaction (PPI) datasets have been published and a plethora of computational methods that provide new ideas for the prediction of protein complexes have been implemented. However, most of the methods suffer from two major limitations: First, they do not account for proteins participating in multiple functions and second, they are unable to handle weighted PPI graphs. Moreover, the problem remains open as existing algorithms and tools are insufficient in terms of predictive metrics. In the present paper, we propose gradually expanding neighborhoods with adjustment (GENA), a new algorithm that gradually expands neighborhoods in a graph starting from highly informative "seed" nodes. GENA considers proteins as multifunctional molecules allowing them to participate in more than one protein complex. In addition, GENA accepts weighted PPI graphs by using a weighted evaluation function for each cluster. In experiments with datasets from Saccharomyces cerevisiae and human, GENA outperformed Markov clustering, restricted neighborhood search and clustering with overlapping neighborhood expansion, three state-of-the-art methods for computationally predicting protein complexes. Seven PPI networks and seven evaluation datasets were used in total. GENA outperformed existing methods in 16 out of 18 experiments achieving an average improvement of 5.5% when the maximum matching ratio metric was used. Our method was able to discover functionally homogeneous protein clusters and uncover important network modules in a Parkinson expression dataset. When used on the human networks, around 47% of the detected clusters were enriched in gene ontology (GO) terms with depth higher than five in the GO hierarchy. In the present manuscript, we introduce a new method for the computational prediction of protein complexes by making the realistic assumption that proteins participate in multiple protein complexes and cellular functions. Our method can detect accurate and functionally homogeneous clusters. Copyright © 2016 Elsevier B.V. All rights reserved.
Cell nuclei and cytoplasm joint segmentation using the sliding band filter.
Quelhas, Pedro; Marcuzzo, Monica; Mendonça, Ana Maria; Campilho, Aurélio
2010-08-01
Microscopy cell image analysis is a fundamental tool for biological research. In particular, multivariate fluorescence microscopy is used to observe different aspects of cells in cultures. It is still common practice to perform analysis tasks by visual inspection of individual cells which is time consuming, exhausting and prone to induce subjective bias. This makes automatic cell image analysis essential for large scale, objective studies of cell cultures. Traditionally the task of automatic cell analysis is approached through the use of image segmentation methods for extraction of cells' locations and shapes. Image segmentation, although fundamental, is neither an easy task in computer vision nor is it robust to image quality changes. This makes image segmentation for cell detection semi-automated requiring frequent tuning of parameters. We introduce a new approach for cell detection and shape estimation in multivariate images based on the sliding band filter (SBF). This filter's design makes it adequate to detect overall convex shapes and as such it performs well for cell detection. Furthermore, the parameters involved are intuitive as they are directly related to the expected cell size. Using the SBF filter we detect cells' nucleus and cytoplasm location and shapes. Based on the assumption that each cell has the same approximate shape center in both nuclei and cytoplasm fluorescence channels, we guide cytoplasm shape estimation by the nuclear detections improving performance and reducing errors. Then we validate cell detection by gathering evidence from nuclei and cytoplasm channels. Additionally, we include overlap correction and shape regularization steps which further improve the estimated cell shapes. The approach is evaluated using two datasets with different types of data: a 20 images benchmark set of simulated cell culture images, containing 1000 simulated cells; a 16 images Drosophila melanogaster Kc167 dataset containing 1255 cells, stained for DNA and actin. Both image datasets present a difficult problem due to the high variability of cell shapes and frequent cluster overlap between cells. On the Drosophila dataset our approach achieved a precision/recall of 95%/69% and 82%/90% for nuclei and cytoplasm detection respectively and an overall accuracy of 76%.
An algorithm for direct causal learning of influences on patient outcomes.
Rathnam, Chandramouli; Lee, Sanghoon; Jiang, Xia
2017-01-01
This study aims at developing and introducing a new algorithm, called direct causal learner (DCL), for learning the direct causal influences of a single target. We applied it to both simulated and real clinical and genome wide association study (GWAS) datasets and compared its performance to classic causal learning algorithms. The DCL algorithm learns the causes of a single target from passive data using Bayesian-scoring, instead of using independence checks, and a novel deletion algorithm. We generate 14,400 simulated datasets and measure the number of datasets for which DCL correctly and partially predicts the direct causes. We then compare its performance with the constraint-based path consistency (PC) and conservative PC (CPC) algorithms, the Bayesian-score based fast greedy search (FGS) algorithm, and the partial ancestral graphs algorithm fast causal inference (FCI). In addition, we extend our comparison of all five algorithms to both a real GWAS dataset and real breast cancer datasets over various time-points in order to observe how effective they are at predicting the causal influences of Alzheimer's disease and breast cancer survival. DCL consistently outperforms FGS, PC, CPC, and FCI in discovering the parents of the target for the datasets simulated using a simple network. Overall, DCL predicts significantly more datasets correctly (McNemar's test significance: p<0.0001) than any of the other algorithms for these network types. For example, when assessing overall performance (simple and complex network results combined), DCL correctly predicts approximately 1400 more datasets than the top FGS method, 1600 more datasets than the top CPC method, 4500 more datasets than the top PC method, and 5600 more datasets than the top FCI method. Although FGS did correctly predict more datasets than DCL for the complex networks, and DCL correctly predicted only a few more datasets than CPC for these networks, there is no significant difference in performance between these three algorithms for this network type. However, when we use a more continuous measure of accuracy, we find that all the DCL methods are able to better partially predict more direct causes than FGS and CPC for the complex networks. In addition, DCL consistently had faster runtimes than the other algorithms. In the application to the real datasets, DCL identified rs6784615, located on the NISCH gene, and rs10824310, located on the PRKG1 gene, as direct causes of late onset Alzheimer's disease (LOAD) development. In addition, DCL identified ER category as a direct predictor of breast cancer mortality within 5 years, and HER2 status as a direct predictor of 10-year breast cancer mortality. These predictors have been identified in previous studies to have a direct causal relationship with their respective phenotypes, supporting the predictive power of DCL. When the other algorithms discovered predictors from the real datasets, these predictors were either also found by DCL or could not be supported by previous studies. Our results show that DCL outperforms FGS, PC, CPC, and FCI in almost every case, demonstrating its potential to advance causal learning. Furthermore, our DCL algorithm effectively identifies direct causes in the LOAD and Metabric GWAS datasets, which indicates its potential for clinical applications. Copyright © 2016 Elsevier B.V. All rights reserved.
Dong, Yadong; Sun, Yongqi; Qin, Chao
2018-01-01
The existing protein complex detection methods can be broadly divided into two categories: unsupervised and supervised learning methods. Most of the unsupervised learning methods assume that protein complexes are in dense regions of protein-protein interaction (PPI) networks even though many true complexes are not dense subgraphs. Supervised learning methods utilize the informative properties of known complexes; they often extract features from existing complexes and then use the features to train a classification model. The trained model is used to guide the search process for new complexes. However, insufficient extracted features, noise in the PPI data and the incompleteness of complex data make the classification model imprecise. Consequently, the classification model is not sufficient for guiding the detection of complexes. Therefore, we propose a new robust score function that combines the classification model with local structural information. Based on the score function, we provide a search method that works both forwards and backwards. The results from experiments on six benchmark PPI datasets and three protein complex datasets show that our approach can achieve better performance compared with the state-of-the-art supervised, semi-supervised and unsupervised methods for protein complex detection, occasionally significantly outperforming such methods.
Aggregating todays data for tomorrows science: a geological use case
NASA Astrophysics Data System (ADS)
Glaves, H.; Kingdon, A.; Nayembil, M.; Baker, G.
2016-12-01
Geoscience data is made up of diverse and complex smaller datasets that, when aggregated together, build towards what is recognised as `big data'. The British Geological Survey (BGS), which acts as a repository for all subsurface data from the United Kingdom, has been collating these disparate small datasets that have been accumulated from the activities of a large number of geoscientists over many years. Recently this picture has been further complicated by the addition of new data sources such as near real-time sensor data, and industry or community data that is increasingly delivered via automatic donations. Many of these datasets have been aggregated in relational databases to form larger ones that are used to address a variety of issues ranging from development of national infrastructure to disaster response. These complex domain-specific SQL databases deliver effective data management using normalised subject-based database designs in a secure environment. However, the isolated subject-oriented design of these systems inhibits efficient cross-domain querying of the datasets. Additionally, the tools provided often do not enable effective data discovery as they have problems resolving the complex underlying normalised structures. Recent requirements to understand sub-surface geology in three dimensions have led BGS to develop new data systems. One such solution is PropBase which delivers a generic denormalised data structure within an RDBMS to store geological property data. Propbase facilitates rapid and standardised data discovery and access, incorporating 2D and 3D physical and chemical property data, including associated metadata. It also provides a dedicated web interface to deliver complex multiple data sets from a single database in standardised common output formats (e.g. CSV, GIS shape files) without the need for complex data conditioning. PropBase facilitates new scientific research, previously considered impractical, by enabling property data searches across multiple databases. Using the Propbase exemplar this presentation will seek to illustrate how BGS has developed systems for aggregating `small datasets' to create the `big data' necessary for the data analytics, mining, processing and visualisation needed for future geoscientific research.
Ramdani, Sofiane; Bonnet, Vincent; Tallon, Guillaume; Lagarde, Julien; Bernard, Pierre Louis; Blain, Hubert
2016-08-01
Entropy measures are often used to quantify the regularity of postural sway time series. Recent methodological developments provided both multivariate and multiscale approaches allowing the extraction of complexity features from physiological signals; see "Dynamical complexity of human responses: A multivariate data-adaptive framework," in Bulletin of Polish Academy of Science and Technology, vol. 60, p. 433, 2012. The resulting entropy measures are good candidates for the analysis of bivariate postural sway signals exhibiting nonstationarity and multiscale properties. These methods are dependant on several input parameters such as embedding parameters. Using two data sets collected from institutionalized frail older adults, we numerically investigate the behavior of a recent multivariate and multiscale entropy estimator; see "Multivariate multiscale entropy: A tool for complexity analysis of multichannel data," Physics Review E, vol. 84, p. 061918, 2011. We propose criteria for the selection of the input parameters. Using these optimal parameters, we statistically compare the multivariate and multiscale entropy values of postural sway data of non-faller subjects to those of fallers. These two groups are discriminated by the resulting measures over multiple time scales. We also demonstrate that the typical parameter settings proposed in the literature lead to entropy measures that do not distinguish the two groups. This last result confirms the importance of the selection of appropriate input parameters.
NASA Astrophysics Data System (ADS)
Nesbit, P. R.; Hugenholtz, C.; Durkin, P.; Hubbard, S. M.; Kucharczyk, M.; Barchyn, T.
2016-12-01
Remote sensing and digital mapping have started to revolutionize geologic mapping in recent years as a result of their realized potential to provide high resolution 3D models of outcrops to assist with interpretation, visualization, and obtaining accurate measurements of inaccessible areas. However, in stratigraphic mapping applications in complex terrain, it is difficult to acquire information with sufficient detail at a wide spatial coverage with conventional techniques. We demonstrate the potential of a UAV and Structure from Motion (SfM) photogrammetric approach for improving 3D stratigraphic mapping applications within a complex badland topography. Our case study is performed in Dinosaur Provincial Park (Alberta, Canada), mapping late Cretaceous fluvial meander belt deposits of the Dinosaur Park formation amidst a succession of steeply sloping hills and abundant drainages - creating a challenge for stratigraphic mapping. The UAV-SfM dataset (2 cm spatial resolution) is compared directly with a combined satellite and aerial LiDAR dataset (30 cm spatial resolution) to reveal advantages and limitations of each dataset before presenting a unique workflow that utilizes the dense point cloud from the UAV-SfM dataset for analysis. The UAV-SfM dense point cloud minimizes distortion, preserves 3D structure, and records an RGB attribute - adding potential value in future studies. The proposed UAV-SfM workflow allows for high spatial resolution remote sensing of stratigraphy in complex topographic environments. This extended capability can add value to field observations and has the potential to be integrated with subsurface petroleum models.
NASA Astrophysics Data System (ADS)
El Alem, A.; Chokmani, K.; Laurion, I.; El Adlouni, S.
2013-12-01
Occurrence and extent of Harmful Algal Bloom (HAB) has increased in inland water bodies around the world. The appearance of these blooms reflects the advanced state of eutrophication of several aquatic systems caused by urban, agricultural, and industrial development. Algal blooms, especially those cyanobacterial origins, are capable to produce and release toxins, threatening human and animal health, quality of drinking water, and recreational water bodies. Conventional monitoring networks, based on infrequent sampling in a few fixed monitoring stations, cannot provide the information needed as HABs are spatially and temporally heterogeneous. Remote sensing represents an interesting alternative to provide the required spatial and temporal coverage. The usefulness of air-borne and satellite remote sensing data to detect HABs was demonstrated since three decades ago, and since several empirical and semi-empirical models, using satellite imagery, were developed to estimate chlorophyll-a concentration [Chl-a] as a proxy to detect bloom proliferations. However, most of those models presented several weaknesses that are generally linked to the range of [Chl-a] to be estimated. Indeed, models originally calibrated for high [Chl-a] fail to estimate low concentrations and vice versa. In this study, an adaptive model to estimate [Chl-a], spread over a wide range of concentrations, is developed for optically complex inland water bodies based on combination of water spectral response classification and three developed semi-empirical algorithms using a multivariate regression. Three distinct water types (low, medium, and high [Chl-a]) are first identified using the Classification and Regression Tree (CART) method performed on remote sensing reflectance over a dataset of 44 [Chl-a] samples collected from Lakes over Quebec province. Based on the water classification, a specific multivariate model to each water type is developed using the same dataset and the MODIS data at 250-m spatial resolution. By pre-clustering inland water bodies, the results were very interesting as the determination coefficients as well as the relative RMSE of the cross-validation were of 0.99, 0.98 and 0.95 and of 0.5%, 8% and 17% for high, medium, and low [Chl-a], respectively. On the other hand, the adaptive model reached a global success rate of 92% using an independent, semi-qualitative, [Chl-a] samples collected over more than twenty inland water bodies for the years 2009 and 2010 over the Quebec province.
Climatic Droughts and the Impacts on Crop Yields in Northern India during the Past Century
NASA Astrophysics Data System (ADS)
Ge, Y.; Cai, X.; Zhu, T.
2014-12-01
Drought has become an increasingly severe threat to water and food security recently. This study presents a novel method to calculate the return period of drought, considering drought as event characterized by expected drought inter-arrival time, duration, severity and peak intensity. Recently, Copula distribution, a multivariable probability distribution, is used to deal with strongly correlated variables in analyzing complex hydrologic phenomenon. This study assesses drought conditions in Northern India, including 8 sites, in the past century using Palmer Drought Severity Index (PDSI) from two latest datasets, Dai (2011, 2013) and Sheffield et al. (2012), which concluded conflicting results about global average drought trend. Our results include the change of the severity, intensity and duration of drought events during the past century and the impact of the drought condition on crop yields in the region. It is found that drought variables are highly correlated, thus copulas joint distribution enables the estimation of multi-variate return period. Based on Dai's dataset from 1900 to 2012, for a fixed drought return period the severity and duration is lower for the period before1955 in sites close to the Indus basin (site 1) or off the coast of the Indian Ocean (Bay of Bengal) (site 8), while they are higher for the period after 1955 in other inland sites (sites 3-7), (e.g., severity in Fig.1). Projections based on two models (IPCC AR4 and AR5) in Dai (2011, 2013) suggested less severity and shorter duration in longer-year drought (e.g., 100-year drought), but larger in shorter-year drought (e.g., 2-year drought). Drought could bring nonlinear responses and unexpected losses in agriculture system, thus prediction and management are essential. Therefore, in the years with extreme drought conditions, impact assessment of drought on crop yield of corn, barley, wheat and sorghum will be also conducted through correlating crop yields with drought conditions during corresponding growing seasons. A. Dai, J. Geophys. Res., 116, D12115 (2011).A. Dai, Nature Climate Change, 3, 52-58 (2013). J. Sheffield, E.F. Wood, M. L. Roderick, Nature, 491, 435-438 (2012) Fig. 1 Return period for severity from 1900 to 1954 (green), from 1955 to 2012 (red), and from 2013 to 2099 (black for AR4, blue for AR5), respectively for 8 sites.
Multiple imputation for handling missing outcome data when estimating the relative risk.
Sullivan, Thomas R; Lee, Katherine J; Ryan, Philip; Salter, Amy B
2017-09-06
Multiple imputation is a popular approach to handling missing data in medical research, yet little is known about its applicability for estimating the relative risk. Standard methods for imputing incomplete binary outcomes involve logistic regression or an assumption of multivariate normality, whereas relative risks are typically estimated using log binomial models. It is unclear whether misspecification of the imputation model in this setting could lead to biased parameter estimates. Using simulated data, we evaluated the performance of multiple imputation for handling missing data prior to estimating adjusted relative risks from a correctly specified multivariable log binomial model. We considered an arbitrary pattern of missing data in both outcome and exposure variables, with missing data induced under missing at random mechanisms. Focusing on standard model-based methods of multiple imputation, missing data were imputed using multivariate normal imputation or fully conditional specification with a logistic imputation model for the outcome. Multivariate normal imputation performed poorly in the simulation study, consistently producing estimates of the relative risk that were biased towards the null. Despite outperforming multivariate normal imputation, fully conditional specification also produced somewhat biased estimates, with greater bias observed for higher outcome prevalences and larger relative risks. Deleting imputed outcomes from analysis datasets did not improve the performance of fully conditional specification. Both multivariate normal imputation and fully conditional specification produced biased estimates of the relative risk, presumably since both use a misspecified imputation model. Based on simulation results, we recommend researchers use fully conditional specification rather than multivariate normal imputation and retain imputed outcomes in the analysis when estimating relative risks. However fully conditional specification is not without its shortcomings, and so further research is needed to identify optimal approaches for relative risk estimation within the multiple imputation framework.
Creation of the Naturalistic Engagement in Secondary Tasks (NEST) distracted driving dataset.
Owens, Justin M; Angell, Linda; Hankey, Jonathan M; Foley, James; Ebe, Kazutoshi
2015-09-01
Distracted driving has become a topic of critical importance to driving safety research over the past several decades. Naturalistic driving data offer a unique opportunity to study how drivers engage with secondary tasks in real-world driving; however, the complexities involved with identifying and coding relevant epochs of naturalistic data have limited its accessibility to the general research community. This project was developed to help address this problem by creating an accessible dataset of driver behavior and situational factors observed during distraction-related safety-critical events and baseline driving epochs, using the Strategic Highway Research Program 2 (SHRP2) naturalistic dataset. The new NEST (Naturalistic Engagement in Secondary Tasks) dataset was created using crashes and near-crashes from the SHRP2 dataset that were identified as including secondary task engagement as a potential contributing factor. Data coding included frame-by-frame video analysis of secondary task and hands-on-wheel activity, as well as summary event information. In addition, information about each secondary task engagement within the trip prior to the crash/near-crash was coded at a higher level. Data were also coded for four baseline epochs and trips per safety-critical event. 1,180 events and baseline epochs were coded, and a dataset was constructed. The project team is currently working to determine the most useful way to allow broad public access to the dataset. We anticipate that the NEST dataset will be extraordinarily useful in allowing qualified researchers access to timely, real-world data concerning how drivers interact with secondary tasks during safety-critical events and baseline driving. The coded dataset developed for this project will allow future researchers to have access to detailed data on driver secondary task engagement in the real world. It will be useful for standalone research, as well as for integration with additional SHRP2 data to enable the conduct of more complex research. Copyright © 2015 Elsevier Ltd and National Safety Council. All rights reserved.
Varekar, Vikas; Karmakar, Subhankar; Jha, Ramakar
2016-02-01
The design of surface water quality sampling location is a crucial decision-making process for rationalization of monitoring network. The quantity, quality, and types of available dataset (watershed characteristics and water quality data) may affect the selection of appropriate design methodology. The modified Sanders approach and multivariate statistical techniques [particularly factor analysis (FA)/principal component analysis (PCA)] are well-accepted and widely used techniques for design of sampling locations. However, their performance may vary significantly with quantity, quality, and types of available dataset. In this paper, an attempt has been made to evaluate performance of these techniques by accounting the effect of seasonal variation, under a situation of limited water quality data but extensive watershed characteristics information, as continuous and consistent river water quality data is usually difficult to obtain, whereas watershed information may be made available through application of geospatial techniques. A case study of Kali River, Western Uttar Pradesh, India, is selected for the analysis. The monitoring was carried out at 16 sampling locations. The discrete and diffuse pollution loads at different sampling sites were estimated and accounted using modified Sanders approach, whereas the monitored physical and chemical water quality parameters were utilized as inputs for FA/PCA. The designed optimum number of sampling locations for monsoon and non-monsoon seasons by modified Sanders approach are eight and seven while that for FA/PCA are eleven and nine, respectively. Less variation in the number and locations of designed sampling sites were obtained by both techniques, which shows stability of results. A geospatial analysis has also been carried out to check the significance of designed sampling location with respect to river basin characteristics and land use of the study area. Both methods are equally efficient; however, modified Sanders approach outperforms FA/PCA when limited water quality and extensive watershed information is available. The available water quality dataset is limited and FA/PCA-based approach fails to identify monitoring locations with higher variation, as these multivariate statistical approaches are data-driven. The priority/hierarchy and number of sampling sites designed by modified Sanders approach are well justified by the land use practices and observed river basin characteristics of the study area.
1981-08-01
RATIO TEST STATISTIC FOR SPHERICITY OF COMPLEX MULTIVARIATE NORMAL DISTRIBUTION* C. Fang P. R. Krishnaiah B. N. Nagarsenker** August 1981 Technical...and their applications in time sEries, the reader is referred to Krishnaiah (1976). Motivated by the applications in the area of inference on multiple...for practical purposes. Here, we note that Krishnaiah , Lee and Chang (1976) approxi- mated the null distribution of certain power of the likeli
Merging K-means with hierarchical clustering for identifying general-shaped groups.
Peterson, Anna D; Ghosh, Arka P; Maitra, Ranjan
2018-01-01
Clustering partitions a dataset such that observations placed together in a group are similar but different from those in other groups. Hierarchical and K -means clustering are two approaches but have different strengths and weaknesses. For instance, hierarchical clustering identifies groups in a tree-like structure but suffers from computational complexity in large datasets while K -means clustering is efficient but designed to identify homogeneous spherically-shaped clusters. We present a hybrid non-parametric clustering approach that amalgamates the two methods to identify general-shaped clusters and that can be applied to larger datasets. Specifically, we first partition the dataset into spherical groups using K -means. We next merge these groups using hierarchical methods with a data-driven distance measure as a stopping criterion. Our proposal has the potential to reveal groups with general shapes and structure in a dataset. We demonstrate good performance on several simulated and real datasets.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Johnson, Kevin J.; Wright, Bob W.; Jarman, Kristin H.
2003-05-09
A rapid retention time alignment algorithm was developed as a preprocessing utility to be used prior to chemometric analysis of large datasets of diesel fuel gas chromatographic profiles. Retention time variation from chromatogram-to-chromatogram has been a significant impediment against the use of chemometric techniques in the analysis of chromatographic data due to the inability of current multivariate techniques to correctly model information that shifts from variable to variable within a dataset. The algorithm developed is shown to increase the efficacy of pattern recognition methods applied to a set of diesel fuel chromatograms by retaining chemical selectivity while reducing chromatogram-to-chromatogram retentionmore » time variations and to do so on a time scale that makes analysis of large sets of chromatographic data practical.« less
A three-dimensional multivariate representation of atmospheric variability
NASA Astrophysics Data System (ADS)
Žagar, Nedjeljka; Jelić, Damjan; Blaauw, Marten; Jesenko, Blaž
2016-04-01
A recently developed MODES software has been applied to the ECMWF analyses and forecasts and to several reanalysis datasets to describe the global variability of the balanced and inertio-gravity (IG) circulation across many scales by considering both mass and wind field and the whole model depth. In particular, the IG spectrum, which has only recently become observable in global datasets, can be studied simultaneously in the mass field and wind field and considering the whole model depth. MODES is open-access software that performs the normal-mode function decomposition of the 3D global datasets. Its application to the ERA Interim dataset reveals several aspects of the large-scale circulation after it has been partitioned into the linearly balanced and IG components. The global energy distribution is dominated by the balanced energy while the IG modes contribute around 8% of the total wave energy. However, on subsynoptic scales IG energy dominates and it is associated with the main features of tropical variability on all scales. The presented energy distribution and features of the zonally-averaged and equatorial circulation provide a reference for the intercomparison of several reanalysis datasets and for the validation of climate models. Features of the global IG circulation are compared in ERA Interim, MERRA and JRA reanalysis datasets and in several CMIP5 models. Since October 2014 the operational medium-range forecasts of the European Centre for Medium-Range Weather Forecasts (ECMWF) have been analyzed by MODES daily and an online archive of all the outputs is available at http://meteo.fmf.uni-lj.si/MODES. New outputs are made available daily based on the 00 UTC run and subsequent 12-hour forecasts up to 240-hour forecast. In addition to the energy spectra and horizontal circulation on selected levels for the balanced and IG components, the equatorial Kelvin waves are presented in time and space as the most energetic tropical IG modes propagating vertically and along the equator from its main generation regions in the upper troposphere over the Indian and Pacific region. The validation of the 10-day ECMWF forecasts with analyses in the modal space suggests a lack of variability in the tropics in the medium range. Reference: Žagar, N. et al., 2015: Normal-mode function representation of global 3-D data sets: open-access software for the atmospheric research community. Geosci. Model Dev., 8, 1169-1195, doi:10.5194/gmd-8-1169-2015 Žagar, N., R. Buizza, and J. Tribbia, 2015: A three-dimensional multivariate modal analysis of atmospheric predictability with application to the ECMWF ensemble. J. Atmos. Sci., 72, 4423-4444 The MODES software is available from http://meteo.fmf.uni-lj.si/MODES.
Harnessing Multivariate Statistics for Ellipsoidal Data in Structural Geology
NASA Astrophysics Data System (ADS)
Roberts, N.; Davis, J. R.; Titus, S.; Tikoff, B.
2015-12-01
Most structural geology articles do not state significance levels, report confidence intervals, or perform regressions to find trends. This is, in part, because structural data tend to include directions, orientations, ellipsoids, and tensors, which are not treatable by elementary statistics. We describe a full procedural methodology for the statistical treatment of ellipsoidal data. We use a reconstructed dataset of deformed ooids in Maryland from Cloos (1947) to illustrate the process. Normalized ellipsoids have five degrees of freedom and can be represented by a second order tensor. This tensor can be permuted into a five dimensional vector that belongs to a vector space and can be treated with standard multivariate statistics. Cloos made several claims about the distribution of deformation in the South Mountain fold, Maryland, and we reexamine two particular claims using hypothesis testing: 1) octahedral shear strain increases towards the axial plane of the fold; 2) finite strain orientation varies systematically along the trend of the axial trace as it bends with the Appalachian orogen. We then test the null hypothesis that the southern segment of South Mountain is the same as the northern segment. This test illustrates the application of ellipsoidal statistics, which combine both orientation and shape. We report confidence intervals for each test, and graphically display our results with novel plots. This poster illustrates the importance of statistics in structural geology, especially when working with noisy or small datasets.
An open source multivariate framework for n-tissue segmentation with evaluation on public data.
Avants, Brian B; Tustison, Nicholas J; Wu, Jue; Cook, Philip A; Gee, James C
2011-12-01
We introduce Atropos, an ITK-based multivariate n-class open source segmentation algorithm distributed with ANTs ( http://www.picsl.upenn.edu/ANTs). The Bayesian formulation of the segmentation problem is solved using the Expectation Maximization (EM) algorithm with the modeling of the class intensities based on either parametric or non-parametric finite mixtures. Atropos is capable of incorporating spatial prior probability maps (sparse), prior label maps and/or Markov Random Field (MRF) modeling. Atropos has also been efficiently implemented to handle large quantities of possible labelings (in the experimental section, we use up to 69 classes) with a minimal memory footprint. This work describes the technical and implementation aspects of Atropos and evaluates its performance on two different ground-truth datasets. First, we use the BrainWeb dataset from Montreal Neurological Institute to evaluate three-tissue segmentation performance via (1) K-means segmentation without use of template data; (2) MRF segmentation with initialization by prior probability maps derived from a group template; (3) Prior-based segmentation with use of spatial prior probability maps derived from a group template. We also evaluate Atropos performance by using spatial priors to drive a 69-class EM segmentation problem derived from the Hammers atlas from University College London. These evaluation studies, combined with illustrative examples that exercise Atropos options, demonstrate both performance and wide applicability of this new platform-independent open source segmentation tool.
An Open Source Multivariate Framework for n-Tissue Segmentation with Evaluation on Public Data
Tustison, Nicholas J.; Wu, Jue; Cook, Philip A.; Gee, James C.
2012-01-01
We introduce Atropos, an ITK-based multivariate n-class open source segmentation algorithm distributed with ANTs (http://www.picsl.upenn.edu/ANTs). The Bayesian formulation of the segmentation problem is solved using the Expectation Maximization (EM) algorithm with the modeling of the class intensities based on either parametric or non-parametric finite mixtures. Atropos is capable of incorporating spatial prior probability maps (sparse), prior label maps and/or Markov Random Field (MRF) modeling. Atropos has also been efficiently implemented to handle large quantities of possible labelings (in the experimental section, we use up to 69 classes) with a minimal memory footprint. This work describes the technical and implementation aspects of Atropos and evaluates its performance on two different ground-truth datasets. First, we use the BrainWeb dataset from Montreal Neurological Institute to evaluate three-tissue segmentation performance via (1) K-means segmentation without use of template data; (2) MRF segmentation with initialization by prior probability maps derived from a group template; (3) Prior-based segmentation with use of spatial prior probability maps derived from a group template. We also evaluate Atropos performance by using spatial priors to drive a 69-class EM segmentation problem derived from the Hammers atlas from University College London. These evaluation studies, combined with illustrative examples that exercise Atropos options, demonstrate both performance and wide applicability of this new platform-independent open source segmentation tool. PMID:21373993
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhan, Xianyuan; Aziz, H. M. Abdul; Ukkusuri, Satish V.
Our study investigates the Multivariate Poisson-lognormal (MVPLN) model that jointly models crash frequency and severity accounting for correlations. The ordinary univariate count models analyze crashes of different severity level separately ignoring the correlations among severity levels. The MVPLN model is capable to incorporate the general correlation structure and takes account of the over dispersion in the data that leads to a superior data fitting. But, the traditional estimation approach for MVPLN model is computationally expensive, which often limits the use of MVPLN model in practice. In this work, a parallel sampling scheme is introduced to improve the original Markov Chainmore » Monte Carlo (MCMC) estimation approach of the MVPLN model, which significantly reduces the model estimation time. Two MVPLN models are developed using the pedestrian vehicle crash data collected in New York City from 2002 to 2006, and the highway-injury data from Washington State (5-year data from 1990 to 1994) The Deviance Information Criteria (DIC) is used to evaluate the model fitting. The estimation results show that the MVPLN models provide a superior fit over univariate Poisson-lognormal (PLN), univariate Poisson, and Negative Binomial models. Moreover, the correlations among the latent effects of different severity levels are found significant in both datasets that justifies the importance of jointly modeling crash frequency and severity accounting for correlations.« less
Zhan, Xianyuan; Aziz, H. M. Abdul; Ukkusuri, Satish V.
2015-11-19
Our study investigates the Multivariate Poisson-lognormal (MVPLN) model that jointly models crash frequency and severity accounting for correlations. The ordinary univariate count models analyze crashes of different severity level separately ignoring the correlations among severity levels. The MVPLN model is capable to incorporate the general correlation structure and takes account of the over dispersion in the data that leads to a superior data fitting. But, the traditional estimation approach for MVPLN model is computationally expensive, which often limits the use of MVPLN model in practice. In this work, a parallel sampling scheme is introduced to improve the original Markov Chainmore » Monte Carlo (MCMC) estimation approach of the MVPLN model, which significantly reduces the model estimation time. Two MVPLN models are developed using the pedestrian vehicle crash data collected in New York City from 2002 to 2006, and the highway-injury data from Washington State (5-year data from 1990 to 1994) The Deviance Information Criteria (DIC) is used to evaluate the model fitting. The estimation results show that the MVPLN models provide a superior fit over univariate Poisson-lognormal (PLN), univariate Poisson, and Negative Binomial models. Moreover, the correlations among the latent effects of different severity levels are found significant in both datasets that justifies the importance of jointly modeling crash frequency and severity accounting for correlations.« less
MGAS: a powerful tool for multivariate gene-based genome-wide association analysis.
Van der Sluis, Sophie; Dolan, Conor V; Li, Jiang; Song, Youqiang; Sham, Pak; Posthuma, Danielle; Li, Miao-Xin
2015-04-01
Standard genome-wide association studies, testing the association between one phenotype and a large number of single nucleotide polymorphisms (SNPs), are limited in two ways: (i) traits are often multivariate, and analysis of composite scores entails loss in statistical power and (ii) gene-based analyses may be preferred, e.g. to decrease the multiple testing problem. Here we present a new method, multivariate gene-based association test by extended Simes procedure (MGAS), that allows gene-based testing of multivariate phenotypes in unrelated individuals. Through extensive simulation, we show that under most trait-generating genotype-phenotype models MGAS has superior statistical power to detect associated genes compared with gene-based analyses of univariate phenotypic composite scores (i.e. GATES, multiple regression), and multivariate analysis of variance (MANOVA). Re-analysis of metabolic data revealed 32 False Discovery Rate controlled genome-wide significant genes, and 12 regions harboring multiple genes; of these 44 regions, 30 were not reported in the original analysis. MGAS allows researchers to conduct their multivariate gene-based analyses efficiently, and without the loss of power that is often associated with an incorrectly specified genotype-phenotype models. MGAS is freely available in KGG v3.0 (http://statgenpro.psychiatry.hku.hk/limx/kgg/download.php). Access to the metabolic dataset can be requested at dbGaP (https://dbgap.ncbi.nlm.nih.gov/). The R-simulation code is available from http://ctglab.nl/people/sophie_van_der_sluis. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
NASA Technical Reports Server (NTRS)
Kattan, Michael W.; Hess, Kenneth R.; Kattan, Michael W.
1998-01-01
New computationally intensive tools for medical survival analyses include recursive partitioning (also called CART) and artificial neural networks. A challenge that remains is to better understand the behavior of these techniques in effort to know when they will be effective tools. Theoretically they may overcome limitations of the traditional multivariable survival technique, the Cox proportional hazards regression model. Experiments were designed to test whether the new tools would, in practice, overcome these limitations. Two datasets in which theory suggests CART and the neural network should outperform the Cox model were selected. The first was a published leukemia dataset manipulated to have a strong interaction that CART should detect. The second was a published cirrhosis dataset with pronounced nonlinear effects that a neural network should fit. Repeated sampling of 50 training and testing subsets was applied to each technique. The concordance index C was calculated as a measure of predictive accuracy by each technique on the testing dataset. In the interaction dataset, CART outperformed Cox (P less than 0.05) with a C improvement of 0.1 (95% Cl, 0.08 to 0.12). In the nonlinear dataset, the neural network outperformed the Cox model (P less than 0.05), but by a very slight amount (0.015). As predicted by theory, CART and the neural network were able to overcome limitations of the Cox model. Experiments like these are important to increase our understanding of when one of these new techniques will outperform the standard Cox model. Further research is necessary to predict which technique will do best a priori and to assess the magnitude of superiority.
MEMD-enhanced multivariate fuzzy entropy for the evaluation of complexity in biomedical signals.
Azami, Hamed; Smith, Keith; Escudero, Javier
2016-08-01
Multivariate multiscale entropy (mvMSE) has been proposed as a combination of the coarse-graining process and multivariate sample entropy (mvSE) to quantify the irregularity of multivariate signals. However, both the coarse-graining process and mvSE may not be reliable for short signals. Although the coarse-graining process can be replaced with multivariate empirical mode decomposition (MEMD), the relative instability of mvSE for short signals remains a problem. Here, we address this issue by proposing the multivariate fuzzy entropy (mvFE) with a new fuzzy membership function. The results using white Gaussian noise show that the mvFE leads to more reliable and stable results, especially for short signals, in comparison with mvSE. Accordingly, we propose MEMD-enhanced mvFE to quantify the complexity of signals. The characteristics of brain regions influenced by partial epilepsy are investigated by focal and non-focal electroencephalogram (EEG) time series. In this sense, the proposed MEMD-enhanced mvFE and mvSE are employed to discriminate focal EEG signals from non-focal ones. The results demonstrate the MEMD-enhanced mvFE values have a smaller coefficient of variation in comparison with those obtained by the MEMD-enhanced mvSE, even for long signals. The results also show that the MEMD-enhanced mvFE has better performance to quantify focal and non-focal signals compared with multivariate multiscale permutation entropy.
The Statistical Consulting Center for Astronomy (SCCA)
NASA Technical Reports Server (NTRS)
Akritas, Michael
2001-01-01
The process by which raw astronomical data acquisition is transformed into scientifically meaningful results and interpretation typically involves many statistical steps. Traditional astronomy limits itself to a narrow range of old and familiar statistical methods: means and standard deviations; least-squares methods like chi(sup 2) minimization; and simple nonparametric procedures such as the Kolmogorov-Smirnov tests. These tools are often inadequate for the complex problems and datasets under investigations, and recent years have witnessed an increased usage of maximum-likelihood, survival analysis, multivariate analysis, wavelet and advanced time-series methods. The Statistical Consulting Center for Astronomy (SCCA) assisted astronomers with the use of sophisticated tools, and to match these tools with specific problems. The SCCA operated with two professors of statistics and a professor of astronomy working together. Questions were received by e-mail, and were discussed in detail with the questioner. Summaries of those questions and answers leading to new approaches were posted on the Web (www.state.psu.edu/ mga/SCCA). In addition to serving individual astronomers, the SCCA established a Web site for general use that provides hypertext links to selected on-line public-domain statistical software and services. The StatCodes site (www.astro.psu.edu/statcodes) provides over 200 links in the areas of: Bayesian statistics; censored and truncated data; correlation and regression, density estimation and smoothing, general statistics packages and information; image analysis; interactive Web tools; multivariate analysis; multivariate clustering and classification; nonparametric analysis; software written by astronomers; spatial statistics; statistical distributions; time series analysis; and visualization tools. StatCodes has received a remarkable high and constant hit rate of 250 hits/week (over 10,000/year) since its inception in mid-1997. It is of interest to scientists both within and outside of astronomy. The most popular sections are multivariate techniques, image analysis, and time series analysis. Hundreds of copies of the ASURV, SLOPES and CENS-TAU codes developed by SCCA scientists were also downloaded from the StatCodes site. In addition to formal SCCA duties, SCCA scientists continued a variety of related activities in astrostatistics, including refereeing of statistically oriented papers submitted to the Astrophysical Journal, talks in meetings including Feigelson's talk to science journalists entitled "The reemergence of astrostatistics" at the American Association for the Advancement of Science meeting, and published papers of astrostatistical content.
Wolpert, Miranda; Rutter, Harry
2018-06-13
The use of routinely collected data that are flawed and limited to inform service development in healthcare systems needs to be considered, both theoretically and practically, given the reality in many areas of healthcare that only poor-quality data are available for use in complex adaptive systems. Data may be compromised in a range of ways. They may be flawed, due to missing or erroneously recorded entries; uncertain, due to differences in how data items are rated or conceptualised; proximate, in that data items are a proxy for key issues of concern; and sparse, in that a low volume of cases within key subgroups may limit the possibility of statistical inference. The term 'FUPS' is proposed to describe these flawed, uncertain, proximate and sparse datasets. Many of the systems that seek to use FUPS data may be characterised as dynamic and complex, involving a wide range of agents whose actions impact on each other in reverberating ways, leading to feedback and adaptation. The literature on the use of routinely collected data in healthcare is often implicitly premised on the availability of high-quality data to be used in complicated but not necessarily complex systems. This paper presents an example of the use of a FUPS dataset in the complex system of child mental healthcare. The dataset comprised routinely collected data from services that were part of a national service transformation initiative in child mental health from 2011 to 2015. The paper explores the use of this FUPS dataset to support meaningful dialogue between key stakeholders, including service providers, funders and users, in relation to outcomes of services. There is a particular focus on the potential for service improvement and learning. The issues raised and principles for practice suggested have relevance for other health communities that similarly face the dilemma of how to address the gap between the ideal of comprehensive clear data used in complicated, but not complex, contexts, and the reality of FUPS data in the context of complexity.
NASA Astrophysics Data System (ADS)
Delgado, Juan A.; Altuve, Miguel; Nabhan Homsi, Masun
2015-12-01
This paper introduces a robust method based on the Support Vector Machine (SVM) algorithm to detect the presence of Fetal QRS (fQRS) complexes in electrocardiogram (ECG) recordings provided by the PhysioNet/CinC challenge 2013. ECG signals are first segmented into contiguous frames of 250 ms duration and then labeled in six classes. Fetal segments are tagged according to the position of fQRS complex within each one. Next, segment features extraction and dimensionality reduction are obtained by applying principal component analysis on Haar-wavelet transform. After that, two sub-datasets are generated to separate representative segments from atypical ones. Imbalanced class problem is dealt by applying sampling without replacement on each sub-dataset. Finally, two SVMs are trained and cross-validated using the two balanced sub-datasets separately. Experimental results show that the proposed approach achieves high performance rates in fetal heartbeats detection that reach up to 90.95% of accuracy, 92.16% of sensitivity, 88.51% of specificity, 94.13% of positive predictive value and 84.96% of negative predictive value. A comparative study is also carried out to show the performance of other two machine learning algorithms for fQRS complex estimation, which are K-nearest neighborhood and Bayesian network.
NASA Astrophysics Data System (ADS)
Sheykhizadeh, Saheleh; Naseri, Abdolhossein
2018-04-01
Variable selection plays a key role in classification and multivariate calibration. Variable selection methods are aimed at choosing a set of variables, from a large pool of available predictors, relevant to the analyte concentrations estimation, or to achieve better classification results. Many variable selection techniques have now been introduced among which, those which are based on the methodologies of swarm intelligence optimization have been more respected during a few last decades since they are mainly inspired by nature. In this work, a simple and new variable selection algorithm is proposed according to the invasive weed optimization (IWO) concept. IWO is considered a bio-inspired metaheuristic mimicking the weeds ecological behavior in colonizing as well as finding an appropriate place for growth and reproduction; it has been shown to be very adaptive and powerful to environmental changes. In this paper, the first application of IWO, as a very simple and powerful method, to variable selection is reported using different experimental datasets including FTIR and NIR data, so as to undertake classification and multivariate calibration tasks. Accordingly, invasive weed optimization - linear discrimination analysis (IWO-LDA) and invasive weed optimization- partial least squares (IWO-PLS) are introduced for multivariate classification and calibration, respectively.
Sheykhizadeh, Saheleh; Naseri, Abdolhossein
2018-04-05
Variable selection plays a key role in classification and multivariate calibration. Variable selection methods are aimed at choosing a set of variables, from a large pool of available predictors, relevant to the analyte concentrations estimation, or to achieve better classification results. Many variable selection techniques have now been introduced among which, those which are based on the methodologies of swarm intelligence optimization have been more respected during a few last decades since they are mainly inspired by nature. In this work, a simple and new variable selection algorithm is proposed according to the invasive weed optimization (IWO) concept. IWO is considered a bio-inspired metaheuristic mimicking the weeds ecological behavior in colonizing as well as finding an appropriate place for growth and reproduction; it has been shown to be very adaptive and powerful to environmental changes. In this paper, the first application of IWO, as a very simple and powerful method, to variable selection is reported using different experimental datasets including FTIR and NIR data, so as to undertake classification and multivariate calibration tasks. Accordingly, invasive weed optimization - linear discrimination analysis (IWO-LDA) and invasive weed optimization- partial least squares (IWO-PLS) are introduced for multivariate classification and calibration, respectively. Copyright © 2018 Elsevier B.V. All rights reserved.
Franco-Pedroso, Javier; Ramos, Daniel; Gonzalez-Rodriguez, Joaquin
2016-01-01
In forensic science, trace evidence found at a crime scene and on suspect has to be evaluated from the measurements performed on them, usually in the form of multivariate data (for example, several chemical compound or physical characteristics). In order to assess the strength of that evidence, the likelihood ratio framework is being increasingly adopted. Several methods have been derived in order to obtain likelihood ratios directly from univariate or multivariate data by modelling both the variation appearing between observations (or features) coming from the same source (within-source variation) and that appearing between observations coming from different sources (between-source variation). In the widely used multivariate kernel likelihood-ratio, the within-source distribution is assumed to be normally distributed and constant among different sources and the between-source variation is modelled through a kernel density function (KDF). In order to better fit the observed distribution of the between-source variation, this paper presents a different approach in which a Gaussian mixture model (GMM) is used instead of a KDF. As it will be shown, this approach provides better-calibrated likelihood ratios as measured by the log-likelihood ratio cost (Cllr) in experiments performed on freely available forensic datasets involving different trace evidences: inks, glass fragments and car paints. PMID:26901680
Multivariate analysis and visualization of soil quality data for no-till systems.
Villamil, M B; Miguez, F E; Bollero, G A
2008-01-01
To evidence the multidimensionality of the soil quality concept, we propose the use of data visualization as a tool for exploratory data analyses, model building, and diagnostics. Our objective was to establish the best edaphic indicators for assessing soil quality in four no-till systems with regard to functioning as a medium for crop production and nutrient cycling across two Illinois locations. The compared situations were no-till corn-soybean rotations including either winter fallowing (C/S) or cover crops of rye (Secale cereale; C-R/S-R), hairy vetch (Vicia villosa; C-R/S-V), or their mixture (C-R/S-VR). The dataset included the variables bulk density (BD), penetration resistance (PR), water aggregate stability (WAS), soil reaction (pH), and the contents of soil organic matter (SOM), total nitrogen (TN), soil nitrates (NO(3)-N), and available phosphorus (P). Interactive data visualization along with canonical discriminant analysis (CDA) allowed us to show that WAS, BD, and the contents of P, TN, and SOM have the greatest potential as soil quality indicators in no-till systems in Illinois. It was more difficult to discriminate among WCC rotations than to separate these from C/S, considerably inflating the error rate associated with CDA. We predict that observations of no-till C/S will be classified correctly 51% of the time, while observations of no-till WCC rotations will be classified correctly 74% of the time. High error rates in CDA underscore the complexity of no-till systems and the need in this area for more long-term studies with larger datasets to increase accuracy to acceptable levels.
Lutomski, Jennifer E; Baars, Maria A E; Boter, Han; Buurman, Bianca M; den Elzen, Wendy P J; Jansen, Aaltje P D; Kempen, Gertrudis I J M; Steunenberg, Bas; Steyerberg, Ewout W; Olde Rikkert, Marcel G M; Melis, René J F
2014-01-01
To assess the independent and combined impact of frailty, multi-morbidity, and activities of daily living (ADL) limitations on self-reported quality of life and healthcare costs in elderly people. Cross-sectional, descriptive study. Data came from The Older Persons and Informal Caregivers Minimum DataSet (TOPICS-MDS), a pooled dataset with information from 41 projects across the Netherlands from the Dutch national care for the Elderly programme. Frailty, multi-morbidity and ADL limitations, and the interactions between these domains, were used as predictors in regression analyses with quality of life and healthcare costs as outcome measures. Analyses were stratified by living situation (independent or care home). Directionality and magnitude of associations were assessed using linear mixed models. A total of 11,093 elderly people were interviewed. A substantial proportion of elderly people living independently reported frailty, multi-morbidity, and/or ADL limitations (56.4%, 88.3% and 41.4%, respectively), as did elderly people living in a care home (88.7%, 89.2% and 77,3%, respectively). One-third of elderly people living at home (31.9%) reported all three conditions compared with two-thirds of elderly people living in a care home (68.3%). In the multivariable analysis, frailty had a strong impact on outcomes independently of multi-morbidity and ADL limitations. Elderly people experiencing problems across all three domains reported the poorest quality-of-life scores and the highest healthcare costs, irrespective of their living situation. Frailty, multi-morbidity and ADL limitations are complementary measurements, which together provide a more holistic understanding of health status in elderly people. A multi-dimensional approach is important in mapping the complex relationships between these measurements on the one hand and the quality of life and healthcare costs on the other.
Modeling Interdependent and Periodic Real-World Action Sequences
Kurashima, Takeshi; Althoff, Tim; Leskovec, Jure
2018-01-01
Mobile health applications, including those that track activities such as exercise, sleep, and diet, are becoming widely used. Accurately predicting human actions in the real world is essential for targeted recommendations that could improve our health and for personalization of these applications. However, making such predictions is extremely difficult due to the complexities of human behavior, which consists of a large number of potential actions that vary over time, depend on each other, and are periodic. Previous work has not jointly modeled these dynamics and has largely focused on item consumption patterns instead of broader types of behaviors such as eating, commuting or exercising. In this work, we develop a novel statistical model, called TIPAS, for Time-varying, Interdependent, and Periodic Action Sequences. Our approach is based on personalized, multivariate temporal point processes that model time-varying action propensities through a mixture of Gaussian intensities. Our model captures short-term and long-term periodic interdependencies between actions through Hawkes process-based self-excitations. We evaluate our approach on two activity logging datasets comprising 12 million real-world actions (e.g., eating, sleep, and exercise) taken by 20 thousand users over 17 months. We demonstrate that our approach allows us to make successful predictions of future user actions and their timing. Specifically, TIPAS improves predictions of actions, and their timing, over existing methods across multiple datasets by up to 156%, and up to 37%, respectively. Performance improvements are particularly large for relatively rare and periodic actions such as walking and biking, improving over baselines by up to 256%. This demonstrates that explicit modeling of dependencies and periodicities in real-world behavior enables successful predictions of future actions, with implications for modeling human behavior, app personalization, and targeting of health interventions. PMID:29780977
NASA Technical Reports Server (NTRS)
Anderson, R. B.; Morris, R. V.; Clegg, S. M.; Bell, J. F., III; Humphries, S. D.; Wiens, R. C.
2011-01-01
The ChemCam instrument selected for the Curiosity rover is capable of remote laser-induced breakdown spectroscopy (LIBS).[1] We used a remote LIBS instrument similar to ChemCam to analyze 197 geologic slab samples and 32 pressed-powder geostandards. The slab samples are well-characterized and have been used to validate the calibration of previous instruments on Mars missions, including CRISM [2], OMEGA [3], the MER Pancam [4], Mini-TES [5], and Moessbauer [6] instruments and the Phoenix SSI [7]. The resulting dataset was used to compare multivariate methods for quantitative LIBS and to determine the effect of grain size on calculations. Three multivariate methods - partial least squares (PLS), multilayer perceptron artificial neural networks (MLP ANNs) and cascade correlation (CC) ANNs - were used to generate models and extract the quantitative composition of unknown samples. PLS can be used to predict one element (PLS1) or multiple elements (PLS2) at a time, as can the neural network methods. Although MLP and CC ANNs were successful in some cases, PLS generally produced the most accurate and precise results.
Dessì, Alessia; Pani, Danilo; Raffo, Luigi
2014-08-01
Non-invasive fetal electrocardiography is still an open research issue. The recent publication of an annotated dataset on Physionet providing four-channel non-invasive abdominal ECG traces promoted an international challenge on the topic. Starting from that dataset, an algorithm for the identification of the fetal QRS complexes from a reduced number of electrodes and without any a priori information about the electrode positioning has been developed, entering into the top ten best-performing open-source algorithms presented at the challenge.In this paper, an improved version of that algorithm is presented and evaluated exploiting the same challenge metrics. It is mainly based on the subtraction of the maternal QRS complexes in every lead, obtained by synchronized averaging of morphologically similar complexes, the filtering of the maternal P and T waves and the enhancement of the fetal QRS through independent component analysis (ICA) applied on the processed signals before a final fetal QRS detection stage. The RR time series of both the mother and the fetus are analyzed to enhance pseudoperiodicity with the aim of correcting wrong annotations. The algorithm has been designed and extensively evaluated on the open dataset A (N = 75), and finally evaluated on datasets B (N = 100) and C (N = 272) to have the mean scores over data not used during the algorithm development. Compared to the results achieved by the previous version of the algorithm, the current version would mark the 5th and 4th position in the final ranking related to the events 1 and 2, reserved to the open-source challenge entries, taking into account both official and unofficial entrants. On dataset A, the algorithm achieves 0.982 median sensitivity and 0.976 median positive predictivity.
Every factor helps: Rapid Ptychographic Reconstruction
NASA Astrophysics Data System (ADS)
Nashed, Youssef
2015-03-01
Recent advances in microscopy, specifically higher spatial resolution and data acquisition rates, require faster and more robust phase retrieval reconstruction methods. Ptychography is a phase retrieval technique for reconstructing the complex transmission function of a specimen from a sequence of diffraction patterns in visible light, X-ray, and electron microscopes. As technical advances allow larger fields to be imaged, computational challenges arise for reconstructing the correspondingly larger data volumes. Waiting to postprocess datasets offline results in missed opportunities. Here we present a parallel method for real-time ptychographic phase retrieval. It uses a hybrid parallel strategy to divide the computation between multiple graphics processing units (GPUs). A final specimen reconstruction is then achieved by different techniques to merge sub-dataset results into a single complex phase and amplitude image. Results are shown on a simulated specimen and real datasets from X-ray experiments conducted at a synchrotron light source.
Chattree, A; Barbour, J A; Thomas-Gibson, S; Bhandari, P; Saunders, B P; Veitch, A M; Anderson, J; Rembacken, B J; Loughrey, M B; Pullan, R; Garrett, W V; Lewis, G; Dolwani, S; Rutter, M D
2017-01-01
The management of large non-pedunculated colorectal polyps (LNPCPs) is complex, with widespread variation in management and outcome, even amongst experienced clinicians. Variations in the assessment and decision-making processes are likely to be a major factor in this variability. The creation of a standardized minimum dataset to aid decision-making may therefore result in improved clinical management. An official working group of 13 multidisciplinary specialists was appointed by the Association of Coloproctology of Great Britain and Ireland (ACPGBI) and the British Society of Gastroenterology (BSG) to develop a minimum dataset on LNPCPs. The literature review used to structure the ACPGBI/BSG guidelines for the management of LNPCPs was used by a steering subcommittee to identify various parameters pertaining to the decision-making processes in the assessment and management of LNPCPs. A modified Delphi consensus process was then used for voting on proposed parameters over multiple voting rounds with at least 80% agreement defined as consensus. The minimum dataset was used in a pilot process to ensure rigidity and usability. A 23-parameter minimum dataset with parameters relating to patient and lesion factors, including six parameters relating to image retrieval, was formulated over four rounds of voting with two pilot processes to test rigidity and usability. This paper describes the development of the first reported evidence-based and expert consensus minimum dataset for the management of LNPCPs. It is anticipated that this dataset will allow comprehensive and standardized lesion assessment to improve decision-making in the assessment and management of LNPCPs. Colorectal Disease © 2016 The Association of Coloproctology of Great Britain and Ireland.
A symmetric multivariate leakage correction for MEG connectomes
Colclough, G.L.; Brookes, M.J.; Smith, S.M.; Woolrich, M.W.
2015-01-01
Ambiguities in the source reconstruction of magnetoencephalographic (MEG) measurements can cause spurious correlations between estimated source time-courses. In this paper, we propose a symmetric orthogonalisation method to correct for these artificial correlations between a set of multiple regions of interest (ROIs). This process enables the straightforward application of network modelling methods, including partial correlation or multivariate autoregressive modelling, to infer connectomes, or functional networks, from the corrected ROIs. Here, we apply the correction to simulated MEG recordings of simple networks and to a resting-state dataset collected from eight subjects, before computing the partial correlations between power envelopes of the corrected ROItime-courses. We show accurate reconstruction of our simulated networks, and in the analysis of real MEGresting-state connectivity, we find dense bilateral connections within the motor and visual networks, together with longer-range direct fronto-parietal connections. PMID:25862259
Mortality Prediction Model of Septic Shock Patients Based on Routinely Recorded Data
Carrara, Marta; Baselli, Giuseppe; Ferrario, Manuela
2015-01-01
We studied the problem of mortality prediction in two datasets, the first composed of 23 septic shock patients and the second composed of 73 septic subjects selected from the public database MIMIC-II. For each patient we derived hemodynamic variables, laboratory results, and clinical information of the first 48 hours after shock onset and we performed univariate and multivariate analyses to predict mortality in the following 7 days. The results show interesting features that individually identify significant differences between survivors and nonsurvivors and features which gain importance only when considered together with the others in a multivariate regression model. This preliminary study on two small septic shock populations represents a novel contribution towards new personalized models for an integration of multiparameter patient information to improve critical care management of shock patients. PMID:26557154
Clustering of Multivariate Geostatistical Data
NASA Astrophysics Data System (ADS)
Fouedjio, Francky
2017-04-01
Multivariate data indexed by geographical coordinates have become omnipresent in the geosciences and pose substantial analysis challenges. One of them is the grouping of data locations into spatially contiguous clusters so that data locations belonging to the same cluster have a certain degree of homogeneity while data locations in the different clusters have to be as different as possible. However, groups of data locations created through classical clustering techniques turn out to show poor spatial contiguity, a feature obviously inconvenient for many geoscience applications. In this work, we develop a clustering method that overcomes this problem by accounting the spatial dependence structure of data; thus reinforcing the spatial contiguity of resulting cluster. The capability of the proposed clustering method to provide spatially contiguous and meaningful clusters of data locations is assessed using both synthetic and real datasets. Keywords: clustering, geostatistics, spatial contiguity, spatial dependence.
NASA Astrophysics Data System (ADS)
Sun, L. Qing; Feng, Feng X.
2014-11-01
In this study, we first built and compared two different climate datasets for Wuling mountainous area in 2010, one of which considered topographical effects during the ANUSPLIN interpolation was referred as terrain-based climate dataset, while the other one did not was called ordinary climate dataset. Then, we quantified the topographical effects of climatic inputs on NPP estimation by inputting two different climate datasets to the same ecosystem model, the Boreal Ecosystem Productivity Simulator (BEPS), to evaluate the importance of considering relief when estimating NPP. Finally, we found the primary contributing variables to the topographical effects through a series of experiments given an overall accuracy of the model output for NPP. The results showed that: (1) The terrain-based climate dataset presented more reliable topographic information and had closer agreements with the station dataset than the ordinary climate dataset at successive time series of 365 days in terms of the daily mean values. (2) On average, ordinary climate dataset underestimated NPP by 12.5% compared with terrain-based climate dataset over the whole study area. (3) The primary climate variables contributing to the topographical effects of climatic inputs for Wuling mountainous area were temperatures, which suggest that it is necessary to correct temperature differences for estimating NPP accurately in such a complex terrain.
NASA Astrophysics Data System (ADS)
Tsontos, V. M.; Arms, S. C.; Thompson, C. K.; Quach, N.; Lam, T.
2016-12-01
Earth science applications increasingly rely on the integration of multivariate data from diverse observational platforms. Whether for satellite mission cal/val, science or decision support, the coupling of remote sensing and in-situ field data is integral also to oceanographic workflows. This has prompted archives such as the PO.DAAC, NASA's physical oceanographic data archive, that historically has had a remote sensing focus, to adapt to better accommodate complex field campaign datasets. However, the inherent heterogeneity of in-situ datasets and their variable adherence to meta/data standards poses a significant impediment to interoperability, a problem originating early in the data lifecycle and significantly impacting stewardship and usability of these data long-term. Here we introduce a new initiative underway at PO.DAAC that seeks to catalyze efforts to address these challenges. It involves the enhancement and integration of available high TRL (Technology Readiness level) components for improved interoperability and support of in-situ data with a focus on a novel yet representative class of oceanographic field data: data from electronic tags deployed on a variety of marine species as biological sampling platforms in support of fisheries management and ocean observation efforts. This project seeks to demonstrate, deliver and ultimately sustain operationally a reusable and accessible set of tools to: 1) mediate reconciliation of heterogeneous source data into a tractable number of standardized formats consistent with earth science data standards; 2) harmonize existing metadata models for satellite and field datasets; 3) demonstrate the value added of integrated data access via a range of available tools and services hosted at the PO.DAAC, including a web-based visualization tool for comprehensive mapping of satellite and in-situ data. An innovative part of our project plan involves partnering with the leading electronic tag manufacturer to promote the adoption of appropriate data standards in their processing software. The proposed project thus adopts a model lifecycle approach complimented by broadly applicable technologies to address key data management and interoperability issues for in-situ data
Mohammadi, Saeedeh; Parastar, Hadi
2018-05-15
In this work, a chemometrics-based strategy is developed for quantitative mass spectrometry imaging (MSI). In this regard, quantification of chlordecone as a carcinogenic organochlorinated pesticide (C10Cll0O) in mouse liver using the matrix-assisted laser desorption ionization MSI (MALDI-MSI) method is used as a case study. The MSI datasets corresponded to 1, 5 and 10 days of mouse exposure to the standard chlordecone in the quantity range of 0 to 450 μg g-1. The binning approach in the m/z direction is used to group high resolution m/z values and to reduce the big data size. To consider the effect of bin size on the quality of results, three different bin sizes of 0.25, 0.5 and 1.0 were chosen. Afterwards, three-way MSI data arrays (two spatial and one m/z dimensions) for seven standards and four unknown samples were column-wise augmented with m/z values as the common mode. Then, these datasets were analyzed using multivariate curve resolution-alternating least squares (MCR-ALS) using proper constraints. The resolved mass spectra were used for identification of chlordecone in the presence of a complex background and interference. Additionally, the augmented spatial profiles were post-processed and 2D images for each component were obtained in calibration and unknown samples. The sum of these profiles was utilized to set the calibration curve and to obtain the analytical figures of merit (AFOMs). Inspection of the results showed that the lower bin size (i.e., 0.25) provides more accurate results. Finally, the obtained results by MCR for three datasets were compared with those of gas chromatography-mass spectrometry (GC-MS) and MALDI-MSI. The results showed that the MCR-assisted method gives a higher amount of chlordecone than MALDI-MSI and a lower amount than GC-MS. It is concluded that a combination of chemometric methods with MSI can be considered as an alternative way for MSI quantification.
Transcriptomic correlates of neuron electrophysiological diversity
Li, Brenna; Crichlow, Cindy-Lee; Mancarci, B. Ogan; Pavlidis, Paul
2017-01-01
How neuronal diversity emerges from complex patterns of gene expression remains poorly understood. Here we present an approach to understand electrophysiological diversity through gene expression by integrating pooled- and single-cell transcriptomics with intracellular electrophysiology. Using neuroinformatics methods, we compiled a brain-wide dataset of 34 neuron types with paired gene expression and intrinsic electrophysiological features from publically accessible sources, the largest such collection to date. We identified 420 genes whose expression levels significantly correlated with variability in one or more of 11 physiological parameters. We next trained statistical models to infer cellular features from multivariate gene expression patterns. Such models were predictive of gene-electrophysiological relationships in an independent collection of 12 visual cortex cell types from the Allen Institute, suggesting that these correlations might reflect general principles relating expression patterns to phenotypic diversity across very different cell types. Many associations reported here have the potential to provide new insights into how neurons generate functional diversity, and correlations of ion channel genes like Gabrd and Scn1a (Nav1.1) with resting potential and spiking frequency are consistent with known causal mechanisms. Our work highlights the promise and inherent challenges in using cell type-specific transcriptomics to understand the mechanistic origins of neuronal diversity. PMID:29069078
Using Fisher information to track stability in multivariate ...
With the current proliferation of data, the proficient use of statistical and mining techniques offer substantial benefits to capture useful information from any dataset. As numerous approaches make use of information theory concepts, here, we discuss how Fisher information (FI) can be applied to sustainability science problems and used in data mining applications by analyzing patterns in data. FI was developed as a measure of information content in data, and it has been adapted to assess order in complex system behaviors. The main advantage of the approach is the ability to collapse multiple variables into an index that can be used to assess stability and track overall trends in a system, including its regimes and regime shifts. Here, we provide a brief overview of FI theory, followed by a simple step-by-step numerical example on how to compute FI. Furthermore, we introduce an open source Python library that can be freely downloaded from GitHub and we use it in a simple case study to evaluate the evolution of FI for the global-mean temperature from 1880 to 2015. Results indicate significant declines in FI starting in 1978, suggesting a possible regime shift. Demonstrate Fisher information as a useful method for assessing patterns in big data.
Careau, Vincent; Wolak, Matthew E.; Carter, Patrick A.; Garland, Theodore
2015-01-01
Given the pace at which human-induced environmental changes occur, a pressing challenge is to determine the speed with which selection can drive evolutionary change. A key determinant of adaptive response to multivariate phenotypic selection is the additive genetic variance–covariance matrix (G). Yet knowledge of G in a population experiencing new or altered selection is not sufficient to predict selection response because G itself evolves in ways that are poorly understood. We experimentally evaluated changes in G when closely related behavioural traits experience continuous directional selection. We applied the genetic covariance tensor approach to a large dataset (n = 17 328 individuals) from a replicated, 31-generation artificial selection experiment that bred mice for voluntary wheel running on days 5 and 6 of a 6-day test. Selection on this subset of G induced proportional changes across the matrix for all 6 days of running behaviour within the first four generations. The changes in G induced by selection resulted in a fourfold slower-than-predicted rate of response to selection. Thus, selection exacerbated constraints within G and limited future adaptive response, a phenomenon that could have profound consequences for populations facing rapid environmental change. PMID:26582016
O'Brien, M.A.; Costin, B.N.; Miles, M.F.
2014-01-01
Postgenomic studies of the function of genes and their role in disease have now become an area of intense study since efforts to define the raw sequence material of the genome have largely been completed. The use of whole-genome approaches such as microarray expression profiling and, more recently, RNA-sequence analysis of transcript abundance has allowed an unprecedented look at the workings of the genome. However, the accurate derivation of such high-throughput data and their analysis in terms of biological function has been critical to truly leveraging the postgenomic revolution. This chapter will describe an approach that focuses on the use of gene networks to both organize and interpret genomic expression data. Such networks, derived from statistical analysis of large genomic datasets and the application of multiple bioinformatics data resources, poten-tially allow the identification of key control elements for networks associated with human disease, and thus may lead to derivation of novel therapeutic approaches. However, as discussed in this chapter, the leveraging of such networks cannot occur without a thorough understanding of the technical and statistical factors influencing the derivation of genomic expression data. Thus, while the catch phrase may be “it's the network … stupid,” the understanding of factors extending from RNA isolation to genomic profiling technique, multivariate statistics, and bioinformatics are all critical to defining fully useful gene networks for study of complex biology. PMID:23195313
Aguirre-Gamboa, Raul; Gomez-Rueda, Hugo; Martínez-Ledesma, Emmanuel; Martínez-Torteya, Antonio; Chacolla-Huaringa, Rafael; Rodriguez-Barrientos, Alberto; Tamez-Peña, José G; Treviño, Victor
2013-01-01
Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R.
Aguirre-Gamboa, Raul; Gomez-Rueda, Hugo; Martínez-Ledesma, Emmanuel; Martínez-Torteya, Antonio; Chacolla-Huaringa, Rafael; Rodriguez-Barrientos, Alberto; Tamez-Peña, José G.; Treviño, Victor
2013-01-01
Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R. PMID:24066126
NASA Astrophysics Data System (ADS)
Tsontos, V. M.; Huang, T.; Holt, B.
2015-12-01
The earth science enterprise increasingly relies on the integration and synthesis of multivariate datasets from diverse observational platforms. NASA's ocean salinity missions, that include Aquarius/SAC-D and the SPURS (Salinity Processes in the Upper Ocean Regional Study) field campaign, illustrate the value of integrated observations in support of studies on ocean circulation, the water cycle, and climate. However, the inherent heterogeneity of resulting data and the disparate, distributed systems that serve them complicates their effective utilization for both earth science research and applications. Key technical interoperability challenges include adherence to metadata and data format standards that are particularly acute for in-situ data and the lack of a unified metadata model facilitating archival and integration of both satellite and oceanographic field datasets. Here we report on efforts at the PO.DAAC, NASA's physical oceanographic data center, to extend our data management and distribution support capabilities for field campaign datasets such as those from SPURS. We also discuss value-added services, based on the integration of satellite and in-situ datasets, which are under development with a particular focus on DOMS. The distributed oceanographic matchup service (DOMS) implements a portable technical infrastructure and associated web services that will be broadly accessible via the PO.DAAC for the dynamic collocation of satellite and in-situ data, hosted by distributed data providers, in support of mission cal/val, science and operational applications.
Understanding the Role of Conscientiousness in Healthy Aging: Where Does the Brain Come In?
Patrick, Christopher J.
2014-01-01
In reviewing this impressive series of articles, I was struck by two points in particular: (1) the fact that the empirically-oriented articles focused on analyses of data from very large samples, with the two papers by Friedman and colleagues highlighting an approach to merging existing datasets through use of “metric bridges” in order to address key questions not addressable through one dataset alone, and (2) the fact that the articles as a whole included limited mention of neuroscientific (i.e., brain research) concepts, methods, and findings. One likely reason for the lack of reference to brain-oriented work is the persisting gap between smaller-N lab-experimental and larger-N multivariate-correlational approaches to psychological research. As a strategy for addressing this gap and bringing a distinct neuroscientific component to the National Institute on Aging’s conscientiousness and health initiative, I suggest that the metric bridging approach highlighted by Friedman and colleagues could be used to connect existing large-scale datasets containing both neurophysiological variables and measures of individual difference constructs to other datasets containing richer arrays of non-physiological variables—including data from longitudinal or twin studies focusing on personality and health-related outcomes (e.g., Terman Life Cycle study and Hawaii longitudinal studies, as described in the article by Kern et al.). PMID:24773108
Methods to increase reproducibility in differential gene expression via meta-analysis
Sweeney, Timothy E.; Haynes, Winston A.; Vallania, Francesco; Ioannidis, John P.; Khatri, Purvesh
2017-01-01
Findings from clinical and biological studies are often not reproducible when tested in independent cohorts. Due to the testing of a large number of hypotheses and relatively small sample sizes, results from whole-genome expression studies in particular are often not reproducible. Compared to single-study analysis, gene expression meta-analysis can improve reproducibility by integrating data from multiple studies. However, there are multiple choices in designing and carrying out a meta-analysis. Yet, clear guidelines on best practices are scarce. Here, we hypothesized that studying subsets of very large meta-analyses would allow for systematic identification of best practices to improve reproducibility. We therefore constructed three very large gene expression meta-analyses from clinical samples, and then examined meta-analyses of subsets of the datasets (all combinations of datasets with up to N/2 samples and K/2 datasets) compared to a ‘silver standard’ of differentially expressed genes found in the entire cohort. We tested three random-effects meta-analysis models using this procedure. We showed relatively greater reproducibility with more-stringent effect size thresholds with relaxed significance thresholds; relatively lower reproducibility when imposing extraneous constraints on residual heterogeneity; and an underestimation of actual false positive rate by Benjamini–Hochberg correction. In addition, multivariate regression showed that the accuracy of a meta-analysis increased significantly with more included datasets even when controlling for sample size. PMID:27634930
Once upon Multivariate Analyses: When They Tell Several Stories about Biological Evolution.
Renaud, Sabrina; Dufour, Anne-Béatrice; Hardouin, Emilie A; Ledevin, Ronan; Auffray, Jean-Christophe
2015-01-01
Geometric morphometrics aims to characterize of the geometry of complex traits. It is therefore by essence multivariate. The most popular methods to investigate patterns of differentiation in this context are (1) the Principal Component Analysis (PCA), which is an eigenvalue decomposition of the total variance-covariance matrix among all specimens; (2) the Canonical Variate Analysis (CVA, a.k.a. linear discriminant analysis (LDA) for more than two groups), which aims at separating the groups by maximizing the between-group to within-group variance ratio; (3) the between-group PCA (bgPCA) which investigates patterns of between-group variation, without standardizing by the within-group variance. Standardizing within-group variance, as performed in the CVA, distorts the relationships among groups, an effect that is particularly strong if the variance is similarly oriented in a comparable way in all groups. Such shared direction of main morphological variance may occur and have a biological meaning, for instance corresponding to the most frequent standing genetic variation in a population. Here we undertake a case study of the evolution of house mouse molar shape across various islands, based on the real dataset and simulations. We investigated how patterns of main variance influence the depiction of among-group differentiation according to the interpretation of the PCA, bgPCA and CVA. Without arguing about a method performing 'better' than another, it rather emerges that working on the total or between-group variance (PCA and bgPCA) will tend to put the focus on the role of direction of main variance as line of least resistance to evolution. Standardizing by the within-group variance (CVA), by dampening the expression of this line of least resistance, has the potential to reveal other relevant patterns of differentiation that may otherwise be blurred.
Multivariate meta-analysis for non-linear and other multi-parameter associations
Gasparrini, A; Armstrong, B; Kenward, M G
2012-01-01
In this paper, we formalize the application of multivariate meta-analysis and meta-regression to synthesize estimates of multi-parameter associations obtained from different studies. This modelling approach extends the standard two-stage analysis used to combine results across different sub-groups or populations. The most straightforward application is for the meta-analysis of non-linear relationships, described for example by regression coefficients of splines or other functions, but the methodology easily generalizes to any setting where complex associations are described by multiple correlated parameters. The modelling framework of multivariate meta-analysis is implemented in the package mvmeta within the statistical environment R. As an illustrative example, we propose a two-stage analysis for investigating the non-linear exposure–response relationship between temperature and non-accidental mortality using time-series data from multiple cities. Multivariate meta-analysis represents a useful analytical tool for studying complex associations through a two-stage procedure. Copyright © 2012 John Wiley & Sons, Ltd. PMID:22807043
NASA Astrophysics Data System (ADS)
Whitehall, K. D.; Jenkins, G. S.; Mattmann, C. A.; Waliser, D. E.; Kim, J.; Goodale, C. E.; Hart, A. F.; Ramirez, P.; Whittell, J.; Zimdars, P. A.
2012-12-01
Mesoscale convective complexes (MCCs) are large (2 - 3 x 105 km2) nocturnal convectively-driven weather systems that are generally associated with high precipitation events in short durations (less than 12hrs) in various locations through out the tropics and midlatitudes (Maddox 1980). These systems are particularly important for climate in the West Sahel region, where the precipitation associated with them is a principal component of the rainfall season (Laing and Fritsch 1993). These systems occur on weather timescales and are historically identified from weather data analysis via manual and more recently automated processes (Miller and Fritsch 1991, Nesbett 2006, Balmey and Reason 2012). The Regional Climate Model Evaluation System (RCMES) is an open source tool designed for easy evaluation of climate and Earth system data through access to standardized datasets, and intrinsic tools that perform common analysis and visualization tasks (Hart et al. 2011). The RCMES toolkit also provides the flexibility of user-defined subroutines for further metrics, visualization and even dataset manipulation. The purpose of this study is to present a methodology for identifying MCCs in observation datasets using the RCMES framework. TRMM 3 hourly datasets will be used to demonstrate the methodology for 2005 boreal summer. This method promotes the use of open source software for scientific data systems to address a concern to multiple stakeholders in the earth sciences. A historical MCC dataset provides a platform with regards to further studies of the variability of frequency on various timescales of MCCs that is important for many including climate scientists, meteorologists, water resource managers, and agriculturalists. The methodology of using RCMES for searching and clipping datasets will engender a new realm of studies as users of the system will no longer be restricted to solely using the datasets as they reside in their own local systems; instead will be afforded rapid, effective, and transparent access, processing and visualization of the wealth of remote sensing datasets and climate model outputs available.
Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments.
Ionescu, Catalin; Papava, Dragos; Olaru, Vlad; Sminchisescu, Cristian
2014-07-01
We introduce a new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different viewpoints, for training realistic human sensing systems and for evaluating the next generation of human pose estimation models and algorithms. Besides increasing the size of the datasets in the current state-of-the-art by several orders of magnitude, we also aim to complement such datasets with a diverse set of motions and poses encountered as part of typical human activities (taking photos, talking on the phone, posing, greeting, eating, etc.), with additional synchronized image, human motion capture, and time of flight (depth) data, and with accurate 3D body scans of all the subject actors involved. We also provide controlled mixed reality evaluation scenarios where 3D human models are animated using motion capture and inserted using correct 3D geometry, in complex real environments, viewed with moving cameras, and under occlusion. Finally, we provide a set of large-scale statistical models and detailed evaluation baselines for the dataset illustrating its diversity and the scope for improvement by future work in the research community. Our experiments show that our best large-scale model can leverage our full training set to obtain a 20% improvement in performance compared to a training set of the scale of the largest existing public dataset for this problem. Yet the potential for improvement by leveraging higher capacity, more complex models with our large dataset, is substantially vaster and should stimulate future research. The dataset together with code for the associated large-scale learning models, features, visualization tools, as well as the evaluation server, is available online at http://vision.imar.ro/human3.6m.
Improving average ranking precision in user searches for biomedical research datasets
Gobeill, Julien; Gaudinat, Arnaud; Vachon, Thérèse; Ruch, Patrick
2017-01-01
Abstract Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorization method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries, and provided competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP, being +22.3% higher than the median infAP of the participant’s best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system’s performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. The similarity measure algorithm showed robust performance in different training conditions, with small performance variations compared to the Divergence from Randomness framework. Finally, the result categorization did not have significant impact on the system’s performance. We believe that our solution could be used to enhance biomedical dataset management systems. The use of data driven expansion methods, such as those based on word embeddings, could be an alternative to the complexity of biomedical terminologies. Nevertheless, due to the limited size of the assessment set, further experiments need to be performed to draw conclusive results. Database URL: https://biocaddie.org/benchmark-data PMID:29220475
Analysis Of Navy Hornet Squadron Mishap Costs With Regard To Previously Flown Flight Hours
2017-06-01
mishaps occur more frequently in a squadron when flight hours are reduced. This thesis correlates F/A-18 Hornet and Super Hornet squadron previously... correlated to the flight hours flown during the previous three and six months. A linear multivariate model was developed and used to analyze a dataset...hours are reduced. This thesis correlates F/A-18 Hornet and Super Hornet squadron previously flown flight hours with mishap costs. It uses a macro
A CLIPS expert system for clinical flow cytometry data analysis
NASA Technical Reports Server (NTRS)
Salzman, G. C.; Duque, R. E.; Braylan, R. C.; Stewart, C. C.
1990-01-01
An expert system is being developed using CLIPS to assist clinicians in the analysis of multivariate flow cytometry data from cancer patients. Cluster analysis is used to find subpopulations representing various cell types in multiple datasets each consisting of four to five measurements on each of 5000 cells. CLIPS facts are derived from results of the clustering. CLIPS rules are based on the expertise of Drs. Stewart, Duque, and Braylan. The rules incorporate certainty factors based on case histories.
A Multivariate Investigation of Employee Absenteeism.
1980-05-01
A MULTIVARIATE INVESTIGATION OF EMPLOYEE ABSENTEEISM.(U) MAY 80 J R TERBORG, G A OAVIS, F J SMITH N00014-78"C-0756 UNCLASSIFIED TR-80-5 NL inuuununn...COMPLEX ORGANIZATIONS PROGRAM IN INDUSTRIAL ORGANIZATIONAL PSYCHOLOG C, DEPARTMENT OF PSYCHOLOGY a- UNIVERSITY OF HOUSTON C HOUSTON, TEXAS T7004 C...a-o I *I-- . ’ 4 , ... ,.I .,.- .S 7Jn .jA A Multivariate Investigation of Employee Absenteeism James R. Terborg & Gregory A. Davis University of
Bessonov, Kyrylo; Walkey, Christopher J.; Shelp, Barry J.; van Vuuren, Hennie J. J.; Chiu, David; van der Merwe, George
2013-01-01
Analyzing time-course expression data captured in microarray datasets is a complex undertaking as the vast and complex data space is represented by a relatively low number of samples as compared to thousands of available genes. Here, we developed the Interdependent Correlation Clustering (ICC) method to analyze relationships that exist among genes conditioned on the expression of a specific target gene in microarray data. Based on Correlation Clustering, the ICC method analyzes a large set of correlation values related to gene expression profiles extracted from given microarray datasets. ICC can be applied to any microarray dataset and any target gene. We applied this method to microarray data generated from wine fermentations and selected NSF1, which encodes a C2H2 zinc finger-type transcription factor, as the target gene. The validity of the method was verified by accurate identifications of the previously known functional roles of NSF1. In addition, we identified and verified potential new functions for this gene; specifically, NSF1 is a negative regulator for the expression of sulfur metabolism genes, the nuclear localization of Nsf1 protein (Nsf1p) is controlled in a sulfur-dependent manner, and the transcription of NSF1 is regulated by Met4p, an important transcriptional activator of sulfur metabolism genes. The inter-disciplinary approach adopted here highlighted the accuracy and relevancy of the ICC method in mining for novel gene functions using complex microarray datasets with a limited number of samples. PMID:24130853
Chung, Dongjun; Kim, Hang J; Zhao, Hongyu
2017-02-01
Genome-wide association studies (GWAS) have identified tens of thousands of genetic variants associated with hundreds of phenotypes and diseases, which have provided clinical and medical benefits to patients with novel biomarkers and therapeutic targets. However, identification of risk variants associated with complex diseases remains challenging as they are often affected by many genetic variants with small or moderate effects. There has been accumulating evidence suggesting that different complex traits share common risk basis, namely pleiotropy. Recently, several statistical methods have been developed to improve statistical power to identify risk variants for complex traits through a joint analysis of multiple GWAS datasets by leveraging pleiotropy. While these methods were shown to improve statistical power for association mapping compared to separate analyses, they are still limited in the number of phenotypes that can be integrated. In order to address this challenge, in this paper, we propose a novel statistical framework, graph-GPA, to integrate a large number of GWAS datasets for multiple phenotypes using a hidden Markov random field approach. Application of graph-GPA to a joint analysis of GWAS datasets for 12 phenotypes shows that graph-GPA improves statistical power to identify risk variants compared to statistical methods based on smaller number of GWAS datasets. In addition, graph-GPA also promotes better understanding of genetic mechanisms shared among phenotypes, which can potentially be useful for the development of improved diagnosis and therapeutics. The R implementation of graph-GPA is currently available at https://dongjunchung.github.io/GGPA/.
Pooled assembly of marine metagenomic datasets: enriching annotation through chimerism.
Magasin, Jonathan D; Gerloff, Dietlind L
2015-02-01
Despite advances in high-throughput sequencing, marine metagenomic samples remain largely opaque. A typical sample contains billions of microbial organisms from thousands of genomes and quadrillions of DNA base pairs. Its derived metagenomic dataset underrepresents this complexity by orders of magnitude because of the sparseness and shortness of sequencing reads. Read shortness and sequencing errors pose a major challenge to accurate species and functional annotation. This includes distinguishing known from novel species. Often the majority of reads cannot be annotated and thus cannot help our interpretation of the sample. Here, we demonstrate quantitatively how careful assembly of marine metagenomic reads within, but also across, datasets can alleviate this problem. For 10 simulated datasets, each with species complexity modeled on a real counterpart, chimerism remained within the same species for most contigs (97%). For 42 real pyrosequencing ('454') datasets, assembly increased the proportion of annotated reads, and even more so when datasets were pooled, by on average 1.6% (max 6.6%) for species, 9.0% (max 28.7%) for Pfam protein domains and 9.4% (max 22.9%) for PANTHER gene families. Our results outline exciting prospects for data sharing in the metagenomics community. While chimeric sequences should be avoided in other areas of metagenomics (e.g. biodiversity analyses), conservative pooled assembly is advantageous for annotation specificity and sensitivity. Intriguingly, our experiment also found potential prospects for (low-cost) discovery of new species in 'old' data. dgerloff@ffame.org Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Temporal abstraction for the analysis of intensive care information
NASA Astrophysics Data System (ADS)
Hadad, Alejandro J.; Evin, Diego A.; Drozdowicz, Bartolomé; Chiotti, Omar
2007-11-01
This paper proposes a scheme for the analysis of time-stamped series data from multiple monitoring devices of intensive care units, using Temporal Abstraction concepts. This scheme is oriented to obtain a description of the patient state evolution in an unsupervised way. The case of study is based on a dataset clinically classified with Pulmonary Edema. For this dataset a trends based Temporal Abstraction mechanism is proposed, by means of a Behaviours Base of time-stamped series and then used in a classification step. Combining this approach with the introduction of expert knowledge, using Fuzzy Logic, and multivariate analysis by means of Self-Organizing Maps, a states characterization model is obtained. This model is feasible of being extended to different patients groups and states. The proposed scheme allows to obtain intermediate states descriptions through which it is passing the patient and that could be used to anticipate alert situations.
Generic Raman-based calibration models enabling real-time monitoring of cell culture bioreactors.
Mehdizadeh, Hamidreza; Lauri, David; Karry, Krizia M; Moshgbar, Mojgan; Procopio-Melino, Renee; Drapeau, Denis
2015-01-01
Raman-based multivariate calibration models have been developed for real-time in situ monitoring of multiple process parameters within cell culture bioreactors. Developed models are generic, in the sense that they are applicable to various products, media, and cell lines based on Chinese Hamster Ovarian (CHO) host cells, and are scalable to large pilot and manufacturing scales. Several batches using different CHO-based cell lines and corresponding proprietary media and process conditions have been used to generate calibration datasets, and models have been validated using independent datasets from separate batch runs. All models have been validated to be generic and capable of predicting process parameters with acceptable accuracy. The developed models allow monitoring multiple key bioprocess metabolic variables, and hence can be utilized as an important enabling tool for Quality by Design approaches which are strongly supported by the U.S. Food and Drug Administration. © 2015 American Institute of Chemical Engineers.
ERIC Educational Resources Information Center
Barthur, Ashrith
2016-01-01
There are two essential goals of this research. The first goal is to design and construct a computational environment that is used for studying large and complex datasets in the cybersecurity domain. The second goal is to analyse the Spamhaus blacklist query dataset which includes uncovering the properties of blacklisted hosts and understanding…
Bayesian Methods for Scalable Multivariate Value-Added Assessment
ERIC Educational Resources Information Center
Lockwood, J. R.; McCaffrey, Daniel F.; Mariano, Louis T.; Setodji, Claude
2007-01-01
There is increased interest in value-added models relying on longitudinal student-level test score data to isolate teachers' contributions to student achievement. The complex linkage of students to teachers as students progress through grades poses both substantive and computational challenges. This article introduces a multivariate Bayesian…
Relevant Feature Set Estimation with a Knock-out Strategy and Random Forests
Ganz, Melanie; Greve, Douglas N.; Fischl, Bruce; Konukoglu, Ender
2015-01-01
Group analysis of neuroimaging data is a vital tool for identifying anatomical and functional variations related to diseases as well as normal biological processes. The analyses are often performed on a large number of highly correlated measurements using a relatively smaller number of samples. Despite the correlation structure, the most widely used approach is to analyze the data using univariate methods followed by post-hoc corrections that try to account for the data’s multivariate nature. Although widely used, this approach may fail to recover from the adverse effects of the initial analysis when local effects are not strong. Multivariate pattern analysis (MVPA) is a powerful alternative to the univariate approach for identifying relevant variations. Jointly analyzing all the measures, MVPA techniques can detect global effects even when individual local effects are too weak to detect with univariate analysis. Current approaches are successful in identifying variations that yield highly predictive and compact models. However, they suffer from lessened sensitivity and instabilities in identification of relevant variations. Furthermore, current methods’ user-defined parameters are often unintuitive and difficult to determine. In this article, we propose a novel MVPA method for group analysis of high-dimensional data that overcomes the drawbacks of the current techniques. Our approach explicitly aims to identify all relevant variations using a “knock-out” strategy and the Random Forest algorithm. In evaluations with synthetic datasets the proposed method achieved substantially higher sensitivity and accuracy than the state-of-the-art MVPA methods, and outperformed the univariate approach when the effect size is low. In experiments with real datasets the proposed method identified regions beyond the univariate approach, while other MVPA methods failed to replicate the univariate results. More importantly, in a reproducibility study with the well-known ADNI dataset the proposed method yielded higher stability and power than the univariate approach. PMID:26272728
Hoenigl, Martin; Weibel, Nadir; Mehta, Sanjay R; Anderson, Christy M; Jenks, Jeffrey; Green, Nella; Gianella, Sara; Smith, Davey M; Little, Susan J
2015-08-01
Although men who have sex with men (MSM) represent a dominant risk group for human immunodeficiency virus (HIV), the risk of HIV infection within this population is not uniform. The objective of this study was to develop and validate a score to estimate incident HIV infection risk. Adult MSM who were tested for acute and early HIV (AEH) between 2008 and 2014 were retrospectively randomized 2:1 to a derivation and validation dataset, respectively. Using the derivation dataset, each predictor associated with an AEH outcome in the multivariate prediction model was assigned a point value that corresponded to its odds ratio. The score was validated on the validation dataset using C-statistics. Data collected at a single HIV testing encounter from 8326 unique MSM were analyzed, including 200 with AEH (2.4%). Four risk behavior variables were significantly associated with an AEH diagnosis (ie, incident infection) in multivariable analysis and were used to derive the San Diego Early Test (SDET) score: condomless receptive anal intercourse (CRAI) with an HIV-positive MSM (3 points), the combination of CRAI plus ≥5 male partners (3 points), ≥10 male partners (2 points), and diagnosis of bacterial sexually transmitted infection (2 points)-all as reported for the prior 12 months. The C-statistic for this risk score was >0.7 in both data sets. The SDET risk score may help to prioritize resources and target interventions, such as preexposure prophylaxis, to MSM at greatest risk of acquiring HIV infection. The SDET risk score is deployed as a freely available tool at http://sdet.ucsd.edu. © The Author 2015. Published by Oxford University Press on behalf of the Infectious Diseases Society of America. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Boiret, Mathieu; de Juan, Anna; Gorretta, Nathalie; Ginot, Yves-Michel; Roger, Jean-Michel
2015-01-25
In this work, Raman hyperspectral images and multivariate curve resolution-alternating least squares (MCR-ALS) are used to study the distribution of actives and excipients within a pharmaceutical drug product. This article is mainly focused on the distribution of a low dose constituent. Different approaches are compared, using initially filtered or non-filtered data, or using a column-wise augmented dataset before starting the MCR-ALS iterative process including appended information on the low dose component. In the studied formulation, magnesium stearate is used as a lubricant to improve powder flowability. With a theoretical concentration of 0.5% (w/w) in the drug product, the spectral variance contained in the data is weak. By using a principal component analysis (PCA) filtered dataset as a first step of the MCR-ALS approach, the lubricant information is lost in the non-explained variance and its associated distribution in the tablet cannot be highlighted. A sufficient number of components to generate the PCA noise-filtered matrix has to be used in order to keep the lubricant variability within the data set analyzed or, otherwise, work with the raw non-filtered data. Different models are built using an increasing number of components to perform the PCA reduction. It is shown that the magnesium stearate information can be extracted from a PCA model using a minimum of 20 components. In the last part, a column-wise augmented matrix, including a reference spectrum of the lubricant, is used before starting MCR-ALS process. PCA reduction is performed on the augmented matrix, so the magnesium stearate contribution is included within the MCR-ALS calculations. By using an appropriate PCA reduction, with a sufficient number of components, or by using an augmented dataset including appended information on the low dose component, the distribution of the two actives, the two main excipients and the low dose lubricant are correctly recovered. Copyright © 2014 Elsevier B.V. All rights reserved.
Schoenfeld, Andrew J; Serrano, Jose A; Waterman, Brian R; Bader, Julia O; Belmont, Philip J
2013-11-01
Few studies have addressed the role of residents' participation in morbidity and mortality after orthopaedic surgery. The present study utilized the 2005-2010 National Surgical Quality Improvement Program (NSQIP) dataset to assess the risk of 30-day post-operative complications and mortality associated with resident participation in orthopaedic procedures. The NSQIP dataset was queried using codes for 12 common orthopaedic procedures. Patients identified as having received one of the procedures had their records abstracted to obtain demographic data, medical history, operative time, and resident involvement in their surgical care. Thirty-day post-operative outcomes, including complications and mortality, were assessed for all patients. A step-wise multivariate logistic regression model was constructed to evaluate the impact of resident participation on mortality- and complication-risk while controlling for other factors in the model. Primary analyses were performed comparing cases where the attending surgeon operated alone to all other case designations, while a subsequent sensitivity analysis limited inclusion to cases where resident participation was reported by post-graduate year. In the NSQIP dataset, 43,343 patients had received one of the 12 orthopaedic procedures queried. Thirty-five percent of cases were performed with resident participation. The mortality rate, overall, was 2.5 and 10 % sustained one or more complications. Multivariate analysis demonstrated a significant association between resident participation and the risk of one or more complications [OR 1.3 (95 % CI 1.1, 1.4); p < 0.001] as well as major systemic complications [OR 1.6 (95 % CI 1.3, 2.0); p < 0.001] for primary joint arthroplasty procedures only. These findings persisted even after sensitivity testing. A mild to moderate risk for complications was noted following resident involvement in joint arthroplasty procedures. No significant risk of post-operative morbidity or mortality was appreciated for the other orthopaedic procedures studied. II (Prognostic).
Advanced Multivariate Inversion Techniques for High Resolution 3D Geophysical Modeling (Invited)
NASA Astrophysics Data System (ADS)
Maceira, M.; Zhang, H.; Rowe, C. A.
2009-12-01
We focus on the development and application of advanced multivariate inversion techniques to generate a realistic, comprehensive, and high-resolution 3D model of the seismic structure of the crust and upper mantle that satisfies several independent geophysical datasets. Building on previous efforts of joint invesion using surface wave dispersion measurements, gravity data, and receiver functions, we have added a fourth dataset, seismic body wave P and S travel times, to the simultaneous joint inversion method. We present a 3D seismic velocity model of the crust and upper mantle of northwest China resulting from the simultaneous, joint inversion of these four data types. Surface wave dispersion measurements are primarily sensitive to seismic shear-wave velocities, but at shallow depths it is difficult to obtain high-resolution velocities and to constrain the structure due to the depth-averaging of the more easily-modeled, longer-period surface waves. Gravity inversions have the greatest resolving power at shallow depths, and they provide constraints on rock density variations. Moreover, while surface wave dispersion measurements are primarily sensitive to vertical shear-wave velocity averages, body wave receiver functions are sensitive to shear-wave velocity contrasts and vertical travel-times. Addition of the fourth dataset, consisting of seismic travel-time data, helps to constrain the shear wave velocities both vertically and horizontally in the model cells crossed by the ray paths. Incorporation of both P and S body wave travel times allows us to invert for both P and S velocity structure, capitalizing on empirical relationships between both wave types’ seismic velocities with rock densities, thus eliminating the need for ad hoc assumptions regarding the Poisson ratios. Our new tomography algorithm is a modification of the Maceira and Ammon joint inversion code, in combination with the Zhang and Thurber TomoDD (double-difference tomography) program.
A photogrammetric technique for generation of an accurate multispectral optical flow dataset
NASA Astrophysics Data System (ADS)
Kniaz, V. V.
2017-06-01
A presence of an accurate dataset is the key requirement for a successful development of an optical flow estimation algorithm. A large number of freely available optical flow datasets were developed in recent years and gave rise for many powerful algorithms. However most of the datasets include only images captured in the visible spectrum. This paper is focused on the creation of a multispectral optical flow dataset with an accurate ground truth. The generation of an accurate ground truth optical flow is a rather complex problem, as no device for error-free optical flow measurement was developed to date. Existing methods for ground truth optical flow estimation are based on hidden textures, 3D modelling or laser scanning. Such techniques are either work only with a synthetic optical flow or provide a sparse ground truth optical flow. In this paper a new photogrammetric method for generation of an accurate ground truth optical flow is proposed. The method combines the benefits of the accuracy and density of a synthetic optical flow datasets with the flexibility of laser scanning based techniques. A multispectral dataset including various image sequences was generated using the developed method. The dataset is freely available on the accompanying web site.
Scalable and Interactive Segmentation and Visualization of Neural Processes in EM Datasets
Jeong, Won-Ki; Beyer, Johanna; Hadwiger, Markus; Vazquez, Amelio; Pfister, Hanspeter; Whitaker, Ross T.
2011-01-01
Recent advances in scanning technology provide high resolution EM (Electron Microscopy) datasets that allow neuroscientists to reconstruct complex neural connections in a nervous system. However, due to the enormous size and complexity of the resulting data, segmentation and visualization of neural processes in EM data is usually a difficult and very time-consuming task. In this paper, we present NeuroTrace, a novel EM volume segmentation and visualization system that consists of two parts: a semi-automatic multiphase level set segmentation with 3D tracking for reconstruction of neural processes, and a specialized volume rendering approach for visualization of EM volumes. It employs view-dependent on-demand filtering and evaluation of a local histogram edge metric, as well as on-the-fly interpolation and ray-casting of implicit surfaces for segmented neural structures. Both methods are implemented on the GPU for interactive performance. NeuroTrace is designed to be scalable to large datasets and data-parallel hardware architectures. A comparison of NeuroTrace with a commonly used manual EM segmentation tool shows that our interactive workflow is faster and easier to use for the reconstruction of complex neural processes. PMID:19834227
DNAism: exploring genomic datasets on the web with Horizon Charts.
Rio Deiros, David; Gibbs, Richard A; Rogers, Jeffrey
2016-01-27
Computational biologists daily face the need to explore massive amounts of genomic data. New visualization techniques can help researchers navigate and understand these big data. Horizon Charts are a relatively new visualization method that, under the right circumstances, maximizes data density without losing graphical perception. Horizon Charts have been successfully applied to understand multi-metric time series data. We have adapted an existing JavaScript library (Cubism) that implements Horizon Charts for the time series domain so that it works effectively with genomic datasets. We call this new library DNAism. Horizon Charts can be an effective visual tool to explore complex and large genomic datasets. Researchers can use our library to leverage these techniques to extract additional insights from their own datasets.
The MIND PALACE: A Multi-Spectral Imaging and Spectroscopy Database for Planetary Science
NASA Astrophysics Data System (ADS)
Eshelman, E.; Doloboff, I.; Hara, E. K.; Uckert, K.; Sapers, H. M.; Abbey, W.; Beegle, L. W.; Bhartia, R.
2017-12-01
The Multi-Instrument Database (MIND) is the web-based home to a well-characterized set of analytical data collected by a suite of deep-UV fluorescence/Raman instruments built at the Jet Propulsion Laboratory (JPL). Samples derive from a growing body of planetary surface analogs, mineral and microbial standards, meteorites, spacecraft materials, and other astrobiologically relevant materials. In addition to deep-UV spectroscopy, datasets stored in MIND are obtained from a variety of analytical techniques obtained over multiple spatial and spectral scales including electron microscopy, optical microscopy, infrared spectroscopy, X-ray fluorescence, and direct fluorescence imaging. Multivariate statistical analysis techniques, primarily Principal Component Analysis (PCA), are used to guide interpretation of these large multi-analytical spectral datasets. Spatial co-referencing of integrated spectral/visual maps is performed using QGIS (geographic information system software). Georeferencing techniques transform individual instrument data maps into a layered co-registered data cube for analysis across spectral and spatial scales. The body of data in MIND is intended to serve as a permanent, reliable, and expanding database of deep-UV spectroscopy datasets generated by this unique suite of JPL-based instruments on samples of broad planetary science interest.
Meneghetti, Natascia; Facco, Pierantonio; Bezzo, Fabrizio; Himawan, Chrismono; Zomer, Simeone; Barolo, Massimiliano
2016-05-30
In this proof-of-concept study, a methodology is proposed to systematically analyze large data historians of secondary pharmaceutical manufacturing systems using data mining techniques. The objective is to develop an approach enabling to automatically retrieve operation-relevant information that can assist the management in the periodic review of a manufactory system. The proposed methodology allows one to automatically perform three tasks: the identification of single batches within the entire data-sequence of the historical dataset, the identification of distinct operating phases within each batch, and the characterization of a batch with respect to an assigned multivariate set of operating characteristics. The approach is tested on a six-month dataset of a commercial-scale granulation/drying system, where several millions of data entries are recorded. The quality of results and the generality of the approach indicate that there is a strong potential for extending the method to even larger historical datasets and to different operations, thus making it an advanced PAT tool that can assist the implementation of continual improvement paradigms within a quality-by-design framework. Copyright © 2016 Elsevier B.V. All rights reserved.
Reference datasets for 2-treatment, 2-sequence, 2-period bioequivalence studies.
Schütz, Helmut; Labes, Detlew; Fuglsang, Anders
2014-11-01
It is difficult to validate statistical software used to assess bioequivalence since very few datasets with known results are in the public domain, and the few that are published are of moderate size and balanced. The purpose of this paper is therefore to introduce reference datasets of varying complexity in terms of dataset size and characteristics (balance, range, outlier presence, residual error distribution) for 2-treatment, 2-period, 2-sequence bioequivalence studies and to report their point estimates and 90% confidence intervals which companies can use to validate their installations. The results for these datasets were calculated using the commercial packages EquivTest, Kinetica, SAS and WinNonlin, and the non-commercial package R. The results of three of these packages mostly agree, but imbalance between sequences seems to provoke questionable results with one package, which illustrates well the need for proper software validation.
2010-01-01
Background The reconstruction of protein complexes from the physical interactome of organisms serves as a building block towards understanding the higher level organization of the cell. Over the past few years, several independent high-throughput experiments have helped to catalogue enormous amount of physical protein interaction data from organisms such as yeast. However, these individual datasets show lack of correlation with each other and also contain substantial number of false positives (noise). Over these years, several affinity scoring schemes have also been devised to improve the qualities of these datasets. Therefore, the challenge now is to detect meaningful as well as novel complexes from protein interaction (PPI) networks derived by combining datasets from multiple sources and by making use of these affinity scoring schemes. In the attempt towards tackling this challenge, the Markov Clustering algorithm (MCL) has proved to be a popular and reasonably successful method, mainly due to its scalability, robustness, and ability to work on scored (weighted) networks. However, MCL produces many noisy clusters, which either do not match known complexes or have additional proteins that reduce the accuracies of correctly predicted complexes. Results Inspired by recent experimental observations by Gavin and colleagues on the modularity structure in yeast complexes and the distinctive properties of "core" and "attachment" proteins, we develop a core-attachment based refinement method coupled to MCL for reconstruction of yeast complexes from scored (weighted) PPI networks. We combine physical interactions from two recent "pull-down" experiments to generate an unscored PPI network. We then score this network using available affinity scoring schemes to generate multiple scored PPI networks. The evaluation of our method (called MCL-CAw) on these networks shows that: (i) MCL-CAw derives larger number of yeast complexes and with better accuracies than MCL, particularly in the presence of natural noise; (ii) Affinity scoring can effectively reduce the impact of noise on MCL-CAw and thereby improve the quality (precision and recall) of its predicted complexes; (iii) MCL-CAw responds well to most available scoring schemes. We discuss several instances where MCL-CAw was successful in deriving meaningful complexes, and where it missed a few proteins or whole complexes due to affinity scoring of the networks. We compare MCL-CAw with several recent complex detection algorithms on unscored and scored networks, and assess the relative performance of the algorithms on these networks. Further, we study the impact of augmenting physical datasets with computationally inferred interactions for complex detection. Finally, we analyse the essentiality of proteins within predicted complexes to understand a possible correlation between protein essentiality and their ability to form complexes. Conclusions We demonstrate that core-attachment based refinement in MCL-CAw improves the predictions of MCL on yeast PPI networks. We show that affinity scoring improves the performance of MCL-CAw. PMID:20939868
Abrahamowicz, Michal; Esdaile, John M; Ramsey-Goldman, Rosalind; Simon, Lee S; Strand, Vibeke; Lipsky, Peter E
2018-04-12
Trials of new SLE treatments are hampered by the lack of effective outcome measures. To address this, we developed a new Lupus Multivariable Lupus Outcome Score (LuMOS). The LuMOS formula was developed by analyzing raw data of two pivotal trials: BLISS-52 and BLISS-76, the basis for approval of belimumab (Bel). Using data from BLISS-76 as the learning dataset, we optimized discrimination between outcomes for patients treated with 10mg/kg Bel versus placebo over the first 52 weeks of follow-up using multivariable logistic regression analyses. Performance of LuMOS was assessed using an independent validation dataset from the BLISS-52 trial. The LuMOS model incorporated reduction in SELENA-SLEDAI ≥4 points, increase in C4, decrease in anti-dsDNA titer, and changes in BILAG organ system manifestations: no worsening in renal and improvements in mucocutaneous components. Decreases in prednisone doses and increases in C3 had very minor impacts on total LuMOS. In all analyses of BLISS-76 and BLISS-52 RCTs, mean LuMOS were significantly higher (p < 0.0001) for Bel 10mg and Bel 1mg treatment groups than placebo. LuMOS also found significant differences between active treatment and placebo when SRI did not, as for Bel 1mg in BLISS-76. The Effect Sizes were significantly much higher with LuMOS compared with SLE Response Index(SRI-4). The evidenced-based LuMOS developed with data from BLISS-76 and validated with data from BLISS-52 exhibits superior capacity to discriminate responders from nonresponders compared to the SRI-4. Use of LuMOS may improve the efficiency and power of analyses in future lupus trials. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Larson, Derek W.; Currie, Philip J.
2013-01-01
Isolated small theropod teeth are abundant in vertebrate microfossil assemblages, and are frequently used in studies of species diversity in ancient ecosystems. However, determining the taxonomic affinities of these teeth is problematic due to an absence of associated diagnostic skeletal material. Species such as Dromaeosaurus albertensis, Richardoestesia gilmorei, and Saurornitholestes langstoni are known from skeletal remains that have been recovered exclusively from the Dinosaur Park Formation (Campanian). It is therefore likely that teeth from different formations widely disparate in age or geographic position are not referable to these species. Tooth taxa without any associated skeletal material, such as Paronychodon lacustris and Richardoestesia isosceles, have also been identified from multiple localities of disparate ages throughout the Late Cretaceous. To address this problem, a dataset of measurements of 1183 small theropod teeth (the most specimen-rich theropod tooth dataset ever constructed) from North America ranging in age from Santonian through Maastrichtian were analyzed using multivariate statistical methods: canonical variate analysis, pairwise discriminant function analysis, and multivariate analysis of variance. The results indicate that teeth referred to the same taxon from different formations are often quantitatively distinct. In contrast, isolated teeth found in time equivalent formations are not quantitatively distinguishable from each other. These results support the hypothesis that small theropod taxa, like other dinosaurs in the Late Cretaceous, tend to be exclusive to discrete host formations. The methods outlined have great potential for future studies of isolated teeth worldwide, and may be the most useful non-destructive technique known of extracting the most data possible from isolated and fragmentary specimens. The ability to accurately assess species diversity and turnover through time based on isolated teeth will help illuminate patterns of evolution and extinction in these groups and potentially others in greater detail than has previously been thought possible without more complete skeletal material. PMID:23372708
Wilcox, Jared T; Satkunendrarajah, Kajana; Nasirzadeh, Yasmin; Laliberte, Alex M; Lip, Alyssa; Cadotte, David W; Foltz, Warren D; Fehlings, Michael G
2017-09-01
The majority of spinal cord injuries (SCI) occur at the cervical level, which results in significant impairment. Neurologic level and severity of injury are primary endpoints in clinical trials; however, how level-specific damages relate to behavioural performance in cervical injury is incompletely understood. We hypothesized that ascending level of injury leads to worsening forelimb performance, and correlates with loss of neural tissue and muscle-specific neuron pools. A direct comparison of multiple models was made with injury realized at the C5, C6, C7 and T7 vertebral levels using clip compression with sham-operated controls. Animals were assessed for 10weeks post-injury with numerous (40) outcome measures, including: classic behavioural tests, CatWalk, non-invasive MRI, electrophysiology, histologic lesion morphometry, neuron counts, and motor compartment quantification, and multivariate statistics on the total dataset. Histologic staining and T1-weighted MR imaging revealed similar structural changes and distinct tissue loss with cystic cavitation across all injuries. Forelimb tests, including grip strength, F-WARP motor scale, Inclined Plane, and forelimb ladder walk, exhibited stratification between all groups and marked impairment with C5 and C6 injuries. Classic hindlimb tests including BBB, hindlimb ladder walk, bladder recovery, and mortality were not different between cervical and thoracic injuries. CatWalk multivariate gait analysis showed reciprocal and progressive changes forelimb and hindlimb function with ascending level of injury. Electrophysiology revealed poor forelimb axonal conduction in cervical C5 and C6 groups alone. The cervical enlargement (C5-T2) showed progressive ventral horn atrophy and loss of specific motor neuron populations with ascending injury. Multivariate statistics revealed a robust dataset, rank-order contribution of outcomes, and allowed prediction of injury level with single-level discrimination using forelimb performance and neuron counts. Level-dependent models were generated using clip-compression SCI, with marked and reliable differences in forelimb performance and specific neuron pool loss. Copyright © 2017 Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
Grundmann, Matthias
Following the assumptions of ecological socialization research, adequate analysis of socialization conditions must take into account the multilevel and multivariate structure of social factors that impact on human development. This statement implies that complex models of family configurations or of socialization factors are needed to explain the…
a Critical Review of Automated Photogrammetric Processing of Large Datasets
NASA Astrophysics Data System (ADS)
Remondino, F.; Nocerino, E.; Toschi, I.; Menna, F.
2017-08-01
The paper reports some comparisons between commercial software able to automatically process image datasets for 3D reconstruction purposes. The main aspects investigated in the work are the capability to correctly orient large sets of image of complex environments, the metric quality of the results, replicability and redundancy. Different datasets are employed, each one featuring a diverse number of images, GSDs at cm and mm resolutions, and ground truth information to perform statistical analyses of the 3D results. A summary of (photogrammetric) terms is also provided, in order to provide rigorous terms of reference for comparisons and critical analyses.
Multivariate longitudinal data analysis with mixed effects hidden Markov models.
Raffa, Jesse D; Dubin, Joel A
2015-09-01
Multiple longitudinal responses are often collected as a means to capture relevant features of the true outcome of interest, which is often hidden and not directly measurable. We outline an approach which models these multivariate longitudinal responses as generated from a hidden disease process. We propose a class of models which uses a hidden Markov model with separate but correlated random effects between multiple longitudinal responses. This approach was motivated by a smoking cessation clinical trial, where a bivariate longitudinal response involving both a continuous and a binomial response was collected for each participant to monitor smoking behavior. A Bayesian method using Markov chain Monte Carlo is used. Comparison of separate univariate response models to the bivariate response models was undertaken. Our methods are demonstrated on the smoking cessation clinical trial dataset, and properties of our approach are examined through extensive simulation studies. © 2015, The International Biometric Society.
NASA Astrophysics Data System (ADS)
Yamasaki, Hideki; Morita, Shigeaki
2018-05-01
Multivariate curve resolution (MCR) was applied to a hetero-spectrally combined dataset consisting of mid-infrared (MIR) and near-infrared (NIR) spectra collected during the isothermal curing reaction of an epoxy resin. An epoxy monomer, bisphenol A diglycidyl ether (BADGE), and a hardening agent, 4,4‧-diaminodiphenyl methane (DDM), were used for the reaction. The fundamental modes of the Nsbnd H and Osbnd H stretches were highly overlapped in the MIR region, while their first overtones could be independently identified in the NIR region. The concentration profiles obtained by MCR using the hetero-spectral combination showed good agreement with the results of calculations based on the Beer-Lambert law and the mass balance. The band assignments and absorption sites estimated by the analysis also showed good agreement with the results using two-dimensional (2D) hetero-correlation spectroscopy.
Yamasaki, Hideki; Morita, Shigeaki
2018-05-15
Multivariate curve resolution (MCR) was applied to a hetero-spectrally combined dataset consisting of mid-infrared (MIR) and near-infrared (NIR) spectra collected during the isothermal curing reaction of an epoxy resin. An epoxy monomer, bisphenol A diglycidyl ether (BADGE), and a hardening agent, 4,4'-diaminodiphenyl methane (DDM), were used for the reaction. The fundamental modes of the NH and OH stretches were highly overlapped in the MIR region, while their first overtones could be independently identified in the NIR region. The concentration profiles obtained by MCR using the hetero-spectral combination showed good agreement with the results of calculations based on the Beer-Lambert law and the mass balance. The band assignments and absorption sites estimated by the analysis also showed good agreement with the results using two-dimensional (2D) hetero-correlation spectroscopy. Copyright © 2017 Elsevier B.V. All rights reserved.
Multi-frequency complex network from time series for uncovering oil-water flow structure.
Gao, Zhong-Ke; Yang, Yu-Xuan; Fang, Peng-Cheng; Jin, Ning-De; Xia, Cheng-Yi; Hu, Li-Dan
2015-02-04
Uncovering complex oil-water flow structure represents a challenge in diverse scientific disciplines. This challenge stimulates us to develop a new distributed conductance sensor for measuring local flow signals at different positions and then propose a novel approach based on multi-frequency complex network to uncover the flow structures from experimental multivariate measurements. In particular, based on the Fast Fourier transform, we demonstrate how to derive multi-frequency complex network from multivariate time series. We construct complex networks at different frequencies and then detect community structures. Our results indicate that the community structures faithfully represent the structural features of oil-water flow patterns. Furthermore, we investigate the network statistic at different frequencies for each derived network and find that the frequency clustering coefficient enables to uncover the evolution of flow patterns and yield deep insights into the formation of flow structures. Current results present a first step towards a network visualization of complex flow patterns from a community structure perspective.
NASA Astrophysics Data System (ADS)
Herman, Matthew R.; Nejadhashemi, A. Pouyan; Abouali, Mohammad; Hernandez-Suarez, Juan Sebastian; Daneshvar, Fariborz; Zhang, Zhen; Anderson, Martha C.; Sadeghi, Ali M.; Hain, Christopher R.; Sharifi, Amirreza
2018-01-01
As the global demands for the use of freshwater resources continues to rise, it has become increasingly important to insure the sustainability of this resources. This is accomplished through the use of management strategies that often utilize monitoring and the use of hydrological models. However, monitoring at large scales is not feasible and therefore model applications are becoming challenging, especially when spatially distributed datasets, such as evapotranspiration, are needed to understand the model performances. Due to these limitations, most of the hydrological models are only calibrated for data obtained from site/point observations, such as streamflow. Therefore, the main focus of this paper is to examine whether the incorporation of remotely sensed and spatially distributed datasets can improve the overall performance of the model. In this study, actual evapotranspiration (ETa) data was obtained from the two different sets of satellite based remote sensing data. One dataset estimates ETa based on the Simplified Surface Energy Balance (SSEBop) model while the other one estimates ETa based on the Atmosphere-Land Exchange Inverse (ALEXI) model. The hydrological model used in this study is the Soil and Water Assessment Tool (SWAT), which was calibrated against spatially distributed ETa and single point streamflow records for the Honeyoey Creek-Pine Creek Watershed, located in Michigan, USA. Two different techniques, multi-variable and genetic algorithm, were used to calibrate the SWAT model. Using the aforementioned datasets, the performance of the hydrological model in estimating ETa was improved using both calibration techniques by achieving Nash-Sutcliffe efficiency (NSE) values >0.5 (0.73-0.85), percent bias (PBIAS) values within ±25% (±21.73%), and root mean squared error - observations standard deviation ratio (RSR) values <0.7 (0.39-0.52). However, the genetic algorithm technique was more effective with the ETa calibration while significantly reducing the model performance for estimating the streamflow (NSE: 0.32-0.52, PBIAS: ±32.73%, and RSR: 0.63-0.82). Meanwhile, using the multi-variable technique, the model performance for estimating the streamflow was maintained with a high level of accuracy (NSE: 0.59-0.61, PBIAS: ±13.70%, and RSR: 0.63-0.64) while the evapotranspiration estimations were improved. Results from this assessment shows that incorporation of remotely sensed and spatially distributed data can improve the hydrological model performance if it is coupled with a right calibration technique.
Advanced functional network analysis in the geosciences: The pyunicorn package
NASA Astrophysics Data System (ADS)
Donges, Jonathan F.; Heitzig, Jobst; Runge, Jakob; Schultz, Hanna C. H.; Wiedermann, Marc; Zech, Alraune; Feldhoff, Jan; Rheinwalt, Aljoscha; Kutza, Hannes; Radebach, Alexander; Marwan, Norbert; Kurths, Jürgen
2013-04-01
Functional networks are a powerful tool for analyzing large geoscientific datasets such as global fields of climate time series originating from observations or model simulations. pyunicorn (pythonic unified complex network and recurrence analysis toolbox) is an open-source, fully object-oriented and easily parallelizable package written in the language Python. It allows for constructing functional networks (aka climate networks) representing the structure of statistical interrelationships in large datasets and, subsequently, investigating this structure using advanced methods of complex network theory such as measures for networks of interacting networks, node-weighted statistics or network surrogates. Additionally, pyunicorn allows to study the complex dynamics of geoscientific systems as recorded by time series by means of recurrence networks and visibility graphs. The range of possible applications of the package is outlined drawing on several examples from climatology.
NASA Astrophysics Data System (ADS)
Bhattacharjee, T.; Kumar, P.; Fillipe, L.
2018-02-01
Vibrational spectroscopy, especially FTIR and Raman, has shown enormous potential in disease diagnosis, especially in cancers. Their potential for detecting varied pathological conditions are regularly reported. However, to prove their applicability in clinics, large multi-center multi-national studies need to be undertaken; and these will result in enormous amount of data. A parallel effort to develop analytical methods, including user-friendly software that can quickly pre-process data and subject them to required multivariate analysis is warranted in order to obtain results in real time. This study reports a MATLAB based script that can automatically import data, preprocess spectra— interpolation, derivatives, normalization, and then carry out Principal Component Analysis (PCA) followed by Linear Discriminant Analysis (LDA) of the first 10 PCs; all with a single click. The software has been verified on data obtained from cell lines, animal models, and in vivo patient datasets, and gives results comparable to Minitab 16 software. The software can be used to import variety of file extensions, asc, .txt., .xls, and many others. Options to ignore noisy data, plot all possible graphs with PCA factors 1 to 5, and save loading factors, confusion matrices and other parameters are also present. The software can provide results for a dataset of 300 spectra within 0.01 s. We believe that the software will be vital not only in clinical trials using vibrational spectroscopic data, but also to obtain rapid results when these tools get translated into clinics.
Yilmaz, Banu; Aras, Egemen; Nacar, Sinan; Kankal, Murat
2018-05-23
The functional life of a dam is often determined by the rate of sediment delivery to its reservoir. Therefore, an accurate estimate of the sediment load in rivers with dams is essential for designing and predicting a dam's useful lifespan. The most credible method is direct measurements of sediment input, but this can be very costly and it cannot always be implemented at all gauging stations. In this study, we tested various regression models to estimate suspended sediment load (SSL) at two gauging stations on the Çoruh River in Turkey, including artificial bee colony (ABC), teaching-learning-based optimization algorithm (TLBO), and multivariate adaptive regression splines (MARS). These models were also compared with one another and with classical regression analyses (CRA). Streamflow values and previously collected data of SSL were used as model inputs with predicted SSL data as output. Two different training and testing dataset configurations were used to reinforce the model accuracy. For the MARS method, the root mean square error value was found to range between 35% and 39% for the test two gauging stations, which was lower than errors for other models. Error values were even lower (7% to 15%) using another dataset. Our results indicate that simultaneous measurements of streamflow with SSL provide the most effective parameter for obtaining accurate predictive models and that MARS is the most accurate model for predicting SSL. Copyright © 2017 Elsevier B.V. All rights reserved.
Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data
Hallac, David; Vare, Sagar; Boyd, Stephen; Leskovec, Jure
2018-01-01
Subsequence clustering of multivariate time series is a useful tool for discovering repeated patterns in temporal data. Once these patterns have been discovered, seemingly complicated datasets can be interpreted as a temporal sequence of only a small number of states, or clusters. For example, raw sensor data from a fitness-tracking application can be expressed as a timeline of a select few actions (i.e., walking, sitting, running). However, discovering these patterns is challenging because it requires simultaneous segmentation and clustering of the time series. Furthermore, interpreting the resulting clusters is difficult, especially when the data is high-dimensional. Here we propose a new method of model-based clustering, which we call Toeplitz Inverse Covariance-based Clustering (TICC). Each cluster in the TICC method is defined by a correlation network, or Markov random field (MRF), characterizing the interdependencies between different observations in a typical subsequence of that cluster. Based on this graphical representation, TICC simultaneously segments and clusters the time series data. We solve the TICC problem through alternating minimization, using a variation of the expectation maximization (EM) algorithm. We derive closed-form solutions to efficiently solve the two resulting subproblems in a scalable way, through dynamic programming and the alternating direction method of multipliers (ADMM), respectively. We validate our approach by comparing TICC to several state-of-the-art baselines in a series of synthetic experiments, and we then demonstrate on an automobile sensor dataset how TICC can be used to learn interpretable clusters in real-world scenarios. PMID:29770257
NASA Astrophysics Data System (ADS)
Zu, Theresah N. K.; Liu, Sanchao; Germane, Katherine L.; Servinsky, Matthew D.; Gerlach, Elliot S.; Mackie, David M.; Sund, Christian J.
2016-05-01
The coupling of optical fibers with Raman instrumentation has proven to be effective for real-time monitoring of chemical reactions and fermentations when combined with multivariate statistical data analysis. Raman spectroscopy is relatively fast, with little interference from the water peak present in fermentation media. Medical research has explored this technique for analysis of mammalian cultures for potential diagnosis of some cancers. Other organisms studied via this route include Escherichia coli, Saccharomyces cerevisiae, and some Bacillus sp., though very little work has been performed on Clostridium acetobutylicum cultures. C. acetobutylicum is a gram-positive anaerobic bacterium, which is highly sought after due to its ability to use a broad spectrum of substrates and produce useful byproducts through the well-known Acetone-Butanol-Ethanol (ABE) fermentation. In this work, real-time Raman data was acquired from C. acetobutylicum cultures grown on glucose. Samples were collected concurrently for comparative off-line product analysis. Partial-least squares (PLS) models were built both for agitated cultures and for static cultures from both datasets. Media components and metabolites monitored include glucose, butyric acid, acetic acid, and butanol. Models were cross-validated with independent datasets. Experiments with agitation were more favorable for modeling with goodness of fit (QY) values of 0.99 and goodness of prediction (Q2Y) values of 0.98. Static experiments did not model as well as agitated experiments. Raman results showed the static experiments were chaotic, especially during and shortly after manual sampling.
Spelten, Evelien R; Martin, Linda; Gitsels, Janneke T; Pereboom, Monique T R; Hutton, Eileen K; van Dulmen, Sandra
2015-01-01
video recording studies have been found to be complex; however very few studies describe the actual introduction and enrolment of the study, the resulting dataset and its interpretation. In this paper we describe the introduction and the use of video recordings of health care provider (HCP)-client interactions in primary care midwifery for research purposes. We also report on the process of data management, data coding and the resulting data set. we describe our experience in undertaking a study using video recording to assess the interaction of the midwife and her client in the first antenatal consultation, in a real life clinical practice setting in the Netherlands. Midwives from six practices across the Netherlands were recruited to videotape 15-20 intakes. The introduction, complexity of the study and intrusiveness of the study were discussed within the research group. The number of valid recordings and missing recordings was measured; reasons not to participate, non-response analyses, and the inter-rater reliability of the coded videotapes were assessed. Video recordings were supplemented by questionnaires for midwives and clients. The Roter Interaction Analysis System (RIAS) was used for coding as well as an obstetric topics scale. at the introduction of the study, more initial hesitation in co-operation was found among the midwives than among their clients. The intrusive nature of the recording on the interaction was perceived to be minimal. The complex nature of the study affected recruitment and data collection. Combining the dataset with the questionnaires and medical records proved to be a challenge. The final dataset included videotapes of 20 midwives (7-23 recordings per midwife). Of the 460 eligible clients, 324 gave informed consent. The study resulted in a significant dataset of first antenatal consultations involving recording 269 clients and 194 partners. video recording of midwife-client interaction was both feasible and challenging and resulted in a unique dataset of recordings of midwife-client interaction. Video recording studies will benefit from a tight design, and vigilant monitoring during the data collection to ensure effective data collection. We provide suggestions to promote successful introduction of video recording for research purposes. Copyright © 2014 Elsevier Ltd. All rights reserved.
A high-resolution 7-Tesla fMRI dataset from complex natural stimulation with an audio movie.
Hanke, Michael; Baumgartner, Florian J; Ibe, Pierre; Kaule, Falko R; Pollmann, Stefan; Speck, Oliver; Zinke, Wolf; Stadler, Jörg
2014-01-01
Here we present a high-resolution functional magnetic resonance (fMRI) dataset - 20 participants recorded at high field strength (7 Tesla) during prolonged stimulation with an auditory feature film ("Forrest Gump"). In addition, a comprehensive set of auxiliary data (T1w, T2w, DTI, susceptibility-weighted image, angiography) as well as measurements to assess technical and physiological noise components have been acquired. An initial analysis confirms that these data can be used to study common and idiosyncratic brain response patterns to complex auditory stimulation. Among the potential uses of this dataset are the study of auditory attention and cognition, language and music perception, and social perception. The auxiliary measurements enable a large variety of additional analysis strategies that relate functional response patterns to structural properties of the brain. Alongside the acquired data, we provide source code and detailed information on all employed procedures - from stimulus creation to data analysis. In order to facilitate replicative and derived works, only free and open-source software was utilized.
A practical tool for maximal information coefficient analysis.
Albanese, Davide; Riccadonna, Samantha; Donati, Claudio; Franceschi, Pietro
2018-04-01
The ability of finding complex associations in large omics datasets, assessing their significance, and prioritizing them according to their strength can be of great help in the data exploration phase. Mutual information-based measures of association are particularly promising, in particular after the recent introduction of the TICe and MICe estimators, which combine computational efficiency with superior bias/variance properties. An open-source software implementation of these two measures providing a complete procedure to test their significance would be extremely useful. Here, we present MICtools, a comprehensive and effective pipeline that combines TICe and MICe into a multistep procedure that allows the identification of relationships of various degrees of complexity. MICtools calculates their strength assessing statistical significance using a permutation-based strategy. The performances of the proposed approach are assessed by an extensive investigation in synthetic datasets and an example of a potential application on a metagenomic dataset is also illustrated. We show that MICtools, combining TICe and MICe, is able to highlight associations that would not be captured by conventional strategies.
Westman, Eric; Aguilar, Carlos; Muehlboeck, J-Sebastian; Simmons, Andrew
2013-01-01
Automated structural magnetic resonance imaging (MRI) processing pipelines are gaining popularity for Alzheimer's disease (AD) research. They generate regional volumes, cortical thickness measures and other measures, which can be used as input for multivariate analysis. It is not clear which combination of measures and normalization approach are most useful for AD classification and to predict mild cognitive impairment (MCI) conversion. The current study includes MRI scans from 699 subjects [AD, MCI and controls (CTL)] from the Alzheimer's disease Neuroimaging Initiative (ADNI). The Freesurfer pipeline was used to generate regional volume, cortical thickness, gray matter volume, surface area, mean curvature, gaussian curvature, folding index and curvature index measures. 259 variables were used for orthogonal partial least square to latent structures (OPLS) multivariate analysis. Normalisation approaches were explored and the optimal combination of measures determined. Results indicate that cortical thickness measures should not be normalized, while volumes should probably be normalized by intracranial volume (ICV). Combining regional cortical thickness measures (not normalized) with cortical and subcortical volumes (normalized with ICV) using OPLS gave a prediction accuracy of 91.5 % when distinguishing AD versus CTL. This model prospectively predicted future decline from MCI to AD with 75.9 % of converters correctly classified. Normalization strategy did not have a significant effect on the accuracies of multivariate models containing multiple MRI measures for this large dataset. The appropriate choice of input for multivariate analysis in AD and MCI is of great importance. The results support the use of un-normalised cortical thickness measures and volumes normalised by ICV.
An Improved Method to Control the Critical Parameters of a Multivariable Control System
NASA Astrophysics Data System (ADS)
Subha Hency Jims, P.; Dharmalingam, S.; Wessley, G. Jims John
2017-10-01
The role of control systems is to cope with the process deficiencies and the undesirable effect of the external disturbances. Most of the multivariable processes are highly iterative and complex in nature. Aircraft systems, Modern Power Plants, Refineries, Robotic systems are few such complex systems that involve numerous critical parameters that need to be monitored and controlled. Control of these important parameters is not only tedious and cumbersome but also is crucial from environmental, safety and quality perspective. In this paper, one such multivariable system, namely, a utility boiler has been considered. A modern power plant is a complex arrangement of pipework and machineries with numerous interacting control loops and support systems. In this paper, the calculation of controller parameters based on classical tuning concepts has been presented. The controller parameters thus obtained and employed has controlled the critical parameters of a boiler during fuel switching disturbances. The proposed method can be applied to control the critical parameters like elevator, aileron, rudder, elevator trim rudder and aileron trim, flap control systems of aircraft systems.
Detection of timescales in evolving complex systems
Darst, Richard K.; Granell, Clara; Arenas, Alex; Gómez, Sergio; Saramäki, Jari; Fortunato, Santo
2016-01-01
Most complex systems are intrinsically dynamic in nature. The evolution of a dynamic complex system is typically represented as a sequence of snapshots, where each snapshot describes the configuration of the system at a particular instant of time. This is often done by using constant intervals but a better approach would be to define dynamic intervals that match the evolution of the system’s configuration. To this end, we propose a method that aims at detecting evolutionary changes in the configuration of a complex system, and generates intervals accordingly. We show that evolutionary timescales can be identified by looking for peaks in the similarity between the sets of events on consecutive time intervals of data. Tests on simple toy models reveal that the technique is able to detect evolutionary timescales of time-varying data both when the evolution is smooth as well as when it changes sharply. This is further corroborated by analyses of several real datasets. Our method is scalable to extremely large datasets and is computationally efficient. This allows a quick, parameter-free detection of multiple timescales in the evolution of a complex system. PMID:28004820
Genomics dataset of unidentified disclosed isolates.
Rekadwad, Bhagwan N
2016-09-01
Analysis of DNA sequences is necessary for higher hierarchical classification of the organisms. It gives clues about the characteristics of organisms and their taxonomic position. This dataset is chosen to find complexities in the unidentified DNA in the disclosed patents. A total of 17 unidentified DNA sequences were thoroughly analyzed. The quick response codes were generated. AT/GC content of the DNA sequences analysis was carried out. The QR is helpful for quick identification of isolates. AT/GC content is helpful for studying their stability at different temperatures. Additionally, a dataset on cleavage code and enzyme code studied under the restriction digestion study, which helpful for performing studies using short DNA sequences was reported. The dataset disclosed here is the new revelatory data for exploration of unique DNA sequences for evaluation, identification, comparison and analysis.
Hernández-Ceballos, M A; Skjøth, C A; García-Mozo, H; Bolívar, J P; Galán, C
2014-12-01
Airborne pollen transport at micro-, meso-gamma and meso-beta scales must be studied by atmospheric models, having special relevance in complex terrain. In these cases, the accuracy of these models is mainly determined by the spatial resolution of the underlying meteorological dataset. This work examines how meteorological datasets determine the results obtained from atmospheric transport models used to describe pollen transport in the atmosphere. We investigate the effect of the spatial resolution when computing backward trajectories with the HYSPLIT model. We have used meteorological datasets from the WRF model with 27, 9 and 3 km resolutions and from the GDAS files with 1° resolution. This work allows characterizing atmospheric transport of Olea pollen in a region with complex flows. The results show that the complex terrain affects the trajectories and this effect varies with the different meteorological datasets. Overall, the change from GDAS to WRF-ARW inputs improves the analyses with the HYSPLIT model, thereby increasing the understanding the pollen episode. The results indicate that a spatial resolution of at least 9 km is needed to simulate atmospheric flows that are considerable affected by the relief of the landscape. The results suggest that the appropriate meteorological files should be considered when atmospheric models are used to characterize the atmospheric transport of pollen on micro-, meso-gamma and meso-beta scales. Furthermore, at these scales, the results are believed to be generally applicable for related areas such as the description of atmospheric transport of radionuclides or in the definition of nuclear-radioactivity emergency preparedness.
NASA Astrophysics Data System (ADS)
Hernández-Ceballos, M. A.; Skjøth, C. A.; García-Mozo, H.; Bolívar, J. P.; Galán, C.
2014-12-01
Airborne pollen transport at micro-, meso-gamma and meso-beta scales must be studied by atmospheric models, having special relevance in complex terrain. In these cases, the accuracy of these models is mainly determined by the spatial resolution of the underlying meteorological dataset. This work examines how meteorological datasets determine the results obtained from atmospheric transport models used to describe pollen transport in the atmosphere. We investigate the effect of the spatial resolution when computing backward trajectories with the HYSPLIT model. We have used meteorological datasets from the WRF model with 27, 9 and 3 km resolutions and from the GDAS files with 1 ° resolution. This work allows characterizing atmospheric transport of Olea pollen in a region with complex flows. The results show that the complex terrain affects the trajectories and this effect varies with the different meteorological datasets. Overall, the change from GDAS to WRF-ARW inputs improves the analyses with the HYSPLIT model, thereby increasing the understanding the pollen episode. The results indicate that a spatial resolution of at least 9 km is needed to simulate atmospheric flows that are considerable affected by the relief of the landscape. The results suggest that the appropriate meteorological files should be considered when atmospheric models are used to characterize the atmospheric transport of pollen on micro-, meso-gamma and meso-beta scales. Furthermore, at these scales, the results are believed to be generally applicable for related areas such as the description of atmospheric transport of radionuclides or in the definition of nuclear-radioactivity emergency preparedness.
Fast Multivariate Search on Large Aviation Datasets
NASA Technical Reports Server (NTRS)
Bhaduri, Kanishka; Zhu, Qiang; Oza, Nikunj C.; Srivastava, Ashok N.
2010-01-01
Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations Both these tests show that our algorithms have very high prune rates (>95%) thus needing actual
ERIC Educational Resources Information Center
Al-Aziz, Jameel; Christou, Nicolas; Dinov, Ivo D.
2010-01-01
The amount, complexity and provenance of data have dramatically increased in the past five years. Visualization of observed and simulated data is a critical component of any social, environmental, biomedical or scientific quest. Dynamic, exploratory and interactive visualization of multivariate data, without preprocessing by dimensionality…
A Simpli ed, General Approach to Simulating from Multivariate Copula Functions
Barry Goodwin
2012-01-01
Copulas have become an important analytic tool for characterizing multivariate distributions and dependence. One is often interested in simulating data from copula estimates. The process can be analytically and computationally complex and usually involves steps that are unique to a given parametric copula. We describe an alternative approach that uses \\probability{...
Akiyoshi, Takashi; Maeda, Hiromichi; Kashiwabara, Kosuke; Kanda, Mitsuro; Mayanagi, Shuhei; Aoyama, Toru; Hamada, Chikuma; Sadahiro, Sotaro; Fukunaga, Yosuke; Ueno, Masashi; Sakamoto, Junichi; Saji, Shigetoyo; Yoshikawa, Takaki
2017-01-01
Background Few prediction models have so far been developed and assessed for the prognosis of patients who undergo curative resection for colorectal cancer (CRC). Materials and Methods We prepared a clinical dataset including 5,530 patients who participated in three major randomized controlled trials as a training dataset and 2,263 consecutive patients who were treated at a cancer-specialized hospital as a validation dataset. All subjects underwent radical resection for CRC which was histologically diagnosed to be adenocarcinoma. The main outcomes that were predicted were the overall survival (OS) and disease free survival (DFS). The identification of the variables in this nomogram was based on a Cox regression analysis and the model performance was evaluated by Harrell's c-index. The calibration plot and its slope were also studied. For the external validation assessment, risk group stratification was employed. Results The multivariate Cox model identified variables; sex, age, pathological T and N factor, tumor location, size, lymphnode dissection, postoperative complications and adjuvant chemotherapy. The c-index was 0.72 (95% confidence interval [CI] 0.66-0.77) for the OS and 0.74 (95% CI 0.69-0.78) for the DFS. The proposed stratification in the risk groups demonstrated a significant distinction between the Kaplan–Meier curves for OS and DFS in the external validation dataset. Conclusions We established a clinically reliable nomogram to predict the OS and DFS in patients with CRC using large scale and reliable independent patient data from phase III randomized controlled trials. The external validity was also confirmed on the practical dataset. PMID:29228760
Takada, M; Sugimoto, M; Ohno, S; Kuroi, K; Sato, N; Bando, H; Masuda, N; Iwata, H; Kondo, M; Sasano, H; Chow, L W C; Inamoto, T; Naito, Y; Tomita, M; Toi, M
2012-07-01
Nomogram, a standard technique that utilizes multiple characteristics to predict efficacy of treatment and likelihood of a specific status of an individual patient, has been used for prediction of response to neoadjuvant chemotherapy (NAC) in breast cancer patients. The aim of this study was to develop a novel computational technique to predict the pathological complete response (pCR) to NAC in primary breast cancer patients. A mathematical model using alternating decision trees, an epigone of decision tree, was developed using 28 clinicopathological variables that were retrospectively collected from patients treated with NAC (n = 150), and validated using an independent dataset from a randomized controlled trial (n = 173). The model selected 15 variables to predict the pCR with yielding area under the receiver operating characteristics curve (AUC) values of 0.766 [95 % confidence interval (CI)], 0.671-0.861, P value < 0.0001) in cross-validation using training dataset and 0.787 (95 % CI 0.716-0.858, P value < 0.0001) in the validation dataset. Among three subtypes of breast cancer, the luminal subgroup showed the best discrimination (AUC = 0.779, 95 % CI 0.641-0.917, P value = 0.0059). The developed model (AUC = 0.805, 95 % CI 0.716-0.894, P value < 0.0001) outperformed multivariate logistic regression (AUC = 0.754, 95 % CI 0.651-0.858, P value = 0.00019) of validation datasets without missing values (n = 127). Several analyses, e.g. bootstrap analysis, revealed that the developed model was insensitive to missing values and also tolerant to distribution bias among the datasets. Our model based on clinicopathological variables showed high predictive ability for pCR. This model might improve the prediction of the response to NAC in primary breast cancer patients.
Noble, Fergus; Curtis, Nathan; Harris, Scott; Kelly, Jamie J; Bailey, Ian S; Byrne, James P; Underwood, Timothy J
2012-06-01
Oesophagectomy is associated with significant morbidity and mortality. A simple score to define a patient's risk of developing major complications would be beneficial. Patients who underwent upper gastrointestinal resections with an oesophageal anastomosis between 2005 and 2010 were reviewed and formed the development dataset with resections performed in 2011 forming a prospective validation dataset. The association between post-operative C-reactive protein (CRP), white cell count (WCC) and albumin levels with anastomotic leak (AL) or major complication including death using the Clavien-Dindo (CD) classification were analysed by receiver operating characteristic curves. After multivariate analysis, from the development dataset, these factors were combined to create a novel score which was subsequently tested on the validation dataset. Two hundred fifty-eight patients were assessed to develop the score. Sixty-three patients (25%) developed a major complication, and there were seven (2.7%) in-patient deaths. Twenty-six (10%) patients were diagnosed with AL at median post-operative day 7 (range: 5-15). CRP (p = 0.002), WCC (p < 0.0001) and albumin (p = 0.001) were predictors of AL. Combining these markers improved prediction of AL (NUn score > 10: sensitivity 95%, specificity 49%, diagnostic accuracy 0.801 (95% confidence interval: 0.692-0.909, p < 0.0001)). The validation dataset confirmed these findings (NUn score > 10: sensitivity 100%, specificity 57%, diagnostic accuracy 0.879 (95% CI 0.763-0.994, p = 0.014)) and a major complication or death (NUn > 10: sensitivity 89%, specificity 63%, diagnostic accuracy 0.856 (95% CI 0.709-1, p = 0.001)). Blood-borne markers of the systemic inflammatory response are predictors of AL and major complications after oesophageal resection. When combined they may categorise a patient's risk of developing a serious complication with higher sensitivity and specificity.
Tomasini, Nicolás; Lauthier, Juan José; Ayala, Francisco José; Tibayrenc, Michel; Diosque, Patricio
2014-01-01
The model of predominant clonal evolution (PCE) proposed for micropathogens does not state that genetic exchange is totally absent, but rather, that it is too rare to break the prevalent PCE pattern. However, the actual impact of this “residual” genetic exchange should be evaluated. Multilocus Sequence Typing (MLST) is an excellent tool to explore the problem. Here, we compared online available MLST datasets for seven eukaryotic microbial pathogens: Trypanosoma cruzi, the Fusarium solani complex, Aspergillus fumigatus, Blastocystis subtype 3, the Leishmania donovani complex, Candida albicans and Candida glabrata. We first analyzed phylogenetic relationships among genotypes within each dataset. Then, we examined different measures of branch support and incongruence among loci as signs of genetic structure and levels of past recombination. The analyses allow us to identify three types of genetic structure. The first was characterized by trees with well-supported branches and low levels of incongruence suggesting well-structured populations and PCE. This was the case for the T. cruzi and F. solani datasets. The second genetic structure, represented by Blastocystis spp., A. fumigatus and the L. donovani complex datasets, showed trees with weakly-supported branches but low levels of incongruence among loci, whereby genetic structuration was not clearly defined by MLST. Finally, trees showing weakly-supported branches and high levels of incongruence among loci were observed for Candida species, suggesting that genetic exchange has a higher evolutionary impact in these mainly clonal yeast species. Furthermore, simulations showed that MLST may fail to show right clustering in population datasets even in the absence of genetic exchange. In conclusion, these results make it possible to infer variable impacts of genetic exchange in populations of predominantly clonal micro-pathogens. Moreover, our results reveal different problems of MLST to determine the genetic structure in these organisms that should be considered. PMID:25054834
Ajetunmobi, Omotomilola; Whyte, Bruce; Chalmers, James; Fleming, Michael; Stockton, Diane; Wood, Rachel
2014-01-01
Providing infants with the 'best possible start in life' is a priority for the Scottish Government. This is reflected in policy and health promotion strategies to increase breast feeding, which gives the best source of nutrients for healthy infant growth and development. However, the rate of breast feeding in Scotland remains one of the lowest in Europe. Information is needed to provide a better understanding of infant feeding and its impact on child health. This paper describes the development of a unique population-wide resource created to explore infant feeding and child health in Scotland. Descriptive and multivariate analyses of linked routine/administrative maternal and infant health records for 731,595 infants born in Scotland between 1997 and 2009. A linked dataset was created containing a wide range of background, parental, maternal, birth and health service characteristics for a representative sample of infants born in Scotland over the study period. There was high coverage and completeness of infant feeding and other demographic, maternal and infant records. The results confirmed the importance of an enabling environment--cultural, family, health service and other maternal and infant health-related factors--in increasing the likelihood to breast feed. Using the linked dataset, it was possible to investigate the determinants of breast feeding for a representative sample of Scottish infants born between 1997 and 2009. The linked dataset is an important resource that has potential uses in research, policy design and targeting intervention programmes.
Power-Hop: A Pervasive Observation for Real Complex Networks
2016-03-14
found at: https:// github.com/kpelechrinis/powerhop. The web graph used was obtained from Yahoo ! under signing NDA and hence, cannot be made available...However, Yahoo ! is making available datasets to eligible organizations/entities through application to Webscope. In particular, the dataset used in...our study can be requested through the following URL: http:// webscope.sandbox.yahoo.com/catalog.php? datatype=g (G2 - Yahoo ! AltaVista Web Page
DCS-SVM: a novel semi-automated method for human brain MR image segmentation.
Ahmadvand, Ali; Daliri, Mohammad Reza; Hajiali, Mohammadtaghi
2017-11-27
In this paper, a novel method is proposed which appropriately segments magnetic resonance (MR) brain images into three main tissues. This paper proposes an extension of our previous work in which we suggested a combination of multiple classifiers (CMC)-based methods named dynamic classifier selection-dynamic local training local Tanimoto index (DCS-DLTLTI) for MR brain image segmentation into three main cerebral tissues. This idea is used here and a novel method is developed that tries to use more complex and accurate classifiers like support vector machine (SVM) in the ensemble. This work is challenging because the CMC-based methods are time consuming, especially on huge datasets like three-dimensional (3D) brain MR images. Moreover, SVM is a powerful method that is used for modeling datasets with complex feature space, but it also has huge computational cost for big datasets, especially those with strong interclass variability problems and with more than two classes such as 3D brain images; therefore, we cannot use SVM in DCS-DLTLTI. Therefore, we propose a novel approach named "DCS-SVM" to use SVM in DCS-DLTLTI to improve the accuracy of segmentation results. The proposed method is applied on well-known datasets of the Internet Brain Segmentation Repository (IBSR) and promising results are obtained.
Grenville-Briggs, Laura J; Stansfield, Ian
2011-01-01
This report describes a linked series of Masters-level computer practical workshops. They comprise an advanced functional genomics investigation, based upon analysis of a microarray dataset probing yeast DNA damage responses. The workshops require the students to analyse highly complex transcriptomics datasets, and were designed to stimulate active learning through experience of current research methods in bioinformatics and functional genomics. They seek to closely mimic a realistic research environment, and require the students first to propose research hypotheses, then test those hypotheses using specific sections of the microarray dataset. The complexity of the microarray data provides students with the freedom to propose their own unique hypotheses, tested using appropriate sections of the microarray data. This research latitude was highly regarded by students and is a strength of this practical. In addition, the focus on DNA damage by radiation and mutagenic chemicals allows them to place their results in a human medical context, and successfully sparks broad interest in the subject material. In evaluation, 79% of students scored the practical workshops on a five-point scale as 4 or 5 (totally effective) for student learning. More broadly, the general use of microarray data as a "student research playground" is also discussed. Copyright © 2011 Wiley Periodicals, Inc.
Chowdhury, Nilotpal; Sapru, Shantanu
2015-01-01
Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate - adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research.
Chowdhury, Nilotpal; Sapru, Shantanu
2015-01-01
Introduction Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. Aim The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Methods Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate – adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Results Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. Conclusion To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research. PMID:26080057
Pantazatos, Spiro P.; Li, Jianrong; Pavlidis, Paul; Lussier, Yves A.
2009-01-01
An approach towards heterogeneous neuroscience dataset integration is proposed that uses Natural Language Processing (NLP) and a knowledge-based phenotype organizer system (PhenOS) to link ontology-anchored terms to underlying data from each database, and then maps these terms based on a computable model of disease (SNOMED CT®). The approach was implemented using sample datasets from fMRIDC, GEO, The Whole Brain Atlas and Neuronames, and allowed for complex queries such as “List all disorders with a finding site of brain region X, and then find the semantically related references in all participating databases based on the ontological model of the disease or its anatomical and morphological attributes”. Precision of the NLP-derived coding of the unstructured phenotypes in each dataset was 88% (n = 50), and precision of the semantic mapping between these terms across datasets was 98% (n = 100). To our knowledge, this is the first example of the use of both semantic decomposition of disease relationships and hierarchical information found in ontologies to integrate heterogeneous phenotypes across clinical and molecular datasets. PMID:20495688
Implementation Challenges for Multivariable Control: What You Did Not Learn in School
NASA Technical Reports Server (NTRS)
Garg, Sanjay
2008-01-01
Multivariable control allows controller designs that can provide decoupled command tracking and robust performance in the presence of modeling uncertainties. Although the last two decades have seen extensive development of multivariable control theory and example applications to complex systems in software/hardware simulations, there are no production flying systems aircraft or spacecraft, that use multivariable control. This is because of the tremendous challenges associated with implementation of such multivariable control designs. Unfortunately, the curriculum in schools does not provide sufficient time to be able to provide an exposure to the students in such implementation challenges. The objective of this paper is to share the lessons learned by a practitioner of multivariable control in the process of applying some of the modern control theory to the Integrated Flight Propulsion Control (IFPC) design for an advanced Short Take-Off Vertical Landing (STOVL) aircraft simulation.
Skipping the real world: Classification of PolSAR images without explicit feature extraction
NASA Astrophysics Data System (ADS)
Hänsch, Ronny; Hellwich, Olaf
2018-06-01
The typical processing chain for pixel-wise classification from PolSAR images starts with an optional preprocessing step (e.g. speckle reduction), continues with extracting features projecting the complex-valued data into the real domain (e.g. by polarimetric decompositions) which are then used as input for a machine-learning based classifier, and ends in an optional postprocessing (e.g. label smoothing). The extracted features are usually hand-crafted as well as preselected and represent (a somewhat arbitrary) projection from the complex to the real domain in order to fit the requirements of standard machine-learning approaches such as Support Vector Machines or Artificial Neural Networks. This paper proposes to adapt the internal node tests of Random Forests to work directly on the complex-valued PolSAR data, which makes any explicit feature extraction obsolete. This approach leads to a classification framework with a significantly decreased computation time and memory footprint since no image features have to be computed and stored beforehand. The experimental results on one fully-polarimetric and one dual-polarimetric dataset show that, despite the simpler approach, accuracy can be maintained (decreased by only less than 2 % for the fully-polarimetric dataset) or even improved (increased by roughly 9 % for the dual-polarimetric dataset).
Rule-based topology system for spatial databases to validate complex geographic datasets
NASA Astrophysics Data System (ADS)
Martinez-Llario, J.; Coll, E.; Núñez-Andrés, M.; Femenia-Ribera, C.
2017-06-01
A rule-based topology software system providing a highly flexible and fast procedure to enforce integrity in spatial relationships among datasets is presented. This improved topology rule system is built over the spatial extension Jaspa. Both projects are open source, freely available software developed by the corresponding author of this paper. Currently, there is no spatial DBMS that implements a rule-based topology engine (considering that the topology rules are designed and performed in the spatial backend). If the topology rules are applied in the frontend (as in many GIS desktop programs), ArcGIS is the most advanced solution. The system presented in this paper has several major advantages over the ArcGIS approach: it can be extended with new topology rules, it has a much wider set of rules, and it can mix feature attributes with topology rules as filters. In addition, the topology rule system can work with various DBMSs, including PostgreSQL, H2 or Oracle, and the logic is performed in the spatial backend. The proposed topology system allows users to check the complex spatial relationships among features (from one or several spatial layers) that require some complex cartographic datasets, such as the data specifications proposed by INSPIRE in Europe and the Land Administration Domain Model (LADM) for Cadastral data.
Smith, Stephen A; Moore, Michael J; Brown, Joseph W; Yang, Ya
2015-08-05
The use of transcriptomic and genomic datasets for phylogenetic reconstruction has become increasingly common as researchers attempt to resolve recalcitrant nodes with increasing amounts of data. The large size and complexity of these datasets introduce significant phylogenetic noise and conflict into subsequent analyses. The sources of conflict may include hybridization, incomplete lineage sorting, or horizontal gene transfer, and may vary across the phylogeny. For phylogenetic analysis, this noise and conflict has been accommodated in one of several ways: by binning gene regions into subsets to isolate consistent phylogenetic signal; by using gene-tree methods for reconstruction, where conflict is presumed to be explained by incomplete lineage sorting (ILS); or through concatenation, where noise is presumed to be the dominant source of conflict. The results provided herein emphasize that analysis of individual homologous gene regions can greatly improve our understanding of the underlying conflict within these datasets. Here we examined two published transcriptomic datasets, the angiosperm group Caryophyllales and the aculeate Hymenoptera, for the presence of conflict, concordance, and gene duplications in individual homologs across the phylogeny. We found significant conflict throughout the phylogeny in both datasets and in particular along the backbone. While some nodes in each phylogeny showed patterns of conflict similar to what might be expected with ILS alone, the backbone nodes also exhibited low levels of phylogenetic signal. In addition, certain nodes, especially in the Caryophyllales, had highly elevated levels of strongly supported conflict that cannot be explained by ILS alone. This study demonstrates that phylogenetic signal is highly variable in phylogenomic data sampled across related species and poses challenges when conducting species tree analyses on large genomic and transcriptomic datasets. Further insight into the conflict and processes underlying these complex datasets is necessary to improve and develop adequate models for sequence analysis and downstream applications. To aid this effort, we developed the open source software phyparts ( https://bitbucket.org/blackrim/phyparts ), which calculates unique, conflicting, and concordant bipartitions, maps gene duplications, and outputs summary statistics such as internode certainy (ICA) scores and node-specific counts of gene duplications.
Kang, Dongwan D.; Froula, Jeff; Egan, Rob; ...
2015-01-01
Grouping large genomic fragments assembled from shotgun metagenomic sequences to deconvolute complex microbial communities, or metagenome binning, enables the study of individual organisms and their interactions. Because of the complex nature of these communities, existing metagenome binning methods often miss a large number of microbial species. In addition, most of the tools are not scalable to large datasets. Here we introduce automated software called MetaBAT that integrates empirical probabilistic distances of genome abundance and tetranucleotide frequency for accurate metagenome binning. MetaBAT outperforms alternative methods in accuracy and computational efficiency on both synthetic and real metagenome datasets. Lastly, it automatically formsmore » hundreds of high quality genome bins on a very large assembly consisting millions of contigs in a matter of hours on a single node. MetaBAT is open source software and available at https://bitbucket.org/berkeleylab/metabat.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Thomas, Mathew; Marshall, Matthew J.; Miller, Erin A.
2014-08-26
Understanding the interactions of structured communities known as “biofilms” and other complex matrixes is possible through the X-ray micro tomography imaging of the biofilms. Feature detection and image processing for this type of data focuses on efficiently identifying and segmenting biofilms and bacteria in the datasets. The datasets are very large and often require manual interventions due to low contrast between objects and high noise levels. Thus new software is required for the effectual interpretation and analysis of the data. This work specifies the evolution and application of the ability to analyze and visualize high resolution X-ray micro tomography datasets.
Application of multivariable statistical techniques in plant-wide WWTP control strategies analysis.
Flores, X; Comas, J; Roda, I R; Jiménez, L; Gernaey, K V
2007-01-01
The main objective of this paper is to present the application of selected multivariable statistical techniques in plant-wide wastewater treatment plant (WWTP) control strategies analysis. In this study, cluster analysis (CA), principal component analysis/factor analysis (PCA/FA) and discriminant analysis (DA) are applied to the evaluation matrix data set obtained by simulation of several control strategies applied to the plant-wide IWA Benchmark Simulation Model No 2 (BSM2). These techniques allow i) to determine natural groups or clusters of control strategies with a similar behaviour, ii) to find and interpret hidden, complex and casual relation features in the data set and iii) to identify important discriminant variables within the groups found by the cluster analysis. This study illustrates the usefulness of multivariable statistical techniques for both analysis and interpretation of the complex multicriteria data sets and allows an improved use of information for effective evaluation of control strategies.
Urbanowicz, Ryan J; Kiralis, Jeff; Sinnott-Armstrong, Nicholas A; Heberling, Tamra; Fisher, Jonathan M; Moore, Jason H
2012-10-01
Geneticists who look beyond single locus disease associations require additional strategies for the detection of complex multi-locus effects. Epistasis, a multi-locus masking effect, presents a particular challenge, and has been the target of bioinformatic development. Thorough evaluation of new algorithms calls for simulation studies in which known disease models are sought. To date, the best methods for generating simulated multi-locus epistatic models rely on genetic algorithms. However, such methods are computationally expensive, difficult to adapt to multiple objectives, and unlikely to yield models with a precise form of epistasis which we refer to as pure and strict. Purely and strictly epistatic models constitute the worst-case in terms of detecting disease associations, since such associations may only be observed if all n-loci are included in the disease model. This makes them an attractive gold standard for simulation studies considering complex multi-locus effects. We introduce GAMETES, a user-friendly software package and algorithm which generates complex biallelic single nucleotide polymorphism (SNP) disease models for simulation studies. GAMETES rapidly and precisely generates random, pure, strict n-locus models with specified genetic constraints. These constraints include heritability, minor allele frequencies of the SNPs, and population prevalence. GAMETES also includes a simple dataset simulation strategy which may be utilized to rapidly generate an archive of simulated datasets for given genetic models. We highlight the utility and limitations of GAMETES with an example simulation study using MDR, an algorithm designed to detect epistasis. GAMETES is a fast, flexible, and precise tool for generating complex n-locus models with random architectures. While GAMETES has a limited ability to generate models with higher heritabilities, it is proficient at generating the lower heritability models typically used in simulation studies evaluating new algorithms. In addition, the GAMETES modeling strategy may be flexibly combined with any dataset simulation strategy. Beyond dataset simulation, GAMETES could be employed to pursue theoretical characterization of genetic models and epistasis.
Bayesian correlated clustering to integrate multiple datasets
Kirk, Paul; Griffin, Jim E.; Savage, Richard S.; Ghahramani, Zoubin; Wild, David L.
2012-01-01
Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact: D.L.Wild@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23047558
Castello, Lucía V; Galetto, Leonardo
2013-01-01
Tillandsia capillaris Ruiz & Pav., which belongs to the subgenus Diaphoranthema is distributed in Ecuador, Peru, Bolivia, northern and central Argentina, and Chile, and includes forms that are difficult to circumscribe, thus considered to form a complex. The entities of this complex are predominantly small-sized epiphytes, adapted to xeric environments. The most widely used classification defines 5 forms for this complex based on few morphological reproductive traits: Tillandsia capillaris Ruiz & Pav. f. capillaris, Tillandsia capillaris f. incana (Mez) L.B. Sm., Tillandsia capillaris f. cordobensis (Hieron.) L.B. Sm., Tillandsia capillaris f. hieronymi (Mez) L.B. Sm. and Tillandsia capillaris f. virescens (Ruiz & Pav.) L.B. Sm. In this study, 35 floral and vegetative characters were analyzed with a multivariate approach in order to assess and discuss different proposals for classification of the Tillandsia capillaris complex, which presents morphotypes that co-occur in central and northern Argentina. To accomplish this, data of quantitative and categorical morphological characters of flowers and leaves were collected from herbarium specimens and field collections and were analyzed with statistical multivariate techniques. The results suggest that the last classification for the complex seems more comprehensive and three taxa were delimited: Tillandsia capillaris (=Tillandsia capillaris f. incana-hieronymi), Tillandsia virescens s. str. (=Tillandsia capillaris f. cordobensis) and Tillandsia virescens s. l. (=Tillandsia capillaris f. virescens). While Tillandsia capillaris and Tillandsia virescens s. str. co-occur, Tillandsia virescens s. l. is restricted to altitudes above 2000 m in Argentina. Characters previously used for taxa delimitation showed continuous variation and therefore were not useful. New diagnostic characters are proposed and a key is provided for delimiting these three taxa within the complex.
Castello, Lucía V.; Galetto, Leonardo
2013-01-01
Abstract Tillandsia capillaris Ruiz & Pav., which belongs to the subgenus Diaphoranthema is distributed in Ecuador, Peru, Bolivia, northern and central Argentina, and Chile, and includes forms that are difficult to circumscribe, thus considered to form a complex. The entities of this complex are predominantly small-sized epiphytes, adapted to xeric environments. The most widely used classification defines 5 forms for this complex based on few morphological reproductive traits: Tillandsia capillaris Ruiz & Pav. f. capillaris, Tillandsia capillaris f. incana (Mez) L.B. Sm., Tillandsia capillaris f. cordobensis (Hieron.) L.B. Sm., Tillandsia capillaris f. hieronymi (Mez) L.B. Sm. and Tillandsia capillaris f. virescens (Ruiz & Pav.) L.B. Sm. In this study, 35 floral and vegetative characters were analyzed with a multivariate approach in order to assess and discuss different proposals for classification of the Tillandsia capillaris complex, which presents morphotypes that co-occur in central and northern Argentina. To accomplish this, data of quantitative and categorical morphological characters of flowers and leaves were collected from herbarium specimens and field collections and were analyzed with statistical multivariate techniques. The results suggest that the last classification for the complex seems more comprehensive and three taxa were delimited: Tillandsia capillaris (=Tillandsia capillaris f. incana-hieronymi), Tillandsia virescens s. str. (=Tillandsia capillaris f. cordobensis) and Tillandsia virescens s. l. (=Tillandsia capillaris f. virescens). While Tillandsia capillaris and Tillandsia virescens s. str. co-occur, Tillandsia virescens s. l. is restricted to altitudes above 2000 m in Argentina. Characters previously used for taxa delimitation showed continuous variation and therefore were not useful. New diagnostic characters are proposed and a key is provided for delimiting these three taxa within the complex. PMID:23805053
A multivariate decision tree analysis of biophysical factors in tropical forest fire occurrence
Rey S. Ofren; Edward Harvey
2000-01-01
A multivariate decision tree model was used to quantify the relative importance of complex hierarchical relationships between biophysical variables and the occurrence of tropical forest fires. The study site is the Huai Kha Kbaeng wildlife sanctuary, a World Heritage Site in northwestern Thailand where annual fires are common and particularly destructive. Thematic...
All-Possible-Subsets for MANOVA and Factorial MANOVAs: Less than a Weekend Project
ERIC Educational Resources Information Center
Nimon, Kim; Zientek, Linda Reichwein; Kraha, Amanda
2016-01-01
Multivariate techniques are increasingly popular as researchers attempt to accurately model a complex world. MANOVA is a multivariate technique used to investigate the dimensions along which groups differ, and how these dimensions may be used to predict group membership. A concern in a MANOVA analysis is to determine if a smaller subset of…
Tracking Problem Solving by Multivariate Pattern Analysis and Hidden Markov Model Algorithms
ERIC Educational Resources Information Center
Anderson, John R.
2012-01-01
Multivariate pattern analysis can be combined with Hidden Markov Model algorithms to track the second-by-second thinking as people solve complex problems. Two applications of this methodology are illustrated with a data set taken from children as they interacted with an intelligent tutoring system for algebra. The first "mind reading" application…
Discrete structural features among interface residue-level classes.
Sowmya, Gopichandran; Ranganathan, Shoba
2015-01-01
Protein-protein interaction (PPI) is essential for molecular functions in biological cells. Investigation on protein interfaces of known complexes is an important step towards deciphering the driving forces of PPIs. Each PPI complex is specific, sensitive and selective to binding. Therefore, we have estimated the relative difference in percentage of polar residues between surface and the interface for each complex in a non-redundant heterodimer dataset of 278 complexes to understand the predominant forces driving binding. Our analysis showed ~60% of protein complexes with surface polarity greater than interface polarity (designated as class A). However, a considerable number of complexes (~40%) have interface polarity greater than surface polarity, (designated as class B), with a significantly different p-value of 1.66E-45 from class A. Comprehensive analyses of protein complexes show that interface features such as interface area, interface polarity abundance, solvation free energy gain upon interface formation, binding energy and the percentage of interface charged residue abundance distinguish among class A and class B complexes, while electrostatic visualization maps also help differentiate interface classes among complexes. Class A complexes are classical with abundant non-polar interactions at the interface; however class B complexes have abundant polar interactions at the interface, similar to protein surface characteristics. Five physicochemical interface features analyzed from the protein heterodimer dataset are discriminatory among the interface residue-level classes. These novel observations find application in developing residue-level models for protein-protein binding prediction, protein-protein docking studies and interface inhibitor design as drugs.
Discrete structural features among interface residue-level classes
2015-01-01
Background Protein-protein interaction (PPI) is essential for molecular functions in biological cells. Investigation on protein interfaces of known complexes is an important step towards deciphering the driving forces of PPIs. Each PPI complex is specific, sensitive and selective to binding. Therefore, we have estimated the relative difference in percentage of polar residues between surface and the interface for each complex in a non-redundant heterodimer dataset of 278 complexes to understand the predominant forces driving binding. Results Our analysis showed ~60% of protein complexes with surface polarity greater than interface polarity (designated as class A). However, a considerable number of complexes (~40%) have interface polarity greater than surface polarity, (designated as class B), with a significantly different p-value of 1.66E-45 from class A. Comprehensive analyses of protein complexes show that interface features such as interface area, interface polarity abundance, solvation free energy gain upon interface formation, binding energy and the percentage of interface charged residue abundance distinguish among class A and class B complexes, while electrostatic visualization maps also help differentiate interface classes among complexes. Conclusions Class A complexes are classical with abundant non-polar interactions at the interface; however class B complexes have abundant polar interactions at the interface, similar to protein surface characteristics. Five physicochemical interface features analyzed from the protein heterodimer dataset are discriminatory among the interface residue-level classes. These novel observations find application in developing residue-level models for protein-protein binding prediction, protein-protein docking studies and interface inhibitor design as drugs. PMID:26679043
Kent, Peter; Jensen, Rikke K; Kongsted, Alice
2014-10-02
There are various methodological approaches to identifying clinically important subgroups and one method is to identify clusters of characteristics that differentiate people in cross-sectional and/or longitudinal data using Cluster Analysis (CA) or Latent Class Analysis (LCA). There is a scarcity of head-to-head comparisons that can inform the choice of which clustering method might be suitable for particular clinical datasets and research questions. Therefore, the aim of this study was to perform a head-to-head comparison of three commonly available methods (SPSS TwoStep CA, Latent Gold LCA and SNOB LCA). The performance of these three methods was compared: (i) quantitatively using the number of subgroups detected, the classification probability of individuals into subgroups, the reproducibility of results, and (ii) qualitatively using subjective judgments about each program's ease of use and interpretability of the presentation of results.We analysed five real datasets of varying complexity in a secondary analysis of data from other research projects. Three datasets contained only MRI findings (n = 2,060 to 20,810 vertebral disc levels), one dataset contained only pain intensity data collected for 52 weeks by text (SMS) messaging (n = 1,121 people), and the last dataset contained a range of clinical variables measured in low back pain patients (n = 543 people). Four artificial datasets (n = 1,000 each) containing subgroups of varying complexity were also analysed testing the ability of these clustering methods to detect subgroups and correctly classify individuals when subgroup membership was known. The results from the real clinical datasets indicated that the number of subgroups detected varied, the certainty of classifying individuals into those subgroups varied, the findings had perfect reproducibility, some programs were easier to use and the interpretability of the presentation of their findings also varied. The results from the artificial datasets indicated that all three clustering methods showed a near-perfect ability to detect known subgroups and correctly classify individuals into those subgroups. Our subjective judgement was that Latent Gold offered the best balance of sensitivity to subgroups, ease of use and presentation of results with these datasets but we recognise that different clustering methods may suit other types of data and clinical research questions.
Nouraei, S A R; Hudovsky, A; Virk, J S; Saleh, H A
2017-04-01
This study aimed to develop a multidisciplinary coded dataset standard for nasal surgery and to assess its impact on data accuracy. An audit of 528 patients undergoing septal and/or inferior turbinate surgery, rhinoplasty and/or septorhinoplasty, and nasal fracture surgery was undertaken. A total of 200 septoplasties, 109 septorhinoplasties, 57 complex septorhinoplasties and 116 nasal fractures were analysed. There were 76 (14.4 per cent) changes to the primary diagnosis. Septorhinoplasties were the most commonly amended procedures. The overall audit-related income change for nasal surgery was £8.78 per patient. Use of a multidisciplinary coded dataset standard revealed that nasal diagnoses were under-coded; a significant proportion of patients received more precise diagnoses following the audit. There was also significant under-coding of both morbidities and revision surgery. The multidisciplinary coded dataset standard approach can improve the accuracy of both data capture and information flow, and, thus, ultimately create a more reliable dataset for use outcomes and health planning.
An extensive dataset of eye movements during viewing of complex images.
Wilming, Niklas; Onat, Selim; Ossandón, José P; Açık, Alper; Kietzmann, Tim C; Kaspar, Kai; Gameiro, Ricardo R; Vormberg, Alexandra; König, Peter
2017-01-31
We present a dataset of free-viewing eye-movement recordings that contains more than 2.7 million fixation locations from 949 observers on more than 1000 images from different categories. This dataset aggregates and harmonizes data from 23 different studies conducted at the Institute of Cognitive Science at Osnabrück University and the University Medical Center in Hamburg-Eppendorf. Trained personnel recorded all studies under standard conditions with homogeneous equipment and parameter settings. All studies allowed for free eye-movements, and differed in the age range of participants (~7-80 years), stimulus sizes, stimulus modifications (phase scrambled, spatial filtering, mirrored), and stimuli categories (natural and urban scenes, web sites, fractal, pink-noise, and ambiguous artistic figures). The size and variability of viewing behavior within this dataset presents a strong opportunity for evaluating and comparing computational models of overt attention, and furthermore, for thoroughly quantifying strategies of viewing behavior. This also makes the dataset a good starting point for investigating whether viewing strategies change in patient groups.
Resilience and tipping points of an exploited fish population over six decades.
Vasilakopoulos, Paraskevas; Marshall, C Tara
2015-05-01
Complex natural systems with eroded resilience, such as populations, ecosystems and socio-ecological systems, respond to small perturbations with abrupt, discontinuous state shifts, or critical transitions. Theory of critical transitions suggests that such systems exhibit fold bifurcations featuring folded response curves, tipping points and alternate attractors. However, there is little empirical evidence of fold bifurcations occurring in actual complex natural systems impacted by multiple stressors. Moreover, resilience of complex systems to change currently lacks clear operational measures with generic application. Here, we provide empirical evidence for the occurrence of a fold bifurcation in an exploited fish population and introduce a generic measure of ecological resilience based on the observed fold bifurcation attributes. We analyse the multivariate development of Barents Sea cod (Gadus morhua), which is currently the world's largest cod stock, over six decades (1949-2009), and identify a population state shift in 1981. By plotting a multivariate population index against a multivariate stressor index, the shift mechanism was revealed suggesting that the observed population shift was a nonlinear response to the combined effects of overfishing and climate change. Annual resilience values were estimated based on the position of each year in relation to the fitted attractors and assumed tipping points of the fold bifurcation. By interpolating the annual resilience values, a folded stability landscape was fit, which was shaped as predicted by theory. The resilience assessment suggested that the population may be close to another tipping point. This study illustrates how a multivariate analysis, supported by theory of critical transitions and accompanied by a quantitative resilience assessment, can clarify shift mechanisms in data-rich complex natural systems. © 2014 John Wiley & Sons Ltd.
NASA Technical Reports Server (NTRS)
Soeder, J. F.
1983-01-01
As turbofan engines become more complex, the development of controls necessitate the use of multivariable control techniques. A control developed for the F100-PW-100(3) turbofan engine by using linear quadratic regulator theory and other modern multivariable control synthesis techniques is described. The assembly language implementation of this control on an SEL 810B minicomputer is described. This implementation was then evaluated by using a real-time hybrid simulation of the engine. The control software was modified to run with a real engine. These modifications, in the form of sensor and actuator failure checks and control executive sequencing, are discussed. Finally recommendations for control software implementations are presented.
Malik, Amrita; Tauler, Roma
2015-06-01
This work focuses on understanding the behaviour and patterns of three atmospheric pollutants namely, nitric oxide (NO), nitrogen dioxide (NO2), and ozone (O3) along with their mutual interactions in the atmosphere of Barcelona, North Spain. Hourly samples were collected for NO, NO2 and O3 from the same city location for three consecutive years (2010-2012). The study explores the seasonal, annual and weekday-weekend variations in their diurnal profiles along with the possible identification of their source and mutual interactions in the region. Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS) was applied to the individual datasets of these pollutants, as well as to all of them simultaneously (augmented mode) to resolve the profiles related to their source and variation patterns in the atmosphere. The analysis of the individual datasets confirmed the source pattern variations in the concerned pollutant's profiles; and the resolved profiles for augmented datasets suggested for the mutual interaction of the pollutants along with their patterns variations, simultaneously. The study suggests vehicular pollution as the major source of atmospheric nitrogen oxides and presence of weekend ozone effect in the region. Copyright © 2015 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Zhan, Weiwei; Fan, Xuanmei; Huang, Runqiu; Pei, Xiangjun; Xu, Qiang; Li, Weile
2017-06-01
Rock avalanches are extremely rapid, massive flow-like movements of fragmented rock. The travel path of the rock avalanches may be confined by channels in some cases, which are referred to as channelized rock avalanches. Channelized rock avalanches are potentially dangerous due to their difficult-to-predict travel distance. In this study, we constructed a dataset with detailed characteristic parameters of 38 channelized rock avalanches triggered by the 2008 Wenchuan earthquake using the visual interpretation of remote sensing imagery, field investigation and literature review. Based on this dataset, we assessed the influence of different factors on the runout distance and developed prediction models of the channelized rock avalanches using the multivariate regression method. The results suggested that the movement of channelized rock avalanche was dominated by the landslide volume, total relief and channel gradient. The performance of both models was then tested with an independent validation dataset of eight rock avalanches that were induced by the 2008 Wenchuan earthquake, the Ms 7.0 Lushan earthquake and heavy rainfall in 2013, showing acceptable good prediction results. Therefore, the travel-distance prediction models for channelized rock avalanches constructed in this study are applicable and reliable for predicting the runout of similar rock avalanches in other regions.
Bhanot, Gyan; Alexe, Gabriela; Levine, Arnold J; Stolovitzky, Gustavo
2005-01-01
A major challenge in cancer diagnosis from microarray data is the need for robust, accurate, classification models which are independent of the analysis techniques used and can combine data from different laboratories. We propose such a classification scheme originally developed for phenotype identification from mass spectrometry data. The method uses a robust multivariate gene selection procedure and combines the results of several machine learning tools trained on raw and pattern data to produce an accurate meta-classifier. We illustrate and validate our method by applying it to gene expression datasets: the oligonucleotide HuGeneFL microarray dataset of Shipp et al. (www.genome.wi.mit.du/MPR/lymphoma) and the Hu95Av2 Affymetrix dataset (DallaFavera's laboratory, Columbia University). Our pattern-based meta-classification technique achieves higher predictive accuracies than each of the individual classifiers , is robust against data perturbations and provides subsets of related predictive genes. Our techniques predict that combinations of some genes in the p53 pathway are highly predictive of phenotype. In particular, we find that in 80% of DLBCL cases the mRNA level of at least one of the three genes p53, PLK1 and CDK2 is elevated, while in 80% of FL cases, the mRNA level of at most one of them is elevated.
Wood, Bradley M; Jia, Guang; Carmichael, Owen; McKlveen, Kevin; Homberger, Dominique G
2018-05-12
3D imaging techniques enable the non-destructive analysis and modeling of complex structures. Among these, MRI exhibits good soft tissue contrast, but is currently less commonly used for non-clinical research than x-ray CT, even though the latter requires contrast-staining that shrinks and distorts soft tissues. When the objective is the creation of a realistic and complete 3D model of soft tissue structures, MRI data are more demanding to acquire and visualize and require extensive post-processing because they comprise non-cubic voxels with dimensions that represent a trade-off between tissue contrast and image resolution. Therefore, thin soft tissue structures with complex spatial configurations are not always visible in a single MRI dataset, so that standard segmentation techniques are not sufficient for their complete visualization. By using the example of the thin and spatially complex connective tissue myosepta in lampreys, we developed a workflow protocol for the selection of the appropriate parameters for the acquisition of MRI data and for the visualization and 3D modeling of soft tissue structures. This protocol includes a novel recursive segmentation technique for supplementing missing data in one dataset with data from another dataset to produce realistic and complete 3D models. Such 3D models are needed for the modeling of dynamic processes, such as the biomechanics of fish locomotion. However, our methodology is applicable to the visualization of any thin soft tissue structures with complex spatial configurations, such as fasciae, aponeuroses, and small blood vessels and nerves, for clinical research and the further exploration of tensegrity. This article is protected by copyright. All rights reserved. © 2018 Wiley Periodicals, Inc.
Spear, Timothy T; Nishimura, Michael I; Simms, Patricia E
2017-08-01
Advancement in flow cytometry reagents and instrumentation has allowed for simultaneous analysis of large numbers of lineage/functional immune cell markers. Highly complex datasets generated by polychromatic flow cytometry require proper analytical software to answer investigators' questions. A problem among many investigators and flow cytometry Shared Resource Laboratories (SRLs), including our own, is a lack of access to a flow cytometry-knowledgeable bioinformatics team, making it difficult to learn and choose appropriate analysis tool(s). Here, we comparatively assess various multidimensional flow cytometry software packages for their ability to answer a specific biologic question and provide graphical representation output suitable for publication, as well as their ease of use and cost. We assessed polyfunctional potential of TCR-transduced T cells, serving as a model evaluation, using multidimensional flow cytometry to analyze 6 intracellular cytokines and degranulation on a per-cell basis. Analysis of 7 parameters resulted in 128 possible combinations of positivity/negativity, far too complex for basic flow cytometry software to analyze fully. Various software packages were used, analysis methods used in each described, and representative output displayed. Of the tools investigated, automated classification of cellular expression by nonlinear stochastic embedding (ACCENSE) and coupled analysis in Pestle/simplified presentation of incredibly complex evaluations (SPICE) provided the most user-friendly manipulations and readable output, evaluating effects of altered antigen-specific stimulation on T cell polyfunctionality. This detailed approach may serve as a model for other investigators/SRLs in selecting the most appropriate software to analyze complex flow cytometry datasets. Further development and awareness of available tools will help guide proper data analysis to answer difficult biologic questions arising from incredibly complex datasets. © Society for Leukocyte Biology.
A Complex Systems Approach to Causal Discovery in Psychiatry.
Saxe, Glenn N; Statnikov, Alexander; Fenyo, David; Ren, Jiwen; Li, Zhiguo; Prasad, Meera; Wall, Dennis; Bergman, Nora; Briggs, Ernestine C; Aliferis, Constantin
2016-01-01
Conventional research methodologies and data analytic approaches in psychiatric research are unable to reliably infer causal relations without experimental designs, or to make inferences about the functional properties of the complex systems in which psychiatric disorders are embedded. This article describes a series of studies to validate a novel hybrid computational approach--the Complex Systems-Causal Network (CS-CN) method-designed to integrate causal discovery within a complex systems framework for psychiatric research. The CS-CN method was first applied to an existing dataset on psychopathology in 163 children hospitalized with injuries (validation study). Next, it was applied to a much larger dataset of traumatized children (replication study). Finally, the CS-CN method was applied in a controlled experiment using a 'gold standard' dataset for causal discovery and compared with other methods for accurately detecting causal variables (resimulation controlled experiment). The CS-CN method successfully detected a causal network of 111 variables and 167 bivariate relations in the initial validation study. This causal network had well-defined adaptive properties and a set of variables was found that disproportionally contributed to these properties. Modeling the removal of these variables resulted in significant loss of adaptive properties. The CS-CN method was successfully applied in the replication study and performed better than traditional statistical methods, and similarly to state-of-the-art causal discovery algorithms in the causal detection experiment. The CS-CN method was validated, replicated, and yielded both novel and previously validated findings related to risk factors and potential treatments of psychiatric disorders. The novel approach yields both fine-grain (micro) and high-level (macro) insights and thus represents a promising approach for complex systems-oriented research in psychiatry.
Web mapping system for complex processing and visualization of environmental geospatial datasets
NASA Astrophysics Data System (ADS)
Titov, Alexander; Gordov, Evgeny; Okladnikov, Igor
2016-04-01
Environmental geospatial datasets (meteorological observations, modeling and reanalysis results, etc.) are used in numerous research applications. Due to a number of objective reasons such as inherent heterogeneity of environmental datasets, big dataset volume, complexity of data models used, syntactic and semantic differences that complicate creation and use of unified terminology, the development of environmental geodata access, processing and visualization services as well as client applications turns out to be quite a sophisticated task. According to general INSPIRE requirements to data visualization geoportal web applications have to provide such standard functionality as data overview, image navigation, scrolling, scaling and graphical overlay, displaying map legends and corresponding metadata information. It should be noted that modern web mapping systems as integrated geoportal applications are developed based on the SOA and might be considered as complexes of interconnected software tools for working with geospatial data. In the report a complex web mapping system including GIS web client and corresponding OGC services for working with geospatial (NetCDF, PostGIS) dataset archive is presented. There are three basic tiers of the GIS web client in it: 1. Tier of geospatial metadata retrieved from central MySQL repository and represented in JSON format 2. Tier of JavaScript objects implementing methods handling: --- NetCDF metadata --- Task XML object for configuring user calculations, input and output formats --- OGC WMS/WFS cartographical services 3. Graphical user interface (GUI) tier representing JavaScript objects realizing web application business logic Metadata tier consists of a number of JSON objects containing technical information describing geospatial datasets (such as spatio-temporal resolution, meteorological parameters, valid processing methods, etc). The middleware tier of JavaScript objects implementing methods for handling geospatial metadata, task XML object, and WMS/WFS cartographical services interconnects metadata and GUI tiers. The methods include such procedures as JSON metadata downloading and update, launching and tracking of the calculation task running on the remote servers as well as working with WMS/WFS cartographical services including: obtaining the list of available layers, visualizing layers on the map, exporting layers in graphical (PNG, JPG, GeoTIFF), vector (KML, GML, Shape) and digital (NetCDF) formats. Graphical user interface tier is based on the bundle of JavaScript libraries (OpenLayers, GeoExt and ExtJS) and represents a set of software components implementing web mapping application business logic (complex menus, toolbars, wizards, event handlers, etc.). GUI provides two basic capabilities for the end user: configuring the task XML object functionality and cartographical information visualizing. The web interface developed is similar to the interface of such popular desktop GIS applications, as uDIG, QuantumGIS etc. Web mapping system developed has shown its effectiveness in the process of solving real climate change research problems and disseminating investigation results in cartographical form. The work is supported by SB RAS Basic Program Projects VIII.80.2.1 and IV.38.1.7.
Compressed sensing based missing nodes prediction in temporal communication network
NASA Astrophysics Data System (ADS)
Cheng, Guangquan; Ma, Yang; Liu, Zhong; Xie, Fuli
2018-02-01
The reconstruction of complex network topology is of great theoretical and practical significance. Most research so far focuses on the prediction of missing links. There are many mature algorithms for link prediction which have achieved good results, but research on the prediction of missing nodes has just begun. In this paper, we propose an algorithm for missing node prediction in complex networks. We detect the position of missing nodes based on their neighbor nodes under the theory of compressed sensing, and extend the algorithm to the case of multiple missing nodes using spectral clustering. Experiments on real public network datasets and simulated datasets show that our algorithm can detect the locations of hidden nodes effectively with high precision.
Crabtree, Nathaniel M; Moore, Jason H; Bowyer, John F; George, Nysia I
2017-01-01
A computational evolution system (CES) is a knowledge discovery engine that can identify subtle, synergistic relationships in large datasets. Pareto optimization allows CESs to balance accuracy with model complexity when evolving classifiers. Using Pareto optimization, a CES is able to identify a very small number of features while maintaining high classification accuracy. A CES can be designed for various types of data, and the user can exploit expert knowledge about the classification problem in order to improve discrimination between classes. These characteristics give CES an advantage over other classification and feature selection algorithms, particularly when the goal is to identify a small number of highly relevant, non-redundant biomarkers. Previously, CESs have been developed only for binary class datasets. In this study, we developed a multi-class CES. The multi-class CES was compared to three common feature selection and classification algorithms: support vector machine (SVM), random k-nearest neighbor (RKNN), and random forest (RF). The algorithms were evaluated on three distinct multi-class RNA sequencing datasets. The comparison criteria were run-time, classification accuracy, number of selected features, and stability of selected feature set (as measured by the Tanimoto distance). The performance of each algorithm was data-dependent. CES performed best on the dataset with the smallest sample size, indicating that CES has a unique advantage since the accuracy of most classification methods suffer when sample size is small. The multi-class extension of CES increases the appeal of its application to complex, multi-class datasets in order to identify important biomarkers and features.
Data-driven probability concentration and sampling on manifold
DOE Office of Scientific and Technical Information (OSTI.GOV)
Soize, C., E-mail: christian.soize@univ-paris-est.fr; Ghanem, R., E-mail: ghanem@usc.edu
2016-09-15
A new methodology is proposed for generating realizations of a random vector with values in a finite-dimensional Euclidean space that are statistically consistent with a dataset of observations of this vector. The probability distribution of this random vector, while a priori not known, is presumed to be concentrated on an unknown subset of the Euclidean space. A random matrix is introduced whose columns are independent copies of the random vector and for which the number of columns is the number of data points in the dataset. The approach is based on the use of (i) the multidimensional kernel-density estimation methodmore » for estimating the probability distribution of the random matrix, (ii) a MCMC method for generating realizations for the random matrix, (iii) the diffusion-maps approach for discovering and characterizing the geometry and the structure of the dataset, and (iv) a reduced-order representation of the random matrix, which is constructed using the diffusion-maps vectors associated with the first eigenvalues of the transition matrix relative to the given dataset. The convergence aspects of the proposed methodology are analyzed and a numerical validation is explored through three applications of increasing complexity. The proposed method is found to be robust to noise levels and data complexity as well as to the intrinsic dimension of data and the size of experimental datasets. Both the methodology and the underlying mathematical framework presented in this paper contribute new capabilities and perspectives at the interface of uncertainty quantification, statistical data analysis, stochastic modeling and associated statistical inverse problems.« less
NASA Astrophysics Data System (ADS)
Laiti, Lavinia; Giovannini, Lorenzo; Zardi, Dino
2015-04-01
The accurate assessment of the solar radiation available at the Earth's surface is essential for a wide range of energy-related applications, such as the design of solar power plants, water heating systems and energy-efficient buildings, as well as in the fields of climatology, hydrology, ecology and agriculture. The characterization of solar radiation is particularly challenging in complex-orography areas, where topographic shadowing and altitude effects, together with local weather phenomena, greatly increase the spatial and temporal variability of such variable. At present, approaches ranging from surface measurements interpolation to orographic down-scaling of satellite data, to numerical model simulations are adopted for mapping solar radiation. In this contribution a high-resolution (200 m) solar atlas for the Trentino region (Italy) is presented, which was recently developed on the basis of hourly observations of global radiation collected from the local radiometric stations during the period 2004-2012. Monthly and annual climatological irradiation maps were obtained by the combined use of a GIS-based clear-sky model (r.sun module of GRASS GIS) and geostatistical interpolation techniques (kriging). Moreover, satellite radiation data derived by the MeteoSwiss HelioMont algorithm (2 km resolution) were used for missing-data reconstruction and for the final mapping, thus integrating ground-based and remote-sensing information. The results are compared with existing solar resource datasets, such as the PVGIS dataset, produced by the Joint Research Center Institute for Energy and Transport, and the HelioMont dataset, in order to evaluate the accuracy of the different datasets available for the region of interest.
da Costa, Pedro Beschoren; Granada, Camille E.; Ambrosini, Adriana; Moreira, Fernanda; de Souza, Rocheli; dos Passos, João Frederico M.; Arruda, Letícia; Passaglia, Luciane M. P.
2014-01-01
Plant growth-promoting bacteria can greatly assist sustainable farming by improving plant health and biomass while reducing fertilizer use. The plant-microorganism-environment interaction is an open and complex system, and despite the active research in the area, patterns in root ecology are elusive. Here, we simultaneously analyzed the plant growth-promoting bacteria datasets from seven independent studies that shared a methodology for bioprospection and phenotype screening. The soil richness of the isolate's origin was classified by a Principal Component Analysis. A Categorical Principal Component Analysis was used to classify the soil richness according to isolate's indolic compound production, siderophores production and phosphate solubilization abilities, and bacterial genera composition. Multiple patterns and relationships were found and verified with nonparametric hypothesis testing. Including niche colonization in the analysis, we proposed a model to explain the expression of bacterial plant growth-promoting traits according to the soil nutritional status. Our model shows that plants favor interaction with growth hormone producers under rich nutrient conditions but favor nutrient solubilizers under poor conditions. We also performed several comparisons among the different genera, highlighting interesting ecological interactions and limitations. Our model could be used to direct plant growth-promoting bacteria bioprospection and metagenomic sampling. PMID:25542031
Careau, Vincent; Wolak, Matthew E; Carter, Patrick A; Garland, Theodore
2015-11-22
Given the pace at which human-induced environmental changes occur, a pressing challenge is to determine the speed with which selection can drive evolutionary change. A key determinant of adaptive response to multivariate phenotypic selection is the additive genetic variance-covariance matrix ( G: ). Yet knowledge of G: in a population experiencing new or altered selection is not sufficient to predict selection response because G: itself evolves in ways that are poorly understood. We experimentally evaluated changes in G: when closely related behavioural traits experience continuous directional selection. We applied the genetic covariance tensor approach to a large dataset (n = 17 328 individuals) from a replicated, 31-generation artificial selection experiment that bred mice for voluntary wheel running on days 5 and 6 of a 6-day test. Selection on this subset of G: induced proportional changes across the matrix for all 6 days of running behaviour within the first four generations. The changes in G: induced by selection resulted in a fourfold slower-than-predicted rate of response to selection. Thus, selection exacerbated constraints within G: and limited future adaptive response, a phenomenon that could have profound consequences for populations facing rapid environmental change. © 2015 The Author(s).
NASA Astrophysics Data System (ADS)
Bandaru, Sunith; Deb, Kalyanmoy
2011-09-01
In this article, a methodology is proposed for automatically extracting innovative design principles which make a system or process (subject to conflicting objectives) optimal using its Pareto-optimal dataset. Such 'higher knowledge' would not only help designers to execute the system better, but also enable them to predict how changes in one variable would affect other variables if the system has to retain its optimal behaviour. This in turn would help solve other similar systems with different parameter settings easily without the need to perform a fresh optimization task. The proposed methodology uses a clustering-based optimization technique and is capable of discovering hidden functional relationships between the variables, objective and constraint functions and any other function that the designer wishes to include as a 'basis function'. A number of engineering design problems are considered for which the mathematical structure of these explicit relationships exists and has been revealed by a previous study. A comparison with the multivariate adaptive regression splines (MARS) approach reveals the practicality of the proposed approach due to its ability to find meaningful design principles. The success of this procedure for automated innovization is highly encouraging and indicates its suitability for further development in tackling more complex design scenarios.
NASA Astrophysics Data System (ADS)
Tan, Chao; Chen, Hui; Wang, Chao; Zhu, Wanping; Wu, Tong; Diao, Yuanbo
2013-03-01
Near and mid-infrared (NIR/MIR) spectroscopy techniques have gained great acceptance in the industry due to their multiple applications and versatility. However, a success of application often depends heavily on the construction of accurate and stable calibration models. For this purpose, a simple multi-model fusion strategy is proposed. It is actually the combination of Kohonen self-organizing map (KSOM), mutual information (MI) and partial least squares (PLSs) and therefore named as KMICPLS. It works as follows: First, the original training set is fed into a KSOM for unsupervised clustering of samples, on which a series of training subsets are constructed. Thereafter, on each of the training subsets, a MI spectrum is calculated and only the variables with higher MI values than the mean value are retained, based on which a candidate PLS model is constructed. Finally, a fixed number of PLS models are selected to produce a consensus model. Two NIR/MIR spectral datasets from brewing industry are used for experiments. The results confirms its superior performance to two reference algorithms, i.e., the conventional PLS and genetic algorithm-PLS (GAPLS). It can build more accurate and stable calibration models without increasing the complexity, and can be generalized to other NIR/MIR applications.
Ciucci, Sara; Ge, Yan; Durán, Claudio; Palladini, Alessandra; Jiménez-Jiménez, Víctor; Martínez-Sánchez, Luisa María; Wang, Yuting; Sales, Susanne; Shevchenko, Andrej; Poser, Steven W.; Herbig, Maik; Otto, Oliver; Androutsellis-Theotokis, Andreas; Guck, Jochen; Gerl, Mathias J.; Cannistraci, Carlo Vittorio
2017-01-01
Omic science is rapidly growing and one of the most employed techniques to explore differential patterns in omic datasets is principal component analysis (PCA). However, a method to enlighten the network of omic features that mostly contribute to the sample separation obtained by PCA is missing. An alternative is to build correlation networks between univariately-selected significant omic features, but this neglects the multivariate unsupervised feature compression responsible for the PCA sample segregation. Biologists and medical researchers often prefer effective methods that offer an immediate interpretation to complicated algorithms that in principle promise an improvement but in practice are difficult to be applied and interpreted. Here we present PC-corr: a simple algorithm that associates to any PCA segregation a discriminative network of features. Such network can be inspected in search of functional modules useful in the definition of combinatorial and multiscale biomarkers from multifaceted omic data in systems and precision biomedicine. We offer proofs of PC-corr efficacy on lipidomic, metagenomic, developmental genomic, population genetic, cancer promoteromic and cancer stem-cell mechanomic data. Finally, PC-corr is a general functional network inference approach that can be easily adopted for big data exploration in computer science and analysis of complex systems in physics. PMID:28287094
Venson, José Eduardo; Bevilacqua, Fernando; Berni, Jean; Onuki, Fabio; Maciel, Anderson
2018-05-01
Mobile devices and software are now available with sufficient computing power, speed and complexity to allow for real-time interpretation of radiology exams. In this paper, we perform a multivariable user study that investigates concordance of image-based diagnoses provided using mobile devices on the one hand and conventional workstations on the other hand. We performed a between-subjects task-analysis using CT, MRI and radiography datasets. Moreover, we investigated the adequacy of the screen size, image quality, usability and the availability of the tools necessary for the analysis. Radiologists, members of several teams, participated in the experiment under real work conditions. A total of 64 studies with 93 main diagnoses were analyzed. Our results showed that 56 cases were classified with complete concordance (87.69%), 5 cases with almost complete concordance (7.69%) and 1 case (1.56%) with partial concordance. Only 2 studies presented discordance between the reports (3.07%). The main reason to explain the cause of those disagreements was the lack of multiplanar reconstruction tool in the mobile viewer. Screen size and image quality had no direct impact on the mobile diagnosis process. We concluded that for images from emergency modalities, a mobile interface provides accurate interpretation and swift response, which could benefit patients' healthcare. Copyright © 2018 Elsevier B.V. All rights reserved.
Association between latent toxoplasmosis and cognition in adults: a cross-sectional study.
Gale, S D; Brown, B L; Erickson, L D; Berrett, A; Hedges, D W
2015-04-01
Latent infection from Toxoplasma gondii (T. gondii) is widespread worldwide and has been associated with cognitive deficits in some but not all animal models and in humans. We tested the hypothesis that latent toxoplasmosis is associated with decreased cognitive function in a large cross-sectional dataset, the National Health and Nutrition Examination Survey (NHANES). There were 4178 participants aged 20-59 years, of whom 19.1% had IgG antibodies against T. gondii. Two ordinary least squares (OLS) regression models adjusted for the NHANES complex sampling design and weighted to represent the US population were estimated for simple reaction time, processing speed and short-term memory or attention. The first model included only main effects of latent toxoplasmosis and demographic control variables, and the second added interaction terms between latent toxoplasmosis and the poverty-to-income ratio (PIR), educational attainment and race-ethnicity. We also used multivariate models to assess all three cognitive outcomes in the same model. Although the models evaluating main effects only demonstrated no association between latent toxoplasmosis and the cognitive outcomes, significant interactions between latent toxoplasmosis and the PIR, between latent toxoplasmosis and educational attainment, and between latent toxoplasmosis and race-ethnicity indicated that latent toxoplasmosis may adversely affect cognitive function in certain groups.
Multi-Target Regression via Robust Low-Rank Learning.
Zhen, Xiantong; Yu, Mengyang; He, Xiaofei; Li, Shuo
2018-02-01
Multi-target regression has recently regained great popularity due to its capability of simultaneously learning multiple relevant regression tasks and its wide applications in data mining, computer vision and medical image analysis, while great challenges arise from jointly handling inter-target correlations and input-output relationships. In this paper, we propose Multi-layer Multi-target Regression (MMR) which enables simultaneously modeling intrinsic inter-target correlations and nonlinear input-output relationships in a general framework via robust low-rank learning. Specifically, the MMR can explicitly encode inter-target correlations in a structure matrix by matrix elastic nets (MEN); the MMR can work in conjunction with the kernel trick to effectively disentangle highly complex nonlinear input-output relationships; the MMR can be efficiently solved by a new alternating optimization algorithm with guaranteed convergence. The MMR leverages the strength of kernel methods for nonlinear feature learning and the structural advantage of multi-layer learning architectures for inter-target correlation modeling. More importantly, it offers a new multi-layer learning paradigm for multi-target regression which is endowed with high generality, flexibility and expressive ability. Extensive experimental evaluation on 18 diverse real-world datasets demonstrates that our MMR can achieve consistently high performance and outperforms representative state-of-the-art algorithms, which shows its great effectiveness and generality for multivariate prediction.
Catelan, Dolores; Biggeri, Annibale
2008-11-01
In environmental epidemiology, long lists of relative risk estimates from exposed populations are compared to a reference to scrutinize the dataset for extremes. Here, inference on disease profiles for given areas, or for fixed disease population signatures, are of interest and summaries can be obtained averaging over areas or diseases. We have developed a multivariate hierarchical Bayesian approach to estimate posterior rank distributions and we show how to produce league tables of ranks with credibility intervals useful to address the above mentioned inferential problems. Applying the procedure to a real dataset from the report "Environment and Health in Sardinia (Italy)" we selected 18 areas characterized by high environmental pressure for industrial, mining or military activities investigated for 29 causes of deaths among male residents. Ranking diseases highlighted the increased burdens of neoplastic (cancerous), and non-neoplastic respiratory diseases in the heavily polluted area of Portoscuso. The averaged ranks by disease over areas showed lung cancer among the three highest positions.
A CCA+ICA based model for multi-task brain imaging data fusion and its application to schizophrenia.
Sui, Jing; Adali, Tülay; Pearlson, Godfrey; Yang, Honghui; Sponheim, Scott R; White, Tonya; Calhoun, Vince D
2010-05-15
Collection of multiple-task brain imaging data from the same subject has now become common practice in medical imaging studies. In this paper, we propose a simple yet effective model, "CCA+ICA", as a powerful tool for multi-task data fusion. This joint blind source separation (BSS) model takes advantage of two multivariate methods: canonical correlation analysis and independent component analysis, to achieve both high estimation accuracy and to provide the correct connection between two datasets in which sources can have either common or distinct between-dataset correlation. In both simulated and real fMRI applications, we compare the proposed scheme with other joint BSS models and examine the different modeling assumptions. The contrast images of two tasks: sensorimotor (SM) and Sternberg working memory (SB), derived from a general linear model (GLM), were chosen to contribute real multi-task fMRI data, both of which were collected from 50 schizophrenia patients and 50 healthy controls. When examining the relationship with duration of illness, CCA+ICA revealed a significant negative correlation with temporal lobe activation. Furthermore, CCA+ICA located sensorimotor cortex as the group-discriminative regions for both tasks and identified the superior temporal gyrus in SM and prefrontal cortex in SB as task-specific group-discriminative brain networks. In summary, we compared the new approach to some competitive methods with different assumptions, and found consistent results regarding each of their hypotheses on connecting the two tasks. Such an approach fills a gap in existing multivariate methods for identifying biomarkers from brain imaging data.
Stekolnikov, Alexandr A; Klimov, Pavel B
2010-09-01
We revise chiggers belonging to the minuta-species group (genus Neotrombicula Hirst, 1925) from the Palaearctic using size-free multivariate morphometrics. This approach allowed us to resolve several diagnostic problems. We show that the widely distributed Neotrombicula scrupulosa Kudryashova, 1993 forms three spatially and ecologically isolated groups different from each other in size or shape (morphometric property) only: specimens from the Caucasus are distinct from those from Asia in shape, whereas the Asian specimens from plains and mountains are different from each other in size. We developed a multivariate classification model to separate three closely related species: N. scrupulosa, N. lubrica Kudryashova, 1993 and N. minuta Schluger, 1966. This model is based on five shape variables selected from an initial 17 variables by a best subset analysis using a custom size-correction subroutine. The variable selection procedure slightly improved the predictive power of the model, suggesting that it not only removed redundancy but also reduced 'noise' in the dataset. The overall classification accuracy of this model is 96.2, 96.2 and 95.5%, as estimated by internal validation, external validation and jackknife statistics, respectively. Our analyses resulted in one new synonymy: N. dimidiata Stekolnikov, 1995 is considered to be a synonym of N. lubrica. Both N. scrupulosa and N. lubrica are recorded from new localities. A key to species of the minuta-group incorporating results from our multivariate analyses is presented.
NASA Astrophysics Data System (ADS)
Das, Bappa; Sahoo, Rabi N.; Pargal, Sourabh; Krishna, Gopal; Verma, Rakesh; Chinnusamy, Viswanathan; Sehgal, Vinay K.; Gupta, Vinod K.; Dash, Sushanta K.; Swain, Padmini
2018-03-01
In the present investigation, the changes in sucrose, reducing and total sugar content due to water-deficit stress in rice leaves were modeled using visible, near infrared (VNIR) and shortwave infrared (SWIR) spectroscopy. The objectives of the study were to identify the best vegetation indices and suitable multivariate technique based on precise analysis of hyperspectral data (350 to 2500 nm) and sucrose, reducing sugar and total sugar content measured at different stress levels from 16 different rice genotypes. Spectral data analysis was done to identify suitable spectral indices and models for sucrose estimation. Novel spectral indices in near infrared (NIR) range viz. ratio spectral index (RSI) and normalised difference spectral indices (NDSI) sensitive to sucrose, reducing sugar and total sugar content were identified which were subsequently calibrated and validated. The RSI and NDSI models had R2 values of 0.65, 0.71 and 0.67; RPD values of 1.68, 1.95 and 1.66 for sucrose, reducing sugar and total sugar, respectively for validation dataset. Different multivariate spectral models such as artificial neural network (ANN), multivariate adaptive regression splines (MARS), multiple linear regression (MLR), partial least square regression (PLSR), random forest regression (RFR) and support vector machine regression (SVMR) were also evaluated. The best performing multivariate models for sucrose, reducing sugars and total sugars were found to be, MARS, ANN and MARS, respectively with respect to RPD values of 2.08, 2.44, and 1.93. Results indicated that VNIR and SWIR spectroscopy combined with multivariate calibration can be used as a reliable alternative to conventional methods for measurement of sucrose, reducing sugars and total sugars of rice under water-deficit stress as this technique is fast, economic, and noninvasive.
Mining Recent Temporal Patterns for Event Detection in Multivariate Time Series Data
Batal, Iyad; Fradkin, Dmitriy; Harrison, James; Moerchen, Fabian; Hauskrecht, Milos
2015-01-01
Improving the performance of classifiers using pattern mining techniques has been an active topic of data mining research. In this work we introduce the recent temporal pattern mining framework for finding predictive patterns for monitoring and event detection problems in complex multivariate time series data. This framework first converts time series into time-interval sequences of temporal abstractions. It then constructs more complex temporal patterns backwards in time using temporal operators. We apply our framework to health care data of 13,558 diabetic patients and show its benefits by efficiently finding useful patterns for detecting and diagnosing adverse medical conditions that are associated with diabetes. PMID:25937993
NASA Astrophysics Data System (ADS)
Yan, Ying; Zhang, Shen; Tang, Jinjun; Wang, Xiaofei
2017-07-01
Discovering dynamic characteristics in traffic flow is the significant step to design effective traffic managing and controlling strategy for relieving traffic congestion in urban cities. A new method based on complex network theory is proposed to study multivariate traffic flow time series. The data were collected from loop detectors on freeway during a year. In order to construct complex network from original traffic flow, a weighted Froenius norm is adopt to estimate similarity between multivariate time series, and Principal Component Analysis is implemented to determine the weights. We discuss how to select optimal critical threshold for networks at different hour in term of cumulative probability distribution of degree. Furthermore, two statistical properties of networks: normalized network structure entropy and cumulative probability of degree, are utilized to explore hourly variation in traffic flow. The results demonstrate these two statistical quantities express similar pattern to traffic flow parameters with morning and evening peak hours. Accordingly, we detect three traffic states: trough, peak and transitional hours, according to the correlation between two aforementioned properties. The classifying results of states can actually represent hourly fluctuation in traffic flow by analyzing annual average hourly values of traffic volume, occupancy and speed in corresponding hours.
Exploratory visualization software for reporting environmental survey results.
Fisher, P; Arnot, C; Bastin, L; Dykes, J
2001-08-01
Environmental surveys yield three principal products: maps, a set of data tables, and a textual report. The relationships between these three elements, however, are often cumbersome to present, making full use of all the information in an integrated and systematic sense difficult. The published paper report is only a partial solution. Modern developments in computing, particularly in cartography, GIS, and hypertext, mean that it is increasingly possible to conceive of an easier and more interactive approach to the presentation of such survey results. Here, we present such an approach which links map and tabular datasets arising from a vegetation survey, allowing users ready access to a complex dataset using dynamic mapping techniques. Multimedia datasets equipped with software like this provide an exciting means of quick and easy visual data exploration and comparison. These techniques are gaining popularity across the sciences as scientists and decision-makers are presented with increasing amounts of diverse digital data. We believe that the software environment actively encourages users to make complex interrogations of the survey information, providing a new vehicle for the reader of an environmental survey report.
A high-resolution 7-Tesla fMRI dataset from complex natural stimulation with an audio movie
Hanke, Michael; Baumgartner, Florian J.; Ibe, Pierre; Kaule, Falko R.; Pollmann, Stefan; Speck, Oliver; Zinke, Wolf; Stadler, Jörg
2014-01-01
Here we present a high-resolution functional magnetic resonance (fMRI) dataset – 20 participants recorded at high field strength (7 Tesla) during prolonged stimulation with an auditory feature film (“Forrest Gump”). In addition, a comprehensive set of auxiliary data (T1w, T2w, DTI, susceptibility-weighted image, angiography) as well as measurements to assess technical and physiological noise components have been acquired. An initial analysis confirms that these data can be used to study common and idiosyncratic brain response patterns to complex auditory stimulation. Among the potential uses of this dataset are the study of auditory attention and cognition, language and music perception, and social perception. The auxiliary measurements enable a large variety of additional analysis strategies that relate functional response patterns to structural properties of the brain. Alongside the acquired data, we provide source code and detailed information on all employed procedures – from stimulus creation to data analysis. In order to facilitate replicative and derived works, only free and open-source software was utilized. PMID:25977761
A note on a simplified and general approach to simulating from multivariate copula functions
Barry K. Goodwin
2013-01-01
Copulas have become an important analytic tool for characterizing multivariate distributions and dependence. One is often interested in simulating data from copula estimates. The process can be analytically and computationally complex and usually involves steps that are unique to a given parametric copula. We describe an alternative approach that uses âProbability-...
GODIVA2: interactive visualization of environmental data on the Web.
Blower, J D; Haines, K; Santokhee, A; Liu, C L
2009-03-13
GODIVA2 is a dynamic website that provides visual access to several terabytes of physically distributed, four-dimensional environmental data. It allows users to explore large datasets interactively without the need to install new software or download and understand complex data. Through the use of open international standards, GODIVA2 maintains a high level of interoperability with third-party systems, allowing diverse datasets to be mutually compared. Scientists can use the system to search for features in large datasets and to diagnose the output from numerical simulations and data processing algorithms. Data providers around Europe have adopted GODIVA2 as an INSPIRE-compliant dynamic quick-view system for providing visual access to their data.
A geologic and mineral exploration spatial database for the Stillwater Complex, Montana
Zientek, Michael L.; Parks, Heather L.
2014-01-01
This report provides essential spatially referenced datasets based on geologic mapping and mineral exploration activities conducted from the 1920s to the 1990s. This information will facilitate research on the complex and provide background material needed to explore for mineral resources and to develop sound land-management policy.
NASA Astrophysics Data System (ADS)
Hampel, B.; Liu, B.; Nording, F.; Ostermann, J.; Struszewski, P.; Langfahl-Klabes, J.; Bieler, M.; Bosse, H.; Güttler, B.; Lemmens, P.; Schilling, M.; Tutsch, R.
2018-03-01
In many cases, the determination of the measurement uncertainty of complex nanosystems provides unexpected challenges. This is in particular true for complex systems with many degrees of freedom, i.e. nanosystems with multiparametric dependencies and multivariate output quantities. The aim of this paper is to address specific questions arising during the uncertainty calculation of such systems. This includes the division of the measurement system into subsystems and the distinction between systematic and statistical influences. We demonstrate that, even if the physical systems under investigation are very different, the corresponding uncertainty calculation can always be realized in a similar manner. This is exemplarily shown in detail for two experiments, namely magnetic nanosensors and ultrafast electro-optical sampling of complex time-domain signals. For these examples the approach for uncertainty calculation following the guide to the expression of uncertainty in measurement (GUM) is explained, in which correlations between multivariate output quantities are captured. To illustate the versatility of the proposed approach, its application to other experiments, namely nanometrological instruments for terahertz microscopy, dimensional scanning probe microscopy, and measurement of concentration of molecules using surface enhanced Raman scattering, is shortly discussed in the appendix. We believe that the proposed approach provides a simple but comprehensive orientation for uncertainty calculation in the discussed measurement scenarios and can also be applied to similar or related situations.
On mining complex sequential data by means of FCA and pattern structures
NASA Astrophysics Data System (ADS)
Buzmakov, Aleksey; Egho, Elias; Jay, Nicolas; Kuznetsov, Sergei O.; Napoli, Amedeo; Raïssi, Chedy
2016-02-01
Nowadays data-sets are available in very complex and heterogeneous ways. Mining of such data collections is essential to support many real-world applications ranging from healthcare to marketing. In this work, we focus on the analysis of "complex" sequential data by means of interesting sequential patterns. We approach the problem using the elegant mathematical framework of formal concept analysis and its extension based on "pattern structures". Pattern structures are used for mining complex data (such as sequences or graphs) and are based on a subsumption operation, which in our case is defined with respect to the partial order on sequences. We show how pattern structures along with projections (i.e. a data reduction of sequential structures) are able to enumerate more meaningful patterns and increase the computing efficiency of the approach. Finally, we show the applicability of the presented method for discovering and analysing interesting patient patterns from a French healthcare data-set on cancer. The quantitative and qualitative results (with annotations and analysis from a physician) are reported in this use-case which is the main motivation for this work.
Hypergraph Based Feature Selection Technique for Medical Diagnosis.
Somu, Nivethitha; Raman, M R Gauthama; Kirthivasan, Kannan; Sriram, V S Shankar
2016-11-01
The impact of internet and information systems across various domains have resulted in substantial generation of multidimensional datasets. The use of data mining and knowledge discovery techniques to extract the original information contained in the multidimensional datasets play a significant role in the exploitation of complete benefit provided by them. The presence of large number of features in the high dimensional datasets incurs high computational cost in terms of computing power and time. Hence, feature selection technique has been commonly used to build robust machine learning models to select a subset of relevant features which projects the maximal information content of the original dataset. In this paper, a novel Rough Set based K - Helly feature selection technique (RSKHT) which hybridize Rough Set Theory (RST) and K - Helly property of hypergraph representation had been designed to identify the optimal feature subset or reduct for medical diagnostic applications. Experiments carried out using the medical datasets from the UCI repository proves the dominance of the RSKHT over other feature selection techniques with respect to the reduct size, classification accuracy and time complexity. The performance of the RSKHT had been validated using WEKA tool, which shows that RSKHT had been computationally attractive and flexible over massive datasets.
NASA Astrophysics Data System (ADS)
Pariser, O.; Calef, F.; Manning, E. M.; Ardulov, V.
2017-12-01
We will present implementation and study of several use-cases of utilizing Virtual Reality (VR) for immersive display, interaction and analysis of large and complex 3D datasets. These datasets have been acquired by the instruments across several Earth, Planetary and Solar Space Robotics Missions. First, we will describe the architecture of the common application framework that was developed to input data, interface with VR display devices and program input controllers in various computing environments. Tethered and portable VR technologies will be contrasted and advantages of each highlighted. We'll proceed to presenting experimental immersive analytics visual constructs that enable augmentation of 3D datasets with 2D ones such as images and statistical and abstract data. We will conclude by presenting comparative analysis with traditional visualization applications and share the feedback provided by our users: scientists and engineers.
Long-term dataset on aquatic responses to concurrent climate change and recovery from acidification
NASA Astrophysics Data System (ADS)
Leach, Taylor H.; Winslow, Luke A.; Acker, Frank W.; Bloomfield, Jay A.; Boylen, Charles W.; Bukaveckas, Paul A.; Charles, Donald F.; Daniels, Robert A.; Driscoll, Charles T.; Eichler, Lawrence W.; Farrell, Jeremy L.; Funk, Clara S.; Goodrich, Christine A.; Michelena, Toby M.; Nierzwicki-Bauer, Sandra A.; Roy, Karen M.; Shaw, William H.; Sutherland, James W.; Swinton, Mark W.; Winkler, David A.; Rose, Kevin C.
2018-04-01
Concurrent regional and global environmental changes are affecting freshwater ecosystems. Decadal-scale data on lake ecosystems that can describe processes affected by these changes are important as multiple stressors often interact to alter the trajectory of key ecological phenomena in complex ways. Due to the practical challenges associated with long-term data collections, the majority of existing long-term data sets focus on only a small number of lakes or few response variables. Here we present physical, chemical, and biological data from 28 lakes in the Adirondack Mountains of northern New York State. These data span the period from 1994-2012 and harmonize multiple open and as-yet unpublished data sources. The dataset creation is reproducible and transparent; R code and all original files used to create the dataset are provided in an appendix. This dataset will be useful for examining ecological change in lakes undergoing multiple stressors.
Multivariate prediction of odor from pig production based on in-situ measurement of odorants
NASA Astrophysics Data System (ADS)
Hansen, Michael J.; Jonassen, Kristoffer E. N.; Løkke, Mette Marie; Adamsen, Anders Peter S.; Feilberg, Anders
2016-06-01
The aim of the present study was to estimate a prediction model for odor from pig production facilities based on measurements of odorants by Proton-Transfer-Reaction Mass spectrometry (PTR-MS). Odor measurements were performed at four different pig production facilities with and without odor abatement technologies using a newly developed mobile odor laboratory equipped with a PTR-MS for measuring odorants and an olfactometer for measuring the odor concentration by human panelists. A total of 115 odor measurements were carried out in the mobile laboratory and simultaneously air samples were collected in Nalophan bags and analyzed at accredited laboratories after 24 h. The dataset was divided into a calibration dataset containing 94 samples and a validation dataset containing 21 samples. The prediction model based on the measurements in the mobile laboratory was able to explain 74% of the variation in the odor concentration based on odorants, whereas the prediction models based on odor measurements with bag samples explained only 46-57%. This study is the first application of direct field olfactometry to livestock odor and emphasizes the importance of avoiding any bias from sample storage in studies of odor-odorant relationships. Application of the model on the validation dataset gave a high correlation between predicted and measured odor concentration (R2 = 0.77). Significant odorants in the prediction models include phenols and indoles. In conclusion, measurements of odorants on-site in pig production facilities is an alternative to dynamic olfactometry that can be applied for measuring odor from pig houses and the effects of odor abatement technologies.
Yu, Kaixin; Wang, Xuetong; Li, Qiongling; Zhang, Xiaohui; Li, Xinwei; Li, Shuyu
2018-01-01
Morphological brain network plays a key role in investigating abnormalities in neurological diseases such as mild cognitive impairment (MCI) and Alzheimer's disease (AD). However, most of the morphological brain network construction methods only considered a single morphological feature. Each type of morphological feature has specific neurological and genetic underpinnings. A combination of morphological features has been proven to have better diagnostic performance compared with a single feature, which suggests that an individual morphological brain network based on multiple morphological features would be beneficial in disease diagnosis. Here, we proposed a novel method to construct individual morphological brain networks for two datasets by calculating the exponential function of multivariate Euclidean distance as the evaluation of similarity between two regions. The first dataset included 24 healthy subjects who were scanned twice within a 3-month period. The topological properties of these brain networks were analyzed and compared with previous studies that used different methods and modalities. Small world property was observed in all of the subjects, and the high reproducibility indicated the robustness of our method. The second dataset included 170 patients with MCI (86 stable MCI and 84 progressive MCI cases) and 169 normal controls (NC). The edge features extracted from the individual morphological brain networks were used to distinguish MCI from NC and separate MCI subgroups (progressive vs. stable) through the support vector machine in order to validate our method. The results showed that our method achieved an accuracy of 79.65% (MCI vs. NC) and 70.59% (stable MCI vs. progressive MCI) in a one-dimension situation. In a multiple-dimension situation, our method improved the classification performance with an accuracy of 80.53% (MCI vs. NC) and 77.06% (stable MCI vs. progressive MCI) compared with the method using a single feature. The results indicated that our method could effectively construct an individual morphological brain network based on multiple morphological features and could accurately discriminate MCI from NC and stable MCI from progressive MCI, and may provide a valuable tool for the investigation of individual morphological brain networks.
Ramnarayan, Padmanabhan; Dimitriades, Konstantinos; Freeburn, Lynsey; Kashyap, Aravind; Dixon, Michaela; Barry, Peter W; Claydon-Smith, Kathryn; Wardhaugh, Allan; Lamming, Caroline R; Draper, Elizabeth S
2018-06-01
International data on characteristics and outcomes of children transported from general hospitals to PICUs are scarce. We aimed to 1) describe the development of a common transport dataset in the United Kingdom and Ireland and 2) analyze transport data from a recent 2-year period. Retrospective analysis of prospectively collected data. Specialist pediatric critical care transport teams and PICUs in the United Kingdom and Ireland. Critically ill children less than 16 years old transported by pediatric critical care transport teams to PICUs in the United Kingdom and Ireland. None. A common transport dataset was developed as part of the Paediatric Intensive Care Audit Network, and standardized data were collected from all PICUs and pediatric critical care transport teams from 2012. Anonymized data on transports (and linked PICU admissions) from a 2-year period (2014-2015) were analyzed to describe patient and transport characteristics, and in uni- and multivariate analyses, to study the association between key transport factors and PICU mortality. A total of 8,167 records were analyzed. Transported children were severely ill (median predicted mortality risk 4.4%) with around half being infants (4,226/8,167; 51.7%) and nearly half presenting with respiratory illnesses (3,619/8,167; 44.3%). The majority of transports were led by physicians (78.4%; consultants: 3,059/8,167, fellows: 3,344/8,167). The median time for a pediatric critical care transport team to arrive at the patient's bedside from referral was 85 minutes (interquartile range, 58-135 min). Adverse events occurred in 369 transports (4.5%). There were considerable variations in how transports were organized and delivered across pediatric critical care transport teams. In multivariate analyses, consultant team leader and transport from an intensive care area were associated with PICU mortality (p = 0.006). Variations exist in United Kingdom and Ireland services for critically ill children needing interhospital transport. Future studies should assess the impact of these variations on long-term patient outcomes taking into account treatment provided prior to transport.
Johnstone, Daniel; Milward, Elizabeth A.; Berretta, Regina; Moscato, Pablo
2012-01-01
Background Recent Alzheimer's disease (AD) research has focused on finding biomarkers to identify disease at the pre-clinical stage of mild cognitive impairment (MCI), allowing treatment to be initiated before irreversible damage occurs. Many studies have examined brain imaging or cerebrospinal fluid but there is also growing interest in blood biomarkers. The Alzheimer's Disease Neuroimaging Initiative (ADNI) has generated data on 190 plasma analytes in 566 individuals with MCI, AD or normal cognition. We conducted independent analyses of this dataset to identify plasma protein signatures predicting pre-clinical AD. Methods and Findings We focused on identifying signatures that discriminate cognitively normal controls (n = 54) from individuals with MCI who subsequently progress to AD (n = 163). Based on p value, apolipoprotein E (APOE) showed the strongest difference between these groups (p = 2.3×10−13). We applied a multivariate approach based on combinatorial optimization ((α,β)-k Feature Set Selection), which retains information about individual participants and maintains the context of interrelationships between different analytes, to identify the optimal set of analytes (signature) to discriminate these two groups. We identified 11-analyte signatures achieving values of sensitivity and specificity between 65% and 86% for both MCI and AD groups, depending on whether APOE was included and other factors. Classification accuracy was improved by considering “meta-features,” representing the difference in relative abundance of two analytes, with an 8-meta-feature signature consistently achieving sensitivity and specificity both over 85%. Generating signatures based on longitudinal rather than cross-sectional data further improved classification accuracy, returning sensitivities and specificities of approximately 90%. Conclusions Applying these novel analysis approaches to the powerful and well-characterized ADNI dataset has identified sets of plasma biomarkers for pre-clinical AD. While studies of independent test sets are required to validate the signatures, these analyses provide a starting point for developing a cost-effective and minimally invasive test capable of diagnosing AD in its pre-clinical stages. PMID:22485168
Multivariate analysis in thoracic research.
Mengual-Macenlle, Noemí; Marcos, Pedro J; Golpe, Rafael; González-Rivas, Diego
2015-03-01
Multivariate analysis is based in observation and analysis of more than one statistical outcome variable at a time. In design and analysis, the technique is used to perform trade studies across multiple dimensions while taking into account the effects of all variables on the responses of interest. The development of multivariate methods emerged to analyze large databases and increasingly complex data. Since the best way to represent the knowledge of reality is the modeling, we should use multivariate statistical methods. Multivariate methods are designed to simultaneously analyze data sets, i.e., the analysis of different variables for each person or object studied. Keep in mind at all times that all variables must be treated accurately reflect the reality of the problem addressed. There are different types of multivariate analysis and each one should be employed according to the type of variables to analyze: dependent, interdependence and structural methods. In conclusion, multivariate methods are ideal for the analysis of large data sets and to find the cause and effect relationships between variables; there is a wide range of analysis types that we can use.
Quantifying uncertainty in high-resolution coupled hydrodynamic-ecosystem models
NASA Astrophysics Data System (ADS)
Allen, J. I.; Somerfield, P. J.; Gilbert, F. J.
2007-01-01
Marine ecosystem models are becoming increasingly complex and sophisticated, and are being used to estimate the effects of future changes in the earth system with a view to informing important policy decisions. Despite their potential importance, far too little attention has been, and is generally, paid to model errors and the extent to which model outputs actually relate to real-world processes. With the increasing complexity of the models themselves comes an increasing complexity among model results. If we are to develop useful modelling tools for the marine environment we need to be able to understand and quantify the uncertainties inherent in the simulations. Analysing errors within highly multivariate model outputs, and relating them to even more complex and multivariate observational data, are not trivial tasks. Here we describe the application of a series of techniques, including a 2-stage self-organising map (SOM), non-parametric multivariate analysis, and error statistics, to a complex spatio-temporal model run for the period 1988-1989 in the Southern North Sea, coinciding with the North Sea Project which collected a wealth of observational data. We use model output, large spatio-temporally resolved data sets and a combination of methodologies (SOM, MDS, uncertainty metrics) to simplify the problem and to provide tractable information on model performance. The use of a SOM as a clustering tool allows us to simplify the dimensions of the problem while the use of MDS on independent data grouped according to the SOM classification allows us to validate the SOM. The combination of classification and uncertainty metrics allows us to pinpoint the variables and associated processes which require attention in each region. We recommend the use of this combination of techniques for simplifying complex comparisons of model outputs with real data, and analysis of error distributions.
Jiang, Zhehan; Skorupski, William
2017-12-12
In many behavioral research areas, multivariate generalizability theory (mG theory) has been typically used to investigate the reliability of certain multidimensional assessments. However, traditional mG-theory estimation-namely, using frequentist approaches-has limits, leading researchers to fail to take full advantage of the information that mG theory can offer regarding the reliability of measurements. Alternatively, Bayesian methods provide more information than frequentist approaches can offer. This article presents instructional guidelines on how to implement mG-theory analyses in a Bayesian framework; in particular, BUGS code is presented to fit commonly seen designs from mG theory, including single-facet designs, two-facet crossed designs, and two-facet nested designs. In addition to concrete examples that are closely related to the selected designs and the corresponding BUGS code, a simulated dataset is provided to demonstrate the utility and advantages of the Bayesian approach. This article is intended to serve as a tutorial reference for applied researchers and methodologists conducting mG-theory studies.
Kao, Raymond; Priestap, Fran; Donner, Allan
2016-01-01
Intensive care unit (ICU) scoring systems or prediction models evolved to meet the desire of clinical and administrative leaders to assess the quality of care provided by their ICUs. The Critical Care Information System (CCIS) is province-wide data information for all Ontario, Canada level 3 and level 2 ICUs collected for this purpose. With the dataset, we developed a multivariable logistic regression ICU mortality prediction model during the first 24 h of ICU admission utilizing the explanatory variables including the two validated scores, Multiple Organs Dysfunctional Score (MODS) and Nine Equivalents Nursing Manpower Use Score (NEMS) followed by the variables age, sex, readmission to the ICU during the same hospital stay, admission diagnosis, source of admission, and the modified Charlson Co-morbidity Index (CCI) collected through the hospital health records. This study is a single-center retrospective cohort review of 8822 records from the Critical Care Trauma Centre (CCTC) and Medical-Surgical Intensive Care Unit (MSICU) of London Health Sciences Centre (LHSC), Ontario, Canada between 1 Jan 2009 to 30 Nov 2012. Multivariable logistic regression on training dataset (n = 4321) was used to develop the model and validate by bootstrapping method on the testing dataset (n = 4501). Discrimination, calibration, and overall model performance were also assessed. The predictors significantly associated with ICU mortality included: age (p < 0.001), source of admission (p < 0.0001), ICU admitting diagnosis (p < 0.0001), MODS (p < 0.0001), and NEMS (p < 0.0001). The variables sex and modified CCI were not significantly associated with ICU mortality. The training dataset for the developed model has good discriminating ability between patients with high risk and those with low risk of mortality (c-statistic 0.787). The Hosmer and Lemeshow goodness-of-fit test has a strong correlation between the observed and expected ICU mortality (χ (2) = 5.48; p > 0.31). The overall optimism of the estimation between the training and testing data set ΔAUC = 0.003, indicating a stable prediction model. This study demonstrates that CCIS data available after the first 24 h of ICU admission at LHSC can be used to create a robust mortality prediction model with acceptable fit statistic and internal validity for valid benchmarking and monitoring ICU performance.
PyMVPA: A python toolbox for multivariate pattern analysis of fMRI data.
Hanke, Michael; Halchenko, Yaroslav O; Sederberg, Per B; Hanson, Stephen José; Haxby, James V; Pollmann, Stefan
2009-01-01
Decoding patterns of neural activity onto cognitive states is one of the central goals of functional brain imaging. Standard univariate fMRI analysis methods, which correlate cognitive and perceptual function with the blood oxygenation-level dependent (BOLD) signal, have proven successful in identifying anatomical regions based on signal increases during cognitive and perceptual tasks. Recently, researchers have begun to explore new multivariate techniques that have proven to be more flexible, more reliable, and more sensitive than standard univariate analysis. Drawing on the field of statistical learning theory, these new classifier-based analysis techniques possess explanatory power that could provide new insights into the functional properties of the brain. However, unlike the wealth of software packages for univariate analyses, there are few packages that facilitate multivariate pattern classification analyses of fMRI data. Here we introduce a Python-based, cross-platform, and open-source software toolbox, called PyMVPA, for the application of classifier-based analysis techniques to fMRI datasets. PyMVPA makes use of Python's ability to access libraries written in a large variety of programming languages and computing environments to interface with the wealth of existing machine learning packages. We present the framework in this paper and provide illustrative examples on its usage, features, and programmability.
PyMVPA: A Python toolbox for multivariate pattern analysis of fMRI data
Hanke, Michael; Halchenko, Yaroslav O.; Sederberg, Per B.; Hanson, Stephen José; Haxby, James V.; Pollmann, Stefan
2009-01-01
Decoding patterns of neural activity onto cognitive states is one of the central goals of functional brain imaging. Standard univariate fMRI analysis methods, which correlate cognitive and perceptual function with the blood oxygenation-level dependent (BOLD) signal, have proven successful in identifying anatomical regions based on signal increases during cognitive and perceptual tasks. Recently, researchers have begun to explore new multivariate techniques that have proven to be more flexible, more reliable, and more sensitive than standard univariate analysis. Drawing on the field of statistical learning theory, these new classifier-based analysis techniques possess explanatory power that could provide new insights into the functional properties of the brain. However, unlike the wealth of software packages for univariate analyses, there are few packages that facilitate multivariate pattern classification analyses of fMRI data. Here we introduce a Python-based, cross-platform, and open-source software toolbox, called PyMVPA, for the application of classifier-based analysis techniques to fMRI datasets. PyMVPA makes use of Python's ability to access libraries written in a large variety of programming languages and computing environments to interface with the wealth of existing machine-learning packages. We present the framework in this paper and provide illustrative examples on its usage, features, and programmability. PMID:19184561
Carlesi, Serena; Ricci, Marilena; Cucci, Costanza; La Nasa, Jacopo; Lofrumento, Cristiana; Picollo, Marcello; Becucci, Maurizio
2015-07-01
This work explores the application of chemometric techniques to the analysis of lipidic paint binders (i.e., drying oils) by means of Raman and near-infrared spectroscopy. These binders have been widely used by artists throughout history, both individually and in mixtures. We prepared various model samples of the pure binders (linseed, poppy seed, and walnut oils) obtained from different manufacturers. These model samples were left to dry and then characterized by Raman and reflectance near-infrared spectroscopy. Multivariate analysis was performed by applying principal component analysis (PCA) on the first derivative of the corresponding Raman spectra (1800-750 cm(-1)), near-infrared spectra (6000-3900 cm(-1)), and their combination to test whether spectral differences could enable samples to be distinguished on the basis of their composition. The vibrational bands we found most useful to discriminate between the different products we studied are the fundamental ν(C=C) stretching and methylenic stretching and bending combination bands. The results of the multivariate analysis demonstrated the potential of chemometric approaches for characterizing and identifying drying oils, and also for gaining a deeper insight into the aging process. Comparison with high-performance liquid chromatography data was conducted to check the PCA results.
Lee, Tsair-Fwu; Liou, Ming-Hsiang; Huang, Yu-Jie; Chao, Pei-Ju; Ting, Hui-Min; Lee, Hsiao-Yi
2014-01-01
To predict the incidence of moderate-to-severe patient-reported xerostomia among head and neck squamous cell carcinoma (HNSCC) and nasopharyngeal carcinoma (NPC) patients treated with intensity-modulated radiotherapy (IMRT). Multivariable normal tissue complication probability (NTCP) models were developed by using quality of life questionnaire datasets from 152 patients with HNSCC and 84 patients with NPC. The primary endpoint was defined as moderate-to-severe xerostomia after IMRT. The numbers of predictive factors for a multivariable logistic regression model were determined using the least absolute shrinkage and selection operator (LASSO) with bootstrapping technique. Four predictive models were achieved by LASSO with the smallest number of factors while preserving predictive value with higher AUC performance. For all models, the dosimetric factors for the mean dose given to the contralateral and ipsilateral parotid gland were selected as the most significant predictors. Followed by the different clinical and socio-economic factors being selected, namely age, financial status, T stage, and education for different models were chosen. The predicted incidence of xerostomia for HNSCC and NPC patients can be improved by using multivariable logistic regression models with LASSO technique. The predictive model developed in HNSCC cannot be generalized to NPC cohort treated with IMRT without validation and vice versa. PMID:25163814
Kao, Ming-Chih Jeffrey; Jarosz, Renata; Goldin, Michael; Patel, Amy; Smuck, Matthew
2014-10-01
To develop and implement methodologies for characterizing accelerometry-derived patterns of physical activity (PA) in the United States in relation to demographics, anthropometrics, behaviors, and comorbidities using the National Health and Nutrition Examination Survey (NHANES) dataset. Retrospective analysis of nationally representative database. Computer-generated modeling in silico. A total of 6329 adults in the United States from the NHANES 2003-2004 database. To discover subtle multivariate signal in the dynamic and noisy accelerometry data, we developed a novel approach, termed discretized multiple adaptive regression and implemented the algorithm in SAS 9.2 (SAS Institute, Cary, NC). Demographic, anthropometric, comorbidity, and behavioral variables. The intensity of PA decreased with both increased age and increased body mass index. Both greater education and greater income correlate with increased activity over short durations and reduced activity intensity over long durations. Numerous predictors demonstrated effects within activity ranges that may be masked by use of the standard activity intensity intervals. These include age, one of the most robust variables, where we discovered decreasing activities inside the moderate activity range. It also includes gender, where women compared with men have increased proportions of active times up to the center of light activity range, and income greater than $45,000, where a complex effect is seen with little correspondence to existing cut-points. The results presented in this study suggest that the method of multiple regression and heat map visualization can generate insights otherwise hidden in large datasets such as NHANES. A review of the provided heat maps reveals the trends discussed previously involving demographic, anthropometric, comorbidity, and behavioral variables. It also demonstrates the power of accelerometry to expose alterations in PA. Ultimately, this study provides a US population-based norm to use in future studies of PA. Copyright © 2014 American Academy of Physical Medicine and Rehabilitation. Published by Elsevier Inc. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Balke, Nina; Kalinin, Sergei V.; Jesse, Stephen
Kelvin probe force microscopy (KPFM) has provided deep insights into the role local electronic, ionic and electrochemical processes play on the global functionality of materials and devices, even down to the atomic scale. Conventional KPFM utilizes heterodyne detection and bias feedback to measure the contact potential difference (CPD) between tip and sample. This measurement paradigm, however, permits only partial recovery of the information encoded in bias- and time-dependent electrostatic interactions between the tip and sample and effectively down-samples the cantilever response to a single measurement of CPD per pixel. This level of detail is insufficient for electroactive materials, devices, ormore » solid-liquid interfaces, where non-linear dielectrics are present or spurious electrostatic events are possible. Here, we simulate and experimentally validate a novel approach for spatially resolved KPFM capable of a full information transfer of the dynamic electric processes occurring between tip and sample. General acquisition mode, or G-Mode, adopts a big data approach utilising high speed detection, compression, and storage of the raw cantilever deflection signal in its entirety at high sampling rates (> 4 MHz), providing a permanent record of the tip trajectory. We develop a range of methodologies for analysing the resultant large multidimensional datasets involving classical, physics-based and information-based approaches. Physics-based analysis of G-Mode KPFM data recovers the parabolic bias dependence of the electrostatic force for each cycle of the excitation voltage, leading to a multidimensional dataset containing spatial and temporal dependence of the CPD and capacitance channels. We use multivariate statistical methods to reduce data volume and separate the complex multidimensional data sets into statistically significant components that can then be mapped onto separate physical mechanisms. Overall, G-Mode KPFM offers a new paradigm to study dynamic electric phenomena in electroactive interfaces as well as offer a promising approach to extend KPFM to solid-liquid interfaces.« less
NASA Astrophysics Data System (ADS)
Jennings, E.; Madigan, M.
2017-04-01
Given the complexity of modern cosmological parameter inference where we are faced with non-Gaussian data and noise, correlated systematics and multi-probe correlated datasets,the Approximate Bayesian Computation (ABC) method is a promising alternative to traditional Markov Chain Monte Carlo approaches in the case where the Likelihood is intractable or unknown. The ABC method is called "Likelihood free" as it avoids explicit evaluation of the Likelihood by using a forward model simulation of the data which can include systematics. We introduce astroABC, an open source ABC Sequential Monte Carlo (SMC) sampler for parameter estimation. A key challenge in astrophysics is the efficient use of large multi-probe datasets to constrain high dimensional, possibly correlated parameter spaces. With this in mind astroABC allows for massive parallelization using MPI, a framework that handles spawning of processes across multiple nodes. A key new feature of astroABC is the ability to create MPI groups with different communicators, one for the sampler and several others for the forward model simulation, which speeds up sampling time considerably. For smaller jobs the Python multiprocessing option is also available. Other key features of this new sampler include: a Sequential Monte Carlo sampler; a method for iteratively adapting tolerance levels; local covariance estimate using scikit-learn's KDTree; modules for specifying optimal covariance matrix for a component-wise or multivariate normal perturbation kernel and a weighted covariance metric; restart files output frequently so an interrupted sampling run can be resumed at any iteration; output and restart files are backed up at every iteration; user defined distance metric and simulation methods; a module for specifying heterogeneous parameter priors including non-standard prior PDFs; a module for specifying a constant, linear, log or exponential tolerance level; well-documented examples and sample scripts. This code is hosted online at https://github.com/EliseJ/astroABC.
Berthod, L; Whitley, D C; Roberts, G; Sharpe, A; Greenwood, R; Mills, G A
2017-02-01
Understanding the sorption of pharmaceuticals to sewage sludge during waste water treatment processes is important for understanding their environmental fate and in risk assessments. The degree of sorption is defined by the sludge/water partition coefficient (K d ). Experimental K d values (n=297) for active pharmaceutical ingredients (n=148) in primary and activated sludge were collected from literature. The compounds were classified by their charge at pH7.4 (44 uncharged, 60 positively and 28 negatively charged, and 16 zwitterions). Univariate models relating log K d to log K ow for each charge class showed weak correlations (maximum R 2 =0.51 for positively charged) with no overall correlation for the combined dataset (R 2 =0.04). Weaker correlations were found when relating log K d to log D ow . Three sets of molecular descriptors (Molecular Operating Environment, VolSurf and ParaSurf) encoding a range of physico-chemical properties were used to derive multivariate models using stepwise regression, partial least squares and Bayesian artificial neural networks (ANN). The best predictive performance was obtained with ANN, with R 2 =0.62-0.69 for these descriptors using the complete dataset. Use of more complex Vsurf and ParaSurf descriptors showed little improvement over Molecular Operating Environment descriptors. The most influential descriptors in the ANN models, identified by automatic relevance determination, highlighted the importance of hydrophobicity, charge and molecular shape effects in these sorbate-sorbent interactions. The heterogeneous nature of the different sewage sludges used to measure K d limited the predictability of sorption from physico-chemical properties of the pharmaceuticals alone. Standardization of test materials for the measurement of K d would improve comparability of data from different studies, in the long-term leading to better quality environmental risk assessments. Copyright © 2016 British Geological Survey, NERC. Published by Elsevier B.V. All rights reserved.
Libiger, Ondrej; Schork, Nicholas J.
2015-01-01
It is now feasible to examine the composition and diversity of microbial communities (i.e., “microbiomes”) that populate different human organs and orifices using DNA sequencing and related technologies. To explore the potential links between changes in microbial communities and various diseases in the human body, it is essential to test associations involving different species within and across microbiomes, environmental settings and disease states. Although a number of statistical techniques exist for carrying out relevant analyses, it is unclear which of these techniques exhibit the greatest statistical power to detect associations given the complexity of most microbiome datasets. We compared the statistical power of principal component regression, partial least squares regression, regularized regression, distance-based regression, Hill's diversity measures, and a modified test implemented in the popular and widely used microbiome analysis methodology “Metastats” across a wide range of simulated scenarios involving changes in feature abundance between two sets of metagenomic samples. For this purpose, simulation studies were used to change the abundance of microbial species in a real dataset from a published study examining human hands. Each technique was applied to the same data, and its ability to detect the simulated change in abundance was assessed. We hypothesized that a small subset of methods would outperform the rest in terms of the statistical power. Indeed, we found that the Metastats technique modified to accommodate multivariate analysis and partial least squares regression yielded high power under the models and data sets we studied. The statistical power of diversity measure-based tests, distance-based regression and regularized regression was significantly lower. Our results provide insight into powerful analysis strategies that utilize information on species counts from large microbiome data sets exhibiting skewed frequency distributions obtained on a small to moderate number of samples. PMID:26734061
Balke, Nina; Kalinin, Sergei V.; Jesse, Stephen; ...
2016-08-12
Kelvin probe force microscopy (KPFM) has provided deep insights into the role local electronic, ionic and electrochemical processes play on the global functionality of materials and devices, even down to the atomic scale. Conventional KPFM utilizes heterodyne detection and bias feedback to measure the contact potential difference (CPD) between tip and sample. This measurement paradigm, however, permits only partial recovery of the information encoded in bias- and time-dependent electrostatic interactions between the tip and sample and effectively down-samples the cantilever response to a single measurement of CPD per pixel. This level of detail is insufficient for electroactive materials, devices, ormore » solid-liquid interfaces, where non-linear dielectrics are present or spurious electrostatic events are possible. Here, we simulate and experimentally validate a novel approach for spatially resolved KPFM capable of a full information transfer of the dynamic electric processes occurring between tip and sample. General acquisition mode, or G-Mode, adopts a big data approach utilising high speed detection, compression, and storage of the raw cantilever deflection signal in its entirety at high sampling rates (> 4 MHz), providing a permanent record of the tip trajectory. We develop a range of methodologies for analysing the resultant large multidimensional datasets involving classical, physics-based and information-based approaches. Physics-based analysis of G-Mode KPFM data recovers the parabolic bias dependence of the electrostatic force for each cycle of the excitation voltage, leading to a multidimensional dataset containing spatial and temporal dependence of the CPD and capacitance channels. We use multivariate statistical methods to reduce data volume and separate the complex multidimensional data sets into statistically significant components that can then be mapped onto separate physical mechanisms. Overall, G-Mode KPFM offers a new paradigm to study dynamic electric phenomena in electroactive interfaces as well as offer a promising approach to extend KPFM to solid-liquid interfaces.« less
NASA Astrophysics Data System (ADS)
Griffiths, Thomas; Habler, Gerlinde; Schantl, Philip; Abart, Rainer
2017-04-01
Crystallographic orientation relationships (CORs) between crystalline inclusions and their hosts are commonly used to support particular inclusion origins, but often interpretations are based on a small fraction of all inclusions in a system. The electron backscatter diffraction (EBSD) method allows collection of large COR datasets more quickly than other methods while maintaining high spatial resolution. Large datasets allow analysis of the relative frequencies of different CORs, and identification of 'statistical CORs', where certain limited degrees of freedom exist in the orientation relationship between two neighbour crystals (Griffiths et al. 2016). Statistical CORs exist in addition to completely fixed 'specific' CORs (previously the only type of COR considered). We present a comparison of three EBSD single point datasets (all N > 200 inclusions) of rutile inclusions in garnet hosts, covering three rock systems, each with a different geological history: 1) magmatic garnet in pegmatite from the Koralpe complex, Eastern Alps, formed at temperatures > 600°C and low pressures; 2) granulite facies garnet rims on ultra-high-pressure garnets from the Kimi complex, Rhodope Massif; and 3) a Moldanubian granulite from the southeastern Bohemian Massif, equilibrated at peak conditions of 1050°C and 1.6 GPa. The present study is unique because all datasets have been analysed using the same catalogue of potential CORs, therefore relative frequencies and other COR properties can be meaningfully compared. In every dataset > 94% of the inclusions analysed exhibit one of the CORs tested for. Certain CORs are consistently among the most common in all datasets. However, the relative abundances of these common CORs show large variations between datasets (varying from 8 to 42 % relative abundance in one case). Other CORs are consistently uncommon but nonetheless present in every dataset. Lastly, there are some CORs that are common in one of the datasets and rare in the remainder. These patterns suggest competing influences on relative COR frequencies. Certain CORs seem consistently favourable, perhaps pointing to very stable low energy configurations, whereas some CORs are favoured in only one system, perhaps due to particulars of the formation mechanism, kinetics or conditions. Variations in COR frequencies between datasets seem to correlate with the conditions of host-inclusion system evolution. The two datasets from granulite-facies metamorphic samples show more similarities to each other than to the pegmatite dataset, and the sample inferred to have experienced the highest temperatures (Moldanubian granulite) shows the lowest diversity of CORs, low frequencies of statistical CORs and the highest frequency of specific CORs. These results provide evidence that petrological information is being encoded in COR distributions. They make a strong case for further studies of the factors influencing COR development and for measurements of COR distributions in other systems and between different phases. Griffiths, T.A., Habler, G., Abart, R. (2016): Crystallographic orientation relationships in host-inclusion systems: New insights from large EBSD data sets. Amer. Miner., 101, 690-705.
Abu-Jamous, Basel; Fa, Rui; Roberts, David J; Nandi, Asoke K
2015-06-04
Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn. The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.
Zepeda-Mendoza, Marie Lisandra; Bohmann, Kristine; Carmona Baez, Aldo; Gilbert, M Thomas P
2016-05-03
DNA metabarcoding is an approach for identifying multiple taxa in an environmental sample using specific genetic loci and taxa-specific primers. When combined with high-throughput sequencing it enables the taxonomic characterization of large numbers of samples in a relatively time- and cost-efficient manner. One recent laboratory development is the addition of 5'-nucleotide tags to both primers producing double-tagged amplicons and the use of multiple PCR replicates to filter erroneous sequences. However, there is currently no available toolkit for the straightforward analysis of datasets produced in this way. We present DAMe, a toolkit for the processing of datasets generated by double-tagged amplicons from multiple PCR replicates derived from an unlimited number of samples. Specifically, DAMe can be used to (i) sort amplicons by tag combination, (ii) evaluate PCR replicates dissimilarity, and (iii) filter sequences derived from sequencing/PCR errors, chimeras, and contamination. This is attained by calculating the following parameters: (i) sequence content similarity between the PCR replicates from each sample, (ii) reproducibility of each unique sequence across the PCR replicates, and (iii) copy number of the unique sequences in each PCR replicate. We showcase the insights that can be obtained using DAMe prior to taxonomic assignment, by applying it to two real datasets that vary in their complexity regarding number of samples, sequencing libraries, PCR replicates, and used tag combinations. Finally, we use a third mock dataset to demonstrate the impact and importance of filtering the sequences with DAMe. DAMe allows the user-friendly manipulation of amplicons derived from multiple samples with PCR replicates built in a single or multiple sequencing libraries. It allows the user to: (i) collapse amplicons into unique sequences and sort them by tag combination while retaining the sample identifier and copy number information, (ii) identify sequences carrying unused tag combinations, (iii) evaluate the comparability of PCR replicates of the same sample, and (iv) filter tagged amplicons from a number of PCR replicates using parameters of minimum length, copy number, and reproducibility across the PCR replicates. This enables an efficient analysis of complex datasets, and ultimately increases the ease of handling datasets from large-scale studies.
NASA Astrophysics Data System (ADS)
Henn, Brian; Clark, Martyn P.; Kavetski, Dmitri; Newman, Andrew J.; Hughes, Mimi; McGurk, Bruce; Lundquist, Jessica D.
2018-01-01
Given uncertainty in precipitation gauge-based gridded datasets over complex terrain, we use multiple streamflow observations as an additional source of information about precipitation, in order to identify spatial and temporal differences between a gridded precipitation dataset and precipitation inferred from streamflow. We test whether gridded datasets capture across-crest and regional spatial patterns of variability, as well as year-to-year variability and trends in precipitation, in comparison to precipitation inferred from streamflow. We use a Bayesian model calibration routine with multiple lumped hydrologic model structures to infer the most likely basin-mean, water-year total precipitation for 56 basins with long-term (>30 year) streamflow records in the Sierra Nevada mountain range of California. We compare basin-mean precipitation derived from this approach with basin-mean precipitation from a precipitation gauge-based, 1/16° gridded dataset that has been used to simulate and evaluate trends in Western United States streamflow and snowpack over the 20th century. We find that the long-term average spatial patterns differ: in particular, there is less precipitation in the gridded dataset in higher-elevation basins whose aspect faces prevailing cool-season winds, as compared to precipitation inferred from streamflow. In a few years and basins, there is less gridded precipitation than there is observed streamflow. Lower-elevation, southern, and east-of-crest basins show better agreement between gridded and inferred precipitation. Implied actual evapotranspiration (calculated as precipitation minus streamflow) then also varies between the streamflow-based estimates and the gridded dataset. Absolute uncertainty in precipitation inferred from streamflow is substantial, but the signal of basin-to-basin and year-to-year differences are likely more robust. The findings suggest that considering streamflow when spatially distributing precipitation in complex terrain may improve its representation, particularly for basins whose orientations (e.g., windward-facing) are favored for orographic precipitation enhancement.
Kernel canonical-correlation Granger causality for multiple time series
NASA Astrophysics Data System (ADS)
Wu, Guorong; Duan, Xujun; Liao, Wei; Gao, Qing; Chen, Huafu
2011-04-01
Canonical-correlation analysis as a multivariate statistical technique has been applied to multivariate Granger causality analysis to infer information flow in complex systems. It shows unique appeal and great superiority over the traditional vector autoregressive method, due to the simplified procedure that detects causal interaction between multiple time series, and the avoidance of potential model estimation problems. However, it is limited to the linear case. Here, we extend the framework of canonical correlation to include the estimation of multivariate nonlinear Granger causality for drawing inference about directed interaction. Its feasibility and effectiveness are verified on simulated data.
Genomic Analysis of Complex Microbial Communities in Wounds
2012-01-01
thoroughly in the ecology literature. Permutation Multivariate Analysis of Variance ( PerMANOVA ). We used PerMANOVA to test the null-hypothesis of no...difference between the bacterial communities found within a single wound compared to those from different patients (α = 0.05). PerMANOVA is a...permutation-based version of the multivariate analysis of variance (MANOVA). PerMANOVA uses the distances between samples to partition variance and
D Survey in Complex Archaeological Environments: AN Approach by Terrestrial Laser Scanning
NASA Astrophysics Data System (ADS)
Ebolese, D.; Dardanelli, G.; Lo Brutto, M.; Sciortino, R.
2018-05-01
The survey of archaeological sites by appropriate geomatics technologies is an important research topic. In particular, the 3D survey by terrestrial laser scanning has become a common practice for 3D archaeological data collection. Even if terrestrial laser scanning survey is quite well established, due to the complexity of the most archaeological contexts, many issues can arise and make the survey more difficult. The aim of this work is to describe the methodology chosen for a terrestrial laser scanning survey in a complex archaeological environment according to the issues related to the particular structure of the site. The developed approach was used for the terrestrial laser scanning survey and documentation of a part of the archaeological site of Elaiussa Sebaste in Turkey. The proposed technical solutions have allowed providing an accurate and detailed 3D dataset of the study area. In addition, further products useful for archaeological analysis were also obtained from the 3D dataset of the study area.
Challenges in Extracting Information From Large Hydrogeophysical-monitoring Datasets
NASA Astrophysics Data System (ADS)
Day-Lewis, F. D.; Slater, L. D.; Johnson, T.
2012-12-01
Over the last decade, new automated geophysical data-acquisition systems have enabled collection of increasingly large and information-rich geophysical datasets. Concurrent advances in field instrumentation, web services, and high-performance computing have made real-time processing, inversion, and visualization of large three-dimensional tomographic datasets practical. Geophysical-monitoring datasets have provided high-resolution insights into diverse hydrologic processes including groundwater/surface-water exchange, infiltration, solute transport, and bioremediation. Despite the high information content of such datasets, extraction of quantitative or diagnostic hydrologic information is challenging. Visual inspection and interpretation for specific hydrologic processes is difficult for datasets that are large, complex, and (or) affected by forcings (e.g., seasonal variations) unrelated to the target hydrologic process. New strategies are needed to identify salient features in spatially distributed time-series data and to relate temporal changes in geophysical properties to hydrologic processes of interest while effectively filtering unrelated changes. Here, we review recent work using time-series and digital-signal-processing approaches in hydrogeophysics. Examples include applications of cross-correlation, spectral, and time-frequency (e.g., wavelet and Stockwell transforms) approaches to (1) identify salient features in large geophysical time series; (2) examine correlation or coherence between geophysical and hydrologic signals, even in the presence of non-stationarity; and (3) condense large datasets while preserving information of interest. Examples demonstrate analysis of large time-lapse electrical tomography and fiber-optic temperature datasets to extract information about groundwater/surface-water exchange and contaminant transport.
A dataset of human decision-making in teamwork management.
Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang
2017-01-17
Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.
A dataset of human decision-making in teamwork management
Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang
2017-01-01
Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members’ capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches. PMID:28094787
Shi, Yingzhong; Chung, Fu-Lai; Wang, Shitong
2015-09-01
Recently, a time-adaptive support vector machine (TA-SVM) is proposed for handling nonstationary datasets. While attractive performance has been reported and the new classifier is distinctive in simultaneously solving several SVM subclassifiers locally and globally by using an elegant SVM formulation in an alternative kernel space, the coupling of subclassifiers brings in the computation of matrix inversion, thus resulting to suffer from high computational burden in large nonstationary dataset applications. To overcome this shortcoming, an improved TA-SVM (ITA-SVM) is proposed using a common vector shared by all the SVM subclassifiers involved. ITA-SVM not only keeps an SVM formulation, but also avoids the computation of matrix inversion. Thus, we can realize its fast version, that is, improved time-adaptive core vector machine (ITA-CVM) for large nonstationary datasets by using the CVM technique. ITA-CVM has the merit of asymptotic linear time complexity for large nonstationary datasets as well as inherits the advantage of TA-SVM. The effectiveness of the proposed classifiers ITA-SVM and ITA-CVM is also experimentally confirmed.
A dataset of human decision-making in teamwork management
NASA Astrophysics Data System (ADS)
Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang
2017-01-01
Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.
Cloud computing for detecting high-order genome-wide epistatic interaction via dynamic clustering.
Guo, Xuan; Meng, Yu; Yu, Ning; Pan, Yi
2014-04-10
Taking the advantage of high-throughput single nucleotide polymorphism (SNP) genotyping technology, large genome-wide association studies (GWASs) have been considered to hold promise for unravelling complex relationships between genotype and phenotype. At present, traditional single-locus-based methods are insufficient to detect interactions consisting of multiple-locus, which are broadly existing in complex traits. In addition, statistic tests for high order epistatic interactions with more than 2 SNPs propose computational and analytical challenges because the computation increases exponentially as the cardinality of SNPs combinations gets larger. In this paper, we provide a simple, fast and powerful method using dynamic clustering and cloud computing to detect genome-wide multi-locus epistatic interactions. We have constructed systematic experiments to compare powers performance against some recently proposed algorithms, including TEAM, SNPRuler, EDCF and BOOST. Furthermore, we have applied our method on two real GWAS datasets, Age-related macular degeneration (AMD) and Rheumatoid arthritis (RA) datasets, where we find some novel potential disease-related genetic factors which are not shown up in detections of 2-loci epistatic interactions. Experimental results on simulated data demonstrate that our method is more powerful than some recently proposed methods on both two- and three-locus disease models. Our method has discovered many novel high-order associations that are significantly enriched in cases from two real GWAS datasets. Moreover, the running time of the cloud implementation for our method on AMD dataset and RA dataset are roughly 2 hours and 50 hours on a cluster with forty small virtual machines for detecting two-locus interactions, respectively. Therefore, we believe that our method is suitable and effective for the full-scale analysis of multiple-locus epistatic interactions in GWAS.
Comprehensive decision tree models in bioinformatics.
Stiglic, Gregor; Kocbek, Simon; Pernek, Igor; Kokol, Peter
2012-01-01
Classification is an important and widely used machine learning technique in bioinformatics. Researchers and other end-users of machine learning software often prefer to work with comprehensible models where knowledge extraction and explanation of reasoning behind the classification model are possible. This paper presents an extension to an existing machine learning environment and a study on visual tuning of decision tree classifiers. The motivation for this research comes from the need to build effective and easily interpretable decision tree models by so called one-button data mining approach where no parameter tuning is needed. To avoid bias in classification, no classification performance measure is used during the tuning of the model that is constrained exclusively by the dimensions of the produced decision tree. The proposed visual tuning of decision trees was evaluated on 40 datasets containing classical machine learning problems and 31 datasets from the field of bioinformatics. Although we did not expected significant differences in classification performance, the results demonstrate a significant increase of accuracy in less complex visually tuned decision trees. In contrast to classical machine learning benchmarking datasets, we observe higher accuracy gains in bioinformatics datasets. Additionally, a user study was carried out to confirm the assumption that the tree tuning times are significantly lower for the proposed method in comparison to manual tuning of the decision tree. The empirical results demonstrate that by building simple models constrained by predefined visual boundaries, one not only achieves good comprehensibility, but also very good classification performance that does not differ from usually more complex models built using default settings of the classical decision tree algorithm. In addition, our study demonstrates the suitability of visually tuned decision trees for datasets with binary class attributes and a high number of possibly redundant attributes that are very common in bioinformatics.
Comprehensive Decision Tree Models in Bioinformatics
Stiglic, Gregor; Kocbek, Simon; Pernek, Igor; Kokol, Peter
2012-01-01
Purpose Classification is an important and widely used machine learning technique in bioinformatics. Researchers and other end-users of machine learning software often prefer to work with comprehensible models where knowledge extraction and explanation of reasoning behind the classification model are possible. Methods This paper presents an extension to an existing machine learning environment and a study on visual tuning of decision tree classifiers. The motivation for this research comes from the need to build effective and easily interpretable decision tree models by so called one-button data mining approach where no parameter tuning is needed. To avoid bias in classification, no classification performance measure is used during the tuning of the model that is constrained exclusively by the dimensions of the produced decision tree. Results The proposed visual tuning of decision trees was evaluated on 40 datasets containing classical machine learning problems and 31 datasets from the field of bioinformatics. Although we did not expected significant differences in classification performance, the results demonstrate a significant increase of accuracy in less complex visually tuned decision trees. In contrast to classical machine learning benchmarking datasets, we observe higher accuracy gains in bioinformatics datasets. Additionally, a user study was carried out to confirm the assumption that the tree tuning times are significantly lower for the proposed method in comparison to manual tuning of the decision tree. Conclusions The empirical results demonstrate that by building simple models constrained by predefined visual boundaries, one not only achieves good comprehensibility, but also very good classification performance that does not differ from usually more complex models built using default settings of the classical decision tree algorithm. In addition, our study demonstrates the suitability of visually tuned decision trees for datasets with binary class attributes and a high number of possibly redundant attributes that are very common in bioinformatics. PMID:22479449
Complex nature of SNP genotype effects on gene expression in primary human leucocytes.
Heap, Graham A; Trynka, Gosia; Jansen, Ritsert C; Bruinenberg, Marcel; Swertz, Morris A; Dinesen, Lotte C; Hunt, Karen A; Wijmenga, Cisca; Vanheel, David A; Franke, Lude
2009-01-07
Genome wide association studies have been hugely successful in identifying disease risk variants, yet most variants do not lead to coding changes and how variants influence biological function is usually unknown. We correlated gene expression and genetic variation in untouched primary leucocytes (n = 110) from individuals with celiac disease - a common condition with multiple risk variants identified. We compared our observations with an EBV-transformed HapMap B cell line dataset (n = 90), and performed a meta-analysis to increase power to detect non-tissue specific effects. In celiac peripheral blood, 2,315 SNP variants influenced gene expression at 765 different transcripts (< 250 kb from SNP, at FDR = 0.05, cis expression quantitative trait loci, eQTLs). 135 of the detected SNP-probe effects (reflecting 51 unique probes) were also detected in a HapMap B cell line published dataset, all with effects in the same allelic direction. Overall gene expression differences within the two datasets predominantly explain the limited overlap in observed cis-eQTLs. Celiac associated risk variants from two regions, containing genes IL18RAP and CCR3, showed significant cis genotype-expression correlations in the peripheral blood but not in the B cell line datasets. We identified 14 genes where a SNP affected the expression of different probes within the same gene, but in opposite allelic directions. By incorporating genetic variation in co-expression analyses, functional relationships between genes can be more significantly detected. In conclusion, the complex nature of genotypic effects in human populations makes the use of a relevant tissue, large datasets, and analysis of different exons essential to enable the identification of the function for many genetic risk variants in common diseases.
Cloud computing for detecting high-order genome-wide epistatic interaction via dynamic clustering
2014-01-01
Backgroud Taking the advan tage of high-throughput single nucleotide polymorphism (SNP) genotyping technology, large genome-wide association studies (GWASs) have been considered to hold promise for unravelling complex relationships between genotype and phenotype. At present, traditional single-locus-based methods are insufficient to detect interactions consisting of multiple-locus, which are broadly existing in complex traits. In addition, statistic tests for high order epistatic interactions with more than 2 SNPs propose computational and analytical challenges because the computation increases exponentially as the cardinality of SNPs combinations gets larger. Results In this paper, we provide a simple, fast and powerful method using dynamic clustering and cloud computing to detect genome-wide multi-locus epistatic interactions. We have constructed systematic experiments to compare powers performance against some recently proposed algorithms, including TEAM, SNPRuler, EDCF and BOOST. Furthermore, we have applied our method on two real GWAS datasets, Age-related macular degeneration (AMD) and Rheumatoid arthritis (RA) datasets, where we find some novel potential disease-related genetic factors which are not shown up in detections of 2-loci epistatic interactions. Conclusions Experimental results on simulated data demonstrate that our method is more powerful than some recently proposed methods on both two- and three-locus disease models. Our method has discovered many novel high-order associations that are significantly enriched in cases from two real GWAS datasets. Moreover, the running time of the cloud implementation for our method on AMD dataset and RA dataset are roughly 2 hours and 50 hours on a cluster with forty small virtual machines for detecting two-locus interactions, respectively. Therefore, we believe that our method is suitable and effective for the full-scale analysis of multiple-locus epistatic interactions in GWAS. PMID:24717145
Closing the data gap: Creating an open data environment
NASA Astrophysics Data System (ADS)
Hester, J. R.
2014-02-01
Poor data management brought on by increasing volumes of complex data undermines both the integrity of the scientific process and the usefulness of datasets. Researchers should endeavour both to make their data citeable and to cite data whenever possible. The reusability of datasets is improved by community adoption of comprehensive metadata standards and public availability of reversibly reduced data. Where standards are not yet defined, as much information as possible about the experiment and samples should be preserved in datafiles written in a standard format.
Dynamic fMRI networks predict success in a behavioral weight loss program among older adults.
Mokhtari, Fatemeh; Rejeski, W Jack; Zhu, Yingying; Wu, Guorong; Simpson, Sean L; Burdette, Jonathan H; Laurienti, Paul J
2018-06-01
More than one-third of adults in the United States are obese, with a higher prevalence among older adults. Obesity among older adults is a major cause of physical dysfunction, hypertension, diabetes, and coronary heart diseases. Many people who engage in lifestyle weight loss interventions fail to reach targeted goals for weight loss, and most will regain what was lost within 1-2 years following cessation of treatment. This variability in treatment efficacy suggests that there are important phenotypes predictive of success with intentional weight loss that could lead to tailored treatment regimen, an idea that is consistent with the concept of precision-based medicine. Although the identification of biochemical and metabolic phenotypes are one potential direction of research, neurobiological measures may prove useful as substantial behavioral change is necessary to achieve success in a lifestyle intervention. In the present study, we use dynamic brain networks from functional magnetic resonance imaging (fMRI) data to prospectively identify individuals most likely to succeed in a behavioral weight loss intervention. Brain imaging was performed in overweight or obese older adults (age: 65-79 years) who participated in an 18-month lifestyle weight loss intervention. Machine learning and functional brain networks were combined to produce multivariate prediction models. The prediction accuracy exceeded 95%, suggesting that there exists a consistent pattern of connectivity which correctly predicts success with weight loss at the individual level. Connectivity patterns that contributed to the prediction consisted of complex multivariate network components that substantially overlapped with known brain networks that are associated with behavior emergence, self-regulation, body awareness, and the sensory features of food. Future work on independent datasets and diverse populations is needed to corroborate our findings. Additionally, we believe that efforts can begin to examine whether these models have clinical utility in tailoring treatment. Copyright © 2018 Elsevier Inc. All rights reserved.
Correlation Between Bladder Pain Syndrome/Interstitial Cystitis and Pelvic Inflammatory Disease
Chung, Shiu-Dong; Chang, Chao-Hsiang; Hung, Peir-Haur; Chung, Chi-Jung; Muo, Chih-Hsin; Huang, Chao-Yuan
2015-01-01
Abstract Pelvic inflammatory disease (PID) has been investigated in Western countries and identified to be associated with chronic pelvic pain and inflammation. Bladder pain syndrome/interstitial cystitis (BPS/IC) is a complex syndrome that is significantly more prevalent in women than in men. Chronic pelvic pain is a main symptom of BPS/IC, and chronic inflammation is a major etiology of BPS/IC. This study aimed to investigate the correlation between BPS/IC and PID using a population-based dataset. We constructed a case–control study from the Taiwan National Health Insurance program. The case cohort comprised 449 patients with BPS/IC, and 1796 randomly selected subjects (about 1:4 matching) were used as controls. A Multivariate logistic regression model was constructed to estimate the association between BPS/IC and PID. Of the 2245 sampled subjects, a significant difference was observed in the prevalence of PID between BPS/IC cases and controls (41.7% vs 15.4%, P < 0.001). Multivariate logistic regression analysis revealed that the odds ratio (OR) for PID among cases was 3.69 (95% confidence interval [CI]: 2.89–4.71). Furthermore, the ORs for PID among BPS/IC cases were 4.52 (95% CI: 2.55–8.01), 4.31 (95% CI: 2.91–6.38), 3.00 (95% CI: 1.82–4.94), and 5.35 (95% CI: 1.88–15.20) in the <35, 35–49, 50–64, and >65 years age groups, respectively, after adjusting for geographic region, irritable bowel syndrome, and hypertension. Joint effect was also noted, specifically when patients had both PID and irritable bowel disease with OR of 10.5 (95% CI: 4.88–22.50). This study demonstrated a correlation between PID and BPS/IC. Clinicians treating women with PID should be alert to BPS/IC-related symptoms in the population. PMID:26579800
Risk model of prolonged intensive care unit stay in Chinese patients undergoing heart valve surgery.
Wang, Chong; Zhang, Guan-xin; Zhang, Hao; Lu, Fang-lin; Li, Bai-ling; Xu, Ji-bin; Han, Lin; Xu, Zhi-yun
2012-11-01
The aim of this study was to develop a preoperative risk prediction model and an scorecard for prolonged intensive care unit length of stay (PrlICULOS) in adult patients undergoing heart valve surgery. This is a retrospective observational study of collected data on 3925 consecutive patients older than 18 years, who had undergone heart valve surgery between January 2000 and December 2010. Data were randomly split into a development dataset (n=2401) and a validation dataset (n=1524). A multivariate logistic regression analysis was undertaken using the development dataset to identify independent risk factors for PrlICULOS. Performance of the model was then assessed by observed and expected rates of PrlICULOS on the development and validation dataset. Model calibration and discriminatory ability were analysed by the Hosmer-Lemeshow goodness-of-fit statistic and the area under the receiver operating characteristic (ROC) curve, respectively. There were 491 patients that required PrlICULOS (12.5%). Preoperative independent predictors of PrlICULOS are shown with odds ratio as follows: (1) age, 1.4; (2) chronic obstructive pulmonary disease (COPD), 1.8; (3) atrial fibrillation, 1.4; (4) left bundle branch block, 2.7; (5) ejection fraction, 1.4; (6) left ventricle weight, 1.5; (7) New York Heart Association class III-IV, 1.8; (8) critical preoperative state, 2.0; (9) perivalvular leakage, 6.4; (10) tricuspid valve replacement, 3.8; (11) concurrent CABG, 2.8; and (12) concurrent other cardiac surgery, 1.8. The Hosmer-Lemeshow goodness-of-fit statistic was not statistically significant in both development and validation dataset (P=0.365 vs P=0.310). The ROC curve for the prediction of PrlICULOS in development and validation dataset was 0.717 and 0.700, respectively. We developed and validated a local risk prediction model for PrlICULOS after adult heart valve surgery. This model can be used to calculate patient-specific risk with an equivalent predicted risk at our centre in future clinical practice. Copyright © 2012 Australian and New Zealand Society of Cardiac and Thoracic Surgeons (ANZSCTS) and the Cardiac Society of Australia and New Zealand (CSANZ). Published by Elsevier B.V. All rights reserved.
Evaluation and comparison of bioinformatic tools for the enrichment analysis of metabolomics data.
Marco-Ramell, Anna; Palau-Rodriguez, Magali; Alay, Ania; Tulipani, Sara; Urpi-Sarda, Mireia; Sanchez-Pla, Alex; Andres-Lacueva, Cristina
2018-01-02
Bioinformatic tools for the enrichment of 'omics' datasets facilitate interpretation and understanding of data. To date few are suitable for metabolomics datasets. The main objective of this work is to give a critical overview, for the first time, of the performance of these tools. To that aim, datasets from metabolomic repositories were selected and enriched data were created. Both types of data were analysed with these tools and outputs were thoroughly examined. An exploratory multivariate analysis of the most used tools for the enrichment of metabolite sets, based on a non-metric multidimensional scaling (NMDS) of Jaccard's distances, was performed and mirrored their diversity. Codes (identifiers) of the metabolites of the datasets were searched in different metabolite databases (HMDB, KEGG, PubChem, ChEBI, BioCyc/HumanCyc, LipidMAPS, ChemSpider, METLIN and Recon2). The databases that presented more identifiers of the metabolites of the dataset were PubChem, followed by METLIN and ChEBI. However, these databases had duplicated entries and might present false positives. The performance of over-representation analysis (ORA) tools, including BioCyc/HumanCyc, ConsensusPathDB, IMPaLA, MBRole, MetaboAnalyst, Metabox, MetExplore, MPEA, PathVisio and Reactome and the mapping tool KEGGREST, was examined. Results were mostly consistent among tools and between real and enriched data despite the variability of the tools. Nevertheless, a few controversial results such as differences in the total number of metabolites were also found. Disease-based enrichment analyses were also assessed, but they were not found to be accurate probably due to the fact that metabolite disease sets are not up-to-date and the difficulty of predicting diseases from a list of metabolites. We have extensively reviewed the state-of-the-art of the available range of tools for metabolomic datasets, the completeness of metabolite databases, the performance of ORA methods and disease-based analyses. Despite the variability of the tools, they provided consistent results independent of their analytic approach. However, more work on the completeness of metabolite and pathway databases is required, which strongly affects the accuracy of enrichment analyses. Improvements will be translated into more accurate and global insights of the metabolome.
Estimating the number of pure chemical components in a mixture by X-ray absorption spectroscopy.
Manceau, Alain; Marcus, Matthew; Lenoir, Thomas
2014-09-01
Principal component analysis (PCA) is a multivariate data analysis approach commonly used in X-ray absorption spectroscopy to estimate the number of pure compounds in multicomponent mixtures. This approach seeks to describe a large number of multicomponent spectra as weighted sums of a smaller number of component spectra. These component spectra are in turn considered to be linear combinations of the spectra from the actual species present in the system from which the experimental spectra were taken. The dimension of the experimental dataset is given by the number of meaningful abstract components, as estimated by the cascade or variance of the eigenvalues (EVs), the factor indicator function (IND), or the F-test on reduced EVs. It is shown on synthetic and real spectral mixtures that the performance of the IND and F-test critically depends on the amount of noise in the data, and may result in considerable underestimation or overestimation of the number of components even for a signal-to-noise (s/n) ratio of the order of 80 (σ = 20) in a XANES dataset. For a given s/n ratio, the accuracy of the component recovery from a random mixture depends on the size of the dataset and number of components, which is not known in advance, and deteriorates for larger datasets because the analysis picks up more noise components. The scree plot of the EVs for the components yields one or two values close to the significant number of components, but the result can be ambiguous and its uncertainty is unknown. A new estimator, NSS-stat, which includes the experimental error to XANES data analysis, is introduced and tested. It is shown that NSS-stat produces superior results compared with the three traditional forms of PCA-based component-number estimation. A graphical user-friendly interface for the calculation of EVs, IND, F-test and NSS-stat from a XANES dataset has been developed under LabVIEW for Windows and is supplied in the supporting information. Its possible application to EXAFS data is discussed, and several XANES and EXAFS datasets are also included for download.
Data reuse and the open data citation advantage
Vision, Todd J.
2013-01-01
Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation benefit”. Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003. PMID:24109559
Ligh, Cassandra A; Nelson, Jonas A; Wink, Jason D; Gerety, Patrick A; Fischer, John P; Wu, Liza C; Kanchwala, Suhail K
2016-01-01
There are limited population-based studies that examine perioperative factors that influence postoperative surgical take-backs to the OR following free flap (FF) reconstruction for head/neck cancer extirpation. The purpose of this study was to critically analyse head/neck free flaps (HNFF) captured in the ACS-NSQIP dataset with a specific focus on postoperative complications and the incidence of factors associated with re-operation. The 2005-2012 ACS-NSQIP datasets were accessed to identify patients undergoing FF reconstruction after a diagnosis of head/neck cancer. Patient demographics, comorbidities, and perioperative risk factors were examined as covariates, and the primary outcome was return to OR within 30 days of surgery. A multivariate regression was performed to determine independent preoperative factors associated with this complication. In total, 855 patients underwent FF for head/neck reconstruction most commonly for the Tongue (24.7%) and Mouth/Floor/cavity (25.0%). Of these, 153 patients (17.9%) returned to the OR within 30 days of surgery. Patients in this cohort had higher rates of wound infections and dehiscence (p < 0.01). Medical complications were significantly higher and included pneumonia (12.4% vs 5.0%, p < 0.01), prolonged ventilation (16.3% vs 4.8%, p < 0.01), myocardial infarction (2.6% vs 0.6%, p = 0.017), and sepsis (7.2% vs 3.4%, p = 0.033). Regression analysis demonstrated that visceral flaps (OR = 9.7, p = 0.012) and hypoalbuminemia (OR = 2.4, p = 0.009) were significant predictors of a return to the OR. Based on data from the nationwide NSQIP dataset, up to 17% of HNFF return to the OR within 30 days. Although this data-set has some significant limitations, these results can cautiously help to improve preoperative patient optimisation and surgical decision-making.
Lae, Marick; Moarii, Matahi; Sadacca, Benjamin; Pinheiro, Alice; Galliot, Marion; Abecassis, Judith; Laurent, Cecile; Reyal, Fabien
2016-01-01
Introduction HER2-positive breast cancer (BC) is a heterogeneous group of aggressive breast cancers, the prognosis of which has greatly improved since the introduction of treatments targeting HER2. However, these tumors may display intrinsic or acquired resistance to treatment, and classifiers of HER2-positive tumors are required to improve the prediction of prognosis and to develop novel therapeutic interventions. Methods We analyzed 2893 primary human breast cancer samples from 21 publicly available datasets and developed a six-metagene signature on a training set of 448 HER2-positive BC. We then used external public datasets to assess the ability of these metagenes to predict the response to chemotherapy (Ignatiadis dataset), and prognosis (METABRIC dataset). Results We identified a six-metagene signature (138 genes) containing metagenes enriched in different gene ontologies. The gene clusters were named as follows: Immunity, Tumor suppressors/proliferation, Interferon, Signal transduction, Hormone/survival and Matrix clusters. In all datasets, the Immunity metagene was less strongly expressed in ER-positive than in ER-negative tumors, and was inversely correlated with the Hormonal/survival metagene. Within the signature, multivariate analyses showed that strong expression of the “Immunity” metagene was associated with higher pCR rates after NAC (OR = 3.71[1.28–11.91], p = 0.019) than weak expression, and with a better prognosis in HER2-positive/ER-negative breast cancers (HR = 0.58 [0.36–0.94], p = 0.026). Immunity metagene expression was associated with the presence of tumor-infiltrating lymphocytes (TILs). Conclusion The identification of a predictive and prognostic immune module in HER2-positive BC confirms the need for clinical testing for immune checkpoint modulators and vaccines for this specific subtype. The inverse correlation between Immunity and hormone pathways opens research perspectives and deserves further investigation. PMID:28005906
Zhang, Yajun; Chai, Tianyou; Wang, Hong; Wang, Dianhui; Chen, Xinkai
2018-06-01
Complex industrial processes are multivariable and generally exhibit strong coupling among their control loops with heavy nonlinear nature. These make it very difficult to obtain an accurate model. As a result, the conventional and data-driven control methods are difficult to apply. Using a twin-tank level control system as an example, a novel multivariable decoupling control algorithm with adaptive neural-fuzzy inference system (ANFIS)-based unmodeled dynamics (UD) compensation is proposed in this paper for a class of complex industrial processes. At first, a nonlinear multivariable decoupling controller with UD compensation is introduced. Different from the existing methods, the decomposition estimation algorithm using ANFIS is employed to estimate the UD, and the desired estimating and decoupling control effects are achieved. Second, the proposed method does not require the complicated switching mechanism which has been commonly used in the literature. This significantly simplifies the obtained decoupling algorithm and its realization. Third, based on some new lemmas and theorems, the conditions on the stability and convergence of the closed-loop system are analyzed to show the uniform boundedness of all the variables. This is then followed by the summary on experimental tests on a heavily coupled nonlinear twin-tank system that demonstrates the effectiveness and the practicability of the proposed method.
Multivariate Meta-Analysis of Genetic Association Studies: A Simulation Study
Neupane, Binod; Beyene, Joseph
2015-01-01
In a meta-analysis with multiple end points of interests that are correlated between or within studies, multivariate approach to meta-analysis has a potential to produce more precise estimates of effects by exploiting the correlation structure between end points. However, under random-effects assumption the multivariate estimation is more complex (as it involves estimation of more parameters simultaneously) than univariate estimation, and sometimes can produce unrealistic parameter estimates. Usefulness of multivariate approach to meta-analysis of the effects of a genetic variant on two or more correlated traits is not well understood in the area of genetic association studies. In such studies, genetic variants are expected to roughly maintain Hardy-Weinberg equilibrium within studies, and also their effects on complex traits are generally very small to modest and could be heterogeneous across studies for genuine reasons. We carried out extensive simulation to explore the comparative performance of multivariate approach with most commonly used univariate inverse-variance weighted approach under random-effects assumption in various realistic meta-analytic scenarios of genetic association studies of correlated end points. We evaluated the performance with respect to relative mean bias percentage, and root mean square error (RMSE) of the estimate and coverage probability of corresponding 95% confidence interval of the effect for each end point. Our simulation results suggest that multivariate approach performs similarly or better than univariate method when correlations between end points within or between studies are at least moderate and between-study variation is similar or larger than average within-study variation for meta-analyses of 10 or more genetic studies. Multivariate approach produces estimates with smaller bias and RMSE especially for the end point that has randomly or informatively missing summary data in some individual studies, when the missing data in the endpoint are imputed with null effects and quite large variance. PMID:26196398
Multivariate Meta-Analysis of Genetic Association Studies: A Simulation Study.
Neupane, Binod; Beyene, Joseph
2015-01-01
In a meta-analysis with multiple end points of interests that are correlated between or within studies, multivariate approach to meta-analysis has a potential to produce more precise estimates of effects by exploiting the correlation structure between end points. However, under random-effects assumption the multivariate estimation is more complex (as it involves estimation of more parameters simultaneously) than univariate estimation, and sometimes can produce unrealistic parameter estimates. Usefulness of multivariate approach to meta-analysis of the effects of a genetic variant on two or more correlated traits is not well understood in the area of genetic association studies. In such studies, genetic variants are expected to roughly maintain Hardy-Weinberg equilibrium within studies, and also their effects on complex traits are generally very small to modest and could be heterogeneous across studies for genuine reasons. We carried out extensive simulation to explore the comparative performance of multivariate approach with most commonly used univariate inverse-variance weighted approach under random-effects assumption in various realistic meta-analytic scenarios of genetic association studies of correlated end points. We evaluated the performance with respect to relative mean bias percentage, and root mean square error (RMSE) of the estimate and coverage probability of corresponding 95% confidence interval of the effect for each end point. Our simulation results suggest that multivariate approach performs similarly or better than univariate method when correlations between end points within or between studies are at least moderate and between-study variation is similar or larger than average within-study variation for meta-analyses of 10 or more genetic studies. Multivariate approach produces estimates with smaller bias and RMSE especially for the end point that has randomly or informatively missing summary data in some individual studies, when the missing data in the endpoint are imputed with null effects and quite large variance.
An R package for the integrated analysis of metabolomics and spectral data.
Costa, Christopher; Maraschin, Marcelo; Rocha, Miguel
2016-06-01
Recently, there has been a growing interest in the field of metabolomics, materialized by a remarkable growth in experimental techniques, available data and related biological applications. Indeed, techniques as nuclear magnetic resonance, gas or liquid chromatography, mass spectrometry, infrared and UV-visible spectroscopies have provided extensive datasets that can help in tasks as biological and biomedical discovery, biotechnology and drug development. However, as it happens with other omics data, the analysis of metabolomics datasets provides multiple challenges, both in terms of methodologies and in the development of appropriate computational tools. Indeed, from the available software tools, none addresses the multiplicity of existing techniques and data analysis tasks. In this work, we make available a novel R package, named specmine, which provides a set of methods for metabolomics data analysis, including data loading in different formats, pre-processing, metabolite identification, univariate and multivariate data analysis, machine learning, and feature selection. Importantly, the implemented methods provide adequate support for the analysis of data from diverse experimental techniques, integrating a large set of functions from several R packages in a powerful, yet simple to use environment. The package, already available in CRAN, is accompanied by a web site where users can deposit datasets, scripts and analysis reports to be shared with the community, promoting the efficient sharing of metabolomics data analysis pipelines. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Using Time Series Analysis to Predict Cardiac Arrest in a PICU.
Kennedy, Curtis E; Aoki, Noriaki; Mariscalco, Michele; Turley, James P
2015-11-01
To build and test cardiac arrest prediction models in a PICU, using time series analysis as input, and to measure changes in prediction accuracy attributable to different classes of time series data. Retrospective cohort study. Thirty-one bed academic PICU that provides care for medical and general surgical (not congenital heart surgery) patients. Patients experiencing a cardiac arrest in the PICU and requiring external cardiac massage for at least 2 minutes. None. One hundred three cases of cardiac arrest and 109 control cases were used to prepare a baseline dataset that consisted of 1,025 variables in four data classes: multivariate, raw time series, clinical calculations, and time series trend analysis. We trained 20 arrest prediction models using a matrix of five feature sets (combinations of data classes) with four modeling algorithms: linear regression, decision tree, neural network, and support vector machine. The reference model (multivariate data with regression algorithm) had an accuracy of 78% and 87% area under the receiver operating characteristic curve. The best model (multivariate + trend analysis data with support vector machine algorithm) had an accuracy of 94% and 98% area under the receiver operating characteristic curve. Cardiac arrest predictions based on a traditional model built with multivariate data and a regression algorithm misclassified cases 3.7 times more frequently than predictions that included time series trend analysis and built with a support vector machine algorithm. Although the final model lacks the specificity necessary for clinical application, we have demonstrated how information from time series data can be used to increase the accuracy of clinical prediction models.
Encoding and Decoding Models in Cognitive Electrophysiology
Holdgraf, Christopher R.; Rieger, Jochem W.; Micheli, Cristiano; Martin, Stephanie; Knight, Robert T.; Theunissen, Frederic E.
2017-01-01
Cognitive neuroscience has seen rapid growth in the size and complexity of data recorded from the human brain as well as in the computational tools available to analyze this data. This data explosion has resulted in an increased use of multivariate, model-based methods for asking neuroscience questions, allowing scientists to investigate multiple hypotheses with a single dataset, to use complex, time-varying stimuli, and to study the human brain under more naturalistic conditions. These tools come in the form of “Encoding” models, in which stimulus features are used to model brain activity, and “Decoding” models, in which neural features are used to generated a stimulus output. Here we review the current state of encoding and decoding models in cognitive electrophysiology and provide a practical guide toward conducting experiments and analyses in this emerging field. Our examples focus on using linear models in the study of human language and audition. We show how to calculate auditory receptive fields from natural sounds as well as how to decode neural recordings to predict speech. The paper aims to be a useful tutorial to these approaches, and a practical introduction to using machine learning and applied statistics to build models of neural activity. The data analytic approaches we discuss may also be applied to other sensory modalities, motor systems, and cognitive systems, and we cover some examples in these areas. In addition, a collection of Jupyter notebooks is publicly available as a complement to the material covered in this paper, providing code examples and tutorials for predictive modeling in python. The aim is to provide a practical understanding of predictive modeling of human brain data and to propose best-practices in conducting these analyses. PMID:29018336
Mr-Moose: An advanced SED-fitting tool for heterogeneous multi-wavelength datasets
NASA Astrophysics Data System (ADS)
Drouart, G.; Falkendal, T.
2018-04-01
We present the public release of Mr-Moose, a fitting procedure that is able to perform multi-wavelength and multi-object spectral energy distribution (SED) fitting in a Bayesian framework. This procedure is able to handle a large variety of cases, from an isolated source to blended multi-component sources from an heterogeneous dataset (i.e. a range of observation sensitivities and spectral/spatial resolutions). Furthermore, Mr-Moose handles upper-limits during the fitting process in a continuous way allowing models to be gradually less probable as upper limits are approached. The aim is to propose a simple-to-use, yet highly-versatile fitting tool fro handling increasing source complexity when combining multi-wavelength datasets with fully customisable filter/model databases. The complete control of the user is one advantage, which avoids the traditional problems related to the "black box" effect, where parameter or model tunings are impossible and can lead to overfitting and/or over-interpretation of the results. Also, while a basic knowledge of Python and statistics is required, the code aims to be sufficiently user-friendly for non-experts. We demonstrate the procedure on three cases: two artificially-generated datasets and a previous result from the literature. In particular, the most complex case (inspired by a real source, combining Herschel, ALMA and VLA data) in the context of extragalactic SED fitting, makes Mr-Moose a particularly-attractive SED fitting tool when dealing with partially blended sources, without the need for data deconvolution.
Exploiting PubChem for Virtual Screening
Xie, Xiang-Qun
2011-01-01
Importance of the field PubChem is a public molecular information repository, a scientific showcase of the NIH Roadmap Initiative. The PubChem database holds over 27 million records of unique chemical structures of compounds (CID) derived from nearly 70 million substance depositions (SID), and contains more than 449,000 bioassay records with over thousands of in vitro biochemical and cell-based screening bioassays established, with targeting more than 7000 proteins and genes linking to over 1.8 million of substances. Areas covered in this review This review builds on recent PubChem-related computational chemistry research reported by other authors while providing readers with an overview of the PubChem database, focusing on its increasing role in cheminformatics, virtual screening and toxicity prediction modeling. What the reader will gain These publicly available datasets in PubChem provide great opportunities for scientists to perform cheminformatics and virtual screening research for computer-aided drug design. However, the high volume and complexity of the datasets, in particular the bioassay-associated false positives/negatives and highly imbalanced datasets in PubChem, also creates major challenges. Several approaches regarding the modeling of PubChem datasets and development of virtual screening models for bioactivity and toxicity predictions are also reviewed. Take home message Novel data-mining cheminformatics tools and virtual screening algorithms are being developed and used to retrieve, annotate and analyze the large-scale and highly complex PubChem biological screening data for drug design. PMID:21691435
Appropriate use of the increment entropy for electrophysiological time series.
Liu, Xiaofeng; Wang, Xue; Zhou, Xu; Jiang, Aimin
2018-04-01
The increment entropy (IncrEn) is a new measure for quantifying the complexity of a time series. There are three critical parameters in the IncrEn calculation: N (length of the time series), m (dimensionality), and q (quantifying precision). However, the question of how to choose the most appropriate combination of IncrEn parameters for short datasets has not been extensively explored. The purpose of this research was to provide guidance on choosing suitable IncrEn parameters for short datasets by exploring the effects of varying the parameter values. We used simulated data, epileptic EEG data and cardiac interbeat (RR) data to investigate the effects of the parameters on the calculated IncrEn values. The results reveal that IncrEn is sensitive to changes in m, q and N for short datasets (N≤500). However, IncrEn reaches stability at a data length of N=1000 with m=2 and q=2, and for short datasets (N=100), it shows better relative consistency with 2≤m≤6 and 2≤q≤8 We suggest that the value of N should be no less than 100. To enable a clear distinction between different classes based on IncrEn, we recommend that m and q should take values between 2 and 4. With appropriate parameters, IncrEn enables the effective detection of complexity variations in physiological time series, suggesting that IncrEn should be useful for the analysis of physiological time series in clinical applications. Copyright © 2018 Elsevier Ltd. All rights reserved.
Simultaneous alignment and clustering of peptide data using a Gibbs sampling approach.
Andreatta, Massimo; Lund, Ole; Nielsen, Morten
2013-01-01
Proteins recognizing short peptide fragments play a central role in cellular signaling. As a result of high-throughput technologies, peptide-binding protein specificities can be studied using large peptide libraries at dramatically lower cost and time. Interpretation of such large peptide datasets, however, is a complex task, especially when the data contain multiple receptor binding motifs, and/or the motifs are found at different locations within distinct peptides. The algorithm presented in this article, based on Gibbs sampling, identifies multiple specificities in peptide data by performing two essential tasks simultaneously: alignment and clustering of peptide data. We apply the method to de-convolute binding motifs in a panel of peptide datasets with different degrees of complexity spanning from the simplest case of pre-aligned fixed-length peptides to cases of unaligned peptide datasets of variable length. Example applications described in this article include mixtures of binders to different MHC class I and class II alleles, distinct classes of ligands for SH3 domains and sub-specificities of the HLA-A*02:01 molecule. The Gibbs clustering method is available online as a web server at http://www.cbs.dtu.dk/services/GibbsCluster.
Automatic Beam Path Analysis of Laser Wakefield Particle Acceleration Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rubel, Oliver; Geddes, Cameron G.R.; Cormier-Michel, Estelle
2009-10-19
Numerical simulations of laser wakefield particle accelerators play a key role in the understanding of the complex acceleration process and in the design of expensive experimental facilities. As the size and complexity of simulation output grows, an increasingly acute challenge is the practical need for computational techniques that aid in scientific knowledge discovery. To that end, we present a set of data-understanding algorithms that work in concert in a pipeline fashion to automatically locate and analyze high energy particle bunches undergoing acceleration in very large simulation datasets. These techniques work cooperatively by first identifying features of interest in individual timesteps,more » then integrating features across timesteps, and based on the information derived perform analysis of temporally dynamic features. This combination of techniques supports accurate detection of particle beams enabling a deeper level of scientific understanding of physical phenomena than hasbeen possible before. By combining efficient data analysis algorithms and state-of-the-art data management we enable high-performance analysis of extremely large particle datasets in 3D. We demonstrate the usefulness of our methods for a variety of 2D and 3D datasets and discuss the performance of our analysis pipeline.« less
Multi-view L2-SVM and its multi-view core vector machine.
Huang, Chengquan; Chung, Fu-lai; Wang, Shitong
2016-03-01
In this paper, a novel L2-SVM based classifier Multi-view L2-SVM is proposed to address multi-view classification tasks. The proposed Multi-view L2-SVM classifier does not have any bias in its objective function and hence has the flexibility like μ-SVC in the sense that the number of the yielded support vectors can be controlled by a pre-specified parameter. The proposed Multi-view L2-SVM classifier can make full use of the coherence and the difference of different views through imposing the consensus among multiple views to improve the overall classification performance. Besides, based on the generalized core vector machine GCVM, the proposed Multi-view L2-SVM classifier is extended into its GCVM version MvCVM which can realize its fast training on large scale multi-view datasets, with its asymptotic linear time complexity with the sample size and its space complexity independent of the sample size. Our experimental results demonstrated the effectiveness of the proposed Multi-view L2-SVM classifier for small scale multi-view datasets and the proposed MvCVM classifier for large scale multi-view datasets. Copyright © 2015 Elsevier Ltd. All rights reserved.
Kidney Transplant Outcomes in the Super Obese: A National Study From the UNOS Dataset.
Kanthawar, Pooja; Mei, Xiaonan; Daily, Michael F; Chandarana, Jyotin; Shah, Malay; Berger, Jonathan; Castellanos, Ana Lia; Marti, Francesc; Gedaly, Roberto
2016-11-01
We evaluated outcomes of super-obese patients (BMI > 50) undergoing kidney transplantation in the US. We performed a review of 190 super-obese patients undergoing kidney transplantation from 1988 through 2013 using the UNOS dataset. Super-obese patients had a mean age of 45.7 years (21-75 years) and 111 (58.4 %) were female. The mean BMI of the super-obese group was 56 (range 50.0-74.2). A subgroup analysis demonstrated that patients with BMI > 50 had worse survival compared to any other BMI class. The 30-day perioperative mortality and length of stay was 3.7 % and 10.09 days compared to 0.8 % and 7.34 days in nonsuper-obese group. On multivariable analysis, BMI > 50 was an independent predictor of 30-day mortality, with a 4.6-fold increased risk of perioperative death. BMI > 50 increased the risk of delayed graft function and the length of stay by twofold. The multivariable analysis of survival showed a 78 % increased risk of death in this group. Overall patient survival for super-obese transplant recipients at 1, 3, and 5 years was 88, 82, and 76 %, compared to 96, 91, 86 % on patients transplanted with BMI < 50. A propensity score adjusted analysis further demonstrates significant worse survival rates in super-obese patients undergoing kidney transplantation. Super-obese patients had prolonged LOS and worse DGF rates. Perioperative mortality was increased 4.6-fold compared to patients with BMI < 50. In a subgroup analysis, super-obese patients who underwent kidney transplantation had significantly worse graft and patient survival compared to underweight, normal weight, and obesity class I, II, and III (BMI 40-50) patients.
Golkarian, Ali; Naghibi, Seyed Amir; Kalantar, Bahareh; Pradhan, Biswajeet
2018-02-17
Ever increasing demand for water resources for different purposes makes it essential to have better understanding and knowledge about water resources. As known, groundwater resources are one of the main water resources especially in countries with arid climatic condition. Thus, this study seeks to provide groundwater potential maps (GPMs) employing new algorithms. Accordingly, this study aims to validate the performance of C5.0, random forest (RF), and multivariate adaptive regression splines (MARS) algorithms for generating GPMs in the eastern part of Mashhad Plain, Iran. For this purpose, a dataset was produced consisting of spring locations as indicator and groundwater-conditioning factors (GCFs) as input. In this research, 13 GCFs were selected including altitude, slope aspect, slope angle, plan curvature, profile curvature, topographic wetness index (TWI), slope length, distance from rivers and faults, rivers and faults density, land use, and lithology. The mentioned dataset was divided into two classes of training and validation with 70 and 30% of the springs, respectively. Then, C5.0, RF, and MARS algorithms were employed using R statistical software, and the final values were transformed into GPMs. Finally, two evaluation criteria including Kappa and area under receiver operating characteristics curve (AUC-ROC) were calculated. According to the findings of this research, MARS had the best performance with AUC-ROC of 84.2%, followed by RF and C5.0 algorithms with AUC-ROC values of 79.7 and 77.3%, respectively. The results indicated that AUC-ROC values for the employed models are more than 70% which shows their acceptable performance. As a conclusion, the produced methodology could be used in other geographical areas. GPMs could be used by water resource managers and related organizations to accelerate and facilitate water resource exploitation.
Alladio, E; Giacomelli, L; Biosa, G; Corcia, D Di; Gerace, E; Salomone, A; Vincenti, M
2018-01-01
The chronic intake of an excessive amount of alcohol is currently ascertained by determining the concentration of direct alcohol metabolites in the hair samples of the alleged abusers, including ethyl glucuronide (EtG) and, less frequently, fatty acid ethyl esters (FAEEs). Indirect blood biomarkers of alcohol abuse are still determined to support hair EtG results and diagnose a consequent liver impairment. In the present study, the supporting role of hair FAEEs is compared with indirect blood biomarkers with respect to the contexts in which hair EtG interpretation is uncertain. Receiver Operating Characteristics (ROC) curves and multivariate Principal Component Analysis (PCA) demonstrated much stronger correlation of EtG results with FAEEs than with any single indirect biomarker or their combinations. Partial Least Squares Discriminant Analysis (PLS-DA) models based on hair EtG and FAEEs were developed to maximize the biomarkers information content on a multivariate background. The final PLS-DA model yielded 100% correct classification on a training/evaluation dataset of 155 subjects, including both chronic alcohol abusers and social drinkers. Then, the PLS-DA model was validated on an external dataset of 81 individual providing optimal discrimination ability between chronic alcohol abusers and social drinkers, in terms of specificity and sensitivity. The PLS-DA scores obtained for each subject, with respect to the PLS-DA model threshold that separates the probabilistic distributions for the two classes, furnished a likelihood ratio value, which in turn conveys the strength of the experimental data support to the classification decision, within a Bayesian logic. Typical boundary real cases from daily work are discussed, too. Copyright © 2017 Elsevier B.V. All rights reserved.
Tibi, Rigobert; Koper, Keith D.; Pankow, Kristine L.; ...
2018-03-20
Most of the commonly used seismic discrimination approaches are designed for regional data. Relatively little attention has focused on discriminants for local distances (< 200 km), the range at which the smallest events are recorded. We take advantage of the variety of seismic sources and the existence of a dense regional seismic network in the Utah region to evaluate amplitude ratio seismic discrimination at local distances. First, we explored phase-amplitude Pg-to-Sg ratios for multiple frequency bands to classify events in a dataset that comprises populations of single-shot surface explosions, shallow and deep ripple-fired mining blasts, mining-induced events, and tectonic earthquakes.more » We achieved a limited success. Then, for the same dataset, we combined the Pg-to-Sg phase-amplitude ratios with Sg-to-Rg spectral amplitude ratios in a multivariate quadratic discriminant function (QDF) approach. For two-category, pairwise classification, seven out of ten population pairs show misclassification rates of about 20% or less, with five pairs showing rates of about 10% or less. The approach performs best for the pair involving the populations of single-shot explosions and mining-induced events. By combining both Pg-to-Sg and Rg-to-Sg ratios in the multivariate QDFs, we are able to achieve an average improvement of about 4–14% in misclassification rates compared to Pg-to-Sg ratios alone. When all five event populations are considered simultaneously, as expected, the potential for misclassification increases and our QDF approach using both Pg-to-Sg and Rg-to-Sg ratios achieves an average success rate of about 74%, compared to the rate of about 86% for two-category, pairwise classification.« less
Use of multivariate statistics to identify unreliable data obtained using CASA.
Martínez, Luis Becerril; Crispín, Rubén Huerta; Mendoza, Maximino Méndez; Gallegos, Oswaldo Hernández; Martínez, Andrés Aragón
2013-06-01
In order to identify unreliable data in a dataset of motility parameters obtained from a pilot study acquired by a veterinarian with experience in boar semen handling, but without experience in the operation of a computer assisted sperm analysis (CASA) system, a multivariate graphical and statistical analysis was performed. Sixteen boar semen samples were aliquoted then incubated with varying concentrations of progesterone from 0 to 3.33 µg/ml and analyzed in a CASA system. After standardization of the data, Chernoff faces were pictured for each measurement, and a principal component analysis (PCA) was used to reduce the dimensionality and pre-process the data before hierarchical clustering. The first twelve individual measurements showed abnormal features when Chernoff faces were drawn. PCA revealed that principal components 1 and 2 explained 63.08% of the variance in the dataset. Values of principal components for each individual measurement of semen samples were mapped to identify differences among treatment or among boars. Twelve individual measurements presented low values of principal component 1. Confidence ellipses on the map of principal components showed no statistically significant effects for treatment or boar. Hierarchical clustering realized on two first principal components produced three clusters. Cluster 1 contained evaluations of the two first samples in each treatment, each one of a different boar. With the exception of one individual measurement, all other measurements in cluster 1 were the same as observed in abnormal Chernoff faces. Unreliable data in cluster 1 are probably related to the operator inexperience with a CASA system. These findings could be used to objectively evaluate the skill level of an operator of a CASA system. This may be particularly useful in the quality control of semen analysis using CASA systems.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tibi, Rigobert; Koper, Keith D.; Pankow, Kristine L.
Most of the commonly used seismic discrimination approaches are designed for regional data. Relatively little attention has focused on discriminants for local distances (< 200 km), the range at which the smallest events are recorded. We take advantage of the variety of seismic sources and the existence of a dense regional seismic network in the Utah region to evaluate amplitude ratio seismic discrimination at local distances. First, we explored phase-amplitude Pg-to-Sg ratios for multiple frequency bands to classify events in a dataset that comprises populations of single-shot surface explosions, shallow and deep ripple-fired mining blasts, mining-induced events, and tectonic earthquakes.more » We achieved a limited success. Then, for the same dataset, we combined the Pg-to-Sg phase-amplitude ratios with Sg-to-Rg spectral amplitude ratios in a multivariate quadratic discriminant function (QDF) approach. For two-category, pairwise classification, seven out of ten population pairs show misclassification rates of about 20% or less, with five pairs showing rates of about 10% or less. The approach performs best for the pair involving the populations of single-shot explosions and mining-induced events. By combining both Pg-to-Sg and Rg-to-Sg ratios in the multivariate QDFs, we are able to achieve an average improvement of about 4–14% in misclassification rates compared to Pg-to-Sg ratios alone. When all five event populations are considered simultaneously, as expected, the potential for misclassification increases and our QDF approach using both Pg-to-Sg and Rg-to-Sg ratios achieves an average success rate of about 74%, compared to the rate of about 86% for two-category, pairwise classification.« less
Potyrailo, Radislav A
2017-08-29
For detection of gases and vapors in complex backgrounds, "classic" analytical instruments are an unavoidable alternative to existing sensors. Recently a new generation of sensors, known as multivariable sensors, emerged with a fundamentally different perspective for sensing to eliminate limitations of existing sensors. In multivariable sensors, a sensing material is designed to have diverse responses to different gases and vapors and is coupled to a multivariable transducer that provides independent outputs to recognize these diverse responses. Data analytics tools provide rejection of interferences and multi-analyte quantitation. This review critically analyses advances of multivariable sensors based on ligand-functionalized metal nanoparticles also known as monolayer-protected nanoparticles (MPNs). These MPN sensing materials distinctively stand out from other sensing materials for multivariable sensors due to their diversity of gas- and vapor-response mechanisms as provided by organic and biological ligands, applicability of these sensing materials for broad classes of gas-phase compounds such as condensable vapors and non-condensable gases, and for several principles of signal transduction in multivariable sensors that result in non-resonant and resonant electrical sensors as well as material- and structure-based photonic sensors. Such features should allow MPN multivariable sensors to be an attractive high value addition to existing analytical instrumentation.
Chang, Pao-Erh Paul; Yang, Jen-Chih Rena; Den, Walter; Wu, Chang-Fu
2014-09-01
Emissions of volatile organic compounds (VOCs) are most frequent environmental nuisance complaints in urban areas, especially where industrial districts are nearby. Unfortunately, identifying the responsible emission sources of VOCs is essentially a difficult task. In this study, we proposed a dynamic approach to gradually confine the location of potential VOC emission sources in an industrial complex, by combining multi-path open-path Fourier transform infrared spectrometry (OP-FTIR) measurement and the statistical method of principal component analysis (PCA). Close-cell FTIR was further used to verify the VOC emission source by measuring emitted VOCs from selected exhaust stacks at factories in the confined areas. Multiple open-path monitoring lines were deployed during a 3-month monitoring campaign in a complex industrial district. The emission patterns were identified and locations of emissions were confined by the wind data collected simultaneously. N,N-Dimethyl formamide (DMF), 2-butanone, toluene, and ethyl acetate with mean concentrations of 80.0 ± 1.8, 34.5 ± 0.8, 103.7 ± 2.8, and 26.6 ± 0.7 ppbv, respectively, were identified as the major VOC mixture at all times of the day around the receptor site. As the toxic air pollutant, the concentrations of DMF in air samples were found exceeding the ambient standard despite the path-average effect of OP-FTIR upon concentration levels. The PCA data identified three major emission sources, including PU coating, chemical packaging, and lithographic printing industries. Applying instrumental measurement and statistical modeling, this study has established a systematic approach for locating emission sources. Statistical modeling (PCA) plays an important role in reducing dimensionality of a large measured dataset and identifying underlying emission sources. Instrumental measurement, however, helps verify the outcomes of the statistical modeling. The field study has demonstrated the feasibility of using multi-path OP-FTIR measurement. The wind data incorporating with the statistical modeling (PCA) may successfully identify the major emission source in a complex industrial district.
D'Amico, E J; Neilands, T B; Zambarano, R
2001-11-01
Although power analysis is an important component in the planning and implementation of research designs, it is often ignored. Computer programs for performing power analysis are available, but most have limitations, particularly for complex multivariate designs. An SPSS procedure is presented that can be used for calculating power for univariate, multivariate, and repeated measures models with and without time-varying and time-constant covariates. Three examples provide a framework for calculating power via this method: an ANCOVA, a MANOVA, and a repeated measures ANOVA with two or more groups. The benefits and limitations of this procedure are discussed.
Soh, Jung; Turinsky, Andrei L; Trinh, Quang M; Chang, Jasmine; Sabhaney, Ajay; Dong, Xiaoli; Gordon, Paul Mk; Janzen, Ryan Pw; Hau, David; Xia, Jianguo; Wishart, David S; Sensen, Christoph W
2009-01-01
We have developed a computational framework for spatiotemporal integration of molecular and anatomical datasets in a virtual reality environment. Using two case studies involving gene expression data and pharmacokinetic data, respectively, we demonstrate how existing knowledge bases for molecular data can be semantically mapped onto a standardized anatomical context of human body. Our data mapping methodology uses ontological representations of heterogeneous biomedical datasets and an ontology reasoner to create complex semantic descriptions of biomedical processes. This framework provides a means to systematically combine an increasing amount of biomedical imaging and numerical data into spatiotemporally coherent graphical representations. Our work enables medical researchers with different expertise to simulate complex phenomena visually and to develop insights through the use of shared data, thus paving the way for pathological inference, developmental pattern discovery and biomedical hypothesis testing.
Prospects and challenges for fungal metatranscriptomes of complex communities
Kuske, Cheryl Rae; Hesse, Cedar Nelson; Challacombe, Jean Faust; ...
2015-01-22
We report that the ability to extract and purify messenger RNA directly from plants, decomposing organic matter and soil, followed by high-throughput sequencing of the pool of expressed genes, has spawned the emerging research area of metatranscriptomics. Each metatranscriptome provides a snapshot of the composition and relative abundance of actively transcribed genes, and thus provides an assessment of the interactions between soil microorganisms and plants, and collective microbial metabolic processes in many environments. We highlight current approaches for analysis of fungal transcriptome and metatranscriptome datasets across a gradient of community complexity, and note benefits and pitfalls associated with those approaches.more » Finally, we discuss knowledge gaps that limit our current ability to interpret metatranscriptome datasets and suggest future research directions that will require concerted efforts within the scientific community.« less
MFIB: a repository of protein complexes with mutual folding induced by binding.
Fichó, Erzsébet; Reményi, István; Simon, István; Mészáros, Bálint
2017-11-15
It is commonplace that intrinsically disordered proteins (IDPs) are involved in crucial interactions in the living cell. However, the study of protein complexes formed exclusively by IDPs is hindered by the lack of data and such analyses remain sporadic. Systematic studies benefited other types of protein-protein interactions paving a way from basic science to therapeutics; yet these efforts require reliable datasets that are currently lacking for synergistically folding complexes of IDPs. Here we present the Mutual Folding Induced by Binding (MFIB) database, the first systematic collection of complexes formed exclusively by IDPs. MFIB contains an order of magnitude more data than any dataset used in corresponding studies and offers a wide coverage of known IDP complexes in terms of flexibility, oligomeric composition and protein function from all domains of life. The included complexes are grouped using a hierarchical classification and are complemented with structural and functional annotations. MFIB is backed by a firm development team and infrastructure, and together with possible future community collaboration it will provide the cornerstone for structural and functional studies of IDP complexes. MFIB is freely accessible at http://mfib.enzim.ttk.mta.hu/. The MFIB application is hosted by Apache web server and was implemented in PHP. To enrich querying features and to enhance backend performance a MySQL database was also created. simon.istvan@ttk.mta.hu, meszaros.balint@ttk.mta.hu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press.
NASA Astrophysics Data System (ADS)
Wyborn, Lesley; Car, Nicholas; Evans, Benjamin; Klump, Jens
2016-04-01
Persistent identifiers in the form of a Digital Object Identifier (DOI) are becoming more mainstream, assigned at both the collection and dataset level. For static datasets, this is a relatively straight-forward matter. However, many new data collections are dynamic, with new data being appended, models and derivative products being revised with new data, or the data itself revised as processing methods are improved. Further, because data collections are becoming accessible as services, researchers can log in and dynamically create user-defined subsets for specific research projects: they also can easily mix and match data from multiple collections, each of which can have a complex history. Inevitably extracts from such dynamic data sets underpin scholarly publications, and this presents new challenges. The National Computational Infrastructure (NCI) has been experiencing and making progress towards addressing these issues. The NCI is large node of the Research Data Services initiative (RDS) of the Australian Government's research infrastructure, which currently makes available over 10 PBytes of priority research collections, ranging from geosciences, geophysics, environment, and climate, through to astronomy, bioinformatics, and social sciences. Data are replicated to, or are produced at, NCI and then processed there to higher-level data products or directly analysed. Individual datasets range from multi-petabyte computational models and large volume raster arrays, down to gigabyte size, ultra-high resolution datasets. To facilitate access, maximise reuse and enable integration across the disciplines, datasets have been organized on a platform called the National Environmental Research Data Interoperability Platform (NERDIP). Combined, the NERDIP data collections form a rich and diverse asset for researchers: their co-location and standardization optimises the value of existing data, and forms a new resource to underpin data-intensive Science. New publication procedures require that a persistent identifier (DOI) be provided for the dataset that underpins the publication. Being able to produce these for data extracts from the NCI data node using only DOIs is proving difficult: preserving a copy of each data extract is not possible due to data scale. A proposal is for researchers to use workflows that capture the provenance of each data extraction, including metadata (e.g., version of the dataset used, the query and time of extraction). In parallel, NCI is now working with the NERDIP dataset providers to ensure that the provenance of data publication is also captured in provenance systems including references to previous versions and a history of data appended or modified. This proposed solution would require an enhancement to new scholarly publication procedures whereby the reference to underlying dataset to a scholarly publication would be the persistent identifier of the provenance workflow that created the data extract. In turn, the provenance workflow would itself link to a series of persistent identifiers that, at a minimum, provide complete dataset production transparency and, if required, would facilitate reconstruction of the dataset. Such a solution will require strict adherence to design patterns for provenance representation to ensure that the provenance representation of the workflow does indeed contain information required to deliver dataset generation transparency and a pathway to reconstruction.
2014-01-01
BACKGROUND Average real variability (ARV) is a recently proposed index for short-term blood pressure (BP) variability. We aimed to determine the minimum number of BP readings required to compute ARV without loss of prognostic information. METHODS ARV was calculated from a discovery dataset that included 24-hour ambulatory BP measurements for 1,254 residents (mean age = 56.6 years; 43.5% women) of Copenhagen, Denmark. Concordance between ARV from full (≥80 BP readings) and randomly reduced 24-hour BP recordings was examined, as was prognostic accuracy. A test dataset that included 5,353 subjects (mean age = 54.0 years; 45.6% women) with at least 48 BP measurements from 11 randomly recruited population cohorts was used to validate the results. RESULTS In the discovery dataset, a minimum of 48 BP readings allowed an accurate assessment of the association between cardiovascular risk and ARV. In the test dataset, over 10.2 years (median), 806 participants died (335 cardiovascular deaths, 206 cardiac deaths) and 696 experienced a major fatal or nonfatal cardiovascular event. Standardized multivariable-adjusted hazard ratios (HRs) were computed for associations between outcome and BP variability. Higher diastolic ARV in 24-hour ambulatory BP recordings predicted (P < 0.01) total (HR = 1.12), cardiovascular (HR = 1.19), and cardiac (HR = 1.19) mortality and fatal combined with nonfatal cerebrovascular events (HR = 1.16). Higher systolic ARV in 24-hour ambulatory BP recordings predicted (P < 0.01) total (HR = 1.12), cardiovascular (HR = 1.17), and cardiac (HR = 1.24) mortality. CONCLUSIONS Forty-eight BP readings over 24 hours were observed to be adequate to compute ARV without meaningful loss of prognostic information. PMID:23955605
Who shares? Who doesn't? Factors associated with openly archiving raw research data.
Piwowar, Heather A
2011-01-01
Many initiatives encourage investigators to share their raw datasets in hopes of increasing research efficiency and quality. Despite these investments of time and money, we do not have a firm grasp of who openly shares raw research data, who doesn't, and which initiatives are correlated with high rates of data sharing. In this analysis I use bibliometric methods to identify patterns in the frequency with which investigators openly archive their raw gene expression microarray datasets after study publication. Automated methods identified 11,603 articles published between 2000 and 2009 that describe the creation of gene expression microarray data. Associated datasets in best-practice repositories were found for 25% of these articles, increasing from less than 5% in 2001 to 30%-35% in 2007-2009. Accounting for sensitivity of the automated methods, approximately 45% of recent gene expression studies made their data publicly available. First-order factor analysis on 124 diverse bibliometric attributes of the data creation articles revealed 15 factors describing authorship, funding, institution, publication, and domain environments. In multivariate regression, authors were most likely to share data if they had prior experience sharing or reusing data, if their study was published in an open access journal or a journal with a relatively strong data sharing policy, or if the study was funded by a large number of NIH grants. Authors of studies on cancer and human subjects were least likely to make their datasets available. These results suggest research data sharing levels are still low and increasing only slowly, and data is least available in areas where it could make the biggest impact. Let's learn from those with high rates of sharing to embrace the full potential of our research output.
Adali, Tülay; Levin-Schwartz, Yuri; Calhoun, Vince D.
2015-01-01
Fusion of information from multiple sets of data in order to extract a set of features that are most useful and relevant for the given task is inherent to many problems we deal with today. Since, usually, very little is known about the actual interaction among the datasets, it is highly desirable to minimize the underlying assumptions. This has been the main reason for the growing importance of data-driven methods, and in particular of independent component analysis (ICA) as it provides useful decompositions with a simple generative model and using only the assumption of statistical independence. A recent extension of ICA, independent vector analysis (IVA) generalizes ICA to multiple datasets by exploiting the statistical dependence across the datasets, and hence, as we discuss in this paper, provides an attractive solution to fusion of data from multiple datasets along with ICA. In this paper, we focus on two multivariate solutions for multi-modal data fusion that let multiple modalities fully interact for the estimation of underlying features that jointly report on all modalities. One solution is the Joint ICA model that has found wide application in medical imaging, and the second one is the the Transposed IVA model introduced here as a generalization of an approach based on multi-set canonical correlation analysis. In the discussion, we emphasize the role of diversity in the decompositions achieved by these two models, present their properties and implementation details to enable the user make informed decisions on the selection of a model along with its associated parameters. Discussions are supported by simulation results to help highlight the main issues in the implementation of these methods. PMID:26525830
Chanel, Guillaume; Pichon, Swann; Conty, Laurence; Berthoz, Sylvie; Chevallier, Coralie; Grèzes, Julie
2015-01-01
Multivariate pattern analysis (MVPA) has been applied successfully to task-based and resting-based fMRI recordings to investigate which neural markers distinguish individuals with autistic spectrum disorders (ASD) from controls. While most studies have focused on brain connectivity during resting state episodes and regions of interest approaches (ROI), a wealth of task-based fMRI datasets have been acquired in these populations in the last decade. This calls for techniques that can leverage information not only from a single dataset, but from several existing datasets that might share some common features and biomarkers. We propose a fully data-driven (voxel-based) approach that we apply to two different fMRI experiments with social stimuli (faces and bodies). The method, based on Support Vector Machines (SVMs) and Recursive Feature Elimination (RFE), is first trained for each experiment independently and each output is then combined to obtain a final classification output. Second, this RFE output is used to determine which voxels are most often selected for classification to generate maps of significant discriminative activity. Finally, to further explore the clinical validity of the approach, we correlate phenotypic information with obtained classifier scores. The results reveal good classification accuracy (range between 69% and 92.3%). Moreover, we were able to identify discriminative activity patterns pertaining to the social brain without relying on a priori ROI definitions. Finally, social motivation was the only dimension which correlated with classifier scores, suggesting that it is the main dimension captured by the classifiers. Altogether, we believe that the present RFE method proves to be efficient and may help identifying relevant biomarkers by taking advantage of acquired task-based fMRI datasets in psychiatric populations. PMID:26793434
Global relationships in river hydromorphology
NASA Astrophysics Data System (ADS)
Pavelsky, T.; Lion, C.; Allen, G. H.; Durand, M. T.; Schumann, G.; Beighley, E.; Yang, X.
2017-12-01
Since the widespread adoption of digital elevation models (DEMs) in the 1980s, most global and continental-scale analysis of river flow characteristics has been focused on measurements derived from DEMs such as drainage area, elevation, and slope. These variables (especially drainage area) have been related to other quantities of interest such as river width, depth, and velocity via empirical relationships that often take the form of power laws. More recently, a number of groups have developed more direct measurements of river location and some aspects of planform geometry from optical satellite imagery on regional, continental, and global scales. However, these satellite-derived datasets often lack many of the qualities that make DEM=derived datasets attractive, including robust network topology. Here, we present analysis of a dataset that combines the Global River Widths from Landsat (GRWL) database of river location, width, and braiding index with a river database extracted from the Shuttle Radar Topography Mission DEM and the HydroSHEDS dataset. Using these combined tools, we present a dataset that includes measurements of river width, slope, braiding index, upstream drainage area, and other variables. The dataset is available everywhere that both datasets are available, which includes all continental areas south of 60N with rivers sufficiently large to be observed with Landsat imagery. We use the dataset to examine patterns and frequencies of river form across continental and global scales as well as global relationships among variables including width, slope, and drainage area. The results demonstrate the complex relationships among different dimensions of river hydromorphology at the global scale.
Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae
Reguly, Teresa; Breitkreutz, Ashton; Boucher, Lorrie; Breitkreutz, Bobby-Joe; Hon, Gary C; Myers, Chad L; Parsons, Ainslie; Friesen, Helena; Oughtred, Rose; Tong, Amy; Stark, Chris; Ho, Yuen; Botstein, David; Andrews, Brenda; Boone, Charles; Troyanskya, Olga G; Ideker, Trey; Dolinski, Kara; Batada, Nizar N; Tyers, Mike
2006-01-01
Background The study of complex biological networks and prediction of gene function has been enabled by high-throughput (HTP) methods for detection of genetic and protein interactions. Sparse coverage in HTP datasets may, however, distort network properties and confound predictions. Although a vast number of well substantiated interactions are recorded in the scientific literature, these data have not yet been distilled into networks that enable system-level inference. Results We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications. This literature-curated (LC) dataset contains 33,311 interactions, on the order of all extant HTP datasets combined. Surprisingly, HTP protein-interaction datasets currently achieve only around 14% coverage of the interactions in the literature. The LC network nevertheless shares attributes with HTP networks, including scale-free connectivity and correlations between interactions, abundance, localization, and expression. We find that essential genes or proteins are enriched for interactions with other essential genes or proteins, suggesting that the global network may be functionally unified. This interconnectivity is supported by a substantial overlap of protein and genetic interactions in the LC dataset. We show that the LC dataset considerably improves the predictive power of network-analysis approaches. The full LC dataset is available at the BioGRID () and SGD () databases. Conclusion Comprehensive datasets of biological interactions derived from the primary literature provide critical benchmarks for HTP methods, augment functional prediction, and reveal system-level attributes of biological networks. PMID:16762047
ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers.
Teodoro, Douglas; Sundvall, Erik; João Junior, Mario; Ruch, Patrick; Miranda Freire, Sergio
2018-01-01
The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms.
ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers
Sundvall, Erik; João Junior, Mario; Ruch, Patrick; Miranda Freire, Sergio
2018-01-01
The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms. PMID:29293556
Khosrow-Khavar, Farzad; Tavakolian, Kouhyar; Blaber, Andrew; Menon, Carlo
2016-10-12
The purpose of this research was to design a delineation algorithm that could detect specific fiducial points of the seismocardiogram (SCG) signal with or without using the electrocardiogram (ECG) R-wave as the reference point. The detected fiducial points were used to estimate cardiac time intervals. Due to complexity and sensitivity of the SCG signal, the algorithm was designed to robustly discard the low-quality cardiac cycles, which are the ones that contain unrecognizable fiducial points. The algorithm was trained on a dataset containing 48,318 manually annotated cardiac cycles. It was then applied to three test datasets: 65 young healthy individuals (dataset 1), 15 individuals above 44 years old (dataset 2), and 25 patients with previous heart conditions (dataset 3). The algorithm accomplished high prediction accuracy with the rootmean- square-error of less than 5 ms for all the test datasets. The algorithm overall mean detection rate per individual recordings (DRI) were 74, 68, and 42 percent for the three test datasets when concurrent ECG and SCG were used. For the standalone SCG case, the mean DRI was 32, 14 and 21 percent. When the proposed algorithm applied to concurrent ECG and SCG signals, the desired fiducial points of the SCG signal were successfully estimated with a high detection rate. For the standalone case, however, the algorithm achieved high prediction accuracy and detection rate for only the young individual dataset. The presented algorithm could be used for accurate and non-invasive estimation of cardiac time intervals.
LEAP: biomarker inference through learning and evaluating association patterns.
Jiang, Xia; Neapolitan, Richard E
2015-03-01
Single nucleotide polymorphism (SNP) high-dimensional datasets are available from Genome Wide Association Studies (GWAS). Such data provide researchers opportunities to investigate the complex genetic basis of diseases. Much of genetic risk might be due to undiscovered epistatic interactions, which are interactions in which combination of several genes affect disease. Research aimed at discovering interacting SNPs from GWAS datasets proceeded in two directions. First, tools were developed to evaluate candidate interactions. Second, algorithms were developed to search over the space of candidate interactions. Another problem when learning interacting SNPs, which has not received much attention, is evaluating how likely it is that the learned SNPs are associated with the disease. A complete system should provide this information as well. We develop such a system. Our system, called LEAP, includes a new heuristic search algorithm for learning interacting SNPs, and a Bayesian network based algorithm for computing the probability of their association. We evaluated the performance of LEAP using 100 1,000-SNP simulated datasets, each of which contains 15 SNPs involved in interactions. When learning interacting SNPs from these datasets, LEAP outperformed seven others methods. Furthermore, only SNPs involved in interactions were found to be probable. We also used LEAP to analyze real Alzheimer's disease and breast cancer GWAS datasets. We obtained interesting and new results from the Alzheimer's dataset, but limited results from the breast cancer dataset. We conclude that our results support that LEAP is a useful tool for extracting candidate interacting SNPs from high-dimensional datasets and determining their probability. © 2015 The Authors. *Genetic Epidemiology published by Wiley Periodicals, Inc.
RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system
Jensen, Tue V.; Pinson, Pierre
2017-01-01
Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation. PMID:29182600
RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system.
Jensen, Tue V; Pinson, Pierre
2017-11-28
Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.
A dataset on human navigation strategies in foreign networked systems.
Kőrösi, Attila; Csoma, Attila; Rétvári, Gábor; Heszberger, Zalán; Bíró, József; Tapolcai, János; Pelle, István; Klajbár, Dávid; Novák, Márton; Halasi, Valentina; Gulyás, András
2018-03-13
Humans are involved in various real-life networked systems. The most obvious examples are social and collaboration networks but the language and the related mental lexicon they use, or the physical map of their territory can also be interpreted as networks. How do they find paths between endpoints in these networks? How do they obtain information about a foreign networked world they find themselves in, how they build mental model for it and how well they succeed in using it? Large, open datasets allowing the exploration of such questions are hard to find. Here we report a dataset collected by a smartphone application, in which players navigate between fixed length source and destination English words step-by-step by changing only one letter at a time. The paths reflect how the players master their navigation skills in such a foreign networked world. The dataset can be used in the study of human mental models for the world around us, or in a broader scope to investigate the navigation strategies in complex networked systems.
RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system
NASA Astrophysics Data System (ADS)
Jensen, Tue V.; Pinson, Pierre
2017-11-01
Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.
Joint Blind Source Separation by Multi-set Canonical Correlation Analysis
Li, Yi-Ou; Adalı, Tülay; Wang, Wei; Calhoun, Vince D
2009-01-01
In this work, we introduce a simple and effective scheme to achieve joint blind source separation (BSS) of multiple datasets using multi-set canonical correlation analysis (M-CCA) [1]. We first propose a generative model of joint BSS based on the correlation of latent sources within and between datasets. We specify source separability conditions, and show that, when the conditions are satisfied, the group of corresponding sources from each dataset can be jointly extracted by M-CCA through maximization of correlation among the extracted sources. We compare source separation performance of the M-CCA scheme with other joint BSS methods and demonstrate the superior performance of the M-CCA scheme in achieving joint BSS for a large number of datasets, group of corresponding sources with heterogeneous correlation values, and complex-valued sources with circular and non-circular distributions. We apply M-CCA to analysis of functional magnetic resonance imaging (fMRI) data from multiple subjects and show its utility in estimating meaningful brain activations from a visuomotor task. PMID:20221319
FlexAID: Revisiting Docking on Non-Native-Complex Structures.
Gaudreault, Francis; Najmanovich, Rafael J
2015-07-27
Small-molecule protein docking is an essential tool in drug design and to understand molecular recognition. In the present work we introduce FlexAID, a small-molecule docking algorithm that accounts for target side-chain flexibility and utilizes a soft scoring function, i.e. one that is not highly dependent on specific geometric criteria, based on surface complementarity. The pairwise energy parameters were derived from a large dataset of true positive poses and negative decoys from the PDBbind database through an iterative process using Monte Carlo simulations. The prediction of binding poses is tested using the widely used Astex dataset as well as the HAP2 dataset, while performance in virtual screening is evaluated using a subset of the DUD dataset. We compare FlexAID to AutoDock Vina, FlexX, and rDock in an extensive number of scenarios to understand the strengths and limitations of the different programs as well as to reported results for Glide, GOLD, and DOCK6 where applicable. The most relevant among these scenarios is that of docking on flexible non-native-complex structures where as is the case in reality, the target conformation in the bound form is not known a priori. We demonstrate that FlexAID, unlike other programs, is robust against increasing structural variability. FlexAID obtains equivalent sampling success as GOLD and performs better than AutoDock Vina or FlexX in all scenarios against non-native-complex structures. FlexAID is better than rDock when there is at least one critical side-chain movement required upon ligand binding. In virtual screening, FlexAID results are lower on average than those of AutoDock Vina and rDock. The higher accuracy in flexible targets where critical movements are required, intuitive PyMOL-integrated graphical user interface and free source code as well as precompiled executables for Windows, Linux, and Mac OS make FlexAID a welcome addition to the arsenal of existing small-molecule protein docking methods.
Syfert, Mindy M; Smith, Matthew J; Coomes, David A
2013-01-01
Species distribution models (SDMs) trained on presence-only data are frequently used in ecological research and conservation planning. However, users of SDM software are faced with a variety of options, and it is not always obvious how selecting one option over another will affect model performance. Working with MaxEnt software and with tree fern presence data from New Zealand, we assessed whether (a) choosing to correct for geographical sampling bias and (b) using complex environmental response curves have strong effects on goodness of fit. SDMs were trained on tree fern data, obtained from an online biodiversity data portal, with two sources that differed in size and geographical sampling bias: a small, widely-distributed set of herbarium specimens and a large, spatially clustered set of ecological survey records. We attempted to correct for geographical sampling bias by incorporating sampling bias grids in the SDMs, created from all georeferenced vascular plants in the datasets, and explored model complexity issues by fitting a wide variety of environmental response curves (known as "feature types" in MaxEnt). In each case, goodness of fit was assessed by comparing predicted range maps with tree fern presences and absences using an independent national dataset to validate the SDMs. We found that correcting for geographical sampling bias led to major improvements in goodness of fit, but did not entirely resolve the problem: predictions made with clustered ecological data were inferior to those made with the herbarium dataset, even after sampling bias correction. We also found that the choice of feature type had negligible effects on predictive performance, indicating that simple feature types may be sufficient once sampling bias is accounted for. Our study emphasizes the importance of reducing geographical sampling bias, where possible, in datasets used to train SDMs, and the effectiveness and essentialness of sampling bias correction within MaxEnt.
Identifying the controls of wildfire activity in Namibia using multivariate statistics
NASA Astrophysics Data System (ADS)
Mayr, Manuel; Le Roux, Johan; Samimi, Cyrus
2015-04-01
Despite large areas of Namibia being unaffected by fires due to aridity, substantial burning in the northern and north-eastern parts of the country is observed every year. Within the fire-affected regions, a strong spatial and inter-annual variability characterizes the dry-season fire situation. In order to understand these patterns, it appears critical to identify the causative factors behind fire occurrence and to examine their interactions in detail. Furthermore, most studies dealing with causative factor examination focus either on the local or the regional scale. However, these scales seem to be inappropriate from a management perspective, as fire-related strategic action plans are most often set up nationwide. Here, we will present an examination of the fire regimes of Namibia based on a dataset conducted by Le Roux (2011). A decade-spanning fire record (1994-2003) derived from NOAA's Advanced Very High Resolution Radiometer (AVHRR) imagery was used to generate four fire regime metrics (Burned Area, Fire Season Length, Month of Peak Fire Season, and Fire Return Period) and quantitative information on vegetation and phenology derived from Normalized Difference Vegetation Index (NDVI) time series. Further variables contained by this dataset are related to climate, biodiversity, and human activities. Le Roux (2011) analyzed the correlations between the fire metrics mentioned above and the predictor variables. We hypothesize that linear correlations (as estimated by correlation coefficients) simplify the interactions between response and predictor variables. For instance, moderate population densities could induce the highest number of fires, whereas the complete absence of humans lacks one major source of ignition. Around highly populated areas, in contrary, fuels are usually reduced and space is more fragmented - thus, the initiation and spread of a potential fire could as well be inhibited. From a total of over 40 explanatory variables, we will initially use data mining techniques to select a conceivable set of variables by their explanatory value and to remove redundancy. We will then apply two multivariate statistical methods suitable to a large variety of data types and frequently used for (non-linear) causative factor identification: Non-metric Multidimensional Scaling (NMDS) and Regression Trees. The assumed value of these analyses is i) to determine the most important predictor variables of fire activity in Namibia, ii) to decipher their complex interactions in driving fire variability in Namibia, and iii) to compare the performance of two state-of-the-art statistical methods. References: Le Roux, J. (2011): The effect of land use practices on the spatial and temporal characteristics of savanna fires in Namibia. Doctoral thesis at the University of Erlangen-Nuremberg/Germany - 155 pages.
On multivariate trace inequalities of Sutter, Berta, and Tomamichel
NASA Astrophysics Data System (ADS)
Lemm, Marius
2018-01-01
We consider a family of multivariate trace inequalities recently derived by Sutter, Berta, and Tomamichel. These inequalities generalize the Golden-Thompson inequality and Lieb's triple matrix inequality to an arbitrary number of matrices in a way that features complex matrix powers (i.e., certain unitaries). We show that their inequalities can be rewritten as an n-matrix generalization of Lieb's original triple matrix inequality. The complex matrix powers are replaced by resolvents and appropriate maximally entangled states. We expect that the technically advantageous properties of resolvents, in particular for perturbation theory, can be of use in applications of the n-matrix inequalities, e.g., for analyzing the performance of the rotated Petz recovery map in quantum information theory and for removing the unitaries altogether.
Classifying Structures in the ISM with Machine Learning Techniques
NASA Astrophysics Data System (ADS)
Beaumont, Christopher; Goodman, A. A.; Williams, J. P.
2011-01-01
The processes which govern molecular cloud evolution and star formation often sculpt structures in the ISM: filaments, pillars, shells, outflows, etc. Because of their morphological complexity, these objects are often identified manually. Manual classification has several disadvantages; the process is subjective, not easily reproducible, and does not scale well to handle increasingly large datasets. We have explored to what extent machine learning algorithms can be trained to autonomously identify specific morphological features in molecular cloud datasets. We show that the Support Vector Machine algorithm can successfully locate filaments and outflows blended with other emission structures. When the objects of interest are morphologically distinct from the surrounding emission, this autonomous classification achieves >90% accuracy. We have developed a set of IDL-based tools to apply this technique to other datasets.
Kokel, David; Rennekamp, Andrew J; Shah, Asmi H; Liebel, Urban; Peterson, Randall T
2012-08-01
For decades, studying the behavioral effects of individual drugs and genetic mutations has been at the heart of efforts to understand and treat nervous system disorders. High-throughput technologies adapted from other disciplines (e.g., high-throughput chemical screening, genomics) are changing the scale of data acquisition in behavioral neuroscience. Massive behavioral datasets are beginning to emerge, particularly from zebrafish labs, where behavioral assays can be performed rapidly and reproducibly in 96-well, high-throughput format. Mining these datasets and making comparisons across different assays are major challenges for the field. Here, we review behavioral barcoding, a process by which complex behavioral assays are reduced to a string of numeric features, facilitating analysis and comparison within and across datasets. Copyright © 2012 Elsevier Ltd. All rights reserved.
Cai, Jia; Tang, Yi
2018-02-01
Canonical correlation analysis (CCA) is a powerful statistical tool for detecting the linear relationship between two sets of multivariate variables. Kernel generalization of it, namely, kernel CCA is proposed to describe nonlinear relationship between two variables. Although kernel CCA can achieve dimensionality reduction results for high-dimensional data feature selection problem, it also yields the so called over-fitting phenomenon. In this paper, we consider a new kernel CCA algorithm via randomized Kaczmarz method. The main contributions of the paper are: (1) A new kernel CCA algorithm is developed, (2) theoretical convergence of the proposed algorithm is addressed by means of scaled condition number, (3) a lower bound which addresses the minimum number of iterations is presented. We test on both synthetic dataset and several real-world datasets in cross-language document retrieval and content-based image retrieval to demonstrate the effectiveness of the proposed algorithm. Numerical results imply the performance and efficiency of the new algorithm, which is competitive with several state-of-the-art kernel CCA methods. Copyright © 2017 Elsevier Ltd. All rights reserved.
Endoscopic third ventriculostomy in the treatment of childhood hydrocephalus.
Kulkarni, Abhaya V; Drake, James M; Mallucci, Conor L; Sgouros, Spyros; Roth, Jonathan; Constantini, Shlomi
2009-08-01
To develop a model to predict the probability of endoscopic third ventriculostomy (ETV) success in the treatment for hydrocephalus on the basis of a child's individual characteristics. We analyzed 618 ETVs performed consecutively on children at 12 international institutions to identify predictors of ETV success at 6 months. A multivariable logistic regression model was developed on 70% of the dataset (training set) and validated on 30% of the dataset (validation set). In the training set, 305/455 ETVs (67.0%) were successful. The regression model (containing patient age, cause of hydrocephalus, and previous cerebrospinal fluid shunt) demonstrated good fit (Hosmer-Lemeshow, P = .78) and discrimination (C statistic = 0.70). In the validation set, 105/163 ETVs (64.4%) were successful and the model maintained good fit (Hosmer-Lemeshow, P = .45), discrimination (C statistic = 0.68), and calibration (calibration slope = 0.88). A simplified ETV Success Score was devised that closely approximates the predicted probability of ETV success. Children most likely to succeed with ETV can now be accurately identified and spared the long-term complications of CSF shunting.
Prognostic value of DNA repair based stratification of hepatocellular carcinoma
Lin, Zhuo; Xu, Shi-Hao; Wang, Hai-Qing; Cai, Yi-Jing; Ying, Li; Song, Mei; Wang, Yu-Qun; Du, Shan-Jie; Shi, Ke-Qing; Zhou, Meng-Tao
2016-01-01
Aberrant activation of DNA repair is frequently associated with tumor progression and response to therapy in hepatocellular carcinoma (HCC). Bioinformatics analyses of HCC data in the Cancer Genome Atlas (TCGA) were performed to define DNA repair based molecular classification that could predict the prognosis of patients with HCC. Furthermore, we tested its predictive performance in 120 independent cases. Four molecular subgroups were identified on the basis of coordinate DNA repair cluster (CDRC) comprising 15 genes in TCGA dataset. Increasing expression of CDRC genes were significantly associated with TP53 mutation. High CDRC was significantly correlated with advanced tumor grades, advanced pathological stage and increased vascular invasion rate. Multivariate Cox regression analysis indicated that the molecular subgrouping was an independent prognostic parameter for both overall survival (p = 0.004, hazard ratio (HR): 2.989) and tumor-free survival (p = 0.049, HR: 3.366) in TCGA dataset. Similar results were also obtained by analyzing the independent cohort. These data suggest that distinct dysregulation of DNA repair constituents based molecular classes in HCC would be useful for predicting prognosis and designing clinical trials for targeted therapy. PMID:27174663
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mitchell, Hugh D.; Eisfeld, Amie J.; Sims, Amy
Respiratory infections stemming from influenza viruses and the Severe Acute Respiratory Syndrome corona virus (SARS-CoV) represent a serious public health threat as emerging pandemics. Despite efforts to identify the critical interactions of these viruses with host machinery, the key regulatory events that lead to disease pathology remain poorly targeted with therapeutics. Here we implement an integrated network interrogation approach, in which proteome and transcriptome datasets from infection of both viruses in human lung epithelial cells are utilized to predict regulatory genes involved in the host response. We take advantage of a novel “crowd-based” approach to identify and combine ranking metricsmore » that isolate genes/proteins likely related to the pathogenicity of SARS-CoV and influenza virus. Subsequently, a multivariate regression model is used to compare predicted lung epithelial regulatory influences with data derived from other respiratory virus infection models. We predicted a small set of regulatory factors with conserved behavior for consideration as important components of viral pathogenesis that might also serve as therapeutic targets for intervention. Our results demonstrate the utility of integrating diverse ‘omic datasets to predict and prioritize regulatory features conserved across multiple pathogen infection models.« less
Zhang, Wei; Liu, Yuanyuan; Warren, Alan; Xu, Henglong
2014-12-15
The aim of this study is to determine the feasibility of using a small species pool from a raw dataset of biofilm-dwelling ciliates for bioassessment based on taxonomic diversity. Samples were collected monthly at four stations within a gradient of environmental stress in coastal waters of the Yellow Sea, northern China from August 2011 to July 2012. A 33-species subset was identified from the raw 137-species dataset using a multivariate method. The spatial patterns of this subset were significantly correlated with the changes in the nutrients and chemical oxygen demand. The taxonomic diversity indices were significantly correlated with nutrients. The pair-wise indices of average taxonomic distinctness (Δ(+)) and the taxonomic distinctness (Λ(+)) showed a clear departure from the expected taxonomic pattern. These findings suggest that this small ciliate assemblage might be used as an adequate species pool for discriminating water quality status based on taxonomic distinctness in marine ecosystems. Copyright © 2014 Elsevier Ltd. All rights reserved.
A Copula-Based Conditional Probabilistic Forecast Model for Wind Power Ramps
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hodge, Brian S; Krishnan, Venkat K; Zhang, Jie
Efficient management of wind ramping characteristics can significantly reduce wind integration costs for balancing authorities. By considering the stochastic dependence of wind power ramp (WPR) features, this paper develops a conditional probabilistic wind power ramp forecast (cp-WPRF) model based on Copula theory. The WPRs dataset is constructed by extracting ramps from a large dataset of historical wind power. Each WPR feature (e.g., rate, magnitude, duration, and start-time) is separately forecasted by considering the coupling effects among different ramp features. To accurately model the marginal distributions with a copula, a Gaussian mixture model (GMM) is adopted to characterize the WPR uncertaintymore » and features. The Canonical Maximum Likelihood (CML) method is used to estimate parameters of the multivariable copula. The optimal copula model is chosen based on the Bayesian information criterion (BIC) from each copula family. Finally, the best conditions based cp-WPRF model is determined by predictive interval (PI) based evaluation metrics. Numerical simulations on publicly available wind power data show that the developed copula-based cp-WPRF model can predict WPRs with a high level of reliability and sharpness.« less
Luan, Xiaoli; Chen, Qiang; Liu, Fei
2014-09-01
This article presents a new scheme to design full matrix controller for high dimensional multivariable processes based on equivalent transfer function (ETF). Differing from existing ETF method, the proposed ETF is derived directly by exploiting the relationship between the equivalent closed-loop transfer function and the inverse of open-loop transfer function. Based on the obtained ETF, the full matrix controller is designed utilizing the existing PI tuning rules. The new proposed ETF model can more accurately represent the original processes. Furthermore, the full matrix centralized controller design method proposed in this paper is applicable to high dimensional multivariable systems with satisfactory performance. Comparison with other multivariable controllers shows that the designed ETF based controller is superior with respect to design-complexity and obtained performance. Copyright © 2014 ISA. Published by Elsevier Ltd. All rights reserved.
Yang, James J; Li, Jia; Williams, L Keoki; Buu, Anne
2016-01-05
In genome-wide association studies (GWAS) for complex diseases, the association between a SNP and each phenotype is usually weak. Combining multiple related phenotypic traits can increase the power of gene search and thus is a practically important area that requires methodology work. This study provides a comprehensive review of existing methods for conducting GWAS on complex diseases with multiple phenotypes including the multivariate analysis of variance (MANOVA), the principal component analysis (PCA), the generalizing estimating equations (GEE), the trait-based association test involving the extended Simes procedure (TATES), and the classical Fisher combination test. We propose a new method that relaxes the unrealistic independence assumption of the classical Fisher combination test and is computationally efficient. To demonstrate applications of the proposed method, we also present the results of statistical analysis on the Study of Addiction: Genetics and Environment (SAGE) data. Our simulation study shows that the proposed method has higher power than existing methods while controlling for the type I error rate. The GEE and the classical Fisher combination test, on the other hand, do not control the type I error rate and thus are not recommended. In general, the power of the competing methods decreases as the correlation between phenotypes increases. All the methods tend to have lower power when the multivariate phenotypes come from long tailed distributions. The real data analysis also demonstrates that the proposed method allows us to compare the marginal results with the multivariate results and specify which SNPs are specific to a particular phenotype or contribute to the common construct. The proposed method outperforms existing methods in most settings and also has great applications in GWAS on complex diseases with multiple phenotypes such as the substance abuse disorders.
Cohen, Mitchell J; Grossman, Adam D; Morabito, Diane; Knudson, M Margaret; Butte, Atul J; Manley, Geoffrey T
2010-01-01
Advances in technology have made extensive monitoring of patient physiology the standard of care in intensive care units (ICUs). While many systems exist to compile these data, there has been no systematic multivariate analysis and categorization across patient physiological data. The sheer volume and complexity of these data make pattern recognition or identification of patient state difficult. Hierarchical cluster analysis allows visualization of high dimensional data and enables pattern recognition and identification of physiologic patient states. We hypothesized that processing of multivariate data using hierarchical clustering techniques would allow identification of otherwise hidden patient physiologic patterns that would be predictive of outcome. Multivariate physiologic and ventilator data were collected continuously using a multimodal bioinformatics system in the surgical ICU at San Francisco General Hospital. These data were incorporated with non-continuous data and stored on a server in the ICU. A hierarchical clustering algorithm grouped each minute of data into 1 of 10 clusters. Clusters were correlated with outcome measures including incidence of infection, multiple organ failure (MOF), and mortality. We identified 10 clusters, which we defined as distinct patient states. While patients transitioned between states, they spent significant amounts of time in each. Clusters were enriched for our outcome measures: 2 of the 10 states were enriched for infection, 6 of 10 were enriched for MOF, and 3 of 10 were enriched for death. Further analysis of correlations between pairs of variables within each cluster reveals significant differences in physiology between clusters. Here we show for the first time the feasibility of clustering physiological measurements to identify clinically relevant patient states after trauma. These results demonstrate that hierarchical clustering techniques can be useful for visualizing complex multivariate data and may provide new insights for the care of critically injured patients.