Microbial bebop: creating music from complex dynamics in microbial ecology.
Larsen, Peter; Gilbert, Jack
2013-01-01
In order for society to make effective policy decisions on complex and far-reaching subjects, such as appropriate responses to global climate change, scientists must effectively communicate complex results to the non-scientifically specialized public. However, there are few ways however to transform highly complicated scientific data into formats that are engaging to the general community. Taking inspiration from patterns observed in nature and from some of the principles of jazz bebop improvisation, we have generated Microbial Bebop, a method by which microbial environmental data are transformed into music. Microbial Bebop uses meter, pitch, duration, and harmony to highlight the relationships between multiple data types in complex biological datasets. We use a comprehensive microbial ecology, time course dataset collected at the L4 marine monitoring station in the Western English Channel as an example of microbial ecological data that can be transformed into music. Four compositions were generated (www.bio.anl.gov/MicrobialBebop.htm.) from L4 Station data using Microbial Bebop. Each composition, though deriving from the same dataset, is created to highlight different relationships between environmental conditions and microbial community structure. The approach presented here can be applied to a wide variety of complex biological datasets.
NASA Astrophysics Data System (ADS)
Lateh, Masitah Abdul; Kamilah Muda, Azah; Yusof, Zeratul Izzah Mohd; Azilah Muda, Noor; Sanusi Azmi, Mohd
2017-09-01
The emerging era of big data for past few years has led to large and complex data which needed faster and better decision making. However, the small dataset problems still arise in a certain area which causes analysis and decision are hard to make. In order to build a prediction model, a large sample is required as a training sample of the model. Small dataset is insufficient to produce an accurate prediction model. This paper will review an artificial data generation approach as one of the solution to solve the small dataset problem.
Urbanowicz, Ryan J; Kiralis, Jeff; Sinnott-Armstrong, Nicholas A; Heberling, Tamra; Fisher, Jonathan M; Moore, Jason H
2012-10-01
Geneticists who look beyond single locus disease associations require additional strategies for the detection of complex multi-locus effects. Epistasis, a multi-locus masking effect, presents a particular challenge, and has been the target of bioinformatic development. Thorough evaluation of new algorithms calls for simulation studies in which known disease models are sought. To date, the best methods for generating simulated multi-locus epistatic models rely on genetic algorithms. However, such methods are computationally expensive, difficult to adapt to multiple objectives, and unlikely to yield models with a precise form of epistasis which we refer to as pure and strict. Purely and strictly epistatic models constitute the worst-case in terms of detecting disease associations, since such associations may only be observed if all n-loci are included in the disease model. This makes them an attractive gold standard for simulation studies considering complex multi-locus effects. We introduce GAMETES, a user-friendly software package and algorithm which generates complex biallelic single nucleotide polymorphism (SNP) disease models for simulation studies. GAMETES rapidly and precisely generates random, pure, strict n-locus models with specified genetic constraints. These constraints include heritability, minor allele frequencies of the SNPs, and population prevalence. GAMETES also includes a simple dataset simulation strategy which may be utilized to rapidly generate an archive of simulated datasets for given genetic models. We highlight the utility and limitations of GAMETES with an example simulation study using MDR, an algorithm designed to detect epistasis. GAMETES is a fast, flexible, and precise tool for generating complex n-locus models with random architectures. While GAMETES has a limited ability to generate models with higher heritabilities, it is proficient at generating the lower heritability models typically used in simulation studies evaluating new algorithms. In addition, the GAMETES modeling strategy may be flexibly combined with any dataset simulation strategy. Beyond dataset simulation, GAMETES could be employed to pursue theoretical characterization of genetic models and epistasis.
Evolving hard problems: Generating human genetics datasets with a complex etiology.
Himmelstein, Daniel S; Greene, Casey S; Moore, Jason H
2011-07-07
A goal of human genetics is to discover genetic factors that influence individuals' susceptibility to common diseases. Most common diseases are thought to result from the joint failure of two or more interacting components instead of single component failures. This greatly complicates both the task of selecting informative genetic variants and the task of modeling interactions between them. We and others have previously developed algorithms to detect and model the relationships between these genetic factors and disease. Previously these methods have been evaluated with datasets simulated according to pre-defined genetic models. Here we develop and evaluate a model free evolution strategy to generate datasets which display a complex relationship between individual genotype and disease susceptibility. We show that this model free approach is capable of generating a diverse array of datasets with distinct gene-disease relationships for an arbitrary interaction order and sample size. We specifically generate eight-hundred Pareto fronts; one for each independent run of our algorithm. In each run the predictiveness of single genetic variation and pairs of genetic variants have been minimized, while the predictiveness of third, fourth, or fifth-order combinations is maximized. Two hundred runs of the algorithm are further dedicated to creating datasets with predictive four or five order interactions and minimized lower-level effects. This method and the resulting datasets will allow the capabilities of novel methods to be tested without pre-specified genetic models. This allows researchers to evaluate which methods will succeed on human genetics problems where the model is not known in advance. We further make freely available to the community the entire Pareto-optimal front of datasets from each run so that novel methods may be rigorously evaluated. These 76,600 datasets are available from http://discovery.dartmouth.edu/model_free_data/.
A photogrammetric technique for generation of an accurate multispectral optical flow dataset
NASA Astrophysics Data System (ADS)
Kniaz, V. V.
2017-06-01
A presence of an accurate dataset is the key requirement for a successful development of an optical flow estimation algorithm. A large number of freely available optical flow datasets were developed in recent years and gave rise for many powerful algorithms. However most of the datasets include only images captured in the visible spectrum. This paper is focused on the creation of a multispectral optical flow dataset with an accurate ground truth. The generation of an accurate ground truth optical flow is a rather complex problem, as no device for error-free optical flow measurement was developed to date. Existing methods for ground truth optical flow estimation are based on hidden textures, 3D modelling or laser scanning. Such techniques are either work only with a synthetic optical flow or provide a sparse ground truth optical flow. In this paper a new photogrammetric method for generation of an accurate ground truth optical flow is proposed. The method combines the benefits of the accuracy and density of a synthetic optical flow datasets with the flexibility of laser scanning based techniques. A multispectral dataset including various image sequences was generated using the developed method. The dataset is freely available on the accompanying web site.
Segmentation of Unstructured Datasets
NASA Technical Reports Server (NTRS)
Bhat, Smitha
1996-01-01
Datasets generated by computer simulations and experiments in Computational Fluid Dynamics tend to be extremely large and complex. It is difficult to visualize these datasets using standard techniques like Volume Rendering and Ray Casting. Object Segmentation provides a technique to extract and quantify regions of interest within these massive datasets. This thesis explores basic algorithms to extract coherent amorphous regions from two-dimensional and three-dimensional scalar unstructured grids. The techniques are applied to datasets from Computational Fluid Dynamics and from Finite Element Analysis.
RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system
Jensen, Tue V.; Pinson, Pierre
2017-01-01
Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation. PMID:29182600
RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system.
Jensen, Tue V; Pinson, Pierre
2017-11-28
Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.
RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system
NASA Astrophysics Data System (ADS)
Jensen, Tue V.; Pinson, Pierre
2017-11-01
Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.
Finding the Maine Story in Hugh Cumbersome National Monitoring Datasets
What’s a manager, analyst, or concerned citizen to do with the complex datasets generated by State and Federal monitoring efforts? Is it possible to use such information to address Maine’s environmental issues without having a degree in informatics and statistics? This presentati...
An interactive web application for the dissemination of human systems immunology data.
Speake, Cate; Presnell, Scott; Domico, Kelly; Zeitner, Brad; Bjork, Anna; Anderson, David; Mason, Michael J; Whalen, Elizabeth; Vargas, Olivia; Popov, Dimitry; Rinchai, Darawan; Jourde-Chiche, Noemie; Chiche, Laurent; Quinn, Charlie; Chaussabel, Damien
2015-06-19
Systems immunology approaches have proven invaluable in translational research settings. The current rate at which large-scale datasets are generated presents unique challenges and opportunities. Mining aggregates of these datasets could accelerate the pace of discovery, but new solutions are needed to integrate the heterogeneous data types with the contextual information that is necessary for interpretation. In addition, enabling tools and technologies facilitating investigators' interaction with large-scale datasets must be developed in order to promote insight and foster knowledge discovery. State of the art application programming was employed to develop an interactive web application for browsing and visualizing large and complex datasets. A collection of human immune transcriptome datasets were loaded alongside contextual information about the samples. We provide a resource enabling interactive query and navigation of transcriptome datasets relevant to human immunology research. Detailed information about studies and samples are displayed dynamically; if desired the associated data can be downloaded. Custom interactive visualizations of the data can be shared via email or social media. This application can be used to browse context-rich systems-scale data within and across systems immunology studies. This resource is publicly available online at [Gene Expression Browser Landing Page ( https://gxb.benaroyaresearch.org/dm3/landing.gsp )]. The source code is also available openly [Gene Expression Browser Source Code ( https://github.com/BenaroyaResearch/gxbrowser )]. We have developed a data browsing and visualization application capable of navigating increasingly large and complex datasets generated in the context of immunological studies. This intuitive tool ensures that, whether taken individually or as a whole, such datasets generated at great effort and expense remain interpretable and a ready source of insight for years to come.
Carmen Legaz-García, María Del; Miñarro-Giménez, José Antonio; Menárguez-Tortosa, Marcos; Fernández-Breis, Jesualdo Tomás
2016-06-03
Biomedical research usually requires combining large volumes of data from multiple heterogeneous sources, which makes difficult the integrated exploitation of such data. The Semantic Web paradigm offers a natural technological space for data integration and exploitation by generating content readable by machines. Linked Open Data is a Semantic Web initiative that promotes the publication and sharing of data in machine readable semantic formats. We present an approach for the transformation and integration of heterogeneous biomedical data with the objective of generating open biomedical datasets in Semantic Web formats. The transformation of the data is based on the mappings between the entities of the data schema and the ontological infrastructure that provides the meaning to the content. Our approach permits different types of mappings and includes the possibility of defining complex transformation patterns. Once the mappings are defined, they can be automatically applied to datasets to generate logically consistent content and the mappings can be reused in further transformation processes. The results of our research are (1) a common transformation and integration process for heterogeneous biomedical data; (2) the application of Linked Open Data principles to generate interoperable, open, biomedical datasets; (3) a software tool, called SWIT, that implements the approach. In this paper we also describe how we have applied SWIT in different biomedical scenarios and some lessons learned. We have presented an approach that is able to generate open biomedical repositories in Semantic Web formats. SWIT is able to apply the Linked Open Data principles in the generation of the datasets, so allowing for linking their content to external repositories and creating linked open datasets. SWIT datasets may contain data from multiple sources and schemas, thus becoming integrated datasets.
Viljoen, Nadia M; Joubert, Johan W
2018-02-01
This article presents the multilayered complex network formulation for three different supply chain network archetypes on an urban road grid and describes how 500 instances were randomly generated for each archetype. Both the supply chain network layer and the urban road network layer are directed unweighted networks. The shortest path set is calculated for each of the 1 500 experimental instances. The datasets are used to empirically explore the impact that the supply chain's dependence on the transport network has on its vulnerability in Viljoen and Joubert (2017) [1]. The datasets are publicly available on Mendeley (Joubert and Viljoen, 2017) [2].
2013-01-01
Background Next generation sequencing technologies have greatly advanced many research areas of the biomedical sciences through their capability to generate massive amounts of genetic information at unprecedented rates. The advent of next generation sequencing has led to the development of numerous computational tools to analyze and assemble the millions to billions of short sequencing reads produced by these technologies. While these tools filled an important gap, current approaches for storing, processing, and analyzing short read datasets generally have remained simple and lack the complexity needed to efficiently model the produced reads and assemble them correctly. Results Previously, we presented an overlap graph coarsening scheme for modeling read overlap relationships on multiple levels. Most current read assembly and analysis approaches use a single graph or set of clusters to represent the relationships among a read dataset. Instead, we use a series of graphs to represent the reads and their overlap relationships across a spectrum of information granularity. At each information level our algorithm is capable of generating clusters of reads from the reduced graph, forming an integrated graph modeling and clustering approach for read analysis and assembly. Previously we applied our algorithm to simulated and real 454 datasets to assess its ability to efficiently model and cluster next generation sequencing data. In this paper we extend our algorithm to large simulated and real Illumina datasets to demonstrate that our algorithm is practical for both sequencing technologies. Conclusions Our overlap graph theoretic algorithm is able to model next generation sequencing reads at various levels of granularity through the process of graph coarsening. Additionally, our model allows for efficient representation of the read overlap relationships, is scalable for large datasets, and is practical for both Illumina and 454 sequencing technologies. PMID:24564333
A window-based time series feature extraction method.
Katircioglu-Öztürk, Deniz; Güvenir, H Altay; Ravens, Ursula; Baykal, Nazife
2017-10-01
This study proposes a robust similarity score-based time series feature extraction method that is termed as Window-based Time series Feature ExtraCtion (WTC). Specifically, WTC generates domain-interpretable results and involves significantly low computational complexity thereby rendering itself useful for densely sampled and populated time series datasets. In this study, WTC is applied to a proprietary action potential (AP) time series dataset on human cardiomyocytes and three precordial leads from a publicly available electrocardiogram (ECG) dataset. This is followed by comparing WTC in terms of predictive accuracy and computational complexity with shapelet transform and fast shapelet transform (which constitutes an accelerated variant of the shapelet transform). The results indicate that WTC achieves a slightly higher classification performance with significantly lower execution time when compared to its shapelet-based alternatives. With respect to its interpretable features, WTC has a potential to enable medical experts to explore definitive common trends in novel datasets. Copyright © 2017 Elsevier Ltd. All rights reserved.
Next-Generation Machine Learning for Biological Networks.
Camacho, Diogo M; Collins, Katherine M; Powers, Rani K; Costello, James C; Collins, James J
2018-06-14
Machine learning, a collection of data-analytical techniques aimed at building predictive models from multi-dimensional datasets, is becoming integral to modern biological research. By enabling one to generate models that learn from large datasets and make predictions on likely outcomes, machine learning can be used to study complex cellular systems such as biological networks. Here, we provide a primer on machine learning for life scientists, including an introduction to deep learning. We discuss opportunities and challenges at the intersection of machine learning and network biology, which could impact disease biology, drug discovery, microbiome research, and synthetic biology. Copyright © 2018 Elsevier Inc. All rights reserved.
Data-driven probability concentration and sampling on manifold
DOE Office of Scientific and Technical Information (OSTI.GOV)
Soize, C., E-mail: christian.soize@univ-paris-est.fr; Ghanem, R., E-mail: ghanem@usc.edu
2016-09-15
A new methodology is proposed for generating realizations of a random vector with values in a finite-dimensional Euclidean space that are statistically consistent with a dataset of observations of this vector. The probability distribution of this random vector, while a priori not known, is presumed to be concentrated on an unknown subset of the Euclidean space. A random matrix is introduced whose columns are independent copies of the random vector and for which the number of columns is the number of data points in the dataset. The approach is based on the use of (i) the multidimensional kernel-density estimation methodmore » for estimating the probability distribution of the random matrix, (ii) a MCMC method for generating realizations for the random matrix, (iii) the diffusion-maps approach for discovering and characterizing the geometry and the structure of the dataset, and (iv) a reduced-order representation of the random matrix, which is constructed using the diffusion-maps vectors associated with the first eigenvalues of the transition matrix relative to the given dataset. The convergence aspects of the proposed methodology are analyzed and a numerical validation is explored through three applications of increasing complexity. The proposed method is found to be robust to noise levels and data complexity as well as to the intrinsic dimension of data and the size of experimental datasets. Both the methodology and the underlying mathematical framework presented in this paper contribute new capabilities and perspectives at the interface of uncertainty quantification, statistical data analysis, stochastic modeling and associated statistical inverse problems.« less
Genomics dataset of unidentified disclosed isolates.
Rekadwad, Bhagwan N
2016-09-01
Analysis of DNA sequences is necessary for higher hierarchical classification of the organisms. It gives clues about the characteristics of organisms and their taxonomic position. This dataset is chosen to find complexities in the unidentified DNA in the disclosed patents. A total of 17 unidentified DNA sequences were thoroughly analyzed. The quick response codes were generated. AT/GC content of the DNA sequences analysis was carried out. The QR is helpful for quick identification of isolates. AT/GC content is helpful for studying their stability at different temperatures. Additionally, a dataset on cleavage code and enzyme code studied under the restriction digestion study, which helpful for performing studies using short DNA sequences was reported. The dataset disclosed here is the new revelatory data for exploration of unique DNA sequences for evaluation, identification, comparison and analysis.
Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples
White, James Robert; Nagarajan, Niranjan; Pop, Mihai
2009-01-01
Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial communities to identify features that distinguish them. We present a statistical method for comparing clinical metagenomic samples from two treatment populations on the basis of count data (e.g. as obtained through sequencing) to detect differentially abundant features. Our method, Metastats, employs the false discovery rate to improve specificity in high-complexity environments, and separately handles sparsely-sampled features using Fisher's exact test. Under a variety of simulations, we show that Metastats performs well compared to previously used methods, and significantly outperforms other methods for features with sparse counts. We demonstrate the utility of our method on several datasets including a 16S rRNA survey of obese and lean human gut microbiomes, COG functional profiles of infant and mature gut microbiomes, and bacterial and viral metabolic subsystem data inferred from random sequencing of 85 metagenomes. The application of our method to the obesity dataset reveals differences between obese and lean subjects not reported in the original study. For the COG and subsystem datasets, we provide the first statistically rigorous assessment of the differences between these populations. The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects. Our methods are robust across datasets of varied complexity and sampling level. While designed for metagenomic applications, our software can also be applied to digital gene expression studies (e.g. SAGE). A web server implementation of our methods and freely available source code can be found at http://metastats.cbcb.umd.edu/. PMID:19360128
NASA Astrophysics Data System (ADS)
Delgado, Juan A.; Altuve, Miguel; Nabhan Homsi, Masun
2015-12-01
This paper introduces a robust method based on the Support Vector Machine (SVM) algorithm to detect the presence of Fetal QRS (fQRS) complexes in electrocardiogram (ECG) recordings provided by the PhysioNet/CinC challenge 2013. ECG signals are first segmented into contiguous frames of 250 ms duration and then labeled in six classes. Fetal segments are tagged according to the position of fQRS complex within each one. Next, segment features extraction and dimensionality reduction are obtained by applying principal component analysis on Haar-wavelet transform. After that, two sub-datasets are generated to separate representative segments from atypical ones. Imbalanced class problem is dealt by applying sampling without replacement on each sub-dataset. Finally, two SVMs are trained and cross-validated using the two balanced sub-datasets separately. Experimental results show that the proposed approach achieves high performance rates in fetal heartbeats detection that reach up to 90.95% of accuracy, 92.16% of sensitivity, 88.51% of specificity, 94.13% of positive predictive value and 84.96% of negative predictive value. A comparative study is also carried out to show the performance of other two machine learning algorithms for fQRS complex estimation, which are K-nearest neighborhood and Bayesian network.
NASA Astrophysics Data System (ADS)
Xu, Xianjin; Yan, Chengfei; Zou, Xiaoqin
2017-08-01
The growing number of protein-ligand complex structures, particularly the structures of proteins co-bound with different ligands, in the Protein Data Bank helps us tackle two major challenges in molecular docking studies: the protein flexibility and the scoring function. Here, we introduced a systematic strategy by using the information embedded in the known protein-ligand complex structures to improve both binding mode and binding affinity predictions. Specifically, a ligand similarity calculation method was employed to search a receptor structure with a bound ligand sharing high similarity with the query ligand for the docking use. The strategy was applied to the two datasets (HSP90 and MAP4K4) in recent D3R Grand Challenge 2015. In addition, for the HSP90 dataset, a system-specific scoring function (ITScore2_hsp90) was generated by recalibrating our statistical potential-based scoring function (ITScore2) using the known protein-ligand complex structures and the statistical mechanics-based iterative method. For the HSP90 dataset, better performances were achieved for both binding mode and binding affinity predictions comparing with the original ITScore2 and with ensemble docking. For the MAP4K4 dataset, although there were only eight known protein-ligand complex structures, our docking strategy achieved a comparable performance with ensemble docking. Our method for receptor conformational selection and iterative method for the development of system-specific statistical potential-based scoring functions can be easily applied to other protein targets that have a number of protein-ligand complex structures available to improve predictions on binding.
Detection of timescales in evolving complex systems
Darst, Richard K.; Granell, Clara; Arenas, Alex; Gómez, Sergio; Saramäki, Jari; Fortunato, Santo
2016-01-01
Most complex systems are intrinsically dynamic in nature. The evolution of a dynamic complex system is typically represented as a sequence of snapshots, where each snapshot describes the configuration of the system at a particular instant of time. This is often done by using constant intervals but a better approach would be to define dynamic intervals that match the evolution of the system’s configuration. To this end, we propose a method that aims at detecting evolutionary changes in the configuration of a complex system, and generates intervals accordingly. We show that evolutionary timescales can be identified by looking for peaks in the similarity between the sets of events on consecutive time intervals of data. Tests on simple toy models reveal that the technique is able to detect evolutionary timescales of time-varying data both when the evolution is smooth as well as when it changes sharply. This is further corroborated by analyses of several real datasets. Our method is scalable to extremely large datasets and is computationally efficient. This allows a quick, parameter-free detection of multiple timescales in the evolution of a complex system. PMID:28004820
2010-01-01
Background The reconstruction of protein complexes from the physical interactome of organisms serves as a building block towards understanding the higher level organization of the cell. Over the past few years, several independent high-throughput experiments have helped to catalogue enormous amount of physical protein interaction data from organisms such as yeast. However, these individual datasets show lack of correlation with each other and also contain substantial number of false positives (noise). Over these years, several affinity scoring schemes have also been devised to improve the qualities of these datasets. Therefore, the challenge now is to detect meaningful as well as novel complexes from protein interaction (PPI) networks derived by combining datasets from multiple sources and by making use of these affinity scoring schemes. In the attempt towards tackling this challenge, the Markov Clustering algorithm (MCL) has proved to be a popular and reasonably successful method, mainly due to its scalability, robustness, and ability to work on scored (weighted) networks. However, MCL produces many noisy clusters, which either do not match known complexes or have additional proteins that reduce the accuracies of correctly predicted complexes. Results Inspired by recent experimental observations by Gavin and colleagues on the modularity structure in yeast complexes and the distinctive properties of "core" and "attachment" proteins, we develop a core-attachment based refinement method coupled to MCL for reconstruction of yeast complexes from scored (weighted) PPI networks. We combine physical interactions from two recent "pull-down" experiments to generate an unscored PPI network. We then score this network using available affinity scoring schemes to generate multiple scored PPI networks. The evaluation of our method (called MCL-CAw) on these networks shows that: (i) MCL-CAw derives larger number of yeast complexes and with better accuracies than MCL, particularly in the presence of natural noise; (ii) Affinity scoring can effectively reduce the impact of noise on MCL-CAw and thereby improve the quality (precision and recall) of its predicted complexes; (iii) MCL-CAw responds well to most available scoring schemes. We discuss several instances where MCL-CAw was successful in deriving meaningful complexes, and where it missed a few proteins or whole complexes due to affinity scoring of the networks. We compare MCL-CAw with several recent complex detection algorithms on unscored and scored networks, and assess the relative performance of the algorithms on these networks. Further, we study the impact of augmenting physical datasets with computationally inferred interactions for complex detection. Finally, we analyse the essentiality of proteins within predicted complexes to understand a possible correlation between protein essentiality and their ability to form complexes. Conclusions We demonstrate that core-attachment based refinement in MCL-CAw improves the predictions of MCL on yeast PPI networks. We show that affinity scoring improves the performance of MCL-CAw. PMID:20939868
KNIME4NGS: a comprehensive toolbox for next generation sequencing analysis.
Hastreiter, Maximilian; Jeske, Tim; Hoser, Jonathan; Kluge, Michael; Ahomaa, Kaarin; Friedl, Marie-Sophie; Kopetzky, Sebastian J; Quell, Jan-Dominik; Mewes, H Werner; Küffner, Robert
2017-05-15
Analysis of Next Generation Sequencing (NGS) data requires the processing of large datasets by chaining various tools with complex input and output formats. In order to automate data analysis, we propose to standardize NGS tasks into modular workflows. This simplifies reliable handling and processing of NGS data, and corresponding solutions become substantially more reproducible and easier to maintain. Here, we present a documented, linux-based, toolbox of 42 processing modules that are combined to construct workflows facilitating a variety of tasks such as DNAseq and RNAseq analysis. We also describe important technical extensions. The high throughput executor (HTE) helps to increase the reliability and to reduce manual interventions when processing complex datasets. We also provide a dedicated binary manager that assists users in obtaining the modules' executables and keeping them up to date. As basis for this actively developed toolbox we use the workflow management software KNIME. See http://ibisngs.github.io/knime4ngs for nodes and user manual (GPLv3 license). robert.kueffner@helmholtz-muenchen.de. Supplementary data are available at Bioinformatics online.
A genetic-algorithm approach for assessing the liquefaction potential of sandy soils
NASA Astrophysics Data System (ADS)
Sen, G.; Akyol, E.
2010-04-01
The determination of liquefaction potential is required to take into account a large number of parameters, which creates a complex nonlinear structure of the liquefaction phenomenon. The conventional methods rely on simple statistical and empirical relations or charts. However, they cannot characterise these complexities. Genetic algorithms are suited to solve these types of problems. A genetic algorithm-based model has been developed to determine the liquefaction potential by confirming Cone Penetration Test datasets derived from case studies of sandy soils. Software has been developed that uses genetic algorithms for the parameter selection and assessment of liquefaction potential. Then several estimation functions for the assessment of a Liquefaction Index have been generated from the dataset. The generated Liquefaction Index estimation functions were evaluated by assessing the training and test data. The suggested formulation estimates the liquefaction occurrence with significant accuracy. Besides, the parametric study on the liquefaction index curves shows a good relation with the physical behaviour. The total number of misestimated cases was only 7.8% for the proposed method, which is quite low when compared to another commonly used method.
Shao, Wenguang; Pedrioli, Patrick G A; Wolski, Witold; Scurtescu, Cristian; Schmid, Emanuel; Courcelles, Mathieu; Schuster, Heiko; Kowalewski, Daniel; Marino, Fabio; Arlehamn, Cecilia S L; Vaughan, Kerrie; Peters, Bjoern; Sette, Alessandro; Ottenhoff, Tom H M; Meijgaarden, Krista E; Nieuwenhuizen, Natalie; Kaufmann, Stefan H E; Schlapbach, Ralph; Castle, John C; Nesvizhskii, Alexey I; Nielsen, Morten; Deutsch, Eric W; Campbell, David S; Moritz, Robert L; Zubarev, Roman A; Ytterberg, Anders Jimmy; Purcell, Anthony W; Marcilla, Miguel; Paradela, Alberto; Wang, Qi; Costello, Catherine E; Ternette, Nicola; van Veelen, Peter A; van Els, Cécile A C M; de Souza, Gustavo A; Sollid, Ludvig M; Admon, Arie; Stevanovic, Stefan; Rammensee, Hans-Georg; Thibault, Pierre; Perreault, Claude; Bassani-Sternberg, Michal
2018-01-01
Abstract Mass spectrometry (MS)-based immunopeptidomics investigates the repertoire of peptides presented at the cell surface by major histocompatibility complex (MHC) molecules. The broad clinical relevance of MHC-associated peptides, e.g. in precision medicine, provides a strong rationale for the large-scale generation of immunopeptidomic datasets and recent developments in MS-based peptide analysis technologies now support the generation of the required data. Importantly, the availability of diverse immunopeptidomic datasets has resulted in an increasing need to standardize, store and exchange this type of data to enable better collaborations among researchers, to advance the field more efficiently and to establish quality measures required for the meaningful comparison of datasets. Here we present the SysteMHC Atlas (https://systemhcatlas.org), a public database that aims at collecting, organizing, sharing, visualizing and exploring immunopeptidomic data generated by MS. The Atlas includes raw mass spectrometer output files collected from several laboratories around the globe, a catalog of context-specific datasets of MHC class I and class II peptides, standardized MHC allele-specific peptide spectral libraries consisting of consensus spectra calculated from repeat measurements of the same peptide sequence, and links to other proteomics and immunology databases. The SysteMHC Atlas project was created and will be further expanded using a uniform and open computational pipeline that controls the quality of peptide identifications and peptide annotations. Thus, the SysteMHC Atlas disseminates quality controlled immunopeptidomic information to the public domain and serves as a community resource toward the generation of a high-quality comprehensive map of the human immunopeptidome and the support of consistent measurement of immunopeptidomic sample cohorts. PMID:28985418
Fast, Automated, Scalable Generation of Textured 3D Models of Indoor Environments
2014-12-18
expensive travel and on-site visits. Different applications require models of different complexities, both with and without furniture geometry. The...environment and to localize the system in the environment over time. The datasets shown in this paper were generated by a backpack -mounted system that uses 2D...voxel is found to intersect the line segment from a scanner to a corresponding scan point. If a laser passes through a voxel, that voxel is considered
Bessonov, Kyrylo; Walkey, Christopher J.; Shelp, Barry J.; van Vuuren, Hennie J. J.; Chiu, David; van der Merwe, George
2013-01-01
Analyzing time-course expression data captured in microarray datasets is a complex undertaking as the vast and complex data space is represented by a relatively low number of samples as compared to thousands of available genes. Here, we developed the Interdependent Correlation Clustering (ICC) method to analyze relationships that exist among genes conditioned on the expression of a specific target gene in microarray data. Based on Correlation Clustering, the ICC method analyzes a large set of correlation values related to gene expression profiles extracted from given microarray datasets. ICC can be applied to any microarray dataset and any target gene. We applied this method to microarray data generated from wine fermentations and selected NSF1, which encodes a C2H2 zinc finger-type transcription factor, as the target gene. The validity of the method was verified by accurate identifications of the previously known functional roles of NSF1. In addition, we identified and verified potential new functions for this gene; specifically, NSF1 is a negative regulator for the expression of sulfur metabolism genes, the nuclear localization of Nsf1 protein (Nsf1p) is controlled in a sulfur-dependent manner, and the transcription of NSF1 is regulated by Met4p, an important transcriptional activator of sulfur metabolism genes. The inter-disciplinary approach adopted here highlighted the accuracy and relevancy of the ICC method in mining for novel gene functions using complex microarray datasets with a limited number of samples. PMID:24130853
Glycan array data management at Consortium for Functional Glycomics.
Venkataraman, Maha; Sasisekharan, Ram; Raman, Rahul
2015-01-01
Glycomics or the study of structure-function relationships of complex glycans has reshaped post-genomics biology. Glycans mediate fundamental biological functions via their specific interactions with a variety of proteins. Recognizing the importance of glycomics, large-scale research initiatives such as the Consortium for Functional Glycomics (CFG) were established to address these challenges. Over the past decade, the Consortium for Functional Glycomics (CFG) has generated novel reagents and technologies for glycomics analyses, which in turn have led to generation of diverse datasets. These datasets have contributed to understanding glycan diversity and structure-function relationships at molecular (glycan-protein interactions), cellular (gene expression and glycan analysis), and whole organism (mouse phenotyping) levels. Among these analyses and datasets, screening of glycan-protein interactions on glycan array platforms has gained much prominence and has contributed to cross-disciplinary realization of the importance of glycomics in areas such as immunology, infectious diseases, cancer biomarkers, etc. This manuscript outlines methodologies for capturing data from glycan array experiments and online tools to access and visualize glycan array data implemented at the CFG.
Topological data analysis of Escherichia coli O157:H7 and non-O157 survival in soils
USDA-ARS?s Scientific Manuscript database
Shiga toxin-producing E. coli O157:H7 and non-O157 have been implicated in many foodborne illnesses caused by the consumption of contaminated fresh produce. However, data on their persistence in major fresh produce-growing soils are limited due to the complexity in datasets generated from different ...
State-of-the-Art Fusion-Finder Algorithms Sensitivity and Specificity
Carrara, Matteo; Beccuti, Marco; Lazzarato, Fulvio; Cavallo, Federica; Cordero, Francesca; Donatelli, Susanna; Calogero, Raffaele A.
2013-01-01
Background. Gene fusions arising from chromosomal translocations have been implicated in cancer. RNA-seq has the potential to discover such rearrangements generating functional proteins (chimera/fusion). Recently, many methods for chimeras detection have been published. However, specificity and sensitivity of those tools were not extensively investigated in a comparative way. Results. We tested eight fusion-detection tools (FusionHunter, FusionMap, FusionFinder, MapSplice, deFuse, Bellerophontes, ChimeraScan, and TopHat-fusion) to detect fusion events using synthetic and real datasets encompassing chimeras. The comparison analysis run only on synthetic data could generate misleading results since we found no counterpart on real dataset. Furthermore, most tools report a very high number of false positive chimeras. In particular, the most sensitive tool, ChimeraScan, reports a large number of false positives that we were able to significantly reduce by devising and applying two filters to remove fusions not supported by fusion junction-spanning reads or encompassing large intronic regions. Conclusions. The discordant results obtained using synthetic and real datasets suggest that synthetic datasets encompassing fusion events may not fully catch the complexity of RNA-seq experiment. Moreover, fusion detection tools are still limited in sensitivity or specificity; thus, there is space for further improvement in the fusion-finder algorithms. PMID:23555082
Özgür, Arzucan; Hur, Junguk; He, Yongqun
2016-01-01
The Interaction Network Ontology (INO) logically represents biological interactions, pathways, and networks. INO has been demonstrated to be valuable in providing a set of structured ontological terms and associated keywords to support literature mining of gene-gene interactions from biomedical literature. However, previous work using INO focused on single keyword matching, while many interactions are represented with two or more interaction keywords used in combination. This paper reports our extension of INO to include combinatory patterns of two or more literature mining keywords co-existing in one sentence to represent specific INO interaction classes. Such keyword combinations and related INO interaction type information could be automatically obtained via SPARQL queries, formatted in Excel format, and used in an INO-supported SciMiner, an in-house literature mining program. We studied the gene interaction sentences from the commonly used benchmark Learning Logic in Language (LLL) dataset and one internally generated vaccine-related dataset to identify and analyze interaction types containing multiple keywords. Patterns obtained from the dependency parse trees of the sentences were used to identify the interaction keywords that are related to each other and collectively represent an interaction type. The INO ontology currently has 575 terms including 202 terms under the interaction branch. The relations between the INO interaction types and associated keywords are represented using the INO annotation relations: 'has literature mining keywords' and 'has keyword dependency pattern'. The keyword dependency patterns were generated via running the Stanford Parser to obtain dependency relation types. Out of the 107 interactions in the LLL dataset represented with two-keyword interaction types, 86 were identified by using the direct dependency relations. The LLL dataset contained 34 gene regulation interaction types, each of which associated with multiple keywords. A hierarchical display of these 34 interaction types and their ancestor terms in INO resulted in the identification of specific gene-gene interaction patterns from the LLL dataset. The phenomenon of having multi-keyword interaction types was also frequently observed in the vaccine dataset. By modeling and representing multiple textual keywords for interaction types, the extended INO enabled the identification of complex biological gene-gene interactions represented with multiple keywords.
Web-based Toolkit for Dynamic Generation of Data Processors
NASA Astrophysics Data System (ADS)
Patel, J.; Dascalu, S.; Harris, F. C.; Benedict, K. K.; Gollberg, G.; Sheneman, L.
2011-12-01
All computation-intensive scientific research uses structured datasets, including hydrology and all other types of climate-related research. When it comes to testing their hypotheses, researchers might use the same dataset differently, and modify, transform, or convert it to meet their research needs. Currently, many researchers spend a good amount of time performing data processing and building tools to speed up this process. They might routinely repeat the same process activities for new research projects, spending precious time that otherwise could be dedicated to analyzing and interpreting the data. Numerous tools are available to run tests on prepared datasets and many of them work with datasets in different formats. However, there is still a significant need for applications that can comprehensively handle data transformation and conversion activities and help prepare the various processed datasets required by the researchers. We propose a web-based application (a software toolkit) that dynamically generates data processors capable of performing data conversions, transformations, and customizations based on user-defined mappings and selections. As a first step, the proposed solution allows the users to define various data structures and, in the next step, can select various file formats and data conversions for their datasets of interest. In a simple scenario, the core of the proposed web-based toolkit allows the users to define direct mappings between input and output data structures. The toolkit will also support defining complex mappings involving the use of pre-defined sets of mathematical, statistical, date/time, and text manipulation functions. Furthermore, the users will be allowed to define logical cases for input data filtering and sampling. At the end of the process, the toolkit is designed to generate reusable source code and executable binary files for download and use by the scientists. The application is also designed to store all data structures and mappings defined by a user (an author), and allow the original author to modify them using standard authoring techniques. The users can change or define new mappings to create new data processors for download and use. In essence, when executed, the generated data processor binary file can take an input data file in a given format and output this data, possibly transformed, in a different file format. If they so desire, the users will be able modify directly the source code in order to define more complex mappings and transformations that are not currently supported by the toolkit. Initially aimed at supporting research in hydrology, the toolkit's functions and features can be either directly used or easily extended to other areas of climate-related research. The proposed web-based data processing toolkit will be able to generate various custom software processors for data conversion and transformation in a matter of seconds or minutes, saving a significant amount of researchers' time and allowing them to focus on core research issues.
Hypergraph Based Feature Selection Technique for Medical Diagnosis.
Somu, Nivethitha; Raman, M R Gauthama; Kirthivasan, Kannan; Sriram, V S Shankar
2016-11-01
The impact of internet and information systems across various domains have resulted in substantial generation of multidimensional datasets. The use of data mining and knowledge discovery techniques to extract the original information contained in the multidimensional datasets play a significant role in the exploitation of complete benefit provided by them. The presence of large number of features in the high dimensional datasets incurs high computational cost in terms of computing power and time. Hence, feature selection technique has been commonly used to build robust machine learning models to select a subset of relevant features which projects the maximal information content of the original dataset. In this paper, a novel Rough Set based K - Helly feature selection technique (RSKHT) which hybridize Rough Set Theory (RST) and K - Helly property of hypergraph representation had been designed to identify the optimal feature subset or reduct for medical diagnostic applications. Experiments carried out using the medical datasets from the UCI repository proves the dominance of the RSKHT over other feature selection techniques with respect to the reduct size, classification accuracy and time complexity. The performance of the RSKHT had been validated using WEKA tool, which shows that RSKHT had been computationally attractive and flexible over massive datasets.
Shao, Wenguang; Pedrioli, Patrick G A; Wolski, Witold; Scurtescu, Cristian; Schmid, Emanuel; Vizcaíno, Juan A; Courcelles, Mathieu; Schuster, Heiko; Kowalewski, Daniel; Marino, Fabio; Arlehamn, Cecilia S L; Vaughan, Kerrie; Peters, Bjoern; Sette, Alessandro; Ottenhoff, Tom H M; Meijgaarden, Krista E; Nieuwenhuizen, Natalie; Kaufmann, Stefan H E; Schlapbach, Ralph; Castle, John C; Nesvizhskii, Alexey I; Nielsen, Morten; Deutsch, Eric W; Campbell, David S; Moritz, Robert L; Zubarev, Roman A; Ytterberg, Anders Jimmy; Purcell, Anthony W; Marcilla, Miguel; Paradela, Alberto; Wang, Qi; Costello, Catherine E; Ternette, Nicola; van Veelen, Peter A; van Els, Cécile A C M; Heck, Albert J R; de Souza, Gustavo A; Sollid, Ludvig M; Admon, Arie; Stevanovic, Stefan; Rammensee, Hans-Georg; Thibault, Pierre; Perreault, Claude; Bassani-Sternberg, Michal; Aebersold, Ruedi; Caron, Etienne
2018-01-04
Mass spectrometry (MS)-based immunopeptidomics investigates the repertoire of peptides presented at the cell surface by major histocompatibility complex (MHC) molecules. The broad clinical relevance of MHC-associated peptides, e.g. in precision medicine, provides a strong rationale for the large-scale generation of immunopeptidomic datasets and recent developments in MS-based peptide analysis technologies now support the generation of the required data. Importantly, the availability of diverse immunopeptidomic datasets has resulted in an increasing need to standardize, store and exchange this type of data to enable better collaborations among researchers, to advance the field more efficiently and to establish quality measures required for the meaningful comparison of datasets. Here we present the SysteMHC Atlas (https://systemhcatlas.org), a public database that aims at collecting, organizing, sharing, visualizing and exploring immunopeptidomic data generated by MS. The Atlas includes raw mass spectrometer output files collected from several laboratories around the globe, a catalog of context-specific datasets of MHC class I and class II peptides, standardized MHC allele-specific peptide spectral libraries consisting of consensus spectra calculated from repeat measurements of the same peptide sequence, and links to other proteomics and immunology databases. The SysteMHC Atlas project was created and will be further expanded using a uniform and open computational pipeline that controls the quality of peptide identifications and peptide annotations. Thus, the SysteMHC Atlas disseminates quality controlled immunopeptidomic information to the public domain and serves as a community resource toward the generation of a high-quality comprehensive map of the human immunopeptidome and the support of consistent measurement of immunopeptidomic sample cohorts. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Ray Meta: scalable de novo metagenome assembly and profiling
2012-01-01
Voluminous parallel sequencing datasets, especially metagenomic experiments, require distributed computing for de novo assembly and taxonomic profiling. Ray Meta is a massively distributed metagenome assembler that is coupled with Ray Communities, which profiles microbiomes based on uniquely-colored k-mers. It can accurately assemble and profile a three billion read metagenomic experiment representing 1,000 bacterial genomes of uneven proportions in 15 hours with 1,024 processor cores, using only 1.5 GB per core. The software will facilitate the processing of large and complex datasets, and will help in generating biological insights for specific environments. Ray Meta is open source and available at http://denovoassembler.sf.net. PMID:23259615
Wang, Jian; Anania, Veronica G.; Knott, Jeff; Rush, John; Lill, Jennie R.; Bourne, Philip E.; Bandeira, Nuno
2014-01-01
The combination of chemical cross-linking and mass spectrometry has recently been shown to constitute a powerful tool for studying protein–protein interactions and elucidating the structure of large protein complexes. However, computational methods for interpreting the complex MS/MS spectra from linked peptides are still in their infancy, making the high-throughput application of this approach largely impractical. Because of the lack of large annotated datasets, most current approaches do not capture the specific fragmentation patterns of linked peptides and therefore are not optimal for the identification of cross-linked peptides. Here we propose a generic approach to address this problem and demonstrate it using disulfide-bridged peptide libraries to (i) efficiently generate large mass spectral reference data for linked peptides at a low cost and (ii) automatically train an algorithm that can efficiently and accurately identify linked peptides from MS/MS spectra. We show that using this approach we were able to identify thousands of MS/MS spectra from disulfide-bridged peptides through comparison with proteome-scale sequence databases and significantly improve the sensitivity of cross-linked peptide identification. This allowed us to identify 60% more direct pairwise interactions between the protein subunits in the 20S proteasome complex than existing tools on cross-linking studies of the proteasome complexes. The basic framework of this approach and the MS/MS reference dataset generated should be valuable resources for the future development of new tools for the identification of linked peptides. PMID:24493012
Application of multivariate statistical techniques in microbial ecology
Paliy, O.; Shankar, V.
2016-01-01
Recent advances in high-throughput methods of molecular analyses have led to an explosion of studies generating large scale ecological datasets. Especially noticeable effect has been attained in the field of microbial ecology, where new experimental approaches provided in-depth assessments of the composition, functions, and dynamic changes of complex microbial communities. Because even a single high-throughput experiment produces large amounts of data, powerful statistical techniques of multivariate analysis are well suited to analyze and interpret these datasets. Many different multivariate techniques are available, and often it is not clear which method should be applied to a particular dataset. In this review we describe and compare the most widely used multivariate statistical techniques including exploratory, interpretive, and discriminatory procedures. We consider several important limitations and assumptions of these methods, and we present examples of how these approaches have been utilized in recent studies to provide insight into the ecology of the microbial world. Finally, we offer suggestions for the selection of appropriate methods based on the research question and dataset structure. PMID:26786791
Joint Blind Source Separation by Multi-set Canonical Correlation Analysis
Li, Yi-Ou; Adalı, Tülay; Wang, Wei; Calhoun, Vince D
2009-01-01
In this work, we introduce a simple and effective scheme to achieve joint blind source separation (BSS) of multiple datasets using multi-set canonical correlation analysis (M-CCA) [1]. We first propose a generative model of joint BSS based on the correlation of latent sources within and between datasets. We specify source separability conditions, and show that, when the conditions are satisfied, the group of corresponding sources from each dataset can be jointly extracted by M-CCA through maximization of correlation among the extracted sources. We compare source separation performance of the M-CCA scheme with other joint BSS methods and demonstrate the superior performance of the M-CCA scheme in achieving joint BSS for a large number of datasets, group of corresponding sources with heterogeneous correlation values, and complex-valued sources with circular and non-circular distributions. We apply M-CCA to analysis of functional magnetic resonance imaging (fMRI) data from multiple subjects and show its utility in estimating meaningful brain activations from a visuomotor task. PMID:20221319
Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments.
Ionescu, Catalin; Papava, Dragos; Olaru, Vlad; Sminchisescu, Cristian
2014-07-01
We introduce a new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different viewpoints, for training realistic human sensing systems and for evaluating the next generation of human pose estimation models and algorithms. Besides increasing the size of the datasets in the current state-of-the-art by several orders of magnitude, we also aim to complement such datasets with a diverse set of motions and poses encountered as part of typical human activities (taking photos, talking on the phone, posing, greeting, eating, etc.), with additional synchronized image, human motion capture, and time of flight (depth) data, and with accurate 3D body scans of all the subject actors involved. We also provide controlled mixed reality evaluation scenarios where 3D human models are animated using motion capture and inserted using correct 3D geometry, in complex real environments, viewed with moving cameras, and under occlusion. Finally, we provide a set of large-scale statistical models and detailed evaluation baselines for the dataset illustrating its diversity and the scope for improvement by future work in the research community. Our experiments show that our best large-scale model can leverage our full training set to obtain a 20% improvement in performance compared to a training set of the scale of the largest existing public dataset for this problem. Yet the potential for improvement by leveraging higher capacity, more complex models with our large dataset, is substantially vaster and should stimulate future research. The dataset together with code for the associated large-scale learning models, features, visualization tools, as well as the evaluation server, is available online at http://vision.imar.ro/human3.6m.
Exploratory Climate Data Visualization and Analysis Using DV3D and UVCDAT
NASA Technical Reports Server (NTRS)
Maxwell, Thomas
2012-01-01
Earth system scientists are being inundated by an explosion of data generated by ever-increasing resolution in both global models and remote sensors. Advanced tools for accessing, analyzing, and visualizing very large and complex climate data are required to maintain rapid progress in Earth system research. To meet this need, NASA, in collaboration with the Ultra-scale Visualization Climate Data Analysis Tools (UVCOAT) consortium, is developing exploratory climate data analysis and visualization tools which provide data analysis capabilities for the Earth System Grid (ESG). This paper describes DV3D, a UV-COAT package that enables exploratory analysis of climate simulation and observation datasets. OV3D provides user-friendly interfaces for visualization and analysis of climate data at a level appropriate for scientists. It features workflow inte rfaces, interactive 40 data exploration, hyperwall and stereo visualization, automated provenance generation, and parallel task execution. DV30's integration with CDAT's climate data management system (COMS) and other climate data analysis tools provides a wide range of high performance climate data analysis operations. DV3D expands the scientists' toolbox by incorporating a suite of rich new exploratory visualization and analysis methods for addressing the complexity of climate datasets.
Dreyer, Florian S; Cantone, Martina; Eberhardt, Martin; Jaitly, Tanushree; Walter, Lisa; Wittmann, Jürgen; Gupta, Shailendra K; Khan, Faiz M; Wolkenhauer, Olaf; Pützer, Brigitte M; Jäck, Hans-Martin; Heinzerling, Lucie; Vera, Julio
2018-06-01
Cellular phenotypes are established and controlled by complex and precisely orchestrated molecular networks. In cancer, mutations and dysregulations of multiple molecular factors perturb the regulation of these networks and lead to malignant transformation. High-throughput technologies are a valuable source of information to establish the complex molecular relationships behind the emergence of malignancy, but full exploitation of this massive amount of data requires bioinformatics tools that rely on network-based analyses. In this report we present the Virtual Melanoma Cell, an online tool developed to facilitate the mining and interpretation of high-throughput data on melanoma by biomedical researches. The platform is based on a comprehensive, manually generated and expert-validated regulatory map composed of signaling pathways important in malignant melanoma. The Virtual Melanoma Cell is a tool designed to accept, visualize and analyze user-generated datasets. It is available at: https://www.vcells.net/melanoma. To illustrate the utilization of the web platform and the regulatory map, we have analyzed a large publicly available dataset accounting for anti-PD1 immunotherapy treatment of malignant melanoma patients. Copyright © 2018 Elsevier B.V. All rights reserved.
User's Guide for the Agricultural Non-Point Source (AGNPS) Pollution Model Data Generator
Finn, Michael P.; Scheidt, Douglas J.; Jaromack, Gregory M.
2003-01-01
BACKGROUND Throughout this user guide, we refer to datasets that we used in conjunction with developing of this software for supporting cartographic research and producing the datasets to conduct research. However, this software can be used with these datasets or with more 'generic' versions of data of the appropriate type. For example, throughout the guide, we refer to national land cover data (NLCD) and digital elevation model (DEM) data from the U.S. Geological Survey (USGS) at a 30-m resolution, but any digital terrain model or land cover data at any appropriate resolution will produce results. Another key point to keep in mind is to use a consistent data resolution for all the datasets per model run. The U.S. Department of Agriculture (USDA) developed the Agricultural Nonpoint Source (AGNPS) pollution model of watershed hydrology in response to the complex problem of managing nonpoint sources of pollution. AGNPS simulates the behavior of runoff, sediment, and nutrient transport from watersheds that have agriculture as their prime use. The model operates on a cell basis and is a distributed parameter, event-based model. The model requires 22 input parameters. Output parameters are grouped primarily by hydrology, sediment, and chemical output (Young and others, 1995.) Elevation, land cover, and soil are the base data from which to extract the 22 input parameters required by the AGNPS. For automatic parameter extraction, follow the general process described in this guide of extraction from the geospatial data through the AGNPS Data Generator to generate input parameters required by the pollution model (Finn and others, 2002.)
Ciric, Milica; Moon, Christina D; Leahy, Sinead C; Creevey, Christopher J; Altermann, Eric; Attwood, Graeme T; Rakonjac, Jasna; Gagic, Dragana
2014-05-12
In silico, secretome proteins can be predicted from completely sequenced genomes using various available algorithms that identify membrane-targeting sequences. For metasecretome (collection of surface, secreted and transmembrane proteins from environmental microbial communities) this approach is impractical, considering that the metasecretome open reading frames (ORFs) comprise only 10% to 30% of total metagenome, and are poorly represented in the dataset due to overall low coverage of metagenomic gene pool, even in large-scale projects. By combining secretome-selective phage display and next-generation sequencing, we focused the sequence analysis of complex rumen microbial community on the metasecretome component of the metagenome. This approach achieved high enrichment (29 fold) of secreted fibrolytic enzymes from the plant-adherent microbial community of the bovine rumen. In particular, we identified hundreds of heretofore rare modules belonging to cellulosomes, cell-surface complexes specialised for recognition and degradation of the plant fibre. As a method, metasecretome phage display combined with next-generation sequencing has a power to sample the diversity of low-abundance surface and secreted proteins that would otherwise require exceptionally large metagenomic sequencing projects. As a resource, metasecretome display library backed by the dataset obtained by next-generation sequencing is ready for i) affinity selection by standard phage display methodology and ii) easy purification of displayed proteins as part of the virion for individual functional analysis.
Legaz-García, María del Carmen; Miñarro-Giménez, José Antonio; Menárguez-Tortosa, Marcos; Fernández-Breis, Jesualdo Tomás
2015-01-01
Biomedical research usually requires combining large volumes of data from multiple heterogeneous sources. Such heterogeneity makes difficult not only the generation of research-oriented dataset but also its exploitation. In recent years, the Open Data paradigm has proposed new ways for making data available in ways that sharing and integration are facilitated. Open Data approaches may pursue the generation of content readable only by humans and by both humans and machines, which are the ones of interest in our work. The Semantic Web provides a natural technological space for data integration and exploitation and offers a range of technologies for generating not only Open Datasets but also Linked Datasets, that is, open datasets linked to other open datasets. According to the Berners-Lee's classification, each open dataset can be given a rating between one and five stars attending to can be given to each dataset. In the last years, we have developed and applied our SWIT tool, which automates the generation of semantic datasets from heterogeneous data sources. SWIT produces four stars datasets, given that fifth one can be obtained by being the dataset linked from external ones. In this paper, we describe how we have applied the tool in two projects related to health care records and orthology data, as well as the major lessons learned from such efforts.
Spinelli, Lionel; Carpentier, Sabrina; Montañana Sanchis, Frédéric; Dalod, Marc; Vu Manh, Thien-Phong
2015-10-19
Recent advances in the analysis of high-throughput expression data have led to the development of tools that scaled-up their focus from single-gene to gene set level. For example, the popular Gene Set Enrichment Analysis (GSEA) algorithm can detect moderate but coordinated expression changes of groups of presumably related genes between pairs of experimental conditions. This considerably improves extraction of information from high-throughput gene expression data. However, although many gene sets covering a large panel of biological fields are available in public databases, the ability to generate home-made gene sets relevant to one's biological question is crucial but remains a substantial challenge to most biologists lacking statistic or bioinformatic expertise. This is all the more the case when attempting to define a gene set specific of one condition compared to many other ones. Thus, there is a crucial need for an easy-to-use software for generation of relevant home-made gene sets from complex datasets, their use in GSEA, and the correction of the results when applied to multiple comparisons of many experimental conditions. We developed BubbleGUM (GSEA Unlimited Map), a tool that allows to automatically extract molecular signatures from transcriptomic data and perform exhaustive GSEA with multiple testing correction. One original feature of BubbleGUM notably resides in its capacity to integrate and compare numerous GSEA results into an easy-to-grasp graphical representation. We applied our method to generate transcriptomic fingerprints for murine cell types and to assess their enrichments in human cell types. This analysis allowed us to confirm homologies between mouse and human immunocytes. BubbleGUM is an open-source software that allows to automatically generate molecular signatures out of complex expression datasets and to assess directly their enrichment by GSEA on independent datasets. Enrichments are displayed in a graphical output that helps interpreting the results. This innovative methodology has recently been used to answer important questions in functional genomics, such as the degree of similarities between microarray datasets from different laboratories or with different experimental models or clinical cohorts. BubbleGUM is executable through an intuitive interface so that both bioinformaticians and biologists can use it. It is available at http://www.ciml.univ-mrs.fr/applications/BubbleGUM/index.html .
Residential load and rooftop PV generation: an Australian distribution network dataset
NASA Astrophysics Data System (ADS)
Ratnam, Elizabeth L.; Weller, Steven R.; Kellett, Christopher M.; Murray, Alan T.
2017-09-01
Despite the rapid uptake of small-scale solar photovoltaic (PV) systems in recent years, public availability of generation and load data at the household level remains very limited. Moreover, such data are typically measured using bi-directional meters recording only PV generation in excess of residential load rather than recording generation and load separately. In this paper, we report a publicly available dataset consisting of load and rooftop PV generation for 300 de-identified residential customers in an Australian distribution network, with load centres covering metropolitan Sydney and surrounding regional areas. The dataset spans a 3-year period, with separately reported measurements of load and PV generation at 30-min intervals. Following a detailed description of the dataset, we identify several means by which anomalous records (e.g. due to inverter failure) are identified and excised. With the resulting 'clean' dataset, we identify key customer-specific and aggregated characteristics of rooftop PV generation and residential load.
Pathway-based personalized analysis of cancer
Drier, Yotam; Sheffer, Michal; Domany, Eytan
2013-01-01
We introduce Pathifier, an algorithm that infers pathway deregulation scores for each tumor sample on the basis of expression data. This score is determined, in a context-specific manner, for every particular dataset and type of cancer that is being investigated. The algorithm transforms gene-level information into pathway-level information, generating a compact and biologically relevant representation of each sample. We demonstrate the algorithm’s performance on three colorectal cancer datasets and two glioblastoma multiforme datasets and show that our multipathway-based representation is reproducible, preserves much of the original information, and allows inference of complex biologically significant information. We discovered several pathways that were significantly associated with survival of glioblastoma patients and two whose scores are predictive of survival in colorectal cancer: CXCR3-mediated signaling and oxidative phosphorylation. We also identified a subclass of proneural and neural glioblastoma with significantly better survival, and an EGF receptor-deregulated subclass of colon cancers. PMID:23547110
Bayesian correlated clustering to integrate multiple datasets
Kirk, Paul; Griffin, Jim E.; Savage, Richard S.; Ghahramani, Zoubin; Wild, David L.
2012-01-01
Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact: D.L.Wild@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23047558
Method of generating features optimal to a dataset and classifier
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bruillard, Paul J.; Gosink, Luke J.; Jarman, Kenneth D.
A method of generating features optimal to a particular dataset and classifier is disclosed. A dataset of messages is inputted and a classifier is selected. An algebra of features is encoded. Computable features that are capable of describing the dataset from the algebra of features are selected. Irredundant features that are optimal for the classifier and the dataset are selected.
GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare.
Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung
2015-07-02
A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a "data modeler" tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets.
GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare
Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung
2015-01-01
A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a “data modeler” tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets. PMID:26147731
Spear, Timothy T; Nishimura, Michael I; Simms, Patricia E
2017-08-01
Advancement in flow cytometry reagents and instrumentation has allowed for simultaneous analysis of large numbers of lineage/functional immune cell markers. Highly complex datasets generated by polychromatic flow cytometry require proper analytical software to answer investigators' questions. A problem among many investigators and flow cytometry Shared Resource Laboratories (SRLs), including our own, is a lack of access to a flow cytometry-knowledgeable bioinformatics team, making it difficult to learn and choose appropriate analysis tool(s). Here, we comparatively assess various multidimensional flow cytometry software packages for their ability to answer a specific biologic question and provide graphical representation output suitable for publication, as well as their ease of use and cost. We assessed polyfunctional potential of TCR-transduced T cells, serving as a model evaluation, using multidimensional flow cytometry to analyze 6 intracellular cytokines and degranulation on a per-cell basis. Analysis of 7 parameters resulted in 128 possible combinations of positivity/negativity, far too complex for basic flow cytometry software to analyze fully. Various software packages were used, analysis methods used in each described, and representative output displayed. Of the tools investigated, automated classification of cellular expression by nonlinear stochastic embedding (ACCENSE) and coupled analysis in Pestle/simplified presentation of incredibly complex evaluations (SPICE) provided the most user-friendly manipulations and readable output, evaluating effects of altered antigen-specific stimulation on T cell polyfunctionality. This detailed approach may serve as a model for other investigators/SRLs in selecting the most appropriate software to analyze complex flow cytometry datasets. Further development and awareness of available tools will help guide proper data analysis to answer difficult biologic questions arising from incredibly complex datasets. © Society for Leukocyte Biology.
Hierarchical storage of large volume of multidector CT data using distributed servers
NASA Astrophysics Data System (ADS)
Ratib, Osman; Rosset, Antoine; Heuberger, Joris; Bandon, David
2006-03-01
Multidector scanners and hybrid multimodality scanners have the ability to generate large number of high-resolution images resulting in very large data sets. In most cases, these datasets are generated for the sole purpose of generating secondary processed images and 3D rendered images as well as oblique and curved multiplanar reformatted images. It is therefore not essential to archive the original images after they have been processed. We have developed an architecture of distributed archive servers for temporary storage of large image datasets for 3D rendering and image processing without the need for long term storage in PACS archive. With the relatively low cost of storage devices it is possible to configure these servers to hold several months or even years of data, long enough for allowing subsequent re-processing if required by specific clinical situations. We tested the latest generation of RAID servers provided by Apple computers with a capacity of 5 TBytes. We implemented a peer-to-peer data access software based on our Open-Source image management software called OsiriX, allowing remote workstations to directly access DICOM image files located on the server through a new technology called "bonjour". This architecture offers a seamless integration of multiple servers and workstations without the need for central database or complex workflow management tools. It allows efficient access to image data from multiple workstation for image analysis and visualization without the need for image data transfer. It provides a convenient alternative to centralized PACS architecture while avoiding complex and time-consuming data transfer and storage.
Svoboda, David; Ulman, Vladimir
2017-01-01
The proper analysis of biological microscopy images is an important and complex task. Therefore, it requires verification of all steps involved in the process, including image segmentation and tracking algorithms. It is generally better to verify algorithms with computer-generated ground truth datasets, which, compared to manually annotated data, nowadays have reached high quality and can be produced in large quantities even for 3D time-lapse image sequences. Here, we propose a novel framework, called MitoGen, which is capable of generating ground truth datasets with fully 3D time-lapse sequences of synthetic fluorescence-stained cell populations. MitoGen shows biologically justified cell motility, shape and texture changes as well as cell divisions. Standard fluorescence microscopy phenomena such as photobleaching, blur with real point spread function (PSF), and several types of noise, are simulated to obtain realistic images. The MitoGen framework is scalable in both space and time. MitoGen generates visually plausible data that shows good agreement with real data in terms of image descriptors and mean square displacement (MSD) trajectory analysis. Additionally, it is also shown in this paper that four publicly available segmentation and tracking algorithms exhibit similar performance on both real and MitoGen-generated data. The implementation of MitoGen is freely available.
A Python Geospatial Language Toolkit
NASA Astrophysics Data System (ADS)
Fillmore, D.; Pletzer, A.; Galloy, M.
2012-12-01
The volume and scope of geospatial data archives, such as collections of satellite remote sensing or climate model products, has been rapidly increasing and will continue to do so in the near future. The recently launched (October 2011) Suomi National Polar-orbiting Partnership satellite (NPP) for instance, is the first of a new generation of Earth observation platforms that will monitor the atmosphere, oceans, and ecosystems, and its suite of instruments will generate several terabytes each day in the form of multi-spectral images and derived datasets. Full exploitation of such data for scientific analysis and decision support applications has become a major computational challenge. Geophysical data exploration and knowledge discovery could benefit, in particular, from intelligent mechanisms for extracting and manipulating subsets of data relevant to the problem of interest. Potential developments include enhanced support for natural language queries and directives to geospatial datasets. The translation of natural language (that is, human spoken or written phrases) into complex but unambiguous objects and actions can be based on a context, or knowledge domain, that represents the underlying geospatial concepts. This poster describes a prototype Python module that maps English phrases onto basic geospatial objects and operations. This module, along with the associated computational geometry methods, enables the resolution of natural language directives that include geographic regions of arbitrary shape and complexity.
A Virtual Reality Visualization Tool for Neuron Tracing
Usher, Will; Klacansky, Pavol; Federer, Frederick; Bremer, Peer-Timo; Knoll, Aaron; Angelucci, Alessandra; Pascucci, Valerio
2017-01-01
Tracing neurons in large-scale microscopy data is crucial to establishing a wiring diagram of the brain, which is needed to understand how neural circuits in the brain process information and generate behavior. Automatic techniques often fail for large and complex datasets, and connectomics researchers may spend weeks or months manually tracing neurons using 2D image stacks. We present a design study of a new virtual reality (VR) system, developed in collaboration with trained neuroanatomists, to trace neurons in microscope scans of the visual cortex of primates. We hypothesize that using consumer-grade VR technology to interact with neurons directly in 3D will help neuroscientists better resolve complex cases and enable them to trace neurons faster and with less physical and mental strain. We discuss both the design process and technical challenges in developing an interactive system to navigate and manipulate terabyte-sized image volumes in VR. Using a number of different datasets, we demonstrate that, compared to widely used commercial software, consumer-grade VR presents a promising alternative for scientists. PMID:28866520
A Virtual Reality Visualization Tool for Neuron Tracing.
Usher, Will; Klacansky, Pavol; Federer, Frederick; Bremer, Peer-Timo; Knoll, Aaron; Yarch, Jeff; Angelucci, Alessandra; Pascucci, Valerio
2018-01-01
Tracing neurons in large-scale microscopy data is crucial to establishing a wiring diagram of the brain, which is needed to understand how neural circuits in the brain process information and generate behavior. Automatic techniques often fail for large and complex datasets, and connectomics researchers may spend weeks or months manually tracing neurons using 2D image stacks. We present a design study of a new virtual reality (VR) system, developed in collaboration with trained neuroanatomists, to trace neurons in microscope scans of the visual cortex of primates. We hypothesize that using consumer-grade VR technology to interact with neurons directly in 3D will help neuroscientists better resolve complex cases and enable them to trace neurons faster and with less physical and mental strain. We discuss both the design process and technical challenges in developing an interactive system to navigate and manipulate terabyte-sized image volumes in VR. Using a number of different datasets, we demonstrate that, compared to widely used commercial software, consumer-grade VR presents a promising alternative for scientists.
Picotti, Paola; Clement-Ziza, Mathieu; Lam, Henry; Campbell, David S.; Schmidt, Alexander; Deutsch, Eric W.; Röst, Hannes; Sun, Zhi; Rinner, Oliver; Reiter, Lukas; Shen, Qin; Michaelson, Jacob J.; Frei, Andreas; Alberti, Simon; Kusebauch, Ulrike; Wollscheid, Bernd; Moritz, Robert; Beyer, Andreas; Aebersold, Ruedi
2013-01-01
Complete reference maps or datasets, like the genomic map of an organism, are highly beneficial tools for biological and biomedical research. Attempts to generate such reference datasets for a proteome so far failed to reach complete proteome coverage, with saturation apparent at approximately two thirds of the proteomes tested, even for the most thoroughly characterized proteomes. Here, we used a strategy based on high-throughput peptide synthesis and mass spectrometry to generate a close to complete reference map (97% of the genome-predicted proteins) of the S. cerevisiae proteome. We generated two versions of this mass spectrometric map one supporting discovery- (shotgun) and the other hypothesis-driven (targeted) proteomic measurements. The two versions of the map, therefore, constitute a complete set of proteomic assays to support most studies performed with contemporary proteomic technologies. The reference libraries can be browsed via a web-based repository and associated navigation tools. To demonstrate the utility of the reference libraries we applied them to a protein quantitative trait locus (pQTL) analysis, which requires measurement of the same peptides over a large number of samples with high precision. Protein measurements over a set of 78 S. cerevisiae strains revealed a complex relationship between independent genetic loci, impacting on the levels of related proteins. Our results suggest that selective pressure favors the acquisition of sets of polymorphisms that maintain the stoichiometry of protein complexes and pathways. PMID:23334424
A Distributed Fuzzy Associative Classifier for Big Data.
Segatori, Armando; Bechini, Alessio; Ducange, Pietro; Marcelloni, Francesco
2017-09-19
Fuzzy associative classification has not been widely analyzed in the literature, although associative classifiers (ACs) have proved to be very effective in different real domain applications. The main reason is that learning fuzzy ACs is a very heavy task, especially when dealing with large datasets. To overcome this drawback, in this paper, we propose an efficient distributed fuzzy associative classification approach based on the MapReduce paradigm. The approach exploits a novel distributed discretizer based on fuzzy entropy for efficiently generating fuzzy partitions of the attributes. Then, a set of candidate fuzzy association rules is generated by employing a distributed fuzzy extension of the well-known FP-Growth algorithm. Finally, this set is pruned by using three purposely adapted types of pruning. We implemented our approach on the popular Hadoop framework. Hadoop allows distributing storage and processing of very large data sets on computer clusters built from commodity hardware. We have performed an extensive experimentation and a detailed analysis of the results using six very large datasets with up to 11,000,000 instances. We have also experimented different types of reasoning methods. Focusing on accuracy, model complexity, computation time, and scalability, we compare the results achieved by our approach with those obtained by two distributed nonfuzzy ACs recently proposed in the literature. We highlight that, although the accuracies result to be comparable, the complexity, evaluated in terms of number of rules, of the classifiers generated by the fuzzy distributed approach is lower than the one of the nonfuzzy classifiers.
Mr-Moose: An advanced SED-fitting tool for heterogeneous multi-wavelength datasets
NASA Astrophysics Data System (ADS)
Drouart, G.; Falkendal, T.
2018-04-01
We present the public release of Mr-Moose, a fitting procedure that is able to perform multi-wavelength and multi-object spectral energy distribution (SED) fitting in a Bayesian framework. This procedure is able to handle a large variety of cases, from an isolated source to blended multi-component sources from an heterogeneous dataset (i.e. a range of observation sensitivities and spectral/spatial resolutions). Furthermore, Mr-Moose handles upper-limits during the fitting process in a continuous way allowing models to be gradually less probable as upper limits are approached. The aim is to propose a simple-to-use, yet highly-versatile fitting tool fro handling increasing source complexity when combining multi-wavelength datasets with fully customisable filter/model databases. The complete control of the user is one advantage, which avoids the traditional problems related to the "black box" effect, where parameter or model tunings are impossible and can lead to overfitting and/or over-interpretation of the results. Also, while a basic knowledge of Python and statistics is required, the code aims to be sufficiently user-friendly for non-experts. We demonstrate the procedure on three cases: two artificially-generated datasets and a previous result from the literature. In particular, the most complex case (inspired by a real source, combining Herschel, ALMA and VLA data) in the context of extragalactic SED fitting, makes Mr-Moose a particularly-attractive SED fitting tool when dealing with partially blended sources, without the need for data deconvolution.
An algorithm for direct causal learning of influences on patient outcomes.
Rathnam, Chandramouli; Lee, Sanghoon; Jiang, Xia
2017-01-01
This study aims at developing and introducing a new algorithm, called direct causal learner (DCL), for learning the direct causal influences of a single target. We applied it to both simulated and real clinical and genome wide association study (GWAS) datasets and compared its performance to classic causal learning algorithms. The DCL algorithm learns the causes of a single target from passive data using Bayesian-scoring, instead of using independence checks, and a novel deletion algorithm. We generate 14,400 simulated datasets and measure the number of datasets for which DCL correctly and partially predicts the direct causes. We then compare its performance with the constraint-based path consistency (PC) and conservative PC (CPC) algorithms, the Bayesian-score based fast greedy search (FGS) algorithm, and the partial ancestral graphs algorithm fast causal inference (FCI). In addition, we extend our comparison of all five algorithms to both a real GWAS dataset and real breast cancer datasets over various time-points in order to observe how effective they are at predicting the causal influences of Alzheimer's disease and breast cancer survival. DCL consistently outperforms FGS, PC, CPC, and FCI in discovering the parents of the target for the datasets simulated using a simple network. Overall, DCL predicts significantly more datasets correctly (McNemar's test significance: p<0.0001) than any of the other algorithms for these network types. For example, when assessing overall performance (simple and complex network results combined), DCL correctly predicts approximately 1400 more datasets than the top FGS method, 1600 more datasets than the top CPC method, 4500 more datasets than the top PC method, and 5600 more datasets than the top FCI method. Although FGS did correctly predict more datasets than DCL for the complex networks, and DCL correctly predicted only a few more datasets than CPC for these networks, there is no significant difference in performance between these three algorithms for this network type. However, when we use a more continuous measure of accuracy, we find that all the DCL methods are able to better partially predict more direct causes than FGS and CPC for the complex networks. In addition, DCL consistently had faster runtimes than the other algorithms. In the application to the real datasets, DCL identified rs6784615, located on the NISCH gene, and rs10824310, located on the PRKG1 gene, as direct causes of late onset Alzheimer's disease (LOAD) development. In addition, DCL identified ER category as a direct predictor of breast cancer mortality within 5 years, and HER2 status as a direct predictor of 10-year breast cancer mortality. These predictors have been identified in previous studies to have a direct causal relationship with their respective phenotypes, supporting the predictive power of DCL. When the other algorithms discovered predictors from the real datasets, these predictors were either also found by DCL or could not be supported by previous studies. Our results show that DCL outperforms FGS, PC, CPC, and FCI in almost every case, demonstrating its potential to advance causal learning. Furthermore, our DCL algorithm effectively identifies direct causes in the LOAD and Metabric GWAS datasets, which indicates its potential for clinical applications. Copyright © 2016 Elsevier B.V. All rights reserved.
Liu, Yijin; Meirer, Florian; Williams, Phillip A.; Wang, Junyue; Andrews, Joy C.; Pianetta, Piero
2012-01-01
Transmission X-ray microscopy (TXM) has been well recognized as a powerful tool for non-destructive investigation of the three-dimensional inner structure of a sample with spatial resolution down to a few tens of nanometers, especially when combined with synchrotron radiation sources. Recent developments of this technique have presented a need for new tools for both system control and data analysis. Here a software package developed in MATLAB for script command generation and analysis of TXM data is presented. The first toolkit, the script generator, allows automating complex experimental tasks which involve up to several thousand motor movements. The second package was designed to accomplish computationally intense tasks such as data processing of mosaic and mosaic tomography datasets; dual-energy contrast imaging, where data are recorded above and below a specific X-ray absorption edge; and TXM X-ray absorption near-edge structure imaging datasets. Furthermore, analytical and iterative tomography reconstruction algorithms were implemented. The compiled software package is freely available. PMID:22338691
Stylized facts in social networks: Community-based static modeling
NASA Astrophysics Data System (ADS)
Jo, Hang-Hyun; Murase, Yohsuke; Török, János; Kertész, János; Kaski, Kimmo
2018-06-01
The past analyses of datasets of social networks have enabled us to make empirical findings of a number of aspects of human society, which are commonly featured as stylized facts of social networks, such as broad distributions of network quantities, existence of communities, assortative mixing, and intensity-topology correlations. Since the understanding of the structure of these complex social networks is far from complete, for deeper insight into human society more comprehensive datasets and modeling of the stylized facts are needed. Although the existing dynamical and static models can generate some stylized facts, here we take an alternative approach by devising a community-based static model with heterogeneous community sizes and larger communities having smaller link density and weight. With these few assumptions we are able to generate realistic social networks that show most stylized facts for a wide range of parameters, as demonstrated numerically and analytically. Since our community-based static model is simple to implement and easily scalable, it can be used as a reference system, benchmark, or testbed for further applications.
Machine learning action parameters in lattice quantum chromodynamics
NASA Astrophysics Data System (ADS)
Shanahan, Phiala E.; Trewartha, Daniel; Detmold, William
2018-05-01
Numerical lattice quantum chromodynamics studies of the strong interaction are important in many aspects of particle and nuclear physics. Such studies require significant computing resources to undertake. A number of proposed methods promise improved efficiency of lattice calculations, and access to regions of parameter space that are currently computationally intractable, via multi-scale action-matching approaches that necessitate parametric regression of generated lattice datasets. The applicability of machine learning to this regression task is investigated, with deep neural networks found to provide an efficient solution even in cases where approaches such as principal component analysis fail. The high information content and complex symmetries inherent in lattice QCD datasets require custom neural network layers to be introduced and present opportunities for further development.
Automated synthetic scene generation
NASA Astrophysics Data System (ADS)
Givens, Ryan N.
Physics-based simulations generate synthetic imagery to help organizations anticipate system performance of proposed remote sensing systems. However, manually constructing synthetic scenes which are sophisticated enough to capture the complexity of real-world sites can take days to months depending on the size of the site and desired fidelity of the scene. This research, sponsored by the Air Force Research Laboratory's Sensors Directorate, successfully developed an automated approach to fuse high-resolution RGB imagery, lidar data, and hyperspectral imagery and then extract the necessary scene components. The method greatly reduces the time and money required to generate realistic synthetic scenes and developed new approaches to improve material identification using information from all three of the input datasets.
Topic modeling for cluster analysis of large biological and medical datasets
2014-01-01
Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets. PMID:25350106
Topic modeling for cluster analysis of large biological and medical datasets.
Zhao, Weizhong; Zou, Wen; Chen, James J
2014-01-01
The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets.
An ontological system for interoperable spatial generalisation in biodiversity monitoring
NASA Astrophysics Data System (ADS)
Nieland, Simon; Moran, Niklas; Kleinschmit, Birgit; Förster, Michael
2015-11-01
Semantic heterogeneity remains a barrier to data comparability and standardisation of results in different fields of spatial research. Because of its thematic complexity, differing acquisition methods and national nomenclatures, interoperability of biodiversity monitoring information is especially difficult. Since data collection methods and interpretation manuals broadly vary there is a need for automatised, objective methodologies for the generation of comparable data-sets. Ontology-based applications offer vast opportunities in data management and standardisation. This study examines two data-sets of protected heathlands in Germany and Belgium which are based on remote sensing image classification and semantically formalised in an OWL2 ontology. The proposed methodology uses semantic relations of the two data-sets, which are (semi-)automatically derived from remote sensing imagery, to generate objective and comparable information about the status of protected areas by utilising kernel-based spatial reclassification. This automatised method suggests a generalisation approach, which is able to generate delineation of Special Areas of Conservation (SAC) of the European biodiversity Natura 2000 network. Furthermore, it is able to transfer generalisation rules between areas surveyed with varying acquisition methods in different countries by taking into account automated inference of the underlying semantics. The generalisation results were compared with the manual delineation of terrestrial monitoring. For the different habitats in the two sites an accuracy of above 70% was detected. However, it has to be highlighted that the delineation of the ground-truth data inherits a high degree of uncertainty, which is discussed in this study.
Thermal Texture Generation and 3d Model Reconstruction Using SFM and Gan
NASA Astrophysics Data System (ADS)
Kniaz, V. V.; Mizginov, V. A.
2018-05-01
Realistic 3D models with textures representing thermal emission of the object are widely used in such fields as dynamic scene analysis, autonomous driving, and video surveillance. Structure from Motion (SfM) methods provide a robust approach for the generation of textured 3D models in the visible range. Still, automatic generation of 3D models from the infrared imagery is challenging due to an absence of the feature points and low sensor resolution. Recent advances in Generative Adversarial Networks (GAN) have proved that they can perform complex image-to-image transformations such as a transformation of day to night and generation of imagery in a different spectral range. In this paper, we propose a novel method for generation of realistic 3D models with thermal textures using the SfM pipeline and GAN. The proposed method uses visible range images as an input. The images are processed in two ways. Firstly, they are used for point matching and dense point cloud generation. Secondly, the images are fed into a GAN that performs the transformation from the visible range to the thermal range. We evaluate the proposed method using real infrared imagery captured with a FLIR ONE PRO camera. We generated a dataset with 2000 pairs of real images captured in thermal and visible range. The dataset is used to train the GAN network and to generate 3D models using SfM. The evaluation of the generated 3D models and infrared textures proved that they are similar to the ground truth model in both thermal emissivity and geometrical shape.
Andersen, J C; Gwiazdowski, R A; Gdanetz, K; Gruwell, M E
2015-02-01
Armored scale insects and their primary bacterial endosymbionts show nearly identical patterns of co-diversification when viewed at the family level, though the persistence of these patterns at the species level has not been explored in this group. Therefore we investigated genealogical patterns of co-diversification near the species level between the primary endosymbiont Uzinura diaspidicola and its hosts in the Chionaspis pinifoliae-Chionaspis heterophyllae species complex. To do this we generated DNA sequence data from three endosymbiont loci (rspB, GroEL, and 16S) and analyzed each locus independently using statistical parsimony network analyses and as a concatenated dataset using Bayesian phylogenetic reconstructions. We found that for two endosymbiont loci, 16S and GroEL, sequences from U. diaspidicola were broadly associated with host species designations, while for rspB this pattern was less clear as C. heterophyllae (species S1) shared haplotypes with several other Chionaspis species. We then compared the topological congruence of the phylogenetic reconstructions generated from a concatenated dataset of endosymbiont loci (including all three loci, above) to that from a concatenated dataset of armored scale hosts, using published data from two nuclear loci (28S and EF1α) and one mitochondrial locus (COI-COII) from the armored scale hosts. We calculated whether the two topologies were congruent using the Shimodaira-Hasegawa test. We found no significant differences (P = 0.4892) between the topologies suggesting that, at least at this level of resolution, co-diversification of U. diaspidicola with its armored scale hosts also occurs near the species level. This is the first such study of co-speciation at the species level between U. diaspidicola and a group of armored scale insects.
Infrasound as a Depth Discriminant
2010-09-01
INFRASOUND AS A DEPTH DISCRIMINANT Stephen J. Arrowsmith1, Hans E. Hartse1, Steven R. Taylor2, Richard J. Stead1, and Rod W. Whitaker1 Los Alamos...LA09-Depth-NDD02 ABSTRACT The identification of a signature relating depth to a remotely recorded infrasound signal requires a dataset of...can generate infrasound via a variety of processes, which have occasionally been confused in past studies due to the complexity of the process; (2
OpenSHS: Open Smart Home Simulator.
Alshammari, Nasser; Alshammari, Talal; Sedky, Mohamed; Champion, Justin; Bauer, Carolin
2017-05-02
This paper develops a new hybrid, open-source, cross-platform 3D smart home simulator, OpenSHS, for dataset generation. OpenSHS offers an opportunity for researchers in the field of the Internet of Things (IoT) and machine learning to test and evaluate their models. Following a hybrid approach, OpenSHS combines advantages from both interactive and model-based approaches. This approach reduces the time and efforts required to generate simulated smart home datasets. We have designed a replication algorithm for extending and expanding a dataset. A small sample dataset produced, by OpenSHS, can be extended without affecting the logical order of the events. The replication provides a solution for generating large representative smart home datasets. We have built an extensible library of smart devices that facilitates the simulation of current and future smart home environments. Our tool divides the dataset generation process into three distinct phases: first design: the researcher designs the initial virtual environment by building the home, importing smart devices and creating contexts; second, simulation: the participant simulates his/her context-specific events; and third, aggregation: the researcher applies the replication algorithm to generate the final dataset. We conducted a study to assess the ease of use of our tool on the System Usability Scale (SUS).
OpenSHS: Open Smart Home Simulator
Alshammari, Nasser; Alshammari, Talal; Sedky, Mohamed; Champion, Justin; Bauer, Carolin
2017-01-01
This paper develops a new hybrid, open-source, cross-platform 3D smart home simulator, OpenSHS, for dataset generation. OpenSHS offers an opportunity for researchers in the field of the Internet of Things (IoT) and machine learning to test and evaluate their models. Following a hybrid approach, OpenSHS combines advantages from both interactive and model-based approaches. This approach reduces the time and efforts required to generate simulated smart home datasets. We have designed a replication algorithm for extending and expanding a dataset. A small sample dataset produced, by OpenSHS, can be extended without affecting the logical order of the events. The replication provides a solution for generating large representative smart home datasets. We have built an extensible library of smart devices that facilitates the simulation of current and future smart home environments. Our tool divides the dataset generation process into three distinct phases: first design: the researcher designs the initial virtual environment by building the home, importing smart devices and creating contexts; second, simulation: the participant simulates his/her context-specific events; and third, aggregation: the researcher applies the replication algorithm to generate the final dataset. We conducted a study to assess the ease of use of our tool on the System Usability Scale (SUS). PMID:28468330
ProDaMa: an open source Python library to generate protein structure datasets.
Armano, Giuliano; Manconi, Andrea
2009-10-02
The huge difference between the number of known sequences and known tertiary structures has justified the use of automated methods for protein analysis. Although a general methodology to solve these problems has not been yet devised, researchers are engaged in developing more accurate techniques and algorithms whose training plays a relevant role in determining their performance. From this perspective, particular importance is given to the training data used in experiments, and researchers are often engaged in the generation of specialized datasets that meet their requirements. To facilitate the task of generating specialized datasets we devised and implemented ProDaMa, an open source Python library than provides classes for retrieving, organizing, updating, analyzing, and filtering protein data. ProDaMa has been used to generate specialized datasets useful for secondary structure prediction and to develop a collaborative web application aimed at generating and sharing protein structure datasets. The library, the related database, and the documentation are freely available at the URL http://iasc.diee.unica.it/prodama.
I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chard, Kyle; D'Arcy, Mike; Heavner, Benjamin D.
Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets for big data workflows assume that all data reside in a single location, requiring costly data marshaling and permitting errors of omission and commission because dataset members are not explicitly specified. We address these issues by proposing simple methods and toolsmore » for assembling, sharing, and analyzing large and complex datasets that scientists can easily integrate into their daily workflows. These tools combine a simple and robust method for describing data collections (BDBags), data descriptions (Research Objects), and simple persistent identifiers (Minids) to create a powerful ecosystem of tools and services for big data analysis and sharing. We present these tools and use biomedical case studies to illustrate their use for the rapid assembly, sharing, and analysis of large datasets.« less
Zepeda-Mendoza, Marie Lisandra; Bohmann, Kristine; Carmona Baez, Aldo; Gilbert, M Thomas P
2016-05-03
DNA metabarcoding is an approach for identifying multiple taxa in an environmental sample using specific genetic loci and taxa-specific primers. When combined with high-throughput sequencing it enables the taxonomic characterization of large numbers of samples in a relatively time- and cost-efficient manner. One recent laboratory development is the addition of 5'-nucleotide tags to both primers producing double-tagged amplicons and the use of multiple PCR replicates to filter erroneous sequences. However, there is currently no available toolkit for the straightforward analysis of datasets produced in this way. We present DAMe, a toolkit for the processing of datasets generated by double-tagged amplicons from multiple PCR replicates derived from an unlimited number of samples. Specifically, DAMe can be used to (i) sort amplicons by tag combination, (ii) evaluate PCR replicates dissimilarity, and (iii) filter sequences derived from sequencing/PCR errors, chimeras, and contamination. This is attained by calculating the following parameters: (i) sequence content similarity between the PCR replicates from each sample, (ii) reproducibility of each unique sequence across the PCR replicates, and (iii) copy number of the unique sequences in each PCR replicate. We showcase the insights that can be obtained using DAMe prior to taxonomic assignment, by applying it to two real datasets that vary in their complexity regarding number of samples, sequencing libraries, PCR replicates, and used tag combinations. Finally, we use a third mock dataset to demonstrate the impact and importance of filtering the sequences with DAMe. DAMe allows the user-friendly manipulation of amplicons derived from multiple samples with PCR replicates built in a single or multiple sequencing libraries. It allows the user to: (i) collapse amplicons into unique sequences and sort them by tag combination while retaining the sample identifier and copy number information, (ii) identify sequences carrying unused tag combinations, (iii) evaluate the comparability of PCR replicates of the same sample, and (iv) filter tagged amplicons from a number of PCR replicates using parameters of minimum length, copy number, and reproducibility across the PCR replicates. This enables an efficient analysis of complex datasets, and ultimately increases the ease of handling datasets from large-scale studies.
Kernel-based discriminant feature extraction using a representative dataset
NASA Astrophysics Data System (ADS)
Li, Honglin; Sancho Gomez, Jose-Luis; Ahalt, Stanley C.
2002-07-01
Discriminant Feature Extraction (DFE) is widely recognized as an important pre-processing step in classification applications. Most DFE algorithms are linear and thus can only explore the linear discriminant information among the different classes. Recently, there has been several promising attempts to develop nonlinear DFE algorithms, among which is Kernel-based Feature Extraction (KFE). The efficacy of KFE has been experimentally verified by both synthetic data and real problems. However, KFE has some known limitations. First, KFE does not work well for strongly overlapped data. Second, KFE employs all of the training set samples during the feature extraction phase, which can result in significant computation when applied to very large datasets. Finally, KFE can result in overfitting. In this paper, we propose a substantial improvement to KFE that overcomes the above limitations by using a representative dataset, which consists of critical points that are generated from data-editing techniques and centroid points that are determined by using the Frequency Sensitive Competitive Learning (FSCL) algorithm. Experiments show that this new KFE algorithm performs well on significantly overlapped datasets, and it also reduces computational complexity. Further, by controlling the number of centroids, the overfitting problem can be effectively alleviated.
Distributed memory parallel Markov random fields using graph partitioning
DOE Office of Scientific and Technical Information (OSTI.GOV)
Heinemann, C.; Perciano, T.; Ushizima, D.
Markov random fields (MRF) based algorithms have attracted a large amount of interest in image analysis due to their ability to exploit contextual information about data. Image data generated by experimental facilities, though, continues to grow larger and more complex, making it more difficult to analyze in a reasonable amount of time. Applying image processing algorithms to large datasets requires alternative approaches to circumvent performance problems. Aiming to provide scientists with a new tool to recover valuable information from such datasets, we developed a general purpose distributed memory parallel MRF-based image analysis framework (MPI-PMRF). MPI-PMRF overcomes performance and memory limitationsmore » by distributing data and computations across processors. The proposed approach was successfully tested with synthetic and experimental datasets. Additionally, the performance of the MPI-PMRF framework is analyzed through a detailed scalability study. We show that a performance increase is obtained while maintaining an accuracy of the segmentation results higher than 98%. The contributions of this paper are: (a) development of a distributed memory MRF framework; (b) measurement of the performance increase of the proposed approach; (c) verification of segmentation accuracy in both synthetic and experimental, real-world datasets« less
Machine learning action parameters in lattice quantum chromodynamics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shanahan, Phiala; Trewartha, Daneil; Detmold, William
Numerical lattice quantum chromodynamics studies of the strong interaction underpin theoretical understanding of many aspects of particle and nuclear physics. Such studies require significant computing resources to undertake. A number of proposed methods promise improved efficiency of lattice calculations, and access to regions of parameter space that are currently computationally intractable, via multi-scale action-matching approaches that necessitate parametric regression of generated lattice datasets. The applicability of machine learning to this regression task is investigated, with deep neural networks found to provide an efficient solution even in cases where approaches such as principal component analysis fail. Finally, the high information contentmore » and complex symmetries inherent in lattice QCD datasets require custom neural network layers to be introduced and present opportunities for further development.« less
Scalable Prediction of Energy Consumption using Incremental Time Series Clustering
DOE Office of Scientific and Technical Information (OSTI.GOV)
Simmhan, Yogesh; Noor, Muhammad Usman
2013-10-09
Time series datasets are a canonical form of high velocity Big Data, and often generated by pervasive sensors, such as found in smart infrastructure. Performing predictive analytics on time series data can be computationally complex, and requires approximation techniques. In this paper, we motivate this problem using a real application from the smart grid domain. We propose an incremental clustering technique, along with a novel affinity score for determining cluster similarity, which help reduce the prediction error for cumulative time series within a cluster. We evaluate this technique, along with optimizations, using real datasets from smart meters, totaling ~700,000 datamore » points, and show the efficacy of our techniques in improving the prediction error of time series data within polynomial time.« less
Machine learning action parameters in lattice quantum chromodynamics
Shanahan, Phiala; Trewartha, Daneil; Detmold, William
2018-05-16
Numerical lattice quantum chromodynamics studies of the strong interaction underpin theoretical understanding of many aspects of particle and nuclear physics. Such studies require significant computing resources to undertake. A number of proposed methods promise improved efficiency of lattice calculations, and access to regions of parameter space that are currently computationally intractable, via multi-scale action-matching approaches that necessitate parametric regression of generated lattice datasets. The applicability of machine learning to this regression task is investigated, with deep neural networks found to provide an efficient solution even in cases where approaches such as principal component analysis fail. Finally, the high information contentmore » and complex symmetries inherent in lattice QCD datasets require custom neural network layers to be introduced and present opportunities for further development.« less
Mohammadhassanzadeh, Hossein; Van Woensel, William; Abidi, Samina Raza; Abidi, Syed Sibte Raza
2017-01-01
Capturing complete medical knowledge is challenging-often due to incomplete patient Electronic Health Records (EHR), but also because of valuable, tacit medical knowledge hidden away in physicians' experiences. To extend the coverage of incomplete medical knowledge-based systems beyond their deductive closure, and thus enhance their decision-support capabilities, we argue that innovative, multi-strategy reasoning approaches should be applied. In particular, plausible reasoning mechanisms apply patterns from human thought processes, such as generalization, similarity and interpolation, based on attributional, hierarchical, and relational knowledge. Plausible reasoning mechanisms include inductive reasoning , which generalizes the commonalities among the data to induce new rules, and analogical reasoning , which is guided by data similarities to infer new facts. By further leveraging rich, biomedical Semantic Web ontologies to represent medical knowledge, both known and tentative, we increase the accuracy and expressivity of plausible reasoning, and cope with issues such as data heterogeneity, inconsistency and interoperability. In this paper, we present a Semantic Web-based, multi-strategy reasoning approach, which integrates deductive and plausible reasoning and exploits Semantic Web technology to solve complex clinical decision support queries. We evaluated our system using a real-world medical dataset of patients with hepatitis, from which we randomly removed different percentages of data (5%, 10%, 15%, and 20%) to reflect scenarios with increasing amounts of incomplete medical knowledge. To increase the reliability of the results, we generated 5 independent datasets for each percentage of missing values, which resulted in 20 experimental datasets (in addition to the original dataset). The results show that plausibly inferred knowledge extends the coverage of the knowledge base by, on average, 2%, 7%, 12%, and 16% for datasets with, respectively, 5%, 10%, 15%, and 20% of missing values. This expansion in the KB coverage allowed solving complex disease diagnostic queries that were previously unresolvable, without losing the correctness of the answers. However, compared to deductive reasoning, data-intensive plausible reasoning mechanisms yield a significant performance overhead. We observed that plausible reasoning approaches, by generating tentative inferences and leveraging domain knowledge of experts, allow us to extend the coverage of medical knowledge bases, resulting in improved clinical decision support. Second, by leveraging OWL ontological knowledge, we are able to increase the expressivity and accuracy of plausible reasoning methods. Third, our approach is applicable to clinical decision support systems for a range of chronic diseases.
Leung, Preston; Eltahla, Auda A; Lloyd, Andrew R; Bull, Rowena A; Luciani, Fabio
2017-07-15
With the advent of affordable deep sequencing technologies, detection of low frequency variants within genetically diverse viral populations can now be achieved with unprecedented depth and efficiency. The high-resolution data provided by next generation sequencing technologies is currently recognised as the gold standard in estimation of viral diversity. In the analysis of rapidly mutating viruses, longitudinal deep sequencing datasets from viral genomes during individual infection episodes, as well as at the epidemiological level during outbreaks, now allow for more sophisticated analyses such as statistical estimates of the impact of complex mutation patterns on the evolution of the viral populations both within and between hosts. These analyses are revealing more accurate descriptions of the evolutionary dynamics that underpin the rapid adaptation of these viruses to the host response, and to drug therapies. This review assesses recent developments in methods and provide informative research examples using deep sequencing data generated from rapidly mutating viruses infecting humans, particularly hepatitis C virus (HCV), human immunodeficiency virus (HIV), Ebola virus and influenza virus, to understand the evolution of viral genomes and to explore the relationship between viral mutations and the host adaptive immune response. Finally, we discuss limitations in current technologies, and future directions that take advantage of publically available large deep sequencing datasets. Copyright © 2016 Elsevier B.V. All rights reserved.
A next generation multiscale view of inborn errors of metabolism
Argmann, Carmen A.; Houten, Sander M.; Zhu, Jun; Schadt, Eric E.
2015-01-01
Inborn errors of metabolism (IEM) are not unlike common diseases. They often present as a spectrum of disease phenotypes that correlates poorly with the severity of the disease-causing mutations. This greatly impacts patient care and reveals fundamental gaps in our knowledge of disease modifying biology. Systems biology approaches that integrate multi-omics data into molecular networks have significantly improved our understanding of complex diseases. Similar approaches to study IEM are rare despite their complex nature. We highlight that existing common disease-derived datasets and networks can be repurposed to generate novel mechanistic insight in IEM and potentially identify candidate modifiers. While understanding disease pathophysiology will advance the IEM field, the ultimate goal should be to understand per individual how their phenotype emerges given their primary mutation on the background of their whole genome, not unlike personalized medicine. We foresee that panomics and network strategies combined with recent experimental innovations will facilitate this. PMID:26712461
Pan, Yuliang; Wang, Zixiang; Zhan, Weihua; Deng, Lei
2018-05-01
Identifying RNA-binding residues, especially energetically favored hot spots, can provide valuable clues for understanding the mechanisms and functional importance of protein-RNA interactions. Yet, limited availability of experimentally recognized energy hot spots in protein-RNA crystal structures leads to the difficulties in developing empirical identification approaches. Computational prediction of RNA-binding hot spot residues is still in its infant stage. Here, we describe a computational method, PrabHot (Prediction of protein-RNA binding hot spots), that can effectively detect hot spot residues on protein-RNA binding interfaces using an ensemble of conceptually different machine learning classifiers. Residue interaction network features and new solvent exposure characteristics are combined together and selected for classification with the Boruta algorithm. In particular, two new reference datasets (benchmark and independent) have been generated containing 107 hot spots from 47 known protein-RNA complex structures. In 10-fold cross-validation on the training dataset, PrabHot achieves promising performances with an AUC score of 0.86 and a sensitivity of 0.78, which are significantly better than that of the pioneer RNA-binding hot spot prediction method HotSPRing. We also demonstrate the capability of our proposed method on the independent test dataset and gain a competitive advantage as a result. The PrabHot webserver is freely available at http://denglab.org/PrabHot/. leideng@csu.edu.cn. Supplementary data are available at Bioinformatics online.
The PREP pipeline: standardized preprocessing for large-scale EEG analysis.
Bigdely-Shamlo, Nima; Mullen, Tim; Kothe, Christian; Su, Kyung-Min; Robbins, Kay A
2015-01-01
The technology to collect brain imaging and physiological measures has become portable and ubiquitous, opening the possibility of large-scale analysis of real-world human imaging. By its nature, such data is large and complex, making automated processing essential. This paper shows how lack of attention to the very early stages of an EEG preprocessing pipeline can reduce the signal-to-noise ratio and introduce unwanted artifacts into the data, particularly for computations done in single precision. We demonstrate that ordinary average referencing improves the signal-to-noise ratio, but that noisy channels can contaminate the results. We also show that identification of noisy channels depends on the reference and examine the complex interaction of filtering, noisy channel identification, and referencing. We introduce a multi-stage robust referencing scheme to deal with the noisy channel-reference interaction. We propose a standardized early-stage EEG processing pipeline (PREP) and discuss the application of the pipeline to more than 600 EEG datasets. The pipeline includes an automatically generated report for each dataset processed. Users can download the PREP pipeline as a freely available MATLAB library from http://eegstudy.org/prepcode.
Toward edge minability for role mining in bipartite networks
NASA Astrophysics Data System (ADS)
Dong, Lijun; Wang, Yi; Liu, Ran; Pi, Benjie; Wu, Liuyi
2016-11-01
Bipartite network models have been extensively used in information security to automatically generate role-based access control (RBAC) from dataset. This process is called role mining. However, not all the topologies of bipartite networks are suitable for role mining; some edges may even reduce the quality of role mining. This causes unnecessary time consumption as role mining is NP-hard. Therefore, to promote the quality of role mining results, the capability that an edge composes roles with other edges, called the minability of edge, needs to be identified. We tackle the problem from an angle of edge importance in complex networks; that is an edge easily covered by roles is considered to be more important. Based on this idea, the k-shell decomposition of complex networks is extended to reveal the different minability of edges. By this way, a bipartite network can be quickly purified by excluding the low-minability edges from role mining, and thus the quality of role mining can be effectively improved. Extensive experiments via the real-world datasets are conducted to confirm the above claims.
Applications of the LBA-ECO Metadata Warehouse
NASA Astrophysics Data System (ADS)
Wilcox, L.; Morrell, A.; Griffith, P. C.
2006-05-01
The LBA-ECO Project Office has developed a system to harvest and warehouse metadata resulting from the Large-Scale Biosphere Atmosphere Experiment in Amazonia. The harvested metadata is used to create dynamically generated reports, available at www.lbaeco.org, which facilitate access to LBA-ECO datasets. The reports are generated for specific controlled vocabulary terms (such as an investigation team or a geospatial region), and are cross-linked with one another via these terms. This approach creates a rich contextual framework enabling researchers to find datasets relevant to their research. It maximizes data discovery by association and provides a greater understanding of the scientific and social context of each dataset. For example, our website provides a profile (e.g. participants, abstract(s), study sites, and publications) for each LBA-ECO investigation. Linked from each profile is a list of associated registered dataset titles, each of which link to a dataset profile that describes the metadata in a user-friendly way. The dataset profiles are generated from the harvested metadata, and are cross-linked with associated reports via controlled vocabulary terms such as geospatial region. The region name appears on the dataset profile as a hyperlinked term. When researchers click on this link, they find a list of reports relevant to that region, including a list of dataset titles associated with that region. Each dataset title in this list is hyperlinked to its corresponding dataset profile. Moreover, each dataset profile contains hyperlinks to each associated data file at its home data repository and to publications that have used the dataset. We also use the harvested metadata in administrative applications to assist quality assurance efforts. These include processes to check for broken hyperlinks to data files, automated emails that inform our administrators when critical metadata fields are updated, dynamically generated reports of metadata records that link to datasets with questionable file formats, and dynamically generated region/site coordinate quality assurance reports. These applications are as important as those that facilitate access to information because they help ensure a high standard of quality for the information. This presentation will discuss reports currently in use, provide a technical overview of the system, and discuss plans to extend this system to harvest metadata resulting from the North American Carbon Program by drawing on datasets in many different formats, residing in many thematic data centers and also distributed among hundreds of investigators.
Generation of openEHR Test Datasets for Benchmarking.
El Helou, Samar; Karvonen, Tuukka; Yamamoto, Goshiro; Kume, Naoto; Kobayashi, Shinji; Kondo, Eiji; Hiragi, Shusuke; Okamoto, Kazuya; Tamura, Hiroshi; Kuroda, Tomohiro
2017-01-01
openEHR is a widely used EHR specification. Given its technology-independent nature, different approaches for implementing openEHR data repositories exist. Public openEHR datasets are needed to conduct benchmark analyses over different implementations. To address their current unavailability, we propose a method for generating openEHR test datasets that can be publicly shared and used.
Gonzalez, Michael A; Lebrigio, Rafael F Acosta; Van Booven, Derek; Ulloa, Rick H; Powell, Eric; Speziani, Fiorella; Tekin, Mustafa; Schüle, Rebecca; Züchner, Stephan
2013-06-01
Novel genes are now identified at a rapid pace for many Mendelian disorders, and increasingly, for genetically complex phenotypes. However, new challenges have also become evident: (1) effectively managing larger exome and/or genome datasets, especially for smaller labs; (2) direct hands-on analysis and contextual interpretation of variant data in large genomic datasets; and (3) many small and medium-sized clinical and research-based investigative teams around the world are generating data that, if combined and shared, will significantly increase the opportunities for the entire community to identify new genes. To address these challenges, we have developed GEnomes Management Application (GEM.app), a software tool to annotate, manage, visualize, and analyze large genomic datasets (https://genomics.med.miami.edu/). GEM.app currently contains ∼1,600 whole exomes from 50 different phenotypes studied by 40 principal investigators from 15 different countries. The focus of GEM.app is on user-friendly analysis for nonbioinformaticians to make next-generation sequencing data directly accessible. Yet, GEM.app provides powerful and flexible filter options, including single family filtering, across family/phenotype queries, nested filtering, and evaluation of segregation in families. In addition, the system is fast, obtaining results within 4 sec across ∼1,200 exomes. We believe that this system will further enhance identification of genetic causes of human disease. © 2013 Wiley Periodicals, Inc.
Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie
2018-01-01
Abstract Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. PMID:29106630
NCAR's Research Data Archive: OPeNDAP Access for Complex Datasets
NASA Astrophysics Data System (ADS)
Dattore, R.; Worley, S. J.
2014-12-01
Many datasets have complex structures including hundreds of parameters and numerous vertical levels, grid resolutions, and temporal products. Making these data accessible is a challenge for a data provider. OPeNDAP is powerful protocol for delivering in real-time multi-file datasets that can be ingested by many analysis and visualization tools, but for these datasets there are too many choices about how to aggregate. Simple aggregation schemes can fail to support, or at least make it very challenging, for many potential studies based on complex datasets. We address this issue by using a rich file content metadata collection to create a real-time customized OPeNDAP service to match the full suite of access possibilities for complex datasets. The Climate Forecast System Reanalysis (CFSR) and it's extension, the Climate Forecast System Version 2 (CFSv2) datasets produced by the National Centers for Environmental Prediction (NCEP) and hosted by the Research Data Archive (RDA) at the Computational and Information Systems Laboratory (CISL) at NCAR are examples of complex datasets that are difficult to aggregate with existing data server software. CFSR and CFSv2 contain 141 distinct parameters on 152 vertical levels, six grid resolutions and 36 products (analyses, n-hour forecasts, multi-hour averages, etc.) where not all parameter/level combinations are available at all grid resolution/product combinations. These data are archived in the RDA with the data structure provided by the producer; no additional re-organization or aggregation have been applied. Since 2011, users have been able to request customized subsets (e.g. - temporal, parameter, spatial) from the CFSR/CFSv2, which are processed in delayed-mode and then downloaded to a user's system. Until now, the complexity has made it difficult to provide real-time OPeNDAP access to the data. We have developed a service that leverages the already-existing subsetting interface and allows users to create a virtual dataset with its own structure (das, dds). The user receives a URL to the customized dataset that can be used by existing tools to ingest, analyze, and visualize the data. This presentation will detail the metadata system and OPeNDAP server that enable user-customized real-time access and show an example of how a visualization tool can access the data.
Could a neuroscientist understand a microprocessor?
Jonas, Eric; Kording, Konrad Paul; Diedrichsen, Jorn
2017-01-12
There is a popular belief in neuroscience that we are primarily data limited, and that producing large, multimodal, and complex datasets will, with the help of advanced data analysis algorithms, lead to fundamental insights into the way the brain processes information. These datasets do not yet exist, and if they did we would have no way of evaluating whether or not the algorithmically-generated insights were sufficient or even correct. To address this, here we take a classical microprocessor as a model organism, and use our ability to perform arbitrary experiments on it to see if popular data analysis methods frommore » neuroscience can elucidate the way it processes information. Microprocessors are among those artificial information processing systems that are both complex and that we understand at all levels, from the overall logical flow, via logical gates, to the dynamics of transistors. We show that the approaches reveal interesting structure in the data but do not meaningfully describe the hierarchy of information processing in the microprocessor. This suggests current analytic approaches in neuroscience may fall short of producing meaningful understanding of neural systems, regardless of the amount of data. Furthermore, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.« less
Patel, Sejal; Roncaglia, Paola; Lovering, Ruth C
2015-06-06
People with an autistic spectrum disorder (ASD) display a variety of characteristic behavioral traits, including impaired social interaction, communication difficulties and repetitive behavior. This complex neurodevelopment disorder is known to be associated with a combination of genetic and environmental factors. Neurexins and neuroligins play a key role in synaptogenesis and neurexin-neuroligin adhesion is one of several processes that have been implicated in autism spectrum disorders. In this report we describe the manual annotation of a selection of gene products known to be associated with autism and/or the neurexin-neuroligin-SHANK complex and demonstrate how a focused annotation approach leads to the creation of more descriptive Gene Ontology (GO) terms, as well as an increase in both the number of gene product annotations and their granularity, thus improving the data available in the GO database. The manual annotations we describe will impact on the functional analysis of a variety of future autism-relevant datasets. Comprehensive gene annotation is an essential aspect of genomic and proteomic studies, as the quality of gene annotations incorporated into statistical analysis tools affects the effective interpretation of data obtained through genome wide association studies, next generation sequencing, proteomic and transcriptomic datasets.
Could a Neuroscientist Understand a Microprocessor?
Kording, Konrad Paul
2017-01-01
There is a popular belief in neuroscience that we are primarily data limited, and that producing large, multimodal, and complex datasets will, with the help of advanced data analysis algorithms, lead to fundamental insights into the way the brain processes information. These datasets do not yet exist, and if they did we would have no way of evaluating whether or not the algorithmically-generated insights were sufficient or even correct. To address this, here we take a classical microprocessor as a model organism, and use our ability to perform arbitrary experiments on it to see if popular data analysis methods from neuroscience can elucidate the way it processes information. Microprocessors are among those artificial information processing systems that are both complex and that we understand at all levels, from the overall logical flow, via logical gates, to the dynamics of transistors. We show that the approaches reveal interesting structure in the data but do not meaningfully describe the hierarchy of information processing in the microprocessor. This suggests current analytic approaches in neuroscience may fall short of producing meaningful understanding of neural systems, regardless of the amount of data. Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods. PMID:28081141
Could a Neuroscientist Understand a Microprocessor?
Jonas, Eric; Kording, Konrad Paul
2017-01-01
There is a popular belief in neuroscience that we are primarily data limited, and that producing large, multimodal, and complex datasets will, with the help of advanced data analysis algorithms, lead to fundamental insights into the way the brain processes information. These datasets do not yet exist, and if they did we would have no way of evaluating whether or not the algorithmically-generated insights were sufficient or even correct. To address this, here we take a classical microprocessor as a model organism, and use our ability to perform arbitrary experiments on it to see if popular data analysis methods from neuroscience can elucidate the way it processes information. Microprocessors are among those artificial information processing systems that are both complex and that we understand at all levels, from the overall logical flow, via logical gates, to the dynamics of transistors. We show that the approaches reveal interesting structure in the data but do not meaningfully describe the hierarchy of information processing in the microprocessor. This suggests current analytic approaches in neuroscience may fall short of producing meaningful understanding of neural systems, regardless of the amount of data. Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.
Could a neuroscientist understand a microprocessor?
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jonas, Eric; Kording, Konrad Paul; Diedrichsen, Jorn
There is a popular belief in neuroscience that we are primarily data limited, and that producing large, multimodal, and complex datasets will, with the help of advanced data analysis algorithms, lead to fundamental insights into the way the brain processes information. These datasets do not yet exist, and if they did we would have no way of evaluating whether or not the algorithmically-generated insights were sufficient or even correct. To address this, here we take a classical microprocessor as a model organism, and use our ability to perform arbitrary experiments on it to see if popular data analysis methods frommore » neuroscience can elucidate the way it processes information. Microprocessors are among those artificial information processing systems that are both complex and that we understand at all levels, from the overall logical flow, via logical gates, to the dynamics of transistors. We show that the approaches reveal interesting structure in the data but do not meaningfully describe the hierarchy of information processing in the microprocessor. This suggests current analytic approaches in neuroscience may fall short of producing meaningful understanding of neural systems, regardless of the amount of data. Furthermore, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.« less
An automated graphics tool for comparative genomics: the Coulson plot generator
2013-01-01
Background Comparative analysis is an essential component to biology. When applied to genomics for example, analysis may require comparisons between the predicted presence and absence of genes in a group of genomes under consideration. Frequently, genes can be grouped into small categories based on functional criteria, for example membership of a multimeric complex, participation in a metabolic or signaling pathway or shared sequence features and/or paralogy. These patterns of retention and loss are highly informative for the prediction of function, and hence possible biological context, and can provide great insights into the evolutionary history of cellular functions. However, representation of such information in a standard spreadsheet is a poor visual means from which to extract patterns within a dataset. Results We devised the Coulson Plot, a new graphical representation that exploits a matrix of pie charts to display comparative genomics data. Each pie is used to describe a complex or process from a separate taxon, and is divided into sectors corresponding to the number of proteins (subunits) in a complex/process. The predicted presence or absence of proteins in each complex are delineated by occupancy of a given sector; this format is visually highly accessible and makes pattern recognition rapid and reliable. A key to the identity of each subunit, plus hierarchical naming of taxa and coloring are included. A java-based application, the Coulson plot generator (CPG) automates graphic production, with a tab or comma-delineated text file as input and generating an editable portable document format or svg file. Conclusions CPG software may be used to rapidly convert spreadsheet data to a graphical matrix pie chart format. The representation essentially retains all of the information from the spreadsheet but presents a graphically rich format making comparisons and identification of patterns significantly clearer. While the Coulson plot format is highly useful in comparative genomics, its original purpose, the software can be used to visualize any dataset where entity occupancy is compared between different classes. Availability CPG software is available at sourceforge http://sourceforge.net/projects/coulson and http://dl.dropbox.com/u/6701906/Web/Sites/Labsite/CPG.html PMID:23621955
Samanipour, Saer; Baz-Lomba, Jose A; Alygizakis, Nikiforos A; Reid, Malcolm J; Thomaidis, Nikolaos S; Thomas, Kevin V
2017-06-09
LC-HR-QTOF-MS recently has become a commonly used approach for the analysis of complex samples. However, identification of small organic molecules in complex samples with the highest level of confidence is a challenging task. Here we report on the implementation of a two stage algorithm for LC-HR-QTOF-MS datasets. We compared the performances of the two stage algorithm, implemented via NIVA_MZ_Analyzer™, with two commonly used approaches (i.e. feature detection and XIC peak picking, implemented via UNIFI by Waters and TASQ by Bruker, respectively) for the suspect analysis of four influent wastewater samples. We first evaluated the cross platform compatibility of LC-HR-QTOF-MS datasets generated via instruments from two different manufacturers (i.e. Waters and Bruker). Our data showed that with an appropriate spectral weighting function the spectra recorded by the two tested instruments are comparable for our analytes. As a consequence, we were able to perform full spectral comparison between the data generated via the two studied instruments. Four extracts of wastewater influent were analyzed for 89 analytes, thus 356 detection cases. The analytes were divided into 158 detection cases of artificial suspect analytes (i.e. verified by target analysis) and 198 true suspects. The two stage algorithm resulted in a zero rate of false positive detection, based on the artificial suspect analytes while producing a rate of false negative detection of 0.12. For the conventional approaches, the rates of false positive detection varied between 0.06 for UNIFI and 0.15 for TASQ. The rates of false negative detection for these methods ranged between 0.07 for TASQ and 0.09 for UNIFI. The effect of background signal complexity on the two stage algorithm was evaluated through the generation of a synthetic signal. We further discuss the boundaries of applicability of the two stage algorithm. The importance of background knowledge and experience in evaluating the reliability of results during the suspect screening was evaluated. Copyright © 2017 Elsevier B.V. All rights reserved.
Combining users' activity survey and simulators to evaluate human activity recognition systems.
Azkune, Gorka; Almeida, Aitor; López-de-Ipiña, Diego; Chen, Liming
2015-04-08
Evaluating human activity recognition systems usually implies following expensive and time-consuming methodologies, where experiments with humans are run with the consequent ethical and legal issues. We propose a novel evaluation methodology to overcome the enumerated problems, which is based on surveys for users and a synthetic dataset generator tool. Surveys allow capturing how different users perform activities of daily living, while the synthetic dataset generator is used to create properly labelled activity datasets modelled with the information extracted from surveys. Important aspects, such as sensor noise, varying time lapses and user erratic behaviour, can also be simulated using the tool. The proposed methodology is shown to have very important advantages that allow researchers to carry out their work more efficiently. To evaluate the approach, a synthetic dataset generated following the proposed methodology is compared to a real dataset computing the similarity between sensor occurrence frequencies. It is concluded that the similarity between both datasets is more than significant.
Variant Review with the Integrative Genomics Viewer.
Robinson, James T; Thorvaldsdóttir, Helga; Wenger, Aaron M; Zehir, Ahmet; Mesirov, Jill P
2017-11-01
Manual review of aligned reads for confirmation and interpretation of variant calls is an important step in many variant calling pipelines for next-generation sequencing (NGS) data. Visual inspection can greatly increase the confidence in calls, reduce the risk of false positives, and help characterize complex events. The Integrative Genomics Viewer (IGV) was one of the first tools to provide NGS data visualization, and it currently provides a rich set of tools for inspection, validation, and interpretation of NGS datasets, as well as other types of genomic data. Here, we present a short overview of IGV's variant review features for both single-nucleotide variants and structural variants, with examples from both cancer and germline datasets. IGV is freely available at https://www.igv.org Cancer Res; 77(21); e31-34. ©2017 AACR . ©2017 American Association for Cancer Research.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hsu, Christina M. L.; Palmeri, Mark L.; Department of Anesthesiology, Duke University Medical Center, Durham, North Carolina 27710
2013-04-15
Purpose: The authors previously reported on a three-dimensional computer-generated breast phantom, based on empirical human image data, including a realistic finite-element based compression model that was capable of simulating multimodality imaging data. The computerized breast phantoms are a hybrid of two phantom generation techniques, combining empirical breast CT (bCT) data with flexible computer graphics techniques. However, to date, these phantoms have been based on single human subjects. In this paper, the authors report on a new method to generate multiple phantoms, simulating additional subjects from the limited set of original dedicated breast CT data. The authors developed an image morphingmore » technique to construct new phantoms by gradually transitioning between two human subject datasets, with the potential to generate hundreds of additional pseudoindependent phantoms from the limited bCT cases. The authors conducted a preliminary subjective assessment with a limited number of observers (n= 4) to illustrate how realistic the simulated images generated with the pseudoindependent phantoms appeared. Methods: Several mesh-based geometric transformations were developed to generate distorted breast datasets from the original human subject data. Segmented bCT data from two different human subjects were used as the 'base' and 'target' for morphing. Several combinations of transformations were applied to morph between the 'base' and 'target' datasets such as changing the breast shape, rotating the glandular data, and changing the distribution of the glandular tissue. Following the morphing, regions of skin and fat were assigned to the morphed dataset in order to appropriately assign mechanical properties during the compression simulation. The resulting morphed breast was compressed using a finite element algorithm and simulated mammograms were generated using techniques described previously. Sixty-two simulated mammograms, generated from morphing three human subject datasets, were used in a preliminary observer evaluation where four board certified breast radiologists with varying amounts of experience ranked the level of realism (from 1 ='fake' to 10 ='real') of the simulated images. Results: The morphing technique was able to successfully generate new and unique morphed datasets from the original human subject data. The radiologists evaluated the realism of simulated mammograms generated from the morphed and unmorphed human subject datasets and scored the realism with an average ranking of 5.87 {+-} 1.99, confirming that overall the phantom image datasets appeared more 'real' than 'fake.' Moreover, there was not a significant difference (p > 0.1) between the realism of the unmorphed datasets (6.0 {+-} 1.95) compared to the morphed datasets (5.86 {+-} 1.99). Three of the four observers had overall average rankings of 6.89 {+-} 0.89, 6.9 {+-} 1.24, 6.76 {+-} 1.22, whereas the fourth observer ranked them noticeably lower at 2.94 {+-} 0.7. Conclusions: This work presents a technique that can be used to generate a suite of realistic computerized breast phantoms from a limited number of human subjects. This suite of flexible breast phantoms can be used for multimodality imaging research to provide a known truth while concurrently producing realistic simulated imaging data.« less
Highly scalable and robust rule learner: performance evaluation and comparison.
Kurgan, Lukasz A; Cios, Krzysztof J; Dick, Scott
2006-02-01
Business intelligence and bioinformatics applications increasingly require the mining of datasets consisting of millions of data points, or crafting real-time enterprise-level decision support systems for large corporations and drug companies. In all cases, there needs to be an underlying data mining system, and this mining system must be highly scalable. To this end, we describe a new rule learner called DataSqueezer. The learner belongs to the family of inductive supervised rule extraction algorithms. DataSqueezer is a simple, greedy, rule builder that generates a set of production rules from labeled input data. In spite of its relative simplicity, DataSqueezer is a very effective learner. The rules generated by the algorithm are compact, comprehensible, and have accuracy comparable to rules generated by other state-of-the-art rule extraction algorithms. The main advantages of DataSqueezer are very high efficiency, and missing data resistance. DataSqueezer exhibits log-linear asymptotic complexity with the number of training examples, and it is faster than other state-of-the-art rule learners. The learner is also robust to large quantities of missing data, as verified by extensive experimental comparison with the other learners. DataSqueezer is thus well suited to modern data mining and business intelligence tasks, which commonly involve huge datasets with a large fraction of missing data.
Shi, Yi; Fernandez-Martinez, Javier; Tjioe, Elina; Pellarin, Riccardo; Kim, Seung Joong; Williams, Rosemary; Schneidman-Duhovny, Dina; Sali, Andrej; Rout, Michael P.; Chait, Brian T.
2014-01-01
Most cellular processes are orchestrated by macromolecular complexes. However, structural elucidation of these endogenous complexes can be challenging because they frequently contain large numbers of proteins, are compositionally and morphologically heterogeneous, can be dynamic, and are often of low abundance in the cell. Here, we present a strategy for the structural characterization of such complexes that has at its center chemical cross-linking with mass spectrometric readout. In this strategy, we isolate the endogenous complexes using a highly optimized sample preparation protocol and generate a comprehensive, high-quality cross-linking dataset using two complementary cross-linking reagents. We then determine the structure of the complex using a refined integrative method that combines the cross-linking data with information generated from other sources, including electron microscopy, X-ray crystallography, and comparative protein structure modeling. We applied this integrative strategy to determine the structure of the native Nup84 complex, a stable hetero-heptameric assembly (∼600 kDa), 16 copies of which form the outer rings of the 50-MDa nuclear pore complex (NPC) in budding yeast. The unprecedented detail of the Nup84 complex structure reveals previously unseen features in its pentameric structural hub and provides information on the conformational flexibility of the assembly. These additional details further support and augment the protocoatomer hypothesis, which proposes an evolutionary relationship between vesicle coating complexes and the NPC, and indicates a conserved mechanism by which the NPC is anchored in the nuclear envelope. PMID:25161197
Li, Xin; Liu, Shaomin; Xiao, Qin; Ma, Mingguo; Jin, Rui; Che, Tao; Wang, Weizhen; Hu, Xiaoli; Xu, Ziwei; Wen, Jianguang; Wang, Liangxu
2017-01-01
We introduce a multiscale dataset obtained from Heihe Watershed Allied Telemetry Experimental Research (HiWATER) in an oasis-desert area in 2012. Upscaling of eco-hydrological processes on a heterogeneous surface is a grand challenge. Progress in this field is hindered by the poor availability of multiscale observations. HiWATER is an experiment designed to address this challenge through instrumentation on hierarchically nested scales to obtain multiscale and multidisciplinary data. The HiWATER observation system consists of a flux observation matrix of eddy covariance towers, large aperture scintillometers, and automatic meteorological stations; an eco-hydrological sensor network of soil moisture and leaf area index; hyper-resolution airborne remote sensing using LiDAR, imaging spectrometer, multi-angle thermal imager, and L-band microwave radiometer; and synchronical ground measurements of vegetation dynamics, and photosynthesis processes. All observational data were carefully quality controlled throughout sensor calibration, data collection, data processing, and datasets generation. The data are freely available at figshare and the Cold and Arid Regions Science Data Centre. The data should be useful for elucidating multiscale eco-hydrological processes and developing upscaling methods. PMID:28654086
Sczyrba, Alexander; Hofmann, Peter; Belmann, Peter; Koslicki, David; Janssen, Stefan; Dröge, Johannes; Gregor, Ivan; Majda, Stephan; Fiedler, Jessika; Dahms, Eik; Bremges, Andreas; Fritz, Adrian; Garrido-Oter, Ruben; Jørgensen, Tue Sparholt; Shapiro, Nicole; Blood, Philip D.; Gurevich, Alexey; Bai, Yang; Turaev, Dmitrij; DeMaere, Matthew Z.; Chikhi, Rayan; Nagarajan, Niranjan; Quince, Christopher; Meyer, Fernando; Balvočiūtė, Monika; Hansen, Lars Hestbjerg; Sørensen, Søren J.; Chia, Burton K. H.; Denis, Bertrand; Froula, Jeff L.; Wang, Zhong; Egan, Robert; Kang, Dongwan Don; Cook, Jeffrey J.; Deltel, Charles; Beckstette, Michael; Lemaitre, Claire; Peterlongo, Pierre; Rizk, Guillaume; Lavenier, Dominique; Wu, Yu-Wei; Singer, Steven W.; Jain, Chirag; Strous, Marc; Klingenberg, Heiner; Meinicke, Peter; Barton, Michael; Lingner, Thomas; Lin, Hsin-Hung; Liao, Yu-Chieh; Silva, Genivaldo Gueiros Z.; Cuevas, Daniel A.; Edwards, Robert A.; Saha, Surya; Piro, Vitor C.; Renard, Bernhard Y.; Pop, Mihai; Klenk, Hans-Peter; Göker, Markus; Kyrpides, Nikos C.; Woyke, Tanja; Vorholt, Julia A.; Schulze-Lefert, Paul; Rubin, Edward M.; Darling, Aaron E.; Rattei, Thomas; McHardy, Alice C.
2018-01-01
In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are key components facilitating downstream biological data interpretation. However, a lack of consensus about benchmarking datasets and evaluation metrics complicates proper performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on datasets of unprecedented complexity and realism. Benchmark metagenomes were generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids, including genomes with varying degrees of relatedness to each other and to publicly available ones and representing common experimental setups. Across all datasets, assembly and genome binning programs performed well for species represented by individual genomes, while performance was substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below the family level. Parameter settings substantially impacted performances, underscoring the importance of program reproducibility. While highlighting current challenges in computational metagenomics, the CAMI results provide a roadmap for software selection to answer specific research questions. PMID:28967888
Context-Aware Generative Adversarial Privacy
NASA Astrophysics Data System (ADS)
Huang, Chong; Kairouz, Peter; Chen, Xiao; Sankar, Lalitha; Rajagopal, Ram
2017-12-01
Preserving the utility of published datasets while simultaneously providing provable privacy guarantees is a well-known challenge. On the one hand, context-free privacy solutions, such as differential privacy, provide strong privacy guarantees, but often lead to a significant reduction in utility. On the other hand, context-aware privacy solutions, such as information theoretic privacy, achieve an improved privacy-utility tradeoff, but assume that the data holder has access to dataset statistics. We circumvent these limitations by introducing a novel context-aware privacy framework called generative adversarial privacy (GAP). GAP leverages recent advancements in generative adversarial networks (GANs) to allow the data holder to learn privatization schemes from the dataset itself. Under GAP, learning the privacy mechanism is formulated as a constrained minimax game between two players: a privatizer that sanitizes the dataset in a way that limits the risk of inference attacks on the individuals' private variables, and an adversary that tries to infer the private variables from the sanitized dataset. To evaluate GAP's performance, we investigate two simple (yet canonical) statistical dataset models: (a) the binary data model, and (b) the binary Gaussian mixture model. For both models, we derive game-theoretically optimal minimax privacy mechanisms, and show that the privacy mechanisms learned from data (in a generative adversarial fashion) match the theoretically optimal ones. This demonstrates that our framework can be easily applied in practice, even in the absence of dataset statistics.
PWHATSHAP: efficient haplotyping for future generation sequencing.
Bracciali, Andrea; Aldinucci, Marco; Patterson, Murray; Marschall, Tobias; Pisanti, Nadia; Merelli, Ivan; Torquati, Massimo
2016-09-22
Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the confidence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WHATSHAP is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e., coverage, and is hence of particular interest when considering sequencing technology's current trends that are producing longer fragments. Given the potential relevance of efficient haplotyping in several analysis pipelines, we have designed and engineered PWHATSHAP, a parallel, high-performance version of WHATSHAP. PWHATSHAP is embedded in a toolkit developed in Python and supports genomics datasets in standard file formats. Building on WHATSHAP, PWHATSHAP exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WHATSHAP, which increases with coverage. Due to its structure and management of the large datasets, the parallelisation of WHATSHAP posed demanding technical challenges, which have been addressed exploiting a high-level parallel programming framework. The result, PWHATSHAP, is a freely available toolkit that improves the efficiency of the analysis of genomics information.
Cheng, Ji-Hong; Liu, Wen-Chun; Chang, Ting-Tsung; Hsieh, Sun-Yuan; Tseng, Vincent S
2017-10-01
Many studies have suggested that deletions of Hepatitis B Viral (HBV) are associated with the development of progressive liver diseases, even ultimately resulting in hepatocellular carcinoma (HCC). Among the methods for detecting deletions from next-generation sequencing (NGS) data, few methods considered the characteristics of virus, such as high evolution rates and high divergence among the different HBV genomes. Sequencing high divergence HBV genome sequences using the NGS technology outputs millions of reads. Thus, detecting exact breakpoints of deletions from these big and complex data incurs very high computational cost. We proposed a novel analytical method named VirDelect (Virus Deletion Detect), which uses split read alignment base to detect exact breakpoint and diversity variable to consider high divergence in single-end reads data, such that the computational cost can be reduced without losing accuracy. We use four simulated reads datasets and two real pair-end reads datasets of HBV genome sequence to verify VirDelect accuracy by score functions. The experimental results show that VirDelect outperforms the state-of-the-art method Pindel in terms of accuracy score for all simulated datasets and VirDelect had only two base errors even in real datasets. VirDelect is also shown to deliver high accuracy in analyzing the single-end read data as well as pair-end data. VirDelect can serve as an effective and efficient bioinformatics tool for physiologists with high accuracy and efficient performance and applicable to further analysis with characteristics similar to HBV on genome length and high divergence. The software program of VirDelect can be downloaded at https://sourceforge.net/projects/virdelect/. Copyright © 2017. Published by Elsevier Inc.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rawle, Rachel A.; Hamerly, Timothy; Tripet, Brian P.
Studies of interspecies interactions are inherently difficult due to the complex mechanisms which enable these relationships. A model system for studying interspecies interactions is the marine hyperthermophiles Ignicoccus hospitalis and Nanoarchaeum equitans. Recent independently-conducted ‘omics’ analyses have generated insights into the molecular factors modulating this association. However, significant questions remain about the nature of the interactions between these archaea. We jointly analyzed multiple levels of omics datasets obtained from published, independent transcriptomics, proteomics, and metabolomics analyses. DAVID identified functionally-related groups enriched when I. hospitalis is grown alone or in co-culture with N. equitans. Enriched molecular pathways were subsequently visualized usingmore » interaction maps generated using STRING. Key findings of our multi-level omics analysis indicated that I. hospitalis provides precursors to N. equitans for energy metabolism. Analysis indicated an overall reduction in diversity of metabolic precursors in the I. hospitalis–N. equitans co-culture, which has been connected to the differential use of ribosomal subunits and was previously unnoticed. We also identified differences in precursors linked to amino acid metabolism, NADH metabolism, and carbon fixation, providing new insights into the metabolic adaptions of I. hospitalis enabling the growth of N. equitans. In conclusion, this multi-omics analysis builds upon previously identified cellular patterns while offering new insights into mechanisms that enable the I. hospitalis–N. equitans association. This study applies statistical and visualization techniques to a mixed-source omics dataset to yield a more global insight into a complex system, that was not readily discernable from separate omics studies.« less
Artificial intelligence (AI) systems for interpreting complex medical datasets.
Altman, R B
2017-05-01
Advances in machine intelligence have created powerful capabilities in algorithms that find hidden patterns in data, classify objects based on their measured characteristics, and associate similar patients/diseases/drugs based on common features. However, artificial intelligence (AI) applications in medical data have several technical challenges: complex and heterogeneous datasets, noisy medical datasets, and explaining their output to users. There are also social challenges related to intellectual property, data provenance, regulatory issues, economics, and liability. © 2017 ASCPT.
Hufnagel, P.; Glandorf, J.; Körting, G.; Jabs, W.; Schweiger-Hufnagel, U.; Hahner, S.; Lubeck, M.; Suckau, D.
2007-01-01
Analysis of complex proteomes often results in long protein lists, but falls short in measuring the validity of identification and quantification results on a greater number of proteins. Biological and technical replicates are mandatory, as is the combination of the MS data from various workflows (gels, 1D-LC, 2D-LC), instruments (TOF/TOF, trap, qTOF or FTMS), and search engines. We describe a database-driven study that combines two workflows, two mass spectrometers, and four search engines with protein identification following a decoy database strategy. The sample was a tryptically digested lysate (10,000 cells) of a human colorectal cancer cell line. Data from two LC-MALDI-TOF/TOF runs and a 2D-LC-ESI-trap run using capillary and nano-LC columns were submitted to the proteomics software platform ProteinScape. The combined MALDI data and the ESI data were searched using Mascot (Matrix Science), Phenyx (GeneBio), ProteinSolver (Bruker and Protagen), and Sequest (Thermo) against a decoy database generated from IPI-human in order to obtain one protein list across all workflows and search engines at a defined maximum false-positive rate of 5%. ProteinScape combined the data to one LC-MALDI and one LC-ESI dataset. The initial separate searches from the two combined datasets generated eight independent peptide lists. These were compiled into an integrated protein list using the ProteinExtractor algorithm. An initial evaluation of the generated data led to the identification of approximately 1200 proteins. Result integration on a peptide level allowed discrimination of protein isoforms that would not have been possible with a mere combination of protein lists.
Guhlin, Joseph; Silverstein, Kevin A T; Zhou, Peng; Tiffin, Peter; Young, Nevin D
2017-08-10
Rapid generation of omics data in recent years have resulted in vast amounts of disconnected datasets without systemic integration and knowledge building, while individual groups have made customized, annotated datasets available on the web with few ways to link them to in-lab datasets. With so many research groups generating their own data, the ability to relate it to the larger genomic and comparative genomic context is becoming increasingly crucial to make full use of the data. The Omics Database Generator (ODG) allows users to create customized databases that utilize published genomics data integrated with experimental data which can be queried using a flexible graph database. When provided with omics and experimental data, ODG will create a comparative, multi-dimensional graph database. ODG can import definitions and annotations from other sources such as InterProScan, the Gene Ontology, ENZYME, UniPathway, and others. This annotation data can be especially useful for studying new or understudied species for which transcripts have only been predicted, and rapidly give additional layers of annotation to predicted genes. In better studied species, ODG can perform syntenic annotation translations or rapidly identify characteristics of a set of genes or nucleotide locations, such as hits from an association study. ODG provides a web-based user-interface for configuring the data import and for querying the database. Queries can also be run from the command-line and the database can be queried directly through programming language hooks available for most languages. ODG supports most common genomic formats as well as generic, easy to use tab-separated value format for user-provided annotations. ODG is a user-friendly database generation and query tool that adapts to the supplied data to produce a comparative genomic database or multi-layered annotation database. ODG provides rapid comparative genomic annotation and is therefore particularly useful for non-model or understudied species. For species for which more data are available, ODG can be used to conduct complex multi-omics, pattern-matching queries.
He, Awen; Wang, Wenyu; Prakash, N Tejo; Tinkov, Alexey A; Skalny, Anatoly V; Wen, Yan; Hao, Jingcan; Guo, Xiong; Zhang, Feng
2018-03-01
Chemical elements are closely related to human health. Extensive genomic profile data of complex diseases offer us a good opportunity to systemically investigate the relationships between elements and complex diseases/traits. In this study, we applied gene set enrichment analysis (GSEA) approach to detect the associations between elements and complex diseases/traits though integrating element-gene interaction datasets and genome-wide association study (GWAS) data of complex diseases/traits. To illustrate the performance of GSEA, the element-gene interaction datasets of 24 elements were extracted from the comparative toxicogenomics database (CTD). GWAS summary datasets of 24 complex diseases or traits were downloaded from the dbGaP or GEFOS websites. We observed significant associations between 7 elements and 13 complex diseases or traits (all false discovery rate (FDR) < 0.05), including reported relationships such as aluminum vs. Alzheimer's disease (FDR = 0.042), calcium vs. bone mineral density (FDR = 0.031), magnesium vs. systemic lupus erythematosus (FDR = 0.012) as well as novel associations, such as nickel vs. hypertriglyceridemia (FDR = 0.002) and bipolar disorder (FDR = 0.027). Our study results are consistent with previous biological studies, supporting the good performance of GSEA. Our analyzing results based on GSEA framework provide novel clues for discovering causal relationships between elements and complex diseases. © 2017 WILEY PERIODICALS, INC.
Lagorce, David; Pencheva, Tania; Villoutreix, Bruno O; Miteva, Maria A
2009-11-13
Discovery of new bioactive molecules that could enter drug discovery programs or that could serve as chemical probes is a very complex and costly endeavor. Structure-based and ligand-based in silico screening approaches are nowadays extensively used to complement experimental screening approaches in order to increase the effectiveness of the process and facilitating the screening of thousands or millions of small molecules against a biomolecular target. Both in silico screening methods require as input a suitable chemical compound collection and most often the 3D structure of the small molecules has to be generated since compounds are usually delivered in 1D SMILES, CANSMILES or in 2D SDF formats. Here, we describe the new open source program DG-AMMOS which allows the generation of the 3D conformation of small molecules using Distance Geometry and their energy minimization via Automated Molecular Mechanics Optimization. The program is validated on the Astex dataset, the ChemBridge Diversity database and on a number of small molecules with known crystal structures extracted from the Cambridge Structural Database. A comparison with the free program Balloon and the well-known commercial program Omega generating the 3D of small molecules is carried out. The results show that the new free program DG-AMMOS is a very efficient 3D structure generator engine. DG-AMMOS provides fast, automated and reliable access to the generation of 3D conformation of small molecules and facilitates the preparation of a compound collection prior to high-throughput virtual screening computations. The validation of DG-AMMOS on several different datasets proves that generated structures are generally of equal quality or sometimes better than structures obtained by other tested methods.
NASA Astrophysics Data System (ADS)
Hullo, J.-F.; Thibault, G.; Boucheny, C.
2015-02-01
In a context of increased maintenance operations and workers generational renewal, a nuclear owner and operator like Electricité de France (EDF) is interested in the scaling up of tools and methods of "as-built virtual reality" for larger buildings and wider audiences. However, acquisition and sharing of as-built data on a large scale (large and complex multi-floored buildings) challenge current scientific and technical capacities. In this paper, we first present a state of the art of scanning tools and methods for industrial plants with very complex architecture. Then, we introduce the inner characteristics of the multi-sensor scanning and visualization of the interior of the most complex building of a power plant: a nuclear reactor building. We introduce several developments that made possible a first complete survey of such a large building, from acquisition, processing and fusion of multiple data sources (3D laser scans, total-station survey, RGB panoramic, 2D floor plans, 3D CAD as-built models). In addition, we present the concepts of a smart application developed for the painless exploration of the whole dataset. The goal of this application is to help professionals, unfamiliar with the manipulation of such datasets, to take into account spatial constraints induced by the building complexity while preparing maintenance operations. Finally, we discuss the main feedbacks of this large experiment, the remaining issues for the generalization of such large scale surveys and the future technical and scientific challenges in the field of industrial "virtual reality".
Fast and Accurate Support Vector Machines on Large Scale Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vishnu, Abhinav; Narasimhan, Jayenthi; Holder, Larry
Support Vector Machines (SVM) is a supervised Machine Learning and Data Mining (MLDM) algorithm, which has become ubiquitous largely due to its high accuracy and obliviousness to dimensionality. The objective of SVM is to find an optimal boundary --- also known as hyperplane --- which separates the samples (examples in a dataset) of different classes by a maximum margin. Usually, very few samples contribute to the definition of the boundary. However, existing parallel algorithms use the entire dataset for finding the boundary, which is sub-optimal for performance reasons. In this paper, we propose a novel distributed memory algorithm to eliminatemore » the samples which do not contribute to the boundary definition in SVM. We propose several heuristics, which range from early (aggressive) to late (conservative) elimination of the samples, such that the overall time for generating the boundary is reduced considerably. In a few cases, a sample may be eliminated (shrunk) pre-emptively --- potentially resulting in an incorrect boundary. We propose a scalable approach to synchronize the necessary data structures such that the proposed algorithm maintains its accuracy. We consider the necessary trade-offs of single/multiple synchronization using in-depth time-space complexity analysis. We implement the proposed algorithm using MPI and compare it with libsvm--- de facto sequential SVM software --- which we enhance with OpenMP for multi-core/many-core parallelism. Our proposed approach shows excellent efficiency using up to 4096 processes on several large datasets such as UCI HIGGS Boson dataset and Offending URL dataset.« less
Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie; Zhang, Gong
2018-01-04
Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets.
Bicer, Tekin; Gürsoy, Doğa; Andrade, Vincent De; Kettimuthu, Rajkumar; Scullin, William; Carlo, Francesco De; Foster, Ian T
2017-01-01
Modern synchrotron light sources and detectors produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used imaging techniques that generates data at tens of gigabytes per second is computed tomography (CT). Although CT experiments result in rapid data generation, the analysis and reconstruction of the collected data may require hours or even days of computation time with a medium-sized workstation, which hinders the scientific progress that relies on the results of analysis. We present Trace, a data-intensive computing engine that we have developed to enable high-performance implementation of iterative tomographic reconstruction algorithms for parallel computers. Trace provides fine-grained reconstruction of tomography datasets using both (thread-level) shared memory and (process-level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations that we apply to the replicated reconstruction objects and evaluate them using tomography datasets collected at the Advanced Photon Source. Our experimental evaluations show that our optimizations and parallelization techniques can provide 158× speedup using 32 compute nodes (384 cores) over a single-core configuration and decrease the end-to-end processing time of a large sinogram (with 4501 × 1 × 22,400 dimensions) from 12.5 h to <5 min per iteration. The proposed tomographic reconstruction engine can efficiently process large-scale tomographic data using many compute nodes and minimize reconstruction times.
Shekarchi, Sayedali; Hallam, John; Christensen-Dalsgaard, Jakob
2013-11-01
Head-related transfer functions (HRTFs) are generally large datasets, which can be an important constraint for embedded real-time applications. A method is proposed here to reduce redundancy and compress the datasets. In this method, HRTFs are first compressed by conversion into autoregressive-moving-average (ARMA) filters whose coefficients are calculated using Prony's method. Such filters are specified by a few coefficients which can generate the full head-related impulse responses (HRIRs). Next, Legendre polynomials (LPs) are used to compress the ARMA filter coefficients. LPs are derived on the sphere and form an orthonormal basis set for spherical functions. Higher-order LPs capture increasingly fine spatial details. The number of LPs needed to represent an HRTF, therefore, is indicative of its spatial complexity. The results indicate that compression ratios can exceed 98% while maintaining a spectral error of less than 4 dB in the recovered HRTFs.
Making the Most of Omics for Symbiosis Research
Chaston, J.; Douglas, A.E.
2012-01-01
Omics, including genomics, proteomics and metabolomics, enable us to explain symbioses in terms of the underlying molecules and their interactions. The central task is to transform molecular catalogs of genes, metabolites etc. into a dynamic understanding of symbiosis function. We review four exemplars of omics studies that achieve this goal, through defined biological questions relating to metabolic integration and regulation of animal-microbial symbioses, the genetic autonomy of bacterial symbionts, and symbiotic protection of animal hosts from pathogens. As omic datasets become increasingly complex, computationally-sophisticated downstream analyses are essential to reveal interactions not evident to visual inspection of the data. We discuss two approaches, phylogenomics and transcriptional clustering, that can divide the primary output of omics studies – long lists of factors – into manageable subsets, and we describe how they have been applied to analyze large datasets and generate testable hypotheses. PMID:22983030
NASA Astrophysics Data System (ADS)
Baudin, François; Martinez, Philippe; Dennielou, Bernard; Charlier, Karine; Marsset, Tania; Droz, Laurence; Rabouille, Christophe
2017-08-01
Geochemical data (total organic carbon-TOC content, δ13Corg, C:N, Rock-Eval analyses) were obtained on 150 core tops from the Angola basin, with a special focus on the Congo deep-sea fan. Combined with the previously published data, the resulting dataset (322 stations) shows a good spatial and bathymetric representativeness. TOC content and δ13Corg maps of the Angola basin were generated using this enhanced dataset. The main difference in our map with previously published ones is the high terrestrial organic matter content observed downslope along the active turbidite channel of the Congo deep-sea fan till the distal lobe complex near 5000 m of water-depth. Interpretation of downslope trends in TOC content and organic matter composition indicates that lateral particle transport by turbidity currents is the primary mechanism controlling supply and burial of organic matter in the bathypelagic depths.
This dataset is generated to both qualitatively and quantitatively examine the interactions between nano-TiO2 and natural organic matter (NOM). This integrated dataset assemble all data generated in this project through a series of experiments. This dataset is associated with the following publication:Li , S., H. Ma, L. Wallis, M. Etterson , B. Riley , D. Hoff , and S. Diamond. Impact of natural organic matter on particle behavior and phototoxicity of titanium dioxide nanoparticles. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 542: 324-333, (2016).
The PREP pipeline: standardized preprocessing for large-scale EEG analysis
Bigdely-Shamlo, Nima; Mullen, Tim; Kothe, Christian; Su, Kyung-Min; Robbins, Kay A.
2015-01-01
The technology to collect brain imaging and physiological measures has become portable and ubiquitous, opening the possibility of large-scale analysis of real-world human imaging. By its nature, such data is large and complex, making automated processing essential. This paper shows how lack of attention to the very early stages of an EEG preprocessing pipeline can reduce the signal-to-noise ratio and introduce unwanted artifacts into the data, particularly for computations done in single precision. We demonstrate that ordinary average referencing improves the signal-to-noise ratio, but that noisy channels can contaminate the results. We also show that identification of noisy channels depends on the reference and examine the complex interaction of filtering, noisy channel identification, and referencing. We introduce a multi-stage robust referencing scheme to deal with the noisy channel-reference interaction. We propose a standardized early-stage EEG processing pipeline (PREP) and discuss the application of the pipeline to more than 600 EEG datasets. The pipeline includes an automatically generated report for each dataset processed. Users can download the PREP pipeline as a freely available MATLAB library from http://eegstudy.org/prepcode. PMID:26150785
Mohamed Salleh, Faridah Hani; Arif, Shereena Mohd; Zainudin, Suhaila; Firdaus-Raih, Mohd
2015-12-01
A gene regulatory network (GRN) is a large and complex network consisting of interacting elements that, over time, affect each other's state. The dynamics of complex gene regulatory processes are difficult to understand using intuitive approaches alone. To overcome this problem, we propose an algorithm for inferring the regulatory interactions from knock-out data using a Gaussian model combines with Pearson Correlation Coefficient (PCC). There are several problems relating to GRN construction that have been outlined in this paper. We demonstrated the ability of our proposed method to (1) predict the presence of regulatory interactions between genes, (2) their directionality and (3) their states (activation or suppression). The algorithm was applied to network sizes of 10 and 50 genes from DREAM3 datasets and network sizes of 10 from DREAM4 datasets. The predicted networks were evaluated based on AUROC and AUPR. We discovered that high false positive values were generated by our GRN prediction methods because the indirect regulations have been wrongly predicted as true relationships. We achieved satisfactory results as the majority of sub-networks achieved AUROC values above 0.5. Copyright © 2015 Elsevier Ltd. All rights reserved.
Automatic Generation of Issue Maps: Structured, Interactive Outputs for Complex Information Needs
2012-09-01
much can result in behaviour similar to the shortest-path chains. 24 Ronald Goldman Neil Lewis Judge Lance Ito 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 jury...Connecting the Dots has also been explored in non-textual domains. The authors of [ Heath et al., 2010] propose building graphs, called Image Webs, to...could imagine a metro map summarizing a dataset of medical records. 2. Images: In [ Heath et al., 2010], Heath et al build graphs called Image Webs to rep
A statistical shape model of the human second cervical vertebra.
Clogenson, Marine; Duff, John M; Luethi, Marcel; Levivier, Marc; Meuli, Reto; Baur, Charles; Henein, Simon
2015-07-01
Statistical shape and appearance models play an important role in reducing the segmentation processing time of a vertebra and in improving results for 3D model development. Here, we describe the different steps in generating a statistical shape model (SSM) of the second cervical vertebra (C2) and provide the shape model for general use by the scientific community. The main difficulties in its construction are the morphological complexity of the C2 and its variability in the population. The input dataset is composed of manually segmented anonymized patient computerized tomography (CT) scans. The alignment of the different datasets is done with the procrustes alignment on surface models, and then, the registration is cast as a model-fitting problem using a Gaussian process. A principal component analysis (PCA)-based model is generated which includes the variability of the C2. The SSM was generated using 92 CT scans. The resulting SSM was evaluated for specificity, compactness and generalization ability. The SSM of the C2 is freely available to the scientific community in Slicer (an open source software for image analysis and scientific visualization) with a module created to visualize the SSM using Statismo, a framework for statistical shape modeling. The SSM of the vertebra allows the shape variability of the C2 to be represented. Moreover, the SSM will enable semi-automatic segmentation and 3D model generation of the vertebra, which would greatly benefit surgery planning.
Yin, Weiwei; Garimalla, Swetha; Moreno, Alberto; Galinski, Mary R; Styczynski, Mark P
2015-08-28
There are increasing efforts to bring high-throughput systems biology techniques to bear on complex animal model systems, often with a goal of learning about underlying regulatory network structures (e.g., gene regulatory networks). However, complex animal model systems typically have significant limitations on cohort sizes, number of samples, and the ability to perform follow-up and validation experiments. These constraints are particularly problematic for many current network learning approaches, which require large numbers of samples and may predict many more regulatory relationships than actually exist. Here, we test the idea that by leveraging the accuracy and efficiency of classifiers, we can construct high-quality networks that capture important interactions between variables in datasets with few samples. We start from a previously-developed tree-like Bayesian classifier and generalize its network learning approach to allow for arbitrary depth and complexity of tree-like networks. Using four diverse sample networks, we demonstrate that this approach performs consistently better at low sample sizes than the Sparse Candidate Algorithm, a representative approach for comparison because it is known to generate Bayesian networks with high positive predictive value. We develop and demonstrate a resampling-based approach to enable the identification of a viable root for the learned tree-like network, important for cases where the root of a network is not known a priori. We also develop and demonstrate an integrated resampling-based approach to the reduction of variable space for the learning of the network. Finally, we demonstrate the utility of this approach via the analysis of a transcriptional dataset of a malaria challenge in a non-human primate model system, Macaca mulatta, suggesting the potential to capture indicators of the earliest stages of cellular differentiation during leukopoiesis. We demonstrate that by starting from effective and efficient approaches for creating classifiers, we can identify interesting tree-like network structures with significant ability to capture the relationships in the training data. This approach represents a promising strategy for inferring networks with high positive predictive value under the constraint of small numbers of samples, meeting a need that will only continue to grow as more high-throughput studies are applied to complex model systems.
Integrating Epigenomics into the Understanding of Biomedical Insight.
Han, Yixing; He, Ximiao
2016-01-01
Epigenetics is one of the most rapidly expanding fields in biomedical research, and the popularity of the high-throughput next-generation sequencing (NGS) highlights the accelerating speed of epigenomics discovery over the past decade. Epigenetics studies the heritable phenotypes resulting from chromatin changes but without alteration on DNA sequence. Epigenetic factors and their interactive network regulate almost all of the fundamental biological procedures, and incorrect epigenetic information may lead to complex diseases. A comprehensive understanding of epigenetic mechanisms, their interactions, and alterations in health and diseases genome widely has become a priority in biological research. Bioinformatics is expected to make a remarkable contribution for this purpose, especially in processing and interpreting the large-scale NGS datasets. In this review, we introduce the epigenetics pioneering achievements in health status and complex diseases; next, we give a systematic review of the epigenomics data generation, summarize public resources and integrative analysis approaches, and finally outline the challenges and future directions in computational epigenomics.
Integrating Epigenomics into the Understanding of Biomedical Insight
Han, Yixing; He, Ximiao
2016-01-01
Epigenetics is one of the most rapidly expanding fields in biomedical research, and the popularity of the high-throughput next-generation sequencing (NGS) highlights the accelerating speed of epigenomics discovery over the past decade. Epigenetics studies the heritable phenotypes resulting from chromatin changes but without alteration on DNA sequence. Epigenetic factors and their interactive network regulate almost all of the fundamental biological procedures, and incorrect epigenetic information may lead to complex diseases. A comprehensive understanding of epigenetic mechanisms, their interactions, and alterations in health and diseases genome widely has become a priority in biological research. Bioinformatics is expected to make a remarkable contribution for this purpose, especially in processing and interpreting the large-scale NGS datasets. In this review, we introduce the epigenetics pioneering achievements in health status and complex diseases; next, we give a systematic review of the epigenomics data generation, summarize public resources and integrative analysis approaches, and finally outline the challenges and future directions in computational epigenomics. PMID:27980397
Rios, Anthony; Kavuluru, Ramakanth
2013-09-01
Extracting diagnosis codes from medical records is a complex task carried out by trained coders by reading all the documents associated with a patient's visit. With the popularity of electronic medical records (EMRs), computational approaches to code extraction have been proposed in the recent years. Machine learning approaches to multi-label text classification provide an important methodology in this task given each EMR can be associated with multiple codes. In this paper, we study the the role of feature selection, training data selection, and probabilistic threshold optimization in improving different multi-label classification approaches. We conduct experiments based on two different datasets: a recent gold standard dataset used for this task and a second larger and more complex EMR dataset we curated from the University of Kentucky Medical Center. While conventional approaches achieve results comparable to the state-of-the-art on the gold standard dataset, on our complex in-house dataset, we show that feature selection, training data selection, and probabilistic thresholding provide significant gains in performance.
PRIDE: new developments and new datasets.
Jones, Philip; Côté, Richard G; Cho, Sang Yun; Klie, Sebastian; Martens, Lennart; Quinn, Antony F; Thorneycroft, David; Hermjakob, Henning
2008-01-01
The PRIDE (http://www.ebi.ac.uk/pride) database of protein and peptide identifications was previously described in the NAR Database Special Edition in 2006. Since this publication, the volume of public data in the PRIDE relational database has increased by more than an order of magnitude. Several significant public datasets have been added, including identifications and processed mass spectra generated by the HUPO Brain Proteome Project and the HUPO Liver Proteome Project. The PRIDE software development team has made several significant changes and additions to the user interface and tool set associated with PRIDE. The focus of these changes has been to facilitate the submission process and to improve the mechanisms by which PRIDE can be queried. The PRIDE team has developed a Microsoft Excel workbook that allows the required data to be collated in a series of relatively simple spreadsheets, with automatic generation of PRIDE XML at the end of the process. The ability to query PRIDE has been augmented by the addition of a BioMart interface allowing complex queries to be constructed. Collaboration with groups outside the EBI has been fruitful in extending PRIDE, including an approach to encode iTRAQ quantitative data in PRIDE XML.
Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions
Li, Haoran; Xiong, Li; Jiang, Xiaoqian
2014-01-01
Differential privacy has recently emerged in private statistical data release as one of the strongest privacy guarantees. Most of the existing techniques that generate differentially private histograms or synthetic data only work well for single dimensional or low-dimensional histograms. They become problematic for high dimensional and large domain data due to increased perturbation error and computation complexity. In this paper, we propose DPCopula, a differentially private data synthesization technique using Copula functions for multi-dimensional data. The core of our method is to compute a differentially private copula function from which we can sample synthetic data. Copula functions are used to describe the dependence between multivariate random vectors and allow us to build the multivariate joint distribution using one-dimensional marginal distributions. We present two methods for estimating the parameters of the copula functions with differential privacy: maximum likelihood estimation and Kendall’s τ estimation. We present formal proofs for the privacy guarantee as well as the convergence property of our methods. Extensive experiments using both real datasets and synthetic datasets demonstrate that DPCopula generates highly accurate synthetic multi-dimensional data with significantly better utility than state-of-the-art techniques. PMID:25405241
Developing Chemistry and Kinetic Modeling Tools for Low-Temperature Plasma Simulations
NASA Astrophysics Data System (ADS)
Jenkins, Thomas; Beckwith, Kris; Davidson, Bradley; Kruger, Scott; Pankin, Alexei; Roark, Christine; Stoltz, Peter
2015-09-01
We discuss the use of proper orthogonal decomposition (POD) methods in VSim, a FDTD plasma simulation code capable of both PIC/MCC and fluid modeling. POD methods efficiently generate smooth representations of noisy self-consistent or test-particle PIC data, and are thus advantageous in computing macroscopic fluid quantities from large PIC datasets (e.g. for particle-based closure computations) and in constructing optimal visual representations of the underlying physics. They may also confer performance advantages for massively parallel simulations, due to the significant reduction in dataset sizes conferred by truncated singular-value decompositions of the PIC data. We also demonstrate how complex LTP chemistry scenarios can be modeled in VSim via an interface with MUNCHKIN, a developing standalone python/C++/SQL code that identifies reaction paths for given input species, solves 1D rate equations for the time-dependent chemical evolution of the system, and generates corresponding VSim input blocks with appropriate cross-sections/reaction rates. MUNCHKIN also computes reaction rates from user-specified distribution functions, and conducts principal path analyses to reduce the number of simulated chemical reactions. Supported by U.S. Department of Energy SBIR program, Award DE-SC0009501.
Shi, Yi; Fernandez-Martinez, Javier; Tjioe, Elina; Pellarin, Riccardo; Kim, Seung Joong; Williams, Rosemary; Schneidman-Duhovny, Dina; Sali, Andrej; Rout, Michael P; Chait, Brian T
2014-11-01
Most cellular processes are orchestrated by macromolecular complexes. However, structural elucidation of these endogenous complexes can be challenging because they frequently contain large numbers of proteins, are compositionally and morphologically heterogeneous, can be dynamic, and are often of low abundance in the cell. Here, we present a strategy for the structural characterization of such complexes that has at its center chemical cross-linking with mass spectrometric readout. In this strategy, we isolate the endogenous complexes using a highly optimized sample preparation protocol and generate a comprehensive, high-quality cross-linking dataset using two complementary cross-linking reagents. We then determine the structure of the complex using a refined integrative method that combines the cross-linking data with information generated from other sources, including electron microscopy, X-ray crystallography, and comparative protein structure modeling. We applied this integrative strategy to determine the structure of the native Nup84 complex, a stable hetero-heptameric assembly (∼ 600 kDa), 16 copies of which form the outer rings of the 50-MDa nuclear pore complex (NPC) in budding yeast. The unprecedented detail of the Nup84 complex structure reveals previously unseen features in its pentameric structural hub and provides information on the conformational flexibility of the assembly. These additional details further support and augment the protocoatomer hypothesis, which proposes an evolutionary relationship between vesicle coating complexes and the NPC, and indicates a conserved mechanism by which the NPC is anchored in the nuclear envelope. © 2014 by The American Society for Biochemistry and Molecular Biology, Inc.
Effect of rich-club on diffusion in complex networks
NASA Astrophysics Data System (ADS)
Berahmand, Kamal; Samadi, Negin; Sheikholeslami, Seyed Mahmood
2018-05-01
One of the main issues in complex networks is the phenomenon of diffusion in which the goal is to find the nodes with the highest diffusing power. In diffusion, there is always a conflict between accuracy and efficiency time complexity; therefore, most of the recent studies have focused on finding new centralities to solve this problem and have offered new ones, but our approach is different. Using one of the complex networks’ features, namely the “rich-club”, its effect on diffusion in complex networks has been analyzed and it is demonstrated that in datasets which have a high rich-club, it is better to use the degree centrality for finding influential nodes because it has a linear time complexity and uses the local information; however, this rule does not apply to datasets which have a low rich-club. Next, real and artificial datasets with the high rich-club have been used in which degree centrality has been compared to famous centrality using the SIR standard.
Fitting Meta-Analytic Structural Equation Models with Complex Datasets
ERIC Educational Resources Information Center
Wilson, Sandra Jo; Polanin, Joshua R.; Lipsey, Mark W.
2016-01-01
A modification of the first stage of the standard procedure for two-stage meta-analytic structural equation modeling for use with large complex datasets is presented. This modification addresses two common problems that arise in such meta-analyses: (a) primary studies that provide multiple measures of the same construct and (b) the correlation…
Statistical Analysis of Large Simulated Yield Datasets for Studying Climate Effects
NASA Technical Reports Server (NTRS)
Makowski, David; Asseng, Senthold; Ewert, Frank; Bassu, Simona; Durand, Jean-Louis; Martre, Pierre; Adam, Myriam; Aggarwal, Pramod K.; Angulo, Carlos; Baron, Chritian;
2015-01-01
Many studies have been carried out during the last decade to study the effect of climate change on crop yields and other key crop characteristics. In these studies, one or several crop models were used to simulate crop growth and development for different climate scenarios that correspond to different projections of atmospheric CO2 concentration, temperature, and rainfall changes (Semenov et al., 1996; Tubiello and Ewert, 2002; White et al., 2011). The Agricultural Model Intercomparison and Improvement Project (AgMIP; Rosenzweig et al., 2013) builds on these studies with the goal of using an ensemble of multiple crop models in order to assess effects of climate change scenarios for several crops in contrasting environments. These studies generate large datasets, including thousands of simulated crop yield data. They include series of yield values obtained by combining several crop models with different climate scenarios that are defined by several climatic variables (temperature, CO2, rainfall, etc.). Such datasets potentially provide useful information on the possible effects of different climate change scenarios on crop yields. However, it is sometimes difficult to analyze these datasets and to summarize them in a useful way due to their structural complexity; simulated yield data can differ among contrasting climate scenarios, sites, and crop models. Another issue is that it is not straightforward to extrapolate the results obtained for the scenarios to alternative climate change scenarios not initially included in the simulation protocols. Additional dynamic crop model simulations for new climate change scenarios are an option but this approach is costly, especially when a large number of crop models are used to generate the simulated data, as in AgMIP. Statistical models have been used to analyze responses of measured yield data to climate variables in past studies (Lobell et al., 2011), but the use of a statistical model to analyze yields simulated by complex process-based crop models is a rather new idea. We demonstrate herewith that statistical methods can play an important role in analyzing simulated yield data sets obtained from the ensembles of process-based crop models. Formal statistical analysis is helpful to estimate the effects of different climatic variables on yield, and to describe the between-model variability of these effects.
A hybrid approach for efficient anomaly detection using metaheuristic methods
Ghanem, Tamer F.; Elkilani, Wail S.; Abdul-kader, Hatem M.
2014-01-01
Network intrusion detection based on anomaly detection techniques has a significant role in protecting networks and systems against harmful activities. Different metaheuristic techniques have been used for anomaly detector generation. Yet, reported literature has not studied the use of the multi-start metaheuristic method for detector generation. This paper proposes a hybrid approach for anomaly detection in large scale datasets using detectors generated based on multi-start metaheuristic method and genetic algorithms. The proposed approach has taken some inspiration of negative selection-based detector generation. The evaluation of this approach is performed using NSL-KDD dataset which is a modified version of the widely used KDD CUP 99 dataset. The results show its effectiveness in generating a suitable number of detectors with an accuracy of 96.1% compared to other competitors of machine learning algorithms. PMID:26199752
A hybrid approach for efficient anomaly detection using metaheuristic methods.
Ghanem, Tamer F; Elkilani, Wail S; Abdul-Kader, Hatem M
2015-07-01
Network intrusion detection based on anomaly detection techniques has a significant role in protecting networks and systems against harmful activities. Different metaheuristic techniques have been used for anomaly detector generation. Yet, reported literature has not studied the use of the multi-start metaheuristic method for detector generation. This paper proposes a hybrid approach for anomaly detection in large scale datasets using detectors generated based on multi-start metaheuristic method and genetic algorithms. The proposed approach has taken some inspiration of negative selection-based detector generation. The evaluation of this approach is performed using NSL-KDD dataset which is a modified version of the widely used KDD CUP 99 dataset. The results show its effectiveness in generating a suitable number of detectors with an accuracy of 96.1% compared to other competitors of machine learning algorithms.
Online Tools for Bioinformatics Analyses in Nutrition Sciences12
Malkaram, Sridhar A.; Hassan, Yousef I.; Zempleni, Janos
2012-01-01
Recent advances in “omics” research have resulted in the creation of large datasets that were generated by consortiums and centers, small datasets that were generated by individual investigators, and bioinformatics tools for mining these datasets. It is important for nutrition laboratories to take full advantage of the analysis tools to interrogate datasets for information relevant to genomics, epigenomics, transcriptomics, proteomics, and metabolomics. This review provides guidance regarding bioinformatics resources that are currently available in the public domain, with the intent to provide a starting point for investigators who want to take advantage of the opportunities provided by the bioinformatics field. PMID:22983844
Rogers, L J; Douglas, R R
1984-02-01
In this paper (the second in a series), we consider a (generic) pair of datasets, which have been analyzed by the techniques of the previous paper. Thus, their "stable subspaces" have been established by comparative factor analysis. The pair of datasets must satisfy two confirmable conditions. The first is the "Inclusion Condition," which requires that the stable subspace of one of the datasets is nearly identical to a subspace of the other dataset's stable subspace. On the basis of that, we have assumed the pair to have similar generating signals, with stochastically independent generators. The second verifiable condition is that the (presumed same) generating signals have distinct ratios of variances for the two datasets. Under these conditions a small elaboration of some elementary linear algebra reduces the rotation problem to several eigenvalue-eigenvector problems. Finally, we emphasize that an analysis of each dataset by the method of Douglas and Rogers (1983) is an essential prerequisite for the useful application of the techniques in this paper. Nonempirical methods of estimating the number of factors simply will not suffice, as confirmed by simulations reported in the previous paper.
An assessment of differences in gridded precipitation datasets in complex terrain
NASA Astrophysics Data System (ADS)
Henn, Brian; Newman, Andrew J.; Livneh, Ben; Daly, Christopher; Lundquist, Jessica D.
2018-01-01
Hydrologic modeling and other geophysical applications are sensitive to precipitation forcing data quality, and there are known challenges in spatially distributing gauge-based precipitation over complex terrain. We conduct a comparison of six high-resolution, daily and monthly gridded precipitation datasets over the Western United States. We compare the long-term average spatial patterns, and interannual variability of water-year total precipitation, as well as multi-year trends in precipitation across the datasets. We find that the greatest absolute differences among datasets occur in high-elevation areas and in the maritime mountain ranges of the Western United States, while the greatest percent differences among datasets relative to annual total precipitation occur in arid and rain-shadowed areas. Differences between datasets in some high-elevation areas exceed 200 mm yr-1 on average, and relative differences range from 5 to 60% across the Western United States. In areas of high topographic relief, true uncertainties and biases are likely higher than the differences among the datasets; we present evidence of this based on streamflow observations. Precipitation trends in the datasets differ in magnitude and sign at smaller scales, and are sensitive to how temporal inhomogeneities in the underlying precipitation gauge data are handled.
Picking ChIP-seq peak detectors for analyzing chromatin modification experiments
Micsinai, Mariann; Parisi, Fabio; Strino, Francesco; Asp, Patrik; Dynlacht, Brian D.; Kluger, Yuval
2012-01-01
Numerous algorithms have been developed to analyze ChIP-Seq data. However, the complexity of analyzing diverse patterns of ChIP-Seq signals, especially for epigenetic marks, still calls for the development of new algorithms and objective comparisons of existing methods. We developed Qeseq, an algorithm to detect regions of increased ChIP read density relative to background. Qeseq employs critical novel elements, such as iterative recalibration and neighbor joining of reads to identify enriched regions of any length. To objectively assess its performance relative to other 14 ChIP-Seq peak finders, we designed a novel protocol based on Validation Discriminant Analysis (VDA) to optimally select validation sites and generated two validation datasets, which are the most comprehensive to date for algorithmic benchmarking of key epigenetic marks. In addition, we systematically explored a total of 315 diverse parameter configurations from these algorithms and found that typically optimal parameters in one dataset do not generalize to other datasets. Nevertheless, default parameters show the most stable performance, suggesting that they should be used. This study also provides a reproducible and generalizable methodology for unbiased comparative analysis of high-throughput sequencing tools that can facilitate future algorithmic development. PMID:22307239
Picking ChIP-seq peak detectors for analyzing chromatin modification experiments.
Micsinai, Mariann; Parisi, Fabio; Strino, Francesco; Asp, Patrik; Dynlacht, Brian D; Kluger, Yuval
2012-05-01
Numerous algorithms have been developed to analyze ChIP-Seq data. However, the complexity of analyzing diverse patterns of ChIP-Seq signals, especially for epigenetic marks, still calls for the development of new algorithms and objective comparisons of existing methods. We developed Qeseq, an algorithm to detect regions of increased ChIP read density relative to background. Qeseq employs critical novel elements, such as iterative recalibration and neighbor joining of reads to identify enriched regions of any length. To objectively assess its performance relative to other 14 ChIP-Seq peak finders, we designed a novel protocol based on Validation Discriminant Analysis (VDA) to optimally select validation sites and generated two validation datasets, which are the most comprehensive to date for algorithmic benchmarking of key epigenetic marks. In addition, we systematically explored a total of 315 diverse parameter configurations from these algorithms and found that typically optimal parameters in one dataset do not generalize to other datasets. Nevertheless, default parameters show the most stable performance, suggesting that they should be used. This study also provides a reproducible and generalizable methodology for unbiased comparative analysis of high-throughput sequencing tools that can facilitate future algorithmic development.
Scalable learning method for feedforward neural networks using minimal-enclosing-ball approximation.
Wang, Jun; Deng, Zhaohong; Luo, Xiaoqing; Jiang, Yizhang; Wang, Shitong
2016-06-01
Training feedforward neural networks (FNNs) is one of the most critical issues in FNNs studies. However, most FNNs training methods cannot be directly applied for very large datasets because they have high computational and space complexity. In order to tackle this problem, the CCMEB (Center-Constrained Minimum Enclosing Ball) problem in hidden feature space of FNN is discussed and a novel learning algorithm called HFSR-GCVM (hidden-feature-space regression using generalized core vector machine) is developed accordingly. In HFSR-GCVM, a novel learning criterion using L2-norm penalty-based ε-insensitive function is formulated and the parameters in the hidden nodes are generated randomly independent of the training sets. Moreover, the learning of parameters in its output layer is proved equivalent to a special CCMEB problem in FNN hidden feature space. As most CCMEB approximation based machine learning algorithms, the proposed HFSR-GCVM training algorithm has the following merits: The maximal training time of the HFSR-GCVM training is linear with the size of training datasets and the maximal space consumption is independent of the size of training datasets. The experiments on regression tasks confirm the above conclusions. Copyright © 2016 Elsevier Ltd. All rights reserved.
ViA: a perceptual visualization assistant
NASA Astrophysics Data System (ADS)
Healey, Chris G.; St. Amant, Robert; Elhaddad, Mahmoud S.
2000-05-01
This paper describes an automated visualized assistant called ViA. ViA is designed to help users construct perceptually optical visualizations to represent, explore, and analyze large, complex, multidimensional datasets. We have approached this problem by studying what is known about the control of human visual attention. By harnessing the low-level human visual system, we can support our dual goals of rapid and accurate visualization. Perceptual guidelines that we have built using psychophysical experiments form the basis for ViA. ViA uses modified mixed-initiative planning algorithms from artificial intelligence to search of perceptually optical data attribute to visual feature mappings. Our perceptual guidelines are integrated into evaluation engines that provide evaluation weights for a given data-feature mapping, and hints on how that mapping might be improved. ViA begins by asking users a set of simple questions about their dataset and the analysis tasks they want to perform. Answers to these questions are used in combination with the evaluation engines to identify and intelligently pursue promising data-feature mappings. The result is an automatically-generated set of mappings that are perceptually salient, but that also respect the context of the dataset and users' preferences about how they want to visualize their data.
NASA Astrophysics Data System (ADS)
Wyborn, Lesley; Car, Nicholas; Evans, Benjamin; Klump, Jens
2016-04-01
Persistent identifiers in the form of a Digital Object Identifier (DOI) are becoming more mainstream, assigned at both the collection and dataset level. For static datasets, this is a relatively straight-forward matter. However, many new data collections are dynamic, with new data being appended, models and derivative products being revised with new data, or the data itself revised as processing methods are improved. Further, because data collections are becoming accessible as services, researchers can log in and dynamically create user-defined subsets for specific research projects: they also can easily mix and match data from multiple collections, each of which can have a complex history. Inevitably extracts from such dynamic data sets underpin scholarly publications, and this presents new challenges. The National Computational Infrastructure (NCI) has been experiencing and making progress towards addressing these issues. The NCI is large node of the Research Data Services initiative (RDS) of the Australian Government's research infrastructure, which currently makes available over 10 PBytes of priority research collections, ranging from geosciences, geophysics, environment, and climate, through to astronomy, bioinformatics, and social sciences. Data are replicated to, or are produced at, NCI and then processed there to higher-level data products or directly analysed. Individual datasets range from multi-petabyte computational models and large volume raster arrays, down to gigabyte size, ultra-high resolution datasets. To facilitate access, maximise reuse and enable integration across the disciplines, datasets have been organized on a platform called the National Environmental Research Data Interoperability Platform (NERDIP). Combined, the NERDIP data collections form a rich and diverse asset for researchers: their co-location and standardization optimises the value of existing data, and forms a new resource to underpin data-intensive Science. New publication procedures require that a persistent identifier (DOI) be provided for the dataset that underpins the publication. Being able to produce these for data extracts from the NCI data node using only DOIs is proving difficult: preserving a copy of each data extract is not possible due to data scale. A proposal is for researchers to use workflows that capture the provenance of each data extraction, including metadata (e.g., version of the dataset used, the query and time of extraction). In parallel, NCI is now working with the NERDIP dataset providers to ensure that the provenance of data publication is also captured in provenance systems including references to previous versions and a history of data appended or modified. This proposed solution would require an enhancement to new scholarly publication procedures whereby the reference to underlying dataset to a scholarly publication would be the persistent identifier of the provenance workflow that created the data extract. In turn, the provenance workflow would itself link to a series of persistent identifiers that, at a minimum, provide complete dataset production transparency and, if required, would facilitate reconstruction of the dataset. Such a solution will require strict adherence to design patterns for provenance representation to ensure that the provenance representation of the workflow does indeed contain information required to deliver dataset generation transparency and a pathway to reconstruction.
Rawle, Rachel A.; Hamerly, Timothy; Tripet, Brian P.; ...
2017-06-04
Studies of interspecies interactions are inherently difficult due to the complex mechanisms which enable these relationships. A model system for studying interspecies interactions is the marine hyperthermophiles Ignicoccus hospitalis and Nanoarchaeum equitans. Recent independently-conducted ‘omics’ analyses have generated insights into the molecular factors modulating this association. However, significant questions remain about the nature of the interactions between these archaea. We jointly analyzed multiple levels of omics datasets obtained from published, independent transcriptomics, proteomics, and metabolomics analyses. DAVID identified functionally-related groups enriched when I. hospitalis is grown alone or in co-culture with N. equitans. Enriched molecular pathways were subsequently visualized usingmore » interaction maps generated using STRING. Key findings of our multi-level omics analysis indicated that I. hospitalis provides precursors to N. equitans for energy metabolism. Analysis indicated an overall reduction in diversity of metabolic precursors in the I. hospitalis–N. equitans co-culture, which has been connected to the differential use of ribosomal subunits and was previously unnoticed. We also identified differences in precursors linked to amino acid metabolism, NADH metabolism, and carbon fixation, providing new insights into the metabolic adaptions of I. hospitalis enabling the growth of N. equitans. In conclusion, this multi-omics analysis builds upon previously identified cellular patterns while offering new insights into mechanisms that enable the I. hospitalis–N. equitans association. This study applies statistical and visualization techniques to a mixed-source omics dataset to yield a more global insight into a complex system, that was not readily discernable from separate omics studies.« less
Rawle, Rachel A; Hamerly, Timothy; Tripet, Brian P; Giannone, Richard J; Wurch, Louie; Hettich, Robert L; Podar, Mircea; Copié, Valerie; Bothner, Brian
2017-09-01
Studies of interspecies interactions are inherently difficult due to the complex mechanisms which enable these relationships. A model system for studying interspecies interactions is the marine hyperthermophiles Ignicoccus hospitalis and Nanoarchaeum equitans. Recent independently-conducted 'omics' analyses have generated insights into the molecular factors modulating this association. However, significant questions remain about the nature of the interactions between these archaea. We jointly analyzed multiple levels of omics datasets obtained from published, independent transcriptomics, proteomics, and metabolomics analyses. DAVID identified functionally-related groups enriched when I. hospitalis is grown alone or in co-culture with N. equitans. Enriched molecular pathways were subsequently visualized using interaction maps generated using STRING. Key findings of our multi-level omics analysis indicated that I. hospitalis provides precursors to N. equitans for energy metabolism. Analysis indicated an overall reduction in diversity of metabolic precursors in the I. hospitalis-N. equitans co-culture, which has been connected to the differential use of ribosomal subunits and was previously unnoticed. We also identified differences in precursors linked to amino acid metabolism, NADH metabolism, and carbon fixation, providing new insights into the metabolic adaptions of I. hospitalis enabling the growth of N. equitans. This multi-omics analysis builds upon previously identified cellular patterns while offering new insights into mechanisms that enable the I. hospitalis-N. equitans association. Our study applies statistical and visualization techniques to a mixed-source omics dataset to yield a more global insight into a complex system, that was not readily discernable from separate omics studies. Copyright © 2017 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Bizzi, S.; Surridge, B.; Lerner, D. N.:
2009-04-01
River ecosystems represent complex networks of interacting biological, chemical and geomorphological processes. These processes generate spatial and temporal patterns in biological, chemical and geomorphological variables, and a growing number of these variables are now being used to characterise the status of rivers. However, integrated analyses of these biological-chemical-geomorphological networks have rarely been undertaken, and as a result our knowledge of the underlying processes and how they generate the resulting patterns remains weak. The apparent complexity of the networks involved, and the lack of coherent datasets, represent two key challenges to such analyses. In this paper we describe the application of a novel technique, Structural Equation Modelling (SEM), to the investigation of biological, chemical and geomorphological data collected from rivers across England and Wales. The SEM approach is a multivariate statistical technique enabling simultaneous examination of direct and indirect relationships across a network of variables. Further, SEM allows a-priori conceptual or theoretical models to be tested against available data. This is a significant departure from the solely exploratory analyses which characterise other multivariate techniques. We took biological, chemical and river habitat survey data collected by the Environment Agency for 400 sites in rivers spread across England and Wales, and created a single, coherent dataset suitable for SEM analyses. Biological data cover benthic macroinvertebrates, chemical data relate to a range of standard parameters (e.g. BOD, dissolved oxygen and phosphate concentration), and geomorphological data cover factors such as river typology, substrate material and degree of physical modification. We developed a number of a-priori conceptual models, reflecting current research questions or existing knowledge, and tested the ability of these conceptual models to explain the variance and covariance within the dataset. The conceptual models we developed were able to explain correctly the variance and covariance shown by the datasets, proving to be a relevant representation of the processes involved. The models explained 65% of the variance in indices describing benthic macroinvertebrate communities. Dissolved oxygen was of primary importance, but geomorphological factors, including river habitat type and degree of habitat degradation, also had significant explanatory power. The addition of spatial variables, such as latitude or longitude, did not provide additional explanatory power. This suggests that the variables already included in the models effectively represented the eco-regions across which our data were distributed. The models produced new insights into the relative importance of chemical and geomorphological factors for river macroinvertebrate communities. The SEM technique proved a powerful tool for exploring complex biological-chemical-geomorphological networks, for example able to deal with the co-correlations that are common in rivers due to multiple feedback mechanisms.
A Manual Segmentation Tool for Three-Dimensional Neuron Datasets.
Magliaro, Chiara; Callara, Alejandro L; Vanello, Nicola; Ahluwalia, Arti
2017-01-01
To date, automated or semi-automated software and algorithms for segmentation of neurons from three-dimensional imaging datasets have had limited success. The gold standard for neural segmentation is considered to be the manual isolation performed by an expert. To facilitate the manual isolation of complex objects from image stacks, such as neurons in their native arrangement within the brain, a new Manual Segmentation Tool (ManSegTool) has been developed. ManSegTool allows user to load an image stack, scroll down the images and to manually draw the structures of interest stack-by-stack. Users can eliminate unwanted regions or split structures (i.e., branches from different neurons that are too close each other, but, to the experienced eye, clearly belong to a unique cell), to view the object in 3D and save the results obtained. The tool can be used for testing the performance of a single-neuron segmentation algorithm or to extract complex objects, where the available automated methods still fail. Here we describe the software's main features and then show an example of how ManSegTool can be used to segment neuron images acquired using a confocal microscope. In particular, expert neuroscientists were asked to segment different neurons from which morphometric variables were subsequently extracted as a benchmark for precision. In addition, a literature-defined index for evaluating the goodness of segmentation was used as a benchmark for accuracy. Neocortical layer axons from a DIADEM challenge dataset were also segmented with ManSegTool and compared with the manual "gold-standard" generated for the competition.
Farseer-NMR: automatic treatment, analysis and plotting of large, multi-variable NMR data.
Teixeira, João M C; Skinner, Simon P; Arbesú, Miguel; Breeze, Alexander L; Pons, Miquel
2018-05-11
We present Farseer-NMR ( https://git.io/vAueU ), a software package to treat, evaluate and combine NMR spectroscopic data from sets of protein-derived peaklists covering a range of experimental conditions. The combined advances in NMR and molecular biology enable the study of complex biomolecular systems such as flexible proteins or large multibody complexes, which display a strong and functionally relevant response to their environmental conditions, e.g. the presence of ligands, site-directed mutations, post translational modifications, molecular crowders or the chemical composition of the solution. These advances have created a growing need to analyse those systems' responses to multiple variables. The combined analysis of NMR peaklists from large and multivariable datasets has become a new bottleneck in the NMR analysis pipeline, whereby information-rich NMR-derived parameters have to be manually generated, which can be tedious, repetitive and prone to human error, or even unfeasible for very large datasets. There is a persistent gap in the development and distribution of software focused on peaklist treatment, analysis and representation, and specifically able to handle large multivariable datasets, which are becoming more commonplace. In this regard, Farseer-NMR aims to close this longstanding gap in the automated NMR user pipeline and, altogether, reduce the time burden of analysis of large sets of peaklists from days/weeks to seconds/minutes. We have implemented some of the most common, as well as new, routines for calculation of NMR parameters and several publication-quality plotting templates to improve NMR data representation. Farseer-NMR has been written entirely in Python and its modular code base enables facile extension.
Schwartz, Rachel S; Mueller, Rachel L
2010-01-11
Estimates of divergence dates between species improve our understanding of processes ranging from nucleotide substitution to speciation. Such estimates are frequently based on molecular genetic differences between species; therefore, they rely on accurate estimates of the number of such differences (i.e. substitutions per site, measured as branch length on phylogenies). We used simulations to determine the effects of dataset size, branch length heterogeneity, branch depth, and analytical framework on branch length estimation across a range of branch lengths. We then reanalyzed an empirical dataset for plethodontid salamanders to determine how inaccurate branch length estimation can affect estimates of divergence dates. The accuracy of branch length estimation varied with branch length, dataset size (both number of taxa and sites), branch length heterogeneity, branch depth, dataset complexity, and analytical framework. For simple phylogenies analyzed in a Bayesian framework, branches were increasingly underestimated as branch length increased; in a maximum likelihood framework, longer branch lengths were somewhat overestimated. Longer datasets improved estimates in both frameworks; however, when the number of taxa was increased, estimation accuracy for deeper branches was less than for tip branches. Increasing the complexity of the dataset produced more misestimated branches in a Bayesian framework; however, in an ML framework, more branches were estimated more accurately. Using ML branch length estimates to re-estimate plethodontid salamander divergence dates generally resulted in an increase in the estimated age of older nodes and a decrease in the estimated age of younger nodes. Branch lengths are misestimated in both statistical frameworks for simulations of simple datasets. However, for complex datasets, length estimates are quite accurate in ML (even for short datasets), whereas few branches are estimated accurately in a Bayesian framework. Our reanalysis of empirical data demonstrates the magnitude of effects of Bayesian branch length misestimation on divergence date estimates. Because the length of branches for empirical datasets can be estimated most reliably in an ML framework when branches are <1 substitution/site and datasets are > or =1 kb, we suggest that divergence date estimates using datasets, branch lengths, and/or analytical techniques that fall outside of these parameters should be interpreted with caution.
SPANG: a SPARQL client supporting generation and reuse of queries for distributed RDF databases.
Chiba, Hirokazu; Uchiyama, Ikuo
2017-02-08
Toward improved interoperability of distributed biological databases, an increasing number of datasets have been published in the standardized Resource Description Framework (RDF). Although the powerful SPARQL Protocol and RDF Query Language (SPARQL) provides a basis for exploiting RDF databases, writing SPARQL code is burdensome for users including bioinformaticians. Thus, an easy-to-use interface is necessary. We developed SPANG, a SPARQL client that has unique features for querying RDF datasets. SPANG dynamically generates typical SPARQL queries according to specified arguments. It can also call SPARQL template libraries constructed in a local system or published on the Web. Further, it enables combinatorial execution of multiple queries, each with a distinct target database. These features facilitate easy and effective access to RDF datasets and integrative analysis of distributed data. SPANG helps users to exploit RDF datasets by generation and reuse of SPARQL queries through a simple interface. This client will enhance integrative exploitation of biological RDF datasets distributed across the Web. This software package is freely available at http://purl.org/net/spang .
Serial femtosecond crystallography datasets from G protein-coupled receptors
White, Thomas A.; Barty, Anton; Liu, Wei; Ishchenko, Andrii; Zhang, Haitao; Gati, Cornelius; Zatsepin, Nadia A.; Basu, Shibom; Oberthür, Dominik; Metz, Markus; Beyerlein, Kenneth R.; Yoon, Chun Hong; Yefanov, Oleksandr M.; James, Daniel; Wang, Dingjie; Messerschmidt, Marc; Koglin, Jason E.; Boutet, Sébastien; Weierstall, Uwe; Cherezov, Vadim
2016-01-01
We describe the deposition of four datasets consisting of X-ray diffraction images acquired using serial femtosecond crystallography experiments on microcrystals of human G protein-coupled receptors, grown and delivered in lipidic cubic phase, at the Linac Coherent Light Source. The receptors are: the human serotonin receptor 2B in complex with an agonist ergotamine, the human δ-opioid receptor in complex with a bi-functional peptide ligand DIPP-NH2, the human smoothened receptor in complex with an antagonist cyclopamine, and finally the human angiotensin II type 1 receptor in complex with the selective antagonist ZD7155. All four datasets have been deposited, with minimal processing, in an HDF5-based file format, which can be used directly for crystallographic processing with CrystFEL or other software. We have provided processing scripts and supporting files for recent versions of CrystFEL, which can be used to validate the data. PMID:27479354
Serial femtosecond crystallography datasets from G protein-coupled receptors.
White, Thomas A; Barty, Anton; Liu, Wei; Ishchenko, Andrii; Zhang, Haitao; Gati, Cornelius; Zatsepin, Nadia A; Basu, Shibom; Oberthür, Dominik; Metz, Markus; Beyerlein, Kenneth R; Yoon, Chun Hong; Yefanov, Oleksandr M; James, Daniel; Wang, Dingjie; Messerschmidt, Marc; Koglin, Jason E; Boutet, Sébastien; Weierstall, Uwe; Cherezov, Vadim
2016-08-01
We describe the deposition of four datasets consisting of X-ray diffraction images acquired using serial femtosecond crystallography experiments on microcrystals of human G protein-coupled receptors, grown and delivered in lipidic cubic phase, at the Linac Coherent Light Source. The receptors are: the human serotonin receptor 2B in complex with an agonist ergotamine, the human δ-opioid receptor in complex with a bi-functional peptide ligand DIPP-NH2, the human smoothened receptor in complex with an antagonist cyclopamine, and finally the human angiotensin II type 1 receptor in complex with the selective antagonist ZD7155. All four datasets have been deposited, with minimal processing, in an HDF5-based file format, which can be used directly for crystallographic processing with CrystFEL or other software. We have provided processing scripts and supporting files for recent versions of CrystFEL, which can be used to validate the data.
SPICE: exploration and analysis of post-cytometric complex multivariate datasets.
Roederer, Mario; Nozzi, Joshua L; Nason, Martha C
2011-02-01
Polychromatic flow cytometry results in complex, multivariate datasets. To date, tools for the aggregate analysis of these datasets across multiple specimens grouped by different categorical variables, such as demographic information, have not been optimized. Often, the exploration of such datasets is accomplished by visualization of patterns with pie charts or bar charts, without easy access to statistical comparisons of measurements that comprise multiple components. Here we report on algorithms and a graphical interface we developed for these purposes. In particular, we discuss thresholding necessary for accurate representation of data in pie charts, the implications for display and comparison of normalized versus unnormalized data, and the effects of averaging when samples with significant background noise are present. Finally, we define a statistic for the nonparametric comparison of complex distributions to test for difference between groups of samples based on multi-component measurements. While originally developed to support the analysis of T cell functional profiles, these techniques are amenable to a broad range of datatypes. Published 2011 Wiley-Liss, Inc.
Microreact: visualizing and sharing data for genomic epidemiology and phylogeography
Argimón, Silvia; Abudahab, Khalil; Goater, Richard J. E.; Fedosejev, Artemij; Bhai, Jyothish; Glasner, Corinna; Feil, Edward J.; Holden, Matthew T. G.; Yeats, Corin A.; Grundmann, Hajo; Spratt, Brian G.
2016-01-01
Visualization is frequently used to aid our interpretation of complex datasets. Within microbial genomics, visualizing the relationships between multiple genomes as a tree provides a framework onto which associated data (geographical, temporal, phenotypic and epidemiological) are added to generate hypotheses and to explore the dynamics of the system under investigation. Selected static images are then used within publications to highlight the key findings to a wider audience. However, these images are a very inadequate way of exploring and interpreting the richness of the data. There is, therefore, a need for flexible, interactive software that presents the population genomic outputs and associated data in a user-friendly manner for a wide range of end users, from trained bioinformaticians to front-line epidemiologists and health workers. Here, we present Microreact, a web application for the easy visualization of datasets consisting of any combination of trees, geographical, temporal and associated metadata. Data files can be uploaded to Microreact directly via the web browser or by linking to their location (e.g. from Google Drive/Dropbox or via API), and an integrated visualization via trees, maps, timelines and tables provides interactive querying of the data. The visualization can be shared as a permanent web link among collaborators, or embedded within publications to enable readers to explore and download the data. Microreact can act as an end point for any tool or bioinformatic pipeline that ultimately generates a tree, and provides a simple, yet powerful, visualization method that will aid research and discovery and the open sharing of datasets. PMID:28348833
USDA-ARS?s Scientific Manuscript database
The methods of Sharkey and Gu for estimating the eight parameters of the Farquhar-von Caemmerer-Berry (FvBC) model were examined using generated photosynthesis versus intercellular carbon dioxide concentration (A/Ci) datasets. The generated datasets included data with (A) high accuracy, (B) normal ...
Teshima, Tara Lynn; Patel, Vaibhav; Mainprize, James G; Edwards, Glenn; Antonyshyn, Oleh M
2015-07-01
The utilization of three-dimensional modeling technology in craniomaxillofacial surgery has grown exponentially during the last decade. Future development, however, is hindered by the lack of a normative three-dimensional anatomic dataset and a statistical mean three-dimensional virtual model. The purpose of this study is to develop and validate a protocol to generate a statistical three-dimensional virtual model based on a normative dataset of adult skulls. Two hundred adult skull CT images were reviewed. The average three-dimensional skull was computed by processing each CT image in the series using thin-plate spline geometric morphometric protocol. Our statistical average three-dimensional skull was validated by reconstructing patient-specific topography in cranial defects. The experiment was repeated 4 times. In each case, computer-generated cranioplasties were compared directly to the original intact skull. The errors describing the difference between the prediction and the original were calculated. A normative database of 33 adult human skulls was collected. Using 21 anthropometric landmark points, a protocol for three-dimensional skull landmarking and data reduction was developed and a statistical average three-dimensional skull was generated. Our results show the root mean square error (RMSE) for restoration of a known defect using the native best match skull, our statistical average skull, and worst match skull was 0.58, 0.74, and 4.4 mm, respectively. The ability to statistically average craniofacial surface topography will be a valuable instrument for deriving missing anatomy in complex craniofacial defects and deficiencies as well as in evaluating morphologic results of surgery.
Artificial Neural Networks: an overview and their use in the analysis of the AMPHORA-3 dataset.
Buscema, Paolo Massimo; Massini, Giulia; Maurelli, Guido
2014-10-01
The Artificial Adaptive Systems (AAS) are theories with which generative algebras are able to create artificial models simulating natural phenomenon. Artificial Neural Networks (ANNs) are the more diffused and best-known learning system models in the AAS. This article describes an overview of ANNs, noting its advantages and limitations for analyzing dynamic, complex, non-linear, multidimensional processes. An example of a specific ANN application to alcohol consumption in Spain, as part of the EU AMPHORA-3 project, during 1961-2006 is presented. Study's limitations are noted and future needed research using ANN methodologies are suggested.
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Manipon, G.; Hua, H.; Fetzer, E.
2011-12-01
Under several NASA grants, we are generating multi-sensor merged atmospheric datasets to enable the detection of instrument biases and studies of climate trends over decades of data. For example, under a NASA MEASURES grant we are producing a water vapor climatology from the A-Train instruments, stratified by the Cloudsat cloud classification for each geophysical scene. The generation and proper use of such multi-sensor climate data records (CDR's) requires a high level of openness, transparency, and traceability. To make the datasets self-documenting and provide access to full metadata and traceability, we have implemented a set of capabilities and services using known, interoperable protocols. These protocols include OpenSearch, OPeNDAP, Open Provenance Model, service & data casting technologies using Atom feeds, and REST-callable analysis workflows implemented as SciFlo (XML) documents. We advocate that our approach can serve as a blueprint for how to openly "document and serve" complex, multi-sensor CDR's with full traceability. The capabilities and services provided include: - Discovery of the collections by keyword search, exposed using OpenSearch protocol; - Space/time query across the CDR's granules and all of the input datasets via OpenSearch; - User-level configuration of the production workflows so that scientists can select additional physical variables from the A-Train to add to the next iteration of the merged datasets; - Efficient data merging using on-the-fly OPeNDAP variable slicing & spatial subsetting of data out of input netCDF and HDF files (without moving the entire files); - Self-documenting CDR's published in a highly usable netCDF4 format with groups used to organize the variables, CF-style attributes for each variable, numeric array compression, & links to OPM provenance; - Recording of processing provenance and data lineage into a query-able provenance trail in Open Provenance Model (OPM) format, auto-captured by the workflow engine; - Open Publishing of all of the workflows used to generate products as machine-callable REST web services, using the capabilities of the SciFlo workflow engine; - Advertising of the metadata (e.g. physical variables provided, space/time bounding box, etc.) for our prepared datasets as "datacasts" using the Atom feed format; - Publishing of all datasets via our "DataDrop" service, which exploits the WebDAV protocol to enable scientists to access remote data directories as local files on their laptops; - Rich "web browse" of the CDR's with full metadata and the provenance trail one click away; - Advertising of all services as Google-discoverable "service casts" using the Atom format. The presentation will describe our use of the interoperable protocols and demonstrate the capabilities and service GUI's.
Pascolutti, Mauro; Campitelli, Marc; Nguyen, Bao; Pham, Ngoc; Gorse, Alain-Dominique; Quinn, Ronald J.
2015-01-01
Natural products are universally recognized to contribute valuable chemical diversity to the design of molecular screening libraries. The analysis undertaken in this work, provides a foundation for the generation of fragment screening libraries that capture the diverse range of molecular recognition building blocks embedded within natural products. Physicochemical properties were used to select fragment-sized natural products from a database of known natural products (Dictionary of Natural Products). PCA analysis was used to illustrate the positioning of the fragment subset within the property space of the non-fragment sized natural products in the dataset. Structural diversity was analysed by three distinct methods: atom function analysis, using pharmacophore fingerprints, atom type analysis, using radial fingerprints, and scaffold analysis. Small pharmacophore triplets, representing the range of chemical features present in natural products that are capable of engaging in molecular interactions with small, contiguous areas of protein binding surfaces, were analysed. We demonstrate that fragment-sized natural products capture more than half of the small pharmacophore triplet diversity observed in non fragment-sized natural product datasets. Atom type analysis using radial fingerprints was represented by a self-organizing map. We examined the structural diversity of non-flat fragment-sized natural product scaffolds, rich in sp3 configured centres. From these results we demonstrate that 2-ring fragment-sized natural products effectively balance the opposing characteristics of minimal complexity and broad structural diversity when compared to the larger, more complex fragment-like natural products. These naturally-derived fragments could be used as the starting point for the generation of a highly diverse library with the scope for further medicinal chemistry elaboration due to their minimal structural complexity. This study highlights the possibility to capture a high proportion of the individual molecular interaction motifs embedded within natural products using a fragment screening library spanning 422 structural clusters and comprised of approximately 2800 natural products. PMID:25902039
Zhao, Min; Li, XiaoMo; Qu, Hong
2013-12-01
Eating disorder is a group of physiological and psychological disorders affecting approximately 1% of the female population worldwide. Although the genetic epidemiology of eating disorder is becoming increasingly clear with accumulated studies, the underlying molecular mechanisms are still unclear. Recently, integration of various high-throughput data expanded the range of candidate genes and started to generate hypotheses for understanding potential pathogenesis in complex diseases. This article presents EDdb (Eating Disorder database), the first evidence-based gene resource for eating disorder. Fifty-nine experimentally validated genes from the literature in relation to eating disorder were collected as the core dataset. Another four datasets with 2824 candidate genes across 601 genome regions were expanded based on the core dataset using different criteria (e.g., protein-protein interactions, shared cytobands, and related complex diseases). Based on human protein-protein interaction data, we reconstructed a potential molecular sub-network related to eating disorder. Furthermore, with an integrative pathway enrichment analysis of genes in EDdb, we identified an extended adipocytokine signaling pathway in eating disorder. Three genes in EDdb (ADIPO (adiponectin), TNF (tumor necrosis factor) and NR3C1 (nuclear receptor subfamily 3, group C, member 1)) link the KEGG (Kyoto Encyclopedia of Genes and Genomes) "adipocytokine signaling pathway" with the BioCarta "visceral fat deposits and the metabolic syndrome" pathway to form a joint pathway. In total, the joint pathway contains 43 genes, among which 39 genes are related to eating disorder. As the first comprehensive gene resource for eating disorder, EDdb ( http://eddb.cbi.pku.edu.cn ) enables the exploration of gene-disease relationships and cross-talk mechanisms between related disorders. Through pathway statistical studies, we revealed that abnormal body weight caused by eating disorder and obesity may both be related to dysregulation of the novel joint pathway of adipocytokine signaling. In addition, this joint pathway may be the common pathway for body weight regulation in complex human diseases related to unhealthy lifestyle.
Unsupervised data mining in nanoscale x-ray spectro-microscopic study of NdFeB magnet
DOE Office of Scientific and Technical Information (OSTI.GOV)
Duan, Xiaoyue; Yang, Feifei; Antono, Erin
Novel developments in X-ray based spectro-microscopic characterization techniques have increased the rate of acquisition of spatially resolved spectroscopic data by several orders of magnitude over what was possible a few years ago. This accelerated data acquisition, with high spatial resolution at nanoscale and sensitivity to subtle differences in chemistry and atomic structure, provides a unique opportunity to investigate hierarchically complex and structurally heterogeneous systems found in functional devices and materials systems. However, handling and analyzing the large volume data generated poses significant challenges. Here we apply an unsupervised data-mining algorithm known as DBSCAN to study a rare-earth element based permanentmore » magnet material, Nd 2Fe 14B. We are able to reduce a large spectro-microscopic dataset of over 300,000 spectra to 3, preserving much of the underlying information. Scientists can easily and quickly analyze in detail three characteristic spectra. Our approach can rapidly provide a concise representation of a large and complex dataset to materials scientists and chemists. For instance, it shows that the surface of common Nd 2Fe 14B magnet is chemically and structurally very different from the bulk, suggesting a possible surface alteration effect possibly due to the corrosion, which could affect the material’s overall properties.« less
Unsupervised data mining in nanoscale x-ray spectro-microscopic study of NdFeB magnet
Duan, Xiaoyue; Yang, Feifei; Antono, Erin; ...
2016-09-29
Novel developments in X-ray based spectro-microscopic characterization techniques have increased the rate of acquisition of spatially resolved spectroscopic data by several orders of magnitude over what was possible a few years ago. This accelerated data acquisition, with high spatial resolution at nanoscale and sensitivity to subtle differences in chemistry and atomic structure, provides a unique opportunity to investigate hierarchically complex and structurally heterogeneous systems found in functional devices and materials systems. However, handling and analyzing the large volume data generated poses significant challenges. Here we apply an unsupervised data-mining algorithm known as DBSCAN to study a rare-earth element based permanentmore » magnet material, Nd 2Fe 14B. We are able to reduce a large spectro-microscopic dataset of over 300,000 spectra to 3, preserving much of the underlying information. Scientists can easily and quickly analyze in detail three characteristic spectra. Our approach can rapidly provide a concise representation of a large and complex dataset to materials scientists and chemists. For instance, it shows that the surface of common Nd 2Fe 14B magnet is chemically and structurally very different from the bulk, suggesting a possible surface alteration effect possibly due to the corrosion, which could affect the material’s overall properties.« less
Susceptibility-based functional brain mapping by 3D deconvolution of an MR-phase activation map.
Chen, Zikuan; Liu, Jingyu; Calhoun, Vince D
2013-05-30
The underlying source of T2*-weighted magnetic resonance imaging (T2*MRI) for brain imaging is magnetic susceptibility (denoted by χ). T2*MRI outputs a complex-valued MR image consisting of magnitude and phase information. Recent research has shown that both the magnitude and the phase images are morphologically different from the source χ, primarily due to 3D convolution, and that the source χ can be reconstructed from complex MR images by computed inverse MRI (CIMRI). Thus, we can obtain a 4D χ dataset from a complex 4D MR dataset acquired from a brain functional MRI study by repeating CIMRI to reconstruct 3D χ volumes at each timepoint. Because the reconstructed χ is a more direct representation of neuronal activity than the MR image, we propose a method for χ-based functional brain mapping, which is numerically characterised by a temporal correlation map of χ responses to a stimulant task. Under the linear imaging conditions used for T2*MRI, we show that the χ activation map can be calculated from the MR phase map by CIMRI. We validate our approach using numerical simulations and Gd-phantom experiments. We also analyse real data from a finger-tapping visuomotor experiment and show that the χ-based functional mapping provides additional activation details (in the form of positive and negative correlation patterns) beyond those generated by conventional MR-magnitude-based mapping. Copyright © 2013 Elsevier B.V. All rights reserved.
Simulation of Smart Home Activity Datasets
Synnott, Jonathan; Nugent, Chris; Jeffers, Paul
2015-01-01
A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation. PMID:26087371
Simulation of Smart Home Activity Datasets.
Synnott, Jonathan; Nugent, Chris; Jeffers, Paul
2015-06-16
A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.
Paiton, Dylan M.; Kenyon, Garrett T.; Brumby, Steven P.; Schultz, Peter F.; George, John S.
2015-07-28
An approach to detecting objects in an image dataset may combine texture/color detection, shape/contour detection, and/or motion detection using sparse, generative, hierarchical models with lateral and top-down connections. A first independent representation of objects in an image dataset may be produced using a color/texture detection algorithm. A second independent representation of objects in the image dataset may be produced using a shape/contour detection algorithm. A third independent representation of objects in the image dataset may be produced using a motion detection algorithm. The first, second, and third independent representations may then be combined into a single coherent output using a combinatorial algorithm.
Zinkgraf, Matthew; Liu, Lijun; Groover, Andrew; Filkov, Vladimir
2017-06-01
Trees modify wood formation through integration of environmental and developmental signals in complex but poorly defined transcriptional networks, allowing trees to produce woody tissues appropriate to diverse environmental conditions. In order to identify relationships among genes expressed during wood formation, we integrated data from new and publically available datasets in Populus. These datasets were generated from woody tissue and include transcriptome profiling, transcription factor binding, DNA accessibility and genome-wide association mapping experiments. Coexpression modules were calculated, each of which contains genes showing similar expression patterns across experimental conditions, genotypes and treatments. Conserved gene coexpression modules (four modules totaling 8398 genes) were identified that were highly preserved across diverse environmental conditions and genetic backgrounds. Functional annotations as well as correlations with specific experimental treatments associated individual conserved modules with distinct biological processes underlying wood formation, such as cell-wall biosynthesis, meristem development and epigenetic pathways. Module genes were also enriched for DNase I hypersensitivity footprints and binding from four transcription factors associated with wood formation. The conserved modules are excellent candidates for modeling core developmental pathways common to wood formation in diverse environments and genotypes, and serve as testbeds for hypothesis generation and testing for future studies. No claim to original US government works. New Phytologist © 2017 New Phytologist Trust.
GTM-Based QSAR Models and Their Applicability Domains.
Gaspar, H A; Baskin, I I; Marcou, G; Horvath, D; Varnek, A
2015-06-01
In this paper we demonstrate that Generative Topographic Mapping (GTM), a machine learning method traditionally used for data visualisation, can be efficiently applied to QSAR modelling using probability distribution functions (PDF) computed in the latent 2-dimensional space. Several different scenarios of the activity assessment were considered: (i) the "activity landscape" approach based on direct use of PDF, (ii) QSAR models involving GTM-generated on descriptors derived from PDF, and, (iii) the k-Nearest Neighbours approach in 2D latent space. Benchmarking calculations were performed on five different datasets: stability constants of metal cations Ca(2+) , Gd(3+) and Lu(3+) complexes with organic ligands in water, aqueous solubility and activity of thrombin inhibitors. It has been shown that the performance of GTM-based regression models is similar to that obtained with some popular machine-learning methods (random forest, k-NN, M5P regression tree and PLS) and ISIDA fragment descriptors. By comparing GTM activity landscapes built both on predicted and experimental activities, we may visually assess the model's performance and identify the areas in the chemical space corresponding to reliable predictions. The applicability domain used in this work is based on data likelihood. Its application has significantly improved the model performances for 4 out of 5 datasets. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
openBIS: a flexible framework for managing and analyzing complex data in biology research
2011-01-01
Background Modern data generation techniques used in distributed systems biology research projects often create datasets of enormous size and diversity. We argue that in order to overcome the challenge of managing those large quantitative datasets and maximise the biological information extracted from them, a sound information system is required. Ease of integration with data analysis pipelines and other computational tools is a key requirement for it. Results We have developed openBIS, an open source software framework for constructing user-friendly, scalable and powerful information systems for data and metadata acquired in biological experiments. openBIS enables users to collect, integrate, share, publish data and to connect to data processing pipelines. This framework can be extended and has been customized for different data types acquired by a range of technologies. Conclusions openBIS is currently being used by several SystemsX.ch and EU projects applying mass spectrometric measurements of metabolites and proteins, High Content Screening, or Next Generation Sequencing technologies. The attributes that make it interesting to a large research community involved in systems biology projects include versatility, simplicity in deployment, scalability to very large data, flexibility to handle any biological data type and extensibility to the needs of any research domain. PMID:22151573
Nesvizhskii, Alexey I.
2013-01-01
Analysis of protein interaction networks and protein complexes using affinity purification and mass spectrometry (AP/MS) is among most commonly used and successful applications of proteomics technologies. One of the foremost challenges of AP/MS data is a large number of false positive protein interactions present in unfiltered datasets. Here we review computational and informatics strategies for detecting specific protein interaction partners in AP/MS experiments, with a focus on incomplete (as opposite to genome-wide) interactome mapping studies. These strategies range from standard statistical approaches, to empirical scoring schemes optimized for a particular type of data, to advanced computational frameworks. The common denominator among these methods is the use of label-free quantitative information such as spectral counts or integrated peptide intensities that can be extracted from AP/MS data. We also discuss related issues such as combining multiple biological or technical replicates, and dealing with data generated using different tagging strategies. Computational approaches for benchmarking of scoring methods are discussed, and the need for generation of reference AP/MS datasets is highlighted. Finally, we discuss the possibility of more extended modeling of experimental AP/MS data, including integration with external information such as protein interaction predictions based on functional genomics data. PMID:22611043
Investigating Bacterial-Animal Symbioses with Light Sheet Microscopy
Taormina, Michael J.; Jemielita, Matthew; Stephens, W. Zac; Burns, Adam R.; Troll, Joshua V.; Parthasarathy, Raghuveer; Guillemin, Karen
2014-01-01
SUMMARY Microbial colonization of the digestive tract is a crucial event in vertebrate development, required for maturation of host immunity and establishment of normal digestive physiology. Advances in genomic, proteomic, and metabolomic technologies are providing a more detailed picture of the constituents of the intestinal habitat, but these approaches lack the spatial and temporal resolution needed to characterize the assembly and dynamics of microbial communities in this complex environment. We report the use of light sheet microscopy to provide high resolution imaging of bacterial colonization of the zebrafish intestine. The methodology allows us to characterize bacterial population dynamics across the entire organ and the behaviors of individual bacterial and host cells throughout the colonization process. The large four-dimensional datasets generated by these imaging approaches require new strategies for image analysis. When integrated with other “omics” datasets, information about the spatial and temporal dynamics of microbial cells within the vertebrate intestine will provide new mechanistic insights into how microbial communities assemble and function within hosts. PMID:22983029
ASSISTments Dataset from Multiple Randomized Controlled Experiments
ERIC Educational Resources Information Center
Selent, Douglas; Patikorn, Thanaporn; Heffernan, Neil
2016-01-01
In this paper, we present a dataset consisting of data generated from 22 previously and currently running randomized controlled experiments inside the ASSISTments online learning platform. This dataset provides data mining opportunities for researchers to analyze ASSISTments data in a convenient format across multiple experiments at the same time.…
Comparative analysis of existing models for power-grid synchronization
NASA Astrophysics Data System (ADS)
Nishikawa, Takashi; Motter, Adilson E.
2015-01-01
The dynamics of power-grid networks is becoming an increasingly active area of research within the physics and network science communities. The results from such studies are typically insightful and illustrative, but are often based on simplifying assumptions that can be either difficult to assess or not fully justified for realistic applications. Here we perform a comprehensive comparative analysis of three leading models recently used to study synchronization dynamics in power-grid networks—a fundamental problem of practical significance given that frequency synchronization of all power generators in the same interconnection is a necessary condition for a power grid to operate. We show that each of these models can be derived from first principles within a common framework based on the classical model of a generator, thereby clarifying all assumptions involved. This framework allows us to view power grids as complex networks of coupled second-order phase oscillators with both forcing and damping terms. Using simple illustrative examples, test systems, and real power-grid datasets, we study the inherent frequencies of the oscillators as well as their coupling structure, comparing across the different models. We demonstrate, in particular, that if the network structure is not homogeneous, generators with identical parameters need to be modeled as non-identical oscillators in general. We also discuss an approach to estimate the required (dynamical) system parameters that are unavailable in typical power-grid datasets, their use for computing the constants of each of the three models, and an open-source MATLAB toolbox that we provide for these computations.
3D shape recovery from image focus using gray level co-occurrence matrix
NASA Astrophysics Data System (ADS)
Mahmood, Fahad; Munir, Umair; Mehmood, Fahad; Iqbal, Javaid
2018-04-01
Recovering a precise and accurate 3-D shape of the target object utilizing robust 3-D shape recovery algorithm is an ultimate objective of computer vision community. Focus measure algorithm plays an important role in this architecture which convert the color values of each pixel of the acquired 2-D image dataset into corresponding focus values. After convolving the focus measure filter with the input 2-D image dataset, a 3-D shape recovery approach is applied which will recover the depth map. In this document, we are concerned with proposing Gray Level Co-occurrence Matrix along with its statistical features for computing the focus information of the image dataset. The Gray Level Co-occurrence Matrix quantifies the texture present in the image using statistical features and then applies joint probability distributive function of the gray level pairs of the input image. Finally, we quantify the focus value of the input image using Gaussian Mixture Model. Due to its little computational complexity, sharp focus measure curve, robust to random noise sources and accuracy, it is considered as superior alternative to most of recently proposed 3-D shape recovery approaches. This algorithm is deeply investigated on real image sequences and synthetic image dataset. The efficiency of the proposed scheme is also compared with the state of art 3-D shape recovery approaches. Finally, by means of two global statistical measures, root mean square error and correlation, we claim that this approach -in spite of simplicity generates accurate results.
ERIC Educational Resources Information Center
Grenville-Briggs, Laura J.; Stansfield, Ian
2011-01-01
This report describes a linked series of Masters-level computer practical workshops. They comprise an advanced functional genomics investigation, based upon analysis of a microarray dataset probing yeast DNA damage responses. The workshops require the students to analyse highly complex transcriptomics datasets, and were designed to stimulate…
Saeed, Mohammad
2017-05-01
Systemic lupus erythematosus (SLE) is a complex disorder. Genetic association studies of complex disorders suffer from the following three major issues: phenotypic heterogeneity, false positive (type I error), and false negative (type II error) results. Hence, genes with low to moderate effects are missed in standard analyses, especially after statistical corrections. OASIS is a novel linkage disequilibrium clustering algorithm that can potentially address false positives and negatives in genome-wide association studies (GWAS) of complex disorders such as SLE. OASIS was applied to two SLE dbGAP GWAS datasets (6077 subjects; ∼0.75 million single-nucleotide polymorphisms). OASIS identified three known SLE genes viz. IFIH1, TNIP1, and CD44, not previously reported using these GWAS datasets. In addition, 22 novel loci for SLE were identified and the 5 SLE genes previously reported using these datasets were verified. OASIS methodology was validated using single-variant replication and gene-based analysis with GATES. This led to the verification of 60% of OASIS loci. New SLE genes that OASIS identified and were further verified include TNFAIP6, DNAJB3, TTF1, GRIN2B, MON2, LATS2, SNX6, RBFOX1, NCOA3, and CHAF1B. This study presents the OASIS algorithm, software, and the meta-analyses of two publicly available SLE GWAS datasets along with the novel SLE genes. Hence, OASIS is a novel linkage disequilibrium clustering method that can be universally applied to existing GWAS datasets for the identification of new genes.
CIFAR10-DVS: An Event-Stream Dataset for Object Classification
Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping
2017-01-01
Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as “CIFAR10-DVS.” The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification. PMID:28611582
CIFAR10-DVS: An Event-Stream Dataset for Object Classification.
Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping
2017-01-01
Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as "CIFAR10-DVS." The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification.
Classification of time series patterns from complex dynamic systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schryver, J.C.; Rao, N.
1998-07-01
An increasing availability of high-performance computing and data storage media at decreasing cost is making possible the proliferation of large-scale numerical databases and data warehouses. Numeric warehousing enterprises on the order of hundreds of gigabytes to terabytes are a reality in many fields such as finance, retail sales, process systems monitoring, biomedical monitoring, surveillance and transportation. Large-scale databases are becoming more accessible to larger user communities through the internet, web-based applications and database connectivity. Consequently, most researchers now have access to a variety of massive datasets. This trend will probably only continue to grow over the next several years. Unfortunately,more » the availability of integrated tools to explore, analyze and understand the data warehoused in these archives is lagging far behind the ability to gain access to the same data. In particular, locating and identifying patterns of interest in numerical time series data is an increasingly important problem for which there are few available techniques. Temporal pattern recognition poses many interesting problems in classification, segmentation, prediction, diagnosis and anomaly detection. This research focuses on the problem of classification or characterization of numerical time series data. Highway vehicles and their drivers are examples of complex dynamic systems (CDS) which are being used by transportation agencies for field testing to generate large-scale time series datasets. Tools for effective analysis of numerical time series in databases generated by highway vehicle systems are not yet available, or have not been adapted to the target problem domain. However, analysis tools from similar domains may be adapted to the problem of classification of numerical time series data.« less
Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study
2015-01-01
Objective This study informs efforts to improve the discoverability of and access to biomedical datasets by providing a preliminary estimate of the number and type of datasets generated annually by research funded by the U.S. National Institutes of Health (NIH). It focuses on those datasets that are “invisible” or not deposited in a known repository. Methods We analyzed NIH-funded journal articles that were published in 2011, cited in PubMed and deposited in PubMed Central (PMC) to identify those that indicate data were submitted to a known repository. After excluding those articles, we analyzed a random sample of the remaining articles to estimate how many and what types of invisible datasets were used in each article. Results About 12% of the articles explicitly mention deposition of datasets in recognized repositories, leaving 88% that are invisible datasets. Among articles with invisible datasets, we found an average of 2.9 to 3.4 datasets, suggesting there were approximately 200,000 to 235,000 invisible datasets generated from NIH-funded research published in 2011. Approximately 87% of the invisible datasets consist of data newly collected for the research reported; 13% reflect reuse of existing data. More than 50% of the datasets were derived from live human or non-human animal subjects. Conclusion In addition to providing a rough estimate of the total number of datasets produced per year by NIH-funded researchers, this study identifies additional issues that must be addressed to improve the discoverability of and access to biomedical research data: the definition of a “dataset,” determination of which (if any) data are valuable for archiving and preservation, and better methods for estimating the number of datasets of interest. Lack of consensus amongst annotators about the number of datasets in a given article reinforces the need for a principled way of thinking about how to identify and characterize biomedical datasets. PMID:26207759
Exploring Genetic Divergence in a Species-Rich Insect Genus Using 2790 DNA Barcodes
Lin, Xiaolong; Stur, Elisabeth; Ekrem, Torbjørn
2015-01-01
DNA barcoding using a fragment of the mitochondrial cytochrome c oxidase subunit 1 gene (COI) has proven to be successful for species-level identification in many animal groups. However, most studies have been focused on relatively small datasets or on large datasets of taxonomically high-ranked groups. We explore the quality of DNA barcodes to delimit species in the diverse chironomid genus Tanytarsus (Diptera: Chironomidae) by using different analytical tools. The genus Tanytarsus is the most species-rich taxon of tribe Tanytarsini (Diptera: Chironomidae) with more than 400 species worldwide, some of which can be notoriously difficult to identify to species-level using morphology. Our dataset, based on sequences generated from own material and publicly available data in BOLD, consist of 2790 DNA barcodes with a fragment length of at least 500 base pairs. A neighbor joining tree of this dataset comprises 131 well separated clusters representing 121 morphological species of Tanytarsus: 77 named, 16 unnamed and 28 unidentified theoretical species. For our geographically widespread dataset, DNA barcodes unambiguously discriminate 94.6% of the Tanytarsus species recognized through prior morphological study. Deep intraspecific divergences exist in some species complexes, and need further taxonomic studies using appropriate nuclear markers as well as morphological and ecological data to be resolved. The DNA barcodes cluster into 120–242 molecular operational taxonomic units (OTUs) depending on whether Objective Clustering, Automatic Barcode Gap Discovery (ABGD), Generalized Mixed Yule Coalescent model (GMYC), Poisson Tree Process (PTP), subjective evaluation of the neighbor joining tree or Barcode Index Numbers (BINs) are used. We suggest that a 4–5% threshold is appropriate to delineate species of Tanytarsus non-biting midges. PMID:26406595
Realistic computer network simulation for network intrusion detection dataset generation
NASA Astrophysics Data System (ADS)
Payer, Garrett
2015-05-01
The KDD-99 Cup dataset is dead. While it can continue to be used as a toy example, the age of this dataset makes it all but useless for intrusion detection research and data mining. Many of the attacks used within the dataset are obsolete and do not reflect the features important for intrusion detection in today's networks. Creating a new dataset encompassing a large cross section of the attacks found on the Internet today could be useful, but would eventually fall to the same problem as the KDD-99 Cup; its usefulness would diminish after a period of time. To continue research into intrusion detection, the generation of new datasets needs to be as dynamic and as quick as the attacker. Simply examining existing network traffic and using domain experts such as intrusion analysts to label traffic is inefficient, expensive, and not scalable. The only viable methodology is simulation using technologies including virtualization, attack-toolsets such as Metasploit and Armitage, and sophisticated emulation of threat and user behavior. Simulating actual user behavior and network intrusion events dynamically not only allows researchers to vary scenarios quickly, but enables online testing of intrusion detection mechanisms by interacting with data as it is generated. As new threat behaviors are identified, they can be added to the simulation to make quicker determinations as to the effectiveness of existing and ongoing network intrusion technology, methodology and models.
Loftus, Stacie K
2018-05-01
The number of melanocyte- and melanoma-derived next generation sequence genome-scale datasets have rapidly expanded over the past several years. This resource guide provides a summary of publicly available sources of melanocyte cell derived whole genome, exome, mRNA and miRNA transcriptome, chromatin accessibility and epigenetic datasets. Also highlighted are bioinformatic resources and tools for visualization and data queries which allow researchers a genome-scale view of the melanocyte. Published 2018. This article is a U.S. Government work and is in the public domain in the USA.
NASA Astrophysics Data System (ADS)
Norros, Veera; Laine, Marko; Lignell, Risto; Thingstad, Frede
2017-10-01
Methods for extracting empirically and theoretically sound parameter values are urgently needed in aquatic ecosystem modelling to describe key flows and their variation in the system. Here, we compare three Bayesian formulations for mechanistic model parameterization that differ in their assumptions about the variation in parameter values between various datasets: 1) global analysis - no variation, 2) separate analysis - independent variation and 3) hierarchical analysis - variation arising from a shared distribution defined by hyperparameters. We tested these methods, using computer-generated and empirical data, coupled with simplified and reasonably realistic plankton food web models, respectively. While all methods were adequate, the simulated example demonstrated that a well-designed hierarchical analysis can result in the most accurate and precise parameter estimates and predictions, due to its ability to combine information across datasets. However, our results also highlighted sensitivity to hyperparameter prior distributions as an important caveat of hierarchical analysis. In the more complex empirical example, hierarchical analysis was able to combine precise identification of parameter values with reasonably good predictive performance, although the ranking of the methods was less straightforward. We conclude that hierarchical Bayesian analysis is a promising tool for identifying key ecosystem-functioning parameters and their variation from empirical datasets.
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets
Bicer, Tekin; Gursoy, Doga; Andrade, Vincent De; ...
2017-01-28
Here, synchrotron light source and detector technologies enable scientists to perform advanced experiments. These scientific instruments and experiments produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used data acquisition technique at light sources is Computed Tomography, which can generate tens of GB/s depending on x-ray range. A large-scale tomographic dataset, such as mouse brain, may require hours of computation time with a medium size workstation. In this paper, we present Trace, a data-intensive computing middleware we developed for implementation and parallelization of iterative tomographic reconstruction algorithms. Tracemore » provides fine-grained reconstruction of tomography datasets using both (thread level) shared memory and (process level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations we have done on the replicated reconstruction objects and evaluate them using a shale and a mouse brain sinogram. Our experimental evaluations show that the applied optimizations and parallelization techniques can provide 158x speedup (using 32 compute nodes) over single core configuration, which decreases the reconstruction time of a sinogram (with 4501 projections and 22400 detector resolution) from 12.5 hours to less than 5 minutes per iteration.« less
Analysis of 3d Building Models Accuracy Based on the Airborne Laser Scanning Point Clouds
NASA Astrophysics Data System (ADS)
Ostrowski, W.; Pilarska, M.; Charyton, J.; Bakuła, K.
2018-05-01
Creating 3D building models in large scale is becoming more popular and finds many applications. Nowadays, a wide term "3D building models" can be applied to several types of products: well-known CityGML solid models (available on few Levels of Detail), which are mainly generated from Airborne Laser Scanning (ALS) data, as well as 3D mesh models that can be created from both nadir and oblique aerial images. City authorities and national mapping agencies are interested in obtaining the 3D building models. Apart from the completeness of the models, the accuracy aspect is also important. Final accuracy of a building model depends on various factors (accuracy of the source data, complexity of the roof shapes, etc.). In this paper the methodology of inspection of dataset containing 3D models is presented. The proposed approach check all building in dataset with comparison to ALS point clouds testing both: accuracy and level of details. Using analysis of statistical parameters for normal heights for reference point cloud and tested planes and segmentation of point cloud provides the tool that can indicate which building and which roof plane in do not fulfill requirement of model accuracy and detail correctness. Proposed method was tested on two datasets: solid and mesh model.
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bicer, Tekin; Gursoy, Doga; Andrade, Vincent De
Here, synchrotron light source and detector technologies enable scientists to perform advanced experiments. These scientific instruments and experiments produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used data acquisition technique at light sources is Computed Tomography, which can generate tens of GB/s depending on x-ray range. A large-scale tomographic dataset, such as mouse brain, may require hours of computation time with a medium size workstation. In this paper, we present Trace, a data-intensive computing middleware we developed for implementation and parallelization of iterative tomographic reconstruction algorithms. Tracemore » provides fine-grained reconstruction of tomography datasets using both (thread level) shared memory and (process level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations we have done on the replicated reconstruction objects and evaluate them using a shale and a mouse brain sinogram. Our experimental evaluations show that the applied optimizations and parallelization techniques can provide 158x speedup (using 32 compute nodes) over single core configuration, which decreases the reconstruction time of a sinogram (with 4501 projections and 22400 detector resolution) from 12.5 hours to less than 5 minutes per iteration.« less
Predicting radiotherapy outcomes using statistical learning techniques
NASA Astrophysics Data System (ADS)
El Naqa, Issam; Bradley, Jeffrey D.; Lindsay, Patricia E.; Hope, Andrew J.; Deasy, Joseph O.
2009-09-01
Radiotherapy outcomes are determined by complex interactions between treatment, anatomical and patient-related variables. A common obstacle to building maximally predictive outcome models for clinical practice is the failure to capture potential complexity of heterogeneous variable interactions and applicability beyond institutional data. We describe a statistical learning methodology that can automatically screen for nonlinear relations among prognostic variables and generalize to unseen data before. In this work, several types of linear and nonlinear kernels to generate interaction terms and approximate the treatment-response function are evaluated. Examples of institutional datasets of esophagitis, pneumonitis and xerostomia endpoints were used. Furthermore, an independent RTOG dataset was used for 'generalizabilty' validation. We formulated the discrimination between risk groups as a supervised learning problem. The distribution of patient groups was initially analyzed using principle components analysis (PCA) to uncover potential nonlinear behavior. The performance of the different methods was evaluated using bivariate correlations and actuarial analysis. Over-fitting was controlled via cross-validation resampling. Our results suggest that a modified support vector machine (SVM) kernel method provided superior performance on leave-one-out testing compared to logistic regression and neural networks in cases where the data exhibited nonlinear behavior on PCA. For instance, in prediction of esophagitis and pneumonitis endpoints, which exhibited nonlinear behavior on PCA, the method provided 21% and 60% improvements, respectively. Furthermore, evaluation on the independent pneumonitis RTOG dataset demonstrated good generalizabilty beyond institutional data in contrast with other models. This indicates that the prediction of treatment response can be improved by utilizing nonlinear kernel methods for discovering important nonlinear interactions among model variables. These models have the capacity to predict on unseen data. Part of this work was first presented at the Seventh International Conference on Machine Learning and Applications, San Diego, CA, USA, 11-13 December 2008.
Benchmarking of Typical Meteorological Year datasets dedicated to Concentrated-PV systems
NASA Astrophysics Data System (ADS)
Realpe, Ana Maria; Vernay, Christophe; Pitaval, Sébastien; Blanc, Philippe; Wald, Lucien; Lenoir, Camille
2016-04-01
Accurate analysis of meteorological and pyranometric data for long-term analysis is the basis of decision-making for banks and investors, regarding solar energy conversion systems. This has led to the development of methodologies for the generation of Typical Meteorological Years (TMY) datasets. The most used method for solar energy conversion systems was proposed in 1978 by the Sandia Laboratory (Hall et al., 1978) considering a specific weighted combination of different meteorological variables with notably global, diffuse horizontal and direct normal irradiances, air temperature, wind speed, relative humidity. In 2012, a new approach was proposed in the framework of the European project FP7 ENDORSE. It introduced the concept of "driver" that is defined by the user as an explicit function of the pyranometric and meteorological relevant variables to improve the representativeness of the TMY datasets with respect the specific solar energy conversion system of interest. The present study aims at comparing and benchmarking different TMY datasets considering a specific Concentrated-PV (CPV) system as the solar energy conversion system of interest. Using long-term (15+ years) time-series of high quality meteorological and pyranometric ground measurements, three types of TMY datasets generated by the following methods: the Sandia method, a simplified driver with DNI as the only representative variable and a more sophisticated driver. The latter takes into account the sensitivities of the CPV system with respect to the spectral distribution of the solar irradiance and wind speed. Different TMY datasets from the three methods have been generated considering different numbers of years in the historical dataset, ranging from 5 to 15 years. The comparisons and benchmarking of these TMY datasets are conducted considering the long-term time series of simulated CPV electric production as a reference. The results of this benchmarking clearly show that the Sandia method is not suitable for CPV systems. For these systems, the TMY datasets obtained using dedicated drivers (DNI only or more precise one) are more representative to derive TMY datasets from limited long-term meteorological dataset.
Understanding Editing Behaviors in Multilingual Wikipedia.
Kim, Suin; Park, Sungjoon; Hale, Scott A; Kim, Sooyoung; Byun, Jeongmin; Oh, Alice H
2016-01-01
Multilingualism is common offline, but we have a more limited understanding of the ways multilingualism is displayed online and the roles that multilinguals play in the spread of content between speakers of different languages. We take a computational approach to studying multilingualism using one of the largest user-generated content platforms, Wikipedia. We study multilingualism by collecting and analyzing a large dataset of the content written by multilingual editors of the English, German, and Spanish editions of Wikipedia. This dataset contains over two million paragraphs edited by over 15,000 multilingual users from July 8 to August 9, 2013. We analyze these multilingual editors in terms of their engagement, interests, and language proficiency in their primary and non-primary (secondary) languages and find that the English edition of Wikipedia displays different dynamics from the Spanish and German editions. Users primarily editing the Spanish and German editions make more complex edits than users who edit these editions as a second language. In contrast, users editing the English edition as a second language make edits that are just as complex as the edits by users who primarily edit the English edition. In this way, English serves a special role bringing together content written by multilinguals from many language editions. Nonetheless, language remains a formidable hurdle to the spread of content: we find evidence for a complexity barrier whereby editors are less likely to edit complex content in a second language. In addition, we find that multilinguals are less engaged and show lower levels of language proficiency in their second languages. We also examine the topical interests of multilingual editors and find that there is no significant difference between primary and non-primary editors in each language.
Understanding Editing Behaviors in Multilingual Wikipedia
Hale, Scott A.; Kim, Sooyoung; Byun, Jeongmin; Oh, Alice H.
2016-01-01
Multilingualism is common offline, but we have a more limited understanding of the ways multilingualism is displayed online and the roles that multilinguals play in the spread of content between speakers of different languages. We take a computational approach to studying multilingualism using one of the largest user-generated content platforms, Wikipedia. We study multilingualism by collecting and analyzing a large dataset of the content written by multilingual editors of the English, German, and Spanish editions of Wikipedia. This dataset contains over two million paragraphs edited by over 15,000 multilingual users from July 8 to August 9, 2013. We analyze these multilingual editors in terms of their engagement, interests, and language proficiency in their primary and non-primary (secondary) languages and find that the English edition of Wikipedia displays different dynamics from the Spanish and German editions. Users primarily editing the Spanish and German editions make more complex edits than users who edit these editions as a second language. In contrast, users editing the English edition as a second language make edits that are just as complex as the edits by users who primarily edit the English edition. In this way, English serves a special role bringing together content written by multilinguals from many language editions. Nonetheless, language remains a formidable hurdle to the spread of content: we find evidence for a complexity barrier whereby editors are less likely to edit complex content in a second language. In addition, we find that multilinguals are less engaged and show lower levels of language proficiency in their second languages. We also examine the topical interests of multilingual editors and find that there is no significant difference between primary and non-primary editors in each language. PMID:27171158
DOE Office of Scientific and Technical Information (OSTI.GOV)
Paiton, Dylan M.; Kenyon, Garrett T.; Brumby, Steven P.
An approach to detecting objects in an image dataset may combine texture/color detection, shape/contour detection, and/or motion detection using sparse, generative, hierarchical models with lateral and top-down connections. A first independent representation of objects in an image dataset may be produced using a color/texture detection algorithm. A second independent representation of objects in the image dataset may be produced using a shape/contour detection algorithm. A third independent representation of objects in the image dataset may be produced using a motion detection algorithm. The first, second, and third independent representations may then be combined into a single coherent output using amore » combinatorial algorithm.« less
Validation project. This report describes the procedure used to generate the noise models output dataset , and then it compares that dataset to the...benchmark, the Engineer Research and Development Centers Long-Range Sound Propagation dataset . It was found that the models consistently underpredict the
Wildenhain, Jan; Spitzer, Michaela; Dolma, Sonam; Jarvik, Nick; White, Rachel; Roy, Marcia; Griffiths, Emma; Bellows, David S.; Wright, Gerard D.; Tyers, Mike
2016-01-01
The network structure of biological systems suggests that effective therapeutic intervention may require combinations of agents that act synergistically. However, a dearth of systematic chemical combination datasets have limited the development of predictive algorithms for chemical synergism. Here, we report two large datasets of linked chemical-genetic and chemical-chemical interactions in the budding yeast Saccharomyces cerevisiae. We screened 5,518 unique compounds against 242 diverse yeast gene deletion strains to generate an extended chemical-genetic matrix (CGM) of 492,126 chemical-gene interaction measurements. This CGM dataset contained 1,434 genotype-specific inhibitors, termed cryptagens. We selected 128 structurally diverse cryptagens and tested all pairwise combinations to generate a benchmark dataset of 8,128 pairwise chemical-chemical interaction tests for synergy prediction, termed the cryptagen matrix (CM). An accompanying database resource called ChemGRID was developed to enable analysis, visualisation and downloads of all data. The CGM and CM datasets will facilitate the benchmarking of computational approaches for synergy prediction, as well as chemical structure-activity relationship models for anti-fungal drug discovery. PMID:27874849
Gomez-Pilar, Javier; Poza, Jesús; Bachiller, Alejandro; Gómez, Carlos; Núñez, Pablo; Lubeiro, Alba; Molina, Vicente; Hornero, Roberto
2018-02-01
The aim of this study was to introduce a novel global measure of graph complexity: Shannon graph complexity (SGC). This measure was specifically developed for weighted graphs, but it can also be applied to binary graphs. The proposed complexity measure was designed to capture the interplay between two properties of a system: the 'information' (calculated by means of Shannon entropy) and the 'order' of the system (estimated by means of a disequilibrium measure). SGC is based on the concept that complex graphs should maintain an equilibrium between the aforementioned two properties, which can be measured by means of the edge weight distribution. In this study, SGC was assessed using four synthetic graph datasets and a real dataset, formed by electroencephalographic (EEG) recordings from controls and schizophrenia patients. SGC was compared with graph density (GD), a classical measure used to evaluate graph complexity. Our results showed that SGC is invariant with respect to GD and independent of node degree distribution. Furthermore, its variation with graph size [Formula: see text] is close to zero for [Formula: see text]. Results from the real dataset showed an increment in the weight distribution balance during the cognitive processing for both controls and schizophrenia patients, although these changes are more relevant for controls. Our findings revealed that SGC does not need a comparison with null-hypothesis networks constructed by a surrogate process. In addition, SGC results on the real dataset suggest that schizophrenia is associated with a deficit in the brain dynamic reorganization related to secondary pathways of the brain network.
The Next Generation of Planetary Atmospheric Probes
NASA Technical Reports Server (NTRS)
Houben, Howard
2005-01-01
Entry probes provide useful insights into the structures of planetary atmospheres, but give only one-dimensional pictures of complex four-dimensional systems that vary on all temporal and spatial scales. This makes the interpretation of the results quite challenging, especially as regards atmospheric dynamics. Here is a planetary meteorologist's vision of what the next generation of atmospheric entry probe missions should be: Dedicated sounding instruments get most of the required data from orbit. Relatively simple and inexpensive entry probes are released from the orbiter, with low entry velocities, to establish ground truth, to clarify the vertical structure, and for adaptive observations to enhance the dataset in preparation for sensitive operations. The data are assimilated onboard in real time. The products, being immediately available, are of immense benefit for scientific and operational purposes (aerobraking, aerocapture, accurate payload delivery via glider, ballooning missions, weather forecasts, etc.).
Progeny Clustering: A Method to Identify Biological Phenotypes
Hu, Chenyue W.; Kornblau, Steven M.; Slater, John H.; Qutub, Amina A.
2015-01-01
Estimating the optimal number of clusters is a major challenge in applying cluster analysis to any type of dataset, especially to biomedical datasets, which are high-dimensional and complex. Here, we introduce an improved method, Progeny Clustering, which is stability-based and exceptionally efficient in computing, to find the ideal number of clusters. The algorithm employs a novel Progeny Sampling method to reconstruct cluster identity, a co-occurrence probability matrix to assess the clustering stability, and a set of reference datasets to overcome inherent biases in the algorithm and data space. Our method was shown successful and robust when applied to two synthetic datasets (datasets of two-dimensions and ten-dimensions containing eight dimensions of pure noise), two standard biological datasets (the Iris dataset and Rat CNS dataset) and two biological datasets (a cell phenotype dataset and an acute myeloid leukemia (AML) reverse phase protein array (RPPA) dataset). Progeny Clustering outperformed some popular clustering evaluation methods in the ten-dimensional synthetic dataset as well as in the cell phenotype dataset, and it was the only method that successfully discovered clinically meaningful patient groupings in the AML RPPA dataset. PMID:26267476
An Effective Methodology for Processing and Analyzing Large, Complex Spacecraft Data Streams
ERIC Educational Resources Information Center
Teymourlouei, Haydar
2013-01-01
The emerging large datasets have made efficient data processing a much more difficult task for the traditional methodologies. Invariably, datasets continue to increase rapidly in size with time. The purpose of this research is to give an overview of some of the tools and techniques that can be utilized to manage and analyze large datasets. We…
CORUM: the comprehensive resource of mammalian protein complexes
Ruepp, Andreas; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Stransky, Michael; Waegele, Brigitte; Schmidt, Thorsten; Doudieu, Octave Noubibou; Stümpflen, Volker; Mewes, H. Werner
2008-01-01
Protein complexes are key molecular entities that integrate multiple gene products to perform cellular functions. The CORUM (http://mips.gsf.de/genre/proj/corum/index.html) database is a collection of experimentally verified mammalian protein complexes. Information is manually derived by critical reading of the scientific literature from expert annotators. Information about protein complexes includes protein complex names, subunits, literature references as well as the function of the complexes. For functional annotation, we use the FunCat catalogue that enables to organize the protein complex space into biologically meaningful subsets. The database contains more than 1750 protein complexes that are built from 2400 different genes, thus representing 12% of the protein-coding genes in human. A web-based system is available to query, view and download the data. CORUM provides a comprehensive dataset of protein complexes for discoveries in systems biology, analyses of protein networks and protein complex-associated diseases. Comparable to the MIPS reference dataset of protein complexes from yeast, CORUM intends to serve as a reference for mammalian protein complexes. PMID:17965090
A global distributed basin morphometric dataset
NASA Astrophysics Data System (ADS)
Shen, Xinyi; Anagnostou, Emmanouil N.; Mei, Yiwen; Hong, Yang
2017-01-01
Basin morphometry is vital information for relating storms to hydrologic hazards, such as landslides and floods. In this paper we present the first comprehensive global dataset of distributed basin morphometry at 30 arc seconds resolution. The dataset includes nine prime morphometric variables; in addition we present formulas for generating twenty-one additional morphometric variables based on combination of the prime variables. The dataset can aid different applications including studies of land-atmosphere interaction, and modelling of floods and droughts for sustainable water management. The validity of the dataset has been consolidated by successfully repeating the Hack's law.
Zhang, Haitao; Wu, Chenxue; Chen, Zewei; Liu, Zhao; Zhu, Yunhong
2017-01-01
Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules.
Wu, Chenxue; Liu, Zhao; Zhu, Yunhong
2017-01-01
Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules. PMID:28767687
Complex versus simple models: ion-channel cardiac toxicity prediction.
Mistry, Hitesh B
2018-01-01
There is growing interest in applying detailed mathematical models of the heart for ion-channel related cardiac toxicity prediction. However, a debate as to whether such complex models are required exists. Here an assessment in the predictive performance between two established large-scale biophysical cardiac models and a simple linear model B net was conducted. Three ion-channel data-sets were extracted from literature. Each compound was designated a cardiac risk category using two different classification schemes based on information within CredibleMeds. The predictive performance of each model within each data-set for each classification scheme was assessed via a leave-one-out cross validation. Overall the B net model performed equally as well as the leading cardiac models in two of the data-sets and outperformed both cardiac models on the latest. These results highlight the importance of benchmarking complex versus simple models but also encourage the development of simple models.
Wang, Lu-Yong; Fasulo, D
2006-01-01
Genome-wide association study for complex diseases will generate massive amount of single nucleotide polymorphisms (SNPs) data. Univariate statistical test (i.e. Fisher exact test) was used to single out non-associated SNPs. However, the disease-susceptible SNPs may have little marginal effects in population and are unlikely to retain after the univariate tests. Also, model-based methods are impractical for large-scale dataset. Moreover, genetic heterogeneity makes the traditional methods harder to identify the genetic causes of diseases. A more recent random forest method provides a more robust method for screening the SNPs in thousands scale. However, for more large-scale data, i.e., Affymetrix Human Mapping 100K GeneChip data, a faster screening method is required to screening SNPs in whole-genome large scale association analysis with genetic heterogeneity. We propose a boosting-based method for rapid screening in large-scale analysis of complex traits in the presence of genetic heterogeneity. It provides a relatively fast and fairly good tool for screening and limiting the candidate SNPs for further more complex computational modeling task.
Multi-Fidelity Framework for Modeling Combustion Instability
2016-07-27
generated from the reduced-domain dataset. Evaluations of the framework are performed based on simplified test problems for a model rocket combustor showing...generated from the reduced-domain dataset. Evaluations of the framework are performed based on simplified test problems for a model rocket combustor...of Aeronautics and Astronautics and Associate Fellow AIAA. ‡ Professor Emeritus. § Senior Scientist, Rocket Propulsion Division and Senior Member
Generating Contextual Descriptions of Virtual Reality (VR) Spaces
NASA Astrophysics Data System (ADS)
Olson, D. M.; Zaman, C. H.; Sutherland, A.
2017-12-01
Virtual reality holds great potential for science communication, education, and research. However, interfaces for manipulating data and environments in virtual worlds are limited and idiosyncratic. Furthermore, speech and vision are the primary modalities by which humans collect information about the world, but the linking of visual and natural language domains is a relatively new pursuit in computer vision. Machine learning techniques have been shown to be effective at image and speech classification, as well as at describing images with language (Karpathy 2016), but have not yet been used to describe potential actions. We propose a technique for creating a library of possible context-specific actions associated with 3D objects in immersive virtual worlds based on a novel dataset generated natively in virtual reality containing speech, image, gaze, and acceleration data. We will discuss the design and execution of a user study in virtual reality that enabled the collection and the development of this dataset. We will also discuss the development of a hybrid machine learning algorithm linking vision data with environmental affordances in natural language. Our findings demonstrate that it is possible to develop a model which can generate interpretable verbal descriptions of possible actions associated with recognized 3D objects within immersive VR environments. This suggests promising applications for more intuitive user interfaces through voice interaction within 3D environments. It also demonstrates the potential to apply vast bodies of embodied and semantic knowledge to enrich user interaction within VR environments. This technology would allow for applications such as expert knowledge annotation of 3D environments, complex verbal data querying and object manipulation in virtual spaces, and computer-generated, dynamic 3D object affordances and functionality during simulations.
Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models
Stephens, Zachary D.; Hudson, Matthew E.; Mainzer, Liudmila S.; Taschuk, Morgan; Weber, Matthew R.; Iyer, Ravishankar K.
2016-01-01
An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the “ground truth” about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads. PMID:27893777
Inaugural Genomics Automation Congress and the coming deluge of sequencing data.
Creighton, Chad J
2010-10-01
Presentations at Select Biosciences's first 'Genomics Automation Congress' (Boston, MA, USA) in 2010 focused on next-generation sequencing and the platforms and methodology around them. The meeting provided an overview of sequencing technologies, both new and emerging. Speakers shared their recent work on applying sequencing to profile cells for various levels of biomolecular complexity, including DNA sequences, DNA copy, DNA methylation, mRNA and microRNA. With sequencing time and costs continuing to drop dramatically, a virtual explosion of very large sequencing datasets is at hand, which will probably present challenges and opportunities for high-level data analysis and interpretation, as well as for information technology infrastructure.
Ellis, Matthew J; Gillette, Michael; Carr, Steven A; Paulovich, Amanda G; Smith, Richard D; Rodland, Karin K; Townsend, R Reid; Kinsinger, Christopher; Mesri, Mehdi; Rodriguez, Henry; Liebler, Daniel C
2013-10-01
The National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium is applying the latest generation of proteomic technologies to genomically annotated tumors from The Cancer Genome Atlas (TCGA) program, a joint initiative of the NCI and the National Human Genome Research Institute. By providing a fully integrated accounting of DNA, RNA, and protein abnormalities in individual tumors, these datasets will illuminate the complex relationship between genomic abnormalities and cancer phenotypes, thus producing biologic insights as well as a wave of novel candidate biomarkers and therapeutic targets amenable to verification using targeted mass spectrometry methods. ©2013 AACR.
Herrinton, Lisa J; Liu, Liyan; Altschuler, Andrea; Dell, Richard; Rabrenovich, Violeta; Compton-Phillips, Amy L
2015-01-01
The cost to build and to maintain traditional registries for many dire, complex, low-frequency conditions is prohibitive. The authors used accessible technology to develop a platform that would generate miniregistries (small, routinely updated datasets) for surveillance, to identify patients who were missing elected utilization and to influence clinicians to change practices to improve care. The platform, tested in 5 medical specialty departments, enabled the specialists to rapidly and effectively communicate clinical questions, knowledge of disease, clinical workflows, and improve opportunities. Each miniregistry required 1 to 2 hours of collaboration by a specialist. Turnaround was 1 to 14 days.
CrossCheck: an open-source web tool for high-throughput screen data analysis.
Najafov, Jamil; Najafov, Ayaz
2017-07-19
Modern high-throughput screening methods allow researchers to generate large datasets that potentially contain important biological information. However, oftentimes, picking relevant hits from such screens and generating testable hypotheses requires training in bioinformatics and the skills to efficiently perform database mining. There are currently no tools available to general public that allow users to cross-reference their screen datasets with published screen datasets. To this end, we developed CrossCheck, an online platform for high-throughput screen data analysis. CrossCheck is a centralized database that allows effortless comparison of the user-entered list of gene symbols with 16,231 published datasets. These datasets include published data from genome-wide RNAi and CRISPR screens, interactome proteomics and phosphoproteomics screens, cancer mutation databases, low-throughput studies of major cell signaling mediators, such as kinases, E3 ubiquitin ligases and phosphatases, and gene ontological information. Moreover, CrossCheck includes a novel database of predicted protein kinase substrates, which was developed using proteome-wide consensus motif searches. CrossCheck dramatically simplifies high-throughput screen data analysis and enables researchers to dig deep into the published literature and streamline data-driven hypothesis generation. CrossCheck is freely accessible as a web-based application at http://proteinguru.com/crosscheck.
KernelADASYN: Kernel Based Adaptive Synthetic Data Generation for Imbalanced Learning
2015-08-17
eases [35], Indian liver patient dataset (ILPD) [36], Parkinsons dataset [37], Vertebral Column dataset [38], breast cancer dataset [39], breast tissue...Both the data set of Breast Cancer and Breast Tissue aim to predict the patient is normal or abnormal according to the measurements. The data set SPECT...9, pp. 1263–1284, 2009. [3] M. Elter, R. Schulz-Wendtland, and T. Wittenberg, “The prediction of breast cancer biopsy outcomes using two cad
KANTS: a stigmergic ant algorithm for cluster analysis and swarm art.
Fernandes, Carlos M; Mora, Antonio M; Merelo, Juan J; Rosa, Agostinho C
2014-06-01
KANTS is a swarm intelligence clustering algorithm inspired by the behavior of social insects. It uses stigmergy as a strategy for clustering large datasets and, as a result, displays a typical behavior of complex systems: self-organization and global patterns emerging from the local interaction of simple units. This paper introduces a simplified version of KANTS and describes recent experiments with the algorithm in the context of a contemporary artistic and scientific trend called swarm art, a type of generative art in which swarm intelligence systems are used to create artwork or ornamental objects. KANTS is used here for generating color drawings from the input data that represent real-world phenomena, such as electroencephalogram sleep data. However, the main proposal of this paper is an art project based on well-known abstract paintings, from which the chromatic values are extracted and used as input. Colors and shapes are therefore reorganized by KANTS, which generates its own interpretation of the original artworks. The project won the 2012 Evolutionary Art, Design, and Creativity Competition.
Human neutral genetic variation and forensic STR data.
Silva, Nuno M; Pereira, Luísa; Poloni, Estella S; Currat, Mathias
2012-01-01
The forensic genetics field is generating extensive population data on polymorphism of short tandem repeats (STR) markers in globally distributed samples. In this study we explored and quantified the informative power of these datasets to address issues related to human evolution and diversity, by using two online resources: an allele frequency dataset representing 141 populations summing up to almost 26 thousand individuals; a genotype dataset consisting of 42 populations and more than 11 thousand individuals. We show that the genetic relationships between populations based on forensic STRs are best explained by geography, as observed when analysing other worldwide datasets generated specifically to study human diversity. However, the global level of genetic differentiation between populations (as measured by a fixation index) is about half the value estimated with those other datasets, which contain a much higher number of markers but much less individuals. We suggest that the main factor explaining this difference is an ascertainment bias in forensics data resulting from the choice of markers for individual identification. We show that this choice results in average low variance of heterozygosity across world regions, and hence in low differentiation among populations. Thus, the forensic genetic markers currently produced for the purpose of individual assignment and identification allow the detection of the patterns of neutral genetic structure that characterize the human population but they do underestimate the levels of this genetic structure compared to the datasets of STRs (or other kinds of markers) generated specifically to study the diversity of human populations.
NASA Astrophysics Data System (ADS)
Abul Ehsan Bhuiyan, Md; Nikolopoulos, Efthymios I.; Anagnostou, Emmanouil N.; Quintana-Seguí, Pere; Barella-Ortiz, Anaïs
2018-02-01
This study investigates the use of a nonparametric, tree-based model, quantile regression forests (QRF), for combining multiple global precipitation datasets and characterizing the uncertainty of the combined product. We used the Iberian Peninsula as the study area, with a study period spanning 11 years (2000-2010). Inputs to the QRF model included three satellite precipitation products, CMORPH, PERSIANN, and 3B42 (V7); an atmospheric reanalysis precipitation and air temperature dataset; satellite-derived near-surface daily soil moisture data; and a terrain elevation dataset. We calibrated the QRF model for two seasons and two terrain elevation categories and used it to generate ensemble for these conditions. Evaluation of the combined product was based on a high-resolution, ground-reference precipitation dataset (SAFRAN) available at 5 km 1 h-1 resolution. Furthermore, to evaluate relative improvements and the overall impact of the combined product in hydrological response, we used the generated ensemble to force a distributed hydrological model (the SURFEX land surface model and the RAPID river routing scheme) and compared its streamflow simulation results with the corresponding simulations from the individual global precipitation and reference datasets. We concluded that the proposed technique could generate realizations that successfully encapsulate the reference precipitation and provide significant improvement in streamflow simulations, with reduction in systematic and random error on the order of 20-99 and 44-88 %, respectively, when considering the ensemble mean.
Gururaj, Anupama E.; Chen, Xiaoling; Pournejati, Saeid; Alter, George; Hersh, William R.; Demner-Fushman, Dina; Ohno-Machado, Lucila
2017-01-01
Abstract The rapid proliferation of publicly available biomedical datasets has provided abundant resources that are potentially of value as a means to reproduce prior experiments, and to generate and explore novel hypotheses. However, there are a number of barriers to the re-use of such datasets, which are distributed across a broad array of dataset repositories, focusing on different data types and indexed using different terminologies. New methods are needed to enable biomedical researchers to locate datasets of interest within this rapidly expanding information ecosystem, and new resources are needed for the formal evaluation of these methods as they emerge. In this paper, we describe the design and generation of a benchmark for information retrieval of biomedical datasets, which was developed and used for the 2016 bioCADDIE Dataset Retrieval Challenge. In the tradition of the seminal Cranfield experiments, and as exemplified by the Text Retrieval Conference (TREC), this benchmark includes a corpus (biomedical datasets), a set of queries, and relevance judgments relating these queries to elements of the corpus. This paper describes the process through which each of these elements was derived, with a focus on those aspects that distinguish this benchmark from typical information retrieval reference sets. Specifically, we discuss the origin of our queries in the context of a larger collaborative effort, the biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium, and the distinguishing features of biomedical dataset retrieval as a task. The resulting benchmark set has been made publicly available to advance research in the area of biomedical dataset retrieval. Database URL: https://biocaddie.org/benchmark-data PMID:29220453
NASA Astrophysics Data System (ADS)
Moura, Ricardo; Sinha, Bimal; Coelho, Carlos A.
2017-06-01
The recent popularity of the use of synthetic data as a Statistical Disclosure Control technique has enabled the development of several methods of generating and analyzing such data, but almost always relying in asymptotic distributions and in consequence being not adequate for small sample datasets. Thus, a likelihood-based exact inference procedure is derived for the matrix of regression coefficients of the multivariate regression model, for multiply imputed synthetic data generated via Posterior Predictive Sampling. Since it is based in exact distributions this procedure may even be used in small sample datasets. Simulation studies compare the results obtained from the proposed exact inferential procedure with the results obtained from an adaptation of Reiters combination rule to multiply imputed synthetic datasets and an application to the 2000 Current Population Survey is discussed.
NASA Astrophysics Data System (ADS)
Shrestha, S. R.; Collow, T. W.; Rose, B.
2016-12-01
Scientific datasets are generated from various sources and platforms but they are typically produced either by earth observation systems or by modelling systems. These are widely used for monitoring, simulating, or analyzing measurements that are associated with physical, chemical, and biological phenomena over the ocean, atmosphere, or land. A significant subset of scientific datasets stores values directly as rasters or in a form that can be rasterized. This is where a value exists at every cell in a regular grid spanning the spatial extent of the dataset. Government agencies like NOAA, NASA, EPA, USGS produces large volumes of near real-time, forecast, and historical data that drives climatological and meteorological studies, and underpins operations ranging from weather prediction to sea ice loss. Modern science is computationally intensive because of the availability of an enormous amount of scientific data, the adoption of data-driven analysis, and the need to share these dataset and research results with the public. ArcGIS as a platform is sophisticated and capable of handling such complex domain. We'll discuss constructs and capabilities applicable to multidimensional gridded data that can be conceptualized as a multivariate space-time cube. Building on the concept of a two-dimensional raster, a typical multidimensional raster dataset could contain several "slices" within the same spatial extent. We will share a case from the NOAA Climate Forecast Systems Reanalysis (CFSR) multidimensional data as an example of how large collections of rasters can be efficiently organized and managed through a data model within a geodatabase called "Mosaic dataset" and dynamically transformed and analyzed using raster functions. A raster function is a lightweight, raster-valued transformation defined over a mixed set of raster and scalar input. That means, just like any tool, you can provide a raster function with input parameters. It enables dynamic processing of only the data that's being displayed on the screen or requested by an application. We will present the dynamic processing and analysis of CFSR data using the chains of raster function and share it as dynamic multidimensional image service. This workflow and capabilities can be easily applied to any scientific data formats that are supported in mosaic dataset.
Lijun Liu; V. Missirian; Matthew S. Zinkgraf; Andrew Groover; V. Filkov
2014-01-01
Background: One of the great advantages of next generation sequencing is the ability to generate large genomic datasets for virtually all species, including non-model organisms. It should be possible, in turn, to apply advanced computational approaches to these datasets to develop models of biological processes. In a practical sense, working with non-model organisms...
Petri Nets with Fuzzy Logic (PNFL): Reverse Engineering and Parametrization
Küffner, Robert; Petri, Tobias; Windhager, Lukas; Zimmer, Ralf
2010-01-01
Background The recent DREAM4 blind assessment provided a particularly realistic and challenging setting for network reverse engineering methods. The in silico part of DREAM4 solicited the inference of cycle-rich gene regulatory networks from heterogeneous, noisy expression data including time courses as well as knockout, knockdown and multifactorial perturbations. Methodology and Principal Findings We inferred and parametrized simulation models based on Petri Nets with Fuzzy Logic (PNFL). This completely automated approach correctly reconstructed networks with cycles as well as oscillating network motifs. PNFL was evaluated as the best performer on DREAM4 in silico networks of size 10 with an area under the precision-recall curve (AUPR) of 81%. Besides topology, we inferred a range of additional mechanistic details with good reliability, e.g. distinguishing activation from inhibition as well as dependent from independent regulation. Our models also performed well on new experimental conditions such as double knockout mutations that were not included in the provided datasets. Conclusions The inference of biological networks substantially benefits from methods that are expressive enough to deal with diverse datasets in a unified way. At the same time, overly complex approaches could generate multiple different models that explain the data equally well. PNFL appears to strike the balance between expressive power and complexity. This also applies to the intuitive representation of PNFL models combining a straightforward graphical notation with colloquial fuzzy parameters. PMID:20862218
Level 2 Ancillary Products and Datasets Algorithm Theoretical Basis
NASA Technical Reports Server (NTRS)
Diner, D.; Abdou, W.; Gordon, H.; Kahn, R.; Knyazikhin, Y.; Martonchik, J.; McDonald, D.; McMuldroch, S.; Myneni, R.; West, R.
1999-01-01
This Algorithm Theoretical Basis (ATB) document describes the algorithms used to generate the parameters of certain ancillary products and datasets used during Level 2 processing of Multi-angle Imaging SpectroRadiometer (MIST) data.
Theofilatos, Konstantinos; Pavlopoulou, Niki; Papasavvas, Christoforos; Likothanassis, Spiros; Dimitrakopoulos, Christos; Georgopoulos, Efstratios; Moschopoulos, Charalampos; Mavroudi, Seferina
2015-03-01
Proteins are considered to be the most important individual components of biological systems and they combine to form physical protein complexes which are responsible for certain molecular functions. Despite the large availability of protein-protein interaction (PPI) information, not much information is available about protein complexes. Experimental methods are limited in terms of time, efficiency, cost and performance constraints. Existing computational methods have provided encouraging preliminary results, but they phase certain disadvantages as they require parameter tuning, some of them cannot handle weighted PPI data and others do not allow a protein to participate in more than one protein complex. In the present paper, we propose a new fully unsupervised methodology for predicting protein complexes from weighted PPI graphs. The proposed methodology is called evolutionary enhanced Markov clustering (EE-MC) and it is a hybrid combination of an adaptive evolutionary algorithm and a state-of-the-art clustering algorithm named enhanced Markov clustering. EE-MC was compared with state-of-the-art methodologies when applied to datasets from the human and the yeast Saccharomyces cerevisiae organisms. Using public available datasets, EE-MC outperformed existing methodologies (in some datasets the separation metric was increased by 10-20%). Moreover, when applied to new human datasets its performance was encouraging in the prediction of protein complexes which consist of proteins with high functional similarity. In specific, 5737 protein complexes were predicted and 72.58% of them are enriched for at least one gene ontology (GO) function term. EE-MC is by design able to overcome intrinsic limitations of existing methodologies such as their inability to handle weighted PPI networks, their constraint to assign every protein in exactly one cluster and the difficulties they face concerning the parameter tuning. This fact was experimentally validated and moreover, new potentially true human protein complexes were suggested as candidates for further validation using experimental techniques. Copyright © 2015 Elsevier B.V. All rights reserved.
QuIN: A Web Server for Querying and Visualizing Chromatin Interaction Networks.
Thibodeau, Asa; Márquez, Eladio J; Luo, Oscar; Ruan, Yijun; Menghi, Francesca; Shin, Dong-Guk; Stitzel, Michael L; Vera-Licona, Paola; Ucar, Duygu
2016-06-01
Recent studies of the human genome have indicated that regulatory elements (e.g. promoters and enhancers) at distal genomic locations can interact with each other via chromatin folding and affect gene expression levels. Genomic technologies for mapping interactions between DNA regions, e.g., ChIA-PET and HiC, can generate genome-wide maps of interactions between regulatory elements. These interaction datasets are important resources to infer distal gene targets of non-coding regulatory elements and to facilitate prioritization of critical loci for important cellular functions. With the increasing diversity and complexity of genomic information and public ontologies, making sense of these datasets demands integrative and easy-to-use software tools. Moreover, network representation of chromatin interaction maps enables effective data visualization, integration, and mining. Currently, there is no software that can take full advantage of network theory approaches for the analysis of chromatin interaction datasets. To fill this gap, we developed a web-based application, QuIN, which enables: 1) building and visualizing chromatin interaction networks, 2) annotating networks with user-provided private and publicly available functional genomics and interaction datasets, 3) querying network components based on gene name or chromosome location, and 4) utilizing network based measures to identify and prioritize critical regulatory targets and their direct and indirect interactions. QuIN's web server is available at http://quin.jax.org QuIN is developed in Java and JavaScript, utilizing an Apache Tomcat web server and MySQL database and the source code is available under the GPLV3 license available on GitHub: https://github.com/UcarLab/QuIN/.
NASA Astrophysics Data System (ADS)
Jawak, Shridhar D.; Luis, Alvarinho J.
2016-05-01
Digital elevation model (DEM) is indispensable for analysis such as topographic feature extraction, ice sheet melting, slope stability analysis, landscape analysis and so on. Such analysis requires a highly accurate DEM. Available DEMs of Antarctic region compiled by using radar altimetry and the Antarctic digital database indicate elevation variations of up to hundreds of meters, which necessitates the generation of local improved DEM. An improved DEM of the Schirmacher Oasis, East Antarctica has been generated by synergistically fusing satellite-derived laser altimetry data from Geoscience Laser Altimetry System (GLAS), Radarsat Antarctic Mapping Project (RAMP) elevation data and Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) global elevation data (GDEM). This is a characteristic attempt to generate a DEM of any part of Antarctica by fusing multiple elevation datasets, which is essential to model the ice elevation change and address the ice mass balance. We analyzed a suite of interpolation techniques for constructing a DEM from GLAS, RAMP and ASTER DEM-based point elevation datasets, in order to determine the level of confidence with which the interpolation techniques can generate a better interpolated continuous surface, and eventually improve the elevation accuracy of DEM from synergistically fused RAMP, GLAS and ASTER point elevation datasets. The DEM presented in this work has a vertical accuracy (≈ 23 m) better than RAMP DEM (≈ 57 m) and ASTER DEM (≈ 64 m) individually. The RAMP DEM and ASTER DEM elevations were corrected using differential GPS elevations as ground reference data, and the accuracy obtained after fusing multitemporal datasets is found to be improved than that of existing DEMs constructed by using RAMP or ASTER alone. This is our second attempt of fusing multitemporal, multisensory and multisource elevation data to generate a DEM of Antarctica, in order to address the ice elevation change and address the ice mass balance. Our approach focuses on the strengths of each elevation data source to produce an accurate elevation model.
MicroRNA array normalization: an evaluation using a randomized dataset as the benchmark.
Qin, Li-Xuan; Zhou, Qin
2014-01-01
MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays.
MicroRNA Array Normalization: An Evaluation Using a Randomized Dataset as the Benchmark
Qin, Li-Xuan; Zhou, Qin
2014-01-01
MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays. PMID:24905456
Non-steady wind turbine response to daytime atmospheric turbulence.
Nandi, Tarak N; Herrig, Andreas; Brasseur, James G
2017-04-13
Relevant to drivetrain bearing fatigue failures, we analyse non-steady wind turbine responses from interactions between energy-dominant daytime atmospheric turbulence eddies and the rotating blades of a GE 1.5 MW wind turbine using a unique dataset from a GE field experiment and computer simulation. Time-resolved local velocity data were collected at the leading and trailing edges of an instrumented blade together with generator power, revolutions per minute, pitch and yaw. Wind velocity and temperature were measured upwind on a meteorological tower. The stability state and other atmospheric conditions during the field experiment were replicated with a large-eddy simulation in which was embedded a GE 1.5 MW wind turbine rotor modelled with an advanced actuator line method. Both datasets identify three important response time scales: advective passage of energy-dominant eddies (≈25-50 s), blade rotation (once per revolution (1P), ≈3 s) and sub-1P scale (<1 s) response to internal eddy structure. Large-amplitude short-time ramp-like and oscillatory load fluctuations result in response to temporal changes in velocity vector inclination in the aerofoil plane, modulated by eddy passage at longer time scales. Generator power responds strongly to large-eddy wind modulations. We show that internal dynamics of the blade boundary layer near the trailing edge is temporally modulated by the non-steady external flow that was measured at the leading edge, as well as blade-generated turbulence motions.This article is part of the themed issue 'Wind energy in complex terrains'. © 2017 The Author(s).
Non-steady wind turbine response to daytime atmospheric turbulence
Nandi, Tarak N.; Herrig, Andreas
2017-01-01
Relevant to drivetrain bearing fatigue failures, we analyse non-steady wind turbine responses from interactions between energy-dominant daytime atmospheric turbulence eddies and the rotating blades of a GE 1.5 MW wind turbine using a unique dataset from a GE field experiment and computer simulation. Time-resolved local velocity data were collected at the leading and trailing edges of an instrumented blade together with generator power, revolutions per minute, pitch and yaw. Wind velocity and temperature were measured upwind on a meteorological tower. The stability state and other atmospheric conditions during the field experiment were replicated with a large-eddy simulation in which was embedded a GE 1.5 MW wind turbine rotor modelled with an advanced actuator line method. Both datasets identify three important response time scales: advective passage of energy-dominant eddies (≈25–50 s), blade rotation (once per revolution (1P), ≈3 s) and sub-1P scale (<1 s) response to internal eddy structure. Large-amplitude short-time ramp-like and oscillatory load fluctuations result in response to temporal changes in velocity vector inclination in the aerofoil plane, modulated by eddy passage at longer time scales. Generator power responds strongly to large-eddy wind modulations. We show that internal dynamics of the blade boundary layer near the trailing edge is temporally modulated by the non-steady external flow that was measured at the leading edge, as well as blade-generated turbulence motions. This article is part of the themed issue ‘Wind energy in complex terrains’. PMID:28265026
Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ)
Mascher, Martin; Muehlbauer, Gary J; Rokhsar, Daniel S; Chapman, Jarrod; Schmutz, Jeremy; Barry, Kerrie; Muñoz-Amatriaín, María; Close, Timothy J; Wise, Roger P; Schulman, Alan H; Himmelbach, Axel; Mayer, Klaus FX; Scholz, Uwe; Poland, Jesse A; Stein, Nils; Waugh, Robbie
2013-01-01
Next-generation whole-genome shotgun assemblies of complex genomes are highly useful, but fail to link nearby sequence contigs with each other or provide a linear order of contigs along individual chromosomes. Here, we introduce a strategy based on sequencing progeny of a segregating population that allows de novo production of a genetically anchored linear assembly of the gene space of an organism. We demonstrate the power of the approach by reconstructing the chromosomal organization of the gene space of barley, a large, complex and highly repetitive 5.1 Gb genome. We evaluate the robustness of the new assembly by comparison to a recently released physical and genetic framework of the barley genome, and to various genetically ordered sequence-based genotypic datasets. The method is independent of the need for any prior sequence resources, and will enable rapid and cost-efficient establishment of powerful genomic information for many species. PMID:23998490
Systems Proteomics for Translational Network Medicine
Arrell, D. Kent; Terzic, Andre
2012-01-01
Universal principles underlying network science, and their ever-increasing applications in biomedicine, underscore the unprecedented capacity of systems biology based strategies to synthesize and resolve massive high throughput generated datasets. Enabling previously unattainable comprehension of biological complexity, systems approaches have accelerated progress in elucidating disease prediction, progression, and outcome. Applied to the spectrum of states spanning health and disease, network proteomics establishes a collation, integration, and prioritization algorithm to guide mapping and decoding of proteome landscapes from large-scale raw data. Providing unparalleled deconvolution of protein lists into global interactomes, integrative systems proteomics enables objective, multi-modal interpretation at molecular, pathway, and network scales, merging individual molecular components, their plurality of interactions, and functional contributions for systems comprehension. As such, network systems approaches are increasingly exploited for objective interpretation of cardiovascular proteomics studies. Here, we highlight network systems proteomic analysis pipelines for integration and biological interpretation through protein cartography, ontological categorization, pathway and functional enrichment and complex network analysis. PMID:22896016
Liang, Jennifer J; Tsou, Ching-Huei; Devarakonda, Murthy V
2017-01-01
Natural language processing (NLP) holds the promise of effectively analyzing patient record data to reduce cognitive load on physicians and clinicians in patient care, clinical research, and hospital operations management. A critical need in developing such methods is the "ground truth" dataset needed for training and testing the algorithms. Beyond localizable, relatively simple tasks, ground truth creation is a significant challenge because medical experts, just as physicians in patient care, have to assimilate vast amounts of data in EHR systems. To mitigate potential inaccuracies of the cognitive challenges, we present an iterative vetting approach for creating the ground truth for complex NLP tasks. In this paper, we present the methodology, and report on its use for an automated problem list generation task, its effect on the ground truth quality and system accuracy, and lessons learned from the effort.
BioStar models of clinical and genomic data for biomedical data warehouse design
Wang, Liangjiang; Ramanathan, Murali
2008-01-01
Biomedical research is now generating large amounts of data, ranging from clinical test results to microarray gene expression profiles. The scale and complexity of these datasets give rise to substantial challenges in data management and analysis. It is highly desirable that data warehousing and online analytical processing technologies can be applied to biomedical data integration and mining. The major difficulty probably lies in the task of capturing and modelling diverse biological objects and their complex relationships. This paper describes multidimensional data modelling for biomedical data warehouse design. Since the conventional models such as star schema appear to be insufficient for modelling clinical and genomic data, we develop a new model called BioStar schema. The new model can capture the rich semantics of biomedical data and provide greater extensibility for the fast evolution of biological research methodologies. PMID:18048122
Kuharev, Jörg; Navarro, Pedro; Distler, Ute; Jahn, Olaf; Tenzer, Stefan
2015-09-01
Label-free quantification (LFQ) based on data-independent acquisition workflows currently experiences increasing popularity. Several software tools have been recently published or are commercially available. The present study focuses on the evaluation of three different software packages (Progenesis, synapter, and ISOQuant) supporting ion mobility enhanced data-independent acquisition data. In order to benchmark the LFQ performance of the different tools, we generated two hybrid proteome samples of defined quantitative composition containing tryptically digested proteomes of three different species (mouse, yeast, Escherichia coli). This model dataset simulates complex biological samples containing large numbers of both unregulated (background) proteins as well as up- and downregulated proteins with exactly known ratios between samples. We determined the number and dynamic range of quantifiable proteins and analyzed the influence of applied algorithms (retention time alignment, clustering, normalization, etc.) on quantification results. Analysis of technical reproducibility revealed median coefficients of variation of reported protein abundances below 5% for MS(E) data for Progenesis and ISOQuant. Regarding accuracy of LFQ, evaluation with synapter and ISOQuant yielded superior results compared to Progenesis. In addition, we discuss reporting formats and user friendliness of the software packages. The data generated in this study have been deposited to the ProteomeXchange Consortium with identifier PXD001240 (http://proteomecentral.proteomexchange.org/dataset/PXD001240). © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
SPICE for ESA Planetary Missions
NASA Astrophysics Data System (ADS)
Costa, M.
2018-04-01
The ESA SPICE Service leads the SPICE operations for ESA missions and is responsible for the generation of the SPICE Kernel Dataset for ESA missions. This contribution will describe the status of these datasets and outline the future developments.
Boosting association rule mining in large datasets via Gibbs sampling.
Qian, Guoqi; Rao, Calyampudi Radhakrishna; Sun, Xiaoying; Wu, Yuehua
2016-05-03
Current algorithms for association rule mining from transaction data are mostly deterministic and enumerative. They can be computationally intractable even for mining a dataset containing just a few hundred transaction items, if no action is taken to constrain the search space. In this paper, we develop a Gibbs-sampling-induced stochastic search procedure to randomly sample association rules from the itemset space, and perform rule mining from the reduced transaction dataset generated by the sample. Also a general rule importance measure is proposed to direct the stochastic search so that, as a result of the randomly generated association rules constituting an ergodic Markov chain, the overall most important rules in the itemset space can be uncovered from the reduced dataset with probability 1 in the limit. In the simulation study and a real genomic data example, we show how to boost association rule mining by an integrated use of the stochastic search and the Apriori algorithm.
Finite element analysis of human joints
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bossart, P.L.; Hollerbach, K.
1996-09-01
Our work focuses on the development of finite element models (FEMs) that describe the biomechanics of human joints. Finite element modeling is becoming a standard tool in industrial applications. In highly complex problems such as those found in biomechanics research, however, the full potential of FEMs is just beginning to be explored, due to the absence of precise, high resolution medical data and the difficulties encountered in converting these enormous datasets into a form that is usable in FEMs. With increasing computing speed and memory available, it is now feasible to address these challenges. We address the first by acquiringmore » data with a high resolution C-ray CT scanner and the latter by developing semi-automated method for generating the volumetric meshes used in the FEM. Issues related to tomographic reconstruction, volume segmentation, the use of extracted surfaces to generate volumetric hexahedral meshes, and applications of the FEM are described.« less
CometQuest: A Rosetta Adventure
NASA Technical Reports Server (NTRS)
Leon, Nancy J.; Fisher, Diane K.; Novati, Alexander; Chmielewski, Artur B.; Fitzpatrick, Austin J.; Angrum, Andrea
2012-01-01
This software is a higher-performance implementation of tiled WMS, with integral support for KML and time-varying data. This software is compliant with the Open Geospatial WMS standard, and supports KML natively as a WMS return type, including support for the time attribute. Regionated KML wrappers are generated that match the existing tiled WMS dataset. Ping and JPG formats are supported, and the software is implemented as an Apache 2.0 module that supports a threading execution model that is capable of supporting very high request rates. The module intercepts and responds to WMS requests that match certain patterns and returns the existing tiles. If a KML format that matches an existing pyramid and tile dataset is requested, regionated KML is generated and returned to the requesting application. In addition, KML requests that do not match the existing tile datasets generate a KML response that includes the corresponding JPG WMS request, effectively adding KML support to a backing WMS server.
NASA Technical Reports Server (NTRS)
Plesea, Lucian
2012-01-01
This software is a higher-performance implementation of tiled WMS, with integral support for KML and time-varying data. This software is compliant with the Open Geospatial WMS standard, and supports KML natively as a WMS return type, including support for the time attribute. Regionated KML wrappers are generated that match the existing tiled WMS dataset. Ping and JPG formats are supported, and the software is implemented as an Apache 2.0 module that supports a threading execution model that is capable of supporting very high request rates. The module intercepts and responds to WMS requests that match certain patterns and returns the existing tiles. If a KML format that matches an existing pyramid and tile dataset is requested, regionated KML is generated and returned to the requesting application. In addition, KML requests that do not match the existing tile datasets generate a KML response that includes the corresponding JPG WMS request, effectively adding KML support to a backing WMS server.
Software ion scan functions in analysis of glycomic and lipidomic MS/MS datasets.
Haramija, Marko
2018-03-01
Hardware ion scan functions unique to tandem mass spectrometry (MS/MS) mode of data acquisition, such as precursor ion scan (PIS) and neutral loss scan (NLS), are important for selective extraction of key structural data from complex MS/MS spectra. However, their software counterparts, software ion scan (SIS) functions, are still not regularly available. Software ion scan functions can be easily coded for additional functionalities, such as software multiple precursor ion scan, software no ion scan, and software variable ion scan functions. These are often necessary, since they allow more efficient analysis of complex MS/MS datasets, often encountered in glycomics and lipidomics. Software ion scan functions can be easily coded by using modern script languages and can be independent of instrument manufacturer. Here we demonstrate the utility of SIS functions on a medium-size glycomic MS/MS dataset. Knowledge of sample properties, as well as of diagnostic and conditional diagnostic ions crucial for data analysis, was needed. Based on the tables constructed with the output data from the SIS functions performed, a detailed analysis of a complex MS/MS glycomic dataset could be carried out in a quick, accurate, and efficient manner. Glycomic research is progressing slowly, and with respect to the MS experiments, one of the key obstacles for moving forward is the lack of appropriate bioinformatic tools necessary for fast analysis of glycomic MS/MS datasets. Adding novel SIS functionalities to the glycomic MS/MS toolbox has a potential to significantly speed up the glycomic data analysis process. Similar tools are useful for analysis of lipidomic MS/MS datasets as well, as will be discussed briefly. Copyright © 2017 John Wiley & Sons, Ltd.
On the Multi-Modal Object Tracking and Image Fusion Using Unsupervised Deep Learning Methodologies
NASA Astrophysics Data System (ADS)
LaHaye, N.; Ott, J.; Garay, M. J.; El-Askary, H. M.; Linstead, E.
2017-12-01
The number of different modalities of remote-sensors has been on the rise, resulting in large datasets with different complexity levels. Such complex datasets can provide valuable information separately, yet there is a bigger value in having a comprehensive view of them combined. As such, hidden information can be deduced through applying data mining techniques on the fused data. The curse of dimensionality of such fused data, due to the potentially vast dimension space, hinders our ability to have deep understanding of them. This is because each dataset requires a user to have instrument-specific and dataset-specific knowledge for optimum and meaningful usage. Once a user decides to use multiple datasets together, deeper understanding of translating and combining these datasets in a correct and effective manner is needed. Although there exists data centric techniques, generic automated methodologies that can potentially solve this problem completely don't exist. Here we are developing a system that aims to gain a detailed understanding of different data modalities. Such system will provide an analysis environment that gives the user useful feedback and can aid in research tasks. In our current work, we show the initial outputs our system implementation that leverages unsupervised deep learning techniques so not to burden the user with the task of labeling input data, while still allowing for a detailed machine understanding of the data. Our goal is to be able to track objects, like cloud systems or aerosols, across different image-like data-modalities. The proposed system is flexible, scalable and robust to understand complex likenesses within multi-modal data in a similar spatio-temporal range, and also to be able to co-register and fuse these images when needed.
Spatiotemporal Permutation Entropy as a Measure for Complexity of Cardiac Arrhythmia
NASA Astrophysics Data System (ADS)
Schlemmer, Alexander; Berg, Sebastian; Lilienkamp, Thomas; Luther, Stefan; Parlitz, Ulrich
2018-05-01
Permutation entropy (PE) is a robust quantity for measuring the complexity of time series. In the cardiac community it is predominantly used in the context of electrocardiogram (ECG) signal analysis for diagnoses and predictions with a major application found in heart rate variability parameters. In this article we are combining spatial and temporal PE to form a spatiotemporal PE that captures both, complexity of spatial structures and temporal complexity at the same time. We demonstrate that the spatiotemporal PE (STPE) quantifies complexity using two datasets from simulated cardiac arrhythmia and compare it to phase singularity analysis and spatial PE (SPE). These datasets simulate ventricular fibrillation (VF) on a two-dimensional and a three-dimensional medium using the Fenton-Karma model. We show that SPE and STPE are robust against noise and demonstrate its usefulness for extracting complexity features at different spatial scales.
Principal component analysis of the cytokine and chemokine response to human traumatic brain injury.
Helmy, Adel; Antoniades, Chrystalina A; Guilfoyle, Mathew R; Carpenter, Keri L H; Hutchinson, Peter J
2012-01-01
There is a growing realisation that neuro-inflammation plays a fundamental role in the pathology of Traumatic Brain Injury (TBI). This has led to the search for biomarkers that reflect these underlying inflammatory processes using techniques such as cerebral microdialysis. The interpretation of such biomarker data has been limited by the statistical methods used. When analysing data of this sort the multiple putative interactions between mediators need to be considered as well as the timing of production and high degree of statistical co-variance in levels of these mediators. Here we present a cytokine and chemokine dataset from human brain following human traumatic brain injury and use principal component analysis and partial least squares discriminant analysis to demonstrate the pattern of production following TBI, distinct phases of the humoral inflammatory response and the differing patterns of response in brain and in peripheral blood. This technique has the added advantage of making no assumptions about the Relative Recovery (RR) of microdialysis derived parameters. Taken together these techniques can be used in complex microdialysis datasets to summarise the data succinctly and generate hypotheses for future study.
Robust Pedestrian Classification Based on Hierarchical Kernel Sparse Representation.
Sun, Rui; Zhang, Guanghai; Yan, Xiaoxing; Gao, Jun
2016-08-16
Vision-based pedestrian detection has become an active topic in computer vision and autonomous vehicles. It aims at detecting pedestrians appearing ahead of the vehicle using a camera so that autonomous vehicles can assess the danger and take action. Due to varied illumination and appearance, complex background and occlusion pedestrian detection in outdoor environments is a difficult problem. In this paper, we propose a novel hierarchical feature extraction and weighted kernel sparse representation model for pedestrian classification. Initially, hierarchical feature extraction based on a CENTRIST descriptor is used to capture discriminative structures. A max pooling operation is used to enhance the invariance of varying appearance. Then, a kernel sparse representation model is proposed to fully exploit the discrimination information embedded in the hierarchical local features, and a Gaussian weight function as the measure to effectively handle the occlusion in pedestrian images. Extensive experiments are conducted on benchmark databases, including INRIA, Daimler, an artificially generated dataset and a real occluded dataset, demonstrating the more robust performance of the proposed method compared to state-of-the-art pedestrian classification methods.
Welham, Nathan V.; Ling, Changying; Dawson, John A.; Kendziorski, Christina; Thibeault, Susan L.; Yamashita, Masaru
2015-01-01
The vocal fold (VF) mucosa confers elegant biomechanical function for voice production but is susceptible to scar formation following injury. Current understanding of VF wound healing is hindered by a paucity of data and is therefore often generalized from research conducted in skin and other mucosal systems. Here, using a previously validated rat injury model, expression microarray technology and an empirical Bayes analysis approach, we generated a VF-specific transcriptome dataset to better capture the system-level complexity of wound healing in this specialized tissue. We measured differential gene expression at 3, 14 and 60 days post-injury compared to experimentally naïve controls, pursued functional enrichment analyses to refine and add greater biological definition to the previously proposed temporal phases of VF wound healing, and validated the expression and localization of a subset of previously unidentified repair- and regeneration-related genes at the protein level. Our microarray dataset is a resource for the wider research community and has the potential to stimulate new hypotheses and avenues of investigation, improve biological and mechanistic insight, and accelerate the identification of novel therapeutic targets. PMID:25592437
Robust Pedestrian Classification Based on Hierarchical Kernel Sparse Representation
Sun, Rui; Zhang, Guanghai; Yan, Xiaoxing; Gao, Jun
2016-01-01
Vision-based pedestrian detection has become an active topic in computer vision and autonomous vehicles. It aims at detecting pedestrians appearing ahead of the vehicle using a camera so that autonomous vehicles can assess the danger and take action. Due to varied illumination and appearance, complex background and occlusion pedestrian detection in outdoor environments is a difficult problem. In this paper, we propose a novel hierarchical feature extraction and weighted kernel sparse representation model for pedestrian classification. Initially, hierarchical feature extraction based on a CENTRIST descriptor is used to capture discriminative structures. A max pooling operation is used to enhance the invariance of varying appearance. Then, a kernel sparse representation model is proposed to fully exploit the discrimination information embedded in the hierarchical local features, and a Gaussian weight function as the measure to effectively handle the occlusion in pedestrian images. Extensive experiments are conducted on benchmark databases, including INRIA, Daimler, an artificially generated dataset and a real occluded dataset, demonstrating the more robust performance of the proposed method compared to state-of-the-art pedestrian classification methods. PMID:27537888
Goossens, Dirk; Moens, Lotte N; Nelis, Eva; Lenaerts, An-Sofie; Glassee, Wim; Kalbe, Andreas; Frey, Bruno; Kopal, Guido; De Jonghe, Peter; De Rijk, Peter; Del-Favero, Jurgen
2009-03-01
We evaluated multiplex PCR amplification as a front-end for high-throughput sequencing, to widen the applicability of massive parallel sequencers for the detailed analysis of complex genomes. Using multiplex PCR reactions, we sequenced the complete coding regions of seven genes implicated in peripheral neuropathies in 40 individuals on a GS-FLX genome sequencer (Roche). The resulting dataset showed highly specific and uniform amplification. Comparison of the GS-FLX sequencing data with the dataset generated by Sanger sequencing confirmed the detection of all variants present and proved the sensitivity of the method for mutation detection. In addition, we showed that we could exploit the multiplexed PCR amplicons to determine individual copy number variation (CNV), increasing the spectrum of detected variations to both genetic and genomic variants. We conclude that our straightforward procedure substantially expands the applicability of the massive parallel sequencers for sequencing projects of a moderate number of amplicons (50-500) with typical applications in resequencing exons in positional or functional candidate regions and molecular genetic diagnostics. 2008 Wiley-Liss, Inc.
PolyCheck: Dynamic Verification of Iteration Space Transformations on Affine Programs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bao, Wenlei; Krishnamoorthy, Sriram; Pouchet, Louis-noel
2016-01-11
High-level compiler transformations, especially loop transformations, are widely recognized as critical optimizations to restructure programs to improve data locality and expose parallelism. Guaranteeing the correctness of program transformations is essential, and to date three main approaches have been developed: proof of equivalence of affine programs, matching the execution traces of programs, and checking bit-by-bit equivalence of the outputs of the programs. Each technique suffers from limitations in either the kind of transformations supported, space complexity, or the sensitivity to the testing dataset. In this paper, we take a novel approach addressing all three limitations to provide an automatic bug checkermore » to verify any iteration reordering transformations on affine programs, including non-affine transformations, with space consumption proportional to the original program data, and robust to arbitrary datasets of a given size. We achieve this by exploiting the structure of affine program control- and data-flow to generate at compile-time lightweight checker code to be executed within the transformed program. Experimental results assess the correctness and effectiveness of our method, and its increased coverage over previous approaches.« less
Assessing the accuracy and stability of variable selection ...
Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological datasets there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used, or stepwise procedures are employed which iteratively add/remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating dataset consists of the good/poor condition of n=1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p=212) of landscape features from the StreamCat dataset. Two types of RF models are compared: a full variable set model with all 212 predictors, and a reduced variable set model selected using a backwards elimination approach. We assess model accuracy using RF's internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors, and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substanti
Lo, Benjamin W Y; Fukuda, Hitoshi; Angle, Mark; Teitelbaum, Jeanne; Macdonald, R Loch; Farrokhyar, Forough; Thabane, Lehana; Levine, Mitchell A H
2016-01-01
Classification and regression tree analysis involves the creation of a decision tree by recursive partitioning of a dataset into more homogeneous subgroups. Thus far, there is scarce literature on using this technique to create clinical prediction tools for aneurysmal subarachnoid hemorrhage (SAH). The classification and regression tree analysis technique was applied to the multicenter Tirilazad database (3551 patients) in order to create the decision-making algorithm. In order to elucidate prognostic subgroups in aneurysmal SAH, neurologic, systemic, and demographic factors were taken into account. The dependent variable used for analysis was the dichotomized Glasgow Outcome Score at 3 months. Classification and regression tree analysis revealed seven prognostic subgroups. Neurological grade, occurrence of post-admission stroke, occurrence of post-admission fever, and age represented the explanatory nodes of this decision tree. Split sample validation revealed classification accuracy of 79% for the training dataset and 77% for the testing dataset. In addition, the occurrence of fever at 1-week post-aneurysmal SAH is associated with increased odds of post-admission stroke (odds ratio: 1.83, 95% confidence interval: 1.56-2.45, P < 0.01). A clinically useful classification tree was generated, which serves as a prediction tool to guide bedside prognostication and clinical treatment decision making. This prognostic decision-making algorithm also shed light on the complex interactions between a number of risk factors in determining outcome after aneurysmal SAH.
Tsareva, Daria A; Osolodkin, Dmitry I; Shulga, Dmitry A; Oliferenko, Alexander A; Pisarev, Sergey A; Palyulin, Vladimir A; Zefirov, Nikolay S
2011-03-14
Two fast empirical charge models, Kirchhoff Charge Model (KCM) and Dynamic Electronegativity Relaxation (DENR), had been developed in our laboratory previously for widespread use in drug design research. Both models are based on the electronegativity relaxation principle (Adv. Quantum Chem. 2006, 51, 139-156) and parameterized against ab initio dipole/quadrupole moments and molecular electrostatic potentials, respectively. As 3D QSAR studies comprise one of the most important fields of applied molecular modeling, they naturally have become the first topic to test our charges and thus, indirectly, the assumptions laid down to the charge model theories in a case study. Here these charge models are used in CoMFA and CoMSIA methods and tested on five glycogen synthase kinase 3 (GSK-3) inhibitor datasets, relevant to our current studies, and one steroid dataset. For comparison, eight other different charge models, ab initio through semiempirical and empirical, were tested on the same datasets. The complex analysis including correlation and cross-validation, charges robustness and predictability, as well as visual interpretability of 3D contour maps generated was carried out. As a result, our new electronegativity relaxation-based models both have shown stable results, which in conjunction with other benefits discussed render them suitable for building reliable 3D QSAR models. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Antibody-protein interactions: benchmark datasets and prediction tools evaluation
Ponomarenko, Julia V; Bourne, Philip E
2007-01-01
Background The ability to predict antibody binding sites (aka antigenic determinants or B-cell epitopes) for a given protein is a precursor to new vaccine design and diagnostics. Among the various methods of B-cell epitope identification X-ray crystallography is one of the most reliable methods. Using these experimental data computational methods exist for B-cell epitope prediction. As the number of structures of antibody-protein complexes grows, further interest in prediction methods using 3D structure is anticipated. This work aims to establish a benchmark for 3D structure-based epitope prediction methods. Results Two B-cell epitope benchmark datasets inferred from the 3D structures of antibody-protein complexes were defined. The first is a dataset of 62 representative 3D structures of protein antigens with inferred structural epitopes. The second is a dataset of 82 structures of antibody-protein complexes containing different structural epitopes. Using these datasets, eight web-servers developed for antibody and protein binding sites prediction have been evaluated. In no method did performance exceed a 40% precision and 46% recall. The values of the area under the receiver operating characteristic curve for the evaluated methods were about 0.6 for ConSurf, DiscoTope, and PPI-PRED methods and above 0.65 but not exceeding 0.70 for protein-protein docking methods when the best of the top ten models for the bound docking were considered; the remaining methods performed close to random. The benchmark datasets are included as a supplement to this paper. Conclusion It may be possible to improve epitope prediction methods through training on datasets which include only immune epitopes and through utilizing more features characterizing epitopes, for example, the evolutionary conservation score. Notwithstanding, overall poor performance may reflect the generality of antigenicity and hence the inability to decipher B-cell epitopes as an intrinsic feature of the protein. It is an open question as to whether ultimately discriminatory features can be found. PMID:17910770
Annotations of Mexican bullfighting videos for semantic index
NASA Astrophysics Data System (ADS)
Montoya Obeso, Abraham; Oropesa Morales, Lester Arturo; Fernando Vázquez, Luis; Cocolán Almeda, Sara Ivonne; Stoian, Andrei; García Vázquez, Mireya Saraí; Zamudio Fuentes, Luis Miguel; Montiel Perez, Jesús Yalja; de la O Torres, Saul; Ramírez Acosta, Alejandro Alvaro
2015-09-01
The video annotation is important for web indexing and browsing systems. Indeed, in order to evaluate the performance of video query and mining techniques, databases with concept annotations are required. Therefore, it is necessary generate a database with a semantic indexing that represents the digital content of the Mexican bullfighting atmosphere. This paper proposes a scheme to make complex annotations in a video in the frame of multimedia search engine project. Each video is partitioned using our segmentation algorithm that creates shots of different length and different number of frames. In order to make complex annotations about the video, we use ELAN software. The annotations are done in two steps: First, we take note about the whole content in each shot. Second, we describe the actions as parameters of the camera like direction, position and deepness. As a consequence, we obtain a more complete descriptor of every action. In both cases we use the concepts of the TRECVid 2014 dataset. We also propose new concepts. This methodology allows to generate a database with the necessary information to create descriptors and algorithms capable to detect actions to automatically index and classify new bullfighting multimedia content.
Greenberg, David A; Zhang, Junying; Shmulewitz, Dvora; Strug, Lisa J; Zimmerman, Regina; Singh, Veena; Marathe, Sudhir
2005-12-30
The Genetic Analysis Workshop 14 simulated dataset was designed 1) To test the ability to find genes related to a complex disease (such as alcoholism). Such a disease may be given a variety of definitions by different investigators, have associated endophenotypes that are common in the general population, and is likely to be not one disease but a heterogeneous collection of clinically similar, but genetically distinct, entities. 2) To observe the effect on genetic analysis and gene discovery of a complex set of gene x gene interactions. 3) To allow comparison of microsatellite vs. large-scale single-nucleotide polymorphism (SNP) data. 4) To allow testing of association to identify the disease gene and the effect of moderate marker x marker linkage disequilibrium. 5) To observe the effect of different ascertainment/disease definition schemes on the analysis. Data was distributed in two forms. Data distributed to participants contained about 1,000 SNPs and 400 microsatellite markers. Internet-obtainable data consisted of a finer 10,000 SNP map, which also contained data on controls. While disease characteristics and parameters were constant, four "studies" used varying ascertainment schemes based on differing beliefs about disease characteristics. One of the studies contained multiplex two- and three-generation pedigrees with at least four affected members. The simulated disease was a psychiatric condition with many associated behaviors (endophenotypes), almost all of which were genetic in origin. The underlying disease model contained four major genes and two modifier genes. The four major genes interacted with each other to produce three different phenotypes, which were themselves heterogeneous. The population parameters were calibrated so that the major genes could be discovered by linkage analysis in most datasets. The association evidence was more difficult to calibrate but was designed to find statistically significant association in 50% of datasets. We also simulated some marker x marker linkage disequilibrium around some of the genes and also in areas without disease genes. We tried two different methods to simulate the linkage disequilibrium.
Rinchai, Darawan; Boughorbel, Sabri; Presnell, Scott; Quinn, Charlie; Chaussabel, Damien
2016-01-01
Systems-scale profiling approaches have become widely used in translational research settings. The resulting accumulation of large-scale datasets in public repositories represents a critical opportunity to promote insight and foster knowledge discovery. However, resources that can serve as an interface between biomedical researchers and such vast and heterogeneous dataset collections are needed in order to fulfill this potential. Recently, we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB). This tool can be used to overlay deep molecular phenotyping data with rich contextual information about analytes, samples and studies along with ancillary clinical or immunological profiling data. In this note, we describe a curated compendium of 93 public datasets generated in the context of human monocyte immunological studies, representing a total of 4,516 transcriptome profiles. Datasets were uploaded to an instance of GXB along with study description and sample annotations. Study samples were arranged in different groups. Ranked gene lists were generated based on relevant group comparisons. This resource is publicly available online at http://monocyte.gxbsidra.org/dm3/landing.gsp. PMID:27158452
Consolidating drug data on a global scale using Linked Data.
Jovanovik, Milos; Trajanov, Dimitar
2017-01-21
Drug product data is available on the Web in a distributed fashion. The reasons lie within the regulatory domains, which exist on a national level. As a consequence, the drug data available on the Web are independently curated by national institutions from each country, leaving the data in varying languages, with a varying structure, granularity level and format, on different locations on the Web. Therefore, one of the main challenges in the realm of drug data is the consolidation and integration of large amounts of heterogeneous data into a comprehensive dataspace, for the purpose of developing data-driven applications. In recent years, the adoption of the Linked Data principles has enabled data publishers to provide structured data on the Web and contextually interlink them with other public datasets, effectively de-siloing them. Defining methodological guidelines and specialized tools for generating Linked Data in the drug domain, applicable on a global scale, is a crucial step to achieving the necessary levels of data consolidation and alignment needed for the development of a global dataset of drug product data. This dataset would then enable a myriad of new usage scenarios, which can, for instance, provide insight into the global availability of different drug categories in different parts of the world. We developed a methodology and a set of tools which support the process of generating Linked Data in the drug domain. Using them, we generated the LinkedDrugs dataset by seamlessly transforming, consolidating and publishing high-quality, 5-star Linked Drug Data from twenty-three countries, containing over 248,000 drug products, over 99,000,000 RDF triples and over 278,000 links to generic drugs from the LOD Cloud. Using the linked nature of the dataset, we demonstrate its ability to support advanced usage scenarios in the drug domain. The process of generating the LinkedDrugs dataset demonstrates the applicability of the methodological guidelines and the supporting tools in transforming drug product data from various, independent and distributed sources, into a comprehensive Linked Drug Data dataset. The presented user-centric and analytical usage scenarios over the dataset show the advantages of having a de-siloed, consolidated and comprehensive dataspace of drug data available via the existing infrastructure of the Web.
Zhou, Hanzhi; Elliott, Michael R; Raghunathan, Trivellore E
2016-06-01
Multistage sampling is often employed in survey samples for cost and convenience. However, accounting for clustering features when generating datasets for multiple imputation is a nontrivial task, particularly when, as is often the case, cluster sampling is accompanied by unequal probabilities of selection, necessitating case weights. Thus, multiple imputation often ignores complex sample designs and assumes simple random sampling when generating imputations, even though failing to account for complex sample design features is known to yield biased estimates and confidence intervals that have incorrect nominal coverage. In this article, we extend a recently developed, weighted, finite-population Bayesian bootstrap procedure to generate synthetic populations conditional on complex sample design data that can be treated as simple random samples at the imputation stage, obviating the need to directly model design features for imputation. We develop two forms of this method: one where the probabilities of selection are known at the first and second stages of the design, and the other, more common in public use files, where only the final weight based on the product of the two probabilities is known. We show that this method has advantages in terms of bias, mean square error, and coverage properties over methods where sample designs are ignored, with little loss in efficiency, even when compared with correct fully parametric models. An application is made using the National Automotive Sampling System Crashworthiness Data System, a multistage, unequal probability sample of U.S. passenger vehicle crashes, which suffers from a substantial amount of missing data in "Delta-V," a key crash severity measure.
Zhou, Hanzhi; Elliott, Michael R.; Raghunathan, Trivellore E.
2017-01-01
Multistage sampling is often employed in survey samples for cost and convenience. However, accounting for clustering features when generating datasets for multiple imputation is a nontrivial task, particularly when, as is often the case, cluster sampling is accompanied by unequal probabilities of selection, necessitating case weights. Thus, multiple imputation often ignores complex sample designs and assumes simple random sampling when generating imputations, even though failing to account for complex sample design features is known to yield biased estimates and confidence intervals that have incorrect nominal coverage. In this article, we extend a recently developed, weighted, finite-population Bayesian bootstrap procedure to generate synthetic populations conditional on complex sample design data that can be treated as simple random samples at the imputation stage, obviating the need to directly model design features for imputation. We develop two forms of this method: one where the probabilities of selection are known at the first and second stages of the design, and the other, more common in public use files, where only the final weight based on the product of the two probabilities is known. We show that this method has advantages in terms of bias, mean square error, and coverage properties over methods where sample designs are ignored, with little loss in efficiency, even when compared with correct fully parametric models. An application is made using the National Automotive Sampling System Crashworthiness Data System, a multistage, unequal probability sample of U.S. passenger vehicle crashes, which suffers from a substantial amount of missing data in “Delta-V,” a key crash severity measure. PMID:29226161
Crystal cryocooling distorts conformational heterogeneity in a model Michaelis complex of DHFR
Keedy, Daniel A.; van den Bedem, Henry; Sivak, David A.; Petsko, Gregory A.; Ringe, Dagmar; Wilson, Mark A.; Fraser, James S.
2014-01-01
Summary Most macromolecular X-ray structures are determined from cryocooled crystals, but it is unclear whether cryocooling distorts functionally relevant flexibility. Here we compare independently acquired pairs of high-resolution datasets of a model Michaelis complex of dihydrofolate reductase (DHFR), collected by separate groups at both room and cryogenic temperatures. These datasets allow us to isolate the differences between experimental procedures and between temperatures. Our analyses of multiconformer models and time-averaged ensembles suggest that cryocooling suppresses and otherwise modifies sidechain and mainchain conformational heterogeneity, quenching dynamic contact networks. Despite some idiosyncratic differences, most changes from room temperature to cryogenic temperature are conserved, and likely reflect temperature-dependent solvent remodeling. Both cryogenic datasets point to additional conformations not evident in the corresponding room-temperature datasets, suggesting that cryocooling does not merely trap pre-existing conformational heterogeneity. Our results demonstrate that crystal cryocooling consistently distorts the energy landscape of DHFR, a paragon for understanding functional protein dynamics. PMID:24882744
Privacy-Preserving Integration of Medical Data : A Practical Multiparty Private Set Intersection.
Miyaji, Atsuko; Nakasho, Kazuhisa; Nishida, Shohei
2017-03-01
Medical data are often maintained by different organizations. However, detailed analyses sometimes require these datasets to be integrated without violating patient or commercial privacy. Multiparty Private Set Intersection (MPSI), which is an important privacy-preserving protocol, computes an intersection of multiple private datasets. This approach ensures that only designated parties can identify the intersection. In this paper, we propose a practical MPSI that satisfies the following requirements: The size of the datasets maintained by the different parties is independent of the others, and the computational complexity of the dataset held by each party is independent of the number of parties. Our MPSI is based on the use of an outsourcing provider, who has no knowledge of the data inputs or outputs. This reduces the computational complexity. The performance of the proposed MPSI is evaluated by implementing a prototype on a virtual private network to enable parallel computation in multiple threads. Our protocol is confirmed to be more efficient than comparable existing approaches.
Design of an Evolutionary Approach for Intrusion Detection
2013-01-01
A novel evolutionary approach is proposed for effective intrusion detection based on benchmark datasets. The proposed approach can generate a pool of noninferior individual solutions and ensemble solutions thereof. The generated ensembles can be used to detect the intrusions accurately. For intrusion detection problem, the proposed approach could consider conflicting objectives simultaneously like detection rate of each attack class, error rate, accuracy, diversity, and so forth. The proposed approach can generate a pool of noninferior solutions and ensembles thereof having optimized trade-offs values of multiple conflicting objectives. In this paper, a three-phase, approach is proposed to generate solutions to a simple chromosome design in the first phase. In the first phase, a Pareto front of noninferior individual solutions is approximated. In the second phase of the proposed approach, the entire solution set is further refined to determine effective ensemble solutions considering solution interaction. In this phase, another improved Pareto front of ensemble solutions over that of individual solutions is approximated. The ensemble solutions in improved Pareto front reported improved detection results based on benchmark datasets for intrusion detection. In the third phase, a combination method like majority voting method is used to fuse the predictions of individual solutions for determining prediction of ensemble solution. Benchmark datasets, namely, KDD cup 1999 and ISCX 2012 dataset, are used to demonstrate and validate the performance of the proposed approach for intrusion detection. The proposed approach can discover individual solutions and ensemble solutions thereof with a good support and a detection rate from benchmark datasets (in comparison with well-known ensemble methods like bagging and boosting). In addition, the proposed approach is a generalized classification approach that is applicable to the problem of any field having multiple conflicting objectives, and a dataset can be represented in the form of labelled instances in terms of its features. PMID:24376390
A multi-strategy approach to informative gene identification from gene expression data.
Liu, Ziying; Phan, Sieu; Famili, Fazel; Pan, Youlian; Lenferink, Anne E G; Cantin, Christiane; Collins, Catherine; O'Connor-McCourt, Maureen D
2010-02-01
An unsupervised multi-strategy approach has been developed to identify informative genes from high throughput genomic data. Several statistical methods have been used in the field to identify differentially expressed genes. Since different methods generate different lists of genes, it is very challenging to determine the most reliable gene list and the appropriate method. This paper presents a multi-strategy method, in which a combination of several data analysis techniques are applied to a given dataset and a confidence measure is established to select genes from the gene lists generated by these techniques to form the core of our final selection. The remainder of the genes that form the peripheral region are subject to exclusion or inclusion into the final selection. This paper demonstrates this methodology through its application to an in-house cancer genomics dataset and a public dataset. The results indicate that our method provides more reliable list of genes, which are validated using biological knowledge, biological experiments, and literature search. We further evaluated our multi-strategy method by consolidating two pairs of independent datasets, each pair is for the same disease, but generated by different labs using different platforms. The results showed that our method has produced far better results.
Integrative sparse principal component analysis of gene expression data.
Liu, Mengque; Fan, Xinyan; Fang, Kuangnan; Zhang, Qingzhao; Ma, Shuangge
2017-12-01
In the analysis of gene expression data, dimension reduction techniques have been extensively adopted. The most popular one is perhaps the PCA (principal component analysis). To generate more reliable and more interpretable results, the SPCA (sparse PCA) technique has been developed. With the "small sample size, high dimensionality" characteristic of gene expression data, the analysis results generated from a single dataset are often unsatisfactory. Under contexts other than dimension reduction, integrative analysis techniques, which jointly analyze the raw data of multiple independent datasets, have been developed and shown to outperform "classic" meta-analysis and other multidatasets techniques and single-dataset analysis. In this study, we conduct integrative analysis by developing the iSPCA (integrative SPCA) method. iSPCA achieves the selection and estimation of sparse loadings using a group penalty. To take advantage of the similarity across datasets and generate more accurate results, we further impose contrasted penalties. Different penalties are proposed to accommodate different data conditions. Extensive simulations show that iSPCA outperforms the alternatives under a wide spectrum of settings. The analysis of breast cancer and pancreatic cancer data further shows iSPCA's satisfactory performance. © 2017 WILEY PERIODICALS, INC.
Liluashvili, Vaja; Kalayci, Selim; Fluder, Eugene; Wilson, Manda; Gabow, Aaron
2017-01-01
Abstract Visualizations of biomolecular networks assist in systems-level data exploration in many cellular processes. Data generated from high-throughput experiments increasingly inform these networks, yet current tools do not adequately scale with concomitant increase in their size and complexity. We present an open source software platform, interactome-CAVE (iCAVE), for visualizing large and complex biomolecular interaction networks in 3D. Users can explore networks (i) in 3D using a desktop, (ii) in stereoscopic 3D using 3D-vision glasses and a desktop, or (iii) in immersive 3D within a CAVE environment. iCAVE introduces 3D extensions of known 2D network layout, clustering, and edge-bundling algorithms, as well as new 3D network layout algorithms. Furthermore, users can simultaneously query several built-in databases within iCAVE for network generation or visualize their own networks (e.g., disease, drug, protein, metabolite). iCAVE has modular structure that allows rapid development by addition of algorithms, datasets, or features without affecting other parts of the code. Overall, iCAVE is the first freely available open source tool that enables 3D (optionally stereoscopic or immersive) visualizations of complex, dense, or multi-layered biomolecular networks. While primarily designed for researchers utilizing biomolecular networks, iCAVE can assist researchers in any field. PMID:28814063
Liluashvili, Vaja; Kalayci, Selim; Fluder, Eugene; Wilson, Manda; Gabow, Aaron; Gümüs, Zeynep H
2017-08-01
Visualizations of biomolecular networks assist in systems-level data exploration in many cellular processes. Data generated from high-throughput experiments increasingly inform these networks, yet current tools do not adequately scale with concomitant increase in their size and complexity. We present an open source software platform, interactome-CAVE (iCAVE), for visualizing large and complex biomolecular interaction networks in 3D. Users can explore networks (i) in 3D using a desktop, (ii) in stereoscopic 3D using 3D-vision glasses and a desktop, or (iii) in immersive 3D within a CAVE environment. iCAVE introduces 3D extensions of known 2D network layout, clustering, and edge-bundling algorithms, as well as new 3D network layout algorithms. Furthermore, users can simultaneously query several built-in databases within iCAVE for network generation or visualize their own networks (e.g., disease, drug, protein, metabolite). iCAVE has modular structure that allows rapid development by addition of algorithms, datasets, or features without affecting other parts of the code. Overall, iCAVE is the first freely available open source tool that enables 3D (optionally stereoscopic or immersive) visualizations of complex, dense, or multi-layered biomolecular networks. While primarily designed for researchers utilizing biomolecular networks, iCAVE can assist researchers in any field. © The Authors 2017. Published by Oxford University Press.
Deformable Image Registration based on Similarity-Steered CNN Regression.
Cao, Xiaohuan; Yang, Jianhua; Zhang, Jun; Nie, Dong; Kim, Min-Jeong; Wang, Qian; Shen, Dinggang
2017-09-01
Existing deformable registration methods require exhaustively iterative optimization, along with careful parameter tuning, to estimate the deformation field between images. Although some learning-based methods have been proposed for initiating deformation estimation, they are often template-specific and not flexible in practical use. In this paper, we propose a convolutional neural network (CNN) based regression model to directly learn the complex mapping from the input image pair (i.e., a pair of template and subject) to their corresponding deformation field. Specifically, our CNN architecture is designed in a patch-based manner to learn the complex mapping from the input patch pairs to their respective deformation field. First, the equalized active-points guided sampling strategy is introduced to facilitate accurate CNN model learning upon a limited image dataset. Then, the similarity-steered CNN architecture is designed, where we propose to add the auxiliary contextual cue, i.e., the similarity between input patches, to more directly guide the learning process. Experiments on different brain image datasets demonstrate promising registration performance based on our CNN model. Furthermore, it is found that the trained CNN model from one dataset can be successfully transferred to another dataset, although brain appearances across datasets are quite variable.
Tam, Angela; Dansereau, Christian; Badhwar, AmanPreet; Orban, Pierre; Belleville, Sylvie; Chertkow, Howard; Dagher, Alain; Hanganu, Alexandru; Monchi, Oury; Rosa-Neto, Pedro; Shmuel, Amir; Breitner, John; Bellec, Pierre
2016-12-01
We present group eight resolutions of brain parcellations for clusters generated from resting-state functional magnetic resonance images for 99 cognitively normal elderly persons and 129 patients with mild cognitive impairment, pooled from four independent datasets. This dataset was generated as part of the following study: Common Effects of Amnestic Mild Cognitive Impairment on Resting-State Connectivity Across Four Independent Studies (Tam et al., 2015) [1]. The brain parcellations have been registered to both symmetric and asymmetric MNI brain templates and generated using a method called bootstrap analysis of stable clusters (BASC) (Bellec et al., 2010) [2]. We present two variants of these parcellations. One variant contains bihemisphereic parcels (4, 6, 12, 22, 33, 65, 111, and 208 total parcels across eight resolutions). The second variant contains spatially connected regions of interest (ROIs) that span only one hemisphere (10, 17, 30, 51, 77, 199, and 322 total ROIs across eight resolutions). We also present maps illustrating functional connectivity differences between patients and controls for four regions of interest (striatum, dorsal prefrontal cortex, middle temporal lobe, and medial frontal cortex). The brain parcels and associated statistical maps have been publicly released as 3D volumes, available in .mnc and .nii file formats on figshare and on Neurovault. Finally, the code used to generate this dataset is available on Github.
Development and application of GIS-based PRISM integration through a plugin approach
NASA Astrophysics Data System (ADS)
Lee, Woo-Seop; Chun, Jong Ahn; Kang, Kwangmin
2014-05-01
A PRISM (Parameter-elevation Regressions on Independent Slopes Model) QGIS-plugin was developed on Quantum GIS platform in this study. This Quantum GIS plugin system provides user-friendly graphic user interfaces (GUIs) so that users can obtain gridded meteorological data of high resolutions (1 km × 1 km). Also, this software is designed to run on a personal computer so that it does not require an internet access or a sophisticated computer system. This module is a user-friendly system that a user can generate PRISM data with ease. The proposed PRISM QGIS-plugin is a hybrid statistical-geographic model system that uses coarse resolution datasets (APHRODITE datasets in this study) with digital elevation data to generate the fine-resolution gridded precipitation. To validate the performance of the software, Prek Thnot River Basin in Kandal, Cambodia is selected for application. Overall statistical analysis shows promising outputs generated by the proposed plugin. Error measures such as RMSE (Root Mean Square Error) and MAPE (Mean Absolute Percentage Error) were used to evaluate the performance of the developed PRISM QGIS-plugin. Evaluation results using RMSE and MAPE were 2.76 mm and 4.2%, respectively. This study suggested that the plugin can be used to generate high resolution precipitation datasets for hydrological and climatological studies at a watershed where observed weather datasets are limited.
Liu, Yuanchao; Liu, Ming; Wang, Xin
2015-01-01
The objective of text clustering is to divide document collections into clusters based on the similarity between documents. In this paper, an extension-based feature modeling approach towards semantically sensitive text clustering is proposed along with the corresponding feature space construction and similarity computation method. By combining the similarity in traditional feature space and that in extension space, the adverse effects of the complexity and diversity of natural language can be addressed and clustering semantic sensitivity can be improved correspondingly. The generated clusters can be organized using different granularities. The experimental evaluations on well-known clustering algorithms and datasets have verified the effectiveness of our approach.
Liu, Yuanchao; Liu, Ming; Wang, Xin
2015-01-01
The objective of text clustering is to divide document collections into clusters based on the similarity between documents. In this paper, an extension-based feature modeling approach towards semantically sensitive text clustering is proposed along with the corresponding feature space construction and similarity computation method. By combining the similarity in traditional feature space and that in extension space, the adverse effects of the complexity and diversity of natural language can be addressed and clustering semantic sensitivity can be improved correspondingly. The generated clusters can be organized using different granularities. The experimental evaluations on well-known clustering algorithms and datasets have verified the effectiveness of our approach. PMID:25794172
A natural product based DOS library of hybrid systems.
Prabhu, Ganesh; Agarwal, Shalini; Sharma, Vijeta; Madurkar, Sanjay M; Munshi, Parthapratim; Singh, Shailja; Sen, Subhabrata
2015-05-05
Here we described a natural product inspired modular DOS strategy for the synthesis of a library of hybrid systems that are structurally and stereochemically disparate. The main scaffold is a pyrroloisoquinoline motif, that is synthesized from tandem Pictet-Spengler lactamization. The structural diversity is generated via "privileged scaffolds" that are attached at the appropriate site of the motif. Screening of the library compounds for their antiplasmodial activity against chloroquine sensitive 3D7 cells indicated few compounds with moderate activity (20-50 μM). A systematic comparison of structural intricacy between the library members and a natural product dataset obtained from ZINC(®) revealed comparable complexity. Copyright © 2015 Elsevier Masson SAS. All rights reserved.
Grainger, Matthew James; Aramyan, Lusine; Piras, Simone; Quested, Thomas Edward; Righi, Simone; Setti, Marco; Vittuari, Matteo; Stewart, Gavin Bruce
2018-01-01
Food waste from households contributes the greatest proportion to total food waste in developed countries. Therefore, food waste reduction requires an understanding of the socio-economic (contextual and behavioural) factors that lead to its generation within the household. Addressing such a complex subject calls for sound methodological approaches that until now have been conditioned by the large number of factors involved in waste generation, by the lack of a recognised definition, and by limited available data. This work contributes to food waste generation literature by using one of the largest available datasets that includes data on the objective amount of avoidable household food waste, along with information on a series of socio-economic factors. In order to address one aspect of the complexity of the problem, machine learning algorithms (random forests and boruta) for variable selection integrated with linear modelling, model selection and averaging are implemented. Model selection addresses model structural uncertainty, which is not routinely considered in assessments of food waste in literature. The main drivers of food waste in the home selected in the most parsimonious models include household size, the presence of fussy eaters, employment status, home ownership status, and the local authority. Results, regardless of which variable set the models are run on, point toward large households as being a key target element for food waste reduction interventions.
Aramyan, Lusine; Piras, Simone; Quested, Thomas Edward; Righi, Simone; Setti, Marco; Vittuari, Matteo; Stewart, Gavin Bruce
2018-01-01
Food waste from households contributes the greatest proportion to total food waste in developed countries. Therefore, food waste reduction requires an understanding of the socio-economic (contextual and behavioural) factors that lead to its generation within the household. Addressing such a complex subject calls for sound methodological approaches that until now have been conditioned by the large number of factors involved in waste generation, by the lack of a recognised definition, and by limited available data. This work contributes to food waste generation literature by using one of the largest available datasets that includes data on the objective amount of avoidable household food waste, along with information on a series of socio-economic factors. In order to address one aspect of the complexity of the problem, machine learning algorithms (random forests and boruta) for variable selection integrated with linear modelling, model selection and averaging are implemented. Model selection addresses model structural uncertainty, which is not routinely considered in assessments of food waste in literature. The main drivers of food waste in the home selected in the most parsimonious models include household size, the presence of fussy eaters, employment status, home ownership status, and the local authority. Results, regardless of which variable set the models are run on, point toward large households as being a key target element for food waste reduction interventions. PMID:29389949
Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses
Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M.; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V.; Ma’ayan, Avi
2018-01-01
Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated ‘canned’ analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools. PMID:29485625
Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses.
Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V; Ma'ayan, Avi
2018-02-27
Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated 'canned' analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools.
Ruane, Sara; Raxworthy, Christopher J; Lemmon, Alan R; Lemmon, Emily Moriarty; Burbrink, Frank T
2015-10-12
Using molecular data generated by high throughput next generation sequencing (NGS) platforms to infer phylogeny is becoming common as costs go down and the ability to capture loci from across the genome goes up. While there is a general consensus that greater numbers of independent loci should result in more robust phylogenetic estimates, few studies have compared phylogenies resulting from smaller datasets for commonly used genetic markers with the large datasets captured using NGS. Here, we determine how a 5-locus Sanger dataset compares with a 377-locus anchored genomics dataset for understanding the evolutionary history of the pseudoxyrhophiine snake radiation centered in Madagascar. The Pseudoxyrhophiinae comprise ~86 % of Madagascar's serpent diversity, yet they are poorly known with respect to ecology, behavior, and systematics. Using the 377-locus NGS dataset and the summary statistics species-tree methods STAR and MP-EST, we estimated a well-supported species tree that provides new insights concerning intergeneric relationships for the pseudoxyrhophiines. We also compared how these and other methods performed with respect to estimating tree topology using datasets with varying numbers of loci. Using Sanger sequencing and an anchored phylogenomics approach, we sequenced datasets comprised of 5 and 377 loci, respectively, for 23 pseudoxyrhophiine taxa. For each dataset, we estimated phylogenies using both gene-tree (concatenation) and species-tree (STAR, MP-EST) approaches. We determined the similarity of resulting tree topologies from the different datasets using Robinson-Foulds distances. In addition, we examined how subsets of these data performed compared to the complete Sanger and anchored datasets for phylogenetic accuracy using the same tree inference methodologies, as well as the program *BEAST to determine if a full coalescent model for species tree estimation could generate robust results with fewer loci compared to the summary statistics species tree approaches. We also examined the individual gene trees in comparison to the 377-locus species tree using the program MetaTree. Using the full anchored dataset under a variety of methods gave us the same, well-supported phylogeny for pseudoxyrhophiines. The African pseudoxyrhophiine Duberria is the sister taxon to the Malagasy pseudoxyrhophiines genera, providing evidence for a monophyletic radiation in Madagascar. In addition, within Madagascar, the two major clades inferred correspond largely to the aglyphous and opisthoglyphous genera, suggesting that feeding specializations associated with tooth venom delivery may have played a major role in the early diversification of this radiation. The comparison of tree topologies from the concatenated and species-tree methods using different datasets indicated the 5-locus dataset cannot beused to infer a correct phylogeny for the pseudoxyrhophiines under any method tested here and that summary statistics methods require 50 or more loci to consistently recover the species-tree inferred using the complete anchored dataset. However, as few as 15 loci may infer the correct topology when using the full coalescent species tree method *BEAST. MetaTree analyses of each gene tree from the Sanger and anchored datasets found that none of the individual gene trees matched the 377-locus species tree, and that no gene trees were identical with respect to topology. Our results suggest that ≥50 loci may be necessary to confidently infer phylogenies when using summaryspecies-tree methods, but that the coalescent-based method *BEAST consistently recovers the same topology using only 15 loci. These results reinforce that datasets with small numbers of markers may result in misleading topologies, and further, that the method of inference used to generate a phylogeny also has a major influence on the number of loci necessary to infer robust species trees.
Scheuch, Matthias; Höper, Dirk; Beer, Martin
2015-03-03
Fuelled by the advent and subsequent development of next generation sequencing technologies, metagenomics became a powerful tool for the analysis of microbial communities both scientifically and diagnostically. The biggest challenge is the extraction of relevant information from the huge sequence datasets generated for metagenomics studies. Although a plethora of tools are available, data analysis is still a bottleneck. To overcome the bottleneck of data analysis, we developed an automated computational workflow called RIEMS - Reliable Information Extraction from Metagenomic Sequence datasets. RIEMS assigns every individual read sequence within a dataset taxonomically by cascading different sequence analyses with decreasing stringency of the assignments using various software applications. After completion of the analyses, the results are summarised in a clearly structured result protocol organised taxonomically. The high accuracy and performance of RIEMS analyses were proven in comparison with other tools for metagenomics data analysis using simulated sequencing read datasets. RIEMS has the potential to fill the gap that still exists with regard to data analysis for metagenomics studies. The usefulness and power of RIEMS for the analysis of genuine sequencing datasets was demonstrated with an early version of RIEMS in 2011 when it was used to detect the orthobunyavirus sequences leading to the discovery of Schmallenberg virus.
Wei, Wei; Ji, Zhanglong; He, Yupeng; Zhang, Kai; Ha, Yuanchi; Li, Qi; Ohno-Machado, Lucila
2018-01-01
Abstract The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline PMID:29688374
NASA Astrophysics Data System (ADS)
Taylor, M. B.
2009-09-01
The new plotting functionality in version 2.0 of STILTS is described. STILTS is a mature and powerful package for all kinds of table manipulation, and this version adds facilities for generating plots from one or more tables to its existing wide range of non-graphical capabilities. 2- and 3-dimensional scatter plots and 1-dimensional histograms may be generated using highly configurable style parameters. Features include multiple dataset overplotting, variable transparency, 1-, 2- or 3-dimensional symmetric or asymmetric error bars, higher-dimensional visualization using color, and textual point labeling. Vector and bitmapped output formats are supported. The plotting options provide enough flexibility to perform meaningful visualization on datasets from a few points up to tens of millions. Arbitrarily large datasets can be plotted without heavy memory usage.
Carrea, Laura; Embury, Owen; Merchant, Christopher J
2015-11-01
Datasets containing information to locate and identify water bodies have been generated from data locating static-water-bodies with resolution of about 300 m (1/360 ∘ ) recently released by the Land Cover Climate Change Initiative (LC CCI) of the European Space Agency. The LC CCI water-bodies dataset has been obtained from multi-temporal metrics based on time series of the backscattered intensity recorded by ASAR on Envisat between 2005 and 2010. The new derived datasets provide coherently: distance to land, distance to water, water-body identifiers and lake-centre locations. The water-body identifier dataset locates the water bodies assigning the identifiers of the Global Lakes and Wetlands Database (GLWD), and lake centres are defined for in-land waters for which GLWD IDs were determined. The new datasets therefore link recent lake/reservoir/wetlands extent to the GLWD, together with a set of coordinates which locates unambiguously the water bodies in the database. Information on distance-to-land for each water cell and the distance-to-water for each land cell has many potential applications in remote sensing, where the applicability of geophysical retrieval algorithms may be affected by the presence of water or land within a satellite field of view (image pixel). During the generation and validation of the datasets some limitations of the GLWD database and of the LC CCI water-bodies mask have been found. Some examples of the inaccuracies/limitations are presented and discussed. Temporal change in water-body extent is common. Future versions of the LC CCI dataset are planned to represent temporal variation, and this will permit these derived datasets to be updated.
Yang, Guanxue; Wang, Lin; Wang, Xiaofan
2017-06-07
Reconstruction of networks underlying complex systems is one of the most crucial problems in many areas of engineering and science. In this paper, rather than identifying parameters of complex systems governed by pre-defined models or taking some polynomial and rational functions as a prior information for subsequent model selection, we put forward a general framework for nonlinear causal network reconstruction from time-series with limited observations. With obtaining multi-source datasets based on the data-fusion strategy, we propose a novel method to handle nonlinearity and directionality of complex networked systems, namely group lasso nonlinear conditional granger causality. Specially, our method can exploit different sets of radial basis functions to approximate the nonlinear interactions between each pair of nodes and integrate sparsity into grouped variables selection. The performance characteristic of our approach is firstly assessed with two types of simulated datasets from nonlinear vector autoregressive model and nonlinear dynamic models, and then verified based on the benchmark datasets from DREAM3 Challenge4. Effects of data size and noise intensity are also discussed. All of the results demonstrate that the proposed method performs better in terms of higher area under precision-recall curve.
SU-E-T-333: Dosimetric Impact of Rotational Error On the Target Coverage in IMPT Lung Cancer Plans
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rana, S; Zheng, Y
2015-06-15
Purpose: The main purpose of this study was to investigate the impact of rotational (yaw, roll, and pitch) error on the planning target volume (PTV) coverage in lung cancer plans generated by intensity modulated proton therapy (IMPT). Methods: In this retrospective study, computed tomography (CT) dataset of previously treated lung case was used. IMPT plan were generated on the original CT dataset using left-lateral (LL) and posterior-anterior (PA) beams for a total dose of 74 Gy[RBE] with 2 Gy[RBE] per fraction. In order to investigate the dosimetric impact of rotational error, 12 new CT datasets were generated by re-sampling themore » original CT dataset for rotational (roll, yaw, and pitch) angles ranged from −5° to +5°, with an increment of 2.5°. A total of 12 new IMPT plans were generated based on the re-sampled CT datasets using beam parameters identical to the ones in the original IMPT plan. All treatment plans were generated in XiO treatment planning system. The PTV coverage (i.e., dose received by 95% of the PTV volume, D95) in new IMPT plans were then compared with the PTV coverage in the original IMPT plan. Results: Rotational errors caused the reduction in the PTV coverage in all 12 new IMPT plans when compared to the original IMPT lung plan. Specifically, the PTV coverage was reduced by 4.94% to 50.51% for yaw, by 4.04% to 23.74% for roll, and by 5.21% to 46.88% for pitch errors. Conclusion: Unacceptable dosimetric results were observed in new IMPT plans as the PTV coverage was reduced by up to 26.87% and 50.51% for rotational error of 2.5° and 5°, respectively. Further investigation is underway in evaluating the PTV coverage loss in the IMPT lung cancer plans for smaller rotational angle change.« less
The NSF ITR Project: Framework for the National Virtual Observatory
NASA Astrophysics Data System (ADS)
Szalay, A. S.; Williams, R. D.; NVO Collaboration
2002-05-01
Technological advances in telescope and instrument design during the last ten years, coupled with the exponential increase in computer and communications capability, have caused a dramatic and irreversible change in the character of astronomical research. Large-scale surveys of the sky from space and ground are being initiated at wavelengths from radio to x-ray, thereby generating vast amounts of high quality irreplaceable data. The potential for scientific discovery afforded by these new surveys is enormous. Entirely new and unexpected scientific results of major significance will emerge from the combined use of the resulting datasets, science that would not be possible from such sets used singly. However, their large size and complexity require tools and structures to discover the complex phenomena encoded within them. We plan to build the NVO framework both through coordinating diverse efforts already in existence and providing a focus for the development of capabilities that do not yet exist. The NVO we envisage will act as an enabling and coordinating entity to foster the development of further tools, protocols, and collaborations necessary to realize the full scientific potential of large astronomical datasets in the coming decade. The NVO must be able to change and respond to the rapidly evolving world of IT technology. In spite of its underlying complex software, the NVO should be no harder to use for the average astronomer, than today's brick-and-mortar observatories and telescopes. Development of these capabilities will require close interaction and collaboration with the information technology community and other disciplines facing similar challenges. We need to ensure that the tools that we need exist or are built, but we do not duplicate efforts, and rely on relevant experience of others.
Accounting For Uncertainty in The Application Of High Throughput Datasets
The use of high throughput screening (HTS) datasets will need to adequately account for uncertainties in the data generation process and propagate these uncertainties through to ultimate use. Uncertainty arises at multiple levels in the construction of predictors using in vitro ...
Smith, Megan L; Noonan, Brice P; Colston, Timothy J
2017-08-01
Ethiopia is a world biodiversity hotspot and harbours levels of biotic endemism unmatched in the Horn of Africa, largely due to topographic-and thus habitat-complexity, which results from a very active geological and climatic history. Among Ethiopian vertebrate fauna, amphibians harbour the highest levels of endemism, making amphibians a compelling system for the exploration of the impacts of Ethiopia's complex abiotic history on biotic diversification. Grass frogs of the genus Ptychadena are notably diverse in Ethiopia, where they have undergone an evolutionary radiation. We used molecular data and expanded taxon sampling to test for cryptic diversity and to explore diversification patterns in both the highland radiation and two widespread lowland Ptychadena . Species delimitation results support the presence of nine highland species and four lowland species in our dataset, and divergence dating suggests that both geologic events and climatic fluctuations played a complex and confounded role in the diversification of Ptychadena in Ethiopia. We rectify the taxonomy of the endemic P. neumanni species complex, elevating one formally synonymized name and describing three novel taxa. Finally, we describe two novel lowland Ptychadena species that occur in Ethiopia and may be more broadly distributed.
Jia, Peilin; Wang, Lily; Fanous, Ayman H.; Pato, Carlos N.; Edwards, Todd L.; Zhao, Zhongming
2012-01-01
With the recent success of genome-wide association studies (GWAS), a wealth of association data has been accomplished for more than 200 complex diseases/traits, proposing a strong demand for data integration and interpretation. A combinatory analysis of multiple GWAS datasets, or an integrative analysis of GWAS data and other high-throughput data, has been particularly promising. In this study, we proposed an integrative analysis framework of multiple GWAS datasets by overlaying association signals onto the protein-protein interaction network, and demonstrated it using schizophrenia datasets. Building on a dense module search algorithm, we first searched for significantly enriched subnetworks for schizophrenia in each single GWAS dataset and then implemented a discovery-evaluation strategy to identify module genes with consistent association signals. We validated the module genes in an independent dataset, and also examined them through meta-analysis of the related SNPs using multiple GWAS datasets. As a result, we identified 205 module genes with a joint effect significantly associated with schizophrenia; these module genes included a number of well-studied candidate genes such as DISC1, GNA12, GNA13, GNAI1, GPR17, and GRIN2B. Further functional analysis suggested these genes are involved in neuronal related processes. Additionally, meta-analysis found that 18 SNPs in 9 module genes had P meta<1×10−4, including the gene HLA-DQA1 located in the MHC region on chromosome 6, which was reported in previous studies using the largest cohort of schizophrenia patients to date. These results demonstrated our bi-directional network-based strategy is efficient for identifying disease-associated genes with modest signals in GWAS datasets. This approach can be applied to any other complex diseases/traits where multiple GWAS datasets are available. PMID:22792057
de Hoogt, Ronald; Estrada, Marta F; Vidic, Suzana; Davies, Emma J; Osswald, Annika; Barbier, Michael; Santo, Vítor E; Gjerde, Kjersti; van Zoggel, Hanneke J A A; Blom, Sami; Dong, Meng; Närhi, Katja; Boghaert, Erwin; Brito, Catarina; Chong, Yolanda; Sommergruber, Wolfgang; van der Kuip, Heiko; van Weerden, Wytske M; Verschuren, Emmy W; Hickman, John; Graeser, Ralph
2017-11-21
Two-dimensional (2D) culture of cancer cells in vitro does not recapitulate the three-dimensional (3D) architecture, heterogeneity and complexity of human tumors. More representative models are required that better reflect key aspects of tumor biology. These are essential studies of cancer biology and immunology as well as for target validation and drug discovery. The Innovative Medicines Initiative (IMI) consortium PREDECT (www.predect.eu) characterized in vitro models of three solid tumor types with the goal to capture elements of tumor complexity and heterogeneity. 2D culture and 3D mono- and stromal co-cultures of increasing complexity, and precision-cut tumor slice models were established. Robust protocols for the generation of these platforms are described. Tissue microarrays were prepared from all the models, permitting immunohistochemical analysis of individual cells, capturing heterogeneity. 3D cultures were also characterized using image analysis. Detailed step-by-step protocols, exemplary datasets from the 2D, 3D, and slice models, and refined analytical methods were established and are presented.
de Hoogt, Ronald; Estrada, Marta F.; Vidic, Suzana; Davies, Emma J.; Osswald, Annika; Barbier, Michael; Santo, Vítor E.; Gjerde, Kjersti; van Zoggel, Hanneke J. A. A.; Blom, Sami; Dong, Meng; Närhi, Katja; Boghaert, Erwin; Brito, Catarina; Chong, Yolanda; Sommergruber, Wolfgang; van der Kuip, Heiko; van Weerden, Wytske M.; Verschuren, Emmy W.; Hickman, John; Graeser, Ralph
2017-01-01
Two-dimensional (2D) culture of cancer cells in vitro does not recapitulate the three-dimensional (3D) architecture, heterogeneity and complexity of human tumors. More representative models are required that better reflect key aspects of tumor biology. These are essential studies of cancer biology and immunology as well as for target validation and drug discovery. The Innovative Medicines Initiative (IMI) consortium PREDECT (www.predect.eu) characterized in vitro models of three solid tumor types with the goal to capture elements of tumor complexity and heterogeneity. 2D culture and 3D mono- and stromal co-cultures of increasing complexity, and precision-cut tumor slice models were established. Robust protocols for the generation of these platforms are described. Tissue microarrays were prepared from all the models, permitting immunohistochemical analysis of individual cells, capturing heterogeneity. 3D cultures were also characterized using image analysis. Detailed step-by-step protocols, exemplary datasets from the 2D, 3D, and slice models, and refined analytical methods were established and are presented. PMID:29160867
Paraskevopoulou, Sivylla E; Barsakcioglu, Deren Y; Saberi, Mohammed R; Eftekhar, Amir; Constandinou, Timothy G
2013-04-30
Next generation neural interfaces aspire to achieve real-time multi-channel systems by integrating spike sorting on chip to overcome limitations in communication channel capacity. The feasibility of this approach relies on developing highly efficient algorithms for feature extraction and clustering with the potential of low-power hardware implementation. We are proposing a feature extraction method, not requiring any calibration, based on first and second derivative features of the spike waveform. The accuracy and computational complexity of the proposed method are quantified and compared against commonly used feature extraction methods, through simulation across four datasets (with different single units) at multiple noise levels (ranging from 5 to 20% of the signal amplitude). The average classification error is shown to be below 7% with a computational complexity of 2N-3, where N is the number of sample points of each spike. Overall, this method presents a good trade-off between accuracy and computational complexity and is thus particularly well-suited for hardware-efficient implementation. Copyright © 2013 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Cruden, A. R.; Vollgger, S.
2016-12-01
The emerging capability of UAV photogrammetry combines a simple and cost-effective method to acquire digital aerial images with advanced computer vision algorithms that compute spatial datasets from a sequence of overlapping digital photographs from various viewpoints. Depending on flight altitude and camera setup, sub-centimeter spatial resolution orthophotographs and textured dense point clouds can be achieved. Orientation data can be collected for detailed structural analysis by digitally mapping such high-resolution spatial datasets in a fraction of time and with higher fidelity compared to traditional mapping techniques. Here we describe a photogrammetric workflow applied to a structural study of folds and fractures within alternating layers of sandstone and mudstone at a coastal outcrop in SE Australia. We surveyed this location using a downward looking digital camera mounted on commercially available multi-rotor UAV that autonomously followed waypoints at a set altitude and speed to ensure sufficient image overlap, minimum motion blur and an appropriate resolution. The use of surveyed ground control points allowed us to produce a geo-referenced 3D point cloud and an orthophotograph from hundreds of digital images at a spatial resolution < 10 mm per pixel, and cm-scale location accuracy. Orientation data of brittle and ductile structures were semi-automatically extracted from these high-resolution datasets using open-source software. This resulted in an extensive and statistically relevant orientation dataset that was used to 1) interpret the progressive development of folds and faults in the region, and 2) to generate a 3D structural model that underlines the complex internal structure of the outcrop and quantifies spatial variations in fold geometries. Overall, our work highlights how UAV photogrammetry can contribute to new insights in structural analysis.
Grummer, Jared A; Morando, Mariana M; Avila, Luciano J; Sites, Jack W; Leaché, Adam D
2018-08-01
Rapid evolutionary radiations are difficult to resolve because divergence events are nearly synchronous and gene flow among nascent species can be high, resulting in a phylogenetic "bush". Large datasets composed of sequence loci from across the genome can potentially help resolve some of these difficult phylogenetic problems. A suitable test case is the Liolaemus fitzingerii species group of lizards, which includes twelve species that are broadly distributed in Argentinean Patagonia. The species in the group have had a complex evolutionary history that has led to high morphological variation and unstable taxonomy. We generated a sequence capture dataset for 28 ingroup individuals of 580 nuclear loci, alongside a mitogenomic dataset, to infer phylogenetic relationships among species in this group. Relationships among species were generally weakly supported with the nuclear data, and along with an inferred age of ∼2.6 million years old, indicate either rapid evolution, hybridization, incomplete lineage sorting, non-informative data, or a combination thereof. We inferred a signal of mito-nuclear discordance, indicating potential hybridization between L. melanops and L. martorii, and phylogenetic network analyses provided support for 5 reticulation events among species. Phasing the nuclear loci did not provide additional insight into relationships or suspected patterns of hybridization. Only one clade, composed of L. camarones, L. fitzingerii, and L. xanthoviridis was recovered across all analyses. Genomic datasets provide molecular systematists with new opportunities to resolve difficult phylogenetic problems, yet the lack of phylogenetic resolution in Patagonian Liolaemus is biologically meaningful and indicative of a recent and rapid evolutionary radiation. The phylogenetic relationships of the Liolaemus fitzingerii group may be best modeled as a reticulated network instead of a bifurcating phylogeny. Copyright © 2018 Elsevier Inc. All rights reserved.
QuIN: A Web Server for Querying and Visualizing Chromatin Interaction Networks
Thibodeau, Asa; Márquez, Eladio J.; Luo, Oscar; Ruan, Yijun; Shin, Dong-Guk; Stitzel, Michael L.; Ucar, Duygu
2016-01-01
Recent studies of the human genome have indicated that regulatory elements (e.g. promoters and enhancers) at distal genomic locations can interact with each other via chromatin folding and affect gene expression levels. Genomic technologies for mapping interactions between DNA regions, e.g., ChIA-PET and HiC, can generate genome-wide maps of interactions between regulatory elements. These interaction datasets are important resources to infer distal gene targets of non-coding regulatory elements and to facilitate prioritization of critical loci for important cellular functions. With the increasing diversity and complexity of genomic information and public ontologies, making sense of these datasets demands integrative and easy-to-use software tools. Moreover, network representation of chromatin interaction maps enables effective data visualization, integration, and mining. Currently, there is no software that can take full advantage of network theory approaches for the analysis of chromatin interaction datasets. To fill this gap, we developed a web-based application, QuIN, which enables: 1) building and visualizing chromatin interaction networks, 2) annotating networks with user-provided private and publicly available functional genomics and interaction datasets, 3) querying network components based on gene name or chromosome location, and 4) utilizing network based measures to identify and prioritize critical regulatory targets and their direct and indirect interactions. AVAILABILITY: QuIN’s web server is available at http://quin.jax.org QuIN is developed in Java and JavaScript, utilizing an Apache Tomcat web server and MySQL database and the source code is available under the GPLV3 license available on GitHub: https://github.com/UcarLab/QuIN/. PMID:27336171
Nandi, Sutanu; Subramanian, Abhishek; Sarkar, Ram Rup
2017-07-25
Prediction of essential genes helps to identify a minimal set of genes that are absolutely required for the appropriate functioning and survival of a cell. The available machine learning techniques for essential gene prediction have inherent problems, like imbalanced provision of training datasets, biased choice of the best model for a given balanced dataset, choice of a complex machine learning algorithm, and data-based automated selection of biologically relevant features for classification. Here, we propose a simple support vector machine-based learning strategy for the prediction of essential genes in Escherichia coli K-12 MG1655 metabolism that integrates a non-conventional combination of an appropriate sample balanced training set, a unique organism-specific genotype, phenotype attributes that characterize essential genes, and optimal parameters of the learning algorithm to generate the best machine learning model (the model with the highest accuracy among all the models trained for different sample training sets). For the first time, we also introduce flux-coupled metabolic subnetwork-based features for enhancing the classification performance. Our strategy proves to be superior as compared to previous SVM-based strategies in obtaining a biologically relevant classification of genes with high sensitivity and specificity. This methodology was also trained with datasets of other recent supervised classification techniques for essential gene classification and tested using reported test datasets. The testing accuracy was always high as compared to the known techniques, proving that our method outperforms known methods. Observations from our study indicate that essential genes are conserved among homologous bacterial species, demonstrate high codon usage bias, GC content and gene expression, and predominantly possess a tendency to form physiological flux modules in metabolism.
Philipp, E E R; Kraemer, L; Mountfort, D; Schilhabel, M; Schreiber, S; Rosenstiel, P
2012-03-15
Next generation sequencing (NGS) technologies allow a rapid and cost-effective compilation of large RNA sequence datasets in model and non-model organisms. However, the storage and analysis of transcriptome information from different NGS platforms is still a significant bottleneck, leading to a delay in data dissemination and subsequent biological understanding. Especially database interfaces with transcriptome analysis modules going beyond mere read counts are missing. Here, we present the Transcriptome Analysis and Comparison Explorer (T-ACE), a tool designed for the organization and analysis of large sequence datasets, and especially suited for transcriptome projects of non-model organisms with little or no a priori sequence information. T-ACE offers a TCL-based interface, which accesses a PostgreSQL database via a php-script. Within T-ACE, information belonging to single sequences or contigs, such as annotation or read coverage, is linked to the respective sequence and immediately accessible. Sequences and assigned information can be searched via keyword- or BLAST-search. Additionally, T-ACE provides within and between transcriptome analysis modules on the level of expression, GO terms, KEGG pathways and protein domains. Results are visualized and can be easily exported for external analysis. We developed T-ACE for laboratory environments, which have only a limited amount of bioinformatics support, and for collaborative projects in which different partners work on the same dataset from different locations or platforms (Windows/Linux/MacOS). For laboratories with some experience in bioinformatics and programming, the low complexity of the database structure and open-source code provides a framework that can be customized according to the different needs of the user and transcriptome project.
Blockmodels for connectome analysis
NASA Astrophysics Data System (ADS)
Moyer, Daniel; Gutman, Boris; Prasad, Gautam; Faskowitz, Joshua; Ver Steeg, Greg; Thompson, Paul
2015-12-01
In the present work we study a family of generative network model and its applications for modeling the human connectome. We introduce a minor but novel variant of the Mixed Membership Stochastic Blockmodel and apply it and two other related model to two human connectome datasets (ADNI and a Bipolar Disorder dataset) with both control and diseased subjects. We further provide a simple generative classifier that, alongside more discriminating methods, provides evidence that blockmodels accurately summarize tractography count networks with respect to a disease classification task.
Improving the discoverability, accessibility, and citability of omics datasets: a case report.
Darlington, Yolanda F; Naumov, Alexey; McOwiti, Apollo; Kankanamge, Wasula H; Becnel, Lauren B; McKenna, Neil J
2017-03-01
Although omics datasets represent valuable assets for hypothesis generation, model testing, and data validation, the infrastructure supporting their reuse lacks organization and consistency. Using nuclear receptor signaling transcriptomic datasets as proof of principle, we developed a model to improve the discoverability, accessibility, and citability of published omics datasets. Primary datasets were retrieved from archives, processed to extract data points, then subjected to metadata enrichment and gap filling. The resulting secondary datasets were exposed on responsive web pages to support mining of gene lists, discovery of related datasets, and single-click citation integration with popular reference managers. Automated processes were established to embed digital object identifier-driven links to the secondary datasets in associated journal articles, small molecule and gene-centric databases, and a dataset search engine. Our model creates multiple points of access to reprocessed and reannotated derivative datasets across the digital biomedical research ecosystem, promoting their visibility and usability across disparate research communities. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Hamdan, Sadeque; Cheaitou, Ali
2017-08-01
This data article provides detailed optimization input and output datasets and optimization code for the published research work titled "Dynamic green supplier selection and order allocation with quantity discounts and varying supplier availability" (Hamdan and Cheaitou, 2017, In press) [1]. Researchers may use these datasets as a baseline for future comparison and extensive analysis of the green supplier selection and order allocation problem with all-unit quantity discount and varying number of suppliers. More particularly, the datasets presented in this article allow researchers to generate the exact optimization outputs obtained by the authors of Hamdan and Cheaitou (2017, In press) [1] using the provided optimization code and then to use them for comparison with the outputs of other techniques or methodologies such as heuristic approaches. Moreover, this article includes the randomly generated optimization input data and the related outputs that are used as input data for the statistical analysis presented in Hamdan and Cheaitou (2017 In press) [1] in which two different approaches for ranking potential suppliers are compared. This article also provides the time analysis data used in (Hamdan and Cheaitou (2017, In press) [1] to study the effect of the problem size on the computation time as well as an additional time analysis dataset. The input data for the time study are generated randomly, in which the problem size is changed, and then are used by the optimization problem to obtain the corresponding optimal outputs as well as the corresponding computation time.
BESST--efficient scaffolding of large fragmented assemblies.
Sahlin, Kristoffer; Vezzi, Francesco; Nystedt, Björn; Lundeberg, Joakim; Arvestad, Lars
2014-08-15
The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features.We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software's general performance. We propose a new algorithm, implemented in a tool called BESST, which can scaffold genomes of all sizes and complexities and was used to scaffold the genome of P. abies (20 Gbp). We performed a comprehensive comparison of BESST against the most popular stand-alone scaffolders on a large variety of datasets. Our results confirm that some of the popular scaffolders are not practical to run on complex datasets. Furthermore, no single stand-alone scaffolder outperforms the others on all datasets. However, BESST fares favorably to the other tested scaffolders on GAGE datasets and, moreover, outperforms the other methods when library insert size distribution is wide. We conclude from our results that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding.
Dimitrakopoulos, Christos; Theofilatos, Konstantinos; Pegkas, Andreas; Likothanassis, Spiros; Mavroudi, Seferina
2016-07-01
Proteins are vital biological molecules driving many fundamental cellular processes. They rarely act alone, but form interacting groups called protein complexes. The study of protein complexes is a key goal in systems biology. Recently, large protein-protein interaction (PPI) datasets have been published and a plethora of computational methods that provide new ideas for the prediction of protein complexes have been implemented. However, most of the methods suffer from two major limitations: First, they do not account for proteins participating in multiple functions and second, they are unable to handle weighted PPI graphs. Moreover, the problem remains open as existing algorithms and tools are insufficient in terms of predictive metrics. In the present paper, we propose gradually expanding neighborhoods with adjustment (GENA), a new algorithm that gradually expands neighborhoods in a graph starting from highly informative "seed" nodes. GENA considers proteins as multifunctional molecules allowing them to participate in more than one protein complex. In addition, GENA accepts weighted PPI graphs by using a weighted evaluation function for each cluster. In experiments with datasets from Saccharomyces cerevisiae and human, GENA outperformed Markov clustering, restricted neighborhood search and clustering with overlapping neighborhood expansion, three state-of-the-art methods for computationally predicting protein complexes. Seven PPI networks and seven evaluation datasets were used in total. GENA outperformed existing methods in 16 out of 18 experiments achieving an average improvement of 5.5% when the maximum matching ratio metric was used. Our method was able to discover functionally homogeneous protein clusters and uncover important network modules in a Parkinson expression dataset. When used on the human networks, around 47% of the detected clusters were enriched in gene ontology (GO) terms with depth higher than five in the GO hierarchy. In the present manuscript, we introduce a new method for the computational prediction of protein complexes by making the realistic assumption that proteins participate in multiple protein complexes and cellular functions. Our method can detect accurate and functionally homogeneous clusters. Copyright © 2016 Elsevier B.V. All rights reserved.
Plant databases and data analysis tools
USDA-ARS?s Scientific Manuscript database
It is anticipated that the coming years will see the generation of large datasets including diagnostic markers in several plant species with emphasis on crop plants. To use these datasets effectively in any plant breeding program, it is essential to have the information available via public database...
A conceptual prototype for the next-generation national elevation dataset
Stoker, Jason M.; Heidemann, Hans Karl; Evans, Gayla A.; Greenlee, Susan K.
2013-01-01
In 2012 the U.S. Geological Survey's (USGS) National Geospatial Program (NGP) funded a study to develop a conceptual prototype for a new National Elevation Dataset (NED) design with expanded capabilities to generate and deliver a suite of bare earth and above ground feature information over the United States. This report details the research on identifying operational requirements based on prior research, evaluation of what is needed for the USGS to meet these requirements, and development of a possible conceptual framework that could potentially deliver the kinds of information that are needed to support NGP's partners and constituents. This report provides an initial proof-of-concept demonstration using an existing dataset, and recommendations for the future, to inform NGP's ongoing and future elevation program planning and management decisions. The demonstration shows that this type of functional process can robustly create derivatives from lidar point cloud data; however, more research needs to be done to see how well it extends to multiple datasets.
Bennie, Marion; Malcolm, William; Marwick, Charis A; Kavanagh, Kimberley; Sneddon, Jean; Nathwani, Dilip
2017-10-01
The better use of new and emerging data streams to understand the epidemiology of infectious disease and to inform and evaluate antimicrobial stewardship improvement programmes is paramount in the global fight against antimicrobial resistance. To create a national informatics platform that synergizes the wealth of disjointed, infection-related health data, building an intelligence capability that allows rapid enquiry, generation of new knowledge and feedback to clinicians and policy makers. A multi-stakeholder community, led by the Scottish Antimicrobial Prescribing Group, secured government funding to deliver a national programme of work centred on three key aspects: (i) technical platform development with record linkage capability across multiple datasets; (ii) a proportionate governance approach to enhance responsiveness; and (iii) generation of new evidence to guide clinical practice. The National Health Service Scotland Infection Intelligence Platform (IIP) is now hosted within the national health data repository to assure resilience and sustainability. New technical solutions include simplified 'data views' of complex, linked datasets and embedded statistical programs to enhance capability. These developments have enabled responsiveness, flexibility and robustness in conducting population-based studies including a focus on intended and unintended effects of antimicrobial stewardship interventions and quantification of infection risk factors and clinical outcomes. We have completed the build and test phase of IIP, overcoming the technical and governance challenges, and produced new capability in infection informatics, generating new evidence for improved clinical practice. This provides a foundation for expansion and opportunity for global collaborations. © The Author 2017. Published by Oxford University Press on behalf of the British Society for Antimicrobial Chemotherapy. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
NASA Astrophysics Data System (ADS)
Zaslavsky, I.; Valentine, D.; Richard, S. M.; Gupta, A.; Meier, O.; Peucker-Ehrenbrink, B.; Hudman, G.; Stocks, K. I.; Hsu, L.; Whitenack, T.; Grethe, J. S.; Ozyurt, I. B.
2017-12-01
EarthCube Data Discovery Hub (DDH) is an EarthCube Building Block project using technologies developed in CINERGI (Community Inventory of EarthCube Resources for Geoscience Interoperability) to enable geoscience users to explore a growing portfolio of EarthCube-created and other geoscience-related resources. Over 1 million metadata records are available for discovery through the project portal (cinergi.sdsc.edu). These records are retrieved from data facilities, including federal, state and academic sources, or contributed by geoscientists through workshops, surveys, or other channels. CINERGI metadata augmentation pipeline components 1) provide semantic enhancement based on a large ontology of geoscience terms, using text analytics to generate keywords with references to ontology classes, 2) add spatial extents based on place names found in the metadata record, and 3) add organization identifiers to the metadata. The records are indexed and can be searched via a web portal and standard search APIs. The added metadata content improves discoverability and interoperability of the registered resources. Specifically, the addition of ontology-anchored keywords enables faceted browsing and lets users navigate to datasets related by variables measured, equipment used, science domain, processes described, geospatial features studied, and other dataset characteristics that are generated by the pipeline. DDH also lets data curators access and edit the automatically generated metadata records using the CINERGI metadata editor, accept or reject the enhanced metadata content, and consider it in updating their metadata descriptions. We consider several complex data discovery workflows, in environmental seismology (quantifying sediment and water fluxes using seismic data), marine biology (determining available temperature, location, weather and bleaching characteristics of coral reefs related to measurements in a given coral reef survey), and river geochemistry (discovering observations relevant to geochemical measurements outside the tidal zone, given specific discharge conditions).
Proteomics and Systems Biology: Current and Future Applications in the Nutritional Sciences1
Moore, J. Bernadette; Weeks, Mark E.
2011-01-01
In the last decade, advances in genomics, proteomics, and metabolomics have yielded large-scale datasets that have driven an interest in global analyses, with the objective of understanding biological systems as a whole. Systems biology integrates computational modeling and experimental biology to predict and characterize the dynamic properties of biological systems, which are viewed as complex signaling networks. Whereas the systems analysis of disease-perturbed networks holds promise for identification of drug targets for therapy, equally the identified critical network nodes may be targeted through nutritional intervention in either a preventative or therapeutic fashion. As such, in the context of the nutritional sciences, it is envisioned that systems analysis of normal and nutrient-perturbed signaling networks in combination with knowledge of underlying genetic polymorphisms will lead to a future in which the health of individuals will be improved through predictive and preventative nutrition. Although high-throughput transcriptomic microarray data were initially most readily available and amenable to systems analysis, recent technological and methodological advances in MS have contributed to a linear increase in proteomic investigations. It is now commonplace for combined proteomic technologies to generate complex, multi-faceted datasets, and these will be the keystone of future systems biology research. This review will define systems biology, outline current proteomic methodologies, highlight successful applications of proteomics in nutrition research, and discuss the challenges for future applications of systems biology approaches in the nutritional sciences. PMID:22332076
Autonomous localisation of rovers for future planetary exploration
NASA Astrophysics Data System (ADS)
Bajpai, Abhinav
Future Mars exploration missions will have increasingly ambitious goals compared to current rover and lander missions. There will be a need for extremely long distance traverses over shorter periods of time. This will allow more varied and complex scientific tasks to be performed and increase the overall value of the missions. The missions may also include a sample return component, where items collected on the surface will be returned to a cache in order to be returned to Earth, for further study. In order to make these missions feasible, future rover platforms will require increased levels of autonomy, allowing them to operate without heavy reliance on a terrestrial ground station. Being able to autonomously localise the rover is an important element in increasing the rover's capability to independently explore. This thesis develops a Planetary Monocular Simultaneous Localisation And Mapping (PM-SLAM) system aimed specifically at a planetary exploration context. The system uses a novel modular feature detection and tracking algorithm called hybrid-saliency in order to achieve robust tracking, while maintaining low computational complexity in the SLAM filter. The hybrid saliency technique uses a combination of cognitive inspired saliency features with point-based feature descriptors as input to the SLAM filter. The system was tested on simulated datasets generated using the Planetary, Asteroid and Natural scene Generation Utility (PANGU) as well as two real world datasets which closely approximated images from a planetary environment. The system was shown to provide a higher accuracy of localisation estimate than a state-of-the-art VO system tested on the same data set. In order to be able to localise the rover absolutely, further techniques are investigated which attempt to determine the rover's position in orbital maps. Orbiter Mask Matching uses point-based features detected by the rover to associate descriptors with large features extracted from orbital imagery and stored in the rover memory prior the mission launch. A proof of concept is evaluated using a PANGU simulated boulder field.
Mathematical Modeling of Intestinal Iron Absorption Using Genetic Programming
Colins, Andrea; Gerdtzen, Ziomara P.; Nuñez, Marco T.; Salgado, J. Cristian
2017-01-01
Iron is a trace metal, key for the development of living organisms. Its absorption process is complex and highly regulated at the transcriptional, translational and systemic levels. Recently, the internalization of the DMT1 transporter has been proposed as an additional regulatory mechanism at the intestinal level, associated to the mucosal block phenomenon. The short-term effect of iron exposure in apical uptake and initial absorption rates was studied in Caco-2 cells at different apical iron concentrations, using both an experimental approach and a mathematical modeling framework. This is the first report of short-term studies for this system. A non-linear behavior in the apical uptake dynamics was observed, which does not follow the classic saturation dynamics of traditional biochemical models. We propose a method for developing mathematical models for complex systems, based on a genetic programming algorithm. The algorithm is aimed at obtaining models with a high predictive capacity, and considers an additional parameter fitting stage and an additional Jackknife stage for estimating the generalization error. We developed a model for the iron uptake system with a higher predictive capacity than classic biochemical models. This was observed both with the apical uptake dataset used for generating the model and with an independent initial rates dataset used to test the predictive capacity of the model. The model obtained is a function of time and the initial apical iron concentration, with a linear component that captures the global tendency of the system, and a non-linear component that can be associated to the movement of DMT1 transporters. The model presented in this paper allows the detailed analysis, interpretation of experimental data, and identification of key relevant components for this complex biological process. This general method holds great potential for application to the elucidation of biological mechanisms and their key components in other complex systems. PMID:28072870
Dong, Yadong; Sun, Yongqi; Qin, Chao
2018-01-01
The existing protein complex detection methods can be broadly divided into two categories: unsupervised and supervised learning methods. Most of the unsupervised learning methods assume that protein complexes are in dense regions of protein-protein interaction (PPI) networks even though many true complexes are not dense subgraphs. Supervised learning methods utilize the informative properties of known complexes; they often extract features from existing complexes and then use the features to train a classification model. The trained model is used to guide the search process for new complexes. However, insufficient extracted features, noise in the PPI data and the incompleteness of complex data make the classification model imprecise. Consequently, the classification model is not sufficient for guiding the detection of complexes. Therefore, we propose a new robust score function that combines the classification model with local structural information. Based on the score function, we provide a search method that works both forwards and backwards. The results from experiments on six benchmark PPI datasets and three protein complex datasets show that our approach can achieve better performance compared with the state-of-the-art supervised, semi-supervised and unsupervised methods for protein complex detection, occasionally significantly outperforming such methods.
NASA Astrophysics Data System (ADS)
Fehm, Thomas Felix; Deán-Ben, Xosé Luís; Razansky, Daniel
2014-10-01
Ultrasonography and optoacoustic imaging share powerful advantages related to the natural aptitude for real-time image rendering with high resolution, the hand-held operation, and lack of ionizing radiation. The two methods also possess very different yet highly complementary advantages of the mechanical and optical contrast in living tissues. Nonetheless, efficient integration of these modalities remains challenging owing to the fundamental differences in the underlying physical contrast, optimal signal acquisition, and image reconstruction approaches. We report on a method for hybrid acquisition and reconstruction of three-dimensional pulse-echo ultrasound and optoacoustic images in real time based on passive ultrasound generation with an optical absorber, thus avoiding the hardware complexity of active ultrasound generation. In this way, complete hybrid datasets are generated with a single laser interrogation pulse, resulting in simultaneous rendering of ultrasound and optoacoustic images at an unprecedented rate of 10 volumetric frames per second. Performance is subsequently showcased in phantom experiments and in-vivo measurements from a healthy human volunteer, confirming general clinical applicability of the method.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fehm, Thomas Felix; Razansky, Daniel, E-mail: dr@tum.de; Faculty of Medicine, Technische Universität München, Munich
2014-10-27
Ultrasonography and optoacoustic imaging share powerful advantages related to the natural aptitude for real-time image rendering with high resolution, the hand-held operation, and lack of ionizing radiation. The two methods also possess very different yet highly complementary advantages of the mechanical and optical contrast in living tissues. Nonetheless, efficient integration of these modalities remains challenging owing to the fundamental differences in the underlying physical contrast, optimal signal acquisition, and image reconstruction approaches. We report on a method for hybrid acquisition and reconstruction of three-dimensional pulse-echo ultrasound and optoacoustic images in real time based on passive ultrasound generation with an opticalmore » absorber, thus avoiding the hardware complexity of active ultrasound generation. In this way, complete hybrid datasets are generated with a single laser interrogation pulse, resulting in simultaneous rendering of ultrasound and optoacoustic images at an unprecedented rate of 10 volumetric frames per second. Performance is subsequently showcased in phantom experiments and in-vivo measurements from a healthy human volunteer, confirming general clinical applicability of the method.« less
Wilke, Marko
2018-02-01
This dataset contains the regression parameters derived by analyzing segmented brain MRI images (gray matter and white matter) from a large population of healthy subjects, using a multivariate adaptive regression splines approach. A total of 1919 MRI datasets ranging in age from 1-75 years from four publicly available datasets (NIH, C-MIND, fCONN, and IXI) were segmented using the CAT12 segmentation framework, writing out gray matter and white matter images normalized using an affine-only spatial normalization approach. These images were then subjected to a six-step DARTEL procedure, employing an iterative non-linear registration approach and yielding increasingly crisp intermediate images. The resulting six datasets per tissue class were then analyzed using multivariate adaptive regression splines, using the CerebroMatic toolbox. This approach allows for flexibly modelling smoothly varying trajectories while taking into account demographic (age, gender) as well as technical (field strength, data quality) predictors. The resulting regression parameters described here can be used to generate matched DARTEL or SHOOT templates for a given population under study, from infancy to old age. The dataset and the algorithm used to generate it are publicly available at https://irc.cchmc.org/software/cerebromatic.php.
Chest x-ray generation and data augmentation for cardiovascular abnormality classification
NASA Astrophysics Data System (ADS)
Madani, Ali; Moradi, Mehdi; Karargyris, Alexandros; Syeda-Mahmood, Tanveer
2018-03-01
Medical imaging datasets are limited in size due to privacy issues and the high cost of obtaining annotations. Augmentation is a widely used practice in deep learning to enrich the data in data-limited scenarios and to avoid overfitting. However, standard augmentation methods that produce new examples of data by varying lighting, field of view, and spatial rigid transformations do not capture the biological variance of medical imaging data and could result in unrealistic images. Generative adversarial networks (GANs) provide an avenue to understand the underlying structure of image data which can then be utilized to generate new realistic samples. In this work, we investigate the use of GANs for producing chest X-ray images to augment a dataset. This dataset is then used to train a convolutional neural network to classify images for cardiovascular abnormalities. We compare our augmentation strategy with traditional data augmentation and show higher accuracy for normal vs abnormal classification in chest X-rays.
Aggregating todays data for tomorrows science: a geological use case
NASA Astrophysics Data System (ADS)
Glaves, H.; Kingdon, A.; Nayembil, M.; Baker, G.
2016-12-01
Geoscience data is made up of diverse and complex smaller datasets that, when aggregated together, build towards what is recognised as `big data'. The British Geological Survey (BGS), which acts as a repository for all subsurface data from the United Kingdom, has been collating these disparate small datasets that have been accumulated from the activities of a large number of geoscientists over many years. Recently this picture has been further complicated by the addition of new data sources such as near real-time sensor data, and industry or community data that is increasingly delivered via automatic donations. Many of these datasets have been aggregated in relational databases to form larger ones that are used to address a variety of issues ranging from development of national infrastructure to disaster response. These complex domain-specific SQL databases deliver effective data management using normalised subject-based database designs in a secure environment. However, the isolated subject-oriented design of these systems inhibits efficient cross-domain querying of the datasets. Additionally, the tools provided often do not enable effective data discovery as they have problems resolving the complex underlying normalised structures. Recent requirements to understand sub-surface geology in three dimensions have led BGS to develop new data systems. One such solution is PropBase which delivers a generic denormalised data structure within an RDBMS to store geological property data. Propbase facilitates rapid and standardised data discovery and access, incorporating 2D and 3D physical and chemical property data, including associated metadata. It also provides a dedicated web interface to deliver complex multiple data sets from a single database in standardised common output formats (e.g. CSV, GIS shape files) without the need for complex data conditioning. PropBase facilitates new scientific research, previously considered impractical, by enabling property data searches across multiple databases. Using the Propbase exemplar this presentation will seek to illustrate how BGS has developed systems for aggregating `small datasets' to create the `big data' necessary for the data analytics, mining, processing and visualisation needed for future geoscientific research.
Modeling the topography of shallow braided rivers using Structure-from-Motion photogrammetry
NASA Astrophysics Data System (ADS)
Javernick, L.; Brasington, J.; Caruso, B.
2014-05-01
Recent advances in computer vision and image analysis have led to the development of a novel, fully automated photogrammetric method to generate dense 3d point cloud data. This approach, termed Structure-from-Motion or SfM, requires only limited ground-control and is ideally suited to imagery obtained from low-cost, non-metric cameras acquired either at close-range or using aerial platforms. Terrain models generated using SfM have begun to emerge recently and with a growing spectrum of software now available, there is an urgent need to provide a robust quality assessment of the data products generated using standard field and computational workflows. To address this demand, we present a detailed error analysis of sub-meter resolution terrain models of two contiguous reaches (1.6 and 1.7 km long) of the braided Ahuriri River, New Zealand, generated using SfM. A six stage methodology is described, involving: i) hand-held image acquisition from an aerial platform, ii) 3d point cloud extraction modeling using Agisoft PhotoScan, iii) georeferencing on a redundant network of GPS-surveyed ground-control points, iv) point cloud filtering to reduce computational demand as well as reduce vegetation noise, v) optical bathymetric modeling of inundated areas; and vi) data fusion and surface modeling to generate sub-meter raster terrain models. Bootstrapped geo-registration as well as extensive distributed GPS and sonar-based bathymetric check-data were used to quantify the quality of the models generated after each processing step. The results obtained provide the first quantified analysis of SfM applied to model the complex terrain of a braided river. Results indicate that geo-registration errors of 0.04 m (planar) and 0.10 m (elevation) and vertical surface errors of 0.10 m in non-vegetation areas can be achieved from a dataset of photographs taken at 600 m and 800 m above the ground level. These encouraging results suggest that this low-cost, logistically simple method can deliver high quality terrain datasets competitive with those obtained with significantly more expensive laser scanning, and suitable for geomorphic change detection and hydrodynamic modeling.
NASA Astrophysics Data System (ADS)
Nesbit, P. R.; Hugenholtz, C.; Durkin, P.; Hubbard, S. M.; Kucharczyk, M.; Barchyn, T.
2016-12-01
Remote sensing and digital mapping have started to revolutionize geologic mapping in recent years as a result of their realized potential to provide high resolution 3D models of outcrops to assist with interpretation, visualization, and obtaining accurate measurements of inaccessible areas. However, in stratigraphic mapping applications in complex terrain, it is difficult to acquire information with sufficient detail at a wide spatial coverage with conventional techniques. We demonstrate the potential of a UAV and Structure from Motion (SfM) photogrammetric approach for improving 3D stratigraphic mapping applications within a complex badland topography. Our case study is performed in Dinosaur Provincial Park (Alberta, Canada), mapping late Cretaceous fluvial meander belt deposits of the Dinosaur Park formation amidst a succession of steeply sloping hills and abundant drainages - creating a challenge for stratigraphic mapping. The UAV-SfM dataset (2 cm spatial resolution) is compared directly with a combined satellite and aerial LiDAR dataset (30 cm spatial resolution) to reveal advantages and limitations of each dataset before presenting a unique workflow that utilizes the dense point cloud from the UAV-SfM dataset for analysis. The UAV-SfM dense point cloud minimizes distortion, preserves 3D structure, and records an RGB attribute - adding potential value in future studies. The proposed UAV-SfM workflow allows for high spatial resolution remote sensing of stratigraphy in complex topographic environments. This extended capability can add value to field observations and has the potential to be integrated with subsurface petroleum models.
A comprehensive evaluation of assembly scaffolding tools
2014-01-01
Background Genome assembly is typically a two-stage process: contig assembly followed by the use of paired sequencing reads to join contigs into scaffolds. Scaffolds are usually the focus of reported assembly statistics; longer scaffolds greatly facilitate the use of genome sequences in downstream analyses, and it is appealing to present larger numbers as metrics of assembly performance. However, scaffolds are highly prone to errors, especially when generated using short reads, which can directly result in inflated assembly statistics. Results Here we provide the first independent evaluation of scaffolding tools for second-generation sequencing data. We find large variations in the quality of results depending on the tool and dataset used. Even extremely simple test cases of perfect input, constructed to elucidate the behaviour of each algorithm, produced some surprising results. We further dissect the performance of the scaffolders using real and simulated sequencing data derived from the genomes of Staphylococcus aureus, Rhodobacter sphaeroides, Plasmodium falciparum and Homo sapiens. The results from simulated data are of high quality, with several of the tools producing perfect output. However, at least 10% of joins remains unidentified when using real data. Conclusions The scaffolders vary in their usability, speed and number of correct and missed joins made between contigs. Results from real data highlight opportunities for further improvements of the tools. Overall, SGA, SOPRA and SSPACE generally outperform the other tools on our datasets. However, the quality of the results is highly dependent on the read mapper and genome complexity. PMID:24581555
Querying Large Biological Network Datasets
ERIC Educational Resources Information Center
Gulsoy, Gunhan
2013-01-01
New experimental methods has resulted in increasing amount of genetic interaction data to be generated every day. Biological networks are used to store genetic interaction data gathered. Increasing amount of data available requires fast large scale analysis methods. Therefore, we address the problem of querying large biological network datasets.…
Statistical analysis of large simulated yield datasets for studying climate effects
USDA-ARS?s Scientific Manuscript database
Ensembles of process-based crop models are now commonly used to simulate crop growth and development for climate scenarios of temperature and/or precipitation changes corresponding to different projections of atmospheric CO2 concentrations. This approach generates large datasets with thousands of de...
Rare itemsets mining algorithm based on RP-Tree and spark framework
NASA Astrophysics Data System (ADS)
Liu, Sainan; Pan, Haoan
2018-05-01
For the issues of the rare itemsets mining in big data, this paper proposed a rare itemsets mining algorithm based on RP-Tree and Spark framework. Firstly, it arranged the data vertically according to the transaction identifier, in order to solve the defects of scan the entire data set, the vertical datasets are divided into frequent vertical datasets and rare vertical datasets. Then, it adopted the RP-Tree algorithm to construct the frequent pattern tree that contains rare items and generate rare 1-itemsets. After that, it calculated the support of the itemsets by scanning the two vertical data sets, finally, it used the iterative process to generate rare itemsets. The experimental show that the algorithm can effectively excavate rare itemsets and have great superiority in execution time.
Creation of the Naturalistic Engagement in Secondary Tasks (NEST) distracted driving dataset.
Owens, Justin M; Angell, Linda; Hankey, Jonathan M; Foley, James; Ebe, Kazutoshi
2015-09-01
Distracted driving has become a topic of critical importance to driving safety research over the past several decades. Naturalistic driving data offer a unique opportunity to study how drivers engage with secondary tasks in real-world driving; however, the complexities involved with identifying and coding relevant epochs of naturalistic data have limited its accessibility to the general research community. This project was developed to help address this problem by creating an accessible dataset of driver behavior and situational factors observed during distraction-related safety-critical events and baseline driving epochs, using the Strategic Highway Research Program 2 (SHRP2) naturalistic dataset. The new NEST (Naturalistic Engagement in Secondary Tasks) dataset was created using crashes and near-crashes from the SHRP2 dataset that were identified as including secondary task engagement as a potential contributing factor. Data coding included frame-by-frame video analysis of secondary task and hands-on-wheel activity, as well as summary event information. In addition, information about each secondary task engagement within the trip prior to the crash/near-crash was coded at a higher level. Data were also coded for four baseline epochs and trips per safety-critical event. 1,180 events and baseline epochs were coded, and a dataset was constructed. The project team is currently working to determine the most useful way to allow broad public access to the dataset. We anticipate that the NEST dataset will be extraordinarily useful in allowing qualified researchers access to timely, real-world data concerning how drivers interact with secondary tasks during safety-critical events and baseline driving. The coded dataset developed for this project will allow future researchers to have access to detailed data on driver secondary task engagement in the real world. It will be useful for standalone research, as well as for integration with additional SHRP2 data to enable the conduct of more complex research. Copyright © 2015 Elsevier Ltd and National Safety Council. All rights reserved.
Basin Characteristics for Selected Streamflow-Gaging Stations In and Near West Virginia
Paybins, Katherine S.
2008-01-01
Basin characteristics have long been used to develop equations describing streamflow. In the past, flow equations used in West Virginia were based on a few hand-calculated basin characteristics. More recently, the use of a Geographic Information System (GIS) to generate basin characteristics from existing datasets has refined the process for developing equations to describe flow values in the Mountain State. These basin characteristics are described in this document for streamflow-gaging stations in and near West Virginia. The GIS program developed in ArcGIS Workstation by Environmental Systems Research Institute (ESRI?) used data that included National Elevation Dataset (NED) at 1:24,000 scale, climate data from the National Oceanic and Atmospheric Agency (NOAA), streamlines from the National Hydrologic Dataset (NHD), and LandSat-based land-cover data (NLCD) for the period 1999-2003. Full automation of data generation was not achieved due to some inaccuracies in the elevation dataset, as well as inaccuracies in the streamflow-gage locations retrieved from the National Water Information System (NWIS). A Pearson?s correlation examination of the data indicates that several of the basin characteristics are correlated with drainage area. However, the GIS-generated data provide a consistent and documented set of basin characteristics for resource managers and researchers to use.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dhou, S; Cai, W; Hurwitz, M
2015-06-15
Purpose: Respiratory-correlated cone-beam CT (4DCBCT) images acquired immediately prior to treatment have the potential to represent patient motion patterns and anatomy during treatment, including both intra- and inter-fractional changes. We develop a method to generate patient-specific motion models based on 4DCBCT images acquired with existing clinical equipment and used to generate time varying volumetric images (3D fluoroscopic images) representing motion during treatment delivery. Methods: Motion models are derived by deformably registering each 4DCBCT phase to a reference phase, and performing principal component analysis (PCA) on the resulting displacement vector fields. 3D fluoroscopic images are estimated by optimizing the resulting PCAmore » coefficients iteratively through comparison of the cone-beam projections simulating kV treatment imaging and digitally reconstructed radiographs generated from the motion model. Patient and physical phantom datasets are used to evaluate the method in terms of tumor localization error compared to manually defined ground truth positions. Results: 4DCBCT-based motion models were derived and used to generate 3D fluoroscopic images at treatment time. For the patient datasets, the average tumor localization error and the 95th percentile were 1.57 and 3.13 respectively in subsets of four patient datasets. For the physical phantom datasets, the average tumor localization error and the 95th percentile were 1.14 and 2.78 respectively in two datasets. 4DCBCT motion models are shown to perform well in the context of generating 3D fluoroscopic images due to their ability to reproduce anatomical changes at treatment time. Conclusion: This study showed the feasibility of deriving 4DCBCT-based motion models and using them to generate 3D fluoroscopic images at treatment time in real clinical settings. 4DCBCT-based motion models were found to account for the 3D non-rigid motion of the patient anatomy during treatment and have the potential to localize tumor and other patient anatomical structures at treatment time even when inter-fractional changes occur. This project was supported, in part, through a Master Research Agreement with Varian Medical Systems, Inc., Palo Alto, CA. The project was also supported, in part, by Award Number R21CA156068 from the National Cancer Institute.« less
Merging K-means with hierarchical clustering for identifying general-shaped groups.
Peterson, Anna D; Ghosh, Arka P; Maitra, Ranjan
2018-01-01
Clustering partitions a dataset such that observations placed together in a group are similar but different from those in other groups. Hierarchical and K -means clustering are two approaches but have different strengths and weaknesses. For instance, hierarchical clustering identifies groups in a tree-like structure but suffers from computational complexity in large datasets while K -means clustering is efficient but designed to identify homogeneous spherically-shaped clusters. We present a hybrid non-parametric clustering approach that amalgamates the two methods to identify general-shaped clusters and that can be applied to larger datasets. Specifically, we first partition the dataset into spherical groups using K -means. We next merge these groups using hierarchical methods with a data-driven distance measure as a stopping criterion. Our proposal has the potential to reveal groups with general shapes and structure in a dataset. We demonstrate good performance on several simulated and real datasets.
Stanislawski, Jerzy; Kotulska, Malgorzata; Unold, Olgierd
2013-01-17
Amyloids are proteins capable of forming fibrils. Many of them underlie serious diseases, like Alzheimer disease. The number of amyloid-associated diseases is constantly increasing. Recent studies indicate that amyloidogenic properties can be associated with short segments of aminoacids, which transform the structure when exposed. A few hundreds of such peptides have been experimentally found. Experimental testing of all possible aminoacid combinations is currently not feasible. Instead, they can be predicted by computational methods. 3D profile is a physicochemical-based method that has generated the most numerous dataset - ZipperDB. However, it is computationally very demanding. Here, we show that dataset generation can be accelerated. Two methods to increase the classification efficiency of amyloidogenic candidates are presented and tested: simplified 3D profile generation and machine learning methods. We generated a new dataset of hexapeptides, using more economical 3D profile algorithm, which showed very good classification overlap with ZipperDB (93.5%). The new part of our dataset contains 1779 segments, with 204 classified as amyloidogenic. The dataset of 6-residue sequences with their binary classification, based on the energy of the segment, was applied for training machine learning methods. A separate set of sequences from ZipperDB was used as a test set. The most effective methods were Alternating Decision Tree and Multilayer Perceptron. Both methods obtained area under ROC curve of 0.96, accuracy 91%, true positive rate ca. 78%, and true negative rate 95%. A few other machine learning methods also achieved a good performance. The computational time was reduced from 18-20 CPU-hours (full 3D profile) to 0.5 CPU-hours (simplified 3D profile) to seconds (machine learning). We showed that the simplified profile generation method does not introduce an error with regard to the original method, while increasing the computational efficiency. Our new dataset proved representative enough to use simple statistical methods for testing the amylogenicity based only on six letter sequences. Statistical machine learning methods such as Alternating Decision Tree and Multilayer Perceptron can replace the energy based classifier, with advantage of very significantly reduced computational time and simplicity to perform the analysis. Additionally, a decision tree provides a set of very easily interpretable rules.
A longitudinal dataset of five years of public activity in the Scratch online community.
Hill, Benjamin Mako; Monroy-Hernández, Andrés
2017-01-31
Scratch is a programming environment and an online community where young people can create, share, learn, and communicate. In collaboration with the Scratch Team at MIT, we created a longitudinal dataset of public activity in the Scratch online community during its first five years (2007-2012). The dataset comprises 32 tables with information on more than 1 million Scratch users, nearly 2 million Scratch projects, more than 10 million comments, more than 30 million visits to Scratch projects, and more. To help researchers understand this dataset, and to establish the validity of the data, we also include the source code of every version of the software that operated the website, as well as the software used to generate this dataset. We believe this is the largest and most comprehensive downloadable dataset of youth programming artifacts and communication.
Wide-Open: Accelerating public data release by automating detection of overdue datasets
Poon, Hoifung; Howe, Bill
2017-01-01
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. PMID:28594819
Wide-Open: Accelerating public data release by automating detection of overdue datasets.
Grechkin, Maxim; Poon, Hoifung; Howe, Bill
2017-06-01
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.
Jones, Darryl R; Thomas, Dallas; Alger, Nicholas; Ghavidel, Ata; Inglis, G Douglas; Abbott, D Wade
2018-01-01
Deposition of new genetic sequences in online databases is expanding at an unprecedented rate. As a result, sequence identification continues to outpace functional characterization of carbohydrate active enzymes (CAZymes). In this paradigm, the discovery of enzymes with novel functions is often hindered by high volumes of uncharacterized sequences particularly when the enzyme sequence belongs to a family that exhibits diverse functional specificities (i.e., polyspecificity). Therefore, to direct sequence-based discovery and characterization of new enzyme activities we have developed an automated in silico pipeline entitled: Sequence Analysis and Clustering of CarboHydrate Active enzymes for Rapid Informed prediction of Specificity (SACCHARIS). This pipeline streamlines the selection of uncharacterized sequences for discovery of new CAZyme or CBM specificity from families currently maintained on the CAZy website or within user-defined datasets. SACCHARIS was used to generate a phylogenetic tree of a GH43, a CAZyme family with defined subfamily designations. This analysis confirmed that large datasets can be organized into sequence clusters of manageable sizes that possess related functions. Seeding this tree with a GH43 sequence from Bacteroides dorei DSM 17855 (BdGH43b, revealed it partitioned as a single sequence within the tree. This pattern was consistent with it possessing a unique enzyme activity for GH43 as BdGH43b is the first described α-glucanase described for this family. The capacity of SACCHARIS to extract and cluster characterized carbohydrate binding module sequences was demonstrated using family 6 CBMs (i.e., CBM6s). This CBM family displays a polyspecific ligand binding profile and contains many structurally determined members. Using SACCHARIS to identify a cluster of divergent sequences, a CBM6 sequence from a unique clade was demonstrated to bind yeast mannan, which represents the first description of an α-mannan binding CBM. Additionally, we have performed a CAZome analysis of an in-house sequenced bacterial genome and a comparative analysis of B. thetaiotaomicron VPI-5482 and B. thetaiotaomicron 7330, to demonstrate that SACCHARIS can generate "CAZome fingerprints", which differentiate between the saccharolytic potential of two related strains in silico. Establishing sequence-function and sequence-structure relationships in polyspecific CAZyme families are promising approaches for streamlining enzyme discovery. SACCHARIS facilitates this process by embedding CAZyme and CBM family trees generated from biochemically to structurally characterized sequences, with protein sequences that have unknown functions. In addition, these trees can be integrated with user-defined datasets (e.g., genomics, metagenomics, and transcriptomics) to inform experimental characterization of new CAZymes or CBMs not currently curated, and for researchers to compare differential sequence patterns between entire CAZomes. In this light, SACCHARIS provides an in silico tool that can be tailored for enzyme bioprospecting in datasets of increasing complexity and for diverse applications in glycobiotechnology.
Prediction of cell penetrating peptides by support vector machines.
Sanders, William S; Johnston, C Ian; Bridges, Susan M; Burgess, Shane C; Willeford, Kenneth O
2011-07-01
Cell penetrating peptides (CPPs) are those peptides that can transverse cell membranes to enter cells. Once inside the cell, different CPPs can localize to different cellular components and perform different roles. Some generate pore-forming complexes resulting in the destruction of cells while others localize to various organelles. Use of machine learning methods to predict potential new CPPs will enable more rapid screening for applications such as drug delivery. We have investigated the influence of the composition of training datasets on the ability to classify peptides as cell penetrating using support vector machines (SVMs). We identified 111 known CPPs and 34 known non-penetrating peptides from the literature and commercial vendors and used several approaches to build training data sets for the classifiers. Features were calculated from the datasets using a set of basic biochemical properties combined with features from the literature determined to be relevant in the prediction of CPPs. Our results using different training datasets confirm the importance of a balanced training set with approximately equal number of positive and negative examples. The SVM based classifiers have greater classification accuracy than previously reported methods for the prediction of CPPs, and because they use primary biochemical properties of the peptides as features, these classifiers provide insight into the properties needed for cell-penetration. To confirm our SVM classifications, a subset of peptides classified as either penetrating or non-penetrating was selected for synthesis and experimental validation. Of the synthesized peptides predicted to be CPPs, 100% of these peptides were shown to be penetrating.
Spherical: an iterative workflow for assembling metagenomic datasets.
Hitch, Thomas C A; Creevey, Christopher J
2018-01-24
The consensus emerging from the study of microbiomes is that they are far more complex than previously thought, requiring better assemblies and increasingly deeper sequencing. However, current metagenomic assembly techniques regularly fail to incorporate all, or even the majority in some cases, of the sequence information generated for many microbiomes, negating this effort. This can especially bias the information gathered and the perceived importance of the minor taxa in a microbiome. We propose a simple but effective approach, implemented in Python, to address this problem. Based on an iterative methodology, our workflow (called Spherical) carries out successive rounds of assemblies with the sequencing reads not yet utilised. This approach also allows the user to reduce the resources required for very large datasets, by assembling random subsets of the whole in a "divide and conquer" manner. We demonstrate the accuracy of Spherical using simulated data based on completely sequenced genomes and the effectiveness of the workflow at retrieving lost information for taxa in three published metagenomics studies of varying sizes. Our results show that Spherical increased the amount of reads utilized in the assembly by up to 109% compared to the base assembly. The additional contigs assembled by the Spherical workflow resulted in a significant (P < 0.05) changes in the predicted taxonomic profile of all datasets analysed. Spherical is implemented in Python 2.7 and freely available for use under the MIT license. Source code and documentation is hosted publically at: https://github.com/thh32/Spherical .
Goldrick, Stephen; Holmes, William; Bond, Nicholas J.; Lewis, Gareth; Kuiper, Marcel; Turner, Richard
2017-01-01
ABSTRACT Product quality heterogeneities, such as a trisulfide bond (TSB) formation, can be influenced by multiple interacting process parameters. Identifying their root cause is a major challenge in biopharmaceutical production. To address this issue, this paper describes the novel application of advanced multivariate data analysis (MVDA) techniques to identify the process parameters influencing TSB formation in a novel recombinant antibody–peptide fusion expressed in mammalian cell culture. The screening dataset was generated with a high‐throughput (HT) micro‐bioreactor system (AmbrTM 15) using a design of experiments (DoE) approach. The complex dataset was firstly analyzed through the development of a multiple linear regression model focusing solely on the DoE inputs and identified the temperature, pH and initial nutrient feed day as important process parameters influencing this quality attribute. To further scrutinize the dataset, a partial least squares model was subsequently built incorporating both on‐line and off‐line process parameters and enabled accurate predictions of the TSB concentration at harvest. Process parameters identified by the models to promote and suppress TSB formation were implemented on five 7 L bioreactors and the resultant TSB concentrations were comparable to the model predictions. This study demonstrates the ability of MVDA to enable predictions of the key performance drivers influencing TSB formation that are valid also upon scale‐up. Biotechnol. Bioeng. 2017;114: 2222–2234. © 2017 The Authors. Biotechnology and Bioengineering Published by Wiley Periodicals, Inc. PMID:28500668
Using iMCFA to Perform the CFA, Multilevel CFA, and Maximum Model for Analyzing Complex Survey Data.
Wu, Jiun-Yu; Lee, Yuan-Hsuan; Lin, John J H
2018-01-01
To construct CFA, MCFA, and maximum MCFA with LISREL v.8 and below, we provide iMCFA (integrated Multilevel Confirmatory Analysis) to examine the potential multilevel factorial structure in the complex survey data. Modeling multilevel structure for complex survey data is complicated because building a multilevel model is not an infallible statistical strategy unless the hypothesized model is close to the real data structure. Methodologists have suggested using different modeling techniques to investigate potential multilevel structure of survey data. Using iMCFA, researchers can visually set the between- and within-level factorial structure to fit MCFA, CFA and/or MAX MCFA models for complex survey data. iMCFA can then yield between- and within-level variance-covariance matrices, calculate intraclass correlations, perform the analyses and generate the outputs for respective models. The summary of the analytical outputs from LISREL is gathered and tabulated for further model comparison and interpretation. iMCFA also provides LISREL syntax of different models for researchers' future use. An empirical and a simulated multilevel dataset with complex and simple structures in the within or between level was used to illustrate the usability and the effectiveness of the iMCFA procedure on analyzing complex survey data. The analytic results of iMCFA using Muthen's limited information estimator were compared with those of Mplus using Full Information Maximum Likelihood regarding the effectiveness of different estimation methods.
Major soybean maturity gene haplotypes revealed by SNPViz analysis of 72 sequenced soybean genomes
USDA-ARS?s Scientific Manuscript database
In this Genomics Era, vast amounts of next generation sequencing data have become publicly-available for multiple genomes across hundreds of species. Analysis of these large-scale datasets can become cumbersome, especially when comparing nucleotide polymorphisms across many samples within a dataset...
Fast and robust curve skeletonization for real-world elongated objects
USDA-ARS?s Scientific Manuscript database
These datasets were generated for calibrating robot-camera systems. In an extension, we also considered the problem of calibrating robots with more than one camera. These datasets are provided as a companion to the paper, "Solving the Robot-World Hand-Eye(s) Calibration Problem with Iterative Meth...
Wolpert, Miranda; Rutter, Harry
2018-06-13
The use of routinely collected data that are flawed and limited to inform service development in healthcare systems needs to be considered, both theoretically and practically, given the reality in many areas of healthcare that only poor-quality data are available for use in complex adaptive systems. Data may be compromised in a range of ways. They may be flawed, due to missing or erroneously recorded entries; uncertain, due to differences in how data items are rated or conceptualised; proximate, in that data items are a proxy for key issues of concern; and sparse, in that a low volume of cases within key subgroups may limit the possibility of statistical inference. The term 'FUPS' is proposed to describe these flawed, uncertain, proximate and sparse datasets. Many of the systems that seek to use FUPS data may be characterised as dynamic and complex, involving a wide range of agents whose actions impact on each other in reverberating ways, leading to feedback and adaptation. The literature on the use of routinely collected data in healthcare is often implicitly premised on the availability of high-quality data to be used in complicated but not necessarily complex systems. This paper presents an example of the use of a FUPS dataset in the complex system of child mental healthcare. The dataset comprised routinely collected data from services that were part of a national service transformation initiative in child mental health from 2011 to 2015. The paper explores the use of this FUPS dataset to support meaningful dialogue between key stakeholders, including service providers, funders and users, in relation to outcomes of services. There is a particular focus on the potential for service improvement and learning. The issues raised and principles for practice suggested have relevance for other health communities that similarly face the dilemma of how to address the gap between the ideal of comprehensive clear data used in complicated, but not complex, contexts, and the reality of FUPS data in the context of complexity.
Biological intuition in alignment-free methods: response to Posada.
Ragan, Mark A; Chan, Cheong Xin
2013-08-01
A recent editorial in Journal of Molecular Evolution highlights opportunities and challenges facing molecular evolution in the era of next-generation sequencing. Abundant sequence data should allow more-complex models to be fit at higher confidence, making phylogenetic inference more reliable and improving our understanding of evolution at the molecular level. However, concern that approaches based on multiple sequence alignment may be computationally infeasible for large datasets is driving the development of so-called alignment-free methods for sequence comparison and phylogenetic inference. The recent editorial characterized these approaches as model-free, not based on the concept of homology, and lacking in biological intuition. We argue here that alignment-free methods have not abandoned models or homology, and can be biologically intuitive.
Comparing the reliability of related populations with the probability of agreement
Stevens, Nathaniel T.; Anderson-Cook, Christine M.
2016-07-26
Combining information from different populations to improve precision, simplify future predictions, or improve underlying understanding of relationships can be advantageous when considering the reliability of several related sets of systems. Using the probability of agreement to help quantify the similarities of populations can help to give a realistic assessment of whether the systems have reliability that are sufficiently similar for practical purposes to be treated as a homogeneous population. In addition, the new method is described and illustrated with an example involving two generations of a complex system where the reliability is modeled using either a logistic or probit regressionmore » model. Note that supplementary materials including code, datasets, and added discussion are available online.« less
Comparing the reliability of related populations with the probability of agreement
DOE Office of Scientific and Technical Information (OSTI.GOV)
Stevens, Nathaniel T.; Anderson-Cook, Christine M.
Combining information from different populations to improve precision, simplify future predictions, or improve underlying understanding of relationships can be advantageous when considering the reliability of several related sets of systems. Using the probability of agreement to help quantify the similarities of populations can help to give a realistic assessment of whether the systems have reliability that are sufficiently similar for practical purposes to be treated as a homogeneous population. In addition, the new method is described and illustrated with an example involving two generations of a complex system where the reliability is modeled using either a logistic or probit regressionmore » model. Note that supplementary materials including code, datasets, and added discussion are available online.« less
Novel scanning procedure enabling the vectorization of entire rhizotron-grown root systems
2013-01-01
This paper presents an original spit-and-combine imaging procedure that enables the complete vectorization of complex root systems grown in rhizotrons. The general principle of the method is to (1) separate the root system into a small number of large pieces to reduce root overlap, (2) scan these pieces one by one, (3) analyze separate images with a root tracing software and (4) combine all tracings into a single vectorized root system. This method generates a rich dataset containing morphological, topological and geometrical information of entire root systems grown in rhizotrons. The utility of the method is illustrated with a detailed architectural analysis of a 20-day old maize root system, coupled with a spatial analysis of water uptake patterns. PMID:23286457
Novel scanning procedure enabling the vectorization of entire rhizotron-grown root systems.
Lobet, Guillaume; Draye, Xavier
2013-01-04
: This paper presents an original spit-and-combine imaging procedure that enables the complete vectorization of complex root systems grown in rhizotrons. The general principle of the method is to (1) separate the root system into a small number of large pieces to reduce root overlap, (2) scan these pieces one by one, (3) analyze separate images with a root tracing software and (4) combine all tracings into a single vectorized root system. This method generates a rich dataset containing morphological, topological and geometrical information of entire root systems grown in rhizotrons. The utility of the method is illustrated with a detailed architectural analysis of a 20-day old maize root system, coupled with a spatial analysis of water uptake patterns.
Cortical Cartography and Caret Software
Van Essen, David C.
2011-01-01
Caret software is widely used for analyzing and visualizing many types of fMRI data, often in conjunction with experimental data from other modalities. This article places Caret’s development in a historical context that spans three decades of brain mapping – from the early days of manually generated flat maps to the nascent field of human connectomics. It also highlights some of Caret’s distinctive capabilities. This includes the ease of visualizing data on surfaces and/or volumes and on atlases as well as individual subjects. Caret can display many types of experimental data using various combinations of overlays (e.g., fMRI activation maps, cortical parcellations, areal boundaries), and it has other features that facilitate the analysis and visualization of complex neuroimaging datasets. PMID:22062192
A parallel data management system for large-scale NASA datasets
NASA Technical Reports Server (NTRS)
Srivastava, Jaideep
1993-01-01
The past decade has experienced a phenomenal growth in the amount of data and resultant information generated by NASA's operations and research projects. A key application is the reprocessing problem which has been identified to require data management capabilities beyond those available today (PRAT93). The Intelligent Information Fusion (IIF) system (ROEL91) is an ongoing NASA project which has similar requirements. Deriving our understanding of NASA's future data management needs based on the above, this paper describes an approach to using parallel computer systems (processor and I/O architectures) to develop an efficient parallel database management system to address the needs. Specifically, we propose to investigate issues in low-level record organizations and management, complex query processing, and query compilation and scheduling.
Exploring heterogeneity in clinical trials with latent class analysis
Abarda, Abdallah; Contractor, Ateka A.; Wang, Juan; Dayton, C. Mitchell
2018-01-01
Case-mix is common in clinical trials and treatment effect can vary across different subgroups. Conventionally, a subgroup analysis is performed by dividing the overall study population by one or two grouping variables. It is usually impossible to explore complex high-order intersections among confounding variables. Latent class analysis (LCA) provides a framework to identify latent classes by observed manifest variables. Distal clinical outcomes and treatment effect can be different across these classes. This paper provides a step-by-step tutorial on how to perform LCA with R. A simulated dataset is generated to illustrate the process. In the example, the classify-analyze approach is employed to explore the differential treatment effects on distal outcomes across latent classes. PMID:29955579
The vast datasets generated by next generation gene sequencing and expression profiling have transformed biological and translational research. However, technologies to produce large-scale functional genomics datasets, such as high-throughput detection of protein-protein interactions (PPIs), are still in early development. While a number of powerful technologies have been employed to detect PPIs, a singular PPI biosensor platform featured with both high sensitivity and robustness in a mammalian cell environment remains to be established.
Dessì, Alessia; Pani, Danilo; Raffo, Luigi
2014-08-01
Non-invasive fetal electrocardiography is still an open research issue. The recent publication of an annotated dataset on Physionet providing four-channel non-invasive abdominal ECG traces promoted an international challenge on the topic. Starting from that dataset, an algorithm for the identification of the fetal QRS complexes from a reduced number of electrodes and without any a priori information about the electrode positioning has been developed, entering into the top ten best-performing open-source algorithms presented at the challenge.In this paper, an improved version of that algorithm is presented and evaluated exploiting the same challenge metrics. It is mainly based on the subtraction of the maternal QRS complexes in every lead, obtained by synchronized averaging of morphologically similar complexes, the filtering of the maternal P and T waves and the enhancement of the fetal QRS through independent component analysis (ICA) applied on the processed signals before a final fetal QRS detection stage. The RR time series of both the mother and the fetus are analyzed to enhance pseudoperiodicity with the aim of correcting wrong annotations. The algorithm has been designed and extensively evaluated on the open dataset A (N = 75), and finally evaluated on datasets B (N = 100) and C (N = 272) to have the mean scores over data not used during the algorithm development. Compared to the results achieved by the previous version of the algorithm, the current version would mark the 5th and 4th position in the final ranking related to the events 1 and 2, reserved to the open-source challenge entries, taking into account both official and unofficial entrants. On dataset A, the algorithm achieves 0.982 median sensitivity and 0.976 median positive predictivity.
A multi-service data management platform for scientific oceanographic products
NASA Astrophysics Data System (ADS)
D'Anca, Alessandro; Conte, Laura; Nassisi, Paola; Palazzo, Cosimo; Lecci, Rita; Cretì, Sergio; Mancini, Marco; Nuzzo, Alessandra; Mirto, Maria; Mannarini, Gianandrea; Coppini, Giovanni; Fiore, Sandro; Aloisio, Giovanni
2017-02-01
An efficient, secure and interoperable data platform solution has been developed in the TESSA project to provide fast navigation and access to the data stored in the data archive, as well as a standard-based metadata management support. The platform mainly targets scientific users and the situational sea awareness high-level services such as the decision support systems (DSS). These datasets are accessible through the following three main components: the Data Access Service (DAS), the Metadata Service and the Complex Data Analysis Module (CDAM). The DAS allows access to data stored in the archive by providing interfaces for different protocols and services for downloading, variables selection, data subsetting or map generation. Metadata Service is the heart of the information system of the TESSA products and completes the overall infrastructure for data and metadata management. This component enables data search and discovery and addresses interoperability by exploiting widely adopted standards for geospatial data. Finally, the CDAM represents the back-end of the TESSA DSS by performing on-demand complex data analysis tasks.
Cellular diversity in the Drosophila midbrain revealed by single-cell transcriptomics
2018-01-01
To understand the brain, molecular details need to be overlaid onto neural wiring diagrams so that synaptic mode, neuromodulation and critical signaling operations can be considered. Single-cell transcriptomics provide a unique opportunity to collect this information. Here we present an initial analysis of thousands of individual cells from Drosophila midbrain, that were acquired using Drop-Seq. A number of approaches permitted the assignment of transcriptional profiles to several major brain regions and cell-types. Expression of biosynthetic enzymes and reuptake mechanisms allows all the neurons to be typed according to the neurotransmitter or neuromodulator that they produce and presumably release. Some neuropeptides are preferentially co-expressed in neurons using a particular fast-acting transmitter, or monoamine. Neuromodulatory and neurotransmitter receptor subunit expression illustrates the potential of these molecules in generating complexity in neural circuit function. This cell atlas dataset provides an important resource to link molecular operations to brain regions and complex neural processes. PMID:29671739
Wang, Zhuo; Jin, Shuilin; Liu, Guiyou; Zhang, Xiurui; Wang, Nan; Wu, Deliang; Hu, Yang; Zhang, Chiping; Jiang, Qinghua; Xu, Li; Wang, Yadong
2017-05-23
The development of single-cell RNA sequencing has enabled profound discoveries in biology, ranging from the dissection of the composition of complex tissues to the identification of novel cell types and dynamics in some specialized cellular environments. However, the large-scale generation of single-cell RNA-seq (scRNA-seq) data collected at multiple time points remains a challenge to effective measurement gene expression patterns in transcriptome analysis. We present an algorithm based on the Dynamic Time Warping score (DTWscore) combined with time-series data, that enables the detection of gene expression changes across scRNA-seq samples and recovery of potential cell types from complex mixtures of multiple cell types. The DTWscore successfully classify cells of different types with the most highly variable genes from time-series scRNA-seq data. The study was confined to methods that are implemented and available within the R framework. Sample datasets and R packages are available at https://github.com/xiaoxiaoxier/DTWscore .
A 3D virtual reality simulator for training of minimally invasive surgery.
Mi, Shao-Hua; Hou, Zeng-Gunag; Yang, Fan; Xie, Xiao-Liang; Bian, Gui-Bin
2014-01-01
For the last decade, remarkable progress has been made in the field of cardiovascular disease treatment. However, these complex medical procedures require a combination of rich experience and technical skills. In this paper, a 3D virtual reality simulator for core skills training in minimally invasive surgery is presented. The system can generate realistic 3D vascular models segmented from patient datasets, including a beating heart, and provide a real-time computation of force and force feedback module for surgical simulation. Instruments, such as a catheter or guide wire, are represented by a multi-body mass-spring model. In addition, a realistic user interface with multiple windows and real-time 3D views are developed. Moreover, the simulator is also provided with a human-machine interaction module that gives doctors the sense of touch during the surgery training, enables them to control the motion of a virtual catheter/guide wire inside a complex vascular model. Experimental results show that the simulator is suitable for minimally invasive surgery training.
Every factor helps: Rapid Ptychographic Reconstruction
NASA Astrophysics Data System (ADS)
Nashed, Youssef
2015-03-01
Recent advances in microscopy, specifically higher spatial resolution and data acquisition rates, require faster and more robust phase retrieval reconstruction methods. Ptychography is a phase retrieval technique for reconstructing the complex transmission function of a specimen from a sequence of diffraction patterns in visible light, X-ray, and electron microscopes. As technical advances allow larger fields to be imaged, computational challenges arise for reconstructing the correspondingly larger data volumes. Waiting to postprocess datasets offline results in missed opportunities. Here we present a parallel method for real-time ptychographic phase retrieval. It uses a hybrid parallel strategy to divide the computation between multiple graphics processing units (GPUs). A final specimen reconstruction is then achieved by different techniques to merge sub-dataset results into a single complex phase and amplitude image. Results are shown on a simulated specimen and real datasets from X-ray experiments conducted at a synchrotron light source.
DNApod: DNA polymorphism annotation database from next-generation sequence read archives.
Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu
2017-01-01
With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information.
DNApod: DNA polymorphism annotation database from next-generation sequence read archives
Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu
2017-01-01
With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information. PMID:28234924
Maron, Bradley A; Leopold, Jane A
2016-09-30
Reductionist theory proposes that analyzing complex systems according to their most fundamental components is required for problem resolution, and has served as the cornerstone of scientific methodology for more than four centuries. However, technological gains in the current scientific era now allow for the generation of large datasets that profile the proteomic, genomic, and metabolomic signatures of biological systems across a range of conditions. The accessibility of data on such a vast scale has, in turn, highlighted the limitations of reductionism, which is not conducive to analyses that consider multiple and contemporaneous interactions between intermediates within a pathway or across constructs. Systems biology has emerged as an alternative approach to analyze complex biological systems. This methodology is based on the generation of scale-free networks and, thus, provides a quantitative assessment of relationships between multiple intermediates, such as protein-protein interactions, within and between pathways of interest. In this way, systems biology is well positioned to identify novel targets implicated in the pathogenesis or treatment of diseases. In this review, the historical root and fundamental basis of systems biology, as well as the potential applications of this methodology are discussed with particular emphasis on integration of these concepts to further understanding of cardiovascular disorders such as coronary artery disease and pulmonary hypertension.
Remote Earth Sciences data collection using ACTS
NASA Technical Reports Server (NTRS)
Evans, Robert H.
1992-01-01
Given the focus on global change and the attendant scope of such research, we anticipate significant growth of requirements for investigator interaction, processing system capabilities, and availability of data sets. The increased complexity of global processes requires interdisciplinary teams to address them; the investigators will need to interact on a regular basis; however, it is unlikely that a single institution will house sufficient investigators with the required breadth of skills. The complexity of the computations may also require resources beyond those located within a single institution; this lack of sufficient computational resources leads to a distributed system located at geographically dispersed institutions. Finally the combination of long term data sets like the Pathfinder datasets and the data to be gathered by new generations of satellites such as SeaWiFS and MODIS-N yield extra-ordinarily large amounts of data. All of these factors combine to increase demands on the communications facilities available; the demands are generating requirements for highly flexible, high capacity networks. We have been examining the applicability of the Advanced Communications Technology Satellite (ACTS) to address the scientific, computational, and, primarily, communications questions resulting from global change research. As part of this effort three scenarios for oceanographic use of ACTS have been developed; a full discussion of this is contained in Appendix B.
Atomic analysis of protein-protein interfaces with known inhibitors: the 2P2I database.
Bourgeas, Raphaël; Basse, Marie-Jeanne; Morelli, Xavier; Roche, Philippe
2010-03-09
In the last decade, the inhibition of protein-protein interactions (PPIs) has emerged from both academic and private research as a new way to modulate the activity of proteins. Inhibitors of these original interactions are certainly the next generation of highly innovative drugs that will reach the market in the next decade. However, in silico design of such compounds still remains challenging. Here we describe this particular PPI chemical space through the presentation of 2P2I(DB), a hand-curated database dedicated to the structure of PPIs with known inhibitors. We have analyzed protein/protein and protein/inhibitor interfaces in terms of geometrical parameters, atom and residue properties, buried accessible surface area and other biophysical parameters. The interfaces found in 2P2I(DB) were then compared to those of representative datasets of heterodimeric complexes. We propose a new classification of PPIs with known inhibitors into two classes depending on the number of segments present at the interface and corresponding to either a single secondary structure element or to a more globular interacting domain. 2P2I(DB) complexes share global shape properties with standard transient heterodimer complexes, but their accessible surface areas are significantly smaller. No major conformational changes are seen between the different states of the proteins. The interfaces are more hydrophobic than general PPI's interfaces, with less charged residues and more non-polar atoms. Finally, fifty percent of the complexes in the 2P2I(DB) dataset possess more hydrogen bonds than typical protein-protein complexes. Potential areas of study for the future are proposed, which include a new classification system consisting of specific families and the identification of PPI targets with high druggability potential based on key descriptors of the interaction. 2P2I database stores structural information about PPIs with known inhibitors and provides a useful tool for biologists to assess the potential druggability of their interfaces. The database can be accessed at http://2p2idb.cnrs-mrs.fr.
Technique for fast and efficient hierarchical clustering
Stork, Christopher
2013-10-08
A fast and efficient technique for hierarchical clustering of samples in a dataset includes compressing the dataset to reduce a number of variables within each of the samples of the dataset. A nearest neighbor matrix is generated to identify nearest neighbor pairs between the samples based on differences between the variables of the samples. The samples are arranged into a hierarchy that groups the samples based on the nearest neighbor matrix. The hierarchy is rendered to a display to graphically illustrate similarities or differences between the samples.
Bourke, Peter M; van Geest, Geert; Voorrips, Roeland E; Jansen, Johannes; Kranenburg, Twan; Shahin, Arwa; Visser, Richard G F; Arens, Paul; Smulders, Marinus J M; Maliepaard, Chris
2018-05-02
Polyploid species carry more than two copies of each chromosome, a condition found in many of the world's most important crops. Genetic mapping in polyploids is more complex than in diploid species, resulting in a lack of available software tools. These are needed if we are to realise all the opportunities offered by modern genotyping platforms for genetic research and breeding in polyploid crops. polymapR is an R package for genetic linkage analysis and integrated genetic map construction from bi-parental populations of outcrossing autopolyploids. It can currently analyse triploid, tetraploid and hexaploid marker datasets and is applicable to various crops including potato, leek, alfalfa, blueberry, chrysanthemum, sweet potato or kiwifruit. It can detect, estimate and correct for preferential chromosome pairing, and has been tested on high-density marker datasets from potato, rose and chrysanthemum, generating high-density integrated linkage maps in all of these crops. polymapR is freely available under the general public license from the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org/package=polymapR. Chris Maliepaard chris.maliepaard@wur.nl or Roeland E. Voorrips roeland.voorrips@wur.nl. Supplementary data are available at Bioinformatics online.
Identifying Corresponding Patches in SAR and Optical Images With a Pseudo-Siamese CNN
NASA Astrophysics Data System (ADS)
Hughes, Lloyd H.; Schmitt, Michael; Mou, Lichao; Wang, Yuanyuan; Zhu, Xiao Xiang
2018-05-01
In this letter, we propose a pseudo-siamese convolutional neural network (CNN) architecture that enables to solve the task of identifying corresponding patches in very-high-resolution (VHR) optical and synthetic aperture radar (SAR) remote sensing imagery. Using eight convolutional layers each in two parallel network streams, a fully connected layer for the fusion of the features learned in each stream, and a loss function based on binary cross-entropy, we achieve a one-hot indication if two patches correspond or not. The network is trained and tested on an automatically generated dataset that is based on a deterministic alignment of SAR and optical imagery via previously reconstructed and subsequently co-registered 3D point clouds. The satellite images, from which the patches comprising our dataset are extracted, show a complex urban scene containing many elevated objects (i.e. buildings), thus providing one of the most difficult experimental environments. The achieved results show that the network is able to predict corresponding patches with high accuracy, thus indicating great potential for further development towards a generalized multi-sensor key-point matching procedure. Index Terms-synthetic aperture radar (SAR), optical imagery, data fusion, deep learning, convolutional neural networks (CNN), image matching, deep matching
Rotation-invariant features for multi-oriented text detection in natural images.
Yao, Cong; Zhang, Xin; Bai, Xiang; Liu, Wenyu; Ma, Yi; Tu, Zhuowen
2013-01-01
Texts in natural scenes carry rich semantic information, which can be used to assist a wide range of applications, such as object recognition, image/video retrieval, mapping/navigation, and human computer interaction. However, most existing systems are designed to detect and recognize horizontal (or near-horizontal) texts. Due to the increasing popularity of mobile-computing devices and applications, detecting texts of varying orientations from natural images under less controlled conditions has become an important but challenging task. In this paper, we propose a new algorithm to detect texts of varying orientations. Our algorithm is based on a two-level classification scheme and two sets of features specially designed for capturing the intrinsic characteristics of texts. To better evaluate the proposed method and compare it with the competing algorithms, we generate a comprehensive dataset with various types of texts in diverse real-world scenes. We also propose a new evaluation protocol, which is more suitable for benchmarking algorithms for detecting texts in varying orientations. Experiments on benchmark datasets demonstrate that our system compares favorably with the state-of-the-art algorithms when handling horizontal texts and achieves significantly enhanced performance on variant texts in complex natural scenes.
Storytelling and story testing in domestication.
Gerbault, Pascale; Allaby, Robin G; Boivin, Nicole; Rudzinski, Anna; Grimaldi, Ilaria M; Pires, J Chris; Climer Vigueira, Cynthia; Dobney, Keith; Gremillion, Kristen J; Barton, Loukas; Arroyo-Kalin, Manuel; Purugganan, Michael D; Rubio de Casas, Rafael; Bollongino, Ruth; Burger, Joachim; Fuller, Dorian Q; Bradley, Daniel G; Balding, David J; Richerson, Peter J; Gilbert, M Thomas P; Larson, Greger; Thomas, Mark G
2014-04-29
The domestication of plants and animals marks one of the most significant transitions in human, and indeed global, history. Traditionally, study of the domestication process was the exclusive domain of archaeologists and agricultural scientists; today it is an increasingly multidisciplinary enterprise that has come to involve the skills of evolutionary biologists and geneticists. Although the application of new information sources and methodologies has dramatically transformed our ability to study and understand domestication, it has also generated increasingly large and complex datasets, the interpretation of which is not straightforward. In particular, challenges of equifinality, evolutionary variance, and emergence of unexpected or counter-intuitive patterns all face researchers attempting to infer past processes directly from patterns in data. We argue that explicit modeling approaches, drawing upon emerging methodologies in statistics and population genetics, provide a powerful means of addressing these limitations. Modeling also offers an approach to analyzing datasets that avoids conclusions steered by implicit biases, and makes possible the formal integration of different data types. Here we outline some of the modeling approaches most relevant to current problems in domestication research, and demonstrate the ways in which simulation modeling is beginning to reshape our understanding of the domestication process.
3D shape recovery from image focus using Gabor features
NASA Astrophysics Data System (ADS)
Mahmood, Fahad; Mahmood, Jawad; Zeb, Ayesha; Iqbal, Javaid
2018-04-01
Recovering an accurate and precise depth map from a set of acquired 2-D image dataset of the target object each having different focus information is an ultimate goal of 3-D shape recovery. Focus measure algorithm plays an important role in this architecture as it converts the corresponding color value information into focus information which will be then utilized for recovering depth map. This article introduces Gabor features as focus measure approach for recovering depth map from a set of 2-D images. Frequency and orientation representation of Gabor filter features is similar to human visual system and normally applied for texture representation. Due to its little computational complexity, sharp focus measure curve, robust to random noise sources and accuracy, it is considered as superior alternative to most of recently proposed 3-D shape recovery approaches. This algorithm is deeply investigated on real image sequences and synthetic image dataset. The efficiency of the proposed scheme is also compared with the state of art 3-D shape recovery approaches. Finally, by means of two global statistical measures, root mean square error and correlation, we claim that this approach, in spite of simplicity, generates accurate results.
Storytelling and story testing in domestication
Gerbault, Pascale; Allaby, Robin G.; Boivin, Nicole; Rudzinski, Anna; Grimaldi, Ilaria M.; Pires, J. Chris; Climer Vigueira, Cynthia; Dobney, Keith; Gremillion, Kristen J.; Barton, Loukas; Arroyo-Kalin, Manuel; Purugganan, Michael D.; Rubio de Casas, Rafael; Bollongino, Ruth; Burger, Joachim; Fuller, Dorian Q.; Bradley, Daniel G.; Balding, David J.; Richerson, Peter J.; Gilbert, M. Thomas P.; Larson, Greger; Thomas, Mark G.
2014-01-01
The domestication of plants and animals marks one of the most significant transitions in human, and indeed global, history. Traditionally, study of the domestication process was the exclusive domain of archaeologists and agricultural scientists; today it is an increasingly multidisciplinary enterprise that has come to involve the skills of evolutionary biologists and geneticists. Although the application of new information sources and methodologies has dramatically transformed our ability to study and understand domestication, it has also generated increasingly large and complex datasets, the interpretation of which is not straightforward. In particular, challenges of equifinality, evolutionary variance, and emergence of unexpected or counter-intuitive patterns all face researchers attempting to infer past processes directly from patterns in data. We argue that explicit modeling approaches, drawing upon emerging methodologies in statistics and population genetics, provide a powerful means of addressing these limitations. Modeling also offers an approach to analyzing datasets that avoids conclusions steered by implicit biases, and makes possible the formal integration of different data types. Here we outline some of the modeling approaches most relevant to current problems in domestication research, and demonstrate the ways in which simulation modeling is beginning to reshape our understanding of the domestication process. PMID:24753572
Chattree, A; Barbour, J A; Thomas-Gibson, S; Bhandari, P; Saunders, B P; Veitch, A M; Anderson, J; Rembacken, B J; Loughrey, M B; Pullan, R; Garrett, W V; Lewis, G; Dolwani, S; Rutter, M D
2017-01-01
The management of large non-pedunculated colorectal polyps (LNPCPs) is complex, with widespread variation in management and outcome, even amongst experienced clinicians. Variations in the assessment and decision-making processes are likely to be a major factor in this variability. The creation of a standardized minimum dataset to aid decision-making may therefore result in improved clinical management. An official working group of 13 multidisciplinary specialists was appointed by the Association of Coloproctology of Great Britain and Ireland (ACPGBI) and the British Society of Gastroenterology (BSG) to develop a minimum dataset on LNPCPs. The literature review used to structure the ACPGBI/BSG guidelines for the management of LNPCPs was used by a steering subcommittee to identify various parameters pertaining to the decision-making processes in the assessment and management of LNPCPs. A modified Delphi consensus process was then used for voting on proposed parameters over multiple voting rounds with at least 80% agreement defined as consensus. The minimum dataset was used in a pilot process to ensure rigidity and usability. A 23-parameter minimum dataset with parameters relating to patient and lesion factors, including six parameters relating to image retrieval, was formulated over four rounds of voting with two pilot processes to test rigidity and usability. This paper describes the development of the first reported evidence-based and expert consensus minimum dataset for the management of LNPCPs. It is anticipated that this dataset will allow comprehensive and standardized lesion assessment to improve decision-making in the assessment and management of LNPCPs. Colorectal Disease © 2016 The Association of Coloproctology of Great Britain and Ireland.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhang, Qi
Organic aerosols (OA) are an important but poorly characterized component of the earth’s climate system. Enormous complexities commonly associated with OA composition and life cycle processes have significantly complicated the simulation and quantification of aerosol effects. To unravel these complexities and improve understanding of the properties, sources, formation, evolution processes, and radiative properties of atmospheric OA, we propose to perform advanced and integrated analyses of multiple DOE aerosol mass spectrometry datasets, including two high-resolution time-of-flight aerosol mass spectrometer (HR-AMS) datasets from intensive field campaigns on the aerosol life cycle and the Aerosol Chemical Speciation Monitor (ACSM) datasets from long-term routinemore » measurement programs at ACRF sites. In this project, we will focus on 1) characterizing the chemical (i.e., composition, organic elemental ratios), physical (i.e., size distribution and volatility), and radiative (i.e., sub- and super-saturated growth) properties of organic aerosols, 2) examining the correlations of these properties with different source and process regimes (e.g., primary, secondary, urban, biogenic, biomass burning, marine, or mixtures), 3) quantifying the evolutions of these properties as a function of photochemical processing, 4) identifying and characterizing special cases for important processes such as SOA formation and new particle formation and growth, and 5) correlating size-resolved aerosol chemistry with measurements of radiative properties of aerosols to determine the climatically relevant properties of OA and characterize the relationship between these properties and processes of atmospheric aerosol organics. Our primary goal is to improve a process-level understanding of the life cycle of organic aerosols in the Earth’s atmosphere. We will also aim at bridging between observations and models via synthesizing and translating the results and insights generated from this research into data products and formulations that may be directly used to inform, improve, and evaluate regional and global models. In addition, we will continue our current very active collaborations with several modeling groups to enhance the use and interpretation of our data products. Overall, this research will contribute new data to improve quantification of the aerosol’s effects on climate and thus the achievement of ASR’s science goal of – “improving the fidelity and predictive capability of global climate models”.« less
Kurkcuoglu, Zeynep; Doruker, Pemra
2016-01-01
Incorporating receptor flexibility in small ligand-protein docking still poses a challenge for proteins undergoing large conformational changes. In the absence of bound structures, sampling conformers that are accessible by apo state may facilitate docking and drug design studies. For this aim, we developed an unbiased conformational search algorithm, by integrating global modes from elastic network model, clustering and energy minimization with implicit solvation. Our dataset consists of five diverse proteins with apo to complex RMSDs 4.7–15 Å. Applying this iterative algorithm on apo structures, conformers close to the bound-state (RMSD 1.4–3.8 Å), as well as the intermediate states were generated. Dockings to a sequence of conformers consisting of a closed structure and its “parents” up to the apo were performed to compare binding poses on different states of the receptor. For two periplasmic binding proteins and biotin carboxylase that exhibit hinge-type closure of two dynamics domains, the best pose was obtained for the conformer closest to the bound structure (ligand RMSDs 1.5–2 Å). In contrast, the best pose for adenylate kinase corresponded to an intermediate state with partially closed LID domain and open NMP domain, in line with recent studies (ligand RMSD 2.9 Å). The docking of a helical peptide to calmodulin was the most challenging case due to the complexity of its 15 Å transition, for which a two-stage procedure was necessary. The technique was first applied on the extended calmodulin to generate intermediate conformers; then peptide docking and a second generation stage on the complex were performed, which in turn yielded a final peptide RMSD of 2.9 Å. Our algorithm is effective in producing conformational states based on the apo state. This study underlines the importance of such intermediate states for ligand docking to proteins undergoing large transitions. PMID:27348230
IndeCut evaluates performance of network motif discovery algorithms.
Ansariola, Mitra; Megraw, Molly; Koslicki, David
2018-05-01
Genomic networks represent a complex map of molecular interactions which are descriptive of the biological processes occurring in living cells. Identifying the small over-represented circuitry patterns in these networks helps generate hypotheses about the functional basis of such complex processes. Network motif discovery is a systematic way of achieving this goal. However, a reliable network motif discovery outcome requires generating random background networks which are the result of a uniform and independent graph sampling method. To date, there has been no method to numerically evaluate whether any network motif discovery algorithm performs as intended on realistically sized datasets-thus it was not possible to assess the validity of resulting network motifs. In this work, we present IndeCut, the first method to date that characterizes network motif finding algorithm performance in terms of uniform sampling on realistically sized networks. We demonstrate that it is critical to use IndeCut prior to running any network motif finder for two reasons. First, IndeCut indicates the number of samples needed for a tool to produce an outcome that is both reproducible and accurate. Second, IndeCut allows users to choose the tool that generates samples in the most independent fashion for their network of interest among many available options. The open source software package is available at https://github.com/megrawlab/IndeCut. megrawm@science.oregonstate.edu or david.koslicki@math.oregonstate.edu. Supplementary data are available at Bioinformatics online.
NASA Astrophysics Data System (ADS)
Sun, L. Qing; Feng, Feng X.
2014-11-01
In this study, we first built and compared two different climate datasets for Wuling mountainous area in 2010, one of which considered topographical effects during the ANUSPLIN interpolation was referred as terrain-based climate dataset, while the other one did not was called ordinary climate dataset. Then, we quantified the topographical effects of climatic inputs on NPP estimation by inputting two different climate datasets to the same ecosystem model, the Boreal Ecosystem Productivity Simulator (BEPS), to evaluate the importance of considering relief when estimating NPP. Finally, we found the primary contributing variables to the topographical effects through a series of experiments given an overall accuracy of the model output for NPP. The results showed that: (1) The terrain-based climate dataset presented more reliable topographic information and had closer agreements with the station dataset than the ordinary climate dataset at successive time series of 365 days in terms of the daily mean values. (2) On average, ordinary climate dataset underestimated NPP by 12.5% compared with terrain-based climate dataset over the whole study area. (3) The primary climate variables contributing to the topographical effects of climatic inputs for Wuling mountainous area were temperatures, which suggest that it is necessary to correct temperature differences for estimating NPP accurately in such a complex terrain.
Behavior Knowledge Space-Based Fusion for Copy-Move Forgery Detection.
Ferreira, Anselmo; Felipussi, Siovani C; Alfaro, Carlos; Fonseca, Pablo; Vargas-Munoz, John E; Dos Santos, Jefersson A; Rocha, Anderson
2016-07-20
The detection of copy-move image tampering is of paramount importance nowadays, mainly due to its potential use for misleading the opinion forming process of the general public. In this paper, we go beyond traditional forgery detectors and aim at combining different properties of copy-move detection approaches by modeling the problem on a multiscale behavior knowledge space, which encodes the output combinations of different techniques as a priori probabilities considering multiple scales of the training data. Afterwards, the conditional probabilities missing entries are properly estimated through generative models applied on the existing training data. Finally, we propose different techniques that exploit the multi-directionality of the data to generate the final outcome detection map in a machine learning decision-making fashion. Experimental results on complex datasets, comparing the proposed techniques with a gamut of copy-move detection approaches and other fusion methodologies in the literature show the effectiveness of the proposed method and its suitability for real-world applications.
GANViz: A Visual Analytics Approach to Understand the Adversarial Game.
Wang, Junpeng; Gou, Liang; Yang, Hao; Shen, Han-Wei
2018-06-01
Generative models bear promising implications to learn data representations in an unsupervised fashion with deep learning. Generative Adversarial Nets (GAN) is one of the most popular frameworks in this arena. Despite the promising results from different types of GANs, in-depth understanding on the adversarial training process of the models remains a challenge to domain experts. The complexity and the potential long-time training process of the models make it hard to evaluate, interpret, and optimize them. In this work, guided by practical needs from domain experts, we design and develop a visual analytics system, GANViz, aiming to help experts understand the adversarial process of GANs in-depth. Specifically, GANViz evaluates the model performance of two subnetworks of GANs, provides evidence and interpretations of the models' performance, and empowers comparative analysis with the evidence. Through our case studies with two real-world datasets, we demonstrate that GANViz can provide useful insight into helping domain experts understand, interpret, evaluate, and potentially improve GAN models.
Squish: Near-Optimal Compression for Archival of Relational Datasets
Gao, Yihan; Parameswaran, Aditya
2017-01-01
Relational datasets are being generated at an alarmingly rapid rate across organizations and industries. Compressing these datasets could significantly reduce storage and archival costs. Traditional compression algorithms, e.g., gzip, are suboptimal for compressing relational datasets since they ignore the table structure and relationships between attributes. We study compression algorithms that leverage the relational structure to compress datasets to a much greater extent. We develop Squish, a system that uses a combination of Bayesian Networks and Arithmetic Coding to capture multiple kinds of dependencies among attributes and achieve near-entropy compression rate. Squish also supports user-defined attributes: users can instantiate new data types by simply implementing five functions for a new class interface. We prove the asymptotic optimality of our compression algorithm and conduct experiments to show the effectiveness of our system: Squish achieves a reduction of over 50% in storage size relative to systems developed in prior work on a variety of real datasets. PMID:28180028
Parallel Index and Query for Large Scale Data Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chou, Jerry; Wu, Kesheng; Ruebel, Oliver
2011-07-18
Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies are critical for facilitating interactive exploration of large datasets, but numerous challenges remain in terms of designing a system for process- ing general scientific datasets. The system needs to be able to run on distributed multi-core platforms, efficiently utilize underlying I/O infrastructure, and scale to massive datasets. We present FastQuery, a novel software framework that address these challenges. FastQuery utilizes a state-of-the-art index and query technology (FastBit) and is designed to process mas- sive datasets on modern supercomputing platforms. We apply FastQuery to processing ofmore » a massive 50TB dataset generated by a large scale accelerator modeling code. We demonstrate the scalability of the tool to 11,520 cores. Motivated by the scientific need to search for inter- esting particles in this dataset, we use our framework to reduce search time from hours to tens of seconds.« less
Marin, Daniele; Ramirez-Giraldo, Juan Carlos; Gupta, Sonia; Fu, Wanyi; Stinnett, Sandra S; Mileto, Achille; Bellini, Davide; Patel, Bhavik; Samei, Ehsan; Nelson, Rendon C
2016-06-01
The purpose of this study is to investigate whether the reduction in noise using a second-generation monoenergetic algorithm can improve the conspicuity of hypervascular liver tumors on dual-energy CT (DECT) images of the liver. An anthropomorphic liver phantom in three body sizes and iodine-containing inserts simulating hypervascular lesions was imaged with DECT and single-energy CT at various energy levels (80-140 kV). In addition, a retrospective clinical study was performed in 31 patients with 66 hypervascular liver tumors who underwent DECT during the late hepatic arterial phase. Datasets at energy levels ranging from 40 to 80 keV were reconstructed using first- and second-generation monoenergetic algorithms. Noise, tumor-to-liver contrast-to-noise ratio (CNR), and CNR with a noise constraint (CNRNC) set with a maximum noise increase of 50% were calculated and compared among the different reconstructed datasets. The maximum CNR for the second-generation monoenergetic algorithm, which was attained at 40 keV in both phantom and clinical datasets, was statistically significantly higher than the maximum CNR for the first-generation monoenergetic algorithm (p < 0.001) or single-energy CT acquisitions across a wide range of kilovoltage values. With the second-generation monoenergetic algorithm, the optimal CNRNC occurred at 55 keV, corresponding to lower energy levels compared with first-generation algorithm (predominantly at 70 keV). Patient body size did not substantially affect the selection of the optimal energy level to attain maximal CNR and CNRNC using the second-generation monoenergetic algorithm. A noise-optimized second-generation monoenergetic algorithm significantly improves the conspicuity of hypervascular liver tumors.
Open Data as Open Educational Resources: Towards Transversal Skills and Global Citizenship
ERIC Educational Resources Information Center
Atenas, Javiera; Havemann, Leo; Priego, Ernesto
2015-01-01
Open Data is the name given to datasets which have been generated by international organisations, governments, NGOs and academic researchers, and made freely available online and openly-licensed. These datasets can be used by educators as Open Educational Resources (OER) to support different teaching and learning activities, allowing students to…
This draft Geographic Information System (GIS) tool can be used to generate scenarios of housing-density changes and calculate impervious surface cover for the conterminous United States. A draft User’s Guide accompanies the tool. This product distributes the population project...
Viking Seismometer PDS Archive Dataset
NASA Astrophysics Data System (ADS)
Lorenz, R. D.
2016-12-01
The Viking Lander 2 seismometer operated successfully for over 500 Sols on the Martian surface, recording at least one likely candidate Marsquake. The Viking mission, in an era when data handling hardware (both on board and on the ground) was limited in capability, predated modern planetary data archiving, and ad-hoc repositories of the data, and the very low-level record at NSSDC, were neither convenient to process nor well-known. In an effort supported by the NASA Mars Data Analysis Program, we have converted the bulk of the Viking dataset (namely the 49,000 and 270,000 records made in High- and Event- modes at 20 and 1 Hz respectively) into a simple ASCII table format. Additionally, since wind-generated lander motion is a major component of the signal, contemporaneous meteorological data are included in summary records to facilitate correlation. These datasets are being archived at the PDS Geosciences Node. In addition to brief instrument and dataset descriptions, the archive includes code snippets in the freely-available language 'R' to demonstrate plotting and analysis. Further, we present examples of lander-generated noise, associated with the sampler arm, instrument dumps and other mechanical operations.
Simultaneous acquisition of EEG and NIRS during cognitive tasks for an open access dataset.
Shin, Jaeyoung; von Lühmann, Alexander; Kim, Do-Won; Mehnert, Jan; Hwang, Han-Jeong; Müller, Klaus-Robert
2018-02-13
We provide an open access multimodal brain-imaging dataset of simultaneous electroencephalography (EEG) and near-infrared spectroscopy (NIRS) recordings. Twenty-six healthy participants performed three cognitive tasks: 1) n-back (0-, 2- and 3-back), 2) discrimination/selection response task (DSR) and 3) word generation (WG) tasks. The data provided includes: 1) measured data, 2) demographic data, and 3) basic analysis results. For n-back (dataset A) and DSR tasks (dataset B), event-related potential (ERP) analysis was performed, and spatiotemporal characteristics and classification results for 'target' versus 'non-target' (dataset A) and symbol 'O' versus symbol 'X' (dataset B) are provided. Time-frequency analysis was performed to show the EEG spectral power to differentiate the task-relevant activations. Spatiotemporal characteristics of hemodynamic responses are also shown. For the WG task (dataset C), the EEG spectral power and spatiotemporal characteristics of hemodynamic responses are analyzed, and the potential merit of hybrid EEG-NIRS BCIs was validated with respect to classification accuracy. We expect that the dataset provided will facilitate performance evaluation and comparison of many neuroimaging analysis techniques.
Simultaneous acquisition of EEG and NIRS during cognitive tasks for an open access dataset
Shin, Jaeyoung; von Lühmann, Alexander; Kim, Do-Won; Mehnert, Jan; Hwang, Han-Jeong; Müller, Klaus-Robert
2018-01-01
We provide an open access multimodal brain-imaging dataset of simultaneous electroencephalography (EEG) and near-infrared spectroscopy (NIRS) recordings. Twenty-six healthy participants performed three cognitive tasks: 1) n-back (0-, 2- and 3-back), 2) discrimination/selection response task (DSR) and 3) word generation (WG) tasks. The data provided includes: 1) measured data, 2) demographic data, and 3) basic analysis results. For n-back (dataset A) and DSR tasks (dataset B), event-related potential (ERP) analysis was performed, and spatiotemporal characteristics and classification results for ‘target’ versus ‘non-target’ (dataset A) and symbol ‘O’ versus symbol ‘X’ (dataset B) are provided. Time-frequency analysis was performed to show the EEG spectral power to differentiate the task-relevant activations. Spatiotemporal characteristics of hemodynamic responses are also shown. For the WG task (dataset C), the EEG spectral power and spatiotemporal characteristics of hemodynamic responses are analyzed, and the potential merit of hybrid EEG-NIRS BCIs was validated with respect to classification accuracy. We expect that the dataset provided will facilitate performance evaluation and comparison of many neuroimaging analysis techniques. PMID:29437166
NASA Astrophysics Data System (ADS)
Whitehall, K. D.; Jenkins, G. S.; Mattmann, C. A.; Waliser, D. E.; Kim, J.; Goodale, C. E.; Hart, A. F.; Ramirez, P.; Whittell, J.; Zimdars, P. A.
2012-12-01
Mesoscale convective complexes (MCCs) are large (2 - 3 x 105 km2) nocturnal convectively-driven weather systems that are generally associated with high precipitation events in short durations (less than 12hrs) in various locations through out the tropics and midlatitudes (Maddox 1980). These systems are particularly important for climate in the West Sahel region, where the precipitation associated with them is a principal component of the rainfall season (Laing and Fritsch 1993). These systems occur on weather timescales and are historically identified from weather data analysis via manual and more recently automated processes (Miller and Fritsch 1991, Nesbett 2006, Balmey and Reason 2012). The Regional Climate Model Evaluation System (RCMES) is an open source tool designed for easy evaluation of climate and Earth system data through access to standardized datasets, and intrinsic tools that perform common analysis and visualization tasks (Hart et al. 2011). The RCMES toolkit also provides the flexibility of user-defined subroutines for further metrics, visualization and even dataset manipulation. The purpose of this study is to present a methodology for identifying MCCs in observation datasets using the RCMES framework. TRMM 3 hourly datasets will be used to demonstrate the methodology for 2005 boreal summer. This method promotes the use of open source software for scientific data systems to address a concern to multiple stakeholders in the earth sciences. A historical MCC dataset provides a platform with regards to further studies of the variability of frequency on various timescales of MCCs that is important for many including climate scientists, meteorologists, water resource managers, and agriculturalists. The methodology of using RCMES for searching and clipping datasets will engender a new realm of studies as users of the system will no longer be restricted to solely using the datasets as they reside in their own local systems; instead will be afforded rapid, effective, and transparent access, processing and visualization of the wealth of remote sensing datasets and climate model outputs available.
Improving average ranking precision in user searches for biomedical research datasets
Gobeill, Julien; Gaudinat, Arnaud; Vachon, Thérèse; Ruch, Patrick
2017-01-01
Abstract Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorization method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries, and provided competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP, being +22.3% higher than the median infAP of the participant’s best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system’s performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. The similarity measure algorithm showed robust performance in different training conditions, with small performance variations compared to the Divergence from Randomness framework. Finally, the result categorization did not have significant impact on the system’s performance. We believe that our solution could be used to enhance biomedical dataset management systems. The use of data driven expansion methods, such as those based on word embeddings, could be an alternative to the complexity of biomedical terminologies. Nevertheless, due to the limited size of the assessment set, further experiments need to be performed to draw conclusive results. Database URL: https://biocaddie.org/benchmark-data PMID:29220475
Uvf - Unified Volume Format: A General System for Efficient Handling of Large Volumetric Datasets.
Krüger, Jens; Potter, Kristin; Macleod, Rob S; Johnson, Christopher
2008-01-01
With the continual increase in computing power, volumetric datasets with sizes ranging from only a few megabytes to petascale are generated thousands of times per day. Such data may come from an ordinary source such as simple everyday medical imaging procedures, while larger datasets may be generated from cluster-based scientific simulations or measurements of large scale experiments. In computer science an incredible amount of work worldwide is put into the efficient visualization of these datasets. As researchers in the field of scientific visualization, we often have to face the task of handling very large data from various sources. This data usually comes in many different data formats. In medical imaging, the DICOM standard is well established, however, most research labs use their own data formats to store and process data. To simplify the task of reading the many different formats used with all of the different visualization programs, we present a system for the efficient handling of many types of large scientific datasets (see Figure 1 for just a few examples). While primarily targeted at structured volumetric data, UVF can store just about any type of structured and unstructured data. The system is composed of a file format specification with a reference implementation of a reader. It is not only a common, easy to implement format but also allows for efficient rendering of most datasets without the need to convert the data in memory.
Yang, Y X; Teo, S-K; Van Reeth, E; Tan, C H; Tham, I W K; Poh, C L
2015-08-01
Accurate visualization of lung motion is important in many clinical applications, such as radiotherapy of lung cancer. Advancement in imaging modalities [e.g., computed tomography (CT) and MRI] has allowed dynamic imaging of lung and lung tumor motion. However, each imaging modality has its advantages and disadvantages. The study presented in this paper aims at generating synthetic 4D-CT dataset for lung cancer patients by combining both continuous three-dimensional (3D) motion captured by 4D-MRI and the high spatial resolution captured by CT using the authors' proposed approach. A novel hybrid approach based on deformable image registration (DIR) and finite element method simulation was developed to fuse a static 3D-CT volume (acquired under breath-hold) and the 3D motion information extracted from 4D-MRI dataset, creating a synthetic 4D-CT dataset. The study focuses on imaging of lung and lung tumor. Comparing the synthetic 4D-CT dataset with the acquired 4D-CT dataset of six lung cancer patients based on 420 landmarks, accurate results (average error <2 mm) were achieved using the authors' proposed approach. Their hybrid approach achieved a 40% error reduction (based on landmarks assessment) over using only DIR techniques. The synthetic 4D-CT dataset generated has high spatial resolution, has excellent lung details, and is able to show movement of lung and lung tumor over multiple breathing cycles.
NASA Astrophysics Data System (ADS)
Guha, Rajarshi; Schürer, Stephan C.
2008-06-01
Computational toxicology is emerging as an encouraging alternative to experimental testing. The Molecular Libraries Screening Center Network (MLSCN) as part of the NIH Molecular Libraries Roadmap has recently started generating large and diverse screening datasets, which are publicly available in PubChem. In this report, we investigate various aspects of developing computational models to predict cell toxicity based on cell proliferation screening data generated in the MLSCN. By capturing feature-based information in those datasets, such predictive models would be useful in evaluating cell-based screening results in general (for example from reporter assays) and could be used as an aid to identify and eliminate potentially undesired compounds. Specifically we present the results of random forest ensemble models developed using different cell proliferation datasets and highlight protocols to take into account their extremely imbalanced nature. Depending on the nature of the datasets and the descriptors employed we were able to achieve percentage correct classification rates between 70% and 85% on the prediction set, though the accuracy rate dropped significantly when the models were applied to in vivo data. In this context we also compare the MLSCN cell proliferation results with animal acute toxicity data to investigate to what extent animal toxicity can be correlated and potentially predicted by proliferation results. Finally, we present a visualization technique that allows one to compare a new dataset to the training set of the models to decide whether the new dataset may be reliably predicted.
Physical environment virtualization for human activities recognition
NASA Astrophysics Data System (ADS)
Poshtkar, Azin; Elangovan, Vinayak; Shirkhodaie, Amir; Chan, Alex; Hu, Shuowen
2015-05-01
Human activity recognition research relies heavily on extensive datasets to verify and validate performance of activity recognition algorithms. However, obtaining real datasets are expensive and highly time consuming. A physics-based virtual simulation can accelerate the development of context based human activity recognition algorithms and techniques by generating relevant training and testing videos simulating diverse operational scenarios. In this paper, we discuss in detail the requisite capabilities of a virtual environment to aid as a test bed for evaluating and enhancing activity recognition algorithms. To demonstrate the numerous advantages of virtual environment development, a newly developed virtual environment simulation modeling (VESM) environment is presented here to generate calibrated multisource imagery datasets suitable for development and testing of recognition algorithms for context-based human activities. The VESM environment serves as a versatile test bed to generate a vast amount of realistic data for training and testing of sensor processing algorithms. To demonstrate the effectiveness of VESM environment, we present various simulated scenarios and processed results to infer proper semantic annotations from the high fidelity imagery data for human-vehicle activity recognition under different operational contexts.
Creating a standardized watersheds database for the Lower Rio Grande/Río Bravo, Texas
Brown, J.R.; Ulery, Randy L.; Parcher, Jean W.
2000-01-01
This report describes the creation of a large-scale watershed database for the lower Rio Grande/Río Bravo Basin in Texas. The watershed database includes watersheds delineated to all 1:24,000-scale mapped stream confluences and other hydrologically significant points, selected watershed characteristics, and hydrologic derivative datasets.Computer technology allows generation of preliminary watershed boundaries in a fraction of the time needed for manual methods. This automated process reduces development time and results in quality improvements in watershed boundaries and characteristics. These data can then be compiled in a permanent database, eliminating the time-consuming step of data creation at the beginning of a project and providing a stable base dataset that can give users greater confidence when further subdividing watersheds.A standardized dataset of watershed characteristics is a valuable contribution to the understanding and management of natural resources. Vertical integration of the input datasets used to automatically generate watershed boundaries is crucial to the success of such an effort. The optimum situation would be to use the digital orthophoto quadrangles as the source of all the input datasets. While the hydrographic data from the digital line graphs can be revised to match the digital orthophoto quadrangles, hypsography data cannot be revised to match the digital orthophoto quadrangles. Revised hydrography from the digital orthophoto quadrangle should be used to create an updated digital elevation model that incorporates the stream channels as revised from the digital orthophoto quadrangle. Computer-generated, standardized watersheds that are vertically integrated with existing digital line graph hydrographic data will continue to be difficult to create until revisions can be made to existing source datasets. Until such time, manual editing will be necessary to make adjustments for man-made features and changes in the natural landscape that are not reflected in the digital elevation model data.
Creating a standardized watersheds database for the lower Rio Grande/Rio Bravo, Texas
Brown, Julie R.; Ulery, Randy L.; Parcher, Jean W.
2000-01-01
This report describes the creation of a large-scale watershed database for the lower Rio Grande/Rio Bravo Basin in Texas. The watershed database includes watersheds delineated to all 1:24,000-scale mapped stream confluences and other hydrologically significant points, selected watershed characteristics, and hydrologic derivative datasets. Computer technology allows generation of preliminary watershed boundaries in a fraction of the time needed for manual methods. This automated process reduces development time and results in quality improvements in watershed boundaries and characteristics. These data can then be compiled in a permanent database, eliminating the time-consuming step of data creation at the beginning of a project and providing a stable base dataset that can give users greater confidence when further subdividing watersheds. A standardized dataset of watershed characteristics is a valuable contribution to the understanding and management of natural resources. Vertical integration of the input datasets used to automatically generate watershed boundaries is crucial to the success of such an effort. The optimum situation would be to use the digital orthophoto quadrangles as the source of all the input datasets. While the hydrographic data from the digital line graphs can be revised to match the digital orthophoto quadrangles, hypsography data cannot be revised to match the digital orthophoto quadrangles. Revised hydrography from the digital orthophoto quadrangle should be used to create an updated digital elevation model that incorporates the stream channels as revised from the digital orthophoto quadrangle. Computer-generated, standardized watersheds that are vertically integrated with existing digital line graph hydrographic data will continue to be difficult to create until revisions can be made to existing source datasets. Until such time, manual editing will be necessary to make adjustments for man-made features and changes in the natural landscape that are not reflected in the digital elevation model data.
Community Detection in Complex Networks via Clique Conductance.
Lu, Zhenqi; Wahlström, Johan; Nehorai, Arye
2018-04-13
Network science plays a central role in understanding and modeling complex systems in many areas including physics, sociology, biology, computer science, economics, politics, and neuroscience. One of the most important features of networks is community structure, i.e., clustering of nodes that are locally densely interconnected. Communities reveal the hierarchical organization of nodes, and detecting communities is of great importance in the study of complex systems. Most existing community-detection methods consider low-order connection patterns at the level of individual links. But high-order connection patterns, at the level of small subnetworks, are generally not considered. In this paper, we develop a novel community-detection method based on cliques, i.e., local complete subnetworks. The proposed method overcomes the deficiencies of previous similar community-detection methods by considering the mathematical properties of cliques. We apply the proposed method to computer-generated graphs and real-world network datasets. When applied to networks with known community structure, the proposed method detects the structure with high fidelity and sensitivity. When applied to networks with no a priori information regarding community structure, the proposed method yields insightful results revealing the organization of these complex networks. We also show that the proposed method is guaranteed to detect near-optimal clusters in the bipartition case.
NASA Astrophysics Data System (ADS)
Reyers, Mark; Moemken, Julia; Pinto, Joaquim; Feldmann, Hendrik; Kottmeier, Christoph; MiKlip Module-C Team
2017-04-01
Decadal climate predictions can provide a useful basis for decision making support systems for the public and private sectors. Several generations of decadal hindcasts and predictions have been generated throughout the German research program MiKlip. Together with the global climate predictions computed with MPI-ESM, the regional climate model (RCM) COSMO-CLM is used for regional downscaling by MiKlip Module-C. The RCMs provide climate information on spatial and temporal scales closer to the needs of potential users. In this study, two downscaled hindcast generations are analysed (named b0 and b1). The respective global generations are both initialized by nudging them towards different reanalysis anomaly fields. An ensemble of five starting years (1961, 1971, 1981, 1991, and 2001), each comprising ten ensemble members, is used for both generations in order to quantify the regional decadal prediction skill for precipitation and near-surface temperature and wind speed over Europe. All datasets (including hindcasts, observations, reanalysis, and historical MPI-ESM runs) are pre-processed in an analogue manner by (i) removing the long-term trend and (ii) re-gridding to a common grid. Our analysis shows that there is potential for skillful decadal predictions over Europe in the regional MiKlip ensemble, but the skill is not systematic and depends on the PRUDENCE region and the variable. Further, the differences between the two hindcast generations are mostly small. As we used detrended time series, the predictive skill found in our study can probably attributed to reasonable predictions of anomalies which are associated with the natural climate variability. In a sensitivity study, it is shown that the results may strongly change when the long-term trend is kept in the datasets, as here the skill of predicting the long-term trend (e.g. for temperature) also plays a major role. The regionalization of the global ensemble provides an added value for decadal predictions for some complex regions like the Mediterranean and Iberian Peninsula, while for other regions no systematic improvement is found. A clear dependence of the performance of the regional MiKlip system on the ensemble size is detected. For all variables in both hindcast generations, the skill increases when the ensemble is enlarged. The results indicate that a number of ten members is an appropriate ensemble size for decadal predictions over Europe.
NASA Astrophysics Data System (ADS)
Evans, Ben; Wyborn, Lesley; Druken, Kelsey; Richards, Clare; Trenham, Claire; Wang, Jingbo; Rozas Larraondo, Pablo; Steer, Adam; Smillie, Jon
2017-04-01
The National Computational Infrastructure (NCI) facility hosts one of Australia's largest repositories (10+ PBytes) of research data collections spanning datasets from climate, coasts, oceans, and geophysics through to astronomy, bioinformatics, and the social sciences domains. The data are obtained from national and international sources, spanning a wide range of gridded and ungridded (i.e., line surveys, point clouds) data, and raster imagery, as well as diverse coordinate reference projections and resolutions. Rather than managing these data assets as a digital library, whereby users can discover and download files to personal servers (similar to borrowing 'books' from a 'library'), NCI has built an extensive and well-integrated research data platform, the National Environmental Research Data Interoperability Platform (NERDIP, http://nci.org.au/data-collections/nerdip/). The NERDIP architecture enables programmatic access to data via standards-compliant services for high performance data analysis, and provides a flexible cloud-based environment to facilitate the next generation of transdisciplinary scientific research across all data domains. To improve use of modern scalable data infrastructures that are focused on efficient data analysis, the data organisation needs to be carefully managed including performance evaluations of projections and coordinate systems, data encoding standards and formats. A complication is that we have often found multiple domain vocabularies and ontologies are associated with equivalent datasets. It is not practical for individual dataset managers to determine which standards are best to apply to their dataset as this could impact accessibility and interoperability. Instead, they need to work with data custodians across interrelated communities and, in partnership with the data repository, the international scientific community to determine the most useful approach. For the data repository, this approach is essential to enable different disciplines and research communities to invoke new forms of analysis and discovery in an increasingly complex data-rich environment. Driven by the heterogeneity of Earth and environmental datasets, NCI developed a Data Quality/Data Assurance Strategy to ensure consistency is maintained within and across all datasets, as well as functionality testing to ensure smooth interoperability between products, tools, and services. This is particularly so for collections that contain data generated from multiple data acquisition campaigns, often using instruments and models that have evolved over time. By implementing the NCI Data Quality Strategy we have seen progressive improvement in the integration and quality of the datasets across the different subject domains, and through this, the ease by which the users can access data from this major data infrastructure. By both adhering to international standards and also contributing to extensions of these standards, data from the NCI NERDIP platform can be federated with data from other globally distributed data repositories and infrastructures. The NCI approach builds on our experience working with the astronomy and climate science communities, which have been internationally coordinating such interoperability standards within their disciplines for some years. The results of our work so far demonstrate more could be done in the Earth science, solid earth and environmental communities, particularly through establishing better linkages between international/national community efforts such as EPOS, ENVRIplus, EarthCube, AuScope and the Research Data Alliance.
Chung, Dongjun; Kim, Hang J; Zhao, Hongyu
2017-02-01
Genome-wide association studies (GWAS) have identified tens of thousands of genetic variants associated with hundreds of phenotypes and diseases, which have provided clinical and medical benefits to patients with novel biomarkers and therapeutic targets. However, identification of risk variants associated with complex diseases remains challenging as they are often affected by many genetic variants with small or moderate effects. There has been accumulating evidence suggesting that different complex traits share common risk basis, namely pleiotropy. Recently, several statistical methods have been developed to improve statistical power to identify risk variants for complex traits through a joint analysis of multiple GWAS datasets by leveraging pleiotropy. While these methods were shown to improve statistical power for association mapping compared to separate analyses, they are still limited in the number of phenotypes that can be integrated. In order to address this challenge, in this paper, we propose a novel statistical framework, graph-GPA, to integrate a large number of GWAS datasets for multiple phenotypes using a hidden Markov random field approach. Application of graph-GPA to a joint analysis of GWAS datasets for 12 phenotypes shows that graph-GPA improves statistical power to identify risk variants compared to statistical methods based on smaller number of GWAS datasets. In addition, graph-GPA also promotes better understanding of genetic mechanisms shared among phenotypes, which can potentially be useful for the development of improved diagnosis and therapeutics. The R implementation of graph-GPA is currently available at https://dongjunchung.github.io/GGPA/.
A case for user-generated sensor metadata
NASA Astrophysics Data System (ADS)
Nüst, Daniel
2015-04-01
Cheap and easy to use sensing technology and new developments in ICT towards a global network of sensors and actuators promise previously unthought of changes for our understanding of the environment. Large professional as well as amateur sensor networks exist, and they are used for specific yet diverse applications across domains such as hydrology, meteorology or early warning systems. However the impact this "abundance of sensors" had so far is somewhat disappointing. There is a gap between (community-driven) sensor networks that could provide very useful data and the users of the data. In our presentation, we argue this is due to a lack of metadata which allows determining the fitness of use of a dataset. Syntactic or semantic interoperability for sensor webs have made great progress and continue to be an active field of research, yet they often are quite complex, which is of course due to the complexity of the problem at hand. But still, we see the most generic information to determine fitness for use is a dataset's provenance, because it allows users to make up their own minds independently from existing classification schemes for data quality. In this work we will make the case how curated user-contributed metadata has the potential to improve this situation. This especially applies for scenarios in which an observed property is applicable in different domains, and for set-ups where the understanding about metadata concepts and (meta-)data quality differs between data provider and user. On the one hand a citizen does not understand the ISO provenance metadata. On the other hand a researcher might find issues in publicly accessible time series published by citizens, which the latter might not be aware of or care about. Because users will have to determine fitness for use for each application on their own anyway, we suggest an online collaboration platform for user-generated metadata based on an extremely simplified data model. In the most basic fashion, metadata generated by users can be boiled down to a basic property of the world wide web: many information items, such as news or blog posts, allow users to create comments and rate the content. Therefore we argue to focus a core data model on one text field for a textual comment, one optional numerical field for a rating, and a resolvable identifier for the dataset that is commented on. We present a conceptual framework that integrates user comments in existing standards and relevant applications of online sensor networks and discuss possible approaches, such as linked data, brokering, or standalone metadata portals. We relate this framework to existing work in user generated content, such as proprietary rating systems on commercial websites, microformats, the GeoViQua User Quality Model, the CHARMe annotations, or W3C Open Annotation. These systems are also explored for commonalities and based on their very useful concepts and ideas; we present an outline for future extensions of the minimal model. Building on this framework we present a concept how a simplistic comment-rating-system can be extended to capture provenance information for spatio-temporal observations in the sensor web, and how this framework can be evaluated.
Pooled assembly of marine metagenomic datasets: enriching annotation through chimerism.
Magasin, Jonathan D; Gerloff, Dietlind L
2015-02-01
Despite advances in high-throughput sequencing, marine metagenomic samples remain largely opaque. A typical sample contains billions of microbial organisms from thousands of genomes and quadrillions of DNA base pairs. Its derived metagenomic dataset underrepresents this complexity by orders of magnitude because of the sparseness and shortness of sequencing reads. Read shortness and sequencing errors pose a major challenge to accurate species and functional annotation. This includes distinguishing known from novel species. Often the majority of reads cannot be annotated and thus cannot help our interpretation of the sample. Here, we demonstrate quantitatively how careful assembly of marine metagenomic reads within, but also across, datasets can alleviate this problem. For 10 simulated datasets, each with species complexity modeled on a real counterpart, chimerism remained within the same species for most contigs (97%). For 42 real pyrosequencing ('454') datasets, assembly increased the proportion of annotated reads, and even more so when datasets were pooled, by on average 1.6% (max 6.6%) for species, 9.0% (max 28.7%) for Pfam protein domains and 9.4% (max 22.9%) for PANTHER gene families. Our results outline exciting prospects for data sharing in the metagenomics community. While chimeric sequences should be avoided in other areas of metagenomics (e.g. biodiversity analyses), conservative pooled assembly is advantageous for annotation specificity and sensitivity. Intriguingly, our experiment also found potential prospects for (low-cost) discovery of new species in 'old' data. dgerloff@ffame.org Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
ERIC Educational Resources Information Center
Barthur, Ashrith
2016-01-01
There are two essential goals of this research. The first goal is to design and construct a computational environment that is used for studying large and complex datasets in the cybersecurity domain. The second goal is to analyse the Spamhaus blacklist query dataset which includes uncovering the properties of blacklisted hosts and understanding…
Wiegand, Sandra; Dietrich, Sascha; Hertel, Robert; Bongaerts, Johannes; Evers, Stefan; Volland, Sonja; Daniel, Rolf; Liesegang, Heiko
2013-10-01
The production of enzymes by an industrial strain requires a complex adaption of the bacterial metabolism to the conditions within the fermenter. Regulatory events within the process result in a dynamic change of the transcriptional activity of the genome. This complex network of genes is orchestrated by proteins as well as regulatory RNA elements. Here we present an RNA-Seq based study considering selected phases of an industry-oriented fermentation of Bacillus licheniformis. A detailed analysis of 20 strand-specific RNA-Seq datasets revealed a multitude of transcriptionally active genomic regions. 3314 RNA features encoded by such active loci have been identified and sorted into ten functional classes. The identified sequences include the expected RNA features like housekeeping sRNAs, metabolic riboswitches and RNA switches well known from studies on Bacillus subtilis as well as a multitude of completely new candidates for regulatory RNAs. An unexpectedly high number of 855 RNA features are encoded antisense to annotated protein and RNA genes, in addition to 461 independently transcribed small RNAs. These antisense transcripts contain molecules with a remarkable size range variation from 38 to 6348 base pairs in length. The genome of the type strain B. licheniformis DSM13 was completely reannotated using data obtained from RNA-Seq analyses and from public databases. The hereby generated data-sets represent a solid amount of knowledge on the dynamic transcriptional activities during the investigated fermentation stages. The identified regulatory elements enable research on the understanding and the optimization of crucial metabolic activities during a productive fermentation of Bacillus licheniformis strains.
Scalable and Interactive Segmentation and Visualization of Neural Processes in EM Datasets
Jeong, Won-Ki; Beyer, Johanna; Hadwiger, Markus; Vazquez, Amelio; Pfister, Hanspeter; Whitaker, Ross T.
2011-01-01
Recent advances in scanning technology provide high resolution EM (Electron Microscopy) datasets that allow neuroscientists to reconstruct complex neural connections in a nervous system. However, due to the enormous size and complexity of the resulting data, segmentation and visualization of neural processes in EM data is usually a difficult and very time-consuming task. In this paper, we present NeuroTrace, a novel EM volume segmentation and visualization system that consists of two parts: a semi-automatic multiphase level set segmentation with 3D tracking for reconstruction of neural processes, and a specialized volume rendering approach for visualization of EM volumes. It employs view-dependent on-demand filtering and evaluation of a local histogram edge metric, as well as on-the-fly interpolation and ray-casting of implicit surfaces for segmented neural structures. Both methods are implemented on the GPU for interactive performance. NeuroTrace is designed to be scalable to large datasets and data-parallel hardware architectures. A comparison of NeuroTrace with a commonly used manual EM segmentation tool shows that our interactive workflow is faster and easier to use for the reconstruction of complex neural processes. PMID:19834227
Villarreal A, Juan Carlos; Crandall-Stotler, Barbara J; Hart, Michelle L; Long, David G; Forrest, Laura L
2016-03-01
We present a complete generic-level phylogeny of the complex thalloid liverworts, a lineage that includes the model system Marchantia polymorpha. The complex thalloids are remarkable for their slow rate of molecular evolution and for being the only extant plant lineage to differentiate gas exchange tissues in the gametophyte generation. We estimated the divergence times and analyzed the evolutionary trends of morphological traits, including air chambers, rhizoids and specialized reproductive structures. A multilocus dataset was analyzed using maximum likelihood and Bayesian approaches. Relative rates were estimated using local clocks. Our phylogeny cements the early branching in complex thalloids. Marchantia is supported in one of the earliest divergent lineages. The rate of evolution in organellar loci is slower than for other liverwort lineages, except for two annual lineages. Most genera diverged in the Cretaceous. Marchantia polymorpha diversified in the Late Miocene, giving a minimum age estimate for the evolution of its sex chromosomes. The complex thalloid ancestor, excluding Blasiales, is reconstructed as a plant with a carpocephalum, with filament-less air chambers opening via compound pores, and without pegged rhizoids. Our comprehensive study of the group provides a temporal framework for the analysis of the evolution of critical traits essential for plants during land colonization. © 2015 Royal Botanic Garden Edinburgh. New Phytologist © 2015 New Phytologist Trust.
Zhou, Li-Wei; Cao, Yun; Wu, Sheng-Hua; Vlasák, Josef; Li, De-Wei; Li, Meng-Jie; Dai, Yu-Cheng
2015-06-01
Species of the Ganoderma lucidum complex are used in many types of health products. However, the taxonomy of this complex has long been chaotic, thus limiting its uses. In the present study, 32 collections of the complex from Asia, Europe and North America were analyzed from both morphological and molecular phylogenetic perspectives. The combined dataset, including an outgroup, comprised 33 ITS, 24 tef1α, 24 rpb1 and 21 rpb2 sequences, of which 19 ITS, 20 tef1α, 20 rpb1 and 17 rpb2 sequences were newly generated. A total of 13 species of the complex were recovered in the multilocus phylogeny. These 13 species were not strongly supported as a single monophyletic lineage, and were further grouped into three lineages that cannot be defined by their geographic distributions. Clade A comprised Ganoderma curtisii, Ganoderma flexipes, Ganoderma lingzhi, Ganoderma multipileum, Ganoderma resinaceum, Ganoderma sessile, Ganoderma sichuanense and Ganoderma tropicum, Clade B comprised G. lucidum, Ganoderma oregonense and Ganoderma tsugae, and Clade C comprised Ganoderma boninense and Ganoderma zonatum. A dichotomous key to the 13 species is provided, and their key morphological characters from context, pores, cuticle cells and basidiospores are presented in a table. The taxonomic positions of these species are briefly discussed. Noteworthy, the epitypification of G. sichuanense is rejected. Copyright © 2014 Elsevier Ltd. All rights reserved.
MatchingLand, geospatial data testbed for the assessment of matching methods.
Xavier, Emerson M A; Ariza-López, Francisco J; Ureña-Cámara, Manuel A
2017-12-05
This article presents datasets prepared with the aim of helping the evaluation of geospatial matching methods for vector data. These datasets were built up from mapping data produced by official Spanish mapping agencies. The testbed supplied encompasses the three geometry types: point, line and area. Initial datasets were submitted to geometric transformations in order to generate synthetic datasets. These transformations represent factors that might influence the performance of geospatial matching methods, like the morphology of linear or areal features, systematic transformations, and random disturbance over initial data. We call our 11 GiB benchmark data 'MatchingLand' and we hope it can be useful for the geographic information science research community.
Hurley, Daniel; Araki, Hiromitsu; Tamada, Yoshinori; Dunmore, Ben; Sanders, Deborah; Humphreys, Sally; Affara, Muna; Imoto, Seiya; Yasuda, Kaori; Tomiyasu, Yuki; Tashiro, Kosuke; Savoie, Christopher; Cho, Vicky; Smith, Stephen; Kuhara, Satoru; Miyano, Satoru; Charnock-Jones, D. Stephen; Crampin, Edmund J.; Print, Cristin G.
2012-01-01
Gene regulatory networks inferred from RNA abundance data have generated significant interest, but despite this, gene network approaches are used infrequently and often require input from bioinformaticians. We have assembled a suite of tools for analysing regulatory networks, and we illustrate their use with microarray datasets generated in human endothelial cells. We infer a range of regulatory networks, and based on this analysis discuss the strengths and limitations of network inference from RNA abundance data. We welcome contact from researchers interested in using our inference and visualization tools to answer biological questions. PMID:22121215
Non-negative Tensor Factorization for Robust Exploratory Big-Data Analytics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Alexandrov, Boian; Vesselinov, Velimir Valentinov; Djidjev, Hristo Nikolov
Currently, large multidimensional datasets are being accumulated in almost every field. Data are: (1) collected by distributed sensor networks in real-time all over the globe, (2) produced by large-scale experimental measurements or engineering activities, (3) generated by high-performance simulations, and (4) gathered by electronic communications and socialnetwork activities, etc. Simultaneous analysis of these ultra-large heterogeneous multidimensional datasets is often critical for scientific discoveries, decision-making, emergency response, and national and global security. The importance of such analyses mandates the development of the next-generation of robust machine learning (ML) methods and tools for bigdata exploratory analysis.
Analyzing How We Do Analysis and Consume Data, Results from the SciDAC-Data Project
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ding, P.; Aliaga, L.; Mubarak, M.
One of the main goals of the Dept. of Energy funded SciDAC-Data project is to analyze the more than 410,000 high energy physics datasets that have been collected, generated and defined over the past two decades by experiments using the Fermilab storage facilities. These datasets have been used as the input to over 5.6 million recorded analysis projects, for which detailed analytics have been gathered. The analytics and meta information for these datasets and analysis projects are being combined with knowledge of their part of the HEP analysis chains for major experiments to understand how modern computing and data deliverymore » is being used. We present the first results of this project, which examine in detail how the CDF, D0, NOvA, MINERvA and MicroBooNE experiments have organized, classified and consumed petascale datasets to produce their physics results. The results include analysis of the correlations in dataset/file overlap, data usage patterns, data popularity, dataset dependency and temporary dataset consumption. The results provide critical insight into how workflows and data delivery schemes can be combined with different caching strategies to more efficiently perform the work required to mine these large HEP data volumes and to understand the physics analysis requirements for the next generation of HEP computing facilities. In particular we present a detailed analysis of the NOvA data organization and consumption model corresponding to their first and second oscillation results (2014-2016) and the first look at the analysis of the Tevatron Run II experiments. We present statistical distributions for the characterization of these data and data driven models describing their consumption« less
Analyzing how we do Analysis and Consume Data, Results from the SciDAC-Data Project
NASA Astrophysics Data System (ADS)
Ding, P.; Aliaga, L.; Mubarak, M.; Tsaris, A.; Norman, A.; Lyon, A.; Ross, R.
2017-10-01
One of the main goals of the Dept. of Energy funded SciDAC-Data project is to analyze the more than 410,000 high energy physics datasets that have been collected, generated and defined over the past two decades by experiments using the Fermilab storage facilities. These datasets have been used as the input to over 5.6 million recorded analysis projects, for which detailed analytics have been gathered. The analytics and meta information for these datasets and analysis projects are being combined with knowledge of their part of the HEP analysis chains for major experiments to understand how modern computing and data delivery is being used. We present the first results of this project, which examine in detail how the CDF, D0, NOvA, MINERvA and MicroBooNE experiments have organized, classified and consumed petascale datasets to produce their physics results. The results include analysis of the correlations in dataset/file overlap, data usage patterns, data popularity, dataset dependency and temporary dataset consumption. The results provide critical insight into how workflows and data delivery schemes can be combined with different caching strategies to more efficiently perform the work required to mine these large HEP data volumes and to understand the physics analysis requirements for the next generation of HEP computing facilities. In particular we present a detailed analysis of the NOvA data organization and consumption model corresponding to their first and second oscillation results (2014-2016) and the first look at the analysis of the Tevatron Run II experiments. We present statistical distributions for the characterization of these data and data driven models describing their consumption.
DNAism: exploring genomic datasets on the web with Horizon Charts.
Rio Deiros, David; Gibbs, Richard A; Rogers, Jeffrey
2016-01-27
Computational biologists daily face the need to explore massive amounts of genomic data. New visualization techniques can help researchers navigate and understand these big data. Horizon Charts are a relatively new visualization method that, under the right circumstances, maximizes data density without losing graphical perception. Horizon Charts have been successfully applied to understand multi-metric time series data. We have adapted an existing JavaScript library (Cubism) that implements Horizon Charts for the time series domain so that it works effectively with genomic datasets. We call this new library DNAism. Horizon Charts can be an effective visual tool to explore complex and large genomic datasets. Researchers can use our library to leverage these techniques to extract additional insights from their own datasets.
Quantitative workflow based on NN for weighting criteria in landfill suitability mapping
NASA Astrophysics Data System (ADS)
Abujayyab, Sohaib K. M.; Ahamad, Mohd Sanusi S.; Yahya, Ahmad Shukri; Ahmad, Siti Zubaidah; Alkhasawneh, Mutasem Sh.; Aziz, Hamidi Abdul
2017-10-01
Our study aims to introduce a new quantitative workflow that integrates neural networks (NNs) and multi criteria decision analysis (MCDA). Existing MCDA workflows reveal a number of drawbacks, because of the reliance on human knowledge in the weighting stage. Thus, new workflow presented to form suitability maps at the regional scale for solid waste planning based on NNs. A feed-forward neural network employed in the workflow. A total of 34 criteria were pre-processed to establish the input dataset for NN modelling. The final learned network used to acquire the weights of the criteria. Accuracies of 95.2% and 93.2% achieved for the training dataset and testing dataset, respectively. The workflow was found to be capable of reducing human interference to generate highly reliable maps. The proposed workflow reveals the applicability of NN in generating landfill suitability maps and the feasibility of integrating them with existing MCDA workflows.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Compo, Gilbert P
As an important step toward a coupled data assimilation system for generating reanalysis fields needed to assess climate model projections, the Ocean Atmosphere Coupled Reanalysis for Climate Applications (OARCA) project assesses and improves the longest reanalyses currently available of the atmosphere and ocean: the 20th Century Reanalysis Project (20CR) and the Simple Ocean Data Assimilation with sparse observational input (SODAsi) system, respectively. In this project, we make off-line but coordinated improvements in the 20CR and SODAsi datasets, with improvements in one feeding into improvements of the other through an iterative generation of new versions. These datasets now span from themore » 19th to 21st centuries. We then study the extreme weather and variability from days to decades of the resulting datasets. A total of 24 publications have been produced in this project.« less
Self-organizing maps: a versatile tool for the automatic analysis of untargeted imaging datasets.
Franceschi, Pietro; Wehrens, Ron
2014-04-01
MS-based imaging approaches allow for location-specific identification of chemical components in biological samples, opening up possibilities of much more detailed understanding of biological processes and mechanisms. Data analysis, however, is challenging, mainly because of the sheer size of such datasets. This article presents a novel approach based on self-organizing maps, extending previous work in order to be able to handle the large number of variables present in high-resolution mass spectra. The key idea is to generate prototype images, representing spatial distributions of ions, rather than prototypical mass spectra. This allows for a two-stage approach, first generating typical spatial distributions and associated m/z bins, and later analyzing the interesting bins in more detail using accurate masses. The possibilities and advantages of the new approach are illustrated on an in-house dataset of apple slices. © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Monroe, J Grey; Allen, Zachariah A; Tanger, Paul; Mullen, Jack L; Lovell, John T; Moyers, Brook T; Whitley, Darrell; McKay, John K
2017-01-01
Recent advances in nucleic acid sequencing technologies have led to a dramatic increase in the number of markers available to generate genetic linkage maps. This increased marker density can be used to improve genome assemblies as well as add much needed resolution for loci controlling variation in ecologically and agriculturally important traits. However, traditional genetic map construction methods from these large marker datasets can be computationally prohibitive and highly error prone. We present TSPmap , a method which implements both approximate and exact Traveling Salesperson Problem solvers to generate linkage maps. We demonstrate that for datasets with large numbers of genomic markers (e.g. 10,000) and in multiple population types generated from inbred parents, TSPmap can rapidly produce high quality linkage maps with low sensitivity to missing and erroneous genotyping data compared to two other benchmark methods, JoinMap and MSTmap . TSPmap is open source and freely available as an R package. With the advancement of low cost sequencing technologies, the number of markers used in the generation of genetic maps is expected to continue to rise. TSPmap will be a useful tool to handle such large datasets into the future, quickly producing high quality maps using a large number of genomic markers.
The influence of climate change on Tanzania's hydropower sustainability
NASA Astrophysics Data System (ADS)
Sperna Weiland, Frederiek; Boehlert, Brent; Meijer, Karen; Schellekens, Jaap; Magnell, Jan-Petter; Helbrink, Jakob; Kassana, Leonard; Liden, Rikard
2015-04-01
Economic costs induced by current climate variability are large for Tanzania and may further increase due to future climate change. The Tanzanian National Climate Change Strategy addressed the need for stabilization of hydropower generation and strengthening of water resources management. Increased hydropower generation can contribute to sustainable use of energy resources and stabilization of the national electricity grid. To support Tanzania the World Bank financed this study in which the impact of climate change on the water resources and related hydropower generation capacity of Tanzania is assessed. To this end an ensemble of 78 GCM projections from both the CMIP3 and CMIP5 datasets was bias-corrected and down-scaled to 0.5 degrees resolution following the BCSD technique using the Princeton Global Meteorological Forcing Dataset as a reference. To quantify the hydrological impacts of climate change by 2035 the global hydrological model PCR-GLOBWB was set-up for Tanzania at a resolution of 3 minutes and run with all 78 GCM datasets. From the full set of projections a probable (median) and worst case scenario (95th percentile) were selected based upon (1) the country average Climate Moisture Index and (2) discharge statistics of relevance to hydropower generation. Although precipitation from the Princeton dataset shows deviations from local station measurements and the global hydrological model does not perfectly reproduce local scale hydrographs, the main discharge characteristics and precipitation patterns are represented well. The modeled natural river flows were adjusted for water demand and irrigation within the water resources model RIBASIM (both historical values and future scenarios). Potential hydropower capacity was assessed with the power market simulation model PoMo-C that considers both reservoir inflows obtained from RIBASIM and overall electricity generation costs. Results of the study show that climate change is unlikely to negatively affect the average potential of future hydropower production; it will likely make hydropower more profitable. Yet, the uncertainty in climate change projections remains large and risks are significant, adaptation strategies should ideally consider a worst case scenario to ensure robust power generation. Overall a diversified power generation portfolio, anchored in hydropower and supported by other renewables and fossil fuel-based energy sources, is the best solution for Tanzania
The Harvard organic photovoltaic dataset
Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; ...
2016-09-27
Presented in this work is the Harvard Organic Photovoltaic Dataset (HOPV15), a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.
The Harvard organic photovoltaic dataset
Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R.; Hachmann, Johannes; Aspuru-Guzik, Alán
2016-01-01
The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications. PMID:27676312
The Harvard organic photovoltaic dataset.
Lopez, Steven A; Pyzer-Knapp, Edward O; Simm, Gregor N; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R; Hachmann, Johannes; Aspuru-Guzik, Alán
2016-09-27
The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.
NASA Astrophysics Data System (ADS)
Oyler, J.; Anderson, R.; Running, S. W.
2010-12-01
In topographically complex landscapes, there is often a mismatch in scale between global climate model projections and more local climate-forcing factors and related ecological/hydrological processes. To overcome this limitation, the objective of this study was to downscale climate projections to the rugged Crown of the Continent Ecosystem (CCE) within the U.S. Northern Rockies and assess future impacts on water balances, vegetation dynamics, and carbon fluxes. A 40-year (1970-2009) spatial historical climate dataset (800m resolution, daily timestep) was generated for the CCE and modified for terrain influences. Regional climate projections were downscaled by applying them to the fine-scale historical dataset using a modified delta downscaling method and stochastic weather generator. The downscaled projections were used to drive the Biome-BGC ecosystem model. Overall CCE impacts included decreases in April 1 snow water equivalent, less days with snow on the ground, increased vegetation water stress, and increased growing degree days. The relaxing of temperature constraints increased annual net primary productivity (NPP) throughout most of the CCE landscape. However, an increase in water stress seems to have limited the growth in NPP and, in some areas, NPP actually decreased. Thus, CCE vegetation productivity trends under increasing temperatures will likely be determined by local changes in hydrologic function. Given the greater uncertainty in precipitation projections, future work should concentrate on determining thresholds in water constraints that greatly modify the magnitude and direction of carbon accumulation within the CCE under a warming climate.
NASA Astrophysics Data System (ADS)
Das, I.; Oberai, K.; Sarathi Roy, P.
2012-07-01
Landslides exhibit themselves in different mass movement processes and are considered among the most complex natural hazards occurring on the earth surface. Making landslide database available online via WWW (World Wide Web) promotes the spreading and reaching out of the landslide information to all the stakeholders. The aim of this research is to present a comprehensive database for generating landslide hazard scenario with the help of available historic records of landslides and geo-environmental factors and make them available over the Web using geospatial Free & Open Source Software (FOSS). FOSS reduces the cost of the project drastically as proprietary software's are very costly. Landslide data generated for the period 1982 to 2009 were compiled along the national highway road corridor in Indian Himalayas. All the geo-environmental datasets along with the landslide susceptibility map were served through WEBGIS client interface. Open source University of Minnesota (UMN) mapserver was used as GIS server software for developing web enabled landslide geospatial database. PHP/Mapscript server-side application serve as a front-end application and PostgreSQL with PostGIS extension serve as a backend application for the web enabled landslide spatio-temporal databases. This dynamic virtual visualization process through a web platform brings an insight into the understanding of the landslides and the resulting damage closer to the affected people and user community. The landslide susceptibility dataset is also made available as an Open Geospatial Consortium (OGC) Web Feature Service (WFS) which can be accessed through any OGC compliant open source or proprietary GIS Software.
Reference datasets for 2-treatment, 2-sequence, 2-period bioequivalence studies.
Schütz, Helmut; Labes, Detlew; Fuglsang, Anders
2014-11-01
It is difficult to validate statistical software used to assess bioequivalence since very few datasets with known results are in the public domain, and the few that are published are of moderate size and balanced. The purpose of this paper is therefore to introduce reference datasets of varying complexity in terms of dataset size and characteristics (balance, range, outlier presence, residual error distribution) for 2-treatment, 2-period, 2-sequence bioequivalence studies and to report their point estimates and 90% confidence intervals which companies can use to validate their installations. The results for these datasets were calculated using the commercial packages EquivTest, Kinetica, SAS and WinNonlin, and the non-commercial package R. The results of three of these packages mostly agree, but imbalance between sequences seems to provoke questionable results with one package, which illustrates well the need for proper software validation.
Nawrotzki, Raphael J.; Jiang, Leiwen
2015-01-01
Although data for the total number of international migrant flows is now available, no global dataset concerning demographic characteristics, such as the age and gender composition of migrant flows exists. This paper reports on the methods used to generate the CDM-IM dataset of age and gender specific profiles of bilateral net (not gross) migrant flows. We employ raw data from the United Nations Global Migration Database and estimate net migrant flows by age and gender between two time points around the year 2000, accounting for various demographic processes (fertility, mortality). The dataset contains information on 3,713 net migrant flows. Validation analyses against existing data sets and the historical, geopolitical context demonstrate that the CDM-IM dataset is of reasonably high quality. PMID:26692590
Detection of Wildfires with Artificial Neural Networks
NASA Astrophysics Data System (ADS)
Umphlett, B.; Leeman, J.; Morrissey, M. L.
2011-12-01
Currently fire detection for the National Oceanic and Atmospheric Administration (NOAA) using satellite data is accomplished with algorithms and error checking human analysts. Artificial neural networks (ANNs) have been shown to be more accurate than algorithms or statistical methods for applications dealing with multiple datasets of complex observed data in the natural sciences. ANNs also deal well with multiple data sources that are not all equally reliable or equally informative to the problem. An ANN was tested to evaluate its accuracy in detecting wildfires utilizing polar orbiter numerical data from the Advanced Very High Resolution Radiometer (AVHRR). Datasets containing locations of known fires were gathered from the NOAA's polar orbiting satellites via the Comprehensive Large Array-data Stewardship System (CLASS). The data was then calibrated and navigation corrected using the Environment for Visualizing Images (ENVI). Fires were located with the aid of shapefiles generated via ArcGIS. Afterwards, several smaller ten pixel by ten pixel datasets were created for each fire (using the ENVI corrected data). Several datasets were created for each fire in order to vary fire position and avoid training the ANN to look only at fires in the center of an image. Datasets containing no fires were also created. A basic pattern recognition neural network was established with the MATLAB neural network toolbox. The datasets were then randomly separated into categories used to train, validate, and test the ANN. To prevent over fitting of the data, the mean squared error (MSE) of the network was monitored and training was stopped when the MSE began to rise. Networks were tested using each channel of the AVHRR data independently, channels 3a and 3b combined, and all six channels. The number of hidden neurons for each input set was also varied between 5-350 in steps of 5 neurons. Each configuration was run 10 times, totaling about 4,200 individual network evaluations. Thirty network parameters were recorded to characterize performance. These parameters were plotted with various data display techniques to determine which network configuration was not only most accurate in fire classification, but also the most computationally efficient. The most accurate fire classification network used all six channels of AVHRR data to achieve an accuracy ranging from 73-90%.
a Critical Review of Automated Photogrammetric Processing of Large Datasets
NASA Astrophysics Data System (ADS)
Remondino, F.; Nocerino, E.; Toschi, I.; Menna, F.
2017-08-01
The paper reports some comparisons between commercial software able to automatically process image datasets for 3D reconstruction purposes. The main aspects investigated in the work are the capability to correctly orient large sets of image of complex environments, the metric quality of the results, replicability and redundancy. Different datasets are employed, each one featuring a diverse number of images, GSDs at cm and mm resolutions, and ground truth information to perform statistical analyses of the 3D results. A summary of (photogrammetric) terms is also provided, in order to provide rigorous terms of reference for comparisons and critical analyses.
A formal concept analysis approach to consensus clustering of multi-experiment expression data
2014-01-01
Background Presently, with the increasing number and complexity of available gene expression datasets, the combination of data from multiple microarray studies addressing a similar biological question is gaining importance. The analysis and integration of multiple datasets are expected to yield more reliable and robust results since they are based on a larger number of samples and the effects of the individual study-specific biases are diminished. This is supported by recent studies suggesting that important biological signals are often preserved or enhanced by multiple experiments. An approach to combining data from different experiments is the aggregation of their clusterings into a consensus or representative clustering solution which increases the confidence in the common features of all the datasets and reveals the important differences among them. Results We propose a novel generic consensus clustering technique that applies Formal Concept Analysis (FCA) approach for the consolidation and analysis of clustering solutions derived from several microarray datasets. These datasets are initially divided into groups of related experiments with respect to a predefined criterion. Subsequently, a consensus clustering algorithm is applied to each group resulting in a clustering solution per group. These solutions are pooled together and further analysed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over all the experiments. In order to validate the FCA-enhanced approach two consensus clustering algorithms are adapted to incorporate the FCA analysis. Their performance is evaluated on gene expression data from multi-experiment study examining the global cell-cycle control of fission yeast. The FCA results derived from both methods demonstrate that, although both algorithms optimize different clustering characteristics, FCA is able to overcome and diminish these differences and preserve some relevant biological signals. Conclusions The proposed FCA-enhanced consensus clustering technique is a general approach to the combination of clustering algorithms with FCA for deriving clustering solutions from multiple gene expression matrices. The experimental results presented herein demonstrate that it is a robust data integration technique able to produce good quality clustering solution that is representative for the whole set of expression matrices. PMID:24885407
DOE Office of Scientific and Technical Information (OSTI.GOV)
Omenn, Gilbert; States, David J.; Adamski, Marcin
2005-08-13
HUPO initiated the Plasma Proteome Project (PPP) in 2002. Its pilot phase has (1) evaluated advantages and limitations of many depletion, fractionation, and MS technology platforms; (2) compared PPP reference specimens of human serum and EDTA, heparin, and citrate-anticoagulated plasma; and (3) created a publicly-available knowledge base (www.bioinformatics. med.umich.edu/hupo/ppp; www.ebi.ac.uk/pride). Thirty-five participating laboratories in 13 countries submitted datasets. Working groups addressed (a) specimen stability and protein concentrations; (b) protein identifications from 18 MS/MS datasets; (c) independent analyses from raw MS-MS spectra; (d) search engine performance, subproteome analyses, and biological insights; (e) antibody arrays; and (f) direct MS/SELDI analyses. MS-MS datasetsmore » had 15 710 different International Protein Index (IPI) protein IDs; our integration algorithm applied to multiple matches of peptide sequences yielded 9504 IPI proteins identified with one or more peptides and 3020 proteins identified with two or more peptides (the Core Dataset). These proteins have been characterized with Gene Ontology, InterPro, Novartis Atlas, OMIM, and immunoassay based concentration determinations. The database permits examination of many other subsets, such as 1274 proteins identified with three or more peptides. Reverse protein to DNA matching identified proteins for 118 previously unidentified ORFs. We recommend use of plasma instead of serum, with EDTA (or citrate) for anticoagulation. To improve resolution, sensitivity and reproducibility of peptide identifications and protein matches, we recommend combinations of depletion, fractionation, and MS/MS technologies, with explicit criteria for evaluation of spectra, use of search algorithms, and integration of homologous protein matches. This Special Issue of PROTEOMICS presents papers integral to the collaborative analysis plus many reports of supplementary work on various aspects of the PPP workplan. These PPP results on complexity, dynamic range, incomplete sampling, false-positive matches, and integration of diverse datasets for plasma and serum proteins lay a foundation for development and validation of circulating protein biomarkers in health and disease.« less
NASA Astrophysics Data System (ADS)
Hess, A.; Davis, J. K.; Wimberly, M. C.
2017-12-01
Human West Nile virus (WNV) first arrived in the USA in 1999 and has since then spread across the country. Today, the highest incidence rates are found in the state of South Dakota. The disease occurrence depends on the complex interaction between the mosquito vector, the bird host and the dead-end human host. Understanding the spatial domain of this interaction and being able to identify disease transmission hotspots is crucial for effective disease prevention and mosquito control. In this study we use geospatial environmental information to understand what drives the spatial distribution of cases of human West Nile virus in South Dakota and to map relative infection risk across the state. To map the risk of human West Nile virus in South Dakota, we used geocoded human case data from the years 2004-2016. Satellite data from the Landsat ETM+ and MODIS for the years 2003 to 2016 were used to characterize environmental patterns. From these datasets we calculated indices, such as the normalized differenced vegetation index (NDVI) and the normalized differenced water index (NDWI). In addition, datasets such as the National Land Data Assimilation System (NLDAS), National Land Cover Dataset (NLCD), National Wetland inventory (NWI), National Elevation Dataset (NED) and Soil Survey Geographic Database (SSURGO) were utilized. Environmental variables were summarized for a buffer zone around the case and control points. We used a boosted regression tree model to identify the most important variables describing the risk of WNV infection. We generated a risk map by applying this model across the entire state. We found that the highest relative risk is present in the James River valley in northeastern South Dakota. Factors that were identified as influencing the transmission risk include inter-annual variability of vegetation cover, water availability and temperature. Land covers such as grasslands, low developed areas and wetlands were also found to be good predictors for human West Nile Virus risk. We suggest that the combination of satellite data derived land cover indices with other geospatial environmental datasets can help to create improved disease transmission risk analyses.
ConnectViz: Accelerated Approach for Brain Structural Connectivity Using Delaunay Triangulation.
Adeshina, A M; Hashim, R
2016-03-01
Stroke is a cardiovascular disease with high mortality and long-term disability in the world. Normal functioning of the brain is dependent on the adequate supply of oxygen and nutrients to the brain complex network through the blood vessels. Stroke, occasionally a hemorrhagic stroke, ischemia or other blood vessel dysfunctions can affect patients during a cerebrovascular incident. Structurally, the left and the right carotid arteries, and the right and the left vertebral arteries are responsible for supplying blood to the brain, scalp and the face. However, a number of impairment in the function of the frontal lobes may occur as a result of any decrease in the flow of the blood through one of the internal carotid arteries. Such impairment commonly results in numbness, weakness or paralysis. Recently, the concepts of brain's wiring representation, the connectome, was introduced. However, construction and visualization of such brain network requires tremendous computation. Consequently, previously proposed approaches have been identified with common problems of high memory consumption and slow execution. Furthermore, interactivity in the previously proposed frameworks for brain network is also an outstanding issue. This study proposes an accelerated approach for brain connectomic visualization based on graph theory paradigm using compute unified device architecture, extending the previously proposed SurLens Visualization and computer aided hepatocellular carcinoma frameworks. The accelerated brain structural connectivity framework was evaluated with stripped brain datasets from the Department of Surgery, University of North Carolina, Chapel Hill, USA. Significantly, our proposed framework is able to generate and extract points and edges of datasets, displays nodes and edges in the datasets in form of a network and clearly maps data volume to the corresponding brain surface. Moreover, with the framework, surfaces of the dataset were simultaneously displayed with the nodes and the edges. The framework is very efficient in providing greater interactivity as a way of representing the nodes and the edges intuitively, all achieved at a considerably interactive speed for instantaneous mapping of the datasets' features. Uniquely, the connectomic algorithm performed remarkably fast with normal hardware requirement specifications.
ConnectViz: Accelerated approach for brain structural connectivity using Delaunay triangulation.
Adeshina, A M; Hashim, R
2015-02-06
Stroke is a cardiovascular disease with high mortality and long-term disability in the world. Normal functioning of the brain is dependent on the adequate supply of oxygen and nutrients to the brain complex network through the blood vessels. Stroke, occasionally a hemorrhagic stroke, ischemia or other blood vessel dysfunctions can affect patients during a cerebrovascular incident. Structurally, the left and the right carotid arteries, and the right and the left vertebral arteries are responsible for supplying blood to the brain, scalp and the face. However, a number of impairment in the function of the frontal lobes may occur as a result of any decrease in the flow of the blood through one of the internal carotid arteries. Such impairment commonly results in numbness, weakness or paralysis. Recently, the concepts of brain's wiring representation, the connectome, was introduced. However, construction and visualization of such brain network requires tremendous computation. Consequently, previously proposed approaches have been identified with common problems of high memory consumption and slow execution. Furthermore, interactivity in the previously proposed frameworks for brain network is also an outstanding issue. This study proposes an accelerated approach for brain connectomic visualization based on graph theory paradigm using Compute Unified Device Architecture (CUDA), extending the previously proposed SurLens Visualization and Computer Aided Hepatocellular Carcinoma (CAHECA) frameworks. The accelerated brain structural connectivity framework was evaluated with stripped brain datasets from the Department of Surgery, University of North Carolina, Chapel Hill, United States. Significantly, our proposed framework is able to generates and extracts points and edges of datasets, displays nodes and edges in the datasets in form of a network and clearly maps data volume to the corresponding brain surface. Moreover, with the framework, surfaces of the dataset were simultaneously displayed with the nodes and the edges. The framework is very efficient in providing greater interactivity as a way of representing the nodes and the edges intuitively, all achieved at a considerably interactive speed for instantaneous mapping of the datasets' features. Uniquely, the connectomic algorithm performed remarkably fast with normal hardware requirement specifications.
de la Fuente, Gabriel; Belanche, Alejandro; Girwood, Susan E.; Pinloche, Eric; Wilkinson, Toby; Newbold, C. Jamie
2014-01-01
The development of next generation sequencing has challenged the use of other molecular fingerprinting methods used to study microbial diversity. We analysed the bacterial diversity in the rumen of defaunated sheep following the introduction of different protozoal populations, using both next generation sequencing (NGS: Ion Torrent PGM) and terminal restriction fragment length polymorphism (T-RFLP). Although absolute number differed, there was a high correlation between NGS and T-RFLP in terms of richness and diversity with R values of 0.836 and 0.781 for richness and Shannon-Wiener index, respectively. Dendrograms for both datasets were also highly correlated (Mantel test = 0.742). Eighteen OTUs and ten genera were significantly impacted by the addition of rumen protozoa, with an increase in the relative abundance of Prevotella, Bacteroides and Ruminobacter, related to an increase in free ammonia levels in the rumen. Our findings suggest that classic fingerprinting methods are still valuable tools to study microbial diversity and structure in complex environments but that NGS techniques now provide cost effect alternatives that provide a far greater level of information on the individual members of the microbial population. PMID:25051490
Network approaches for expert decisions in sports.
Glöckner, Andreas; Heinen, Thomas; Johnson, Joseph G; Raab, Markus
2012-04-01
This paper focuses on a model comparison to explain choices based on gaze behavior via simulation procedures. We tested two classes of models, a parallel constraint satisfaction (PCS) artificial neuronal network model and an accumulator model in a handball decision-making task from a lab experiment. Both models predict action in an option-generation task in which options can be chosen from the perspective of a playmaker in handball (i.e., passing to another player or shooting at the goal). Model simulations are based on a dataset of generated options together with gaze behavior measurements from 74 expert handball players for 22 pieces of video footage. We implemented both classes of models as deterministic vs. probabilistic models including and excluding fitted parameters. Results indicated that both classes of models can fit and predict participants' initially generated options based on gaze behavior data, and that overall, the classes of models performed about equally well. Early fixations were thereby particularly predictive for choices. We conclude that the analyses of complex environments via network approaches can be successfully applied to the field of experts' decision making in sports and provide perspectives for further theoretical developments. Copyright © 2011 Elsevier B.V. All rights reserved.
Wolff, Alexander; Bayerlová, Michaela; Gaedcke, Jochen; Kube, Dieter; Beißbarth, Tim
2018-01-01
Pipeline comparisons for gene expression data are highly valuable for applied real data analyses, as they enable the selection of suitable analysis strategies for the dataset at hand. Such pipelines for RNA-Seq data should include mapping of reads, counting and differential gene expression analysis or preprocessing, normalization and differential gene expression in case of microarray analysis, in order to give a global insight into pipeline performances. Four commonly used RNA-Seq pipelines (STAR/HTSeq-Count/edgeR, STAR/RSEM/edgeR, Sailfish/edgeR, TopHat2/Cufflinks/CuffDiff)) were investigated on multiple levels (alignment and counting) and cross-compared with the microarray counterpart on the level of gene expression and gene ontology enrichment. For these comparisons we generated two matched microarray and RNA-Seq datasets: Burkitt Lymphoma cell line data and rectal cancer patient data. The overall mapping rate of STAR was 98.98% for the cell line dataset and 98.49% for the patient dataset. Tophat's overall mapping rate was 97.02% and 96.73%, respectively, while Sailfish had only an overall mapping rate of 84.81% and 54.44%. The correlation of gene expression in microarray and RNA-Seq data was moderately worse for the patient dataset (ρ = 0.67-0.69) than for the cell line dataset (ρ = 0.87-0.88). An exception were the correlation results of Cufflinks, which were substantially lower (ρ = 0.21-0.29 and 0.34-0.53). For both datasets we identified very low numbers of differentially expressed genes using the microarray platform. For RNA-Seq we checked the agreement of differentially expressed genes identified in the different pipelines and of GO-term enrichment results. In conclusion the combination of STAR aligner with HTSeq-Count followed by STAR aligner with RSEM and Sailfish generated differentially expressed genes best suited for the dataset at hand and in agreement with most of the other transcriptomics pipelines.
Advanced functional network analysis in the geosciences: The pyunicorn package
NASA Astrophysics Data System (ADS)
Donges, Jonathan F.; Heitzig, Jobst; Runge, Jakob; Schultz, Hanna C. H.; Wiedermann, Marc; Zech, Alraune; Feldhoff, Jan; Rheinwalt, Aljoscha; Kutza, Hannes; Radebach, Alexander; Marwan, Norbert; Kurths, Jürgen
2013-04-01
Functional networks are a powerful tool for analyzing large geoscientific datasets such as global fields of climate time series originating from observations or model simulations. pyunicorn (pythonic unified complex network and recurrence analysis toolbox) is an open-source, fully object-oriented and easily parallelizable package written in the language Python. It allows for constructing functional networks (aka climate networks) representing the structure of statistical interrelationships in large datasets and, subsequently, investigating this structure using advanced methods of complex network theory such as measures for networks of interacting networks, node-weighted statistics or network surrogates. Additionally, pyunicorn allows to study the complex dynamics of geoscientific systems as recorded by time series by means of recurrence networks and visibility graphs. The range of possible applications of the package is outlined drawing on several examples from climatology.
2013-01-01
Background Hypodontus macropi is a common intestinal nematode of a range of kangaroos and wallabies (macropodid marsupials). Based on previous multilocus enzyme electrophoresis (MEE) and nuclear ribosomal DNA sequence data sets, H. macropi has been proposed to be complex of species. To test this proposal using independent molecular data, we sequenced the whole mitochondrial (mt) genomes of individuals of H. macropi from three different species of hosts (Macropus robustus robustus, Thylogale billardierii and Macropus [Wallabia] bicolor) as well as that of Macropicola ocydromi (a related nematode), and undertook a comparative analysis of the amino acid sequence datasets derived from these genomes. Results The mt genomes sequenced by next-generation (454) technology from H. macropi from the three host species varied from 13,634 bp to 13,699 bp in size. Pairwise comparisons of the amino acid sequences predicted from these three mt genomes revealed differences of 5.8% to 18%. Phylogenetic analysis of the amino acid sequence data sets using Bayesian Inference (BI) showed that H. macropi from the three different host species formed distinct, well-supported clades. In addition, sliding window analysis of the mt genomes defined variable regions for future population genetic studies of H. macropi in different macropodid hosts and geographical regions around Australia. Conclusions The present analyses of inferred mt protein sequence datasets clearly supported the hypothesis that H. macropi from M. robustus robustus, M. bicolor and T. billardierii represent distinct species. PMID:24261823
Genetic Characterization of Dog Personality Traits.
Ilska, Joanna; Haskell, Marie J; Blott, Sarah C; Sánchez-Molano, Enrique; Polgar, Zita; Lofgren, Sarah E; Clements, Dylan N; Wiener, Pamela
2017-06-01
The genetic architecture of behavioral traits in dogs is of great interest to owners, breeders, and professionals involved in animal welfare, as well as to scientists studying the genetics of animal (including human) behavior. The genetic component of dog behavior is supported by between-breed differences and some evidence of within-breed variation. However, it is a challenge to gather sufficiently large datasets to dissect the genetic basis of complex traits such as behavior, which are both time-consuming and logistically difficult to measure, and known to be influenced by nongenetic factors. In this study, we exploited the knowledge that owners have of their dogs to generate a large dataset of personality traits in Labrador Retrievers. While accounting for key environmental factors, we demonstrate that genetic variance can be detected for dog personality traits assessed using questionnaire data. We identified substantial genetic variance for several traits, including fetching tendency and fear of loud noises, while other traits revealed negligibly small heritabilities. Genetic correlations were also estimated between traits; however, due to fairly large SEs, only a handful of trait pairs yielded statistically significant estimates. Genomic analyses indicated that these traits are mainly polygenic, such that individual genomic regions have small effects, and suggested chromosomal associations for six of the traits. The polygenic nature of these traits is consistent with previous behavioral genetics studies in other species, for example in mouse, and confirms that large datasets are required to quantify the genetic variance and to identify the individual genes that influence behavioral traits. Copyright © 2017 by the Genetics Society of America.
NASA Astrophysics Data System (ADS)
Bonner, J.
2006-05-01
Differences in energy partitioning of seismic phases from earthquakes and explosions provide the opportunity for event identification. In this talk, I will briefly review teleseismic Ms:mb and P/S ratio techniques that help identify events based on differences in compressional, shear, and surface wave energy generation from explosions and earthquakes. With the push to identify smaller yield explosions, the identification process has become increasingly complex as varied types of explosions, including chemical, mining, and nuclear, must be identified at regional distances. Thus, I will highlight some of the current views and problems associated with the energy partitioning of seismic phases from single- and delay-fired chemical explosions. One problem yet to have a universally accepted answer is whether the explosion and earthquake populations, based on the Ms:mb discriminants, should be separated at smaller magnitudes. I will briefly describe the datasets and theory that support either converging or parallel behavior of these populations. Also, I will discuss improvement to the currently used methods that will better constrain this problem in the future. I will also discuss the role of regional P/S ratios in identifying explosions. In particular, recent datasets from South Africa, Scandinavia, and the Western United States collected from earthquakes, single-fired chemical explosions, and/or delay-fired mining explosions have provide new insight into regional P, S, Lg, and Rg energy partitioning. Data from co-located mining and chemical explosions suggest that some mining explosions may be used for limited calibration of regional discriminants in regions where no historic explosion data is available.
Doubal, Fergus N; Ali, Myzoon; Batty, G David; Charidimou, Andreas; Eriksdotter, Maria; Hofmann-Apitius, Martin; Kim, Yun-Hee; Levine, Deborah A; Mead, Gillian; Mucke, Hermann A M; Ritchie, Craig W; Roberts, Charlotte J; Russ, Tom C; Stewart, Robert; Whiteley, William; Quinn, Terence J
2017-04-17
Traditional approaches to clinical research have, as yet, failed to provide effective treatments for vascular dementia (VaD). Novel approaches to collation and synthesis of data may allow for time and cost efficient hypothesis generating and testing. These approaches may have particular utility in helping us understand and treat a complex condition such as VaD. We present an overview of new uses for existing data to progress VaD research. The overview is the result of consultation with various stakeholders, focused literature review and learning from the group's experience of successful approaches to data repurposing. In particular, we benefitted from the expert discussion and input of delegates at the 9 th International Congress on Vascular Dementia (Ljubljana, 16-18 th October 2015). We agreed on key areas that could be of relevance to VaD research: systematic review of existing studies; individual patient level analyses of existing trials and cohorts and linking electronic health record data to other datasets. We illustrated each theme with a case-study of an existing project that has utilised this approach. There are many opportunities for the VaD research community to make better use of existing data. The volume of potentially available data is increasing and the opportunities for using these resources to progress the VaD research agenda are exciting. Of course, these approaches come with inherent limitations and biases, as bigger datasets are not necessarily better datasets and maintaining rigour and critical analysis will be key to optimising data use.
NASA Astrophysics Data System (ADS)
Gens, R.
2017-12-01
With increasing number of experimental and operational satellites in orbit, remote sensing based mapping and monitoring of the dynamic Earth has entered into the realm of `big data'. Just the Landsat series of satellites provide a near continuous archive of 45 years of data. The availability of such spatio-temporal datasets has created opportunities for long-term monitoring diverse features and processes operating on the Earth's terrestrial and aquatic systems. Processes such as erosion, deposition, subsidence, uplift, evapotranspiration, urbanization, land-cover regime shifts can not only be monitored and change can be quantified using time-series data analysis. This unique opportunity comes with new challenges in management, analysis, and visualization of spatio-temporal datasets. Data need to be stored in a user-friendly format, and relevant metadata needs to be recorded, to allow maximum flexibility for data exchange and use. Specific data processing workflows need to be defined to support time-series analysis for specific applications. Value-added data products need to be generated keeping in mind the needs of the end-users, and using best practices in complex data visualization. This presentation systematically highlights the various steps for preparing spatio-temporal remote sensing data for time series analysis. It showcases a prototype workflow for remote sensing based change detection that can be generically applied while preserving the application-specific fidelity of the datasets. The prototype includes strategies for visualizing change over time. This has been exemplified using a time-series of optical and SAR images for visualizing the changing glacial, coastal, and wetland landscapes in parts of Alaska.
Goldrick, Stephen; Holmes, William; Bond, Nicholas J; Lewis, Gareth; Kuiper, Marcel; Turner, Richard; Farid, Suzanne S
2017-10-01
Product quality heterogeneities, such as a trisulfide bond (TSB) formation, can be influenced by multiple interacting process parameters. Identifying their root cause is a major challenge in biopharmaceutical production. To address this issue, this paper describes the novel application of advanced multivariate data analysis (MVDA) techniques to identify the process parameters influencing TSB formation in a novel recombinant antibody-peptide fusion expressed in mammalian cell culture. The screening dataset was generated with a high-throughput (HT) micro-bioreactor system (Ambr TM 15) using a design of experiments (DoE) approach. The complex dataset was firstly analyzed through the development of a multiple linear regression model focusing solely on the DoE inputs and identified the temperature, pH and initial nutrient feed day as important process parameters influencing this quality attribute. To further scrutinize the dataset, a partial least squares model was subsequently built incorporating both on-line and off-line process parameters and enabled accurate predictions of the TSB concentration at harvest. Process parameters identified by the models to promote and suppress TSB formation were implemented on five 7 L bioreactors and the resultant TSB concentrations were comparable to the model predictions. This study demonstrates the ability of MVDA to enable predictions of the key performance drivers influencing TSB formation that are valid also upon scale-up. Biotechnol. Bioeng. 2017;114: 2222-2234. © 2017 The Authors. Biotechnology and Bioengineering Published by Wiley Periodicals, Inc. © 2017 The Authors. Biotechnology and Bioengineering Published by Wiley Periodicals, Inc.
Subbotin, Sergei A; Ragsdale, Erik J; Mullens, Teresa; Roberts, Philip A; Mundo-Ocampo, Manuel; Baldwin, James G
2008-08-01
The root lesion nematodes of the genus Pratylenchus Filipjev, 1936 are migratory endoparasites of plant roots, considered among the most widespread and important nematode parasites in a variety of crops. We obtained gene sequences from the D2 and D3 expansion segments of 28S rRNA partial and 18S rRNA from 31 populations belonging to 11 valid and two unidentified species of root lesion nematodes and five outgroup taxa. These datasets were analyzed using maximum parsimony and Bayesian inference. The alignments were generated using the secondary structure models for these molecules and analyzed with Bayesian inference under the standard models and the complex model, considering helices under the doublet model and loops and bulges under the general time reversible model. The phylogenetic informativeness of morphological characters is tested by reconstruction of their histories on rRNA based trees using parallel parsimony and Bayesian approaches. Phylogenetic and sequence analyses of the 28S D2-D3 dataset with 145 accessions for 28 species and 18S dataset with 68 accessions for 15 species confirmed among large numbers of geographical diverse isolates that most classical morphospecies are monophyletic. Phylogenetic analyses revealed at least six distinct major clades of examined Pratylenchus species and these clades are generally congruent with those defined by characters derived from lip patterns, numbers of lip annules, and spermatheca shape. Morphological results suggest the need for sophisticated character discovery and analysis for morphology based phylogenetics in nematodes.
Hierarchical Naive Bayes for genetic association studies.
Malovini, Alberto; Barbarini, Nicola; Bellazzi, Riccardo; de Michelis, Francesca
2012-01-01
Genome Wide Association Studies represent powerful approaches that aim at disentangling the genetic and molecular mechanisms underlying complex traits. The usual "one-SNP-at-the-time" testing strategy cannot capture the multi-factorial nature of this kind of disorders. We propose a Hierarchical Naïve Bayes classification model for taking into account associations in SNPs data characterized by Linkage Disequilibrium. Validation shows that our model reaches classification performances superior to those obtained by the standard Naïve Bayes classifier for simulated and real datasets. In the Hierarchical Naïve Bayes implemented, the SNPs mapping to the same region of Linkage Disequilibrium are considered as "details" or "replicates" of the locus, each contributing to the overall effect of the region on the phenotype. A latent variable for each block, which models the "population" of correlated SNPs, can be then used to summarize the available information. The classification is thus performed relying on the latent variables conditional probability distributions and on the SNPs data available. The developed methodology has been tested on simulated datasets, each composed by 300 cases, 300 controls and a variable number of SNPs. Our approach has been also applied to two real datasets on the genetic bases of Type 1 Diabetes and Type 2 Diabetes generated by the Wellcome Trust Case Control Consortium. The approach proposed in this paper, called Hierarchical Naïve Bayes, allows dealing with classification of examples for which genetic information of structurally correlated SNPs are available. It improves the Naïve Bayes performances by properly handling the within-loci variability.
Automatic extraction of building boundaries using aerial LiDAR data
NASA Astrophysics Data System (ADS)
Wang, Ruisheng; Hu, Yong; Wu, Huayi; Wang, Jian
2016-01-01
Building extraction is one of the main research topics of the photogrammetry community. This paper presents automatic algorithms for building boundary extractions from aerial LiDAR data. First, segmenting height information generated from LiDAR data, the outer boundaries of aboveground objects are expressed as closed chains of oriented edge pixels. Then, building boundaries are distinguished from nonbuilding ones by evaluating their shapes. The candidate building boundaries are reconstructed as rectangles or regular polygons by applying new algorithms, following the hypothesis verification paradigm. These algorithms include constrained searching in Hough space, enhanced Hough transformation, and the sequential linking technique. The experimental results show that the proposed algorithms successfully extract building boundaries at rates of 97%, 85%, and 92% for three LiDAR datasets with varying scene complexities.
XGI: a graphical interface for XQuery creation.
Li, Xiang; Gennari, John H; Brinkley, James F
2007-10-11
XML has become the default standard for data exchange among heterogeneous data sources, and in January 2007 XQuery (XML Query language) was recommended by the World Wide Web Consortium as the query language for XML. However, XQuery is a complex language that is difficult for non-programmers to learn. We have therefore developed XGI (XQuery Graphical Interface), a visual interface for graphically generating XQuery. In this paper we demonstrate the functionality of XGI through its application to a biomedical XML dataset. We describe the system architecture and the features of XGI in relation to several existing querying systems, we demonstrate the system's usability through a sample query construction, and we discuss a preliminary evaluation of XGI. Finally, we describe some limitations of the system, and our plans for future improvements.
Spelten, Evelien R; Martin, Linda; Gitsels, Janneke T; Pereboom, Monique T R; Hutton, Eileen K; van Dulmen, Sandra
2015-01-01
video recording studies have been found to be complex; however very few studies describe the actual introduction and enrolment of the study, the resulting dataset and its interpretation. In this paper we describe the introduction and the use of video recordings of health care provider (HCP)-client interactions in primary care midwifery for research purposes. We also report on the process of data management, data coding and the resulting data set. we describe our experience in undertaking a study using video recording to assess the interaction of the midwife and her client in the first antenatal consultation, in a real life clinical practice setting in the Netherlands. Midwives from six practices across the Netherlands were recruited to videotape 15-20 intakes. The introduction, complexity of the study and intrusiveness of the study were discussed within the research group. The number of valid recordings and missing recordings was measured; reasons not to participate, non-response analyses, and the inter-rater reliability of the coded videotapes were assessed. Video recordings were supplemented by questionnaires for midwives and clients. The Roter Interaction Analysis System (RIAS) was used for coding as well as an obstetric topics scale. at the introduction of the study, more initial hesitation in co-operation was found among the midwives than among their clients. The intrusive nature of the recording on the interaction was perceived to be minimal. The complex nature of the study affected recruitment and data collection. Combining the dataset with the questionnaires and medical records proved to be a challenge. The final dataset included videotapes of 20 midwives (7-23 recordings per midwife). Of the 460 eligible clients, 324 gave informed consent. The study resulted in a significant dataset of first antenatal consultations involving recording 269 clients and 194 partners. video recording of midwife-client interaction was both feasible and challenging and resulted in a unique dataset of recordings of midwife-client interaction. Video recording studies will benefit from a tight design, and vigilant monitoring during the data collection to ensure effective data collection. We provide suggestions to promote successful introduction of video recording for research purposes. Copyright © 2014 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Ota, Junko; Umehara, Kensuke; Ishimaru, Naoki; Ohno, Shunsuke; Okamoto, Kentaro; Suzuki, Takanori; Shirai, Naoki; Ishida, Takayuki
2017-02-01
As the capability of high-resolution displays grows, high-resolution images are often required in Computed Tomography (CT). However, acquiring high-resolution images takes a higher radiation dose and a longer scanning time. In this study, we applied the Sparse-coding-based Super-Resolution (ScSR) method to generate high-resolution images without increasing the radiation dose. We prepared the over-complete dictionary learned the mapping between low- and highresolution patches and seek a sparse representation of each patch of the low-resolution input. These coefficients were used to generate the high-resolution output. For evaluation, 44 CT cases were used as the test dataset. We up-sampled images up to 2 or 4 times and compared the image quality of the ScSR scheme and bilinear and bicubic interpolations, which are the traditional interpolation schemes. We also compared the image quality of three learning datasets. A total of 45 CT images, 91 non-medical images, and 93 chest radiographs were used for dictionary preparation respectively. The image quality was evaluated by measuring peak signal-to-noise ratio (PSNR) and structure similarity (SSIM). The differences of PSNRs and SSIMs between the ScSR method and interpolation methods were statistically significant. Visual assessment confirmed that the ScSR method generated a high-resolution image with sharpness, whereas conventional interpolation methods generated over-smoothed images. To compare three different training datasets, there were no significance between the CT, the CXR and non-medical datasets. These results suggest that the ScSR provides a robust approach for application of up-sampling CT images and yields substantial high image quality of extended images in CT.
Bruni, Renato; Cesarone, Francesco; Scozzari, Andrea; Tardella, Fabio
2016-09-01
A large number of portfolio selection models have appeared in the literature since the pioneering work of Markowitz. However, even when computational and empirical results are described, they are often hard to replicate and compare due to the unavailability of the datasets used in the experiments. We provide here several datasets for portfolio selection generated using real-world price values from several major stock markets. The datasets contain weekly return values, adjusted for dividends and for stock splits, which are cleaned from errors as much as possible. The datasets are available in different formats, and can be used as benchmarks for testing the performances of portfolio selection models and for comparing the efficiency of the algorithms used to solve them. We also provide, for these datasets, the portfolios obtained by several selection strategies based on Stochastic Dominance models (see "On Exact and Approximate Stochastic Dominance Strategies for Portfolio Selection" (Bruni et al. [2])). We believe that testing portfolio models on publicly available datasets greatly simplifies the comparison of the different portfolio selection strategies.
Generation of High Resolution Global DSM from ALOS PRISM
NASA Astrophysics Data System (ADS)
Takaku, J.; Tadono, T.; Tsutsui, K.
2014-04-01
Panchromatic Remote-sensing Instrument for Stereo Mapping (PRISM), one of onboard sensors carried on the Advanced Land Observing Satellite (ALOS), was designed to generate worldwide topographic data with its optical stereoscopic observation. The sensor consists of three independent panchromatic radiometers for viewing forward, nadir, and backward in 2.5 m ground resolution producing a triplet stereoscopic image along its track. The sensor had observed huge amount of stereo images all over the world during the mission life of the satellite from 2006 through 2011. We have semi-automatically processed Digital Surface Model (DSM) data with the image archives in some limited areas. The height accuracy of the dataset was estimated at less than 5 m (rms) from the evaluation with ground control points (GCPs) or reference DSMs derived from the Light Detection and Ranging (LiDAR). Then, we decided to process the global DSM datasets from all available archives of PRISM stereo images by the end of March 2016. This paper briefly reports on the latest processing algorithms for the global DSM datasets as well as their preliminary results on some test sites. The accuracies and error characteristics of datasets are analyzed and discussed on various fields by the comparison with existing global datasets such as Ice, Cloud, and land Elevation Satellite (ICESat) data and Shuttle Radar Topography Mission (SRTM) data, as well as the GCPs and the reference airborne LiDAR/DSM.
A high-resolution 7-Tesla fMRI dataset from complex natural stimulation with an audio movie.
Hanke, Michael; Baumgartner, Florian J; Ibe, Pierre; Kaule, Falko R; Pollmann, Stefan; Speck, Oliver; Zinke, Wolf; Stadler, Jörg
2014-01-01
Here we present a high-resolution functional magnetic resonance (fMRI) dataset - 20 participants recorded at high field strength (7 Tesla) during prolonged stimulation with an auditory feature film ("Forrest Gump"). In addition, a comprehensive set of auxiliary data (T1w, T2w, DTI, susceptibility-weighted image, angiography) as well as measurements to assess technical and physiological noise components have been acquired. An initial analysis confirms that these data can be used to study common and idiosyncratic brain response patterns to complex auditory stimulation. Among the potential uses of this dataset are the study of auditory attention and cognition, language and music perception, and social perception. The auxiliary measurements enable a large variety of additional analysis strategies that relate functional response patterns to structural properties of the brain. Alongside the acquired data, we provide source code and detailed information on all employed procedures - from stimulus creation to data analysis. In order to facilitate replicative and derived works, only free and open-source software was utilized.
A practical tool for maximal information coefficient analysis.
Albanese, Davide; Riccadonna, Samantha; Donati, Claudio; Franceschi, Pietro
2018-04-01
The ability of finding complex associations in large omics datasets, assessing their significance, and prioritizing them according to their strength can be of great help in the data exploration phase. Mutual information-based measures of association are particularly promising, in particular after the recent introduction of the TICe and MICe estimators, which combine computational efficiency with superior bias/variance properties. An open-source software implementation of these two measures providing a complete procedure to test their significance would be extremely useful. Here, we present MICtools, a comprehensive and effective pipeline that combines TICe and MICe into a multistep procedure that allows the identification of relationships of various degrees of complexity. MICtools calculates their strength assessing statistical significance using a permutation-based strategy. The performances of the proposed approach are assessed by an extensive investigation in synthetic datasets and an example of a potential application on a metagenomic dataset is also illustrated. We show that MICtools, combining TICe and MICe, is able to highlight associations that would not be captured by conventional strategies.
Sequence Data for Clostridium autoethanogenum using Three Generations of Sequencing Technologies
Utturkar, Sagar M.; Klingeman, Dawn Marie; Bruno-Barcena, José M.; ...
2015-04-14
During the past decade, DNA sequencing output has been mostly dominated by the second generation sequencing platforms which are characterized by low cost, high throughput and shorter read lengths for example, Illumina. The emergence and development of so called third generation sequencing platforms such as PacBio has permitted exceptionally long reads (over 20 kb) to be generated. Due to read length increases, algorithm improvements and hybrid assembly approaches, the concept of one chromosome, one contig and automated finishing of microbial genomes is now a realistic and achievable task for many microbial laboratories. In this paper, we describe high quality sequencemore » datasets which span three generations of sequencing technologies, containing six types of data from four NGS platforms and originating from a single microorganism, Clostridium autoethanogenum. The dataset reported here will be useful for the scientific community to evaluate upcoming NGS platforms, enabling comparison of existing and novel bioinformatics approaches and will encourage interest in the development of innovative experimental and computational methods for NGS data.« less
García-Jacas, César R; Contreras-Torres, Ernesto; Marrero-Ponce, Yovani; Pupo-Meriño, Mario; Barigye, Stephen J; Cabrera-Leyva, Lisset
2016-01-01
Recently, novel 3D alignment-free molecular descriptors (also known as QuBiLS-MIDAS) based on two-linear, three-linear and four-linear algebraic forms have been introduced. These descriptors codify chemical information for relations between two, three and four atoms by using several (dis-)similarity metrics and multi-metrics. Several studies aimed at assessing the quality of these novel descriptors have been performed. However, a deeper analysis of their performance is necessary. Therefore, in the present manuscript an assessment and statistical validation of the performance of these novel descriptors in QSAR studies is performed. To this end, eight molecular datasets (angiotensin converting enzyme, acetylcholinesterase inhibitors, benzodiazepine receptor, cyclooxygenase-2 inhibitors, dihydrofolate reductase inhibitors, glycogen phosphorylase b, thermolysin inhibitors, thrombin inhibitors) widely used as benchmarks in the evaluation of several procedures are utilized. Three to nine variable QSAR models based on Multiple Linear Regression are built for each chemical dataset according to the original division into training/test sets. Comparisons with respect to leave-one-out cross-validation correlation coefficients[Formula: see text] reveal that the models based on QuBiLS-MIDAS indices possess superior predictive ability in 7 of the 8 datasets analyzed, outperforming methodologies based on similar or more complex techniques such as: Partial Least Square, Neural Networks, Support Vector Machine and others. On the other hand, superior external correlation coefficients[Formula: see text] are attained in 6 of the 8 test sets considered, confirming the good predictive power of the obtained models. For the [Formula: see text] values non-parametric statistic tests were performed, which demonstrated that the models based on QuBiLS-MIDAS indices have the best global performance and yield significantly better predictions in 11 of the 12 QSAR procedures used in the comparison. Lastly, a study concerning to the performance of the indices according to several conformer generation methods was performed. This demonstrated that the quality of predictions of the QSAR models based on QuBiLS-MIDAS indices depend on 3D structure generation method considered, although in this preliminary study the results achieved do not present significant statistical differences among them. As conclusions it can be stated that the QuBiLS-MIDAS indices are suitable for extracting structural information of the molecules and thus, constitute a promissory alternative to build models that contribute to the prediction of pharmacokinetic, pharmacodynamics and toxicological properties on novel compounds.Graphical abstractComparative graphical representation of the performance of the novel QuBiLS-MIDAS 3D-MDs with respect to other methodologies in QSAR modeling of eight chemical datasets.
NASA Astrophysics Data System (ADS)
Fillmore, D. W.; Galloy, M. D.; Kindig, D.
2013-12-01
SatelliteDL is an IDL toolkit for the analysis of satellite Earth observations from a diverse set of platforms and sensors. The design features an abstraction layer that allows for easy inclusion of new datasets in a modular way. The core function of the toolkit is the spatial and temporal alignment of satellite swath and geostationary data. IDL has a powerful suite of statistical and visualization tools that can be used in conjunction with SatelliteDL. Our overarching objective is to create utilities that automate the mundane aspects of satellite data analysis, are extensible and maintainable, and do not place limitations on the analysis itself. Toward this end we have constructed SatelliteDL to include (1) HTML and LaTeX API document generation, (2) a unit test framework, (3) automatic message and error logs, (4) HTML and LaTeX plot and table generation, and (5) several real world examples with bundled datasets available for download. For ease of use, datasets, variables and optional workflows may be specified in a flexible format configuration file. Configuration statements may specify, for example, a region and date range, and the creation of images, plots and statistical summary tables for a long list of variables. SatelliteDL enforces data provenance; all data should be traceable and reproducible. The output NetCDF file metadata holds a complete history of the original datasets and their transformations, and a method exists to reconstruct a configuration file from this information. Release 0.1.0 of SatelliteDL is anticipated for the 2013 Fall AGU conference. It will distribute with ingest methods for GOES, MODIS, VIIRS and CERES radiance data (L1) as well as select 2D atmosphere products (L2) such as aerosol and cloud (MODIS and VIIRS) and radiant flux (CERES). Future releases will provide ingest methods for ocean and land surface products, gridded and time averaged datasets (L3 Daily, Monthly and Yearly), and support for 3D products such as temperature and water vapor profiles. Emphasis will be on NPP Sensor, Environmental and Climate Data Records as they become available. To obtain SatelliteDL (from 2013 December onward) please visit the project website at the indicated URL. Our poster exhibits three regional weather examples of SatelliteDL in action: (1) a mesoscale convective complex over the Great Plains (GOES, MODIS, VIIRS and CERES), (2) a dust storm over Arabia (MODIS, VIIRS and CERES) and (3) a volcanic ash plume over Patagonia and the South Atlantic (GOES, MODIS and CERES). In these examples the GOES radiances are cross-calibrated with MODIS. Cloud products are shown in examples (1) and (3) and aerosol products in examples (2) and (3).
A reference human genome dataset of the BGISEQ-500 sequencer.
Huang, Jie; Liang, Xinming; Xuan, Yuankai; Geng, Chunyu; Li, Yuxiang; Lu, Haorong; Qu, Shoufang; Mei, Xianglin; Chen, Hongbo; Yu, Ting; Sun, Nan; Rao, Junhua; Wang, Jiahao; Zhang, Wenwei; Chen, Ying; Liao, Sha; Jiang, Hui; Liu, Xin; Yang, Zhaopeng; Mu, Feng; Gao, Shangxian
2017-05-01
BGISEQ-500 is a new desktop sequencer developed by BGI. Using DNA nanoball and combinational probe anchor synthesis developed from Complete Genomics™ sequencing technologies, it generates short reads at a large scale. Here, we present the first human whole-genome sequencing dataset of BGISEQ-500. The dataset was generated by sequencing the widely used cell line HG001 (NA12878) in two sequencing runs of paired-end 50 bp (PE50) and two sequencing runs of paired-end 100 bp (PE100). We also include examples of the raw images from the sequencer for reference. Finally, we identified variations using this dataset, estimated the accuracy of the variations, and compared to that of the variations identified from similar amounts of publicly available HiSeq2500 data. We found similar single nucleotide polymorphism (SNP) detection accuracy for the BGISEQ-500 PE100 data (false positive rate [FPR] = 0.00020%, sensitivity = 96.20%) compared to the PE150 HiSeq2500 data (FPR = 0.00017%, sensitivity = 96.60%) better SNP detection accuracy than the PE50 data (FPR = 0.0006%, sensitivity = 94.15%). But for insertions and deletions (indels), we found lower accuracy for BGISEQ-500 data (FPR = 0.00069% and 0.00067% for PE100 and PE50 respectively, sensitivity = 88.52% and 70.93%) than the HiSeq2500 data (FPR = 0.00032%, sensitivity = 96.28%). Our dataset can serve as the reference dataset, providing basic information not just for future development, but also for all research and applications based on the new sequencing platform. © The Authors 2017. Published by Oxford University Press.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yang, Y. X.; Van Reeth, E.; Poh, C. L., E-mail: clpoh@ntu.edu.sg
2015-08-15
Purpose: Accurate visualization of lung motion is important in many clinical applications, such as radiotherapy of lung cancer. Advancement in imaging modalities [e.g., computed tomography (CT) and MRI] has allowed dynamic imaging of lung and lung tumor motion. However, each imaging modality has its advantages and disadvantages. The study presented in this paper aims at generating synthetic 4D-CT dataset for lung cancer patients by combining both continuous three-dimensional (3D) motion captured by 4D-MRI and the high spatial resolution captured by CT using the authors’ proposed approach. Methods: A novel hybrid approach based on deformable image registration (DIR) and finite elementmore » method simulation was developed to fuse a static 3D-CT volume (acquired under breath-hold) and the 3D motion information extracted from 4D-MRI dataset, creating a synthetic 4D-CT dataset. Results: The study focuses on imaging of lung and lung tumor. Comparing the synthetic 4D-CT dataset with the acquired 4D-CT dataset of six lung cancer patients based on 420 landmarks, accurate results (average error <2 mm) were achieved using the authors’ proposed approach. Their hybrid approach achieved a 40% error reduction (based on landmarks assessment) over using only DIR techniques. Conclusions: The synthetic 4D-CT dataset generated has high spatial resolution, has excellent lung details, and is able to show movement of lung and lung tumor over multiple breathing cycles.« less
Lee, Jasper; Zhang, Jianguo; Park, Ryan; Dagliyan, Grant; Liu, Brent; Huang, H K
2012-07-01
A Molecular Imaging Data Grid (MIDG) was developed to address current informatics challenges in archival, sharing, search, and distribution of preclinical imaging studies between animal imaging facilities and investigator sites. This manuscript presents a 2nd generation MIDG replacing the Globus Toolkit with a new system architecture that implements the IHE XDS-i integration profile. Implementation and evaluation were conducted using a 3-site interdisciplinary test-bed at the University of Southern California. The 2nd generation MIDG design architecture replaces the initial design's Globus Toolkit with dedicated web services and XML-based messaging for dedicated management and delivery of multi-modality DICOM imaging datasets. The Cross-enterprise Document Sharing for Imaging (XDS-i) integration profile from the field of enterprise radiology informatics was adopted into the MIDG design because streamlined image registration, management, and distribution dataflow are likewise needed in preclinical imaging informatics systems as in enterprise PACS application. Implementation of the MIDG is demonstrated at the University of Southern California Molecular Imaging Center (MIC) and two other sites with specified hardware, software, and network bandwidth. Evaluation of the MIDG involves data upload, download, and fault-tolerance testing scenarios using multi-modality animal imaging datasets collected at the USC Molecular Imaging Center. The upload, download, and fault-tolerance tests of the MIDG were performed multiple times using 12 collected animal study datasets. Upload and download times demonstrated reproducibility and improved real-world performance. Fault-tolerance tests showed that automated failover between Grid Node Servers has minimal impact on normal download times. Building upon the 1st generation concepts and experiences, the 2nd generation MIDG system improves accessibility of disparate animal-model molecular imaging datasets to users outside a molecular imaging facility's LAN using a new architecture, dataflow, and dedicated DICOM-based management web services. Productivity and efficiency of preclinical research for translational sciences investigators has been further streamlined for multi-center study data registration, management, and distribution.
Fan, Jiawei; Wang, Jiazhou; Zhang, Zhen; Hu, Weigang
2017-06-01
To develop a new automated treatment planning solution for breast and rectal cancer radiotherapy. The automated treatment planning solution developed in this study includes selection of the iterative optimized training dataset, dose volume histogram (DVH) prediction for the organs at risk (OARs), and automatic generation of clinically acceptable treatment plans. The iterative optimized training dataset is selected by an iterative optimization from 40 treatment plans for left-breast and rectal cancer patients who received radiation therapy. A two-dimensional kernel density estimation algorithm (noted as two parameters KDE) which incorporated two predictive features was implemented to produce the predicted DVHs. Finally, 10 additional new left-breast treatment plans are re-planned using the Pinnacle 3 Auto-Planning (AP) module (version 9.10, Philips Medical Systems) with the objective functions derived from the predicted DVH curves. Automatically generated re-optimized treatment plans are compared with the original manually optimized plans. By combining the iterative optimized training dataset methodology and two parameters KDE prediction algorithm, our proposed automated planning strategy improves the accuracy of the DVH prediction. The automatically generated treatment plans using the dose derived from the predicted DVHs can achieve better dose sparing for some OARs without compromising other metrics of plan quality. The proposed new automated treatment planning solution can be used to efficiently evaluate and improve the quality and consistency of the treatment plans for intensity-modulated breast and rectal cancer radiation therapy. © 2017 American Association of Physicists in Medicine.
Remote visualization and scale analysis of large turbulence datatsets
NASA Astrophysics Data System (ADS)
Livescu, D.; Pulido, J.; Burns, R.; Canada, C.; Ahrens, J.; Hamann, B.
2015-12-01
Accurate simulations of turbulent flows require solving all the dynamically relevant scales of motions. This technique, called Direct Numerical Simulation, has been successfully applied to a variety of simple flows; however, the large-scale flows encountered in Geophysical Fluid Dynamics (GFD) would require meshes outside the range of the most powerful supercomputers for the foreseeable future. Nevertheless, the current generation of petascale computers has enabled unprecedented simulations of many types of turbulent flows which focus on various GFD aspects, from the idealized configurations extensively studied in the past to more complex flows closer to the practical applications. The pace at which such simulations are performed only continues to increase; however, the simulations themselves are restricted to a small number of groups with access to large computational platforms. Yet the petabytes of turbulence data offer almost limitless information on many different aspects of the flow, from the hierarchy of turbulence moments, spectra and correlations, to structure-functions, geometrical properties, etc. The ability to share such datasets with other groups can significantly reduce the time to analyze the data, help the creative process and increase the pace of discovery. Using the largest DOE supercomputing platforms, we have performed some of the biggest turbulence simulations to date, in various configurations, addressing specific aspects of turbulence production and mixing mechanisms. Until recently, the visualization and analysis of such datasets was restricted by access to large supercomputers. The public Johns Hopkins Turbulence database simplifies the access to multi-Terabyte turbulence datasets and facilitates turbulence analysis through the use of commodity hardware. First, one of our datasets, which is part of the database, will be described and then a framework that adds high-speed visualization and wavelet support for multi-resolution analysis of turbulence will be highlighted. The addition of wavelet support reduces the latency and bandwidth requirements for visualization, allowing for many concurrent users, and enables new types of analyses, including scale decomposition and coherent feature extraction.
NASA Astrophysics Data System (ADS)
McGibbney, L. J.; Armstrong, E. M.
2016-12-01
Figuratively speaking, Scientific Datasets (SD) are shared by data producers in a multitude of shapes, sizes and flavors. Primarily however they exist as machine-independent manifestations supporting the creation, access, and sharing of array-oriented SD that can on occasion be spread across multiple files. Within the Earth Sciences, the most notable general examples include the HDF family, NetCDF, etc. with other formats such as GRIB being used pervasively within specific domains such as the Oceanographic, Atmospheric and Meteorological sciences. Such file formats contain Coverage Data e.g. a digital representation of some spatio-temporal phenomenon. A challenge for large data producers such as NASA and NOAA as well as consumers of coverage datasets (particularly surrounding visualization and interactive use within web clients) is that this is still not a straight-forward issue due to size, serialization and inherent complexity. Additionally existing data formats are either unsuitable for the Web (like netCDF files) or hard to interpret independently due to missing standard structures and metadata (e.g. the OPeNDAP protocol). Therefore alternative, Web friendly manifestations of such datasets are required.CoverageJSON is an emerging data format for publishing coverage data to the web in a web-friendly, way which fits in with the linked data publication paradigm hence lowering the barrier for interpretation by consumers via mobile devices and client applications, etc. as well as data producers who can build next generation Web friendly Web services around datasets. This work will detail how CoverageJSON is being evaluated at NASA JPL's PO.DAAC as an enabling data representation format for publishing SD as Linked Open Data embedded within SD landing pages as well as via semantic data repositories. We are currently evaluating how utilization of CoverageJSON within SD landing pages addresses the long-standing acknowledgement that SD producers are not currently addressing content-based optimization within their SD landing pages for better crawlability by commercial search engines.
Wijetunge, Chalini D; Saeed, Isaam; Boughton, Berin A; Spraggins, Jeffrey M; Caprioli, Richard M; Bacic, Antony; Roessner, Ute; Halgamuge, Saman K
2015-10-01
Matrix Assisted Laser Desorption Ionization-Imaging Mass Spectrometry (MALDI-IMS) in 'omics' data acquisition generates detailed information about the spatial distribution of molecules in a given biological sample. Various data processing methods have been developed for exploring the resultant high volume data. However, most of these methods process data in the spectral domain and do not make the most of the important spatial information available through this technology. Therefore, we propose a novel streamlined data analysis pipeline specifically developed for MALDI-IMS data utilizing significant spatial information for identifying hidden significant molecular distribution patterns in these complex datasets. The proposed unsupervised algorithm uses Sliding Window Normalization (SWN) and a new spatial distribution based peak picking method developed based on Gray level Co-Occurrence (GCO) matrices followed by clustering of biomolecules. We also use gist descriptors and an improved version of GCO matrices to extract features from molecular images and minimum medoid distance to automatically estimate the number of possible groups. We evaluated our algorithm using a new MALDI-IMS metabolomics dataset of a plant (Eucalypt) leaf. The algorithm revealed hidden significant molecular distribution patterns in the dataset, which the current Component Analysis and Segmentation Map based approaches failed to extract. We further demonstrate the performance of our peak picking method over other traditional approaches by using a publicly available MALDI-IMS proteomics dataset of a rat brain. Although SWN did not show any significant improvement as compared with using no normalization, the visual assessment showed an improvement as compared to using the median normalization. The source code and sample data are freely available at http://exims.sourceforge.net/. awgcdw@student.unimelb.edu.au or chalini_w@live.com Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Functional networks inference from rule-based machine learning models.
Lazzarini, Nicola; Widera, Paweł; Williamson, Stuart; Heer, Rakesh; Krasnogor, Natalio; Bacardit, Jaume
2016-01-01
Functional networks play an important role in the analysis of biological processes and systems. The inference of these networks from high-throughput (-omics) data is an area of intense research. So far, the similarity-based inference paradigm (e.g. gene co-expression) has been the most popular approach. It assumes a functional relationship between genes which are expressed at similar levels across different samples. An alternative to this paradigm is the inference of relationships from the structure of machine learning models. These models are able to capture complex relationships between variables, that often are different/complementary to the similarity-based methods. We propose a protocol to infer functional networks from machine learning models, called FuNeL. It assumes, that genes used together within a rule-based machine learning model to classify the samples, might also be functionally related at a biological level. The protocol is first tested on synthetic datasets and then evaluated on a test suite of 8 real-world datasets related to human cancer. The networks inferred from the real-world data are compared against gene co-expression networks of equal size, generated with 3 different methods. The comparison is performed from two different points of view. We analyse the enriched biological terms in the set of network nodes and the relationships between known disease-associated genes in a context of the network topology. The comparison confirms both the biological relevance and the complementary character of the knowledge captured by the FuNeL networks in relation to similarity-based methods and demonstrates its potential to identify known disease associations as core elements of the network. Finally, using a prostate cancer dataset as a case study, we confirm that the biological knowledge captured by our method is relevant to the disease and consistent with the specialised literature and with an independent dataset not used in the inference process. The implementation of our network inference protocol is available at: http://ico2s.org/software/funel.html.
NASA Astrophysics Data System (ADS)
Ozturk, D.; Chaudhary, A.; Votava, P.; Kotfila, C.
2016-12-01
Jointly developed by Kitware and NASA Ames, GeoNotebook is an open source tool designed to give the maximum amount of flexibility to analysts, while dramatically simplifying the process of exploring geospatially indexed datasets. Packages like Fiona (backed by GDAL), Shapely, Descartes, Geopandas, and PySAL provide a stack of technologies for reading, transforming, and analyzing geospatial data. Combined with the Jupyter notebook and libraries like matplotlib/Basemap it is possible to generate detailed geospatial visualizations. Unfortunately, visualizations generated is either static or does not perform well for very large datasets. Also, this setup requires a great deal of boilerplate code to create and maintain. Other extensions exist to remedy these problems, but they provide a separate map for each input cell and do not support map interactions that feed back into the python environment. To support interactive data exploration and visualization on large datasets we have developed an extension to the Jupyter notebook that provides a single dynamic map that can be managed from the Python environment, and that can communicate back with a server which can perform operations like data subsetting on a cloud-based cluster.
Thermalnet: a Deep Convolutional Network for Synthetic Thermal Image Generation
NASA Astrophysics Data System (ADS)
Kniaz, V. V.; Gorbatsevich, V. S.; Mizginov, V. A.
2017-05-01
Deep convolutional neural networks have dramatically changed the landscape of the modern computer vision. Nowadays methods based on deep neural networks show the best performance among image recognition and object detection algorithms. While polishing of network architectures received a lot of scholar attention, from the practical point of view the preparation of a large image dataset for a successful training of a neural network became one of major challenges. This challenge is particularly profound for image recognition in wavelengths lying outside the visible spectrum. For example no infrared or radar image datasets large enough for successful training of a deep neural network are available to date in public domain. Recent advances of deep neural networks prove that they are also capable to do arbitrary image transformations such as super-resolution image generation, grayscale image colorisation and imitation of style of a given artist. Thus a natural question arise: how could be deep neural networks used for augmentation of existing large image datasets? This paper is focused on the development of the Thermalnet deep convolutional neural network for augmentation of existing large visible image datasets with synthetic thermal images. The Thermalnet network architecture is inspired by colorisation deep neural networks.
Hernández-Ceballos, M A; Skjøth, C A; García-Mozo, H; Bolívar, J P; Galán, C
2014-12-01
Airborne pollen transport at micro-, meso-gamma and meso-beta scales must be studied by atmospheric models, having special relevance in complex terrain. In these cases, the accuracy of these models is mainly determined by the spatial resolution of the underlying meteorological dataset. This work examines how meteorological datasets determine the results obtained from atmospheric transport models used to describe pollen transport in the atmosphere. We investigate the effect of the spatial resolution when computing backward trajectories with the HYSPLIT model. We have used meteorological datasets from the WRF model with 27, 9 and 3 km resolutions and from the GDAS files with 1° resolution. This work allows characterizing atmospheric transport of Olea pollen in a region with complex flows. The results show that the complex terrain affects the trajectories and this effect varies with the different meteorological datasets. Overall, the change from GDAS to WRF-ARW inputs improves the analyses with the HYSPLIT model, thereby increasing the understanding the pollen episode. The results indicate that a spatial resolution of at least 9 km is needed to simulate atmospheric flows that are considerable affected by the relief of the landscape. The results suggest that the appropriate meteorological files should be considered when atmospheric models are used to characterize the atmospheric transport of pollen on micro-, meso-gamma and meso-beta scales. Furthermore, at these scales, the results are believed to be generally applicable for related areas such as the description of atmospheric transport of radionuclides or in the definition of nuclear-radioactivity emergency preparedness.
NASA Astrophysics Data System (ADS)
Hernández-Ceballos, M. A.; Skjøth, C. A.; García-Mozo, H.; Bolívar, J. P.; Galán, C.
2014-12-01
Airborne pollen transport at micro-, meso-gamma and meso-beta scales must be studied by atmospheric models, having special relevance in complex terrain. In these cases, the accuracy of these models is mainly determined by the spatial resolution of the underlying meteorological dataset. This work examines how meteorological datasets determine the results obtained from atmospheric transport models used to describe pollen transport in the atmosphere. We investigate the effect of the spatial resolution when computing backward trajectories with the HYSPLIT model. We have used meteorological datasets from the WRF model with 27, 9 and 3 km resolutions and from the GDAS files with 1 ° resolution. This work allows characterizing atmospheric transport of Olea pollen in a region with complex flows. The results show that the complex terrain affects the trajectories and this effect varies with the different meteorological datasets. Overall, the change from GDAS to WRF-ARW inputs improves the analyses with the HYSPLIT model, thereby increasing the understanding the pollen episode. The results indicate that a spatial resolution of at least 9 km is needed to simulate atmospheric flows that are considerable affected by the relief of the landscape. The results suggest that the appropriate meteorological files should be considered when atmospheric models are used to characterize the atmospheric transport of pollen on micro-, meso-gamma and meso-beta scales. Furthermore, at these scales, the results are believed to be generally applicable for related areas such as the description of atmospheric transport of radionuclides or in the definition of nuclear-radioactivity emergency preparedness.
Tomasini, Nicolás; Lauthier, Juan José; Ayala, Francisco José; Tibayrenc, Michel; Diosque, Patricio
2014-01-01
The model of predominant clonal evolution (PCE) proposed for micropathogens does not state that genetic exchange is totally absent, but rather, that it is too rare to break the prevalent PCE pattern. However, the actual impact of this “residual” genetic exchange should be evaluated. Multilocus Sequence Typing (MLST) is an excellent tool to explore the problem. Here, we compared online available MLST datasets for seven eukaryotic microbial pathogens: Trypanosoma cruzi, the Fusarium solani complex, Aspergillus fumigatus, Blastocystis subtype 3, the Leishmania donovani complex, Candida albicans and Candida glabrata. We first analyzed phylogenetic relationships among genotypes within each dataset. Then, we examined different measures of branch support and incongruence among loci as signs of genetic structure and levels of past recombination. The analyses allow us to identify three types of genetic structure. The first was characterized by trees with well-supported branches and low levels of incongruence suggesting well-structured populations and PCE. This was the case for the T. cruzi and F. solani datasets. The second genetic structure, represented by Blastocystis spp., A. fumigatus and the L. donovani complex datasets, showed trees with weakly-supported branches but low levels of incongruence among loci, whereby genetic structuration was not clearly defined by MLST. Finally, trees showing weakly-supported branches and high levels of incongruence among loci were observed for Candida species, suggesting that genetic exchange has a higher evolutionary impact in these mainly clonal yeast species. Furthermore, simulations showed that MLST may fail to show right clustering in population datasets even in the absence of genetic exchange. In conclusion, these results make it possible to infer variable impacts of genetic exchange in populations of predominantly clonal micro-pathogens. Moreover, our results reveal different problems of MLST to determine the genetic structure in these organisms that should be considered. PMID:25054834
Green, Christopher T.; Zhang, Yong; Jurgens, Bryant C.; Starn, J. Jeffrey; Landon, Matthew K.
2014-01-01
Analytical models of the travel time distribution (TTD) from a source area to a sample location are often used to estimate groundwater ages and solute concentration trends. The accuracies of these models are not well known for geologically complex aquifers. In this study, synthetic datasets were used to quantify the accuracy of four analytical TTD models as affected by TTD complexity, observation errors, model selection, and tracer selection. Synthetic TTDs and tracer data were generated from existing numerical models with complex hydrofacies distributions for one public-supply well and 14 monitoring wells in the Central Valley, California. Analytical TTD models were calibrated to synthetic tracer data, and prediction errors were determined for estimates of TTDs and conservative tracer (NO3−) concentrations. Analytical models included a new, scale-dependent dispersivity model (SDM) for two-dimensional transport from the watertable to a well, and three other established analytical models. The relative influence of the error sources (TTD complexity, observation error, model selection, and tracer selection) depended on the type of prediction. Geological complexity gave rise to complex TTDs in monitoring wells that strongly affected errors of the estimated TTDs. However, prediction errors for NO3− and median age depended more on tracer concentration errors. The SDM tended to give the most accurate estimates of the vertical velocity and other predictions, although TTD model selection had minor effects overall. Adding tracers improved predictions if the new tracers had different input histories. Studies using TTD models should focus on the factors that most strongly affect the desired predictions.
ERIC Educational Resources Information Center
Polanin, Joshua R.; Wilson, Sandra Jo
2014-01-01
The purpose of this project is to demonstrate the practical methods developed to utilize a dataset consisting of both multivariate and multilevel effect size data. The context for this project is a large-scale meta-analytic review of the predictors of academic achievement. This project is guided by three primary research questions: (1) How do we…
Power-Hop: A Pervasive Observation for Real Complex Networks
2016-03-14
found at: https:// github.com/kpelechrinis/powerhop. The web graph used was obtained from Yahoo ! under signing NDA and hence, cannot be made available...However, Yahoo ! is making available datasets to eligible organizations/entities through application to Webscope. In particular, the dataset used in...our study can be requested through the following URL: http:// webscope.sandbox.yahoo.com/catalog.php? datatype=g (G2 - Yahoo ! AltaVista Web Page
MVIRI/SEVIRI TOA Radiation Datasets within the Climate Monitoring SAF
NASA Astrophysics Data System (ADS)
Urbain, Manon; Clerbaux, Nicolas; Ipe, Alessandro; Baudrez, Edward; Velazquez Blazquez, Almudena; Moreels, Johan
2016-04-01
Within CM SAF, Interim Climate Data Records (ICDR) of Top-Of-Atmosphere (TOA) radiation products from the Geostationary Earth Radiation Budget (GERB) instruments on the Meteosat Second Generation (MSG) satellites have been released in 2013. These datasets (referred to as CM-113 and CM-115, resp. for shortwave (SW) and longwave (LW) radiation) are based on the instantaneous TOA fluxes from the GERB Edition-1 dataset. They cover the time period 2004-2011. Extending these datasets backward in the past is not possible as no GERB instruments were available on the Meteosat First Generation (MFG) satellites. As an alternative, it is proposed to rely on the Meteosat Visible and InfraRed Imager (MVIRI - from 1982 until 2004) and the Spinning Enhanced Visible and Infrared Imager (SEVIRI - from 2004 onward) to generate a long Thematic Climate Data Record (TCDR) from Meteosat instruments. Combining MVIRI and SEVIRI allows an unprecedented temporal (30 minutes / 15 minutes) and spatial (2.5 km / 3 km) resolution compared to the Clouds and the Earth's Radiant Energy System (CERES) products. This is a step forward as it helps to increase the knowledge of the diurnal cycle and the small-scale spatial variations of radiation. The MVIRI/SEVIRI datasets (referred to as CM-23311 and CM-23341, resp. for SW and LW radiation) will provide daily and monthly averaged TOA Reflected Solar (TRS) and Emitted Thermal (TET) radiation in "all-sky" conditions (no clear-sky conditions for this first version of the datasets), as well as monthly averaged of the hourly integrated values. The SEVIRI Solar Channels Calibration (SSCC) and the operational calibration have been used resp. for the SW and LW channels. For MFG, it is foreseen to replace the latter by the EUMETSAT/GSICS recalibration of MVIRI using HIRS. The CERES TRMM angular dependency models have been used to compute TRS fluxes while theoretical models have been used for TET fluxes. The CM-23311 and CM-23341 datasets will cover a 32 years time period, from 1st February 1982 to 31st January 2014. TRS and TET fluxes will be provided on a regular latitude-longitude grid at a spatial resolution of 0.05° (i.e. about 5.5 km) to ensure consistency with other CM SAF products. Validation will be performed at lower resolution (e.g. 1° x 1°) by intercomparison with several other datasets (CERES EBAF, CERES SYN 1deg-day, HIRS OLR, ISCCP-FD, NCDC daily OLR, etc.).
Monsen, Karen A; Kelechi, Teresa J; McRae, Marion E; Mathiason, Michelle A; Martin, Karen S
The growth and diversification of nursing theory, nursing terminology, and nursing data enable a convergence of theory- and data-driven discovery in the era of big data research. Existing datasets can be viewed through theoretical and terminology perspectives using visualization techniques in order to reveal new patterns and generate hypotheses. The Omaha System is a standardized terminology and metamodel that makes explicit the theoretical perspective of the nursing discipline and enables terminology-theory testing research. The purpose of this paper is to illustrate the approach by exploring a large research dataset consisting of 95 variables (demographics, temperature measures, anthropometrics, and standardized instruments measuring quality of life and self-efficacy) from a theory-based perspective using the Omaha System. Aims were to (a) examine the Omaha System dataset to understand the sample at baseline relative to Omaha System problem terms and outcome measures, (b) examine relationships within the normalized Omaha System dataset at baseline in predicting adherence, and (c) examine relationships within the normalized Omaha System dataset at baseline in predicting incident venous ulcer. Variables from a randomized clinical trial of a cryotherapy intervention for the prevention of venous ulcers were mapped onto Omaha System terms and measures to derive a theoretical framework for the terminology-theory testing study. The original dataset was recoded using the mapping to create an Omaha System dataset, which was then examined using visualization to generate hypotheses. The hypotheses were tested using standard inferential statistics. Logistic regression was used to predict adherence and incident venous ulcer. Findings revealed novel patterns in the psychosocial characteristics of the sample that were discovered to be drivers of both adherence (Mental health Behavior: OR = 1.28, 95% CI [1.02, 1.60]; AUC = .56) and incident venous ulcer (Mental health Behavior: OR = 0.65, 95% CI [0.45, 0.93]; Neuro-musculo-skeletal function Status: OR = 0.69, 95% CI [0.47, 1.00]; male: OR = 3.08, 95% CI [1.15, 8.24]; not married: OR = 2.70, 95% CI [1.00, 7.26]; AUC = .76). The Omaha System was employed as ontology, nursing theory, and terminology to bridge data and theory and may be considered a data-driven theorizing methodology. Novel findings suggest a relationship between psychosocial factors and incident venous ulcer outcomes. There is potential to employ this method in further research, which is needed to generate and test hypotheses from other datasets to extend scientific investigations from existing data.
DCS-SVM: a novel semi-automated method for human brain MR image segmentation.
Ahmadvand, Ali; Daliri, Mohammad Reza; Hajiali, Mohammadtaghi
2017-11-27
In this paper, a novel method is proposed which appropriately segments magnetic resonance (MR) brain images into three main tissues. This paper proposes an extension of our previous work in which we suggested a combination of multiple classifiers (CMC)-based methods named dynamic classifier selection-dynamic local training local Tanimoto index (DCS-DLTLTI) for MR brain image segmentation into three main cerebral tissues. This idea is used here and a novel method is developed that tries to use more complex and accurate classifiers like support vector machine (SVM) in the ensemble. This work is challenging because the CMC-based methods are time consuming, especially on huge datasets like three-dimensional (3D) brain MR images. Moreover, SVM is a powerful method that is used for modeling datasets with complex feature space, but it also has huge computational cost for big datasets, especially those with strong interclass variability problems and with more than two classes such as 3D brain images; therefore, we cannot use SVM in DCS-DLTLTI. Therefore, we propose a novel approach named "DCS-SVM" to use SVM in DCS-DLTLTI to improve the accuracy of segmentation results. The proposed method is applied on well-known datasets of the Internet Brain Segmentation Repository (IBSR) and promising results are obtained.
Grenville-Briggs, Laura J; Stansfield, Ian
2011-01-01
This report describes a linked series of Masters-level computer practical workshops. They comprise an advanced functional genomics investigation, based upon analysis of a microarray dataset probing yeast DNA damage responses. The workshops require the students to analyse highly complex transcriptomics datasets, and were designed to stimulate active learning through experience of current research methods in bioinformatics and functional genomics. They seek to closely mimic a realistic research environment, and require the students first to propose research hypotheses, then test those hypotheses using specific sections of the microarray dataset. The complexity of the microarray data provides students with the freedom to propose their own unique hypotheses, tested using appropriate sections of the microarray data. This research latitude was highly regarded by students and is a strength of this practical. In addition, the focus on DNA damage by radiation and mutagenic chemicals allows them to place their results in a human medical context, and successfully sparks broad interest in the subject material. In evaluation, 79% of students scored the practical workshops on a five-point scale as 4 or 5 (totally effective) for student learning. More broadly, the general use of microarray data as a "student research playground" is also discussed. Copyright © 2011 Wiley Periodicals, Inc.
Pantazatos, Spiro P.; Li, Jianrong; Pavlidis, Paul; Lussier, Yves A.
2009-01-01
An approach towards heterogeneous neuroscience dataset integration is proposed that uses Natural Language Processing (NLP) and a knowledge-based phenotype organizer system (PhenOS) to link ontology-anchored terms to underlying data from each database, and then maps these terms based on a computable model of disease (SNOMED CT®). The approach was implemented using sample datasets from fMRIDC, GEO, The Whole Brain Atlas and Neuronames, and allowed for complex queries such as “List all disorders with a finding site of brain region X, and then find the semantically related references in all participating databases based on the ontological model of the disease or its anatomical and morphological attributes”. Precision of the NLP-derived coding of the unstructured phenotypes in each dataset was 88% (n = 50), and precision of the semantic mapping between these terms across datasets was 98% (n = 100). To our knowledge, this is the first example of the use of both semantic decomposition of disease relationships and hierarchical information found in ontologies to integrate heterogeneous phenotypes across clinical and molecular datasets. PMID:20495688
Harvesting geographic features from heterogeneous raster maps
NASA Astrophysics Data System (ADS)
Chiang, Yao-Yi
2010-11-01
Raster maps offer a great deal of geospatial information and are easily accessible compared to other geospatial data. However, harvesting geographic features locked in heterogeneous raster maps to obtain the geospatial information is challenging. This is because of the varying image quality of raster maps (e.g., scanned maps with poor image quality and computer-generated maps with good image quality), the overlapping geographic features in maps, and the typical lack of metadata (e.g., map geocoordinates, map source, and original vector data). Previous work on map processing is typically limited to a specific type of map and often relies on intensive manual work. In contrast, this thesis investigates a general approach that does not rely on any prior knowledge and requires minimal user effort to process heterogeneous raster maps. This approach includes automatic and supervised techniques to process raster maps for separating individual layers of geographic features from the maps and recognizing geographic features in the separated layers (i.e., detecting road intersections, generating and vectorizing road geometry, and recognizing text labels). The automatic technique eliminates user intervention by exploiting common map properties of how road lines and text labels are drawn in raster maps. For example, the road lines are elongated linear objects and the characters are small connected-objects. The supervised technique utilizes labels of road and text areas to handle complex raster maps, or maps with poor image quality, and can process a variety of raster maps with minimal user input. The results show that the general approach can handle raster maps with varying map complexity, color usage, and image quality. By matching extracted road intersections to another geospatial dataset, we can identify the geocoordinates of a raster map and further align the raster map, separated feature layers from the map, and recognized features from the layers with the geospatial dataset. The road vectorization and text recognition results outperform state-of-art commercial products, and with considerably less user input. The approach in this thesis allows us to make use of the geospatial information of heterogeneous maps locked in raster format.
Cerveau, Nicolas; Jackson, Daniel J
2016-12-09
Next-generation sequencing (NGS) technologies are arguably the most revolutionary technical development to join the list of tools available to molecular biologists since PCR. For researchers working with nonconventional model organisms one major problem with the currently dominant NGS platform (Illumina) stems from the obligatory fragmentation of nucleic acid material that occurs prior to sequencing during library preparation. This step creates a significant bioinformatic challenge for accurate de novo assembly of novel transcriptome data. This challenge becomes apparent when a variety of modern assembly tools (of which there is no shortage) are applied to the same raw NGS dataset. With the same assembly parameters these tools can generate markedly different assembly outputs. In this study we present an approach that generates an optimized consensus de novo assembly of eukaryotic coding transcriptomes. This approach does not represent a new assembler, rather it combines the outputs of a variety of established assembly packages, and removes redundancy via a series of clustering steps. We test and validate our approach using Illumina datasets from six phylogenetically diverse eukaryotes (three metazoans, two plants and a yeast) and two simulated datasets derived from metazoan reference genome annotations. All of these datasets were assembled using three currently popular assembly packages (CLC, Trinity and IDBA-tran). In addition, we experimentally demonstrate that transcripts unique to one particular assembly package are likely to be bioinformatic artefacts. For all eight datasets our pipeline generates more concise transcriptomes that in fact possess more unique annotatable protein domains than any of the three individual assemblers we employed. Another measure of assembly completeness (using the purpose built BUSCO databases) also confirmed that our approach yields more information. Our approach yields coding transcriptome assemblies that are more likely to be closer to biological reality than any of the three individual assembly packages we investigated. This approach (freely available as a simple perl script) will be of use to researchers working with species for which there is little or no reference data against which the assembly of a transcriptome can be performed.
Quantitative proteomics in Giardia duodenalis-Achievements and challenges.
Emery, Samantha J; Lacey, Ernest; Haynes, Paul A
2016-08-01
Giardia duodenalis (syn. G. lamblia and G. intestinalis) is a protozoan parasite of vertebrates and a major contributor to the global burden of diarrheal diseases and gastroenteritis. The publication of multiple genome sequences in the G. duodenalis species complex has provided important insights into parasite biology, and made post-genomic technologies, including proteomics, significantly more accessible. The aims of proteomics are to identify and quantify proteins present in a cell, and assign functions to them within the context of dynamic biological systems. In Giardia, proteomics in the post-genomic era has transitioned from reliance on gel-based systems to utilisation of a diverse array of techniques based on bottom-up LC-MS/MS technologies. Together, these have generated crucial foundations for subcellular proteomes, elucidated intra- and inter-assemblage isolate variation, and identified pathways and markers in differentiation, host-parasite interactions and drug resistance. However, in Giardia, proteomics remains an emerging field, with considerable shortcomings evident from the published research. These include a bias towards assemblage A, a lack of emphasis on quantitative analytical techniques, and limited information on post-translational protein modifications. Additionally, there are multiple areas of research for which proteomic data is not available to add value to published transcriptomic data. The challenge of amalgamating data in the systems biology paradigm necessitates the further generation of large, high-quality quantitative datasets to accurately model parasite biology. This review surveys the current proteomic research available for Giardia and evaluates their technical and quantitative approaches, while contextualising their biological insights into parasite pathology, isolate variation and eukaryotic evolution. Finally, we propose areas of priority for the generation of future proteomic data to explore fundamental questions in Giardia, including the analysis of post-translational modifications, and the design of MS-based assays for validation of differentially expressed proteins in large datasets. Copyright © 2016 Elsevier B.V. All rights reserved.
BrEPS 2.0: Optimization of sequence pattern prediction for enzyme annotation.
Dudek, Christian-Alexander; Dannheim, Henning; Schomburg, Dietmar
2017-01-01
The prediction of gene functions is crucial for a large number of different life science areas. Faster high throughput sequencing techniques generate more and larger datasets. The manual annotation by classical wet-lab experiments is not suitable for these large amounts of data. We showed earlier that the automatic sequence pattern-based BrEPS protocol, based on manually curated sequences, can be used for the prediction of enzymatic functions of genes. The growing sequence databases provide the opportunity for more reliable patterns, but are also a challenge for the implementation of automatic protocols. We reimplemented and optimized the BrEPS pattern generation to be applicable for larger datasets in an acceptable timescale. Primary improvement of the new BrEPS protocol is the enhanced data selection step. Manually curated annotations from Swiss-Prot are used as reliable source for function prediction of enzymes observed on protein level. The pool of sequences is extended by highly similar sequences from TrEMBL and SwissProt. This allows us to restrict the selection of Swiss-Prot entries, without losing the diversity of sequences needed to generate significant patterns. Additionally, a supporting pattern type was introduced by extending the patterns at semi-conserved positions with highly similar amino acids. Extended patterns have an increased complexity, increasing the chance to match more sequences, without losing the essential structural information of the pattern. To enhance the usability of the database, we introduced enzyme function prediction based on consensus EC numbers and IUBMB enzyme nomenclature. BrEPS is part of the Braunschweig Enzyme Database (BRENDA) and is available on a completely redesigned website and as download. The database can be downloaded and used with the BrEPScmd command line tool for large scale sequence analysis. The BrEPS website and downloads for the database creation tool, command line tool and database are freely accessible at http://breps.tu-bs.de.
BrEPS 2.0: Optimization of sequence pattern prediction for enzyme annotation
Schomburg, Dietmar
2017-01-01
The prediction of gene functions is crucial for a large number of different life science areas. Faster high throughput sequencing techniques generate more and larger datasets. The manual annotation by classical wet-lab experiments is not suitable for these large amounts of data. We showed earlier that the automatic sequence pattern-based BrEPS protocol, based on manually curated sequences, can be used for the prediction of enzymatic functions of genes. The growing sequence databases provide the opportunity for more reliable patterns, but are also a challenge for the implementation of automatic protocols. We reimplemented and optimized the BrEPS pattern generation to be applicable for larger datasets in an acceptable timescale. Primary improvement of the new BrEPS protocol is the enhanced data selection step. Manually curated annotations from Swiss-Prot are used as reliable source for function prediction of enzymes observed on protein level. The pool of sequences is extended by highly similar sequences from TrEMBL and SwissProt. This allows us to restrict the selection of Swiss-Prot entries, without losing the diversity of sequences needed to generate significant patterns. Additionally, a supporting pattern type was introduced by extending the patterns at semi-conserved positions with highly similar amino acids. Extended patterns have an increased complexity, increasing the chance to match more sequences, without losing the essential structural information of the pattern. To enhance the usability of the database, we introduced enzyme function prediction based on consensus EC numbers and IUBMB enzyme nomenclature. BrEPS is part of the Braunschweig Enzyme Database (BRENDA) and is available on a completely redesigned website and as download. The database can be downloaded and used with the BrEPScmd command line tool for large scale sequence analysis. The BrEPS website and downloads for the database creation tool, command line tool and database are freely accessible at http://breps.tu-bs.de. PMID:28750104
Model-based clustering for RNA-seq data.
Si, Yaqing; Liu, Peng; Li, Pinghua; Brutnell, Thomas P
2014-01-15
RNA-seq technology has been widely adopted as an attractive alternative to microarray-based methods to study global gene expression. However, robust statistical tools to analyze these complex datasets are still lacking. By grouping genes with similar expression profiles across treatments, cluster analysis provides insight into gene functions and networks, and hence is an important technique for RNA-seq data analysis. In this manuscript, we derive clustering algorithms based on appropriate probability models for RNA-seq data. An expectation-maximization algorithm and another two stochastic versions of expectation-maximization algorithms are described. In addition, a strategy for initialization based on likelihood is proposed to improve the clustering algorithms. Moreover, we present a model-based hybrid-hierarchical clustering method to generate a tree structure that allows visualization of relationships among clusters as well as flexibility of choosing the number of clusters. Results from both simulation studies and analysis of a maize RNA-seq dataset show that our proposed methods provide better clustering results than alternative methods such as the K-means algorithm and hierarchical clustering methods that are not based on probability models. An R package, MBCluster.Seq, has been developed to implement our proposed algorithms. This R package provides fast computation and is publicly available at http://www.r-project.org
Giabbanelli, Philippe J; Crutzen, Rik
2017-01-01
Most adults are overweight or obese in many western countries. Several population-level interventions on the physical, economical, political, or sociocultural environment have thus attempted to achieve a healthier weight. These interventions have involved different weight-related behaviours, such as food behaviours. Agent-based models (ABMs) have the potential to help policymakers evaluate food behaviour interventions from a systems perspective. However, fully realizing this potential involves a complex procedure starting with obtaining and analyzing data to populate the model and eventually identifying more efficient cross-sectoral policies. Current procedures for ABMs of food behaviours are mostly rooted in one technique, often ignore the food environment beyond home and work, and underutilize rich datasets. In this paper, we address some of these limitations to better support policymakers through two contributions. First, via a scoping review, we highlight readily available datasets and techniques to deal with these limitations independently. Second, we propose a three steps' process to tackle all limitations together and discuss its use to develop future models for food behaviours. We acknowledge that this integrated process is a leap forward in ABMs. However, this long-term objective is well-worth addressing as it can generate robust findings to effectively inform the design of food behaviour interventions.
NASA Astrophysics Data System (ADS)
Pérez-Luque, A. J.; Pérez-Pérez, R.; Bonet-García, F. J.; Magaña, P. J.
2015-05-01
The implementation of the Natura 2000 network requires methods to assess the conservation status of habitats. This paper shows a methodological approach that combines the use of (satellite) Earth observation with ontologies to monitor Natura 2000 habitats and assess their functioning. We have created an ontological system called Savia that can describe both the ecosystem functioning and the behaviour of abiotic factors in a Natura 2000 habitat. This system is able to automatically download images from MODIS products, create indicators and compute temporal trends for them. We have developed an ontology that takes into account the different concepts and relations about indicators and temporal trends, and the spatio-temporal components of the datasets. All the information generated from datasets and MODIS images, is stored into a knowledge base according to the ontology. Users can formulate complex questions using a SPARQL end-point. This system has been tested and validated in a case study that uses Quercus pyrenaica Willd. forests as a target habitat in Sierra Nevada (Spain), a Natura 2000 site. We assess ecosystem functioning using NDVI. The selected abiotic factor is snow cover. Savia provides useful data regarding these two variables and reflects relationships between them.
A Novel Multi-Class Ensemble Model for Classifying Imbalanced Biomedical Datasets
NASA Astrophysics Data System (ADS)
Bikku, Thulasi; Sambasiva Rao, N., Dr; Rao, Akepogu Ananda, Dr
2017-08-01
This paper mainly focuseson developing aHadoop based framework for feature selection and classification models to classify high dimensionality data in heterogeneous biomedical databases. Wide research has been performing in the fields of Machine learning, Big data and Data mining for identifying patterns. The main challenge is extracting useful features generated from diverse biological systems. The proposed model can be used for predicting diseases in various applications and identifying the features relevant to particular diseases. There is an exponential growth of biomedical repositories such as PubMed and Medline, an accurate predictive model is essential for knowledge discovery in Hadoop environment. Extracting key features from unstructured documents often lead to uncertain results due to outliers and missing values. In this paper, we proposed a two phase map-reduce framework with text preprocessor and classification model. In the first phase, mapper based preprocessing method was designed to eliminate irrelevant features, missing values and outliers from the biomedical data. In the second phase, a Map-Reduce based multi-class ensemble decision tree model was designed and implemented in the preprocessed mapper data to improve the true positive rate and computational time. The experimental results on the complex biomedical datasets show that the performance of our proposed Hadoop based multi-class ensemble model significantly outperforms state-of-the-art baselines.
Automated Signal Processing Applied to Volatile-Based Inspection of Greenhouse Crops
Jansen, Roel; Hofstee, Jan Willem; Bouwmeester, Harro; van Henten, Eldert
2010-01-01
Gas chromatograph–mass spectrometers (GC-MS) have been used and shown utility for volatile-based inspection of greenhouse crops. However, a widely recognized difficulty associated with GC-MS application is the large and complex data generated by this instrument. As a consequence, experienced analysts are often required to process this data in order to determine the concentrations of the volatile organic compounds (VOCs) of interest. Manual processing is time-consuming, labour intensive and may be subject to errors due to fatigue. The objective of this study was to assess whether or not GC-MS data can also be automatically processed in order to determine the concentrations of crop health associated VOCs in a greenhouse. An experimental dataset that consisted of twelve data files was processed both manually and automatically to address this question. Manual processing was based on simple peak integration while the automatic processing relied on the algorithms implemented in the MetAlign™ software package. The results of automatic processing of the experimental dataset resulted in concentrations similar to that after manual processing. These results demonstrate that GC-MS data can be automatically processed in order to accurately determine the concentrations of crop health associated VOCs in a greenhouse. When processing GC-MS data automatically, noise reduction, alignment, baseline correction and normalisation are required. PMID:22163594
Pan- and core- network analysis of co-expression genes in a model plant
He, Fei; Maslov, Sergei
2016-12-16
Genome-wide gene expression experiments have been performed using the model plant Arabidopsis during the last decade. Some studies involved construction of coexpression networks, a popular technique used to identify groups of co-regulated genes, to infer unknown gene functions. One approach is to construct a single coexpression network by combining multiple expression datasets generated in different labs. We advocate a complementary approach in which we construct a large collection of 134 coexpression networks based on expression datasets reported in individual publications. To this end we reanalyzed public expression data. To describe this collection of networks we introduced concepts of ‘pan-network’ andmore » ‘core-network’ representing union and intersection between a sizeable fractions of individual networks, respectively. Here, we showed that these two types of networks are different both in terms of their topology and biological function of interacting genes. For example, the modules of the pan-network are enriched in regulatory and signaling functions, while the modules of the core-network tend to include components of large macromolecular complexes such as ribosomes and photosynthetic machinery. Our analysis is aimed to help the plant research community to better explore the information contained within the existing vast collection of gene expression data in Arabidopsis.« less
Pan- and core- network analysis of co-expression genes in a model plant
DOE Office of Scientific and Technical Information (OSTI.GOV)
He, Fei; Maslov, Sergei
Genome-wide gene expression experiments have been performed using the model plant Arabidopsis during the last decade. Some studies involved construction of coexpression networks, a popular technique used to identify groups of co-regulated genes, to infer unknown gene functions. One approach is to construct a single coexpression network by combining multiple expression datasets generated in different labs. We advocate a complementary approach in which we construct a large collection of 134 coexpression networks based on expression datasets reported in individual publications. To this end we reanalyzed public expression data. To describe this collection of networks we introduced concepts of ‘pan-network’ andmore » ‘core-network’ representing union and intersection between a sizeable fractions of individual networks, respectively. Here, we showed that these two types of networks are different both in terms of their topology and biological function of interacting genes. For example, the modules of the pan-network are enriched in regulatory and signaling functions, while the modules of the core-network tend to include components of large macromolecular complexes such as ribosomes and photosynthetic machinery. Our analysis is aimed to help the plant research community to better explore the information contained within the existing vast collection of gene expression data in Arabidopsis.« less
Phylo-VISTA: Interactive visualization of multiple DNA sequence alignments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shah, Nameeta; Couronne, Olivier; Pennacchio, Len A.
The power of multi-sequence comparison for biological discovery is well established. The need for new capabilities to visualize and compare cross-species alignment data is intensified by the growing number of genomic sequence datasets being generated for an ever-increasing number of organisms. To be efficient these visualization algorithms must support the ability to accommodate consistently a wide range of evolutionary distances in a comparison framework based upon phylogenetic relationships. Results: We have developed Phylo-VISTA, an interactive tool for analyzing multiple alignments by visualizing a similarity measure for multiple DNA sequences. The complexity of visual presentation is effectively organized using a frameworkmore » based upon interspecies phylogenetic relationships. The phylogenetic organization supports rapid, user-guided interspecies comparison. To aid in navigation through large sequence datasets, Phylo-VISTA leverages concepts from VISTA that provide a user with the ability to select and view data at varying resolutions. The combination of multiresolution data visualization and analysis, combined with the phylogenetic framework for interspecies comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments. Availability: Phylo-VISTA is available at http://www-gsd.lbl. gov/phylovista. It requires an Internet browser with Java Plugin 1.4.2 and it is integrated into the global alignment program LAGAN at http://lagan.stanford.edu« less
Maljovec, D.; Liu, S.; Wang, B.; ...
2015-07-14
Here, dynamic probabilistic risk assessment (DPRA) methodologies couple system simulator codes (e.g., RELAP and MELCOR) with simulation controller codes (e.g., RAVEN and ADAPT). Whereas system simulator codes model system dynamics deterministically, simulation controller codes introduce both deterministic (e.g., system control logic and operating procedures) and stochastic (e.g., component failures and parameter uncertainties) elements into the simulation. Typically, a DPRA is performed by sampling values of a set of parameters and simulating the system behavior for that specific set of parameter values. For complex systems, a major challenge in using DPRA methodologies is to analyze the large number of scenarios generated,more » where clustering techniques are typically employed to better organize and interpret the data. In this paper, we focus on the analysis of two nuclear simulation datasets that are part of the risk-informed safety margin characterization (RISMC) boiling water reactor (BWR) station blackout (SBO) case study. We provide the domain experts a software tool that encodes traditional and topological clustering techniques within an interactive analysis and visualization environment, for understanding the structures of such high-dimensional nuclear simulation datasets. We demonstrate through our case study that both types of clustering techniques complement each other for enhanced structural understanding of the data.« less
Races of the Celery Pathogen Fusarium oxysporum f. sp. apii Are Polyphyletic.
Epstein, Lynn; Kaur, Sukhwinder; Chang, Peter L; Carrasquilla-Garcia, Noelia; Lyu, Guiyun; Cook, Douglas R; Subbarao, Krishna V; O'Donnell, Kerry
2017-04-01
Fusarium oxysporum species complex (FOSC) isolates were obtained from celery with symptoms of Fusarium yellows between 1993 and 2013 primarily in California. Virulence tests and a two-gene dataset from 174 isolates indicated that virulent isolates collected before 2013 were a highly clonal population of F. oxysporum f. sp. apii race 2. In 2013, new highly virulent clonal isolates, designated race 4, were discovered in production fields in Camarillo, California. Long-read Illumina data were used to analyze 16 isolates: six race 2, one of each from races 1, 3, and 4, and seven genetically diverse FOSC that were isolated from symptomatic celery but are nonpathogenic on this host. Analyses of a 10-gene dataset comprising 38 kb indicated that F. oxysporum f. sp. apii is polyphyletic; race 2 is nested within clade 3, whereas the evolutionary origins of races 1, 3, and 4 are within clade 2. Based on 6,898 single nucleotide polymorphisms from the core FOSC genome, race 3 and the new highly virulent race 4 are highly similar with Nei's Da = 0.0019, suggesting that F. oxysporum f. sp. apii race 4 evolved from race 3. Next generation sequences were used to develop PCR primers that allow rapid diagnosis of races 2 and 4 in planta.
Skipping the real world: Classification of PolSAR images without explicit feature extraction
NASA Astrophysics Data System (ADS)
Hänsch, Ronny; Hellwich, Olaf
2018-06-01
The typical processing chain for pixel-wise classification from PolSAR images starts with an optional preprocessing step (e.g. speckle reduction), continues with extracting features projecting the complex-valued data into the real domain (e.g. by polarimetric decompositions) which are then used as input for a machine-learning based classifier, and ends in an optional postprocessing (e.g. label smoothing). The extracted features are usually hand-crafted as well as preselected and represent (a somewhat arbitrary) projection from the complex to the real domain in order to fit the requirements of standard machine-learning approaches such as Support Vector Machines or Artificial Neural Networks. This paper proposes to adapt the internal node tests of Random Forests to work directly on the complex-valued PolSAR data, which makes any explicit feature extraction obsolete. This approach leads to a classification framework with a significantly decreased computation time and memory footprint since no image features have to be computed and stored beforehand. The experimental results on one fully-polarimetric and one dual-polarimetric dataset show that, despite the simpler approach, accuracy can be maintained (decreased by only less than 2 % for the fully-polarimetric dataset) or even improved (increased by roughly 9 % for the dual-polarimetric dataset).
Rule-based topology system for spatial databases to validate complex geographic datasets
NASA Astrophysics Data System (ADS)
Martinez-Llario, J.; Coll, E.; Núñez-Andrés, M.; Femenia-Ribera, C.
2017-06-01
A rule-based topology software system providing a highly flexible and fast procedure to enforce integrity in spatial relationships among datasets is presented. This improved topology rule system is built over the spatial extension Jaspa. Both projects are open source, freely available software developed by the corresponding author of this paper. Currently, there is no spatial DBMS that implements a rule-based topology engine (considering that the topology rules are designed and performed in the spatial backend). If the topology rules are applied in the frontend (as in many GIS desktop programs), ArcGIS is the most advanced solution. The system presented in this paper has several major advantages over the ArcGIS approach: it can be extended with new topology rules, it has a much wider set of rules, and it can mix feature attributes with topology rules as filters. In addition, the topology rule system can work with various DBMSs, including PostgreSQL, H2 or Oracle, and the logic is performed in the spatial backend. The proposed topology system allows users to check the complex spatial relationships among features (from one or several spatial layers) that require some complex cartographic datasets, such as the data specifications proposed by INSPIRE in Europe and the Land Administration Domain Model (LADM) for Cadastral data.
Smith, Stephen A; Moore, Michael J; Brown, Joseph W; Yang, Ya
2015-08-05
The use of transcriptomic and genomic datasets for phylogenetic reconstruction has become increasingly common as researchers attempt to resolve recalcitrant nodes with increasing amounts of data. The large size and complexity of these datasets introduce significant phylogenetic noise and conflict into subsequent analyses. The sources of conflict may include hybridization, incomplete lineage sorting, or horizontal gene transfer, and may vary across the phylogeny. For phylogenetic analysis, this noise and conflict has been accommodated in one of several ways: by binning gene regions into subsets to isolate consistent phylogenetic signal; by using gene-tree methods for reconstruction, where conflict is presumed to be explained by incomplete lineage sorting (ILS); or through concatenation, where noise is presumed to be the dominant source of conflict. The results provided herein emphasize that analysis of individual homologous gene regions can greatly improve our understanding of the underlying conflict within these datasets. Here we examined two published transcriptomic datasets, the angiosperm group Caryophyllales and the aculeate Hymenoptera, for the presence of conflict, concordance, and gene duplications in individual homologs across the phylogeny. We found significant conflict throughout the phylogeny in both datasets and in particular along the backbone. While some nodes in each phylogeny showed patterns of conflict similar to what might be expected with ILS alone, the backbone nodes also exhibited low levels of phylogenetic signal. In addition, certain nodes, especially in the Caryophyllales, had highly elevated levels of strongly supported conflict that cannot be explained by ILS alone. This study demonstrates that phylogenetic signal is highly variable in phylogenomic data sampled across related species and poses challenges when conducting species tree analyses on large genomic and transcriptomic datasets. Further insight into the conflict and processes underlying these complex datasets is necessary to improve and develop adequate models for sequence analysis and downstream applications. To aid this effort, we developed the open source software phyparts ( https://bitbucket.org/blackrim/phyparts ), which calculates unique, conflicting, and concordant bipartitions, maps gene duplications, and outputs summary statistics such as internode certainy (ICA) scores and node-specific counts of gene duplications.
Kang, Dongwan D.; Froula, Jeff; Egan, Rob; ...
2015-01-01
Grouping large genomic fragments assembled from shotgun metagenomic sequences to deconvolute complex microbial communities, or metagenome binning, enables the study of individual organisms and their interactions. Because of the complex nature of these communities, existing metagenome binning methods often miss a large number of microbial species. In addition, most of the tools are not scalable to large datasets. Here we introduce automated software called MetaBAT that integrates empirical probabilistic distances of genome abundance and tetranucleotide frequency for accurate metagenome binning. MetaBAT outperforms alternative methods in accuracy and computational efficiency on both synthetic and real metagenome datasets. Lastly, it automatically formsmore » hundreds of high quality genome bins on a very large assembly consisting millions of contigs in a matter of hours on a single node. MetaBAT is open source software and available at https://bitbucket.org/berkeleylab/metabat.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Thomas, Mathew; Marshall, Matthew J.; Miller, Erin A.
2014-08-26
Understanding the interactions of structured communities known as “biofilms” and other complex matrixes is possible through the X-ray micro tomography imaging of the biofilms. Feature detection and image processing for this type of data focuses on efficiently identifying and segmenting biofilms and bacteria in the datasets. The datasets are very large and often require manual interventions due to low contrast between objects and high noise levels. Thus new software is required for the effectual interpretation and analysis of the data. This work specifies the evolution and application of the ability to analyze and visualize high resolution X-ray micro tomography datasets.
NASA Astrophysics Data System (ADS)
Xu, Y.; Sun, Z.; Boerner, R.; Koch, T.; Hoegner, L.; Stilla, U.
2018-04-01
In this work, we report a novel way of generating ground truth dataset for analyzing point cloud from different sensors and the validation of algorithms. Instead of directly labeling large amount of 3D points requiring time consuming manual work, a multi-resolution 3D voxel grid for the testing site is generated. Then, with the help of a set of basic labeled points from the reference dataset, we can generate a 3D labeled space of the entire testing site with different resolutions. Specifically, an octree-based voxel structure is applied to voxelize the annotated reference point cloud, by which all the points are organized by 3D grids of multi-resolutions. When automatically annotating the new testing point clouds, a voting based approach is adopted to the labeled points within multiple resolution voxels, in order to assign a semantic label to the 3D space represented by the voxel. Lastly, robust line- and plane-based fast registration methods are developed for aligning point clouds obtained via various sensors. Benefiting from the labeled 3D spatial information, we can easily create new annotated 3D point clouds of different sensors of the same scene directly by considering the corresponding labels of 3D space the points located, which would be convenient for the validation and evaluation of algorithms related to point cloud interpretation and semantic segmentation.
Application of constrained k-means clustering in ground motion simulation validation
NASA Astrophysics Data System (ADS)
Khoshnevis, N.; Taborda, R.
2017-12-01
The validation of ground motion synthetics has received increased attention over the last few years due to the advances in physics-based deterministic and hybrid simulation methods. Unlike for low frequency simulations (f ≤ 0.5 Hz), for which it has become reasonable to expect a good match between synthetics and data, in the case of high-frequency simulations (f ≥ 1 Hz) it is not possible to match results on a wiggle-by-wiggle basis. This is mostly due to the various complexities and uncertainties involved in earthquake ground motion modeling. Therefore, in order to compare synthetics with data we turn to different time series metrics, which are used as a means to characterize how the synthetics match the data on qualitative and statistical sense. In general, these metrics provide GOF scores that measure the level of similarity in the time and frequency domains. It is common for these scores to be scaled from 0 to 10, with 10 representing a perfect match. Although using individual metrics for particular applications is considered more adequate, there is no consensus or a unified method to classify the comparison between a set of synthetic and recorded seismograms when the various metrics offer different scores. We study the relationship among these metrics through a constrained k-means clustering approach. We define 4 hypothetical stations with scores 3, 5, 7, and 9 for all metrics. We put these stations in the category of cannot-link constraints. We generate the dataset through the validation of the results from a deterministic (physics-based) ground motion simulation for a moderate magnitude earthquake in the greater Los Angeles basin using three velocity models. The maximum frequency of the simulation is 4 Hz. The dataset involves over 300 stations and 11 metrics, or features, as they are understood in the clustering process, where the metrics form a multi-dimensional space. We address the high-dimensional feature effects with a subspace-clustering analysis, generate a final labeled dataset of stations, and discuss the within-class statistical characteristics of each metric. Labeling these stations is the first step towards developing a unified metric to evaluate ground motion simulations in an application-independent manner.
Global evaluation of ammonia bidirectional exchange and livestock diurnal variation schemes
There is no EPA generated dataset in this study.This dataset is associated with the following publication:Zhu, L., D. Henze, J. Bash , G. Jeong, K. Cady-Pereira, M. Shephard, M. Luo, F. Poulot, and S. Capps. Global evaluation of ammonia bidirectional exchange and livestock diurnal variation schemes. Atmospheric Chemistry and Physics. Copernicus Publications, Katlenburg-Lindau, GERMANY, 15: 12823-12843, (2015).
Scalable and cost-effective NGS genotyping in the cloud.
Souilmi, Yassine; Lancaster, Alex K; Jung, Jae-Yoon; Rizzo, Ettore; Hawkins, Jared B; Powles, Ryan; Amzazi, Saaïd; Ghazal, Hassan; Tonellato, Peter J; Wall, Dennis P
2015-10-15
While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data can be accurately rendered to medically actionable reports within a time window of hours and at scales of economy in the 10's of dollars. We take a step towards addressing this challenge, by using COSMOS, a cloud-enabled workflow management system, to develop GenomeKey, an NGS whole genome analysis workflow. COSMOS implements complex workflows making optimal use of high-performance compute clusters. Here we show that the Amazon Web Service (AWS) implementation of GenomeKey via COSMOS provides a fast, scalable, and cost-effective analysis of both public benchmarking and large-scale heterogeneous clinical NGS datasets. Our systematic benchmarking reveals important new insights and considerations to produce clinical turn-around of whole genome analysis optimization and workflow management including strategic batching of individual genomes and efficient cluster resource configuration.
Jurrus, Elizabeth; Watanabe, Shigeki; Giuly, Richard J.; Paiva, Antonio R. C.; Ellisman, Mark H.; Jorgensen, Erik M.; Tasdizen, Tolga
2013-01-01
Neuroscientists are developing new imaging techniques and generating large volumes of data in an effort to understand the complex structure of the nervous system. The complexity and size of this data makes human interpretation a labor-intensive task. To aid in the analysis, new segmentation techniques for identifying neurons in these feature rich datasets are required. This paper presents a method for neuron boundary detection and nonbranching process segmentation in electron microscopy images and visualizing them in three dimensions. It combines both automated segmentation techniques with a graphical user interface for correction of mistakes in the automated process. The automated process first uses machine learning and image processing techniques to identify neuron membranes that deliniate the cells in each two-dimensional section. To segment nonbranching processes, the cell regions in each two-dimensional section are connected in 3D using correlation of regions between sections. The combination of this method with a graphical user interface specially designed for this purpose, enables users to quickly segment cellular processes in large volumes. PMID:22644867
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jurrus, Elizabeth R.; Watanabe, Shigeki; Giuly, Richard J.
2013-01-01
Neuroscientists are developing new imaging techniques and generating large volumes of data in an effort to understand the complex structure of the nervous system. The complexity and size of this data makes human interpretation a labor-intensive task. To aid in the analysis, new segmentation techniques for identifying neurons in these feature rich datasets are required. This paper presents a method for neuron boundary detection and nonbranching process segmentation in electron microscopy images and visualizing them in three dimensions. It combines both automated segmentation techniques with a graphical user interface for correction of mistakes in the automated process. The automated processmore » first uses machine learning and image processing techniques to identify neuron membranes that deliniate the cells in each two-dimensional section. To segment nonbranching processes, the cell regions in each two-dimensional section are connected in 3D using correlation of regions between sections. The combination of this method with a graphical user interface specially designed for this purpose, enables users to quickly segment cellular processes in large volumes.« less
HI and Low Metal Ions at the Intersection of Galaxies and the CGM
NASA Astrophysics Data System (ADS)
Oppenheimer, Benjamin
2017-08-01
Over 1000 COS orbits have revealed a surprisingly complex picture of circumgalactic gas flows surrounding the diversity of galaxies in the evolved Universe. Cosmological hydrodynamic simulations have only begun to confront the vast amount of galaxy formation physics, chemistry, and dynamics revealed in the multi-ion CGM datasets. We propose the next generation of EAGLE zoom simulations, called EAGLE Cosmic Origins, to model HI and low metal ions (C II, Mg II, & Si II) throughout not just the CGM but also within the galaxies themselves. We will employ a novel, new chemistry solver, CHIMES, to follow time-dependent ionization, chemistry, and cooling of 157 ionic and molecular species, and include multiple ionization sources from the extra-galactic background, episodic AGN, and star formation. Our aim is to understand the complete baryon cycle of inflows, outflows, and gas recycling traced over 10 decades of HI column densities as well as the complex kinematic information encoded low ion absorption spectroscopy. This simulation project represents a pilot program for a larger suite of zoom simulations, which will be publicly released and lead to additional publications.
Maglione, Anton G; Brizi, Ambra; Vecchiato, Giovanni; Rossi, Dario; Trettel, Arianna; Modica, Enrica; Babiloni, Fabio
2017-01-01
In this study, the cortical activity correlated with the perception and appreciation of different set of pictures was estimated by using neuroelectric brain activity and graph theory methodologies in a group of artistic educated persons. The pictures shown to the subjects consisted of original pictures of Titian's and a contemporary artist's paintings (Orig dataset) plus two sets of additional pictures. These additional datasets were obtained from the previous paintings by removing all but the colors or the shapes employed (Color and Style dataset, respectively). Results suggest that the verbal appreciation of Orig dataset when compared to Color and Style ones was mainly correlated to the neuroelectric indexes estimated during the first 10 s of observation of the pictures. Always in the first 10 s of observation: (1) Orig dataset induced more emotion and is perceived with more appreciation than the other two Color and Style datasets; (2) Style dataset is perceived with more attentional effort than the other investigated datasets. During the whole period of observation of 30 s: (1) emotion induced by Color and Style datasets increased across the time while that induced of the Orig dataset remain stable; (2) Color and Style dataset were perceived with more attentional effort than the Orig dataset. During the entire experience, there is evidence of a cortical flow of activity from the parietal and central areas toward the prefrontal and frontal areas during the observation of the images of all the datasets. This is coherent from the notion that active perception of the images with sustained cognitive attention in parietal and central areas caused the generation of the judgment about their aesthetic appreciation in frontal areas.
Maglione, Anton G.; Brizi, Ambra; Vecchiato, Giovanni; Rossi, Dario; Trettel, Arianna; Modica, Enrica; Babiloni, Fabio
2017-01-01
In this study, the cortical activity correlated with the perception and appreciation of different set of pictures was estimated by using neuroelectric brain activity and graph theory methodologies in a group of artistic educated persons. The pictures shown to the subjects consisted of original pictures of Titian's and a contemporary artist's paintings (Orig dataset) plus two sets of additional pictures. These additional datasets were obtained from the previous paintings by removing all but the colors or the shapes employed (Color and Style dataset, respectively). Results suggest that the verbal appreciation of Orig dataset when compared to Color and Style ones was mainly correlated to the neuroelectric indexes estimated during the first 10 s of observation of the pictures. Always in the first 10 s of observation: (1) Orig dataset induced more emotion and is perceived with more appreciation than the other two Color and Style datasets; (2) Style dataset is perceived with more attentional effort than the other investigated datasets. During the whole period of observation of 30 s: (1) emotion induced by Color and Style datasets increased across the time while that induced of the Orig dataset remain stable; (2) Color and Style dataset were perceived with more attentional effort than the Orig dataset. During the entire experience, there is evidence of a cortical flow of activity from the parietal and central areas toward the prefrontal and frontal areas during the observation of the images of all the datasets. This is coherent from the notion that active perception of the images with sustained cognitive attention in parietal and central areas caused the generation of the judgment about their aesthetic appreciation in frontal areas. PMID:28790907
NASA Astrophysics Data System (ADS)
Camera, Corrado; Bruggeman, Adriana; Hadjinicolaou, Panos; Pashiardis, Stelios; Lange, Manfred
2014-05-01
High-resolution gridded daily datasets are essential for natural resource management and the analysis of climate changes and their effects. This study aimed to create gridded datasets of daily precipitation and daily minimum and maximum temperature, for the future (2020-2050). The horizontal resolution of the developed datasets is 1 x 1 km2, covering the area under control of the Republic of Cyprus (5.760 km2). The study is divided into two parts. The first consists of the evaluation of the performance of different interpolation techniques for daily rainfall and temperature data (1980-2010) for the creation of the gridded datasets. Rainfall data recorded at 145 stations and temperature data from 34 stations were used. For precipitation, inverse distance weighting (IDW) performs best for local events, while a combination of step-wise geographically weighted regression and IDW proves to be the best method for large scale events. For minimum and maximum temperature, a combination of step-wise linear multiple regression and thin plate splines is recognized as the best method. Six Regional Climate Models (RCMs) for the A1B SRES emission scenario from the EU ENSEMBLE project database were selected as sources for future climate projections. The RCMs were evaluated for their capacity to simulate Cyprus climatology for the period 1980-2010. Data for the period 2020-2050 from the three best performing RCMs were downscaled, using the change factors approach, at the location of observational stations. Daily time series were created with a stochastic rainfall and temperature generator. The RainSim V3 software (Burton et al., 2008) was used to generate spatial-temporal coherent rainfall fields. The temperature generator was developed in R and modeled temperature as a weakly stationary process with the daily mean and standard deviation conditioned on the wet and dry state of the day (Richardson, 1981). Finally gridded datasets depicting projected future climate conditions were created with the identified best interpolation methods. The difference between the input and simulated mean daily rainfall, averaged over all the stations, was 0.03 mm (2.2%), while the error related to the number of dry days was 2 (0.6%). For mean daily minimum temperature the error was 0.005 ºC (0.04%), while for maximum temperature it was 0.01 ºC (0.04%). Overall, the weather generators were found to be reliable instruments for the downscaling of precipitation and temperature. The resulting datasets indicate a decrease of the mean annual rainfall over the study area between 5 and 70 mm (1-15%) for 2020-2050, relative to 1980-2010. Average annual minimum and maximum temperature over the Republic of Cyprus are projected to increase between 1.2 and 1.5 ºC. The dataset is currently used to compute agricultural production and water use indicators, as part of the AGWATER project (AEIFORIA/GEORGO/0311(BIE)/06), co-financed by the European Regional Development Fund and the Republic of Cyprus through the Research Promotion Foundation. Burton, A., Kilsby, C.G., Fowler, H.J., Cowpertwait, P.S.P., and O'Connell, P.E.: RainSim: A spatial-temporal stochastic rainfall modelling system. Environ. Model. Software 23, 1356-1369, 2008 Richardson, C.W.: Stochastic simulation of daily precipitation, temperature, and solar radiation. Water Resour. Res. 17, 182-190, 1981.
Discrete structural features among interface residue-level classes.
Sowmya, Gopichandran; Ranganathan, Shoba
2015-01-01
Protein-protein interaction (PPI) is essential for molecular functions in biological cells. Investigation on protein interfaces of known complexes is an important step towards deciphering the driving forces of PPIs. Each PPI complex is specific, sensitive and selective to binding. Therefore, we have estimated the relative difference in percentage of polar residues between surface and the interface for each complex in a non-redundant heterodimer dataset of 278 complexes to understand the predominant forces driving binding. Our analysis showed ~60% of protein complexes with surface polarity greater than interface polarity (designated as class A). However, a considerable number of complexes (~40%) have interface polarity greater than surface polarity, (designated as class B), with a significantly different p-value of 1.66E-45 from class A. Comprehensive analyses of protein complexes show that interface features such as interface area, interface polarity abundance, solvation free energy gain upon interface formation, binding energy and the percentage of interface charged residue abundance distinguish among class A and class B complexes, while electrostatic visualization maps also help differentiate interface classes among complexes. Class A complexes are classical with abundant non-polar interactions at the interface; however class B complexes have abundant polar interactions at the interface, similar to protein surface characteristics. Five physicochemical interface features analyzed from the protein heterodimer dataset are discriminatory among the interface residue-level classes. These novel observations find application in developing residue-level models for protein-protein binding prediction, protein-protein docking studies and interface inhibitor design as drugs.