Sample records for integrate multiple datasets

  1. Bayesian correlated clustering to integrate multiple datasets

    PubMed Central

    Kirk, Paul; Griffin, Jim E.; Savage, Richard S.; Ghahramani, Zoubin; Wild, David L.

    2012-01-01

    Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact: D.L.Wild@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23047558

  2. Integrative Analysis of “-Omics” Data Using Penalty Functions

    PubMed Central

    Zhao, Qing; Shi, Xingjie; Huang, Jian; Liu, Jin; Li, Yang; Ma, Shuangge

    2014-01-01

    In the analysis of omics data, integrative analysis provides an effective way of pooling information across multiple datasets or multiple correlated responses, and can be more effective than single-dataset (response) analysis. Multiple families of integrative analysis methods have been proposed in the literature. The current review focuses on the penalization methods. Special attention is paid to sparse meta-analysis methods that pool summary statistics across datasets, and integrative analysis methods that pool raw data across datasets. We discuss their formulation and rationale. Beyond “standard” penalized selection, we also review contrasted penalization and Laplacian penalization which accommodate finer data structures. The computational aspects, including computational algorithms and tuning parameter selection, are examined. This review concludes with possible limitations and extensions. PMID:25691921

  3. Integrative prescreening in analysis of multiple cancer genomic studies

    PubMed Central

    2012-01-01

    Background In high throughput cancer genomic studies, results from the analysis of single datasets often suffer from a lack of reproducibility because of small sample sizes. Integrative analysis can effectively pool and analyze multiple datasets and provides a cost effective way to improve reproducibility. In integrative analysis, simultaneously analyzing all genes profiled may incur high computational cost. A computationally affordable remedy is prescreening, which fits marginal models, can be conducted in a parallel manner, and has low computational cost. Results An integrative prescreening approach is developed for the analysis of multiple cancer genomic datasets. Simulation shows that the proposed integrative prescreening has better performance than alternatives, particularly including prescreening with individual datasets, an intensity approach and meta-analysis. We also analyze multiple microarray gene profiling studies on liver and pancreatic cancers using the proposed approach. Conclusions The proposed integrative prescreening provides an effective way to reduce the dimensionality in cancer genomic studies. It can be coupled with existing analysis methods to identify cancer markers. PMID:22799431

  4. iASeq: integrative analysis of allele-specificity of protein-DNA interactions in multiple ChIP-seq datasets

    PubMed Central

    2012-01-01

    Background ChIP-seq provides new opportunities to study allele-specific protein-DNA binding (ASB). However, detecting allelic imbalance from a single ChIP-seq dataset often has low statistical power since only sequence reads mapped to heterozygote SNPs are informative for discriminating two alleles. Results We develop a new method iASeq to address this issue by jointly analyzing multiple ChIP-seq datasets. iASeq uses a Bayesian hierarchical mixture model to learn correlation patterns of allele-specificity among multiple proteins. Using the discovered correlation patterns, the model allows one to borrow information across datasets to improve detection of allelic imbalance. Application of iASeq to 77 ChIP-seq samples from 40 ENCODE datasets and 1 genomic DNA sample in GM12878 cells reveals that allele-specificity of multiple proteins are highly correlated, and demonstrates the ability of iASeq to improve allelic inference compared to analyzing each individual dataset separately. Conclusions iASeq illustrates the value of integrating multiple datasets in the allele-specificity inference and offers a new tool to better analyze ASB. PMID:23194258

  5. Unsupervised multiple kernel learning for heterogeneous data integration.

    PubMed

    Mariette, Jérôme; Villa-Vialaneix, Nathalie

    2018-03-15

    Recent high-throughput sequencing advances have expanded the breadth of available omics datasets and the integrated analysis of multiple datasets obtained on the same samples has allowed to gain important insights in a wide range of applications. However, the integration of various sources of information remains a challenge for systems biology since produced datasets are often of heterogeneous types, with the need of developing generic methods to take their different specificities into account. We propose a multiple kernel framework that allows to integrate multiple datasets of various types into a single exploratory analysis. Several solutions are provided to learn either a consensus meta-kernel or a meta-kernel that preserves the original topology of the datasets. We applied our framework to analyse two public multi-omics datasets. First, the multiple metagenomic datasets, collected during the TARA Oceans expedition, was explored to demonstrate that our method is able to retrieve previous findings in a single kernel PCA as well as to provide a new image of the sample structures when a larger number of datasets are included in the analysis. To perform this analysis, a generic procedure is also proposed to improve the interpretability of the kernel PCA in regards with the original data. Second, the multi-omics breast cancer datasets, provided by The Cancer Genome Atlas, is analysed using a kernel Self-Organizing Maps with both single and multi-omics strategies. The comparison of these two approaches demonstrates the benefit of our integration method to improve the representation of the studied biological system. Proposed methods are available in the R package mixKernel, released on CRAN. It is fully compatible with the mixOmics package and a tutorial describing the approach can be found on mixOmics web site http://mixomics.org/mixkernel/. jerome.mariette@inra.fr or nathalie.villa-vialaneix@inra.fr. Supplementary data are available at Bioinformatics online.

  6. Sparse Group Penalized Integrative Analysis of Multiple Cancer Prognosis Datasets

    PubMed Central

    Liu, Jin; Huang, Jian; Xie, Yang; Ma, Shuangge

    2014-01-01

    SUMMARY In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach. PMID:23938111

  7. Integrative Exploratory Analysis of Two or More Genomic Datasets.

    PubMed

    Meng, Chen; Culhane, Aedin

    2016-01-01

    Exploratory analysis is an essential step in the analysis of high throughput data. Multivariate approaches such as correspondence analysis (CA), principal component analysis, and multidimensional scaling are widely used in the exploratory analysis of single dataset. Modern biological studies often assay multiple types of biological molecules (e.g., mRNA, protein, phosphoproteins) on a same set of biological samples, thereby creating multiple different types of omics data or multiassay data. Integrative exploratory analysis of these multiple omics data is required to leverage the potential of multiple omics studies. In this chapter, we describe the application of co-inertia analysis (CIA; for analyzing two datasets) and multiple co-inertia analysis (MCIA; for three or more datasets) to address this problem. These methods are powerful yet simple multivariate approaches that represent samples using a lower number of variables, allowing a more easily identification of the correlated structure in and between multiple high dimensional datasets. Graphical representations can be employed to this purpose. In addition, the methods simultaneously project samples and variables (genes, proteins) onto the same lower dimensional space, so the most variant variables from each dataset can be selected and associated with samples, which can be further used to facilitate biological interpretation and pathway analysis. We applied CIA to explore the concordance between mRNA and protein expression in a panel of 60 tumor cell lines from the National Cancer Institute. In the same 60 cell lines, we used MCIA to perform a cross-platform comparison of mRNA gene expression profiles obtained on four different microarray platforms. Last, as an example of integrative analysis of multiassay or multi-omics data we analyzed transcriptomic, proteomic, and phosphoproteomic data from pluripotent (iPS) and embryonic stem (ES) cell lines.

  8. Network-Assisted Investigation of Combined Causal Signals from Genome-Wide Association Studies in Schizophrenia

    PubMed Central

    Jia, Peilin; Wang, Lily; Fanous, Ayman H.; Pato, Carlos N.; Edwards, Todd L.; Zhao, Zhongming

    2012-01-01

    With the recent success of genome-wide association studies (GWAS), a wealth of association data has been accomplished for more than 200 complex diseases/traits, proposing a strong demand for data integration and interpretation. A combinatory analysis of multiple GWAS datasets, or an integrative analysis of GWAS data and other high-throughput data, has been particularly promising. In this study, we proposed an integrative analysis framework of multiple GWAS datasets by overlaying association signals onto the protein-protein interaction network, and demonstrated it using schizophrenia datasets. Building on a dense module search algorithm, we first searched for significantly enriched subnetworks for schizophrenia in each single GWAS dataset and then implemented a discovery-evaluation strategy to identify module genes with consistent association signals. We validated the module genes in an independent dataset, and also examined them through meta-analysis of the related SNPs using multiple GWAS datasets. As a result, we identified 205 module genes with a joint effect significantly associated with schizophrenia; these module genes included a number of well-studied candidate genes such as DISC1, GNA12, GNA13, GNAI1, GPR17, and GRIN2B. Further functional analysis suggested these genes are involved in neuronal related processes. Additionally, meta-analysis found that 18 SNPs in 9 module genes had P meta<1×10−4, including the gene HLA-DQA1 located in the MHC region on chromosome 6, which was reported in previous studies using the largest cohort of schizophrenia patients to date. These results demonstrated our bi-directional network-based strategy is efficient for identifying disease-associated genes with modest signals in GWAS datasets. This approach can be applied to any other complex diseases/traits where multiple GWAS datasets are available. PMID:22792057

  9. Generation of open biomedical datasets through ontology-driven transformation and integration processes.

    PubMed

    Carmen Legaz-García, María Del; Miñarro-Giménez, José Antonio; Menárguez-Tortosa, Marcos; Fernández-Breis, Jesualdo Tomás

    2016-06-03

    Biomedical research usually requires combining large volumes of data from multiple heterogeneous sources, which makes difficult the integrated exploitation of such data. The Semantic Web paradigm offers a natural technological space for data integration and exploitation by generating content readable by machines. Linked Open Data is a Semantic Web initiative that promotes the publication and sharing of data in machine readable semantic formats. We present an approach for the transformation and integration of heterogeneous biomedical data with the objective of generating open biomedical datasets in Semantic Web formats. The transformation of the data is based on the mappings between the entities of the data schema and the ontological infrastructure that provides the meaning to the content. Our approach permits different types of mappings and includes the possibility of defining complex transformation patterns. Once the mappings are defined, they can be automatically applied to datasets to generate logically consistent content and the mappings can be reused in further transformation processes. The results of our research are (1) a common transformation and integration process for heterogeneous biomedical data; (2) the application of Linked Open Data principles to generate interoperable, open, biomedical datasets; (3) a software tool, called SWIT, that implements the approach. In this paper we also describe how we have applied SWIT in different biomedical scenarios and some lessons learned. We have presented an approach that is able to generate open biomedical repositories in Semantic Web formats. SWIT is able to apply the Linked Open Data principles in the generation of the datasets, so allowing for linking their content to external repositories and creating linked open datasets. SWIT datasets may contain data from multiple sources and schemas, thus becoming integrated datasets.

  10. Comparative Microbial Modules Resource: Generation and Visualization of Multi-species Biclusters

    PubMed Central

    Bate, Ashley; Eichenberger, Patrick; Bonneau, Richard

    2011-01-01

    The increasing abundance of large-scale, high-throughput datasets for many closely related organisms provides opportunities for comparative analysis via the simultaneous biclustering of datasets from multiple species. These analyses require a reformulation of how to organize multi-species datasets and visualize comparative genomics data analyses results. Recently, we developed a method, multi-species cMonkey, which integrates heterogeneous high-throughput datatypes from multiple species to identify conserved regulatory modules. Here we present an integrated data visualization system, built upon the Gaggle, enabling exploration of our method's results (available at http://meatwad.bio.nyu.edu/cmmr.html). The system can also be used to explore other comparative genomics datasets and outputs from other data analysis procedures – results from other multiple-species clustering programs or from independent clustering of different single-species datasets. We provide an example use of our system for two bacteria, Escherichia coli and Salmonella Typhimurium. We illustrate the use of our system by exploring conserved biclusters involved in nitrogen metabolism, uncovering a putative function for yjjI, a currently uncharacterized gene that we predict to be involved in nitrogen assimilation. PMID:22144874

  11. Comparative microbial modules resource: generation and visualization of multi-species biclusters.

    PubMed

    Kacmarczyk, Thadeous; Waltman, Peter; Bate, Ashley; Eichenberger, Patrick; Bonneau, Richard

    2011-12-01

    The increasing abundance of large-scale, high-throughput datasets for many closely related organisms provides opportunities for comparative analysis via the simultaneous biclustering of datasets from multiple species. These analyses require a reformulation of how to organize multi-species datasets and visualize comparative genomics data analyses results. Recently, we developed a method, multi-species cMonkey, which integrates heterogeneous high-throughput datatypes from multiple species to identify conserved regulatory modules. Here we present an integrated data visualization system, built upon the Gaggle, enabling exploration of our method's results (available at http://meatwad.bio.nyu.edu/cmmr.html). The system can also be used to explore other comparative genomics datasets and outputs from other data analysis procedures - results from other multiple-species clustering programs or from independent clustering of different single-species datasets. We provide an example use of our system for two bacteria, Escherichia coli and Salmonella Typhimurium. We illustrate the use of our system by exploring conserved biclusters involved in nitrogen metabolism, uncovering a putative function for yjjI, a currently uncharacterized gene that we predict to be involved in nitrogen assimilation. © 2011 Kacmarczyk et al.

  12. The Graduate Outcome Project: Using Data from the Integrated Data Infrastructure Project

    ERIC Educational Resources Information Center

    Rees, Malcolm

    2014-01-01

    This paper reports on progress to date with a project underway in New Zealand involving the extraction of data from multiple government agencies that is then combined into one comprehensive longitudinal integrated dataset and made available to trial participants in a way never previously thought possible. The dataset includes school leaver…

  13. Efficient algorithms for fast integration on large data sets from multiple sources.

    PubMed

    Mi, Tian; Rajasekaran, Sanguthevar; Aseltine, Robert

    2012-06-28

    Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently. Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block. We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach. In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records.

  14. A group LASSO-based method for robustly inferring gene regulatory networks from multiple time-course datasets.

    PubMed

    Liu, Li-Zhi; Wu, Fang-Xiang; Zhang, Wen-Jun

    2014-01-01

    As an abstract mapping of the gene regulations in the cell, gene regulatory network is important to both biological research study and practical applications. The reverse engineering of gene regulatory networks from microarray gene expression data is a challenging research problem in systems biology. With the development of biological technologies, multiple time-course gene expression datasets might be collected for a specific gene network under different circumstances. The inference of a gene regulatory network can be improved by integrating these multiple datasets. It is also known that gene expression data may be contaminated with large errors or outliers, which may affect the inference results. A novel method, Huber group LASSO, is proposed to infer the same underlying network topology from multiple time-course gene expression datasets as well as to take the robustness to large error or outliers into account. To solve the optimization problem involved in the proposed method, an efficient algorithm which combines the ideas of auxiliary function minimization and block descent is developed. A stability selection method is adapted to our method to find a network topology consisting of edges with scores. The proposed method is applied to both simulation datasets and real experimental datasets. It shows that Huber group LASSO outperforms the group LASSO in terms of both areas under receiver operating characteristic curves and areas under the precision-recall curves. The convergence analysis of the algorithm theoretically shows that the sequence generated from the algorithm converges to the optimal solution of the problem. The simulation and real data examples demonstrate the effectiveness of the Huber group LASSO in integrating multiple time-course gene expression datasets and improving the resistance to large errors or outliers.

  15. A New Combinatorial Optimization Approach for Integrated Feature Selection Using Different Datasets: A Prostate Cancer Transcriptomic Study

    PubMed Central

    Puthiyedth, Nisha; Riveros, Carlos; Berretta, Regina; Moscato, Pablo

    2015-01-01

    Background The joint study of multiple datasets has become a common technique for increasing statistical power in detecting biomarkers obtained from smaller studies. The approach generally followed is based on the fact that as the total number of samples increases, we expect to have greater power to detect associations of interest. This methodology has been applied to genome-wide association and transcriptomic studies due to the availability of datasets in the public domain. While this approach is well established in biostatistics, the introduction of new combinatorial optimization models to address this issue has not been explored in depth. In this study, we introduce a new model for the integration of multiple datasets and we show its application in transcriptomics. Methods We propose a new combinatorial optimization problem that addresses the core issue of biomarker detection in integrated datasets. Optimal solutions for this model deliver a feature selection from a panel of prospective biomarkers. The model we propose is a generalised version of the (α,β)-k-Feature Set problem. We illustrate the performance of this new methodology via a challenging meta-analysis task involving six prostate cancer microarray datasets. The results are then compared to the popular RankProd meta-analysis tool and to what can be obtained by analysing the individual datasets by statistical and combinatorial methods alone. Results Application of the integrated method resulted in a more informative signature than the rank-based meta-analysis or individual dataset results, and overcomes problems arising from real world datasets. The set of genes identified is highly significant in the context of prostate cancer. The method used does not rely on homogenisation or transformation of values to a common scale, and at the same time is able to capture markers associated with subgroups of the disease. PMID:26106884

  16. Privacy-Preserving Integration of Medical Data : A Practical Multiparty Private Set Intersection.

    PubMed

    Miyaji, Atsuko; Nakasho, Kazuhisa; Nishida, Shohei

    2017-03-01

    Medical data are often maintained by different organizations. However, detailed analyses sometimes require these datasets to be integrated without violating patient or commercial privacy. Multiparty Private Set Intersection (MPSI), which is an important privacy-preserving protocol, computes an intersection of multiple private datasets. This approach ensures that only designated parties can identify the intersection. In this paper, we propose a practical MPSI that satisfies the following requirements: The size of the datasets maintained by the different parties is independent of the others, and the computational complexity of the dataset held by each party is independent of the number of parties. Our MPSI is based on the use of an outsourcing provider, who has no knowledge of the data inputs or outputs. This reduces the computational complexity. The performance of the proposed MPSI is evaluated by implementing a prototype on a virtual private network to enable parallel computation in multiple threads. Our protocol is confirmed to be more efficient than comparable existing approaches.

  17. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets

    PubMed Central

    Wernisch, Lorenz

    2017-01-01

    Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm. PMID:29036190

  18. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.

    PubMed

    Gabasova, Evelina; Reid, John; Wernisch, Lorenz

    2017-10-01

    Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.

  19. A method and software framework for enriching private biomedical sources with data from public online repositories.

    PubMed

    Anguita, Alberto; García-Remesal, Miguel; Graf, Norbert; Maojo, Victor

    2016-04-01

    Modern biomedical research relies on the semantic integration of heterogeneous data sources to find data correlations. Researchers access multiple datasets of disparate origin, and identify elements-e.g. genes, compounds, pathways-that lead to interesting correlations. Normally, they must refer to additional public databases in order to enrich the information about the identified entities-e.g. scientific literature, published clinical trial results, etc. While semantic integration techniques have traditionally focused on providing homogeneous access to private datasets-thus helping automate the first part of the research, and there exist different solutions for browsing public data, there is still a need for tools that facilitate merging public repositories with private datasets. This paper presents a framework that automatically locates public data of interest to the researcher and semantically integrates it with existing private datasets. The framework has been designed as an extension of traditional data integration systems, and has been validated with an existing data integration platform from a European research project by integrating a private biological dataset with data from the National Center for Biotechnology Information (NCBI). Copyright © 2016 Elsevier Inc. All rights reserved.

  20. Sparse models for correlative and integrative analysis of imaging and genetic data

    PubMed Central

    Lin, Dongdong; Cao, Hongbao; Calhoun, Vince D.

    2014-01-01

    The development of advanced medical imaging technologies and high-throughput genomic measurements has enhanced our ability to understand their interplay as well as their relationship with human behavior by integrating these two types of datasets. However, the high dimensionality and heterogeneity of these datasets presents a challenge to conventional statistical methods; there is a high demand for the development of both correlative and integrative analysis approaches. Here, we review our recent work on developing sparse representation based approaches to address this challenge. We show how sparse models are applied to the correlation and integration of imaging and genetic data for biomarker identification. We present examples on how these approaches are used for the detection of risk genes and classification of complex diseases such as schizophrenia. Finally, we discuss future directions on the integration of multiple imaging and genomic datasets including their interactions such as epistasis. PMID:25218561

  1. Integrating multiple immunogenetic data sources for feature extraction and mining somatic hypermutation patterns: the case of "towards analysis" in chronic lymphocytic leukaemia.

    PubMed

    Kavakiotis, Ioannis; Xochelli, Aliki; Agathangelidis, Andreas; Tsoumakas, Grigorios; Maglaveras, Nicos; Stamatopoulos, Kostas; Hadzidimitriou, Anastasia; Vlahavas, Ioannis; Chouvarda, Ioanna

    2016-06-06

    Somatic Hypermutation (SHM) refers to the introduction of mutations within rearranged V(D)J genes, a process that increases the diversity of Immunoglobulins (IGs). The analysis of SHM has offered critical insight into the physiology and pathology of B cells, leading to strong prognostication markers for clinical outcome in chronic lymphocytic leukaemia (CLL), the most frequent adult B-cell malignancy. In this paper we present a methodology for integrating multiple immunogenetic and clinocobiological data sources in order to extract features and create high quality datasets for SHM analysis in IG receptors of CLL patients. This dataset is used as the basis for a higher level integration procedure, inspired form social choice theory. This is applied in the Towards Analysis, our attempt to investigate the potential ontogenetic transformation of genes belonging to specific stereotyped CLL subsets towards other genes or gene families, through SHM. The data integration process, followed by feature extraction, resulted in the generation of a dataset containing information about mutations occurring through SHM. The Towards analysis performed on the integrated dataset applying voting techniques, revealed the distinct behaviour of subset #201 compared to other subsets, as regards SHM related movements among gene clans, both in allele-conserved and non-conserved gene areas. With respect to movement between genes, a high percentage movement towards pseudo genes was found in all CLL subsets. This data integration and feature extraction process can set the basis for exploratory analysis or a fully automated computational data mining approach on many as yet unanswered, clinically relevant biological questions.

  2. Software for the Integration of Multiomics Experiments in Bioconductor.

    PubMed

    Ramos, Marcel; Schiffer, Lucas; Re, Angela; Azhar, Rimsha; Basunia, Azfar; Rodriguez, Carmen; Chan, Tiffany; Chapman, Phil; Davis, Sean R; Gomez-Cabrero, David; Culhane, Aedin C; Haibe-Kains, Benjamin; Hansen, Kasper D; Kodali, Hanish; Louis, Marie S; Mer, Arvind S; Riester, Markus; Morgan, Martin; Carey, Vince; Waldron, Levi

    2017-11-01

    Multiomics experiments are increasingly commonplace in biomedical research and add layers of complexity to experimental design, data integration, and analysis. R and Bioconductor provide a generic framework for statistical analysis and visualization, as well as specialized data classes for a variety of high-throughput data types, but methods are lacking for integrative analysis of multiomics experiments. The MultiAssayExperiment software package, implemented in R and leveraging Bioconductor software and design principles, provides for the coordinated representation of, storage of, and operation on multiple diverse genomics data. We provide the unrestricted multiple 'omics data for each cancer tissue in The Cancer Genome Atlas as ready-to-analyze MultiAssayExperiment objects and demonstrate in these and other datasets how the software simplifies data representation, statistical analysis, and visualization. The MultiAssayExperiment Bioconductor package reduces major obstacles to efficient, scalable, and reproducible statistical analysis of multiomics data and enhances data science applications of multiple omics datasets. Cancer Res; 77(21); e39-42. ©2017 AACR . ©2017 American Association for Cancer Research.

  3. Spatio-Temporal Data Model for Integrating Evolving Nation-Level Datasets

    NASA Astrophysics Data System (ADS)

    Sorokine, A.; Stewart, R. N.

    2017-10-01

    Ability to easily combine the data from diverse sources in a single analytical workflow is one of the greatest promises of the Big Data technologies. However, such integration is often challenging as datasets originate from different vendors, governments, and research communities that results in multiple incompatibilities including data representations, formats, and semantics. Semantics differences are hardest to handle: different communities often use different attribute definitions and associate the records with different sets of evolving geographic entities. Analysis of global socioeconomic variables across multiple datasets over prolonged time is often complicated by the difference in how boundaries and histories of countries or other geographic entities are represented. Here we propose an event-based data model for depicting and tracking histories of evolving geographic units (countries, provinces, etc.) and their representations in disparate data. The model addresses the semantic challenge of preserving identity of geographic entities over time by defining criteria for the entity existence, a set of events that may affect its existence, and rules for mapping between different representations (datasets). Proposed model is used for maintaining an evolving compound database of global socioeconomic and environmental data harvested from multiple sources. Practical implementation of our model is demonstrated using PostgreSQL object-relational database with the use of temporal, geospatial, and NoSQL database extensions.

  4. -A curated transcriptomic dataset collection relevant to embryonic development associated with in vitro fertilization in healthy individuals and patients with polycystic ovary syndrome.

    PubMed

    Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige

    2017-01-01

    The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp.

  5. ­A curated transcriptomic dataset collection relevant to embryonic development associated with in vitro fertilization in healthy individuals and patients with polycystic ovary syndrome

    PubMed Central

    Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige

    2017-01-01

    The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp. PMID:28413616

  6. Decibel: The Relational Dataset Branching System

    PubMed Central

    Maddox, Michael; Goehring, David; Elmore, Aaron J.; Madden, Samuel; Parameswaran, Aditya; Deshpande, Amol

    2017-01-01

    As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these shortcomings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs. PMID:28149668

  7. Designing for Global Data Sharing, Designing for Educational Transformation

    ERIC Educational Resources Information Center

    Adams, Robin S.; Radcliffe, David; Fosmire, Michael

    2016-01-01

    This paper provides an example of a global data sharing project with an educational transformation agenda. This agenda shaped both the design of the shared dataset and the experience of sharing the common dataset to support multiple perspective inquiry and enable integrative and critically reflexive research-to-practice dialogue. The shared…

  8. Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework.

    PubMed

    Voillet, Valentin; Besse, Philippe; Liaubet, Laurence; San Cristobal, Magali; González, Ignacio

    2016-10-03

    In omics data integration studies, it is common, for a variety of reasons, for some individuals to not be present in all data tables. Missing row values are challenging to deal with because most statistical methods cannot be directly applied to incomplete datasets. To overcome this issue, we propose a multiple imputation (MI) approach in a multivariate framework. In this study, we focus on multiple factor analysis (MFA) as a tool to compare and integrate multiple layers of information. MI involves filling the missing rows with plausible values, resulting in M completed datasets. MFA is then applied to each completed dataset to produce M different configurations (the matrices of coordinates of individuals). Finally, the M configurations are combined to yield a single consensus solution. We assessed the performance of our method, named MI-MFA, on two real omics datasets. Incomplete artificial datasets with different patterns of missingness were created from these data. The MI-MFA results were compared with two other approaches i.e., regularized iterative MFA (RI-MFA) and mean variable imputation (MVI-MFA). For each configuration resulting from these three strategies, the suitability of the solution was determined against the true MFA configuration obtained from the original data and a comprehensive graphical comparison showing how the MI-, RI- or MVI-MFA configurations diverge from the true configuration was produced. Two approaches i.e., confidence ellipses and convex hulls, to visualize and assess the uncertainty due to missing values were also described. We showed how the areas of ellipses and convex hulls increased with the number of missing individuals. A free and easy-to-use code was proposed to implement the MI-MFA method in the R statistical environment. We believe that MI-MFA provides a useful and attractive method for estimating the coordinates of individuals on the first MFA components despite missing rows. MI-MFA configurations were close to the true configuration even when many individuals were missing in several data tables. This method takes into account the uncertainty of MI-MFA configurations induced by the missing rows, thereby allowing the reliability of the results to be evaluated.

  9. Integrating genome-wide association studies and gene expression data highlights dysregulated multiple sclerosis risk pathways.

    PubMed

    Liu, Guiyou; Zhang, Fang; Jiang, Yongshuai; Hu, Yang; Gong, Zhongying; Liu, Shoufeng; Chen, Xiuju; Jiang, Qinghua; Hao, Junwei

    2017-02-01

    Much effort has been expended on identifying the genetic determinants of multiple sclerosis (MS). Existing large-scale genome-wide association study (GWAS) datasets provide strong support for using pathway and network-based analysis methods to investigate the mechanisms underlying MS. However, no shared genetic pathways have been identified to date. We hypothesize that shared genetic pathways may indeed exist in different MS-GWAS datasets. Here, we report results from a three-stage analysis of GWAS and expression datasets. In stage 1, we conducted multiple pathway analyses of two MS-GWAS datasets. In stage 2, we performed a candidate pathway analysis of the large-scale MS-GWAS dataset. In stage 3, we performed a pathway analysis using the dysregulated MS gene list from seven human MS case-control expression datasets. In stage 1, we identified 15 shared pathways. In stage 2, we successfully replicated 14 of these 15 significant pathways. In stage 3, we found that dysregulated MS genes were significantly enriched in 10 of 15 MS risk pathways identified in stages 1 and 2. We report shared genetic pathways in different MS-GWAS datasets and highlight some new MS risk pathways. Our findings provide new insights on the genetic determinants of MS.

  10. PICKLE 2.0: A human protein-protein interaction meta-database employing data integration via genetic information ontology

    PubMed Central

    Gioutlakis, Aris; Klapa, Maria I.

    2017-01-01

    It has been acknowledged that source databases recording experimentally supported human protein-protein interactions (PPIs) exhibit limited overlap. Thus, the reconstruction of a comprehensive PPI network requires appropriate integration of multiple heterogeneous primary datasets, presenting the PPIs at various genetic reference levels. Existing PPI meta-databases perform integration via normalization; namely, PPIs are merged after converted to a certain target level. Hence, the node set of the integrated network depends each time on the number and type of the combined datasets. Moreover, the irreversible a priori normalization process hinders the identification of normalization artifacts in the integrated network, which originate from the nonlinearity characterizing the genetic information flow. PICKLE (Protein InteraCtion KnowLedgebasE) 2.0 implements a new architecture for this recently introduced human PPI meta-database. Its main novel feature over the existing meta-databases is its approach to primary PPI dataset integration via genetic information ontology. Building upon the PICKLE principles of using the reviewed human complete proteome (RHCP) of UniProtKB/Swiss-Prot as the reference protein interactor set, and filtering out protein interactions with low probability of being direct based on the available evidence, PICKLE 2.0 first assembles the RHCP genetic information ontology network by connecting the corresponding genes, nucleotide sequences (mRNAs) and proteins (UniProt entries) and then integrates PPI datasets by superimposing them on the ontology network without any a priori transformations. Importantly, this process allows the resulting heterogeneous integrated network to be reversibly normalized to any level of genetic reference without loss of the original information, the latter being used for identification of normalization biases, and enables the appraisal of potential false positive interactions through PPI source database cross-checking. The PICKLE web-based interface (www.pickle.gr) allows for the simultaneous query of multiple entities and provides integrated human PPI networks at either the protein (UniProt) or the gene level, at three PPI filtering modes. PMID:29023571

  11. Integration of heterogeneous features for remote sensing scene classification

    NASA Astrophysics Data System (ADS)

    Wang, Xin; Xiong, Xingnan; Ning, Chen; Shi, Aiye; Lv, Guofang

    2018-01-01

    Scene classification is one of the most important issues in remote sensing (RS) image processing. We find that features from different channels (shape, spectral, texture, etc.), levels (low-level and middle-level), or perspectives (local and global) could provide various properties for RS images, and then propose a heterogeneous feature framework to extract and integrate heterogeneous features with different types for RS scene classification. The proposed method is composed of three modules (1) heterogeneous features extraction, where three heterogeneous feature types, called DS-SURF-LLC, mean-Std-LLC, and MS-CLBP, are calculated, (2) heterogeneous features fusion, where the multiple kernel learning (MKL) is utilized to integrate the heterogeneous features, and (3) an MKL support vector machine classifier for RS scene classification. The proposed method is extensively evaluated on three challenging benchmark datasets (a 6-class dataset, a 12-class dataset, and a 21-class dataset), and the experimental results show that the proposed method leads to good classification performance. It produces good informative features to describe the RS image scenes. Moreover, the integration of heterogeneous features outperforms some state-of-the-art features on RS scene classification tasks.

  12. a Web-Based Interactive Platform for Co-Clustering Spatio-Temporal Data

    NASA Astrophysics Data System (ADS)

    Wu, X.; Poorthuis, A.; Zurita-Milla, R.; Kraak, M.-J.

    2017-09-01

    Since current studies on clustering analysis mainly focus on exploring spatial or temporal patterns separately, a co-clustering algorithm is utilized in this study to enable the concurrent analysis of spatio-temporal patterns. To allow users to adopt and adapt the algorithm for their own analysis, it is integrated within the server side of an interactive web-based platform. The client side of the platform, running within any modern browser, is a graphical user interface (GUI) with multiple linked visualizations that facilitates the understanding, exploration and interpretation of the raw dataset and co-clustering results. Users can also upload their own datasets and adjust clustering parameters within the platform. To illustrate the use of this platform, an annual temperature dataset from 28 weather stations over 20 years in the Netherlands is used. After the dataset is loaded, it is visualized in a set of linked visualizations: a geographical map, a timeline and a heatmap. This aids the user in understanding the nature of their dataset and the appropriate selection of co-clustering parameters. Once the dataset is processed by the co-clustering algorithm, the results are visualized in the small multiples, a heatmap and a timeline to provide various views for better understanding and also further interpretation. Since the visualization and analysis are integrated in a seamless platform, the user can explore different sets of co-clustering parameters and instantly view the results in order to do iterative, exploratory data analysis. As such, this interactive web-based platform allows users to analyze spatio-temporal data using the co-clustering method and also helps the understanding of the results using multiple linked visualizations.

  13. Integrative missing value estimation for microarray data.

    PubMed

    Hu, Jianjun; Li, Haifeng; Waterman, Michael S; Zhou, Xianghong Jasmine

    2006-10-12

    Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples. We present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS) imputation algorithm by up to 15% improvement in our benchmark tests. We demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.

  14. Merging Disparate Data Sources Into a Paleoanthropological Geodatabase for Research, Education, and Conservation in the Greater Hadar Region (Afar, Ethiopia)

    NASA Astrophysics Data System (ADS)

    Campisano, C. J.; Dimaggio, E. N.; Arrowsmith, J. R.; Kimbel, W. H.; Reed, K. E.; Robinson, S. E.; Schoville, B. J.

    2008-12-01

    Understanding the geographic, temporal, and environmental contexts of human evolution requires the ability to compare wide-ranging datasets collected from multiple research disciplines. Paleoanthropological field- research projects are notoriously independent administratively even in regions of high transdisciplinary importance. As a result, valuable opportunities for the integration of new and archival datasets spanning diverse archaeological assemblages, paleontological localities, and stratigraphic sequences are often neglected, which limits the range of research questions that can be addressed. Using geoinformatic tools we integrate spatial, temporal, and semantically disparate paleoanthropological and geological datasets from the Hadar sedimentary basin of the Afar Rift, Ethiopia. Applying newly integrated data to investigations of fossil- rich sediments will provide the geospatial framework critical for addressing fundamental questions concerning hominins and their paleoenvironmental context. We present a preliminary cyberinfrastructure for data management that will allow scientists, students, and interested citizens to interact with, integrate, and visualize data from the Afar region. Examples of our initial integration efforts include generating a regional high-resolution satellite imagery base layer for georeferencing, standardizing and compiling multiple project datasets and digitizing paper maps. We also demonstrate how the robust datasets generated from our work are being incorporated into a new, digital module for Arizona State University's Hadar Paleoanthropology Field School - modernizing field data collection methods, on-the-fly data visualization and query, and subsequent analysis and interpretation. Armed with a fully fused database tethered to high-resolution satellite imagery, we can more accurately reconstruct spatial and temporal paleoenvironmental conditions and efficiently address key scientific questions, such as those regarding the relative importance of internal and external ecological, climatological, and tectonic forcings on evolutionary change in the fossil record. In close association with colleagues working in neighboring project areas, this work advances multidisciplinary and collaborative research, training, and long-range antiquities conservation in the Hadar region.

  15. InSilico DB genomic datasets hub: an efficient starting point for analyzing genome-wide studies in GenePattern, Integrative Genomics Viewer, and R/Bioconductor.

    PubMed

    Coletta, Alain; Molter, Colin; Duqué, Robin; Steenhoff, David; Taminau, Jonatan; de Schaetzen, Virginie; Meganck, Stijn; Lazar, Cosmin; Venet, David; Detours, Vincent; Nowé, Ann; Bersini, Hugues; Weiss Solís, David Y

    2012-11-18

    Genomics datasets are increasingly useful for gaining biomedical insights, with adoption in the clinic underway. However, multiple hurdles related to data management stand in the way of their efficient large-scale utilization. The solution proposed is a web-based data storage hub. Having clear focus, flexibility and adaptability, InSilico DB seamlessly connects genomics dataset repositories to state-of-the-art and free GUI and command-line data analysis tools. The InSilico DB platform is a powerful collaborative environment, with advanced capabilities for biocuration, dataset sharing, and dataset subsetting and combination. InSilico DB is available from https://insilicodb.org.

  16. Integrative sparse principal component analysis of gene expression data.

    PubMed

    Liu, Mengque; Fan, Xinyan; Fang, Kuangnan; Zhang, Qingzhao; Ma, Shuangge

    2017-12-01

    In the analysis of gene expression data, dimension reduction techniques have been extensively adopted. The most popular one is perhaps the PCA (principal component analysis). To generate more reliable and more interpretable results, the SPCA (sparse PCA) technique has been developed. With the "small sample size, high dimensionality" characteristic of gene expression data, the analysis results generated from a single dataset are often unsatisfactory. Under contexts other than dimension reduction, integrative analysis techniques, which jointly analyze the raw data of multiple independent datasets, have been developed and shown to outperform "classic" meta-analysis and other multidatasets techniques and single-dataset analysis. In this study, we conduct integrative analysis by developing the iSPCA (integrative SPCA) method. iSPCA achieves the selection and estimation of sparse loadings using a group penalty. To take advantage of the similarity across datasets and generate more accurate results, we further impose contrasted penalties. Different penalties are proposed to accommodate different data conditions. Extensive simulations show that iSPCA outperforms the alternatives under a wide spectrum of settings. The analysis of breast cancer and pancreatic cancer data further shows iSPCA's satisfactory performance. © 2017 WILEY PERIODICALS, INC.

  17. Lessons learned in the generation of biomedical research datasets using Semantic Open Data technologies.

    PubMed

    Legaz-García, María del Carmen; Miñarro-Giménez, José Antonio; Menárguez-Tortosa, Marcos; Fernández-Breis, Jesualdo Tomás

    2015-01-01

    Biomedical research usually requires combining large volumes of data from multiple heterogeneous sources. Such heterogeneity makes difficult not only the generation of research-oriented dataset but also its exploitation. In recent years, the Open Data paradigm has proposed new ways for making data available in ways that sharing and integration are facilitated. Open Data approaches may pursue the generation of content readable only by humans and by both humans and machines, which are the ones of interest in our work. The Semantic Web provides a natural technological space for data integration and exploitation and offers a range of technologies for generating not only Open Datasets but also Linked Datasets, that is, open datasets linked to other open datasets. According to the Berners-Lee's classification, each open dataset can be given a rating between one and five stars attending to can be given to each dataset. In the last years, we have developed and applied our SWIT tool, which automates the generation of semantic datasets from heterogeneous data sources. SWIT produces four stars datasets, given that fifth one can be obtained by being the dataset linked from external ones. In this paper, we describe how we have applied the tool in two projects related to health care records and orthology data, as well as the major lessons learned from such efforts.

  18. The Problem with the Delta Cost Project Database

    ERIC Educational Resources Information Center

    Jaquette, Ozan; Parra, Edna

    2016-01-01

    The Integrated Postsecondary Education System (IPEDS) collects data on Title IV institutions. The Delta Cost Project (DCP) integrated data from multiple IPEDS survey components into a public-use longitudinal dataset. The DCP Database was the basis for dozens of journal articles and a series of influential policy reports. Unfortunately, a flaw in…

  19. Integrative Analysis of High-throughput Cancer Studies with Contrasted Penalization

    PubMed Central

    Shi, Xingjie; Liu, Jin; Huang, Jian; Zhou, Yong; Shia, BenChang; Ma, Shuangge

    2015-01-01

    In cancer studies with high-throughput genetic and genomic measurements, integrative analysis provides a way to effectively pool and analyze heterogeneous raw data from multiple independent studies and outperforms “classic” meta-analysis and single-dataset analysis. When marker selection is of interest, the genetic basis of multiple datasets can be described using the homogeneity model or the heterogeneity model. In this study, we consider marker selection under the heterogeneity model, which includes the homogeneity model as a special case and can be more flexible. Penalization methods have been developed in the literature for marker selection. This study advances from the published ones by introducing the contrast penalties, which can accommodate the within- and across-dataset structures of covariates/regression coefficients and, by doing so, further improve marker selection performance. Specifically, we develop a penalization method that accommodates the across-dataset structures by smoothing over regression coefficients. An effective iterative algorithm, which calls an inner coordinate descent iteration, is developed. Simulation shows that the proposed method outperforms the benchmark with more accurate marker identification. The analysis of breast cancer and lung cancer prognosis studies with gene expression measurements shows that the proposed method identifies genes different from those using the benchmark and has better prediction performance. PMID:24395534

  20. EnRICH: Extraction and Ranking using Integration and Criteria Heuristics.

    PubMed

    Zhang, Xia; Greenlee, M Heather West; Serb, Jeanne M

    2013-01-15

    High throughput screening technologies enable biologists to generate candidate genes at a rate that, due to time and cost constraints, cannot be studied by experimental approaches in the laboratory. Thus, it has become increasingly important to prioritize candidate genes for experiments. To accomplish this, researchers need to apply selection requirements based on their knowledge, which necessitates qualitative integration of heterogeneous data sources and filtration using multiple criteria. A similar approach can also be applied to putative candidate gene relationships. While automation can assist in this routine and imperative procedure, flexibility of data sources and criteria must not be sacrificed. A tool that can optimize the trade-off between automation and flexibility to simultaneously filter and qualitatively integrate data is needed to prioritize candidate genes and generate composite networks from heterogeneous data sources. We developed the java application, EnRICH (Extraction and Ranking using Integration and Criteria Heuristics), in order to alleviate this need. Here we present a case study in which we used EnRICH to integrate and filter multiple candidate gene lists in order to identify potential retinal disease genes. As a result of this procedure, a candidate pool of several hundred genes was narrowed down to five candidate genes, of which four are confirmed retinal disease genes and one is associated with a retinal disease state. We developed a platform-independent tool that is able to qualitatively integrate multiple heterogeneous datasets and use different selection criteria to filter each of them, provided the datasets are tables that have distinct identifiers (required) and attributes (optional). With the flexibility to specify data sources and filtering criteria, EnRICH automatically prioritizes candidate genes or gene relationships for biologists based on their specific requirements. Here, we also demonstrate that this tool can be effectively and easily used to apply highly specific user-defined criteria and can efficiently identify high quality candidate genes from relatively sparse datasets.

  1. A multiscale Bayesian data integration approach for mapping air dose rates around the Fukushima Daiichi Nuclear Power Plant.

    PubMed

    Wainwright, Haruko M; Seki, Akiyuki; Chen, Jinsong; Saito, Kimiaki

    2017-02-01

    This paper presents a multiscale data integration method to estimate the spatial distribution of air dose rates in the regional scale around the Fukushima Daiichi Nuclear Power Plant. We integrate various types of datasets, such as ground-based walk and car surveys, and airborne surveys, all of which have different scales, resolutions, spatial coverage, and accuracy. This method is based on geostatistics to represent spatial heterogeneous structures, and also on Bayesian hierarchical models to integrate multiscale, multi-type datasets in a consistent manner. The Bayesian method allows us to quantify the uncertainty in the estimates, and to provide the confidence intervals that are critical for robust decision-making. Although this approach is primarily data-driven, it has great flexibility to include mechanistic models for representing radiation transport or other complex correlations. We demonstrate our approach using three types of datasets collected at the same time over Fukushima City in Japan: (1) coarse-resolution airborne surveys covering the entire area, (2) car surveys along major roads, and (3) walk surveys in multiple neighborhoods. Results show that the method can successfully integrate three types of datasets and create an integrated map (including the confidence intervals) of air dose rates over the domain in high resolution. Moreover, this study provides us with various insights into the characteristics of each dataset, as well as radiocaesium distribution. In particular, the urban areas show high heterogeneity in the contaminant distribution due to human activities as well as large discrepancy among different surveys due to such heterogeneity. Copyright © 2016 Elsevier Ltd. All rights reserved.

  2. SPANG: a SPARQL client supporting generation and reuse of queries for distributed RDF databases.

    PubMed

    Chiba, Hirokazu; Uchiyama, Ikuo

    2017-02-08

    Toward improved interoperability of distributed biological databases, an increasing number of datasets have been published in the standardized Resource Description Framework (RDF). Although the powerful SPARQL Protocol and RDF Query Language (SPARQL) provides a basis for exploiting RDF databases, writing SPARQL code is burdensome for users including bioinformaticians. Thus, an easy-to-use interface is necessary. We developed SPANG, a SPARQL client that has unique features for querying RDF datasets. SPANG dynamically generates typical SPARQL queries according to specified arguments. It can also call SPARQL template libraries constructed in a local system or published on the Web. Further, it enables combinatorial execution of multiple queries, each with a distinct target database. These features facilitate easy and effective access to RDF datasets and integrative analysis of distributed data. SPANG helps users to exploit RDF datasets by generation and reuse of SPARQL queries through a simple interface. This client will enhance integrative exploitation of biological RDF datasets distributed across the Web. This software package is freely available at http://purl.org/net/spang .

  3. A geospatial database model for the management of remote sensing datasets at multiple spectral, spatial, and temporal scales

    NASA Astrophysics Data System (ADS)

    Ifimov, Gabriela; Pigeau, Grace; Arroyo-Mora, J. Pablo; Soffer, Raymond; Leblanc, George

    2017-10-01

    In this study the development and implementation of a geospatial database model for the management of multiscale datasets encompassing airborne imagery and associated metadata is presented. To develop the multi-source geospatial database we have used a Relational Database Management System (RDBMS) on a Structure Query Language (SQL) server which was then integrated into ArcGIS and implemented as a geodatabase. The acquired datasets were compiled, standardized, and integrated into the RDBMS, where logical associations between different types of information were linked (e.g. location, date, and instrument). Airborne data, at different processing levels (digital numbers through geocorrected reflectance), were implemented in the geospatial database where the datasets are linked spatially and temporally. An example dataset consisting of airborne hyperspectral imagery, collected for inter and intra-annual vegetation characterization and detection of potential hydrocarbon seepage events over pipeline areas, is presented. Our work provides a model for the management of airborne imagery, which is a challenging aspect of data management in remote sensing, especially when large volumes of data are collected.

  4. A Semantic Sensor Web for Environmental Decision Support Applications

    PubMed Central

    Gray, Alasdair J. G.; Sadler, Jason; Kit, Oles; Kyzirakos, Kostis; Karpathiotakis, Manos; Calbimonte, Jean-Paul; Page, Kevin; García-Castro, Raúl; Frazer, Alex; Galpin, Ixent; Fernandes, Alvaro A. A.; Paton, Norman W.; Corcho, Oscar; Koubarakis, Manolis; De Roure, David; Martinez, Kirk; Gómez-Pérez, Asunción

    2011-01-01

    Sensing devices are increasingly being deployed to monitor the physical world around us. One class of application for which sensor data is pertinent is environmental decision support systems, e.g., flood emergency response. For these applications, the sensor readings need to be put in context by integrating them with other sources of data about the surrounding environment. Traditional systems for predicting and detecting floods rely on methods that need significant human resources. In this paper we describe a semantic sensor web architecture for integrating multiple heterogeneous datasets, including live and historic sensor data, databases, and map layers. The architecture provides mechanisms for discovering datasets, defining integrated views over them, continuously receiving data in real-time, and visualising on screen and interacting with the data. Our approach makes extensive use of web service standards for querying and accessing data, and semantic technologies to discover and integrate datasets. We demonstrate the use of our semantic sensor web architecture in the context of a flood response planning web application that uses data from sensor networks monitoring the sea-state around the coast of England. PMID:22164110

  5. Integrative Analysis of Cancer Diagnosis Studies with Composite Penalization

    PubMed Central

    Liu, Jin; Huang, Jian; Ma, Shuangge

    2013-01-01

    Summary In cancer diagnosis studies, high-throughput gene profiling has been extensively conducted, searching for genes whose expressions may serve as markers. Data generated from such studies have the “large d, small n” feature, with the number of genes profiled much larger than the sample size. Penalization has been extensively adopted for simultaneous estimation and marker selection. Because of small sample sizes, markers identified from the analysis of single datasets can be unsatisfactory. A cost-effective remedy is to conduct integrative analysis of multiple heterogeneous datasets. In this article, we investigate composite penalization methods for estimation and marker selection in integrative analysis. The proposed methods use the minimax concave penalty (MCP) as the outer penalty. Under the homogeneity model, the ridge penalty is adopted as the inner penalty. Under the heterogeneity model, the Lasso penalty and MCP are adopted as the inner penalty. Effective computational algorithms based on coordinate descent are developed. Numerical studies, including simulation and analysis of practical cancer datasets, show satisfactory performance of the proposed methods. PMID:24578589

  6. Sequencing Data Discovery and Integration for Earth System Science with MetaSeek

    NASA Astrophysics Data System (ADS)

    Hoarfrost, A.; Brown, N.; Arnosti, C.

    2017-12-01

    Microbial communities play a central role in biogeochemical cycles. Sequencing data resources from environmental sources have grown exponentially in recent years, and represent a singular opportunity to investigate microbial interactions with Earth system processes. Carrying out such meta-analyses depends on our ability to discover and curate sequencing data into large-scale integrated datasets. However, such integration efforts are currently challenging and time-consuming, with sequencing data scattered across multiple repositories and metadata that is not easily or comprehensively searchable. MetaSeek is a sequencing data discovery tool that integrates sequencing metadata from all the major data repositories, allowing the user to search and filter on datasets in a lightweight application with an intuitive, easy-to-use web-based interface. Users can save and share curated datasets, while other users can browse these data integrations or use them as a jumping off point for their own curation. Missing and/or erroneous metadata are inferred automatically where possible, and where not possible, users are prompted to contribute to the improvement of the sequencing metadata pool by correcting and amending metadata errors. Once an integrated dataset has been curated, users can follow simple instructions to download their raw data and quickly begin their investigations. In addition to the online interface, the MetaSeek database is easily queryable via an open API, further enabling users and facilitating integrations of MetaSeek with other data curation tools. This tool lowers the barriers to curation and integration of environmental sequencing data, clearing the path forward to illuminating the ecosystem-scale interactions between biological and abiotic processes.

  7. Hybrid coexpression link similarity graph clustering for mining biological modules from multiple gene expression datasets.

    PubMed

    Salem, Saeed; Ozcaglar, Cagri

    2014-01-01

    Advances in genomic technologies have enabled the accumulation of vast amount of genomic data, including gene expression data for multiple species under various biological and environmental conditions. Integration of these gene expression datasets is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on a single gene expression data, which suffers from spurious coexpression. We propose a joint mining algorithm that constructs a weighted hybrid similarity graph whose nodes are the coexpression links. The weight of an edge between two coexpression links in this hybrid graph is a linear combination of the topological similarities and co-appearance similarities of the corresponding two coexpression links. Clustering the weighted hybrid similarity graph yields recurrent coexpression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally homogeneous as evident by their enrichment with biological process GO terms and KEGG pathways.

  8. Data integration: Combined imaging and electrophysiology data in the cloud.

    PubMed

    Kini, Lohith G; Davis, Kathryn A; Wagenaar, Joost B

    2016-01-01

    There has been an increasing effort to correlate electrophysiology data with imaging in patients with refractory epilepsy over recent years. IEEG.org provides a free-access, rapidly growing archive of imaging data combined with electrophysiology data and patient metadata. It currently contains over 1200 human and animal datasets, with multiple data modalities associated with each dataset (neuroimaging, EEG, EKG, de-identified clinical and experimental data, etc.). The platform is developed around the concept that scientific data sharing requires a flexible platform that allows sharing of data from multiple file formats. IEEG.org provides high- and low-level access to the data in addition to providing an environment in which domain experts can find, visualize, and analyze data in an intuitive manner. Here, we present a summary of the current infrastructure of the platform, available datasets and goals for the near future. Copyright © 2015 Elsevier Inc. All rights reserved.

  9. Data integration: Combined Imaging and Electrophysiology data in the cloud

    PubMed Central

    Kini, Lohith G.; Davis, Kathryn A.; Wagenaar, Joost B.

    2015-01-01

    There has been an increasing effort to correlate electrophysiology data with imaging in patients with refractory epilepsy over recent years. IEEG.org provides a free-access, rapidly growing archive of imaging data combined with electrophysiology data and patient metadata. It currently contains over 1200 human and animal datasets, with multiple data modalities associated with each dataset (neuroimaging, EEG, EKG, de-identified clinical and experimental data, etc.). The platform is developed around the concept that scientific data sharing requires a flexible platform that allows sharing of data from multiple file-formats. IEEG.org provides high and low-level access to the data in addition to providing an environment in which domain experts can find, visualize, and analyze data in an intuitive manner. Here, we present a summary of the current infrastructure of the platform, available datasets and goals for the near future. PMID:26044858

  10. Improving the discoverability, accessibility, and citability of omics datasets: a case report.

    PubMed

    Darlington, Yolanda F; Naumov, Alexey; McOwiti, Apollo; Kankanamge, Wasula H; Becnel, Lauren B; McKenna, Neil J

    2017-03-01

    Although omics datasets represent valuable assets for hypothesis generation, model testing, and data validation, the infrastructure supporting their reuse lacks organization and consistency. Using nuclear receptor signaling transcriptomic datasets as proof of principle, we developed a model to improve the discoverability, accessibility, and citability of published omics datasets. Primary datasets were retrieved from archives, processed to extract data points, then subjected to metadata enrichment and gap filling. The resulting secondary datasets were exposed on responsive web pages to support mining of gene lists, discovery of related datasets, and single-click citation integration with popular reference managers. Automated processes were established to embed digital object identifier-driven links to the secondary datasets in associated journal articles, small molecule and gene-centric databases, and a dataset search engine. Our model creates multiple points of access to reprocessed and reannotated derivative datasets across the digital biomedical research ecosystem, promoting their visibility and usability across disparate research communities. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  11. Multivendor Spectral-Domain Optical Coherence Tomography Dataset, Observer Annotation Performance Evaluation, and Standardized Evaluation Framework for Intraretinal Cystoid Fluid Segmentation.

    PubMed

    Wu, Jing; Philip, Ana-Maria; Podkowinski, Dominika; Gerendas, Bianca S; Langs, Georg; Simader, Christian; Waldstein, Sebastian M; Schmidt-Erfurth, Ursula M

    2016-01-01

    Development of image analysis and machine learning methods for segmentation of clinically significant pathology in retinal spectral-domain optical coherence tomography (SD-OCT), used in disease detection and prediction, is limited due to the availability of expertly annotated reference data. Retinal segmentation methods use datasets that either are not publicly available, come from only one device, or use different evaluation methodologies making them difficult to compare. Thus we present and evaluate a multiple expert annotated reference dataset for the problem of intraretinal cystoid fluid (IRF) segmentation, a key indicator in exudative macular disease. In addition, a standardized framework for segmentation accuracy evaluation, applicable to other pathological structures, is presented. Integral to this work is the dataset used which must be fit for purpose for IRF segmentation algorithm training and testing. We describe here a multivendor dataset comprised of 30 scans. Each OCT scan for system training has been annotated by multiple graders using a proprietary system. Evaluation of the intergrader annotations shows a good correlation, thus making the reproducibly annotated scans suitable for the training and validation of image processing and machine learning based segmentation methods. The dataset will be made publicly available in the form of a segmentation Grand Challenge.

  12. Multivendor Spectral-Domain Optical Coherence Tomography Dataset, Observer Annotation Performance Evaluation, and Standardized Evaluation Framework for Intraretinal Cystoid Fluid Segmentation

    PubMed Central

    Wu, Jing; Philip, Ana-Maria; Podkowinski, Dominika; Gerendas, Bianca S.; Langs, Georg; Simader, Christian

    2016-01-01

    Development of image analysis and machine learning methods for segmentation of clinically significant pathology in retinal spectral-domain optical coherence tomography (SD-OCT), used in disease detection and prediction, is limited due to the availability of expertly annotated reference data. Retinal segmentation methods use datasets that either are not publicly available, come from only one device, or use different evaluation methodologies making them difficult to compare. Thus we present and evaluate a multiple expert annotated reference dataset for the problem of intraretinal cystoid fluid (IRF) segmentation, a key indicator in exudative macular disease. In addition, a standardized framework for segmentation accuracy evaluation, applicable to other pathological structures, is presented. Integral to this work is the dataset used which must be fit for purpose for IRF segmentation algorithm training and testing. We describe here a multivendor dataset comprised of 30 scans. Each OCT scan for system training has been annotated by multiple graders using a proprietary system. Evaluation of the intergrader annotations shows a good correlation, thus making the reproducibly annotated scans suitable for the training and validation of image processing and machine learning based segmentation methods. The dataset will be made publicly available in the form of a segmentation Grand Challenge. PMID:27579177

  13. A formal concept analysis approach to consensus clustering of multi-experiment expression data

    PubMed Central

    2014-01-01

    Background Presently, with the increasing number and complexity of available gene expression datasets, the combination of data from multiple microarray studies addressing a similar biological question is gaining importance. The analysis and integration of multiple datasets are expected to yield more reliable and robust results since they are based on a larger number of samples and the effects of the individual study-specific biases are diminished. This is supported by recent studies suggesting that important biological signals are often preserved or enhanced by multiple experiments. An approach to combining data from different experiments is the aggregation of their clusterings into a consensus or representative clustering solution which increases the confidence in the common features of all the datasets and reveals the important differences among them. Results We propose a novel generic consensus clustering technique that applies Formal Concept Analysis (FCA) approach for the consolidation and analysis of clustering solutions derived from several microarray datasets. These datasets are initially divided into groups of related experiments with respect to a predefined criterion. Subsequently, a consensus clustering algorithm is applied to each group resulting in a clustering solution per group. These solutions are pooled together and further analysed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over all the experiments. In order to validate the FCA-enhanced approach two consensus clustering algorithms are adapted to incorporate the FCA analysis. Their performance is evaluated on gene expression data from multi-experiment study examining the global cell-cycle control of fission yeast. The FCA results derived from both methods demonstrate that, although both algorithms optimize different clustering characteristics, FCA is able to overcome and diminish these differences and preserve some relevant biological signals. Conclusions The proposed FCA-enhanced consensus clustering technique is a general approach to the combination of clustering algorithms with FCA for deriving clustering solutions from multiple gene expression matrices. The experimental results presented herein demonstrate that it is a robust data integration technique able to produce good quality clustering solution that is representative for the whole set of expression matrices. PMID:24885407

  14. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics.

    PubMed

    Giambartolomei, Claudia; Vukcevic, Damjan; Schadt, Eric E; Franke, Lude; Hingorani, Aroon D; Wallace, Chris; Plagnol, Vincent

    2014-05-01

    Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including >100,000 individuals of European ancestry. Combining all lipid biomarkers, our re-analysis supported 26 out of 38 reported colocalisation results with eQTLs and identified 14 new colocalisation results, hence highlighting the value of a formal statistical test. In three cases of reported eQTL-lipid pairs (SYPL2, IFT172, TBKBP1) for which our analysis suggests that the eQTL pattern is not consistent with the lipid association, we identify alternative colocalisation results with SORT1, GCKR, and KPNB1, indicating that these genes are more likely to be causal in these genomic intervals. A key feature of the method is the ability to derive the output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at http://coloc.cs.ucl.ac.uk/coloc/). Our methodology provides information about candidate causal genes in associated intervals and has direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.

  15. Overview of the HUPO Plasma Proteome Project: Results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly-available database

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Omenn, Gilbert; States, David J.; Adamski, Marcin

    2005-08-13

    HUPO initiated the Plasma Proteome Project (PPP) in 2002. Its pilot phase has (1) evaluated advantages and limitations of many depletion, fractionation, and MS technology platforms; (2) compared PPP reference specimens of human serum and EDTA, heparin, and citrate-anticoagulated plasma; and (3) created a publicly-available knowledge base (www.bioinformatics. med.umich.edu/hupo/ppp; www.ebi.ac.uk/pride). Thirty-five participating laboratories in 13 countries submitted datasets. Working groups addressed (a) specimen stability and protein concentrations; (b) protein identifications from 18 MS/MS datasets; (c) independent analyses from raw MS-MS spectra; (d) search engine performance, subproteome analyses, and biological insights; (e) antibody arrays; and (f) direct MS/SELDI analyses. MS-MS datasetsmore » had 15 710 different International Protein Index (IPI) protein IDs; our integration algorithm applied to multiple matches of peptide sequences yielded 9504 IPI proteins identified with one or more peptides and 3020 proteins identified with two or more peptides (the Core Dataset). These proteins have been characterized with Gene Ontology, InterPro, Novartis Atlas, OMIM, and immunoassay based concentration determinations. The database permits examination of many other subsets, such as 1274 proteins identified with three or more peptides. Reverse protein to DNA matching identified proteins for 118 previously unidentified ORFs. We recommend use of plasma instead of serum, with EDTA (or citrate) for anticoagulation. To improve resolution, sensitivity and reproducibility of peptide identifications and protein matches, we recommend combinations of depletion, fractionation, and MS/MS technologies, with explicit criteria for evaluation of spectra, use of search algorithms, and integration of homologous protein matches. This Special Issue of PROTEOMICS presents papers integral to the collaborative analysis plus many reports of supplementary work on various aspects of the PPP workplan. These PPP results on complexity, dynamic range, incomplete sampling, false-positive matches, and integration of diverse datasets for plasma and serum proteins lay a foundation for development and validation of circulating protein biomarkers in health and disease.« less

  16. Sequential Bayesian Geostatistical Inversion and Evaluation of Combined Data Worth for Aquifer Characterization at the Hanford 300 Area

    NASA Astrophysics Data System (ADS)

    Murakami, H.; Chen, X.; Hahn, M. S.; Over, M. W.; Rockhold, M. L.; Vermeul, V.; Hammond, G. E.; Zachara, J. M.; Rubin, Y.

    2010-12-01

    Subsurface characterization for predicting groundwater flow and contaminant transport requires us to integrate large and diverse datasets in a consistent manner, and quantify the associated uncertainty. In this study, we sequentially assimilated multiple types of datasets for characterizing a three-dimensional heterogeneous hydraulic conductivity field at the Hanford 300 Area. The datasets included constant-rate injection tests, electromagnetic borehole flowmeter tests, lithology profile and tracer tests. We used the method of anchored distributions (MAD), which is a modular-structured Bayesian geostatistical inversion method. MAD has two major advantages over the other inversion methods. First, it can directly infer a joint distribution of parameters, which can be used as an input in stochastic simulations for prediction. In MAD, in addition to typical geostatistical structural parameters, the parameter vector includes multiple point values of the heterogeneous field, called anchors, which capture local trends and reduce uncertainty in the prediction. Second, MAD allows us to integrate the datasets sequentially in a Bayesian framework such that it updates the posterior distribution, as a new dataset is included. The sequential assimilation can decrease computational burden significantly. We applied MAD to assimilate different combinations of the datasets, and then compared the inversion results. For the injection and tracer test assimilation, we calculated temporal moments of pressure build-up and breakthrough curves, respectively, to reduce the data dimension. A massive parallel flow and transport code PFLOTRAN is used for simulating the tracer test. For comparison, we used different metrics based on the breakthrough curves not used in the inversion, such as mean arrival time, peak concentration and early arrival time. This comparison intends to yield the combined data worth, i.e. which combination of the datasets is the most effective for a certain metric, which will be useful for guiding the further characterization effort at the site and also the future characterization projects at the other sites.

  17. Passing messages between biological networks to refine predicted interactions.

    PubMed

    Glass, Kimberly; Huttenhower, Curtis; Quackenbush, John; Yuan, Guo-Cheng

    2013-01-01

    Regulatory network reconstruction is a fundamental problem in computational biology. There are significant limitations to such reconstruction using individual datasets, and increasingly people attempt to construct networks using multiple, independent datasets obtained from complementary sources, but methods for this integration are lacking. We developed PANDA (Passing Attributes between Networks for Data Assimilation), a message-passing model using multiple sources of information to predict regulatory relationships, and used it to integrate protein-protein interaction, gene expression, and sequence motif data to reconstruct genome-wide, condition-specific regulatory networks in yeast as a model. The resulting networks were not only more accurate than those produced using individual data sets and other existing methods, but they also captured information regarding specific biological mechanisms and pathways that were missed using other methodologies. PANDA is scalable to higher eukaryotes, applicable to specific tissue or cell type data and conceptually generalizable to include a variety of regulatory, interaction, expression, and other genome-scale data. An implementation of the PANDA algorithm is available at www.sourceforge.net/projects/panda-net.

  18. CHiCP: a web-based tool for the integrative and interactive visualization of promoter capture Hi-C datasets.

    PubMed

    Schofield, E C; Carver, T; Achuthan, P; Freire-Pritchett, P; Spivakov, M; Todd, J A; Burren, O S

    2016-08-15

    Promoter capture Hi-C (PCHi-C) allows the genome-wide interrogation of physical interactions between distal DNA regulatory elements and gene promoters in multiple tissue contexts. Visual integration of the resultant chromosome interaction maps with other sources of genomic annotations can provide insight into underlying regulatory mechanisms. We have developed Capture HiC Plotter (CHiCP), a web-based tool that allows interactive exploration of PCHi-C interaction maps and integration with both public and user-defined genomic datasets. CHiCP is freely accessible from www.chicp.org and supports most major HTML5 compliant web browsers. Full source code and installation instructions are available from http://github.com/D-I-L/django-chicp ob219@cam.ac.uk. © The Author 2016. Published by Oxford University Press. All rights reserved.

  19. Systems Medicine: The Future of Medical Genomics, Healthcare, and Wellness.

    PubMed

    Saqi, Mansoor; Pellet, Johann; Roznovat, Irina; Mazein, Alexander; Ballereau, Stéphane; De Meulder, Bertrand; Auffray, Charles

    2016-01-01

    Recent advances in genomics have led to the rapid and relatively inexpensive collection of patient molecular data including multiple types of omics data. The integration of these data with clinical measurements has the potential to impact on our understanding of the molecular basis of disease and on disease management. Systems medicine is an approach to understanding disease through an integration of large patient datasets. It offers the possibility for personalized strategies for healthcare through the development of a new taxonomy of disease. Advanced computing will be an important component in effectively implementing systems medicine. In this chapter we describe three computational challenges associated with systems medicine: disease subtype discovery using integrated datasets, obtaining a mechanistic understanding of disease, and the development of an informatics platform for the mining, analysis, and visualization of data emerging from translational medicine studies.

  20. HPC AND GRID COMPUTING FOR INTEGRATIVE BIOMEDICAL RESEARCH

    PubMed Central

    Kurc, Tahsin; Hastings, Shannon; Kumar, Vijay; Langella, Stephen; Sharma, Ashish; Pan, Tony; Oster, Scott; Ervin, David; Permar, Justin; Narayanan, Sivaramakrishnan; Gil, Yolanda; Deelman, Ewa; Hall, Mary; Saltz, Joel

    2010-01-01

    Integrative biomedical research projects query, analyze, and integrate many different data types and make use of datasets obtained from measurements or simulations of structure and function at multiple biological scales. With the increasing availability of high-throughput and high-resolution instruments, the integrative biomedical research imposes many challenging requirements on software middleware systems. In this paper, we look at some of these requirements using example research pattern templates. We then discuss how middleware systems, which incorporate Grid and high-performance computing, could be employed to address the requirements. PMID:20107625

  1. graph-GPA: A graphical model for prioritizing GWAS results and investigating pleiotropic architecture.

    PubMed

    Chung, Dongjun; Kim, Hang J; Zhao, Hongyu

    2017-02-01

    Genome-wide association studies (GWAS) have identified tens of thousands of genetic variants associated with hundreds of phenotypes and diseases, which have provided clinical and medical benefits to patients with novel biomarkers and therapeutic targets. However, identification of risk variants associated with complex diseases remains challenging as they are often affected by many genetic variants with small or moderate effects. There has been accumulating evidence suggesting that different complex traits share common risk basis, namely pleiotropy. Recently, several statistical methods have been developed to improve statistical power to identify risk variants for complex traits through a joint analysis of multiple GWAS datasets by leveraging pleiotropy. While these methods were shown to improve statistical power for association mapping compared to separate analyses, they are still limited in the number of phenotypes that can be integrated. In order to address this challenge, in this paper, we propose a novel statistical framework, graph-GPA, to integrate a large number of GWAS datasets for multiple phenotypes using a hidden Markov random field approach. Application of graph-GPA to a joint analysis of GWAS datasets for 12 phenotypes shows that graph-GPA improves statistical power to identify risk variants compared to statistical methods based on smaller number of GWAS datasets. In addition, graph-GPA also promotes better understanding of genetic mechanisms shared among phenotypes, which can potentially be useful for the development of improved diagnosis and therapeutics. The R implementation of graph-GPA is currently available at https://dongjunchung.github.io/GGPA/.

  2. Causes and Consequences of Genetic Background Effects Illuminated by Integrative Genomic Analysis

    PubMed Central

    Chandler, Christopher H.; Chari, Sudarshan; Dworkin, Ian

    2014-01-01

    The phenotypic consequences of individual mutations are modulated by the wild-type genetic background in which they occur. Although such background dependence is widely observed, we do not know whether general patterns across species and traits exist or about the mechanisms underlying it. We also lack knowledge on how mutations interact with genetic background to influence gene expression and how this in turn mediates mutant phenotypes. Furthermore, how genetic background influences patterns of epistasis remains unclear. To investigate the genetic basis and genomic consequences of genetic background dependence of the scallopedE3 allele on the Drosophila melanogaster wing, we generated multiple novel genome-level datasets from a mapping-by-introgression experiment and a tagged RNA gene expression dataset. In addition we used whole genome resequencing of the parental lines—two commonly used laboratory strains—to predict polymorphic transcription factor binding sites for SD. We integrated these data with previously published genomic datasets from expression microarrays and a modifier mutation screen. By searching for genes showing a congruent signal across multiple datasets, we were able to identify a robust set of candidate loci contributing to the background-dependent effects of mutations in sd. We also show that the majority of background-dependent modifiers previously reported are caused by higher-order epistasis, not quantitative noncomplementation. These findings provide a useful foundation for more detailed investigations of genetic background dependence in this system, and this approach is likely to prove useful in exploring the genetic basis of other traits as well. PMID:24504186

  3. A Novel Methodology for Improving Plant Pest Surveillance in Vineyards and Crops Using UAV-Based Hyperspectral and Spatial Data.

    PubMed

    Vanegas, Fernando; Bratanov, Dmitry; Powell, Kevin; Weiss, John; Gonzalez, Felipe

    2018-01-17

    Recent advances in remote sensed imagery and geospatial image processing using unmanned aerial vehicles (UAVs) have enabled the rapid and ongoing development of monitoring tools for crop management and the detection/surveillance of insect pests. This paper describes a (UAV) remote sensing-based methodology to increase the efficiency of existing surveillance practices (human inspectors and insect traps) for detecting pest infestations (e.g., grape phylloxera in vineyards). The methodology uses a UAV integrated with advanced digital hyperspectral, multispectral, and RGB sensors. We implemented the methodology for the development of a predictive model for phylloxera detection. In this method, we explore the combination of airborne RGB, multispectral, and hyperspectral imagery with ground-based data at two separate time periods and under different levels of phylloxera infestation. We describe the technology used-the sensors, the UAV, and the flight operations-the processing workflow of the datasets from each imagery type, and the methods for combining multiple airborne with ground-based datasets. Finally, we present relevant results of correlation between the different processed datasets. The objective of this research is to develop a novel methodology for collecting, processing, analising and integrating multispectral, hyperspectral, ground and spatial data to remote sense different variables in different applications, such as, in this case, plant pest surveillance. The development of such methodology would provide researchers, agronomists, and UAV practitioners reliable data collection protocols and methods to achieve faster processing techniques and integrate multiple sources of data in diverse remote sensing applications.

  4. A sampling-based method for ranking protein structural models by integrating multiple scores and features.

    PubMed

    Shi, Xiaohu; Zhang, Jingfen; He, Zhiquan; Shang, Yi; Xu, Dong

    2011-09-01

    One of the major challenges in protein tertiary structure prediction is structure quality assessment. In many cases, protein structure prediction tools generate good structural models, but fail to select the best models from a huge number of candidates as the final output. In this study, we developed a sampling-based machine-learning method to rank protein structural models by integrating multiple scores and features. First, features such as predicted secondary structure, solvent accessibility and residue-residue contact information are integrated by two Radial Basis Function (RBF) models trained from different datasets. Then, the two RBF scores and five selected scoring functions developed by others, i.e., Opus-CA, Opus-PSP, DFIRE, RAPDF, and Cheng Score are synthesized by a sampling method. At last, another integrated RBF model ranks the structural models according to the features of sampling distribution. We tested the proposed method by using two different datasets, including the CASP server prediction models of all CASP8 targets and a set of models generated by our in-house software MUFOLD. The test result shows that our method outperforms any individual scoring function on both best model selection, and overall correlation between the predicted ranking and the actual ranking of structural quality.

  5. Integrative Analysis of Transcription Factor Combinatorial Interactions Using a Bayesian Tensor Factorization Approach

    PubMed Central

    Ye, Yusen; Gao, Lin; Zhang, Shihua

    2017-01-01

    Transcription factors play a key role in transcriptional regulation of genes and determination of cellular identity through combinatorial interactions. However, current studies about combinatorial regulation is deficient due to lack of experimental data in the same cellular environment and extensive existence of data noise. Here, we adopt a Bayesian CANDECOMP/PARAFAC (CP) factorization approach (BCPF) to integrate multiple datasets in a network paradigm for determining precise TF interaction landscapes. In our first application, we apply BCPF to integrate three networks built based on diverse datasets of multiple cell lines from ENCODE respectively to predict a global and precise TF interaction network. This network gives 38 novel TF interactions with distinct biological functions. In our second application, we apply BCPF to seven types of cell type TF regulatory networks and predict seven cell lineage TF interaction networks, respectively. By further exploring the dynamics and modularity of them, we find cell lineage-specific hub TFs participate in cell type or lineage-specific regulation by interacting with non-specific TFs. Furthermore, we illustrate the biological function of hub TFs by taking those of cancer lineage and blood lineage as examples. Taken together, our integrative analysis can reveal more precise and extensive description about human TF combinatorial interactions. PMID:29033978

  6. Integrative Analysis of Transcription Factor Combinatorial Interactions Using a Bayesian Tensor Factorization Approach.

    PubMed

    Ye, Yusen; Gao, Lin; Zhang, Shihua

    2017-01-01

    Transcription factors play a key role in transcriptional regulation of genes and determination of cellular identity through combinatorial interactions. However, current studies about combinatorial regulation is deficient due to lack of experimental data in the same cellular environment and extensive existence of data noise. Here, we adopt a Bayesian CANDECOMP/PARAFAC (CP) factorization approach (BCPF) to integrate multiple datasets in a network paradigm for determining precise TF interaction landscapes. In our first application, we apply BCPF to integrate three networks built based on diverse datasets of multiple cell lines from ENCODE respectively to predict a global and precise TF interaction network. This network gives 38 novel TF interactions with distinct biological functions. In our second application, we apply BCPF to seven types of cell type TF regulatory networks and predict seven cell lineage TF interaction networks, respectively. By further exploring the dynamics and modularity of them, we find cell lineage-specific hub TFs participate in cell type or lineage-specific regulation by interacting with non-specific TFs. Furthermore, we illustrate the biological function of hub TFs by taking those of cancer lineage and blood lineage as examples. Taken together, our integrative analysis can reveal more precise and extensive description about human TF combinatorial interactions.

  7. Seeking unique and common biological themes in multiple gene lists or datasets: pathway pattern extraction pipeline for pathway-level comparative analysis.

    PubMed

    Yi, Ming; Mudunuri, Uma; Che, Anney; Stephens, Robert M

    2009-06-29

    One of the challenges in the analysis of microarray data is to integrate and compare the selected (e.g., differential) gene lists from multiple experiments for common or unique underlying biological themes. A common way to approach this problem is to extract common genes from these gene lists and then subject these genes to enrichment analysis to reveal the underlying biology. However, the capacity of this approach is largely restricted by the limited number of common genes shared by datasets from multiple experiments, which could be caused by the complexity of the biological system itself. We now introduce a new Pathway Pattern Extraction Pipeline (PPEP), which extends the existing WPS application by providing a new pathway-level comparative analysis scheme. To facilitate comparing and correlating results from different studies and sources, PPEP contains new interfaces that allow evaluation of the pathway-level enrichment patterns across multiple gene lists. As an exploratory tool, this analysis pipeline may help reveal the underlying biological themes at both the pathway and gene levels. The analysis scheme provided by PPEP begins with multiple gene lists, which may be derived from different studies in terms of the biological contexts, applied technologies, or methodologies. These lists are then subjected to pathway-level comparative analysis for extraction of pathway-level patterns. This analysis pipeline helps to explore the commonality or uniqueness of these lists at the level of pathways or biological processes from different but relevant biological systems using a combination of statistical enrichment measurements, pathway-level pattern extraction, and graphical display of the relationships of genes and their associated pathways as Gene-Term Association Networks (GTANs) within the WPS platform. As a proof of concept, we have used the new method to analyze many datasets from our collaborators as well as some public microarray datasets. This tool provides a new pathway-level analysis scheme for integrative and comparative analysis of data derived from different but relevant systems. The tool is freely available as a Pathway Pattern Extraction Pipeline implemented in our existing software package WPS, which can be obtained at http://www.abcc.ncifcrf.gov/wps/wps_index.php.

  8. Large-scale atlas of microarray data reveals biological landscape of gene expression in Arabidopsis

    USDA-ARS?s Scientific Manuscript database

    Transcriptome datasets from thousands of samples of the model plant Arabidopsis thaliana have been collectively generated by multiple individual labs. Although integration and meta-analysis of these samples has become routine in the plant research community, it is often hampered by the lack of metad...

  9. Functional Connectivity in Multiple Cortical Networks Is Associated with Performance Across Cognitive Domains in Older Adults.

    PubMed

    Shaw, Emily E; Schultz, Aaron P; Sperling, Reisa A; Hedden, Trey

    2015-10-01

    Intrinsic functional connectivity MRI has become a widely used tool for measuring integrity in large-scale cortical networks. This study examined multiple cortical networks using Template-Based Rotation (TBR), a method that applies a priori network and nuisance component templates defined from an independent dataset to test datasets of interest. A priori templates were applied to a test dataset of 276 older adults (ages 65-90) from the Harvard Aging Brain Study to examine the relationship between multiple large-scale cortical networks and cognition. Factor scores derived from neuropsychological tests represented processing speed, executive function, and episodic memory. Resting-state BOLD data were acquired in two 6-min acquisitions on a 3-Tesla scanner and processed with TBR to extract individual-level metrics of network connectivity in multiple cortical networks. All results controlled for data quality metrics, including motion. Connectivity in multiple large-scale cortical networks was positively related to all cognitive domains, with a composite measure of general connectivity positively associated with general cognitive performance. Controlling for the correlations between networks, the frontoparietal control network (FPCN) and executive function demonstrated the only significant association, suggesting specificity in this relationship. Further analyses found that the FPCN mediated the relationships of the other networks with cognition, suggesting that this network may play a central role in understanding individual variation in cognition during aging.

  10. Hybrid coexpression link similarity graph clustering for mining biological modules from multiple gene expression datasets

    PubMed Central

    2014-01-01

    Background Advances in genomic technologies have enabled the accumulation of vast amount of genomic data, including gene expression data for multiple species under various biological and environmental conditions. Integration of these gene expression datasets is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on a single gene expression data, which suffers from spurious coexpression. Results We propose a joint mining algorithm that constructs a weighted hybrid similarity graph whose nodes are the coexpression links. The weight of an edge between two coexpression links in this hybrid graph is a linear combination of the topological similarities and co-appearance similarities of the corresponding two coexpression links. Clustering the weighted hybrid similarity graph yields recurrent coexpression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally homogeneous as evident by their enrichment with biological process GO terms and KEGG pathways. PMID:25221624

  11. The NCBI BioSystems database.

    PubMed

    Geer, Lewis Y; Marchler-Bauer, Aron; Geer, Renata C; Han, Lianyi; He, Jane; He, Siqian; Liu, Chunlei; Shi, Wenyao; Bryant, Stephen H

    2010-01-01

    The NCBI BioSystems database, found at http://www.ncbi.nlm.nih.gov/biosystems/, centralizes and cross-links existing biological systems databases, increasing their utility and target audience by integrating their pathways and systems into NCBI resources. This integration allows users of NCBI's Entrez databases to quickly categorize proteins, genes and small molecules by metabolic pathway, disease state or other BioSystem type, without requiring time-consuming inference of biological relationships from the literature or multiple experimental datasets.

  12. A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa

    PubMed Central

    Petegrosso, Raphael; Tolar, Jakub

    2018-01-01

    Single-cell RNA sequencing (scRNA-seq) has been widely applied to discover new cell types by detecting sub-populations in a heterogeneous group of cells. Since scRNA-seq experiments have lower read coverage/tag counts and introduce more technical biases compared to bulk RNA-seq experiments, the limited number of sampled cells combined with the experimental biases and other dataset specific variations presents a challenge to cross-dataset analysis and discovery of relevant biological variations across multiple cell populations. In this paper, we introduce a method of variance-driven multitask clustering of single-cell RNA-seq data (scVDMC) that utilizes multiple single-cell populations from biological replicates or different samples. scVDMC clusters single cells in multiple scRNA-seq experiments of similar cell types and markers but varying expression patterns such that the scRNA-seq data are better integrated than typical pooled analyses which only increase the sample size. By controlling the variance among the cell clusters within each dataset and across all the datasets, scVDMC detects cell sub-populations in each individual experiment with shared cell-type markers but varying cluster centers among all the experiments. Applied to two real scRNA-seq datasets with several replicates and one large-scale droplet-based dataset on three patient samples, scVDMC more accurately detected cell populations and known cell markers than pooled clustering and other recently proposed scRNA-seq clustering methods. In the case study applied to in-house Recessive Dystrophic Epidermolysis Bullosa (RDEB) scRNA-seq data, scVDMC revealed several new cell types and unknown markers validated by flow cytometry. MATLAB/Octave code available at https://github.com/kuanglab/scVDMC. PMID:29630593

  13. PASTA for Proteins.

    PubMed

    Collins, Kodi; Warnow, Tandy

    2018-06-19

    PASTA is a multiple sequence method that uses divide-and-conquer plus iteration to enable base alignment methods to scale with high accuracy to large sequence datasets. By default, PASTA included MAFFT L-INS-i; our new extension of PASTA enables the use of MAFFT G-INS-i, MAFFT Homologs, CONTRAlign, and ProbCons. We analyzed the performance of each base method and PASTA using these base methods on 224 datasets from BAliBASE 4 with at least 50 sequences. We show that PASTA enables the most accurate base methods to scale to larger datasets at reduced computational effort, and generally improves alignment and tree accuracy on the largest BAliBASE datasets. PASTA is available at https://github.com/kodicollins/pasta and has also been integrated into the original PASTA repository at https://github.com/smirarab/pasta. Supplementary data are available at Bioinformatics online.

  14. MetaMeta: integrating metagenome analysis tools to improve taxonomic profiling.

    PubMed

    Piro, Vitor C; Matschkowski, Marcel; Renard, Bernhard Y

    2017-08-14

    Many metagenome analysis tools are presently available to classify sequences and profile environmental samples. In particular, taxonomic profiling and binning methods are commonly used for such tasks. Tools available among these two categories make use of several techniques, e.g., read mapping, k-mer alignment, and composition analysis. Variations on the construction of the corresponding reference sequence databases are also common. In addition, different tools provide good results in different datasets and configurations. All this variation creates a complicated scenario to researchers to decide which methods to use. Installation, configuration and execution can also be difficult especially when dealing with multiple datasets and tools. We propose MetaMeta: a pipeline to execute and integrate results from metagenome analysis tools. MetaMeta provides an easy workflow to run multiple tools with multiple samples, producing a single enhanced output profile for each sample. MetaMeta includes a database generation, pre-processing, execution, and integration steps, allowing easy execution and parallelization. The integration relies on the co-occurrence of organisms from different methods as the main feature to improve community profiling while accounting for differences in their databases. In a controlled case with simulated and real data, we show that the integrated profiles of MetaMeta overcome the best single profile. Using the same input data, it provides more sensitive and reliable results with the presence of each organism being supported by several methods. MetaMeta uses Snakemake and has six pre-configured tools, all available at BioConda channel for easy installation (conda install -c bioconda metameta). The MetaMeta pipeline is open-source and can be downloaded at: https://gitlab.com/rki_bioinformatics .

  15. A Novel Methodology for Improving Plant Pest Surveillance in Vineyards and Crops Using UAV-Based Hyperspectral and Spatial Data

    PubMed Central

    Vanegas, Fernando; Weiss, John; Gonzalez, Felipe

    2018-01-01

    Recent advances in remote sensed imagery and geospatial image processing using unmanned aerial vehicles (UAVs) have enabled the rapid and ongoing development of monitoring tools for crop management and the detection/surveillance of insect pests. This paper describes a (UAV) remote sensing-based methodology to increase the efficiency of existing surveillance practices (human inspectors and insect traps) for detecting pest infestations (e.g., grape phylloxera in vineyards). The methodology uses a UAV integrated with advanced digital hyperspectral, multispectral, and RGB sensors. We implemented the methodology for the development of a predictive model for phylloxera detection. In this method, we explore the combination of airborne RGB, multispectral, and hyperspectral imagery with ground-based data at two separate time periods and under different levels of phylloxera infestation. We describe the technology used—the sensors, the UAV, and the flight operations—the processing workflow of the datasets from each imagery type, and the methods for combining multiple airborne with ground-based datasets. Finally, we present relevant results of correlation between the different processed datasets. The objective of this research is to develop a novel methodology for collecting, processing, analysing and integrating multispectral, hyperspectral, ground and spatial data to remote sense different variables in different applications, such as, in this case, plant pest surveillance. The development of such methodology would provide researchers, agronomists, and UAV practitioners reliable data collection protocols and methods to achieve faster processing techniques and integrate multiple sources of data in diverse remote sensing applications. PMID:29342101

  16. Precise Network Modeling of Systems Genetics Data Using the Bayesian Network Webserver.

    PubMed

    Ziebarth, Jesse D; Cui, Yan

    2017-01-01

    The Bayesian Network Webserver (BNW, http://compbio.uthsc.edu/BNW ) is an integrated platform for Bayesian network modeling of biological datasets. It provides a web-based network modeling environment that seamlessly integrates advanced algorithms for probabilistic causal modeling and reasoning with Bayesian networks. BNW is designed for precise modeling of relatively small networks that contain less than 20 nodes. The structure learning algorithms used by BNW guarantee the discovery of the best (most probable) network structure given the data. To facilitate network modeling across multiple biological levels, BNW provides a very flexible interface that allows users to assign network nodes into different tiers and define the relationships between and within the tiers. This function is particularly useful for modeling systems genetics datasets that often consist of multiscalar heterogeneous genotype-to-phenotype data. BNW enables users to, within seconds or minutes, go from having a simply formatted input file containing a dataset to using a network model to make predictions about the interactions between variables and the potential effects of experimental interventions. In this chapter, we will introduce the functions of BNW and show how to model systems genetics datasets with BNW.

  17. Integrating multiple analytical datasets to compare metabolite profiles of mouse colonic-cecal contents and feces

    USDA-ARS?s Scientific Manuscript database

    The pattern of metabolites produced by the gut microbiome comprises a phenotype indicative of the means by which that microbiome affects the gut. We characterized that phenotype in mice by conducting metabolomic analyses of the colonic-cecal contents, comparing that to the metabolite patterns of fec...

  18. JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES.

    PubMed

    Lock, Eric F; Hoadley, Katherine A; Marron, J S; Nobel, Andrew B

    2013-03-01

    Research in several fields now requires the analysis of datasets in which multiple high-dimensional types of data are available for a common set of objects. In particular, The Cancer Genome Atlas (TCGA) includes data from several diverse genomic technologies on the same cancerous tumor samples. In this paper we introduce Joint and Individual Variation Explained (JIVE), a general decomposition of variation for the integrated analysis of such datasets. The decomposition consists of three terms: a low-rank approximation capturing joint variation across data types, low-rank approximations for structured variation individual to each data type, and residual noise. JIVE quantifies the amount of joint variation between data types, reduces the dimensionality of the data, and provides new directions for the visual exploration of joint and individual structure. The proposed method represents an extension of Principal Component Analysis and has clear advantages over popular two-block methods such as Canonical Correlation Analysis and Partial Least Squares. A JIVE analysis of gene expression and miRNA data on Glioblastoma Multiforme tumor samples reveals gene-miRNA associations and provides better characterization of tumor types.

  19. Passing Messages between Biological Networks to Refine Predicted Interactions

    PubMed Central

    Glass, Kimberly; Huttenhower, Curtis; Quackenbush, John; Yuan, Guo-Cheng

    2013-01-01

    Regulatory network reconstruction is a fundamental problem in computational biology. There are significant limitations to such reconstruction using individual datasets, and increasingly people attempt to construct networks using multiple, independent datasets obtained from complementary sources, but methods for this integration are lacking. We developed PANDA (Passing Attributes between Networks for Data Assimilation), a message-passing model using multiple sources of information to predict regulatory relationships, and used it to integrate protein-protein interaction, gene expression, and sequence motif data to reconstruct genome-wide, condition-specific regulatory networks in yeast as a model. The resulting networks were not only more accurate than those produced using individual data sets and other existing methods, but they also captured information regarding specific biological mechanisms and pathways that were missed using other methodologies. PANDA is scalable to higher eukaryotes, applicable to specific tissue or cell type data and conceptually generalizable to include a variety of regulatory, interaction, expression, and other genome-scale data. An implementation of the PANDA algorithm is available at www.sourceforge.net/projects/panda-net. PMID:23741402

  20. An Intelligent Polar Cyberinfrastrucuture to Support Spatiotemporal Decision Making

    NASA Astrophysics Data System (ADS)

    Song, M.; Li, W.; Zhou, X.

    2014-12-01

    In the era of big data, polar sciences have already faced an urgent demand of utilizing intelligent approaches to support precise and effective spatiotemporal decision-making. Service-oriented cyberinfrastructure has advantages of seamlessly integrating distributed computing resources, and aggregating a variety of geospatial data derived from Earth observation network. This paper focuses on building a smart service-oriented cyberinfrastructure to support intelligent question answering related to polar datasets. The innovation of this polar cyberinfrastructure includes: (1) a problem-solving environment that parses geospatial question in natural language, builds geoprocessing rules, composites atomic processing services and executes the entire workflow; (2) a self-adaptive spatiotemporal filter that is capable of refining query constraints through semantic analysis; (3) a dynamic visualization strategy to support results animation and statistics in multiple spatial reference systems; and (4) a user-friendly online portal to support collaborative decision-making. By means of this polar cyberinfrastructure, we intend to facilitate integration of distributed and heterogeneous Arctic datasets and comprehensive analysis of multiple environmental elements (e.g. snow, ice, permafrost) to provide a better understanding of the environmental variation in circumpolar regions.

  1. The NCBI BioSystems database

    PubMed Central

    Geer, Lewis Y.; Marchler-Bauer, Aron; Geer, Renata C.; Han, Lianyi; He, Jane; He, Siqian; Liu, Chunlei; Shi, Wenyao; Bryant, Stephen H.

    2010-01-01

    The NCBI BioSystems database, found at http://www.ncbi.nlm.nih.gov/biosystems/, centralizes and cross-links existing biological systems databases, increasing their utility and target audience by integrating their pathways and systems into NCBI resources. This integration allows users of NCBI’s Entrez databases to quickly categorize proteins, genes and small molecules by metabolic pathway, disease state or other BioSystem type, without requiring time-consuming inference of biological relationships from the literature or multiple experimental datasets. PMID:19854944

  2. Integrated Strategy Improves the Prediction Accuracy of miRNA in Large Dataset

    PubMed Central

    Lipps, David; Devineni, Sree

    2016-01-01

    MiRNAs are short non-coding RNAs of about 22 nucleotides, which play critical roles in gene expression regulation. The biogenesis of miRNAs is largely determined by the sequence and structural features of their parental RNA molecules. Based on these features, multiple computational tools have been developed to predict if RNA transcripts contain miRNAs or not. Although being very successful, these predictors started to face multiple challenges in recent years. Many predictors were optimized using datasets of hundreds of miRNA samples. The sizes of these datasets are much smaller than the number of known miRNAs. Consequently, the prediction accuracy of these predictors in large dataset becomes unknown and needs to be re-tested. In addition, many predictors were optimized for either high sensitivity or high specificity. These optimization strategies may bring in serious limitations in applications. Moreover, to meet continuously raised expectations on these computational tools, improving the prediction accuracy becomes extremely important. In this study, a meta-predictor mirMeta was developed by integrating a set of non-linear transformations with meta-strategy. More specifically, the outputs of five individual predictors were first preprocessed using non-linear transformations, and then fed into an artificial neural network to make the meta-prediction. The prediction accuracy of meta-predictor was validated using both multi-fold cross-validation and independent dataset. The final accuracy of meta-predictor in newly-designed large dataset is improved by 7% to 93%. The meta-predictor is also proved to be less dependent on datasets, as well as has refined balance between sensitivity and specificity. This study has two folds of importance: First, it shows that the combination of non-linear transformations and artificial neural networks improves the prediction accuracy of individual predictors. Second, a new miRNA predictor with significantly improved prediction accuracy is developed for the community for identifying novel miRNAs and the complete set of miRNAs. Source code is available at: https://github.com/xueLab/mirMeta PMID:28002428

  3. An R package for the integrated analysis of metabolomics and spectral data.

    PubMed

    Costa, Christopher; Maraschin, Marcelo; Rocha, Miguel

    2016-06-01

    Recently, there has been a growing interest in the field of metabolomics, materialized by a remarkable growth in experimental techniques, available data and related biological applications. Indeed, techniques as nuclear magnetic resonance, gas or liquid chromatography, mass spectrometry, infrared and UV-visible spectroscopies have provided extensive datasets that can help in tasks as biological and biomedical discovery, biotechnology and drug development. However, as it happens with other omics data, the analysis of metabolomics datasets provides multiple challenges, both in terms of methodologies and in the development of appropriate computational tools. Indeed, from the available software tools, none addresses the multiplicity of existing techniques and data analysis tasks. In this work, we make available a novel R package, named specmine, which provides a set of methods for metabolomics data analysis, including data loading in different formats, pre-processing, metabolite identification, univariate and multivariate data analysis, machine learning, and feature selection. Importantly, the implemented methods provide adequate support for the analysis of data from diverse experimental techniques, integrating a large set of functions from several R packages in a powerful, yet simple to use environment. The package, already available in CRAN, is accompanied by a web site where users can deposit datasets, scripts and analysis reports to be shared with the community, promoting the efficient sharing of metabolomics data analysis pipelines. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  4. metaseq: a Python package for integrative genome-wide analysis reveals relationships between chromatin insulators and associated nuclear mRNA.

    PubMed

    Dale, Ryan K; Matzat, Leah H; Lei, Elissa P

    2014-08-01

    Here we introduce metaseq, a software library written in Python, which enables loading multiple genomic data formats into standard Python data structures and allows flexible, customized manipulation and visualization of data from high-throughput sequencing studies. We demonstrate its practical use by analyzing multiple datasets related to chromatin insulators, which are DNA-protein complexes proposed to organize the genome into distinct transcriptional domains. Recent studies in Drosophila and mammals have implicated RNA in the regulation of chromatin insulator activities. Moreover, the Drosophila RNA-binding protein Shep has been shown to antagonize gypsy insulator activity in a tissue-specific manner, but the precise role of RNA in this process remains unclear. Better understanding of chromatin insulator regulation requires integration of multiple datasets, including those from chromatin-binding, RNA-binding, and gene expression experiments. We use metaseq to integrate RIP- and ChIP-seq data for Shep and the core gypsy insulator protein Su(Hw) in two different cell types, along with publicly available ChIP-chip and RNA-seq data. Based on the metaseq-enabled analysis presented here, we propose a model where Shep associates with chromatin cotranscriptionally, then is recruited to insulator complexes in trans where it plays a negative role in insulator activity. Published by Oxford University Press on behalf of Nucleic Acids Research 2014. This work is written by (a) US Government employee(s) and is in the public domain in the US.

  5. Harnessing Diversity towards the Reconstructing of Large Scale Gene Regulatory Networks

    PubMed Central

    Yamanaka, Ryota; Kitano, Hiroaki

    2013-01-01

    Elucidating gene regulatory network (GRN) from large scale experimental data remains a central challenge in systems biology. Recently, numerous techniques, particularly consensus driven approaches combining different algorithms, have become a potentially promising strategy to infer accurate GRNs. Here, we develop a novel consensus inference algorithm, TopkNet that can integrate multiple algorithms to infer GRNs. Comprehensive performance benchmarking on a cloud computing framework demonstrated that (i) a simple strategy to combine many algorithms does not always lead to performance improvement compared to the cost of consensus and (ii) TopkNet integrating only high-performance algorithms provide significant performance improvement compared to the best individual algorithms and community prediction. These results suggest that a priori determination of high-performance algorithms is a key to reconstruct an unknown regulatory network. Similarity among gene-expression datasets can be useful to determine potential optimal algorithms for reconstruction of unknown regulatory networks, i.e., if expression-data associated with known regulatory network is similar to that with unknown regulatory network, optimal algorithms determined for the known regulatory network can be repurposed to infer the unknown regulatory network. Based on this observation, we developed a quantitative measure of similarity among gene-expression datasets and demonstrated that, if similarity between the two expression datasets is high, TopkNet integrating algorithms that are optimal for known dataset perform well on the unknown dataset. The consensus framework, TopkNet, together with the similarity measure proposed in this study provides a powerful strategy towards harnessing the wisdom of the crowds in reconstruction of unknown regulatory networks. PMID:24278007

  6. Remote visual analysis of large turbulence databases at multiple scales

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pulido, Jesus; Livescu, Daniel; Kanov, Kalin

    The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less

  7. Remote visual analysis of large turbulence databases at multiple scales

    DOE PAGES

    Pulido, Jesus; Livescu, Daniel; Kanov, Kalin; ...

    2018-06-15

    The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less

  8. DNApod: DNA polymorphism annotation database from next-generation sequence read archives.

    PubMed

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information.

  9. DNApod: DNA polymorphism annotation database from next-generation sequence read archives

    PubMed Central

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information. PMID:28234924

  10. Wind Integration National Dataset Toolkit | Grid Modernization | NREL

    Science.gov Websites

    information, share tips The WIND Toolkit includes meteorological conditions and turbine power for more than Integration National Dataset Toolkit Wind Integration National Dataset Toolkit The Wind Integration National Dataset (WIND) Toolkit is an update and expansion of the Eastern Wind Integration Data Set and

  11. Large Scale Flood Risk Analysis using a New Hyper-resolution Population Dataset

    NASA Astrophysics Data System (ADS)

    Smith, A.; Neal, J. C.; Bates, P. D.; Quinn, N.; Wing, O.

    2017-12-01

    Here we present the first national scale flood risk analyses, using high resolution Facebook Connectivity Lab population data and data from a hyper resolution flood hazard model. In recent years the field of large scale hydraulic modelling has been transformed by new remotely sensed datasets, improved process representation, highly efficient flow algorithms and increases in computational power. These developments have allowed flood risk analysis to be undertaken in previously unmodeled territories and from continental to global scales. Flood risk analyses are typically conducted via the integration of modelled water depths with an exposure dataset. Over large scales and in data poor areas, these exposure data typically take the form of a gridded population dataset, estimating population density using remotely sensed data and/or locally available census data. The local nature of flooding dictates that for robust flood risk analysis to be undertaken both hazard and exposure data should sufficiently resolve local scale features. Global flood frameworks are enabling flood hazard data to produced at 90m resolution, resulting in a mis-match with available population datasets which are typically more coarsely resolved. Moreover, these exposure data are typically focused on urban areas and struggle to represent rural populations. In this study we integrate a new population dataset with a global flood hazard model. The population dataset was produced by the Connectivity Lab at Facebook, providing gridded population data at 5m resolution, representing a resolution increase over previous countrywide data sets of multiple orders of magnitude. Flood risk analysis undertaken over a number of developing countries are presented, along with a comparison of flood risk analyses undertaken using pre-existing population datasets.

  12. Network Analysis of Rodent Transcriptomes in Spaceflight

    NASA Technical Reports Server (NTRS)

    Ramachandran, Maya; Fogle, Homer; Costes, Sylvain

    2017-01-01

    Network analysis methods leverage prior knowledge of cellular systems and the statistical and conceptual relationships between analyte measurements to determine gene connectivity. Correlation and conditional metrics are used to infer a network topology and provide a systems-level context for cellular responses. Integration across multiple experimental conditions and omics domains can reveal the regulatory mechanisms that underlie gene expression. GeneLab has assembled rich multi-omic (transcriptomics, proteomics, epigenomics, and epitranscriptomics) datasets for multiple murine tissues from the Rodent Research 1 (RR-1) experiment. RR-1 assesses the impact of 37 days of spaceflight on gene expression across a variety of tissue types, such as adrenal glands, quadriceps, gastrocnemius, tibalius anterior, extensor digitorum longus, soleus, eye, and kidney. Network analysis is particularly useful for RR-1 -omics datasets because it reinforces subtle relationships that may be overlooked in isolated analyses and subdues confounding factors. Our objective is to use network analysis to determine potential target nodes for therapeutic intervention and identify similarities with existing disease models. Multiple network algorithms are used for a higher confidence consensus.

  13. Bioinformatics resource manager v2.3: an integrated software environment for systems biology with microRNA and cross-species analysis tools

    PubMed Central

    2012-01-01

    Background MicroRNAs (miRNAs) are noncoding RNAs that direct post-transcriptional regulation of protein coding genes. Recent studies have shown miRNAs are important for controlling many biological processes, including nervous system development, and are highly conserved across species. Given their importance, computational tools are necessary for analysis, interpretation and integration of high-throughput (HTP) miRNA data in an increasing number of model species. The Bioinformatics Resource Manager (BRM) v2.3 is a software environment for data management, mining, integration and functional annotation of HTP biological data. In this study, we report recent updates to BRM for miRNA data analysis and cross-species comparisons across datasets. Results BRM v2.3 has the capability to query predicted miRNA targets from multiple databases, retrieve potential regulatory miRNAs for known genes, integrate experimentally derived miRNA and mRNA datasets, perform ortholog mapping across species, and retrieve annotation and cross-reference identifiers for an expanded number of species. Here we use BRM to show that developmental exposure of zebrafish to 30 uM nicotine from 6–48 hours post fertilization (hpf) results in behavioral hyperactivity in larval zebrafish and alteration of putative miRNA gene targets in whole embryos at developmental stages that encompass early neurogenesis. We show typical workflows for using BRM to integrate experimental zebrafish miRNA and mRNA microarray datasets with example retrievals for zebrafish, including pathway annotation and mapping to human ortholog. Functional analysis of differentially regulated (p<0.05) gene targets in BRM indicates that nicotine exposure disrupts genes involved in neurogenesis, possibly through misregulation of nicotine-sensitive miRNAs. Conclusions BRM provides the ability to mine complex data for identification of candidate miRNAs or pathways that drive phenotypic outcome and, therefore, is a useful hypothesis generation tool for systems biology. The miRNA workflow in BRM allows for efficient processing of multiple miRNA and mRNA datasets in a single software environment with the added capability to interact with public data sources and visual analytic tools for HTP data analysis at a systems level. BRM is developed using Java™ and other open-source technologies for free distribution (http://www.sysbio.org/dataresources/brm.stm). PMID:23174015

  14. Cross-platform method for identifying candidate network biomarkers for prostate cancer.

    PubMed

    Jin, G; Zhou, X; Cui, K; Zhang, X-S; Chen, L; Wong, S T C

    2009-11-01

    Discovering biomarkers using mass spectrometry (MS) and microarray expression profiles is a promising strategy in molecular diagnosis. Here, the authors proposed a new pipeline for biomarker discovery that integrates disease information for proteins and genes, expression profiles in both genomic and proteomic levels, and protein-protein interactions (PPIs) to discover high confidence network biomarkers. Using this pipeline, a total of 474 molecules (genes and proteins) related to prostate cancer were identified and a prostate-cancer-related network (PCRN) was derived from the integrative information. Thus, a set of candidate network biomarkers were identified from multiple expression profiles composed by eight microarray datasets and one proteomics dataset. The network biomarkers with PPIs can accurately distinguish the prostate patients from the normal ones, which potentially provide more reliable hits of biomarker candidates than conventional biomarker discovery methods.

  15. A novel bi-level meta-analysis approach: applied to biological pathway analysis.

    PubMed

    Nguyen, Tin; Tagett, Rebecca; Donato, Michele; Mitrea, Cristina; Draghici, Sorin

    2016-02-01

    The accumulation of high-throughput data in public repositories creates a pressing need for integrative analysis of multiple datasets from independent experiments. However, study heterogeneity, study bias, outliers and the lack of power of available methods present real challenge in integrating genomic data. One practical drawback of many P-value-based meta-analysis methods, including Fisher's, Stouffer's, minP and maxP, is that they are sensitive to outliers. Another drawback is that, because they perform just one statistical test for each individual experiment, they may not fully exploit the potentially large number of samples within each study. We propose a novel bi-level meta-analysis approach that employs the additive method and the Central Limit Theorem within each individual experiment and also across multiple experiments. We prove that the bi-level framework is robust against bias, less sensitive to outliers than other methods, and more sensitive to small changes in signal. For comparative analysis, we demonstrate that the intra-experiment analysis has more power than the equivalent statistical test performed on a single large experiment. For pathway analysis, we compare the proposed framework versus classical meta-analysis approaches (Fisher's, Stouffer's and the additive method) as well as against a dedicated pathway meta-analysis package (MetaPath), using 1252 samples from 21 datasets related to three human diseases, acute myeloid leukemia (9 datasets), type II diabetes (5 datasets) and Alzheimer's disease (7 datasets). Our framework outperforms its competitors to correctly identify pathways relevant to the phenotypes. The framework is sufficiently general to be applied to any type of statistical meta-analysis. The R scripts are available on demand from the authors. sorin@wayne.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  16. Map Matching and Real World Integrated Sensor Data Warehousing (Presentation)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Burton, E.

    2014-02-01

    The inclusion of interlinked temporal and spatial elements within integrated sensor data enables a tremendous degree of flexibility when analyzing multi-component datasets. The presentation illustrates how to warehouse, process, and analyze high-resolution integrated sensor datasets to support complex system analysis at the entity and system levels. The example cases presented utilizes in-vehicle sensor system data to assess vehicle performance, while integrating a map matching algorithm to link vehicle data to roads to demonstrate the enhanced analysis possible via interlinking data elements. Furthermore, in addition to the flexibility provided, the examples presented illustrate concepts of maintaining proprietary operational information (Fleet DNA)more » and privacy of study participants (Transportation Secure Data Center) while producing widely distributed data products. Should real-time operational data be logged at high resolution across multiple infrastructure types, map matched to their associated infrastructure, and distributed employing a similar approach; dependencies between urban environment infrastructures components could be better understood. This understanding is especially crucial for the cities of the future where transportation will rely more on grid infrastructure to support its energy demands.« less

  17. Integrated genome browser: visual analytics platform for genomics.

    PubMed

    Freese, Nowlan H; Norris, David C; Loraine, Ann E

    2016-07-15

    Genome browsers that support fast navigation through vast datasets and provide interactive visual analytics functions can help scientists achieve deeper insight into biological systems. Toward this end, we developed Integrated Genome Browser (IGB), a highly configurable, interactive and fast open source desktop genome browser. Here we describe multiple updates to IGB, including all-new capabilities to display and interact with data from high-throughput sequencing experiments. To demonstrate, we describe example visualizations and analyses of datasets from RNA-Seq, ChIP-Seq and bisulfite sequencing experiments. Understanding results from genome-scale experiments requires viewing the data in the context of reference genome annotations and other related datasets. To facilitate this, we enhanced IGB's ability to consume data from diverse sources, including Galaxy, Distributed Annotation and IGB-specific Quickload servers. To support future visualization needs as new genome-scale assays enter wide use, we transformed the IGB codebase into a modular, extensible platform for developers to create and deploy all-new visualizations of genomic data. IGB is open source and is freely available from http://bioviz.org/igb aloraine@uncc.edu. © The Author 2016. Published by Oxford University Press.

  18. methylPipe and compEpiTools: a suite of R packages for the integrative analysis of epigenomics data.

    PubMed

    Kishore, Kamal; de Pretis, Stefano; Lister, Ryan; Morelli, Marco J; Bianchi, Valerio; Amati, Bruno; Ecker, Joseph R; Pelizzola, Mattia

    2015-09-29

    Numerous methods are available to profile several epigenetic marks, providing data with different genome coverage and resolution. Large epigenomic datasets are then generated, and often combined with other high-throughput data, including RNA-seq, ChIP-seq for transcription factors (TFs) binding and DNase-seq experiments. Despite the numerous computational tools covering specific steps in the analysis of large-scale epigenomics data, comprehensive software solutions for their integrative analysis are still missing. Multiple tools must be identified and combined to jointly analyze histone marks, TFs binding and other -omics data together with DNA methylation data, complicating the analysis of these data and their integration with publicly available datasets. To overcome the burden of integrating various data types with multiple tools, we developed two companion R/Bioconductor packages. The former, methylPipe, is tailored to the analysis of high- or low-resolution DNA methylomes in several species, accommodating (hydroxy-)methyl-cytosines in both CpG and non-CpG sequence context. The analysis of multiple whole-genome bisulfite sequencing experiments is supported, while maintaining the ability of integrating targeted genomic data. The latter, compEpiTools, seamlessly incorporates the results obtained with methylPipe and supports their integration with other epigenomics data. It provides a number of methods to score these data in regions of interest, leading to the identification of enhancers, lncRNAs, and RNAPII stalling/elongation dynamics. Moreover, it allows a fast and comprehensive annotation of the resulting genomic regions, and the association of the corresponding genes with non-redundant GeneOntology terms. Finally, the package includes a flexible method based on heatmaps for the integration of various data types, combining annotation tracks with continuous or categorical data tracks. methylPipe and compEpiTools provide a comprehensive Bioconductor-compliant solution for the integrative analysis of heterogeneous epigenomics data. These packages are instrumental in providing biologists with minimal R skills a complete toolkit facilitating the analysis of their own data, or in accelerating the analyses performed by more experienced bioinformaticians.

  19. Insights and Challenges to Integrating Data from Diverse Ecological Networks

    NASA Astrophysics Data System (ADS)

    Peters, D. P. C.

    2014-12-01

    Many of the most dramatic and surprising effects of global change occur across large spatial extents, from regions to continents, that impact multiple ecosystem types across a range of interacting spatial and temporal scales. The ability of ecologists and inter-disciplinary scientists to understand and predict these dynamics depend, in large part, on existing site-based research infrastructures that developed in response to historic events. Integrating these diverse sources of data is critical to addressing these broad-scale questions. A conceptual approach is presented to synthesize and integrate diverse sources and types of data from different networks of research sites. This approach focuses on developing derived data products through spatial and temporal aggregation that allow datasets collected with different methods to be compared. The approach is illustrated through the integration, analysis, and comparison of hundreds of long-term datasets from 50 ecological sites in the US that represent ecosystem types commonly found globally. New insights were found by comparing multiple sites using common derived data. In addition to "bringing to light" many dark data in a standardized, open access, easy-to-use format, a suite of lessons were learned that can be applied to up and coming research networks in the US and internationally. These lessons will be described along with the challenges, including cyber-infrastructure, cultural, and behavioral constraints associated with the use of big and little data, that may keep ecologists and inter-disciplinary scientists from taking full advantage of the vast amounts of existing and yet-to-be exposed data.

  20. CoINcIDE: A framework for discovery of patient subtypes across multiple datasets.

    PubMed

    Planey, Catherine R; Gevaert, Olivier

    2016-03-09

    Patient disease subtypes have the potential to transform personalized medicine. However, many patient subtypes derived from unsupervised clustering analyses on high-dimensional datasets are not replicable across multiple datasets, limiting their clinical utility. We present CoINcIDE, a novel methodological framework for the discovery of patient subtypes across multiple datasets that requires no between-dataset transformations. We also present a high-quality database collection, curatedBreastData, with over 2,500 breast cancer gene expression samples. We use CoINcIDE to discover novel breast and ovarian cancer subtypes with prognostic significance and novel hypothesized ovarian therapeutic targets across multiple datasets. CoINcIDE and curatedBreastData are available as R packages.

  1. The Human Thalamus Is an Integrative Hub for Functional Brain Networks

    PubMed Central

    Bertolero, Maxwell A.

    2017-01-01

    The thalamus is globally connected with distributed cortical regions, yet the functional significance of this extensive thalamocortical connectivity remains largely unknown. By performing graph-theoretic analyses on thalamocortical functional connectivity data collected from human participants, we found that most thalamic subdivisions display network properties that are capable of integrating multimodal information across diverse cortical functional networks. From a meta-analysis of a large dataset of functional brain-imaging experiments, we further found that the thalamus is involved in multiple cognitive functions. Finally, we found that focal thalamic lesions in humans have widespread distal effects, disrupting the modular organization of cortical functional networks. This converging evidence suggests that the human thalamus is a critical hub region that could integrate diverse information being processed throughout the cerebral cortex as well as maintain the modular structure of cortical functional networks. SIGNIFICANCE STATEMENT The thalamus is traditionally viewed as a passive relay station of information from sensory organs or subcortical structures to the cortex. However, the thalamus has extensive connections with the entire cerebral cortex, which can also serve to integrate information processing between cortical regions. In this study, we demonstrate that multiple thalamic subdivisions display network properties that are capable of integrating information across multiple functional brain networks. Moreover, the thalamus is engaged by tasks requiring multiple cognitive functions. These findings support the idea that the thalamus is involved in integrating information across cortical networks. PMID:28450543

  2. A Bayesian trans-dimensional approach for the fusion of multiple geophysical datasets

    NASA Astrophysics Data System (ADS)

    JafarGandomi, Arash; Binley, Andrew

    2013-09-01

    We propose a Bayesian fusion approach to integrate multiple geophysical datasets with different coverage and sensitivity. The fusion strategy is based on the capability of various geophysical methods to provide enough resolution to identify either subsurface material parameters or subsurface structure, or both. We focus on electrical resistivity as the target material parameter and electrical resistivity tomography (ERT), electromagnetic induction (EMI), and ground penetrating radar (GPR) as the set of geophysical methods. However, extending the approach to different sets of geophysical parameters and methods is straightforward. Different geophysical datasets are entered into a trans-dimensional Markov chain Monte Carlo (McMC) search-based joint inversion algorithm. The trans-dimensional property of the McMC algorithm allows dynamic parameterisation of the model space, which in turn helps to avoid bias of the post-inversion results towards a particular model. Given that we are attempting to develop an approach that has practical potential, we discretize the subsurface into an array of one-dimensional earth-models. Accordingly, the ERT data that are collected by using two-dimensional acquisition geometry are re-casted to a set of equivalent vertical electric soundings. Different data are inverted either individually or jointly to estimate one-dimensional subsurface models at discrete locations. We use Shannon's information measure to quantify the information obtained from the inversion of different combinations of geophysical datasets. Information from multiple methods is brought together via introducing joint likelihood function and/or constraining the prior information. A Bayesian maximum entropy approach is used for spatial fusion of spatially dispersed estimated one-dimensional models and mapping of the target parameter. We illustrate the approach with a synthetic dataset and then apply it to a field dataset. We show that the proposed fusion strategy is successful not only in enhancing the subsurface information but also as a survey design tool to identify the appropriate combination of the geophysical tools and show whether application of an individual method for further investigation of a specific site is beneficial.

  3. Tissue-Specific Enrichment of Lymphoma Risk Loci in Regulatory Elements

    PubMed Central

    Hayes, James E.; Trynka, Gosia; Vijai, Joseph; Offit, Kenneth; Raychaudhuri, Soumya; Klein, Robert J.

    2015-01-01

    Though numerous polymorphisms have been associated with risk of developing lymphoma, how these variants function to promote tumorigenesis is poorly understood. Here, we report that lymphoma risk SNPs, especially in the non-Hodgkin’s lymphoma subtype chronic lymphocytic leukemia, are significantly enriched for co-localization with epigenetic marks of active gene regulation. These enrichments were seen in a lymphoid-specific manner for numerous ENCODE datasets, including DNase-hypersensitivity as well as multiple segmentation-defined enhancer regions. Furthermore, we identify putatively functional SNPs that are both in regulatory elements in lymphocytes and are associated with gene expression changes in blood. We developed an algorithm, UES, that uses a Monte Carlo simulation approach to calculate the enrichment of previously identified risk SNPs in various functional elements. This multiscale approach integrating multiple datasets helps disentangle the underlying biology of lymphoma, and more broadly, is generally applicable to GWAS results from other diseases as well. PMID:26422229

  4. The FLIGHT Drosophila RNAi database

    PubMed Central

    Bursteinas, Borisas; Jain, Ekta; Gao, Qiong; Baum, Buzz; Zvelebil, Marketa

    2010-01-01

    FLIGHT (http://flight.icr.ac.uk/) is an online resource compiling data from high-throughput Drosophila in vivo and in vitro RNAi screens. FLIGHT includes details of RNAi reagents and their predicted off-target effects, alongside RNAi screen hits, scores and phenotypes, including images from high-content screens. The latest release of FLIGHT is designed to enable users to upload, analyze, integrate and share their own RNAi screens. Users can perform multiple normalizations, view quality control plots, detect and assign screen hits and compare hits from multiple screens using a variety of methods including hierarchical clustering. FLIGHT integrates RNAi screen data with microarray gene expression as well as genomic annotations and genetic/physical interaction datasets to provide a single interface for RNAi screen analysis and datamining in Drosophila. PMID:20855970

  5. The Role of Data Archives in Synoptic Solar Physics

    NASA Astrophysics Data System (ADS)

    Reardon, Kevin

    The detailed study of solar cycle variations requires analysis of recorded datasets spanning many years of observations, that is, a data archive. The use of digital data, combined with powerful database server software, gives such archives new capabilities to provide, quickly and flexibly, selected pieces of information to scientists. Use of standardized protocols will allow multiple databases, independently maintained, to be seamlessly joined, allowing complex searches spanning multiple archives. These data archives also benefit from being developed in parallel with the telescope itself, which helps to assure data integrity and to provide close integration between the telescope and archive. Development of archives that can guarantee long-term data availability and strong compatibility with other projects makes solar-cycle studies easier to plan and realize.

  6. CAMBerVis: visualization software to support comparative analysis of multiple bacterial strains.

    PubMed

    Woźniak, Michał; Wong, Limsoon; Tiuryn, Jerzy

    2011-12-01

    A number of inconsistencies in genome annotations are documented among bacterial strains. Visualization of the differences may help biologists to make correct decisions in spurious cases. We have developed a visualization tool, CAMBerVis, to support comparative analysis of multiple bacterial strains. The software manages simultaneous visualization of multiple bacterial genomes, enabling visual analysis focused on genome structure annotations. The CAMBerVis software is freely available at the project website: http://bioputer.mimuw.edu.pl/camber. Input datasets for Mycobacterium tuberculosis and Staphylocacus aureus are integrated with the software as examples. m.wozniak@mimuw.edu.pl Supplementary data are available at Bioinformatics online.

  7. Efficient sequential and parallel algorithms for record linkage.

    PubMed

    Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert; Rajasekaran, Sanguthevar

    2014-01-01

    Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms. Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components. Our sequential and parallel algorithms have been tested on a real dataset of 1,083,878 records and synthetic datasets ranging in size from 50,000 to 9,000,000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes). We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm.

  8. Robust continuous clustering

    PubMed Central

    Shah, Sohil Atul

    2017-01-01

    Clustering is a fundamental procedure in the analysis of scientific data. It is used ubiquitously across the sciences. Despite decades of research, existing clustering algorithms have limited effectiveness in high dimensions and often require tuning parameters for different domains and datasets. We present a clustering algorithm that achieves high accuracy across multiple domains and scales efficiently to high dimensions and large datasets. The presented algorithm optimizes a smooth continuous objective, which is based on robust statistics and allows heavily mixed clusters to be untangled. The continuous nature of the objective also allows clustering to be integrated as a module in end-to-end feature learning pipelines. We demonstrate this by extending the algorithm to perform joint clustering and dimensionality reduction by efficiently optimizing a continuous global objective. The presented approach is evaluated on large datasets of faces, hand-written digits, objects, newswire articles, sensor readings from the Space Shuttle, and protein expression levels. Our method achieves high accuracy across all datasets, outperforming the best prior algorithm by a factor of 3 in average rank. PMID:28851838

  9. Argumentation Based Joint Learning: A Novel Ensemble Learning Approach

    PubMed Central

    Xu, Junyi; Yao, Li; Li, Le

    2015-01-01

    Recently, ensemble learning methods have been widely used to improve classification performance in machine learning. In this paper, we present a novel ensemble learning method: argumentation based multi-agent joint learning (AMAJL), which integrates ideas from multi-agent argumentation, ensemble learning, and association rule mining. In AMAJL, argumentation technology is introduced as an ensemble strategy to integrate multiple base classifiers and generate a high performance ensemble classifier. We design an argumentation framework named Arena as a communication platform for knowledge integration. Through argumentation based joint learning, high quality individual knowledge can be extracted, and thus a refined global knowledge base can be generated and used independently for classification. We perform numerous experiments on multiple public datasets using AMAJL and other benchmark methods. The results demonstrate that our method can effectively extract high quality knowledge for ensemble classifier and improve the performance of classification. PMID:25966359

  10. Integrative analysis of gene expression and DNA methylation using unsupervised feature extraction for detecting candidate cancer biomarkers.

    PubMed

    Moon, Myungjin; Nakai, Kenta

    2018-04-01

    Currently, cancer biomarker discovery is one of the important research topics worldwide. In particular, detecting significant genes related to cancer is an important task for early diagnosis and treatment of cancer. Conventional studies mostly focus on genes that are differentially expressed in different states of cancer; however, noise in gene expression datasets and insufficient information in limited datasets impede precise analysis of novel candidate biomarkers. In this study, we propose an integrative analysis of gene expression and DNA methylation using normalization and unsupervised feature extractions to identify candidate biomarkers of cancer using renal cell carcinoma RNA-seq datasets. Gene expression and DNA methylation datasets are normalized by Box-Cox transformation and integrated into a one-dimensional dataset that retains the major characteristics of the original datasets by unsupervised feature extraction methods, and differentially expressed genes are selected from the integrated dataset. Use of the integrated dataset demonstrated improved performance as compared with conventional approaches that utilize gene expression or DNA methylation datasets alone. Validation based on the literature showed that a considerable number of top-ranked genes from the integrated dataset have known relationships with cancer, implying that novel candidate biomarkers can also be acquired from the proposed analysis method. Furthermore, we expect that the proposed method can be expanded for applications involving various types of multi-omics datasets.

  11. Multimodal integration of micro-Doppler sonar and auditory signals for behavior classification with convolutional networks.

    PubMed

    Dura-Bernal, Salvador; Garreau, Guillaume; Georgiou, Julius; Andreou, Andreas G; Denham, Susan L; Wennekers, Thomas

    2013-10-01

    The ability to recognize the behavior of individuals is of great interest in the general field of safety (e.g. building security, crowd control, transport analysis, independent living for the elderly). Here we report a new real-time acoustic system for human action and behavior recognition that integrates passive audio and active micro-Doppler sonar signatures over multiple time scales. The system architecture is based on a six-layer convolutional neural network, trained and evaluated using a dataset of 10 subjects performing seven different behaviors. Probabilistic combination of system output through time for each modality separately yields 94% (passive audio) and 91% (micro-Doppler sonar) correct behavior classification; probabilistic multimodal integration increases classification performance to 98%. This study supports the efficacy of micro-Doppler sonar systems in characterizing human actions, which can then be efficiently classified using ConvNets. It also demonstrates that the integration of multiple sources of acoustic information can significantly improve the system's performance.

  12. Comparing methods of analysing datasets with small clusters: case studies using four paediatric datasets.

    PubMed

    Marston, Louise; Peacock, Janet L; Yu, Keming; Brocklehurst, Peter; Calvert, Sandra A; Greenough, Anne; Marlow, Neil

    2009-07-01

    Studies of prematurely born infants contain a relatively large percentage of multiple births, so the resulting data have a hierarchical structure with small clusters of size 1, 2 or 3. Ignoring the clustering may lead to incorrect inferences. The aim of this study was to compare statistical methods which can be used to analyse such data: generalised estimating equations, multilevel models, multiple linear regression and logistic regression. Four datasets which differed in total size and in percentage of multiple births (n = 254, multiple 18%; n = 176, multiple 9%; n = 10 098, multiple 3%; n = 1585, multiple 8%) were analysed. With the continuous outcome, two-level models produced similar results in the larger dataset, while generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) produced divergent estimates using the smaller dataset. For the dichotomous outcome, most methods, except generalised least squares multilevel modelling (ML GH 'xtlogit' in Stata) gave similar odds ratios and 95% confidence intervals within datasets. For the continuous outcome, our results suggest using multilevel modelling. We conclude that generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) should be used with caution when the dataset is small. Where the outcome is dichotomous and there is a relatively large percentage of non-independent data, it is recommended that these are accounted for in analyses using logistic regression with adjusted standard errors or multilevel modelling. If, however, the dataset has a small percentage of clusters greater than size 1 (e.g. a population dataset of children where there are few multiples) there appears to be less need to adjust for clustering.

  13. Methods to increase reproducibility in differential gene expression via meta-analysis

    PubMed Central

    Sweeney, Timothy E.; Haynes, Winston A.; Vallania, Francesco; Ioannidis, John P.; Khatri, Purvesh

    2017-01-01

    Findings from clinical and biological studies are often not reproducible when tested in independent cohorts. Due to the testing of a large number of hypotheses and relatively small sample sizes, results from whole-genome expression studies in particular are often not reproducible. Compared to single-study analysis, gene expression meta-analysis can improve reproducibility by integrating data from multiple studies. However, there are multiple choices in designing and carrying out a meta-analysis. Yet, clear guidelines on best practices are scarce. Here, we hypothesized that studying subsets of very large meta-analyses would allow for systematic identification of best practices to improve reproducibility. We therefore constructed three very large gene expression meta-analyses from clinical samples, and then examined meta-analyses of subsets of the datasets (all combinations of datasets with up to N/2 samples and K/2 datasets) compared to a ‘silver standard’ of differentially expressed genes found in the entire cohort. We tested three random-effects meta-analysis models using this procedure. We showed relatively greater reproducibility with more-stringent effect size thresholds with relaxed significance thresholds; relatively lower reproducibility when imposing extraneous constraints on residual heterogeneity; and an underestimation of actual false positive rate by Benjamini–Hochberg correction. In addition, multivariate regression showed that the accuracy of a meta-analysis increased significantly with more included datasets even when controlling for sample size. PMID:27634930

  14. Efficient and self-adaptive in-situ learning in multilayer memristor neural networks.

    PubMed

    Li, Can; Belkin, Daniel; Li, Yunning; Yan, Peng; Hu, Miao; Ge, Ning; Jiang, Hao; Montgomery, Eric; Lin, Peng; Wang, Zhongrui; Song, Wenhao; Strachan, John Paul; Barnell, Mark; Wu, Qing; Williams, R Stanley; Yang, J Joshua; Xia, Qiangfei

    2018-06-19

    Memristors with tunable resistance states are emerging building blocks of artificial neural networks. However, in situ learning on a large-scale multiple-layer memristor network has yet to be demonstrated because of challenges in device property engineering and circuit integration. Here we monolithically integrate hafnium oxide-based memristors with a foundry-made transistor array into a multiple-layer neural network. We experimentally demonstrate in situ learning capability and achieve competitive classification accuracy on a standard machine learning dataset, which further confirms that the training algorithm allows the network to adapt to hardware imperfections. Our simulation using the experimental parameters suggests that a larger network would further increase the classification accuracy. The memristor neural network is a promising hardware platform for artificial intelligence with high speed-energy efficiency.

  15. In Silico Investigations of the Anti-Catabolic Effects of Pamidronate and Denosumab on Multiple Myeloma-Induced Bone Disease

    PubMed Central

    Wang, Yan; Lin, Bo

    2012-01-01

    It is unclear whether the new anti-catabolic agent denosumab represents a viable alternative to the widely used anti-catabolic agent pamidronate in the treatment of Multiple Myeloma (MM)-induced bone disease. This lack of clarity primarily stems from the lack of sufficient clinical investigations, which are costly and time consuming. However, in silico investigations require less time and expense, suggesting that they may be a useful complement to traditional clinical investigations. In this paper, we aim to (i) develop integrated computational models that are suitable for investigating the effects of pamidronate and denosumab on MM-induced bone disease and (ii) evaluate the responses to pamidronate and denosumab treatments using these integrated models. To achieve these goals, pharmacokinetic models of pamidronate and denosumab are first developed and then calibrated and validated using different clinical datasets. Next, the integrated computational models are developed by incorporating the simulated transient concentrations of pamidronate and denosumab and simulations of their actions on the MM-bone compartment into the previously proposed MM-bone model. These integrated models are further calibrated and validated by different clinical datasets so that they are suitable to be applied to investigate the responses to the pamidronate and denosumab treatments. Finally, these responses are evaluated by quantifying the bone volume, bone turnover, and MM-cell density. This evaluation identifies four denosumab regimes that potentially produce an overall improved bone-related response compared with the recommended pamidronate regime. This in silico investigation supports the idea that denosumab represents an appropriate alternative to pamidronate in the treatment of MM-induced bone disease. PMID:23028650

  16. Integrating genome-wide association study summaries and element-gene interaction datasets identified multiple associations between elements and complex diseases.

    PubMed

    He, Awen; Wang, Wenyu; Prakash, N Tejo; Tinkov, Alexey A; Skalny, Anatoly V; Wen, Yan; Hao, Jingcan; Guo, Xiong; Zhang, Feng

    2018-03-01

    Chemical elements are closely related to human health. Extensive genomic profile data of complex diseases offer us a good opportunity to systemically investigate the relationships between elements and complex diseases/traits. In this study, we applied gene set enrichment analysis (GSEA) approach to detect the associations between elements and complex diseases/traits though integrating element-gene interaction datasets and genome-wide association study (GWAS) data of complex diseases/traits. To illustrate the performance of GSEA, the element-gene interaction datasets of 24 elements were extracted from the comparative toxicogenomics database (CTD). GWAS summary datasets of 24 complex diseases or traits were downloaded from the dbGaP or GEFOS websites. We observed significant associations between 7 elements and 13 complex diseases or traits (all false discovery rate (FDR) < 0.05), including reported relationships such as aluminum vs. Alzheimer's disease (FDR = 0.042), calcium vs. bone mineral density (FDR = 0.031), magnesium vs. systemic lupus erythematosus (FDR = 0.012) as well as novel associations, such as nickel vs. hypertriglyceridemia (FDR = 0.002) and bipolar disorder (FDR = 0.027). Our study results are consistent with previous biological studies, supporting the good performance of GSEA. Our analyzing results based on GSEA framework provide novel clues for discovering causal relationships between elements and complex diseases. © 2017 WILEY PERIODICALS, INC.

  17. In vitro downregulated hypoxia transcriptome is associated with poor prognosis in breast cancer.

    PubMed

    Abu-Jamous, Basel; Buffa, Francesca M; Harris, Adrian L; Nandi, Asoke K

    2017-06-15

    Hypoxia is a characteristic of breast tumours indicating poor prognosis. Based on the assumption that those genes which are up-regulated under hypoxia in cell-lines are expected to be predictors of poor prognosis in clinical data, many signatures of poor prognosis were identified. However, it was observed that cell line data do not always concur with clinical data, and therefore conclusions from cell line analysis should be considered with caution. As many transcriptomic cell-line datasets from hypoxia related contexts are available, integrative approaches which investigate these datasets collectively, while not ignoring clinical data, are required. We analyse sixteen heterogeneous breast cancer cell-line transcriptomic datasets in hypoxia-related conditions collectively by employing the unique capabilities of the method, UNCLES, which integrates clustering results from multiple datasets and can address questions that cannot be answered by existing methods. This has been demonstrated by comparison with the state-of-the-art iCluster method. From this collection of genome-wide datasets include 15,588 genes, UNCLES identified a relatively high number of genes (>1000 overall) which are consistently co-regulated over all of the datasets, and some of which are still poorly understood and represent new potential HIF targets, such as RSBN1 and KIAA0195. Two main, anti-correlated, clusters were identified; the first is enriched with MYC targets participating in growth and proliferation, while the other is enriched with HIF targets directly participating in the hypoxia response. Surprisingly, in six clinical datasets, some sub-clusters of growth genes are found consistently positively correlated with hypoxia response genes, unlike the observation in cell lines. Moreover, the ability to predict bad prognosis by a combined signature of one sub-cluster of growth genes and one sub-cluster of hypoxia-induced genes appears to be comparable and perhaps greater than that of known hypoxia signatures. We present a clustering approach suitable to integrate data from diverse experimental set-ups. Its application to breast cancer cell line datasets reveals new hypoxia-regulated signatures of genes which behave differently when in vitro (cell-line) data is compared with in vivo (clinical) data, and are of a prognostic value comparable or exceeding the state-of-the-art hypoxia signatures.

  18. The Nanomaterial Data Curation Initiative: A collaborative approach to assessing, evaluating, and advancing the state of the field

    PubMed Central

    Powers, Christina M; Hoover, Mark D; Harper, Stacey L

    2015-01-01

    Summary The Nanomaterial Data Curation Initiative (NDCI), a project of the National Cancer Informatics Program Nanotechnology Working Group (NCIP NanoWG), explores the critical aspect of data curation within the development of informatics approaches to understanding nanomaterial behavior. Data repositories and tools for integrating and interrogating complex nanomaterial datasets are gaining widespread interest, with multiple projects now appearing in the US and the EU. Even in these early stages of development, a single common aspect shared across all nanoinformatics resources is that data must be curated into them. Through exploration of sub-topics related to all activities necessary to enable, execute, and improve the curation process, the NDCI will provide a substantive analysis of nanomaterial data curation itself, as well as a platform for multiple other important discussions to advance the field of nanoinformatics. This article outlines the NDCI project and lays the foundation for a series of papers on nanomaterial data curation. The NDCI purpose is to: 1) present and evaluate the current state of nanomaterial data curation across the field on multiple specific data curation topics, 2) propose ways to leverage and advance progress for both individual efforts and the nanomaterial data community as a whole, and 3) provide opportunities for similar publication series on the details of the interactive needs and workflows of data customers, data creators, and data analysts. Initial responses from stakeholder liaisons throughout the nanoinformatics community reveal a shared view that it will be critical to focus on integration of datasets with specific orientation toward the purposes for which the individual resources were created, as well as the purpose for integrating multiple resources. Early acknowledgement and undertaking of complex topics such as uncertainty, reproducibility, and interoperability is proposed as an important path to addressing key challenges within the nanomaterial community, such as reducing collateral negative impacts and decreasing the time from development to market for this new class of technologies. PMID:26425427

  19. Registration and Fusion of Multiple Source Remotely Sensed Image Data

    NASA Technical Reports Server (NTRS)

    LeMoigne, Jacqueline

    2004-01-01

    Earth and Space Science often involve the comparison, fusion, and integration of multiple types of remotely sensed data at various temporal, radiometric, and spatial resolutions. Results of this integration may be utilized for global change analysis, global coverage of an area at multiple resolutions, map updating or validation of new instruments, as well as integration of data provided by multiple instruments carried on multiple platforms, e.g. in spacecraft constellations or fleets of planetary rovers. Our focus is on developing methods to perform fast, accurate and automatic image registration and fusion. General methods for automatic image registration are being reviewed and evaluated. Various choices for feature extraction, feature matching and similarity measurements are being compared, including wavelet-based algorithms, mutual information and statistically robust techniques. Our work also involves studies related to image fusion and investigates dimension reduction and co-kriging for application-dependent fusion. All methods are being tested using several multi-sensor datasets, acquired at EOS Core Sites, and including multiple sensors such as IKONOS, Landsat-7/ETM+, EO1/ALI and Hyperion, MODIS, and SeaWIFS instruments. Issues related to the coregistration of data from the same platform (i.e., AIRS and MODIS from Aqua) or from several platforms of the A-train (i.e., MLS, HIRDLS, OMI from Aura with AIRS and MODIS from Terra and Aqua) will also be considered.

  20. Efficient sequential and parallel algorithms for record linkage

    PubMed Central

    Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert; Rajasekaran, Sanguthevar

    2014-01-01

    Background and objective Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms. Methods Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components. Results Our sequential and parallel algorithms have been tested on a real dataset of 1 083 878 records and synthetic datasets ranging in size from 50 000 to 9 000 000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes). Conclusions We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm. PMID:24154837

  1. Profiling physicochemical and planktonic features from discretely/continuously sampled surface water.

    PubMed

    Oita, Azusa; Tsuboi, Yuuri; Date, Yasuhiro; Oshima, Takahiro; Sakata, Kenji; Yokoyama, Akiko; Moriya, Shigeharu; Kikuchi, Jun

    2018-04-24

    There is an increasing need for assessing aquatic ecosystems that are globally endangered. Since aquatic ecosystems are complex, integrated consideration of multiple factors utilizing omics technologies can help us better understand aquatic ecosystems. An integrated strategy linking three analytical (machine learning, factor mapping, and forecast-error-variance decomposition) approaches for extracting the features of surface water from datasets comprising ions, metabolites, and microorganisms is proposed herein. The three developed approaches can be employed for diverse datasets of sample sizes and experimentally analyzed factors. The three approaches are applied to explore the features of bay water surrounding Odaiba, Tokyo, Japan, as a case study. Firstly, the machine learning approach separated 681 surface water samples within Japan into three clusters, categorizing Odaiba water into seawater with relatively low inorganic ions, including Mg, Ba, and B. Secondly, the factor mapping approach illustrated Odaiba water samples from the summer as rich in multiple amino acids and some other metabolites and poor in inorganic ions relative to other seasons based on their seasonal dynamics. Finally, forecast-error-variance decomposition using vector autoregressive models indicated that a type of microalgae (Raphidophyceae) grows in close correlation with alanine, succinic acid, and valine on filters and with isobutyric acid and 4-hydroxybenzoic acid in filtrate, Ba, and average wind speed. Our integrated strategy can be used to examine many biological, chemical, and environmental physical factors to analyze surface water. Copyright © 2018. Published by Elsevier B.V.

  2. Server-based Approach to Web Visualization of Integrated Three-dimensional Brain Imaging Data

    PubMed Central

    Poliakov, Andrew V.; Albright, Evan; Hinshaw, Kevin P.; Corina, David P.; Ojemann, George; Martin, Richard F.; Brinkley, James F.

    2005-01-01

    The authors describe a client-server approach to three-dimensional (3-D) visualization of neuroimaging data, which enables researchers to visualize, manipulate, and analyze large brain imaging datasets over the Internet. All computationally intensive tasks are done by a graphics server that loads and processes image volumes and 3-D models, renders 3-D scenes, and sends the renderings back to the client. The authors discuss the system architecture and implementation and give several examples of client applications that allow visualization and analysis of integrated language map data from single and multiple patients. PMID:15561787

  3. THE ENGINE AND THE REAPER: INDUSTRIALIZATION AND MORTALITY IN LATE NINETEENTH CENTURY JAPAN.

    PubMed

    Tang, John P

    2017-12-01

    Economic development improves long-run health outcomes through access to medical treatment, sanitation, and higher income. Short run impacts, however, may be ambiguous given disease exposure from market integration. Using a panel dataset of Japanese vital statistics and multiple estimation methods, I find that railroad network expansion is associated with a six percent increase in gross mortality rates among newly integrated regions. Communicable diseases accounted for most of the rail-associated mortality, which indicate railways behaved as transmission vectors. At the same time, market integration facilitated by railways corresponded with an eighteen percent increase in total capital investment nationwide over ten years. Copyright © 2017 Elsevier B.V. All rights reserved.

  4. Diagram-based Analysis of Causal Systems (DACS): elucidating inter-relationships between determinants of acute lower respiratory infections among children in sub-Saharan Africa.

    PubMed

    Rehfuess, Eva A; Best, Nicky; Briggs, David J; Joffe, Mike

    2013-12-06

    Effective interventions require evidence on how individual causal pathways jointly determine disease. Based on the concept of systems epidemiology, this paper develops Diagram-based Analysis of Causal Systems (DACS) as an approach to analyze complex systems, and applies it by examining the contributions of proximal and distal determinants of childhood acute lower respiratory infections (ALRI) in sub-Saharan Africa. Diagram-based Analysis of Causal Systems combines the use of causal diagrams with multiple routinely available data sources, using a variety of statistical techniques. In a step-by-step process, the causal diagram evolves from conceptual based on a priori knowledge and assumptions, through operational informed by data availability which then undergoes empirical testing, to integrated which synthesizes information from multiple datasets. In our application, we apply different regression techniques to Demographic and Health Survey (DHS) datasets for Benin, Ethiopia, Kenya and Namibia and a pooled World Health Survey (WHS) dataset for sixteen African countries. Explicit strategies are employed to make decisions transparent about the inclusion/omission of arrows, the sign and strength of the relationships and homogeneity/heterogeneity across settings.Findings about the current state of evidence on the complex web of socio-economic, environmental, behavioral and healthcare factors influencing childhood ALRI, based on DHS and WHS data, are summarized in an integrated causal diagram. Notably, solid fuel use is structured by socio-economic factors and increases the risk of childhood ALRI mortality. Diagram-based Analysis of Causal Systems is a means of organizing the current state of knowledge about a specific area of research, and a framework for integrating statistical analyses across a whole system. This partly a priori approach is explicit about causal assumptions guiding the analysis and about researcher judgment, and wrong assumptions can be reversed following empirical testing. This approach is well-suited to dealing with complex systems, in particular where data are scarce.

  5. Diagram-based Analysis of Causal Systems (DACS): elucidating inter-relationships between determinants of acute lower respiratory infections among children in sub-Saharan Africa

    PubMed Central

    2013-01-01

    Background Effective interventions require evidence on how individual causal pathways jointly determine disease. Based on the concept of systems epidemiology, this paper develops Diagram-based Analysis of Causal Systems (DACS) as an approach to analyze complex systems, and applies it by examining the contributions of proximal and distal determinants of childhood acute lower respiratory infections (ALRI) in sub-Saharan Africa. Results Diagram-based Analysis of Causal Systems combines the use of causal diagrams with multiple routinely available data sources, using a variety of statistical techniques. In a step-by-step process, the causal diagram evolves from conceptual based on a priori knowledge and assumptions, through operational informed by data availability which then undergoes empirical testing, to integrated which synthesizes information from multiple datasets. In our application, we apply different regression techniques to Demographic and Health Survey (DHS) datasets for Benin, Ethiopia, Kenya and Namibia and a pooled World Health Survey (WHS) dataset for sixteen African countries. Explicit strategies are employed to make decisions transparent about the inclusion/omission of arrows, the sign and strength of the relationships and homogeneity/heterogeneity across settings. Findings about the current state of evidence on the complex web of socio-economic, environmental, behavioral and healthcare factors influencing childhood ALRI, based on DHS and WHS data, are summarized in an integrated causal diagram. Notably, solid fuel use is structured by socio-economic factors and increases the risk of childhood ALRI mortality. Conclusions Diagram-based Analysis of Causal Systems is a means of organizing the current state of knowledge about a specific area of research, and a framework for integrating statistical analyses across a whole system. This partly a priori approach is explicit about causal assumptions guiding the analysis and about researcher judgment, and wrong assumptions can be reversed following empirical testing. This approach is well-suited to dealing with complex systems, in particular where data are scarce. PMID:24314302

  6. A reproducible approach to high-throughput biological data acquisition and integration

    PubMed Central

    Rahnavard, Gholamali; Waldron, Levi; McIver, Lauren; Shafquat, Afrah; Franzosa, Eric A.; Miropolsky, Larissa; Sweeney, Christopher

    2015-01-01

    Modern biological research requires rapid, complex, and reproducible integration of multiple experimental results generated both internally and externally (e.g., from public repositories). Although large systematic meta-analyses are among the most effective approaches both for clinical biomarker discovery and for computational inference of biomolecular mechanisms, identifying, acquiring, and integrating relevant experimental results from multiple sources for a given study can be time-consuming and error-prone. To enable efficient and reproducible integration of diverse experimental results, we developed a novel approach for standardized acquisition and analysis of high-throughput and heterogeneous biological data. This allowed, first, novel biomolecular network reconstruction in human prostate cancer, which correctly recovered and extended the NFκB signaling pathway. Next, we investigated host-microbiome interactions. In less than an hour of analysis time, the system retrieved data and integrated six germ-free murine intestinal gene expression datasets to identify the genes most influenced by the gut microbiota, which comprised a set of immune-response and carbohydrate metabolism processes. Finally, we constructed integrated functional interaction networks to compare connectivity of peptide secretion pathways in the model organisms Escherichia coli, Bacillus subtilis, and Pseudomonas aeruginosa. PMID:26157642

  7. Improvements in the spatial representation of lakes and reservoirs in the contiguous United States for the National Water Model

    NASA Astrophysics Data System (ADS)

    Khan, S.; Salas, F.; Sampson, K. M.; Read, L. K.; Cosgrove, B.; Li, Z.; Gochis, D. J.

    2017-12-01

    The representation of inland surface water bodies in distributed hydrologic models at the continental scale is a challenge. The National Water Model (NWM) utilizes the National Hydrography Dataset Plus Version 2 (NHDPlusV2) "waterbody" dataset to represent lakes and reservoirs. The "waterbody" layer is a comprehensive dataset that represents surface water bodies using common features like lakes, ponds, reservoirs, estuaries, playas and swamps/marshes. However, a major issue that remains unresolved even in the latest revision of NHDPlus Version 2 is the inconsistency in waterbody digitization and delineation errors. Manually correcting the water body polygons becomes tedious and quickly impossible for continental-scale hydrologic models such as the NWM. In this study, we improved spatial representation of 6,802 lakes and reservoirs by analyzing 379,110 waterbodies in the contiguous United States (excluding the Laurentian Great Lakes). We performed a step-by- step process that integrates a set of geospatial analyses to identify, track, and correct the extent of lakes and reservoirs features that are larger than 0.75 km2. The following assumptions were applied while developing the new dataset: a) lakes and reservoirs cannot directly feed into each other; b) each waterbody must have one outlet; and c) a single lake or reservoir feature cannot have multiple parts. The majority of the NHDplusV2 waterbody features in the original dataset are delineated correctly. However approximately 3 % of the lake and reservoir polygons were found to be incorrect with topological errors and were corrected accordingly. It is important to fix these digitizing errors because the waterbody features are closely linked to the river topology. This new waterbody dataset will ensure that model-simulated water is directed into and through the lakes and reservoirs in a manner that supports the NWM code base and assumptions. The improved dataset will facilitate more effective integration of lakes and reservoirs with correct spatial features into the updated NWM.

  8. Phylo-VISTA: Interactive visualization of multiple DNA sequence alignments

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shah, Nameeta; Couronne, Olivier; Pennacchio, Len A.

    The power of multi-sequence comparison for biological discovery is well established. The need for new capabilities to visualize and compare cross-species alignment data is intensified by the growing number of genomic sequence datasets being generated for an ever-increasing number of organisms. To be efficient these visualization algorithms must support the ability to accommodate consistently a wide range of evolutionary distances in a comparison framework based upon phylogenetic relationships. Results: We have developed Phylo-VISTA, an interactive tool for analyzing multiple alignments by visualizing a similarity measure for multiple DNA sequences. The complexity of visual presentation is effectively organized using a frameworkmore » based upon interspecies phylogenetic relationships. The phylogenetic organization supports rapid, user-guided interspecies comparison. To aid in navigation through large sequence datasets, Phylo-VISTA leverages concepts from VISTA that provide a user with the ability to select and view data at varying resolutions. The combination of multiresolution data visualization and analysis, combined with the phylogenetic framework for interspecies comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments. Availability: Phylo-VISTA is available at http://www-gsd.lbl. gov/phylovista. It requires an Internet browser with Java Plugin 1.4.2 and it is integrated into the global alignment program LAGAN at http://lagan.stanford.edu« less

  9. Integrated data analysis for genome-wide research.

    PubMed

    Steinfath, Matthias; Repsilber, Dirk; Scholz, Matthias; Walther, Dirk; Selbig, Joachim

    2007-01-01

    Integrated data analysis is introduced as the intermediate level of a systems biology approach to analyse different 'omics' datasets, i.e., genome-wide measurements of transcripts, protein levels or protein-protein interactions, and metabolite levels aiming at generating a coherent understanding of biological function. In this chapter we focus on different methods of correlation analyses ranging from simple pairwise correlation to kernel canonical correlation which were recently applied in molecular biology. Several examples are presented to illustrate their application. The input data for this analysis frequently originate from different experimental platforms. Therefore, preprocessing steps such as data normalisation and missing value estimation are inherent to this approach. The corresponding procedures, potential pitfalls and biases, and available software solutions are reviewed. The multiplicity of observations obtained in omics-profiling experiments necessitates the application of multiple testing correction techniques.

  10. ProbFold: a probabilistic method for integration of probing data in RNA secondary structure prediction.

    PubMed

    Sahoo, Sudhakar; Świtnicki, Michał P; Pedersen, Jakob Skou

    2016-09-01

    Recently, new RNA secondary structure probing techniques have been developed, including Next Generation Sequencing based methods capable of probing transcriptome-wide. These techniques hold great promise for improving structure prediction accuracy. However, each new data type comes with its own signal properties and biases, which may even be experiment specific. There is therefore a growing need for RNA structure prediction methods that can be automatically trained on new data types and readily extended to integrate and fully exploit multiple types of data. Here, we develop and explore a modular probabilistic approach for integrating probing data in RNA structure prediction. It can be automatically trained given a set of known structures with probing data. The approach is demonstrated on SHAPE datasets, where we evaluate and selectively model specific correlations. The approach often makes superior use of the probing data signal compared to other methods. We illustrate the use of ProbFold on multiple data types using both simulations and a small set of structures with both SHAPE, DMS and CMCT data. Technically, the approach combines stochastic context-free grammars (SCFGs) with probabilistic graphical models. This approach allows rapid adaptation and integration of new probing data types. ProbFold is implemented in C ++. Models are specified using simple textual formats. Data reformatting is done using separate C ++ programs. Source code, statically compiled binaries for x86 Linux machines, C ++ programs, example datasets and a tutorial is available from http://moma.ki.au.dk/prj/probfold/ : jakob.skou@clin.au.dk Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  11. Mergeomics: a web server for identifying pathological pathways, networks, and key regulators via multidimensional data integration.

    PubMed

    Arneson, Douglas; Bhattacharya, Anindya; Shu, Le; Mäkinen, Ville-Petteri; Yang, Xia

    2016-09-09

    Human diseases are commonly the result of multidimensional changes at molecular, cellular, and systemic levels. Recent advances in genomic technologies have enabled an outpour of omics datasets that capture these changes. However, separate analyses of these various data only provide fragmented understanding and do not capture the holistic view of disease mechanisms. To meet the urgent needs for tools that effectively integrate multiple types of omics data to derive biological insights, we have developed Mergeomics, a computational pipeline that integrates multidimensional disease association data with functional genomics and molecular networks to retrieve biological pathways, gene networks, and central regulators critical for disease development. To make the Mergeomics pipeline available to a wider research community, we have implemented an online, user-friendly web server ( http://mergeomics. idre.ucla.edu/ ). The web server features a modular implementation of the Mergeomics pipeline with detailed tutorials. Additionally, it provides curated genomic resources including tissue-specific expression quantitative trait loci, ENCODE functional annotations, biological pathways, and molecular networks, and offers interactive visualization of analytical results. Multiple computational tools including Marker Dependency Filtering (MDF), Marker Set Enrichment Analysis (MSEA), Meta-MSEA, and Weighted Key Driver Analysis (wKDA) can be used separately or in flexible combinations. User-defined summary-level genomic association datasets (e.g., genetic, transcriptomic, epigenomic) related to a particular disease or phenotype can be uploaded and computed real-time to yield biologically interpretable results, which can be viewed online and downloaded for later use. Our Mergeomics web server offers researchers flexible and user-friendly tools to facilitate integration of multidimensional data into holistic views of disease mechanisms in the form of tissue-specific key regulators, biological pathways, and gene networks.

  12. Harnessing Connectivity in a Large-Scale Small-Molecule Sensitivity Dataset.

    PubMed

    Seashore-Ludlow, Brinton; Rees, Matthew G; Cheah, Jaime H; Cokol, Murat; Price, Edmund V; Coletti, Matthew E; Jones, Victor; Bodycombe, Nicole E; Soule, Christian K; Gould, Joshua; Alexander, Benjamin; Li, Ava; Montgomery, Philip; Wawer, Mathias J; Kuru, Nurdan; Kotz, Joanne D; Hon, C Suk-Yee; Munoz, Benito; Liefeld, Ted; Dančík, Vlado; Bittker, Joshua A; Palmer, Michelle; Bradner, James E; Shamji, Alykhan F; Clemons, Paul A; Schreiber, Stuart L

    2015-11-01

    Identifying genetic alterations that prime a cancer cell to respond to a particular therapeutic agent can facilitate the development of precision cancer medicines. Cancer cell-line (CCL) profiling of small-molecule sensitivity has emerged as an unbiased method to assess the relationships between genetic or cellular features of CCLs and small-molecule response. Here, we developed annotated cluster multidimensional enrichment analysis to explore the associations between groups of small molecules and groups of CCLs in a new, quantitative sensitivity dataset. This analysis reveals insights into small-molecule mechanisms of action, and genomic features that associate with CCL response to small-molecule treatment. We are able to recapitulate known relationships between FDA-approved therapies and cancer dependencies and to uncover new relationships, including for KRAS-mutant cancers and neuroblastoma. To enable the cancer community to explore these data, and to generate novel hypotheses, we created an updated version of the Cancer Therapeutic Response Portal (CTRP v2). We present the largest CCL sensitivity dataset yet available, and an analysis method integrating information from multiple CCLs and multiple small molecules to identify CCL response predictors robustly. We updated the CTRP to enable the cancer research community to leverage these data and analyses. ©2015 American Association for Cancer Research.

  13. Video Analytics Evaluation: Survey of Datasets, Performance Metrics and Approaches

    DTIC Science & Technology

    2014-09-01

    training phase and a fusion of the detector outputs. 6.3.1 Training Techniques 1. Bagging: The basic idea of Bagging is to train multiple classifiers...can reduce more noise interesting points. Person detection and background subtraction methods were used to create hot regions. The hot regions were...detection algorithms are incorporated with MHT to construct one integrated detector /tracker. 6.8 IRDS-CASIA team IRDS-CASIA proposed a method to solve a

  14. 3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data.

    PubMed

    Luo, Yuan; Szolovits, Peter; Dighe, Anand S; Baron, Jason M

    2018-06-01

    A key challenge in clinical data mining is that most clinical datasets contain missing data. Since many commonly used machine learning algorithms require complete datasets (no missing data), clinical analytic approaches often entail an imputation procedure to "fill in" missing data. However, although most clinical datasets contain a temporal component, most commonly used imputation methods do not adequately accommodate longitudinal time-based data. We sought to develop a new imputation algorithm, 3-dimensional multiple imputation with chained equations (3D-MICE), that can perform accurate imputation of missing clinical time series data. We extracted clinical laboratory test results for 13 commonly measured analytes (clinical laboratory tests). We imputed missing test results for the 13 analytes using 3 imputation methods: multiple imputation with chained equations (MICE), Gaussian process (GP), and 3D-MICE. 3D-MICE utilizes both MICE and GP imputation to integrate cross-sectional and longitudinal information. To evaluate imputation method performance, we randomly masked selected test results and imputed these masked results alongside results missing from our original data. We compared predicted results to measured results for masked data points. 3D-MICE performed significantly better than MICE and GP-based imputation in a composite of all 13 analytes, predicting missing results with a normalized root-mean-square error of 0.342, compared to 0.373 for MICE alone and 0.358 for GP alone. 3D-MICE offers a novel and practical approach to imputing clinical laboratory time series data. 3D-MICE may provide an additional tool for use as a foundation in clinical predictive analytics and intelligent clinical decision support.

  15. Hurricane Harvey Riverine Flooding: Part 2: Integration of Heterogeneous Earth Observation Data for Comparative Analysis with High-Resolution Inundation Boundaries Reconstructed from Flood2D-GPU Model

    NASA Astrophysics Data System (ADS)

    Jackson, C.; Sava, E.; Cervone, G.

    2017-12-01

    Hurricane Harvey has been noted as the wettest cyclone on record for the US as well as the most destructive (so far) for the 2017 hurricane season. An entire year worth of rainfall occurred over the course of a few days. The city of Houston was greatly impacted as the storm lingered over the city for five days, causing a record-breaking 50+ inches of rain as well as severe damage from flooding. Flood model simulations were performed to reconstruct the event in order to better understand, assess, and predict flooding dynamics for the future. Additionally, number of remote sensing platforms, and on ground instruments that provide near real-time data have also been used for flood identification, monitoring, and damage assessment. Although both flood models and remote sensing techniques are able to identify inundated areas, rapid and accurate flood prediction at a high spatio-temporal resolution remains a challenge. Thus a methodological approach which fuses the two techniques can help to better validate what is being modeled and observed. Recent advancements in data fusion techniques of remote sensing with near real time heterogeneous datasets have allowed emergency responders to more efficiently extract increasingly precise and relevant knowledge from the available information. In this work the use of multiple sources of contributed data, coupled with remotely sensed and open source geospatial datasets is demonstrated to generate an understanding of potential damage assessment for the floods after Hurricane Harvey in Harris County, Texas. The feasibility of integrating multiple sources at different temporal and spatial resolutions into hydrodynamic models for flood inundation simulations is assessed. Furthermore the contributed datasets are compared against a reconstructed flood extent generated from the Flood2D-GPU model.

  16. The multiple imputation method: a case study involving secondary data analysis.

    PubMed

    Walani, Salimah R; Cleland, Charles M

    2015-05-01

    To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.

  17. An assessment of the cultivated cropland class of NLCD 2006 using a multi-source and multi-criteria approach

    USGS Publications Warehouse

    Danielson, Patrick; Yang, Limin; Jin, Suming; Homer, Collin G.; Napton, Darrell

    2016-01-01

    We developed a method that analyzes the quality of the cultivated cropland class mapped in the USA National Land Cover Database (NLCD) 2006. The method integrates multiple geospatial datasets and a Multi Index Integrated Change Analysis (MIICA) change detection method that captures spectral changes to identify the spatial distribution and magnitude of potential commission and omission errors for the cultivated cropland class in NLCD 2006. The majority of the commission and omission errors in NLCD 2006 are in areas where cultivated cropland is not the most dominant land cover type. The errors are primarily attributed to the less accurate training dataset derived from the National Agricultural Statistics Service Cropland Data Layer dataset. In contrast, error rates are low in areas where cultivated cropland is the dominant land cover. Agreement between model-identified commission errors and independently interpreted reference data was high (79%). Agreement was low (40%) for omission error comparison. The majority of the commission errors in the NLCD 2006 cultivated crops were confused with low-intensity developed classes, while the majority of omission errors were from herbaceous and shrub classes. Some errors were caused by inaccurate land cover change from misclassification in NLCD 2001 and the subsequent land cover post-classification process.

  18. ASSISTments Dataset from Multiple Randomized Controlled Experiments

    ERIC Educational Resources Information Center

    Selent, Douglas; Patikorn, Thanaporn; Heffernan, Neil

    2016-01-01

    In this paper, we present a dataset consisting of data generated from 22 previously and currently running randomized controlled experiments inside the ASSISTments online learning platform. This dataset provides data mining opportunities for researchers to analyze ASSISTments data in a convenient format across multiple experiments at the same time.…

  19. Wanted dead or alive: A state-space mark-recapture-recovery model incorporating multiple recovery types and state uncertainty

    USGS Publications Warehouse

    Hostetter, Nathan; Gardner, Beth; Evans, Allen F.; Cramer, Bradley M.; Payton, Quinn; Collis, Ken; Roby, Daniel D.

    2017-01-01

    We developed a state-space mark-recapture-recovery model that incorporates multiple recovery types and state uncertainty to estimate survival of an anadromous fish species. We apply the model to a dataset of out-migrating juvenile steelhead trout (Oncorhynchus mykiss) tagged with passive integrated transponders, recaptured during outmigration, and recovered on bird colonies in the Columbia River basin (2008-2014). Recoveries on bird colonies are often ignored in survival studies because the river reach of mortality is often unknown, which we model as a form of state uncertainty. Median outmigration survival from release to the lower river (river kilometer 729 to 75) ranged from 0.27 to 0.35, depending on year. Recovery probabilities were frequently >0.20 in the first river reach following tagging, indicating that one out of five fish that died in that reach was recovered on a bird colony. Integrating dead recovery data provided increased parameter precision, estimation of where birds consumed fish, and survival estimates across larger spatial scales. More generally, these modeling approaches provide a flexible framework to integrate multiple sources of tag recovery data into mark-recapture studies.

  20. From data to knowledge: The future of multi-omics data analysis for the rhizosphere

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Allen White, Richard; Borkum, Mark I.; Rivas-Ubach, Albert

    The rhizosphere is the interface between a plant's roots and its surrounding soil. The rhizosphere microbiome, a complex microbial ecosystem, nourishes the terrestrial biosphere. Integrated multi-omics is a modern approach to systems biology that analyzes and interprets the datasets of multiple -omes of both individual organisms and multi-organism communities and consortia. The successful usage and application of integrated multi-omics to rhizospheric science is predicated upon the availability of rhizosphere-specific data, metadata and software. This review analyzes the availability of multi-omics data, metadata and software for rhizospheric science, identifying potential issues, challenges and opportunities.

  1. An integrated approach for identifying wrongly labelled samples when performing classification in microarray data.

    PubMed

    Leung, Yuk Yee; Chang, Chun Qi; Hung, Yeung Sam

    2012-01-01

    Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own. We tackle the outlier detection problem based on a previously proposed Multiple-Filter-Multiple-Wrapper (MFMW) model, which was demonstrated to yield promising results when compared to other hybrid approaches (Leung and Hung, 2010). To incorporate outlier detection and overcome limitations of the existing MFMW model, three new features are introduced in our proposed MFMW-outlier approach: 1) an unbiased external Leave-One-Out Cross-Validation framework is developed to replace internal cross-validation in the previous MFMW model; 2) wrongly labeled samples are identified within the MFMW-outlier model; and 3) a stable set of genes is selected using an L1-norm SVM that removes any redundant genes present. Six binary-class microarray datasets were tested. Comparing with outlier detection studies on the same datasets, MFMW-outlier could detect all the outliers found in the original paper (for which the data was provided for analysis), and the genes selected after outlier removal were proven to have biological relevance. We also compared MFMW-outlier with PRAPIV (Zhang et al., 2006) based on same synthetic datasets. MFMW-outlier gave better average precision and recall values on three different settings. Lastly, artificially flipped microarray datasets were created by removing our detected outliers and flipping some of the remaining samples' labels. Almost all the 'wrong' (artificially flipped) samples were detected, suggesting that MFMW-outlier was sufficiently powerful to detect outliers in high-dimensional microarray datasets.

  2. SchizConnect: Mediating Neuroimaging Databases on Schizophrenia and Related Disorders for Large-Scale Integration

    PubMed Central

    Wang, Lei; Alpert, Kathryn I.; Calhoun, Vince D.; Cobia, Derin J.; Keator, David B.; King, Margaret D.; Kogan, Alexandr; Landis, Drew; Tallis, Marcelo; Turner, Matthew D.; Potkin, Steven G.; Turner, Jessica A.; Ambite, Jose Luis

    2015-01-01

    SchizConnect (www.schizconnect.org) is built to address the issues of multiple data repositories in schizophrenia neuroimaging studies. It includes a level of mediation—translating across data sources—so that the user can place one query, e.g. for diffusion images from male individuals with schizophrenia, and find out from across participating data sources how many datasets there are, as well as downloading the imaging and related data. The current version handles the Data Usage Agreements across different studies, as well as interpreting database-specific terminologies into a common framework. New data repositories can also be mediated to bring immediate access to existing datasets. Compared with centralized, upload data sharing models, SchizConnect is a unique, virtual database with a focus on schizophrenia and related disorders that can mediate live data as information are being updated at each data source. It is our hope that SchizConnect can facilitate testing new hypotheses through aggregated datasets, promoting discovery related to the mechanisms underlying schizophrenic dysfunction. PMID:26142271

  3. PIVOT: platform for interactive analysis and visualization of transcriptomics data.

    PubMed

    Zhu, Qin; Fisher, Stephen A; Dueck, Hannah; Middleton, Sarah; Khaladkar, Mugdha; Kim, Junhyong

    2018-01-05

    Many R packages have been developed for transcriptome analysis but their use often requires familiarity with R and integrating results of different packages requires scripts to wrangle the datatypes. Furthermore, exploratory data analyses often generate multiple derived datasets such as data subsets or data transformations, which can be difficult to track. Here we present PIVOT, an R-based platform that wraps open source transcriptome analysis packages with a uniform user interface and graphical data management that allows non-programmers to interactively explore transcriptomics data. PIVOT supports more than 40 popular open source packages for transcriptome analysis and provides an extensive set of tools for statistical data manipulations. A graph-based visual interface is used to represent the links between derived datasets, allowing easy tracking of data versions. PIVOT further supports automatic report generation, publication-quality plots, and program/data state saving, such that all analysis can be saved, shared and reproduced. PIVOT will allow researchers with broad background to easily access sophisticated transcriptome analysis tools and interactively explore transcriptome datasets.

  4. Identifying candidate drivers of drug response in heterogeneous cancer by mining high throughput genomics data.

    PubMed

    Nabavi, Sheida

    2016-08-15

    With advances in technologies, huge amounts of multiple types of high-throughput genomics data are available. These data have tremendous potential to identify new and clinically valuable biomarkers to guide the diagnosis, assessment of prognosis, and treatment of complex diseases, such as cancer. Integrating, analyzing, and interpreting big and noisy genomics data to obtain biologically meaningful results, however, remains highly challenging. Mining genomics datasets by utilizing advanced computational methods can help to address these issues. To facilitate the identification of a short list of biologically meaningful genes as candidate drivers of anti-cancer drug resistance from an enormous amount of heterogeneous data, we employed statistical machine-learning techniques and integrated genomics datasets. We developed a computational method that integrates gene expression, somatic mutation, and copy number aberration data of sensitive and resistant tumors. In this method, an integrative method based on module network analysis is applied to identify potential driver genes. This is followed by cross-validation and a comparison of the results of sensitive and resistance groups to obtain the final list of candidate biomarkers. We applied this method to the ovarian cancer data from the cancer genome atlas. The final result contains biologically relevant genes, such as COL11A1, which has been reported as a cis-platinum resistant biomarker for epithelial ovarian carcinoma in several recent studies. The described method yields a short list of aberrant genes that also control the expression of their co-regulated genes. The results suggest that the unbiased data driven computational method can identify biologically relevant candidate biomarkers. It can be utilized in a wide range of applications that compare two conditions with highly heterogeneous datasets.

  5. De-identification of health records using Anonym: effectiveness and robustness across datasets.

    PubMed

    Zuccon, Guido; Kotzur, Daniel; Nguyen, Anthony; Bergheim, Anton

    2014-07-01

    Evaluate the effectiveness and robustness of Anonym, a tool for de-identifying free-text health records based on conditional random fields classifiers informed by linguistic and lexical features, as well as features extracted by pattern matching techniques. De-identification of personal health information in electronic health records is essential for the sharing and secondary usage of clinical data. De-identification tools that adapt to different sources of clinical data are attractive as they would require minimal intervention to guarantee high effectiveness. The effectiveness and robustness of Anonym are evaluated across multiple datasets, including the widely adopted Integrating Biology and the Bedside (i2b2) dataset, used for evaluation in a de-identification challenge. The datasets used here vary in type of health records, source of data, and their quality, with one of the datasets containing optical character recognition errors. Anonym identifies and removes up to 96.6% of personal health identifiers (recall) with a precision of up to 98.2% on the i2b2 dataset, outperforming the best system proposed in the i2b2 challenge. The effectiveness of Anonym across datasets is found to depend on the amount of information available for training. Findings show that Anonym compares to the best approach from the 2006 i2b2 shared task. It is easy to retrain Anonym with new datasets; if retrained, the system is robust to variations of training size, data type and quality in presence of sufficient training data. Crown Copyright © 2014. Published by Elsevier B.V. All rights reserved.

  6. integIRTy: a method to identify genes altered in cancer by accounting for multiple mechanisms of regulation using item response theory.

    PubMed

    Tong, Pan; Coombes, Kevin R

    2012-11-15

    Identifying genes altered in cancer plays a crucial role in both understanding the mechanism of carcinogenesis and developing novel therapeutics. It is known that there are various mechanisms of regulation that can lead to gene dysfunction, including copy number change, methylation, abnormal expression, mutation and so on. Nowadays, all these types of alterations can be simultaneously interrogated by different types of assays. Although many methods have been proposed to identify altered genes from a single assay, there is no method that can deal with multiple assays accounting for different alteration types systematically. In this article, we propose a novel method, integration using item response theory (integIRTy), to identify altered genes by using item response theory that allows integrated analysis of multiple high-throughput assays. When applied to a single assay, the proposed method is more robust and reliable than conventional methods such as Student's t-test or the Wilcoxon rank-sum test. When used to integrate multiple assays, integIRTy can identify novel-altered genes that cannot be found by looking at individual assay separately. We applied integIRTy to three public cancer datasets (ovarian carcinoma, breast cancer, glioblastoma) for cross-assay type integration which all show encouraging results. The R package integIRTy is available at the web site http://bioinformatics.mdanderson.org/main/OOMPA:Overview. kcoombes@mdanderson.org. Supplementary data are available at Bioinformatics online.

  7. The I4 Online Query Tool for Earth Observations Data

    NASA Technical Reports Server (NTRS)

    Stefanov, William L.; Vanderbloemen, Lisa A.; Lawrence, Samuel J.

    2015-01-01

    The NASA Earth Observation System Data and Information System (EOSDIS) delivers an average of 22 terabytes per day of data collected by orbital and airborne sensor systems to end users through an integrated online search environment (the Reverb/ECHO system). Earth observations data collected by sensors on the International Space Station (ISS) are not currently included in the EOSDIS system, and are only accessible through various individual online locations. This increases the effort required by end users to query multiple datasets, and limits the opportunity for data discovery and innovations in analysis. The Earth Science and Remote Sensing Unit of the Exploration Integration and Science Directorate at NASA Johnson Space Center has collaborated with the School of Earth and Space Exploration at Arizona State University (ASU) to develop the ISS Instrument Integration Implementation (I4) data query tool to provide end users a clean, simple online interface for querying both current and historical ISS Earth Observations data. The I4 interface is based on the Lunaserv and Lunaserv Global Explorer (LGE) open-source software packages developed at ASU for query of lunar datasets. In order to avoid mirroring existing databases - and the need to continually sync/update those mirrors - our design philosophy is for the I4 tool to be a pure query engine only. Once an end user identifies a specific scene or scenes of interest, I4 transparently takes the user to the appropriate online location to download the data. The tool consists of two public-facing web interfaces. The Map Tool provides a graphic geobrowser environment where the end user can navigate to an area of interest and select single or multiple datasets to query. The Map Tool displays active image footprints for the selected datasets (Figure 1). Selecting a footprint will open a pop-up window that includes a browse image and a link to available image metadata, along with a link to the online location to order or download the actual data. Search results are either delivered in the form of browse images linked to the appropriate online database, similar to the Map Tool, or they may be transferred within the I4 environment for display as footprints in the Map Tool. Datasets searchable through I4 (http://eol.jsc.nasa.gov/I4_tool) currently include: Crew Earth Observations (CEO) cataloged and uncataloged handheld astronaut photography; Sally Ride EarthKAM; Hyperspectral Imager for the Coastal Ocean (HICO); and the ISS SERVIR Environmental Research and Visualization System (ISERV). The ISS is a unique platform in that it will have multiple users over its lifetime, and that no single remote sensing system has a permanent internal or external berth. The open source I4 tool is designed to enable straightforward addition of new datasets as they become available such as ISS-RapidSCAT, Cloud Aerosol Transport System (CATS), and the High Definition Earth Viewing (HDEV) system. Data from other sensor systems, such as those operated by the ISS International Partners or under the auspices of the US National Laboratory program, can also be added to I4 provided sufficient access to enable searching of data or metadata is available. Commercial providers of remotely sensed data from the ISS may be particularly interested in I4 as an additional means of directing potential customers and clients to their products.

  8. Using Multiple Big Datasets and Machine Learning to Produce a New Global Particulate Dataset: A Technology Challenge Case Study

    NASA Astrophysics Data System (ADS)

    Lary, D. J.

    2013-12-01

    A BigData case study is described where multiple datasets from several satellites, high-resolution global meteorological data, social media and in-situ observations are combined using machine learning on a distributed cluster using an automated workflow. The global particulate dataset is relevant to global public health studies and would not be possible to produce without the use of the multiple big datasets, in-situ data and machine learning.To greatly reduce the development time and enhance the functionality a high level language capable of parallel processing has been used (Matlab). A key consideration for the system is high speed access due to the large data volume, persistence of the large data volumes and a precise process time scheduling capability.

  9. Enabling Data-as- a-Service (DaaS) - Biggest Challenge of Geoscience Australia

    NASA Astrophysics Data System (ADS)

    Bastrakova, I.; Kemp, C.; Car, N. J.

    2016-12-01

    Geoscience Australia (GA) is recognised and respected as the national repository and steward of multiple national significance data collections that provides geoscience information, services and capability to the Australian Government, industry and stakeholders. Provision of Data-as-a-Service is both GA's key responsibility and core business. Through the Science First Transformation Program GA is undergoing a significant rethinking of its data architecture, curation and access to support the Digital Science capability for which DaaS forms both a dependency and underpins its implementation. DaaS, being a service, means we can deliver its outputs in multiple ways thus providing users with data on demand in ready-for-consumption forms. We can then to reuse prebuilt data constructions to allow self-serviced integration of data underpinned by dynamic query tools. In GA's context examples of DaaS are the Australian Geoscience Data Cube, the Foundation Spatial Data Framework and data served through several Virtual Laboratories. We have implemented a three-layered architecture for DaaS in order to store and manage the data while honouring the semantics of Scientific Data Models defined by subject matter experts and GA's Enterprise Data Architecture as well as retain that delivery flexibility. The foundation layer of DaaS is Canonical Datasets, which are optimised for a long-term data stewardship and curation. Data is well structured, standardised, described and audited. All data creation and editing happen within this layer. The middle Data Transformation layer assists with transformation of data from Canonical Datasets to data integration layer. It provides mechanisms for multi-format and multi-technology data transformation. The top Data Integration layer is optimised for data access. Data can be easily reused and repurposed; data formats made available are optimised for scientific computing and adjusted for access by multiple applications, tools and libraries. Moving to DaaS enables GA to increase data alertness, generate new capabilities and be prepared for emerging technological challengers.

  10. Use of Persistent Identifiers to link Heterogeneous Data Systems in the Integrated Earth Data Applications (IEDA) Facility

    NASA Astrophysics Data System (ADS)

    Hsu, L.; Lehnert, K. A.; Carbotte, S. M.; Arko, R. A.; Ferrini, V.; O'hara, S. H.; Walker, J. D.

    2012-12-01

    The Integrated Earth Data Applications (IEDA) facility maintains multiple data systems with a wide range of solid earth data types from the marine, terrestrial, and polar environments. Examples of the different data types include syntheses of ultra-high resolution seafloor bathymetry collected on large collaborative cruises and analytical geochemistry measurements collected by single investigators in small, unique projects. These different data types have historically been channeled into separate, discipline-specific databases with search and retrieval tailored for the specific data type. However, a current major goal is to integrate data from different systems to allow interdisciplinary data discovery and scientific analysis. To increase discovery and access across these heterogeneous systems, IEDA employs several unique IDs, including sample IDs (International Geo Sample Number, IGSN), person IDs (GeoPass ID), funding award IDs (NSF Award Number), cruise IDs (from the Marine Geoscience Data System Expedition Metadata Catalog), dataset IDs (DOIs), and publication IDs (DOIs). These IDs allow linking of a sample registry (System for Earth SAmple Registration), data libraries and repositories (e.g. Geochemical Research Library, Marine Geoscience Data System), integrated synthesis databases (e.g. EarthChem Portal, PetDB), and investigator services (IEDA Data Compliance Tool). The linked systems allow efficient discovery of related data across different levels of granularity. In addition, IEDA data systems maintain links with several external data systems, including digital journal publishers. Links have been established between the EarthChem Portal and ScienceDirect through publication DOIs, returning sample-level objects and geochemical analyses for a particular publication. Linking IEDA-hosted data to digital publications with IGSNs at the sample level and with IEDA-allocated dataset DOIs are under development. As an example, an individual investigator could sign up for a GeoPass account ID, write a proposal to NSF and create a data plan using the IEDA Data Management Plan Tool. Having received the grant, the investigator then collects rock samples on a scientific cruise from dredges and registers the samples with IGSNs. The investigator then performs analytical geochemistry on the samples, and submits the full dataset to the Geochemical Resource Library for a dataset DOI. Finally, the investigator writes an article that is published in Science Direct. Knowing any of the following IDs: Investigator GeoPass ID, NSF Award Number, Cruise ID, Sample IGSNs, dataset DOI, or publication DOI, a user would be able to navigate to all samples, datasets, and publications in IEDA and external systems. Use of persistent identifiers to link heterogeneous data systems in IEDA thus increases access, discovery, and proper citation of hard-earned investigator datasets.

  11. RS-Forest: A Rapid Density Estimator for Streaming Anomaly Detection.

    PubMed

    Wu, Ke; Zhang, Kun; Fan, Wei; Edwards, Andrea; Yu, Philip S

    Anomaly detection in streaming data is of high interest in numerous application domains. In this paper, we propose a novel one-class semi-supervised algorithm to detect anomalies in streaming data. Underlying the algorithm is a fast and accurate density estimator implemented by multiple fully randomized space trees (RS-Trees), named RS-Forest. The piecewise constant density estimate of each RS-tree is defined on the tree node into which an instance falls. Each incoming instance in a data stream is scored by the density estimates averaged over all trees in the forest. Two strategies, statistical attribute range estimation of high probability guarantee and dual node profiles for rapid model update, are seamlessly integrated into RS-Forest to systematically address the ever-evolving nature of data streams. We derive the theoretical upper bound for the proposed algorithm and analyze its asymptotic properties via bias-variance decomposition. Empirical comparisons to the state-of-the-art methods on multiple benchmark datasets demonstrate that the proposed method features high detection rate, fast response, and insensitivity to most of the parameter settings. Algorithm implementations and datasets are available upon request.

  12. RS-Forest: A Rapid Density Estimator for Streaming Anomaly Detection

    PubMed Central

    Wu, Ke; Zhang, Kun; Fan, Wei; Edwards, Andrea; Yu, Philip S.

    2015-01-01

    Anomaly detection in streaming data is of high interest in numerous application domains. In this paper, we propose a novel one-class semi-supervised algorithm to detect anomalies in streaming data. Underlying the algorithm is a fast and accurate density estimator implemented by multiple fully randomized space trees (RS-Trees), named RS-Forest. The piecewise constant density estimate of each RS-tree is defined on the tree node into which an instance falls. Each incoming instance in a data stream is scored by the density estimates averaged over all trees in the forest. Two strategies, statistical attribute range estimation of high probability guarantee and dual node profiles for rapid model update, are seamlessly integrated into RS-Forest to systematically address the ever-evolving nature of data streams. We derive the theoretical upper bound for the proposed algorithm and analyze its asymptotic properties via bias-variance decomposition. Empirical comparisons to the state-of-the-art methods on multiple benchmark datasets demonstrate that the proposed method features high detection rate, fast response, and insensitivity to most of the parameter settings. Algorithm implementations and datasets are available upon request. PMID:25685112

  13. Geospatial datasets for watershed delineation and characterization used in the Hawaii StreamStats web application

    USGS Publications Warehouse

    Rea, Alan; Skinner, Kenneth D.

    2012-01-01

    The U.S. Geological Survey Hawaii StreamStats application uses an integrated suite of raster and vector geospatial datasets to delineate and characterize watersheds. The geospatial datasets used to delineate and characterize watersheds on the StreamStats website, and the methods used to develop the datasets are described in this report. The datasets for Hawaii were derived primarily from 10 meter resolution National Elevation Dataset (NED) elevation models, and the National Hydrography Dataset (NHD), using a set of procedures designed to enforce the drainage pattern from the NHD into the NED, resulting in an integrated suite of elevation-derived datasets. Additional sources of data used for computing basin characteristics include precipitation, land cover, soil permeability, and elevation-derivative datasets. The report also includes links for metadata and downloads of the geospatial datasets.

  14. On the comparison of the strength of morphological integration across morphometric datasets.

    PubMed

    Adams, Dean C; Collyer, Michael L

    2016-11-01

    Evolutionary morphologists frequently wish to understand the extent to which organisms are integrated, and whether the strength of morphological integration among subsets of phenotypic variables differ among taxa or other groups. However, comparisons of the strength of integration across datasets are difficult, in part because the summary measures that characterize these patterns (RV coefficient and r PLS ) are dependent both on sample size and on the number of variables. As a solution to this issue, we propose a standardized test statistic (a z-score) for measuring the degree of morphological integration between sets of variables. The approach is based on a partial least squares analysis of trait covariation, and its permutation-based sampling distribution. Under the null hypothesis of a random association of variables, the method displays a constant expected value and confidence intervals for datasets of differing sample sizes and variable number, thereby providing a consistent measure of integration suitable for comparisons across datasets. A two-sample test is also proposed to statistically determine whether levels of integration differ between datasets, and an empirical example examining cranial shape integration in Mediterranean wall lizards illustrates its use. Some extensions of the procedure are also discussed. © 2016 The Author(s). Evolution © 2016 The Society for the Study of Evolution.

  15. The Spectral Image Processing System (SIPS): Software for integrated analysis of AVIRIS data

    NASA Technical Reports Server (NTRS)

    Kruse, F. A.; Lefkoff, A. B.; Boardman, J. W.; Heidebrecht, K. B.; Shapiro, A. T.; Barloon, P. J.; Goetz, A. F. H.

    1992-01-01

    The Spectral Image Processing System (SIPS) is a software package developed by the Center for the Study of Earth from Space (CSES) at the University of Colorado, Boulder, in response to a perceived need to provide integrated tools for analysis of imaging spectrometer data both spectrally and spatially. SIPS was specifically designed to deal with data from the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) and the High Resolution Imaging Spectrometer (HIRIS), but was tested with other datasets including the Geophysical and Environmental Research Imaging Spectrometer (GERIS), GEOSCAN images, and Landsat TM. SIPS was developed using the 'Interactive Data Language' (IDL). It takes advantage of high speed disk access and fast processors running under the UNIX operating system to provide rapid analysis of entire imaging spectrometer datasets. SIPS allows analysis of single or multiple imaging spectrometer data segments at full spatial and spectral resolution. It also allows visualization and interactive analysis of image cubes derived from quantitative analysis procedures such as absorption band characterization and spectral unmixing. SIPS consists of three modules: SIPS Utilities, SIPS_View, and SIPS Analysis. SIPS version 1.1 is described below.

  16. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mitchell, Hugh D.; Eisfeld, Amie J.; Sims, Amy

    Respiratory infections stemming from influenza viruses and the Severe Acute Respiratory Syndrome corona virus (SARS-CoV) represent a serious public health threat as emerging pandemics. Despite efforts to identify the critical interactions of these viruses with host machinery, the key regulatory events that lead to disease pathology remain poorly targeted with therapeutics. Here we implement an integrated network interrogation approach, in which proteome and transcriptome datasets from infection of both viruses in human lung epithelial cells are utilized to predict regulatory genes involved in the host response. We take advantage of a novel “crowd-based” approach to identify and combine ranking metricsmore » that isolate genes/proteins likely related to the pathogenicity of SARS-CoV and influenza virus. Subsequently, a multivariate regression model is used to compare predicted lung epithelial regulatory influences with data derived from other respiratory virus infection models. We predicted a small set of regulatory factors with conserved behavior for consideration as important components of viral pathogenesis that might also serve as therapeutic targets for intervention. Our results demonstrate the utility of integrating diverse ‘omic datasets to predict and prioritize regulatory features conserved across multiple pathogen infection models.« less

  17. DoctorEye: A clinically driven multifunctional platform, for accurate processing of tumors in medical images.

    PubMed

    Skounakis, Emmanouil; Farmaki, Christina; Sakkalis, Vangelis; Roniotis, Alexandros; Banitsas, Konstantinos; Graf, Norbert; Marias, Konstantinos

    2010-01-01

    This paper presents a novel, open access interactive platform for 3D medical image analysis, simulation and visualization, focusing in oncology images. The platform was developed through constant interaction and feedback from expert clinicians integrating a thorough analysis of their requirements while having an ultimate goal of assisting in accurately delineating tumors. It allows clinicians not only to work with a large number of 3D tomographic datasets but also to efficiently annotate multiple regions of interest in the same session. Manual and semi-automatic segmentation techniques combined with integrated correction tools assist in the quick and refined delineation of tumors while different users can add different components related to oncology such as tumor growth and simulation algorithms for improving therapy planning. The platform has been tested by different users and over large number of heterogeneous tomographic datasets to ensure stability, usability, extensibility and robustness with promising results. the platform, a manual and tutorial videos are available at: http://biomodeling.ics.forth.gr. it is free to use under the GNU General Public License.

  18. Identifying Martian Hydrothermal Sites: Geological Investigation Utilizing Multiple Datasets

    NASA Technical Reports Server (NTRS)

    Dohm, J. M.; Baker, V. R.; Anderson, R. C.; Scott, D. H.; Rice, J. W., Jr.; Hare, T. M.

    2000-01-01

    Comprehensive geological investigations of martian landscapes that may have been modified by magmatic-driven hydrothermal activity, utilizing multiple datasets, will yield prime target sites for future hydrological, mineralogical, and biological investigations.

  19. BIAS: Bioinformatics Integrated Application Software.

    PubMed

    Finak, G; Godin, N; Hallett, M; Pepin, F; Rajabi, Z; Srivastava, V; Tang, Z

    2005-04-15

    We introduce a development platform especially tailored to Bioinformatics research and software development. BIAS (Bioinformatics Integrated Application Software) provides the tools necessary for carrying out integrative Bioinformatics research requiring multiple datasets and analysis tools. It follows an object-relational strategy for providing persistent objects, allows third-party tools to be easily incorporated within the system and supports standards and data-exchange protocols common to Bioinformatics. BIAS is an OpenSource project and is freely available to all interested users at http://www.mcb.mcgill.ca/~bias/. This website also contains a paper containing a more detailed description of BIAS and a sample implementation of a Bayesian network approach for the simultaneous prediction of gene regulation events and of mRNA expression from combinations of gene regulation events. hallett@mcb.mcgill.ca.

  20. Fuzzy neural network technique for system state forecasting.

    PubMed

    Li, Dezhi; Wang, Wilson; Ismail, Fathy

    2013-10-01

    In many system state forecasting applications, the prediction is performed based on multiple datasets, each corresponding to a distinct system condition. The traditional methods dealing with multiple datasets (e.g., vector autoregressive moving average models and neural networks) have some shortcomings, such as limited modeling capability and opaque reasoning operations. To tackle these problems, a novel fuzzy neural network (FNN) is proposed in this paper to effectively extract information from multiple datasets, so as to improve forecasting accuracy. The proposed predictor consists of both autoregressive (AR) nodes modeling and nonlinear nodes modeling; AR models/nodes are used to capture the linear correlation of the datasets, and the nonlinear correlation of the datasets are modeled with nonlinear neuron nodes. A novel particle swarm technique [i.e., Laplace particle swarm (LPS) method] is proposed to facilitate parameters estimation of the predictor and improve modeling accuracy. The effectiveness of the developed FNN predictor and the associated LPS method is verified by a series of tests related to Mackey-Glass data forecast, exchange rate data prediction, and gear system prognosis. Test results show that the developed FNN predictor and the LPS method can capture the dynamics of multiple datasets effectively and track system characteristics accurately.

  1. WISARD: workbench for integrated superfast association studies for related datasets.

    PubMed

    Lee, Sungyoung; Choi, Sungkyoung; Qiao, Dandi; Cho, Michael; Silverman, Edwin K; Park, Taesung; Won, Sungho

    2018-04-20

    A Mendelian transmission produces phenotypic and genetic relatedness between family members, giving family-based analytical methods an important role in genetic epidemiological studies-from heritability estimations to genetic association analyses. With the advance in genotyping technologies, whole-genome sequence data can be utilized for genetic epidemiological studies, and family-based samples may become more useful for detecting de novo mutations. However, genetic analyses employing family-based samples usually suffer from the complexity of the computational/statistical algorithms, and certain types of family designs, such as incorporating data from extended families, have rarely been used. We present a Workbench for Integrated Superfast Association studies for Related Data (WISARD) programmed in C/C++. WISARD enables the fast and a comprehensive analysis of SNP-chip and next-generation sequencing data on extended families, with applications from designing genetic studies to summarizing analysis results. In addition, WISARD can automatically be run in a fully multithreaded manner, and the integration of R software for visualization makes it more accessible to non-experts. Comparison with existing toolsets showed that WISARD is computationally suitable for integrated analysis of related subjects, and demonstrated that WISARD outperforms existing toolsets. WISARD has also been successfully utilized to analyze the large-scale massive sequencing dataset of chronic obstructive pulmonary disease data (COPD), and we identified multiple genes associated with COPD, which demonstrates its practical value.

  2. Interactive Visualization and Analysis of Geospatial Data Sets - TrikeND-iGlobe

    NASA Astrophysics Data System (ADS)

    Rosebrock, Uwe; Hogan, Patrick; Chandola, Varun

    2013-04-01

    The visualization of scientific datasets is becoming an ever-increasing challenge as advances in computing technologies have enabled scientists to build high resolution climate models that have produced petabytes of climate data. To interrogate and analyze these large datasets in real-time is a task that pushes the boundaries of computing hardware and software. But integration of climate datasets with geospatial data requires considerable amount of effort and close familiarity of various data formats and projection systems, which has prevented widespread utilization outside of climate community. TrikeND-iGlobe is a sophisticated software tool that bridges this gap, allows easy integration of climate datasets with geospatial datasets and provides sophisticated visualization and analysis capabilities. The objective for TrikeND-iGlobe is the continued building of an open source 4D virtual globe application using NASA World Wind technology that integrates analysis of climate model outputs with remote sensing observations as well as demographic and environmental data sets. This will facilitate a better understanding of global and regional phenomenon, and the impact analysis of climate extreme events. The critical aim is real-time interactive interrogation. At the data centric level the primary aim is to enable the user to interact with the data in real-time for the purpose of analysis - locally or remotely. TrikeND-iGlobe provides the basis for the incorporation of modular tools that provide extended interactions with the data, including sub-setting, aggregation, re-shaping, time series analysis methods and animation to produce publication-quality imagery. TrikeND-iGlobe may be run locally or can be accessed via a web interface supported by high-performance visualization compute nodes placed close to the data. It supports visualizing heterogeneous data formats: traditional geospatial datasets along with scientific data sets with geographic coordinates (NetCDF, HDF, etc.). It also supports multiple data access mechanisms, including HTTP, FTP, WMS, WCS, and Thredds Data Server (for NetCDF data and for scientific data, TrikeND-iGlobe supports various visualization capabilities, including animations, vector field visualization, etc. TrikeND-iGlobe is a collaborative open-source project, contributors include NASA (ARC-PX), ORNL (Oakridge National Laboratories), Unidata, Kansas University, CSIRO CMAR Australia and Geoscience Australia.

  3. An integrated framework for detecting suspicious behaviors in video surveillance

    NASA Astrophysics Data System (ADS)

    Zin, Thi Thi; Tin, Pyke; Hama, Hiromitsu; Toriu, Takashi

    2014-03-01

    In this paper, we propose an integrated framework for detecting suspicious behaviors in video surveillance systems which are established in public places such as railway stations, airports, shopping malls and etc. Especially, people loitering in suspicion, unattended objects left behind and exchanging suspicious objects between persons are common security concerns in airports and other transit scenarios. These involve understanding scene/event, analyzing human movements, recognizing controllable objects, and observing the effect of the human movement on those objects. In the proposed framework, multiple background modeling technique, high level motion feature extraction method and embedded Markov chain models are integrated for detecting suspicious behaviors in real time video surveillance systems. Specifically, the proposed framework employs probability based multiple backgrounds modeling technique to detect moving objects. Then the velocity and distance measures are computed as the high level motion features of the interests. By using an integration of the computed features and the first passage time probabilities of the embedded Markov chain, the suspicious behaviors in video surveillance are analyzed for detecting loitering persons, objects left behind and human interactions such as fighting. The proposed framework has been tested by using standard public datasets and our own video surveillance scenarios.

  4. A DICOM-based 2nd generation Molecular Imaging Data Grid implementing the IHE XDS-i integration profile.

    PubMed

    Lee, Jasper; Zhang, Jianguo; Park, Ryan; Dagliyan, Grant; Liu, Brent; Huang, H K

    2012-07-01

    A Molecular Imaging Data Grid (MIDG) was developed to address current informatics challenges in archival, sharing, search, and distribution of preclinical imaging studies between animal imaging facilities and investigator sites. This manuscript presents a 2nd generation MIDG replacing the Globus Toolkit with a new system architecture that implements the IHE XDS-i integration profile. Implementation and evaluation were conducted using a 3-site interdisciplinary test-bed at the University of Southern California. The 2nd generation MIDG design architecture replaces the initial design's Globus Toolkit with dedicated web services and XML-based messaging for dedicated management and delivery of multi-modality DICOM imaging datasets. The Cross-enterprise Document Sharing for Imaging (XDS-i) integration profile from the field of enterprise radiology informatics was adopted into the MIDG design because streamlined image registration, management, and distribution dataflow are likewise needed in preclinical imaging informatics systems as in enterprise PACS application. Implementation of the MIDG is demonstrated at the University of Southern California Molecular Imaging Center (MIC) and two other sites with specified hardware, software, and network bandwidth. Evaluation of the MIDG involves data upload, download, and fault-tolerance testing scenarios using multi-modality animal imaging datasets collected at the USC Molecular Imaging Center. The upload, download, and fault-tolerance tests of the MIDG were performed multiple times using 12 collected animal study datasets. Upload and download times demonstrated reproducibility and improved real-world performance. Fault-tolerance tests showed that automated failover between Grid Node Servers has minimal impact on normal download times. Building upon the 1st generation concepts and experiences, the 2nd generation MIDG system improves accessibility of disparate animal-model molecular imaging datasets to users outside a molecular imaging facility's LAN using a new architecture, dataflow, and dedicated DICOM-based management web services. Productivity and efficiency of preclinical research for translational sciences investigators has been further streamlined for multi-center study data registration, management, and distribution.

  5. Elastic K-means using posterior probability.

    PubMed

    Zheng, Aihua; Jiang, Bo; Li, Yan; Zhang, Xuehan; Ding, Chris

    2017-01-01

    The widely used K-means clustering is a hard clustering algorithm. Here we propose a Elastic K-means clustering model (EKM) using posterior probability with soft capability where each data point can belong to multiple clusters fractionally and show the benefit of proposed Elastic K-means. Furthermore, in many applications, besides vector attributes information, pairwise relations (graph information) are also available. Thus we integrate EKM with Normalized Cut graph clustering into a single clustering formulation. Finally, we provide several useful matrix inequalities which are useful for matrix formulations of learning models. Based on these results, we prove the correctness and the convergence of EKM algorithms. Experimental results on six benchmark datasets demonstrate the effectiveness of proposed EKM and its integrated model.

  6. Epithelial-mesenchymal transition is associated with a distinct tumor microenvironment including elevation of inflammatory signals and multiple immune checkpoints in lung adenocarcinoma

    PubMed Central

    Lou, Yanyan; Diao, Lixia; Cuentas, Edwin Roger Parra; Denning, Warren L.; Chen, Limo; Fan, Youhong; Byers, Lauren A.; Wang, Jing; Papadimitrakopoulou, Vassiliki; Behrens, Carmen; Rodriguez, Jaime Canales; Hwu, Patrick; Wistuba, Ignacio I.; Heymach, John V.; Gibbons, Don L.

    2016-01-01

    Purpose Promising results in the treatment of NSCLC have been seen with agents targeting immune checkpoints, such as PD-1 or PD-L1. However, only a select group of patients respond to these interventions. The identification of biomarkers that predict clinical benefit to immune checkpoint blockade is critical to successful clinical translation of these agents. Methods We conducted an integrated analysis of three independent large datasets, including The Cancer Genome Atlas (TCGA) of lung adenocarcinoma and two datasets from MD Anderson Cancer Center, Profiling of Resistance patterns and Oncogenic Signaling Pathways in Evaluation of Cancers of the Thorax (named PROSPECT) and Biomarker-integrated Approaches of Targeted Therapy for Lung Cancer Elimination (named BATTLE-1). Comprehensive analysis of mRNA gene expression, reverse phase protein array (RPPA), immunohistochemistry and correlation with clinical data were performed. Results Epithelial-mesenchymal transition (EMT) is highly associated with an inflammatory tumor microenvironment in lung adenocarcinoma, independent of tumor mutational burden. We found immune activation co-existent with elevation of multiple targetable immune checkpoint molecules, including PD-L1, PD-L2, PD-1, TIM-3, B7-H3, BTLA and CTLA-4, along with increases in tumor infiltration by CD4+Foxp3+ regulatory T cells in lung adenocarcinomas that displayed an EMT phenotype. Furthermore, we identify B7-H3 as a prognostic marker for NSCLC. Conclusions The strong association between EMT status and an inflammatory tumor microenvironment with elevation of multiple targetable immune checkpoint molecules warrants further investigation of using EMT as a predictive biomarker for immune checkpoint blockade agents and other immunotherapies in NSCLC and possibly a broad range of other cancers. PMID:26851185

  7. Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying.

    PubMed

    Masseroli, Marco; Kaitoua, Abdulrahman; Pinoli, Pietro; Ceri, Stefano

    2016-12-01

    While a huge amount of (epi)genomic data of multiple types is becoming available by using Next Generation Sequencing (NGS) technologies, the most important emerging problem is the so-called tertiary analysis, concerned with sense making, e.g., discovering how different (epi)genomic regions and their products interact and cooperate with each other. We propose a paradigm shift in tertiary analysis, based on the use of the Genomic Data Model (GDM), a simple data model which links genomic feature data to their associated experimental, biological and clinical metadata. GDM encompasses all the data formats which have been produced for feature extraction from (epi)genomic datasets. We specifically describe the mapping to GDM of SAM (Sequence Alignment/Map), VCF (Variant Call Format), NARROWPEAK (for called peaks produced by NGS ChIP-seq or DNase-seq methods), and BED (Browser Extensible Data) formats, but GDM supports as well all the formats describing experimental datasets (e.g., including copy number variations, DNA somatic mutations, or gene expressions) and annotations (e.g., regarding transcription start sites, genes, enhancers or CpG islands). We downloaded and integrated samples of all the above-mentioned data types and formats from multiple sources. The GDM is able to homogeneously describe semantically heterogeneous data and makes the ground for providing data interoperability, e.g., achieved through the GenoMetric Query Language (GMQL), a high-level, declarative query language for genomic big data. The combined use of the data model and the query language allows comprehensive processing of multiple heterogeneous data, and supports the development of domain-specific data-driven computations and bio-molecular knowledge discovery. Copyright © 2016 Elsevier Inc. All rights reserved.

  8. Integration of multi-omics data for integrative gene regulatory network inference.

    PubMed

    Zarayeneh, Neda; Ko, Euiseong; Oh, Jung Hun; Suh, Sang; Liu, Chunyu; Gao, Jean; Kim, Donghyun; Kang, Mingon

    2017-01-01

    Gene regulatory networks provide comprehensive insights and indepth understanding of complex biological processes. The molecular interactions of gene regulatory networks are inferred from a single type of genomic data, e.g., gene expression data in most research. However, gene expression is a product of sequential interactions of multiple biological processes, such as DNA sequence variations, copy number variations, histone modifications, transcription factors, and DNA methylations. The recent rapid advances of high-throughput omics technologies enable one to measure multiple types of omics data, called 'multi-omics data', that represent the various biological processes. In this paper, we propose an Integrative Gene Regulatory Network inference method (iGRN) that incorporates multi-omics data and their interactions in gene regulatory networks. In addition to gene expressions, copy number variations and DNA methylations were considered for multi-omics data in this paper. The intensive experiments were carried out with simulation data, where iGRN's capability that infers the integrative gene regulatory network is assessed. Through the experiments, iGRN shows its better performance on model representation and interpretation than other integrative methods in gene regulatory network inference. iGRN was also applied to a human brain dataset of psychiatric disorders, and the biological network of psychiatric disorders was analysed.

  9. Network Analysis of Epidermal Growth Factor Signaling Using Integrated Genomic, Proteomic and Phosphorylation Data

    PubMed Central

    Waters, Katrina M.; Liu, Tao; Quesenberry, Ryan D.; Willse, Alan R.; Bandyopadhyay, Somnath; Kathmann, Loel E.; Weber, Thomas J.; Smith, Richard D.; Wiley, H. Steven; Thrall, Brian D.

    2012-01-01

    To understand how integration of multiple data types can help decipher cellular responses at the systems level, we analyzed the mitogenic response of human mammary epithelial cells to epidermal growth factor (EGF) using whole genome microarrays, mass spectrometry-based proteomics and large-scale western blots with over 1000 antibodies. A time course analysis revealed significant differences in the expression of 3172 genes and 596 proteins, including protein phosphorylation changes measured by western blot. Integration of these disparate data types showed that each contributed qualitatively different components to the observed cell response to EGF and that varying degrees of concordance in gene expression and protein abundance measurements could be linked to specific biological processes. Networks inferred from individual data types were relatively limited, whereas networks derived from the integrated data recapitulated the known major cellular responses to EGF and exhibited more highly connected signaling nodes than networks derived from any individual dataset. While cell cycle regulatory pathways were altered as anticipated, we found the most robust response to mitogenic concentrations of EGF was induction of matrix metalloprotease cascades, highlighting the importance of the EGFR system as a regulator of the extracellular environment. These results demonstrate the value of integrating multiple levels of biological information to more accurately reconstruct networks of cellular response. PMID:22479638

  10. Integration of multi-omics data for integrative gene regulatory network inference

    PubMed Central

    Zarayeneh, Neda; Ko, Euiseong; Oh, Jung Hun; Suh, Sang; Liu, Chunyu; Gao, Jean; Kim, Donghyun

    2017-01-01

    Gene regulatory networks provide comprehensive insights and indepth understanding of complex biological processes. The molecular interactions of gene regulatory networks are inferred from a single type of genomic data, e.g., gene expression data in most research. However, gene expression is a product of sequential interactions of multiple biological processes, such as DNA sequence variations, copy number variations, histone modifications, transcription factors, and DNA methylations. The recent rapid advances of high-throughput omics technologies enable one to measure multiple types of omics data, called ‘multi-omics data’, that represent the various biological processes. In this paper, we propose an Integrative Gene Regulatory Network inference method (iGRN) that incorporates multi-omics data and their interactions in gene regulatory networks. In addition to gene expressions, copy number variations and DNA methylations were considered for multi-omics data in this paper. The intensive experiments were carried out with simulation data, where iGRN’s capability that infers the integrative gene regulatory network is assessed. Through the experiments, iGRN shows its better performance on model representation and interpretation than other integrative methods in gene regulatory network inference. iGRN was also applied to a human brain dataset of psychiatric disorders, and the biological network of psychiatric disorders was analysed. PMID:29354189

  11. Integrating heterogeneous earth observation data for assessment of high-resolution inundation boundaries generated during flood emergencies.

    NASA Astrophysics Data System (ADS)

    Sava, E.; Cervone, G.; Kalyanapu, A. J.; Sampson, K. M.

    2017-12-01

    The increasing trend in flooding events, paired with rapid urbanization and an aging infrastructure is projected to enhance the risk of catastrophic losses and increase the frequency of both flash and large area floods. During such events, it is critical for decision makers and emergency responders to have access to timely actionable knowledge regarding preparedness, emergency response, and recovery before, during and after a disaster. Large volumes of data sets derived from sophisticated sensors, mobile phones, and social media feeds are increasingly being used to improve citizen services and provide clues to the best way to respond to emergencies through the use of visualization and GIS mapping. Such data, coupled with recent advancements in data fusion techniques of remote sensing with near real time heterogeneous datasets have allowed decision makers to more efficiently extract precise and relevant knowledge and better understand how damage caused by disasters have real time effects on urban population. This research assesses the feasibility of integrating multiple sources of contributed data into hydrodynamic models for flood inundation simulation and estimating damage assessment. It integrates multiple sources of high-resolution physiographic data such as satellite remote sensing imagery coupled with non-authoritative data such as Civil Air Patrol (CAP) and `during-event' social media observations of flood inundation in order to improve the identification of flood mapping. The goal is to augment remote sensing imagery with new open-source datasets to generate flood extend maps at higher temporal and spatial resolution. The proposed methodology is applied on two test cases, relative to the 2013 Boulder Colorado flood and the 2015 floods in Texas.

  12. A prior-based integrative framework for functional transcriptional regulatory network inference

    PubMed Central

    Siahpirani, Alireza F.

    2017-01-01

    Abstract Transcriptional regulatory networks specify regulatory proteins controlling the context-specific expression levels of genes. Inference of genome-wide regulatory networks is central to understanding gene regulation, but remains an open challenge. Expression-based network inference is among the most popular methods to infer regulatory networks, however, networks inferred from such methods have low overlap with experimentally derived (e.g. ChIP-chip and transcription factor (TF) knockouts) networks. Currently we have a limited understanding of this discrepancy. To address this gap, we first develop a regulatory network inference algorithm, based on probabilistic graphical models, to integrate expression with auxiliary datasets supporting a regulatory edge. Second, we comprehensively analyze our and other state-of-the-art methods on different expression perturbation datasets. Networks inferred by integrating sequence-specific motifs with expression have substantially greater agreement with experimentally derived networks, while remaining more predictive of expression than motif-based networks. Our analysis suggests natural genetic variation as the most informative perturbation for network inference, and, identifies core TFs whose targets are predictable from expression. Multiple reasons make the identification of targets of other TFs difficult, including network architecture and insufficient variation of TF mRNA level. Finally, we demonstrate the utility of our inference algorithm to infer stress-specific regulatory networks and for regulator prioritization. PMID:27794550

  13. Image-Based Multi-Target Tracking through Multi-Bernoulli Filtering with Interactive Likelihoods.

    PubMed

    Hoak, Anthony; Medeiros, Henry; Povinelli, Richard J

    2017-03-03

    We develop an interactive likelihood (ILH) for sequential Monte Carlo (SMC) methods for image-based multiple target tracking applications. The purpose of the ILH is to improve tracking accuracy by reducing the need for data association. In addition, we integrate a recently developed deep neural network for pedestrian detection along with the ILH with a multi-Bernoulli filter. We evaluate the performance of the multi-Bernoulli filter with the ILH and the pedestrian detector in a number of publicly available datasets (2003 PETS INMOVE, Australian Rules Football League (AFL) and TUD-Stadtmitte) using standard, well-known multi-target tracking metrics (optimal sub-pattern assignment (OSPA) and classification of events, activities and relationships for multi-object trackers (CLEAR MOT)). In all datasets, the ILH term increases the tracking accuracy of the multi-Bernoulli filter.

  14. Image-Based Multi-Target Tracking through Multi-Bernoulli Filtering with Interactive Likelihoods

    PubMed Central

    Hoak, Anthony; Medeiros, Henry; Povinelli, Richard J.

    2017-01-01

    We develop an interactive likelihood (ILH) for sequential Monte Carlo (SMC) methods for image-based multiple target tracking applications. The purpose of the ILH is to improve tracking accuracy by reducing the need for data association. In addition, we integrate a recently developed deep neural network for pedestrian detection along with the ILH with a multi-Bernoulli filter. We evaluate the performance of the multi-Bernoulli filter with the ILH and the pedestrian detector in a number of publicly available datasets (2003 PETS INMOVE, Australian Rules Football League (AFL) and TUD-Stadtmitte) using standard, well-known multi-target tracking metrics (optimal sub-pattern assignment (OSPA) and classification of events, activities and relationships for multi-object trackers (CLEAR MOT)). In all datasets, the ILH term increases the tracking accuracy of the multi-Bernoulli filter. PMID:28273796

  15. A large dataset of protein dynamics in the mammalian heart proteome.

    PubMed

    Lau, Edward; Cao, Quan; Ng, Dominic C M; Bleakley, Brian J; Dincer, T Umut; Bot, Brian M; Wang, Ding; Liem, David A; Lam, Maggie P Y; Ge, Junbo; Ping, Peipei

    2016-03-15

    Protein stability is a major regulatory principle of protein function and cellular homeostasis. Despite limited understanding on mechanisms, disruption of protein turnover is widely implicated in diverse pathologies from heart failure to neurodegenerations. Information on global protein dynamics therefore has the potential to expand the depth and scope of disease phenotyping and therapeutic strategies. Using an integrated platform of metabolic labeling, high-resolution mass spectrometry and computational analysis, we report here a comprehensive dataset of the in vivo half-life of 3,228 and the expression of 8,064 cardiac proteins, quantified under healthy and hypertrophic conditions across six mouse genetic strains commonly employed in biomedical research. We anticipate these data will aid in understanding key mitochondrial and metabolic pathways in heart diseases, and further serve as a reference for methodology development in dynamics studies in multiple organ systems.

  16. Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning.

    PubMed

    Adhikari, Badri; Hou, Jie; Cheng, Jianlin

    2018-03-01

    In this study, we report the evaluation of the residue-residue contacts predicted by our three different methods in the CASP12 experiment, focusing on studying the impact of multiple sequence alignment, residue coevolution, and machine learning on contact prediction. The first method (MULTICOM-NOVEL) uses only traditional features (sequence profile, secondary structure, and solvent accessibility) with deep learning to predict contacts and serves as a baseline. The second method (MULTICOM-CONSTRUCT) uses our new alignment algorithm to generate deep multiple sequence alignment to derive coevolution-based features, which are integrated by a neural network method to predict contacts. The third method (MULTICOM-CLUSTER) is a consensus combination of the predictions of the first two methods. We evaluated our methods on 94 CASP12 domains. On a subset of 38 free-modeling domains, our methods achieved an average precision of up to 41.7% for top L/5 long-range contact predictions. The comparison of the three methods shows that the quality and effective depth of multiple sequence alignments, coevolution-based features, and machine learning integration of coevolution-based features and traditional features drive the quality of predicted protein contacts. On the full CASP12 dataset, the coevolution-based features alone can improve the average precision from 28.4% to 41.6%, and the machine learning integration of all the features further raises the precision to 56.3%, when top L/5 predicted long-range contacts are evaluated. And the correlation between the precision of contact prediction and the logarithm of the number of effective sequences in alignments is 0.66. © 2017 Wiley Periodicals, Inc.

  17. GFVO: the Genomic Feature and Variation Ontology.

    PubMed

    Baran, Joachim; Durgahee, Bibi Sehnaaz Begum; Eilbeck, Karen; Antezana, Erick; Hoehndorf, Robert; Dumontier, Michel

    2015-01-01

    Falling costs in genomic laboratory experiments have led to a steady increase of genomic feature and variation data. Multiple genomic data formats exist for sharing these data, and whilst they are similar, they are addressing slightly different data viewpoints and are consequently not fully compatible with each other. The fragmentation of data format specifications makes it hard to integrate and interpret data for further analysis with information from multiple data providers. As a solution, a new ontology is presented here for annotating and representing genomic feature and variation dataset contents. The Genomic Feature and Variation Ontology (GFVO) specifically addresses genomic data as it is regularly shared using the GFF3 (incl. FASTA), GTF, GVF and VCF file formats. GFVO simplifies data integration and enables linking of genomic annotations across datasets through common semantics of genomic types and relations. Availability and implementation. The latest stable release of the ontology is available via its base URI; previous and development versions are available at the ontology's GitHub repository: https://github.com/BioInterchange/Ontologies; versions of the ontology are indexed through BioPortal (without external class-/property-equivalences due to BioPortal release 4.10 limitations); examples and reference documentation is provided on a separate web-page: http://www.biointerchange.org/ontologies.html. GFVO version 1.0.2 is licensed under the CC0 1.0 Universal license (https://creativecommons.org/publicdomain/zero/1.0) and therefore de facto within the public domain; the ontology can be appropriated without attribution for commercial and non-commercial use.

  18. Semi-Supervised Multi-View Learning for Gene Network Reconstruction

    PubMed Central

    Ceci, Michelangelo; Pio, Gianvito; Kuzmanovski, Vladimir; Džeroski, Sašo

    2015-01-01

    The task of gene regulatory network reconstruction from high-throughput data is receiving increasing attention in recent years. As a consequence, many inference methods for solving this task have been proposed in the literature. It has been recently observed, however, that no single inference method performs optimally across all datasets. It has also been shown that the integration of predictions from multiple inference methods is more robust and shows high performance across diverse datasets. Inspired by this research, in this paper, we propose a machine learning solution which learns to combine predictions from multiple inference methods. While this approach adds additional complexity to the inference process, we expect it would also carry substantial benefits. These would come from the automatic adaptation to patterns on the outputs of individual inference methods, so that it is possible to identify regulatory interactions more reliably when these patterns occur. This article demonstrates the benefits (in terms of accuracy of the reconstructed networks) of the proposed method, which exploits an iterative, semi-supervised ensemble-based algorithm. The algorithm learns to combine the interactions predicted by many different inference methods in the multi-view learning setting. The empirical evaluation of the proposed algorithm on a prokaryotic model organism (E. coli) and on a eukaryotic model organism (S. cerevisiae) clearly shows improved performance over the state of the art methods. The results indicate that gene regulatory network reconstruction for the real datasets is more difficult for S. cerevisiae than for E. coli. The software, all the datasets used in the experiments and all the results are available for download at the following link: http://figshare.com/articles/Semi_supervised_Multi_View_Learning_for_Gene_Network_Reconstruction/1604827. PMID:26641091

  19. Creating a FIESTA (Framework for Integrated Earth Science and Technology Applications) with MagIC

    NASA Astrophysics Data System (ADS)

    Minnett, R.; Koppers, A. A. P.; Jarboe, N.; Tauxe, L.; Constable, C.

    2017-12-01

    The Magnetics Information Consortium (https://earthref.org/MagIC) has recently developed a containerized web application to considerably reduce the friction in contributing, exploring and combining valuable and complex datasets for the paleo-, geo- and rock magnetic scientific community. The data produced in this scientific domain are inherently hierarchical and the communities evolving approaches to this scientific workflow, from sampling to taking measurements to multiple levels of interpretations, require a large and flexible data model to adequately annotate the results and ensure reproducibility. Historically, contributing such detail in a consistent format has been prohibitively time consuming and often resulted in only publishing the highly derived interpretations. The new open-source (https://github.com/earthref/MagIC) application provides a flexible upload tool integrated with the data model to easily create a validated contribution and a powerful search interface for discovering datasets and combining them to enable transformative science. MagIC is hosted at EarthRef.org along with several interdisciplinary geoscience databases. A FIESTA (Framework for Integrated Earth Science and Technology Applications) is being created by generalizing MagIC's web application for reuse in other domains. The application relies on a single configuration document that describes the routing, data model, component settings and external services integrations. The container hosts an isomorphic Meteor JavaScript application, MongoDB database and ElasticSearch search engine. Multiple containers can be configured as microservices to serve portions of the application or rely on externally hosted MongoDB, ElasticSearch, or third-party services to efficiently scale computational demands. FIESTA is particularly well suited for many Earth Science disciplines with its flexible data model, mapping, account management, upload tool to private workspaces, reference metadata, image galleries, full text searches and detailed filters. EarthRef's Seamount Catalog of bathymetry and morphology data, EarthRef's Geochemical Earth Reference Model (GERM) databases, and Oregon State University's Marine and Geology Repository (http://osu-mgr.org) will benefit from custom adaptations of FIESTA.

  20. Quantifying Spatially Integrated Floodplain and Wetland Systems for the Conterminous US

    NASA Astrophysics Data System (ADS)

    Lane, C.; D'Amico, E.; Wing, O.; Bates, P. D.

    2017-12-01

    Wetlands interact with other waters across a variable connectivity continuum, from permanent to transient, from fast to slow, and from primarily surface water to exclusively groundwater flows. Floodplain wetlands typically experience fast and frequent surface and near-surface groundwater interactions with their river networks, leading to an increasing effort to tailor management strategies for these wetlands. Management of floodplain wetlands is contingent on accurate floodplain delineation, and though this has proven challenging, multiple efforts are being made to alleviate this data gap at the conterminous scale using spatial, physical, and hydrological floodplain proxies. In this study, we derived and contrasted floodplain extents using the following nationally available approaches: 1) a geospatial-buffer floodplain proxy (Lane and D'Amico 2016, JAWRA 52(3):705-722, 2) a regionalized flood frequency analysis coupled to a 30m resolution continental-scale hydraulic model (RFFA; Smith et al. 2015, WRR 51:539-553), and 3) a soils-based floodplain analysis (Sangwan and Merwade 2015, JAWRA 51(5):1286-1304). The geospatial approach uses National Wetlands Inventory and buffered National Hydrography Datasets. RFFA estimates extreme flows based on catchment size, regional climatology and upstream annual rainfall and routes these flows through a hydraulic model built with data from USGS HydroSHEDS, NOAA, and the National Elevation Dataset. Soil-based analyses define floodplains based on attributes within the USDA soil-survey data (SSURGO). Nearly 30% (by count) of U.S. freshwater wetlands are located within floodplains with geospatial analyses, contrasted with 37% (soils-based), and 53% (RFFA-based). The dichotomies between approaches are mainly a function of input data-layer resolution, accuracy, coverage, and extent, further discussed in this presentation. Ultimately, these spatial analyses and findings will improve floodplain and integrated wetland system extent assessment. This will lead to better management of the physically, chemically, and biologically integrated floodplain wetlands affecting the integrity of downstream waterbodies at multiple scales.

  1. Integration of heterogeneous molecular networks to unravel gene-regulation in Mycobacterium tuberculosis.

    PubMed

    van Dam, Jesse C J; Schaap, Peter J; Martins dos Santos, Vitor A P; Suárez-Diez, María

    2014-09-26

    Different methods have been developed to infer regulatory networks from heterogeneous omics datasets and to construct co-expression networks. Each algorithm produces different networks and efforts have been devoted to automatically integrate them into consensus sets. However each separate set has an intrinsic value that is diluted and partly lost when building a consensus network. Here we present a methodology to generate co-expression networks and, instead of a consensus network, we propose an integration framework where the different networks are kept and analysed with additional tools to efficiently combine the information extracted from each network. We developed a workflow to efficiently analyse information generated by different inference and prediction methods. Our methodology relies on providing the user the means to simultaneously visualise and analyse the coexisting networks generated by different algorithms, heterogeneous datasets, and a suite of analysis tools. As a show case, we have analysed the gene co-expression networks of Mycobacterium tuberculosis generated using over 600 expression experiments. Regarding DNA damage repair, we identified SigC as a key control element, 12 new targets for LexA, an updated LexA binding motif, and a potential mismatch repair system. We expanded the DevR regulon with 27 genes while identifying 9 targets wrongly assigned to this regulon. We discovered 10 new genes linked to zinc uptake and a new regulatory mechanism for ZuR. The use of co-expression networks to perform system level analysis allows the development of custom made methodologies. As show cases we implemented a pipeline to integrate ChIP-seq data and another method to uncover multiple regulatory layers. Our workflow is based on representing the multiple types of information as network representations and presenting these networks in a synchronous framework that allows their simultaneous visualization while keeping specific associations from the different networks. By simultaneously exploring these networks and metadata, we gained insights into regulatory mechanisms in M. tuberculosis that could not be obtained through the separate analysis of each data type.

  2. SchizConnect: Mediating neuroimaging databases on schizophrenia and related disorders for large-scale integration.

    PubMed

    Wang, Lei; Alpert, Kathryn I; Calhoun, Vince D; Cobia, Derin J; Keator, David B; King, Margaret D; Kogan, Alexandr; Landis, Drew; Tallis, Marcelo; Turner, Matthew D; Potkin, Steven G; Turner, Jessica A; Ambite, Jose Luis

    2016-01-01

    SchizConnect (www.schizconnect.org) is built to address the issues of multiple data repositories in schizophrenia neuroimaging studies. It includes a level of mediation--translating across data sources--so that the user can place one query, e.g. for diffusion images from male individuals with schizophrenia, and find out from across participating data sources how many datasets there are, as well as downloading the imaging and related data. The current version handles the Data Usage Agreements across different studies, as well as interpreting database-specific terminologies into a common framework. New data repositories can also be mediated to bring immediate access to existing datasets. Compared with centralized, upload data sharing models, SchizConnect is a unique, virtual database with a focus on schizophrenia and related disorders that can mediate live data as information is being updated at each data source. It is our hope that SchizConnect can facilitate testing new hypotheses through aggregated datasets, promoting discovery related to the mechanisms underlying schizophrenic dysfunction. Copyright © 2015 Elsevier Inc. All rights reserved.

  3. Intellicount: High-Throughput Quantification of Fluorescent Synaptic Protein Puncta by Machine Learning

    PubMed Central

    Fantuzzo, J. A.; Mirabella, V. R.; Zahn, J. D.

    2017-01-01

    Abstract Synapse formation analyses can be performed by imaging and quantifying fluorescent signals of synaptic markers. Traditionally, these analyses are done using simple or multiple thresholding and segmentation approaches or by labor-intensive manual analysis by a human observer. Here, we describe Intellicount, a high-throughput, fully-automated synapse quantification program which applies a novel machine learning (ML)-based image processing algorithm to systematically improve region of interest (ROI) identification over simple thresholding techniques. Through processing large datasets from both human and mouse neurons, we demonstrate that this approach allows image processing to proceed independently of carefully set thresholds, thus reducing the need for human intervention. As a result, this method can efficiently and accurately process large image datasets with minimal interaction by the experimenter, making it less prone to bias and less liable to human error. Furthermore, Intellicount is integrated into an intuitive graphical user interface (GUI) that provides a set of valuable features, including automated and multifunctional figure generation, routine statistical analyses, and the ability to run full datasets through nested folders, greatly expediting the data analysis process. PMID:29218324

  4. Data Sharing Reveals Complexity in the Westward Spread of Domestic Animals across Neolithic Turkey

    PubMed Central

    Arbuckle, Benjamin S.; Kansa, Sarah Whitcher; Kansa, Eric; Orton, David; Çakırlar, Canan; Gourichon, Lionel; Atici, Levent; Galik, Alfred; Marciniak, Arkadiusz; Mulville, Jacqui; Buitenhuis, Hijlke; Carruthers, Denise; De Cupere, Bea; Demirergi, Arzu; Frame, Sheelagh; Helmer, Daniel; Martin, Louise; Peters, Joris; Pöllath, Nadja; Pawłowska, Kamilla; Russell, Nerissa; Twiss, Katheryn; Würtenberger, Doris

    2014-01-01

    This study presents the results of a major data integration project bringing together primary archaeozoological data for over 200,000 faunal specimens excavated from seventeen sites in Turkey spanning the Epipaleolithic through Chalcolithic periods, c. 18,000-4,000 cal BC, in order to document the initial westward spread of domestic livestock across Neolithic central and western Turkey. From these shared datasets we demonstrate that the westward expansion of Neolithic subsistence technologies combined multiple routes and pulses but did not involve a set ‘package’ comprising all four livestock species including sheep, goat, cattle and pig. Instead, Neolithic animal economies in the study regions are shown to be more diverse than deduced previously using quantitatively more limited datasets. Moreover, during the transition to agro-pastoral economies interactions between domestic stock and local wild fauna continued. Through publication of datasets with Open Context (opencontext.org), this project emphasizes the benefits of data sharing and web-based dissemination of large primary data sets for exploring major questions in archaeology (Alternative Language Abstract S1). PMID:24927173

  5. Fish and fishery historical data since the 19th century in the Adriatic Sea, Mediterranean

    NASA Astrophysics Data System (ADS)

    Fortibuoni, Tomaso; Libralato, Simone; Arneri, Enrico; Giovanardi, Otello; Solidoro, Cosimo; Raicevich, Saša

    2017-09-01

    Historic data on biodiversity provide the context for present observations and allow studying long-term changes in marine populations. Here we present multiple datasets on fish and fisheries of the Adriatic Sea covering the last two centuries encompassing from qualitative observations to standardised scientific monitoring. The datasets consist of three groups: (1) early naturalists' descriptions of fish fauna, including information (e.g., presence, perceived abundance, size) on 255 fish species for the period 1818-1936 (2) historical landings from major Northern Adriatic fish markets (Venice, Trieste, Rijeka) for the period 1902-1968, Italian official landings for the Northern and Central Adriatic (1953-2012) and landings from the Lagoon of Venice (1945-2001) (3) trawl-survey data from seven surveys spanning the period 1948-1991 and including Catch per Unit of Effort data (kgh-1 and/or nh-1) for 956 hauls performed at 301 stations. The integration of these datasets has already demonstrated to be useful to analyse historical marine community changes over time, and its availability through open-source data portal will facilitate analyses in the framework of marine historical ecology.

  6. Fish and fishery historical data since the 19th century in the Adriatic Sea, Mediterranean.

    PubMed

    Fortibuoni, Tomaso; Libralato, Simone; Arneri, Enrico; Giovanardi, Otello; Solidoro, Cosimo; Raicevich, Saša

    2017-09-12

    Historic data on biodiversity provide the context for present observations and allow studying long-term changes in marine populations. Here we present multiple datasets on fish and fisheries of the Adriatic Sea covering the last two centuries encompassing from qualitative observations to standardised scientific monitoring. The datasets consist of three groups: (1) early naturalists' descriptions of fish fauna, including information (e.g., presence, perceived abundance, size) on 255 fish species for the period 1818-1936; (2) historical landings from major Northern Adriatic fish markets (Venice, Trieste, Rijeka) for the period 1902-1968, Italian official landings for the Northern and Central Adriatic (1953-2012) and landings from the Lagoon of Venice (1945-2001); (3) trawl-survey data from seven surveys spanning the period 1948-1991 and including Catch per Unit of Effort data (kgh -1 and/or nh -1 ) for 956 hauls performed at 301 stations. The integration of these datasets has already demonstrated to be useful to analyse historical marine community changes over time, and its availability through open-source data portal will facilitate analyses in the framework of marine historical ecology.

  7. A multiple-feature and multiple-kernel scene segmentation algorithm for humanoid robot.

    PubMed

    Liu, Zhi; Xu, Shuqiong; Zhang, Yun; Chen, Chun Lung Philip

    2014-11-01

    This technical correspondence presents a multiple-feature and multiple-kernel support vector machine (MFMK-SVM) methodology to achieve a more reliable and robust segmentation performance for humanoid robot. The pixel wise intensity, gradient, and C1 SMF features are extracted via the local homogeneity model and Gabor filter, which would be used as inputs of MFMK-SVM model. It may provide multiple features of the samples for easier implementation and efficient computation of MFMK-SVM model. A new clustering method, which is called feature validity-interval type-2 fuzzy C-means (FV-IT2FCM) clustering algorithm, is proposed by integrating a type-2 fuzzy criterion in the clustering optimization process to improve the robustness and reliability of clustering results by the iterative optimization. Furthermore, the clustering validity is employed to select the training samples for the learning of the MFMK-SVM model. The MFMK-SVM scene segmentation method is able to fully take advantage of the multiple features of scene image and the ability of multiple kernels. Experiments on the BSDS dataset and real natural scene images demonstrate the superior performance of our proposed method.

  8. An integrated pan-tropical biomass map using multiple reference datasets.

    PubMed

    Avitabile, Valerio; Herold, Martin; Heuvelink, Gerard B M; Lewis, Simon L; Phillips, Oliver L; Asner, Gregory P; Armston, John; Ashton, Peter S; Banin, Lindsay; Bayol, Nicolas; Berry, Nicholas J; Boeckx, Pascal; de Jong, Bernardus H J; DeVries, Ben; Girardin, Cecile A J; Kearsley, Elizabeth; Lindsell, Jeremy A; Lopez-Gonzalez, Gabriela; Lucas, Richard; Malhi, Yadvinder; Morel, Alexandra; Mitchard, Edward T A; Nagy, Laszlo; Qie, Lan; Quinones, Marcela J; Ryan, Casey M; Ferry, Slik J W; Sunderland, Terry; Laurin, Gaia Vaglio; Gatti, Roberto Cazzolla; Valentini, Riccardo; Verbeeck, Hans; Wijaya, Arief; Willcock, Simon

    2016-04-01

    We combined two existing datasets of vegetation aboveground biomass (AGB) (Proceedings of the National Academy of Sciences of the United States of America, 108, 2011, 9899; Nature Climate Change, 2, 2012, 182) into a pan-tropical AGB map at 1-km resolution using an independent reference dataset of field observations and locally calibrated high-resolution biomass maps, harmonized and upscaled to 14 477 1-km AGB estimates. Our data fusion approach uses bias removal and weighted linear averaging that incorporates and spatializes the biomass patterns indicated by the reference data. The method was applied independently in areas (strata) with homogeneous error patterns of the input (Saatchi and Baccini) maps, which were estimated from the reference data and additional covariates. Based on the fused map, we estimated AGB stock for the tropics (23.4 N-23.4 S) of 375 Pg dry mass, 9-18% lower than the Saatchi and Baccini estimates. The fused map also showed differing spatial patterns of AGB over large areas, with higher AGB density in the dense forest areas in the Congo basin, Eastern Amazon and South-East Asia, and lower values in Central America and in most dry vegetation areas of Africa than either of the input maps. The validation exercise, based on 2118 estimates from the reference dataset not used in the fusion process, showed that the fused map had a RMSE 15-21% lower than that of the input maps and, most importantly, nearly unbiased estimates (mean bias 5 Mg dry mass ha(-1) vs. 21 and 28 Mg ha(-1) for the input maps). The fusion method can be applied at any scale including the policy-relevant national level, where it can provide improved biomass estimates by integrating existing regional biomass maps as input maps and additional, country-specific reference datasets. © 2015 John Wiley & Sons Ltd.

  9. An Integrated Bioinformatics Approach Identifies Elevated Cyclin E2 Expression and E2F Activity as Distinct Features of Tamoxifen Resistant Breast Tumors

    PubMed Central

    Huang, Lei; Zhao, Shuangping; Frasor, Jonna M.; Dai, Yang

    2011-01-01

    Approximately half of estrogen receptor (ER) positive breast tumors will fail to respond to endocrine therapy. Here we used an integrative bioinformatics approach to analyze three gene expression profiling data sets from breast tumors in an attempt to uncover underlying mechanisms contributing to the development of resistance and potential therapeutic strategies to counteract these mechanisms. Genes that are differentially expressed in tamoxifen resistant vs. sensitive breast tumors were identified from three different publically available microarray datasets. These differentially expressed (DE) genes were analyzed using gene function and gene set enrichment and examined in intrinsic subtypes of breast tumors. The Connectivity Map analysis was utilized to link gene expression profiles of tamoxifen resistant tumors to small molecules and validation studies were carried out in a tamoxifen resistant cell line. Despite little overlap in genes that are differentially expressed in tamoxifen resistant vs. sensitive tumors, a high degree of functional similarity was observed among the three datasets. Tamoxifen resistant tumors displayed enriched expression of genes related to cell cycle and proliferation, as well as elevated activity of E2F transcription factors, and were highly correlated with a Luminal intrinsic subtype. A number of small molecules, including phenothiazines, were found that induced a gene signature in breast cancer cell lines opposite to that found in tamoxifen resistant vs. sensitive tumors and the ability of phenothiazines to down-regulate cyclin E2 and inhibit proliferation of tamoxifen resistant breast cancer cells was validated. Our findings demonstrate that an integrated bioinformatics approach to analyze gene expression profiles from multiple breast tumor datasets can identify important biological pathways and potentially novel therapeutic options for tamoxifen-resistant breast cancers. PMID:21789246

  10. National Geospatial Data Asset Lifecycle Baseline Maturity Assessment for the Federal Geographic Data Committee

    NASA Astrophysics Data System (ADS)

    Peltz-Lewis, L. A.; Blake-Coleman, W.; Johnston, J.; DeLoatch, I. B.

    2014-12-01

    The Federal Geographic Data Committee (FGDC) is designing a portfolio management process for 193 geospatial datasets contained within the 16 topical National Spatial Data Infrastructure themes managed under OMB Circular A-16 "Coordination of Geographic Information and Related Spatial Data Activities." The 193 datasets are designated as National Geospatial Data Assets (NGDA) because of their significance in implementing to the missions of multiple levels of government, partners and stakeholders. As a starting point, the data managers of these NGDAs will conduct a baseline maturity assessment of the dataset(s) for which they are responsible. The maturity is measured against benchmarks related to each of the seven stages of the data lifecycle management framework promulgated within the OMB Circular A-16 Supplemental Guidance issued by OMB in November 2010. This framework was developed by the interagency Lifecycle Management Work Group (LMWG), consisting of 16 Federal agencies, under the 2004 Presidential Initiative the Geospatial Line of Business,using OMB Circular A-130" Management of Federal Information Resources" as guidance The seven lifecycle stages are: Define, Inventory/Evaluate, Obtain, Access, Maintain, Use/Evaluate, and Archive. This paper will focus on the Lifecycle Baseline Maturity Assessment, and efforts to integration the FGDC approach with other data maturity assessments.

  11. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation

    PubMed Central

    Pujar, Shashikant; O’Leary, Nuala A; Farrell, Catherine M; Mudge, Jonathan M; Wallin, Craig; Diekhans, Mark; Barnes, If; Bennett, Ruth; Berry, Andrew E; Cox, Eric; Davidson, Claire; Goldfarb, Tamara; Gonzalez, Jose M; Hunt, Toby; Jackson, John; Joardar, Vinita; Kay, Mike P; Kodali, Vamsi K; McAndrews, Monica; McGarvey, Kelly M; Murphy, Michael; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Seal, Ruth L; Webb, David; Zhu, Sophia; Aken, Bronwen L; Bult, Carol J; Frankish, Adam; Pruitt, Kim D

    2018-01-01

    Abstract The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community. PMID:29126148

  12. New GES DISC Services Shortening the Path in Science Data Discovery

    NASA Technical Reports Server (NTRS)

    Li, Angela; Shie, Chung-Lin; Petrenko, Maksym; Hegde, Mahabaleshwa; Teng, William; Liu, Zhong; Bryant, Keith; Shen, Suhung; Hearty, Thomas; Wei, Jennifer; hide

    2017-01-01

    The Current GES DISC available services only allow user to select variables from a single dataset at a time and too many variables from a dataset are displayed, choice is hard. At American Geophysical Union (AGU) 2016 Fall Meeting, Goddard Earth Sciences Data Information Services Center (GES DISC) unveiled a new service: Datalist. A Datalist is a collection of predefined or user-defined data variables from one or more archived datasets. Our science support team curated predefined datalist and provided value to the user community. Imagine some novice user wants to study hurricane and typed in hurricane in the search box. The first item in the search result is GES DISC provided Hurricane Datalist. It contains scientists recommended variables from multiple datasets like TRMM, GPM, MERRA, etc. Datalist uses the same architecture as that of our new website, which also provides one-stop shopping for data, metadata, citation, documentation, visualization and other available services.We implemented Datalist with new GES DISC web architecture, one single web page that unified all user interfaces. From that webpage, users can find data by either type in keyword, or browse by category. It also provides user with a sophisticated integrated data and services package, including metadata, citation, documentation, visualization, and data-specific services, all available from one-stop shopping.

  13. Next Generation Search Interfaces

    NASA Astrophysics Data System (ADS)

    Roby, W.; Wu, X.; Ly, L.; Goldina, T.

    2015-09-01

    Astronomers are constantly looking for easier ways to access multiple data sets. While much effort is spent on VO, little thought is given to the types of User Interfaces we need to effectively search this sort of data. For instance, an astronomer might need to search Spitzer, WISE, and 2MASS catalogs and images then see the results presented together in one UI. Moving seamlessly between data sets is key to presenting integrated results. Results need to be viewed using first class, web based, integrated FITS viewers, XY Plots, and advanced table display tools. These components should be able to handle very large datasets. To make a powerful Web based UI that can manage and present multiple searches to the user requires taking advantage of many HTML5 features. AJAX is used to start searches and present results. Push notifications (Server Sent Events) monitor background jobs. Canvas is required for advanced result displays. Lesser known CSS3 technologies makes it all flow seamlessly together. At IPAC, we have been developing our Firefly toolkit for several years. We are now using it to solve this multiple data set, multiple queries, and integrated presentation problem to create a powerful research experience. Firefly was created in IRSA, the NASA/IPAC Infrared Science Archive (http://irsa.ipac.caltech.edu). Firefly is the core for applications serving many project archives, including Spitzer, Planck, WISE, PTF, LSST and others. It is also used in IRSA's new Finder Chart and catalog and image displays.

  14. The MetabolomeExpress Project: enabling web-based processing, analysis and transparent dissemination of GC/MS metabolomics datasets.

    PubMed

    Carroll, Adam J; Badger, Murray R; Harvey Millar, A

    2010-07-14

    Standardization of analytical approaches and reporting methods via community-wide collaboration can work synergistically with web-tool development to result in rapid community-driven expansion of online data repositories suitable for data mining and meta-analysis. In metabolomics, the inter-laboratory reproducibility of gas-chromatography/mass-spectrometry (GC/MS) makes it an obvious target for such development. While a number of web-tools offer access to datasets and/or tools for raw data processing and statistical analysis, none of these systems are currently set up to act as a public repository by easily accepting, processing and presenting publicly submitted GC/MS metabolomics datasets for public re-analysis. Here, we present MetabolomeExpress, a new File Transfer Protocol (FTP) server and web-tool for the online storage, processing, visualisation and statistical re-analysis of publicly submitted GC/MS metabolomics datasets. Users may search a quality-controlled database of metabolite response statistics from publicly submitted datasets by a number of parameters (eg. metabolite, species, organ/biofluid etc.). Users may also perform meta-analysis comparisons of multiple independent experiments or re-analyse public primary datasets via user-friendly tools for t-test, principal components analysis, hierarchical cluster analysis and correlation analysis. They may interact with chromatograms, mass spectra and peak detection results via an integrated raw data viewer. Researchers who register for a free account may upload (via FTP) their own data to the server for online processing via a novel raw data processing pipeline. MetabolomeExpress https://www.metabolome-express.org provides a new opportunity for the general metabolomics community to transparently present online the raw and processed GC/MS data underlying their metabolomics publications. Transparent sharing of these data will allow researchers to assess data quality and draw their own insights from published metabolomics datasets.

  15. Elastic K-means using posterior probability

    PubMed Central

    Zheng, Aihua; Jiang, Bo; Li, Yan; Zhang, Xuehan; Ding, Chris

    2017-01-01

    The widely used K-means clustering is a hard clustering algorithm. Here we propose a Elastic K-means clustering model (EKM) using posterior probability with soft capability where each data point can belong to multiple clusters fractionally and show the benefit of proposed Elastic K-means. Furthermore, in many applications, besides vector attributes information, pairwise relations (graph information) are also available. Thus we integrate EKM with Normalized Cut graph clustering into a single clustering formulation. Finally, we provide several useful matrix inequalities which are useful for matrix formulations of learning models. Based on these results, we prove the correctness and the convergence of EKM algorithms. Experimental results on six benchmark datasets demonstrate the effectiveness of proposed EKM and its integrated model. PMID:29240756

  16. Data layer integration for the national map of the united states

    USGS Publications Warehouse

    Usery, E.L.; Finn, M.P.; Starbuck, M.

    2009-01-01

    The integration of geographic data layers in multiple raster and vector formats, from many different organizations and at a variety of resolutions and scales, is a significant problem for The National Map of the United States being developed by the U.S. Geological Survey. Our research has examined data integration from a layer-based approach for five of The National Map data layers: digital orthoimages, elevation, land cover, hydrography, and transportation. An empirical approach has included visual assessment by a set of respondents with statistical analysis to establish the meaning of various types of integration. A separate theoretical approach with established hypotheses tested against actual data sets has resulted in an automated procedure for integration of specific layers and is being tested. The empirical analysis has established resolution bounds on meanings of integration with raster datasets and distance bounds for vector data. The theoretical approach has used a combination of theories on cartographic transformation and generalization, such as T??pfer's radical law, and additional research concerning optimum viewing scales for digital images to establish a set of guiding principles for integrating data of different resolutions.

  17. Analysing and correcting the differences between multi-source and multi-scale spatial remote sensing observations.

    PubMed

    Dong, Yingying; Luo, Ruisen; Feng, Haikuan; Wang, Jihua; Zhao, Jinling; Zhu, Yining; Yang, Guijun

    2014-01-01

    Differences exist among analysis results of agriculture monitoring and crop production based on remote sensing observations, which are obtained at different spatial scales from multiple remote sensors in same time period, and processed by same algorithms, models or methods. These differences can be mainly quantitatively described from three aspects, i.e. multiple remote sensing observations, crop parameters estimation models, and spatial scale effects of surface parameters. Our research proposed a new method to analyse and correct the differences between multi-source and multi-scale spatial remote sensing surface reflectance datasets, aiming to provide references for further studies in agricultural application with multiple remotely sensed observations from different sources. The new method was constructed on the basis of physical and mathematical properties of multi-source and multi-scale reflectance datasets. Theories of statistics were involved to extract statistical characteristics of multiple surface reflectance datasets, and further quantitatively analyse spatial variations of these characteristics at multiple spatial scales. Then, taking the surface reflectance at small spatial scale as the baseline data, theories of Gaussian distribution were selected for multiple surface reflectance datasets correction based on the above obtained physical characteristics and mathematical distribution properties, and their spatial variations. This proposed method was verified by two sets of multiple satellite images, which were obtained in two experimental fields located in Inner Mongolia and Beijing, China with different degrees of homogeneity of underlying surfaces. Experimental results indicate that differences of surface reflectance datasets at multiple spatial scales could be effectively corrected over non-homogeneous underlying surfaces, which provide database for further multi-source and multi-scale crop growth monitoring and yield prediction, and their corresponding consistency analysis evaluation.

  18. Analysing and Correcting the Differences between Multi-Source and Multi-Scale Spatial Remote Sensing Observations

    PubMed Central

    Dong, Yingying; Luo, Ruisen; Feng, Haikuan; Wang, Jihua; Zhao, Jinling; Zhu, Yining; Yang, Guijun

    2014-01-01

    Differences exist among analysis results of agriculture monitoring and crop production based on remote sensing observations, which are obtained at different spatial scales from multiple remote sensors in same time period, and processed by same algorithms, models or methods. These differences can be mainly quantitatively described from three aspects, i.e. multiple remote sensing observations, crop parameters estimation models, and spatial scale effects of surface parameters. Our research proposed a new method to analyse and correct the differences between multi-source and multi-scale spatial remote sensing surface reflectance datasets, aiming to provide references for further studies in agricultural application with multiple remotely sensed observations from different sources. The new method was constructed on the basis of physical and mathematical properties of multi-source and multi-scale reflectance datasets. Theories of statistics were involved to extract statistical characteristics of multiple surface reflectance datasets, and further quantitatively analyse spatial variations of these characteristics at multiple spatial scales. Then, taking the surface reflectance at small spatial scale as the baseline data, theories of Gaussian distribution were selected for multiple surface reflectance datasets correction based on the above obtained physical characteristics and mathematical distribution properties, and their spatial variations. This proposed method was verified by two sets of multiple satellite images, which were obtained in two experimental fields located in Inner Mongolia and Beijing, China with different degrees of homogeneity of underlying surfaces. Experimental results indicate that differences of surface reflectance datasets at multiple spatial scales could be effectively corrected over non-homogeneous underlying surfaces, which provide database for further multi-source and multi-scale crop growth monitoring and yield prediction, and their corresponding consistency analysis evaluation. PMID:25405760

  19. Multiple Myeloma and Glyphosate Use: A Re-Analysis of US Agricultural Health Study (AHS) Data

    PubMed Central

    Sorahan, Tom

    2015-01-01

    A previous publication of 57,311 pesticide applicators enrolled in the US Agricultural Health Study (AHS) produced disparate findings in relation to multiple myeloma risks in the period 1993–2001 and ever-use of glyphosate (32 cases of multiple myeloma in the full dataset of 54,315 applicators without adjustment for other variables: rate ratio (RR) 1.1, 95% confidence interval (CI) 0.5 to 2.4; 22 cases of multiple myeloma in restricted dataset of 40,719 applicators with adjustment for other variables: RR 2.6, 95% CI 0.7 to 9.4). It seemed important to determine which result should be preferred. RRs for exposed and non-exposed subjects were calculated using Poisson regression; subjects with missing data were not excluded from the main analyses. Using the full dataset adjusted for age and gender the analysis produced a RR of 1.12 (95% CI 0.50 to 2.49) for ever-use of glyphosate. Additional adjustment for lifestyle factors and use of ten other pesticides had little effect (RR 1.24, 95% CI 0.52 to 2.94). There were no statistically significant trends for multiple myeloma risks in relation to reported cumulative days (or intensity weighted days) of glyphosate use. The doubling of risk reported previously arose from the use of an unrepresentative restricted dataset and analyses of the full dataset provides no convincing evidence in the AHS for a link between multiple myeloma risk and glyphosate use. PMID:25635915

  20. Multiple myeloma and glyphosate use: a re-analysis of US Agricultural Health Study (AHS) data.

    PubMed

    Sorahan, Tom

    2015-01-28

    A previous publication of 57,311 pesticide applicators enrolled in the US Agricultural Health Study (AHS) produced disparate findings in relation to multiple myeloma risks in the period 1993-2001 and ever-use of glyphosate (32 cases of multiple myeloma in the full dataset of 54,315 applicators without adjustment for other variables: rate ratio (RR) 1.1, 95% confidence interval (CI) 0.5 to 2.4; 22 cases of multiple myeloma in restricted dataset of 40,719 applicators with adjustment for other variables: RR 2.6, 95% CI 0.7 to 9.4). It seemed important to determine which result should be preferred. RRs for exposed and non-exposed subjects were calculated using Poisson regression; subjects with missing data were not excluded from the main analyses. Using the full dataset adjusted for age and gender the analysis produced a RR of 1.12 (95% CI 0.50 to 2.49) for ever-use of glyphosate. Additional adjustment for lifestyle factors and use of ten other pesticides had little effect (RR 1.24, 95% CI 0.52 to 2.94). There were no statistically significant trends for multiple myeloma risks in relation to reported cumulative days (or intensity weighted days) of glyphosate use. The doubling of risk reported previously arose from the use of an unrepresentative restricted dataset and analyses of the full dataset provides no convincing evidence in the AHS for a link between multiple myeloma risk and glyphosate use.

  1. Wisdom of crowds for robust gene network inference

    PubMed Central

    Marbach, Daniel; Costello, James C.; Küffner, Robert; Vega, Nicci; Prill, Robert J.; Camacho, Diogo M.; Allison, Kyle R.; Kellis, Manolis; Collins, James J.; Stolovitzky, Gustavo

    2012-01-01

    Reconstructing gene regulatory networks from high-throughput data is a long-standing problem. Through the DREAM project (Dialogue on Reverse Engineering Assessment and Methods), we performed a comprehensive blind assessment of over thirty network inference methods on Escherichia coli, Staphylococcus aureus, Saccharomyces cerevisiae, and in silico microarray data. We characterize performance, data requirements, and inherent biases of different inference approaches offering guidelines for both algorithm application and development. We observe that no single inference method performs optimally across all datasets. In contrast, integration of predictions from multiple inference methods shows robust and high performance across diverse datasets. Thereby, we construct high-confidence networks for E. coli and S. aureus, each comprising ~1700 transcriptional interactions at an estimated precision of 50%. We experimentally test 53 novel interactions in E. coli, of which 23 were supported (43%). Our results establish community-based methods as a powerful and robust tool for the inference of transcriptional gene regulatory networks. PMID:22796662

  2. A large dataset of protein dynamics in the mammalian heart proteome

    PubMed Central

    Lau, Edward; Cao, Quan; Ng, Dominic C.M.; Bleakley, Brian J.; Dincer, T. Umut; Bot, Brian M.; Wang, Ding; Liem, David A.; Lam, Maggie P.Y.; Ge, Junbo; Ping, Peipei

    2016-01-01

    Protein stability is a major regulatory principle of protein function and cellular homeostasis. Despite limited understanding on mechanisms, disruption of protein turnover is widely implicated in diverse pathologies from heart failure to neurodegenerations. Information on global protein dynamics therefore has the potential to expand the depth and scope of disease phenotyping and therapeutic strategies. Using an integrated platform of metabolic labeling, high-resolution mass spectrometry and computational analysis, we report here a comprehensive dataset of the in vivo half-life of 3,228 and the expression of 8,064 cardiac proteins, quantified under healthy and hypertrophic conditions across six mouse genetic strains commonly employed in biomedical research. We anticipate these data will aid in understanding key mitochondrial and metabolic pathways in heart diseases, and further serve as a reference for methodology development in dynamics studies in multiple organ systems. PMID:26977904

  3. Program Implementers' Evaluation of the Project P.A.T.H.S.: Findings Based on Different Datasets over Time

    PubMed Central

    Shek, Daniel T. L.; Ma, Cecilia M. S.

    2012-01-01

    This paper integrates the evaluation findings based on program implementers in nine datasets collected from 2005 to 2009 (244 schools and 7,926 implementers). Using consolidated data with schools as the unit of analysis, results showed that program implementers generally had positive perceptions of the program, themselves, and benefits of the program, with more than four-fifths of the implementers regarding the program as beneficial to the program participants. The subjective outcome evaluation instrument was found to be internally consistent. Multiple regression analyses revealed that perceived qualities of the program and program implementers predicted perceived effectiveness of the program. In conjunction with evaluation findings based on other sources, the present study provides support for the effectiveness of the Tier 1 Program of the Project P.A.T.H.S. (Positive Adolescent Training through Holistic Social Programmes) in Hong Kong. PMID:22629224

  4. A Toolkit for ARB to Integrate Custom Databases and Externally Built Phylogenies

    DOE PAGES

    Essinger, Steven D.; Reichenberger, Erin; Morrison, Calvin; ...

    2015-01-21

    Researchers are perpetually amassing biological sequence data. The computational approaches employed by ecologists for organizing this data (e.g. alignment, phylogeny, etc.) typically scale nonlinearly in execution time with the size of the dataset. This often serves as a bottleneck for processing experimental data since many molecular studies are characterized by massive datasets. To keep up with experimental data demands, ecologists are forced to choose between continually upgrading expensive in-house computer hardware or outsourcing the most demanding computations to the cloud. Outsourcing is attractive since it is the least expensive option, but does not necessarily allow direct user interaction with themore » data for exploratory analysis. Desktop analytical tools such as ARB are indispensable for this purpose, but they do not necessarily offer a convenient solution for the coordination and integration of datasets between local and outsourced destinations. Therefore, researchers are currently left with an undesirable tradeoff between computational throughput and analytical capability. To mitigate this tradeoff we introduce a software package to leverage the utility of the interactive exploratory tools offered by ARB with the computational throughput of cloud-based resources. Our pipeline serves as middleware between the desktop and the cloud allowing researchers to form local custom databases containing sequences and metadata from multiple resources and a method for linking data outsourced for computation back to the local database. Furthermore, a tutorial implementation of the toolkit is provided in the supporting information, S1 Tutorial.« less

  5. Here the data: the new FLUXNET collection and the future for model-data integration

    NASA Astrophysics Data System (ADS)

    Papale, D.; Pastorello, G.; Trotta, C.; Chu, H.; Canfora, E.; Agarwal, D.; Baldocchi, D. D.; Torn, M. S.

    2016-12-01

    Seven years after the release of the LaThuile FLUXNET database, widely used in synthesis activities and model-data fusion exercises, a new FLUXNET collection has been released (FLUXNET 2015 - http://fluxnet.fluxdata.org) with the aim to increase the quality of the measurements and provide high quality standardized data obtained by a new processing pipeline. The new FLUXNET collection includes also sites with timeseries of 20 years of continuous carbon and energy fluxes, opening new opportunities in their use in the context of models parameterization and validation. The main characteristics of the FLUXNET 2015 dataset are the uncertainty quantification, the multiple products (e.g. partitioning in photosynthesis and ecosystem respiration) that allow consistency analysis for each site, and new long term downscaled meteorological data provided with the data. Feedbacks from new users, in particular from the modelling communities, are crucial to further improve the quality of the products and move in the direction of a coherent integration across multi-disciplinary communities. In this presentation, the new FLUXNET2015 dataset will be explained and explored, with particular focus on the meaning of the different products and variables, their potentiality but also their limitations. The future development of the dataset will be discussed, with the role of the regional networks and the ongoing efforts to provide new and advanced services such a near real time data provision and a completely open access policy to high quality standardized measurements of GHGs exchanges and additional ecological quantities.

  6. A Toolkit for ARB to Integrate Custom Databases and Externally Built Phylogenies

    PubMed Central

    Essinger, Steven D.; Reichenberger, Erin; Morrison, Calvin; Blackwood, Christopher B.; Rosen, Gail L.

    2015-01-01

    Researchers are perpetually amassing biological sequence data. The computational approaches employed by ecologists for organizing this data (e.g. alignment, phylogeny, etc.) typically scale nonlinearly in execution time with the size of the dataset. This often serves as a bottleneck for processing experimental data since many molecular studies are characterized by massive datasets. To keep up with experimental data demands, ecologists are forced to choose between continually upgrading expensive in-house computer hardware or outsourcing the most demanding computations to the cloud. Outsourcing is attractive since it is the least expensive option, but does not necessarily allow direct user interaction with the data for exploratory analysis. Desktop analytical tools such as ARB are indispensable for this purpose, but they do not necessarily offer a convenient solution for the coordination and integration of datasets between local and outsourced destinations. Therefore, researchers are currently left with an undesirable tradeoff between computational throughput and analytical capability. To mitigate this tradeoff we introduce a software package to leverage the utility of the interactive exploratory tools offered by ARB with the computational throughput of cloud-based resources. Our pipeline serves as middleware between the desktop and the cloud allowing researchers to form local custom databases containing sequences and metadata from multiple resources and a method for linking data outsourced for computation back to the local database. A tutorial implementation of the toolkit is provided in the supporting information, S1 Tutorial. Availability: http://www.ece.drexel.edu/gailr/EESI/tutorial.php. PMID:25607539

  7. Ontology for Transforming Geo-Spatial Data for Discovery and Integration of Scientific Data

    NASA Astrophysics Data System (ADS)

    Nguyen, L.; Chee, T.; Minnis, P.

    2013-12-01

    Discovery and access to geo-spatial scientific data across heterogeneous repositories and multi-discipline datasets can present challenges for scientist. We propose to build a workflow for transforming geo-spatial datasets into semantic environment by using relationships to describe the resource using OWL Web Ontology, RDF, and a proposed geo-spatial vocabulary. We will present methods for transforming traditional scientific dataset, use of a semantic repository, and querying using SPARQL to integrate and access datasets. This unique repository will enable discovery of scientific data by geospatial bound or other criteria.

  8. Management and assimilation of diverse, distributed watershed datasets

    NASA Astrophysics Data System (ADS)

    Varadharajan, C.; Faybishenko, B.; Versteeg, R.; Agarwal, D.; Hubbard, S. S.; Hendrix, V.

    2016-12-01

    The U.S. Department of Energy's (DOE) Watershed Function Scientific Focus Area (SFA) seeks to determine how perturbations to mountainous watersheds (e.g., floods, drought, early snowmelt) impact the downstream delivery of water, nutrients, carbon, and metals over seasonal to decadal timescales. We are building a software platform that enables integration of diverse and disparate field, laboratory, and simulation datasets, of various types including hydrological, geological, meteorological, geophysical, geochemical, ecological and genomic datasets across a range of spatial and temporal scales within the Rifle floodplain and the East River watershed, Colorado. We are using agile data management and assimilation approaches, to enable web-based integration of heterogeneous, multi-scale dataSensor-based observations of water-level, vadose zone and groundwater temperature, water quality, meteorology as well as biogeochemical analyses of soil and groundwater samples have been curated and archived in federated databases. Quality Assurance and Quality Control (QA/QC) are performed on priority datasets needed for on-going scientific analyses, and hydrological and geochemical modeling. Automated QA/QC methods are used to identify and flag issues in the datasets. Data integration is achieved via a brokering service that dynamically integrates data from distributed databases via web services, based on user queries. The integrated results are presented to users in a portal that enables intuitive search, interactive visualization and download of integrated datasets. The concepts, approaches and codes being used are shared across various data science components of various large DOE-funded projects such as the Watershed Function SFA, Next Generation Ecosystem Experiment (NGEE) Tropics, Ameriflux/FLUXNET, and Advanced Simulation Capability for Environmental Management (ASCEM), and together contribute towards DOE's cyberinfrastructure for data management and model-data integration.

  9. TISSUES 2.0: an integrative web resource on mammalian tissue expression

    PubMed Central

    Palasca, Oana; Santos, Alberto; Stolte, Christian; Gorodkin, Jan; Jensen, Lars Juhl

    2018-01-01

    Abstract Physiological and molecular similarities between organisms make it possible to translate findings from simpler experimental systems—model organisms—into more complex ones, such as human. This translation facilitates the understanding of biological processes under normal or disease conditions. Researchers aiming to identify the similarities and differences between organisms at the molecular level need resources collecting multi-organism tissue expression data. We have developed a database of gene–tissue associations in human, mouse, rat and pig by integrating multiple sources of evidence: transcriptomics covering all four species and proteomics (human only), manually curated and mined from the scientific literature. Through a scoring scheme, these associations are made comparable across all sources of evidence and across organisms. Furthermore, the scoring produces a confidence score assigned to each of the associations. The TISSUES database (version 2.0) is publicly accessible through a user-friendly web interface and as part of the STRING app for Cytoscape. In addition, we analyzed the agreement between datasets, across and within organisms, and identified that the agreement is mainly affected by the quality of the datasets rather than by the technologies used or organisms compared. Database URL: http://tissues.jensenlab.org/ PMID:29617745

  10. Wyoming Landscape Conservation Initiative data management and integration

    USGS Publications Warehouse

    Latysh, Natalie; Bristol, R. Sky

    2011-01-01

    Six Federal agencies, two State agencies, and two local entities formally support the Wyoming Landscape Conservation Initiative (WLCI) and work together on a landscape scale to manage fragile habitats and wildlife resources amidst growing energy development in southwest Wyoming. The U.S. Geological Survey (USGS) was tasked with implementing targeted research and providing scientific information about southwest Wyoming to inform the development of WLCI habitat enhancement and restoration projects conducted by land management agencies. Many WLCI researchers and decisionmakers representing the Bureau of Land Management, U.S. Fish and Wildlife Service, the State of Wyoming, and others have overwhelmingly expressed the need for a stable, robust infrastructure to promote sharing of data resources produced by multiple entities, including metadata adequately describing the datasets. Descriptive metadata facilitates use of the datasets by users unfamiliar with the data. Agency representatives advocate development of common data handling and distribution practices among WLCI partners to enhance availability of comprehensive and diverse data resources for use in scientific analyses and resource management. The USGS Core Science Informatics (CSI) team is developing and promoting data integration tools and techniques across USGS and partner entity endeavors, including a data management infrastructure to aid WLCI researchers and decisionmakers.

  11. Monitoring glacier change: advances in cross-disciplinary research and data sharing methods

    NASA Astrophysics Data System (ADS)

    Arendt, A. A.; O'Neel, S.; Cogley, G.; Hill, D. F.; Hood, E. W.

    2016-12-01

    Recent studies have emphasized the importance of understanding interactions between glacier change and downstream ecosystems, ocean dynamics and human infrastructure. Despite the need for integrated assessments, few in-situ and remote sensing glacier monitoring studies also collect concurrent data on surrounding systems affected by glacier change. In addition, the sharing of glacier datasets across disciplines has often been hampered by limitations in data sharing technologies and a lack of data standardization. Here we provide an overview of recent efforts to facilitate distribution of glacier inventory/change datasets under the framework provided by the Global Terrestrial Network for Glaciers (GTN-G). New, web accessible data products include glacier thickness data and updated glacier extents from the Randolph Glacier Inventory. We also highlight a 2016 data collection effort led by the US Geological Survey on the Wolverine Glacier watershed, Alaska, USA. A large international team collected glaciological, water quality, snow cover, firn composition, vegetation and freshwater ecology data, using remote sensing/in-situ data and model simulations. We summarize preliminary results and outline our use of cloud-computing technologies to coordinate the integration of complex data types across multiple research teams.

  12. Charged-particle distributions in pp interactions at √s=8TeV measured with the ATLAS detector

    DOE PAGES

    Aad, G.; Abbott, B.; Abdallah, J.; ...

    2016-07-15

    This study presents measurements of distributions of charged particles which are produced in proton–proton collisions at a centre-of-mass energy of √s=8TeV and recorded by the ATLAS detector at the LHC. A special dataset recorded in 2012 with a small number of interactions per beam crossing (below 0.004) and corresponding to an integrated luminosity of 160 μb -1 was used. A minimum-bias trigger was utilised to select a data sample of more than 9 million collision events. The multiplicity, pseudorapidity, and transverse momentum distributions of charged particles are shown in different regions of kinematics and charged-particle multiplicity, including measurements of finalmore » states at high multiplicity. Finally, the results are corrected for detector effects and are compared to the predictions of various Monte Carlo event generator models which simulate the full hadronic final state.« less

  13. Charged-particle distributions in pp interactions at √{s}=8 { TeV} measured with the ATLAS detector

    NASA Astrophysics Data System (ADS)

    Aad, G.; Abbott, B.; Abdallah, J.; Abdinov, O.; Abeloos, B.; Aben, R.; Abolins, M.; AbouZeid, O. S.; Abraham, N. L.; Abramowicz, H.; Abreu, H.; Abreu, R.; Abulaiti, Y.; Acharya, B. S.; Adamczyk, L.; Adams, D. L.; Adelman, J.; Adomeit, S.; Adye, T.; Affolder, A. A.; Agatonovic-Jovin, T.; Agricola, J.; Aguilar-Saavedra, J. A.; Ahlen, S. P.; Ahmadov, F.; Aielli, G.; Akerstedt, H.; Åkesson, T. P. A.; Akimov, A. V.; Alberghi, G. L.; Albert, J.; Albrand, S.; Alconada Verzini, M. J.; Aleksa, M.; Aleksandrov, I. N.; Alexa, C.; Alexander, G.; Alexopoulos, T.; Alhroob, M.; Aliev, M.; Alimonti, G.; Alison, J.; Alkire, S. P.; Allbrooke, B. M. M.; Allen, B. W.; Allport, P. P.; Aloisio, A.; Alonso, A.; Alonso, F.; Alpigiani, C.; Alvarez Gonzalez, B.; Álvarez Piqueras, D.; Alviggi, M. G.; Amadio, B. T.; Amako, K.; Amaral Coutinho, Y.; Amelung, C.; Amidei, D.; Amor Dos Santos, S. P.; Amorim, A.; Amoroso, S.; Amram, N.; Amundsen, G.; Anastopoulos, C.; Ancu, L. S.; Andari, N.; Andeen, T.; Anders, C. F.; Anders, G.; Anders, J. K.; Anderson, K. J.; Andreazza, A.; Andrei, V.; Angelidakis, S.; Angelozzi, I.; Anger, P.; Angerami, A.; Anghinolfi, F.; Anisenkov, A. V.; Anjos, N.; Annovi, A.; Antonelli, M.; Antonov, A.; Antos, J.; Anulli, F.; Aoki, M.; Aperio Bella, L.; Arabidze, G.; Arai, Y.; Araque, J. P.; Arce, A. T. H.; Arduh, F. A.; Arguin, J.-F.; Argyropoulos, S.; Arik, M.; Armbruster, A. J.; Armitage, L. J.; Arnaez, O.; Arnold, H.; Arratia, M.; Arslan, O.; Artamonov, A.; Artoni, G.; Artz, S.; Asai, S.; Asbah, N.; Ashkenazi, A.; Åsman, B.; Asquith, L.; Assamagan, K.; Astalos, R.; Atkinson, M.; Atlay, N. B.; Augsten, K.; Avolio, G.; Axen, B.; Ayoub, M. K.; Azuelos, G.; Baak, M. A.; Baas, A. E.; Baca, M. J.; Bachacou, H.; Bachas, K.; Backes, M.; Backhaus, M.; Bagiacchi, P.; Bagnaia, P.; Bai, Y.; Baines, J. T.; Baker, O. K.; Baldin, E. M.; Balek, P.; Balestri, T.; Balli, F.; Balunas, W. K.; Banas, E.; Banerjee, Sw.; Bannoura, A. A. E.; Barak, L.; Barberio, E. L.; Barberis, D.; Barbero, M.; Barillari, T.; Barisonzi, M.; Barklow, T.; Barlow, N.; Barnes, S. L.; Barnett, B. M.; Barnett, R. M.; Barnovska, Z.; Baroncelli, A.; Barone, G.; Barr, A. J.; Barranco Navarro, L.; Barreiro, F.; Barreiro Guimarães da Costa, J.; Bartoldus, R.; Barton, A. E.; Bartos, P.; Basalaev, A.; Bassalat, A.; Basye, A.; Bates, R. L.; Batista, S. J.; Batley, J. R.; Battaglia, M.; Bauce, M.; Bauer, F.; Bawa, H. S.; Beacham, J. B.; Beattie, M. D.; Beau, T.; Beauchemin, P. H.; Bechtle, P.; Beck, H. P.; Becker, K.; Becker, M.; Beckingham, M.; Becot, C.; Beddall, A. J.; Beddall, A.; Bednyakov, V. A.; Bedognetti, M.; Bee, C. P.; Beemster, L. J.; Beermann, T. A.; Begel, M.; Behr, J. K.; Belanger-Champagne, C.; Bell, A. S.; Bell, W. H.; Bella, G.; Bellagamba, L.; Bellerive, A.; Bellomo, M.; Belotskiy, K.; Beltramello, O.; Belyaev, N. L.; Benary, O.; Benchekroun, D.; Bender, M.; Bendtz, K.; Benekos, N.; Benhammou, Y.; Benhar Noccioli, E.; Benitez, J.; Benitez Garcia, J. A.; Benjamin, D. P.; Bensinger, J. R.; Bentvelsen, S.; Beresford, L.; Beretta, M.; Berge, D.; Bergeaas Kuutmann, E.; Berger, N.; Berghaus, F.; Beringer, J.; Berlendis, S.; Bernard, N. R.; Bernius, C.; Bernlochner, F. U.; Berry, T.; Berta, P.; Bertella, C.; Bertoli, G.; Bertolucci, F.; Bertram, I. A.; Bertsche, C.; Bertsche, D.; Besjes, G. J.; Bessidskaia Bylund, O.; Bessner, M.; Besson, N.; Betancourt, C.; Bethke, S.; Bevan, A. J.; Bhimji, W.; Bianchi, R. M.; Bianchini, L.; Bianco, M.; Biebel, O.; Biedermann, D.; Bielski, R.; Biesuz, N. V.; Biglietti, M.; Bilbao De Mendizabal, J.; Bilokon, H.; Bindi, M.; Binet, S.; Bingul, A.; Bini, C.; Biondi, S.; Bjergaard, D. M.; Black, C. W.; Black, J. E.; Black, K. M.; Blackburn, D.; Blair, R. E.; Blanchard, J.-B.; Blanco, J. E.; Blazek, T.; Bloch, I.; Blocker, C.; Blum, W.; Blumenschein, U.; Blunier, S.; Bobbink, G. J.; Bobrovnikov, V. S.; Bocchetta, S. S.; Bocci, A.; Bock, C.; Boehler, M.; Boerner, D.; Bogaerts, J. A.; Bogavac, D.; Bogdanchikov, A. G.; Bohm, C.; Boisvert, V.; Bold, T.; Boldea, V.; Boldyrev, A. S.; Bomben, M.; Bona, M.; Boonekamp, M.; Borisov, A.; Borissov, G.; Bortfeldt, J.; Bortoletto, D.; Bortolotto, V.; Bos, K.; Boscherini, D.; Bosman, M.; Bossio Sola, J. D.; Boudreau, J.; Bouffard, J.; Bouhova-Thacker, E. V.; Boumediene, D.; Bourdarios, C.; Boutle, S. K.; Boveia, A.; Boyd, J.; Boyko, I. R.; Bracinik, J.; Brandt, A.; Brandt, G.; Brandt, O.; Bratzler, U.; Brau, B.; Brau, J. E.; Braun, H. M.; Breaden Madden, W. D.; Brendlinger, K.; Brennan, A. J.; Brenner, L.; Brenner, R.; Bressler, S.; Bristow, T. M.; Britton, D.; Britzger, D.; Brochu, F. M.; Brock, I.; Brock, R.; Brooijmans, G.; Brooks, T.; Brooks, W. K.; Brosamer, J.; Brost, E.; Broughton, J. H.; Bruckman de Renstrom, P. A.; Bruncko, D.; Bruneliere, R.; Bruni, A.; Bruni, G.; Brunt, BH; Bruschi, M.; Bruscino, N.; Bryant, P.; Bryngemark, L.; Buanes, T.; Buat, Q.; Buchholz, P.; Buckley, A. G.; Budagov, I. A.; Buehrer, F.; Bugge, M. K.; Bulekov, O.; Bullock, D.; Burckhart, H.; Burdin, S.; Burgard, C. D.; Burghgrave, B.; Burka, K.; Burke, S.; Burmeister, I.; Busato, E.; Büscher, D.; Büscher, V.; Bussey, P.; Butler, J. M.; Butt, A. I.; Buttar, C. M.; Butterworth, J. M.; Butti, P.; Buttinger, W.; Buzatu, A.; Buzykaev, A. R.; Cabrera Urbán, S.; Caforio, D.; Cairo, V. M.; Cakir, O.; Calace, N.; Calafiura, P.; Calandri, A.; Calderini, G.; Calfayan, P.; Caloba, L. P.; Calvet, D.; Calvet, S.; Calvet, T. P.; Camacho Toro, R.; Camarda, S.; Camarri, P.; Cameron, D.; Caminal Armadans, R.; Camincher, C.; Campana, S.; Campanelli, M.; Campoverde, A.; Canale, V.; Canepa, A.; Cano Bret, M.; Cantero, J.; Cantrill, R.; Cao, T.; Capeans Garrido, M. D. M.; Caprini, I.; Caprini, M.; Capua, M.; Caputo, R.; Carbone, R. M.; Cardarelli, R.; Cardillo, F.; Carli, T.; Carlino, G.; Carminati, L.; Caron, S.; Carquin, E.; Carrillo-Montoya, G. D.; Carter, J. R.; Carvalho, J.; Casadei, D.; Casado, M. P.; Casolino, M.; Casper, D. W.; Castaneda-Miranda, E.; Castelli, A.; Castillo Gimenez, V.; Castro, N. F.; Catinaccio, A.; Catmore, J. R.; Cattai, A.; Caudron, J.; Cavaliere, V.; Cavallaro, E.; Cavalli, D.; Cavalli-Sforza, M.; Cavasinni, V.; Ceradini, F.; Cerda Alberich, L.; Cerio, B. C.; Cerqueira, A. S.; Cerri, A.; Cerrito, L.; Cerutti, F.; Cerv, M.; Cervelli, A.; Cetin, S. A.; Chafaq, A.; Chakraborty, D.; Chalupkova, I.; Chan, S. K.; Chan, Y. L.; Chang, P.; Chapman, J. D.; Charlton, D. G.; Chatterjee, A.; Chau, C. C.; Chavez Barajas, C. A.; Che, S.; Cheatham, S.; Chegwidden, A.; Chekanov, S.; Chekulaev, S. V.; Chelkov, G. A.; Chelstowska, M. A.; Chen, C.; Chen, H.; Chen, K.; Chen, S.; Chen, S.; Chen, X.; Chen, Y.; Cheng, H. C.; Cheng, H. J.; Cheng, Y.; Cheplakov, A.; Cheremushkina, E.; Cherkaoui El Moursli, R.; Chernyatin, V.; Cheu, E.; Chevalier, L.; Chiarella, V.; Chiarelli, G.; Chiodini, G.; Chisholm, A. S.; Chitan, A.; Chizhov, M. V.; Choi, K.; Chomont, A. R.; Chouridou, S.; Chow, B. K. B.; Christodoulou, V.; Chromek-Burckhart, D.; Chudoba, J.; Chuinard, A. J.; Chwastowski, J. J.; Chytka, L.; Ciapetti, G.; Ciftci, A. K.; Cinca, D.; Cindro, V.; Cioara, I. A.; Ciocio, A.; Cirotto, F.; Citron, Z. H.; Ciubancan, M.; Clark, A.; Clark, B. L.; Clark, M. R.; Clark, P. J.; Clarke, R. N.; Clement, C.; Coadou, Y.; Cobal, M.; Coccaro, A.; Cochran, J.; Coffey, L.; Colasurdo, L.; Cole, B.; Cole, S.; Colijn, A. P.; Collot, J.; Colombo, T.; Compostella, G.; Conde Muiño, P.; Coniavitis, E.; Connell, S. H.; Connelly, I. A.; Consorti, V.; Constantinescu, S.; Conta, C.; Conti, G.; Conventi, F.; Cooke, M.; Cooper, B. D.; Cooper-Sarkar, A. M.; Cornelissen, T.; Corradi, M.; Corriveau, F.; Corso-Radu, A.; Cortes-Gonzalez, A.; Cortiana, G.; Costa, G.; Costa, M. J.; Costanzo, D.; Cottin, G.; Cowan, G.; Cox, B. E.; Cranmer, K.; Crawley, S. J.; Cree, G.; Crépé-Renaudin, S.; Crescioli, F.; Cribbs, W. A.; Crispin Ortuzar, M.; Cristinziani, M.; Croft, V.; Crosetti, G.; Cuhadar Donszelmann, T.; Cummings, J.; Curatolo, M.; Cúth, J.; Cuthbert, C.; Czirr, H.; Czodrowski, P.; D'Auria, S.; D'Onofrio, M.; Da Cunha Sargedas De Sousa, M. J.; Da Via, C.; Dabrowski, W.; Dai, T.; Dale, O.; Dallaire, F.; Dallapiccola, C.; Dam, M.; Dandoy, J. R.; Dang, N. P.; Daniells, A. C.; Dann, N. S.; Danninger, M.; Dano Hoffmann, M.; Dao, V.; Darbo, G.; Darmora, S.; Dassoulas, J.; Dattagupta, A.; Davey, W.; David, C.; Davidek, T.; Davies, M.; Davison, P.; Davygora, Y.; Dawe, E.; Dawson, I.; Daya-Ishmukhametova, R. K.; De, K.; de Asmundis, R.; De Benedetti, A.; De Castro, S.; De Cecco, S.; De Groot, N.; de Jong, P.; De la Torre, H.; De Lorenzi, F.; De Pedis, D.; De Salvo, A.; De Sanctis, U.; De Santo, A.; De Vivie De Regie, J. B.; Dearnaley, W. J.; Debbe, R.; Debenedetti, C.; Dedovich, D. V.; Deigaard, I.; Del Peso, J.; Del Prete, T.; Delgove, D.; Deliot, F.; Delitzsch, C. M.; Deliyergiyev, M.; Dell'Acqua, A.; Dell'Asta, L.; Dell'Orso, M.; Della Pietra, M.; della Volpe, D.; Delmastro, M.; Delsart, P. A.; Deluca, C.; DeMarco, D. A.; Demers, S.; Demichev, M.; Demilly, A.; Denisov, S. P.; Denysiuk, D.; Derendarz, D.; Derkaoui, J. E.; Derue, F.; Dervan, P.; Desch, K.; Deterre, C.; Dette, K.; Deviveiros, P. O.; Dewhurst, A.; Dhaliwal, S.; Di Ciaccio, A.; Di Ciaccio, L.; Di Clemente, W. K.; Di Domenico, A.; Di Donato, C.; Di Girolamo, A.; Di Girolamo, B.; Di Mattia, A.; Di Micco, B.; Di Nardo, R.; Di Simone, A.; Di Sipio, R.; Di Valentino, D.; Diaconu, C.; Diamond, M.; Dias, F. A.; Diaz, M. A.; Diehl, E. B.; Dietrich, J.; Diglio, S.; Dimitrievska, A.; Dingfelder, J.; Dita, P.; Dita, S.; Dittus, F.; Djama, F.; Djobava, T.; Djuvsland, J. I.; do Vale, M. A. B.; Dobos, D.; Dobre, M.; Doglioni, C.; Dohmae, T.; Dolejsi, J.; Dolezal, Z.; Dolgoshein, B. A.; Donadelli, M.; Donati, S.; Dondero, P.; Donini, J.; Dopke, J.; Doria, A.; Dova, M. T.; Doyle, A. T.; Drechsler, E.; Dris, M.; Du, Y.; Duarte-Campderros, J.; Duchovni, E.; Duckeck, G.; Ducu, O. A.; Duda, D.; Dudarev, A.; Duflot, L.; Duguid, L.; Dührssen, M.; Dunford, M.; Duran Yildiz, H.; Düren, M.; Durglishvili, A.; Duschinger, D.; Dutta, B.; Dyndal, M.; Eckardt, C.; Ecker, K. M.; Edgar, R. C.; Edson, W.; Edwards, N. C.; Eifert, T.; Eigen, G.; Einsweiler, K.; Ekelof, T.; El Kacimi, M.; Ellajosyula, V.; Ellert, M.; Elles, S.; Ellinghaus, F.; Elliot, A. A.; Ellis, N.; Elmsheuser, J.; Elsing, M.; Emeliyanov, D.; Enari, Y.; Endner, O. C.; Endo, M.; Ennis, J. S.; Erdmann, J.; Ereditato, A.; Ernis, G.; Ernst, J.; Ernst, M.; Errede, S.; Ertel, E.; Escalier, M.; Esch, H.; Escobar, C.; Esposito, B.; Etienvre, A. I.; Etzion, E.; Evans, H.; Ezhilov, A.; Fabbri, F.; Fabbri, L.; Facini, G.; Fakhrutdinov, R. M.; Falciano, S.; Falla, R. J.; Faltova, J.; Fang, Y.; Fanti, M.; Farbin, A.; Farilla, A.; Farina, C.; Farooque, T.; Farrell, S.; Farrington, S. M.; Farthouat, P.; Fassi, F.; Fassnacht, P.; Fassouliotis, D.; Faucci Giannelli, M.; Favareto, A.; Fawcett, W. J.; Fayard, L.; Fedin, O. L.; Fedorko, W.; Feigl, S.; Feligioni, L.; Feng, C.; Feng, E. J.; Feng, H.; Fenyuk, A. B.; Feremenga, L.; Fernandez Martinez, P.; Fernandez Perez, S.; Ferrando, J.; Ferrari, A.; Ferrari, P.; Ferrari, R.; Ferreira de Lima, D. E.; Ferrer, A.; Ferrere, D.; Ferretti, C.; Ferretto Parodi, A.; Fiedler, F.; Filipčič, A.; Filipuzzi, M.; Filthaut, F.; Fincke-Keeler, M.; Finelli, K. D.; Fiolhais, M. C. N.; Fiorini, L.; Firan, A.; Fischer, A.; Fischer, C.; Fischer, J.; Fisher, W. C.; Flaschel, N.; Fleck, I.; Fleischmann, P.; Fletcher, G. T.; Fletcher, G.; Fletcher, R. R. M.; Flick, T.; Floderus, A.; Flores Castillo, L. R.; Flowerdew, M. J.; Forcolin, G. T.; Formica, A.; Forti, A.; Foster, A. G.; Fournier, D.; Fox, H.; Fracchia, S.; Francavilla, P.; Franchini, M.; Francis, D.; Franconi, L.; Franklin, M.; Frate, M.; Fraternali, M.; Freeborn, D.; Fressard-Batraneanu, S. M.; Friedrich, F.; Froidevaux, D.; Frost, J. A.; Fukunaga, C.; Fullana Torregrosa, E.; Fusayasu, T.; Fuster, J.; Gabaldon, C.; Gabizon, O.; Gabrielli, A.; Gabrielli, A.; Gach, G. P.; Gadatsch, S.; Gadomski, S.; Gagliardi, G.; Gagnon, L. G.; Gagnon, P.; Galea, C.; Galhardo, B.; Gallas, E. J.; Gallop, B. J.; Gallus, P.; Galster, G.; Gan, K. K.; Gao, J.; Gao, Y.; Gao, Y. S.; Garay Walls, F. M.; García, C.; García Navarro, J. E.; Garcia-Sciveres, M.; Gardner, R. W.; Garelli, N.; Garonne, V.; Gascon Bravo, A.; Gatti, C.; Gaudiello, A.; Gaudio, G.; Gaur, B.; Gauthier, L.; Gavrilenko, I. L.; Gay, C.; Gaycken, G.; Gazis, E. N.; Gecse, Z.; Gee, C. N. P.; Geich-Gimbel, Ch.; Geisler, M. P.; Gemme, C.; Genest, M. H.; Geng, C.; Gentile, S.; George, S.; Gerbaudo, D.; Gershon, A.; Ghasemi, S.; Ghazlane, H.; Ghneimat, M.; Giacobbe, B.; Giagu, S.; Giannetti, P.; Gibbard, B.; Gibson, S. M.; Gignac, M.; Gilchriese, M.; Gillam, T. P. S.; Gillberg, D.; Gilles, G.; Gingrich, D. M.; Giokaris, N.; Giordani, M. P.; Giorgi, F. M.; Giorgi, F. M.; Giraud, P. F.; Giromini, P.; Giugni, D.; Giuli, F.; Giuliani, C.; Giulini, M.; Gjelsten, B. K.; Gkaitatzis, S.; Gkialas, I.; Gkougkousis, E. L.; Gladilin, L. K.; Glasman, C.; Glatzer, J.; Glaysher, P. C. F.; Glazov, A.; Goblirsch-Kolb, M.; Godlewski, J.; Goldfarb, S.; Golling, T.; Golubkov, D.; Gomes, A.; Gonçalo, R.; Goncalves Pinto Firmino Da Costa, J.; Gonella, L.; Gongadze, A.; González de la Hoz, S.; Gonzalez Parra, G.; Gonzalez-Sevilla, S.; Goossens, L.; Gorbounov, P. A.; Gordon, H. A.; Gorelov, I.; Gorini, B.; Gorini, E.; Gorišek, A.; Gornicki, E.; Goshaw, A. T.; Gössling, C.; Gostkin, M. I.; Goudet, C. R.; Goujdami, D.; Goussiou, A. G.; Govender, N.; Gozani, E.; Graber, L.; Grabowska-Bold, I.; Gradin, P. O. J.; Grafström, P.; Gramling, J.; Gramstad, E.; Grancagnolo, S.; Gratchev, V.; Gray, H. M.; Graziani, E.; Greenwood, Z. D.; Grefe, C.; Gregersen, K.; Gregor, I. M.; Grenier, P.; Grevtsov, K.; Griffiths, J.; Grillo, A. A.; Grimm, K.; Grinstein, S.; Gris, Ph.; Grivaz, J.-F.; Groh, S.; Grohs, J. P.; Gross, E.; Grosse-Knetter, J.; Grossi, G. C.; Grout, Z. J.; Guan, L.; Guan, W.; Guenther, J.; Guescini, F.; Guest, D.; Gueta, O.; Guido, E.; Guillemin, T.; Guindon, S.; Gul, U.; Gumpert, C.; Guo, J.; Guo, Y.; Gupta, S.; Gustavino, G.; Gutierrez, P.; Gutierrez Ortiz, N. G.; Gutschow, C.; Guyot, C.; Gwenlan, C.; Gwilliam, C. B.; Haas, A.; Haber, C.; Hadavand, H. K.; Haddad, N.; Hadef, A.; Haefner, P.; Hageböck, S.; Hajduk, Z.; Hakobyan, H.; Haleem, M.; Haley, J.; Hall, D.; Halladjian, G.; Hallewell, G. D.; Hamacher, K.; Hamal, P.; Hamano, K.; Hamilton, A.; Hamity, G. N.; Hamnett, P. G.; Han, L.; Hanagaki, K.; Hanawa, K.; Hance, M.; Haney, B.; Hanke, P.; Hanna, R.; Hansen, J. B.; Hansen, J. D.; Hansen, M. C.; Hansen, P. H.; Hara, K.; Hard, A. S.; Harenberg, T.; Hariri, F.; Harkusha, S.; Harrington, R. D.; Harrison, P. F.; Hartjes, F.; Hasegawa, M.; Hasegawa, Y.; Hasib, A.; Hassani, S.; Haug, S.; Hauser, R.; Hauswald, L.; Havranek, M.; Hawkes, C. M.; Hawkings, R. J.; Hawkins, A. D.; Hayden, D.; Hays, C. P.; Hays, J. M.; Hayward, H. S.; Haywood, S. J.; Head, S. J.; Heck, T.; Hedberg, V.; Heelan, L.; Heim, S.; Heim, T.; Heinemann, B.; Heinrich, J. J.; Heinrich, L.; Heinz, C.; Hejbal, J.; Helary, L.; Hellman, S.; Helsens, C.; Henderson, J.; Henderson, R. C. W.; Heng, Y.; Henkelmann, S.; Henriques Correia, A. M.; Henrot-Versille, S.; Herbert, G. H.; Hernández Jiménez, Y.; Herten, G.; Hertenberger, R.; Hervas, L.; Hesketh, G. G.; Hessey, N. P.; Hetherly, J. W.; Hickling, R.; Higón-Rodriguez, E.; Hill, E.; Hill, J. C.; Hiller, K. H.; Hillier, S. J.; Hinchliffe, I.; Hines, E.; Hinman, R. R.; Hirose, M.; Hirschbuehl, D.; Hobbs, J.; Hod, N.; Hodgkinson, M. C.; Hodgson, P.; Hoecker, A.; Hoeferkamp, M. R.; Hoenig, F.; Hohlfeld, M.; Hohn, D.; Holmes, T. R.; Homann, M.; Hong, T. M.; Hooberman, B. H.; Hopkins, W. H.; Horii, Y.; Horton, A. J.; Hostachy, J.-Y.; Hou, S.; Hoummada, A.; Howard, J.; Howarth, J.; Hrabovsky, M.; Hristova, I.; Hrivnac, J.; Hryn'ova, T.; Hrynevich, A.; Hsu, C.; Hsu, P. J.; Hsu, S.-C.; Hu, D.; Hu, Q.; Huang, Y.; Hubacek, Z.; Hubaut, F.; Huegging, F.; Huffman, T. B.; Hughes, E. W.; Hughes, G.; Huhtinen, M.; Hülsing, T. A.; Huseynov, N.; Huston, J.; Huth, J.; Iacobucci, G.; Iakovidis, G.; Ibragimov, I.; Iconomidou-Fayard, L.; Ideal, E.; Idrissi, Z.; Iengo, P.; Igonkina, O.; Iizawa, T.; Ikegami, Y.; Ikeno, M.; Ilchenko, Y.; Iliadis, D.; Ilic, N.; Ince, T.; Introzzi, G.; Ioannou, P.; Iodice, M.; Iordanidou, K.; Ippolito, V.; Irles Quiles, A.; Isaksson, C.; Ishino, M.; Ishitsuka, M.; Ishmukhametov, R.; Issever, C.; Istin, S.; Ito, F.; Iturbe Ponce, J. M.; Iuppa, R.; Ivarsson, J.; Iwanski, W.; Iwasaki, H.; Izen, J. M.; Izzo, V.; Jabbar, S.; Jackson, B.; Jackson, M.; Jackson, P.; Jain, V.; Jakobi, K. B.; Jakobs, K.; Jakobsen, S.; Jakoubek, T.; Jamin, D. O.; Jana, D. K.; Jansen, E.; Jansky, R.; Janssen, J.; Janus, M.; Jarlskog, G.; Javadov, N.; Javůrek, T.; Jeanneau, F.; Jeanty, L.; Jejelava, J.; Jeng, G.-Y.; Jennens, D.; Jenni, P.; Jentzsch, J.; Jeske, C.; Jézéquel, S.; Ji, H.; Jia, J.; Jiang, H.; Jiang, Y.; Jiggins, S.; Jimenez Pena, J.; Jin, S.; Jinaru, A.; Jinnouchi, O.; Johansson, P.; Johns, K. A.; Johnson, W. J.; Jon-And, K.; Jones, G.; Jones, R. W. L.; Jones, S.; Jones, T. J.; Jongmanns, J.; Jorge, P. M.; Jovicevic, J.; Ju, X.; Juste Rozas, A.; Köhler, M. K.; Kaczmarska, A.; Kado, M.; Kagan, H.; Kagan, M.; Kahn, S. J.; Kajomovitz, E.; Kalderon, C. W.; Kaluza, A.; Kama, S.; Kamenshchikov, A.; Kanaya, N.; Kaneti, S.; Kanjir, L.; Kantserov, V. A.; Kanzaki, J.; Kaplan, B.; Kaplan, L. S.; Kapliy, A.; Kar, D.; Karakostas, K.; Karamaoun, A.; Karastathis, N.; Kareem, M. J.; Karentzos, E.; Karnevskiy, M.; Karpov, S. N.; Karpova, Z. M.; Karthik, K.; Kartvelishvili, V.; Karyukhin, A. N.; Kasahara, K.; Kashif, L.; Kass, R. D.; Kastanas, A.; Kataoka, Y.; Kato, C.; Katre, A.; Katzy, J.; Kawade, K.; Kawagoe, K.; Kawamoto, T.; Kawamura, G.; Kazama, S.; Kazanin, V. F.; Keeler, R.; Kehoe, R.; Keller, J. S.; Kempster, J. J.; Keoshkerian, H.; Kepka, O.; Kerševan, B. P.; Kersten, S.; Keyes, R. A.; Khalil-zada, F.; Khandanyan, H.; Khanov, A.; Kharlamov, A. G.; Khoo, T. J.; Khovanskiy, V.; Khramov, E.; Khubua, J.; Kido, S.; Kim, H. Y.; Kim, S. H.; Kim, Y. K.; Kimura, N.; Kind, O. M.; King, B. T.; King, M.; King, S. B.; Kirk, J.; Kiryunin, A. E.; Kishimoto, T.; Kisielewska, D.; Kiss, F.; Kiuchi, K.; Kivernyk, O.; Kladiva, E.; Klein, M. H.; Klein, M.; Klein, U.; Kleinknecht, K.; Klimek, P.; Klimentov, A.; Klingenberg, R.; Klinger, J. A.; Klioutchnikova, T.; Kluge, E.-E.; Kluit, P.; Kluth, S.; Knapik, J.; Kneringer, E.; Knoops, E. B. F. G.; Knue, A.; Kobayashi, A.; Kobayashi, D.; Kobayashi, T.; Kobel, M.; Kocian, M.; Kodys, P.; Koffas, T.; Koffeman, E.; Kogan, L. A.; Kohriki, T.; Koi, T.; Kolanoski, H.; Kolb, M.; Koletsou, I.; Komar, A. A.; Komori, Y.; Kondo, T.; Kondrashova, N.; Köneke, K.; König, A. C.; Kono, T.; Konoplich, R.; Konstantinidis, N.; Kopeliansky, R.; Koperny, S.; Köpke, L.; Kopp, A. K.; Korcyl, K.; Kordas, K.; Korn, A.; Korol, A. A.; Korolkov, I.; Korolkova, E. V.; Kortner, O.; Kortner, S.; Kosek, T.; Kostyukhin, V. V.; Kotov, V. M.; Kotwal, A.; Kourkoumeli-Charalampidi, A.; Kourkoumelis, C.; Kouskoura, V.; Koutsman, A.; Kowalewska, A. B.; Kowalewski, R.; Kowalski, T. Z.; Kozanecki, W.; Kozhin, A. S.; Kramarenko, V. A.; Kramberger, G.; Krasnopevtsev, D.; Krasny, M. W.; Krasznahorkay, A.; Kraus, J. K.; Kravchenko, A.; Kretz, M.; Kretzschmar, J.; Kreutzfeldt, K.; Krieger, P.; Krizka, K.; Kroeninger, K.; Kroha, H.; Kroll, J.; Kroseberg, J.; Krstic, J.; Kruchonak, U.; Krüger, H.; Krumnack, N.; Kruse, A.; Kruse, M. C.; Kruskal, M.; Kubota, T.; Kucuk, H.; Kuday, S.; Kuechler, J. T.; Kuehn, S.; Kugel, A.; Kuger, F.; Kuhl, A.; Kuhl, T.; Kukhtin, V.; Kukla, R.; Kulchitsky, Y.; Kuleshov, S.; Kuna, M.; Kunigo, T.; Kupco, A.; Kurashige, H.; Kurochkin, Y. A.; Kus, V.; Kuwertz, E. S.; Kuze, M.; Kvita, J.; Kwan, T.; Kyriazopoulos, D.; La Rosa, A.; La Rosa Navarro, J. L.; La Rotonda, L.; Lacasta, C.; Lacava, F.; Lacey, J.; Lacker, H.; Lacour, D.; Lacuesta, V. R.; Ladygin, E.; Lafaye, R.; Laforge, B.; Lagouri, T.; Lai, S.; Lammers, S.; Lampl, W.; Lançon, E.; Landgraf, U.; Landon, M. P. J.; Lang, V. S.; Lange, J. C.; Lankford, A. J.; Lanni, F.; Lantzsch, K.; Lanza, A.; Laplace, S.; Lapoire, C.; Laporte, J. F.; Lari, T.; Lasagni Manghi, F.; Lassnig, M.; Laurelli, P.; Lavrijsen, W.; Law, A. T.; Laycock, P.; Lazovich, T.; Lazzaroni, M.; Le Dortz, O.; Le Guirriec, E.; Le Menedeu, E.; Le Quilleuc, E. P.; LeBlanc, M.; LeCompte, T.; Ledroit-Guillon, F.; Lee, C. A.; Lee, S. C.; Lee, L.; Lefebvre, G.; Lefebvre, M.; Legger, F.; Leggett, C.; Lehan, A.; Lehmann Miotto, G.; Lei, X.; Leight, W. A.; Leisos, A.; Leister, A. G.; Leite, M. A. L.; Leitner, R.; Lellouch, D.; Lemmer, B.; Leney, K. J. C.; Lenz, T.; Lenzi, B.; Leone, R.; Leone, S.; Leonidopoulos, C.; Leontsinis, S.; Lerner, G.; Leroy, C.; Lesage, A. A. J.; Lester, C. G.; Levchenko, M.; Levêque, J.; Levin, D.; Levinson, L. J.; Levy, M.; Leyko, A. M.; Leyton, M.; Li, B.; Li, H.; Li, H. L.; Li, L.; Li, L.; Li, Q.; Li, S.; Li, X.; Li, Y.; Liang, Z.; Liao, H.; Liberti, B.; Liblong, A.; Lichard, P.; Lie, K.; Liebal, J.; Liebig, W.; Limbach, C.; Limosani, A.; Lin, S. C.; Lin, T. H.; Lindquist, B. E.; Lipeles, E.; Lipniacka, A.; Lisovyi, M.; Liss, T. M.; Lissauer, D.; Lister, A.; Litke, A. M.; Liu, B.; Liu, D.; Liu, H.; Liu, H.; Liu, J.; Liu, J. B.; Liu, K.; Liu, L.; Liu, M.; Liu, M.; Liu, Y. L.; Liu, Y.; Livan, M.; Lleres, A.; Llorente Merino, J.; Lloyd, S. L.; Lo Sterzo, F.; Lobodzinska, E.; Loch, P.; Lockman, W. S.; Loebinger, F. K.; Loevschall-Jensen, A. E.; Loew, K. M.; Loginov, A.; Lohse, T.; Lohwasser, K.; Lokajicek, M.; Long, B. A.; Long, J. D.; Long, R. E.; Longo, L.; Looper, K. A.; Lopes, L.; Lopez Mateos, D.; Lopez Paredes, B.; Lopez Paz, I.; Lopez Solis, A.; Lorenz, J.; Lorenzo Martinez, N.; Losada, M.; Lösel, P. J.; Lou, X.; Lounis, A.; Love, J.; Love, P. A.; Lu, H.; Lu, N.; Lubatti, H. J.; Luci, C.; Lucotte, A.; Luedtke, C.; Luehring, F.; Lukas, W.; Luminari, L.; Lundberg, O.; Lund-Jensen, B.; Lynn, D.; Lysak, R.; Lytken, E.; Lyubushkin, V.; Ma, H.; Ma, L. L.; Ma, Y.; Maccarrone, G.; Macchiolo, A.; Macdonald, C. M.; Maček, B.; Machado Miguens, J.; Madaffari, D.; Madar, R.; Maddocks, H. J.; Mader, W. F.; Madsen, A.; Maeda, J.; Maeland, S.; Maeno, T.; Maevskiy, A.; Magradze, E.; Mahlstedt, J.; Maiani, C.; Maidantchik, C.; Maier, A. A.; Maier, T.; Maio, A.; Majewski, S.; Makida, Y.; Makovec, N.; Malaescu, B.; Malecki, Pa.; Maleev, V. P.; Malek, F.; Mallik, U.; Malon, D.; Malone, C.; Maltezos, S.; Malyshev, V. M.; Malyukov, S.; Mamuzic, J.; Mancini, G.; Mandelli, B.; Mandelli, L.; Mandić, I.; Maneira, J.; Manhaes de Andrade Filho, L.; Manjarres Ramos, J.; Mann, A.; Mansoulie, B.; Mantifel, R.; Mantoani, M.; Manzoni, S.; Mapelli, L.; Marceca, G.; March, L.; Marchiori, G.; Marcisovsky, M.; Marjanovic, M.; Marley, D. E.; Marroquim, F.; Marsden, S. P.; Marshall, Z.; Marti, L. F.; Marti-Garcia, S.; Martin, B.; Martin, T. A.; Martin, V. J.; Martin dit Latour, B.; Martinez, M.; Martin-Haugh, S.; Martoiu, V. S.; Martyniuk, A. C.; Marx, M.; Marzano, F.; Marzin, A.; Masetti, L.; Mashimo, T.; Mashinistov, R.; Masik, J.; Maslennikov, A. L.; Massa, I.; Massa, L.; Mastrandrea, P.; Mastroberardino, A.; Masubuchi, T.; Mättig, P.; Mattmann, J.; Maurer, J.; Maxfield, S. J.; Maximov, D. A.; Mazini, R.; Mazza, S. M.; Mc Fadden, N. C.; Mc Goldrick, G.; Mc Kee, S. P.; McCarn, A.; McCarthy, R. L.; McCarthy, T. G.; McClymont, L. I.; McFarlane, K. W.; Mcfayden, J. A.; Mchedlidze, G.; McMahon, S. J.; McPherson, R. A.; Medici, M.; Medinnis, M.; Meehan, S.; Mehlhase, S.; Mehta, A.; Meier, K.; Meineck, C.; Meirose, B.; Mellado Garcia, B. R.; Meloni, F.; Mengarelli, A.; Menke, S.; Meoni, E.; Mercurio, K. M.; Mergelmeyer, S.; Mermod, P.; Merola, L.; Meroni, C.; Merritt, F. S.; Messina, A.; Metcalfe, J.; Mete, A. S.; Meyer, C.; Meyer, C.; Meyer, J.-P.; Meyer, J.; Meyer Zu Theenhausen, H.; Middleton, R. P.; Miglioranzi, S.; Mijović, L.; Mikenberg, G.; Mikestikova, M.; Mikuž, M.; Milesi, M.; Milic, A.; Miller, D. W.; Mills, C.; Milov, A.; Milstead, D. A.; Minaenko, A. A.; Minami, Y.; Minashvili, I. A.; Mincer, A. I.; Mindur, B.; Mineev, M.; Ming, Y.; Mir, L. M.; Mistry, K. P.; Mitani, T.; Mitrevski, J.; Mitsou, V. A.; Miucci, A.; Miyagawa, P. S.; Mjörnmark, J. U.; Moa, T.; Mochizuki, K.; Mohapatra, S.; Mohr, W.; Molander, S.; Moles-Valls, R.; Monden, R.; Mondragon, M. C.; Mönig, K.; Monk, J.; Monnier, E.; Montalbano, A.; Montejo Berlingen, J.; Monticelli, F.; Monzani, S.; Moore, R. W.; Morange, N.; Moreno, D.; Moreno Llácer, M.; Morettini, P.; Mori, D.; Mori, T.; Morii, M.; Morinaga, M.; Morisbak, V.; Moritz, S.; Morley, A. K.; Mornacchi, G.; Morris, J. D.; Mortensen, S. S.; Morvaj, L.; Mosidze, M.; Moss, J.; Motohashi, K.; Mount, R.; Mountricha, E.; Mouraviev, S. V.; Moyse, E. J. W.; Muanza, S.; Mudd, R. D.; Mueller, F.; Mueller, J.; Mueller, R. S. P.; Mueller, T.; Muenstermann, D.; Mullen, P.; Mullier, G. A.; Munoz Sanchez, F. J.; Murillo Quijada, J. A.; Murray, W. J.; Musheghyan, H.; Muskinja, M.; Myagkov, A. G.; Myska, M.; Nachman, B. P.; Nackenhorst, O.; Nadal, J.; Nagai, K.; Nagai, R.; Nagano, K.; Nagasaka, Y.; Nagata, K.; Nagel, M.; Nagy, E.; Nairz, A. M.; Nakahama, Y.; Nakamura, K.; Nakamura, T.; Nakano, I.; Namasivayam, H.; Naranjo Garcia, R. F.; Narayan, R.; Narrias Villar, D. I.; Naryshkin, I.; Naumann, T.; Navarro, G.; Nayyar, R.; Neal, H. A.; Nechaeva, P. Yu.; Neep, T. J.; Nef, P. D.; Negri, A.; Negrini, M.; Nektarijevic, S.; Nellist, C.; Nelson, A.; Nemecek, S.; Nemethy, P.; Nepomuceno, A. A.; Nessi, M.; Neubauer, M. S.; Neumann, M.; Neves, R. M.; Nevski, P.; Newman, P. R.; Nguyen, D. H.; Nickerson, R. B.; Nicolaidou, R.; Nicquevert, B.; Nielsen, J.; Nikiforov, A.; Nikolaenko, V.; Nikolic-Audit, I.; Nikolopoulos, K.; Nilsen, J. K.; Nilsson, P.; Ninomiya, Y.; Nisati, A.; Nisius, R.; Nobe, T.; Nodulman, L.; Nomachi, M.; Nomidis, I.; Nooney, T.; Norberg, S.; Nordberg, M.; Norjoharuddeen, N.; Novgorodova, O.; Nowak, S.; Nozaki, M.; Nozka, L.; Ntekas, K.; Nurse, E.; Nuti, F.; O'grady, F.; O'Neil, D. C.; O'Rourke, A. A.; O'Shea, V.; Oakham, F. G.; Oberlack, H.; Obermann, T.; Ocariz, J.; Ochi, A.; Ochoa, I.; Ochoa-Ricoux, J. P.; Oda, S.; Odaka, S.; Ogren, H.; Oh, A.; Oh, S. H.; Ohm, C. C.; Ohman, H.; Oide, H.; Okawa, H.; Okumura, Y.; Okuyama, T.; Olariu, A.; Oleiro Seabra, L. F.; Olivares Pino, S. A.; Oliveira Damazio, D.; Olszewski, A.; Olszowska, J.; Onofre, A.; Onogi, K.; Onyisi, P. U. E.; Oram, C. J.; Oreglia, M. J.; Oren, Y.; Orestano, D.; Orlando, N.; Orr, R. S.; Osculati, B.; Ospanov, R.; Otero y Garzon, G.; Otono, H.; Ouchrif, M.; Ould-Saada, F.; Ouraou, A.; Oussoren, K. P.; Ouyang, Q.; Owen, M.; Owen, R. E.; Ozcan, V. E.; Ozturk, N.; Pachal, K.; Pacheco Pages, A.; Padilla Aranda, C.; Pagáčová, M.; Pagan Griso, S.; Paige, F.; Pais, P.; Pajchel, K.; Palacino, G.; Palestini, S.; Palka, M.; Pallin, D.; Palma, A.; Panagiotopoulou, E. St.; Pandini, C. E.; Panduro Vazquez, J. G.; Pani, P.; Panitkin, S.; Pantea, D.; Paolozzi, L.; Papadopoulou, Th. D.; Papageorgiou, K.; Paramonov, A.; Paredes Hernandez, D.; Parker, A. J.; Parker, M. A.; Parker, K. A.; Parodi, F.; Parsons, J. A.; Parzefall, U.; Pascuzzi, V. R.; Pasqualucci, E.; Passaggio, S.; Pastore, F.; Pastore, Fr.; Pásztor, G.; Pataraia, S.; Patel, N. D.; Pater, J. R.; Pauly, T.; Pearce, J.; Pearson, B.; Pedersen, L. E.; Pedersen, M.; Pedraza Lopez, S.; Pedro, R.; Peleganchuk, S. V.; Pelikan, D.; Penc, O.; Peng, C.; Peng, H.; Penwell, J.; Peralva, B. S.; Perego, M. M.; Perepelitsa, D. V.; Perez Codina, E.; Perini, L.; Pernegger, H.; Perrella, S.; Peschke, R.; Peshekhonov, V. D.; Peters, K.; Peters, R. F. Y.; Petersen, B. A.; Petersen, T. C.; Petit, E.; Petridis, A.; Petridou, C.; Petroff, P.; Petrolo, E.; Petrov, M.; Petrucci, F.; Pettersson, N. E.; Peyaud, A.; Pezoa, R.; Phillips, P. W.; Piacquadio, G.; Pianori, E.; Picazio, A.; Piccaro, E.; Piccinini, M.; Pickering, M. A.; Piegaia, R.; Pilcher, J. E.; Pilkington, A. D.; Pin, A. W. J.; Pina, J.; Pinamonti, M.; Pinfold, J. L.; Pingel, A.; Pires, S.; Pirumov, H.; Pitt, M.; Plazak, L.; Pleier, M.-A.; Pleskot, V.; Plotnikova, E.; Plucinski, P.; Pluth, D.; Poettgen, R.; Poggioli, L.; Pohl, D.; Polesello, G.; Poley, A.; Policicchio, A.; Polifka, R.; Polini, A.; Pollard, C. S.; Polychronakos, V.; Pommès, K.; Pontecorvo, L.; Pope, B. G.; Popeneciu, G. A.; Popovic, D. S.; Poppleton, A.; Pospisil, S.; Potamianos, K.; Potrap, I. N.; Potter, C. J.; Potter, C. T.; Poulard, G.; Poveda, J.; Pozdnyakov, V.; Pozo Astigarraga, M. E.; Pralavorio, P.; Pranko, A.; Prell, S.; Price, D.; Price, L. E.; Primavera, M.; Prince, S.; Proissl, M.; Prokofiev, K.; Prokoshin, F.; Protopopescu, S.; Proudfoot, J.; Przybycien, M.; Puddu, D.; Puldon, D.; Purohit, M.; Puzo, P.; Qian, J.; Qin, G.; Qin, Y.; Quadt, A.; Quayle, W. B.; Queitsch-Maitland, M.; Quilty, D.; Raddum, S.; Radeka, V.; Radescu, V.; Radhakrishnan, S. K.; Radloff, P.; Rados, P.; Ragusa, F.; Rahal, G.; Raine, J. A.; Rajagopalan, S.; Rammensee, M.; Rangel-Smith, C.; Ratti, M. G.; Rauscher, F.; Rave, S.; Ravenscroft, T.; Raymond, M.; Read, A. L.; Readioff, N. P.; Rebuzzi, D. M.; Redelbach, A.; Redlinger, G.; Reece, R.; Reeves, K.; Rehnisch, L.; Reichert, J.; Reisin, H.; Rembser, C.; Ren, H.; Rescigno, M.; Resconi, S.; Rezanova, O. L.; Reznicek, P.; Rezvani, R.; Richter, R.; Richter, S.; Richter-Was, E.; Ricken, O.; Ridel, M.; Rieck, P.; Riegel, C. J.; Rieger, J.; Rifki, O.; Rijssenbeek, M.; Rimoldi, A.; Rinaldi, L.; Ristić, B.; Ritsch, E.; Riu, I.; Rizatdinova, F.; Rizvi, E.; Rizzi, C.; Robertson, S. H.; Robichaud-Veronneau, A.; Robinson, D.; Robinson, J. E. M.; Robson, A.; Roda, C.; Rodina, Y.; Rodriguez Perez, A.; Rodriguez Rodriguez, D.; Roe, S.; Rogan, C. S.; Røhne, O.; Romaniouk, A.; Romano, M.; Romano Saez, S. M.; Romero Adam, E.; Rompotis, N.; Ronzani, M.; Roos, L.; Ros, E.; Rosati, S.; Rosbach, K.; Rose, P.; Rosenthal, O.; Rossetti, V.; Rossi, E.; Rossi, L. P.; Rosten, J. H. N.; Rosten, R.; Rotaru, M.; Roth, I.; Rothberg, J.; Rousseau, D.; Royon, C. R.; Rozanov, A.; Rozen, Y.; Ruan, X.; Rubbo, F.; Rubinskiy, I.; Rud, V. I.; Rudolph, M. S.; Rühr, F.; Ruiz-Martinez, A.; Rurikova, Z.; Rusakovich, N. A.; Ruschke, A.; Russell, H. L.; Rutherfoord, J. P.; Ruthmann, N.; Ryabov, Y. F.; Rybar, M.; Rybkin, G.; Ryu, S.; Ryzhov, A.; Saavedra, A. F.; Sabato, G.; Sacerdoti, S.; Sadrozinski, H. F.-W.; Sadykov, R.; Safai Tehrani, F.; Saha, P.; Sahinsoy, M.; Saimpert, M.; Saito, T.; Sakamoto, H.; Sakurai, Y.; Salamanna, G.; Salamon, A.; Salazar Loyola, J. E.; Salek, D.; Sales De Bruin, P. H.; Salihagic, D.; Salnikov, A.; Salt, J.; Salvatore, D.; Salvatore, F.; Salvucci, A.; Salzburger, A.; Sammel, D.; Sampsonidis, D.; Sanchez, A.; Sánchez, J.; Sanchez Martinez, V.; Sandaker, H.; Sandbach, R. L.; Sander, H. G.; Sanders, M. P.; Sandhoff, M.; Sandoval, C.; Sandstroem, R.; Sankey, D. P. C.; Sannino, M.; Sansoni, A.; Santoni, C.; Santonico, R.; Santos, H.; Santoyo Castillo, I.; Sapp, K.; Sapronov, A.; Saraiva, J. G.; Sarrazin, B.; Sasaki, O.; Sasaki, Y.; Sato, K.; Sauvage, G.; Sauvan, E.; Savage, G.; Savard, P.; Sawyer, C.; Sawyer, L.; Saxon, J.; Sbarra, C.; Sbrizzi, A.; Scanlon, T.; Scannicchio, D. A.; Scarcella, M.; Scarfone, V.; Schaarschmidt, J.; Schacht, P.; Schaefer, D.; Schaefer, R.; Schaeffer, J.; Schaepe, S.; Schaetzel, S.; Schäfer, U.; Schaffer, A. C.; Schaile, D.; Schamberger, R. D.; Scharf, V.; Schegelsky, V. A.; Scheirich, D.; Schernau, M.; Schiavi, C.; Schillo, C.; Schioppa, M.; Schlenker, S.; Schmieden, K.; Schmitt, C.; Schmitt, S.; Schmitz, S.; Schneider, B.; Schnellbach, Y. J.; Schnoor, U.; Schoeffel, L.; Schoening, A.; Schoenrock, B. D.; Schopf, E.; Schorlemmer, A. L. S.; Schott, M.; Schovancova, J.; Schramm, S.; Schreyer, M.; Schuh, N.; Schultens, M. J.; Schultz-Coulon, H.-C.; Schulz, H.; Schumacher, M.; Schumm, B. A.; Schune, Ph.; Schwanenberger, C.; Schwartzman, A.; Schwarz, T. A.; Schwegler, Ph.; Schweiger, H.; Schwemling, Ph.; Schwienhorst, R.; Schwindling, J.; Schwindt, T.; Sciolla, G.; Scuri, F.; Scutti, F.; Searcy, J.; Seema, P.; Seidel, S. C.; Seiden, A.; Seifert, F.; Seixas, J. M.; Sekhniaidze, G.; Sekhon, K.; Sekula, S. J.; Seliverstov, D. M.; Semprini-Cesari, N.; Serfon, C.; Serin, L.; Serkin, L.; Sessa, M.; Seuster, R.; Severini, H.; Sfiligoj, T.; Sforza, F.; Sfyrla, A.; Shabalina, E.; Shaikh, N. W.; Shan, L. Y.; Shang, R.; Shank, J. T.; Shapiro, M.; Shatalov, P. B.; Shaw, K.; Shaw, S. M.; Shcherbakova, A.; Shehu, C. Y.; Sherwood, P.; Shi, L.; Shimizu, S.; Shimmin, C. O.; Shimojima, M.; Shiyakova, M.; Shmeleva, A.; Shoaleh Saadi, D.; Shochet, M. J.; Shojaii, S.; Shrestha, S.; Shulga, E.; Shupe, M. A.; Sicho, P.; Sidebo, P. E.; Sidiropoulou, O.; Sidorov, D.; Sidoti, A.; Siegert, F.; Sijacki, Dj.; Silva, J.; Silverstein, S. B.; Simak, V.; Simard, O.; Simic, Lj.; Simion, S.; Simioni, E.; Simmons, B.; Simon, D.; Simon, M.; Sinervo, P.; Sinev, N. B.; Sioli, M.; Siragusa, G.; Sivoklokov, S. Yu.; Sjölin, J.; Sjursen, T. B.; Skinner, M. B.; Skottowe, H. P.; Skubic, P.; Slater, M.; Slavicek, T.; Slawinska, M.; Sliwa, K.; Slovak, R.; Smakhtin, V.; Smart, B. H.; Smestad, L.; Smirnov, S. Yu.; Smirnov, Y.; Smirnova, L. N.; Smirnova, O.; Smith, M. N. K.; Smith, R. W.; Smizanska, M.; Smolek, K.; Snesarev, A. A.; Snidero, G.; Snyder, S.; Sobie, R.; Socher, F.; Soffer, A.; Soh, D. A.; Sokhrannyi, G.; Solans Sanchez, C. A.; Solar, M.; Soldatov, E. Yu.; Soldevila, U.; Solodkov, A. A.; Soloshenko, A.; Solovyanov, O. V.; Solovyev, V.; Sommer, P.; Son, H.; Song, H. Y.; Sood, A.; Sopczak, A.; Sopko, V.; Sorin, V.; Sosa, D.; Sotiropoulou, C. L.; Soualah, R.; Soukharev, A. M.; South, D.; Sowden, B. C.; Spagnolo, S.; Spalla, M.; Spangenberg, M.; Spanò, F.; Sperlich, D.; Spettel, F.; Spighi, R.; Spigo, G.; Spiller, L. A.; Spousta, M.; Denis, R. D. St.; Stabile, A.; Staerz, S.; Stahlman, J.; Stamen, R.; Stamm, S.; Stanecka, E.; Stanek, R. W.; Stanescu, C.; Stanescu-Bellu, M.; Stanitzki, M. M.; Stapnes, S.; Starchenko, E. A.; Stark, G. H.; Stark, J.; Staroba, P.; Starovoitov, P.; Staszewski, R.; Steinberg, P.; Stelzer, B.; Stelzer, H. J.; Stelzer-Chilton, O.; Stenzel, H.; Stewart, G. A.; Stillings, J. A.; Stockton, M. C.; Stoebe, M.; Stoicea, G.; Stolte, P.; Stonjek, S.; Stradling, A. R.; Straessner, A.; Stramaglia, M. E.; Strandberg, J.; Strandberg, S.; Strandlie, A.; Strauss, M.; Strizenec, P.; Ströhmer, R.; Strom, D. M.; Stroynowski, R.; Strubig, A.; Stucci, S. A.; Stugu, B.; Styles, N. A.; Su, D.; Su, J.; Subramaniam, R.; Suchek, S.; Sugaya, Y.; Suk, M.; Sulin, V. V.; Sultansoy, S.; Sumida, T.; Sun, S.; Sun, X.; Sundermann, J. E.; Suruliz, K.; Susinno, G.; Sutton, M. R.; Suzuki, S.; Svatos, M.; Swiatlowski, M.; Sykora, I.; Sykora, T.; Ta, D.; Taccini, C.; Tackmann, K.; Taenzer, J.; Taffard, A.; Tafirout, R.; Taiblum, N.; Takai, H.; Takashima, R.; Takeda, H.; Takeshita, T.; Takubo, Y.; Talby, M.; Talyshev, A. A.; Tam, J. Y. C.; Tan, K. G.; Tanaka, J.; Tanaka, R.; Tanaka, S.; Tannenwald, B. B.; Tapia Araya, S.; Tapprogge, S.; Tarem, S.; Tartarelli, G. F.; Tas, P.; Tasevsky, M.; Tashiro, T.; Tassi, E.; Tavares Delgado, A.; Tayalati, Y.; Taylor, A. C.; Taylor, G. N.; Taylor, P. T. E.; Taylor, W.; Teischinger, F. A.; Teixeira-Dias, P.; Temming, K. K.; Temple, D.; Ten Kate, H.; Teng, P. K.; Teoh, J. J.; Tepel, F.; Terada, S.; Terashi, K.; Terron, J.; Terzo, S.; Testa, M.; Teuscher, R. J.; Theveneaux-Pelzer, T.; Thomas, J. P.; Thomas-Wilsker, J.; Thompson, E. N.; Thompson, P. D.; Thompson, R. J.; Thompson, A. S.; Thomsen, L. A.; Thomson, E.; Thomson, M.; Tibbetts, M. J.; Ticse Torres, R. E.; Tikhomirov, V. O.; Tikhonov, Yu. A.; Timoshenko, S.; Tipton, P.; Tisserant, S.; Todome, K.; Todorov, T.; Todorova-Nova, S.; Tojo, J.; Tokár, S.; Tokushuku, K.; Tolley, E.; Tomlinson, L.; Tomoto, M.; Tompkins, L.; Toms, K.; Tong, B.; Torrence, E.; Torres, H.; Torró Pastor, E.; Toth, J.; Touchard, F.; Tovey, D. R.; Trefzger, T.; Tremblet, L.; Tricoli, A.; Trigger, I. M.; Trincaz-Duvoid, S.; Tripiana, M. F.; Trischuk, W.; Trocmé, B.; Trofymov, A.; Troncon, C.; Trottier-McDonald, M.; Trovatelli, M.; Truong, L.; Trzebinski, M.; Trzupek, A.; Tseng, J. C.-L.; Tsiareshka, P. V.; Tsipolitis, G.; Tsirintanis, N.; Tsiskaridze, S.; Tsiskaridze, V.; Tskhadadze, E. G.; Tsui, K. M.; Tsukerman, I. I.; Tsulaia, V.; Tsuno, S.; Tsybychev, D.; Tudorache, A.; Tudorache, V.; Tuna, A. N.; Tupputi, S. A.; Turchikhin, S.; Turecek, D.; Turgeman, D.; Turra, R.; Turvey, A. J.; Tuts, P. M.; Tyndel, M.; Ucchielli, G.; Ueda, I.; Ueno, R.; Ughetto, M.; Ukegawa, F.; Unal, G.; Undrus, A.; Unel, G.; Ungaro, F. C.; Unno, Y.; Unverdorben, C.; Urban, J.; Urquijo, P.; Urrejola, P.; Usai, G.; Usanova, A.; Vacavant, L.; Vacek, V.; Vachon, B.; Valderanis, C.; Valdes Santurio, E.; Valencic, N.; Valentinetti, S.; Valero, A.; Valery, L.; Valkar, S.; Vallecorsa, S.; Valls Ferrer, J. A.; Van Den Wollenberg, W.; Van Der Deijl, P. C.; van der Geer, R.; van der Graaf, H.; van Eldik, N.; van Gemmeren, P.; Van Nieuwkoop, J.; van Vulpen, I.; van Woerden, M. C.; Vanadia, M.; Vandelli, W.; Vanguri, R.; Vaniachine, A.; Vankov, P.; Vardanyan, G.; Vari, R.; Varnes, E. W.; Varol, T.; Varouchas, D.; Vartapetian, A.; Varvell, K. E.; Vasquez, J. G.; Vazeille, F.; Vazquez Schroeder, T.; Veatch, J.; Veloce, L. M.; Veloso, F.; Veneziano, S.; Ventura, A.; Venturi, M.; Venturi, N.; Venturini, A.; Vercesi, V.; Verducci, M.; Verkerke, W.; Vermeulen, J. C.; Vest, A.; Vetterli, M. C.; Viazlo, O.; Vichou, I.; Vickey, T.; Vickey Boeriu, O. E.; Viehhauser, G. H. A.; Viel, S.; Vigani, L.; Vigne, R.; Villa, M.; Villaplana Perez, M.; Vilucchi, E.; Vincter, M. G.; Vinogradov, V. B.; Vittori, C.; Vivarelli, I.; Vlachos, S.; Vlasak, M.; Vogel, M.; Vokac, P.; Volpi, G.; Volpi, M.; von der Schmitt, H.; von Toerne, E.; Vorobel, V.; Vorobev, K.; Vos, M.; Voss, R.; Vossebeld, J. H.; Vranjes, N.; Vranjes Milosavljevic, M.; Vrba, V.; Vreeswijk, M.; Vuillermet, R.; Vukotic, I.; Vykydal, Z.; Wagner, P.; Wagner, W.; Wahlberg, H.; Wahrmund, S.; Wakabayashi, J.; Walder, J.; Walker, R.; Walkowiak, W.; Wallangen, V.; Wang, C.; Wang, C.; Wang, F.; Wang, H.; Wang, H.; Wang, J.; Wang, J.; Wang, K.; Wang, R.; Wang, S. M.; Wang, T.; Wang, T.; Wang, X.; Wanotayaroj, C.; Warburton, A.; Ward, C. P.; Wardrope, D. R.; Washbrook, A.; Watkins, P. M.; Watson, A. T.; Watson, I. J.; Watson, M. F.; Watts, G.; Watts, S.; Waugh, B. M.; Webb, S.; Weber, M. S.; Weber, S. W.; Webster, J. S.; Weidberg, A. R.; Weinert, B.; Weingarten, J.; Weiser, C.; Weits, H.; Wells, P. S.; Wenaus, T.; Wengler, T.; Wenig, S.; Wermes, N.; Werner, M.; Werner, P.; Wessels, M.; Wetter, J.; Whalen, K.; Whallon, N. L.; Wharton, A. M.; White, A.; White, M. J.; White, R.; White, S.; Whiteson, D.; Wickens, F. J.; Wiedenmann, W.; Wielers, M.; Wienemann, P.; Wiglesworth, C.; Wiik-Fuchs, L. A. M.; Wildauer, A.; Wilk, F.; Wilkens, H. G.; Williams, H. H.; Williams, S.; Willis, C.; Willocq, S.; Wilson, J. A.; Wingerter-Seez, I.; Winklmeier, F.; Winston, O. J.; Winter, B. T.; Wittgen, M.; Wittkowski, J.; Wollstadt, S. J.; Wolter, M. W.; Wolters, H.; Wosiek, B. K.; Wotschack, J.; Woudstra, M. J.; Wozniak, K. W.; Wu, M.; Wu, M.; Wu, S. L.; Wu, X.; Wu, Y.; Wyatt, T. R.; Wynne, B. M.; Xella, S.; Xu, D.; Xu, L.; Yabsley, B.; Yacoob, S.; Yakabe, R.; Yamaguchi, D.; Yamaguchi, Y.; Yamamoto, A.; Yamamoto, S.; Yamanaka, T.; Yamauchi, K.; Yamazaki, Y.; Yan, Z.; Yang, H.; Yang, H.; Yang, Y.; Yang, Z.; Yao, W.-M.; Yap, Y. C.; Yasu, Y.; Yatsenko, E.; Yau Wong, K. H.; Ye, J.; Ye, S.; Yeletskikh, I.; Yen, A. L.; Yildirim, E.; Yorita, K.; Yoshida, R.; Yoshihara, K.; Young, C.; Young, C. J. S.; Youssef, S.; Yu, D. R.; Yu, J.; Yu, J. M.; Yu, J.; Yuan, L.; Yuen, S. P. Y.; Yusuff, I.; Zabinski, B.; Zaidan, R.; Zaitsev, A. M.; Zakharchuk, N.; Zalieckas, J.; Zaman, A.; Zambito, S.; Zanello, L.; Zanzi, D.; Zeitnitz, C.; Zeman, M.; Zemla, A.; Zeng, J. C.; Zeng, Q.; Zengel, K.; Zenin, O.; Ženiš, T.; Zerwas, D.; Zhang, D.; Zhang, F.; Zhang, G.; Zhang, H.; Zhang, J.; Zhang, L.; Zhang, R.; Zhang, R.; Zhang, X.; Zhang, Z.; Zhao, X.; Zhao, Y.; Zhao, Z.; Zhemchugov, A.; Zhong, J.; Zhou, B.; Zhou, C.; Zhou, L.; Zhou, L.; Zhou, M.; Zhou, N.; Zhu, C. G.; Zhu, H.; Zhu, J.; Zhu, Y.; Zhuang, X.; Zhukov, K.; Zibell, A.; Zieminska, D.; Zimine, N. I.; Zimmermann, C.; Zimmermann, S.; Zinonos, Z.; Zinser, M.; Ziolkowski, M.; Živković, L.; Zobernig, G.; Zoccoli, A.; zur Nedden, M.; Zurzolo, G.; Zwalinski, L.

    2016-07-01

    This paper presents measurements of distributions of charged particles which are produced in proton-proton collisions at a centre-of-mass energy of √{s} = 8 TeV and recorded by the ATLAS detector at the LHC. A special dataset recorded in 2012 with a small number of interactions per beam crossing (below 0.004) and corresponding to an integrated luminosity of 160 μ b^{-1} was used. A minimum-bias trigger was utilised to select a data sample of more than 9 million collision events. The multiplicity, pseudorapidity, and transverse momentum distributions of charged particles are shown in different regions of kinematics and charged-particle multiplicity, including measurements of final states at high multiplicity. The results are corrected for detector effects and are compared to the predictions of various Monte Carlo event generator models which simulate the full hadronic final state.

  14. Cadastral Database Positional Accuracy Improvement

    NASA Astrophysics Data System (ADS)

    Hashim, N. M.; Omar, A. H.; Ramli, S. N. M.; Omar, K. M.; Din, N.

    2017-10-01

    Positional Accuracy Improvement (PAI) is the refining process of the geometry feature in a geospatial dataset to improve its actual position. This actual position relates to the absolute position in specific coordinate system and the relation to the neighborhood features. With the growth of spatial based technology especially Geographical Information System (GIS) and Global Navigation Satellite System (GNSS), the PAI campaign is inevitable especially to the legacy cadastral database. Integration of legacy dataset and higher accuracy dataset like GNSS observation is a potential solution for improving the legacy dataset. However, by merely integrating both datasets will lead to a distortion of the relative geometry. The improved dataset should be further treated to minimize inherent errors and fitting to the new accurate dataset. The main focus of this study is to describe a method of angular based Least Square Adjustment (LSA) for PAI process of legacy dataset. The existing high accuracy dataset known as National Digital Cadastral Database (NDCDB) is then used as bench mark to validate the results. It was found that the propose technique is highly possible for positional accuracy improvement of legacy spatial datasets.

  15. The Balance-Scale Task Revisited: A Comparison of Statistical Models for Rule-Based and Information-Integration Theories of Proportional Reasoning

    PubMed Central

    Hofman, Abe D.; Visser, Ingmar; Jansen, Brenda R. J.; van der Maas, Han L. J.

    2015-01-01

    We propose and test three statistical models for the analysis of children’s responses to the balance scale task, a seminal task to study proportional reasoning. We use a latent class modelling approach to formulate a rule-based latent class model (RB LCM) following from a rule-based perspective on proportional reasoning and a new statistical model, the Weighted Sum Model, following from an information-integration approach. Moreover, a hybrid LCM using item covariates is proposed, combining aspects of both a rule-based and information-integration perspective. These models are applied to two different datasets, a standard paper-and-pencil test dataset (N = 779), and a dataset collected within an online learning environment that included direct feedback, time-pressure, and a reward system (N = 808). For the paper-and-pencil dataset the RB LCM resulted in the best fit, whereas for the online dataset the hybrid LCM provided the best fit. The standard paper-and-pencil dataset yielded more evidence for distinct solution rules than the online data set in which quantitative item characteristics are more prominent in determining responses. These results shed new light on the discussion on sequential rule-based and information-integration perspectives of cognitive development. PMID:26505905

  16. Multiple-input multiple-output causal strategies for gene selection.

    PubMed

    Bontempi, Gianluca; Haibe-Kains, Benjamin; Desmedt, Christine; Sotiriou, Christos; Quackenbush, John

    2011-11-25

    Traditional strategies for selecting variables in high dimensional classification problems aim to find sets of maximally relevant variables able to explain the target variations. If these techniques may be effective in generalization accuracy they often do not reveal direct causes. The latter is essentially related to the fact that high correlation (or relevance) does not imply causation. In this study, we show how to efficiently incorporate causal information into gene selection by moving from a single-input single-output to a multiple-input multiple-output setting. We show in synthetic case study that a better prioritization of causal variables can be obtained by considering a relevance score which incorporates a causal term. In addition we show, in a meta-analysis study of six publicly available breast cancer microarray datasets, that the improvement occurs also in terms of accuracy. The biological interpretation of the results confirms the potential of a causal approach to gene selection. Integrating causal information into gene selection algorithms is effective both in terms of prediction accuracy and biological interpretation.

  17. Facing the Challenges of Accessing, Managing, and Integrating Large Observational Datasets in Ecology: Enabling and Enriching the Use of NEON's Observational Data

    NASA Astrophysics Data System (ADS)

    Thibault, K. M.

    2013-12-01

    As the construction of NEON and its transition to operations progresses, more and more data will become available to the scientific community, both from NEON directly and from the concomitant growth of existing data repositories. Many of these datasets include ecological observations of a diversity of taxa in both aquatic and terrestrial environments. Although observational data have been collected and used throughout the history of organismal biology, the field has not yet fully developed a culture of data management, documentation, standardization, sharing and discoverability to facilitate the integration and synthesis of datasets. Moreover, the tools required to accomplish these goals, namely database design, implementation, and management, and automation and parallelization of analytical tasks through computational techniques, have not historically been included in biology curricula, at either the undergraduate or graduate levels. To ensure the success of data-generating projects like NEON in advancing organismal ecology and to increase transparency and reproducibility of scientific analyses, an acceleration of the cultural shift to open science practices, the development and adoption of data standards, such as the DarwinCore standard for taxonomic data, and increased training in computational approaches for biologists need to be realized. Here I highlight several initiatives that are intended to increase access to and discoverability of publicly available datasets and equip biologists and other scientists with the skills that are need to manage, integrate, and analyze data from multiple large-scale projects. The EcoData Retriever (ecodataretriever.org) is a tool that downloads publicly available datasets, re-formats the data into an efficient relational database structure, and then automatically imports the data tables onto a user's local drive into the database tool of the user's choice. The automation of these tasks results in nearly instantaneous execution of tasks that previously required hours to days of each data user's time, with decreased error rates and increased useability of the data. The Ecological Data wiki (ecologicaldata.org) provides a forum for users of ecological datasets to share relevant metadata and tips and tricks for using the data, in order to flatten learning curves, as well as minimize redundancy of efforts among users of the same datasets. Finally, Software Carpentry (software-carpentry.org) has developed curricula for scientific computing and provides both online training and low cost, short courses that can be tailored to the specific needs of the students. Demand for these courses has been increasing exponentially in recent years, and represent a significant educational resource for biologists. I will conclude by linking these initiatives to the challenges facing ecologists related to the effective and efficient exploitation of NEON's diverse data streams.

  18. Novel promoters and coding first exons in DLG2 linked to developmental disorders and intellectual disability.

    PubMed

    Reggiani, Claudio; Coppens, Sandra; Sekhara, Tayeb; Dimov, Ivan; Pichon, Bruno; Lufin, Nicolas; Addor, Marie-Claude; Belligni, Elga Fabia; Digilio, Maria Cristina; Faletra, Flavio; Ferrero, Giovanni Battista; Gerard, Marion; Isidor, Bertrand; Joss, Shelagh; Niel-Bütschi, Florence; Perrone, Maria Dolores; Petit, Florence; Renieri, Alessandra; Romana, Serge; Topa, Alexandra; Vermeesch, Joris Robert; Lenaerts, Tom; Casimir, Georges; Abramowicz, Marc; Bontempi, Gianluca; Vilain, Catheline; Deconinck, Nicolas; Smits, Guillaume

    2017-07-19

    Tissue-specific integrative omics has the potential to reveal new genic elements important for developmental disorders. Two pediatric patients with global developmental delay and intellectual disability phenotype underwent array-CGH genetic testing, both showing a partial deletion of the DLG2 gene. From independent human and murine omics datasets, we combined copy number variations, histone modifications, developmental tissue-specific regulation, and protein data to explore the molecular mechanism at play. Integrating genomics, transcriptomics, and epigenomics data, we describe two novel DLG2 promoters and coding first exons expressed in human fetal brain. Their murine conservation and protein-level evidence allowed us to produce new DLG2 gene models for human and mouse. These new genic elements are deleted in 90% of 29 patients (public and in-house) showing partial deletion of the DLG2 gene. The patients' clinical characteristics expand the neurodevelopmental phenotypic spectrum linked to DLG2 gene disruption to cognitive and behavioral categories. While protein-coding genes are regarded as well known, our work shows that integration of multiple omics datasets can unveil novel coding elements. From a clinical perspective, our work demonstrates that two new DLG2 promoters and exons are crucial for the neurodevelopmental phenotypes associated with this gene. In addition, our work brings evidence for the lack of cross-annotation in human versus mouse reference genomes and nucleotide versus protein databases.

  19. Building a multi-scaled geospatial temporal ecology database from disparate data sources: fostering open science and data reuse.

    PubMed

    Soranno, Patricia A; Bissell, Edward G; Cheruvelil, Kendra S; Christel, Samuel T; Collins, Sarah M; Fergus, C Emi; Filstrup, Christopher T; Lapierre, Jean-Francois; Lottig, Noah R; Oliver, Samantha K; Scott, Caren E; Smith, Nicole J; Stopyak, Scott; Yuan, Shuai; Bremigan, Mary Tate; Downing, John A; Gries, Corinna; Henry, Emily N; Skaff, Nick K; Stanley, Emily H; Stow, Craig A; Tan, Pang-Ning; Wagner, Tyler; Webster, Katherine E

    2015-01-01

    Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km(2)). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.

  20. Building a multi-scaled geospatial temporal ecology database from disparate data sources: Fostering open science through data reuse

    USGS Publications Warehouse

    Soranno, Patricia A.; Bissell, E.G.; Cheruvelil, Kendra S.; Christel, Samuel T.; Collins, Sarah M.; Fergus, C. Emi; Filstrup, Christopher T.; Lapierre, Jean-Francois; Lotting, Noah R.; Oliver, Samantha K.; Scott, Caren E.; Smith, Nicole J.; Stopyak, Scott; Yuan, Shuai; Bremigan, Mary Tate; Downing, John A.; Gries, Corinna; Henry, Emily N.; Skaff, Nick K.; Stanley, Emily H.; Stow, Craig A.; Tan, Pang-Ning; Wagner, Tyler; Webster, Katherine E.

    2015-01-01

    Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km2). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.

  1. Artificial neural network for suppression of banding artifacts in balanced steady-state free precession MRI.

    PubMed

    Kim, Ki Hwan; Park, Sung-Hong

    2017-04-01

    The balanced steady-state free precession (bSSFP) MR sequence is frequently used in clinics, but is sensitive to off-resonance effects, which can cause banding artifacts. Often multiple bSSFP datasets are acquired at different phase cycling (PC) angles and then combined in a special way for banding artifact suppression. Many strategies of combining the datasets have been suggested for banding artifact suppression, but there are still limitations in their performance, especially when the number of phase-cycled bSSFP datasets is small. The purpose of this study is to develop a learning-based model to combine the multiple phase-cycled bSSFP datasets for better banding artifact suppression. Multilayer perceptron (MLP) is a feedforward artificial neural network consisting of three layers of input, hidden, and output layers. MLP models were trained by input bSSFP datasets acquired from human brain and knee at 3T, which were separately performed for two and four PC angles. Banding-free bSSFP images were generated by maximum-intensity projection (MIP) of 8 or 12 phase-cycled datasets and were used as targets for training the output layer. The trained MLP models were applied to another brain and knee datasets acquired with different scan parameters and also to multiple phase-cycled bSSFP functional MRI datasets acquired on rat brain at 9.4T, in comparison with the conventional MIP method. Simulations were also performed to validate the MLP approach. Both the simulations and human experiments demonstrated that MLP suppressed banding artifacts significantly, superior to MIP in both banding artifact suppression and SNR efficiency. MLP demonstrated superior performance over MIP for the 9.4T fMRI data as well, which was not used for training the models, while visually preserving the fMRI maps very well. Artificial neural network is a promising technique for combining multiple phase-cycled bSSFP datasets for banding artifact suppression. Copyright © 2016 Elsevier Inc. All rights reserved.

  2. Integration of Neuroimaging and Microarray Datasets through Mapping and Model-Theoretic Semantic Decomposition of Unstructured Phenotypes

    PubMed Central

    Pantazatos, Spiro P.; Li, Jianrong; Pavlidis, Paul; Lussier, Yves A.

    2009-01-01

    An approach towards heterogeneous neuroscience dataset integration is proposed that uses Natural Language Processing (NLP) and a knowledge-based phenotype organizer system (PhenOS) to link ontology-anchored terms to underlying data from each database, and then maps these terms based on a computable model of disease (SNOMED CT®). The approach was implemented using sample datasets from fMRIDC, GEO, The Whole Brain Atlas and Neuronames, and allowed for complex queries such as “List all disorders with a finding site of brain region X, and then find the semantically related references in all participating databases based on the ontological model of the disease or its anatomical and morphological attributes”. Precision of the NLP-derived coding of the unstructured phenotypes in each dataset was 88% (n = 50), and precision of the semantic mapping between these terms across datasets was 98% (n = 100). To our knowledge, this is the first example of the use of both semantic decomposition of disease relationships and hierarchical information found in ontologies to integrate heterogeneous phenotypes across clinical and molecular datasets. PMID:20495688

  3. ISRNA: an integrative online toolkit for short reads from high-throughput sequencing data.

    PubMed

    Luo, Guan-Zheng; Yang, Wei; Ma, Ying-Ke; Wang, Xiu-Jie

    2014-02-01

    Integrative Short Reads NAvigator (ISRNA) is an online toolkit for analyzing high-throughput small RNA sequencing data. Besides the high-speed genome mapping function, ISRNA provides statistics for genomic location, length distribution and nucleotide composition bias analysis of sequence reads. Number of reads mapped to known microRNAs and other classes of short non-coding RNAs, coverage of short reads on genes, expression abundance of sequence reads as well as some other analysis functions are also supported. The versatile search functions enable users to select sequence reads according to their sub-sequences, expression abundance, genomic location, relationship to genes, etc. A specialized genome browser is integrated to visualize the genomic distribution of short reads. ISRNA also supports management and comparison among multiple datasets. ISRNA is implemented in Java/C++/Perl/MySQL and can be freely accessed at http://omicslab.genetics.ac.cn/ISRNA/.

  4. Harnessing glycomics technologies: integrating structure with function for glycan characterization

    PubMed Central

    Robinson, Luke N.; Artpradit, Charlermchai; Raman, Rahul; Shriver, Zachary H.; Ruchirawat, Mathuros; Sasisekharan, Ram

    2013-01-01

    Glycans, or complex carbohydrates, are a ubiquitous class of biological molecules which impinge on a variety of physiological processes ranging from signal transduction to tissue development and microbial pathogenesis. In comparison to DNA and proteins, glycans present unique challenges to the study of their structure and function owing to their complex and heterogeneous structures and the dominant role played by multivalency in their sequence-specific biological interactions. Arising from these challenges, there is a need to integrate information from multiple complementary methods to decode structure-function relationships. Focusing on acidic glycans, we describe here key glycomics technologies for characterizing their structural attributes, including linkage, modifications, and topology, as well as for elucidating their role in biological processes. Two cases studies, one involving sialylated branched glycans and the other sulfated glycosaminoglycans, are used to highlight how integration of orthogonal information from diverse datasets enables rapid convergence of glycan characterization for development of robust structure-function relationships. PMID:22522536

  5. CisSERS: Customizable in silico sequence evaluation for restriction sites

    DOE PAGES

    Sharpe, Richard M.; Koepke, Tyson; Harper, Artemus; ...

    2016-04-12

    High-throughput sequencing continues to produce an immense volume of information that is processed and assembled into mature sequence data. Here, data analysis tools are urgently needed that leverage the embedded DNA sequence polymorphisms and consequent changes to restriction sites or sequence motifs in a high-throughput manner to enable biological experimentation. CisSERS was developed as a standalone open source tool to analyze sequence datasets and provide biologists with individual or comparative genome organization information in terms of presence and frequency of patterns or motifs such as restriction enzymes. Predicted agarose gel visualization of the custom analyses results was also integrated tomore » enhance the usefulness of the software. CisSERS offers several novel functionalities, such as handling of large and multiple datasets in parallel, multiple restriction enzyme site detection and custom motif detection features, which are seamlessly integrated with real time agarose gel visualization. Using a simple fasta-formatted file as input, CisSERS utilizes the REBASE enzyme database. Results from CisSERSenable the user to make decisions for designing genotyping by sequencing experiments, reduced representation sequencing, 3’UTR sequencing, and cleaved amplified polymorphic sequence (CAPS) molecular markers for large sample sets. CisSERS is a java based graphical user interface built around a perl backbone. Several of the applications of CisSERS including CAPS molecular marker development were successfully validated using wet-lab experimentation. Here, we present the tool CisSERSand results from in-silico and corresponding wet-lab analyses demonstrating that CisSERS is a technology platform solution that facilitates efficient data utilization in genomics and genetics studies.« less

  6. GLobal Integrated Design Environment

    NASA Technical Reports Server (NTRS)

    Kunkel, Matthew; McGuire, Melissa; Smith, David A.; Gefert, Leon P.

    2011-01-01

    The GLobal Integrated Design Environment (GLIDE) is a collaborative engineering application built to resolve the design session issues of real-time passing of data between multiple discipline experts in a collaborative environment. Utilizing Web protocols and multiple programming languages, GLIDE allows engineers to use the applications to which they are accustomed in this case, Excel to send and receive datasets via the Internet to a database-driven Web server. Traditionally, a collaborative design session consists of one or more engineers representing each discipline meeting together in a single location. The discipline leads exchange parameters and iterate through their respective processes to converge on an acceptable dataset. In cases in which the engineers are unable to meet, their parameters are passed via e-mail, telephone, facsimile, or even postal mail. The result of this slow process of data exchange would elongate a design session to weeks or even months. While the iterative process remains in place, software can now exchange parameters securely and efficiently, while at the same time allowing for much more information about a design session to be made available. GLIDE is written in a compilation of several programming languages, including REALbasic, PHP, and Microsoft Visual Basic. GLIDE client installers are available to download for both Microsoft Windows and Macintosh systems. The GLIDE client software is compatible with Microsoft Excel 2000 or later on Windows systems, and with Microsoft Excel X or later on Macintosh systems. GLIDE follows the Client-Server paradigm, transferring encrypted and compressed data via standard Web protocols. Currently, the engineers use Excel as a front end to the GLIDE Client, as many of their custom tools run in Excel.

  7. Integrative analysis for identification of shared markers from various functional cells/tissues for rheumatoid arthritis.

    PubMed

    Xia, Wei; Wu, Jian; Deng, Fei-Yan; Wu, Long-Fei; Zhang, Yong-Hong; Guo, Yu-Fan; Lei, Shu-Feng

    2017-02-01

    Rheumatoid arthritis (RA) is a systemic autoimmune disease. So far, it is unclear whether there exist common RA-related genes shared in different tissues/cells. In this study, we conducted an integrative analysis on multiple datasets to identify potential shared genes that are significant in multiple tissues/cells for RA. Seven microarray gene expression datasets representing various RA-related tissues/cells were downloaded from the Gene Expression Omnibus (GEO). Statistical analyses, testing both marginal and joint effects, were conducted to identify significant genes shared in various samples. Followed-up analyses were conducted on functional annotation clustering analysis, protein-protein interaction (PPI) analysis, gene-based association analysis, and ELISA validation analysis in in-house samples. We identified 18 shared significant genes, which were mainly involved in the immune response and chemokine signaling pathway. Among the 18 genes, eight genes (PPBP, PF4, HLA-F, S100A8, RNASEH2A, P2RY6, JAG2, and PCBP1) interact with known RA genes. Two genes (HLA-F and PCBP1) are significant in gene-based association analysis (P = 1.03E-31, P = 1.30E-2, respectively). Additionally, PCBP1 also showed differential protein expression levels in in-house case-control plasma samples (P = 2.60E-2). This study represented the first effort to identify shared RA markers from different functional cells or tissues. The results suggested that one of the shared genes, i.e., PCBP1, is a promising biomarker for RA.

  8. CisSERS: Customizable in silico sequence evaluation for restriction sites

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sharpe, Richard M.; Koepke, Tyson; Harper, Artemus

    High-throughput sequencing continues to produce an immense volume of information that is processed and assembled into mature sequence data. Here, data analysis tools are urgently needed that leverage the embedded DNA sequence polymorphisms and consequent changes to restriction sites or sequence motifs in a high-throughput manner to enable biological experimentation. CisSERS was developed as a standalone open source tool to analyze sequence datasets and provide biologists with individual or comparative genome organization information in terms of presence and frequency of patterns or motifs such as restriction enzymes. Predicted agarose gel visualization of the custom analyses results was also integrated tomore » enhance the usefulness of the software. CisSERS offers several novel functionalities, such as handling of large and multiple datasets in parallel, multiple restriction enzyme site detection and custom motif detection features, which are seamlessly integrated with real time agarose gel visualization. Using a simple fasta-formatted file as input, CisSERS utilizes the REBASE enzyme database. Results from CisSERSenable the user to make decisions for designing genotyping by sequencing experiments, reduced representation sequencing, 3’UTR sequencing, and cleaved amplified polymorphic sequence (CAPS) molecular markers for large sample sets. CisSERS is a java based graphical user interface built around a perl backbone. Several of the applications of CisSERS including CAPS molecular marker development were successfully validated using wet-lab experimentation. Here, we present the tool CisSERSand results from in-silico and corresponding wet-lab analyses demonstrating that CisSERS is a technology platform solution that facilitates efficient data utilization in genomics and genetics studies.« less

  9. Long-term dataset on aquatic responses to concurrent climate change and recovery from acidification

    NASA Astrophysics Data System (ADS)

    Leach, Taylor H.; Winslow, Luke A.; Acker, Frank W.; Bloomfield, Jay A.; Boylen, Charles W.; Bukaveckas, Paul A.; Charles, Donald F.; Daniels, Robert A.; Driscoll, Charles T.; Eichler, Lawrence W.; Farrell, Jeremy L.; Funk, Clara S.; Goodrich, Christine A.; Michelena, Toby M.; Nierzwicki-Bauer, Sandra A.; Roy, Karen M.; Shaw, William H.; Sutherland, James W.; Swinton, Mark W.; Winkler, David A.; Rose, Kevin C.

    2018-04-01

    Concurrent regional and global environmental changes are affecting freshwater ecosystems. Decadal-scale data on lake ecosystems that can describe processes affected by these changes are important as multiple stressors often interact to alter the trajectory of key ecological phenomena in complex ways. Due to the practical challenges associated with long-term data collections, the majority of existing long-term data sets focus on only a small number of lakes or few response variables. Here we present physical, chemical, and biological data from 28 lakes in the Adirondack Mountains of northern New York State. These data span the period from 1994-2012 and harmonize multiple open and as-yet unpublished data sources. The dataset creation is reproducible and transparent; R code and all original files used to create the dataset are provided in an appendix. This dataset will be useful for examining ecological change in lakes undergoing multiple stressors.

  10. UNCLES: method for the identification of genes differentially consistently co-expressed in a specific subset of datasets.

    PubMed

    Abu-Jamous, Basel; Fa, Rui; Roberts, David J; Nandi, Asoke K

    2015-06-04

    Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn. The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.

  11. Integrated dataset of impact of dissolved organic matter on particle behavior and phototoxicity of titanium dioxide nanoparticles

    EPA Pesticide Factsheets

    This dataset is generated to both qualitatively and quantitatively examine the interactions between nano-TiO2 and natural organic matter (NOM). This integrated dataset assemble all data generated in this project through a series of experiments. This dataset is associated with the following publication:Li , S., H. Ma, L. Wallis, M. Etterson , B. Riley , D. Hoff , and S. Diamond. Impact of natural organic matter on particle behavior and phototoxicity of titanium dioxide nanoparticles. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 542: 324-333, (2016).

  12. A case study of data integration for aquatic resources using semantic web technologies

    USGS Publications Warehouse

    Gordon, Janice M.; Chkhenkeli, Nina; Govoni, David L.; Lightsom, Frances L.; Ostroff, Andrea C.; Schweitzer, Peter N.; Thongsavanh, Phethala; Varanka, Dalia E.; Zednik, Stephan

    2015-01-01

    Use cases, information modeling, and linked data techniques are Semantic Web technologies used to develop a prototype system that integrates scientific observations from four independent USGS and cooperator data systems. The techniques were tested with a use case goal of creating a data set for use in exploring potential relationships among freshwater fish populations and environmental factors. The resulting prototype extracts data from the BioData Retrieval System, the Multistate Aquatic Resource Information System, the National Geochemical Survey, and the National Hydrography Dataset. A prototype user interface allows a scientist to select observations from these data systems and combine them into a single data set in RDF format that includes explicitly defined relationships and data definitions. The project was funded by the USGS Community for Data Integration and undertaken by the Community for Data Integration Semantic Web Working Group in order to demonstrate use of Semantic Web technologies by scientists. This allows scientists to simultaneously explore data that are available in multiple, disparate systems beyond those they traditionally have used.

  13. Study of the Integration of LIDAR and Photogrammetric Datasets by in Situ Camera Calibration and Integrated Sensor Orientation

    NASA Astrophysics Data System (ADS)

    Mitishita, E.; Costa, F.; Martins, M.

    2017-05-01

    Photogrammetric and Lidar datasets should be in the same mapping or geodetic frame to be used simultaneously in an engineering project. Nowadays direct sensor orientation is a common procedure used in simultaneous photogrammetric and Lidar surveys. Although the direct sensor orientation technologies provide a high degree of automation process due to the GNSS/INS technologies, the accuracies of the results obtained from the photogrammetric and Lidar surveys are dependent on the quality of a group of parameters that models accurately the user conditions of the system at the moment the job is performed. This paper shows the study that was performed to verify the importance of the in situ camera calibration and Integrated Sensor Orientation without control points to increase the accuracies of the photogrammetric and LIDAR datasets integration. The horizontal and vertical accuracies of photogrammetric and Lidar datasets integration by photogrammetric procedure improved significantly when the Integrated Sensor Orientation (ISO) approach was performed using Interior Orientation Parameter (IOP) values estimated from the in situ camera calibration. The horizontal and vertical accuracies, estimated by the Root Mean Square Error (RMSE) of the 3D discrepancies from the Lidar check points, increased around of 37% and 198% respectively.

  14. Progress on big data publication and documentation for machine-to-machine discovery, access, and processing

    NASA Astrophysics Data System (ADS)

    Walker, J. I.; Blodgett, D. L.; Suftin, I.; Kunicki, T.

    2013-12-01

    High-resolution data for use in environmental modeling is increasingly becoming available at broad spatial and temporal scales. Downscaled climate projections, remotely sensed landscape parameters, and land-use/land-cover projections are examples of datasets that may exceed an individual investigation's data management and analysis capacity. To allow projects on limited budgets to work with many of these data sets, the burden of working with them must be reduced. The approach being pursued at the U.S. Geological Survey Center for Integrated Data Analytics uses standard self-describing web services that allow machine to machine data access and manipulation. These techniques have been implemented and deployed in production level server-based Web Processing Services that can be accessed from a web application or scripted workflow. Data publication techniques that allow machine-interpretation of large collections of data have also been implemented for numerous datasets at U.S. Geological Survey data centers as well as partner agencies and academic institutions. Discovery of data services is accomplished using a method in which a machine-generated metadata record holds content--derived from the data's source web service--that is intended for human interpretation as well as machine interpretation. A distributed search application has been developed that demonstrates the utility of a decentralized search of data-owner metadata catalogs from multiple agencies. The integrated but decentralized system of metadata, data, and server-based processing capabilities will be presented. The design, utility, and value of these solutions will be illustrated with applied science examples and success stories. Datasets such as the EPA's Integrated Climate and Land Use Scenarios, USGS/NASA MODIS derived land cover attributes, and downscaled climate projections from several sources are examples of data this system includes. These and other datasets, have been published as standard, self-describing, web services that provide the ability to inspect and subset the data. This presentation will demonstrate this file-to-web service concept and how it can be used from script-based workflows or web applications.

  15. Reconstruction of the experimentally supported human protein interactome: what can we learn?

    PubMed

    Klapa, Maria I; Tsafou, Kalliopi; Theodoridis, Evangelos; Tsakalidis, Athanasios; Moschonas, Nicholas K

    2013-10-02

    Understanding the topology and dynamics of the human protein-protein interaction (PPI) network will significantly contribute to biomedical research, therefore its systematic reconstruction is required. Several meta-databases integrate source PPI datasets, but the protein node sets of their networks vary depending on the PPI data combined. Due to this inherent heterogeneity, the way in which the human PPI network expands via multiple dataset integration has not been comprehensively analyzed. We aim at assembling the human interactome in a global structured way and exploring it to gain insights of biological relevance. First, we defined the UniProtKB manually reviewed human "complete" proteome as the reference protein-node set and then we mined five major source PPI datasets for direct PPIs exclusively between the reference proteins. We updated the protein and publication identifiers and normalized all PPIs to the UniProt identifier level. The reconstructed interactome covers approximately 60% of the human proteome and has a scale-free structure. No apparent differentiating gene functional classification characteristics were identified for the unrepresented proteins. The source dataset integration augments the network mainly in PPIs. Polyubiquitin emerged as the highest-degree node, but the inclusion of most of its identified PPIs may be reconsidered. The high number (>300) of connections of the subsequent fifteen proteins correlates well with their essential biological role. According to the power-law network structure, the unrepresented proteins should mainly have up to four connections with equally poorly-connected interactors. Reconstructing the human interactome based on the a priori definition of the protein nodes enabled us to identify the currently included part of the human "complete" proteome, and discuss the role of the proteins within the network topology with respect to their function. As the network expansion has to comply with the scale-free theory, we suggest that the core of the human interactome has essentially emerged. Thus, it could be employed in systems biology and biomedical research, despite the considerable number of currently unrepresented proteins. The latter are probably involved in specialized physiological conditions, justifying the scarcity of related PPI information, and their identification can assist in designing relevant functional experiments and targeted text mining algorithms.

  16. A journey to Semantic Web query federation in the life sciences.

    PubMed

    Cheung, Kei-Hoi; Frost, H Robert; Marshall, M Scott; Prud'hommeaux, Eric; Samwald, Matthias; Zhao, Jun; Paschke, Adrian

    2009-10-01

    As interest in adopting the Semantic Web in the biomedical domain continues to grow, Semantic Web technology has been evolving and maturing. A variety of technological approaches including triplestore technologies, SPARQL endpoints, Linked Data, and Vocabulary of Interlinked Datasets have emerged in recent years. In addition to the data warehouse construction, these technological approaches can be used to support dynamic query federation. As a community effort, the BioRDF task force, within the Semantic Web for Health Care and Life Sciences Interest Group, is exploring how these emerging approaches can be utilized to execute distributed queries across different neuroscience data sources. We have created two health care and life science knowledge bases. We have explored a variety of Semantic Web approaches to describe, map, and dynamically query multiple datasets. We have demonstrated several federation approaches that integrate diverse types of information about neurons and receptors that play an important role in basic, clinical, and translational neuroscience research. Particularly, we have created a prototype receptor explorer which uses OWL mappings to provide an integrated list of receptors and executes individual queries against different SPARQL endpoints. We have also employed the AIDA Toolkit, which is directed at groups of knowledge workers who cooperatively search, annotate, interpret, and enrich large collections of heterogeneous documents from diverse locations. We have explored a tool called "FeDeRate", which enables a global SPARQL query to be decomposed into subqueries against the remote databases offering either SPARQL or SQL query interfaces. Finally, we have explored how to use the vocabulary of interlinked Datasets (voiD) to create metadata for describing datasets exposed as Linked Data URIs or SPARQL endpoints. We have demonstrated the use of a set of novel and state-of-the-art Semantic Web technologies in support of a neuroscience query federation scenario. We have identified both the strengths and weaknesses of these technologies. While Semantic Web offers a global data model including the use of Uniform Resource Identifiers (URI's), the proliferation of semantically-equivalent URI's hinders large scale data integration. Our work helps direct research and tool development, which will be of benefit to this community.

  17. A journey to Semantic Web query federation in the life sciences

    PubMed Central

    Cheung, Kei-Hoi; Frost, H Robert; Marshall, M Scott; Prud'hommeaux, Eric; Samwald, Matthias; Zhao, Jun; Paschke, Adrian

    2009-01-01

    Background As interest in adopting the Semantic Web in the biomedical domain continues to grow, Semantic Web technology has been evolving and maturing. A variety of technological approaches including triplestore technologies, SPARQL endpoints, Linked Data, and Vocabulary of Interlinked Datasets have emerged in recent years. In addition to the data warehouse construction, these technological approaches can be used to support dynamic query federation. As a community effort, the BioRDF task force, within the Semantic Web for Health Care and Life Sciences Interest Group, is exploring how these emerging approaches can be utilized to execute distributed queries across different neuroscience data sources. Methods and results We have created two health care and life science knowledge bases. We have explored a variety of Semantic Web approaches to describe, map, and dynamically query multiple datasets. We have demonstrated several federation approaches that integrate diverse types of information about neurons and receptors that play an important role in basic, clinical, and translational neuroscience research. Particularly, we have created a prototype receptor explorer which uses OWL mappings to provide an integrated list of receptors and executes individual queries against different SPARQL endpoints. We have also employed the AIDA Toolkit, which is directed at groups of knowledge workers who cooperatively search, annotate, interpret, and enrich large collections of heterogeneous documents from diverse locations. We have explored a tool called "FeDeRate", which enables a global SPARQL query to be decomposed into subqueries against the remote databases offering either SPARQL or SQL query interfaces. Finally, we have explored how to use the vocabulary of interlinked Datasets (voiD) to create metadata for describing datasets exposed as Linked Data URIs or SPARQL endpoints. Conclusion We have demonstrated the use of a set of novel and state-of-the-art Semantic Web technologies in support of a neuroscience query federation scenario. We have identified both the strengths and weaknesses of these technologies. While Semantic Web offers a global data model including the use of Uniform Resource Identifiers (URI's), the proliferation of semantically-equivalent URI's hinders large scale data integration. Our work helps direct research and tool development, which will be of benefit to this community. PMID:19796394

  18. Longitudinal Data on the Effectiveness of Mathematics Mini-Games in Primary Education

    ERIC Educational Resources Information Center

    Bakker, Marjoke; Van den Heuvel-Panhuizen, Marja; Robitzsch, Alexander

    2015-01-01

    This paper describes a dataset consisting of longitudinal data gathered in the BRXXX project. The aim of the project was to investigate the effectiveness of online mathematics mini-games in enhancing primary school students' multiplicative reasoning ability (multiplication and division). The dataset includes data of 719 students from 35 primary…

  19. The GAAIN Entity Mapper: An Active-Learning System for Medical Data Mapping.

    PubMed

    Ashish, Naveen; Dewan, Peehoo; Toga, Arthur W

    2015-01-01

    This work is focused on mapping biomedical datasets to a common representation, as an integral part of data harmonization for integrated biomedical data access and sharing. We present GEM, an intelligent software assistant for automated data mapping across different datasets or from a dataset to a common data model. The GEM system automates data mapping by providing precise suggestions for data element mappings. It leverages the detailed metadata about elements in associated dataset documentation such as data dictionaries that are typically available with biomedical datasets. It employs unsupervised text mining techniques to determine similarity between data elements and also employs machine-learning classifiers to identify element matches. It further provides an active-learning capability where the process of training the GEM system is optimized. Our experimental evaluations show that the GEM system provides highly accurate data mappings (over 90% accuracy) for real datasets of thousands of data elements each, in the Alzheimer's disease research domain. Further, the effort in training the system for new datasets is also optimized. We are currently employing the GEM system to map Alzheimer's disease datasets from around the globe into a common representation, as part of a global Alzheimer's disease integrated data sharing and analysis network called GAAIN. GEM achieves significantly higher data mapping accuracy for biomedical datasets compared to other state-of-the-art tools for database schema matching that have similar functionality. With the use of active-learning capabilities, the user effort in training the system is minimal.

  20. The GAAIN Entity Mapper: An Active-Learning System for Medical Data Mapping

    PubMed Central

    Ashish, Naveen; Dewan, Peehoo; Toga, Arthur W.

    2016-01-01

    This work is focused on mapping biomedical datasets to a common representation, as an integral part of data harmonization for integrated biomedical data access and sharing. We present GEM, an intelligent software assistant for automated data mapping across different datasets or from a dataset to a common data model. The GEM system automates data mapping by providing precise suggestions for data element mappings. It leverages the detailed metadata about elements in associated dataset documentation such as data dictionaries that are typically available with biomedical datasets. It employs unsupervised text mining techniques to determine similarity between data elements and also employs machine-learning classifiers to identify element matches. It further provides an active-learning capability where the process of training the GEM system is optimized. Our experimental evaluations show that the GEM system provides highly accurate data mappings (over 90% accuracy) for real datasets of thousands of data elements each, in the Alzheimer's disease research domain. Further, the effort in training the system for new datasets is also optimized. We are currently employing the GEM system to map Alzheimer's disease datasets from around the globe into a common representation, as part of a global Alzheimer's disease integrated data sharing and analysis network called GAAIN1. GEM achieves significantly higher data mapping accuracy for biomedical datasets compared to other state-of-the-art tools for database schema matching that have similar functionality. With the use of active-learning capabilities, the user effort in training the system is minimal. PMID:26793094

  1. The MIND PALACE: A Multi-Spectral Imaging and Spectroscopy Database for Planetary Science

    NASA Astrophysics Data System (ADS)

    Eshelman, E.; Doloboff, I.; Hara, E. K.; Uckert, K.; Sapers, H. M.; Abbey, W.; Beegle, L. W.; Bhartia, R.

    2017-12-01

    The Multi-Instrument Database (MIND) is the web-based home to a well-characterized set of analytical data collected by a suite of deep-UV fluorescence/Raman instruments built at the Jet Propulsion Laboratory (JPL). Samples derive from a growing body of planetary surface analogs, mineral and microbial standards, meteorites, spacecraft materials, and other astrobiologically relevant materials. In addition to deep-UV spectroscopy, datasets stored in MIND are obtained from a variety of analytical techniques obtained over multiple spatial and spectral scales including electron microscopy, optical microscopy, infrared spectroscopy, X-ray fluorescence, and direct fluorescence imaging. Multivariate statistical analysis techniques, primarily Principal Component Analysis (PCA), are used to guide interpretation of these large multi-analytical spectral datasets. Spatial co-referencing of integrated spectral/visual maps is performed using QGIS (geographic information system software). Georeferencing techniques transform individual instrument data maps into a layered co-registered data cube for analysis across spectral and spatial scales. The body of data in MIND is intended to serve as a permanent, reliable, and expanding database of deep-UV spectroscopy datasets generated by this unique suite of JPL-based instruments on samples of broad planetary science interest.

  2. PharmacoGx: an R package for analysis of large pharmacogenomic datasets.

    PubMed

    Smirnov, Petr; Safikhani, Zhaleh; El-Hachem, Nehme; Wang, Dong; She, Adrian; Olsen, Catharina; Freeman, Mark; Selby, Heather; Gendoo, Deena M A; Grossmann, Patrick; Beck, Andrew H; Aerts, Hugo J W L; Lupien, Mathieu; Goldenberg, Anna; Haibe-Kains, Benjamin

    2016-04-15

    Pharmacogenomics holds great promise for the development of biomarkers of drug response and the design of new therapeutic options, which are key challenges in precision medicine. However, such data are scattered and lack standards for efficient access and analysis, consequently preventing the realization of the full potential of pharmacogenomics. To address these issues, we implemented PharmacoGx, an easy-to-use, open source package for integrative analysis of multiple pharmacogenomic datasets. We demonstrate the utility of our package in comparing large drug sensitivity datasets, such as the Genomics of Drug Sensitivity in Cancer and the Cancer Cell Line Encyclopedia. Moreover, we show how to use our package to easily perform Connectivity Map analysis. With increasing availability of drug-related data, our package will open new avenues of research for meta-analysis of pharmacogenomic data. PharmacoGx is implemented in R and can be easily installed on any system. The package is available from CRAN and its source code is available from GitHub. bhaibeka@uhnresearch.ca or benjamin.haibe.kains@utoronto.ca Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  3. Fish and fishery historical data since the 19th century in the Adriatic Sea, Mediterranean

    PubMed Central

    Fortibuoni, Tomaso; Libralato, Simone; Arneri, Enrico; Giovanardi, Otello; Solidoro, Cosimo; Raicevich, Saša

    2017-01-01

    Historic data on biodiversity provide the context for present observations and allow studying long-term changes in marine populations. Here we present multiple datasets on fish and fisheries of the Adriatic Sea covering the last two centuries encompassing from qualitative observations to standardised scientific monitoring. The datasets consist of three groups: (1) early naturalists’ descriptions of fish fauna, including information (e.g., presence, perceived abundance, size) on 255 fish species for the period 1818–1936; (2) historical landings from major Northern Adriatic fish markets (Venice, Trieste, Rijeka) for the period 1902–1968, Italian official landings for the Northern and Central Adriatic (1953–2012) and landings from the Lagoon of Venice (1945–2001); (3) trawl-survey data from seven surveys spanning the period 1948–1991 and including Catch per Unit of Effort data (kgh−1 and/or nh−1) for 956 hauls performed at 301 stations. The integration of these datasets has already demonstrated to be useful to analyse historical marine community changes over time, and its availability through open-source data portal will facilitate analyses in the framework of marine historical ecology. PMID:28895949

  4. Hybrid 3D printing: a game-changer in personalized cardiac medicine?

    PubMed

    Kurup, Harikrishnan K N; Samuel, Bennett P; Vettukattil, Joseph J

    2015-12-01

    Three-dimensional (3D) printing in congenital heart disease has the potential to increase procedural efficiency and patient safety by improving interventional and surgical planning and reducing radiation exposure. Cardiac magnetic resonance imaging and computed tomography are usually the source datasets to derive 3D printing. More recently, 3D echocardiography has been demonstrated to derive 3D-printed models. The integration of multiple imaging modalities for hybrid 3D printing has also been shown to create accurate printed heart models, which may prove to be beneficial for interventional cardiologists, cardiothoracic surgeons, and as an educational tool. Further advancements in the integration of different imaging modalities into a single platform for hybrid 3D printing and virtual 3D models will drive the future of personalized cardiac medicine.

  5. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation.

    PubMed

    Pujar, Shashikant; O'Leary, Nuala A; Farrell, Catherine M; Loveland, Jane E; Mudge, Jonathan M; Wallin, Craig; Girón, Carlos G; Diekhans, Mark; Barnes, If; Bennett, Ruth; Berry, Andrew E; Cox, Eric; Davidson, Claire; Goldfarb, Tamara; Gonzalez, Jose M; Hunt, Toby; Jackson, John; Joardar, Vinita; Kay, Mike P; Kodali, Vamsi K; Martin, Fergal J; McAndrews, Monica; McGarvey, Kelly M; Murphy, Michael; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Seal, Ruth L; Suner, Marie-Marthe; Webb, David; Zhu, Sophia; Aken, Bronwen L; Bruford, Elspeth A; Bult, Carol J; Frankish, Adam; Murphy, Terence; Pruitt, Kim D

    2018-01-04

    The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community. Published by Oxford University Press on behalf of Nucleic Acids Research 2017.

  6. Integrated analysis for population estimation, management impact evaluation, and decision-making for a declining species

    USGS Publications Warehouse

    Crawford, Brian A.; Moore, Clinton; Norton, Terry M.; Maerz, John C.

    2018-01-01

    A challenge for making conservation decisions is predicting how wildlife populations respond to multiple, concurrent threats and potential management strategies, usually under substantial uncertainty. Integrated modeling approaches can improve estimation of demographic rates necessary for making predictions, even for rare or cryptic species with sparse data, but their use in management applications is limited. We developed integrated models for a population of diamondback terrapins (Malaclemys terrapin) impacted by road-associated threats to (i) jointly estimate demographic rates from two mark-recapture datasets, while directly estimating road mortality and the impact of management actions deployed during the study; and (ii) project the population using population viability analysis under simulated management strategies to inform decision-making. Without management, population extirpation was nearly certain due to demographic impacts of road mortality, predators, and vegetation. Installation of novel flashing signage increased survival of terrapins that crossed roads by 30%. Signage, along with small roadside barriers installed during the study, increased population persistence probability, but the population was still predicted to decline. Management strategies that included actions targeting multiple threats and demographic rates resulted in the highest persistence probability, and roadside barriers, which increased adult survival, were predicted to increase persistence more than other actions. Our results support earlier findings showing mitigation of multiple threats is likely required to increase the viability of declining populations. Our approach illustrates how integrated models may be adapted to use limited data efficiently, represent system complexity, evaluate impacts of threats and management actions, and provide decision-relevant information for conservation of at-risk populations.

  7. MassImager: A software for interactive and in-depth analysis of mass spectrometry imaging data.

    PubMed

    He, Jiuming; Huang, Luojiao; Tian, Runtao; Li, Tiegang; Sun, Chenglong; Song, Xiaowei; Lv, Yiwei; Luo, Zhigang; Li, Xin; Abliz, Zeper

    2018-07-26

    Mass spectrometry imaging (MSI) has become a powerful tool to probe molecule events in biological tissue. However, it is a widely held viewpoint that one of the biggest challenges is an easy-to-use data processing software for discovering the underlying biological information from complicated and huge MSI dataset. Here, a user-friendly and full-featured MSI software including three subsystems, Solution, Visualization and Intelligence, named MassImager, is developed focusing on interactive visualization, in-situ biomarker discovery and artificial intelligent pathological diagnosis. Simplified data preprocessing and high-throughput MSI data exchange, serialization jointly guarantee the quick reconstruction of ion image and rapid analysis of dozens of gigabytes datasets. It also offers diverse self-defined operations for visual processing, including multiple ion visualization, multiple channel superposition, image normalization, visual resolution enhancement and image filter. Regions-of-interest analysis can be performed precisely through the interactive visualization between the ion images and mass spectra, also the overlaid optical image guide, to directly find out the region-specific biomarkers. Moreover, automatic pattern recognition can be achieved immediately upon the supervised or unsupervised multivariate statistical modeling. Clear discrimination between cancer tissue and adjacent tissue within a MSI dataset can be seen in the generated pattern image, which shows great potential in visually in-situ biomarker discovery and artificial intelligent pathological diagnosis of cancer. All the features are integrated together in MassImager to provide a deep MSI processing solution at the in-situ metabolomics level for biomarker discovery and future clinical pathological diagnosis. Copyright © 2018 The Authors. Published by Elsevier B.V. All rights reserved.

  8. Wind Integration National Dataset (WIND) Toolkit; NREL (National Renewable Energy Laboratory)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Draxl, Caroline; Hodge, Bri-Mathias

    A webinar about the Wind Integration National Dataset (WIND) Toolkit was presented by Bri-Mathias Hodge and Caroline Draxl on July 14, 2015. It was hosted by the Southern Alliance for Clean Energy. The toolkit is a grid integration data set that contains meteorological and power data at a 5-minute resolution across the continental United States for 7 years and hourly power forecasts.

  9. Joint principal trend analysis for longitudinal high-dimensional data.

    PubMed

    Zhang, Yuping; Ouyang, Zhengqing

    2018-06-01

    We consider a research scenario motivated by integrating multiple sources of information for better knowledge discovery in diverse dynamic biological processes. Given two longitudinal high-dimensional datasets for a group of subjects, we want to extract shared latent trends and identify relevant features. To solve this problem, we present a new statistical method named as joint principal trend analysis (JPTA). We demonstrate the utility of JPTA through simulations and applications to gene expression data of the mammalian cell cycle and longitudinal transcriptional profiling data in response to influenza viral infections. © 2017, The International Biometric Society.

  10. Identifying novel glioma associated pathways based on systems biology level meta-analysis.

    PubMed

    Hu, Yangfan; Li, Jinquan; Yan, Wenying; Chen, Jiajia; Li, Yin; Hu, Guang; Shen, Bairong

    2013-01-01

    With recent advances in microarray technology, including genomics, proteomics, and metabolomics, it brings a great challenge for integrating this "-omics" data to analysis complex disease. Glioma is an extremely aggressive and lethal form of brain tumor, and thus the study of the molecule mechanism underlying glioma remains very important. To date, most studies focus on detecting the differentially expressed genes in glioma. However, the meta-analysis for pathway analysis based on multiple microarray datasets has not been systematically pursued. In this study, we therefore developed a systems biology based approach by integrating three types of omics data to identify common pathways in glioma. Firstly, the meta-analysis has been performed to study the overlapping of signatures at different levels based on the microarray gene expression data of glioma. Among these gene expression datasets, 12 pathways were found in GeneGO database that shared by four stages. Then, microRNA expression profiles and ChIP-seq data were integrated for the further pathway enrichment analysis. As a result, we suggest 5 of these pathways could be served as putative pathways in glioma. Among them, the pathway of TGF-beta-dependent induction of EMT via SMAD is of particular importance. Our results demonstrate that the meta-analysis based on systems biology level provide a more useful approach to study the molecule mechanism of complex disease. The integration of different types of omics data, including gene expression microarrays, microRNA and ChIP-seq data, suggest some common pathways correlated with glioma. These findings will offer useful potential candidates for targeted therapeutic intervention of glioma.

  11. Dynamic Mobility Applications Policy Analysis: Policy and Institutional Issues for Integrated Dynamic Transit Operations (IDTO). [supporting datasets

    DOT National Transportation Integrated Search

    2015-01-27

    The datasets in this zip file are in support of Intelligent Transportation Systems Joint Program Office (ITS JPO) report FHWA-JPO-14-134, "Dynamic Mobility Applications Policy Analysis: Policy and Institutional Issues for Integrated Dynamic Transit O...

  12. Improving global data infrastructures for more effective and scalable analysis of Earth and environmental data: the Australian NCI NERDIP Approach

    NASA Astrophysics Data System (ADS)

    Evans, Ben; Wyborn, Lesley; Druken, Kelsey; Richards, Clare; Trenham, Claire; Wang, Jingbo; Rozas Larraondo, Pablo; Steer, Adam; Smillie, Jon

    2017-04-01

    The National Computational Infrastructure (NCI) facility hosts one of Australia's largest repositories (10+ PBytes) of research data collections spanning datasets from climate, coasts, oceans, and geophysics through to astronomy, bioinformatics, and the social sciences domains. The data are obtained from national and international sources, spanning a wide range of gridded and ungridded (i.e., line surveys, point clouds) data, and raster imagery, as well as diverse coordinate reference projections and resolutions. Rather than managing these data assets as a digital library, whereby users can discover and download files to personal servers (similar to borrowing 'books' from a 'library'), NCI has built an extensive and well-integrated research data platform, the National Environmental Research Data Interoperability Platform (NERDIP, http://nci.org.au/data-collections/nerdip/). The NERDIP architecture enables programmatic access to data via standards-compliant services for high performance data analysis, and provides a flexible cloud-based environment to facilitate the next generation of transdisciplinary scientific research across all data domains. To improve use of modern scalable data infrastructures that are focused on efficient data analysis, the data organisation needs to be carefully managed including performance evaluations of projections and coordinate systems, data encoding standards and formats. A complication is that we have often found multiple domain vocabularies and ontologies are associated with equivalent datasets. It is not practical for individual dataset managers to determine which standards are best to apply to their dataset as this could impact accessibility and interoperability. Instead, they need to work with data custodians across interrelated communities and, in partnership with the data repository, the international scientific community to determine the most useful approach. For the data repository, this approach is essential to enable different disciplines and research communities to invoke new forms of analysis and discovery in an increasingly complex data-rich environment. Driven by the heterogeneity of Earth and environmental datasets, NCI developed a Data Quality/Data Assurance Strategy to ensure consistency is maintained within and across all datasets, as well as functionality testing to ensure smooth interoperability between products, tools, and services. This is particularly so for collections that contain data generated from multiple data acquisition campaigns, often using instruments and models that have evolved over time. By implementing the NCI Data Quality Strategy we have seen progressive improvement in the integration and quality of the datasets across the different subject domains, and through this, the ease by which the users can access data from this major data infrastructure. By both adhering to international standards and also contributing to extensions of these standards, data from the NCI NERDIP platform can be federated with data from other globally distributed data repositories and infrastructures. The NCI approach builds on our experience working with the astronomy and climate science communities, which have been internationally coordinating such interoperability standards within their disciplines for some years. The results of our work so far demonstrate more could be done in the Earth science, solid earth and environmental communities, particularly through establishing better linkages between international/national community efforts such as EPOS, ENVRIplus, EarthCube, AuScope and the Research Data Alliance.

  13. A Sample Data Publication: Interactive Access, Analysis and Display of Remotely Stored Datasets From Hurricane Charley

    NASA Astrophysics Data System (ADS)

    Weber, J.; Domenico, B.

    2004-12-01

    This paper is an example of what we call data interactive publications. With a properly configured workstation, the readers can click on "hotspots" in the document that launches an interactive analysis tool called the Unidata Integrated Data Viewer (IDV). The IDV will enable the readers to access, analyze and display datasets on remote servers as well as documents describing them. Beyond the parameters and datasets initially configured into the paper, the analysis tool will have access to all the other dataset parameters as well as to a host of other datasets on remote servers. These data interactive publications are built on top of several data delivery, access, discovery, and visualization tools developed by Unidata and its partner organizations. For purposes of illustrating this integrative technology, we will use data from the event of Hurricane Charley over Florida from August 13-15, 2004. This event illustrates how components of this process fit together. The Local Data Manager (LDM), Open-source Project for a Network Data Access Protocol (OPeNDAP) and Abstract Data Distribution Environment (ADDE) services, Thematic Realtime Environmental Distributed Data Service (THREDDS) cataloging services, and the IDV are highlighted in this example of a publication with embedded pointers for accessing and interacting with remote datasets. An important objective of this paper is to illustrate how these integrated technologies foster the creation of documents that allow the reader to learn the scientific concepts by direct interaction with illustrative datasets, and help build a framework for integrated Earth System science.

  14. A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies

    PubMed Central

    Howie, Bryan N.; Donnelly, Peter; Marchini, Jonathan

    2009-01-01

    Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%–20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions. PMID:19543373

  15. Testing for clustering at many ranges inflates family-wise error rate (FWE).

    PubMed

    Loop, Matthew Shane; McClure, Leslie A

    2015-01-15

    Testing for clustering at multiple ranges within a single dataset is a common practice in spatial epidemiology. It is not documented whether this approach has an impact on the type 1 error rate. We estimated the family-wise error rate (FWE) for the difference in Ripley's K functions test, when testing at an increasing number of ranges at an alpha-level of 0.05. Case and control locations were generated from a Cox process on a square area the size of the continental US (≈3,000,000 mi2). Two thousand Monte Carlo replicates were used to estimate the FWE with 95% confidence intervals when testing for clustering at one range, as well as 10, 50, and 100 equidistant ranges. The estimated FWE and 95% confidence intervals when testing 10, 50, and 100 ranges were 0.22 (0.20 - 0.24), 0.34 (0.31 - 0.36), and 0.36 (0.34 - 0.38), respectively. Testing for clustering at multiple ranges within a single dataset inflated the FWE above the nominal level of 0.05. Investigators should construct simultaneous critical envelopes (available in spatstat package in R), or use a test statistic that integrates the test statistics from each range, as suggested by the creators of the difference in Ripley's K functions test.

  16. Energize New Mexico - Integration of Diverse Energy-Related Research Data into an Interoperable Geospatial Infrastructure and National Data Repositories

    NASA Astrophysics Data System (ADS)

    Hudspeth, W. B.; Barrett, H.; Diller, S.; Valentin, G.

    2016-12-01

    Energize is New Mexico's Experimental Program to Stimulate Competitive Research (NM EPSCoR), funded by the NSF with a focus on building capacity to conduct scientific research. Energize New Mexico leverages the work of faculty and students from NM universities and colleges to provide the tools necessary to a quantitative, science-driven discussion of the state's water policy options and to realize New Mexico's potential for sustainable energy development. This presentation discusses the architectural details of NM EPSCoR's collaborative data management system, GSToRE, and how New Mexico researchers use it to share and analyze diverse research data, with the goal of attaining sustainable energy development in the state.The Earth Data Analysis Center (EDAC) at The University of New Mexico leads the development of computational interoperability capacity that allows the wide use and sharing of energy-related data among NM EPSCoR researchers. Data from a variety of research disciplines is stored and maintained in EDAC's Geographic Storage, Transformation and Retrieval Engine (GSToRE), a distributed platform for large-scale vector and raster data discovery, subsetting, and delivery via Web services that are based on Open Geospatial Consortium (OGC) and REST Web-service standards. Researchers upload and register scientific datasets using a front-end client that collects the critical metadata. In addition, researchers have the option to register their datasets with DataONE, a national, community-driven project that provides access to data across multiple member repositories. The GSToRE platform maintains a searchable, core collection of metadata elements that can be used to deliver metadata in multiple formats, including ISO 19115-2/19139 and FGDC CSDGM. Stored metadata elements also permit the platform to automate the registration of Energize datasets into DataONE, once the datasets are approved for release to the public.

  17. Correction of elevation offsets in multiple co-located lidar datasets

    USGS Publications Warehouse

    Thompson, David M.; Dalyander, P. Soupy; Long, Joseph W.; Plant, Nathaniel G.

    2017-04-07

    IntroductionTopographic elevation data collected with airborne light detection and ranging (lidar) can be used to analyze short- and long-term changes to beach and dune systems. Analysis of multiple lidar datasets at Dauphin Island, Alabama, revealed systematic, island-wide elevation differences on the order of 10s of centimeters (cm) that were not attributable to real-world change and, therefore, were likely to represent systematic sampling offsets. These offsets vary between the datasets, but appear spatially consistent within a given survey. This report describes a method that was developed to identify and correct offsets between lidar datasets collected over the same site at different times so that true elevation changes over time, associated with sediment accumulation or erosion, can be analyzed.

  18. A regulation probability model-based meta-analysis of multiple transcriptomics data sets for cancer biomarker identification.

    PubMed

    Xie, Xin-Ping; Xie, Yu-Feng; Wang, Hong-Qiang

    2017-08-23

    Large-scale accumulation of omics data poses a pressing challenge of integrative analysis of multiple data sets in bioinformatics. An open question of such integrative analysis is how to pinpoint consistent but subtle gene activity patterns across studies. Study heterogeneity needs to be addressed carefully for this goal. This paper proposes a regulation probability model-based meta-analysis, jGRP, for identifying differentially expressed genes (DEGs). The method integrates multiple transcriptomics data sets in a gene regulatory space instead of in a gene expression space, which makes it easy to capture and manage data heterogeneity across studies from different laboratories or platforms. Specifically, we transform gene expression profiles into a united gene regulation profile across studies by mathematically defining two gene regulation events between two conditions and estimating their occurring probabilities in a sample. Finally, a novel differential expression statistic is established based on the gene regulation profiles, realizing accurate and flexible identification of DEGs in gene regulation space. We evaluated the proposed method on simulation data and real-world cancer datasets and showed the effectiveness and efficiency of jGRP in identifying DEGs identification in the context of meta-analysis. Data heterogeneity largely influences the performance of meta-analysis of DEGs identification. Existing different meta-analysis methods were revealed to exhibit very different degrees of sensitivity to study heterogeneity. The proposed method, jGRP, can be a standalone tool due to its united framework and controllable way to deal with study heterogeneity.

  19. Collaborative development of predictive toxicology applications

    PubMed Central

    2010-01-01

    OpenTox provides an interoperable, standards-based Framework for the support of predictive toxicology data management, algorithms, modelling, validation and reporting. It is relevant to satisfying the chemical safety assessment requirements of the REACH legislation as it supports access to experimental data, (Quantitative) Structure-Activity Relationship models, and toxicological information through an integrating platform that adheres to regulatory requirements and OECD validation principles. Initial research defined the essential components of the Framework including the approach to data access, schema and management, use of controlled vocabularies and ontologies, architecture, web service and communications protocols, and selection and integration of algorithms for predictive modelling. OpenTox provides end-user oriented tools to non-computational specialists, risk assessors, and toxicological experts in addition to Application Programming Interfaces (APIs) for developers of new applications. OpenTox actively supports public standards for data representation, interfaces, vocabularies and ontologies, Open Source approaches to core platform components, and community-based collaboration approaches, so as to progress system interoperability goals. The OpenTox Framework includes APIs and services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, and reporting which may be combined into multiple applications satisfying a variety of different user needs. OpenTox applications are based on a set of distributed, interoperable OpenTox API-compliant REST web services. The OpenTox approach to ontology allows for efficient mapping of complementary data coming from different datasets into a unifying structure having a shared terminology and representation. Two initial OpenTox applications are presented as an illustration of the potential impact of OpenTox for high-quality and consistent structure-activity relationship modelling of REACH-relevant endpoints: ToxPredict which predicts and reports on toxicities for endpoints for an input chemical structure, and ToxCreate which builds and validates a predictive toxicity model based on an input toxicology dataset. Because of the extensible nature of the standardised Framework design, barriers of interoperability between applications and content are removed, as the user may combine data, models and validation from multiple sources in a dependable and time-effective way. PMID:20807436

  20. Collaborative development of predictive toxicology applications.

    PubMed

    Hardy, Barry; Douglas, Nicki; Helma, Christoph; Rautenberg, Micha; Jeliazkova, Nina; Jeliazkov, Vedrin; Nikolova, Ivelina; Benigni, Romualdo; Tcheremenskaia, Olga; Kramer, Stefan; Girschick, Tobias; Buchwald, Fabian; Wicker, Joerg; Karwath, Andreas; Gütlein, Martin; Maunz, Andreas; Sarimveis, Haralambos; Melagraki, Georgia; Afantitis, Antreas; Sopasakis, Pantelis; Gallagher, David; Poroikov, Vladimir; Filimonov, Dmitry; Zakharov, Alexey; Lagunin, Alexey; Gloriozova, Tatyana; Novikov, Sergey; Skvortsova, Natalia; Druzhilovsky, Dmitry; Chawla, Sunil; Ghosh, Indira; Ray, Surajit; Patel, Hitesh; Escher, Sylvia

    2010-08-31

    OpenTox provides an interoperable, standards-based Framework for the support of predictive toxicology data management, algorithms, modelling, validation and reporting. It is relevant to satisfying the chemical safety assessment requirements of the REACH legislation as it supports access to experimental data, (Quantitative) Structure-Activity Relationship models, and toxicological information through an integrating platform that adheres to regulatory requirements and OECD validation principles. Initial research defined the essential components of the Framework including the approach to data access, schema and management, use of controlled vocabularies and ontologies, architecture, web service and communications protocols, and selection and integration of algorithms for predictive modelling. OpenTox provides end-user oriented tools to non-computational specialists, risk assessors, and toxicological experts in addition to Application Programming Interfaces (APIs) for developers of new applications. OpenTox actively supports public standards for data representation, interfaces, vocabularies and ontologies, Open Source approaches to core platform components, and community-based collaboration approaches, so as to progress system interoperability goals.The OpenTox Framework includes APIs and services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, and reporting which may be combined into multiple applications satisfying a variety of different user needs. OpenTox applications are based on a set of distributed, interoperable OpenTox API-compliant REST web services. The OpenTox approach to ontology allows for efficient mapping of complementary data coming from different datasets into a unifying structure having a shared terminology and representation.Two initial OpenTox applications are presented as an illustration of the potential impact of OpenTox for high-quality and consistent structure-activity relationship modelling of REACH-relevant endpoints: ToxPredict which predicts and reports on toxicities for endpoints for an input chemical structure, and ToxCreate which builds and validates a predictive toxicity model based on an input toxicology dataset. Because of the extensible nature of the standardised Framework design, barriers of interoperability between applications and content are removed, as the user may combine data, models and validation from multiple sources in a dependable and time-effective way.

  1. Computational and informatics strategies for identification of specific protein interaction partners in affinity purification mass spectrometry experiments

    PubMed Central

    Nesvizhskii, Alexey I.

    2013-01-01

    Analysis of protein interaction networks and protein complexes using affinity purification and mass spectrometry (AP/MS) is among most commonly used and successful applications of proteomics technologies. One of the foremost challenges of AP/MS data is a large number of false positive protein interactions present in unfiltered datasets. Here we review computational and informatics strategies for detecting specific protein interaction partners in AP/MS experiments, with a focus on incomplete (as opposite to genome-wide) interactome mapping studies. These strategies range from standard statistical approaches, to empirical scoring schemes optimized for a particular type of data, to advanced computational frameworks. The common denominator among these methods is the use of label-free quantitative information such as spectral counts or integrated peptide intensities that can be extracted from AP/MS data. We also discuss related issues such as combining multiple biological or technical replicates, and dealing with data generated using different tagging strategies. Computational approaches for benchmarking of scoring methods are discussed, and the need for generation of reference AP/MS datasets is highlighted. Finally, we discuss the possibility of more extended modeling of experimental AP/MS data, including integration with external information such as protein interaction predictions based on functional genomics data. PMID:22611043

  2. Data mining to predict climate hotspots: an experiment in aligning federal climate enterprises in the Northwest

    NASA Astrophysics Data System (ADS)

    Mote, P.; Foster, J. G.; Daley-Laursen, S. B.

    2014-12-01

    The Northwest has the nation's strongest geographic, institutional, and scientific alignment between NOAA RISA, DOI Climate Science Center, USDA Climate Hub, and participating universities. Considering each of those institutions' distinct mission, funding structures, governance, stakeholder engagement, methods of priority-setting, and deliverables, it is a challenge to find areas of common interest and ways for these institutions to work together. In view of the rich history of stakeholder engagement and the deep base of previous research on climate change in the region, these institutions are cooperating in developing a regional capacity to mine the vast available data in ways that are mutually beneficial, synergistic, and regionally relevant. Fundamentally, data mining means exploring connections across and within multiple datasets using advanced statistical techniques, development of multidimensional indices, machine learning, and more. The challenge is not just what we do with big datasets, but how we integrate the wide variety and types of data coming out of scenario analyses to create knowledge and inform decision-making. Federal agencies and their partners need to learn integrate big data on climate change and develop useful tools for important stake-holders to assist them in anticipating the main stresses of climate change to their own resources and preparing to abate those stresses.

  3. GEOGLE: context mining tool for the correlation between gene expression and the phenotypic distinction.

    PubMed

    Yu, Yao; Tu, Kang; Zheng, Siyuan; Li, Yun; Ding, Guohui; Ping, Jie; Hao, Pei; Li, Yixue

    2009-08-25

    In the post-genomic era, the development of high-throughput gene expression detection technology provides huge amounts of experimental data, which challenges the traditional pipelines for data processing and analyzing in scientific researches. In our work, we integrated gene expression information from Gene Expression Omnibus (GEO), biomedical ontology from Medical Subject Headings (MeSH) and signaling pathway knowledge from sigPathway entries to develop a context mining tool for gene expression analysis - GEOGLE. GEOGLE offers a rapid and convenient way for searching relevant experimental datasets, pathways and biological terms according to multiple types of queries: including biomedical vocabularies, GDS IDs, gene IDs, pathway names and signature list. Moreover, GEOGLE summarizes the signature genes from a subset of GDSes and estimates the correlation between gene expression and the phenotypic distinction with an integrated p value. This approach performing global searching of expression data may expand the traditional way of collecting heterogeneous gene expression experiment data. GEOGLE is a novel tool that provides researchers a quantitative way to understand the correlation between gene expression and phenotypic distinction through meta-analysis of gene expression datasets from different experiments, as well as the biological meaning behind. The web site and user guide of GEOGLE are available at: http://omics.biosino.org:14000/kweb/workflow.jsp?id=00020.

  4. Matching Alternative Addresses: a Semantic Web Approach

    NASA Astrophysics Data System (ADS)

    Ariannamazi, S.; Karimipour, F.; Hakimpour, F.

    2015-12-01

    Rapid development of crowd-sourcing or volunteered geographic information (VGI) provides opportunities for authoritatives that deal with geospatial information. Heterogeneity of multiple data sources and inconsistency of data types is a key characteristics of VGI datasets. The expansion of cities resulted in the growing number of POIs in the OpenStreetMap, a well-known VGI source, which causes the datasets to outdate in short periods of time. These changes made to spatial and aspatial attributes of features such as names and addresses might cause confusion or ambiguity in the processes that require feature's literal information like addressing and geocoding. VGI sources neither will conform specific vocabularies nor will remain in a specific schema for a long period of time. As a result, the integration of VGI sources is crucial and inevitable in order to avoid duplication and the waste of resources. Information integration can be used to match features and qualify different annotation alternatives for disambiguation. This study enhances the search capabilities of geospatial tools with applications able to understand user terminology to pursuit an efficient way for finding desired results. Semantic web is a capable tool for developing technologies that deal with lexical and numerical calculations and estimations. There are a vast amount of literal-spatial data representing the capability of linguistic information in knowledge modeling, but these resources need to be harmonized based on Semantic Web standards. The process of making addresses homogenous generates a helpful tool based on spatial data integration and lexical annotation matching and disambiguating.

  5. Large-scale atlas of microarray data reveals the distinct expression landscape of different tissues in Arabidopsis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    He, Fei; Maslov, Sergei; Yoo, Shinjae

    Here, transcriptome datasets from thousands of samples of the model plant Arabidopsis thaliana have been collectively generated by multiple individual labs. Although integration and meta-analysis of these samples has become routine in the plant research community, it is often hampered by the lack of metadata or differences in annotation styles by different labs. In this study, we carefully selected and integrated 6,057 Arabidopsis microarray expression samples from 304 experiments deposited to NCBI GEO. Metadata such as tissue type, growth condition, and developmental stage were manually curated for each sample. We then studied global expression landscape of the integrated dataset andmore » found that samples of the same tissue tend to be more similar to each other than to samples of other tissues, even in different growth conditions or developmental stages. Root has the most distinct transcriptome compared to aerial tissues, but the transcriptome of cultured root is more similar to those of aerial tissues as the former samples lost their cellular identity. Using a simple computational classification method, we showed that the tissue type of a sample can be successfully predicted based on its expression profile, opening the door for automatic metadata extraction and facilitating re-use of plant transcriptome data. As a proof of principle we applied our automated annotation pipeline to 708 RNA-seq samples from public repositories and verified accuracy of our predictions with samples’ metadata provided by authors.« less

  6. Large-scale atlas of microarray data reveals the distinct expression landscape of different tissues in Arabidopsis

    DOE PAGES

    He, Fei; Maslov, Sergei; Yoo, Shinjae; ...

    2016-05-25

    Here, transcriptome datasets from thousands of samples of the model plant Arabidopsis thaliana have been collectively generated by multiple individual labs. Although integration and meta-analysis of these samples has become routine in the plant research community, it is often hampered by the lack of metadata or differences in annotation styles by different labs. In this study, we carefully selected and integrated 6,057 Arabidopsis microarray expression samples from 304 experiments deposited to NCBI GEO. Metadata such as tissue type, growth condition, and developmental stage were manually curated for each sample. We then studied global expression landscape of the integrated dataset andmore » found that samples of the same tissue tend to be more similar to each other than to samples of other tissues, even in different growth conditions or developmental stages. Root has the most distinct transcriptome compared to aerial tissues, but the transcriptome of cultured root is more similar to those of aerial tissues as the former samples lost their cellular identity. Using a simple computational classification method, we showed that the tissue type of a sample can be successfully predicted based on its expression profile, opening the door for automatic metadata extraction and facilitating re-use of plant transcriptome data. As a proof of principle we applied our automated annotation pipeline to 708 RNA-seq samples from public repositories and verified accuracy of our predictions with samples’ metadata provided by authors.« less

  7. WholePathwayScope: a comprehensive pathway-based analysis tool for high-throughput data

    PubMed Central

    Yi, Ming; Horton, Jay D; Cohen, Jonathan C; Hobbs, Helen H; Stephens, Robert M

    2006-01-01

    Background Analysis of High Throughput (HTP) Data such as microarray and proteomics data has provided a powerful methodology to study patterns of gene regulation at genome scale. A major unresolved problem in the post-genomic era is to assemble the large amounts of data generated into a meaningful biological context. We have developed a comprehensive software tool, WholePathwayScope (WPS), for deriving biological insights from analysis of HTP data. Result WPS extracts gene lists with shared biological themes through color cue templates. WPS statistically evaluates global functional category enrichment of gene lists and pathway-level pattern enrichment of data. WPS incorporates well-known biological pathways from KEGG (Kyoto Encyclopedia of Genes and Genomes) and Biocarta, GO (Gene Ontology) terms as well as user-defined pathways or relevant gene clusters or groups, and explores gene-term relationships within the derived gene-term association networks (GTANs). WPS simultaneously compares multiple datasets within biological contexts either as pathways or as association networks. WPS also integrates Genetic Association Database and Partial MedGene Database for disease-association information. We have used this program to analyze and compare microarray and proteomics datasets derived from a variety of biological systems. Application examples demonstrated the capacity of WPS to significantly facilitate the analysis of HTP data for integrative discovery. Conclusion This tool represents a pathway-based platform for discovery integration to maximize analysis power. The tool is freely available at . PMID:16423281

  8. Integrating genome-wide association study and expression quantitative trait loci data identifies multiple genes and gene set associated with neuroticism.

    PubMed

    Fan, Qianrui; Wang, Wenyu; Hao, Jingcan; He, Awen; Wen, Yan; Guo, Xiong; Wu, Cuiyan; Ning, Yujie; Wang, Xi; Wang, Sen; Zhang, Feng

    2017-08-01

    Neuroticism is a fundamental personality trait with significant genetic determinant. To identify novel susceptibility genes for neuroticism, we conducted an integrative analysis of genomic and transcriptomic data of genome wide association study (GWAS) and expression quantitative trait locus (eQTL) study. GWAS summary data was driven from published studies of neuroticism, totally involving 170,906 subjects. eQTL dataset containing 927,753 eQTLs were obtained from an eQTL meta-analysis of 5311 samples. Integrative analysis of GWAS and eQTL data was conducted by summary data-based Mendelian randomization (SMR) analysis software. To identify neuroticism associated gene sets, the SMR analysis results were further subjected to gene set enrichment analysis (GSEA). The gene set annotation dataset (containing 13,311 annotated gene sets) of GSEA Molecular Signatures Database was used. SMR single gene analysis identified 6 significant genes for neuroticism, including MSRA (p value=2.27×10 -10 ), MGC57346 (p value=6.92×10 -7 ), BLK (p value=1.01×10 -6 ), XKR6 (p value=1.11×10 -6 ), C17ORF69 (p value=1.12×10 -6 ) and KIAA1267 (p value=4.00×10 -6 ). Gene set enrichment analysis observed significant association for Chr8p23 gene set (false discovery rate=0.033). Our results provide novel clues for the genetic mechanism studies of neuroticism. Copyright © 2017. Published by Elsevier Inc.

  9. Identifying Drug-Target Interactions with Decision Templates.

    PubMed

    Yan, Xiao-Ying; Zhang, Shao-Wu

    2018-01-01

    During the development process of new drugs, identification of the drug-target interactions wins primary concerns. However, the chemical or biological experiments bear the limitation in coverage as well as the huge cost of both time and money. Based on drug similarity and target similarity, chemogenomic methods can be able to predict potential drug-target interactions (DTIs) on a large scale and have no luxurious need about target structures or ligand entries. In order to reflect the cases that the drugs having variant structures interact with common targets and the targets having dissimilar sequences interact with same drugs. In addition, though several other similarity metrics have been developed to predict DTIs, the combination of multiple similarity metrics (especially heterogeneous similarities) is too naïve to sufficiently explore the multiple similarities. In this paper, based on Gene Ontology and pathway annotation, we introduce two novel target similarity metrics to address above issues. More importantly, we propose a more effective strategy via decision template to integrate multiple classifiers designed with multiple similarity metrics. In the scenarios that predict existing targets for new drugs and predict approved drugs for new protein targets, the results on the DTI benchmark datasets show that our target similarity metrics are able to enhance the predictive accuracies in two scenarios. And the elaborate fusion strategy of multiple classifiers has better predictive power than the naïve combination of multiple similarity metrics. Compared with other two state-of-the-art approaches on the four popular benchmark datasets of binary drug-target interactions, our method achieves the best results in terms of AUC and AUPR for predicting available targets for new drugs (S2), and predicting approved drugs for new protein targets (S3).These results demonstrate that our method can effectively predict the drug-target interactions. The software package can freely available at https://github.com/NwpuSY/DT_all.git for academic users. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  10. Action recognition via cumulative histogram of multiple features

    NASA Astrophysics Data System (ADS)

    Yan, Xunshi; Luo, Yupin

    2011-01-01

    Spatial-temporal interest points (STIPs) are popular in human action recognition. However, they suffer from difficulties in determining size of codebook and losing much information during forming histograms. In this paper, spatial-temporal interest regions (STIRs) are proposed, which are based on STIPs and are capable of marking the locations of the most ``shining'' human body parts. In order to represent human actions, the proposed approach takes great advantages of multiple features, including STIRs, pyramid histogram of oriented gradients and pyramid histogram of oriented optical flows. To achieve this, cumulative histogram is used to integrate dynamic information in sequences and to form feature vectors. Furthermore, the widely used nearest neighbor and AdaBoost methods are employed as classification algorithms. Experiments on public datasets KTH, Weizmann and UCF sports show that the proposed approach achieves effective and robust results.

  11. Integrated analysis of microRNAs, transcription factors and target genes expression discloses a specific molecular architecture of hyperdiploid multiple myeloma

    PubMed Central

    Caracciolo, Daniele; Agnelli, Luca; Neri, Antonino; Walker, Brian A.; Morgan, Gareth J.; Cannataro, Mario

    2015-01-01

    Multiple Myeloma (MM) is a malignancy characterized by the hyperdiploid (HD-MM) and the non-hyperdiploid (nHD-MM) subtypes. To shed light within the molecular architecture of these subtypes, we used a novel integromics approach. By annotated MM patient mRNA/microRNA (miRNA) datasets, we investigated mRNAs and miRNAs profiles with relation to changes in transcriptional regulators expression. We found that HD-MM displays specific gene and miRNA expression profiles, involving the Signal Transducer and Activator of Transcription (STAT)3 pathway as well as the Transforming Growth Factor–beta (TGFβ) and the transcription regulator Nuclear Protein-1 (NUPR1). Our data define specific molecular features of HD-MM that may translate in the identification of novel relevant druggable targets. PMID:26056083

  12. Colorado Late Cenozoic Fault and Fold Database and Internet Map Server: User-friendly technology for complex information

    USGS Publications Warehouse

    Morgan, K.S.; Pattyn, G.J.; Morgan, M.L.

    2005-01-01

    Internet mapping applications for geologic data allow simultaneous data delivery and collection, enabling quick data modification while efficiently supplying the end user with information. Utilizing Web-based technologies, the Colorado Geological Survey's Colorado Late Cenozoic Fault and Fold Database was transformed from a monothematic, nonspatial Microsoft Access database into a complex information set incorporating multiple data sources. The resulting user-friendly format supports easy analysis and browsing. The core of the application is the Microsoft Access database, which contains information compiled from available literature about faults and folds that are known or suspected to have moved during the late Cenozoic. The database contains nonspatial fields such as structure type, age, and rate of movement. Geographic locations of the fault and fold traces were compiled from previous studies at 1:250,000 scale to form a spatial database containing information such as length and strike. Integration of the two databases allowed both spatial and nonspatial information to be presented on the Internet as a single dataset (http://geosurvey.state.co.us/pubs/ceno/). The user-friendly interface enables users to view and query the data in an integrated manner, thus providing multiple ways to locate desired information. Retaining the digital data format also allows continuous data updating and quick delivery of newly acquired information. This dataset is a valuable resource to anyone interested in earthquake hazards and the activity of faults and folds in Colorado. Additional geologic hazard layers and imagery may aid in decision support and hazard evaluation. The up-to-date and customizable maps are invaluable tools for researchers or the public.

  13. A trust-based recommendation method using network diffusion processes

    NASA Astrophysics Data System (ADS)

    Chen, Ling-Jiao; Gao, Jian

    2018-09-01

    A variety of rating-based recommendation methods have been extensively studied including the well-known collaborative filtering approaches and some network diffusion-based methods, however, social trust relations are not sufficiently considered when making recommendations. In this paper, we contribute to the literature by proposing a trust-based recommendation method, named CosRA+T, after integrating the information of trust relations into the resource-redistribution process. Specifically, a tunable parameter is used to scale the resources received by trusted users before the redistribution back to the objects. Interestingly, we find an optimal scaling parameter for the proposed CosRA+T method to achieve its best recommendation accuracy, and the optimal value seems to be universal under several evaluation metrics across different datasets. Moreover, results of extensive experiments on the two real-world rating datasets with trust relations, Epinions and FriendFeed, suggest that CosRA+T has a remarkable improvement in overall accuracy, diversity and novelty. Our work takes a step towards designing better recommendation algorithms by employing multiple resources of social network information.

  14. CORUM: the comprehensive resource of mammalian protein complexes

    PubMed Central

    Ruepp, Andreas; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Stransky, Michael; Waegele, Brigitte; Schmidt, Thorsten; Doudieu, Octave Noubibou; Stümpflen, Volker; Mewes, H. Werner

    2008-01-01

    Protein complexes are key molecular entities that integrate multiple gene products to perform cellular functions. The CORUM (http://mips.gsf.de/genre/proj/corum/index.html) database is a collection of experimentally verified mammalian protein complexes. Information is manually derived by critical reading of the scientific literature from expert annotators. Information about protein complexes includes protein complex names, subunits, literature references as well as the function of the complexes. For functional annotation, we use the FunCat catalogue that enables to organize the protein complex space into biologically meaningful subsets. The database contains more than 1750 protein complexes that are built from 2400 different genes, thus representing 12% of the protein-coding genes in human. A web-based system is available to query, view and download the data. CORUM provides a comprehensive dataset of protein complexes for discoveries in systems biology, analyses of protein networks and protein complex-associated diseases. Comparable to the MIPS reference dataset of protein complexes from yeast, CORUM intends to serve as a reference for mammalian protein complexes. PMID:17965090

  15. Assessing reliability of protein-protein interactions by integrative analysis of data in model organisms.

    PubMed

    Lin, Xiaotong; Liu, Mei; Chen, Xue-wen

    2009-04-29

    Protein-protein interactions play vital roles in nearly all cellular processes and are involved in the construction of biological pathways such as metabolic and signal transduction pathways. Although large-scale experiments have enabled the discovery of thousands of previously unknown linkages among proteins in many organisms, the high-throughput interaction data is often associated with high error rates. Since protein interaction networks have been utilized in numerous biological inferences, the inclusive experimental errors inevitably affect the quality of such prediction. Thus, it is essential to assess the quality of the protein interaction data. In this paper, a novel Bayesian network-based integrative framework is proposed to assess the reliability of protein-protein interactions. We develop a cross-species in silico model that assigns likelihood scores to individual protein pairs based on the information entirely extracted from model organisms. Our proposed approach integrates multiple microarray datasets and novel features derived from gene ontology. Furthermore, the confidence scores for cross-species protein mappings are explicitly incorporated into our model. Applying our model to predict protein interactions in the human genome, we are able to achieve 80% in sensitivity and 70% in specificity. Finally, we assess the overall quality of the experimentally determined yeast protein-protein interaction dataset. We observe that the more high-throughput experiments confirming an interaction, the higher the likelihood score, which confirms the effectiveness of our approach. This study demonstrates that model organisms certainly provide important information for protein-protein interaction inference and assessment. The proposed method is able to assess not only the overall quality of an interaction dataset, but also the quality of individual protein-protein interactions. We expect the method to continually improve as more high quality interaction data from more model organisms becomes available and is readily scalable to a genome-wide application.

  16. Multivariate spatiotemporal visualizations for mobile devices in Flyover Country

    NASA Astrophysics Data System (ADS)

    Loeffler, S.; Thorn, R.; Myrbo, A.; Roth, R.; Goring, S. J.; Williams, J.

    2017-12-01

    Visualizing and interacting with complex multivariate and spatiotemporal datasets on mobile devices is challenging due to their smaller screens, reduced processing power, and limited data connectivity. Pollen data require visualizing pollen assemblages spatially, temporally, and across multiple taxa to understand plant community dynamics through time. Drawing from cartography, information visualization, and paleoecology, we have created new mobile-first visualization techniques that represent multiple taxa across many sites and enable user interaction. Using pollen datasets from the Neotoma Paleoecology Database as a case study, the visualization techniques allow ecological patterns and trends to be quickly understood on a mobile device compared to traditional pollen diagrams and maps. This flexible visualization system can be used for datasets beyond pollen, with the only requirements being point-based localities and multiple variables changing through time or depth.

  17. Multi-Party Privacy-Preserving Set Intersection with Quasi-Linear Complexity

    NASA Astrophysics Data System (ADS)

    Cheon, Jung Hee; Jarecki, Stanislaw; Seo, Jae Hong

    Secure computation of the set intersection functionality allows n parties to find the intersection between their datasets without revealing anything else about them. An efficient protocol for such a task could have multiple potential applications in commerce, health care, and security. However, all currently known secure set intersection protocols for n>2 parties have computational costs that are quadratic in the (maximum) number of entries in the dataset contributed by each party, making secure computation of the set intersection only practical for small datasets. In this paper, we describe the first multi-party protocol for securely computing the set intersection functionality with both the communication and the computation costs that are quasi-linear in the size of the datasets. For a fixed security parameter, our protocols require O(n2k) bits of communication and Õ(n2k) group multiplications per player in the malicious adversary setting, where k is the size of each dataset. Our protocol follows the basic idea of the protocol proposed by Kissner and Song, but we gain efficiency by using different representations of the polynomials associated with users' datasets and careful employment of algorithms that interpolate or evaluate polynomials on multiple points more efficiently. Moreover, the proposed protocol is robust. This means that the protocol outputs the desired result even if some corrupted players leave during the execution of the protocol.

  18. DAMe: a toolkit for the initial processing of datasets with PCR replicates of double-tagged amplicons for DNA metabarcoding analyses.

    PubMed

    Zepeda-Mendoza, Marie Lisandra; Bohmann, Kristine; Carmona Baez, Aldo; Gilbert, M Thomas P

    2016-05-03

    DNA metabarcoding is an approach for identifying multiple taxa in an environmental sample using specific genetic loci and taxa-specific primers. When combined with high-throughput sequencing it enables the taxonomic characterization of large numbers of samples in a relatively time- and cost-efficient manner. One recent laboratory development is the addition of 5'-nucleotide tags to both primers producing double-tagged amplicons and the use of multiple PCR replicates to filter erroneous sequences. However, there is currently no available toolkit for the straightforward analysis of datasets produced in this way. We present DAMe, a toolkit for the processing of datasets generated by double-tagged amplicons from multiple PCR replicates derived from an unlimited number of samples. Specifically, DAMe can be used to (i) sort amplicons by tag combination, (ii) evaluate PCR replicates dissimilarity, and (iii) filter sequences derived from sequencing/PCR errors, chimeras, and contamination. This is attained by calculating the following parameters: (i) sequence content similarity between the PCR replicates from each sample, (ii) reproducibility of each unique sequence across the PCR replicates, and (iii) copy number of the unique sequences in each PCR replicate. We showcase the insights that can be obtained using DAMe prior to taxonomic assignment, by applying it to two real datasets that vary in their complexity regarding number of samples, sequencing libraries, PCR replicates, and used tag combinations. Finally, we use a third mock dataset to demonstrate the impact and importance of filtering the sequences with DAMe. DAMe allows the user-friendly manipulation of amplicons derived from multiple samples with PCR replicates built in a single or multiple sequencing libraries. It allows the user to: (i) collapse amplicons into unique sequences and sort them by tag combination while retaining the sample identifier and copy number information, (ii) identify sequences carrying unused tag combinations, (iii) evaluate the comparability of PCR replicates of the same sample, and (iv) filter tagged amplicons from a number of PCR replicates using parameters of minimum length, copy number, and reproducibility across the PCR replicates. This enables an efficient analysis of complex datasets, and ultimately increases the ease of handling datasets from large-scale studies.

  19. xMSanalyzer: automated pipeline for improved feature detection and downstream analysis of large-scale, non-targeted metabolomics data.

    PubMed

    Uppal, Karan; Soltow, Quinlyn A; Strobel, Frederick H; Pittard, W Stephen; Gernert, Kim M; Yu, Tianwei; Jones, Dean P

    2013-01-16

    Detection of low abundance metabolites is important for de novo mapping of metabolic pathways related to diet, microbiome or environmental exposures. Multiple algorithms are available to extract m/z features from liquid chromatography-mass spectral data in a conservative manner, which tends to preclude detection of low abundance chemicals and chemicals found in small subsets of samples. The present study provides software to enhance such algorithms for feature detection, quality assessment, and annotation. xMSanalyzer is a set of utilities for automated processing of metabolomics data. The utilites can be classified into four main modules to: 1) improve feature detection for replicate analyses by systematic re-extraction with multiple parameter settings and data merger to optimize the balance between sensitivity and reliability, 2) evaluate sample quality and feature consistency, 3) detect feature overlap between datasets, and 4) characterize high-resolution m/z matches to small molecule metabolites and biological pathways using multiple chemical databases. The package was tested with plasma samples and shown to more than double the number of features extracted while improving quantitative reliability of detection. MS/MS analysis of a random subset of peaks that were exclusively detected using xMSanalyzer confirmed that the optimization scheme improves detection of real metabolites. xMSanalyzer is a package of utilities for data extraction, quality control assessment, detection of overlapping and unique metabolites in multiple datasets, and batch annotation of metabolites. The program was designed to integrate with existing packages such as apLCMS and XCMS, but the framework can also be used to enhance data extraction for other LC/MS data software.

  20. Advancements in Wind Integration Study Data Modeling: The Wind Integration National Dataset (WIND) Toolkit; Preprint

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Draxl, C.; Hodge, B. M.; Orwig, K.

    2013-10-01

    Regional wind integration studies in the United States require detailed wind power output data at many locations to perform simulations of how the power system will operate under high-penetration scenarios. The wind data sets that serve as inputs into the study must realistically reflect the ramping characteristics, spatial and temporal correlations, and capacity factors of the simulated wind plants, as well as be time synchronized with available load profiles. The Wind Integration National Dataset (WIND) Toolkit described in this paper fulfills these requirements. A wind resource dataset, wind power production time series, and simulated forecasts from a numerical weather predictionmore » model run on a nationwide 2-km grid at 5-min resolution will be made publicly available for more than 110,000 onshore and offshore wind power production sites.« less

  1. Extraction of Urban Trees from Integrated Airborne Based Digital Image and LIDAR Point Cloud Datasets - Initial Results

    NASA Astrophysics Data System (ADS)

    Dogon-yaro, M. A.; Kumar, P.; Rahman, A. Abdul; Buyuksalih, G.

    2016-10-01

    Timely and accurate acquisition of information on the condition and structural changes of urban trees serves as a tool for decision makers to better appreciate urban ecosystems and their numerous values which are critical to building up strategies for sustainable development. The conventional techniques used for extracting tree features include; ground surveying and interpretation of the aerial photography. However, these techniques are associated with some constraint, such as labour intensive field work, a lot of financial requirement, influences by weather condition and topographical covers which can be overcome by means of integrated airborne based LiDAR and very high resolution digital image datasets. This study presented a semi-automated approach for extracting urban trees from integrated airborne based LIDAR and multispectral digital image datasets over Istanbul city of Turkey. The above scheme includes detection and extraction of shadow free vegetation features based on spectral properties of digital images using shadow index and NDVI techniques and automated extraction of 3D information about vegetation features from the integrated processing of shadow free vegetation image and LiDAR point cloud datasets. The ability of the developed algorithms shows a promising result as an automated and cost effective approach to estimating and delineated 3D information of urban trees. The research also proved that integrated datasets is a suitable technology and a viable source of information for city managers to be used in urban trees management.

  2. The Wind Integration National Dataset (WIND) toolkit (Presentation)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Caroline Draxl: NREL

    2014-01-01

    Regional wind integration studies require detailed wind power output data at many locations to perform simulations of how the power system will operate under high penetration scenarios. The wind datasets that serve as inputs into the study must realistically reflect the ramping characteristics, spatial and temporal correlations, and capacity factors of the simulated wind plants, as well as being time synchronized with available load profiles.As described in this presentation, the WIND Toolkit fulfills these requirements by providing a state-of-the-art national (US) wind resource, power production and forecast dataset.

  3. Modal Analysis Using the Singular Value Decomposition and Rational Fraction Polynomials

    DTIC Science & Technology

    2017-04-06

    information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and...results. The programs are designed for experimental datasets with multiple drive and response points and have proven effective even for systems with... designed for experimental datasets with multiple drive and response points and have proven effective even for systems with numerous closely-spaced

  4. Substituting values for censored data from Texas, USA, reservoirs inflated and obscured trends in analyses commonly used for water quality target development.

    PubMed

    Grantz, Erin; Haggard, Brian; Scott, J Thad

    2018-06-12

    We calculated four median datasets (chlorophyll a, Chl a; total phosphorus, TP; and transparency) using multiple approaches to handling censored observations, including substituting fractions of the quantification limit (QL; dataset 1 = 1QL, dataset 2 = 0.5QL) and statistical methods for censored datasets (datasets 3-4) for approximately 100 Texas, USA reservoirs. Trend analyses of differences between dataset 1 and 3 medians indicated percent difference increased linearly above thresholds in percent censored data (%Cen). This relationship was extrapolated to estimate medians for site-parameter combinations with %Cen > 80%, which were combined with dataset 3 as dataset 4. Changepoint analysis of Chl a- and transparency-TP relationships indicated threshold differences up to 50% between datasets. Recursive analysis identified secondary thresholds in dataset 4. Threshold differences show that information introduced via substitution or missing due to limitations of statistical methods biased values, underestimated error, and inflated the strength of TP thresholds identified in datasets 1-3. Analysis of covariance identified differences in linear regression models relating transparency-TP between datasets 1, 2, and the more statistically robust datasets 3-4. Study findings identify high-risk scenarios for biased analytical outcomes when using substitution. These include high probability of median overestimation when %Cen > 50-60% for a single QL, or when %Cen is as low 16% for multiple QL's. Changepoint analysis was uniquely vulnerable to substitution effects when using medians from sites with %Cen > 50%. Linear regression analysis was less sensitive to substitution and missing data effects, but differences in model parameters for transparency cannot be discounted and could be magnified by log-transformation of the variables.

  5. EnviroAtlas -- Green Bay, Wisconsin -- One Meter Resolution Urban Land Cover Data (2010)

    EPA Pesticide Factsheets

    The Green Bay, WI one meter-scale urban land cover (LC) dataset comprises 936 km2 around the city of Green Bay, surrounding towns, tribal lands and rural areas in Brown and Outagamie Counties. These leaf-on LC data and maps were derived from 1-m pixel, four-band (red, green, blue, and near-infrared) aerial photography acquired from the United States Department of Agriculture (USDA) National Agriculture Imagery Program (NAIP) on three dates in 2010: July 3, July 25, and August 5. LiDAR data collected on November 18, 2010 was integrated for the Brown County portion. Eight land cover classes were mapped: water, impervious surfaces, soil and barren land, trees and forest, grass and herbaceous non-woody vegetation, agriculture, and wetlands (woody and emergent). Wetlands were copied from the best available existing wetlands data. Analysis of a random sampling of 566 photo-interpreted land cover reference points yielded an overall accuracy of 91.3%. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can b

  6. Research on Zheng Classification Fusing Pulse Parameters in Coronary Heart Disease

    PubMed Central

    Guo, Rui; Wang, Yi-Qin; Xu, Jin; Yan, Hai-Xia; Yan, Jian-Jun; Li, Fu-Feng; Xu, Zhao-Xia; Xu, Wen-Jie

    2013-01-01

    This study was conducted to illustrate that nonlinear dynamic variables of Traditional Chinese Medicine (TCM) pulse can improve the performances of TCM Zheng classification models. Pulse recordings of 334 coronary heart disease (CHD) patients and 117 normal subjects were collected in this study. Recurrence quantification analysis (RQA) was employed to acquire nonlinear dynamic variables of pulse. TCM Zheng models in CHD were constructed, and predictions using a novel multilabel learning algorithm based on different datasets were carried out. Datasets were designed as follows: dataset1, TCM inquiry information including inspection information; dataset2, time-domain variables of pulse and dataset1; dataset3, RQA variables of pulse and dataset1; and dataset4, major principal components of RQA variables and dataset1. The performances of the different models for Zheng differentiation were compared. The model for Zheng differentiation based on RQA variables integrated with inquiry information had the best performance, whereas that based only on inquiry had the worst performance. Meanwhile, the model based on time-domain variables of pulse integrated with inquiry fell between the above two. This result showed that RQA variables of pulse can be used to construct models of TCM Zheng and improve the performance of Zheng differentiation models. PMID:23737839

  7. BubbleGUM: automatic extraction of phenotype molecular signatures and comprehensive visualization of multiple Gene Set Enrichment Analyses.

    PubMed

    Spinelli, Lionel; Carpentier, Sabrina; Montañana Sanchis, Frédéric; Dalod, Marc; Vu Manh, Thien-Phong

    2015-10-19

    Recent advances in the analysis of high-throughput expression data have led to the development of tools that scaled-up their focus from single-gene to gene set level. For example, the popular Gene Set Enrichment Analysis (GSEA) algorithm can detect moderate but coordinated expression changes of groups of presumably related genes between pairs of experimental conditions. This considerably improves extraction of information from high-throughput gene expression data. However, although many gene sets covering a large panel of biological fields are available in public databases, the ability to generate home-made gene sets relevant to one's biological question is crucial but remains a substantial challenge to most biologists lacking statistic or bioinformatic expertise. This is all the more the case when attempting to define a gene set specific of one condition compared to many other ones. Thus, there is a crucial need for an easy-to-use software for generation of relevant home-made gene sets from complex datasets, their use in GSEA, and the correction of the results when applied to multiple comparisons of many experimental conditions. We developed BubbleGUM (GSEA Unlimited Map), a tool that allows to automatically extract molecular signatures from transcriptomic data and perform exhaustive GSEA with multiple testing correction. One original feature of BubbleGUM notably resides in its capacity to integrate and compare numerous GSEA results into an easy-to-grasp graphical representation. We applied our method to generate transcriptomic fingerprints for murine cell types and to assess their enrichments in human cell types. This analysis allowed us to confirm homologies between mouse and human immunocytes. BubbleGUM is an open-source software that allows to automatically generate molecular signatures out of complex expression datasets and to assess directly their enrichment by GSEA on independent datasets. Enrichments are displayed in a graphical output that helps interpreting the results. This innovative methodology has recently been used to answer important questions in functional genomics, such as the degree of similarities between microarray datasets from different laboratories or with different experimental models or clinical cohorts. BubbleGUM is executable through an intuitive interface so that both bioinformaticians and biologists can use it. It is available at http://www.ciml.univ-mrs.fr/applications/BubbleGUM/index.html .

  8. Feasibility of a GNSS-Probe for Creating Digital Maps of High Accuracy and Integrity

    NASA Astrophysics Data System (ADS)

    Vartziotis, Dimitris; Poulis, Alkis; Minogiannis, Alexandros; Siozos, Panayiotis; Goudas, Iraklis; Samson, Jaron; Tossaint, Michel

    The “ROADSCANNER” project addresses the need for increased accuracy and integrity Digital Maps (DM) utilizing the latest developments in GNSS, in order to provide the required datasets for novel applications, such as navigation based Safety Applications, Advanced Driver Assistance Systems (ADAS) and Digital Automotive Simulations. The activity covered in the current paper is the feasibility study, preliminary tests, initial product design and development plan for an EGNOS enabled vehicle probe. The vehicle probe will be used for generating high accuracy, high integrity and ADAS compatible digital maps of roads, employing a multiple passes methodology supported by sophisticated refinement algorithms. Furthermore, the vehicle probe will be equipped with pavement scanning and other data fusion equipment, in order to produce 3D road surface models compatible with standards of road-tire simulation applications. The project was assigned to NIKI Ltd under the 1st Call for Ideas in the frame of the ESA - Greece Task Force.

  9. Identification of Differentially Expressed Genes through Integrated Study of Alzheimer's Disease Affected Brain Regions.

    PubMed

    Puthiyedth, Nisha; Riveros, Carlos; Berretta, Regina; Moscato, Pablo

    2016-01-01

    Alzheimer's disease (AD) is the most common form of dementia in older adults that damages the brain and results in impaired memory, thinking and behaviour. The identification of differentially expressed genes and related pathways among affected brain regions can provide more information on the mechanisms of AD. In the past decade, several studies have reported many genes that are associated with AD. This wealth of information has become difficult to follow and interpret as most of the results are conflicting. In that case, it is worth doing an integrated study of multiple datasets that helps to increase the total number of samples and the statistical power in detecting biomarkers. In this study, we present an integrated analysis of five different brain region datasets and introduce new genes that warrant further investigation. The aim of our study is to apply a novel combinatorial optimisation based meta-analysis approach to identify differentially expressed genes that are associated to AD across brain regions. In this study, microarray gene expression data from 161 samples (74 non-demented controls, 87 AD) from the Entorhinal Cortex (EC), Hippocampus (HIP), Middle temporal gyrus (MTG), Posterior cingulate cortex (PC), Superior frontal gyrus (SFG) and visual cortex (VCX) brain regions were integrated and analysed using our method. The results are then compared to two popular meta-analysis methods, RankProd and GeneMeta, and to what can be obtained by analysing the individual datasets. We find genes related with AD that are consistent with existing studies, and new candidate genes not previously related with AD. Our study confirms the up-regualtion of INFAR2 and PTMA along with the down regulation of GPHN, RAB2A, PSMD14 and FGF. Novel genes PSMB2, WNK1, RPL15, SEMA4C, RWDD2A and LARGE are found to be differentially expressed across all brain regions. Further investigation on these genes may provide new insights into the development of AD. In addition, we identified the presence of 23 non-coding features, including four miRNA precursors (miR-7, miR570, miR-1229 and miR-6821), dysregulated across the brain regions. Furthermore, we compared our results with two popular meta-analysis methods RankProd and GeneMeta to validate our findings and performed a sensitivity analysis by removing one dataset at a time to assess the robustness of our results. These new findings may provide new insights into the disease mechanisms and thus make a significant contribution in the near future towards understanding, prevention and cure of AD.

  10. Evaporation induced 18O and 13C enrichment in lake systems: A global perspective on hydrologic balance effects

    NASA Astrophysics Data System (ADS)

    Horton, Travis W.; Defliese, William F.; Tripati, Aradhna K.; Oze, Christopher

    2016-01-01

    Growing pressure on sustainable water resource allocation in the context of global development and rapid environmental change demands rigorous knowledge of how regional water cycles change through time. One of the most attractive and widely utilized approaches for gaining this knowledge is the analysis of lake carbonate stable isotopic compositions. However, endogenic carbonate archives are sensitive to a variety of natural processes and conditions leaving isotopic datasets largely underdetermined. As a consequence, isotopic researchers are often required to assume values for multiple parameters, including temperature of carbonate formation or lake water δ18O, in order to interpret changes in hydrologic conditions. Here, we review and analyze a global compilation of 57 lacustrine dual carbon and oxygen stable isotope records with a topical focus on the effects of shifting hydrologic balance on endogenic carbonate isotopic compositions. Through integration of multiple large datasets we show that lake carbonate δ18O values and the lake waters from which they are derived are often shifted by >+10‰ relative to source waters discharging into the lake. The global pattern of δ18O and δ13C covariation observed in >70% of the records studied and in several evaporation experiments demonstrates that isotopic fractionations associated with lake water evaporation cause the heavy carbon and oxygen isotope enrichments observed in most lakes and lake carbonate records. Modeled endogenic calcite compositions in isotopic equilibrium with lake source waters further demonstrate that evaporation effects can be extreme even in lake records where δ18O and δ13C covariation is absent. Aridisol pedogenic carbonates show similar isotopic responses to evaporation, and the relevance of evaporative modification to paleoclimatic and paleotopographic research using endogenic carbonate proxies are discussed. Recent advances in stable isotope research techniques present unprecedented opportunities to overcome the underdetermined nature of stable isotopic data through integration of multiple isotopic proxies, including dual element 13C-excess values and clumped isotope temperature estimates. We demonstrate the utility of applying these multi-proxy approaches to the interpretation of paleohydroclimatic conditions in ancient lake systems. Understanding past, present, and future hydroclimatic systems is a global imperative. Significant progress should be expected as these modern research techniques become more widely applied and integrated with traditional stable isotopic proxies.

  11. A novel virtual hub approach for multisource downstream service integration

    NASA Astrophysics Data System (ADS)

    Previtali, Mattia; Cuca, Branka; Barazzetti, Luigi

    2016-08-01

    A large development of downstream services is expected to be stimulated starting from earth observations (EO) datasets acquired by Copernicus satellites. An important challenge connected with the availability of downstream services is the possibility for their integration in order to create innovative applications with added values for users of different categories level. At the moment, the world of geo-information (GI) is extremely heterogeneous in terms of standards and formats used, thus preventing a facilitated access and integration of downstream services. Indeed, different users and data providers have also different requirements in terms of communication protocols and technology advancement. In recent years, many important programs and initiatives have tried to address this issue even on trans-regional and international level (e.g. INSPIRE Directive, GEOSS, Eye on Earth and SEIS). However, a lack of interoperability between systems and services still exists. In order to facilitate the interaction between different downstream services, a new architectural approach (developed within the European project ENERGIC OD) is proposed in this paper. The brokering-oriented architecture introduces a new mediation layer (the Virtual Hub) which works as an intermediary to bridge the gaps linked to interoperability issues. This intermediation layer de-couples the server and the client allowing a facilitated access to multiple downstream services and also Open Data provided by national and local SDIs. In particular, in this paper an application is presented integrating four services on the topic of agriculture: (i) the service given by Space4Agri (providing services based on MODIS and Landsat data); (ii) Gicarus Lab (providing sample services based on Landsat datasets) and (iii) FRESHMON (providing sample services for water quality) and services from a several regional SDIs.

  12. Terra Populus and DataNet Collaboration

    NASA Astrophysics Data System (ADS)

    Kugler, T.; Ruggles, S.; Fitch, C. A.; Clark, P. D.; Sobek, M.; Van Riper, D.

    2012-12-01

    Terra Populus, part of NSF's new DataNet initiative, is developing organizational and technical infrastructure to integrate, preserve, and disseminate data describing changes in the human population and environment over time. Terra Populus will incorporate large microdata and aggregate census datasets from the United States and around the world, as well as land use, land cover, climate and other environmental datasets. These data are widely dispersed, exist in a variety of data structures, have incompatible or inadequate metadata, and have incompatible geographic identifiers. Terra Populus is developing methods of integrating data from different domains and translating across data structures based on spatio-temporal linkages among data contents. The new infrastructure will enable researchers to identify and merge data from heterogeneous sources to study the relationships between human behavior and the natural world. Terra Populus will partner with data archives, data producers, and data users to create a sustainable international organization that will guarantee preservation and access over multiple decades. Terra Populus is also collaborating with the other projects in the DataNet initiative - DataONE, the DataNet Federation Consortium (DFC) and Sustainable Environment-Actionable Data (SEAD). Taken together, the four projects address aspects of the entire data lifecycle, including planning, collection, documentation, discovery, integration, curation, preservation, and collaboration; and encompass a wide range of disciplines including earth sciences, ecology, social sciences, hydrology, oceanography, and engineering. The four projects are pursuing activities to share data, tools, and expertise between pairs of projects as well as collaborating across the DataNet program on issues of cyberinfrastructure and community engagement. Topics to be addressed through program-wide collaboration include technical, organizational, and financial sustainability; semantic integration; data management training and education; and cross-disciplinary awareness of data resources.

  13. Cross-Cultural Validation of the Patient Perception of Integrated Care Survey.

    PubMed

    Tietschert, Maike V; Angeli, Federica; van Raak, Arno J A; Ruwaard, Dirk; Singer, Sara J

    2017-07-20

    To test the cross-cultural validity of the U.S. Patient Perception of Integrated Care (PPIC) Survey in a Dutch sample using a standardized procedure. Primary data collected from patients of five primary care centers in the south of the Netherlands, through survey research from 2014 to 2015. Cross-sectional data collected from patients who saw multiple health care providers during 6 months preceding data collection. The PPIC survey includes 59 questions that measure patient perceived care integration across providers, settings, and time. Data analysis followed a standardized procedure guiding data preparation, psychometric analysis, and included invariance testing with the U.S. dataset. Latent scale structures of the Dutch and U.S. survey were highly comparable. Factor "Integration with specialist" had lower reliability scores and noninvariance. For the remaining factors, internal consistency and invariance estimates were strong. The standardized cross-cultural validation procedure produced strong support for comparable psychometric characteristics of the Dutch and U.S. surveys. Future research should examine the usability of the proposed procedure for contexts with greater cultural differences. © Health Research and Educational Trust.

  14. Multi -omics and metabolic modelling pipelines: challenges and tools for systems microbiology.

    PubMed

    Fondi, Marco; Liò, Pietro

    2015-02-01

    Integrated -omics approaches are quickly spreading across microbiology research labs, leading to (i) the possibility of detecting previously hidden features of microbial cells like multi-scale spatial organization and (ii) tracing molecular components across multiple cellular functional states. This promises to reduce the knowledge gap between genotype and phenotype and poses new challenges for computational microbiologists. We underline how the capability to unravel the complexity of microbial life will strongly depend on the integration of the huge and diverse amount of information that can be derived today from -omics experiments. In this work, we present opportunities and challenges of multi -omics data integration in current systems biology pipelines. We here discuss which layers of biological information are important for biotechnological and clinical purposes, with a special focus on bacterial metabolism and modelling procedures. A general review of the most recent computational tools for performing large-scale datasets integration is also presented, together with a possible framework to guide the design of systems biology experiments by microbiologists. Copyright © 2015. Published by Elsevier GmbH.

  15. Quality Controlling CMIP datasets at GFDL

    NASA Astrophysics Data System (ADS)

    Horowitz, L. W.; Radhakrishnan, A.; Balaji, V.; Adcroft, A.; Krasting, J. P.; Nikonov, S.; Mason, E. E.; Schweitzer, R.; Nadeau, D.

    2017-12-01

    As GFDL makes the switch from model development to production in light of the Climate Model Intercomparison Project (CMIP), GFDL's efforts are shifted to testing and more importantly establishing guidelines and protocols for Quality Controlling and semi-automated data publishing. Every CMIP cycle introduces key challenges and the upcoming CMIP6 is no exception. The new CMIP experimental design comprises of multiple MIPs facilitating research in different focus areas. This paradigm has implications not only for the groups that develop the models and conduct the runs, but also for the groups that monitor, analyze and quality control the datasets before data publishing, before their knowledge makes its way into reports like the IPCC (Intergovernmental Panel on Climate Change) Assessment Reports. In this talk, we discuss some of the paths taken at GFDL to quality control the CMIP-ready datasets including: Jupyter notebooks, PrePARE, LAMP (Linux, Apache, MySQL, PHP/Python/Perl): technology-driven tracker system to monitor the status of experiments qualitatively and quantitatively, provide additional metadata and analysis services along with some in-built controlled-vocabulary validations in the workflow. In addition to this, we also discuss the integration of community-based model evaluation software (ESMValTool, PCMDI Metrics Package, and ILAMB) as part of our CMIP6 workflow.

  16. Integrated Analyses of Cuticular Hydrocarbons, Chromosome and mtDNA in the Neotropical Social Wasp Mischocyttarus consimilis Zikán (Hymenoptera, Vespidae).

    PubMed

    Cunha, D A S; Menezes, R S T; Costa, M A; Lima, S M; Andrade, L H C; Antonialli, W F

    2017-12-01

    In the present work, we explored multiple data from different biological levels such as cuticular hydrocarbons, chromosomal features, and mtDNA sequences in the Neotropical social wasp Mischocyttarus consimilis (J.F. Zikán). Particularly, we explored the genetic and chemical differentiation level within and between populations of this insect. Our dataset revealed shallow intraspecific differentiation in M. consimilis. The similarity among the analyzed samples can probably be due to the geographical proximity where the colonies were sampled, and we argue that Paraná River did not contribute effectively as a historical barrier to this wasp.

  17. Genevar: a database and Java application for the analysis and visualization of SNP-gene associations in eQTL studies.

    PubMed

    Yang, Tsun-Po; Beazley, Claude; Montgomery, Stephen B; Dimas, Antigone S; Gutierrez-Arcelus, Maria; Stranger, Barbara E; Deloukas, Panos; Dermitzakis, Emmanouil T

    2010-10-01

    Genevar (GENe Expression VARiation) is a database and Java tool designed to integrate multiple datasets, and provides analysis and visualization of associations between sequence variation and gene expression. Genevar allows researchers to investigate expression quantitative trait loci (eQTL) associations within a gene locus of interest in real time. The database and application can be installed on a standard computer in database mode and, in addition, on a server to share discoveries among affiliations or the broader community over the Internet via web services protocols. http://www.sanger.ac.uk/resources/software/genevar.

  18. LinkWinds: An Approach to Visual Data Analysis

    NASA Technical Reports Server (NTRS)

    Jacobson, Allan S.

    1992-01-01

    The Linked Windows Interactive Data System (LinkWinds) is a prototype visual data exploration and analysis system resulting from a NASA/JPL program of research into graphical methods for rapidly accessing, displaying and analyzing large multivariate multidisciplinary datasets. It is an integrated multi-application execution environment allowing the dynamic interconnection of multiple windows containing visual displays and/or controls through a data-linking paradigm. This paradigm, which results in a system much like a graphical spreadsheet, is not only a powerful method for organizing large amounts of data for analysis, but provides a highly intuitive, easy to learn user interface on top of the traditional graphical user interface.

  19. An Introduction to MAMA (Meta-Analysis of MicroArray data) System.

    PubMed

    Zhang, Zhe; Fenstermacher, David

    2005-01-01

    Analyzing microarray data across multiple experiments has been proven advantageous. To support this kind of analysis, we are developing a software system called MAMA (Meta-Analysis of MicroArray data). MAMA utilizes a client-server architecture with a relational database on the server-side for the storage of microarray datasets collected from various resources. The client-side is an application running on the end user's computer that allows the user to manipulate microarray data and analytical results locally. MAMA implementation will integrate several analytical methods, including meta-analysis within an open-source framework offering other developers the flexibility to plug in additional statistical algorithms.

  20. Booly: a new data integration platform.

    PubMed

    Do, Long H; Esteves, Francisco F; Karten, Harvey J; Bier, Ethan

    2010-10-13

    Data integration is an escalating problem in bioinformatics. We have developed a web tool and warehousing system, Booly, that features a simple yet flexible data model coupled with the ability to perform powerful comparative analysis, including the use of Boolean logic to merge datasets together, and an integrated aliasing system to decipher differing names of the same gene or protein. Furthermore, Booly features a collaborative sharing system and a public repository so that users can retrieve new datasets while contributors can easily disseminate new content. We illustrate the uses of Booly with several examples including: the versatile creation of homebrew datasets, the integration of heterogeneous data to identify genes useful for comparing avian and mammalian brain architecture, and generation of a list of Food and Drug Administration (FDA) approved drugs with possible alternative disease targets. The Booly paradigm for data storage and analysis should facilitate integration between disparate biological and medical fields and result in novel discoveries that can then be validated experimentally. Booly can be accessed at http://booly.ucsd.edu.

  1. Booly: a new data integration platform

    PubMed Central

    2010-01-01

    Background Data integration is an escalating problem in bioinformatics. We have developed a web tool and warehousing system, Booly, that features a simple yet flexible data model coupled with the ability to perform powerful comparative analysis, including the use of Boolean logic to merge datasets together, and an integrated aliasing system to decipher differing names of the same gene or protein. Furthermore, Booly features a collaborative sharing system and a public repository so that users can retrieve new datasets while contributors can easily disseminate new content. Results We illustrate the uses of Booly with several examples including: the versatile creation of homebrew datasets, the integration of heterogeneous data to identify genes useful for comparing avian and mammalian brain architecture, and generation of a list of Food and Drug Administration (FDA) approved drugs with possible alternative disease targets. Conclusions The Booly paradigm for data storage and analysis should facilitate integration between disparate biological and medical fields and result in novel discoveries that can then be validated experimentally. Booly can be accessed at http://booly.ucsd.edu. PMID:20942966

  2. Combining results of multiple search engines in proteomics.

    PubMed

    Shteynberg, David; Nesvizhskii, Alexey I; Moritz, Robert L; Deutsch, Eric W

    2013-09-01

    A crucial component of the analysis of shotgun proteomics datasets is the search engine, an algorithm that attempts to identify the peptide sequence from the parent molecular ion that produced each fragment ion spectrum in the dataset. There are many different search engines, both commercial and open source, each employing a somewhat different technique for spectrum identification. The set of high-scoring peptide-spectrum matches for a defined set of input spectra differs markedly among the various search engine results; individual engines each provide unique correct identifications among a core set of correlative identifications. This has led to the approach of combining the results from multiple search engines to achieve improved analysis of each dataset. Here we review the techniques and available software for combining the results of multiple search engines and briefly compare the relative performance of these techniques.

  3. Combining Results of Multiple Search Engines in Proteomics*

    PubMed Central

    Shteynberg, David; Nesvizhskii, Alexey I.; Moritz, Robert L.; Deutsch, Eric W.

    2013-01-01

    A crucial component of the analysis of shotgun proteomics datasets is the search engine, an algorithm that attempts to identify the peptide sequence from the parent molecular ion that produced each fragment ion spectrum in the dataset. There are many different search engines, both commercial and open source, each employing a somewhat different technique for spectrum identification. The set of high-scoring peptide-spectrum matches for a defined set of input spectra differs markedly among the various search engine results; individual engines each provide unique correct identifications among a core set of correlative identifications. This has led to the approach of combining the results from multiple search engines to achieve improved analysis of each dataset. Here we review the techniques and available software for combining the results of multiple search engines and briefly compare the relative performance of these techniques. PMID:23720762

  4. Extending TOPS: Ontology-driven Anomaly Detection and Analysis System

    NASA Astrophysics Data System (ADS)

    Votava, P.; Nemani, R. R.; Michaelis, A.

    2010-12-01

    Terrestrial Observation and Prediction System (TOPS) is a flexible modeling software system that integrates ecosystem models with frequent satellite and surface weather observations to produce ecosystem nowcasts (assessments of current conditions) and forecasts useful in natural resources management, public health and disaster management. We have been extending the Terrestrial Observation and Prediction System (TOPS) to include a capability for automated anomaly detection and analysis of both on-line (streaming) and off-line data. In order to best capture the knowledge about data hierarchies, Earth science models and implied dependencies between anomalies and occurrences of observable events such as urbanization, deforestation, or fires, we have developed an ontology to serve as a knowledge base. We can query the knowledge base and answer questions about dataset compatibilities, similarities and dependencies so that we can, for example, automatically analyze similar datasets in order to verify a given anomaly occurrence in multiple data sources. We are further extending the system to go beyond anomaly detection towards reasoning about possible causes of anomalies that are also encoded in the knowledge base as either learned or implied knowledge. This enables us to scale up the analysis by eliminating a large number of anomalies early on during the processing by either failure to verify them from other sources, or matching them directly with other observable events without having to perform an extensive and time-consuming exploration and analysis. The knowledge is captured using OWL ontology language, where connections are defined in a schema that is later extended by including specific instances of datasets and models. The information is stored using Sesame server and is accessible through both Java API and web services using SeRQL and SPARQL query languages. Inference is provided using OWLIM component integrated with Sesame.

  5. Brain-CODE: A Secure Neuroinformatics Platform for Management, Federation, Sharing and Analysis of Multi-Dimensional Neuroscience Data.

    PubMed

    Vaccarino, Anthony L; Dharsee, Moyez; Strother, Stephen; Aldridge, Don; Arnott, Stephen R; Behan, Brendan; Dafnas, Costas; Dong, Fan; Edgecombe, Kenneth; El-Badrawi, Rachad; El-Emam, Khaled; Gee, Tom; Evans, Susan G; Javadi, Mojib; Jeanson, Francis; Lefaivre, Shannon; Lutz, Kristen; MacPhee, F Chris; Mikkelsen, Jordan; Mikkelsen, Tom; Mirotchnick, Nicholas; Schmah, Tanya; Studzinski, Christa M; Stuss, Donald T; Theriault, Elizabeth; Evans, Kenneth R

    2018-01-01

    Historically, research databases have existed in isolation with no practical avenue for sharing or pooling medical data into high dimensional datasets that can be efficiently compared across databases. To address this challenge, the Ontario Brain Institute's "Brain-CODE" is a large-scale neuroinformatics platform designed to support the collection, storage, federation, sharing and analysis of different data types across several brain disorders, as a means to understand common underlying causes of brain dysfunction and develop novel approaches to treatment. By providing researchers access to aggregated datasets that they otherwise could not obtain independently, Brain-CODE incentivizes data sharing and collaboration and facilitates analyses both within and across disorders and across a wide array of data types, including clinical, neuroimaging and molecular. The Brain-CODE system architecture provides the technical capabilities to support (1) consolidated data management to securely capture, monitor and curate data, (2) privacy and security best-practices, and (3) interoperable and extensible systems that support harmonization, integration, and query across diverse data modalities and linkages to external data sources. Brain-CODE currently supports collaborative research networks focused on various brain conditions, including neurodevelopmental disorders, cerebral palsy, neurodegenerative diseases, epilepsy and mood disorders. These programs are generating large volumes of data that are integrated within Brain-CODE to support scientific inquiry and analytics across multiple brain disorders and modalities. By providing access to very large datasets on patients with different brain disorders and enabling linkages to provincial, national and international databases, Brain-CODE will help to generate new hypotheses about the biological bases of brain disorders, and ultimately promote new discoveries to improve patient care.

  6. Brain-CODE: A Secure Neuroinformatics Platform for Management, Federation, Sharing and Analysis of Multi-Dimensional Neuroscience Data

    PubMed Central

    Vaccarino, Anthony L.; Dharsee, Moyez; Strother, Stephen; Aldridge, Don; Arnott, Stephen R.; Behan, Brendan; Dafnas, Costas; Dong, Fan; Edgecombe, Kenneth; El-Badrawi, Rachad; El-Emam, Khaled; Gee, Tom; Evans, Susan G.; Javadi, Mojib; Jeanson, Francis; Lefaivre, Shannon; Lutz, Kristen; MacPhee, F. Chris; Mikkelsen, Jordan; Mikkelsen, Tom; Mirotchnick, Nicholas; Schmah, Tanya; Studzinski, Christa M.; Stuss, Donald T.; Theriault, Elizabeth; Evans, Kenneth R.

    2018-01-01

    Historically, research databases have existed in isolation with no practical avenue for sharing or pooling medical data into high dimensional datasets that can be efficiently compared across databases. To address this challenge, the Ontario Brain Institute’s “Brain-CODE” is a large-scale neuroinformatics platform designed to support the collection, storage, federation, sharing and analysis of different data types across several brain disorders, as a means to understand common underlying causes of brain dysfunction and develop novel approaches to treatment. By providing researchers access to aggregated datasets that they otherwise could not obtain independently, Brain-CODE incentivizes data sharing and collaboration and facilitates analyses both within and across disorders and across a wide array of data types, including clinical, neuroimaging and molecular. The Brain-CODE system architecture provides the technical capabilities to support (1) consolidated data management to securely capture, monitor and curate data, (2) privacy and security best-practices, and (3) interoperable and extensible systems that support harmonization, integration, and query across diverse data modalities and linkages to external data sources. Brain-CODE currently supports collaborative research networks focused on various brain conditions, including neurodevelopmental disorders, cerebral palsy, neurodegenerative diseases, epilepsy and mood disorders. These programs are generating large volumes of data that are integrated within Brain-CODE to support scientific inquiry and analytics across multiple brain disorders and modalities. By providing access to very large datasets on patients with different brain disorders and enabling linkages to provincial, national and international databases, Brain-CODE will help to generate new hypotheses about the biological bases of brain disorders, and ultimately promote new discoveries to improve patient care. PMID:29875648

  7. MIDG-Emerging grid technologies for multi-site preclinical molecular imaging research communities.

    PubMed

    Lee, Jasper; Documet, Jorge; Liu, Brent; Park, Ryan; Tank, Archana; Huang, H K

    2011-03-01

    Molecular imaging is the visualization and identification of specific molecules in anatomy for insight into metabolic pathways, tissue consistency, and tracing of solute transport mechanisms. This paper presents the Molecular Imaging Data Grid (MIDG) which utilizes emerging grid technologies in preclinical molecular imaging to facilitate data sharing and discovery between preclinical molecular imaging facilities and their collaborating investigator institutions to expedite translational sciences research. Grid-enabled archiving, management, and distribution of animal-model imaging datasets help preclinical investigators to monitor, access and share their imaging data remotely, and promote preclinical imaging facilities to share published imaging datasets as resources for new investigators. The system architecture of the Molecular Imaging Data Grid is described in a four layer diagram. A data model for preclinical molecular imaging datasets is also presented based on imaging modalities currently used in a molecular imaging center. The MIDG system components and connectivity are presented. And finally, the workflow steps for grid-based archiving, management, and retrieval of preclincial molecular imaging data are described. Initial performance tests of the Molecular Imaging Data Grid system have been conducted at the USC IPILab using dedicated VMware servers. System connectivity, evaluated datasets, and preliminary results are presented. The results show the system's feasibility, limitations, direction of future research. Translational and interdisciplinary research in medicine is increasingly interested in cellular and molecular biology activity at the preclinical levels, utilizing molecular imaging methods on animal models. The task of integrated archiving, management, and distribution of these preclinical molecular imaging datasets at preclinical molecular imaging facilities is challenging due to disparate imaging systems and multiple off-site investigators. A Molecular Imaging Data Grid design, implementation, and initial evaluation is presented to demonstrate the secure and novel data grid solution for sharing preclinical molecular imaging data across the wide-area-network (WAN).

  8. Scalable Machine Learning for Massive Astronomical Datasets

    NASA Astrophysics Data System (ADS)

    Ball, Nicholas M.; Gray, A.

    2014-04-01

    We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors. This is likely of particular interest to the radio astronomy community given, for example, that survey projects contain groups dedicated to this topic. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex datasets that wishes to extract the full scientific value from its data.

  9. Scalable Machine Learning for Massive Astronomical Datasets

    NASA Astrophysics Data System (ADS)

    Ball, Nicholas M.; Astronomy Data Centre, Canadian

    2014-01-01

    We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors, and the local outlier factor. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex datasets that wishes to extract the full scientific value from its data.

  10. An efficient and scalable graph modeling approach for capturing information at different levels in next generation sequencing reads

    PubMed Central

    2013-01-01

    Background Next generation sequencing technologies have greatly advanced many research areas of the biomedical sciences through their capability to generate massive amounts of genetic information at unprecedented rates. The advent of next generation sequencing has led to the development of numerous computational tools to analyze and assemble the millions to billions of short sequencing reads produced by these technologies. While these tools filled an important gap, current approaches for storing, processing, and analyzing short read datasets generally have remained simple and lack the complexity needed to efficiently model the produced reads and assemble them correctly. Results Previously, we presented an overlap graph coarsening scheme for modeling read overlap relationships on multiple levels. Most current read assembly and analysis approaches use a single graph or set of clusters to represent the relationships among a read dataset. Instead, we use a series of graphs to represent the reads and their overlap relationships across a spectrum of information granularity. At each information level our algorithm is capable of generating clusters of reads from the reduced graph, forming an integrated graph modeling and clustering approach for read analysis and assembly. Previously we applied our algorithm to simulated and real 454 datasets to assess its ability to efficiently model and cluster next generation sequencing data. In this paper we extend our algorithm to large simulated and real Illumina datasets to demonstrate that our algorithm is practical for both sequencing technologies. Conclusions Our overlap graph theoretic algorithm is able to model next generation sequencing reads at various levels of granularity through the process of graph coarsening. Additionally, our model allows for efficient representation of the read overlap relationships, is scalable for large datasets, and is practical for both Illumina and 454 sequencing technologies. PMID:24564333

  11. Knowledge Discovery from Biomedical Ontologies in Cross Domains.

    PubMed

    Shen, Feichen; Lee, Yugyung

    2016-01-01

    In recent years, there is an increasing demand for sharing and integration of medical data in biomedical research. In order to improve a health care system, it is required to support the integration of data by facilitating semantic interoperability systems and practices. Semantic interoperability is difficult to achieve in these systems as the conceptual models underlying datasets are not fully exploited. In this paper, we propose a semantic framework, called Medical Knowledge Discovery and Data Mining (MedKDD), that aims to build a topic hierarchy and serve the semantic interoperability between different ontologies. For the purpose, we fully focus on the discovery of semantic patterns about the association of relations in the heterogeneous information network representing different types of objects and relationships in multiple biological ontologies and the creation of a topic hierarchy through the analysis of the discovered patterns. These patterns are used to cluster heterogeneous information networks into a set of smaller topic graphs in a hierarchical manner and then to conduct cross domain knowledge discovery from the multiple biological ontologies. Thus, patterns made a greater contribution in the knowledge discovery across multiple ontologies. We have demonstrated the cross domain knowledge discovery in the MedKDD framework using a case study with 9 primary biological ontologies from Bio2RDF and compared it with the cross domain query processing approach, namely SLAP. We have confirmed the effectiveness of the MedKDD framework in knowledge discovery from multiple medical ontologies.

  12. Knowledge Discovery from Biomedical Ontologies in Cross Domains

    PubMed Central

    Shen, Feichen; Lee, Yugyung

    2016-01-01

    In recent years, there is an increasing demand for sharing and integration of medical data in biomedical research. In order to improve a health care system, it is required to support the integration of data by facilitating semantic interoperability systems and practices. Semantic interoperability is difficult to achieve in these systems as the conceptual models underlying datasets are not fully exploited. In this paper, we propose a semantic framework, called Medical Knowledge Discovery and Data Mining (MedKDD), that aims to build a topic hierarchy and serve the semantic interoperability between different ontologies. For the purpose, we fully focus on the discovery of semantic patterns about the association of relations in the heterogeneous information network representing different types of objects and relationships in multiple biological ontologies and the creation of a topic hierarchy through the analysis of the discovered patterns. These patterns are used to cluster heterogeneous information networks into a set of smaller topic graphs in a hierarchical manner and then to conduct cross domain knowledge discovery from the multiple biological ontologies. Thus, patterns made a greater contribution in the knowledge discovery across multiple ontologies. We have demonstrated the cross domain knowledge discovery in the MedKDD framework using a case study with 9 primary biological ontologies from Bio2RDF and compared it with the cross domain query processing approach, namely SLAP. We have confirmed the effectiveness of the MedKDD framework in knowledge discovery from multiple medical ontologies. PMID:27548262

  13. Effect of denoising on supervised lung parenchymal clusters

    NASA Astrophysics Data System (ADS)

    Jayamani, Padmapriya; Raghunath, Sushravya; Rajagopalan, Srinivasan; Karwoski, Ronald A.; Bartholmai, Brian J.; Robb, Richard A.

    2012-03-01

    Denoising is a critical preconditioning step for quantitative analysis of medical images. Despite promises for more consistent diagnosis, denoising techniques are seldom explored in clinical settings. While this may be attributed to the esoteric nature of the parameter sensitve algorithms, lack of quantitative measures on their ecacy to enhance the clinical decision making is a primary cause of physician apathy. This paper addresses this issue by exploring the eect of denoising on the integrity of supervised lung parenchymal clusters. Multiple Volumes of Interests (VOIs) were selected across multiple high resolution CT scans to represent samples of dierent patterns (normal, emphysema, ground glass, honey combing and reticular). The VOIs were labeled through consensus of four radiologists. The original datasets were ltered by multiple denoising techniques (median ltering, anisotropic diusion, bilateral ltering and non-local means) and the corresponding ltered VOIs were extracted. Plurality of cluster indices based on multiple histogram-based pair-wise similarity measures were used to assess the quality of supervised clusters in the original and ltered space. The resultant rank orders were analyzed using the Borda criteria to nd the denoising-similarity measure combination that has the best cluster quality. Our exhaustive analyis reveals (a) for a number of similarity measures, the cluster quality is inferior in the ltered space; and (b) for measures that benet from denoising, a simple median ltering outperforms non-local means and bilateral ltering. Our study suggests the need to judiciously choose, if required, a denoising technique that does not deteriorate the integrity of supervised clusters.

  14. Reconstruction of the experimentally supported human protein interactome: what can we learn?

    PubMed Central

    2013-01-01

    Background Understanding the topology and dynamics of the human protein-protein interaction (PPI) network will significantly contribute to biomedical research, therefore its systematic reconstruction is required. Several meta-databases integrate source PPI datasets, but the protein node sets of their networks vary depending on the PPI data combined. Due to this inherent heterogeneity, the way in which the human PPI network expands via multiple dataset integration has not been comprehensively analyzed. We aim at assembling the human interactome in a global structured way and exploring it to gain insights of biological relevance. Results First, we defined the UniProtKB manually reviewed human “complete” proteome as the reference protein-node set and then we mined five major source PPI datasets for direct PPIs exclusively between the reference proteins. We updated the protein and publication identifiers and normalized all PPIs to the UniProt identifier level. The reconstructed interactome covers approximately 60% of the human proteome and has a scale-free structure. No apparent differentiating gene functional classification characteristics were identified for the unrepresented proteins. The source dataset integration augments the network mainly in PPIs. Polyubiquitin emerged as the highest-degree node, but the inclusion of most of its identified PPIs may be reconsidered. The high number (>300) of connections of the subsequent fifteen proteins correlates well with their essential biological role. According to the power-law network structure, the unrepresented proteins should mainly have up to four connections with equally poorly-connected interactors. Conclusions Reconstructing the human interactome based on the a priori definition of the protein nodes enabled us to identify the currently included part of the human “complete” proteome, and discuss the role of the proteins within the network topology with respect to their function. As the network expansion has to comply with the scale-free theory, we suggest that the core of the human interactome has essentially emerged. Thus, it could be employed in systems biology and biomedical research, despite the considerable number of currently unrepresented proteins. The latter are probably involved in specialized physiological conditions, justifying the scarcity of related PPI information, and their identification can assist in designing relevant functional experiments and targeted text mining algorithms. PMID:24088582

  15. Search for new phenomena with large jet multiplicities and missing transverse momentum using large-radius jets and flavour-tagging at ATLAS in 13 TeV pp collisions

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Aaboud, M.; Aad, G.; Abbott, B.

    A search is presented for particles that decay producing a large jet multiplicity and invisible particles. The event selection then applies a veto on the presence of isolated electrons or muons and additional requirements on the number of b-tagged jets and the scalar sum of masses of large-radius jets. In having explored the full ATLAS 2015-2016 dataset of LHC proton-proton collisions at √s=13 TeV, which corresponds to 36.1 fb -1 of integrated luminosity, no evidence is found for physics beyond the Standard Model. The results are interpreted in the context of simplified models inspired by R-parity-conserving and R-parity-violating supersymmetry, wheremore » gluinos are pair-produced. More generic models within the phenomenological minimal supersymmetric Standard Model are also considered.« less

  16. Search for new phenomena with large jet multiplicities and missing transverse momentum using large-radius jets and flavour-tagging at ATLAS in 13 TeV pp collisions

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Aaboud, M.; Aad, G.; Abbott, B.

    A search is presented for particles that decay producing a large jet multiplicity and invisible particles. The event selection applies a veto on the presence of isolated electrons or muons and additional requirements on the number of b-tagged jets and the scalar sum of masses of large-radius jets. Having explored the full ATLAS 2015-2016 dataset of LHC proton-proton collisions at s=13 TeV, which corresponds to 36.1 fb -1 of integrated luminosity, no evidence is found for physics beyond the Standard Model. The results are interpreted in the context of simplified models inspired by R-parity-conserving and R-parity-violating supersymmetry, where gluinos aremore » pair-produced. More generic models within the phenomenological minimal supersymmetric Standard Model are also considered.« less

  17. Search for new phenomena with large jet multiplicities and missing transverse momentum using large-radius jets and flavour-tagging at ATLAS in 13 TeV pp collisions

    DOE PAGES

    Aaboud, M.; Aad, G.; Abbott, B.; ...

    2017-12-06

    A search is presented for particles that decay producing a large jet multiplicity and invisible particles. The event selection applies a veto on the presence of isolated electrons or muons and additional requirements on the number of b-tagged jets and the scalar sum of masses of large-radius jets. Having explored the full ATLAS 2015-2016 dataset of LHC proton-proton collisions at s=13 TeV, which corresponds to 36.1 fb -1 of integrated luminosity, no evidence is found for physics beyond the Standard Model. The results are interpreted in the context of simplified models inspired by R-parity-conserving and R-parity-violating supersymmetry, where gluinos aremore » pair-produced. More generic models within the phenomenological minimal supersymmetric Standard Model are also considered.« less

  18. Search for new phenomena with large jet multiplicities and missing transverse momentum using large-radius jets and flavour-tagging at ATLAS in 13 TeV pp collisions

    DOE PAGES

    Aaboud, M.; Aad, G.; Abbott, B.; ...

    2017-12-06

    A search is presented for particles that decay producing a large jet multiplicity and invisible particles. The event selection then applies a veto on the presence of isolated electrons or muons and additional requirements on the number of b-tagged jets and the scalar sum of masses of large-radius jets. In having explored the full ATLAS 2015-2016 dataset of LHC proton-proton collisions at √s=13 TeV, which corresponds to 36.1 fb -1 of integrated luminosity, no evidence is found for physics beyond the Standard Model. The results are interpreted in the context of simplified models inspired by R-parity-conserving and R-parity-violating supersymmetry, wheremore » gluinos are pair-produced. More generic models within the phenomenological minimal supersymmetric Standard Model are also considered.« less

  19. Search for new phenomena with large jet multiplicities and missing transverse momentum using large-radius jets and flavour-tagging at ATLAS in 13 TeV pp collisions

    NASA Astrophysics Data System (ADS)

    Aaboud, M.; Aad, G.; Abbott, B.; Abdinov, O.; Abeloos, B.; Abidi, S. H.; AbouZeid, O. S.; Abraham, N. L.; Abramowicz, H.; Abreu, H.; Abreu, R.; Abulaiti, Y.; Acharya, B. S.; Adachi, S.; Adamczyk, L.; Adelman, J.; Adersberger, M.; Adye, T.; Affolder, A. A.; Afik, Y.; Agatonovic-Jovin, T.; Agheorghiesei, C.; Aguilar-Saavedra, J. A.; Ahlen, S. P.; Ahmadov, F.; Aielli, G.; Akatsuka, S.; Akerstedt, H.; Åkesson, T. P. A.; Akilli, E.; Akimov, A. V.; Alberghi, G. L.; Albert, J.; Albicocco, P.; Alconada Verzini, M. J.; Alderweireldt, S. C.; Aleksa, M.; Aleksandrov, I. N.; Alexa, C.; Alexander, G.; Alexopoulos, T.; Alhroob, M.; Ali, B.; Aliev, M.; Alimonti, G.; Alison, J.; Alkire, S. P.; Allbrooke, B. M. M.; Allen, B. W.; Allport, P. P.; Aloisio, A.; Alonso, A.; Alonso, F.; Alpigiani, C.; Alshehri, A. A.; Alstaty, M. I.; Alvarez Gonzalez, B.; Álvarez Piqueras, D.; Alviggi, M. G.; Amadio, B. T.; Amaral Coutinho, Y.; Amelung, C.; Amidei, D.; Amor Dos Santos, S. P.; Amoroso, S.; Amundsen, G.; Anastopoulos, C.; Ancu, L. S.; Andari, N.; Andeen, T.; Anders, C. F.; Anders, J. K.; Anderson, K. J.; Andreazza, A.; Andrei, V.; Angelidakis, S.; Angelozzi, I.; Angerami, A.; Anisenkov, A. V.; Anjos, N.; Annovi, A.; Antel, C.; Antonelli, M.; Antonov, A.; Antrim, D. J.; Anulli, F.; Aoki, M.; Aperio Bella, L.; Arabidze, G.; Arai, Y.; Araque, J. P.; Araujo Ferraz, V.; Arce, A. T. H.; Ardell, R. E.; Arduh, F. A.; Arguin, J.-F.; Argyropoulos, S.; Arik, M.; Armbruster, A. J.; Armitage, L. J.; Arnaez, O.; Arnold, H.; Arratia, M.; Arslan, O.; Artamonov, A.; Artoni, G.; Artz, S.; Asai, S.; Asbah, N.; Ashkenazi, A.; Asquith, L.; Assamagan, K.; Astalos, R.; Atkinson, M.; Atlay, N. B.; Augsten, K.; Avolio, G.; Axen, B.; Ayoub, M. K.; Azuelos, G.; Baas, A. E.; Baca, M. J.; Bachacou, H.; Bachas, K.; Backes, M.; Bagnaia, P.; Bahmani, M.; Bahrasemani, H.; Baines, J. T.; Bajic, M.; Baker, O. K.; Baldin, E. M.; Balek, P.; Balli, F.; Balunas, W. K.; Banas, E.; Bandyopadhyay, A.; Banerjee, Sw.; Bannoura, A. A. E.; Barak, L.; Barberio, E. L.; Barberis, D.; Barbero, M.; Barillari, T.; Barisits, M.-S.; Barkeloo, J. T.; Barklow, T.; Barlow, N.; Barnes, S. L.; Barnett, B. M.; Barnett, R. M.; Barnovska-Blenessy, Z.; Baroncelli, A.; Barone, G.; Barr, A. J.; Barranco Navarro, L.; Barreiro, F.; Barreiro Guimarães da Costa, J.; Bartoldus, R.; Barton, A. E.; Bartos, P.; Basalaev, A.; Bassalat, A.; Bates, R. L.; Batista, S. J.; Batley, J. R.; Battaglia, M.; Bauce, M.; Bauer, F.; Bawa, H. S.; Beacham, J. B.; Beattie, M. D.; Beau, T.; Beauchemin, P. H.; Bechtle, P.; Beck, H. P.; Beck, H. C.; Becker, K.; Becker, M.; Becot, C.; Beddall, A. J.; Beddall, A.; Bednyakov, V. A.; Bedognetti, M.; Bee, C. P.; Beermann, T. A.; Begalli, M.; Begel, M.; Behr, J. K.; Bell, A. S.; Bella, G.; Bellagamba, L.; Bellerive, A.; Bellomo, M.; Belotskiy, K.; Beltramello, O.; Belyaev, N. L.; Benary, O.; Benchekroun, D.; Bender, M.; Bendtz, K.; Benekos, N.; Benhammou, Y.; Benhar Noccioli, E.; Benitez, J.; Benjamin, D. P.; Benoit, M.; Bensinger, J. R.; Bentvelsen, S.; Beresford, L.; Beretta, M.; Berge, D.; Bergeaas Kuutmann, E.; Berger, N.; Beringer, J.; Berlendis, S.; Bernard, N. R.; Bernardi, G.; Bernius, C.; Bernlochner, F. U.; Berry, T.; Berta, P.; Bertella, C.; Bertoli, G.; Bertolucci, F.; Bertram, I. A.; Bertsche, C.; Bertsche, D.; Besjes, G. J.; Bessidskaia Bylund, O.; Bessner, M.; Besson, N.; Bethani, A.; Bethke, S.; Bevan, A. J.; Beyer, J.; Bianchi, R. M.; Biebel, O.; Biedermann, D.; Bielski, R.; Bierwagen, K.; Biesuz, N. V.; Biglietti, M.; Billoud, T. R. V.; Bilokon, H.; Bindi, M.; Bingul, A.; Bini, C.; Biondi, S.; Bisanz, T.; Bittrich, C.; Bjergaard, D. M.; Black, J. E.; Black, K. M.; Blair, R. E.; Blazek, T.; Bloch, I.; Blocker, C.; Blue, A.; Blum, W.; Blumenschein, U.; Blunier, S.; Bobbink, G. J.; Bobrovnikov, V. S.; Bocchetta, S. S.; Bocci, A.; Bock, C.; Boehler, M.; Boerner, D.; Bogavac, D.; Bogdanchikov, A. G.; Bohm, C.; Boisvert, V.; Bokan, P.; Bold, T.; Boldyrev, A. S.; Bolz, A. E.; Bomben, M.; Bona, M.; Boonekamp, M.; Borisov, A.; Borissov, G.; Bortfeldt, J.; Bortoletto, D.; Bortolotto, V.; Boscherini, D.; Bosman, M.; Bossio Sola, J. D.; Boudreau, J.; Bouffard, J.; Bouhova-Thacker, E. V.; Boumediene, D.; Bourdarios, C.; Boutle, S. K.; Boveia, A.; Boyd, J.; Boyko, I. R.; Bracinik, J.; Brandt, A.; Brandt, G.; Brandt, O.; Bratzler, U.; Brau, B.; Brau, J. E.; Breaden Madden, W. D.; Brendlinger, K.; Brennan, A. J.; Brenner, L.; Brenner, R.; Bressler, S.; Briglin, D. L.; Bristow, T. M.; Britton, D.; Britzger, D.; Brochu, F. M.; Brock, I.; Brock, R.; Brooijmans, G.; Brooks, T.; Brooks, W. K.; Brosamer, J.; Brost, E.; Broughton, J. H.; Bruckman de Renstrom, P. A.; Bruncko, D.; Bruni, A.; Bruni, G.; Bruni, L. S.; Bruno, S.; Brunt, BH; Bruschi, M.; Bruscino, N.; Bryant, P.; Bryngemark, L.; Buanes, T.; Buat, Q.; Buchholz, P.; Buckley, A. G.; Budagov, I. A.; Buehrer, F.; Bugge, M. K.; Bulekov, O.; Bullock, D.; Burch, T. J.; Burdin, S.; Burgard, C. D.; Burger, A. M.; Burghgrave, B.; Burka, K.; Burke, S.; Burmeister, I.; Burr, J. T. P.; Busato, E.; Büscher, D.; Büscher, V.; Bussey, P.; Butler, J. M.; Buttar, C. M.; Butterworth, J. M.; Butti, P.; Buttinger, W.; Buzatu, A.; Buzykaev, A. R.; Cabrera Urbán, S.; Caforio, D.; Cairo, V. M.; Cakir, O.; Calace, N.; Calafiura, P.; Calandri, A.; Calderini, G.; Calfayan, P.; Callea, G.; Caloba, L. P.; Calvente Lopez, S.; Calvet, D.; Calvet, S.; Calvet, T. P.; Camacho Toro, R.; Camarda, S.; Camarri, P.; Cameron, D.; Caminal Armadans, R.; Camincher, C.; Campana, S.; Campanelli, M.; Camplani, A.; Campoverde, A.; Canale, V.; Cano Bret, M.; Cantero, J.; Cao, T.; Capeans Garrido, M. D. M.; Caprini, I.; Caprini, M.; Capua, M.; Carbone, R. M.; Cardarelli, R.; Cardillo, F.; Carli, I.; Carli, T.; Carlino, G.; Carlson, B. T.; Carminati, L.; Carney, R. M. D.; Caron, S.; Carquin, E.; Carrá, S.; Carrillo-Montoya, G. D.; Casadei, D.; Casado, M. P.; Casolino, M.; Casper, D. W.; Castelijn, R.; Castillo Gimenez, V.; Castro, N. F.; Catinaccio, A.; Catmore, J. R.; Cattai, A.; Caudron, J.; Cavaliere, V.; Cavallaro, E.; Cavalli, D.; Cavalli-Sforza, M.; Cavasinni, V.; Celebi, E.; Ceradini, F.; Cerda Alberich, L.; Cerqueira, A. S.; Cerri, A.; Cerrito, L.; Cerutti, F.; Cervelli, A.; Cetin, S. A.; Chafaq, A.; Chakraborty, D.; Chan, S. K.; Chan, W. S.; Chan, Y. L.; Chang, P.; Chapman, J. D.; Charlton, D. G.; Chau, C. C.; Chavez Barajas, C. A.; Che, S.; Cheatham, S.; Chegwidden, A.; Chekanov, S.; Chekulaev, S. V.; Chelkov, G. A.; Chelstowska, M. A.; Chen, C.; Chen, C.; Chen, H.; Chen, J.; Chen, S.; Chen, S.; Chen, X.; Chen, Y.; Cheng, H. C.; Cheng, H. J.; Cheplakov, A.; Cheremushkina, E.; Cherkaoui El Moursli, R.; Cheu, E.; Cheung, K.; Chevalier, L.; Chiarella, V.; Chiarelli, G.; Chiodini, G.; Chisholm, A. S.; Chitan, A.; Chiu, Y. H.; Chizhov, M. V.; Choi, K.; Chomont, A. R.; Chouridou, S.; Chow, Y. S.; Christodoulou, V.; Chu, M. C.; Chudoba, J.; Chuinard, A. J.; Chwastowski, J. J.; Chytka, L.; Ciftci, A. K.; Cinca, D.; Cindro, V.; Cioara, I. A.; Ciocca, C.; Ciocio, A.; Cirotto, F.; Citron, Z. H.; Citterio, M.; Ciubancan, M.; Clark, A.; Clark, B. L.; Clark, M. R.; Clark, P. J.; Clarke, R. N.; Clement, C.; Coadou, Y.; Cobal, M.; Coccaro, A.; Cochran, J.; Colasurdo, L.; Cole, B.; Colijn, A. P.; Collot, J.; Colombo, T.; Conde Muiño, P.; Coniavitis, E.; Connell, S. H.; Connelly, I. A.; Constantinescu, S.; Conti, G.; Conventi, F.; Cooke, M.; Cooper-Sarkar, A. M.; Cormier, F.; Cormier, K. J. R.; Corradi, M.; Corriveau, F.; Cortes-Gonzalez, A.; Cortiana, G.; Costa, G.; Costa, M. J.; Costanzo, D.; Cottin, G.; Cowan, G.; Cox, B. E.; Cranmer, K.; Crawley, S. J.; Creager, R. A.; Cree, G.; Crépé-Renaudin, S.; Crescioli, F.; Cribbs, W. A.; Cristinziani, M.; Croft, V.; Crosetti, G.; Cueto, A.; Cuhadar Donszelmann, T.; Cukierman, A. R.; Cummings, J.; Curatolo, M.; Cúth, J.; Czekierda, S.; Czodrowski, P.; D'amen, G.; D'Auria, S.; D'eramo, L.; D'Onofrio, M.; Da Cunha Sargedas De Sousa, M. J.; Da Via, C.; Dabrowski, W.; Dado, T.; Dai, T.; Dale, O.; Dallaire, F.; Dallapiccola, C.; Dam, M.; Dandoy, J. R.; Daneri, M. F.; Dang, N. P.; Daniells, A. C.; Dann, N. S.; Danninger, M.; Dano Hoffmann, M.; Dao, V.; Darbo, G.; Darmora, S.; Dassoulas, J.; Dattagupta, A.; Daubney, T.; Davey, W.; David, C.; Davidek, T.; Davis, D. R.; Davison, P.; Dawe, E.; Dawson, I.; De, K.; de Asmundis, R.; De Benedetti, A.; De Castro, S.; De Cecco, S.; De Groot, N.; de Jong, P.; De la Torre, H.; De Lorenzi, F.; De Maria, A.; De Pedis, D.; De Salvo, A.; De Sanctis, U.; De Santo, A.; De Vasconcelos Corga, K.; De Vivie De Regie, J. B.; Debbe, R.; Debenedetti, C.; Dedovich, D. V.; Dehghanian, N.; Deigaard, I.; Del Gaudio, M.; Del Peso, J.; Delgove, D.; Deliot, F.; Delitzsch, C. M.; Dell'Acqua, A.; Dell'Asta, L.; Dell'Orso, M.; Della Pietra, M.; della Volpe, D.; Delmastro, M.; Delporte, C.; Delsart, P. A.; DeMarco, D. A.; Demers, S.; Demichev, M.; Demilly, A.; Denisov, S. P.; Denysiuk, D.; Derendarz, D.; Derkaoui, J. E.; Derue, F.; Dervan, P.; Desch, K.; Deterre, C.; Dette, K.; Devesa, M. R.; Deviveiros, P. O.; Dewhurst, A.; Dhaliwal, S.; Di Bello, F. A.; Di Ciaccio, A.; Di Ciaccio, L.; Di Clemente, W. K.; Di Donato, C.; Di Girolamo, A.; Di Girolamo, B.; Di Micco, B.; Di Nardo, R.; Di Petrillo, K. F.; Di Simone, A.; Di Sipio, R.; Di Valentino, D.; Diaconu, C.; Diamond, M.; Dias, F. A.; Diaz, M. A.; Diehl, E. B.; Dietrich, J.; Díez Cornell, S.; Dimitrievska, A.; Dingfelder, J.; Dita, P.; Dita, S.; Dittus, F.; Djama, F.; Djobava, T.; Djuvsland, J. I.; do Vale, M. A. B.; Dobos, D.; Dobre, M.; Doglioni, C.; Dolejsi, J.; Dolezal, Z.; Donadelli, M.; Donati, S.; Dondero, P.; Donini, J.; Dopke, J.; Doria, A.; Dova, M. T.; Doyle, A. T.; Drechsler, E.; Dris, M.; Du, Y.; Duarte-Campderros, J.; Dubreuil, A.; Duchovni, E.; Duckeck, G.; Ducourthial, A.; Ducu, O. A.; Duda, D.; Dudarev, A.; Dudder, A. Chr.; Duffield, E. M.; Duflot, L.; Dührssen, M.; Dumancic, M.; Dumitriu, A. E.; Duncan, A. K.; Dunford, M.; Duran Yildiz, H.; Düren, M.; Durglishvili, A.; Duschinger, D.; Dutta, B.; Duvnjak, D.; Dyndal, M.; Dziedzic, B. S.; Eckardt, C.; Ecker, K. M.; Edgar, R. C.; Eifert, T.; Eigen, G.; Einsweiler, K.; Ekelof, T.; El Kacimi, M.; El Kosseifi, R.; Ellajosyula, V.; Ellert, M.; Elles, S.; Ellinghaus, F.; Elliot, A. A.; Ellis, N.; Elmsheuser, J.; Elsing, M.; Emeliyanov, D.; Enari, Y.; Endner, O. C.; Ennis, J. S.; Erdmann, J.; Ereditato, A.; Ernst, M.; Errede, S.; Escalier, M.; Escobar, C.; Esposito, B.; Estrada Pastor, O.; Etienvre, A. I.; Etzion, E.; Evans, H.; Ezhilov, A.; Ezzi, M.; Fabbri, F.; Fabbri, L.; Fabiani, V.; Facini, G.; Fakhrutdinov, R. M.; Falciano, S.; Falla, R. J.; Faltova, J.; Fang, Y.; Fanti, M.; Farbin, A.; Farilla, A.; Farina, C.; Farina, E. M.; Farooque, T.; Farrell, S.; Farrington, S. M.; Farthouat, P.; Fassi, F.; Fassnacht, P.; Fassouliotis, D.; Faucci Giannelli, M.; Favareto, A.; Fawcett, W. J.; Fayard, L.; Fedin, O. L.; Fedorko, W.; Feigl, S.; Feligioni, L.; Feng, C.; Feng, E. J.; Feng, H.; Fenton, M. J.; Fenyuk, A. B.; Feremenga, L.; Fernandez Martinez, P.; Fernandez Perez, S.; Ferrando, J.; Ferrari, A.; Ferrari, P.; Ferrari, R.; Ferreira de Lima, D. E.; Ferrer, A.; Ferrere, D.; Ferretti, C.; Fiedler, F.; Filipčič, A.; Filipuzzi, M.; Filthaut, F.; Fincke-Keeler, M.; Finelli, K. D.; Fiolhais, M. C. N.; Fiorini, L.; Fischer, A.; Fischer, C.; Fischer, J.; Fisher, W. C.; Flaschel, N.; Fleck, I.; Fleischmann, P.; Fletcher, R. R. M.; Flick, T.; Flierl, B. M.; Flores Castillo, L. R.; Flowerdew, M. J.; Forcolin, G. T.; Formica, A.; Förster, F. A.; Forti, A.; Foster, A. G.; Fournier, D.; Fox, H.; Fracchia, S.; Francavilla, P.; Franchini, M.; Franchino, S.; Francis, D.; Franconi, L.; Franklin, M.; Frate, M.; Fraternali, M.; Freeborn, D.; Fressard-Batraneanu, S. M.; Freund, B.; Froidevaux, D.; Frost, J. A.; Fukunaga, C.; Fusayasu, T.; Fuster, J.; Gabaldon, C.; Gabizon, O.; Gabrielli, A.; Gabrielli, A.; Gach, G. P.; Gadatsch, S.; Gadomski, S.; Gagliardi, G.; Gagnon, L. G.; Galea, C.; Galhardo, B.; Gallas, E. J.; Gallop, B. J.; Gallus, P.; Galster, G.; Gan, K. K.; Ganguly, S.; Gao, Y.; Gao, Y. S.; Garay Walls, F. M.; García, C.; García Navarro, J. E.; García Pascual, J. A.; Garcia-Sciveres, M.; Gardner, R. W.; Garelli, N.; Garonne, V.; Gascon Bravo, A.; Gasnikova, K.; Gatti, C.; Gaudiello, A.; Gaudio, G.; Gavrilenko, I. L.; Gay, C.; Gaycken, G.; Gazis, E. N.; Gee, C. N. P.; Geisen, J.; Geisen, M.; Geisler, M. P.; Gellerstedt, K.; Gemme, C.; Genest, M. H.; Geng, C.; Gentile, S.; Gentsos, C.; George, S.; Gerbaudo, D.; Gershon, A.; Geßner, G.; Ghasemi, S.; Ghneimat, M.; Giacobbe, B.; Giagu, S.; Giangiacomi, N.; Giannetti, P.; Gibson, S. M.; Gignac, M.; Gilchriese, M.; Gillberg, D.; Gilles, G.; Gingrich, D. M.; Giordani, M. P.; Giorgi, F. M.; Giraud, P. F.; Giromini, P.; Giugliarelli, G.; Giugni, D.; Giuli, F.; Giuliani, C.; Giulini, M.; Gjelsten, B. K.; Gkaitatzis, S.; Gkialas, I.; Gkougkousis, E. L.; Gkountoumis, P.; Gladilin, L. K.; Glasman, C.; Glatzer, J.; Glaysher, P. C. F.; Glazov, A.; Goblirsch-Kolb, M.; Godlewski, J.; Goldfarb, S.; Golling, T.; Golubkov, D.; Gomes, A.; Gonçalo, R.; Goncalves Gama, R.; Goncalves Pinto Firmino Da Costa, J.; Gonella, G.; Gonella, L.; Gongadze, A.; González de la Hoz, S.; Gonzalez-Sevilla, S.; Goossens, L.; Gorbounov, P. A.; Gordon, H. A.; Gorelov, I.; Gorini, B.; Gorini, E.; Gorišek, A.; Goshaw, A. T.; Gössling, C.; Gostkin, M. I.; Gottardo, C. A.; Goudet, C. R.; Goujdami, D.; Goussiou, A. G.; Govender, N.; Gozani, E.; Graber, L.; Grabowska-Bold, I.; Gradin, P. O. J.; Gramling, J.; Gramstad, E.; Grancagnolo, S.; Gratchev, V.; Gravila, P. M.; Gray, C.; Gray, H. M.; Greenwood, Z. D.; Grefe, C.; Gregersen, K.; Gregor, I. M.; Grenier, P.; Grevtsov, K.; Griffiths, J.; Grillo, A. A.; Grimm, K.; Grinstein, S.; Gris, Ph.; Grivaz, J.-F.; Groh, S.; Gross, E.; Grosse-Knetter, J.; Grossi, G. C.; Grout, Z. J.; Grummer, A.; Guan, L.; Guan, W.; Guenther, J.; Guescini, F.; Guest, D.; Gueta, O.; Gui, B.; Guido, E.; Guillemin, T.; Guindon, S.; Gul, U.; Gumpert, C.; Guo, J.; Guo, W.; Guo, Y.; Gupta, R.; Gupta, S.; Gustavino, G.; Gutelman, B. J.; Gutierrez, P.; Gutierrez Ortiz, N. G.; Gutschow, C.; Guyot, C.; Guzik, M. P.; Gwenlan, C.; Gwilliam, C. B.; Haas, A.; Haber, C.; Hadavand, H. K.; Haddad, N.; Hadef, A.; Hageböck, S.; Hagihara, M.; Hakobyan, H.; Haleem, M.; Haley, J.; Halladjian, G.; Hallewell, G. D.; Hamacher, K.; Hamal, P.; Hamano, K.; Hamilton, A.; Hamity, G. N.; Hamnett, P. G.; Han, L.; Han, S.; Hanagaki, K.; Hanawa, K.; Hance, M.; Haney, B.; Hanke, P.; Hansen, J. B.; Hansen, J. D.; Hansen, M. C.; Hansen, P. H.; Hara, K.; Hard, A. S.; Harenberg, T.; Hariri, F.; Harkusha, S.; Harrison, P. F.; Hartmann, N. M.; Hasegawa, Y.; Hasib, A.; Hassani, S.; Haug, S.; Hauser, R.; Hauswald, L.; Havener, L. B.; Havranek, M.; Hawkes, C. M.; Hawkings, R. J.; Hayakawa, D.; Hayden, D.; Hays, C. P.; Hays, J. M.; Hayward, H. S.; Haywood, S. J.; Head, S. J.; Heck, T.; Hedberg, V.; Heelan, L.; Heer, S.; Heidegger, K. K.; Heim, S.; Heim, T.; Heinemann, B.; Heinrich, J. J.; Heinrich, L.; Heinz, C.; Hejbal, J.; Helary, L.; Held, A.; Hellman, S.; Helsens, C.; Henderson, R. C. W.; Heng, Y.; Henkelmann, S.; Henriques Correia, A. M.; Henrot-Versille, S.; Herbert, G. H.; Herde, H.; Herget, V.; Hernández Jiménez, Y.; Herr, H.; Herten, G.; Hertenberger, R.; Hervas, L.; Herwig, T. C.; Hesketh, G. G.; Hessey, N. P.; Hetherly, J. W.; Higashino, S.; Higón-Rodriguez, E.; Hildebrand, K.; Hill, E.; Hill, J. C.; Hiller, K. H.; Hillier, S. J.; Hils, M.; Hinchliffe, I.; Hirose, M.; Hirschbuehl, D.; Hiti, B.; Hladik, O.; Hoad, X.; Hobbs, J.; Hod, N.; Hodgkinson, M. C.; Hodgson, P.; Hoecker, A.; Hoeferkamp, M. R.; Hoenig, F.; Hohn, D.; Holmes, T. R.; Homann, M.; Honda, S.; Honda, T.; Hong, T. M.; Hooberman, B. H.; Hopkins, W. H.; Horii, Y.; Horton, A. J.; Hostachy, J.-Y.; Hou, S.; Hoummada, A.; Howarth, J.; Hoya, J.; Hrabovsky, M.; Hrdinka, J.; Hristova, I.; Hrivnac, J.; Hryn'ova, T.; Hrynevich, A.; Hsu, P. J.; Hsu, S.-C.; Hu, Q.; Hu, S.; Huang, Y.; Hubacek, Z.; Hubaut, F.; Huegging, F.; Huffman, T. B.; Hughes, E. W.; Hughes, G.; Huhtinen, M.; Huo, P.; Huseynov, N.; Huston, J.; Huth, J.; Iacobucci, G.; Iakovidis, G.; Ibragimov, I.; Iconomidou-Fayard, L.; Idrissi, Z.; Iengo, P.; Igonkina, O.; Iizawa, T.; Ikegami, Y.; Ikeno, M.; Ilchenko, Y.; Iliadis, D.; Ilic, N.; Introzzi, G.; Ioannou, P.; Iodice, M.; Iordanidou, K.; Ippolito, V.; Isacson, M. F.; Ishijima, N.; Ishino, M.; Ishitsuka, M.; Issever, C.; Istin, S.; Ito, F.; Iturbe Ponce, J. M.; Iuppa, R.; Iwasaki, H.; Izen, J. M.; Izzo, V.; Jabbar, S.; Jackson, P.; Jacobs, R. M.; Jain, V.; Jakobi, K. B.; Jakobs, K.; Jakobsen, S.; Jakoubek, T.; Jamin, D. O.; Jana, D. K.; Jansky, R.; Janssen, J.; Janus, M.; Janus, P. A.; Jarlskog, G.; Javadov, N.; Javůrek, T.; Javurkova, M.; Jeanneau, F.; Jeanty, L.; Jejelava, J.; Jelinskas, A.; Jenni, P.; Jeske, C.; Jézéquel, S.; Ji, H.; Jia, J.; Jiang, H.; Jiang, Y.; Jiang, Z.; Jiggins, S.; Jimenez Pena, J.; Jin, S.; Jinaru, A.; Jinnouchi, O.; Jivan, H.; Johansson, P.; Johns, K. A.; Johnson, C. A.; Johnson, W. J.; Jon-And, K.; Jones, R. W. L.; Jones, S. D.; Jones, S.; Jones, T. J.; Jongmanns, J.; Jorge, P. M.; Jovicevic, J.; Ju, X.; Juste Rozas, A.; Köhler, M. K.; Kaczmarska, A.; Kado, M.; Kagan, H.; Kagan, M.; Kahn, S. J.; Kaji, T.; Kajomovitz, E.; Kalderon, C. W.; Kaluza, A.; Kama, S.; Kamenshchikov, A.; Kanaya, N.; Kanjir, L.; Kantserov, V. A.; Kanzaki, J.; Kaplan, B.; Kaplan, L. S.; Kar, D.; Karakostas, K.; Karastathis, N.; Kareem, M. J.; Karentzos, E.; Karpov, S. N.; Karpova, Z. M.; Karthik, K.; Kartvelishvili, V.; Karyukhin, A. N.; Kasahara, K.; Kashif, L.; Kass, R. D.; Kastanas, A.; Kataoka, Y.; Kato, C.; Katre, A.; Katzy, J.; Kawade, K.; Kawagoe, K.; Kawamoto, T.; Kawamura, G.; Kay, E. F.; Kazanin, V. F.; Keeler, R.; Kehoe, R.; Keller, J. S.; Kellermann, E.; Kempster, J. J.; Kendrick, J.; Keoshkerian, H.; Kepka, O.; Kerševan, B. P.; Kersten, S.; Keyes, R. A.; Khader, M.; Khalil-zada, F.; Khanov, A.; Kharlamov, A. G.; Kharlamova, T.; Khodinov, A.; Khoo, T. J.; Khovanskiy, V.; Khramov, E.; Khubua, J.; Kido, S.; Kilby, C. R.; Kim, H. Y.; Kim, S. H.; Kim, Y. K.; Kimura, N.; Kind, O. M.; King, B. T.; Kirchmeier, D.; Kirk, J.; Kiryunin, A. E.; Kishimoto, T.; Kisielewska, D.; Kitali, V.; Kivernyk, O.; Kladiva, E.; Klapdor-Kleingrothaus, T.; Klein, M. H.; Klein, M.; Klein, U.; Kleinknecht, K.; Klimek, P.; Klimentov, A.; Klingenberg, R.; Klingl, T.; Klioutchnikova, T.; Kluge, E.-E.; Kluit, P.; Kluth, S.; Kneringer, E.; Knoops, E. B. F. G.; Knue, A.; Kobayashi, A.; Kobayashi, D.; Kobayashi, T.; Kobel, M.; Kocian, M.; Kodys, P.; Koffas, T.; Koffeman, E.; Köhler, N. M.; Koi, T.; Kolb, M.; Koletsou, I.; Komar, A. A.; Kondo, T.; Kondrashova, N.; Köneke, K.; König, A. C.; Kono, T.; Konoplich, R.; Konstantinidis, N.; Kopeliansky, R.; Koperny, S.; Kopp, A. K.; Korcyl, K.; Kordas, K.; Korn, A.; Korol, A. A.; Korolkov, I.; Korolkova, E. V.; Kortner, O.; Kortner, S.; Kosek, T.; Kostyukhin, V. V.; Kotwal, A.; Koulouris, A.; Kourkoumeli-Charalampidi, A.; Kourkoumelis, C.; Kourlitis, E.; Kouskoura, V.; Kowalewska, A. B.; Kowalewski, R.; Kowalski, T. Z.; Kozakai, C.; Kozanecki, W.; Kozhin, A. S.; Kramarenko, V. A.; Kramberger, G.; Krasnopevtsev, D.; Krasny, M. W.; Krasznahorkay, A.; Krauss, D.; Kremer, J. A.; Kretzschmar, J.; Kreutzfeldt, K.; Krieger, P.; Krizka, K.; Kroeninger, K.; Kroha, H.; Kroll, J.; Kroll, J.; Kroseberg, J.; Krstic, J.; Kruchonak, U.; Krüger, H.; Krumnack, N.; Kruse, M. C.; Kubota, T.; Kucuk, H.; Kuday, S.; Kuechler, J. T.; Kuehn, S.; Kugel, A.; Kuger, F.; Kuhl, T.; Kukhtin, V.; Kukla, R.; Kulchitsky, Y.; Kuleshov, S.; Kulinich, Y. P.; Kuna, M.; Kunigo, T.; Kupco, A.; Kupfer, T.; Kuprash, O.; Kurashige, H.; Kurchaninov, L. L.; Kurochkin, Y. A.; Kurth, M. G.; Kus, V.; Kuwertz, E. S.; Kuze, M.; Kvita, J.; Kwan, T.; Kyriazopoulos, D.; La Rosa, A.; La Rosa Navarro, J. L.; La Rotonda, L.; La Ruffa, F.; Lacasta, C.; Lacava, F.; Lacey, J.; Lack, D. P. J.; Lacker, H.; Lacour, D.; Ladygin, E.; Lafaye, R.; Laforge, B.; Lagouri, T.; Lai, S.; Lammers, S.; Lampl, W.; Lançon, E.; Landgraf, U.; Landon, M. P. J.; Lanfermann, M. C.; Lang, V. S.; Lange, J. C.; Langenberg, R. J.; Lankford, A. J.; Lanni, F.; Lantzsch, K.; Lanza, A.; Lapertosa, A.; Laplace, S.; Laporte, J. F.; Lari, T.; Lasagni Manghi, F.; Lassnig, M.; Lau, T. S.; Laurelli, P.; Lavrijsen, W.; Law, A. T.; Laycock, P.; Lazovich, T.; Lazzaroni, M.; Le, B.; Le Dortz, O.; Le Guirriec, E.; Le Quilleuc, E. P.; LeBlanc, M.; LeCompte, T.; Ledroit-Guillon, F.; Lee, C. A.; Lee, G. R.; Lee, S. C.; Lee, L.; Lefebvre, B.; Lefebvre, G.; Lefebvre, M.; Legger, F.; Leggett, C.; Lehmann Miotto, G.; Lei, X.; Leight, W. A.; Leite, M. A. L.; Leitner, R.; Lellouch, D.; Lemmer, B.; Leney, K. J. C.; Lenz, T.; Lenzi, B.; Leone, R.; Leone, S.; Leonidopoulos, C.; Lerner, G.; Leroy, C.; Lesage, A. A. J.; Lester, C. G.; Levchenko, M.; Levêque, J.; Levin, D.; Levinson, L. J.; Levy, M.; Lewis, D.; Li, B.; Li, Changqiao; Li, H.; Li, L.; Li, Q.; Li, Q.; Li, S.; Li, X.; Li, Y.; Liang, Z.; Liberti, B.; Liblong, A.; Lie, K.; Liebal, J.; Liebig, W.; Limosani, A.; Lin, S. C.; Lin, T. H.; Linck, R. A.; Lindquist, B. E.; Lionti, A. E.; Lipeles, E.; Lipniacka, A.; Lisovyi, M.; Liss, T. M.; Lister, A.; Litke, A. M.; Liu, B.; Liu, H.; Liu, H.; Liu, J. K. K.; Liu, J.; Liu, J. B.; Liu, K.; Liu, L.; Liu, M.; Liu, Y. L.; Liu, Y.; Livan, M.; Lleres, A.; Llorente Merino, J.; Lloyd, S. L.; Lo, C. Y.; Lo Sterzo, F.; Lobodzinska, E. M.; Loch, P.; Loebinger, F. K.; Loesle, A.; Loew, K. M.; Loginov, A.; Lohse, T.; Lohwasser, K.; Lokajicek, M.; Long, B. A.; Long, J. D.; Long, R. E.; Longo, L.; Looper, K. A.; Lopez, J. A.; Lopez Mateos, D.; Lopez Paz, I.; Lopez Solis, A.; Lorenz, J.; Lorenzo Martinez, N.; Losada, M.; Lösel, P. J.; Lou, X.; Lounis, A.; Love, J.; Love, P. A.; Lu, H.; Lu, N.; Lu, Y. J.; Lubatti, H. J.; Luci, C.; Lucotte, A.; Luedtke, C.; Luehring, F.; Lukas, W.; Luminari, L.; Lundberg, O.; Lund-Jensen, B.; Lutz, M. S.; Luzi, P. M.; Lynn, D.; Lysak, R.; Lytken, E.; Lyu, F.; Lyubushkin, V.; Ma, H.; Ma, L. L.; Ma, Y.; Maccarrone, G.; Macchiolo, A.; Macdonald, C. M.; Maček, B.; Machado Miguens, J.; Madaffari, D.; Madar, R.; Mader, W. F.; Madsen, A.; Maeda, J.; Maeland, S.; Maeno, T.; Maevskiy, A. S.; Magerl, V.; Mahlstedt, J.; Maiani, C.; Maidantchik, C.; Maier, A. A.; Maier, T.; Maio, A.; Majersky, O.; Majewski, S.; Makida, Y.; Makovec, N.; Malaescu, B.; Malecki, Pa.; Maleev, V. P.; Malek, F.; Mallik, U.; Malon, D.; Malone, C.; Maltezos, S.; Malyukov, S.; Mamuzic, J.; Mancini, G.; Mandić, I.; Maneira, J.; Manhaes de Andrade Filho, L.; Manjarres Ramos, J.; Mankinen, K. H.; Mann, A.; Manousos, A.; Mansoulie, B.; Mansour, J. D.; Mantifel, R.; Mantoani, M.; Manzoni, S.; Mapelli, L.; Marceca, G.; March, L.; Marchese, L.; Marchiori, G.; Marcisovsky, M.; Marin Tobon, C. A.; Marjanovic, M.; Marley, D. E.; Marroquim, F.; Marsden, S. P.; Marshall, Z.; Martensson, M. U. F.; Marti-Garcia, S.; Martin, C. B.; Martin, T. A.; Martin, V. J.; Martin dit Latour, B.; Martinez, M.; Martinez Outschoorn, V. I.; Martin-Haugh, S.; Martoiu, V. S.; Martyniuk, A. C.; Marzin, A.; Masetti, L.; Mashimo, T.; Mashinistov, R.; Masik, J.; Maslennikov, A. L.; Massa, L.; Mastrandrea, P.; Mastroberardino, A.; Masubuchi, T.; Mättig, P.; Maurer, J.; Maxfield, S. J.; Maximov, D. A.; Mazini, R.; Maznas, I.; Mazza, S. M.; Mc Fadden, N. C.; Mc Goldrick, G.; Mc Kee, S. P.; McCarn, A.; McCarthy, R. L.; McCarthy, T. G.; McClymont, L. I.; McDonald, E. F.; Mcfayden, J. A.; Mchedlidze, G.; McMahon, S. J.; McNamara, P. C.; McNicol, C. J.; McPherson, R. A.; Meehan, S.; Megy, T. J.; Mehlhase, S.; Mehta, A.; Meideck, T.; Meier, K.; Meirose, B.; Melini, D.; Mellado Garcia, B. R.; Mellenthin, J. D.; Melo, M.; Meloni, F.; Melzer, A.; Menary, S. B.; Meng, L.; Meng, X. T.; Mengarelli, A.; Menke, S.; Meoni, E.; Mergelmeyer, S.; Merlassino, C.; Mermod, P.; Merola, L.; Meroni, C.; Merritt, F. S.; Messina, A.; Metcalfe, J.; Mete, A. S.; Meyer, C.; Meyer, J.-P.; Meyer, J.; Meyer Zu Theenhausen, H.; Miano, F.; Middleton, R. P.; Miglioranzi, S.; Mijović, L.; Mikenberg, G.; Mikestikova, M.; Mikuž, M.; Milesi, M.; Milic, A.; Millar, D. A.; Miller, D. W.; Mills, C.; Milov, A.; Milstead, D. A.; Minaenko, A. A.; Minami, Y.; Minashvili, I. A.; Mincer, A. I.; Mindur, B.; Mineev, M.; Minegishi, Y.; Ming, Y.; Mir, L. M.; Mistry, K. P.; Mitani, T.; Mitrevski, J.; Mitsou, V. A.; Miucci, A.; Miyagawa, P. S.; Mizukami, A.; Mjörnmark, J. U.; Mkrtchyan, T.; Mlynarikova, M.; Moa, T.; Mochizuki, K.; Mogg, P.; Mohapatra, S.; Molander, S.; Moles-Valls, R.; Mondragon, M. C.; Mönig, K.; Monk, J.; Monnier, E.; Montalbano, A.; Montejo Berlingen, J.; Monticelli, F.; Monzani, S.; Moore, R. W.; Morange, N.; Moreno, D.; Moreno Llácer, M.; Morettini, P.; Morgenstern, S.; Mori, D.; Mori, T.; Morii, M.; Morinaga, M.; Morisbak, V.; Morley, A. K.; Mornacchi, G.; Morris, J. D.; Morvaj, L.; Moschovakos, P.; Mosidze, M.; Moss, H. J.; Moss, J.; Motohashi, K.; Mount, R.; Mountricha, E.; Moyse, E. J. W.; Muanza, S.; Mueller, F.; Mueller, J.; Mueller, R. S. P.; Muenstermann, D.; Mullen, P.; Mullier, G. A.; Munoz Sanchez, F. J.; Murray, W. J.; Musheghyan, H.; Muškinja, M.; Myagkov, A. G.; Myska, M.; Nachman, B. P.; Nackenhorst, O.; Nagai, K.; Nagai, R.; Nagano, K.; Nagasaka, Y.; Nagata, K.; Nagel, M.; Nagy, E.; Nairz, A. M.; Nakahama, Y.; Nakamura, K.; Nakamura, T.; Nakano, I.; Naranjo Garcia, R. F.; Narayan, R.; Narrias Villar, D. I.; Naryshkin, I.; Naumann, T.; Navarro, G.; Nayyar, R.; Neal, H. A.; Nechaeva, P. Yu.; Neep, T. J.; Negri, A.; Negrini, M.; Nektarijevic, S.; Nellist, C.; Nelson, A.; Nelson, M. E.; Nemecek, S.; Nemethy, P.; Nessi, M.; Neubauer, M. S.; Neumann, M.; Newman, P. R.; Ng, T. Y.; Nguyen Manh, T.; Nickerson, R. B.; Nicolaidou, R.; Nielsen, J.; Nikolaenko, V.; Nikolic-Audit, I.; Nikolopoulos, K.; Nilsen, J. K.; Nilsson, P.; Ninomiya, Y.; Nisati, A.; Nishu, N.; Nisius, R.; Nitsche, I.; Nitta, T.; Nobe, T.; Noguchi, Y.; Nomachi, M.; Nomidis, I.; Nomura, M. A.; Nooney, T.; Nordberg, M.; Norjoharuddeen, N.; Novgorodova, O.; Nozaki, M.; Nozka, L.; Ntekas, K.; Nurse, E.; Nuti, F.; O'connor, K.; O'Neil, D. C.; O'Rourke, A. A.; O'Shea, V.; Oakham, F. G.; Oberlack, H.; Obermann, T.; Ocariz, J.; Ochi, A.; Ochoa, I.; Ochoa-Ricoux, J. P.; Oda, S.; Odaka, S.; Oh, A.; Oh, S. H.; Ohm, C. C.; Ohman, H.; Oide, H.; Okawa, H.; Okumura, Y.; Okuyama, T.; Olariu, A.; Oleiro Seabra, L. F.; Olivares Pino, S. A.; Oliveira Damazio, D.; Olszewski, A.; Olszowska, J.; Onofre, A.; Onogi, K.; Onyisi, P. U. E.; Oppen, H.; Oreglia, M. J.; Oren, Y.; Orestano, D.; Orlando, N.; Orr, R. S.; Osculati, B.; Ospanov, R.; Otero y Garzon, G.; Otono, H.; Ouchrif, M.; Ould-Saada, F.; Ouraou, A.; Oussoren, K. P.; Ouyang, Q.; Owen, M.; Owen, R. E.; Ozcan, V. E.; Ozturk, N.; Pachal, K.; Pacheco Pages, A.; Pacheco Rodriguez, L.; Padilla Aranda, C.; Pagan Griso, S.; Paganini, M.; Paige, F.; Palacino, G.; Palazzo, S.; Palestini, S.; Palka, M.; Pallin, D.; Panagiotopoulou, E. St.; Panagoulias, I.; Pandini, C. E.; Panduro Vazquez, J. G.; Pani, P.; Panitkin, S.; Pantea, D.; Paolozzi, L.; Papadopoulou, Th. D.; Papageorgiou, K.; Paramonov, A.; Paredes Hernandez, D.; Parker, A. J.; Parker, M. A.; Parker, K. A.; Parodi, F.; Parsons, J. A.; Parzefall, U.; Pascuzzi, V. R.; Pasner, J. M.; Pasqualucci, E.; Passaggio, S.; Pastore, Fr.; Pataraia, S.; Pater, J. R.; Pauly, T.; Pearson, B.; Pedraza Lopez, S.; Pedro, R.; Peleganchuk, S. V.; Penc, O.; Peng, C.; Peng, H.; Penwell, J.; Peralva, B. S.; Perego, M. M.; Perepelitsa, D. V.; Peri, F.; Perini, L.; Pernegger, H.; Perrella, S.; Peschke, R.; Peshekhonov, V. D.; Peters, K.; Peters, R. F. Y.; Petersen, B. A.; Petersen, T. C.; Petit, E.; Petridis, A.; Petridou, C.; Petroff, P.; Petrolo, E.; Petrov, M.; Petrucci, F.; Pettersson, N. E.; Peyaud, A.; Pezoa, R.; Phillips, F. H.; Phillips, P. W.; Piacquadio, G.; Pianori, E.; Picazio, A.; Piccaro, E.; Pickering, M. A.; Piegaia, R.; Pilcher, J. E.; Pilkington, A. D.; Pin, A. W. J.; Pinamonti, M.; Pinfold, J. L.; Pirumov, H.; Pitt, M.; Plazak, L.; Pleier, M.-A.; Pleskot, V.; Plotnikova, E.; Pluth, D.; Podberezko, P.; Poettgen, R.; Poggi, R.; Poggioli, L.; Pogrebnyak, I.; Pohl, D.; Polesello, G.; Poley, A.; Policicchio, A.; Polifka, R.; Polini, A.; Pollard, C. S.; Polychronakos, V.; Pommès, K.; Ponomarenko, D.; Pontecorvo, L.; Popeneciu, G. A.; Pospisil, S.; Potamianos, K.; Potrap, I. N.; Potter, C. J.; Potti, H.; Poulsen, T.; Poveda, J.; Pozo Astigarraga, M. E.; Pralavorio, P.; Pranko, A.; Prell, S.; Price, D.; Primavera, M.; Prince, S.; Proklova, N.; Prokofiev, K.; Prokoshin, F.; Protopopescu, S.; Proudfoot, J.; Przybycien, M.; Puri, A.; Puzo, P.; Qian, J.; Qin, G.; Qin, Y.; Quadt, A.; Queitsch-Maitland, M.; Quilty, D.; Raddum, S.; Radeka, V.; Radescu, V.; Radhakrishnan, S. K.; Radloff, P.; Rados, P.; Ragusa, F.; Rahal, G.; Raine, J. A.; Rajagopalan, S.; Rangel-Smith, C.; Rashid, T.; Raspopov, S.; Ratti, M. G.; Rauch, D. M.; Rauscher, F.; Rave, S.; Ravinovich, I.; Rawling, J. H.; Raymond, M.; Read, A. L.; Readioff, N. P.; Reale, M.; Rebuzzi, D. M.; Redelbach, A.; Redlinger, G.; Reece, R.; Reed, R. G.; Reeves, K.; Rehnisch, L.; Reichert, J.; Reiss, A.; Rembser, C.; Ren, H.; Rescigno, M.; Resconi, S.; Resseguie, E. D.; Rettie, S.; Reynolds, E.; Rezanova, O. L.; Reznicek, P.; Rezvani, R.; Richter, R.; Richter, S.; Richter-Was, E.; Ricken, O.; Ridel, M.; Rieck, P.; Riegel, C. J.; Rieger, J.; Rifki, O.; Rijssenbeek, M.; Rimoldi, A.; Rimoldi, M.; Rinaldi, L.; Ripellino, G.; Ristić, B.; Ritsch, E.; Riu, I.; Rizatdinova, F.; Rizvi, E.; Rizzi, C.; Roberts, R. T.; Robertson, S. H.; Robichaud-Veronneau, A.; Robinson, D.; Robinson, J. E. M.; Robson, A.; Rocco, E.; Roda, C.; Rodina, Y.; Rodriguez Bosca, S.; Rodriguez Perez, A.; Rodriguez Rodriguez, D.; Roe, S.; Rogan, C. S.; Røhne, O.; Roloff, J.; Romaniouk, A.; Romano, M.; Romano Saez, S. M.; Romero Adam, E.; Rompotis, N.; Ronzani, M.; Roos, L.; Rosati, S.; Rosbach, K.; Rose, P.; Rosien, N.-A.; Rossi, E.; Rossi, L. P.; Rosten, J. H. N.; Rosten, R.; Rotaru, M.; Rothberg, J.; Rousseau, D.; Rozanov, A.; Rozen, Y.; Ruan, X.; Rubbo, F.; Rühr, F.; Ruiz-Martinez, A.; Rurikova, Z.; Rusakovich, N. A.; Russell, H. L.; Rutherfoord, J. P.; Ruthmann, N.; Ryabov, Y. F.; Rybar, M.; Rybkin, G.; Ryu, S.; Ryzhov, A.; Rzehorz, G. F.; Saavedra, A. F.; Sabato, G.; Sacerdoti, S.; Sadrozinski, H. F.-W.; Sadykov, R.; Safai Tehrani, F.; Saha, P.; Sahinsoy, M.; Saimpert, M.; Saito, M.; Saito, T.; Sakamoto, H.; Sakurai, Y.; Salamanna, G.; Salazar Loyola, J. E.; Salek, D.; Sales De Bruin, P. H.; Salihagic, D.; Salnikov, A.; Salt, J.; Salvatore, D.; Salvatore, F.; Salvucci, A.; Salzburger, A.; Sammel, D.; Sampsonidis, D.; Sampsonidou, D.; Sánchez, J.; Sanchez Martinez, V.; Sanchez Pineda, A.; Sandaker, H.; Sandbach, R. L.; Sander, C. O.; Sandhoff, M.; Sandoval, C.; Sankey, D. P. C.; Sannino, M.; Sano, Y.; Sansoni, A.; Santoni, C.; Santos, H.; Santoyo Castillo, I.; Sapronov, A.; Saraiva, J. G.; Sarrazin, B.; Sasaki, O.; Sato, K.; Sauvan, E.; Savage, G.; Savard, P.; Savic, N.; Sawyer, C.; Sawyer, L.; Saxon, J.; Sbarra, C.; Sbrizzi, A.; Scanlon, T.; Scannicchio, D. A.; Schaarschmidt, J.; Schacht, P.; Schachtner, B. M.; Schaefer, D.; Schaefer, L.; Schaefer, R.; Schaeffer, J.; Schaepe, S.; Schaetzel, S.; Schäfer, U.; Schaffer, A. C.; Schaile, D.; Schamberger, R. D.; Schegelsky, V. A.; Scheirich, D.; Schernau, M.; Schiavi, C.; Schier, S.; Schildgen, L. K.; Schillo, C.; Schioppa, M.; Schlenker, S.; Schmidt-Sommerfeld, K. R.; Schmieden, K.; Schmitt, C.; Schmitt, S.; Schmitz, S.; Schnoor, U.; Schoeffel, L.; Schoening, A.; Schoenrock, B. D.; Schopf, E.; Schott, M.; Schouwenberg, J. F. P.; Schovancova, J.; Schramm, S.; Schuh, N.; Schulte, A.; Schultens, M. J.; Schultz-Coulon, H.-C.; Schulz, H.; Schumacher, M.; Schumm, B. A.; Schune, Ph.; Schwartzman, A.; Schwarz, T. A.; Schweiger, H.; Schwemling, Ph.; Schwienhorst, R.; Schwindling, J.; Sciandra, A.; Sciolla, G.; Scornajenghi, M.; Scuri, F.; Scutti, F.; Searcy, J.; Seema, P.; Seidel, S. C.; Seiden, A.; Seixas, J. M.; Sekhniaidze, G.; Sekhon, K.; Sekula, S. J.; Semprini-Cesari, N.; Senkin, S.; Serfon, C.; Serin, L.; Serkin, L.; Sessa, M.; Seuster, R.; Severini, H.; Sfiligoj, T.; Sforza, F.; Sfyrla, A.; Shabalina, E.; Shaikh, N. W.; Shan, L. Y.; Shang, R.; Shank, J. T.; Shapiro, M.; Shatalov, P. B.; Shaw, K.; Shaw, S. M.; Shcherbakova, A.; Shehu, C. Y.; Shen, Y.; Sherafati, N.; Sherwood, P.; Shi, L.; Shimizu, S.; Shimmin, C. O.; Shimojima, M.; Shipsey, I. P. J.; Shirabe, S.; Shiyakova, M.; Shlomi, J.; Shmeleva, A.; Shoaleh Saadi, D.; Shochet, M. J.; Shojaii, S.; Shope, D. R.; Shrestha, S.; Shulga, E.; Shupe, M. A.; Sicho, P.; Sickles, A. M.; Sidebo, P. E.; Sideras Haddad, E.; Sidiropoulou, O.; Sidoti, A.; Siegert, F.; Sijacki, Dj.; Silva, J.; Silverstein, S. B.; Simak, V.; Simic, Lj.; Simion, S.; Simioni, E.; Simmons, B.; Simon, M.; Sinervo, P.; Sinev, N. B.; Sioli, M.; Siragusa, G.; Siral, I.; Sivoklokov, S. Yu.; Sjölin, J.; Skinner, M. B.; Skubic, P.; Slater, M.; Slavicek, T.; Slawinska, M.; Sliwa, K.; Slovak, R.; Smakhtin, V.; Smart, B. H.; Smiesko, J.; Smirnov, N.; Smirnov, S. Yu.; Smirnov, Y.; Smirnova, L. N.; Smirnova, O.; Smith, J. W.; Smith, M. N. K.; Smith, R. W.; Smizanska, M.; Smolek, K.; Snesarev, A. A.; Snyder, I. M.; Snyder, S.; Sobie, R.; Socher, F.; Soffer, A.; Søgaard, A.; Soh, D. A.; Sokhrannyi, G.; Solans Sanchez, C. A.; Solar, M.; Soldatov, E. Yu.; Soldevila, U.; Solodkov, A. A.; Soloshenko, A.; Solovyanov, O. V.; Solovyev, V.; Sommer, P.; Son, H.; Sopczak, A.; Sosa, D.; Sotiropoulou, C. L.; Soualah, R.; Soukharev, A. M.; South, D.; Sowden, B. C.; Spagnolo, S.; Spalla, M.; Spangenberg, M.; Spanò, F.; Sperlich, D.; Spettel, F.; Spieker, T. M.; Spighi, R.; Spigo, G.; Spiller, L. A.; Spousta, M.; Denis, R. D. St.; Stabile, A.; Stamen, R.; Stamm, S.; Stanecka, E.; Stanek, R. W.; Stanescu, C.; Stanitzki, M. M.; Stapf, B. S.; Stapnes, S.; Starchenko, E. A.; Stark, G. H.; Stark, J.; Stark, S. H.; Staroba, P.; Starovoitov, P.; Stärz, S.; Staszewski, R.; Stegler, M.; Steinberg, P.; Stelzer, B.; Stelzer, H. J.; Stelzer-Chilton, O.; Stenzel, H.; Stewart, G. A.; Stockton, M. C.; Stoebe, M.; Stoicea, G.; Stolte, P.; Stonjek, S.; Stradling, A. R.; Straessner, A.; Stramaglia, M. E.; Strandberg, J.; Strandberg, S.; Strauss, M.; Strizenec, P.; Ströhmer, R.; Strom, D. M.; Stroynowski, R.; Strubig, A.; Stucci, S. A.; Stugu, B.; Styles, N. A.; Su, D.; Su, J.; Suchek, S.; Sugaya, Y.; Suk, M.; Sulin, V. V.; Sultan, DMS; Sultansoy, S.; Sumida, T.; Sun, S.; Sun, X.; Suruliz, K.; Suster, C. J. E.; Sutton, M. R.; Suzuki, S.; Svatos, M.; Swiatlowski, M.; Swift, S. P.; Sykora, I.; Sykora, T.; Ta, D.; Tackmann, K.; Taenzer, J.; Taffard, A.; Tafirout, R.; Tahirovic, E.; Taiblum, N.; Takai, H.; Takashima, R.; Takasugi, E. H.; Takeshita, T.; Takubo, Y.; Talby, M.; Talyshev, A. A.; Tanaka, J.; Tanaka, M.; Tanaka, R.; Tanaka, S.; Tanioka, R.; Tannenwald, B. B.; Tapia Araya, S.; Tapprogge, S.; Tarem, S.; Tartarelli, G. F.; Tas, P.; Tasevsky, M.; Tashiro, T.; Tassi, E.; Tavares Delgado, A.; Tayalati, Y.; Taylor, A. C.; Taylor, A. J.; Taylor, G. N.; Taylor, P. T. E.; Taylor, W.; Teixeira-Dias, P.; Temple, D.; Ten Kate, H.; Teng, P. K.; Teoh, J. J.; Tepel, F.; Terada, S.; Terashi, K.; Terron, J.; Terzo, S.; Testa, M.; Teuscher, R. J.; Theveneaux-Pelzer, T.; Thiele, F.; Thomas, J. P.; Thomas-Wilsker, J.; Thompson, P. D.; Thompson, A. S.; Thomsen, L. A.; Thomson, E.; Tibbetts, M. J.; Ticse Torres, R. E.; Tikhomirov, V. O.; Tikhonov, Yu. A.; Timoshenko, S.; Tipton, P.; Tisserant, S.; Todome, K.; Todorova-Nova, S.; Todt, S.; Tojo, J.; Tokár, S.; Tokushuku, K.; Tolley, E.; Tomlinson, L.; Tomoto, M.; Tompkins, L.; Toms, K.; Tong, B.; Tornambe, P.; Torrence, E.; Torres, H.; Torró Pastor, E.; Toth, J.; Touchard, F.; Tovey, D. R.; Treado, C. J.; Trefzger, T.; Tresoldi, F.; Tricoli, A.; Trigger, I. M.; Trincaz-Duvoid, S.; Tripiana, M. F.; Trischuk, W.; Trocmé, B.; Trofymov, A.; Troncon, C.; Trottier-McDonald, M.; Trovatelli, M.; Truong, L.; Trzebinski, M.; Trzupek, A.; Tsang, K. W.; Tseng, J. C.-L.; Tsiareshka, P. V.; Tsipolitis, G.; Tsirintanis, N.; Tsiskaridze, S.; Tsiskaridze, V.; Tskhadadze, E. G.; Tsui, K. M.; Tsukerman, I. I.; Tsulaia, V.; Tsuno, S.; Tsybychev, D.; Tu, Y.; Tudorache, A.; Tudorache, V.; Tulbure, T. T.; Tuna, A. N.; Tupputi, S. A.; Turchikhin, S.; Turgeman, D.; Turk Cakir, I.; Turra, R.; Tuts, P. M.; Ucchielli, G.; Ueda, I.; Ughetto, M.; Ukegawa, F.; Unal, G.; Undrus, A.; Unel, G.; Ungaro, F. C.; Unno, Y.; Unverdorben, C.; Urban, J.; Urquijo, P.; Urrejola, P.; Usai, G.; Usui, J.; Vacavant, L.; Vacek, V.; Vachon, B.; Vadla, K. O. H.; Vaidya, A.; Valderanis, C.; Valdes Santurio, E.; Valente, M.; Valentinetti, S.; Valero, A.; Valéry, L.; Valkar, S.; Vallier, A.; Valls Ferrer, J. A.; Van Den Wollenberg, W.; van der Graaf, H.; van Gemmeren, P.; Van Nieuwkoop, J.; van Vulpen, I.; van Woerden, M. C.; Vanadia, M.; Vandelli, W.; Vaniachine, A.; Vankov, P.; Vardanyan, G.; Vari, R.; Varnes, E. W.; Varni, C.; Varol, T.; Varouchas, D.; Vartapetian, A.; Varvell, K. E.; Vasquez, J. G.; Vasquez, G. A.; Vazeille, F.; Vazquez Furelos, D.; Vazquez Schroeder, T.; Veatch, J.; Veeraraghavan, V.; Veloce, L. M.; Veloso, F.; Veneziano, S.; Ventura, A.; Venturi, M.; Venturi, N.; Venturini, A.; Vercesi, V.; Verducci, M.; Verkerke, W.; Vermeulen, A. T.; Vermeulen, J. C.; Vetterli, M. C.; Viaux Maira, N.; Viazlo, O.; Vichou, I.; Vickey, T.; Vickey Boeriu, O. E.; Viehhauser, G. H. A.; Viel, S.; Vigani, L.; Villa, M.; Villaplana Perez, M.; Vilucchi, E.; Vincter, M. G.; Vinogradov, V. B.; Vishwakarma, A.; Vittori, C.; Vivarelli, I.; Vlachos, S.; Vogel, M.; Vokac, P.; Volpi, G.; von der Schmitt, H.; von Toerne, E.; Vorobel, V.; Vorobev, K.; Vos, M.; Voss, R.; Vossebeld, J. H.; Vranjes, N.; Vranjes Milosavljevic, M.; Vrba, V.; Vreeswijk, M.; Vuillermet, R.; Vukotic, I.; Wagner, P.; Wagner, W.; Wagner-Kuhr, J.; Wahlberg, H.; Wahrmund, S.; Walder, J.; Walker, R.; Walkowiak, W.; Wallangen, V.; Wang, C.; Wang, C.; Wang, F.; Wang, H.; Wang, H.; Wang, J.; Wang, J.; Wang, Q.; Wang, R.; Wang, S. M.; Wang, T.; Wang, W.; Wang, W.; Wang, Z.; Wanotayaroj, C.; Warburton, A.; Ward, C. P.; Wardrope, D. R.; Washbrook, A.; Watkins, P. M.; Watson, A. T.; Watson, M. F.; Watts, G.; Watts, S.; Waugh, B. M.; Webb, A. F.; Webb, S.; Weber, M. S.; Weber, S. W.; Weber, S. A.; Webster, J. S.; Weidberg, A. R.; Weinert, B.; Weingarten, J.; Weirich, M.; Weiser, C.; Weits, H.; Wells, P. S.; Wenaus, T.; Wengler, T.; Wenig, S.; Wermes, N.; Werner, M. D.; Werner, P.; Wessels, M.; Weston, T. D.; Whalen, K.; Whallon, N. L.; Wharton, A. M.; White, A. S.; White, A.; White, M. J.; White, R.; Whiteson, D.; Whitmore, B. W.; Wickens, F. J.; Wiedenmann, W.; Wielers, M.; Wiglesworth, C.; Wiik-Fuchs, L. A. M.; Wildauer, A.; Wilk, F.; Wilkens, H. G.; Williams, H. H.; Williams, S.; Willis, C.; Willocq, S.; Wilson, J. A.; Wingerter-Seez, I.; Winkels, E.; Winklmeier, F.; Winston, O. J.; Winter, B. T.; Wittgen, M.; Wobisch, M.; Wolf, T. M. H.; Wolff, R.; Wolter, M. W.; Wolters, H.; Wong, V. W. S.; Worm, S. D.; Wosiek, B. K.; Wotschack, J.; Wozniak, K. W.; Wu, M.; Wu, S. L.; Wu, X.; Wu, Y.; Wyatt, T. R.; Wynne, B. M.; Xella, S.; Xi, Z.; Xia, L.; Xu, D.; Xu, L.; Xu, T.; Yabsley, B.; Yacoob, S.; Yamaguchi, D.; Yamaguchi, Y.; Yamamoto, A.; Yamamoto, S.; Yamanaka, T.; Yamane, F.; Yamatani, M.; Yamazaki, Y.; Yan, Z.; Yang, H.; Yang, H.; Yang, Y.; Yang, Z.; Yao, W.-M.; Yap, Y. C.; Yasu, Y.; Yatsenko, E.; Yau Wong, K. H.; Ye, J.; Ye, S.; Yeletskikh, I.; Yigitbasi, E.; Yildirim, E.; Yorita, K.; Yoshihara, K.; Young, C.; Young, C. J. S.; Yu, J.; Yu, J.; Yuen, S. P. Y.; Yusuff, I.; Zabinski, B.; Zacharis, G.; Zaidan, R.; Zaitsev, A. M.; Zakharchuk, N.; Zalieckas, J.; Zaman, A.; Zambito, S.; Zanzi, D.; Zeitnitz, C.; Zemaityte, G.; Zemla, A.; Zeng, J. C.; Zeng, Q.; Zenin, O.; Ženiš, T.; Zerwas, D.; Zhang, D.; Zhang, F.; Zhang, G.; Zhang, H.; Zhang, J.; Zhang, L.; Zhang, L.; Zhang, M.; Zhang, P.; Zhang, R.; Zhang, R.; Zhang, X.; Zhang, Y.; Zhang, Z.; Zhao, X.; Zhao, Y.; Zhao, Z.; Zhemchugov, A.; Zhou, B.; Zhou, C.; Zhou, L.; Zhou, M.; Zhou, M.; Zhou, N.; Zhu, C. G.; Zhu, H.; Zhu, J.; Zhu, Y.; Zhuang, X.; Zhukov, K.; Zibell, A.; Zieminska, D.; Zimine, N. I.; Zimmermann, C.; Zimmermann, S.; Zinonos, Z.; Zinser, M.; Ziolkowski, M.; Živković, L.; Zobernig, G.; Zoccoli, A.; Zou, R.; zur Nedden, M.; Zwalinski, L.

    2017-12-01

    A search is presented for particles that decay producing a large jet multiplicity and invisible particles. The event selection applies a veto on the presence of isolated electrons or muons and additional requirements on the number of b-tagged jets and the scalar sum of masses of large-radius jets. Having explored the full ATLAS 2015-2016 dataset of LHC proton-proton collisions at √{s}=13 TeV, which corresponds to 36.1 fb-1 of integrated luminosity, no evidence is found for physics beyond the Standard Model. The results are interpreted in the context of simplified models inspired by R-parity-conserving and R-parity-violating supersymmetry, where gluinos are pair-produced. More generic models within the phenomenological minimal supersymmetric Standard Model are also considered. [Figure not available: see fulltext.

  20. Five year global dataset: NMC operational analyses (1978 to 1982)

    NASA Technical Reports Server (NTRS)

    Straus, David; Ardizzone, Joseph

    1987-01-01

    This document describes procedures used in assembling a five year dataset (1978 to 1982) using NMC Operational Analysis data. These procedures entailed replacing missing and unacceptable data in order to arrive at a complete dataset that is continuous in time. In addition, a subjective assessment on the integrity of all data (both preliminary and final) is presented. Documentation on tapes comprising the Five Year Global Dataset is also included.

  1. WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora.

    PubMed

    Callón, Miguel; Fdez-Glez, Jorge; Ruano-Ordás, David; Laza, Rosalía; Pavón, Reyes; Fdez-Riverola, Florentino; Méndez, Jose Ramón

    2017-12-22

    In this work we present the design and implementation of WARCProcessor, a novel multiplatform integrative tool aimed to build scientific datasets to facilitate experimentation in web spam research. The developed application allows the user to specify multiple criteria that change the way in which new corpora are generated whilst reducing the number of repetitive and error prone tasks related with existing corpus maintenance. For this goal, WARCProcessor supports up to six commonly used data sources for web spam research, being able to store output corpus in standard WARC format together with complementary metadata files. Additionally, the application facilitates the automatic and concurrent download of web sites from Internet, giving the possibility of configuring the deep of the links to be followed as well as the behaviour when redirected URLs appear. WARCProcessor supports both an interactive GUI interface and a command line utility for being executed in background.

  2. Modeling of Receptor Tyrosine Kinase Signaling: Computational and Experimental Protocols.

    PubMed

    Fey, Dirk; Aksamitiene, Edita; Kiyatkin, Anatoly; Kholodenko, Boris N

    2017-01-01

    The advent of systems biology has convincingly demonstrated that the integration of experiments and dynamic modelling is a powerful approach to understand the cellular network biology. Here we present experimental and computational protocols that are necessary for applying this integrative approach to the quantitative studies of receptor tyrosine kinase (RTK) signaling networks. Signaling by RTKs controls multiple cellular processes, including the regulation of cell survival, motility, proliferation, differentiation, glucose metabolism, and apoptosis. We describe methods of model building and training on experimentally obtained quantitative datasets, as well as experimental methods of obtaining quantitative dose-response and temporal dependencies of protein phosphorylation and activities. The presented methods make possible (1) both the fine-grained modeling of complex signaling dynamics and identification of salient, course-grained network structures (such as feedback loops) that bring about intricate dynamics, and (2) experimental validation of dynamic models.

  3. rCAD: A Novel Database Schema for the Comparative Analysis of RNA.

    PubMed

    Ozer, Stuart; Doshi, Kishore J; Xu, Weijia; Gutell, Robin R

    2011-12-31

    Beyond its direct involvement in protein synthesis with mRNA, tRNA, and rRNA, RNA is now being appreciated for its significance in the overall metabolism and regulation of the cell. Comparative analysis has been very effective in the identification and characterization of RNA molecules, including the accurate prediction of their secondary structure. We are developing an integrative scalable data management and analysis system, the RNA Comparative Analysis Database (rCAD), implemented with SQL Server to support RNA comparative analysis. The platformagnostic database schema of rCAD captures the essential relationships between the different dimensions of information for RNA comparative analysis datasets. The rCAD implementation enables a variety of comparative analysis manipulations with multiple integrated data dimensions for advanced RNA comparative analysis workflows. In this paper, we describe details of the rCAD schema design and illustrate its usefulness with two usage scenarios.

  4. rCAD: A Novel Database Schema for the Comparative Analysis of RNA

    PubMed Central

    Ozer, Stuart; Doshi, Kishore J.; Xu, Weijia; Gutell, Robin R.

    2013-01-01

    Beyond its direct involvement in protein synthesis with mRNA, tRNA, and rRNA, RNA is now being appreciated for its significance in the overall metabolism and regulation of the cell. Comparative analysis has been very effective in the identification and characterization of RNA molecules, including the accurate prediction of their secondary structure. We are developing an integrative scalable data management and analysis system, the RNA Comparative Analysis Database (rCAD), implemented with SQL Server to support RNA comparative analysis. The platformagnostic database schema of rCAD captures the essential relationships between the different dimensions of information for RNA comparative analysis datasets. The rCAD implementation enables a variety of comparative analysis manipulations with multiple integrated data dimensions for advanced RNA comparative analysis workflows. In this paper, we describe details of the rCAD schema design and illustrate its usefulness with two usage scenarios. PMID:24772454

  5. Integrating CFD, CAA, and Experiments Towards Benchmark Datasets for Airframe Noise Problems

    NASA Technical Reports Server (NTRS)

    Choudhari, Meelan M.; Yamamoto, Kazuomi

    2012-01-01

    Airframe noise corresponds to the acoustic radiation due to turbulent flow in the vicinity of airframe components such as high-lift devices and landing gears. The combination of geometric complexity, high Reynolds number turbulence, multiple regions of separation, and a strong coupling with adjacent physical components makes the problem of airframe noise highly challenging. Since 2010, the American Institute of Aeronautics and Astronautics has organized an ongoing series of workshops devoted to Benchmark Problems for Airframe Noise Computations (BANC). The BANC workshops are aimed at enabling a systematic progress in the understanding and high-fidelity predictions of airframe noise via collaborative investigations that integrate state of the art computational fluid dynamics, computational aeroacoustics, and in depth, holistic, and multifacility measurements targeting a selected set of canonical yet realistic configurations. This paper provides a brief summary of the BANC effort, including its technical objectives, strategy, and selective outcomes thus far.

  6. WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora

    PubMed Central

    Callón, Miguel; Fdez-Glez, Jorge; Ruano-Ordás, David; Laza, Rosalía; Pavón, Reyes; Méndez, Jose Ramón

    2017-01-01

    In this work we present the design and implementation of WARCProcessor, a novel multiplatform integrative tool aimed to build scientific datasets to facilitate experimentation in web spam research. The developed application allows the user to specify multiple criteria that change the way in which new corpora are generated whilst reducing the number of repetitive and error prone tasks related with existing corpus maintenance. For this goal, WARCProcessor supports up to six commonly used data sources for web spam research, being able to store output corpus in standard WARC format together with complementary metadata files. Additionally, the application facilitates the automatic and concurrent download of web sites from Internet, giving the possibility of configuring the deep of the links to be followed as well as the behaviour when redirected URLs appear. WARCProcessor supports both an interactive GUI interface and a command line utility for being executed in background. PMID:29271913

  7. Approach to Managing MeaSURES Data at the GSFC Earth Science Data and Information Services Center (GES DISC)

    NASA Technical Reports Server (NTRS)

    Vollmer, Bruce; Kempler, Steven J.; Ramapriyan, Hampapuram K.

    2009-01-01

    A major need stated by the NASA Earth science research strategy is to develop long-term, consistent, and calibrated data and products that are valid across multiple missions and satellite sensors. (NASA Solicitation for Making Earth System data records for Use in Research Environments (MEaSUREs) 2006-2010) Selected projects create long term records of a given parameter, called Earth Science Data Records (ESDRs), based on mature algorithms that bring together continuous multi-sensor data. ESDRs, associated algorithms, vetted by the appropriate community, are archived at a NASA affiliated data center for archive, stewardship, and distribution. See http://measures-projects.gsfc.nasa.gov/ for more details. This presentation describes the NASA GSFC Earth Science Data and Information Services Center (GES DISC) approach to managing the MEaSUREs ESDR datasets assigned to GES DISC. (Energy/water cycle related and atmospheric composition ESDRs) GES DISC will utilize its experience to integrate existing and proven reusable data management components to accommodate the new ESDRs. Components include a data archive system (S4PA), a data discovery and access system (Mirador), and various web services for data access. In addition, if determined to be useful to the user community, the Giovanni data exploration tool will be made available to ESDRs. The GES DISC data integration methodology to be used for the MEaSUREs datasets is presented. The goals of this presentation are to share an approach to ESDR integration, and initiate discussions amongst the data centers, data managers and data providers for the purpose of gaining efficiencies in data management for MEaSUREs projects.

  8. Multi-modal gesture recognition using integrated model of motion, audio and video

    NASA Astrophysics Data System (ADS)

    Goutsu, Yusuke; Kobayashi, Takaki; Obara, Junya; Kusajima, Ikuo; Takeichi, Kazunari; Takano, Wataru; Nakamura, Yoshihiko

    2015-07-01

    Gesture recognition is used in many practical applications such as human-robot interaction, medical rehabilitation and sign language. With increasing motion sensor development, multiple data sources have become available, which leads to the rise of multi-modal gesture recognition. Since our previous approach to gesture recognition depends on a unimodal system, it is difficult to classify similar motion patterns. In order to solve this problem, a novel approach which integrates motion, audio and video models is proposed by using dataset captured by Kinect. The proposed system can recognize observed gestures by using three models. Recognition results of three models are integrated by using the proposed framework and the output becomes the final result. The motion and audio models are learned by using Hidden Markov Model. Random Forest which is the video classifier is used to learn the video model. In the experiments to test the performances of the proposed system, the motion and audio models most suitable for gesture recognition are chosen by varying feature vectors and learning methods. Additionally, the unimodal and multi-modal models are compared with respect to recognition accuracy. All the experiments are conducted on dataset provided by the competition organizer of MMGRC, which is a workshop for Multi-Modal Gesture Recognition Challenge. The comparison results show that the multi-modal model composed of three models scores the highest recognition rate. This improvement of recognition accuracy means that the complementary relationship among three models improves the accuracy of gesture recognition. The proposed system provides the application technology to understand human actions of daily life more precisely.

  9. SAR image dataset of military ground targets with multiple poses for ATR

    NASA Astrophysics Data System (ADS)

    Belloni, Carole; Balleri, Alessio; Aouf, Nabil; Merlet, Thomas; Le Caillec, Jean-Marc

    2017-10-01

    Automatic Target Recognition (ATR) is the task of automatically detecting and classifying targets. Recognition using Synthetic Aperture Radar (SAR) images is interesting because SAR images can be acquired at night and under any weather conditions, whereas optical sensors operating in the visible band do not have this capability. Existing SAR ATR algorithms have mostly been evaluated using the MSTAR dataset.1 The problem with the MSTAR is that some of the proposed ATR methods have shown good classification performance even when targets were hidden,2 suggesting the presence of a bias in the dataset. Evaluations of SAR ATR techniques are currently challenging due to the lack of publicly available data in the SAR domain. In this paper, we present a high resolution SAR dataset consisting of images of a set of ground military target models taken at various aspect angles, The dataset can be used for a fair evaluation and comparison of SAR ATR algorithms. We applied the Inverse Synthetic Aperture Radar (ISAR) technique to echoes from targets rotating on a turntable and illuminated with a stepped frequency waveform. The targets in the database consist of four variants of two 1.7m-long models of T-64 and T-72 tanks. The gun, the turret position and the depression angle are varied to form 26 different sequences of images. The emitted signal spanned the frequency range from 13 GHz to 18 GHz to achieve a bandwidth of 5 GHz sampled with 4001 frequency points. The resolution obtained with respect to the size of the model targets is comparable to typical values obtained using SAR airborne systems. Single polarized images (Horizontal-Horizontal) are generated using the backprojection algorithm.3 A total of 1480 images are produced using a 20° integration angle. The images in the dataset are organized in a suggested training and testing set to facilitate a standard evaluation of SAR ATR algorithms.

  10. Generating a focused view of disease ontology cancer terms for pan-cancer data integration and analysis

    PubMed Central

    Wu, Tsung-Jung; Schriml, Lynn M.; Chen, Qing-Rong; Colbert, Maureen; Crichton, Daniel J.; Finney, Richard; Hu, Ying; Kibbe, Warren A.; Kincaid, Heather; Meerzaman, Daoud; Mitraka, Elvira; Pan, Yang; Smith, Krista M.; Srivastava, Sudhir; Ward, Sari; Yan, Cheng; Mazumder, Raja

    2015-01-01

    Bio-ontologies provide terminologies for the scientific community to describe biomedical entities in a standardized manner. There are multiple initiatives that are developing biomedical terminologies for the purpose of providing better annotation, data integration and mining capabilities. Terminology resources devised for multiple purposes inherently diverge in content and structure. A major issue of biomedical data integration is the development of overlapping terms, ambiguous classifications and inconsistencies represented across databases and publications. The disease ontology (DO) was developed over the past decade to address data integration, standardization and annotation issues for human disease data. We have established a DO cancer project to be a focused view of cancer terms within the DO. The DO cancer project mapped 386 cancer terms from the Catalogue of Somatic Mutations in Cancer (COSMIC), The Cancer Genome Atlas (TCGA), International Cancer Genome Consortium, Therapeutically Applicable Research to Generate Effective Treatments, Integrative Oncogenomics and the Early Detection Research Network into a cohesive set of 187 DO terms represented by 63 top-level DO cancer terms. For example, the COSMIC term ‘kidney, NS, carcinoma, clear_cell_renal_cell_carcinoma’ and TCGA term ‘Kidney renal clear cell carcinoma’ were both grouped to the term ‘Disease Ontology Identification (DOID):4467 / renal clear cell carcinoma’ which was mapped to the TopNodes_DOcancerslim term ‘DOID:263 / kidney cancer’. Mapping of diverse cancer terms to DO and the use of top level terms (DO slims) will enable pan-cancer analysis across datasets generated from any of the cancer term sources where pan-cancer means including or relating to all or multiple types of cancer. The terms can be browsed from the DO web site (http://www.disease-ontology.org) and downloaded from the DO’s Apache Subversion or GitHub repositories. Database URL: http://www.disease-ontology.org PMID:25841438

  11. Integrating multiple fitting regression and Bayes decision for cancer diagnosis with transcriptomic data from tumor-educated blood platelets.

    PubMed

    Huang, Guangzao; Yuan, Mingshun; Chen, Moliang; Li, Lei; You, Wenjie; Li, Hanjie; Cai, James J; Ji, Guoli

    2017-10-07

    The application of machine learning in cancer diagnostics has shown great promise and is of importance in clinic settings. Here we consider applying machine learning methods to transcriptomic data derived from tumor-educated platelets (TEPs) from individuals with different types of cancer. We aim to define a reliability measure for diagnostic purposes to increase the potential for facilitating personalized treatments. To this end, we present a novel classification method called MFRB (for Multiple Fitting Regression and Bayes decision), which integrates the process of multiple fitting regression (MFR) with Bayes decision theory. MFR is first used to map multidimensional features of the transcriptomic data into a one-dimensional feature. The probability density function of each class in the mapped space is then adjusted using the Gaussian probability density function. Finally, the Bayes decision theory is used to build a probabilistic classifier with the estimated probability density functions. The output of MFRB can be used to determine which class a sample belongs to, as well as to assign a reliability measure for a given class. The classical support vector machine (SVM) and probabilistic SVM (PSVM) are used to evaluate the performance of the proposed method with simulated and real TEP datasets. Our results indicate that the proposed MFRB method achieves the best performance compared to SVM and PSVM, mainly due to its strong generalization ability for limited, imbalanced, and noisy data.

  12. Modeling and Databases for Teaching Petrology

    NASA Astrophysics Data System (ADS)

    Asher, P.; Dutrow, B.

    2003-12-01

    With the widespread availability of high-speed computers with massive storage and ready transport capability of large amounts of data, computational and petrologic modeling and the use of databases provide new tools with which to teach petrology. Modeling can be used to gain insights into a system, predict system behavior, describe a system's processes, compare with a natural system or simply to be illustrative. These aspects result from data driven or empirical, analytical or numerical models or the concurrent examination of multiple lines of evidence. At the same time, use of models can enhance core foundations of the geosciences by improving critical thinking skills and by reinforcing prior knowledge gained. However, the use of modeling to teach petrology is dictated by the level of expectation we have for students and their facility with modeling approaches. For example, do we expect students to push buttons and navigate a program, understand the conceptual model and/or evaluate the results of a model. Whatever the desired level of sophistication, specific elements of design should be incorporated into a modeling exercise for effective teaching. These include, but are not limited to; use of the scientific method, use of prior knowledge, a clear statement of purpose and goals, attainable goals, a connection to the natural/actual system, a demonstration that complex heterogeneous natural systems are amenable to analyses by these techniques and, ideally, connections to other disciplines and the larger earth system. Databases offer another avenue with which to explore petrology. Large datasets are available that allow integration of multiple lines of evidence to attack a petrologic problem or understand a petrologic process. These are collected into a database that offers a tool for exploring, organizing and analyzing the data. For example, datasets may be geochemical, mineralogic, experimental and/or visual in nature, covering global, regional to local scales. These datasets provide students with access to large amount of related data through space and time. Goals of the database working group include educating earth scientists about information systems in general, about the importance of metadata about ways of using databases and datasets as educational tools and about the availability of existing datasets and databases. The modeling and databases groups hope to create additional petrologic teaching tools using these aspects and invite the community to contribute to the effort.

  13. Integrative multi-platform meta-analysis of gene expression profiles in pancreatic ductal adenocarcinoma patients for identifying novel diagnostic biomarkers.

    PubMed

    Irigoyen, Antonio; Jimenez-Luna, Cristina; Benavides, Manuel; Caba, Octavio; Gallego, Javier; Ortuño, Francisco Manuel; Guillen-Ponce, Carmen; Rojas, Ignacio; Aranda, Enrique; Torres, Carolina; Prados, Jose

    2018-01-01

    Applying differentially expressed genes (DEGs) to identify feasible biomarkers in diseases can be a hard task when working with heterogeneous datasets. Expression data are strongly influenced by technology, sample preparation processes, and/or labeling methods. The proliferation of different microarray platforms for measuring gene expression increases the need to develop models able to compare their results, especially when different technologies can lead to signal values that vary greatly. Integrative meta-analysis can significantly improve the reliability and robustness of DEG detection. The objective of this work was to develop an integrative approach for identifying potential cancer biomarkers by integrating gene expression data from two different platforms. Pancreatic ductal adenocarcinoma (PDAC), where there is an urgent need to find new biomarkers due its late diagnosis, is an ideal candidate for testing this technology. Expression data from two different datasets, namely Affymetrix and Illumina (18 and 36 PDAC patients, respectively), as well as from 18 healthy controls, was used for this study. A meta-analysis based on an empirical Bayesian methodology (ComBat) was then proposed to integrate these datasets. DEGs were finally identified from the integrated data by using the statistical programming language R. After our integrative meta-analysis, 5 genes were commonly identified within the individual analyses of the independent datasets. Also, 28 novel genes that were not reported by the individual analyses ('gained' genes) were also discovered. Several of these gained genes have been already related to other gastroenterological tumors. The proposed integrative meta-analysis has revealed novel DEGs that may play an important role in PDAC and could be potential biomarkers for diagnosing the disease.

  14. An Analysis of the Relationship Between Atmospheric Heat Transport and the Position of the ITCZ in NASA NEWS products, CMIP5 GCMs, and Multiple Reanalyses

    NASA Astrophysics Data System (ADS)

    Stanfield, R.; Dong, X.; Su, H.; Xi, B.; Jiang, J. H.

    2016-12-01

    In the past few years, studies have found a strong connection between atmospheric heat transport across the equator (AHTEQ) and the position of the ITCZ. This study investigates the seasonal, annual-mean and interannual variability of the ITCZ position and explores the relationships between the ITCZ position and inter-hemispheric energy transport in NASA NEWS products, multiple reanalyses datasets, and CMIP5 simulations. We find large discrepancies exist in the ITCZ-AHTEQ relationships in these datasets and model simulations. The components of energy fluxes are examined to identify the primary sources for the discrepancies among the datasets and models results.

  15. BioSig3D: High Content Screening of Three-Dimensional Cell Culture Models

    PubMed Central

    Bilgin, Cemal Cagatay; Fontenay, Gerald; Cheng, Qingsu; Chang, Hang; Han, Ju; Parvin, Bahram

    2016-01-01

    BioSig3D is a computational platform for high-content screening of three-dimensional (3D) cell culture models that are imaged in full 3D volume. It provides an end-to-end solution for designing high content screening assays, based on colony organization that is derived from segmentation of nuclei in each colony. BioSig3D also enables visualization of raw and processed 3D volumetric data for quality control, and integrates advanced bioinformatics analysis. The system consists of multiple computational and annotation modules that are coupled together with a strong use of controlled vocabularies to reduce ambiguities between different users. It is a web-based system that allows users to: design an experiment by defining experimental variables, upload a large set of volumetric images into the system, analyze and visualize the dataset, and either display computed indices as a heatmap, or phenotypic subtypes for heterogeneity analysis, or download computed indices for statistical analysis or integrative biology. BioSig3D has been used to profile baseline colony formations with two experiments: (i) morphogenesis of a panel of human mammary epithelial cell lines (HMEC), and (ii) heterogeneity in colony formation using an immortalized non-transformed cell line. These experiments reveal intrinsic growth properties of well-characterized cell lines that are routinely used for biological studies. BioSig3D is being released with seed datasets and video-based documentation. PMID:26978075

  16. From gene networks to drugs: systems pharmacology approaches for AUD.

    PubMed

    Ferguson, Laura B; Harris, R Adron; Mayfield, Roy Dayne

    2018-06-01

    The alcohol research field has amassed an impressive number of gene expression datasets spanning key brain areas for addiction, species (humans as well as multiple animal models), and stages in the addiction cycle (binge/intoxication, withdrawal/negative effect, and preoccupation/anticipation). These data have improved our understanding of the molecular adaptations that eventually lead to dysregulation of brain function and the chronic, relapsing disorder of addiction. Identification of new medications to treat alcohol use disorder (AUD) will likely benefit from the integration of genetic, genomic, and behavioral information included in these important datasets. Systems pharmacology considers drug effects as the outcome of the complex network of interactions a drug has rather than a single drug-molecule interaction. Computational strategies based on this principle that integrate gene expression signatures of pharmaceuticals and disease states have shown promise for identifying treatments that ameliorate disease symptoms (called in silico gene mapping or connectivity mapping). In this review, we suggest that gene expression profiling for in silico mapping is critical to improve drug repurposing and discovery for AUD and other psychiatric illnesses. We highlight studies that successfully apply gene mapping computational approaches to identify or repurpose pharmaceutical treatments for psychiatric illnesses. Furthermore, we address important challenges that must be overcome to maximize the potential of these strategies to translate to the clinic and improve healthcare outcomes.

  17. Acoustic Seabed Characterization of the Porcupine Bank, Irish Margin

    NASA Astrophysics Data System (ADS)

    O'Toole, Ronan; Monteys, Xavier

    2010-05-01

    The Porcupine Bank represents a large section of continental shelf situated west of the Irish landmass, located in water depths ranging between 150 and 500m. Under the Irish National Seabed Survey (INSS 1999-2006) this area was comprehensively mapped, generating multiple acoustic datasets including high resolution multibeam echosounder data. The unique nature of the area's datasets in terms of data density, consistency and geographic extent has allowed the development of a large-scale integrated physical characterization of the Porcupine Bank for multidisciplinary applications. Integrated analysis of backscatter and bathymetry data has resulted in a baseline delineation of sediment distribution, seabed geology and geomorphological features on the bank, along with an inclusive set of related database information. The methodology used incorporates a variety of statistical techniques which are necessary in isolating sonar system artefacts and addressing sonar geometry related issues. A number of acoustic backscatter parameters at several angles of incidence have been analysed in order to complement the characterization for both surface and subsurface sediments. Acoustic sub bottom records have also been incorporated in order to investigate the physical characteristics of certain features on the Porcupine Bank. Where available, groundtruthing information in terms of sediment samples, video footage and cores has been applied to add physical descriptors and validation to the characterization. Extensive mapping of different rock outcrops, sediment drifts, seabed features and other geological classes has been achieved using this methodology.

  18. Evaluation of hierarchical models for integrative genomic analyses.

    PubMed

    Denis, Marie; Tadesse, Mahlet G

    2016-03-01

    Advances in high-throughput technologies have led to the acquisition of various types of -omic data on the same biological samples. Each data type gives independent and complementary information that can explain the biological mechanisms of interest. While several studies performing independent analyses of each dataset have led to significant results, a better understanding of complex biological mechanisms requires an integrative analysis of different sources of data. Flexible modeling approaches, based on penalized likelihood methods and expectation-maximization (EM) algorithms, are studied and tested under various biological relationship scenarios between the different molecular features and their effects on a clinical outcome. The models are applied to genomic datasets from two cancer types in the Cancer Genome Atlas project: glioblastoma multiforme and ovarian serous cystadenocarcinoma. The integrative models lead to improved model fit and predictive performance. They also provide a better understanding of the biological mechanisms underlying patients' survival. Source code implementing the integrative models is freely available at https://github.com/mgt000/IntegrativeAnalysis along with example datasets and sample R script applying the models to these data. The TCGA datasets used for analysis are publicly available at https://tcga-data.nci.nih.gov/tcga/tcgaDownload.jsp marie.denis@cirad.fr or mgt26@georgetown.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  19. The World Spatiotemporal Analytics and Mapping Project (WSTAMP): Discovering, Exploring, and Mapping Spatiotemporal Patterns Across Heterogenous Space-Time Data

    NASA Astrophysics Data System (ADS)

    Morton, A.; Stewart, R.; Held, E.; Piburn, J.; Allen, M. R.; McManamay, R.; Sanyal, J.; Sorokine, A.; Bhaduri, B. L.

    2017-12-01

    Spatiotemporal (ST) analytics applied to major spatio-temporal data sources from major vendors such as USGS, NOAA, World Bank and World Health Organization have tremendous value in shedding light on the evolution of physical, cultural, and geopolitical landscapes on a local and global level. Especially powerful is the integration of these physical and cultural datasets across multiple and disparate formats, facilitating new interdisciplinary analytics and insights. Realizing this potential first requires an ST data model that addresses challenges in properly merging data from multiple authors, with evolving ontological perspectives, semantical differences, changing attributes, and content that is textual, numeric, categorical, and hierarchical. Equally challenging is the development of analytical and visualization approaches that provide a serious exploration of this integrated data while remaining accessible to practitioners with varied backgrounds. The WSTAMP project at the Oak Ridge National Laboratory has yielded two major results in addressing these challenges: 1) development of the WSTAMP database, a significant advance in ST data modeling that integrates 16000+ attributes covering 200+ countries for over 50 years from over 30 major sources and 2) a novel online ST exploratory and analysis tool providing an array of modern statistical and visualization techniques for analyzing these data temporally, spatially, and spatiotemporally under a standard analytic workflow. We report on these advances, provide an illustrative case study, and inform how others may freely access the tool.

  20. The Interaction Network Ontology-supported modeling and mining of complex interactions represented with multiple keywords in biomedical literature.

    PubMed

    Özgür, Arzucan; Hur, Junguk; He, Yongqun

    2016-01-01

    The Interaction Network Ontology (INO) logically represents biological interactions, pathways, and networks. INO has been demonstrated to be valuable in providing a set of structured ontological terms and associated keywords to support literature mining of gene-gene interactions from biomedical literature. However, previous work using INO focused on single keyword matching, while many interactions are represented with two or more interaction keywords used in combination. This paper reports our extension of INO to include combinatory patterns of two or more literature mining keywords co-existing in one sentence to represent specific INO interaction classes. Such keyword combinations and related INO interaction type information could be automatically obtained via SPARQL queries, formatted in Excel format, and used in an INO-supported SciMiner, an in-house literature mining program. We studied the gene interaction sentences from the commonly used benchmark Learning Logic in Language (LLL) dataset and one internally generated vaccine-related dataset to identify and analyze interaction types containing multiple keywords. Patterns obtained from the dependency parse trees of the sentences were used to identify the interaction keywords that are related to each other and collectively represent an interaction type. The INO ontology currently has 575 terms including 202 terms under the interaction branch. The relations between the INO interaction types and associated keywords are represented using the INO annotation relations: 'has literature mining keywords' and 'has keyword dependency pattern'. The keyword dependency patterns were generated via running the Stanford Parser to obtain dependency relation types. Out of the 107 interactions in the LLL dataset represented with two-keyword interaction types, 86 were identified by using the direct dependency relations. The LLL dataset contained 34 gene regulation interaction types, each of which associated with multiple keywords. A hierarchical display of these 34 interaction types and their ancestor terms in INO resulted in the identification of specific gene-gene interaction patterns from the LLL dataset. The phenomenon of having multi-keyword interaction types was also frequently observed in the vaccine dataset. By modeling and representing multiple textual keywords for interaction types, the extended INO enabled the identification of complex biological gene-gene interactions represented with multiple keywords.

  1. Multiple-rule bias in the comparison of classification rules

    PubMed Central

    Yousefi, Mohammadmahdi R.; Hua, Jianping; Dougherty, Edward R.

    2011-01-01

    Motivation: There is growing discussion in the bioinformatics community concerning overoptimism of reported results. Two approaches contributing to overoptimism in classification are (i) the reporting of results on datasets for which a proposed classification rule performs well and (ii) the comparison of multiple classification rules on a single dataset that purports to show the advantage of a certain rule. Results: This article provides a careful probabilistic analysis of the second issue and the ‘multiple-rule bias’, resulting from choosing a classification rule having minimum estimated error on the dataset. It quantifies this bias corresponding to estimating the expected true error of the classification rule possessing minimum estimated error and it characterizes the bias from estimating the true comparative advantage of the chosen classification rule relative to the others by the estimated comparative advantage on the dataset. The analysis is applied to both synthetic and real data using a number of classification rules and error estimators. Availability: We have implemented in C code the synthetic data distribution model, classification rules, feature selection routines and error estimation methods. The code for multiple-rule analysis is implemented in MATLAB. The source code is available at http://gsp.tamu.edu/Publications/supplementary/yousefi11a/. Supplementary simulation results are also included. Contact: edward@ece.tamu.edu Supplementary Information: Supplementary data are available at Bioinformatics online. PMID:21546390

  2. Dataset Lifecycle Policy

    NASA Technical Reports Server (NTRS)

    Armstrong, Edward; Tauer, Eric

    2013-01-01

    The presentation focused on describing a new dataset lifecycle policy that the NASA Physical Oceanography DAAC (PO.DAAC) has implemented for its new and current datasets to foster improved stewardship and consistency across its archive. The overarching goal is to implement this dataset lifecycle policy for all new GHRSST GDS2 datasets and bridge the mission statements from the GHRSST Project Office and PO.DAAC to provide the best quality SST data in a cost-effective, efficient manner, preserving its integrity so that it will be available and usable to a wide audience.

  3. Integrative Analysis of Many RNA-Seq Datasets to Study Alternative Splicing

    PubMed Central

    Li, Wenyuan; Dai, Chao; Kang, Shuli; Zhou, Xianghong Jasmine

    2014-01-01

    Alternative splicing is an important gene regulatory mechanism that dramatically increases the complexity of the proteome. However, how alternative splicing is regulated and how transcription and splicing are coordinated are still poorly understood, and functions of transcript isoforms have been studied only in a few limited cases. Nowadays, RNA-seq technology provides an exceptional opportunity to study alternative splicing on genome-wide scales and in an unbiased manner. With the rapid accumulation of data in public repositories, new challenges arise from the urgent need to effectively integrate many different RNA-seq datasets for study alterative splicing. This paper discusses a set of advanced computational methods that can integrate and analyze many RNA-seq datasets to systematically identify splicing modules, unravel the coupling of transcription and splicing, and predict the functions of splicing isoforms on a genome-wide scale. PMID:24583115

  4. Configuration of multiple human stressors and their impacts on fish assemblages in Alpine river basins of Austria.

    PubMed

    Schinegger, Rafaela; Pucher, Matthias; Aschauer, Christiane; Schmutz, Stefan

    2018-03-01

    This work addresses multiple human stressors and their impacts on fish assemblages of the Drava and Mura rivers in southern Austria. The impacts of single and multiple human stressors on riverine fish assemblages in these basins were disentangled, based on an extensive dataset. Stressor configuration, i.e. various metrics of multiple stressors belonging to stressor groups hydrology, morphology, connectivity and water quality were investigated for the first time at river basin scale in Austria. As biological response variables, the Fish Index Austria (FIA) and its related single as well as the WFD biological- and total state were investigated. Stressor-response analysis shows divergent results, but a general trend of decreasing ecological integrity with increasing number of stressors and maximum stressor is observed. Fish metrics based on age structure, fish region index and biological status responded best to single stressors and/or their combinations. The knowledge gained in this work provides a basis for advanced investigations in Alpine river basins and beyond, supports WFD implementation and helps prioritizing further actions towards multi-stressor restoration- and management. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.

  5. Efficient exploration of pan-cancer networks by generalized covariance selection and interactive web content

    PubMed Central

    Kling, Teresia; Johansson, Patrik; Sanchez, José; Marinescu, Voichita D.; Jörnsten, Rebecka; Nelander, Sven

    2015-01-01

    Statistical network modeling techniques are increasingly important tools to analyze cancer genomics data. However, current tools and resources are not designed to work across multiple diagnoses and technical platforms, thus limiting their applicability to comprehensive pan-cancer datasets such as The Cancer Genome Atlas (TCGA). To address this, we describe a new data driven modeling method, based on generalized Sparse Inverse Covariance Selection (SICS). The method integrates genetic, epigenetic and transcriptional data from multiple cancers, to define links that are present in multiple cancers, a subset of cancers, or a single cancer. It is shown to be statistically robust and effective at detecting direct pathway links in data from TCGA. To facilitate interpretation of the results, we introduce a publicly accessible tool (cancerlandscapes.org), in which the derived networks are explored as interactive web content, linked to several pathway and pharmacological databases. To evaluate the performance of the method, we constructed a model for eight TCGA cancers, using data from 3900 patients. The model rediscovered known mechanisms and contained interesting predictions. Possible applications include prediction of regulatory relationships, comparison of network modules across multiple forms of cancer and identification of drug targets. PMID:25953855

  6. High performance geospatial and climate data visualization using GeoJS

    NASA Astrophysics Data System (ADS)

    Chaudhary, A.; Beezley, J. D.

    2015-12-01

    GeoJS (https://github.com/OpenGeoscience/geojs) is an open-source library developed to support interactive scientific and geospatial visualization of climate and earth science datasets in a web environment. GeoJS has a convenient application programming interface (API) that enables users to harness the fast performance of WebGL and Canvas 2D APIs with sophisticated Scalable Vector Graphics (SVG) features in a consistent and convenient manner. We started the project in response to the need for an open-source JavaScript library that can combine traditional geographic information systems (GIS) and scientific visualization on the web. Many libraries, some of which are open source, support mapping or other GIS capabilities, but lack the features required to visualize scientific and other geospatial datasets. For instance, such libraries are not be capable of rendering climate plots from NetCDF files, and some libraries are limited in regards to geoinformatics (infovis in a geospatial environment). While libraries such as d3.js are extremely powerful for these kinds of plots, in order to integrate them into other GIS libraries, the construction of geoinformatics visualizations must be completed manually and separately, or the code must somehow be mixed in an unintuitive way.We developed GeoJS with the following motivations:• To create an open-source geovisualization and GIS library that combines scientific visualization with GIS and informatics• To develop an extensible library that can combine data from multiple sources and render them using multiple backends• To build a library that works well with existing scientific visualizations tools such as VTKWe have successfully deployed GeoJS-based applications for multiple domains across various projects. The ClimatePipes project funded by the Department of Energy, for example, used GeoJS to visualize NetCDF datasets from climate data archives. Other projects built visualizations using GeoJS for interactively exploring data and analysis regarding 1) the human trafficking domain, 2) New York City taxi drop-offs and pick-ups, and 3) the Ebola outbreak. GeoJS supports advanced visualization features such as picking and selecting, as well as clustering. It also supports 2D contour plots, vector plots, heat maps, and geospatial graphs.

  7. STOP using just GO: a multi-ontology hypothesis generation tool for high throughput experimentation

    PubMed Central

    2013-01-01

    Background Gene Ontology (GO) enrichment analysis remains one of the most common methods for hypothesis generation from high throughput datasets. However, we believe that researchers strive to test other hypotheses that fall outside of GO. Here, we developed and evaluated a tool for hypothesis generation from gene or protein lists using ontological concepts present in manually curated text that describes those genes and proteins. Results As a consequence we have developed the method Statistical Tracking of Ontological Phrases (STOP) that expands the realm of testable hypotheses in gene set enrichment analyses by integrating automated annotations of genes to terms from over 200 biomedical ontologies. While not as precise as manually curated terms, we find that the additional enriched concepts have value when coupled with traditional enrichment analyses using curated terms. Conclusion Multiple ontologies have been developed for gene and protein annotation, by using a dataset of both manually curated GO terms and automatically recognized concepts from curated text we can expand the realm of hypotheses that can be discovered. The web application STOP is available at http://mooneygroup.org/stop/. PMID:23409969

  8. Contribution of tropical cyclones to abnormal sea surface temperature warming in the Yellow Sea in December 2004

    NASA Astrophysics Data System (ADS)

    Kim, Taekyun; Choo, Sung-Ho; Moon, Jae-Hong; Chang, Pil-Hun

    2017-12-01

    Unusual sea surface temperature (SST) warming occurred over the Yellow Sea (YS) in December 2004. To identify the causes of the abnormal SST warming, we conducted an analysis on atmospheric circulation anomalies induced by tropical cyclones (TCs) and their impacts on upper ocean characteristics using multiple datasets. With the analysis of various datasets, we explored a new aspect of the relationship between TC activity and SST. The results show that there is a significant link between TC activity over the Northwest Pacific (NWP) and SST in the YS. The integrated effect of consecutive TCs activity induces a large-scale atmospheric cyclonic circulation anomaly over the NWP and consequently anomalous easterly winds over the YS and East China Sea. The mechanism of the unusually warm SST in the YS can be explained by considering TCs acting as an important source of Ekman heat transport that results in substantial intrusion of relatively warm surface water into the YS interior. Furthermore, TC-related circulation anomalies contribute to the retention of the resulting warm SST anomalies in the entire YS.

  9. Genevar: a database and Java application for the analysis and visualization of SNP-gene associations in eQTL studies

    PubMed Central

    Yang, Tsun-Po; Beazley, Claude; Montgomery, Stephen B.; Dimas, Antigone S.; Gutierrez-Arcelus, Maria; Stranger, Barbara E.; Deloukas, Panos; Dermitzakis, Emmanouil T.

    2010-01-01

    Summary: Genevar (GENe Expression VARiation) is a database and Java tool designed to integrate multiple datasets, and provides analysis and visualization of associations between sequence variation and gene expression. Genevar allows researchers to investigate expression quantitative trait loci (eQTL) associations within a gene locus of interest in real time. The database and application can be installed on a standard computer in database mode and, in addition, on a server to share discoveries among affiliations or the broader community over the Internet via web services protocols. Availability: http://www.sanger.ac.uk/resources/software/genevar Contact: emmanouil.dermitzakis@unige.ch PMID:20702402

  10. Improving semi-automated segmentation by integrating learning with active sampling

    NASA Astrophysics Data System (ADS)

    Huo, Jing; Okada, Kazunori; Brown, Matthew

    2012-02-01

    Interactive segmentation algorithms such as GrowCut usually require quite a few user interactions to perform well, and have poor repeatability. In this study, we developed a novel technique to boost the performance of the interactive segmentation method GrowCut involving: 1) a novel "focused sampling" approach for supervised learning, as opposed to conventional random sampling; 2) boosting GrowCut using the machine learned results. We applied the proposed technique to the glioblastoma multiforme (GBM) brain tumor segmentation, and evaluated on a dataset of ten cases from a multiple center pharmaceutical drug trial. The results showed that the proposed system has the potential to reduce user interaction while maintaining similar segmentation accuracy.

  11. Estimating weak ratiometric signals in imaging data. II. Meta-analysis with multiple, dual-channel datasets.

    PubMed

    Sornborger, Andrew; Broder, Josef; Majumder, Anirban; Srinivasamoorthy, Ganesh; Porter, Erika; Reagin, Sean S; Keith, Charles; Lauderdale, James D

    2008-09-01

    Ratiometric fluorescent indicators are used for making quantitative measurements of a variety of physiological variables. Their utility is often limited by noise. This is the second in a series of papers describing statistical methods for denoising ratiometric data with the aim of obtaining improved quantitative estimates of variables of interest. Here, we outline a statistical optimization method that is designed for the analysis of ratiometric imaging data in which multiple measurements have been taken of systems responding to the same stimulation protocol. This method takes advantage of correlated information across multiple datasets for objectively detecting and estimating ratiometric signals. We demonstrate our method by showing results of its application on multiple, ratiometric calcium imaging experiments.

  12. Atlas Toolkit: Fast registration of 3D morphological datasets in the absence of landmarks

    PubMed Central

    Grocott, Timothy; Thomas, Paul; Münsterberg, Andrea E.

    2016-01-01

    Image registration is a gateway technology for Developmental Systems Biology, enabling computational analysis of related datasets within a shared coordinate system. Many registration tools rely on landmarks to ensure that datasets are correctly aligned; yet suitable landmarks are not present in many datasets. Atlas Toolkit is a Fiji/ImageJ plugin collection offering elastic group-wise registration of 3D morphological datasets, guided by segmentation of the interesting morphology. We demonstrate the method by combinatorial mapping of cell signalling events in the developing eyes of chick embryos, and use the integrated datasets to predictively enumerate Gene Regulatory Network states. PMID:26864723

  13. Atlas Toolkit: Fast registration of 3D morphological datasets in the absence of landmarks.

    PubMed

    Grocott, Timothy; Thomas, Paul; Münsterberg, Andrea E

    2016-02-11

    Image registration is a gateway technology for Developmental Systems Biology, enabling computational analysis of related datasets within a shared coordinate system. Many registration tools rely on landmarks to ensure that datasets are correctly aligned; yet suitable landmarks are not present in many datasets. Atlas Toolkit is a Fiji/ImageJ plugin collection offering elastic group-wise registration of 3D morphological datasets, guided by segmentation of the interesting morphology. We demonstrate the method by combinatorial mapping of cell signalling events in the developing eyes of chick embryos, and use the integrated datasets to predictively enumerate Gene Regulatory Network states.

  14. NetMHCpan-3.0; improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets.

    PubMed

    Nielsen, Morten; Andreatta, Massimo

    2016-03-30

    Binding of peptides to MHC class I molecules (MHC-I) is essential for antigen presentation to cytotoxic T-cells. Here, we demonstrate how a simple alignment step allowing insertions and deletions in a pan-specific MHC-I binding machine-learning model enables combining information across both multiple MHC molecules and peptide lengths. This pan-allele/pan-length algorithm significantly outperforms state-of-the-art methods, and captures differences in the length profile of binders to different MHC molecules leading to increased accuracy for ligand identification. Using this model, we demonstrate that percentile ranks in contrast to affinity-based thresholds are optimal for ligand identification due to uniform sampling of the MHC space. We have developed a neural network-based machine-learning algorithm leveraging information across multiple receptor specificities and ligand length scales, and demonstrated how this approach significantly improves the accuracy for prediction of peptide binding and identification of MHC ligands. The method is available at www.cbs.dtu.dk/services/NetMHCpan-3.0 .

  15. Information Requirements for Integrating Spatially Discrete, Feature-Based Earth Observations

    NASA Astrophysics Data System (ADS)

    Horsburgh, J. S.; Aufdenkampe, A. K.; Lehnert, K. A.; Mayorga, E.; Hsu, L.; Song, L.; Zaslavsky, I.; Valentine, D. L.

    2014-12-01

    Several cyberinfrastructures have emerged for sharing observational data collected at densely sampled and/or highly instrumented field sites. These include the CUAHSI Hydrologic Information System (HIS), the Critical Zone Observatory Integrated Data Management System (CZOData), the Integrated Earth Data Applications (IEDA) and EarthChem system, and the Integrated Ocean Observing System (IOOS). These systems rely on standard data encodings and, in some cases, standard semantics for classes of geoscience data. Their focus is on sharing data on the Internet via web services in domain specific encodings or markup languages. While they have made progress in making data available, it still takes investigators significant effort to discover and access datasets from multiple repositories because of inconsistencies in the way domain systems describe, encode, and share data. Yet, there are many scenarios that require efficient integration of these data types across different domains. For example, understanding a soil profile's geochemical response to extreme weather events requires integration of hydrologic and atmospheric time series with geochemical data from soil samples collected over various depth intervals from soil cores or pits at different positions on a landscape. Integrated access to and analysis of data for such studies are hindered because common characteristics of data, including time, location, provenance, methods, and units are described differently within different systems. Integration requires syntactic and semantic translations that can be manual, error-prone, and lossy. We report information requirements identified as part of our work to define an information model for a broad class of earth science data - i.e., spatially-discrete, feature-based earth observations resulting from in-situ sensors and environmental samples. We sought to answer the question: "What information must accompany observational data for them to be archivable and discoverable within a publication system as well as interpretable once retrieved from such a system for analysis and (re)use?" We also describe development of multiple functional schemas (i.e., physical implementations for data storage, transfer, and archival) for the information model that capture the requirements reported here.

  16. NASA's High Mountain Asia Team (HiMAT): collaborative research to study changes of the High Asia region

    NASA Astrophysics Data System (ADS)

    Arendt, A. A.; Houser, P.; Kapnick, S. B.; Kargel, J. S.; Kirschbaum, D.; Kumar, S.; Margulis, S. A.; McDonald, K. C.; Osmanoglu, B.; Painter, T. H.; Raup, B. H.; Rupper, S.; Tsay, S. C.; Velicogna, I.

    2017-12-01

    The High Mountain Asia Team (HiMAT) is an assembly of 13 research groups funded by NASA to improve understanding of cryospheric and hydrological changes in High Mountain Asia (HMA). Our project goals are to quantify historical and future variability in weather and climate over the HMA, partition the components of the water budget across HMA watersheds, explore physical processes driving changes, and predict couplings and feedbacks between physical and human systems through assessment of hazards and downstream impacts. These objectives are being addressed through analysis of remote sensing datasets combined with modeling and assimilation methods to enable data integration across multiple spatial and temporal scales. Our work to date has focused on developing improved high resolution precipitation, snow cover and snow water equivalence products through a variety of statistical uncertainty analysis, dynamical downscaling and assimilation techniques. These and other high resolution climate products are being used as input and validation for an assembly of land surface and General Circulation Models. To quantify glacier change in the region we have calculated multidecadal mass balances of a subset of HMA glaciers by comparing commercial satellite imagery with earlier elevation datasets. HiMAT is using these tools and datasets to explore the impact of atmospheric aerosols and surface impurities on surface energy exchanges, to determine drivers of glacier and snowpack melt rates, and to improve our capacity to predict future hydrological variability. Outputs from the climate and land surface assessments are being combined with landslide and glacier lake inventories to refine our ability to predict hazards in the region. Economic valuation models are also being used to assess impacts on water resources and hydropower. Field data of atmospheric aerosol, radiative flux and glacier lake conditions are being collected to provide ground validation for models and remote sensing products. In this presentation we will discuss initial results and outline plans for a scheduled release of our datasets and findings to the broader community. We will also describe our methods for cross-team collaboration through the adoption of cloud computing and data integration tools.

  17. Parasol: An Architecture for Cross-Cloud Federated Graph Querying

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lieberman, Michael; Choudhury, Sutanay; Hughes, Marisa

    2014-06-22

    Large scale data fusion of multiple datasets can often provide in- sights that examining datasets individually cannot. However, when these datasets reside in different data centers and cannot be collocated due to technical, administrative, or policy barriers, a unique set of problems arise that hamper querying and data fusion. To ad- dress these problems, a system and architecture named Parasol is presented that enables federated queries over graph databases residing in multiple clouds. Parasol’s design is flexible and requires only minimal assumptions for participant clouds. Query optimization techniques are also described that are compatible with Parasol’s lightweight architecture. Experiments onmore » a prototype implementation of Parasol indicate its suitability for cross-cloud federated graph queries.« less

  18. Multiple resource use efficiency (mRUE): A new concept for ecosystem production

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Han, Juanjuan; Chen, Jiquan; Miao, Yuan

    The resource-driven concept, which is an important school for investigating ecosystem production, has been applied for decades. However, the regulatory mechanisms of production by multiple resources remain unclear. We formulated a new algorithm model that integrates multiple resource uses to study ecosystem production and tested its applications on a water-availability gradient in semi-arid grassland. The result of our experiment showed that changes in water availability significantly affected the resources of light and nitrogen, and altered the relationships among multiple resource absorption rate (ε), multiple resource use efficiency (mRUE), and available resource (R avail). The increased water availability suppressed ecosystem mRUEmore » (i.e., “declining marginal returns”); The changes in mRUE had a negative effect on ε (i.e., “inverse feedback”). These two processes jointly regulated that the stimulated single resource availability would promote ecosystem production rather than suppress it, even when mRUE was reduced. This study illustrated the use of the mRUE model in exploring the coherent relationships among the key parameters on regulating the ecosystem production for future modeling, and evaluated the sensitivity of this conceptual model under different dataset properties. Furthermore, this model needs extensive validation by the ecological community before it can extrapolate this method to other ecosystems in the future.« less

  19. Multiple resource use efficiency (mRUE): A new concept for ecosystem production

    DOE PAGES

    Han, Juanjuan; Chen, Jiquan; Miao, Yuan; ...

    2016-11-21

    The resource-driven concept, which is an important school for investigating ecosystem production, has been applied for decades. However, the regulatory mechanisms of production by multiple resources remain unclear. We formulated a new algorithm model that integrates multiple resource uses to study ecosystem production and tested its applications on a water-availability gradient in semi-arid grassland. The result of our experiment showed that changes in water availability significantly affected the resources of light and nitrogen, and altered the relationships among multiple resource absorption rate (ε), multiple resource use efficiency (mRUE), and available resource (R avail). The increased water availability suppressed ecosystem mRUEmore » (i.e., “declining marginal returns”); The changes in mRUE had a negative effect on ε (i.e., “inverse feedback”). These two processes jointly regulated that the stimulated single resource availability would promote ecosystem production rather than suppress it, even when mRUE was reduced. This study illustrated the use of the mRUE model in exploring the coherent relationships among the key parameters on regulating the ecosystem production for future modeling, and evaluated the sensitivity of this conceptual model under different dataset properties. Furthermore, this model needs extensive validation by the ecological community before it can extrapolate this method to other ecosystems in the future.« less

  20. Multiple Resource Use Efficiency (mRUE): A New Concept for Ecosystem Production.

    PubMed

    Han, Juanjuan; Chen, Jiquan; Miao, Yuan; Wan, Shiqiang

    2016-11-21

    The resource-driven concept, which is an important school for investigating ecosystem production, has been applied for decades. However, the regulatory mechanisms of production by multiple resources remain unclear. We formulated a new algorithm model that integrates multiple resource uses to study ecosystem production and tested its applications on a water-availability gradient in semi-arid grassland. The result of our experiment showed that changes in water availability significantly affected the resources of light and nitrogen, and altered the relationships among multiple resource absorption rate (ε), multiple resource use efficiency (mRUE), and available resource (R avail ). The increased water availability suppressed ecosystem mRUE (i.e., "declining marginal returns"); The changes in mRUE had a negative effect on ε (i.e., "inverse feedback"). These two processes jointly regulated that the stimulated single resource availability would promote ecosystem production rather than suppress it, even when mRUE was reduced. This study illustrated the use of the mRUE model in exploring the coherent relationships among the key parameters on regulating the ecosystem production for future modeling, and evaluated the sensitivity of this conceptual model under different dataset properties. However, this model needs extensive validation by the ecological community before it can extrapolate this method to other ecosystems in the future.

  1. Reading Profiles in Multi-Site Data With Missingness.

    PubMed

    Eckert, Mark A; Vaden, Kenneth I; Gebregziabher, Mulugeta

    2018-01-01

    Children with reading disability exhibit varied deficits in reading and cognitive abilities that contribute to their reading comprehension problems. Some children exhibit primary deficits in phonological processing, while others can exhibit deficits in oral language and executive functions that affect comprehension. This behavioral heterogeneity is problematic when missing data prevent the characterization of different reading profiles, which often occurs in retrospective data sharing initiatives without coordinated data collection. Here we show that reading profiles can be reliably identified based on Random Forest classification of incomplete behavioral datasets, after the missForest method is used to multiply impute missing values. Results from simulation analyses showed that reading profiles could be accurately classified across degrees of missingness (e.g., ∼5% classification error for 30% missingness across the sample). The application of missForest to a real multi-site dataset with missingness ( n = 924) showed that reading disability profiles significantly and consistently differed in reading and cognitive abilities for cases with and without missing data. The results of validation analyses indicated that the reading profiles (cases with and without missing data) exhibited significant differences for an independent set of behavioral variables that were not used to classify reading profiles. Together, the results show how multiple imputation can be applied to the classification of cases with missing data and can increase the integrity of results from multi-site open access datasets.

  2. DataPflex: a MATLAB-based tool for the manipulation and visualization of multidimensional datasets.

    PubMed

    Hendriks, Bart S; Espelin, Christopher W

    2010-02-01

    DataPflex is a MATLAB-based application that facilitates the manipulation and visualization of multidimensional datasets. The strength of DataPflex lies in the intuitive graphical user interface for the efficient incorporation, manipulation and visualization of high-dimensional data that can be generated by multiplexed protein measurement platforms including, but not limited to Luminex or Meso-Scale Discovery. Such data can generally be represented in the form of multidimensional datasets [for example (time x stimulation x inhibitor x inhibitor concentration x cell type x measurement)]. For cases where measurements are made in a combinational fashion across multiple dimensions, there is a need for a tool to efficiently manipulate and reorganize such data for visualization. DataPflex accepts data consisting of up to five arbitrary dimensions in addition to a measurement dimension. Data are imported from a simple .xls format and can be exported to MATLAB or .xls. Data dimensions can be reordered, subdivided, merged, normalized and visualized in the form of collections of line graphs, bar graphs, surface plots, heatmaps, IC50's and other custom plots. Open source implementation in MATLAB enables easy extension for custom plotting routines and integration with more sophisticated analysis tools. DataPflex is distributed under the GPL license (http://www.gnu.org/licenses/) together with documentation, source code and sample data files at: http://code.google.com/p/datapflex. Supplementary data available at Bioinformatics online.

  3. Common pitfalls in statistical analysis: The perils of multiple testing

    PubMed Central

    Ranganathan, Priya; Pramesh, C. S.; Buyse, Marc

    2016-01-01

    Multiple testing refers to situations where a dataset is subjected to statistical testing multiple times - either at multiple time-points or through multiple subgroups or for multiple end-points. This amplifies the probability of a false-positive finding. In this article, we look at the consequences of multiple testing and explore various methods to deal with this issue. PMID:27141478

  4. Integration of contributed data with HEC-RAS hydrodynamic model for flood inundation and damage assessment: 2015 Dallas Texas Case Study

    NASA Astrophysics Data System (ADS)

    Sava, E.; Thornton, J. C.; Kalyanapu, A. J.; Cervone, G.

    2016-12-01

    Transportation infrastructure networks in urban areas are highly sensitive to natural disasters, yet are a very critical source for the success of rescue, recovery, and renovation operations. Therefore, prompt restoration of such networks is of high importance for disaster relief services. Satellite and aerial images provide data with high spatial and temporal resolution and are a powerful tool for monitoring the environment and mapping the spatio-temporal variability of the Earth's surface. They provide a synoptic overview and give useful environmental information for a wide range of scales, from entire continents to urban areas, with spatial pixel resolutions ranging from kilometers to centimeters. However, sensor limitations are often a serious drawback since no single sensor offers the optimal spectral, spatial, and temporal resolution at the same time. Specific data may not be collected in the time and space most urgently required and/or may it contain gaps as a result of the satellite revisit time, atmospheric opacity, or other obstructions. In this study, the feasibility of integrating multiple sources of contributed data including remotely sensed datasets and open-source geospatial datasets, into hydrodynamic models for flood inundation simulations is assessed. The 2015 Dallas floods that caused up to $61 million dollars in damage was selected for this study. A Hydraulic Engineering Center - River Analysis System (HEC-RAS) model was developed for the study area, using reservoir surcharge releases and geometry provided by the U.S. Army Corps of Engineers Fort Worth District. The simulated flood inundation is compared with the "contributed data" for the location (such as Civil Air Patrol data and WorldView 3 dataset) which indicated the model's lack of representing lateral inflows near the upstream section. An Artificial Neural Network (ANN) model is developed that used local precipitation and discharge values in the vicinity to estimate the lateral flows. This addition of estimated lateral inflows is expected to improve the model performance to match with the observed flows. Future work will focus on extending this preliminary work to assess the model performance after integrating these additional data sources.

  5. Identification of Differentially Expressed Genes through Integrated Study of Alzheimer’s Disease Affected Brain Regions

    PubMed Central

    Berretta, Regina; Moscato, Pablo

    2016-01-01

    Background Alzheimer’s disease (AD) is the most common form of dementia in older adults that damages the brain and results in impaired memory, thinking and behaviour. The identification of differentially expressed genes and related pathways among affected brain regions can provide more information on the mechanisms of AD. In the past decade, several studies have reported many genes that are associated with AD. This wealth of information has become difficult to follow and interpret as most of the results are conflicting. In that case, it is worth doing an integrated study of multiple datasets that helps to increase the total number of samples and the statistical power in detecting biomarkers. In this study, we present an integrated analysis of five different brain region datasets and introduce new genes that warrant further investigation. Methods The aim of our study is to apply a novel combinatorial optimisation based meta-analysis approach to identify differentially expressed genes that are associated to AD across brain regions. In this study, microarray gene expression data from 161 samples (74 non-demented controls, 87 AD) from the Entorhinal Cortex (EC), Hippocampus (HIP), Middle temporal gyrus (MTG), Posterior cingulate cortex (PC), Superior frontal gyrus (SFG) and visual cortex (VCX) brain regions were integrated and analysed using our method. The results are then compared to two popular meta-analysis methods, RankProd and GeneMeta, and to what can be obtained by analysing the individual datasets. Results We find genes related with AD that are consistent with existing studies, and new candidate genes not previously related with AD. Our study confirms the up-regualtion of INFAR2 and PTMA along with the down regulation of GPHN, RAB2A, PSMD14 and FGF. Novel genes PSMB2, WNK1, RPL15, SEMA4C, RWDD2A and LARGE are found to be differentially expressed across all brain regions. Further investigation on these genes may provide new insights into the development of AD. In addition, we identified the presence of 23 non-coding features, including four miRNA precursors (miR-7, miR570, miR-1229 and miR-6821), dysregulated across the brain regions. Furthermore, we compared our results with two popular meta-analysis methods RankProd and GeneMeta to validate our findings and performed a sensitivity analysis by removing one dataset at a time to assess the robustness of our results. These new findings may provide new insights into the disease mechanisms and thus make a significant contribution in the near future towards understanding, prevention and cure of AD. PMID:27050411

  6. Data Integration Framework Data Management Plan Remote Sensing Dataset

    DTIC Science & Technology

    2016-07-01

    performed by the Coastal Observations and Analysis Branch (CEERD-HFA) of the Flood and Storm Protection Division (CEERD-HF), U.S. Army Engineer Research... Protection Division, Coastal Observations and Analysis Branch CESAM U.S. Army Corps of Engineers, Mobile District CESAM-OP-J U.S. Army Corps of Engineers...ER D C/ CH L SR -1 6- 2 Coastal Ocean Data Systems Program Data Integration Framework Data Management Plan Remote Sensing Dataset Co

  7. Secure Multiparty Computation for Cooperative Cyber Risk Assessment

    DTIC Science & Technology

    2016-11-01

    the scope of data available; the more attacks that are represented in the dataset the easier it will be to determine which vulnerabilities are most...assessments by pooling their data, as a dataset that covers the infrastructure of multiple institutions would allow each of them to account for...attacks that others had experienced [4]. Sharing information to produce a broad dataset would greatly improve the ability of each organization involved to

  8. Accounting For Uncertainty in The Application Of High Throughput Datasets

    EPA Science Inventory

    The use of high throughput screening (HTS) datasets will need to adequately account for uncertainties in the data generation process and propagate these uncertainties through to ultimate use. Uncertainty arises at multiple levels in the construction of predictors using in vitro ...

  9. Toxics Release Inventory Chemical Hazard Information Profiles (TRI-CHIP) Dataset

    EPA Pesticide Factsheets

    The Toxics Release Inventory (TRI) Chemical Hazard Information Profiles (TRI-CHIP) dataset contains hazard information about the chemicals reported in TRI. Users can use this XML-format dataset to create their own databases and hazard analyses of TRI chemicals. The hazard information is compiled from a series of authoritative sources including the Integrated Risk Information System (IRIS). The dataset is provided as a downloadable .zip file that when extracted provides XML files and schemas for the hazard information tables.

  10. Full Life Cycle of Data Analysis with Climate Model Diagnostic Analyzer (CMDA)

    NASA Astrophysics Data System (ADS)

    Lee, S.; Zhai, C.; Pan, L.; Tang, B.; Zhang, J.; Bao, Q.; Malarout, N.

    2017-12-01

    We have developed a system that supports the full life cycle of a data analysis process, from data discovery, to data customization, to analysis, to reanalysis, to publication, and to reproduction. The system called Climate Model Diagnostic Analyzer (CMDA) is designed to demonstrate that the full life cycle of data analysis can be supported within one integrated system for climate model diagnostic evaluation with global observational and reanalysis datasets. CMDA has four subsystems that are highly integrated to support the analysis life cycle. Data System manages datasets used by CMDA analysis tools, Analysis System manages CMDA analysis tools which are all web services, Provenance System manages the meta data of CMDA datasets and the provenance of CMDA analysis history, and Recommendation System extracts knowledge from CMDA usage history and recommends datasets/analysis tools to users. These four subsystems are not only highly integrated but also easily expandable. New datasets can be easily added to Data System and scanned to be visible to the other subsystems. New analysis tools can be easily registered to be available in the Analysis System and Provenance System. With CMDA, a user can start a data analysis process by discovering datasets of relevance to their research topic using the Recommendation System. Next, the user can customize the discovered datasets for their scientific use (e.g. anomaly calculation, regridding, etc) with tools in the Analysis System. Next, the user can do their analysis with the tools (e.g. conditional sampling, time averaging, spatial averaging) in the Analysis System. Next, the user can reanalyze the datasets based on the previously stored analysis provenance in the Provenance System. Further, they can publish their analysis process and result to the Provenance System to share with other users. Finally, any user can reproduce the published analysis process and results. By supporting the full life cycle of climate data analysis, CMDA improves the research productivity and collaboration level of its user.

  11. A Semantic Web Management Model for Integrative Biomedical Informatics

    PubMed Central

    Deus, Helena F.; Stanislaus, Romesh; Veiga, Diogo F.; Behrens, Carmen; Wistuba, Ignacio I.; Minna, John D.; Garner, Harold R.; Swisher, Stephen G.; Roth, Jack A.; Correa, Arlene M.; Broom, Bradley; Coombes, Kevin; Chang, Allen; Vogel, Lynn H.; Almeida, Jonas S.

    2008-01-01

    Background Data, data everywhere. The diversity and magnitude of the data generated in the Life Sciences defies automated articulation among complementary efforts. The additional need in this field for managing property and access permissions compounds the difficulty very significantly. This is particularly the case when the integration involves multiple domains and disciplines, even more so when it includes clinical and high throughput molecular data. Methodology/Principal Findings The emergence of Semantic Web technologies brings the promise of meaningful interoperation between data and analysis resources. In this report we identify a core model for biomedical Knowledge Engineering applications and demonstrate how this new technology can be used to weave a management model where multiple intertwined data structures can be hosted and managed by multiple authorities in a distributed management infrastructure. Specifically, the demonstration is performed by linking data sources associated with the Lung Cancer SPORE awarded to The University of Texas MDAnderson Cancer Center at Houston and the Southwestern Medical Center at Dallas. A software prototype, available with open source at www.s3db.org, was developed and its proposed design has been made publicly available as an open source instrument for shared, distributed data management. Conclusions/Significance The Semantic Web technologies have the potential to addresses the need for distributed and evolvable representations that are critical for systems Biology and translational biomedical research. As this technology is incorporated into application development we can expect that both general purpose productivity software and domain specific software installed on our personal computers will become increasingly integrated with the relevant remote resources. In this scenario, the acquisition of a new dataset should automatically trigger the delegation of its analysis. PMID:18698353

  12. Desiderata for an authoritative Representation of MeSH in RDF.

    PubMed

    Winnenburg, Rainer; Bodenreider, Olivier

    2014-01-01

    The Semantic Web provides a framework for the integration of resources on the web, which facilitates information integration and interoperability. RDF is the main representation format for Linked Open Data (LOD). However, datasets are not always made available in RDF by their producers and the Semantic Web community has had to convert some of these datasets to RDF in order for these datasets to participate in the LOD cloud. As a result, the LOD cloud sometimes contains outdated, partial and even inaccurate RDF datasets. We review the LOD landscape for one of these resources, MeSH, and analyze the characteristics of six existing representations in order to identify desirable features for an authoritative version, for which we create a prototype. We illustrate the suitability of this prototype on three common use cases. NLM intends to release an authoritative representation of MeSH in RDF (beta version) in the Fall of 2014.

  13. Desiderata for an authoritative Representation of MeSH in RDF

    PubMed Central

    Winnenburg, Rainer; Bodenreider, Olivier

    2014-01-01

    The Semantic Web provides a framework for the integration of resources on the web, which facilitates information integration and interoperability. RDF is the main representation format for Linked Open Data (LOD). However, datasets are not always made available in RDF by their producers and the Semantic Web community has had to convert some of these datasets to RDF in order for these datasets to participate in the LOD cloud. As a result, the LOD cloud sometimes contains outdated, partial and even inaccurate RDF datasets. We review the LOD landscape for one of these resources, MeSH, and analyze the characteristics of six existing representations in order to identify desirable features for an authoritative version, for which we create a prototype. We illustrate the suitability of this prototype on three common use cases. NLM intends to release an authoritative representation of MeSH in RDF (beta version) in the Fall of 2014. PMID:25954433

  14. Topographic Science

    USGS Publications Warehouse

    Poppenga, Sandra K.; Evans, Gayla; Gesch, Dean; Stoker, Jason M.; Queija, Vivian R.; Worstell, Bruce; Tyler, Dean J.; Danielson, Jeff; Bliss, Norman; Greenlee, Susan

    2010-01-01

    The mission of U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) Center Topographic Science is to establish partnerships and conduct research and applications that facilitate the development and use of integrated national and global topographic datasets. Topographic Science includes a wide range of research and applications that result in improved seamless topographic datasets, advanced elevation technology, data integration and terrain visualization, new and improved elevation derivatives, and development of Web-based tools. In cooperation with our partners, Topographic Science is developing integrated-science applications for mapping, national natural resource initiatives, hazards, and global change science. http://topotools.cr.usgs.gov/.

  15. Modeling Rabbit Responses to Single and Multiple Aerosol ...

    EPA Pesticide Factsheets

    Journal Article Survival models are developed here to predict response and time-to-response for mortality in rabbits following exposures to single or multiple aerosol doses of Bacillus anthracis spores. Hazard function models were developed for a multiple dose dataset to predict the probability of death through specifying dose-response functions and the time between exposure and the time-to-death (TTD). Among the models developed, the best-fitting survival model (baseline model) has an exponential dose-response model with a Weibull TTD distribution. Alternative models assessed employ different underlying dose-response functions and use the assumption that, in a multiple dose scenario, earlier doses affect the hazard functions of each subsequent dose. In addition, published mechanistic models are analyzed and compared with models developed in this paper. None of the alternative models that were assessed provided a statistically significant improvement in fit over the baseline model. The general approach utilizes simple empirical data analysis to develop parsimonious models with limited reliance on mechanistic assumptions. The baseline model predicts TTDs consistent with reported results from three independent high-dose rabbit datasets. More accurate survival models depend upon future development of dose-response datasets specifically designed to assess potential multiple dose effects on response and time-to-response. The process used in this paper to dev

  16. A Research Graph dataset for connecting research data repositories using RD-Switchboard.

    PubMed

    Aryani, Amir; Poblet, Marta; Unsworth, Kathryn; Wang, Jingbo; Evans, Ben; Devaraju, Anusuriya; Hausstein, Brigitte; Klas, Claus-Peter; Zapilko, Benjamin; Kaplun, Samuele

    2018-05-29

    This paper describes the open access graph dataset that shows the connections between Dryad, CERN, ANDS and other international data repositories to publications and grants across multiple research data infrastructures. The graph dataset was created using the Research Graph data model and the Research Data Switchboard (RD-Switchboard), a collaborative project by the Research Data Alliance DDRI Working Group (DDRI WG) with the aim to discover and connect the related research datasets based on publication co-authorship or jointly funded grants. The graph dataset allows researchers to trace and follow the paths to understanding a body of work. By mapping the links between research datasets and related resources, the graph dataset improves both their discovery and visibility, while avoiding duplicate efforts in data creation. Ultimately, the linked datasets may spur novel ideas, facilitate reproducibility and re-use in new applications, stimulate combinatorial creativity, and foster collaborations across institutions.

  17. Publishing NASA Metadata as Linked Open Data for Semantic Mashups

    NASA Astrophysics Data System (ADS)

    Wilson, Brian; Manipon, Gerald; Hua, Hook

    2014-05-01

    Data providers are now publishing more metadata in more interoperable forms, e.g. Atom or RSS 'casts', as Linked Open Data (LOD), or as ISO Metadata records. A major effort on the part of the NASA's Earth Science Data and Information System (ESDIS) project is the aggregation of metadata that enables greater data interoperability among scientific data sets regardless of source or application. Both the Earth Observing System (EOS) ClearingHOuse (ECHO) and the Global Change Master Directory (GCMD) repositories contain metadata records for NASA (and other) datasets and provided services. These records contain typical fields for each dataset (or software service) such as the source, creation date, cognizant institution, related access URL's, and domain and variable keywords to enable discovery. Under a NASA ACCESS grant, we demonstrated how to publish the ECHO and GCMD dataset and services metadata as LOD in the RDF format. Both sets of metadata are now queryable at SPARQL endpoints and available for integration into "semantic mashups" in the browser. It is straightforward to reformat sets of XML metadata, including ISO, into simple RDF and then later refine and improve the RDF predicates by reusing known namespaces such as Dublin core, georss, etc. All scientific metadata should be part of the LOD world. In addition, we developed an "instant" drill-down and browse interface that provides faceted navigation so that the user can discover and explore the 25,000 datasets and 3000 services. The available facets and the free-text search box appear in the left panel, and the instantly updated results for the dataset search appear in the right panel. The user can constrain the value of a metadata facet simply by clicking on a word (or phrase) in the "word cloud" of values for each facet. The display section for each dataset includes the important metadata fields, a full description of the dataset, potentially some related URL's, and a "search" button that points to an OpenSearch GUI that is pre-configured to search for granules within the dataset. We will present our experiences with converting NASA metadata into LOD, discuss the challenges, illustrate some of the enabled mashups, and demonstrate the latest version of the "instant browse" interface for navigating multiple metadata collections.

  18. The NASA Reanalysis Ensemble Service - Advanced Capabilities for Integrated Reanalysis Access and Intercomparison

    NASA Astrophysics Data System (ADS)

    Tamkin, G.; Schnase, J. L.; Duffy, D.; Li, J.; Strong, S.; Thompson, J. H.

    2017-12-01

    NASA's efforts to advance climate analytics-as-a-service are making new capabilities available to the research community: (1) A full-featured Reanalysis Ensemble Service (RES) comprising monthly means data from multiple reanalysis data sets, accessible through an enhanced set of extraction, analytic, arithmetic, and intercomparison operations. The operations are made accessible through NASA's climate data analytics Web services and our client-side Climate Data Services Python library, CDSlib; (2) A cloud-based, high-performance Virtual Real-Time Analytics Testbed supporting a select set of climate variables. This near real-time capability enables advanced technologies like Spark and Hadoop-based MapReduce analytics over native NetCDF files; and (3) A WPS-compliant Web service interface to our climate data analytics service that will enable greater interoperability with next-generation systems such as ESGF. The Reanalysis Ensemble Service includes the following: - New API that supports full temporal, spatial, and grid-based resolution services with sample queries - A Docker-ready RES application to deploy across platforms - Extended capabilities that enable single- and multiple reanalysis area average, vertical average, re-gridding, standard deviation, and ensemble averages - Convenient, one-stop shopping for commonly used data products from multiple reanalyses including basic sub-setting and arithmetic operations (e.g., avg, sum, max, min, var, count, anomaly) - Full support for the MERRA-2 reanalysis dataset in addition to, ECMWF ERA-Interim, NCEP CFSR, JMA JRA-55 and NOAA/ESRL 20CR… - A Jupyter notebook-based distribution mechanism designed for client use cases that combines CDSlib documentation with interactive scenarios and personalized project management - Supporting analytic services for NASA GMAO Forward Processing datasets - Basic uncertainty quantification services that combine heterogeneous ensemble products with comparative observational products (e.g., reanalysis, observational, visualization) - The ability to compute and visualize multiple reanalysis for ease of inter-comparisons - Automated tools to retrieve and prepare data collections for analytic processing

  19. A novel statistical method for quantitative comparison of multiple ChIP-seq datasets.

    PubMed

    Chen, Li; Wang, Chi; Qin, Zhaohui S; Wu, Hao

    2015-06-15

    ChIP-seq is a powerful technology to measure the protein binding or histone modification strength in the whole genome scale. Although there are a number of methods available for single ChIP-seq data analysis (e.g. 'peak detection'), rigorous statistical method for quantitative comparison of multiple ChIP-seq datasets with the considerations of data from control experiment, signal to noise ratios, biological variations and multiple-factor experimental designs is under-developed. In this work, we develop a statistical method to perform quantitative comparison of multiple ChIP-seq datasets and detect genomic regions showing differential protein binding or histone modification. We first detect peaks from all datasets and then union them to form a single set of candidate regions. The read counts from IP experiment at the candidate regions are assumed to follow Poisson distribution. The underlying Poisson rates are modeled as an experiment-specific function of artifacts and biological signals. We then obtain the estimated biological signals and compare them through the hypothesis testing procedure in a linear model framework. Simulations and real data analyses demonstrate that the proposed method provides more accurate and robust results compared with existing ones. An R software package ChIPComp is freely available at http://web1.sph.emory.edu/users/hwu30/software/ChIPComp.html. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  20. Accurate and fast multiple-testing correction in eQTL studies.

    PubMed

    Sul, Jae Hoon; Raj, Towfique; de Jong, Simone; de Bakker, Paul I W; Raychaudhuri, Soumya; Ophoff, Roel A; Stranger, Barbara E; Eskin, Eleazar; Han, Buhm

    2015-06-04

    In studies of expression quantitative trait loci (eQTLs), it is of increasing interest to identify eGenes, the genes whose expression levels are associated with variation at a particular genetic variant. Detecting eGenes is important for follow-up analyses and prioritization because genes are the main entities in biological processes. To detect eGenes, one typically focuses on the genetic variant with the minimum p value among all variants in cis with a gene and corrects for multiple testing to obtain a gene-level p value. For performing multiple-testing correction, a permutation test is widely used. Because of growing sample sizes of eQTL studies, however, the permutation test has become a computational bottleneck in eQTL studies. In this paper, we propose an efficient approach for correcting for multiple testing and assess eGene p values by utilizing a multivariate normal distribution. Our approach properly takes into account the linkage-disequilibrium structure among variants, and its time complexity is independent of sample size. By applying our small-sample correction techniques, our method achieves high accuracy in both small and large studies. We have shown that our method consistently produces extremely accurate p values (accuracy > 98%) for three human eQTL datasets with different sample sizes and SNP densities: the Genotype-Tissue Expression pilot dataset, the multi-region brain dataset, and the HapMap 3 dataset. Copyright © 2015 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.

  1. The application of high-resolution mass spectrometry-based data-mining tools in tandem to metabolite profiling of a triple drug combination in humans.

    PubMed

    Xing, Jie; Zang, Meitong; Zhang, Haiying; Zhu, Mingshe

    2015-10-15

    Patients are usually exposed to multiple drugs, and metabolite profiling of each drug in complex biological matrices is a big challenge. This study presented a new application of an improved high resolution mass spectrometry (HRMS)-based data-mining tools in tandem to fast and comprehensive metabolite identification of combination drugs in human. The model drug combination was metronidazole-pantoprazole-clarithromycin (MET-PAN-CLAR), which is widely used in clinic to treat ulcers caused by Helicobacter pylori. First, mass defect filter (MDF), as a targeted data processing tool, was able to recover all relevant metabolites of MET-PAN-CLAR in human plasma and urine from the full-scan MS dataset when appropriate MDF templates for each drug were defined. Second, the accurate mass-based background subtraction (BS), as an untargeted data-mining tool, worked effectively except for several trace metabolites, which were buried in the remaining background signals. Third, an integrated strategy, i.e., untargeted BS followed by improved MDF, was effective for metabolite identification of MET-PAN-CLAR. Most metabolites except for trace ones were found in the first step of BS-processed datasets, and the results led to the setup of appropriate metabolite MDF template for the subsequent MDF data processing. Trace metabolites were further recovered by MDF, which used both common MDF templates and the novel metabolite-based MDF templates. As a result, a total of 44 metabolites or related components were found for MET-PAN-CLAR in human plasma and urine using the integrated strategy. New metabolic pathways such as N-glucuronidation of PAN and dehydrogenation of CLAR were found. This study demonstrated that the combination of accurate mass-based multiple data-mining techniques in tandem, i.e., untargeted background subtraction followed by targeted mass defect filtering, can be a valuable tool for rapid metabolite profiling of combination drugs in vivo. Copyright © 2015 Elsevier B.V. All rights reserved.

  2. Epiviz: a view inside the design of an integrated visual analysis software for genomics

    PubMed Central

    2015-01-01

    Background Computational and visual data analysis for genomics has traditionally involved a combination of tools and resources, of which the most ubiquitous consist of genome browsers, focused mainly on integrative visualization of large numbers of big datasets, and computational environments, focused on data modeling of a small number of moderately sized datasets. Workflows that involve the integration and exploration of multiple heterogeneous data sources, small and large, public and user specific have been poorly addressed by these tools. In our previous work, we introduced Epiviz, which bridges the gap between the two types of tools, simplifying these workflows. Results In this paper we expand on the design decisions behind Epiviz, and introduce a series of new advanced features that further support the type of interactive exploratory workflow we have targeted. We discuss three ways in which Epiviz advances the field of genomic data analysis: 1) it brings code to interactive visualizations at various different levels; 2) takes the first steps in the direction of collaborative data analysis by incorporating user plugins from source control providers, as well as by allowing analysis states to be shared among the scientific community; 3) combines established analysis features that have never before been available simultaneously in a genome browser. In our discussion section, we present security implications of the current design, as well as a series of limitations and future research steps. Conclusions Since many of the design choices of Epiviz are novel in genomics data analysis, this paper serves both as a document of our own approaches with lessons learned, as well as a start point for future efforts in the same direction for the genomics community. PMID:26328750

  3. Prediction of rat protein subcellular localization with pseudo amino acid composition based on multiple sequential features.

    PubMed

    Shi, Ruijia; Xu, Cunshuan

    2011-06-01

    The study of rat proteins is an indispensable task in experimental medicine and drug development. The function of a rat protein is closely related to its subcellular location. Based on the above concept, we construct the benchmark rat proteins dataset and develop a combined approach for predicting the subcellular localization of rat proteins. From protein primary sequence, the multiple sequential features are obtained by using of discrete Fourier analysis, position conservation scoring function and increment of diversity, and these sequential features are selected as input parameters of the support vector machine. By the jackknife test, the overall success rate of prediction is 95.6% on the rat proteins dataset. Our method are performed on the apoptosis proteins dataset and the Gram-negative bacterial proteins dataset with the jackknife test, the overall success rates are 89.9% and 96.4%, respectively. The above results indicate that our proposed method is quite promising and may play a complementary role to the existing predictors in this area.

  4. Joint Blind Source Separation by Multi-set Canonical Correlation Analysis

    PubMed Central

    Li, Yi-Ou; Adalı, Tülay; Wang, Wei; Calhoun, Vince D

    2009-01-01

    In this work, we introduce a simple and effective scheme to achieve joint blind source separation (BSS) of multiple datasets using multi-set canonical correlation analysis (M-CCA) [1]. We first propose a generative model of joint BSS based on the correlation of latent sources within and between datasets. We specify source separability conditions, and show that, when the conditions are satisfied, the group of corresponding sources from each dataset can be jointly extracted by M-CCA through maximization of correlation among the extracted sources. We compare source separation performance of the M-CCA scheme with other joint BSS methods and demonstrate the superior performance of the M-CCA scheme in achieving joint BSS for a large number of datasets, group of corresponding sources with heterogeneous correlation values, and complex-valued sources with circular and non-circular distributions. We apply M-CCA to analysis of functional magnetic resonance imaging (fMRI) data from multiple subjects and show its utility in estimating meaningful brain activations from a visuomotor task. PMID:20221319

  5. SPICE: exploration and analysis of post-cytometric complex multivariate datasets.

    PubMed

    Roederer, Mario; Nozzi, Joshua L; Nason, Martha C

    2011-02-01

    Polychromatic flow cytometry results in complex, multivariate datasets. To date, tools for the aggregate analysis of these datasets across multiple specimens grouped by different categorical variables, such as demographic information, have not been optimized. Often, the exploration of such datasets is accomplished by visualization of patterns with pie charts or bar charts, without easy access to statistical comparisons of measurements that comprise multiple components. Here we report on algorithms and a graphical interface we developed for these purposes. In particular, we discuss thresholding necessary for accurate representation of data in pie charts, the implications for display and comparison of normalized versus unnormalized data, and the effects of averaging when samples with significant background noise are present. Finally, we define a statistic for the nonparametric comparison of complex distributions to test for difference between groups of samples based on multi-component measurements. While originally developed to support the analysis of T cell functional profiles, these techniques are amenable to a broad range of datatypes. Published 2011 Wiley-Liss, Inc.

  6. DASMiner: discovering and integrating data from DAS sources

    PubMed Central

    2009-01-01

    Background DAS is a widely adopted protocol for providing syntactic interoperability among biological databases. The popularity of DAS is due to a simplified and elegant mechanism for data exchange that consists of sources exposing their RESTful interfaces for data access. As a growing number of DAS services are available for molecular biology resources, there is an incentive to explore this protocol in order to advance data discovery and integration among these resources. Results We developed DASMiner, a Matlab toolkit for querying DAS data sources that enables creation of integrated biological models using the information available in DAS-compliant repositories. DASMiner is composed by a browser application and an API that work together to facilitate gathering of data from different DAS sources, which can be used for creating enriched datasets from multiple sources. The browser is used to formulate queries and navigate data contained in DAS sources. Users can execute queries against these sources in an intuitive fashion, without the need of knowing the specific DAS syntax for the particular source. Using the source's metadata provided by the DAS Registry, the browser's layout adapts to expose only the set of commands and coordinate systems supported by the specific source. For this reason, the browser can interrogate any DAS source, independently of the type of data being served. The API component of DASMiner may be used for programmatic access of DAS sources by programs in Matlab. Once the desired data is found during navigation, the query is exported in the format of an API call to be used within any Matlab application. We illustrate the use of DASMiner by creating integrative models of histone modification maps and protein-protein interaction networks. These enriched datasets were built by retrieving and integrating distributed genomic and proteomic DAS sources using the API. Conclusion The support of the DAS protocol allows that hundreds of molecular biology databases to be treated as a federated, online collection of resources. DASMiner enables full exploration of these resources, and can be used to deploy applications and create integrated views of biological systems using the information deposited in DAS repositories. PMID:19919683

  7. VideoWeb Dataset for Multi-camera Activities and Non-verbal Communication

    NASA Astrophysics Data System (ADS)

    Denina, Giovanni; Bhanu, Bir; Nguyen, Hoang Thanh; Ding, Chong; Kamal, Ahmed; Ravishankar, Chinya; Roy-Chowdhury, Amit; Ivers, Allen; Varda, Brenda

    Human-activity recognition is one of the most challenging problems in computer vision. Researchers from around the world have tried to solve this problem and have come a long way in recognizing simple motions and atomic activities. As the computer vision community heads toward fully recognizing human activities, a challenging and labeled dataset is needed. To respond to that need, we collected a dataset of realistic scenarios in a multi-camera network environment (VideoWeb) involving multiple persons performing dozens of different repetitive and non-repetitive activities. This chapter describes the details of the dataset. We believe that this VideoWeb Activities dataset is unique and it is one of the most challenging datasets available today. The dataset is publicly available online at http://vwdata.ee.ucr.edu/ along with the data annotation.

  8. Climate Change, Disaster and Sentiment Analysis over Social Media Mining

    NASA Astrophysics Data System (ADS)

    Lee, J.; McCusker, J. P.; McGuinness, D. L.

    2012-12-01

    Accelerated climate change causes disasters and disrupts people living all over the globe. Disruptive climate events are often reflected in expressed sentiments of the people affected. Monitoring changes in these sentiments during and after disasters can reveal relationships between climate change and mental health. We developed a semantic web tool that uses linked data principles and semantic web technologies to integrate data from multiple sources and analyze them together. We are converting statistical data on climate change and disaster records obtained from the World Bank data catalog and the International Disaster Database into a Resource Description Framework (RDF) representation that was annotated with the RDF Data Cube vocabulary. We compare these data with a dataset of tweets that mention terms from the Emotion Ontology to get a sense of how disasters can impact the affected populations. This dataset is being gathered using an infrastructure we developed that extracts term uses in Twitter with controlled vocabularies. This data was also converted to RDF structure so that statistical data on the climate change and disasters is analyzed together with sentiment data. To visualize and explore relationship of the multiple data across the dimensions of time and location, we use the qb.js framework. We are using this approach to investigate the social and emotional impact of climate change. We hope that this will demonstrate the use of social media data as a valuable source of understanding on global climate change.

  9. Automatic multi-label annotation of abdominal CT images using CBIR

    NASA Astrophysics Data System (ADS)

    Xue, Zhiyun; Antani, Sameer; Long, L. Rodney; Thoma, George R.

    2017-03-01

    We present a technique to annotate multiple organs shown in 2-D abdominal/pelvic CT images using CBIR. This annotation task is motivated by our research interests in visual question-answering (VQA). We aim to apply results from this effort in Open-iSM, a multimodal biomedical search engine developed by the National Library of Medicine (NLM). Understanding visual content of biomedical images is a necessary step for VQA. Though sufficient annotational information about an image may be available in related textual metadata, not all may be useful as descriptive tags, particularly for anatomy on the image. In this paper, we develop and evaluate a multi-label image annotation method using CBIR. We evaluate our method on two 2-D CT image datasets we generated from 3-D volumetric data obtained from a multi-organ segmentation challenge hosted in MICCAI 2015. Shape and spatial layout information is used to encode visual characteristics of the anatomy. We adapt a weighted voting scheme to assign multiple labels to the query image by combining the labels of the images identified as similar by the method. Key parameters that may affect the annotation performance, such as the number of images used in the label voting and the threshold for excluding labels that have low weights, are studied. The method proposes a coarse-to-fine retrieval strategy which integrates the classification with the nearest-neighbor search. Results from our evaluation (using the MICCAI CT image datasets as well as figures from Open-i) are presented.

  10. Cross-scale integration of knowledge for predicting species ranges: a metamodeling framework

    PubMed Central

    Talluto, Matthew V.; Boulangeat, Isabelle; Ameztegui, Aitor; Aubin, Isabelle; Berteaux, Dominique; Butler, Alyssa; Doyon, Frédérik; Drever, C. Ronnie; Fortin, Marie-Josée; Franceschini, Tony; Liénard, Jean; McKenney, Dan; Solarik, Kevin A.; Strigul, Nikolay; Thuiller, Wilfried; Gravel, Dominique

    2016-01-01

    Aim Current interest in forecasting changes to species ranges have resulted in a multitude of approaches to species distribution models (SDMs). However, most approaches include only a small subset of the available information, and many ignore smaller-scale processes such as growth, fecundity, and dispersal. Furthermore, different approaches often produce divergent predictions with no simple method to reconcile them. Here, we present a flexible framework for integrating models at multiple scales using hierarchical Bayesian methods. Location Eastern North America (as an example). Methods Our framework builds a metamodel that is constrained by the results of multiple sub-models and provides probabilistic estimates of species presence. We applied our approach to a simulated dataset to demonstrate the integration of a correlative SDM with a theoretical model. In a second example, we built an integrated model combining the results of a physiological model with presence-absence data for sugar maple (Acer saccharum), an abundant tree native to eastern North America. Results For both examples, the integrated models successfully included information from all data sources and substantially improved the characterization of uncertainty. For the second example, the integrated model outperformed the source models with respect to uncertainty when modelling the present range of the species. When projecting into the future, the model provided a consensus view of two models that differed substantially in their predictions. Uncertainty was reduced where the models agreed and was greater where they diverged, providing a more realistic view of the state of knowledge than either source model. Main conclusions We conclude by discussing the potential applications of our method and its accessibility to applied ecologists. In ideal cases, our framework can be easily implemented using off-the-shelf software. The framework has wide potential for use in species distribution modelling and can drive better integration of multi-source and multi-scale data into ecological decision-making. PMID:27499698

  11. Cross-scale integration of knowledge for predicting species ranges: a metamodeling framework.

    PubMed

    Talluto, Matthew V; Boulangeat, Isabelle; Ameztegui, Aitor; Aubin, Isabelle; Berteaux, Dominique; Butler, Alyssa; Doyon, Frédérik; Drever, C Ronnie; Fortin, Marie-Josée; Franceschini, Tony; Liénard, Jean; McKenney, Dan; Solarik, Kevin A; Strigul, Nikolay; Thuiller, Wilfried; Gravel, Dominique

    2016-02-01

    Current interest in forecasting changes to species ranges have resulted in a multitude of approaches to species distribution models (SDMs). However, most approaches include only a small subset of the available information, and many ignore smaller-scale processes such as growth, fecundity, and dispersal. Furthermore, different approaches often produce divergent predictions with no simple method to reconcile them. Here, we present a flexible framework for integrating models at multiple scales using hierarchical Bayesian methods. Eastern North America (as an example). Our framework builds a metamodel that is constrained by the results of multiple sub-models and provides probabilistic estimates of species presence. We applied our approach to a simulated dataset to demonstrate the integration of a correlative SDM with a theoretical model. In a second example, we built an integrated model combining the results of a physiological model with presence-absence data for sugar maple ( Acer saccharum ), an abundant tree native to eastern North America. For both examples, the integrated models successfully included information from all data sources and substantially improved the characterization of uncertainty. For the second example, the integrated model outperformed the source models with respect to uncertainty when modelling the present range of the species. When projecting into the future, the model provided a consensus view of two models that differed substantially in their predictions. Uncertainty was reduced where the models agreed and was greater where they diverged, providing a more realistic view of the state of knowledge than either source model. We conclude by discussing the potential applications of our method and its accessibility to applied ecologists. In ideal cases, our framework can be easily implemented using off-the-shelf software. The framework has wide potential for use in species distribution modelling and can drive better integration of multi-source and multi-scale data into ecological decision-making.

  12. Integrative analysis of GWAS, eQTLs and meQTLs data suggests that multiple gene sets are associated with bone mineral density.

    PubMed

    Wang, W; Huang, S; Hou, W; Liu, Y; Fan, Q; He, A; Wen, Y; Hao, J; Guo, X; Zhang, F

    2017-10-01

    Several genome-wide association studies (GWAS) of bone mineral density (BMD) have successfully identified multiple susceptibility genes, yet isolated susceptibility genes are often difficult to interpret biologically. The aim of this study was to unravel the genetic background of BMD at pathway level, by integrating BMD GWAS data with genome-wide expression quantitative trait loci (eQTLs) and methylation quantitative trait loci (meQTLs) data METHOD: We employed the GWAS datasets of BMD from the Genetic Factors for Osteoporosis Consortium (GEFOS), analysing patients' BMD. The areas studied included 32 735 femoral necks, 28 498 lumbar spines, and 8143 forearms. Genome-wide eQTLs (containing 923 021 eQTLs) and meQTLs (containing 683 152 unique methylation sites with local meQTLs) data sets were collected from recently published studies. Gene scores were first calculated by summary data-based Mendelian randomisation (SMR) software and meQTL-aligned GWAS results. Gene set enrichment analysis (GSEA) was then applied to identify BMD-associated gene sets with a predefined significance level of 0.05. We identified multiple gene sets associated with BMD in one or more regions, including relevant known biological gene sets such as the Reactome Circadian Clock (GSEA p-value = 1.0 × 10 -4 for LS and 2.7 × 10 -2 for femoral necks BMD in eQTLs-based GSEA) and insulin-like growth factor receptor binding (GSEA p-value = 5.0 × 10 -4 for femoral necks and 2.6 × 10 -2 for lumbar spines BMD in meQTLs-based GSEA). Our results provided novel clues for subsequent functional analysis of bone metabolism, and illustrated the benefit of integrating eQTLs and meQTLs data into pathway association analysis for genetic studies of complex human diseases. Cite this article : W. Wang, S. Huang, W. Hou, Y. Liu, Q. Fan, A. He, Y. Wen, J. Hao, X. Guo, F. Zhang. Integrative analysis of GWAS, eQTLs and meQTLs data suggests that multiple gene sets are associated with bone mineral density. Bone Joint Res 2017;6:572-576. © 2017 Wang et al.

  13. geneLAB: Expanding the Impact of NASA's Biological Research in Space

    NASA Technical Reports Server (NTRS)

    Rayl, Nicole; Smith, Jeffrey D.

    2014-01-01

    The geneLAB project is designed to leverage the value of large 'omics' datasets from molecular biology projects conducted on the ISS by making these datasets available, citable, discoverable, interpretable, reusable, and reproducible. geneLAB will create a collaboration space with an integrated set of tools for depositing, accessing, analyzing, and modeling these diverse datasets from spaceflight and related terrestrial studies.

  14. Physiologic Waveform Analysis for Early Detection of Hemorrhage during Transport and Higher Echelon Medical Care of Combat Casualties

    DTIC Science & Technology

    2014-03-01

    waveforms that are easier to measure than ABP (e.g., pulse oximeter waveforms); (3) a NIH SBIR Phase I proposal with Retia Medical to develop automated...the training dataset. Integrating the technique with non-invasive pulse transit time (PTT) was most effective. The integrated technique specifically...the peripheral ABP waveforms in the training dataset. These techniques included the rudimentary mean ABP technique, the classic pulse pressure times

  15. Site Features

    EPA Pesticide Factsheets

    This dataset consists of various site features from multiple Superfund sites in U.S. EPA Region 8. These data were acquired from multiple sources at different times and were combined into one region-wide layer.

  16. Towards Improved Satellite-In Situ Oceanographic Data Interoperability and Associated Value Added Services at the Podaac

    NASA Astrophysics Data System (ADS)

    Tsontos, V. M.; Huang, T.; Holt, B.

    2015-12-01

    The earth science enterprise increasingly relies on the integration and synthesis of multivariate datasets from diverse observational platforms. NASA's ocean salinity missions, that include Aquarius/SAC-D and the SPURS (Salinity Processes in the Upper Ocean Regional Study) field campaign, illustrate the value of integrated observations in support of studies on ocean circulation, the water cycle, and climate. However, the inherent heterogeneity of resulting data and the disparate, distributed systems that serve them complicates their effective utilization for both earth science research and applications. Key technical interoperability challenges include adherence to metadata and data format standards that are particularly acute for in-situ data and the lack of a unified metadata model facilitating archival and integration of both satellite and oceanographic field datasets. Here we report on efforts at the PO.DAAC, NASA's physical oceanographic data center, to extend our data management and distribution support capabilities for field campaign datasets such as those from SPURS. We also discuss value-added services, based on the integration of satellite and in-situ datasets, which are under development with a particular focus on DOMS. The distributed oceanographic matchup service (DOMS) implements a portable technical infrastructure and associated web services that will be broadly accessible via the PO.DAAC for the dynamic collocation of satellite and in-situ data, hosted by distributed data providers, in support of mission cal/val, science and operational applications.

  17. Tree-based approach for exploring marine spatial patterns with raster datasets.

    PubMed

    Liao, Xiaohan; Xue, Cunjin; Su, Fenzhen

    2017-01-01

    From multiple raster datasets to spatial association patterns, the data-mining technique is divided into three subtasks, i.e., raster dataset pretreatment, mining algorithm design, and spatial pattern exploration from the mining results. Comparison with the former two subtasks reveals that the latter remains unresolved. Confronted with the interrelated marine environmental parameters, we propose a Tree-based Approach for eXploring Marine Spatial Patterns with multiple raster datasets called TAXMarSP, which includes two models. One is the Tree-based Cascading Organization Model (TCOM), and the other is the Spatial Neighborhood-based CAlculation Model (SNCAM). TCOM designs the "Spatial node→Pattern node" from top to bottom layers to store the table-formatted frequent patterns. Together with TCOM, SNCAM considers the spatial neighborhood contributions to calculate the pattern-matching degree between the specified marine parameters and the table-formatted frequent patterns and then explores the marine spatial patterns. Using the prevalent quantification Apriori algorithm and a real remote sensing dataset from January 1998 to December 2014, a successful application of TAXMarSP to marine spatial patterns in the Pacific Ocean is described, and the obtained marine spatial patterns present not only the well-known but also new patterns to Earth scientists.

  18. Transductive multi-view zero-shot learning.

    PubMed

    Fu, Yanwei; Hospedales, Timothy M; Xiang, Tao; Gong, Shaogang

    2015-11-01

    Most existing zero-shot learning approaches exploit transfer learning via an intermediate semantic representation shared between an annotated auxiliary dataset and a target dataset with different classes and no annotation. A projection from a low-level feature space to the semantic representation space is learned from the auxiliary dataset and applied without adaptation to the target dataset. In this paper we identify two inherent limitations with these approaches. First, due to having disjoint and potentially unrelated classes, the projection functions learned from the auxiliary dataset/domain are biased when applied directly to the target dataset/domain. We call this problem the projection domain shift problem and propose a novel framework, transductive multi-view embedding, to solve it. The second limitation is the prototype sparsity problem which refers to the fact that for each target class, only a single prototype is available for zero-shot learning given a semantic representation. To overcome this problem, a novel heterogeneous multi-view hypergraph label propagation method is formulated for zero-shot learning in the transductive embedding space. It effectively exploits the complementary information offered by different semantic representations and takes advantage of the manifold structures of multiple representation spaces in a coherent manner. We demonstrate through extensive experiments that the proposed approach (1) rectifies the projection shift between the auxiliary and target domains, (2) exploits the complementarity of multiple semantic representations, (3) significantly outperforms existing methods for both zero-shot and N-shot recognition on three image and video benchmark datasets, and (4) enables novel cross-view annotation tasks.

  19. ASSESSING THE ACCURACY OF NATIONAL LAND COVER DATASET AREA ESTIMATES AT MULTIPLE SPATIAL EXTENTS

    EPA Science Inventory

    Site specific accuracy assessments provide fine-scale evaluation of the thematic accuracy of land use/land cover (LULC) datasets; however, they provide little insight into LULC accuracy across varying spatial extents. Additionally, LULC data are typically used to describe lands...

  20. iGC-an integrated analysis package of gene expression and copy number alteration.

    PubMed

    Lai, Yi-Pin; Wang, Liang-Bo; Wang, Wei-An; Lai, Liang-Chuan; Tsai, Mong-Hsun; Lu, Tzu-Pin; Chuang, Eric Y

    2017-01-14

    With the advancement in high-throughput technologies, researchers can simultaneously investigate gene expression and copy number alteration (CNA) data from individual patients at a lower cost. Traditional analysis methods analyze each type of data individually and integrate their results using Venn diagrams. Challenges arise, however, when the results are irreproducible and inconsistent across multiple platforms. To address these issues, one possible approach is to concurrently analyze both gene expression profiling and CNAs in the same individual. We have developed an open-source R/Bioconductor package (iGC). Multiple input formats are supported and users can define their own criteria for identifying differentially expressed genes driven by CNAs. The analysis of two real microarray datasets demonstrated that the CNA-driven genes identified by the iGC package showed significantly higher Pearson correlation coefficients with their gene expression levels and copy numbers than those genes located in a genomic region with CNA. Compared with the Venn diagram approach, the iGC package showed better performance. The iGC package is effective and useful for identifying CNA-driven genes. By simultaneously considering both comparative genomic and transcriptomic data, it can provide better understanding of biological and medical questions. The iGC package's source code and manual are freely available at https://www.bioconductor.org/packages/release/bioc/html/iGC.html .

  1. Recent Achievements in Characterizing the Histone Code and Approaches to Integrating Epigenomics and Systems Biology.

    PubMed

    Janssen, K A; Sidoli, S; Garcia, B A

    2017-01-01

    Functional epigenetic regulation occurs by dynamic modification of chromatin, including genetic material (i.e., DNA methylation), histone proteins, and other nuclear proteins. Due to the highly complex nature of the histone code, mass spectrometry (MS) has become the leading technique in identification of single and combinatorial histone modifications. MS has now overcome antibody-based strategies due to its automation, high resolution, and accurate quantitation. Moreover, multiple approaches to analysis have been developed for global quantitation of posttranslational modifications (PTMs), including large-scale characterization of modification coexistence (middle-down and top-down proteomics), which is not currently possible with any other biochemical strategy. Recently, our group and others have simplified and increased the effectiveness of analyzing histone PTMs by improving multiple MS methods and data analysis tools. This review provides an overview of the major achievements in the analysis of histone PTMs using MS with a focus on the most recent improvements. We speculate that the workflow for histone analysis at its state of the art is highly reliable in terms of identification and quantitation accuracy, and it has the potential to become a routine method for systems biology thanks to the possibility of integrating histone MS results with genomics and proteomics datasets. © 2017 Elsevier Inc. All rights reserved.

  2. Inference of Low and High-Grade Glioma Gene Regulatory Networks Delineates the Role of Rnd3 in Establishing Multiple Hallmarks of Cancer

    PubMed Central

    Turan, Nil; Soulet, Fabienne; Mohd Zahari, Maihafizah; Ryan, Katie R.; Durant, Sarah; He, Shan; Herbert, John; Ankers, John; Heath, John K.; Bjerkvig, Rolf; Bicknell, Roy; Hotchin, Neil A.; Bikfalvi, Andreas; Falciani, Francesco

    2015-01-01

    Gliomas are a highly heterogeneous group of brain tumours that are refractory to treatment, highly invasive and pro-angiogenic. Glioblastoma patients have an average survival time of less than 15 months. Understanding the molecular basis of different grades of glioma, from well differentiated, low-grade tumours to high-grade tumours, is a key step in defining new therapeutic targets. Here we use a data-driven approach to learn the structure of gene regulatory networks from observational data and use the resulting models to formulate hypothesis on the molecular determinants of glioma stage. Remarkably, integration of available knowledge with functional genomics datasets representing clinical and pre-clinical studies reveals important properties within the regulatory circuits controlling low and high-grade glioma. Our analyses first show that low and high-grade gliomas are characterised by a switch in activity of two subsets of Rho GTPases. The first one is involved in maintaining normal glial cell function, while the second is linked to the establishment of multiple hallmarks of cancer. Next, the development and application of a novel data integration methodology reveals novel functions of RND3 in controlling glioma cell migration, invasion, proliferation, angiogenesis and clinical outcome. PMID:26132659

  3. Network hydraulics inclusion in water quality event detection using multiple sensor stations data.

    PubMed

    Oliker, Nurit; Ostfeld, Avi

    2015-09-01

    Event detection is one of the current most challenging topics in water distribution systems analysis: how regular on-line hydraulic (e.g., pressure, flow) and water quality (e.g., pH, residual chlorine, turbidity) measurements at different network locations can be efficiently utilized to detect water quality contamination events. This study describes an integrated event detection model which combines multiple sensor stations data with network hydraulics. To date event detection modelling is likely limited to single sensor station location and dataset. Single sensor station models are detached from network hydraulics insights and as a result might be significantly exposed to false positive alarms. This work is aimed at decreasing this limitation through integrating local and spatial hydraulic data understanding into an event detection model. The spatial analysis complements the local event detection effort through discovering events with lower signatures by exploring the sensors mutual hydraulic influences. The unique contribution of this study is in incorporating hydraulic simulation information into the overall event detection process of spatially distributed sensors. The methodology is demonstrated on two example applications using base runs and sensitivity analyses. Results show a clear advantage of the suggested model over single-sensor event detection schemes. Copyright © 2015 Elsevier Ltd. All rights reserved.

  4. Multi-GNSS PPP-RTK: From Large- to Small-Scale Networks

    PubMed Central

    Nadarajah, Nandakumaran; Wang, Kan; Choudhury, Mazher

    2018-01-01

    Precise point positioning (PPP) and its integer ambiguity resolution-enabled variant, PPP-RTK (real-time kinematic), can benefit enormously from the integration of multiple global navigation satellite systems (GNSS). In such a multi-GNSS landscape, the positioning convergence time is expected to be reduced considerably as compared to the one obtained by a single-GNSS setup. It is therefore the goal of the present contribution to provide numerical insights into the role taken by the multi-GNSS integration in delivering fast and high-precision positioning solutions (sub-decimeter and centimeter levels) using PPP-RTK. To that end, we employ the Curtin PPP-RTK platform and process data-sets of GPS, BeiDou Navigation Satellite System (BDS) and Galileo in stand-alone and combined forms. The data-sets are collected by various receiver types, ranging from high-end multi-frequency geodetic receivers to low-cost single-frequency mass-market receivers. The corresponding stations form a large-scale (Australia-wide) network as well as a small-scale network with inter-station distances less than 30 km. In case of the Australia-wide GPS-only ambiguity-float setup, 90% of the horizontal positioning errors (kinematic mode) are shown to become less than five centimeters after 103 min. The stated required time is reduced to 66 min for the corresponding GPS + BDS + Galieo setup. The time is further reduced to 15 min by applying single-receiver ambiguity resolution. The outcomes are supported by the positioning results of the small-scale network. PMID:29614040

  5. Multi-GNSS PPP-RTK: From Large- to Small-Scale Networks.

    PubMed

    Nadarajah, Nandakumaran; Khodabandeh, Amir; Wang, Kan; Choudhury, Mazher; Teunissen, Peter J G

    2018-04-03

    Precise point positioning (PPP) and its integer ambiguity resolution-enabled variant, PPP-RTK (real-time kinematic), can benefit enormously from the integration of multiple global navigation satellite systems (GNSS). In such a multi-GNSS landscape, the positioning convergence time is expected to be reduced considerably as compared to the one obtained by a single-GNSS setup. It is therefore the goal of the present contribution to provide numerical insights into the role taken by the multi-GNSS integration in delivering fast and high-precision positioning solutions (sub-decimeter and centimeter levels) using PPP-RTK. To that end, we employ the Curtin PPP-RTK platform and process data-sets of GPS, BeiDou Navigation Satellite System (BDS) and Galileo in stand-alone and combined forms. The data-sets are collected by various receiver types, ranging from high-end multi-frequency geodetic receivers to low-cost single-frequency mass-market receivers. The corresponding stations form a large-scale (Australia-wide) network as well as a small-scale network with inter-station distances less than 30 km. In case of the Australia-wide GPS-only ambiguity-float setup, 90% of the horizontal positioning errors (kinematic mode) are shown to become less than five centimeters after 103 min. The stated required time is reduced to 66 min for the corresponding GPS + BDS + Galieo setup. The time is further reduced to 15 min by applying single-receiver ambiguity resolution. The outcomes are supported by the positioning results of the small-scale network.

  6. Integrative radiogenomic analysis for multicentric radiophenotype in glioblastoma

    PubMed Central

    Kong, Doo-Sik; Kim, Jinkuk; Lee, In-Hee; Kim, Sung Tae; Seol, Ho Jun; Lee, Jung-Il; Park, Woong-Yang; Ryu, Gyuha; Wang, Zichen; Ma'ayan, Avi; Nam, Do-Hyun

    2016-01-01

    We postulated that multicentric glioblastoma (GBM) represents more invasiveness form than solitary GBM and has their own genomic characteristics. From May 2004 to June 2010 we retrospectively identified 51 treatment-naïve GBM patients with available clinical information from the Samsung Medical Center data registry. Multicentricity of the tumor was defined as the presence of multiple foci on the T1 contrast enhancement of MR images or having high signal for multiple lesions without contiguity of each other on the FLAIR image. Kaplan-Meier survival analysis demonstrated that multicentric GBM had worse prognosis than solitary GBM (median, 16.03 vs. 20.57 months, p < 0.05). Copy number variation (CNV) analysis revealed there was an increase in 11 regions, and a decrease in 17 regions, in the multicentric GBM. Gene expression profiling identified 738 genes to be increased and 623 genes to be decreased in the multicentric radiophenotype (p < 0.001). Integration of the CNV and expression datasets identified twelve representative genes: CPM, LANCL2, LAMP1, GAS6, DCUN1D2, CDK4, AGAP2, TSPAN33, PDLIM1, CLDN12, and GTPBP10 having high correlation across CNV, gene expression and patient outcome. Network and enrichment analyses showed that the multicentric tumor had elevated fibrotic signaling pathways compared with a more proliferative and mitogenic signal in the solitary tumors. Noninvasive radiological imaging together with integrative radiogenomic analysis can provide an important tool in helping to advance personalized therapy for the more clinically aggressive subset of GBM. PMID:26863628

  7. Benchmarking homogenization algorithms for monthly data

    NASA Astrophysics Data System (ADS)

    Venema, V. K. C.; Mestre, O.; Aguilar, E.; Auer, I.; Guijarro, J. A.; Domonkos, P.; Vertacnik, G.; Szentimrey, T.; Stepanek, P.; Zahradnicek, P.; Viarre, J.; Müller-Westermeier, G.; Lakatos, M.; Williams, C. N.; Menne, M. J.; Lindau, R.; Rasol, D.; Rustemeier, E.; Kolokythas, K.; Marinova, T.; Andresen, L.; Acquaotta, F.; Fratianni, S.; Cheval, S.; Klancar, M.; Brunetti, M.; Gruber, C.; Prohom Duran, M.; Likso, T.; Esteban, P.; Brandsma, T.

    2012-01-01

    The COST (European Cooperation in Science and Technology) Action ES0601: advances in homogenization methods of climate series: an integrated approach (HOME) has executed a blind intercomparison and validation study for monthly homogenization algorithms. Time series of monthly temperature and precipitation were evaluated because of their importance for climate studies and because they represent two important types of statistics (additive and multiplicative). The algorithms were validated against a realistic benchmark dataset. The benchmark contains real inhomogeneous data as well as simulated data with inserted inhomogeneities. Random independent break-type inhomogeneities with normally distributed breakpoint sizes were added to the simulated datasets. To approximate real world conditions, breaks were introduced that occur simultaneously in multiple station series within a simulated network of station data. The simulated time series also contained outliers, missing data periods and local station trends. Further, a stochastic nonlinear global (network-wide) trend was added. Participants provided 25 separate homogenized contributions as part of the blind study. After the deadline at which details of the imposed inhomogeneities were revealed, 22 additional solutions were submitted. These homogenized datasets were assessed by a number of performance metrics including (i) the centered root mean square error relative to the true homogeneous value at various averaging scales, (ii) the error in linear trend estimates and (iii) traditional contingency skill scores. The metrics were computed both using the individual station series as well as the network average regional series. The performance of the contributions depends significantly on the error metric considered. Contingency scores by themselves are not very informative. Although relative homogenization algorithms typically improve the homogeneity of temperature data, only the best ones improve precipitation data. Training the users on homogenization software was found to be very important. Moreover, state-of-the-art relative homogenization algorithms developed to work with an inhomogeneous reference are shown to perform best. The study showed that automatic algorithms can perform as well as manual ones.

  8. Benchmarking monthly homogenization algorithms

    NASA Astrophysics Data System (ADS)

    Venema, V. K. C.; Mestre, O.; Aguilar, E.; Auer, I.; Guijarro, J. A.; Domonkos, P.; Vertacnik, G.; Szentimrey, T.; Stepanek, P.; Zahradnicek, P.; Viarre, J.; Müller-Westermeier, G.; Lakatos, M.; Williams, C. N.; Menne, M.; Lindau, R.; Rasol, D.; Rustemeier, E.; Kolokythas, K.; Marinova, T.; Andresen, L.; Acquaotta, F.; Fratianni, S.; Cheval, S.; Klancar, M.; Brunetti, M.; Gruber, C.; Prohom Duran, M.; Likso, T.; Esteban, P.; Brandsma, T.

    2011-08-01

    The COST (European Cooperation in Science and Technology) Action ES0601: Advances in homogenization methods of climate series: an integrated approach (HOME) has executed a blind intercomparison and validation study for monthly homogenization algorithms. Time series of monthly temperature and precipitation were evaluated because of their importance for climate studies and because they represent two important types of statistics (additive and multiplicative). The algorithms were validated against a realistic benchmark dataset. The benchmark contains real inhomogeneous data as well as simulated data with inserted inhomogeneities. Random break-type inhomogeneities were added to the simulated datasets modeled as a Poisson process with normally distributed breakpoint sizes. To approximate real world conditions, breaks were introduced that occur simultaneously in multiple station series within a simulated network of station data. The simulated time series also contained outliers, missing data periods and local station trends. Further, a stochastic nonlinear global (network-wide) trend was added. Participants provided 25 separate homogenized contributions as part of the blind study as well as 22 additional solutions submitted after the details of the imposed inhomogeneities were revealed. These homogenized datasets were assessed by a number of performance metrics including (i) the centered root mean square error relative to the true homogeneous value at various averaging scales, (ii) the error in linear trend estimates and (iii) traditional contingency skill scores. The metrics were computed both using the individual station series as well as the network average regional series. The performance of the contributions depends significantly on the error metric considered. Contingency scores by themselves are not very informative. Although relative homogenization algorithms typically improve the homogeneity of temperature data, only the best ones improve precipitation data. Training was found to be very important. Moreover, state-of-the-art relative homogenization algorithms developed to work with an inhomogeneous reference are shown to perform best. The study showed that currently automatic algorithms can perform as well as manual ones.

  9. Analysis of Multiple Precipitation Products and Preliminary Assessment of Their Impact on Global Land Data Assimilation System (GLDAS) Land Surface States

    NASA Technical Reports Server (NTRS)

    Gottschalck, Jon; Meng, Jesse; Rodel, Matt; Houser, paul

    2005-01-01

    Land surface models (LSMs) are computer programs, similar to weather and climate prediction models, which simulate the stocks and fluxes of water (including soil moisture, snow, evaporation, and runoff) and energy (including the temperature of and sensible heat released from the soil) after they arrive on the land surface as precipitation and sunlight. It is not currently possible to measure all of the variables of interest everywhere on Earth with sufficient accuracy and space-time resolution. Hence LSMs have been developed to integrate the available observations with our understanding of the physical processes involved, using powerful computers, in order to map these stocks and fluxes as they change in time. The maps are used to improve weather forecasts, support water resources and agricultural applications, and study the Earth's water cycle and climate variability. NASA's Global Land Data Assimilation System (GLDAS) project facilitates testing of several different LSMs with a variety of input datasets (e.g., precipitation, plant type). Precipitation is arguably the most important input to LSMs. Many precipitation datasets have been produced using satellite and rain gauge observations and weather forecast models. In this study, seven different global precipitation datasets were evaluated over the United States, where dense rain gauge networks contribute to reliable precipitation maps. We then used the seven datasets as inputs to GLDAS simulations, so that we could diagnose their impacts on output stocks and fluxes of water. In terms of totals, the Climate Prediction Center (CPC) Merged Analysis of Precipitation (CMAP) had the closest agreement with the US rain gauge dataset for all seasons except winter. The CMAP precipitation was also the most closely correlated in time with the rain gauge data during spring, fall, and winter, while the satellitebased estimates performed best in summer. The GLDAS simulations revealed that modeled soil moisture is highly sensitive to precipitation, with differences in spring and summer as large as 45% depending on the choice of precipitation input.

  10. Quantitative comparison of microarray experiments with published leukemia related gene expression signatures.

    PubMed

    Klein, Hans-Ulrich; Ruckert, Christian; Kohlmann, Alexander; Bullinger, Lars; Thiede, Christian; Haferlach, Torsten; Dugas, Martin

    2009-12-15

    Multiple gene expression signatures derived from microarray experiments have been published in the field of leukemia research. A comparison of these signatures with results from new experiments is useful for verification as well as for interpretation of the results obtained. Currently, the percentage of overlapping genes is frequently used to compare published gene signatures against a signature derived from a new experiment. However, it has been shown that the percentage of overlapping genes is of limited use for comparing two experiments due to the variability of gene signatures caused by different array platforms or assay-specific influencing parameters. Here, we present a robust approach for a systematic and quantitative comparison of published gene expression signatures with an exemplary query dataset. A database storing 138 leukemia-related published gene signatures was designed. Each gene signature was manually annotated with terms according to a leukemia-specific taxonomy. Two analysis steps are implemented to compare a new microarray dataset with the results from previous experiments stored and curated in the database. First, the global test method is applied to assess gene signatures and to constitute a ranking among them. In a subsequent analysis step, the focus is shifted from single gene signatures to chromosomal aberrations or molecular mutations as modeled in the taxonomy. Potentially interesting disease characteristics are detected based on the ranking of gene signatures associated with these aberrations stored in the database. Two example analyses are presented. An implementation of the approach is freely available as web-based application. The presented approach helps researchers to systematically integrate the knowledge derived from numerous microarray experiments into the analysis of a new dataset. By means of example leukemia datasets we demonstrate that this approach detects related experiments as well as related molecular mutations and may help to interpret new microarray data.

  11. Model-data integration to improve the LPJmL dynamic global vegetation model

    NASA Astrophysics Data System (ADS)

    Forkel, Matthias; Thonicke, Kirsten; Schaphoff, Sibyll; Thurner, Martin; von Bloh, Werner; Dorigo, Wouter; Carvalhais, Nuno

    2017-04-01

    Dynamic global vegetation models show large uncertainties regarding the development of the land carbon balance under future climate change conditions. This uncertainty is partly caused by differences in how vegetation carbon turnover is represented in global vegetation models. Model-data integration approaches might help to systematically assess and improve model performances and thus to potentially reduce the uncertainty in terrestrial vegetation responses under future climate change. Here we present several applications of model-data integration with the LPJmL (Lund-Potsdam-Jena managed Lands) dynamic global vegetation model to systematically improve the representation of processes or to estimate model parameters. In a first application, we used global satellite-derived datasets of FAPAR (fraction of absorbed photosynthetic activity), albedo and gross primary production to estimate phenology- and productivity-related model parameters using a genetic optimization algorithm. Thereby we identified major limitations of the phenology module and implemented an alternative empirical phenology model. The new phenology module and optimized model parameters resulted in a better performance of LPJmL in representing global spatial patterns of biomass, tree cover, and the temporal dynamic of atmospheric CO2. Therefore, we used in a second application additionally global datasets of biomass and land cover to estimate model parameters that control vegetation establishment and mortality. The results demonstrate the ability to improve simulations of vegetation dynamics but also highlight the need to improve the representation of mortality processes in dynamic global vegetation models. In a third application, we used multiple site-level observations of ecosystem carbon and water exchange, biomass and soil organic carbon to jointly estimate various model parameters that control ecosystem dynamics. This exercise demonstrates the strong role of individual data streams on the simulated ecosystem dynamics which consequently changed the development of ecosystem carbon stocks and fluxes under future climate and CO2 change. In summary, our results demonstrate challenges and the potential of using model-data integration approaches to improve a dynamic global vegetation model.

  12. A novel strategy of integrated microarray analysis identifies CENPA, CDK1 and CDC20 as a cluster of diagnostic biomarkers in lung adenocarcinoma.

    PubMed

    Liu, Wan-Ting; Wang, Yang; Zhang, Jing; Ye, Fei; Huang, Xiao-Hui; Li, Bin; He, Qing-Yu

    2018-07-01

    Lung adenocarcinoma (LAC) is the most lethal cancer and the leading cause of cancer-related death worldwide. The identification of meaningful clusters of co-expressed genes or representative biomarkers may help improve the accuracy of LAC diagnoses. Public databases, such as the Gene Expression Omnibus (GEO), provide rich resources of valuable information for clinics, however, the integration of multiple microarray datasets from various platforms and institutes remained a challenge. To determine potential indicators of LAC, we performed genome-wide relative significance (GWRS), genome-wide global significance (GWGS) and support vector machine (SVM) analyses progressively to identify robust gene biomarker signatures from 5 different microarray datasets that included 330 samples. The top 200 genes with robust signatures were selected for integrative analysis according to "guilt-by-association" methods, including protein-protein interaction (PPI) analysis and gene co-expression analysis. Of these 200 genes, only 10 genes showed both intensive PPI network and high gene co-expression correlation (r > 0.8). IPA analysis of this regulatory networks suggested that the cell cycle process is a crucial determinant of LAC. CENPA, as well as two linked hub genes CDK1 and CDC20, are determined to be potential indicators of LAC. Immunohistochemical staining showed that CENPA, CDK1 and CDC20 were highly expressed in LAC cancer tissue with co-expression patterns. A Cox regression model indicated that LAC patients with CENPA + /CDK1 + and CENPA + /CDC20 + were high-risk groups in terms of overall survival. In conclusion, our integrated microarray analysis demonstrated that CENPA, CDK1 and CDC20 might serve as novel cluster of prognostic biomarkers for LAC, and the cooperative unit of three genes provides a technically simple approach for identification of LAC patients. Copyright © 2018 Elsevier B.V. All rights reserved.

  13. A deep convolutional neural network-based automatic delineation strategy for multiple brain metastases stereotactic radiosurgery.

    PubMed

    Liu, Yan; Stojadinovic, Strahinja; Hrycushko, Brian; Wardak, Zabi; Lau, Steven; Lu, Weiguo; Yan, Yulong; Jiang, Steve B; Zhen, Xin; Timmerman, Robert; Nedzi, Lucien; Gu, Xuejun

    2017-01-01

    Accurate and automatic brain metastases target delineation is a key step for efficient and effective stereotactic radiosurgery (SRS) treatment planning. In this work, we developed a deep learning convolutional neural network (CNN) algorithm for segmenting brain metastases on contrast-enhanced T1-weighted magnetic resonance imaging (MRI) datasets. We integrated the CNN-based algorithm into an automatic brain metastases segmentation workflow and validated on both Multimodal Brain Tumor Image Segmentation challenge (BRATS) data and clinical patients' data. Validation on BRATS data yielded average DICE coefficients (DCs) of 0.75±0.07 in the tumor core and 0.81±0.04 in the enhancing tumor, which outperformed most techniques in the 2015 BRATS challenge. Segmentation results of patient cases showed an average of DCs 0.67±0.03 and achieved an area under the receiver operating characteristic curve of 0.98±0.01. The developed automatic segmentation strategy surpasses current benchmark levels and offers a promising tool for SRS treatment planning for multiple brain metastases.

  14. A Higher-Order Neural Network Design for Improving Segmentation Performance in Medical Image Series

    NASA Astrophysics Data System (ADS)

    Selvi, Eşref; Selver, M. Alper; Güzeliş, Cüneyt; Dicle, Oǧuz

    2014-03-01

    Segmentation of anatomical structures from medical image series is an ongoing field of research. Although, organs of interest are three-dimensional in nature, slice-by-slice approaches are widely used in clinical applications because of their ease of integration with the current manual segmentation scheme. To be able to use slice-by-slice techniques effectively, adjacent slice information, which represents likelihood of a region to be the structure of interest, plays critical role. Recent studies focus on using distance transform directly as a feature or to increase the feature values at the vicinity of the search area. This study presents a novel approach by constructing a higher order neural network, the input layer of which receives features together with their multiplications with the distance transform. This allows higher-order interactions between features through the non-linearity introduced by the multiplication. The application of the proposed method to 9 CT datasets for segmentation of the liver shows higher performance than well-known higher order classification neural networks.

  15. MEGA-CC: computing core of molecular evolutionary genetics analysis program for automated and iterative data analysis.

    PubMed

    Kumar, Sudhir; Stecher, Glen; Peterson, Daniel; Tamura, Koichiro

    2012-10-15

    There is a growing need in the research community to apply the molecular evolutionary genetics analysis (MEGA) software tool for batch processing a large number of datasets and to integrate it into analysis workflows. Therefore, we now make available the computing core of the MEGA software as a stand-alone executable (MEGA-CC), along with an analysis prototyper (MEGA-Proto). MEGA-CC provides users with access to all the computational analyses available through MEGA's graphical user interface version. This includes methods for multiple sequence alignment, substitution model selection, evolutionary distance estimation, phylogeny inference, substitution rate and pattern estimation, tests of natural selection and ancestral sequence inference. Additionally, we have upgraded the source code for phylogenetic analysis using the maximum likelihood methods for parallel execution on multiple processors and cores. Here, we describe MEGA-CC and outline the steps for using MEGA-CC in tandem with MEGA-Proto for iterative and automated data analysis. http://www.megasoftware.net/.

  16. A hadoop-based method to predict potential effective drug combination.

    PubMed

    Sun, Yifan; Xiong, Yi; Xu, Qian; Wei, Dongqing

    2014-01-01

    Combination drugs that impact multiple targets simultaneously are promising candidates for combating complex diseases due to their improved efficacy and reduced side effects. However, exhaustive screening of all possible drug combinations is extremely time-consuming and impractical. Here, we present a novel Hadoop-based approach to predict drug combinations by taking advantage of the MapReduce programming model, which leads to an improvement of scalability of the prediction algorithm. By integrating the gene expression data of multiple drugs, we constructed data preprocessing and the support vector machines and naïve Bayesian classifiers on Hadoop for prediction of drug combinations. The experimental results suggest that our Hadoop-based model achieves much higher efficiency in the big data processing steps with satisfactory performance. We believed that our proposed approach can help accelerate the prediction of potential effective drugs with the increasing of the combination number at an exponential rate in future. The source code and datasets are available upon request.

  17. A Hadoop-Based Method to Predict Potential Effective Drug Combination

    PubMed Central

    Xiong, Yi; Xu, Qian; Wei, Dongqing

    2014-01-01

    Combination drugs that impact multiple targets simultaneously are promising candidates for combating complex diseases due to their improved efficacy and reduced side effects. However, exhaustive screening of all possible drug combinations is extremely time-consuming and impractical. Here, we present a novel Hadoop-based approach to predict drug combinations by taking advantage of the MapReduce programming model, which leads to an improvement of scalability of the prediction algorithm. By integrating the gene expression data of multiple drugs, we constructed data preprocessing and the support vector machines and naïve Bayesian classifiers on Hadoop for prediction of drug combinations. The experimental results suggest that our Hadoop-based model achieves much higher efficiency in the big data processing steps with satisfactory performance. We believed that our proposed approach can help accelerate the prediction of potential effective drugs with the increasing of the combination number at an exponential rate in future. The source code and datasets are available upon request. PMID:25147789

  18. Neural CMOS-integrated circuit and its application to data classification.

    PubMed

    Göknar, Izzet Cem; Yildiz, Merih; Minaei, Shahram; Deniz, Engin

    2012-05-01

    Implementation and new applications of a tunable complementary metal-oxide-semiconductor-integrated circuit (CMOS-IC) of a recently proposed classifier core-cell (CC) are presented and tested with two different datasets. With two algorithms-one based on Fisher's linear discriminant analysis and the other based on perceptron learning, used to obtain CCs' tunable parameters-the Haberman and Iris datasets are classified. The parameters so obtained are used for hard-classification of datasets with a neural network structured circuit. Classification performance and coefficient calculation times for both algorithms are given. The CC has 6-ns response time and 1.8-mW power consumption. The fabrication parameters used for the IC are taken from CMOS AMS 0.35-μm technology.

  19. Major soybean maturity gene haplotypes revealed by SNPViz analysis of 72 sequenced soybean genomes

    USDA-ARS?s Scientific Manuscript database

    In this Genomics Era, vast amounts of next generation sequencing data have become publicly-available for multiple genomes across hundreds of species. Analysis of these large-scale datasets can become cumbersome, especially when comparing nucleotide polymorphisms across many samples within a dataset...

  20. ACCURACY OF THE 1992 NATIONAL LAND COVER DATASET AREA ESTIMATES: AN ANALYSIS AT MULTIPLE SPATIAL EXTENTS

    EPA Science Inventory

    Abstract for poster presentation:

    Site-specific accuracy assessments evaluate fine-scale accuracy of land-use/land-cover(LULC) datasets but provide little insight into accuracy of area estimates of LULC

    classes derived from sampling units of varying size. Additiona...

  1. Enabling Data Fusion via a Common Data Model and Programming Interface

    NASA Astrophysics Data System (ADS)

    Lindholm, D. M.; Wilson, A.

    2011-12-01

    Much progress has been made in scientific data interoperability, especially in the areas of metadata and discovery. However, while a data user may have improved techniques for finding data, there is often a large chasm to span when it comes to acquiring the desired subsets of various datasets and integrating them into a data processing environment. Some tools such as OPeNDAP servers and the Unidata Common Data Model (CDM) have introduced improved abstractions for accessing data via a common interface, but they alone do not go far enough to enable fusion of data from multidisciplinary sources. Although data from various scientific disciplines may represent semantically similar concepts (e.g. time series), the user may face widely varying structural representations of the data (e.g. row versus column oriented), not to mention radically different storage formats. It is not enough to convert data to a common format. The key to fusing scientific data is to represent each dataset with consistent sampling. This can best be done by using a data model that expresses the functional relationship that each dataset represents. The domain of those functions determines how the data can be combined. The Visualization for Algorithm Development (VisAD) Java API has provided a sophisticated data model for representing the functional nature of scientific datasets for well over a decade. Because VisAD is largely designed for its visualization capabilities, the data model can be cumbersome to use for numerical computation, especially for those not comfortable with Java. Although both VisAD and the implementation of the CDM are written in Java, neither defines a pure Java interface that others could implement and program to, further limiting potential for interoperability. In this talk, we will present a solution for data integration based on a simple discipline-agnostic scientific data model and programming interface that enables a dataset to be defined in terms of three variable types: Scalar (a), Tuple (a,b), and Function (a -> b). These basic building blocks can be combined and nested to represent any arbitrarily complex dataset. For example, a time series of surface temperature and pressure could be represented as: time -> ((lon,lat) -> (T,P)). Our data model is expressed in UML and can be implemented in numerous programming languages. We will demonstrate an implementation of our data model and interface using the Scala programming language. Given its functional programming constructs, sophisticated type system, and other language features, Scala enables us to construct complex data structures that can be manipulated using natural mathematical expressions while taking advantage of the language's ability to operate on collections in parallel. This API will be applied to the problem of assimilating various measurements of the solar spectrum and other proxies from multiple sources to construct a composite Lyman-alpha irradiance dataset.

  2. Final Report on the Creation of the Wind Integration National Dataset (WIND) Toolkit and API: October 1, 2013 - September 30, 2015

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hodge, Bri-Mathias

    2016-04-08

    The primary objective of this work was to create a state-of-the-art national wind resource data set and to provide detailed wind plant output data for specific sites based on that data set. Corresponding retrospective wind forecasts were also included at all selected locations. The combined information from these activities was used to create the Wind Integration National Dataset (WIND), and an extraction tool was developed to allow web-based data access.

  3. Overview and Meteorological Validation of the Wind Integration National Dataset toolkit

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Draxl, C.; Hodge, B. M.; Clifton, A.

    2015-04-13

    The Wind Integration National Dataset (WIND) Toolkit described in this report fulfills these requirements, and constitutes a state-of-the-art national wind resource data set covering the contiguous United States from 2007 to 2013 for use in a variety of next-generation wind integration analyses and wind power planning. The toolkit is a wind resource data set, wind forecast data set, and wind power production and forecast data set derived from the Weather Research and Forecasting (WRF) numerical weather prediction model. WIND Toolkit data are available online for over 116,000 land-based and 10,000 offshore sites representing existing and potential wind facilities.

  4. IMG/M: integrated genome and metagenome comparative data analysis system

    DOE PAGES

    Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken; ...

    2016-10-13

    The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support formore » examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review(ER) companion system (IMG/M ER: https://img.jgi.doe.gov/ mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system.« less

  5. IMG/M: integrated genome and metagenome comparative data analysis system

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken

    The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support formore » examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review(ER) companion system (IMG/M ER: https://img.jgi.doe.gov/ mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system.« less

  6. IMG/M: integrated genome and metagenome comparative data analysis system

    PubMed Central

    Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken; Palaniappan, Krishna; Szeto, Ernest; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Andersen, Evan; Huntemann, Marcel; Varghese, Neha; Hadjithomas, Michalis; Tennessen, Kristin; Nielsen, Torben; Ivanova, Natalia N.; Kyrpides, Nikos C.

    2017-01-01

    The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support for examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review (ER) companion system (IMG/M ER: https://img.jgi.doe.gov/mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system. PMID:27738135

  7. Data-Driven Synthesis for Investigating Food Systems Resilience to Climate Change

    NASA Astrophysics Data System (ADS)

    Magliocca, N. R.; Hart, D.; Hondula, K. L.; Munoz, I.; Shelley, M.; Smorul, M.

    2014-12-01

    The production, supply, and distribution of our food involves a complex set of interactions between farmers, rural communities, governments, and global commodity markets that link important issues such as environmental quality, agricultural science and technology, health and nutrition, rural livelihoods, and social institutions and equality - all of which will be affected by climate change. The production of actionable science is thus urgently needed to inform and prepare the public for the consequences of climate change for local and global food systems. Access to data that spans multiple sectors/domains and spatial and temporal scales is key to beginning to tackle such complex issues. As part of the White House's Climate Data Initiative, the USDA and the National Socio-Environmental Synthesis Center (SESYNC) are launching a new collaboration to catalyze data-driven research to enhance food systems resilience to climate change. To support this collaboration, SESYNC is developing a new "Data to Motivate Synthesis" program designed to engage early career scholars in a highly interactive and dynamic process of real-time data discovery, analysis, and visualization to catalyze new research questions and analyses that would not have otherwise been possible and/or apparent. This program will be supported by an integrated, spatially-enabled cyberinfrastructure that enables the management, intersection, and analysis of large heterogeneous datasets relevant to food systems resilience to climate change. Our approach is to create a series of geospatial abstraction data structures and visualization services that can be used to accelerate analysis and visualization across various socio-economic and environmental datasets (e.g., reconcile census data with remote sensing raster datasets). We describe the application of this approach with a pilot workshop of socio-environmental scholars that will lay the groundwork for the larger SESYNC-USDA collaboration. We discuss the particular challenges of supporting an integrated, repeatable workflow for socio-environmental data synthesis, and the advantages and limitations to using data as a launching point for interdisciplinary research projects.

  8. The HIAPER Pole-to-Pole Observations (HIPPO) Public Data Archive at CDIAC: Carbon Cycle and Greenhouse Gas Data

    NASA Astrophysics Data System (ADS)

    Christensen, S. W.; Hook, L. A.

    2011-12-01

    The HIAPER Pole-to-Pole Observations (HIPPO) project is investigating the carbon cycle and greenhouse gases throughout various altitudes in the atmosphere over the Pacific Basin through the annual cycle (Wofsy and the HIPPO Science Team 2011, this session). Aircraft-based data collection occurred during 2009-2011. Data analyses, comparisons, and integration are ongoing. A permanent public archive of HIPPO data has been established at the U. S. DOE Carbon Dioxide Information Analysis Center (CDIAC). Datasets are provided primarily by the Lead Principal Investigator (PI), who draws on a comprehensive set of aircraft navigation information, meteorological measurements, and research instrument and sampling system results from multiple co-investigators to compile integrated and generate value-added products. A website/ftp site has been developed for HIPPO data and metadata (http://hippo.ornl.gov), in coordination with the UCAR website that presents field catalogs and other detailed information about HIPPO missions (http://www.eol.ucar.edu/projects/hippo/dm/). A data policy was adopted that balances the needs of the project investigators with the interests of the scientific user community. A data dictionary was developed to capture the basic characteristics of the hundreds of measurements. Instrument descriptions were compiled. A user's guide is presented for each dataset that also contains data file information enabling users to know when data have been updated. Data are received and provided as space-delimited ASCII files. Metadata records are compiled into a searchable CDIAC index and will be submitted to climate change research data clearinghouses. Each dataset is given a persistent identifier (DOI) to facilitate attribution. We expect that data will continue to be added to the archive for the next year or more. In the future we anticipate creating a database for HIPPO data, with a web interface to facilitate searching and customized data extraction.

  9. VIPER: a visualisation tool for exploring inheritance inconsistencies in genotyped pedigrees

    PubMed Central

    2012-01-01

    Background Pedigree genotype datasets are used for analysing genetic inheritance and to map genetic markers and traits. Such datasets consist of hundreds of related animals genotyped for thousands of genetic markers and invariably contain multiple errors in both the pedigree structure and in the associated individual genotype data. These errors manifest as apparent inheritance inconsistencies in the pedigree, and invalidate analyses of marker inheritance patterns across the dataset. Cleaning raw datasets of bad data points (incorrect pedigree relationships, unreliable marker assays, suspect samples, bad genotype results etc.) requires expert exploration of the patterns of exposed inconsistencies in the context of the inheritance pedigree. In order to assist this process we are developing VIPER (Visual Pedigree Explorer), a software tool that integrates an inheritance-checking algorithm with a novel space-efficient pedigree visualisation, so that reported inheritance inconsistencies are overlaid on an interactive, navigable representation of the pedigree structure. Methods and results This paper describes an evaluation of how VIPER displays the different scales and types of dataset that occur experimentally, with a description of how VIPER's display interface and functionality meet the challenges presented by such data. We examine a range of possible error types found in real and simulated pedigree genotype datasets, demonstrating how these errors are exposed and explored using the VIPER interface and we evaluate the utility and usability of the interface to the domain expert. Evaluation was performed as a two stage process with the assistance of domain experts (geneticists). The initial evaluation drove the iterative implementation of further features in the software prototype, as required by the users, prior to a final functional evaluation of the pedigree display for exploring the various error types, data scales and structures. Conclusions The VIPER display was shown to effectively expose the range of errors found in experimental genotyped pedigrees, allowing users to explore the underlying causes of reported inheritance inconsistencies. This interface will provide the basis for a full data cleaning tool that will allow the user to remove isolated bad data points, and reversibly test the effect of removing suspect genotypes and pedigree relationships. PMID:22607476

  10. A critical enquiry into the psychometric properties of the professional quality of life scale (ProQol-5) instrument.

    PubMed

    Hemsworth, David; Baregheh, Anahita; Aoun, Samar; Kazanjian, Arminee

    2018-02-01

    This study had conducted a comprehensive analysis of the psychometric properties of Proqol 5, professional quality of work instrument among nurses and palliative care-workers on the basis of three independent datasets. The goal is to see the general applicability of this instrument across multiple populations. Although the Proqol scale has been widely adopted, there are few attempts that have thoroughly analyzed this instrument across multiple datasets using multiple populations. A questionnaire was developed and distributed to palliative care-workers in Canada and Nurses at two hospitals in Australia and Canada, this resulted in 273 datasets from the Australian and 303 datasets from the Canadian nurses and 503 datasets from the Canadian palliative care-workers. A comprehensive psychometric property analysis was conducted including inter-item correlations, tests of reliability, and both convergent and discriminant validity as well as construct validity analyses. In addition, to test for the reverse coding artifacts in the BO scale, exploratory factor analysis was adopted. The psychometric property analysis of Proqol 5 was satisfactory for the compassion satisfaction construct. However, there are concerns with respect to the burnout and secondary trauma stress scales and recommendations are made regarding the coding and specific items which should improve the reliability and validity of these scales. This research establishes the strengths and weaknesses of the Proqol instrument and demonstrates how it can be improved. Through specific recommendations, the academic community is invited to revise the burnout and secondary traumatic stress scales in an effort to improve Proqol 5 measures. Copyright © 2017. Published by Elsevier Inc.

  11. Ecosystem functioning is enveloped by hydrometeorological variability.

    PubMed

    Pappas, Christoforos; Mahecha, Miguel D; Frank, David C; Babst, Flurin; Koutsoyiannis, Demetris

    2017-09-01

    Terrestrial ecosystem processes, and the associated vegetation carbon dynamics, respond differently to hydrometeorological variability across timescales, and so does our scientific understanding of the underlying mechanisms. Long-term variability of the terrestrial carbon cycle is not yet well constrained and the resulting climate-biosphere feedbacks are highly uncertain. Here we present a comprehensive overview of hydrometeorological and ecosystem variability from hourly to decadal timescales integrating multiple in situ and remote-sensing datasets characterizing extra-tropical forest sites. We find that ecosystem variability at all sites is confined within a hydrometeorological envelope across sites and timescales. Furthermore, ecosystem variability demonstrates long-term persistence, highlighting ecological memory and slow ecosystem recovery rates after disturbances. However, simulation results with state-of-the-art process-based models do not reflect this long-term persistent behaviour in ecosystem functioning. Accordingly, we develop a cross-time-scale stochastic framework that captures hydrometeorological and ecosystem variability. Our analysis offers a perspective for terrestrial ecosystem modelling and paves the way for new model-data integration opportunities in Earth system sciences.

  12. Tier-2 Optimisation for Computational Density/Diversity and Big Data

    NASA Astrophysics Data System (ADS)

    Fay, R. B.; Bland, J.

    2014-06-01

    As the number of cores on chip continues to trend upwards and new CPU architectures emerge, increasing CPU density and diversity presents multiple challenges to site administrators. These include scheduling for massively multi-core systems (potentially including Graphical Processing Units (GPU), integrated and dedicated) and Many Integrated Core (MIC)) to ensure a balanced throughput of jobs while preserving overall cluster throughput, as well as the increasing complexity of developing for these heterogeneous platforms, and the challenge in managing this more complex mix of resources. In addition, meeting data demands as both dataset sizes increase and as the rate of demand scales with increased computational power requires additional performance from the associated storage elements. In this report, we evaluate one emerging technology, Solid State Drive (SSD) caching for RAID controllers, with consideration to its potential to assist in meeting evolving demand. We also briefly consider the broader developing trends outlined above in order to identify issues that may develop and assess what actions should be taken in the immediate term to address those.

  13. A Benchmark Dataset and Saliency-guided Stacked Autoencoders for Video-based Salient Object Detection.

    PubMed

    Li, Jia; Xia, Changqun; Chen, Xiaowu

    2017-10-12

    Image-based salient object detection (SOD) has been extensively studied in past decades. However, video-based SOD is much less explored due to the lack of large-scale video datasets within which salient objects are unambiguously defined and annotated. Toward this end, this paper proposes a video-based SOD dataset that consists of 200 videos. In constructing the dataset, we manually annotate all objects and regions over 7,650 uniformly sampled keyframes and collect the eye-tracking data of 23 subjects who free-view all videos. From the user data, we find that salient objects in a video can be defined as objects that consistently pop-out throughout the video, and objects with such attributes can be unambiguously annotated by combining manually annotated object/region masks with eye-tracking data of multiple subjects. To the best of our knowledge, it is currently the largest dataset for videobased salient object detection. Based on this dataset, this paper proposes an unsupervised baseline approach for video-based SOD by using saliencyguided stacked autoencoders. In the proposed approach, multiple spatiotemporal saliency cues are first extracted at the pixel, superpixel and object levels. With these saliency cues, stacked autoencoders are constructed in an unsupervised manner that automatically infers a saliency score for each pixel by progressively encoding the high-dimensional saliency cues gathered from the pixel and its spatiotemporal neighbors. In experiments, the proposed unsupervised approach is compared with 31 state-of-the-art models on the proposed dataset and outperforms 30 of them, including 19 imagebased classic (unsupervised or non-deep learning) models, six image-based deep learning models, and five video-based unsupervised models. Moreover, benchmarking results show that the proposed dataset is very challenging and has the potential to boost the development of video-based SOD.

  14. Multi-site genetic analysis of diffusion images and voxelwise heritability analysis: A pilot project of the ENIGMA–DTI working group

    PubMed Central

    Jahanshad, Neda; Kochunov, Peter; Sprooten, Emma; Mandl, René C.; Nichols, Thomas E.; Almassy, Laura; Blangero, John; Brouwer, Rachel M.; Curran, Joanne E.; de Zubicaray, Greig I.; Duggirala, Ravi; Fox, Peter T.; Hong, L. Elliot; Landman, Bennett A.; Martin, Nicholas G.; McMahon, Katie L.; Medland, Sarah E.; Mitchell, Braxton D.; Olvera, Rene L.; Peterson, Charles P.; Starr, John M.; Sussmann, Jessika E.; Toga, Arthur W.; Wardlaw, Joanna M.; Wright, Margaret J.; Hulshoff Pol, Hilleke E.; Bastin, Mark E.; McIntosh, Andrew M.; Deary, Ian J.; Thompson, Paul M.; Glahn, David C.

    2013-01-01

    The ENIGMA (Enhancing NeuroImaging Genetics through Meta-Analysis) Consortium was set up to analyze brain measures and genotypes from multiple sites across the world to improve the power to detect genetic variants that influence the brain. Diffusion tensor imaging (DTI) yields quantitative measures sensitive to brain development and degeneration, and some common genetic variants may be associated with white matter integrity or connectivity. DTI measures, such as the fractional anisotropy (FA) of water diffusion, may be useful for identifying genetic variants that influence brain microstructure. However, genome-wide association studies (GWAS) require large populations to obtain sufficient power to detect and replicate significant effects, motivating a multi-site consortium effort. As part of an ENIGMA–DTI working group, we analyzed high-resolution FA images from multiple imaging sites across North America, Australia, and Europe, to address the challenge of harmonizing imaging data collected at multiple sites. Four hundred images of healthy adults aged 18–85 from four sites were used to create a template and corresponding skeletonized FA image as a common reference space. Using twin and pedigree samples of different ethnicities, we used our common template to evaluate the heritability of tract-derived FA measures. We show that our template is reliable for integrating multiple datasets by combining results through meta-analysis and unifying the data through exploratory mega-analyses. Our results may help prioritize regions of the FA map that are consistently influenced by additive genetic factors for future genetic discovery studies. Protocols and templates are publicly available at (http://enigma.loni.ucla.edu/ongoing/dti-working-group/). PMID:23629049

  15. GenomeGraphs: integrated genomic data visualization with R.

    PubMed

    Durinck, Steffen; Bullard, James; Spellman, Paul T; Dudoit, Sandrine

    2009-01-06

    Biological studies involve a growing number of distinct high-throughput experiments to characterize samples of interest. There is a lack of methods to visualize these different genomic datasets in a versatile manner. In addition, genomic data analysis requires integrated visualization of experimental data along with constantly changing genomic annotation and statistical analyses. We developed GenomeGraphs, as an add-on software package for the statistical programming environment R, to facilitate integrated visualization of genomic datasets. GenomeGraphs uses the biomaRt package to perform on-line annotation queries to Ensembl and translates these to gene/transcript structures in viewports of the grid graphics package. This allows genomic annotation to be plotted together with experimental data. GenomeGraphs can also be used to plot custom annotation tracks in combination with different experimental data types together in one plot using the same genomic coordinate system. GenomeGraphs is a flexible and extensible software package which can be used to visualize a multitude of genomic datasets within the statistical programming environment R.

  16. Integrated web system of geospatial data services for climate research

    NASA Astrophysics Data System (ADS)

    Okladnikov, Igor; Gordov, Evgeny; Titov, Alexander

    2016-04-01

    Georeferenced datasets are currently actively used for modeling, interpretation and forecasting of climatic and ecosystem changes on different spatial and temporal scales. Due to inherent heterogeneity of environmental datasets as well as their huge size (up to tens terabytes for a single dataset) a special software supporting studies in the climate and environmental change areas is required. An approach for integrated analysis of georefernced climatological data sets based on combination of web and GIS technologies in the framework of spatial data infrastructure paradigm is presented. According to this approach a dedicated data-processing web system for integrated analysis of heterogeneous georeferenced climatological and meteorological data is being developed. It is based on Open Geospatial Consortium (OGC) standards and involves many modern solutions such as object-oriented programming model, modular composition, and JavaScript libraries based on GeoExt library, ExtJS Framework and OpenLayers software. This work is supported by the Ministry of Education and Science of the Russian Federation, Agreement #14.613.21.0037.

  17. ToxMiner Software Interface for Visualizing and Analyzing ToxCast Data

    EPA Science Inventory

    The ToxCast dataset represents a collection of assays and endpoints that will require both standard statistical approaches as well as customized data analysis workflows. To analyze this unique dataset, we have developed an integrated database with Javabased interface called ToxMi...

  18. Drilling informatics: data-driven challenges of scientific drilling

    NASA Astrophysics Data System (ADS)

    Yamada, Yasuhiro; Kyaw, Moe; Saito, Sanny

    2017-04-01

    The primary aim of scientific drilling is to precisely understand the dynamic nature of the Earth. This is the reason why we investigate the subsurface materials (rock and fluid including microbial community) existing under particular environmental conditions. This requires sample collection and analytical data production from the samples, and in-situ data measurement at boreholes. Current available data comes from cores, cuttings, mud logging, geophysical logging, and exploration geophysics, but these datasets are difficult to be integrated because of their different kinds and scales. Now we are producing more useful datasets to fill the gap between the exiting data and extracting more information from such datasets and finally integrating the information. In particular, drilling parameters are very useful datasets as geomechanical properties. We believe such approach, 'drilling informatics', would be the most appropriate to obtain the comprehensive and dynamic picture of our scientific target, such as the seismogenic fault zone and the Moho discontinuity surface. This presentation introduces our initiative and current achievements of drilling informatics.

  19. The U.S. Geological Survey Climate Geo Data Portal: an integrated broker for climate and geospatial data

    USGS Publications Warehouse

    Blodgett, David L.

    2013-01-01

    The increasing availability of downscaled climate projections and other data products that summarize or predict climate conditions, is making climate data use more common in research and management. Scientists and decisionmakers often need to construct ensembles and compare climate hindcasts and future projections for particular spatial areas. These tasks generally require an investigator to procure all datasets of interest en masse, integrate the various data formats and representations into commonly accessible and comparable formats, and then extract the subsets of the datasets that are actually of interest. This process can be challenging and time intensive due to data-transfer, -storage, and(or) -processing limits, or unfamiliarity with methods of accessing climate data. Data management for modeling and assessing the impacts of future climate conditions is also becoming increasingly expensive due to the size of the datasets. The Climate Geo Data Portal (http://cida.usgs.gov/climate/gdp/) addresses these limitations, making access to numerous climate datasets for particular areas of interest a simple and efficient task.

  20. SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction.

    PubMed

    Hagopian, Raffi; Davidson, John R; Datta, Ruchira S; Samad, Bushra; Jarvis, Glen R; Sjölander, Kimmen

    2010-07-01

    We present the jump-start simultaneous alignment and tree construction using hidden Markov models (SATCHMO-JS) web server for simultaneous estimation of protein multiple sequence alignments (MSAs) and phylogenetic trees. The server takes as input a set of sequences in FASTA format, and outputs a phylogenetic tree and MSA; these can be viewed online or downloaded from the website. SATCHMO-JS is an extension of the SATCHMO algorithm, and employs a divide-and-conquer strategy to jump-start SATCHMO at a higher point in the phylogenetic tree, reducing the computational complexity of the progressive all-versus-all HMM-HMM scoring and alignment. Results on a benchmark dataset of 983 structurally aligned pairs from the PREFAB benchmark dataset show that SATCHMO-JS provides a statistically significant improvement in alignment accuracy over MUSCLE, Multiple Alignment using Fast Fourier Transform (MAFFT), ClustalW and the original SATCHMO algorithm. The SATCHMO-JS webserver is available at http://phylogenomics.berkeley.edu/satchmo-js. The datasets used in these experiments are available for download at http://phylogenomics.berkeley.edu/satchmo-js/supplementary/.

  1. MRI-based intelligence quotient (IQ) estimation with sparse learning.

    PubMed

    Wang, Liye; Wee, Chong-Yaw; Suk, Heung-Il; Tang, Xiaoying; Shen, Dinggang

    2015-01-01

    In this paper, we propose a novel framework for IQ estimation using Magnetic Resonance Imaging (MRI) data. In particular, we devise a new feature selection method based on an extended dirty model for jointly considering both element-wise sparsity and group-wise sparsity. Meanwhile, due to the absence of large dataset with consistent scanning protocols for the IQ estimation, we integrate multiple datasets scanned from different sites with different scanning parameters and protocols. In this way, there is large variability in these different datasets. To address this issue, we design a two-step procedure for 1) first identifying the possible scanning site for each testing subject and 2) then estimating the testing subject's IQ by using a specific estimator designed for that scanning site. We perform two experiments to test the performance of our method by using the MRI data collected from 164 typically developing children between 6 and 15 years old. In the first experiment, we use a multi-kernel Support Vector Regression (SVR) for estimating IQ values, and obtain an average correlation coefficient of 0.718 and also an average root mean square error of 8.695 between the true IQs and the estimated ones. In the second experiment, we use a single-kernel SVR for IQ estimation, and achieve an average correlation coefficient of 0.684 and an average root mean square error of 9.166. All these results show the effectiveness of using imaging data for IQ prediction, which is rarely done in the field according to our knowledge.

  2. Field Research Facility Data Integration Framework Data Management Plan: Survey Lines Dataset

    DTIC Science & Technology

    2016-08-01

    CHL and its District partners. The beach morphology surveys on which this report focuses provide quantitative measures of the dynamic nature of...topography • volume change 1.4 Data description The morphology surveys are conducted over a series of 26 shore- perpendicular profile lines spaced 50...dataset input data and products. Table 1. FRF survey lines dataset input data and products. Input Data FDIF Product Description ASCII LARC survey text

  3. National Hydropower Plant Dataset, Version 1 (Update FY18Q2)

    DOE Data Explorer

    Samu, Nicole; Kao, Shih-Chieh; O'Connor, Patrick; Johnson, Megan; Uria-Martinez, Rocio; McManamay, Ryan

    2016-09-30

    The National Hydropower Plant Dataset, Version 1, Update FY18Q2, includes geospatial point-level locations and key characteristics of existing hydropower plants in the United States that are currently online. These data are a subset extracted from NHAAP’s Existing Hydropower Assets (EHA) dataset, which is a cornerstone of NHAAP’s EHA effort that has supported multiple U.S. hydropower R&D research initiatives related to market acceleration, environmental impact reduction, technology-to-market activities, and climate change impact assessment.

  4. Integrative Data Analysis of Multi-Platform Cancer Data with a Multimodal Deep Learning Approach.

    PubMed

    Liang, Muxuan; Li, Zhizhong; Chen, Ting; Zeng, Jianyang

    2015-01-01

    Identification of cancer subtypes plays an important role in revealing useful insights into disease pathogenesis and advancing personalized therapy. The recent development of high-throughput sequencing technologies has enabled the rapid collection of multi-platform genomic data (e.g., gene expression, miRNA expression, and DNA methylation) for the same set of tumor samples. Although numerous integrative clustering approaches have been developed to analyze cancer data, few of them are particularly designed to exploit both deep intrinsic statistical properties of each input modality and complex cross-modality correlations among multi-platform input data. In this paper, we propose a new machine learning model, called multimodal deep belief network (DBN), to cluster cancer patients from multi-platform observation data. In our integrative clustering framework, relationships among inherent features of each single modality are first encoded into multiple layers of hidden variables, and then a joint latent model is employed to fuse common features derived from multiple input modalities. A practical learning algorithm, called contrastive divergence (CD), is applied to infer the parameters of our multimodal DBN model in an unsupervised manner. Tests on two available cancer datasets show that our integrative data analysis approach can effectively extract a unified representation of latent features to capture both intra- and cross-modality correlations, and identify meaningful disease subtypes from multi-platform cancer data. In addition, our approach can identify key genes and miRNAs that may play distinct roles in the pathogenesis of different cancer subtypes. Among those key miRNAs, we found that the expression level of miR-29a is highly correlated with survival time in ovarian cancer patients. These results indicate that our multimodal DBN based data analysis approach may have practical applications in cancer pathogenesis studies and provide useful guidelines for personalized cancer therapy.

  5. cGRNB: a web server for building combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets.

    PubMed

    Xu, Huayong; Yu, Hui; Tu, Kang; Shi, Qianqian; Wei, Chaochun; Li, Yuan-Yuan; Li, Yi-Xue

    2013-01-01

    We are witnessing rapid progress in the development of methodologies for building the combinatorial gene regulatory networks involving both TFs (Transcription Factors) and miRNAs (microRNAs). There are a few tools available to do these jobs but most of them are not easy to use and not accessible online. A web server is especially needed in order to allow users to upload experimental expression datasets and build combinatorial regulatory networks corresponding to their particular contexts. In this work, we compiled putative TF-gene, miRNA-gene and TF-miRNA regulatory relationships from forward-engineering pipelines and curated them as built-in data libraries. We streamlined the R codes of our two separate forward-and-reverse engineering algorithms for combinatorial gene regulatory network construction and formalized them as two major functional modules. As a result, we released the cGRNB (combinatorial Gene Regulatory Networks Builder): a web server for constructing combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets. The cGRNB enables two major network-building modules, one for MPGE (miRNA-perturbed gene expression) datasets and the other for parallel miRNA/mRNA expression datasets. A miRNA-centered two-layer combinatorial regulatory cascade is the output of the first module and a comprehensive genome-wide network involving all three types of combinatorial regulations (TF-gene, TF-miRNA, and miRNA-gene) are the output of the second module. In this article we propose cGRNB, a web server for building combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets. Since parallel miRNA/mRNA expression datasets are rapidly accumulated by the advance of next-generation sequencing techniques, cGRNB will be very useful tool for researchers to build combinatorial gene regulatory networks based on expression datasets. The cGRNB web-server is free and available online at http://www.scbit.org/cgrnb.

  6. KAnalyze: a fast versatile pipelined K-mer toolkit

    PubMed Central

    Audano, Peter; Vannberg, Fredrik

    2014-01-01

    Motivation: Converting nucleotide sequences into short overlapping fragments of uniform length, k-mers, is a common step in many bioinformatics applications. While existing software packages count k-mers, few are optimized for speed, offer an application programming interface (API), a graphical interface or contain features that make it extensible and maintainable. We designed KAnalyze to compete with the fastest k-mer counters, to produce reliable output and to support future development efforts through well-architected, documented and testable code. Currently, KAnalyze can output k-mer counts in a sorted tab-delimited file or stream k-mers as they are read. KAnalyze can process large datasets with 2 GB of memory. This project is implemented in Java 7, and the command line interface (CLI) is designed to integrate into pipelines written in any language. Results: As a k-mer counter, KAnalyze outperforms Jellyfish, DSK and a pipeline built on Perl and Linux utilities. Through extensive unit and system testing, we have verified that KAnalyze produces the correct k-mer counts over multiple datasets and k-mer sizes. Availability and implementation: KAnalyze is available on SourceForge: https://sourceforge.net/projects/kanalyze/ Contact: fredrik.vannberg@biology.gatech.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24642064

  7. KAnalyze: a fast versatile pipelined k-mer toolkit.

    PubMed

    Audano, Peter; Vannberg, Fredrik

    2014-07-15

    Converting nucleotide sequences into short overlapping fragments of uniform length, k-mers, is a common step in many bioinformatics applications. While existing software packages count k-mers, few are optimized for speed, offer an application programming interface (API), a graphical interface or contain features that make it extensible and maintainable. We designed KAnalyze to compete with the fastest k-mer counters, to produce reliable output and to support future development efforts through well-architected, documented and testable code. Currently, KAnalyze can output k-mer counts in a sorted tab-delimited file or stream k-mers as they are read. KAnalyze can process large datasets with 2 GB of memory. This project is implemented in Java 7, and the command line interface (CLI) is designed to integrate into pipelines written in any language. As a k-mer counter, KAnalyze outperforms Jellyfish, DSK and a pipeline built on Perl and Linux utilities. Through extensive unit and system testing, we have verified that KAnalyze produces the correct k-mer counts over multiple datasets and k-mer sizes. KAnalyze is available on SourceForge: https://sourceforge.net/projects/kanalyze/. © The Author 2014. Published by Oxford University Press.

  8. cDREM: inferring dynamic combinatorial gene regulation.

    PubMed

    Wise, Aaron; Bar-Joseph, Ziv

    2015-04-01

    Genes are often combinatorially regulated by multiple transcription factors (TFs). Such combinatorial regulation plays an important role in development and facilitates the ability of cells to respond to different stresses. While a number of approaches have utilized sequence and ChIP-based datasets to study combinational regulation, these have often ignored the combinational logic and the dynamics associated with such regulation. Here we present cDREM, a new method for reconstructing dynamic models of combinatorial regulation. cDREM integrates time series gene expression data with (static) protein interaction data. The method is based on a hidden Markov model and utilizes the sparse group Lasso to identify small subsets of combinatorially active TFs, their time of activation, and the logical function they implement. We tested cDREM on yeast and human data sets. Using yeast we show that the predicted combinatorial sets agree with other high throughput genomic datasets and improve upon prior methods developed to infer combinatorial regulation. Applying cDREM to study human response to flu, we were able to identify several combinatorial TF sets, some of which were known to regulate immune response while others represent novel combinations of important TFs.

  9. Wildlife tracking data management: a new vision.

    PubMed

    Urbano, Ferdinando; Cagnacci, Francesca; Calenge, Clément; Dettki, Holger; Cameron, Alison; Neteler, Markus

    2010-07-27

    To date, the processing of wildlife location data has relied on a diversity of software and file formats. Data management and the following spatial and statistical analyses were undertaken in multiple steps, involving many time-consuming importing/exporting phases. Recent technological advancements in tracking systems have made large, continuous, high-frequency datasets of wildlife behavioural data available, such as those derived from the global positioning system (GPS) and other animal-attached sensor devices. These data can be further complemented by a wide range of other information about the animals' environment. Management of these large and diverse datasets for modelling animal behaviour and ecology can prove challenging, slowing down analysis and increasing the probability of mistakes in data handling. We address these issues by critically evaluating the requirements for good management of GPS data for wildlife biology. We highlight that dedicated data management tools and expertise are needed. We explore current research in wildlife data management. We suggest a general direction of development, based on a modular software architecture with a spatial database at its core, where interoperability, data model design and integration with remote-sensing data sources play an important role in successful GPS data handling.

  10. Wildlife tracking data management: a new vision

    PubMed Central

    Urbano, Ferdinando; Cagnacci, Francesca; Calenge, Clément; Dettki, Holger; Cameron, Alison; Neteler, Markus

    2010-01-01

    To date, the processing of wildlife location data has relied on a diversity of software and file formats. Data management and the following spatial and statistical analyses were undertaken in multiple steps, involving many time-consuming importing/exporting phases. Recent technological advancements in tracking systems have made large, continuous, high-frequency datasets of wildlife behavioural data available, such as those derived from the global positioning system (GPS) and other animal-attached sensor devices. These data can be further complemented by a wide range of other information about the animals' environment. Management of these large and diverse datasets for modelling animal behaviour and ecology can prove challenging, slowing down analysis and increasing the probability of mistakes in data handling. We address these issues by critically evaluating the requirements for good management of GPS data for wildlife biology. We highlight that dedicated data management tools and expertise are needed. We explore current research in wildlife data management. We suggest a general direction of development, based on a modular software architecture with a spatial database at its core, where interoperability, data model design and integration with remote-sensing data sources play an important role in successful GPS data handling. PMID:20566495

  11. AMUC: Associated Motion capture User Categories.

    PubMed

    Norman, Sally Jane; Lawson, Sian E M; Olivier, Patrick; Watson, Paul; Chan, Anita M-A; Dade-Robertson, Martyn; Dunphy, Paul; Green, Dave; Hiden, Hugo; Hook, Jonathan; Jackson, Daniel G

    2009-07-13

    The AMUC (Associated Motion capture User Categories) project consisted of building a prototype sketch retrieval client for exploring motion capture archives. High-dimensional datasets reflect the dynamic process of motion capture and comprise high-rate sampled data of a performer's joint angles; in response to multiple query criteria, these data can potentially yield different kinds of information. The AMUC prototype harnesses graphic input via an electronic tablet as a query mechanism, time and position signals obtained from the sketch being mapped to the properties of data streams stored in the motion capture repository. As well as proposing a pragmatic solution for exploring motion capture datasets, the project demonstrates the conceptual value of iterative prototyping in innovative interdisciplinary design. The AMUC team was composed of live performance practitioners and theorists conversant with a variety of movement techniques, bioengineers who recorded and processed motion data for integration into the retrieval tool, and computer scientists who designed and implemented the retrieval system and server architecture, scoped for Grid-based applications. Creative input on information system design and navigation, and digital image processing, underpinned implementation of the prototype, which has undergone preliminary trials with diverse users, allowing identification of rich potential development areas.

  12. Target of obstructive sleep apnea syndrome merge lung cancer: based on big data platform.

    PubMed

    Li, Lifeng; Lu, Jingli; Xue, Wenhua; Wang, Liping; Zhai, Yunkai; Fan, Zhirui; Wu, Ge; Fan, Feifei; Li, Jieyao; Zhang, Chaoqi; Zhang, Yi; Zhao, Jie

    2017-03-28

    Based on our hospital database, the incidence of lung cancer diagnoses was similar in obstructive sleep apnea Syndrome (OSAS) and hospital general population; among individual with a diagnosis of lung cancer, the presence of OSAS was associated with an increased risk for mortality. In the gene expression and network-level information, we revealed significant alterations of molecules related to HIF1 and metabolic pathways in the hypoxic-conditioned lung cancer cells. We also observed that GBE1 and HK2 are downstream of HIF1 pathway important in hypoxia-conditioned lung cancer cell. Furthermore, we used publicly available datasets to validate that the late-stage lung adenocarcinoma patients showed higher expression HK2 and GBE1 than early-stage ones. In terms of prognostic features, a survival analysis revealed that the high GBE1 and HK2 expression group exhibited poorer survival in lung adenocarcinoma patients. By analyzing and integrating multiple datasets, we identify molecular convergence between hypoxia and lung cancer that reflects their clinical profiles and reveals molecular pathways involved in hypoxic-induced lung cancer progression. In conclusion, we show that OSAS severity appears to increase the risk of lung cancer mortality.

  13. A Survey of Computational Intelligence Techniques in Protein Function Prediction

    PubMed Central

    Tiwari, Arvind Kumar; Srivastava, Rajeev

    2014-01-01

    During the past, there was a massive growth of knowledge of unknown proteins with the advancement of high throughput microarray technologies. Protein function prediction is the most challenging problem in bioinformatics. In the past, the homology based approaches were used to predict the protein function, but they failed when a new protein was different from the previous one. Therefore, to alleviate the problems associated with homology based traditional approaches, numerous computational intelligence techniques have been proposed in the recent past. This paper presents a state-of-the-art comprehensive review of various computational intelligence techniques for protein function predictions using sequence, structure, protein-protein interaction network, and gene expression data used in wide areas of applications such as prediction of DNA and RNA binding sites, subcellular localization, enzyme functions, signal peptides, catalytic residues, nuclear/G-protein coupled receptors, membrane proteins, and pathway analysis from gene expression datasets. This paper also summarizes the result obtained by many researchers to solve these problems by using computational intelligence techniques with appropriate datasets to improve the prediction performance. The summary shows that ensemble classifiers and integration of multiple heterogeneous data are useful for protein function prediction. PMID:25574395

  14. Integrative Genomics Viewer (IGV) | Informatics Technology for Cancer Research (ITCR)

    Cancer.gov

    The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.

  15. Inferring Ice Thickness from a Glacier Dynamics Model and Multiple Surface Datasets.

    NASA Astrophysics Data System (ADS)

    Guan, Y.; Haran, M.; Pollard, D.

    2017-12-01

    The future behavior of the West Antarctic Ice Sheet (WAIS) may have a major impact on future climate. For instance, ice sheet melt may contribute significantly to global sea level rise. Understanding the current state of WAIS is therefore of great interest. WAIS is drained by fast-flowing glaciers which are major contributors to ice loss. Hence, understanding the stability and dynamics of glaciers is critical for predicting the future of the ice sheet. Glacier dynamics are driven by the interplay between the topography, temperature and basal conditions beneath the ice. A glacier dynamics model describes the interactions between these processes. We develop a hierarchical Bayesian model that integrates multiple ice sheet surface data sets with a glacier dynamics model. Our approach allows us to (1) infer important parameters describing the glacier dynamics, (2) learn about ice sheet thickness, and (3) account for errors in the observations and the model. Because we have relatively dense and accurate ice thickness data from the Thwaites Glacier in West Antarctica, we use these data to validate the proposed approach. The long-term goal of this work is to have a general model that may be used to study multiple glaciers in the Antarctic.

  16. Fitting Meta-Analytic Structural Equation Models with Complex Datasets

    ERIC Educational Resources Information Center

    Wilson, Sandra Jo; Polanin, Joshua R.; Lipsey, Mark W.

    2016-01-01

    A modification of the first stage of the standard procedure for two-stage meta-analytic structural equation modeling for use with large complex datasets is presented. This modification addresses two common problems that arise in such meta-analyses: (a) primary studies that provide multiple measures of the same construct and (b) the correlation…

  17. Continuous Toxicological Dose-Response Relationships Are Pretty Homogeneous (Society for Risk Analysis Annual Meeting)

    EPA Science Inventory

    Dose-response relationships for a wide range of in vivo and in vitro continuous datasets are well-described by a four-parameter exponential or Hill model, based on a recent analysis of multiple historical dose-response datasets, mostly with more than five dose groups (Slob and Se...

  18. Similarity of markers identified from cancer gene expression studies: observations from GEO.

    PubMed

    Shi, Xingjie; Shen, Shihao; Liu, Jin; Huang, Jian; Zhou, Yong; Ma, Shuangge

    2014-09-01

    Gene expression profiling has been extensively conducted in cancer research. The analysis of multiple independent cancer gene expression datasets may provide additional information and complement single-dataset analysis. In this study, we conduct multi-dataset analysis and are interested in evaluating the similarity of cancer-associated genes identified from different datasets. The first objective of this study is to briefly review some statistical methods that can be used for such evaluation. Both marginal analysis and joint analysis methods are reviewed. The second objective is to apply those methods to 26 Gene Expression Omnibus (GEO) datasets on five types of cancers. Our analysis suggests that for the same cancer, the marker identification results may vary significantly across datasets, and different datasets share few common genes. In addition, datasets on different cancers share few common genes. The shared genetic basis of datasets on the same or different cancers, which has been suggested in the literature, is not observed in the analysis of GEO data. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  19. Multiple local feature representations and their fusion based on an SVR model for iris recognition using optimized Gabor filters

    NASA Astrophysics Data System (ADS)

    He, Fei; Liu, Yuanning; Zhu, Xiaodong; Huang, Chun; Han, Ye; Dong, Hongxing

    2014-12-01

    Gabor descriptors have been widely used in iris texture representations. However, fixed basic Gabor functions cannot match the changing nature of diverse iris datasets. Furthermore, a single form of iris feature cannot overcome difficulties in iris recognition, such as illumination variations, environmental conditions, and device variations. This paper provides multiple local feature representations and their fusion scheme based on a support vector regression (SVR) model for iris recognition using optimized Gabor filters. In our iris system, a particle swarm optimization (PSO)- and a Boolean particle swarm optimization (BPSO)-based algorithm is proposed to provide suitable Gabor filters for each involved test dataset without predefinition or manual modulation. Several comparative experiments on JLUBR-IRIS, CASIA-I, and CASIA-V4-Interval iris datasets are conducted, and the results show that our work can generate improved local Gabor features by using optimized Gabor filters for each dataset. In addition, our SVR fusion strategy may make full use of their discriminative ability to improve accuracy and reliability. Other comparative experiments show that our approach may outperform other popular iris systems.

  20. Description of the U.S. Geological Survey Geo Data Portal data integration framework

    USGS Publications Warehouse

    Blodgett, David L.; Booth, Nathaniel L.; Kunicki, Thomas C.; Walker, Jordan I.; Lucido, Jessica M.

    2012-01-01

    The U.S. Geological Survey has developed an open-standard data integration framework for working efficiently and effectively with large collections of climate and other geoscience data. A web interface accesses catalog datasets to find data services. Data resources can then be rendered for mapping and dataset metadata are derived directly from these web services. Algorithm configuration and information needed to retrieve data for processing are passed to a server where all large-volume data access and manipulation takes place. The data integration strategy described here was implemented by leveraging existing free and open source software. Details of the software used are omitted; rather, emphasis is placed on how open-standard web services and data encodings can be used in an architecture that integrates common geographic and atmospheric data.

  1. Controlling the joint local false discovery rate is more powerful than meta-analysis methods in joint analysis of summary statistics from multiple genome-wide association studies.

    PubMed

    Jiang, Wei; Yu, Weichuan

    2017-02-15

    In genome-wide association studies (GWASs) of common diseases/traits, we often analyze multiple GWASs with the same phenotype together to discover associated genetic variants with higher power. Since it is difficult to access data with detailed individual measurements, summary-statistics-based meta-analysis methods have become popular to jointly analyze datasets from multiple GWASs. In this paper, we propose a novel summary-statistics-based joint analysis method based on controlling the joint local false discovery rate (Jlfdr). We prove that our method is the most powerful summary-statistics-based joint analysis method when controlling the false discovery rate at a certain level. In particular, the Jlfdr-based method achieves higher power than commonly used meta-analysis methods when analyzing heterogeneous datasets from multiple GWASs. Simulation experiments demonstrate the superior power of our method over meta-analysis methods. Also, our method discovers more associations than meta-analysis methods from empirical datasets of four phenotypes. The R-package is available at: http://bioinformatics.ust.hk/Jlfdr.html . eeyu@ust.hk. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  2. Integration of Digital Dental Casts in Cone-Beam Computed Tomography Scans

    PubMed Central

    Rangel, Frits A.; Maal, Thomas J. J.; Bergé, Stefaan J.; Kuijpers-Jagtman, Anne Marie

    2012-01-01

    Cone-beam computed tomography (CBCT) is widely used in maxillofacial surgery. The CBCT image of the dental arches, however, is of insufficient quality to use in digital planning of orthognathic surgery. Several authors have described methods to integrate digital dental casts into CBCT scans, but all reported methods have drawbacks. The aim of this feasibility study is to present a new simplified method to integrate digital dental casts into CBCT scans. In a patient scheduled for orthognathic surgery, titanium markers were glued to the gingiva. Next, a CBCT scan and dental impressions were made. During the impression-taking procedure, the titanium markers were transferred to the impression. The impressions were scanned, and all CBCT datasets were exported in DICOM format. The two datasets were matched, and the dentition derived from the scanned impressions was transferred to the CBCT of the patient. After matching the two datasets, the average distance between the corresponding markers was 0.1 mm. This novel method allows for the integration of digital dental casts into CBCT scans, overcoming problems such as unwanted extra radiation exposure, distortion of soft tissues due to the use of bite jigs, and time-consuming digital data handling. PMID:23050159

  3. Evaluation of data analytic approaches to generating cross-domain mappings of controlled science vocabularies

    NASA Astrophysics Data System (ADS)

    Zednik, S.

    2015-12-01

    Recent data publication practices have made increasing amounts of diverse datasets available online for the general research community to explore and integrate. Even with the abundance of data online, relevant data discovery and successful integration is still highly dependent upon the data being published with well-formed and understandable metadata. Tagging a dataset with well-known or controlled community terms is a common mechanism to indicate the intended purpose, subject matter, or other relevant facts of a dataset, however controlled domain terminology can be difficult for cross-domain researchers to interpret and leverage. It is also a challenge for integration portals to successfully provide cross-domain search capabilities over data holdings described using many different controlled vocabularies. Mappings between controlled vocabularies can be challenging because communities frequently develop specialized terminologies and have highly specific and contextual usages of common words. Despite this specificity it is highly desirable to produce cross-domain mappings to support data integration. In this contribution we evaluate the applicability of several data analytic techniques for the purpose of generating mappings between hierarchies of controlled science terms. We hope our efforts initiate more discussion on the topic and encourage future mapping efforts.

  4. An XMM-Newton Science Archive for next decade, and its integration into ESASky

    NASA Astrophysics Data System (ADS)

    Loiseau, N.; Baines, D.; Rodriguez, P.; Salgado, J.; Sarmiento, M.; Colomo, E.; Merin, B.; Giordano, F.; Racero, E.; Migliari, S.

    2016-06-01

    We will present a roadmap for the next decade improvements of the XMM-Newton Science Archive (XSA), as planned for an always faster and more user friendly access to all XMM-Newton data. This plan includes the integration of the Upper Limit server, an interactive visualization of EPIC and RGS spectra, on-the-fly data analysis, among other advanced features. Within this philosophy XSA is also being integrated into ESASky, the science-driven discovery portal for all the ESA Astronomy Missions. A first public beta release of the ESASky service has been already released at the end of 2015. It is currently featuring an interface for exploration of the multi-wavelength sky and for single and/or multiple target searches of science-ready data. The system offers progressive multi-resolution all-sky projections of full mission datasets using a new generation of HEALPix projections called HiPS, developed at the CDS; detailed geometrical footprints to connect the all-sky mosaics to individual observations; and direct access to science-ready data at the underlying mission-specific science archives. New XMM-Newton EPIC and OM all-sky HiPS maps, catalogues and links to the observations are available through ESASky, together with INTEGRAL, HST, Herschel, Planck and other future data.

  5. Analysis of Genome-Wide Association Studies with Multiple Outcomes Using Penalization

    PubMed Central

    Liu, Jin; Huang, Jian; Ma, Shuangge

    2012-01-01

    Genome-wide association studies have been extensively conducted, searching for markers for biologically meaningful outcomes and phenotypes. Penalization methods have been adopted in the analysis of the joint effects of a large number of SNPs (single nucleotide polymorphisms) and marker identification. This study is partly motivated by the analysis of heterogeneous stock mice dataset, in which multiple correlated phenotypes and a large number of SNPs are available. Existing penalization methods designed to analyze a single response variable cannot accommodate the correlation among multiple response variables. With multiple response variables sharing the same set of markers, joint modeling is first employed to accommodate the correlation. The group Lasso approach is adopted to select markers associated with all the outcome variables. An efficient computational algorithm is developed. Simulation study and analysis of the heterogeneous stock mice dataset show that the proposed method can outperform existing penalization methods. PMID:23272092

  6. Data Integration for Heterogenous Datasets

    PubMed Central

    2014-01-01

    Abstract More and more, the needs of data analysts are requiring the use of data outside the control of their own organizations. The increasing amount of data available on the Web, the new technologies for linking data across datasets, and the increasing need to integrate structured and unstructured data are all driving this trend. In this article, we provide a technical overview of the emerging “broad data” area, in which the variety of heterogeneous data being used, rather than the scale of the data being analyzed, is the limiting factor in data analysis efforts. The article explores some of the emerging themes in data discovery, data integration, linked data, and the combination of structured and unstructured data. PMID:25553272

  7. Association of Protein Translation and Extracellular Matrix Gene Sets with Breast Cancer Metastasis: Findings Uncovered on Analysis of Multiple Publicly Available Datasets Using Individual Patient Data Approach.

    PubMed

    Chowdhury, Nilotpal; Sapru, Shantanu

    2015-01-01

    Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate - adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research.

  8. Association of Protein Translation and Extracellular Matrix Gene Sets with Breast Cancer Metastasis: Findings Uncovered on Analysis of Multiple Publicly Available Datasets Using Individual Patient Data Approach

    PubMed Central

    Chowdhury, Nilotpal; Sapru, Shantanu

    2015-01-01

    Introduction Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. Aim The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Methods Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate – adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Results Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. Conclusion To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research. PMID:26080057

  9. TH-E-BRF-05: Comparison of Survival-Time Prediction Models After Radiotherapy for High-Grade Glioma Patients Based On Clinical and DVH Features

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Magome, T; Haga, A; Igaki, H

    Purpose: Although many outcome prediction models based on dose-volume information have been proposed, it is well known that the prognosis may be affected also by multiple clinical factors. The purpose of this study is to predict the survival time after radiotherapy for high-grade glioma patients based on features including clinical and dose-volume histogram (DVH) information. Methods: A total of 35 patients with high-grade glioma (oligodendroglioma: 2, anaplastic astrocytoma: 3, glioblastoma: 30) were selected in this study. All patients were treated with prescribed dose of 30–80 Gy after surgical resection or biopsy from 2006 to 2013 at The University of Tokyomore » Hospital. All cases were randomly separated into training dataset (30 cases) and test dataset (5 cases). The survival time after radiotherapy was predicted based on a multiple linear regression analysis and artificial neural network (ANN) by using 204 candidate features. The candidate features included the 12 clinical features (tumor location, extent of surgical resection, treatment duration of radiotherapy, etc.), and the 192 DVH features (maximum dose, minimum dose, D95, V60, etc.). The effective features for the prediction were selected according to a step-wise method by using 30 training cases. The prediction accuracy was evaluated by a coefficient of determination (R{sup 2}) between the predicted and actual survival time for the training and test dataset. Results: In the multiple regression analysis, the value of R{sup 2} between the predicted and actual survival time was 0.460 for the training dataset and 0.375 for the test dataset. On the other hand, in the ANN analysis, the value of R{sup 2} was 0.806 for the training dataset and 0.811 for the test dataset. Conclusion: Although a large number of patients would be needed for more accurate and robust prediction, our preliminary Result showed the potential to predict the outcome in the patients with high-grade glioma. This work was partly supported by the JSPS Core-to-Core Program(No. 23003) and Grant-in-aid from the JSPS Fellows.« less

  10. Utilizing novel diversity estimators to quantify multiple dimensions of microbial biodiversity across domains

    PubMed Central

    2013-01-01

    Background Microbial ecologists often employ methods from classical community ecology to analyze microbial community diversity. However, these methods have limitations because microbial communities differ from macro-organismal communities in key ways. This study sought to quantify microbial diversity using methods that are better suited for data spanning multiple domains of life and dimensions of diversity. Diversity profiles are one novel, promising way to analyze microbial datasets. Diversity profiles encompass many other indices, provide effective numbers of diversity (mathematical generalizations of previous indices that better convey the magnitude of differences in diversity), and can incorporate taxa similarity information. To explore whether these profiles change interpretations of microbial datasets, diversity profiles were calculated for four microbial datasets from different environments spanning all domains of life as well as viruses. Both similarity-based profiles that incorporated phylogenetic relatedness and naïve (not similarity-based) profiles were calculated. Simulated datasets were used to examine the robustness of diversity profiles to varying phylogenetic topology and community composition. Results Diversity profiles provided insights into microbial datasets that were not detectable with classical univariate diversity metrics. For all datasets analyzed, there were key distinctions between calculations that incorporated phylogenetic diversity as a measure of taxa similarity and naïve calculations. The profiles also provided information about the effects of rare species on diversity calculations. Additionally, diversity profiles were used to examine thousands of simulated microbial communities, showing that similarity-based and naïve diversity profiles only agreed approximately 50% of the time in their classification of which sample was most diverse. This is a strong argument for incorporating similarity information and calculating diversity with a range of emphases on rare and abundant species when quantifying microbial community diversity. Conclusions For many datasets, diversity profiles provided a different view of microbial community diversity compared to analyses that did not take into account taxa similarity information, effective diversity, or multiple diversity metrics. These findings are a valuable contribution to data analysis methodology in microbial ecology. PMID:24238386

  11. Secondary Analysis and Integration of Existing Data to Elucidate the Genetic Architecture of Cancer Risk and Related Outcomes, R21 | Informatics Technology for Cancer Research (ITCR)

    Cancer.gov

    This funding opportunity announcement (FOA) encourages applications that propose to conduct secondary data analysis and integration of existing datasets and database resources, with the ultimate aim to elucidate the genetic architecture of cancer risk and related outcomes. The goal of this initiative is to address key scientific questions relevant to cancer epidemiology by supporting the analysis of existing genetic or genomic datasets, possibly in combination with environmental, outcomes, behavioral, lifestyle, and molecular profiles data.

  12. Secondary Analysis and Integration of Existing Data to Elucidate the Genetic Architecture of Cancer Risk and Related Outcomes, R01 | Informatics Technology for Cancer Research (ITCR)

    Cancer.gov

    This funding opportunity announcement (FOA) encourages applications that propose to conduct secondary data analysis and integration of existing datasets and database resources, with the ultimate aim to elucidate the genetic architecture of cancer risk and related outcomes. The goal of this initiative is to address key scientific questions relevant to cancer epidemiology by supporting the analysis of existing genetic or genomic datasets, possibly in combination with environmental, outcomes, behavioral, lifestyle, and molecular profiles data.

  13. Workshop on New Views of the Moon 2: Understanding the Moon Through the Integration of Diverse Datasets

    NASA Technical Reports Server (NTRS)

    1999-01-01

    This volume contains abstracts that have been accepted for presentation at the Workshop on New Views of the Moon II: Understanding the Moon Through the Integration of Diverse Datasets, September 22-24, 1999, in Flagstaff, Arizona. The workshop conveners are Lisa Gaddis (U.S. Geological Survey, Flagstaff and Charles K. Shearer (University of New Mexico). Color versions of some of the images contained in this volume are available on the meeting Web site (http://cass.jsc.nasa.gov/meetings/moon99/pdf/program.pdf).

  14. GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare.

    PubMed

    Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung

    2015-07-02

    A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a "data modeler" tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets.

  15. GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare

    PubMed Central

    Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung

    2015-01-01

    A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a “data modeler” tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets. PMID:26147731

  16. A new integrated and homogenized global monthly land surface air temperature dataset for the period since 1900

    NASA Astrophysics Data System (ADS)

    Xu, Wenhui; Li, Qingxiang; Jones, Phil; Wang, Xiaolan L.; Trewin, Blair; Yang, Su; Zhu, Chen; Zhai, Panmao; Wang, Jinfeng; Vincent, Lucie; Dai, Aiguo; Gao, Yun; Ding, Yihui

    2018-04-01

    A new dataset of integrated and homogenized monthly surface air temperature over global land for the period since 1900 [China Meteorological Administration global Land Surface Air Temperature (CMA-LSAT)] is developed. In total, 14 sources have been collected and integrated into the newly developed dataset, including three global (CRUTEM4, GHCN, and BEST), three regional and eight national sources. Duplicate stations are identified, and those with the higher priority are chosen or spliced. Then, a consistency test and a climate outlier test are conducted to ensure that each station series is quality controlled. Next, two steps are adopted to assure the homogeneity of the station series: (1) homogenized station series in existing national datasets (by National Meteorological Services) are directly integrated into the dataset without any changes (50% of all stations), and (2) the inhomogeneities are detected and adjusted for in the remaining data series using a penalized maximal t test (50% of all stations). Based on the dataset, we re-assess the temperature changes in global and regional areas compared with GHCN-V3 and CRUTEM4, as well as the temperature changes during the three periods of 1900-2014, 1979-2014 and 1998-2014. The best estimates of warming trends and there 95% confidence ranges for 1900-2014 are approximately 0.102 ± 0.006 °C/decade for the whole year, and 0.104 ± 0.009, 0.112 ± 0.007, 0.090 ± 0.006, and 0.092 ± 0.007 °C/decade for the DJF (December, January, February), MAM, JJA, and SON seasons, respectively. MAM saw the most significant warming trend in both 1900-2014 and 1979-2014. For an even shorter and more recent period (1998-2014), MAM, JJA and SON show similar warming trends, while DJF shows opposite trends. The results show that the ability of CMA-LAST for describing the global temperature changes is similar with other existing products, while there are some differences when describing regional temperature changes.

  17. The impact of integrating WorldView-2 sensor and environmental variables in estimating plantation forest species aboveground biomass and carbon stocks in uMgeni Catchment, South Africa

    NASA Astrophysics Data System (ADS)

    Dube, Timothy; Mutanga, Onisimo

    2016-09-01

    Reliable and accurate mapping and extraction of key forest indicators of ecosystem development and health, such as aboveground biomass (AGB) and aboveground carbon stocks (AGCS) is critical in understanding forests contribution to the local, regional and global carbon cycle. This information is critical in assessing forest contribution towards ecosystem functioning and services, as well as their conservation status. This work aimed at assessing the applicability of the high resolution 8-band WorldView-2 multispectral dataset together with environmental variables in quantifying AGB and aboveground carbon stocks for three forest plantation species i.e. Eucalyptus dunii (ED), Eucalyptus grandis (EG) and Pinus taeda (PT) in uMgeni Catchment, South Africa. Specifically, the strength of the Worldview-2 sensor in terms of its improved imaging agilities is examined as an independent dataset and in conjunction with selected environmental variables. The results have demonstrated that the integration of high resolution 8-band Worldview-2 multispectral data with environmental variables provide improved AGB and AGCS estimates, when compared to the use of spectral data as an independent dataset. The use of integrated datasets yielded a high R2 value of 0.88 and RMSEs of 10.05 t ha-1 and 5.03 t C ha-1 for E. dunii AGB and carbon stocks; whereas the use of spectral data as an independent dataset yielded slightly weaker results, producing an R2 value of 0.73 and an RMSE of 18.57 t ha-1 and 09.29 t C ha-1. Similarly, high accurate results (R2 value of 0.73 and RMSE values of 27.30 t ha-1 and 13.65 t C ha-1) were observed from the estimation of inter-species AGB and carbon stocks. Overall, the findings of this work have shown that the integration of new generation multispectral datasets with environmental variables provide a robust toolset required for the accurate and reliable retrieval of forest aboveground biomass and carbon stocks in densely forested terrestrial ecosystems.

  18. QSAR studies of the bioactivity of hepatitis C virus (HCV) NS3/4A protease inhibitors by multiple linear regression (MLR) and support vector machine (SVM).

    PubMed

    Qin, Zijian; Wang, Maolin; Yan, Aixia

    2017-07-01

    In this study, quantitative structure-activity relationship (QSAR) models using various descriptor sets and training/test set selection methods were explored to predict the bioactivity of hepatitis C virus (HCV) NS3/4A protease inhibitors by using a multiple linear regression (MLR) and a support vector machine (SVM) method. 512 HCV NS3/4A protease inhibitors and their IC 50 values which were determined by the same FRET assay were collected from the reported literature to build a dataset. All the inhibitors were represented with selected nine global and 12 2D property-weighted autocorrelation descriptors calculated from the program CORINA Symphony. The dataset was divided into a training set and a test set by a random and a Kohonen's self-organizing map (SOM) method. The correlation coefficients (r 2 ) of training sets and test sets were 0.75 and 0.72 for the best MLR model, 0.87 and 0.85 for the best SVM model, respectively. In addition, a series of sub-dataset models were also developed. The performances of all the best sub-dataset models were better than those of the whole dataset models. We believe that the combination of the best sub- and whole dataset SVM models can be used as reliable lead designing tools for new NS3/4A protease inhibitors scaffolds in a drug discovery pipeline. Copyright © 2017 Elsevier Ltd. All rights reserved.

  19. Fast Construction of Near Parsimonious Hybridization Networks for Multiple Phylogenetic Trees.

    PubMed

    Mirzaei, Sajad; Wu, Yufeng

    2016-01-01

    Hybridization networks represent plausible evolutionary histories of species that are affected by reticulate evolutionary processes. An established computational problem on hybridization networks is constructing the most parsimonious hybridization network such that each of the given phylogenetic trees (called gene trees) is "displayed" in the network. There have been several previous approaches, including an exact method and several heuristics, for this NP-hard problem. However, the exact method is only applicable to a limited range of data, and heuristic methods can be less accurate and also slow sometimes. In this paper, we develop a new algorithm for constructing near parsimonious networks for multiple binary gene trees. This method is more efficient for large numbers of gene trees than previous heuristics. This new method also produces more parsimonious results on many simulated datasets as well as a real biological dataset than a previous method. We also show that our method produces topologically more accurate networks for many datasets.

  20. Strategic Integration of Multiple Bioinformatics Resources for System Level Analysis of Biological Networks.

    PubMed

    D'Souza, Mark; Sulakhe, Dinanath; Wang, Sheng; Xie, Bing; Hashemifar, Somaye; Taylor, Andrew; Dubchak, Inna; Conrad Gilliam, T; Maltsev, Natalia

    2017-01-01

    Recent technological advances in genomics allow the production of biological data at unprecedented tera- and petabyte scales. Efficient mining of these vast and complex datasets for the needs of biomedical research critically depends on a seamless integration of the clinical, genomic, and experimental information with prior knowledge about genotype-phenotype relationships. Such experimental data accumulated in publicly available databases should be accessible to a variety of algorithms and analytical pipelines that drive computational analysis and data mining.We present an integrated computational platform Lynx (Sulakhe et al., Nucleic Acids Res 44:D882-D887, 2016) ( http://lynx.cri.uchicago.edu ), a web-based database and knowledge extraction engine. It provides advanced search capabilities and a variety of algorithms for enrichment analysis and network-based gene prioritization. It gives public access to the Lynx integrated knowledge base (LynxKB) and its analytical tools via user-friendly web services and interfaces. The Lynx service-oriented architecture supports annotation and analysis of high-throughput experimental data. Lynx tools assist the user in extracting meaningful knowledge from LynxKB and experimental data, and in the generation of weighted hypotheses regarding the genes and molecular mechanisms contributing to human phenotypes or conditions of interest. The goal of this integrated platform is to support the end-to-end analytical needs of various translational projects.

  1. Integrated analysis identifies microRNA-195 as a suppressor of Hippo-YAP pathway in colorectal cancer.

    PubMed

    Sun, Min; Song, Haibin; Wang, Shuyi; Zhang, Chunxiao; Zheng, Liang; Chen, Fangfang; Shi, Dongdong; Chen, Yuanyuan; Yang, Chaogang; Xiang, Zhenxian; Liu, Qing; Wei, Chen; Xiong, Bin

    2017-03-29

    With persistent inconsistencies in colorectal cancer (CRC) miRNAs expression data, it is crucial to shift toward inclusion of a "pre-laboratory" integrated analysis to expedite effective precision medicine and translational research. Aberrant expression of hsa-miRNA-195 (miR-195) which is distinguished as a clinically noteworthy miRNA has previously been observed in multiple cancers, yet its role in CRC remains unclear. In this study, we performed an integrated analysis of seven CRC miRNAs expression datasets. The expression of miR-195 was validated in The Cancer Genome Atlas (TCGA) datasets, and an independent validation sample cohort. Colon cancer cells were transfected with miR-195 mimic and inhibitor, after which cell proliferation, colony formation, migration, invasion, and dual luciferase reporter were assayed. Xenograft mouse models were used to determine the role of miR-195 in CRC tumorigenicity in vivo. Four downregulated miRNAs (hsa-let-7a, hsa-miR-125b, hsa-miR-145, and hsa-miR-195) were demonstrated to be potentially useful diagnostic markers in the clinical setting. CRC patients with a decreased level of miR-195-5p in tumor tissues had significantly shortened survival as revealed by the TCGA colon adenocarcinoma (COAD) dataset and our CRC cohort. Overexpression of miR-195-5p in DLD1 and HCT116 cells repressed cell growth, colony formation, invasion, and migration. Inhibition of miR-195-5p function contributed to aberrant cell proliferation, migration, invasion, and epithelial mesenchymal transition (EMT). We identified miR-195-5p binding sites within the 3'-untranslated region (3'-UTR) of the human yes-associated protein (YAP) mRNA. YAP1 expression was downregulated after miR-195-5p treatment by qRT-PCR analysis and western blot. Four downregulated miRNAs were shown to be prime candidates for a panel of biomarkers with sufficient diagnostic accuracy for CRC in a clinical setting. Our integrated microRNA profiling approach identified miR-195-5p independently associated with prognosis in CRC. Our results demonstrated that miR-195-5p was a potent suppressor of YAP1, and miR-195-5p-mediated downregulation of YAP1 significantly reduced tumor development in a mouse CRC xenograft model. In the clinic, miR-195-5p can serve as a prognostic marker to predict the outcome of the CRC patients.

  2. A comparative analysis reveals weak relationships between ecological factors and beta diversity of stream insect metacommunities at two spatial levels.

    PubMed

    Heino, Jani; Melo, Adriano S; Bini, Luis Mauricio; Altermatt, Florian; Al-Shami, Salman A; Angeler, David G; Bonada, Núria; Brand, Cecilia; Callisto, Marcos; Cottenie, Karl; Dangles, Olivier; Dudgeon, David; Encalada, Andrea; Göthe, Emma; Grönroos, Mira; Hamada, Neusa; Jacobsen, Dean; Landeiro, Victor L; Ligeiro, Raphael; Martins, Renato T; Miserendino, María Laura; Md Rawi, Che Salmah; Rodrigues, Marciel E; Roque, Fabio de Oliveira; Sandin, Leonard; Schmera, Denes; Sgarbi, Luciano F; Simaika, John P; Siqueira, Tadeu; Thompson, Ross M; Townsend, Colin R

    2015-03-01

    The hypotheses that beta diversity should increase with decreasing latitude and increase with spatial extent of a region have rarely been tested based on a comparative analysis of multiple datasets, and no such study has focused on stream insects. We first assessed how well variability in beta diversity of stream insect metacommunities is predicted by insect group, latitude, spatial extent, altitudinal range, and dataset properties across multiple drainage basins throughout the world. Second, we assessed the relative roles of environmental and spatial factors in driving variation in assemblage composition within each drainage basin. Our analyses were based on a dataset of 95 stream insect metacommunities from 31 drainage basins distributed around the world. We used dissimilarity-based indices to quantify beta diversity for each metacommunity and, subsequently, regressed beta diversity on insect group, latitude, spatial extent, altitudinal range, and dataset properties (e.g., number of sites and percentage of presences). Within each metacommunity, we used a combination of spatial eigenfunction analyses and partial redundancy analysis to partition variation in assemblage structure into environmental, shared, spatial, and unexplained fractions. We found that dataset properties were more important predictors of beta diversity than ecological and geographical factors across multiple drainage basins. In the within-basin analyses, environmental and spatial variables were generally poor predictors of variation in assemblage composition. Our results revealed deviation from general biodiversity patterns because beta diversity did not show the expected decreasing trend with latitude. Our results also call for reconsideration of just how predictable stream assemblages are along ecological gradients, with implications for environmental assessment and conservation decisions. Our findings may also be applicable to other dynamic systems where predictability is low.

  3. [MicroRNA Target Prediction Based on Support Vector Machine Ensemble Classification Algorithm of Under-sampling Technique].

    PubMed

    Chen, Zhiru; Hong, Wenxue

    2016-02-01

    Considering the low accuracy of prediction in the positive samples and poor overall classification effects caused by unbalanced sample data of MicroRNA (miRNA) target, we proposes a support vector machine (SVM)-integration of under-sampling and weight (IUSM) algorithm in this paper, an under-sampling based on the ensemble learning algorithm. The algorithm adopts SVM as learning algorithm and AdaBoost as integration framework, and embeds clustering-based under-sampling into the iterative process, aiming at reducing the degree of unbalanced distribution of positive and negative samples. Meanwhile, in the process of adaptive weight adjustment of the samples, the SVM-IUSM algorithm eliminates the abnormal ones in negative samples with robust sample weights smoothing mechanism so as to avoid over-learning. Finally, the prediction of miRNA target integrated classifier is achieved with the combination of multiple weak classifiers through the voting mechanism. The experiment revealed that the SVM-IUSW, compared with other algorithms on unbalanced dataset collection, could not only improve the accuracy of positive targets and the overall effect of classification, but also enhance the generalization ability of miRNA target classifier.

  4. Salient object detection based on discriminative boundary and multiple cues integration

    NASA Astrophysics Data System (ADS)

    Jiang, Qingzhu; Wu, Zemin; Tian, Chang; Liu, Tao; Zeng, Mingyong; Hu, Lei

    2016-01-01

    In recent years, many saliency models have achieved good performance by taking the image boundary as the background prior. However, if all boundaries of an image are equally and artificially selected as background, misjudgment may happen when the object touches the boundary. We propose an algorithm called weighted contrast optimization based on discriminative boundary (wCODB). First, a background estimation model is reliably constructed through discriminating each boundary via Hausdorff distance. Second, the background-only weighted contrast is improved by fore-background weighted contrast, which is optimized through weight-adjustable optimization framework. Then to objectively estimate the quality of a saliency map, a simple but effective metric called spatial distribution of saliency map and mean saliency in covered window ratio (MSR) is designed. Finally, in order to further promote the detection result using MSR as the weight, we propose a saliency fusion framework to integrate three other cues-uniqueness, distribution, and coherence from three representative methods into our wCODB model. Extensive experiments on six public datasets demonstrate that our wCODB performs favorably against most of the methods based on boundary, and the integrated result outperforms all state-of-the-art methods.

  5. LNDriver: identifying driver genes by integrating mutation and expression data based on gene-gene interaction network.

    PubMed

    Wei, Pi-Jing; Zhang, Di; Xia, Junfeng; Zheng, Chun-Hou

    2016-12-23

    Cancer is a complex disease which is characterized by the accumulation of genetic alterations during the patient's lifetime. With the development of the next-generation sequencing technology, multiple omics data, such as cancer genomic, epigenomic and transcriptomic data etc., can be measured from each individual. Correspondingly, one of the key challenges is to pinpoint functional driver mutations or pathways, which contributes to tumorigenesis, from millions of functional neutral passenger mutations. In this paper, in order to identify driver genes effectively, we applied a generalized additive model to mutation profiles to filter genes with long length and constructed a new gene-gene interaction network. Then we integrated the mutation data and expression data into the gene-gene interaction network. Lastly, greedy algorithm was used to prioritize candidate driver genes from the integrated data. We named the proposed method Length-Net-Driver (LNDriver). Experiments on three TCGA datasets, i.e., head and neck squamous cell carcinoma, kidney renal clear cell carcinoma and thyroid carcinoma, demonstrated that the proposed method was effective. Also, it can identify not only frequently mutated drivers, but also rare candidate driver genes.

  6. linkedISA: semantic representation of ISA-Tab experimental metadata.

    PubMed

    González-Beltrán, Alejandra; Maguire, Eamonn; Sansone, Susanna-Assunta; Rocca-Serra, Philippe

    2014-01-01

    Reporting and sharing experimental metadata- such as the experimental design, characteristics of the samples, and procedures applied, along with the analysis results, in a standardised manner ensures that datasets are comprehensible and, in principle, reproducible, comparable and reusable. Furthermore, sharing datasets in formats designed for consumption by humans and machines will also maximize their use. The Investigation/Study/Assay (ISA) open source metadata tracking framework facilitates standards-compliant collection, curation, visualization, storage and sharing of datasets, leveraging on other platforms to enable analysis and publication. The ISA software suite includes several components used in increasingly diverse set of life science and biomedical domains; it is underpinned by a general-purpose format, ISA-Tab, and conversions exist into formats required by public repositories. While ISA-Tab works well mainly as a human readable format, we have also implemented a linked data approach to semantically define the ISA-Tab syntax. We present a semantic web representation of the ISA-Tab syntax that complements ISA-Tab's syntactic interoperability with semantic interoperability. We introduce the linkedISA conversion tool from ISA-Tab to the Resource Description Framework (RDF), supporting mappings from the ISA syntax to multiple community-defined, open ontologies and capitalising on user-provided ontology annotations in the experimental metadata. We describe insights of the implementation and how annotations can be expanded driven by the metadata. We applied the conversion tool as part of Bio-GraphIIn, a web-based application supporting integration of the semantically-rich experimental descriptions. Designed in a user-friendly manner, the Bio-GraphIIn interface hides most of the complexities to the users, exposing a familiar tabular view of the experimental description to allow seamless interaction with the RDF representation, and visualising descriptors to drive the query over the semantic representation of the experimental design. In addition, we defined queries over the linkedISA RDF representation and demonstrated its use over the linkedISA conversion of datasets from Nature' Scientific Data online publication. Our linked data approach has allowed us to: 1) make the ISA-Tab semantics explicit and machine-processable, 2) exploit the existing ontology-based annotations in the ISA-Tab experimental descriptions, 3) augment the ISA-Tab syntax with new descriptive elements, 4) visualise and query elements related to the experimental design. Reasoning over ISA-Tab metadata and associated data will facilitate data integration and knowledge discovery.

  7. Characterizing Organic Aerosol Processes and Climatically Relevant Properties via Advanced and Integrated Analyses of Aerosol Mass Spectrometry Datasets from DOE Campaigns and ACRF Measurements. Final report for DE-SC0007178

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhang, Qi

    Organic aerosols (OA) are an important but poorly characterized component of the earth’s climate system. Enormous complexities commonly associated with OA composition and life cycle processes have significantly complicated the simulation and quantification of aerosol effects. To unravel these complexities and improve understanding of the properties, sources, formation, evolution processes, and radiative properties of atmospheric OA, we propose to perform advanced and integrated analyses of multiple DOE aerosol mass spectrometry datasets, including two high-resolution time-of-flight aerosol mass spectrometer (HR-AMS) datasets from intensive field campaigns on the aerosol life cycle and the Aerosol Chemical Speciation Monitor (ACSM) datasets from long-term routinemore » measurement programs at ACRF sites. In this project, we will focus on 1) characterizing the chemical (i.e., composition, organic elemental ratios), physical (i.e., size distribution and volatility), and radiative (i.e., sub- and super-saturated growth) properties of organic aerosols, 2) examining the correlations of these properties with different source and process regimes (e.g., primary, secondary, urban, biogenic, biomass burning, marine, or mixtures), 3) quantifying the evolutions of these properties as a function of photochemical processing, 4) identifying and characterizing special cases for important processes such as SOA formation and new particle formation and growth, and 5) correlating size-resolved aerosol chemistry with measurements of radiative properties of aerosols to determine the climatically relevant properties of OA and characterize the relationship between these properties and processes of atmospheric aerosol organics. Our primary goal is to improve a process-level understanding of the life cycle of organic aerosols in the Earth’s atmosphere. We will also aim at bridging between observations and models via synthesizing and translating the results and insights generated from this research into data products and formulations that may be directly used to inform, improve, and evaluate regional and global models. In addition, we will continue our current very active collaborations with several modeling groups to enhance the use and interpretation of our data products. Overall, this research will contribute new data to improve quantification of the aerosol’s effects on climate and thus the achievement of ASR’s science goal of – “improving the fidelity and predictive capability of global climate models”.« less

  8. BEAT: Bioinformatics Exon Array Tool to store, analyze and visualize Affymetrix GeneChip Human Exon Array data from disease experiments

    PubMed Central

    2012-01-01

    Background It is known from recent studies that more than 90% of human multi-exon genes are subject to Alternative Splicing (AS), a key molecular mechanism in which multiple transcripts may be generated from a single gene. It is widely recognized that a breakdown in AS mechanisms plays an important role in cellular differentiation and pathologies. Polymerase Chain Reactions, microarrays and sequencing technologies have been applied to the study of transcript diversity arising from alternative expression. Last generation Affymetrix GeneChip Human Exon 1.0 ST Arrays offer a more detailed view of the gene expression profile providing information on the AS patterns. The exon array technology, with more than five million data points, can detect approximately one million exons, and it allows performing analyses at both gene and exon level. In this paper we describe BEAT, an integrated user-friendly bioinformatics framework to store, analyze and visualize exon arrays datasets. It combines a data warehouse approach with some rigorous statistical methods for assessing the AS of genes involved in diseases. Meta statistics are proposed as a novel approach to explore the analysis results. BEAT is available at http://beat.ba.itb.cnr.it. Results BEAT is a web tool which allows uploading and analyzing exon array datasets using standard statistical methods and an easy-to-use graphical web front-end. BEAT has been tested on a dataset with 173 samples and tuned using new datasets of exon array experiments from 28 colorectal cancer and 26 renal cell cancer samples produced at the Medical Genetics Unit of IRCCS Casa Sollievo della Sofferenza. To highlight all possible AS events, alternative names, accession Ids, Gene Ontology terms and biochemical pathways annotations are integrated with exon and gene level expression plots. The user can customize the results choosing custom thresholds for the statistical parameters and exploiting the available clinical data of the samples for a multivariate AS analysis. Conclusions Despite exon array chips being widely used for transcriptomics studies, there is a lack of analysis tools offering advanced statistical features and requiring no programming knowledge. BEAT provides a user-friendly platform for a comprehensive study of AS events in human diseases, displaying the analysis results with easily interpretable and interactive tables and graphics. PMID:22536968

  9. Hierarchical storage of large volume of multidector CT data using distributed servers

    NASA Astrophysics Data System (ADS)

    Ratib, Osman; Rosset, Antoine; Heuberger, Joris; Bandon, David

    2006-03-01

    Multidector scanners and hybrid multimodality scanners have the ability to generate large number of high-resolution images resulting in very large data sets. In most cases, these datasets are generated for the sole purpose of generating secondary processed images and 3D rendered images as well as oblique and curved multiplanar reformatted images. It is therefore not essential to archive the original images after they have been processed. We have developed an architecture of distributed archive servers for temporary storage of large image datasets for 3D rendering and image processing without the need for long term storage in PACS archive. With the relatively low cost of storage devices it is possible to configure these servers to hold several months or even years of data, long enough for allowing subsequent re-processing if required by specific clinical situations. We tested the latest generation of RAID servers provided by Apple computers with a capacity of 5 TBytes. We implemented a peer-to-peer data access software based on our Open-Source image management software called OsiriX, allowing remote workstations to directly access DICOM image files located on the server through a new technology called "bonjour". This architecture offers a seamless integration of multiple servers and workstations without the need for central database or complex workflow management tools. It allows efficient access to image data from multiple workstation for image analysis and visualization without the need for image data transfer. It provides a convenient alternative to centralized PACS architecture while avoiding complex and time-consuming data transfer and storage.

  10. Squish: Near-Optimal Compression for Archival of Relational Datasets

    PubMed Central

    Gao, Yihan; Parameswaran, Aditya

    2017-01-01

    Relational datasets are being generated at an alarmingly rapid rate across organizations and industries. Compressing these datasets could significantly reduce storage and archival costs. Traditional compression algorithms, e.g., gzip, are suboptimal for compressing relational datasets since they ignore the table structure and relationships between attributes. We study compression algorithms that leverage the relational structure to compress datasets to a much greater extent. We develop Squish, a system that uses a combination of Bayesian Networks and Arithmetic Coding to capture multiple kinds of dependencies among attributes and achieve near-entropy compression rate. Squish also supports user-defined attributes: users can instantiate new data types by simply implementing five functions for a new class interface. We prove the asymptotic optimality of our compression algorithm and conduct experiments to show the effectiveness of our system: Squish achieves a reduction of over 50% in storage size relative to systems developed in prior work on a variety of real datasets. PMID:28180028

  11. Visualization of conserved structures by fusing highly variable datasets.

    PubMed

    Silverstein, Jonathan C; Chhadia, Ankur; Dech, Fred

    2002-01-01

    Skill, effort, and time are required to identify and visualize anatomic structures in three-dimensions from radiological data. Fundamentally, automating these processes requires a technique that uses symbolic information not in the dynamic range of the voxel data. We were developing such a technique based on mutual information for automatic multi-modality image fusion (MIAMI Fuse, University of Michigan). This system previously demonstrated facility at fusing one voxel dataset with integrated symbolic structure information to a CT dataset (different scale and resolution) from the same person. The next step of development of our technique was aimed at accommodating the variability of anatomy from patient to patient by using warping to fuse our standard dataset to arbitrary patient CT datasets. A standard symbolic information dataset was created from the full color Visible Human Female by segmenting the liver parenchyma, portal veins, and hepatic veins and overwriting each set of voxels with a fixed color. Two arbitrarily selected patient CT scans of the abdomen were used for reference datasets. We used the warping functions in MIAMI Fuse to align the standard structure data to each patient scan. The key to successful fusion was the focused use of multiple warping control points that place themselves around the structure of interest automatically. The user assigns only a few initial control points to align the scans. Fusion 1 and 2 transformed the atlas with 27 points around the liver to CT1 and CT2 respectively. Fusion 3 transformed the atlas with 45 control points around the liver to CT1 and Fusion 4 transformed the atlas with 5 control points around the portal vein. The CT dataset is augmented with the transformed standard structure dataset, such that the warped structure masks are visualized in combination with the original patient dataset. This combined volume visualization is then rendered interactively in stereo on the ImmersaDesk in an immersive Virtual Reality (VR) environment. The accuracy of the fusions was determined qualitatively by comparing the transformed atlas overlaid on the appropriate CT. It was examined for where the transformed structure atlas was incorrectly overlaid (false positive) and where it was incorrectly not overlaid (false negative). According to this method, fusions 1 and 2 were correct roughly 50-75% of the time, while fusions 3 and 4 were correct roughly 75-100%. The CT dataset augmented with transformed dataset was viewed arbitrarily in user-centered perspective stereo taking advantage of features such as scaling, windowing and volumetric region of interest selection. This process of auto-coloring conserved structures in variable datasets is a step toward the goal of a broader, standardized automatic structure visualization method for radiological data. If successful it would permit identification, visualization or deletion of structures in radiological data by semi-automatically applying canonical structure information to the radiological data (not just processing and visualization of the data's intrinsic dynamic range). More sophisticated selection of control points and patterns of warping may allow for more accurate transforms, and thus advances in visualization, simulation, education, diagnostics, and treatment planning.

  12. Evaluation of a Traffic Sign Detector by Synthetic Image Data for Advanced Driver Assistance Systems

    NASA Astrophysics Data System (ADS)

    Hanel, A.; Kreuzpaintner, D.; Stilla, U.

    2018-05-01

    Recently, several synthetic image datasets of street scenes have been published. These datasets contain various traffic signs and can therefore be used to train and test machine learning-based traffic sign detectors. In this contribution, selected datasets are compared regarding ther applicability for traffic sign detection. The comparison covers the process to produce the synthetic images and addresses the virtual worlds, needed to produce the synthetic images, and their environmental conditions. The comparison covers variations in the appearance of traffic signs and the labeling strategies used for the datasets, as well. A deep learning traffic sign detector is trained with multiple training datasets with different ratios between synthetic and real training samples to evaluate the synthetic SYNTHIA dataset. A test of the detector on real samples only has shown that an overall accuracy and ROC AUC of more than 95 % can be achieved for both a small rate of synthetic samples and a large rate of synthetic samples in the training dataset.

  13. A multiple hypotheses uncertainty analysis in hydrological modelling: about model structure, landscape parameterization, and numerical integration

    NASA Astrophysics Data System (ADS)

    Pilz, Tobias; Francke, Till; Bronstert, Axel

    2016-04-01

    Until today a large number of competing computer models has been developed to understand hydrological processes and to simulate and predict streamflow dynamics of rivers. This is primarily the result of a lack of a unified theory in catchment hydrology due to insufficient process understanding and uncertainties related to model development and application. Therefore, the goal of this study is to analyze the uncertainty structure of a process-based hydrological catchment model employing a multiple hypotheses approach. The study focuses on three major problems that have received only little attention in previous investigations. First, to estimate the impact of model structural uncertainty by employing several alternative representations for each simulated process. Second, explore the influence of landscape discretization and parameterization from multiple datasets and user decisions. Third, employ several numerical solvers for the integration of the governing ordinary differential equations to study the effect on simulation results. The generated ensemble of model hypotheses is then analyzed and the three sources of uncertainty compared against each other. To ensure consistency and comparability all model structures and numerical solvers are implemented within a single simulation environment. First results suggest that the selection of a sophisticated numerical solver for the differential equations positively affects simulation outcomes. However, already some simple and easy to implement explicit methods perform surprisingly well and need less computational efforts than more advanced but time consuming implicit techniques. There is general evidence that ambiguous and subjective user decisions form a major source of uncertainty and can greatly influence model development and application at all stages.

  14. A Hybrid Neuro-Fuzzy Model For Integrating Large Earth-Science Datasets

    NASA Astrophysics Data System (ADS)

    Porwal, A.; Carranza, J.; Hale, M.

    2004-12-01

    A GIS-based hybrid neuro-fuzzy approach to integration of large earth-science datasets for mineral prospectivity mapping is described. It implements a Takagi-Sugeno type fuzzy inference system in the framework of a four-layered feed-forward adaptive neural network. Each unique combination of the datasets is considered a feature vector whose components are derived by knowledge-based ordinal encoding of the constituent datasets. A subset of feature vectors with a known output target vector (i.e., unique conditions known to be associated with either a mineralized or a barren location) is used for the training of an adaptive neuro-fuzzy inference system. Training involves iterative adjustment of parameters of the adaptive neuro-fuzzy inference system using a hybrid learning procedure for mapping each training vector to its output target vector with minimum sum of squared error. The trained adaptive neuro-fuzzy inference system is used to process all feature vectors. The output for each feature vector is a value that indicates the extent to which a feature vector belongs to the mineralized class or the barren class. These values are used to generate a prospectivity map. The procedure is demonstrated by an application to regional-scale base metal prospectivity mapping in a study area located in the Aravalli metallogenic province (western India). A comparison of the hybrid neuro-fuzzy approach with pure knowledge-driven fuzzy and pure data-driven neural network approaches indicates that the former offers a superior method for integrating large earth-science datasets for predictive spatial mathematical modelling.

  15. Multi-modal data fusion using source separation: Two effective models based on ICA and IVA and their properties

    PubMed Central

    Adali, Tülay; Levin-Schwartz, Yuri; Calhoun, Vince D.

    2015-01-01

    Fusion of information from multiple sets of data in order to extract a set of features that are most useful and relevant for the given task is inherent to many problems we deal with today. Since, usually, very little is known about the actual interaction among the datasets, it is highly desirable to minimize the underlying assumptions. This has been the main reason for the growing importance of data-driven methods, and in particular of independent component analysis (ICA) as it provides useful decompositions with a simple generative model and using only the assumption of statistical independence. A recent extension of ICA, independent vector analysis (IVA) generalizes ICA to multiple datasets by exploiting the statistical dependence across the datasets, and hence, as we discuss in this paper, provides an attractive solution to fusion of data from multiple datasets along with ICA. In this paper, we focus on two multivariate solutions for multi-modal data fusion that let multiple modalities fully interact for the estimation of underlying features that jointly report on all modalities. One solution is the Joint ICA model that has found wide application in medical imaging, and the second one is the the Transposed IVA model introduced here as a generalization of an approach based on multi-set canonical correlation analysis. In the discussion, we emphasize the role of diversity in the decompositions achieved by these two models, present their properties and implementation details to enable the user make informed decisions on the selection of a model along with its associated parameters. Discussions are supported by simulation results to help highlight the main issues in the implementation of these methods. PMID:26525830

  16. A Unified Approach to Functional Principal Component Analysis and Functional Multiple-Set Canonical Correlation.

    PubMed

    Choi, Ji Yeh; Hwang, Heungsun; Yamamoto, Michio; Jung, Kwanghee; Woodward, Todd S

    2017-06-01

    Functional principal component analysis (FPCA) and functional multiple-set canonical correlation analysis (FMCCA) are data reduction techniques for functional data that are collected in the form of smooth curves or functions over a continuum such as time or space. In FPCA, low-dimensional components are extracted from a single functional dataset such that they explain the most variance of the dataset, whereas in FMCCA, low-dimensional components are obtained from each of multiple functional datasets in such a way that the associations among the components are maximized across the different sets. In this paper, we propose a unified approach to FPCA and FMCCA. The proposed approach subsumes both techniques as special cases. Furthermore, it permits a compromise between the techniques, such that components are obtained from each set of functional data to maximize their associations across different datasets, while accounting for the variance of the data well. We propose a single optimization criterion for the proposed approach, and develop an alternating regularized least squares algorithm to minimize the criterion in combination with basis function approximations to functions. We conduct a simulation study to investigate the performance of the proposed approach based on synthetic data. We also apply the approach for the analysis of multiple-subject functional magnetic resonance imaging data to obtain low-dimensional components of blood-oxygen level-dependent signal changes of the brain over time, which are highly correlated across the subjects as well as representative of the data. The extracted components are used to identify networks of neural activity that are commonly activated across the subjects while carrying out a working memory task.

  17. Cross-species multiple environmental stress responses: An integrated approach to identify candidate genes for multiple stress tolerance in sorghum (Sorghum bicolor (L.) Moench) and related model species

    PubMed Central

    Modise, David M.; Gemeildien, Junaid; Ndimba, Bongani K.; Christoffels, Alan

    2018-01-01

    Background Crop response to the changing climate and unpredictable effects of global warming with adverse conditions such as drought stress has brought concerns about food security to the fore; crop yield loss is a major cause of concern in this regard. Identification of genes with multiple responses across environmental stresses is the genetic foundation that leads to crop adaptation to environmental perturbations. Methods In this paper, we introduce an integrated approach to assess candidate genes for multiple stress responses across-species. The approach combines ontology based semantic data integration with expression profiling, comparative genomics, phylogenomics, functional gene enrichment and gene enrichment network analysis to identify genes associated with plant stress phenotypes. Five different ontologies, viz., Gene Ontology (GO), Trait Ontology (TO), Plant Ontology (PO), Growth Ontology (GRO) and Environment Ontology (EO) were used to semantically integrate drought related information. Results Target genes linked to Quantitative Trait Loci (QTLs) controlling yield and stress tolerance in sorghum (Sorghum bicolor (L.) Moench) and closely related species were identified. Based on the enriched GO terms of the biological processes, 1116 sorghum genes with potential responses to 5 different stresses, such as drought (18%), salt (32%), cold (20%), heat (8%) and oxidative stress (25%) were identified to be over-expressed. Out of 169 sorghum drought responsive QTLs associated genes that were identified based on expression datasets, 56% were shown to have multiple stress responses. On the other hand, out of 168 additional genes that have been evaluated for orthologous pairs, 90% were conserved across species for drought tolerance. Over 50% of identified maize and rice genes were responsive to drought and salt stresses and were co-located within multifunctional QTLs. Among the total identified multi-stress responsive genes, 272 targets were shown to be co-localized within QTLs associated with different traits that are responsive to multiple stresses. Ontology mapping was used to validate the identified genes, while reconstruction of the phylogenetic tree was instrumental to infer the evolutionary relationship of the sorghum orthologs. The results also show specific genes responsible for various interrelated components of drought response mechanism such as drought tolerance, drought avoidance and drought escape. Conclusions We submit that this approach is novel and to our knowledge, has not been used previously in any other research; it enables us to perform cross-species queries for genes that are likely to be associated with multiple stress tolerance, as a means to identify novel targets for engineering stress resistance in sorghum and possibly, in other crop species. PMID:29590108

  18. Structural Analysis of 1995-2005 School Crime Supplement Datasets: Factors Influencing Students' Fear, Anxiety, and Avoidant Behaviors

    ERIC Educational Resources Information Center

    Mayer, Matthew J.

    2010-01-01

    The 1995-2005 School Crime Supplement datasets were analyzed using structural equation modeling. Converging evidence across multiple analyses suggests that secure school building policies may not be systematically linked to school disorder and may be more a reactive measure in response to other concerns. Most importantly, measures of incivility…

  19. GSuite HyperBrowser: integrative analysis of dataset collections across the genome and epigenome.

    PubMed

    Simovski, Boris; Vodák, Daniel; Gundersen, Sveinung; Domanska, Diana; Azab, Abdulrahman; Holden, Lars; Holden, Marit; Grytten, Ivar; Rand, Knut; Drabløs, Finn; Johansen, Morten; Mora, Antonio; Lund-Andersen, Christin; Fromm, Bastian; Eskeland, Ragnhild; Gabrielsen, Odd Stokke; Ferkingstad, Egil; Nakken, Sigve; Bengtsen, Mads; Nederbragt, Alexander Johan; Thorarensen, Hildur Sif; Akse, Johannes Andreas; Glad, Ingrid; Hovig, Eivind; Sandve, Geir Kjetil

    2017-07-01

    Recent large-scale undertakings such as ENCODE and Roadmap Epigenomics have generated experimental data mapped to the human reference genome (as genomic tracks) representing a variety of functional elements across a large number of cell types. Despite the high potential value of these publicly available data for a broad variety of investigations, little attention has been given to the analytical methodology necessary for their widespread utilisation. We here present a first principled treatment of the analysis of collections of genomic tracks. We have developed novel computational and statistical methodology to permit comparative and confirmatory analyses across multiple and disparate data sources. We delineate a set of generic questions that are useful across a broad range of investigations and discuss the implications of choosing different statistical measures and null models. Examples include contrasting analyses across different tissues or diseases. The methodology has been implemented in a comprehensive open-source software system, the GSuite HyperBrowser. To make the functionality accessible to biologists, and to facilitate reproducible analysis, we have also developed a web-based interface providing an expertly guided and customizable way of utilizing the methodology. With this system, many novel biological questions can flexibly be posed and rapidly answered. Through a combination of streamlined data acquisition, interoperable representation of dataset collections, and customizable statistical analysis with guided setup and interpretation, the GSuite HyperBrowser represents a first comprehensive solution for integrative analysis of track collections across the genome and epigenome. The software is available at: https://hyperbrowser.uio.no. © The Author 2017. Published by Oxford University Press.

  20. A normalization method for combination of laboratory test results from different electronic healthcare databases in a distributed research network.

    PubMed

    Yoon, Dukyong; Schuemie, Martijn J; Kim, Ju Han; Kim, Dong Ki; Park, Man Young; Ahn, Eun Kyoung; Jung, Eun-Young; Park, Dong Kyun; Cho, Soo Yeon; Shin, Dahye; Hwang, Yeonsoo; Park, Rae Woong

    2016-03-01

    Distributed research networks (DRNs) afford statistical power by integrating observational data from multiple partners for retrospective studies. However, laboratory test results across care sites are derived using different assays from varying patient populations, making it difficult to simply combine data for analysis. Additionally, existing normalization methods are not suitable for retrospective studies. We normalized laboratory results from different data sources by adjusting for heterogeneous clinico-epidemiologic characteristics of the data and called this the subgroup-adjusted normalization (SAN) method. Subgroup-adjusted normalization renders the means and standard deviations of distributions identical under population structure-adjusted conditions. To evaluate its performance, we compared SAN with existing methods for simulated and real datasets consisting of blood urea nitrogen, serum creatinine, hematocrit, hemoglobin, serum potassium, and total bilirubin. Various clinico-epidemiologic characteristics can be applied together in SAN. For simplicity of comparison, age and gender were used to adjust population heterogeneity in this study. In simulations, SAN had the lowest standardized difference in means (SDM) and Kolmogorov-Smirnov values for all tests (p < 0.05). In a real dataset, SAN had the lowest SDM and Kolmogorov-Smirnov values for blood urea nitrogen, hematocrit, hemoglobin, and serum potassium, and the lowest SDM for serum creatinine (p < 0.05). Subgroup-adjusted normalization performed better than normalization using other methods. The SAN method is applicable in a DRN environment and should facilitate analysis of data integrated across DRN partners for retrospective observational studies. Copyright © 2015 John Wiley & Sons, Ltd.

Top