Sample records for databases models datasets

  1. A geospatial database model for the management of remote sensing datasets at multiple spectral, spatial, and temporal scales

    NASA Astrophysics Data System (ADS)

    Ifimov, Gabriela; Pigeau, Grace; Arroyo-Mora, J. Pablo; Soffer, Raymond; Leblanc, George

    2017-10-01

    In this study the development and implementation of a geospatial database model for the management of multiscale datasets encompassing airborne imagery and associated metadata is presented. To develop the multi-source geospatial database we have used a Relational Database Management System (RDBMS) on a Structure Query Language (SQL) server which was then integrated into ArcGIS and implemented as a geodatabase. The acquired datasets were compiled, standardized, and integrated into the RDBMS, where logical associations between different types of information were linked (e.g. location, date, and instrument). Airborne data, at different processing levels (digital numbers through geocorrected reflectance), were implemented in the geospatial database where the datasets are linked spatially and temporally. An example dataset consisting of airborne hyperspectral imagery, collected for inter and intra-annual vegetation characterization and detection of potential hydrocarbon seepage events over pipeline areas, is presented. Our work provides a model for the management of airborne imagery, which is a challenging aspect of data management in remote sensing, especially when large volumes of data are collected.

  2. Toward Computational Cumulative Biology by Combining Models of Biological Datasets

    PubMed Central

    Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

    2014-01-01

    A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations—for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database. PMID:25427176

  3. Toward computational cumulative biology by combining models of biological datasets.

    PubMed

    Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

    2014-01-01

    A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations-for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.

  4. MODBASE, a database of annotated comparative protein structure models

    PubMed Central

    Pieper, Ursula; Eswar, Narayanan; Stuart, Ashley C.; Ilyin, Valentin A.; Sali, Andrej

    2002-01-01

    MODBASE (http://guitar.rockefeller.edu/modbase) is a relational database of annotated comparative protein structure models for all available protein sequences matched to at least one known protein structure. The models are calculated by MODPIPE, an automated modeling pipeline that relies on PSI-BLAST, IMPALA and MODELLER. MODBASE uses the MySQL relational database management system for flexible and efficient querying, and the MODVIEW Netscape plugin for viewing and manipulating multiple sequences and structures. It is updated regularly to reflect the growth of the protein sequence and structure databases, as well as improvements in the software for calculating the models. For ease of access, MODBASE is organized into different datasets. The largest dataset contains models for domains in 304 517 out of 539 171 unique protein sequences in the complete TrEMBL database (23 March 2001); only models based on significant alignments (PSI-BLAST E-value < 10–4) and models assessed to have the correct fold are included. Other datasets include models for target selection and structure-based annotation by the New York Structural Genomics Research Consortium, models for prediction of genes in the Drosophila melanogaster genome, models for structure determination of several ribosomal particles and models calculated by the MODWEB comparative modeling web server. PMID:11752309

  5. Modeling and Databases for Teaching Petrology

    NASA Astrophysics Data System (ADS)

    Asher, P.; Dutrow, B.

    2003-12-01

    With the widespread availability of high-speed computers with massive storage and ready transport capability of large amounts of data, computational and petrologic modeling and the use of databases provide new tools with which to teach petrology. Modeling can be used to gain insights into a system, predict system behavior, describe a system's processes, compare with a natural system or simply to be illustrative. These aspects result from data driven or empirical, analytical or numerical models or the concurrent examination of multiple lines of evidence. At the same time, use of models can enhance core foundations of the geosciences by improving critical thinking skills and by reinforcing prior knowledge gained. However, the use of modeling to teach petrology is dictated by the level of expectation we have for students and their facility with modeling approaches. For example, do we expect students to push buttons and navigate a program, understand the conceptual model and/or evaluate the results of a model. Whatever the desired level of sophistication, specific elements of design should be incorporated into a modeling exercise for effective teaching. These include, but are not limited to; use of the scientific method, use of prior knowledge, a clear statement of purpose and goals, attainable goals, a connection to the natural/actual system, a demonstration that complex heterogeneous natural systems are amenable to analyses by these techniques and, ideally, connections to other disciplines and the larger earth system. Databases offer another avenue with which to explore petrology. Large datasets are available that allow integration of multiple lines of evidence to attack a petrologic problem or understand a petrologic process. These are collected into a database that offers a tool for exploring, organizing and analyzing the data. For example, datasets may be geochemical, mineralogic, experimental and/or visual in nature, covering global, regional to local scales. These datasets provide students with access to large amount of related data through space and time. Goals of the database working group include educating earth scientists about information systems in general, about the importance of metadata about ways of using databases and datasets as educational tools and about the availability of existing datasets and databases. The modeling and databases groups hope to create additional petrologic teaching tools using these aspects and invite the community to contribute to the effort.

  6. The HyMeX database

    NASA Astrophysics Data System (ADS)

    Brissebrat, Guillaume; Mastrorillo, Laurence; Ramage, Karim; Boichard, Jean-Luc; Cloché, Sophie; Fleury, Laurence; Klenov, Ludmila; Labatut, Laurent; Mière, Arnaud

    2013-04-01

    The international HyMeX (HYdrological cycle in the Mediterranean EXperiment) project aims at a better understanding and quantification of the hydrological cycle and related processes in the Mediterranean, with emphasis on high-impact weather events, inter-annual to decadal variability of the Mediterranean coupled system, and associated trends in the context of global change. The project includes long term monitoring of environmental parameters, intensive field campaigns, use of satellite data, modelling studies, as well as post event field surveys and value-added products processing. Therefore HyMeX database incorporates various dataset types from different disciplines, either operational or research. The database relies on a strong collaboration between OMP and IPSL data centres. Field data, which are 1D time series, maps or pictures, are managed by OMP team while gridded data (satellite products, model outputs, radar data...) are managed by IPSL team. At present, the HyMeX database contains about 150 datasets, including 80 hydrological, meteorological, ocean and soil in situ datasets, 30 radar datasets, 15 satellite products, 15 atmosphere, ocean and land surface model outputs from operational (re-)analysis or forecasts and from research simulations, and 5 post event survey datasets. The data catalogue complies with international standards (ISO 19115; INSPIRE; Directory Interchange Format; Global Change Master Directory Thesaurus). It includes all the datasets stored in the HyMeX database, as well as external datasets relevant for the project. All the data, whatever the type is, are accessible through a single gateway. The database website http://mistrals.sedoo.fr/HyMeX offers different tools: - A registration procedure which enables any scientist to accept the data policy and apply for a user database account. - A search tool to browse the catalogue using thematic, geographic and/or temporal criteria. - Sorted lists of the datasets by thematic keywords, by measured parameters, by instruments or by platform type. - Forms to document observations or products that will be provided to the database. - A shopping-cart web interface to order in situ data files. - Ftp facilities to access gridded data. The website will soon propose new facilities. Many in situ datasets have been homogenized and inserted in a relational database yet, in order to enable more accurate data selection and download of different datasets in a shared format. Interoperability between the two data centres will be enhanced by the OpenDAP communication protocol associated with the Thredds catalogue software, which may also be implemented in other data centres that manage data of interest for the HyMeX project. In order to meet the operational needs for the HyMeX 2012 campaigns, a day-to-day quick look and report display website has been developed too: http://sop.hymex.org. It offers a convenient way to browse meteorological conditions and data during the campaign periods.

  7. Background qualitative analysis of the European Reference Life Cycle Database (ELCD) energy datasets - part I: fuel datasets.

    PubMed

    Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice

    2015-01-01

    The aim of this study is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) fuel datasets. The revision is based on the data quality indicators described by the ILCD Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD fuel datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the fuel-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD fuel datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall DQR of databases.

  8. Comparing the Performance of NoSQL Approaches for Managing Archetype-Based Electronic Health Record Data

    PubMed Central

    Freire, Sergio Miranda; Teodoro, Douglas; Wei-Kleiner, Fang; Sundvall, Erik; Karlsson, Daniel; Lambrix, Patrick

    2016-01-01

    This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest. PMID:26958859

  9. Comparing the Performance of NoSQL Approaches for Managing Archetype-Based Electronic Health Record Data.

    PubMed

    Freire, Sergio Miranda; Teodoro, Douglas; Wei-Kleiner, Fang; Sundvall, Erik; Karlsson, Daniel; Lambrix, Patrick

    2016-01-01

    This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest.

  10. Improving phylogenetic analyses by incorporating additional information from genetic sequence databases.

    PubMed

    Liang, Li-Jung; Weiss, Robert E; Redelings, Benjamin; Suchard, Marc A

    2009-10-01

    Statistical analyses of phylogenetic data culminate in uncertain estimates of underlying model parameters. Lack of additional data hinders the ability to reduce this uncertainty, as the original phylogenetic dataset is often complete, containing the entire gene or genome information available for the given set of taxa. Informative priors in a Bayesian analysis can reduce posterior uncertainty; however, publicly available phylogenetic software specifies vague priors for model parameters by default. We build objective and informative priors using hierarchical random effect models that combine additional datasets whose parameters are not of direct interest but are similar to the analysis of interest. We propose principled statistical methods that permit more precise parameter estimates in phylogenetic analyses by creating informative priors for parameters of interest. Using additional sequence datasets from our lab or public databases, we construct a fully Bayesian semiparametric hierarchical model to combine datasets. A dynamic iteratively reweighted Markov chain Monte Carlo algorithm conveniently recycles posterior samples from the individual analyses. We demonstrate the value of our approach by examining the insertion-deletion (indel) process in the enolase gene across the Tree of Life using the phylogenetic software BALI-PHY; we incorporate prior information about indels from 82 curated alignments downloaded from the BAliBASE database.

  11. Integration of Neuroimaging and Microarray Datasets through Mapping and Model-Theoretic Semantic Decomposition of Unstructured Phenotypes

    PubMed Central

    Pantazatos, Spiro P.; Li, Jianrong; Pavlidis, Paul; Lussier, Yves A.

    2009-01-01

    An approach towards heterogeneous neuroscience dataset integration is proposed that uses Natural Language Processing (NLP) and a knowledge-based phenotype organizer system (PhenOS) to link ontology-anchored terms to underlying data from each database, and then maps these terms based on a computable model of disease (SNOMED CT®). The approach was implemented using sample datasets from fMRIDC, GEO, The Whole Brain Atlas and Neuronames, and allowed for complex queries such as “List all disorders with a finding site of brain region X, and then find the semantically related references in all participating databases based on the ontological model of the disease or its anatomical and morphological attributes”. Precision of the NLP-derived coding of the unstructured phenotypes in each dataset was 88% (n = 50), and precision of the semantic mapping between these terms across datasets was 98% (n = 100). To our knowledge, this is the first example of the use of both semantic decomposition of disease relationships and hierarchical information found in ontologies to integrate heterogeneous phenotypes across clinical and molecular datasets. PMID:20495688

  12. Background qualitative analysis of the European reference life cycle database (ELCD) energy datasets - part II: electricity datasets.

    PubMed

    Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice

    2015-01-01

    The aim of this paper is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) electricity datasets. The revision is based on the data quality indicators described by the International Life Cycle Data system (ILCD) Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD electricity datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the electricity-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD electricity datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall Data Quality Requirements of databases.

  13. Consensus Modeling of Oral Rat Acute Toxicity

    EPA Science Inventory

    An acute toxicity dataset (oral rat LD50) with about 7400 compounds was compiled from the ChemIDplus database. This dataset was divided into a modeling set and a prediction set. The compounds in the prediction set were selected so that they were present in the modeling set used...

  14. Spatio-Temporal Data Model for Integrating Evolving Nation-Level Datasets

    NASA Astrophysics Data System (ADS)

    Sorokine, A.; Stewart, R. N.

    2017-10-01

    Ability to easily combine the data from diverse sources in a single analytical workflow is one of the greatest promises of the Big Data technologies. However, such integration is often challenging as datasets originate from different vendors, governments, and research communities that results in multiple incompatibilities including data representations, formats, and semantics. Semantics differences are hardest to handle: different communities often use different attribute definitions and associate the records with different sets of evolving geographic entities. Analysis of global socioeconomic variables across multiple datasets over prolonged time is often complicated by the difference in how boundaries and histories of countries or other geographic entities are represented. Here we propose an event-based data model for depicting and tracking histories of evolving geographic units (countries, provinces, etc.) and their representations in disparate data. The model addresses the semantic challenge of preserving identity of geographic entities over time by defining criteria for the entity existence, a set of events that may affect its existence, and rules for mapping between different representations (datasets). Proposed model is used for maintaining an evolving compound database of global socioeconomic and environmental data harvested from multiple sources. Practical implementation of our model is demonstrated using PostgreSQL object-relational database with the use of temporal, geospatial, and NoSQL database extensions.

  15. VEMAP Phase 2 bioclimatic database. I. Gridded historical (20th century) climate for modeling ecosystem dynamics across the conterminous USA

    USGS Publications Warehouse

    Kittel, T.G.F.; Rosenbloom, N.A.; Royle, J. Andrew; Daly, Christopher; Gibson, W.P.; Fisher, H.H.; Thornton, P.; Yates, D.N.; Aulenbach, S.; Kaufman, C.; McKeown, R.; Bachelet, D.; Schimel, D.S.; Neilson, R.; Lenihan, J.; Drapek, R.; Ojima, D.S.; Parton, W.J.; Melillo, J.M.; Kicklighter, D.W.; Tian, H.; McGuire, A.D.; Sykes, M.T.; Smith, B.; Cowling, S.; Hickler, T.; Prentice, I.C.; Running, S.; Hibbard, K.A.; Post, W.M.; King, A.W.; Smith, T.; Rizzo, B.; Woodward, F.I.

    2004-01-01

    Analysis and simulation of biospheric responses to historical forcing require surface climate data that capture those aspects of climate that control ecological processes, including key spatial gradients and modes of temporal variability. We developed a multivariate, gridded historical climate dataset for the conterminous USA as a common input database for the Vegetation/Ecosystem Modeling and Analysis Project (VEMAP), a biogeochemical and dynamic vegetation model intercomparison. The dataset covers the period 1895-1993 on a 0.5?? latitude/longitude grid. Climate is represented at both monthly and daily timesteps. Variables are: precipitation, mininimum and maximum temperature, total incident solar radiation, daylight-period irradiance, vapor pressure, and daylight-period relative humidity. The dataset was derived from US Historical Climate Network (HCN), cooperative network, and snowpack telemetry (SNOTEL) monthly precipitation and mean minimum and maximum temperature station data. We employed techniques that rely on geostatistical and physical relationships to create the temporally and spatially complete dataset. We developed a local kriging prediction model to infill discontinuous and limited-length station records based on spatial autocorrelation structure of climate anomalies. A spatial interpolation model (PRISM) that accounts for physiographic controls was used to grid the infilled monthly station data. We implemented a stochastic weather generator (modified WGEN) to disaggregate the gridded monthly series to dailies. Radiation and humidity variables were estimated from the dailies using a physically-based empirical surface climate model (MTCLIM3). Derived datasets include a 100 yr model spin-up climate and a historical Palmer Drought Severity Index (PDSI) dataset. The VEMAP dataset exhibits statistically significant trends in temperature, precipitation, solar radiation, vapor pressure, and PDSI for US National Assessment regions. The historical climate and companion datasets are available online at data archive centers. ?? Inter-Research 2004.

  16. Developing High-resolution Soil Database for Regional Crop Modeling in East Africa

    NASA Astrophysics Data System (ADS)

    Han, E.; Ines, A. V. M.

    2014-12-01

    The most readily available soil data for regional crop modeling in Africa is the World Inventory of Soil Emission potentials (WISE) dataset, which has 1125 soil profiles for the world, but does not extensively cover countries Ethiopia, Kenya, Uganda and Tanzania in East Africa. Another dataset available is the HC27 (Harvest Choice by IFPRI) in a gridded format (10km) but composed of generic soil profiles based on only three criteria (texture, rooting depth, and organic carbon content). In this paper, we present a development and application of a high-resolution (1km), gridded soil database for regional crop modeling in East Africa. Basic soil information is extracted from Africa Soil Information Service (AfSIS), which provides essential soil properties (bulk density, soil organic carbon, soil PH and percentages of sand, silt and clay) for 6 different standardized soil layers (5, 15, 30, 60, 100 and 200 cm) in 1km resolution. Soil hydraulic properties (e.g., field capacity and wilting point) are derived from the AfSIS soil dataset using well-proven pedo-transfer functions and are customized for DSSAT-CSM soil data requirements. The crop model is used to evaluate crop yield forecasts using the new high resolution soil database and compared with WISE and HC27. In this paper we will present also the results of DSSAT loosely coupled with a hydrologic model (VIC) to assimilate root-zone soil moisture. Creating a grid-based soil database, which provides a consistent soil input for two different models (DSSAT and VIC) is a critical part of this work. The created soil database is expected to contribute to future applications of DSSAT crop simulation in East Africa where food security is highly vulnerable.

  17. CyanOmics: an integrated database of omics for the model cyanobacterium Synechococcus sp. PCC 7002.

    PubMed

    Yang, Yaohua; Feng, Jie; Li, Tao; Ge, Feng; Zhao, Jindong

    2015-01-01

    Cyanobacteria are an important group of organisms that carry out oxygenic photosynthesis and play vital roles in both the carbon and nitrogen cycles of the Earth. The annotated genome of Synechococcus sp. PCC 7002, as an ideal model cyanobacterium, is available. A series of transcriptomic and proteomic studies of Synechococcus sp. PCC 7002 cells grown under different conditions have been reported. However, no database of such integrated omics studies has been constructed. Here we present CyanOmics, a database based on the results of Synechococcus sp. PCC 7002 omics studies. CyanOmics comprises one genomic dataset, 29 transcriptomic datasets and one proteomic dataset and should prove useful for systematic and comprehensive analysis of all those data. Powerful browsing and searching tools are integrated to help users directly access information of interest with enhanced visualization of the analytical results. Furthermore, Blast is included for sequence-based similarity searching and Cluster 3.0, as well as the R hclust function is provided for cluster analyses, to increase CyanOmics's usefulness. To the best of our knowledge, it is the first integrated omics analysis database for cyanobacteria. This database should further understanding of the transcriptional patterns, and proteomic profiling of Synechococcus sp. PCC 7002 and other cyanobacteria. Additionally, the entire database framework is applicable to any sequenced prokaryotic genome and could be applied to other integrated omics analysis projects. Database URL: http://lag.ihb.ac.cn/cyanomics. © The Author(s) 2015. Published by Oxford University Press.

  18. The BioGRID interaction database: 2017 update

    PubMed Central

    Chatr-aryamontri, Andrew; Oughtred, Rose; Boucher, Lorrie; Rust, Jennifer; Chang, Christie; Kolas, Nadine K.; O'Donnell, Lara; Oster, Sara; Theesfeld, Chandra; Sellam, Adnane; Stark, Chris; Breitkreutz, Bobby-Joe; Dolinski, Kara; Tyers, Mike

    2017-01-01

    The Biological General Repository for Interaction Datasets (BioGRID: https://thebiogrid.org) is an open access database dedicated to the annotation and archival of protein, genetic and chemical interactions for all major model organism species and humans. As of September 2016 (build 3.4.140), the BioGRID contains 1 072 173 genetic and protein interactions, and 38 559 post-translational modifications, as manually annotated from 48 114 publications. This dataset represents interaction records for 66 model organisms and represents a 30% increase compared to the previous 2015 BioGRID update. BioGRID curates the biomedical literature for major model organism species, including humans, with a recent emphasis on central biological processes and specific human diseases. To facilitate network-based approaches to drug discovery, BioGRID now incorporates 27 501 chemical–protein interactions for human drug targets, as drawn from the DrugBank database. A new dynamic interaction network viewer allows the easy navigation and filtering of all genetic and protein interaction data, as well as for bioactive compounds and their established targets. BioGRID data are directly downloadable without restriction in a variety of standardized formats and are freely distributed through partner model organism databases and meta-databases. PMID:27980099

  19. Description of the National Hydrologic Model for use with the Precipitation-Runoff Modeling System (PRMS)

    USGS Publications Warehouse

    Regan, R. Steven; Markstrom, Steven L.; Hay, Lauren E.; Viger, Roland J.; Norton, Parker A.; Driscoll, Jessica M.; LaFontaine, Jacob H.

    2018-01-08

    This report documents several components of the U.S. Geological Survey National Hydrologic Model of the conterminous United States for use with the Precipitation-Runoff Modeling System (PRMS). It provides descriptions of the (1) National Hydrologic Model, (2) Geospatial Fabric for National Hydrologic Modeling, (3) PRMS hydrologic simulation code, (4) parameters and estimation methods used to compute spatially and temporally distributed default values as required by PRMS, (5) National Hydrologic Model Parameter Database, and (6) model extraction tool named Bandit. The National Hydrologic Model Parameter Database contains values for all PRMS parameters used in the National Hydrologic Model. The methods and national datasets used to estimate all the PRMS parameters are described. Some parameter values are derived from characteristics of topography, land cover, soils, geology, and hydrography using traditional Geographic Information System methods. Other parameters are set to long-established default values and computation of initial values. Additionally, methods (statistical, sensitivity, calibration, and algebraic) were developed to compute parameter values on the basis of a variety of nationally-consistent datasets. Values in the National Hydrologic Model Parameter Database can periodically be updated on the basis of new parameter estimation methods and as additional national datasets become available. A companion ScienceBase resource provides a set of static parameter values as well as images of spatially-distributed parameters associated with PRMS states and fluxes for each Hydrologic Response Unit across the conterminuous United States.

  20. Planform: an application and database of graph-encoded planarian regenerative experiments.

    PubMed

    Lobo, Daniel; Malone, Taylor J; Levin, Michael

    2013-04-15

    Understanding the mechanisms governing the regeneration capabilities of many organisms is a fundamental interest in biology and medicine. An ever-increasing number of manipulation and molecular experiments are attempting to discover a comprehensive model for regeneration, with the planarian flatworm being one of the most important model species. Despite much effort, no comprehensive, constructive, mechanistic models exist yet, and it is now clear that computational tools are needed to mine this huge dataset. However, until now, there is no database of regenerative experiments, and the current genotype-phenotype ontologies and databases are based on textual descriptions, which are not understandable by computers. To overcome these difficulties, we present here Planform (Planarian formalization), a manually curated database and software tool for planarian regenerative experiments, based on a mathematical graph formalism. The database contains more than a thousand experiments from the main publications in the planarian literature. The software tool provides the user with a graphical interface to easily interact with and mine the database. The presented system is a valuable resource for the regeneration community and, more importantly, will pave the way for the application of novel artificial intelligence tools to extract knowledge from this dataset. The database and software tool are freely available at http://planform.daniel-lobo.com.

  1. Clinical Predictive Modeling Development and Deployment through FHIR Web Services.

    PubMed

    Khalilia, Mohammed; Choi, Myung; Henderson, Amelia; Iyengar, Sneha; Braunstein, Mark; Sun, Jimeng

    2015-01-01

    Clinical predictive modeling involves two challenging tasks: model development and model deployment. In this paper we demonstrate a software architecture for developing and deploying clinical predictive models using web services via the Health Level 7 (HL7) Fast Healthcare Interoperability Resources (FHIR) standard. The services enable model development using electronic health records (EHRs) stored in OMOP CDM databases and model deployment for scoring individual patients through FHIR resources. The MIMIC2 ICU dataset and a synthetic outpatient dataset were transformed into OMOP CDM databases for predictive model development. The resulting predictive models are deployed as FHIR resources, which receive requests of patient information, perform prediction against the deployed predictive model and respond with prediction scores. To assess the practicality of this approach we evaluated the response and prediction time of the FHIR modeling web services. We found the system to be reasonably fast with one second total response time per patient prediction.

  2. Clinical Predictive Modeling Development and Deployment through FHIR Web Services

    PubMed Central

    Khalilia, Mohammed; Choi, Myung; Henderson, Amelia; Iyengar, Sneha; Braunstein, Mark; Sun, Jimeng

    2015-01-01

    Clinical predictive modeling involves two challenging tasks: model development and model deployment. In this paper we demonstrate a software architecture for developing and deploying clinical predictive models using web services via the Health Level 7 (HL7) Fast Healthcare Interoperability Resources (FHIR) standard. The services enable model development using electronic health records (EHRs) stored in OMOP CDM databases and model deployment for scoring individual patients through FHIR resources. The MIMIC2 ICU dataset and a synthetic outpatient dataset were transformed into OMOP CDM databases for predictive model development. The resulting predictive models are deployed as FHIR resources, which receive requests of patient information, perform prediction against the deployed predictive model and respond with prediction scores. To assess the practicality of this approach we evaluated the response and prediction time of the FHIR modeling web services. We found the system to be reasonably fast with one second total response time per patient prediction. PMID:26958207

  3. The ChArMEx database

    NASA Astrophysics Data System (ADS)

    Ferré, Hélène; Belmahfoud, Nizar; Boichard, Jean-Luc; Brissebrat, Guillaume; Cloché, Sophie; Descloitres, Jacques; Fleury, Laurence; Focsa, Loredana; Henriot, Nicolas; Mière, Arnaud; Ramage, Karim; Vermeulen, Anne; Boulanger, Damien

    2015-04-01

    The Chemistry-Aerosol Mediterranean Experiment (ChArMEx, http://charmex.lsce.ipsl.fr/) aims at a scientific assessment of the present and future state of the atmospheric environment in the Mediterranean Basin, and of its impacts on the regional climate, air quality, and marine biogeochemistry. The project includes long term monitoring of environmental parameters , intensive field campaigns, use of satellite data and modelling studies. Therefore ChARMEx scientists produce and need to access a wide diversity of data. In this context, the objective of the database task is to organize data management, distribution system and services, such as facilitating the exchange of information and stimulating the collaboration between researchers within the ChArMEx community, and beyond. The database relies on a strong collaboration between ICARE, IPSL and OMP data centers and has been set up in the framework of the Mediterranean Integrated Studies at Regional And Locals Scales (MISTRALS) program data portal. ChArMEx data, either produced or used by the project, are documented and accessible through the database website: http://mistrals.sedoo.fr/ChArMEx. The website offers the usual but user-friendly functionalities: data catalog, user registration procedure, search tool to select and access data... The metadata (data description) are standardized, and comply with international standards (ISO 19115-19139; INSPIRE European Directive; Global Change Master Directory Thesaurus). A Digital Object Identifier (DOI) assignement procedure allows to automatically register the datasets, in order to make them easier to access, cite, reuse and verify. At present, the ChArMEx database contains about 120 datasets, including more than 80 in situ datasets (2012, 2013 and 2014 summer campaigns, background monitoring station of Ersa...), 25 model output sets (dust model intercomparison, MEDCORDEX scenarios...), a high resolution emission inventory over the Mediterranean... Many in situ datasets have been inserted in a relational database, in order to enable more accurate selection and download of different datasets in a shared format. Many dedicated satellite products (SEVIRI, TRIMM, PARASOL...) are processed and will soon be accessible through the database website. In order to meet the operational needs of the airborne and ground based observational teams during the ChArMEx campaigns, a day-to-day chart display website has been developed and operated: http://choc.sedoo.org. It offers a convenient way to browse weather conditions and chemical composition during the campaign periods. Every scientist is invited to visit the ChArMEx websites, to register and to request data. Feel free to contact charmex-database@sedoo.fr for any question.

  4. Improving the discoverability, accessibility, and citability of omics datasets: a case report.

    PubMed

    Darlington, Yolanda F; Naumov, Alexey; McOwiti, Apollo; Kankanamge, Wasula H; Becnel, Lauren B; McKenna, Neil J

    2017-03-01

    Although omics datasets represent valuable assets for hypothesis generation, model testing, and data validation, the infrastructure supporting their reuse lacks organization and consistency. Using nuclear receptor signaling transcriptomic datasets as proof of principle, we developed a model to improve the discoverability, accessibility, and citability of published omics datasets. Primary datasets were retrieved from archives, processed to extract data points, then subjected to metadata enrichment and gap filling. The resulting secondary datasets were exposed on responsive web pages to support mining of gene lists, discovery of related datasets, and single-click citation integration with popular reference managers. Automated processes were established to embed digital object identifier-driven links to the secondary datasets in associated journal articles, small molecule and gene-centric databases, and a dataset search engine. Our model creates multiple points of access to reprocessed and reannotated derivative datasets across the digital biomedical research ecosystem, promoting their visibility and usability across disparate research communities. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  5. Large scale validation of the M5L lung CAD on heterogeneous CT datasets.

    PubMed

    Torres, E Lopez; Fiorina, E; Pennazio, F; Peroni, C; Saletta, M; Camarlinghi, N; Fantacci, M E; Cerello, P

    2015-04-01

    M5L, a fully automated computer-aided detection (CAD) system for the detection and segmentation of lung nodules in thoracic computed tomography (CT), is presented and validated on several image datasets. M5L is the combination of two independent subsystems, based on the Channeler Ant Model as a segmentation tool [lung channeler ant model (lungCAM)] and on the voxel-based neural approach. The lungCAM was upgraded with a scan equalization module and a new procedure to recover the nodules connected to other lung structures; its classification module, which makes use of a feed-forward neural network, is based of a small number of features (13), so as to minimize the risk of lacking generalization, which could be possible given the large difference between the size of the training and testing datasets, which contain 94 and 1019 CTs, respectively. The lungCAM (standalone) and M5L (combined) performance was extensively tested on 1043 CT scans from three independent datasets, including a detailed analysis of the full Lung Image Database Consortium/Image Database Resource Initiative database, which is not yet found in literature. The lungCAM and M5L performance is consistent across the databases, with a sensitivity of about 70% and 80%, respectively, at eight false positive findings per scan, despite the variable annotation criteria and acquisition and reconstruction conditions. A reduced sensitivity is found for subtle nodules and ground glass opacities (GGO) structures. A comparison with other CAD systems is also presented. The M5L performance on a large and heterogeneous dataset is stable and satisfactory, although the development of a dedicated module for GGOs detection could further improve it, as well as an iterative optimization of the training procedure. The main aim of the present study was accomplished: M5L results do not deteriorate when increasing the dataset size, making it a candidate for supporting radiologists on large scale screenings and clinical programs.

  6. Creating a standardized watersheds database for the Lower Rio Grande/Río Bravo, Texas

    USGS Publications Warehouse

    Brown, J.R.; Ulery, Randy L.; Parcher, Jean W.

    2000-01-01

    This report describes the creation of a large-scale watershed database for the lower Rio Grande/Río Bravo Basin in Texas. The watershed database includes watersheds delineated to all 1:24,000-scale mapped stream confluences and other hydrologically significant points, selected watershed characteristics, and hydrologic derivative datasets.Computer technology allows generation of preliminary watershed boundaries in a fraction of the time needed for manual methods. This automated process reduces development time and results in quality improvements in watershed boundaries and characteristics. These data can then be compiled in a permanent database, eliminating the time-consuming step of data creation at the beginning of a project and providing a stable base dataset that can give users greater confidence when further subdividing watersheds.A standardized dataset of watershed characteristics is a valuable contribution to the understanding and management of natural resources. Vertical integration of the input datasets used to automatically generate watershed boundaries is crucial to the success of such an effort. The optimum situation would be to use the digital orthophoto quadrangles as the source of all the input datasets. While the hydrographic data from the digital line graphs can be revised to match the digital orthophoto quadrangles, hypsography data cannot be revised to match the digital orthophoto quadrangles. Revised hydrography from the digital orthophoto quadrangle should be used to create an updated digital elevation model that incorporates the stream channels as revised from the digital orthophoto quadrangle. Computer-generated, standardized watersheds that are vertically integrated with existing digital line graph hydrographic data will continue to be difficult to create until revisions can be made to existing source datasets. Until such time, manual editing will be necessary to make adjustments for man-made features and changes in the natural landscape that are not reflected in the digital elevation model data.

  7. Creating a standardized watersheds database for the lower Rio Grande/Rio Bravo, Texas

    USGS Publications Warehouse

    Brown, Julie R.; Ulery, Randy L.; Parcher, Jean W.

    2000-01-01

    This report describes the creation of a large-scale watershed database for the lower Rio Grande/Rio Bravo Basin in Texas. The watershed database includes watersheds delineated to all 1:24,000-scale mapped stream confluences and other hydrologically significant points, selected watershed characteristics, and hydrologic derivative datasets. Computer technology allows generation of preliminary watershed boundaries in a fraction of the time needed for manual methods. This automated process reduces development time and results in quality improvements in watershed boundaries and characteristics. These data can then be compiled in a permanent database, eliminating the time-consuming step of data creation at the beginning of a project and providing a stable base dataset that can give users greater confidence when further subdividing watersheds. A standardized dataset of watershed characteristics is a valuable contribution to the understanding and management of natural resources. Vertical integration of the input datasets used to automatically generate watershed boundaries is crucial to the success of such an effort. The optimum situation would be to use the digital orthophoto quadrangles as the source of all the input datasets. While the hydrographic data from the digital line graphs can be revised to match the digital orthophoto quadrangles, hypsography data cannot be revised to match the digital orthophoto quadrangles. Revised hydrography from the digital orthophoto quadrangle should be used to create an updated digital elevation model that incorporates the stream channels as revised from the digital orthophoto quadrangle. Computer-generated, standardized watersheds that are vertically integrated with existing digital line graph hydrographic data will continue to be difficult to create until revisions can be made to existing source datasets. Until such time, manual editing will be necessary to make adjustments for man-made features and changes in the natural landscape that are not reflected in the digital elevation model data.

  8. Hydrologic Derivatives for Modeling and Analysis—A new global high-resolution database

    USGS Publications Warehouse

    Verdin, Kristine L.

    2017-07-17

    The U.S. Geological Survey has developed a new global high-resolution hydrologic derivative database. Loosely modeled on the HYDRO1k database, this new database, entitled Hydrologic Derivatives for Modeling and Analysis, provides comprehensive and consistent global coverage of topographically derived raster layers (digital elevation model data, flow direction, flow accumulation, slope, and compound topographic index) and vector layers (streams and catchment boundaries). The coverage of the data is global, and the underlying digital elevation model is a hybrid of three datasets: HydroSHEDS (Hydrological data and maps based on SHuttle Elevation Derivatives at multiple Scales), GMTED2010 (Global Multi-resolution Terrain Elevation Data 2010), and the SRTM (Shuttle Radar Topography Mission). For most of the globe south of 60°N., the raster resolution of the data is 3 arc-seconds, corresponding to the resolution of the SRTM. For the areas north of 60°N., the resolution is 7.5 arc-seconds (the highest resolution of the GMTED2010 dataset) except for Greenland, where the resolution is 30 arc-seconds. The streams and catchments are attributed with Pfafstetter codes, based on a hierarchical numbering system, that carry important topological information. This database is appropriate for use in continental-scale modeling efforts. The work described in this report was conducted by the U.S. Geological Survey in cooperation with the National Aeronautics and Space Administration Goddard Space Flight Center.

  9. The ChArMEx database

    NASA Astrophysics Data System (ADS)

    Ferré, Helene; Belmahfoud, Nizar; Boichard, Jean-Luc; Brissebrat, Guillaume; Descloitres, Jacques; Fleury, Laurence; Focsa, Loredana; Henriot, Nicolas; Mastrorillo, Laurence; Mière, Arnaud; Vermeulen, Anne

    2014-05-01

    The Chemistry-Aerosol Mediterranean Experiment (ChArMEx, http://charmex.lsce.ipsl.fr/) aims at a scientific assessment of the present and future state of the atmospheric environment in the Mediterranean Basin, and of its impacts on the regional climate, air quality, and marine biogeochemistry. The project includes long term monitoring of environmental parameters, intensive field campaigns, use of satellite data and modelling studies. Therefore ChARMEx scientists produce and need to access a wide diversity of data. In this context, the objective of the database task is to organize data management, distribution system and services, such as facilitating the exchange of information and stimulating the collaboration between researchers within the ChArMEx community, and beyond. The database relies on a strong collaboration between OMP and ICARE data centres and has been set up in the framework of the Mediterranean Integrated Studies at Regional And Locals Scales (MISTRALS) program data portal. All the data produced by or of interest for the ChArMEx community will be documented in the data catalogue and accessible through the database website: http://mistrals.sedoo.fr/ChArMEx. At present, the ChArMEx database contains about 75 datasets, including 50 in situ datasets (2012 and 2013 campaigns, Ersa background monitoring station), 25 model outputs (dust model intercomparison, MEDCORDEX scenarios), and a high resolution emission inventory over the Mediterranean. Many in situ datasets have been inserted in a relational database, in order to enable more accurate data selection and download of different datasets in a shared format. The database website offers different tools: - A registration procedure which enables any scientist to accept the data policy and apply for a user database account. - A data catalogue that complies with metadata international standards (ISO 19115-19139; INSPIRE European Directive; Global Change Master Directory Thesaurus). - Metadata forms to document observations or products that will be provided to the database. - A search tool to browse the catalogue using thematic, geographic and/or temporal criteria. - A shopping-cart web interface to order in situ data files. - A web interface to select and access to homogenized datasets. Interoperability between the two data centres is being set up using the OPEnDAP protocol. The data portal will soon propose a user-friendly access to satellite products managed by the ICARE data centre (SEVIRI, TRIMM, PARASOL...). In order to meet the operational needs of the airborne and ground based observational teams during the ChArMEx 2012 and 2013 campaigns, a day-to-day chart and report display website has been developed too: http://choc.sedoo.org. It offers a convenient way to browse weather conditions and chemical composition during the campaign periods.

  10. A global gas flaring black carbon emission rate dataset from 1994 to 2012

    PubMed Central

    Huang, Kan; Fu, Joshua S.

    2016-01-01

    Global flaring of associated petroleum gas is a potential emission source of particulate matters (PM) and could be notable in some specific regions that are in urgent need of mitigation. PM emitted from gas flaring is mainly in the form of black carbon (BC), which is a strong short-lived climate forcer. However, BC from gas flaring has been neglected in most global/regional emission inventories and is rarely considered in climate modeling. Here we present a global gas flaring BC emission rate dataset for the period 1994–2012 in a machine-readable format. We develop a region-dependent gas flaring BC emission factor database based on the chemical compositions of associated petroleum gas at various oil fields. Gas flaring BC emission rates are estimated using this emission factor database and flaring volumes retrieved from satellite imagery. Evaluation using a chemical transport model suggests that consideration of gas flaring emissions can improve model performance. This dataset will benefit and inform a broad range of research topics, e.g., carbon budget, air quality/climate modeling, and environmental/human exposure. PMID:27874852

  11. A global gas flaring black carbon emission rate dataset from 1994 to 2012

    NASA Astrophysics Data System (ADS)

    Huang, Kan; Fu, Joshua S.

    2016-11-01

    Global flaring of associated petroleum gas is a potential emission source of particulate matters (PM) and could be notable in some specific regions that are in urgent need of mitigation. PM emitted from gas flaring is mainly in the form of black carbon (BC), which is a strong short-lived climate forcer. However, BC from gas flaring has been neglected in most global/regional emission inventories and is rarely considered in climate modeling. Here we present a global gas flaring BC emission rate dataset for the period 1994-2012 in a machine-readable format. We develop a region-dependent gas flaring BC emission factor database based on the chemical compositions of associated petroleum gas at various oil fields. Gas flaring BC emission rates are estimated using this emission factor database and flaring volumes retrieved from satellite imagery. Evaluation using a chemical transport model suggests that consideration of gas flaring emissions can improve model performance. This dataset will benefit and inform a broad range of research topics, e.g., carbon budget, air quality/climate modeling, and environmental/human exposure.

  12. Benchmarking protein classification algorithms via supervised cross-validation.

    PubMed

    Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor

    2008-04-24

    Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.

  13. Developing a Global Database of Historic Flood Events to Support Machine Learning Flood Prediction in Google Earth Engine

    NASA Astrophysics Data System (ADS)

    Tellman, B.; Sullivan, J.; Kettner, A.; Brakenridge, G. R.; Slayback, D. A.; Kuhn, C.; Doyle, C.

    2016-12-01

    There is an increasing need to understand flood vulnerability as the societal and economic effects of flooding increases. Risk models from insurance companies and flood models from hydrologists must be calibrated based on flood observations in order to make future predictions that can improve planning and help societies reduce future disasters. Specifically, to improve these models both traditional methods of flood prediction from physically based models as well as data-driven techniques, such as machine learning, require spatial flood observation to validate model outputs and quantify uncertainty. A key dataset that is missing for flood model validation is a global historical geo-database of flood event extents. Currently, the most advanced database of historical flood extent is hosted and maintained at the Dartmouth Flood Observatory (DFO) that has catalogued 4320 floods (1985-2015) but has only mapped 5% of these floods. We are addressing this data gap by mapping the inventory of floods in the DFO database to create a first-of- its-kind, comprehensive, global and historical geospatial database of flood events. To do so, we combine water detection algorithms on MODIS and Landsat 5,7 and 8 imagery in Google Earth Engine to map discrete flood events. The created database will be available in the Earth Engine Catalogue for download by country, region, or time period. This dataset can be leveraged for new data-driven hydrologic modeling using machine learning algorithms in Earth Engine's highly parallelized computing environment, and we will show examples for New York and Senegal.

  14. Illuminating the Depths of the MagIC (Magnetics Information Consortium) Database

    NASA Astrophysics Data System (ADS)

    Koppers, A. A. P.; Minnett, R.; Jarboe, N.; Jonestrask, L.; Tauxe, L.; Constable, C.

    2015-12-01

    The Magnetics Information Consortium (http://earthref.org/MagIC/) is a grass-roots cyberinfrastructure effort envisioned by the paleo-, geo-, and rock magnetic scientific community. Its mission is to archive their wealth of peer-reviewed raw data and interpretations from magnetics studies on natural and synthetic samples. Many of these valuable data are legacy datasets that were never published in their entirety, some resided in other databases that are no longer maintained, and others were never digitized from the field notebooks and lab work. Due to the volume of data collected, most studies, modern and legacy, only publish the interpreted results and, occasionally, a subset of the raw data. MagIC is making an extraordinary effort to archive these data in a single data model, including the raw instrument measurements if possible. This facilitates the reproducibility of the interpretations, the re-interpretation of the raw data as the community introduces new techniques, and the compilation of heterogeneous datasets that are otherwise distributed across multiple formats and physical locations. MagIC has developed tools to assist the scientific community in many stages of their workflow. Contributors easily share studies (in a private mode if so desired) in the MagIC Database with colleagues and reviewers prior to publication, publish the data online after the study is peer reviewed, and visualize their data in the context of the rest of the contributions to the MagIC Database. From organizing their data in the MagIC Data Model with an online editable spreadsheet, to validating the integrity of the dataset with automated plots and statistics, MagIC is continually lowering the barriers to transforming dark data into transparent and reproducible datasets. Additionally, this web application generalizes to other databases in MagIC's umbrella website (EarthRef.org) so that the Geochemical Earth Reference Model (http://earthref.org/GERM/) portal, Seamount Biogeosciences Network (http://earthref.org/SBN/), EarthRef Digital Archive (http://earthref.org/ERDA/) and EarthRef Reference Database (http://earthref.org/ERR/) benefit from its development.

  15. dbMDEGA: a database for meta-analysis of differentially expressed genes in autism spectrum disorder.

    PubMed

    Zhang, Shuyun; Deng, Libin; Jia, Qiyue; Huang, Shaoting; Gu, Junwang; Zhou, Fankun; Gao, Meng; Sun, Xinyi; Feng, Chang; Fan, Guangqin

    2017-11-16

    Autism spectrum disorders (ASD) are hereditary, heterogeneous and biologically complex neurodevelopmental disorders. Individual studies on gene expression in ASD cannot provide clear consensus conclusions. Therefore, a systematic review to synthesize the current findings from brain tissues and a search tool to share the meta-analysis results are urgently needed. Here, we conducted a meta-analysis of brain gene expression profiles in the current reported human ASD expression datasets (with 84 frozen male cortex samples, 17 female cortex samples, 32 cerebellum samples and 4 formalin fixed samples) and knock-out mouse ASD model expression datasets (with 80 collective brain samples). Then, we applied R language software and developed an interactive shared and updated database (dbMDEGA) displaying the results of meta-analysis of data from ASD studies regarding differentially expressed genes (DEGs) in the brain. This database, dbMDEGA ( https://dbmdega.shinyapps.io/dbMDEGA/ ), is a publicly available web-portal for manual annotation and visualization of DEGs in the brain from data from ASD studies. This database uniquely presents meta-analysis values and homologous forest plots of DEGs in brain tissues. Gene entries are annotated with meta-values, statistical values and forest plots of DEGs in brain samples. This database aims to provide searchable meta-analysis results based on the current reported brain gene expression datasets of ASD to help detect candidate genes underlying this disorder. This new analytical tool may provide valuable assistance in the discovery of DEGs and the elucidation of the molecular pathogenicity of ASD. This database model may be replicated to study other disorders.

  16. THE ART OF DATA MINING THE MINEFIELDS OF TOXICITY ...

    EPA Pesticide Factsheets

    Toxicity databases have a special role in predictive toxicology, providing ready access to historical information throughout the workflow of discovery, development, and product safety processes in drug development as well as in review by regulatory agencies. To provide accurate information within a hypothesesbuilding environment, the content of the databases needs to be rigorously modeled using standards and controlled vocabulary. The utilitarian purposes of databases widely vary, ranging from a source for (Q)SAR datasets for modelers to a basis for

  17. Sensitivity of a numerical wave model on wind re-analysis datasets

    NASA Astrophysics Data System (ADS)

    Lavidas, George; Venugopal, Vengatesan; Friedrich, Daniel

    2017-03-01

    Wind is the dominant process for wave generation. Detailed evaluation of metocean conditions strengthens our understanding of issues concerning potential offshore applications. However, the scarcity of buoys and high cost of monitoring systems pose a barrier to properly defining offshore conditions. Through use of numerical wave models, metocean conditions can be hindcasted and forecasted providing reliable characterisations. This study reports the sensitivity of wind inputs on a numerical wave model for the Scottish region. Two re-analysis wind datasets with different spatio-temporal characteristics are used, the ERA-Interim Re-Analysis and the CFSR-NCEP Re-Analysis dataset. Different wind products alter results, affecting the accuracy obtained. The scope of this study is to assess different available wind databases and provide information concerning the most appropriate wind dataset for the specific region, based on temporal, spatial and geographic terms for wave modelling and offshore applications. Both wind input datasets delivered results from the numerical wave model with good correlation. Wave results by the 1-h dataset have higher peaks and lower biases, in expense of a high scatter index. On the other hand, the 6-h dataset has lower scatter but higher biases. The study shows how wind dataset affects the numerical wave modelling performance, and that depending on location and study needs, different wind inputs should be considered.

  18. SchizConnect: Mediating Neuroimaging Databases on Schizophrenia and Related Disorders for Large-Scale Integration

    PubMed Central

    Wang, Lei; Alpert, Kathryn I.; Calhoun, Vince D.; Cobia, Derin J.; Keator, David B.; King, Margaret D.; Kogan, Alexandr; Landis, Drew; Tallis, Marcelo; Turner, Matthew D.; Potkin, Steven G.; Turner, Jessica A.; Ambite, Jose Luis

    2015-01-01

    SchizConnect (www.schizconnect.org) is built to address the issues of multiple data repositories in schizophrenia neuroimaging studies. It includes a level of mediation—translating across data sources—so that the user can place one query, e.g. for diffusion images from male individuals with schizophrenia, and find out from across participating data sources how many datasets there are, as well as downloading the imaging and related data. The current version handles the Data Usage Agreements across different studies, as well as interpreting database-specific terminologies into a common framework. New data repositories can also be mediated to bring immediate access to existing datasets. Compared with centralized, upload data sharing models, SchizConnect is a unique, virtual database with a focus on schizophrenia and related disorders that can mediate live data as information are being updated at each data source. It is our hope that SchizConnect can facilitate testing new hypotheses through aggregated datasets, promoting discovery related to the mechanisms underlying schizophrenic dysfunction. PMID:26142271

  19. The Transcriptome Analysis and Comparison Explorer--T-ACE: a platform-independent, graphical tool to process large RNAseq datasets of non-model organisms.

    PubMed

    Philipp, E E R; Kraemer, L; Mountfort, D; Schilhabel, M; Schreiber, S; Rosenstiel, P

    2012-03-15

    Next generation sequencing (NGS) technologies allow a rapid and cost-effective compilation of large RNA sequence datasets in model and non-model organisms. However, the storage and analysis of transcriptome information from different NGS platforms is still a significant bottleneck, leading to a delay in data dissemination and subsequent biological understanding. Especially database interfaces with transcriptome analysis modules going beyond mere read counts are missing. Here, we present the Transcriptome Analysis and Comparison Explorer (T-ACE), a tool designed for the organization and analysis of large sequence datasets, and especially suited for transcriptome projects of non-model organisms with little or no a priori sequence information. T-ACE offers a TCL-based interface, which accesses a PostgreSQL database via a php-script. Within T-ACE, information belonging to single sequences or contigs, such as annotation or read coverage, is linked to the respective sequence and immediately accessible. Sequences and assigned information can be searched via keyword- or BLAST-search. Additionally, T-ACE provides within and between transcriptome analysis modules on the level of expression, GO terms, KEGG pathways and protein domains. Results are visualized and can be easily exported for external analysis. We developed T-ACE for laboratory environments, which have only a limited amount of bioinformatics support, and for collaborative projects in which different partners work on the same dataset from different locations or platforms (Windows/Linux/MacOS). For laboratories with some experience in bioinformatics and programming, the low complexity of the database structure and open-source code provides a framework that can be customized according to the different needs of the user and transcriptome project.

  20. A modern plant-climate research dataset for modelling eastern North American plant taxa.

    NASA Astrophysics Data System (ADS)

    Gonzales, L. M.; Grimm, E. C.; Williams, J. W.; Nordheim, E. V.

    2008-12-01

    Continental-scale modern pollen-climate data repositories are a primary data source for paleoclimate reconstructions. However, these repositories can contain artifacts, such as records from different depositional environment and replicate records, that can influence the observed pollen-climate relationships as well as the paleoclimate reconstructions derived from these relationships. In this paper, we address the issues related to these artifacts as we define the methods used to create a research dataset from the North American Modern Pollen Database (Whitmore et al., 2005). Additionally, we define the methods used to select the environmental variables that are best for modeling regional pollen-climate relationships from the research dataset. Because the depositional environment determines the relative strengths of the local and regional pollen signals, combining data from different depositional environments results in pollen abundances that can be influenced by the local pollen signal. Replicate records in pollen-climate datasets can skew pollen-climate relationships by causing an over- or under- representation of pollen abundances in climate space. When these two artifacts are combined, the errors introduced into pollen-climate relationship modeling are compounded. The research dataset we present consists of 2,613 records in eastern North America, of which 70.9% are lacustrine sites. We demonstrate that this new research database improves upon the modeling of regional pollen-climate relationships for eastern North American taxa. The research dataset encompasses the majority of the temperature and mean summer precipitation ranges of the NAMPD's climatic range and 40% of its mean winter precipitation range. NAMPD sites with higher winter precipitation are located along the northwestern coast of North America where a rainshadow effect produces abundant winter precipitation. We present our analysis of the research dataset for use in paleoclimate reconstructions, and recommend that mean winter and summer temperature and precipitation variables be used for pollen-climate relationship modeling.

  1. A database of marine phytoplankton abundance, biomass and species composition in Australian waters

    NASA Astrophysics Data System (ADS)

    Davies, Claire H.; Coughlan, Alex; Hallegraeff, Gustaaf; Ajani, Penelope; Armbrecht, Linda; Atkins, Natalia; Bonham, Prudence; Brett, Steve; Brinkman, Richard; Burford, Michele; Clementson, Lesley; Coad, Peter; Coman, Frank; Davies, Diana; Dela-Cruz, Jocelyn; Devlin, Michelle; Edgar, Steven; Eriksen, Ruth; Furnas, Miles; Hassler, Christel; Hill, David; Holmes, Michael; Ingleton, Tim; Jameson, Ian; Leterme, Sophie C.; Lønborg, Christian; McLaughlin, James; McEnnulty, Felicity; McKinnon, A. David; Miller, Margaret; Murray, Shauna; Nayar, Sasi; Patten, Renee; Pritchard, Tim; Proctor, Roger; Purcell-Meyerink, Diane; Raes, Eric; Rissik, David; Ruszczyk, Jason; Slotwinski, Anita; Swadling, Kerrie M.; Tattersall, Katherine; Thompson, Peter; Thomson, Paul; Tonks, Mark; Trull, Thomas W.; Uribe-Palomino, Julian; Waite, Anya M.; Yauwenas, Rouna; Zammit, Anthony; Richardson, Anthony J.

    2016-06-01

    There have been many individual phytoplankton datasets collected across Australia since the mid 1900s, but most are unavailable to the research community. We have searched archives, contacted researchers, and scanned the primary and grey literature to collate 3,621,847 records of marine phytoplankton species from Australian waters from 1844 to the present. Many of these are small datasets collected for local questions, but combined they provide over 170 years of data on phytoplankton communities in Australian waters. Units and taxonomy have been standardised, obviously erroneous data removed, and all metadata included. We have lodged this dataset with the Australian Ocean Data Network (http://portal.aodn.org.au/) allowing public access. The Australian Phytoplankton Database will be invaluable for global change studies, as it allows analysis of ecological indicators of climate change and eutrophication (e.g., changes in distribution; diatom:dinoflagellate ratios). In addition, the standardised conversion of abundance records to biomass provides modellers with quantifiable data to initialise and validate ecosystem models of lower marine trophic levels.

  2. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lopez Torres, E., E-mail: Ernesto.Lopez.Torres@cern.ch, E-mail: cerello@to.infn.it; Fiorina, E.; Pennazio, F.

    Purpose: M5L, a fully automated computer-aided detection (CAD) system for the detection and segmentation of lung nodules in thoracic computed tomography (CT), is presented and validated on several image datasets. Methods: M5L is the combination of two independent subsystems, based on the Channeler Ant Model as a segmentation tool [lung channeler ant model (lungCAM)] and on the voxel-based neural approach. The lungCAM was upgraded with a scan equalization module and a new procedure to recover the nodules connected to other lung structures; its classification module, which makes use of a feed-forward neural network, is based of a small number ofmore » features (13), so as to minimize the risk of lacking generalization, which could be possible given the large difference between the size of the training and testing datasets, which contain 94 and 1019 CTs, respectively. The lungCAM (standalone) and M5L (combined) performance was extensively tested on 1043 CT scans from three independent datasets, including a detailed analysis of the full Lung Image Database Consortium/Image Database Resource Initiative database, which is not yet found in literature. Results: The lungCAM and M5L performance is consistent across the databases, with a sensitivity of about 70% and 80%, respectively, at eight false positive findings per scan, despite the variable annotation criteria and acquisition and reconstruction conditions. A reduced sensitivity is found for subtle nodules and ground glass opacities (GGO) structures. A comparison with other CAD systems is also presented. Conclusions: The M5L performance on a large and heterogeneous dataset is stable and satisfactory, although the development of a dedicated module for GGOs detection could further improve it, as well as an iterative optimization of the training procedure. The main aim of the present study was accomplished: M5L results do not deteriorate when increasing the dataset size, making it a candidate for supporting radiologists on large scale screenings and clinical programs.« less

  3. Spatial aspects of building and population exposure data and their implications for global earthquake exposure modeling

    USGS Publications Warehouse

    Dell’Acqua, F.; Gamba, P.; Jaiswal, K.

    2012-01-01

    This paper discusses spatial aspects of the global exposure dataset and mapping needs for earthquake risk assessment. We discuss this in the context of development of a Global Exposure Database for the Global Earthquake Model (GED4GEM), which requires compilation of a multi-scale inventory of assets at risk, for example, buildings, populations, and economic exposure. After defining the relevant spatial and geographic scales of interest, different procedures are proposed to disaggregate coarse-resolution data, to map them, and if necessary to infer missing data by using proxies. We discuss the advantages and limitations of these methodologies and detail the potentials of utilizing remote-sensing data. The latter is used especially to homogenize an existing coarser dataset and, where possible, replace it with detailed information extracted from remote sensing using the built-up indicators for different environments. Present research shows that the spatial aspects of earthquake risk computation are tightly connected with the availability of datasets of the resolution necessary for producing sufficiently detailed exposure. The global exposure database designed by the GED4GEM project is able to manage datasets and queries of multiple spatial scales.

  4. Human Ageing Genomic Resources: new and updated databases

    PubMed Central

    Tacutu, Robi; Thornton, Daniel; Johnson, Emily; Budovsky, Arie; Barardo, Diogo; Craig, Thomas; Diana, Eugene; Lehmann, Gilad; Toren, Dmitri; Wang, Jingwei; Fraifeld, Vadim E

    2018-01-01

    Abstract In spite of a growing body of research and data, human ageing remains a poorly understood process. Over 10 years ago we developed the Human Ageing Genomic Resources (HAGR), a collection of databases and tools for studying the biology and genetics of ageing. Here, we present HAGR’s main functionalities, highlighting new additions and improvements. HAGR consists of six core databases: (i) the GenAge database of ageing-related genes, in turn composed of a dataset of >300 human ageing-related genes and a dataset with >2000 genes associated with ageing or longevity in model organisms; (ii) the AnAge database of animal ageing and longevity, featuring >4000 species; (iii) the GenDR database with >200 genes associated with the life-extending effects of dietary restriction; (iv) the LongevityMap database of human genetic association studies of longevity with >500 entries; (v) the DrugAge database with >400 ageing or longevity-associated drugs or compounds; (vi) the CellAge database with >200 genes associated with cell senescence. All our databases are manually curated by experts and regularly updated to ensure a high quality data. Cross-links across our databases and to external resources help researchers locate and integrate relevant information. HAGR is freely available online (http://genomics.senescence.info/). PMID:29121237

  5. Use of Patient Registries and Administrative Datasets for the Study of Pediatric Cancer

    PubMed Central

    Rice, Henry E.; Englum, Brian R.; Gulack, Brian C.; Adibe, Obinna O.; Tracy, Elizabeth T.; Kreissman, Susan G.; Routh, Jonathan C.

    2015-01-01

    Analysis of data from large administrative databases and patient registries is increasingly being used to study childhood cancer care, although the value of these data sources remains unclear to many clinicians. Interpretation of large databases requires a thorough understanding of how the dataset was designed, how data were collected, and how to assess data quality. This review will detail the role of administrative databases and registry databases for the study of childhood cancer, tools to maximize information from these datasets, and recommendations to improve the use of these databases for the study of pediatric oncology. PMID:25807938

  6. Global Precipitation Measurement: Methods, Datasets and Applications

    NASA Technical Reports Server (NTRS)

    Tapiador, Francisco; Turk, Francis J.; Petersen, Walt; Hou, Arthur Y.; Garcia-Ortega, Eduardo; Machado, Luiz, A. T.; Angelis, Carlos F.; Salio, Paola; Kidd, Chris; Huffman, George J.; hide

    2011-01-01

    This paper reviews the many aspects of precipitation measurement that are relevant to providing an accurate global assessment of this important environmental parameter. Methods discussed include ground data, satellite estimates and numerical models. First, the methods for measuring, estimating, and modeling precipitation are discussed. Then, the most relevant datasets gathering precipitation information from those three sources are presented. The third part of the paper illustrates a number of the many applications of those measurements and databases. The aim of the paper is to organize the many links and feedbacks between precipitation measurement, estimation and modeling, indicating the uncertainties and limitations of each technique in order to identify areas requiring further attention, and to show the limits within which datasets can be used.

  7. Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset Comparison.

    PubMed

    Kireeva, N; Baskin, I I; Gaspar, H A; Horvath, D; Marcou, G; Varnek, A

    2012-04-01

    Here, the utility of Generative Topographic Maps (GTM) for data visualization, structure-activity modeling and database comparison is evaluated, on hand of subsets of the Database of Useful Decoys (DUD). Unlike other popular dimensionality reduction approaches like Principal Component Analysis, Sammon Mapping or Self-Organizing Maps, the great advantage of GTMs is providing data probability distribution functions (PDF), both in the high-dimensional space defined by molecular descriptors and in 2D latent space. PDFs for the molecules of different activity classes were successfully used to build classification models in the framework of the Bayesian approach. Because PDFs are represented by a mixture of Gaussian functions, the Bhattacharyya kernel has been proposed as a measure of the overlap of datasets, which leads to an elegant method of global comparison of chemical libraries. Copyright © 2012 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  8. The Regional Hydrologic Extremes Assessment System: A software framework for hydrologic modeling and data assimilation

    PubMed Central

    Das, Narendra; Stampoulis, Dimitrios; Ines, Amor; Fisher, Joshua B.; Granger, Stephanie; Kawata, Jessie; Han, Eunjin; Behrangi, Ali

    2017-01-01

    The Regional Hydrologic Extremes Assessment System (RHEAS) is a prototype software framework for hydrologic modeling and data assimilation that automates the deployment of water resources nowcasting and forecasting applications. A spatially-enabled database is a key component of the software that can ingest a suite of satellite and model datasets while facilitating the interfacing with Geographic Information System (GIS) applications. The datasets ingested are obtained from numerous space-borne sensors and represent multiple components of the water cycle. The object-oriented design of the software allows for modularity and extensibility, showcased here with the coupling of the core hydrologic model with a crop growth model. RHEAS can exploit multi-threading to scale with increasing number of processors, while the database allows delivery of data products and associated uncertainty through a variety of GIS platforms. A set of three example implementations of RHEAS in the United States and Kenya are described to demonstrate the different features of the system in real-world applications. PMID:28545077

  9. The Regional Hydrologic Extremes Assessment System: A software framework for hydrologic modeling and data assimilation.

    PubMed

    Andreadis, Konstantinos M; Das, Narendra; Stampoulis, Dimitrios; Ines, Amor; Fisher, Joshua B; Granger, Stephanie; Kawata, Jessie; Han, Eunjin; Behrangi, Ali

    2017-01-01

    The Regional Hydrologic Extremes Assessment System (RHEAS) is a prototype software framework for hydrologic modeling and data assimilation that automates the deployment of water resources nowcasting and forecasting applications. A spatially-enabled database is a key component of the software that can ingest a suite of satellite and model datasets while facilitating the interfacing with Geographic Information System (GIS) applications. The datasets ingested are obtained from numerous space-borne sensors and represent multiple components of the water cycle. The object-oriented design of the software allows for modularity and extensibility, showcased here with the coupling of the core hydrologic model with a crop growth model. RHEAS can exploit multi-threading to scale with increasing number of processors, while the database allows delivery of data products and associated uncertainty through a variety of GIS platforms. A set of three example implementations of RHEAS in the United States and Kenya are described to demonstrate the different features of the system in real-world applications.

  10. LIPS database with LIPService: a microscopic image database of intracellular structures in Arabidopsis guard cells.

    PubMed

    Higaki, Takumi; Kutsuna, Natsumaro; Hasezawa, Seiichiro

    2013-05-16

    Intracellular configuration is an important feature of cell status. Recent advances in microscopic imaging techniques allow us to easily obtain a large number of microscopic images of intracellular structures. In this circumstance, automated microscopic image recognition techniques are of extreme importance to future phenomics/visible screening approaches. However, there was no benchmark microscopic image dataset for intracellular organelles in a specified plant cell type. We previously established the Live Images of Plant Stomata (LIPS) database, a publicly available collection of optical-section images of various intracellular structures of plant guard cells, as a model system of environmental signal perception and transduction. Here we report recent updates to the LIPS database and the establishment of a database table, LIPService. We updated the LIPS dataset and established a new interface named LIPService to promote efficient inspection of intracellular structure configurations. Cell nuclei, microtubules, actin microfilaments, mitochondria, chloroplasts, endoplasmic reticulum, peroxisomes, endosomes, Golgi bodies, and vacuoles can be filtered using probe names or morphometric parameters such as stomatal aperture. In addition to the serial optical sectional images of the original LIPS database, new volume-rendering data for easy web browsing of three-dimensional intracellular structures have been released to allow easy inspection of their configurations or relationships with cell status/morphology. We also demonstrated the utility of the new LIPS image database for automated organelle recognition of images from another plant cell image database with image clustering analyses. The updated LIPS database provides a benchmark image dataset for representative intracellular structures in Arabidopsis guard cells. The newly released LIPService allows users to inspect the relationship between organellar three-dimensional configurations and morphometrical parameters.

  11. SAADA: Astronomical Databases Made Easier

    NASA Astrophysics Data System (ADS)

    Michel, L.; Nguyen, H. N.; Motch, C.

    2005-12-01

    Many astronomers wish to share datasets with their community but have not enough manpower to develop databases having the functionalities required for high-level scientific applications. The SAADA project aims at automatizing the creation and deployment process of such databases. A generic but scientifically relevant data model has been designed which allows one to build databases by providing only a limited number of product mapping rules. Databases created by SAADA rely on a relational database supporting JDBC and covered by a Java layer including a lot of generated code. Such databases can simultaneously host spectra, images, source lists and plots. Data are grouped in user defined collections whose content can be seen as one unique set per data type even if their formats differ. Datasets can be correlated one with each other using qualified links. These links help, for example, to handle the nature of a cross-identification (e.g., a distance or a likelihood) or to describe their scientific content (e.g., by associating a spectrum to a catalog entry). The SAADA query engine is based on a language well suited to the data model which can handle constraints on linked data, in addition to classical astronomical queries. These constraints can be applied on the linked objects (number, class and attributes) and/or on the link qualifier values. Databases created by SAADA are accessed through a rich WEB interface or a Java API. We are currently developing an inter-operability module implanting VO protocols.

  12. Advanced Neuropsychological Diagnostics Infrastructure (ANDI): A Normative Database Created from Control Datasets

    PubMed Central

    de Vent, Nathalie R.; Agelink van Rentergem, Joost A.; Schmand, Ben A.; Murre, Jaap M. J.; Huizenga, Hilde M.

    2016-01-01

    In the Advanced Neuropsychological Diagnostics Infrastructure (ANDI), datasets of several research groups are combined into a single database, containing scores on neuropsychological tests from healthy participants. For most popular neuropsychological tests the quantity, and range of these data surpasses that of traditional normative data, thereby enabling more accurate neuropsychological assessment. Because of the unique structure of the database, it facilitates normative comparison methods that were not feasible before, in particular those in which entire profiles of scores are evaluated. In this article, we describe the steps that were necessary to combine the separate datasets into a single database. These steps involve matching variables from multiple datasets, removing outlying values, determining the influence of demographic variables, and finding appropriate transformations to normality. Also, a brief description of the current contents of the ANDI database is given. PMID:27812340

  13. Advanced Neuropsychological Diagnostics Infrastructure (ANDI): A Normative Database Created from Control Datasets.

    PubMed

    de Vent, Nathalie R; Agelink van Rentergem, Joost A; Schmand, Ben A; Murre, Jaap M J; Huizenga, Hilde M

    2016-01-01

    In the Advanced Neuropsychological Diagnostics Infrastructure (ANDI), datasets of several research groups are combined into a single database, containing scores on neuropsychological tests from healthy participants. For most popular neuropsychological tests the quantity, and range of these data surpasses that of traditional normative data, thereby enabling more accurate neuropsychological assessment. Because of the unique structure of the database, it facilitates normative comparison methods that were not feasible before, in particular those in which entire profiles of scores are evaluated. In this article, we describe the steps that were necessary to combine the separate datasets into a single database. These steps involve matching variables from multiple datasets, removing outlying values, determining the influence of demographic variables, and finding appropriate transformations to normality. Also, a brief description of the current contents of the ANDI database is given.

  14. In-situ databases and comparison of ESA Ocean Colour Climate Change Initiative (OC-CCI) products with precursor data, towards an integrated approach for ocean colour validation and climate studies

    NASA Astrophysics Data System (ADS)

    Brotas, Vanda; Valente, André; Couto, André B.; Grant, Mike; Chuprin, Andrei; Jackson, Thomas; Groom, Steve; Sathyendranath, Shubha

    2014-05-01

    Ocean colour (OC) is an Oceanic Essential Climate Variable, which is used by climate modellers and researchers. The European Space Agency (ESA) Climate Change Initiative project, is the ESA response for the need of climate-quality satellite data, with the goal of providing stable, long-term, satellite-based ECV data products. The ESA Ocean Colour CCI focuses on the production of Ocean Colour ECV uses remote sensing reflectances to derive inherent optical properties and chlorophyll a concentration from ESA's MERIS (2002-2012) and NASA's SeaWiFS (1997 - 2010) and MODIS (2002-2012) sensor archives. This work presents an integrated approach by setting up a global database of in situ measurements and by inter-comparing OC-CCI products with pre-cursor datasets. The availability of in situ databases is fundamental for the validation of satellite derived ocean colour products. A global distribution in situ database was assembled, from several pre-existing datasets, with data spanning between 1997 and 2012. It includes in-situ measurements of remote sensing reflectances, concentration of chlorophyll-a, inherent optical properties and diffuse attenuation coefficient. The database is composed from observations of the following datasets: NOMAD, SeaBASS, MERMAID, AERONET-OC, BOUSSOLE and HOTS. The result was a merged dataset tuned for the validation of satellite-derived ocean colour products. This was an attempt to gather, homogenize and merge, a large high-quality bio-optical marine in situ data, as using all datasets in a single validation exercise increases the number of matchups and enhances the representativeness of different marine regimes. An inter-comparison analysis between OC-CCI chlorophyll-a product and satellite pre-cursor datasets was done with single missions and merged single mission products. Single mission datasets considered were SeaWiFS, MODIS-Aqua and MERIS; merged mission datasets were obtained from the GlobColour (GC) as well as the Making Earth Science Data Records for Use in Research Environments (MEaSUREs). OC-CCI product was found to be most similar to SeaWiFS record, and generally, the OC-CCI record was most similar to records derived from single mission than merged mission initiatives. Results suggest that CCI product is a more consistent dataset than other available merged mission initiatives. In conclusion, climate related science, requires long term data records to provide robust results, OC-CCI product proves to be a worthy data record for climate research, as it combines multi-sensor OC observations to provide a >15-year global error-characterized record.

  15. SchizConnect: Mediating neuroimaging databases on schizophrenia and related disorders for large-scale integration.

    PubMed

    Wang, Lei; Alpert, Kathryn I; Calhoun, Vince D; Cobia, Derin J; Keator, David B; King, Margaret D; Kogan, Alexandr; Landis, Drew; Tallis, Marcelo; Turner, Matthew D; Potkin, Steven G; Turner, Jessica A; Ambite, Jose Luis

    2016-01-01

    SchizConnect (www.schizconnect.org) is built to address the issues of multiple data repositories in schizophrenia neuroimaging studies. It includes a level of mediation--translating across data sources--so that the user can place one query, e.g. for diffusion images from male individuals with schizophrenia, and find out from across participating data sources how many datasets there are, as well as downloading the imaging and related data. The current version handles the Data Usage Agreements across different studies, as well as interpreting database-specific terminologies into a common framework. New data repositories can also be mediated to bring immediate access to existing datasets. Compared with centralized, upload data sharing models, SchizConnect is a unique, virtual database with a focus on schizophrenia and related disorders that can mediate live data as information is being updated at each data source. It is our hope that SchizConnect can facilitate testing new hypotheses through aggregated datasets, promoting discovery related to the mechanisms underlying schizophrenic dysfunction. Copyright © 2015 Elsevier Inc. All rights reserved.

  16. Cadastral Database Positional Accuracy Improvement

    NASA Astrophysics Data System (ADS)

    Hashim, N. M.; Omar, A. H.; Ramli, S. N. M.; Omar, K. M.; Din, N.

    2017-10-01

    Positional Accuracy Improvement (PAI) is the refining process of the geometry feature in a geospatial dataset to improve its actual position. This actual position relates to the absolute position in specific coordinate system and the relation to the neighborhood features. With the growth of spatial based technology especially Geographical Information System (GIS) and Global Navigation Satellite System (GNSS), the PAI campaign is inevitable especially to the legacy cadastral database. Integration of legacy dataset and higher accuracy dataset like GNSS observation is a potential solution for improving the legacy dataset. However, by merely integrating both datasets will lead to a distortion of the relative geometry. The improved dataset should be further treated to minimize inherent errors and fitting to the new accurate dataset. The main focus of this study is to describe a method of angular based Least Square Adjustment (LSA) for PAI process of legacy dataset. The existing high accuracy dataset known as National Digital Cadastral Database (NDCDB) is then used as bench mark to validate the results. It was found that the propose technique is highly possible for positional accuracy improvement of legacy spatial datasets.

  17. Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments.

    PubMed

    Keuleers, Emmanuel; Balota, David A

    2015-01-01

    This paper introduces and summarizes the special issue on megastudies, crowdsourcing, and large datasets in psycholinguistics. We provide a brief historical overview and show how the papers in this issue have extended the field by compiling new databases and making important theoretical contributions. In addition, we discuss several studies that use text corpora to build distributional semantic models to tackle various interesting problems in psycholinguistics. Finally, as is the case across the papers, we highlight some methodological issues that are brought forth via the analyses of such datasets.

  18. An Evaluation of Database Solutions to Spatial Object Association

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kumar, V S; Kurc, T; Saltz, J

    2008-06-24

    Object association is a common problem encountered in many applications. Spatial object association, also referred to as crossmatch of spatial datasets, is the problem of identifying and comparing objects in two datasets based on their positions in a common spatial coordinate system--one of the datasets may correspond to a catalog of objects observed over time in a multi-dimensional domain; the other dataset may consist of objects observed in a snapshot of the domain at a time point. The use of database management systems to the solve the object association problem provides portability across different platforms and also greater flexibility. Increasingmore » dataset sizes in today's applications, however, have made object association a data/compute-intensive problem that requires targeted optimizations for efficient execution. In this work, we investigate how database-based crossmatch algorithms can be deployed on different database system architectures and evaluate the deployments to understand the impact of architectural choices on crossmatch performance and associated trade-offs. We investigate the execution of two crossmatch algorithms on (1) a parallel database system with active disk style processing capabilities, (2) a high-throughput network database (MySQL Cluster), and (3) shared-nothing databases with replication. We have conducted our study in the context of a large-scale astronomy application with real use-case scenarios.« less

  19. CyanoBase: the cyanobacteria genome database update 2010.

    PubMed

    Nakao, Mitsuteru; Okamoto, Shinobu; Kohara, Mitsuyo; Fujishiro, Tsunakazu; Fujisawa, Takatomo; Sato, Shusei; Tabata, Satoshi; Kaneko, Takakazu; Nakamura, Yasukazu

    2010-01-01

    CyanoBase (http://genome.kazusa.or.jp/cyanobase) is the genome database for cyanobacteria, which are model organisms for photosynthesis. The database houses cyanobacteria species information, complete genome sequences, genome-scale experiment data, gene information, gene annotations and mutant information. In this version, we updated these datasets and improved the navigation and the visual display of the data views. In addition, a web service API now enables users to retrieve the data in various formats with other tools, seamlessly.

  20. A database of marine phytoplankton abundance, biomass and species composition in Australian waters

    PubMed Central

    Davies, Claire H.; Coughlan, Alex; Hallegraeff, Gustaaf; Ajani, Penelope; Armbrecht, Linda; Atkins, Natalia; Bonham, Prudence; Brett, Steve; Brinkman, Richard; Burford, Michele; Clementson, Lesley; Coad, Peter; Coman, Frank; Davies, Diana; Dela-Cruz, Jocelyn; Devlin, Michelle; Edgar, Steven; Eriksen, Ruth; Furnas, Miles; Hassler, Christel; Hill, David; Holmes, Michael; Ingleton, Tim; Jameson, Ian; Leterme, Sophie C.; Lønborg, Christian; McLaughlin, James; McEnnulty, Felicity; McKinnon, A. David; Miller, Margaret; Murray, Shauna; Nayar, Sasi; Patten, Renee; Pritchard, Tim; Proctor, Roger; Purcell-Meyerink, Diane; Raes, Eric; Rissik, David; Ruszczyk, Jason; Slotwinski, Anita; Swadling, Kerrie M.; Tattersall, Katherine; Thompson, Peter; Thomson, Paul; Tonks, Mark; Trull, Thomas W.; Uribe-Palomino, Julian; Waite, Anya M.; Yauwenas, Rouna; Zammit, Anthony; Richardson, Anthony J.

    2016-01-01

    There have been many individual phytoplankton datasets collected across Australia since the mid 1900s, but most are unavailable to the research community. We have searched archives, contacted researchers, and scanned the primary and grey literature to collate 3,621,847 records of marine phytoplankton species from Australian waters from 1844 to the present. Many of these are small datasets collected for local questions, but combined they provide over 170 years of data on phytoplankton communities in Australian waters. Units and taxonomy have been standardised, obviously erroneous data removed, and all metadata included. We have lodged this dataset with the Australian Ocean Data Network (http://portal.aodn.org.au/) allowing public access. The Australian Phytoplankton Database will be invaluable for global change studies, as it allows analysis of ecological indicators of climate change and eutrophication (e.g., changes in distribution; diatom:dinoflagellate ratios). In addition, the standardised conversion of abundance records to biomass provides modellers with quantifiable data to initialise and validate ecosystem models of lower marine trophic levels. PMID:27328409

  1. BASINS Framework and Features

    EPA Pesticide Factsheets

    BASINS enables users to efficiently access nationwide environmental databases and local user-specified datasets, apply assessment and planning tools, and run a variety of proven nonpoint loading and water quality models within a single GIS format.

  2. VEMAP phase 2 bioclimatic database. I. Gridded historical (20th century) climate for modeling ecosystem dynamics across the conterminous USA

    Treesearch

    Timothy G.F. Kittel; Nan. A. Rosenbloom; J.A. Royle; C. Daly; W.P. Gibson; H.H. Fisher; P. Thornton; D.N. Yates; S. Aulenbach; C. Kaufman; R. McKeown; Dominque Bachelet; David S. Schimel

    2004-01-01

    Analysis and simulation of biospheric responses to historical forcing require surface climate data that capture those aspects of climate that control ecological processes, including key spatial gradients and modes of temporal variability. We developed a multivariate, gridded historical climate dataset for the conterminous USA as a common input database for the...

  3. Exploring Antarctic Land Surface Temperature Extremes Using Condensed Anomaly Databases

    NASA Astrophysics Data System (ADS)

    Grant, Glenn Edwin

    Satellite observations have revolutionized the Earth Sciences and climate studies. However, data and imagery continue to accumulate at an accelerating rate, and efficient tools for data discovery, analysis, and quality checking lag behind. In particular, studies of long-term, continental-scale processes at high spatiotemporal resolutions are especially problematic. The traditional technique of downloading an entire dataset and using customized analysis code is often impractical or consumes too many resources. The Condensate Database Project was envisioned as an alternative method for data exploration and quality checking. The project's premise was that much of the data in any satellite dataset is unneeded and can be eliminated, compacting massive datasets into more manageable sizes. Dataset sizes are further reduced by retaining only anomalous data of high interest. Hosting the resulting "condensed" datasets in high-speed databases enables immediate availability for queries and exploration. Proof of the project's success relied on demonstrating that the anomaly database methods can enhance and accelerate scientific investigations. The hypothesis of this dissertation is that the condensed datasets are effective tools for exploring many scientific questions, spurring further investigations and revealing important information that might otherwise remain undetected. This dissertation uses condensed databases containing 17 years of Antarctic land surface temperature anomalies as its primary data. The study demonstrates the utility of the condensate database methods by discovering new information. In particular, the process revealed critical quality problems in the source satellite data. The results are used as the starting point for four case studies, investigating Antarctic temperature extremes, cloud detection errors, and the teleconnections between Antarctic temperature anomalies and climate indices. The results confirm the hypothesis that the condensate databases are a highly useful tool for Earth Science analyses. Moreover, the quality checking capabilities provide an important method for independent evaluation of dataset veracity.

  4. Exploring human disease using the Rat Genome Database.

    PubMed

    Shimoyama, Mary; Laulederkind, Stanley J F; De Pons, Jeff; Nigam, Rajni; Smith, Jennifer R; Tutaj, Marek; Petri, Victoria; Hayman, G Thomas; Wang, Shur-Jen; Ghiasvand, Omid; Thota, Jyothi; Dwinell, Melinda R

    2016-10-01

    Rattus norvegicus, the laboratory rat, has been a crucial model for studies of the environmental and genetic factors associated with human diseases for over 150 years. It is the primary model organism for toxicology and pharmacology studies, and has features that make it the model of choice in many complex-disease studies. Since 1999, the Rat Genome Database (RGD; http://rgd.mcw.edu) has been the premier resource for genomic, genetic, phenotype and strain data for the laboratory rat. The primary role of RGD is to curate rat data and validate orthologous relationships with human and mouse genes, and make these data available for incorporation into other major databases such as NCBI, Ensembl and UniProt. RGD also provides official nomenclature for rat genes, quantitative trait loci, strains and genetic markers, as well as unique identifiers. The RGD team adds enormous value to these basic data elements through functional and disease annotations, the analysis and visual presentation of pathways, and the integration of phenotype measurement data for strains used as disease models. Because much of the rat research community focuses on understanding human diseases, RGD provides a number of datasets and software tools that allow users to easily explore and make disease-related connections among these datasets. RGD also provides comprehensive human and mouse data for comparative purposes, illustrating the value of the rat in translational research. This article introduces RGD and its suite of tools and datasets to researchers - within and beyond the rat community - who are particularly interested in leveraging rat-based insights to understand human diseases. © 2016. Published by The Company of Biologists Ltd.

  5. Creation of clinical research databases in the 21st century: a practical algorithm for HIPAA Compliance.

    PubMed

    Schell, Scott R

    2006-02-01

    Enforcement of the Health Insurance Portability and Accountability Act (HIPAA) began in April, 2003. Designed as a law mandating health insurance availability when coverage was lost, HIPAA imposed sweeping and broad-reaching protections of patient privacy. These changes dramatically altered clinical research by placing sizeable regulatory burdens upon investigators with threat of severe and costly federal and civil penalties. This report describes development of an algorithmic approach to clinical research database design based upon a central key-shared data (CK-SD) model allowing researchers to easily analyze, distribute, and publish clinical research without disclosure of HIPAA Protected Health Information (PHI). Three clinical database formats (small clinical trial, operating room performance, and genetic microchip array datasets) were modeled using standard structured query language (SQL)-compliant databases. The CK database was created to contain PHI data, whereas a shareable SD database was generated in real-time containing relevant clinical outcome information while protecting PHI items. Small (< 100 records), medium (< 50,000 records), and large (> 10(8) records) model databases were created, and the resultant data models were evaluated in consultation with an HIPAA compliance officer. The SD database models complied fully with HIPAA regulations, and resulting "shared" data could be distributed freely. Unique patient identifiers were not required for treatment or outcome analysis. Age data were resolved to single-integer years, grouping patients aged > 89 years. Admission, discharge, treatment, and follow-up dates were replaced with enrollment year, and follow-up/outcome intervals calculated eliminating original data. Two additional data fields identified as PHI (treating physician and facility) were replaced with integer values, and the original data corresponding to these values were stored in the CK database. Use of the algorithm at the time of database design did not increase cost or design effort. The CK-SD model for clinical database design provides an algorithm for investigators to create, maintain, and share clinical research data compliant with HIPAA regulations. This model is applicable to new projects and large institutional datasets, and should decrease regulatory efforts required for conduct of clinical research. Application of the design algorithm early in the clinical research enterprise does not increase cost or the effort of data collection.

  6. Crowdsourcing-Assisted Radio Environment Database for V2V Communication.

    PubMed

    Katagiri, Keita; Sato, Koya; Fujii, Takeo

    2018-04-12

    In order to realize reliable Vehicle-to-Vehicle (V2V) communication systems for autonomous driving, the recognition of radio propagation becomes an important technology. However, in the current wireless distributed network systems, it is difficult to accurately estimate the radio propagation characteristics because of the locality of the radio propagation caused by surrounding buildings and geographical features. In this paper, we propose a measurement-based radio environment database for improving the accuracy of the radio environment estimation in the V2V communication systems. The database first gathers measurement datasets of the received signal strength indicator (RSSI) related to the transmission/reception locations from V2V systems. By using the datasets, the average received power maps linked with transmitter and receiver locations are generated. We have performed measurement campaigns of V2V communications in the real environment to observe RSSI for the database construction. Our results show that the proposed method has higher accuracy of the radio propagation estimation than the conventional path loss model-based estimation.

  7. The Brainomics/Localizer database.

    PubMed

    Papadopoulos Orfanos, Dimitri; Michel, Vincent; Schwartz, Yannick; Pinel, Philippe; Moreno, Antonio; Le Bihan, Denis; Frouin, Vincent

    2017-01-01

    The Brainomics/Localizer database exposes part of the data collected by the in-house Localizer project, which planned to acquire four types of data from volunteer research subjects: anatomical MRI scans, functional MRI data, behavioral and demographic data, and DNA sampling. Over the years, this local project has been collecting such data from hundreds of subjects. We had selected 94 of these subjects for their complete datasets, including all four types of data, as the basis for a prior publication; the Brainomics/Localizer database publishes the data associated with these 94 subjects. Since regulatory rules prevent us from making genetic data available for download, the database serves only anatomical MRI scans, functional MRI data, behavioral and demographic data. To publish this set of heterogeneous data, we use dedicated software based on the open-source CubicWeb semantic web framework. Through genericity in the data model and flexibility in the display of data (web pages, CSV, JSON, XML), CubicWeb helps us expose these complex datasets in original and efficient ways. Copyright © 2015 Elsevier Inc. All rights reserved.

  8. Crowdsourcing-Assisted Radio Environment Database for V2V Communication †

    PubMed Central

    Katagiri, Keita; Fujii, Takeo

    2018-01-01

    In order to realize reliable Vehicle-to-Vehicle (V2V) communication systems for autonomous driving, the recognition of radio propagation becomes an important technology. However, in the current wireless distributed network systems, it is difficult to accurately estimate the radio propagation characteristics because of the locality of the radio propagation caused by surrounding buildings and geographical features. In this paper, we propose a measurement-based radio environment database for improving the accuracy of the radio environment estimation in the V2V communication systems. The database first gathers measurement datasets of the received signal strength indicator (RSSI) related to the transmission/reception locations from V2V systems. By using the datasets, the average received power maps linked with transmitter and receiver locations are generated. We have performed measurement campaigns of V2V communications in the real environment to observe RSSI for the database construction. Our results show that the proposed method has higher accuracy of the radio propagation estimation than the conventional path loss model-based estimation. PMID:29649174

  9. National Transportation Atlas Databases : 2013

    DOT National Transportation Integrated Search

    2013-01-01

    The National Transportation Atlas Databases 2013 (NTAD2013) is a set of nationwide geographic datasets of transportation facilities, transportation networks, associated infrastructure, and other political and administrative entities. These datasets i...

  10. Supporting the operational use of process based hydrological models and NASA Earth Observations for use in land management and post-fire remediation through a Rapid Response Erosion Database (RRED).

    NASA Astrophysics Data System (ADS)

    Miller, M. E.; Elliot, W.; Billmire, M.; Robichaud, P. R.; Banach, D. M.

    2017-12-01

    We have built a Rapid Response Erosion Database (RRED, http://rred.mtri.org/rred/) for the continental United States to allow land managers to access properly formatted spatial model inputs for the Water Erosion Prediction Project (WEPP). Spatially-explicit process-based models like WEPP require spatial inputs that include digital elevation models (DEMs), soil, climate and land cover. The online database delivers either a 10m or 30m USGS DEM, land cover derived from the Landfire project, and soil data derived from SSURGO and STATSGO datasets. The spatial layers are projected into UTM coordinates and pre-registered for modeling. WEPP soil parameter files are also created along with linkage files to match both spatial land cover and soils data with the appropriate WEPP parameter files. Our goal is to make process-based models more accessible by preparing spatial inputs ahead of time allowing modelers to focus on addressing scenarios of concern. The database provides comprehensive support for post-fire hydrological modeling by allowing users to upload spatial soil burn severity maps, and within moments returns spatial model inputs. Rapid response is critical following natural disasters. After moderate and high severity wildfires, flooding, erosion, and debris flows are a major threat to life, property and municipal water supplies. Mitigation measures must be rapidly implemented if they are to be effective, but they are expensive and cannot be applied everywhere. Fire, runoff, and erosion risks also are highly heterogeneous in space, creating an urgent need for rapid, spatially-explicit assessment. The database has been used to help assess and plan remediation on over a dozen wildfires in the Western US. Future plans include expanding spatial coverage, improving model input data and supporting additional models. Our goal is to facilitate the use of the best possible datasets and models to support the conservation of soil and water.

  11. CyanoBase: the cyanobacteria genome database update 2010

    PubMed Central

    Nakao, Mitsuteru; Okamoto, Shinobu; Kohara, Mitsuyo; Fujishiro, Tsunakazu; Fujisawa, Takatomo; Sato, Shusei; Tabata, Satoshi; Kaneko, Takakazu; Nakamura, Yasukazu

    2010-01-01

    CyanoBase (http://genome.kazusa.or.jp/cyanobase) is the genome database for cyanobacteria, which are model organisms for photosynthesis. The database houses cyanobacteria species information, complete genome sequences, genome-scale experiment data, gene information, gene annotations and mutant information. In this version, we updated these datasets and improved the navigation and the visual display of the data views. In addition, a web service API now enables users to retrieve the data in various formats with other tools, seamlessly. PMID:19880388

  12. A brief overview of the Chemistry-Aerosol Mediterranean Experiment (ChArMEx) database and campaign operation centre (ChOC)

    NASA Astrophysics Data System (ADS)

    Ferré, Hélène; Dulac, François; Belmahfoud, Nizar; Brissebrat, Guillaume; Cloché, Sophie; Descloitres, Jacques; Fleury, Laurence; Focsa, Loredana; Henriot, Nicolas; Ramage, Karim; Vermeulen, Anne

    2016-04-01

    Initiated in 2010 in the framework of the multidisciplinary research programme MISTRALS (Mediterranean Integrated Studies at Regional and Local Scales; http:www.mistrals-home.org), the Chemistry-Aerosol Mediterranean Experiment (ChArMEx, http://charmex.lsce.ipsl.fr/) aims at federating the scientific community for an updated assessment of the present and future state of the atmospheric environment in the Mediterranean Basin, and of its impacts on the regional climate, air quality, and marine biogeochemistry. The project combines mid- and long-term monitoring, intensive field campaigns, use of satellite data, and modelling studies. In this presentation we provide an overview of the campaign operation centre (http://choc.sedoo.fr/) and project database (http://mistrals.sedoo.fr/ChArMEx), at the end of the first experimental phase of the project that included a series of large campaigns based on airborne means (including balloons and various aircraft) and a network of surface stations. Those campaigns were performed mainly in the western Mediterranean basin in the summer of 2012, 2013 and 2014 with the help of the ChArMEx Operation Centre (ChOC), an open web site that has the objective to gather and display daily quick-looks from model forecasts and near-real time in situ and remote sensing observations of physical and chemical weather conditions relevant for the everyday campaign operation decisions. The ChOC is also useful for post campaign analyses and can be completed with a number of quick-looks of campaign results obtained later in order to offer an easy access to, and comprehensive view of all available data during the campaign period. The items included are selected according to the objectives and location of the given campaigns. The second experimental phase of ChArMEx from 2015 on is more focused on the eastern basin. In addition, the project operation centre is planned to be adapted for a joint MERMEX-ChArMEx oceanographic cruise (PEACETIME) for a study at the air-sea interface focused on the biogeochemical impact of atmospheric deposition. The database includes a wide diversity of data and parameters relevant to atmospheric chemistry. The objective of the database task team is to organize data management, distribution system and services, such as facilitating the exchange of information and stimulating the collaboration between researchers within the ChArMEx community, and beyond. The database relies on a strong collaboration between ICARE, IPSL and OMP data centers and has been set up in the framework of the MISTRALS programme data portal. ChArMEx data, either produced or used by the project, are documented and made easily accessible through the database website, which offers expected user-friendly functionalities: data catalog, user registration procedure, search tool to select and access data based on parameters, instruments, countries, platform or project, information of dataset PIs about downloadings... The metadata (data description) are standardized, and comply with international standards (ISO 19115-19139; INSPIRE European Directive; Global Change Master Directory Thesaurus). A Digital Object Identifier (DOI) assignement procedure allows to automatically register the datasets, in order to make them easier to access, cite, reuse and verify. At present, the ChArMEx database contains about 160 datasets, including more than 120 in situ datasets (from a total of 7 campaigns and various monitoring stations including the background atmospheric station of Ersa (June 2012-July 2014), 30 model output sets (dust model intercomparison, MEDCORDEX scenarios...), a high resolution emission inventory over the Mediterranean made available as part of the ECCAD database (http://eccad.sedoo.fr/eccad_extract_interface/JSF/page_charmex.jsf), etc. Some in situ datasets have been inserted in a relational database in order to enable more accurate selection and download of different datasets in a shared format. Many dedicated satellite products (SEVIRI, TRIMM, PARASOL...) are processed and will soon be accessible through the database website. Every scientist is welcome to visit the ChArMEx websites, to register and request data, and to contact charmex-database@sedoo.fr for any question.

  13. Method applied to the background analysis of energy data to be considered for the European Reference Life Cycle Database (ELCD).

    PubMed

    Fazio, Simone; Garraín, Daniel; Mathieux, Fabrice; De la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda

    2015-01-01

    Under the framework of the European Platform on Life Cycle Assessment, the European Reference Life-Cycle Database (ELCD - developed by the Joint Research Centre of the European Commission), provides core Life Cycle Inventory (LCI) data from front-running EU-level business associations and other sources. The ELCD contains energy-related data on power and fuels. This study describes the methods to be used for the quality analysis of energy data for European markets (available in third-party LC databases and from authoritative sources) that are, or could be, used in the context of the ELCD. The methodology was developed and tested on the energy datasets most relevant for the EU context, derived from GaBi (the reference database used to derive datasets for the ELCD), Ecoinvent, E3 and Gemis. The criteria for the database selection were based on the availability of EU-related data, the inclusion of comprehensive datasets on energy products and services, and the general approval of the LCA community. The proposed approach was based on the quality indicators developed within the International Reference Life Cycle Data System (ILCD) Handbook, further refined to facilitate their use in the analysis of energy systems. The overall Data Quality Rating (DQR) of the energy datasets can be calculated by summing up the quality rating (ranging from 1 to 5, where 1 represents very good, and 5 very poor quality) of each of the quality criteria indicators, divided by the total number of indicators considered. The quality of each dataset can be estimated for each indicator, and then compared with the different databases/sources. The results can be used to highlight the weaknesses of each dataset and can be used to guide further improvements to enhance the data quality with regard to the established criteria. This paper describes the application of the methodology to two exemplary datasets, in order to show the potential of the methodological approach. The analysis helps LCA practitioners to evaluate the usefulness of the ELCD datasets for their purposes, and dataset developers and reviewers to derive information that will help improve the overall DQR of databases.

  14. Deep Convolutional Neural Networks for breast cancer screening.

    PubMed

    Chougrad, Hiba; Zouaki, Hamid; Alheyane, Omar

    2018-04-01

    Radiologists often have a hard time classifying mammography mass lesions which leads to unnecessary breast biopsies to remove suspicions and this ends up adding exorbitant expenses to an already burdened patient and health care system. In this paper we developed a Computer-aided Diagnosis (CAD) system based on deep Convolutional Neural Networks (CNN) that aims to help the radiologist classify mammography mass lesions. Deep learning usually requires large datasets to train networks of a certain depth from scratch. Transfer learning is an effective method to deal with relatively small datasets as in the case of medical images, although it can be tricky as we can easily start overfitting. In this work, we explore the importance of transfer learning and we experimentally determine the best fine-tuning strategy to adopt when training a CNN model. We were able to successfully fine-tune some of the recent, most powerful CNNs and achieved better results compared to other state-of-the-art methods which classified the same public datasets. For instance we achieved 97.35% accuracy and 0.98 AUC on the DDSM database, 95.50% accuracy and 0.97 AUC on the INbreast database and 96.67% accuracy and 0.96 AUC on the BCDR database. Furthermore, after pre-processing and normalizing all the extracted Regions of Interest (ROIs) from the full mammograms, we merged all the datasets to build one large set of images and used it to fine-tune our CNNs. The CNN model which achieved the best results, a 98.94% accuracy, was used as a baseline to build the Breast Cancer Screening Framework. To evaluate the proposed CAD system and its efficiency to classify new images, we tested it on an independent database (MIAS) and got 98.23% accuracy and 0.99 AUC. The results obtained demonstrate that the proposed framework is performant and can indeed be used to predict if the mass lesions are benign or malignant. Copyright © 2018 Elsevier B.V. All rights reserved.

  15. Implementing a Community-Driven Cyberinfrastructure Platform for the Paleo- and Rock Magnetic Scientific Fields that Generalizes to Other Geoscience Disciplines

    NASA Astrophysics Data System (ADS)

    Minnett, R.; Jarboe, N.; Koppers, A. A.; Tauxe, L.; Constable, C.

    2013-12-01

    EarthRef.org is a geoscience umbrella website for several databases and data and model repository portals. These portals, unified in the mandate to preserve their respective data and promote scientific collaboration in their fields, are also disparate in their schemata. The Magnetics Information Consortium (http://earthref.org/MagIC/) is a grass-roots cyberinfrastructure effort envisioned by the paleo- and rock magnetic scientific community to archive their wealth of peer-reviewed raw data and interpretations from studies on natural and synthetic samples and relies on a partially strict subsumptive hierarchical data model. The Geochemical Earth Reference Model (http://earthref.org/GERM/) portal focuses on the chemical characterization of the Earth and relies on two data schemata: a repository of peer-reviewed reservoir geochemistry, and a database of partition coefficients for rocks, minerals, and elements. The Seamount Biogeosciences Network (http://earthref.org/SBN/) encourages the collaboration between the diverse disciplines involved in seamount research and includes the Seamount Catalog (http://earthref.org/SC/) of bathymetry and morphology. All of these portals also depend on the EarthRef Reference Database (http://earthref.org/ERR/) for publication reference metadata and the EarthRef Digital Archive (http://earthref.org/ERDA/), a generic repository of data objects and their metadata. The development of the new MagIC Search Interface (http://earthref.org/MagIC/search/) centers on a reusable platform designed to be flexible enough for largely heterogeneous datasets and to scale up to datasets with tens of millions of records. The HTML5 web application and Oracle 11g database residing at the San Diego Supercomputer Center (SDSC) support the online contribution and editing of complex datasets in a spreadsheet environment and the browsing and filtering of these contributions in the context of thousands of other datasets. EarthRef.org is in the process of implementing this platform across all of its data portals in spite of the wide variety of data schemata and is dedicated to serving the geoscience community with as little effort from the end-users as possible.

  16. Exploiting PubChem for Virtual Screening

    PubMed Central

    Xie, Xiang-Qun

    2011-01-01

    Importance of the field PubChem is a public molecular information repository, a scientific showcase of the NIH Roadmap Initiative. The PubChem database holds over 27 million records of unique chemical structures of compounds (CID) derived from nearly 70 million substance depositions (SID), and contains more than 449,000 bioassay records with over thousands of in vitro biochemical and cell-based screening bioassays established, with targeting more than 7000 proteins and genes linking to over 1.8 million of substances. Areas covered in this review This review builds on recent PubChem-related computational chemistry research reported by other authors while providing readers with an overview of the PubChem database, focusing on its increasing role in cheminformatics, virtual screening and toxicity prediction modeling. What the reader will gain These publicly available datasets in PubChem provide great opportunities for scientists to perform cheminformatics and virtual screening research for computer-aided drug design. However, the high volume and complexity of the datasets, in particular the bioassay-associated false positives/negatives and highly imbalanced datasets in PubChem, also creates major challenges. Several approaches regarding the modeling of PubChem datasets and development of virtual screening models for bioactivity and toxicity predictions are also reviewed. Take home message Novel data-mining cheminformatics tools and virtual screening algorithms are being developed and used to retrieve, annotate and analyze the large-scale and highly complex PubChem biological screening data for drug design. PMID:21691435

  17. Building a multi-scaled geospatial temporal ecology database from disparate data sources: fostering open science and data reuse.

    PubMed

    Soranno, Patricia A; Bissell, Edward G; Cheruvelil, Kendra S; Christel, Samuel T; Collins, Sarah M; Fergus, C Emi; Filstrup, Christopher T; Lapierre, Jean-Francois; Lottig, Noah R; Oliver, Samantha K; Scott, Caren E; Smith, Nicole J; Stopyak, Scott; Yuan, Shuai; Bremigan, Mary Tate; Downing, John A; Gries, Corinna; Henry, Emily N; Skaff, Nick K; Stanley, Emily H; Stow, Craig A; Tan, Pang-Ning; Wagner, Tyler; Webster, Katherine E

    2015-01-01

    Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km(2)). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.

  18. Building a multi-scaled geospatial temporal ecology database from disparate data sources: Fostering open science through data reuse

    USGS Publications Warehouse

    Soranno, Patricia A.; Bissell, E.G.; Cheruvelil, Kendra S.; Christel, Samuel T.; Collins, Sarah M.; Fergus, C. Emi; Filstrup, Christopher T.; Lapierre, Jean-Francois; Lotting, Noah R.; Oliver, Samantha K.; Scott, Caren E.; Smith, Nicole J.; Stopyak, Scott; Yuan, Shuai; Bremigan, Mary Tate; Downing, John A.; Gries, Corinna; Henry, Emily N.; Skaff, Nick K.; Stanley, Emily H.; Stow, Craig A.; Tan, Pang-Ning; Wagner, Tyler; Webster, Katherine E.

    2015-01-01

    Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km2). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.

  19. The use of karst geomorphology for planning, hazard avoidance and development in Great Britain

    NASA Astrophysics Data System (ADS)

    Cooper, Anthony H.; Farrant, Andrew R.; Price, Simon J.

    2011-11-01

    Within Great Britain five main types of karstic rocks - dolomite, limestone, chalk, gypsum and salt - are present. Each presents a different type and severity of karstic geohazard which are related to the rock solubility and geological setting. Typical karstic features associated with these rocks have been databased by the British Geological Survey (BGS) with records of sinkholes, cave entrances, stream sinks, resurgences and building damage; data for more than half of the country has been gathered. BGS has manipulated digital map data, for bedrock and superficial deposits, with digital elevation slope models, superficial deposit thickness models, the karst data and expertly interpreted areas, to generate a derived dataset assessing the likelihood of subsidence due to karst collapse. This dataset is informed and verified by the karst database and marketed as part of the BGS GeoSure suite. It is currently used by environmental regulators, the insurance and construction industries, and the BGS semi-automated enquiry system. The database and derived datasets can be further combined and manipulated using GIS to provide other datasets that deal with specific problems. Sustainable drainage systems, some of which use soak-aways into the ground, are being encouraged in Great Britain, but in karst areas they can cause ground stability problems. Similarly, open loop ground source heat or cooling pump systems may induce subsidence if installed in certain types of karstic environments such as in chalk with overlying sand deposits. Groundwater abstraction also has the potential to trigger subsidence in karst areas. GIS manipulation of the karst information is allowing Great Britain to be zoned into areas suitable, or unsuitable, for such uses; it has the potential to become part of a suite of planning management tools for local and National Government to assess the long term sustainable use of the ground.

  20. GIEMS-D3: A new long-term, dynamical, high-spatial resolution inundation extent dataset at global scale

    NASA Astrophysics Data System (ADS)

    Aires, Filipe; Miolane, Léo; Prigent, Catherine; Pham Duc, Binh; Papa, Fabrice; Fluet-Chouinard, Etienne; Lehner, Bernhard

    2017-04-01

    The Global Inundation Extent from Multi-Satellites (GIEMS) provides multi-year monthly variations of the global surface water extent at 25kmx25km resolution. It is derived from multiple satellite observations. Its spatial resolution is usually compatible with climate model outputs and with global land surface model grids but is clearly not adequate for local applications that require the characterization of small individual water bodies. There is today a strong demand for high-resolution inundation extent datasets, for a large variety of applications such as water management, regional hydrological modeling, or for the analysis of mosquitos-related diseases. A new procedure is introduced to downscale the GIEMS low spatial resolution inundations to a 3 arc second (90 m) dataset. The methodology is based on topography and hydrography information from the HydroSHEDS database. A new floodability index is adopted and an innovative smoothing procedure is developed to ensure the smooth transition, in the high-resolution maps, between the low-resolution boxes from GIEMS. Topography information is relevant for natural hydrology environments controlled by elevation, but is more limited in human-modified basins. However, the proposed downscaling approach is compatible with forthcoming fusion with other more pertinent satellite information in these difficult regions. The resulting GIEMS-D3 database is the only high spatial resolution inundation database available globally at the monthly time scale over the 1993-2007 period. GIEMS-D3 is assessed by analyzing its spatial and temporal variability, and evaluated by comparisons to other independent satellite observations from visible (Google Earth and Landsat), infrared (MODIS) and active microwave (SAR).

  1. Management and assimilation of diverse, distributed watershed datasets

    NASA Astrophysics Data System (ADS)

    Varadharajan, C.; Faybishenko, B.; Versteeg, R.; Agarwal, D.; Hubbard, S. S.; Hendrix, V.

    2016-12-01

    The U.S. Department of Energy's (DOE) Watershed Function Scientific Focus Area (SFA) seeks to determine how perturbations to mountainous watersheds (e.g., floods, drought, early snowmelt) impact the downstream delivery of water, nutrients, carbon, and metals over seasonal to decadal timescales. We are building a software platform that enables integration of diverse and disparate field, laboratory, and simulation datasets, of various types including hydrological, geological, meteorological, geophysical, geochemical, ecological and genomic datasets across a range of spatial and temporal scales within the Rifle floodplain and the East River watershed, Colorado. We are using agile data management and assimilation approaches, to enable web-based integration of heterogeneous, multi-scale dataSensor-based observations of water-level, vadose zone and groundwater temperature, water quality, meteorology as well as biogeochemical analyses of soil and groundwater samples have been curated and archived in federated databases. Quality Assurance and Quality Control (QA/QC) are performed on priority datasets needed for on-going scientific analyses, and hydrological and geochemical modeling. Automated QA/QC methods are used to identify and flag issues in the datasets. Data integration is achieved via a brokering service that dynamically integrates data from distributed databases via web services, based on user queries. The integrated results are presented to users in a portal that enables intuitive search, interactive visualization and download of integrated datasets. The concepts, approaches and codes being used are shared across various data science components of various large DOE-funded projects such as the Watershed Function SFA, Next Generation Ecosystem Experiment (NGEE) Tropics, Ameriflux/FLUXNET, and Advanced Simulation Capability for Environmental Management (ASCEM), and together contribute towards DOE's cyberinfrastructure for data management and model-data integration.

  2. Global relationships in river hydromorphology

    NASA Astrophysics Data System (ADS)

    Pavelsky, T.; Lion, C.; Allen, G. H.; Durand, M. T.; Schumann, G.; Beighley, E.; Yang, X.

    2017-12-01

    Since the widespread adoption of digital elevation models (DEMs) in the 1980s, most global and continental-scale analysis of river flow characteristics has been focused on measurements derived from DEMs such as drainage area, elevation, and slope. These variables (especially drainage area) have been related to other quantities of interest such as river width, depth, and velocity via empirical relationships that often take the form of power laws. More recently, a number of groups have developed more direct measurements of river location and some aspects of planform geometry from optical satellite imagery on regional, continental, and global scales. However, these satellite-derived datasets often lack many of the qualities that make DEM=derived datasets attractive, including robust network topology. Here, we present analysis of a dataset that combines the Global River Widths from Landsat (GRWL) database of river location, width, and braiding index with a river database extracted from the Shuttle Radar Topography Mission DEM and the HydroSHEDS dataset. Using these combined tools, we present a dataset that includes measurements of river width, slope, braiding index, upstream drainage area, and other variables. The dataset is available everywhere that both datasets are available, which includes all continental areas south of 60N with rivers sufficiently large to be observed with Landsat imagery. We use the dataset to examine patterns and frequencies of river form across continental and global scales as well as global relationships among variables including width, slope, and drainage area. The results demonstrate the complex relationships among different dimensions of river hydromorphology at the global scale.

  3. Improving data management and dissemination in web based information systems by semantic enrichment of descriptive data aspects

    NASA Astrophysics Data System (ADS)

    Gebhardt, Steffen; Wehrmann, Thilo; Klinger, Verena; Schettler, Ingo; Huth, Juliane; Künzer, Claudia; Dech, Stefan

    2010-10-01

    The German-Vietnamese water-related information system for the Mekong Delta (WISDOM) project supports business processes in Integrated Water Resources Management in Vietnam. Multiple disciplines bring together earth and ground based observation themes, such as environmental monitoring, water management, demographics, economy, information technology, and infrastructural systems. This paper introduces the components of the web-based WISDOM system including data, logic and presentation tier. It focuses on the data models upon which the database management system is built, including techniques for tagging or linking metadata with the stored information. The model also uses ordered groupings of spatial, thematic and temporal reference objects to semantically tag datasets to enable fast data retrieval, such as finding all data in a specific administrative unit belonging to a specific theme. A spatial database extension is employed by the PostgreSQL database. This object-oriented database was chosen over a relational database to tag spatial objects to tabular data, improving the retrieval of census and observational data at regional, provincial, and local areas. While the spatial database hinders processing raster data, a "work-around" was built into WISDOM to permit efficient management of both raster and vector data. The data model also incorporates styling aspects of the spatial datasets through styled layer descriptions (SLD) and web mapping service (WMS) layer specifications, allowing retrieval of rendered maps. Metadata elements of the spatial data are based on the ISO19115 standard. XML structured information of the SLD and metadata are stored in an XML database. The data models and the data management system are robust for managing the large quantity of spatial objects, sensor observations, census and document data. The operational WISDOM information system prototype contains modules for data management, automatic data integration, and web services for data retrieval, analysis, and distribution. The graphical user interfaces facilitate metadata cataloguing, data warehousing, web sensor data analysis and thematic mapping.

  4. ChemNet: A Transferable and Generalizable Deep Neural Network for Small-Molecule Property Prediction

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Goh, Garrett B.; Siegel, Charles M.; Vishnu, Abhinav

    With access to large datasets, deep neural networks through representation learning have been able to identify patterns from raw data, achieving human-level accuracy in image and speech recognition tasks. However, in chemistry, availability of large standardized and labelled datasets is scarce, and with a multitude of chemical properties of interest, chemical data is inherently small and fragmented. In this work, we explore transfer learning techniques in conjunction with the existing Chemception CNN model, to create a transferable and generalizable deep neural network for small-molecule property prediction. Our latest model, ChemNet learns in a semi-supervised manner from inexpensive labels computed frommore » the ChEMBL database. When fine-tuned to the Tox21, HIV and FreeSolv dataset, which are 3 separate chemical tasks that ChemNet was not originally trained on, we demonstrate that ChemNet exceeds the performance of existing Chemception models, contemporary MLP models that trains on molecular fingerprints, and it matches the performance of the ConvGraph algorithm, the current state-of-the-art. Furthermore, as ChemNet has been pre-trained on a large diverse chemical database, it can be used as a universal “plug-and-play” deep neural network, which accelerates the deployment of deep neural networks for the prediction of novel small-molecule chemical properties.« less

  5. Making the MagIC (Magnetics Information Consortium) Web Application Accessible to New Users and Useful to Experts

    NASA Astrophysics Data System (ADS)

    Minnett, R.; Koppers, A.; Jarboe, N.; Tauxe, L.; Constable, C.; Jonestrask, L.

    2017-12-01

    Challenges are faced by both new and experienced users interested in contributing their data to community repositories, in data discovery, or engaged in potentially transformative science. The Magnetics Information Consortium (https://earthref.org/MagIC) has recently simplified its data model and developed a new containerized web application to reduce the friction in contributing, exploring, and combining valuable and complex datasets for the paleo-, geo-, and rock magnetic scientific community. The new data model more closely reflects the hierarchical workflow in paleomagnetic experiments to enable adequate annotation of scientific results and ensure reproducibility. The new open-source (https://github.com/earthref/MagIC) application includes an upload tool that is integrated with the data model to provide early data validation feedback and ease the friction of contributing and updating datasets. The search interface provides a powerful full text search of contributions indexed by ElasticSearch and a wide array of filters, including specific geographic and geological timescale filtering, to support both novice users exploring the database and experts interested in compiling new datasets with specific criteria across thousands of studies and millions of measurements. The datasets are not large, but they are complex, with many results from evolving experimental and analytical approaches. These data are also extremely valuable due to the cost in collecting or creating physical samples and the, often, destructive nature of the experiments. MagIC is heavily invested in encouraging young scientists as well as established labs to cultivate workflows that facilitate contributing their data in a consistent format. This eLightning presentation includes a live demonstration of the MagIC web application, developed as a configurable container hosting an isomorphic Meteor JavaScript application, MongoDB database, and ElasticSearch search engine. Visitors can explore the MagIC Database through maps and image or plot galleries or search and filter the raw measurements and their derived hierarchy of analytical interpretations.

  6. Comparing NetCDF and SciDB on managing and querying 5D hydrologic dataset

    NASA Astrophysics Data System (ADS)

    Liu, Haicheng; Xiao, Xiao

    2016-11-01

    Efficiently extracting information from high dimensional hydro-meteorological modelling datasets requires smart solutions. Traditional methods are mostly based on files, which can be edited and accessed handily. But they have problems of efficiency due to contiguous storage structure. Others propose databases as an alternative for advantages such as native functionalities for manipulating multidimensional (MD) arrays, smart caching strategy and scalability. In this research, NetCDF file based solutions and the multidimensional array database management system (DBMS) SciDB applying chunked storage structure are benchmarked to determine the best solution for storing and querying 5D large hydrologic modelling dataset. The effect of data storage configurations including chunk size, dimension order and compression on query performance is explored. Results indicate that dimension order to organize storage of 5D data has significant influence on query performance if chunk size is very large. But the effect becomes insignificant when chunk size is properly set. Compression of SciDB mostly has negative influence on query performance. Caching is an advantage but may be influenced by execution of different query processes. On the whole, NetCDF solution without compression is in general more efficient than the SciDB DBMS.

  7. Comparison of GFED3, QFED2 and FEER1 Biomass Burning Emissions Datasets in a Global Model

    NASA Technical Reports Server (NTRS)

    Pan, Xiaohua; Ichoku, Charles; Bian, Huisheng; Chin, Mian; Ellison, Luke; da Silva, Arlindo; Darmenov, Anton

    2015-01-01

    Biomass burning contributes about 40% of the global loading of carbonaceous aerosols, significantly affecting air quality and the climate system by modulating solar radiation and cloud properties. However, fire emissions are poorly constrained in models on global and regional levels. In this study, we investigate 3 global biomass burning emission datasets in NASA GEOS5, namely: (1) GFEDv3.1 (Global Fire Emissions Database version 3.1); (2) QFEDv2.4 (Quick Fire Emissions Dataset version 2.4); (3) FEERv1 (Fire Energetics and Emissions Research version 1.0). The simulated aerosol optical depth (AOD), absorption AOD (AAOD), angstrom exponent and surface concentrations of aerosol plumes dominated by fire emissions are evaluated and compared to MODIS, OMI, AERONET, and IMPROVE data over different regions. In general, the spatial patterns of biomass burning emissions from these inventories are similar, although the strength of the emissions can be noticeably different. The emissions estimates from QFED are generally larger than those of FEER, which are in turn larger than those of GFED. AOD simulated with all these 3 databases are lower than the corresponding observations in Southern Africa and South America, two of the major biomass burning regions in the world.

  8. USA National Phenology Network's volunteer-contributed observations yield predictive models of phenological transitions.

    PubMed

    Crimmins, Theresa M; Crimmins, Michael A; Gerst, Katharine L; Rosemartin, Alyssa H; Weltzin, Jake F

    2017-01-01

    In support of science and society, the USA National Phenology Network (USA-NPN) maintains a rapidly growing, continental-scale, species-rich dataset of plant and animal phenology observations that with over 10 million records is the largest such database in the United States. The aim of this study was to explore the potential that exists in the broad and rich volunteer-collected dataset maintained by the USA-NPN for constructing models predicting the timing of phenological transition across species' ranges within the continental United States. Contributed voluntarily by professional and citizen scientists, these opportunistically collected observations are characterized by spatial clustering, inconsistent spatial and temporal sampling, and short temporal depth (2009-present). Whether data exhibiting such limitations can be used to develop predictive models appropriate for use across large geographic regions has not yet been explored. We constructed predictive models for phenophases that are the most abundant in the database and also relevant to management applications for all species with available data, regardless of plant growth habit, location, geographic extent, or temporal depth of the observations. We implemented a very basic model formulation-thermal time models with a fixed start date. Sufficient data were available to construct 107 individual species × phenophase models. Remarkably, given the limited temporal depth of this dataset and the simple modeling approach used, fifteen of these models (14%) met our criteria for model fit and error. The majority of these models represented the "breaking leaf buds" and "leaves" phenophases and represented shrub or tree growth forms. Accumulated growing degree day (GDD) thresholds that emerged ranged from 454 GDDs (Amelanchier canadensis-breaking leaf buds) to 1,300 GDDs (Prunus serotina-open flowers). Such candidate thermal time thresholds can be used to produce real-time and short-term forecast maps of the timing of these phenophase transition. In addition, many of the candidate models that emerged were suitable for use across the majority of the species' geographic ranges. Real-time and forecast maps of phenophase transitions could support a wide range of natural resource management applications, including invasive plant management, issuing asthma and allergy alerts, and anticipating frost damage for crops in vulnerable states. Our finding that several viable thermal time threshold models that work across the majority of the species ranges could be constructed from the USA-NPN database provides clear evidence that great potential exists this dataset to develop more enhanced predictive models for additional species and phenophases. Further, the candidate models that emerged have immediate utility for supporting a wide range of management applications.

  9. Final Report - Enhanced LAW Glass Property - Composition Models - Phase 1 VSL-13R2940-1, Rev. 0, dated 9/27/2013

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kruger, Albert A.; Muller, I.; Gilbo, K.

    2013-11-13

    The objectives of this work are aimed at the development of enhanced LAW propertycomposition models that expand the composition region covered by the models. The models of interest include PCT, VHT, viscosity and electrical conductivity. This is planned as a multi-year effort that will be performed in phases with the objectives listed below for the current phase.  Incorporate property- composition data from the new glasses into the database.  Assess the database and identify composition spaces in the database that need augmentation.  Develop statistically-designed composition matrices to cover the composition regions identified in the above analysis.  Preparemore » crucible melts of glass compositions from the statistically-designed composition matrix and measure the properties of interest.  Incorporate the above property-composition data into the database.  Assess existing models against the complete dataset and, as necessary, start development of new models.« less

  10. DNApod: DNA polymorphism annotation database from next-generation sequence read archives.

    PubMed

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information.

  11. DNApod: DNA polymorphism annotation database from next-generation sequence read archives

    PubMed Central

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information. PMID:28234924

  12. SPANG: a SPARQL client supporting generation and reuse of queries for distributed RDF databases.

    PubMed

    Chiba, Hirokazu; Uchiyama, Ikuo

    2017-02-08

    Toward improved interoperability of distributed biological databases, an increasing number of datasets have been published in the standardized Resource Description Framework (RDF). Although the powerful SPARQL Protocol and RDF Query Language (SPARQL) provides a basis for exploiting RDF databases, writing SPARQL code is burdensome for users including bioinformaticians. Thus, an easy-to-use interface is necessary. We developed SPANG, a SPARQL client that has unique features for querying RDF datasets. SPANG dynamically generates typical SPARQL queries according to specified arguments. It can also call SPARQL template libraries constructed in a local system or published on the Web. Further, it enables combinatorial execution of multiple queries, each with a distinct target database. These features facilitate easy and effective access to RDF datasets and integrative analysis of distributed data. SPANG helps users to exploit RDF datasets by generation and reuse of SPARQL queries through a simple interface. This client will enhance integrative exploitation of biological RDF datasets distributed across the Web. This software package is freely available at http://purl.org/net/spang .

  13. Experimental Database with Baseline CFD Solutions: 2-D and Axisymmetric Hypersonic Shock-Wave/Turbulent-Boundary-Layer Interactions

    NASA Technical Reports Server (NTRS)

    Marvin, Joseph G.; Brown, James L.; Gnoffo, Peter A.

    2013-01-01

    A database compilation of hypersonic shock-wave/turbulent boundary layer experiments is provided. The experiments selected for the database are either 2D or axisymmetric, and include both compression corner and impinging type SWTBL interactions. The strength of the interactions range from attached to incipient separation to fully separated flows. The experiments were chosen based on criterion to ensure quality of the datasets, to be relevant to NASA's missions and to be useful for validation and uncertainty assessment of CFD Navier-Stokes predictive methods, both now and in the future. An emphasis on datasets selected was on surface pressures and surface heating throughout the interaction, but include some wall shear stress distributions and flowfield profiles. Included, for selected cases, are example CFD grids and setup information, along with surface pressure and wall heating results from simulations using current NASA real-gas Navier-Stokes codes by which future CFD investigators can compare and evaluate physics modeling improvements and validation and uncertainty assessments of future CFD code developments. The experimental database is presented tabulated in the Appendices describing each experiment. The database is also provided in computer-readable ASCII files located on a companion DVD.

  14. A kinetics database and scripts for PHREEQC

    NASA Astrophysics Data System (ADS)

    Hu, B.; Zhang, Y.; Teng, Y.; Zhu, C.

    2017-12-01

    Kinetics of geochemical reactions has been increasingly used in numerical models to simulate coupled flow, mass transport, and chemical reactions. However, the kinetic data are scattered in the literature. To assemble a kinetic dataset for a modeling project is an intimidating task for most. In order to facilitate the application of kinetics in geochemical modeling, we assembled kinetics parameters into a database for the geochemical simulation program, PHREEQC (version 3.0). Kinetics data were collected from the literature. Our database includes kinetic data for over 70 minerals. The rate equations are also programmed into scripts with the Basic language. Using the new kinetic database, we simulated reaction path during the albite dissolution process using various rate equations in the literature. The simulation results with three different rate equations gave difference reaction paths at different time scale. Another application involves a coupled reactive transport model simulating the advancement of an acid plume in an acid mine drainage site associated with Bear Creek Uranium tailings pond. Geochemical reactions including calcite, gypsum, and illite were simulated with PHREEQC using the new kinetic database. The simulation results successfully demonstrated the utility of new kinetic database.

  15. Indirectly Estimating International Net Migration Flows by Age and Gender: The Community Demographic Model International Migration (CDM-IM) Dataset

    PubMed Central

    Nawrotzki, Raphael J.; Jiang, Leiwen

    2015-01-01

    Although data for the total number of international migrant flows is now available, no global dataset concerning demographic characteristics, such as the age and gender composition of migrant flows exists. This paper reports on the methods used to generate the CDM-IM dataset of age and gender specific profiles of bilateral net (not gross) migrant flows. We employ raw data from the United Nations Global Migration Database and estimate net migrant flows by age and gender between two time points around the year 2000, accounting for various demographic processes (fertility, mortality). The dataset contains information on 3,713 net migrant flows. Validation analyses against existing data sets and the historical, geopolitical context demonstrate that the CDM-IM dataset is of reasonably high quality. PMID:26692590

  16. ODG: Omics database generator - a tool for generating, querying, and analyzing multi-omics comparative databases to facilitate biological understanding.

    PubMed

    Guhlin, Joseph; Silverstein, Kevin A T; Zhou, Peng; Tiffin, Peter; Young, Nevin D

    2017-08-10

    Rapid generation of omics data in recent years have resulted in vast amounts of disconnected datasets without systemic integration and knowledge building, while individual groups have made customized, annotated datasets available on the web with few ways to link them to in-lab datasets. With so many research groups generating their own data, the ability to relate it to the larger genomic and comparative genomic context is becoming increasingly crucial to make full use of the data. The Omics Database Generator (ODG) allows users to create customized databases that utilize published genomics data integrated with experimental data which can be queried using a flexible graph database. When provided with omics and experimental data, ODG will create a comparative, multi-dimensional graph database. ODG can import definitions and annotations from other sources such as InterProScan, the Gene Ontology, ENZYME, UniPathway, and others. This annotation data can be especially useful for studying new or understudied species for which transcripts have only been predicted, and rapidly give additional layers of annotation to predicted genes. In better studied species, ODG can perform syntenic annotation translations or rapidly identify characteristics of a set of genes or nucleotide locations, such as hits from an association study. ODG provides a web-based user-interface for configuring the data import and for querying the database. Queries can also be run from the command-line and the database can be queried directly through programming language hooks available for most languages. ODG supports most common genomic formats as well as generic, easy to use tab-separated value format for user-provided annotations. ODG is a user-friendly database generation and query tool that adapts to the supplied data to produce a comparative genomic database or multi-layered annotation database. ODG provides rapid comparative genomic annotation and is therefore particularly useful for non-model or understudied species. For species for which more data are available, ODG can be used to conduct complex multi-omics, pattern-matching queries.

  17. Infrastructure and distributed learning methodology for privacy-preserving multi-centric rapid learning health care: euroCAT.

    PubMed

    Deist, Timo M; Jochems, A; van Soest, Johan; Nalbantov, Georgi; Oberije, Cary; Walsh, Seán; Eble, Michael; Bulens, Paul; Coucke, Philippe; Dries, Wim; Dekker, Andre; Lambin, Philippe

    2017-06-01

    Machine learning applications for personalized medicine are highly dependent on access to sufficient data. For personalized radiation oncology, datasets representing the variation in the entire cancer patient population need to be acquired and used to learn prediction models. Ethical and legal boundaries to ensure data privacy hamper collaboration between research institutes. We hypothesize that data sharing is possible without identifiable patient data leaving the radiation clinics and that building machine learning applications on distributed datasets is feasible. We developed and implemented an IT infrastructure in five radiation clinics across three countries (Belgium, Germany, and The Netherlands). We present here a proof-of-principle for future 'big data' infrastructures and distributed learning studies. Lung cancer patient data was collected in all five locations and stored in local databases. Exemplary support vector machine (SVM) models were learned using the Alternating Direction Method of Multipliers (ADMM) from the distributed databases to predict post-radiotherapy dyspnea grade [Formula: see text]. The discriminative performance was assessed by the area under the curve (AUC) in a five-fold cross-validation (learning on four sites and validating on the fifth). The performance of the distributed learning algorithm was compared to centralized learning where datasets of all institutes are jointly analyzed. The euroCAT infrastructure has been successfully implemented in five radiation clinics across three countries. SVM models can be learned on data distributed over all five clinics. Furthermore, the infrastructure provides a general framework to execute learning algorithms on distributed data. The ongoing expansion of the euroCAT network will facilitate machine learning in radiation oncology. The resulting access to larger datasets with sufficient variation will pave the way for generalizable prediction models and personalized medicine.

  18. Applying Knowledge Discovery in Databases in Public Health Data Set: Challenges and Concerns

    PubMed Central

    Volrathongchia, Kanittha

    2003-01-01

    In attempting to apply Knowledge Discovery in Databases (KDD) to generate a predictive model from a health care dataset that is currently available to the public, the first step is to pre-process the data to overcome the challenges of missing data, redundant observations, and records containing inaccurate data. This study will demonstrate how to use simple pre-processing methods to improve the quality of input data. PMID:14728545

  19. Novel Prostate Cancer Pathway Modeling using Boolean Implication

    DTIC Science & Technology

    2012-09-01

    cause of cancer deaths in men. Diagnosis and pathogenesis of this disease is poorly understood. Prostate specific antigen (PSA) test is still the... specific database, I combined 14 different datasets (global prostate cancer database, total n=891) that are in Affymetrix U133A (n=456), U133A 2.0...immunohistochemistry and flow cytometry are limited by the availability of antigen - specific monoclonal antibodies and by the small number of parallel

  20. A novel on-line spatial-temporal k-anonymity method for location privacy protection from sequence rules-based inference attacks.

    PubMed

    Zhang, Haitao; Wu, Chenxue; Chen, Zewei; Liu, Zhao; Zhu, Yunhong

    2017-01-01

    Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules.

  1. A novel on-line spatial-temporal k-anonymity method for location privacy protection from sequence rules-based inference attacks

    PubMed Central

    Wu, Chenxue; Liu, Zhao; Zhu, Yunhong

    2017-01-01

    Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules. PMID:28767687

  2. Joint Sparse Representation for Robust Multimodal Biometrics Recognition

    DTIC Science & Technology

    2012-01-01

    described in III. Experimental evaluations on a comprehensive multimodal dataset and a face database have been described in section V. Finally, in...WVU Multimodal Dataset The WVU multimodal dataset is a comprehensive collection of different biometric modalities such as fingerprint, iris, palmprint ...Martnez and R. Benavente, “The AR face database ,” CVC Technical Report, June 1998. [29] U. Park and A. Jain, “Face matching and retrieval using soft

  3. Rapid development of entity-based data models for bioinformatics with persistence object-oriented design and structured interfaces.

    PubMed

    Ezra Tsur, Elishai

    2017-01-01

    Databases are imperative for research in bioinformatics and computational biology. Current challenges in database design include data heterogeneity and context-dependent interconnections between data entities. These challenges drove the development of unified data interfaces and specialized databases. The curation of specialized databases is an ever-growing challenge due to the introduction of new data sources and the emergence of new relational connections between established datasets. Here, an open-source framework for the curation of specialized databases is proposed. The framework supports user-designed models of data encapsulation, objects persistency and structured interfaces to local and external data sources such as MalaCards, Biomodels and the National Centre for Biotechnology Information (NCBI) databases. The proposed framework was implemented using Java as the development environment, EclipseLink as the data persistency agent and Apache Derby as the database manager. Syntactic analysis was based on J3D, jsoup, Apache Commons and w3c.dom open libraries. Finally, a construction of a specialized database for aneurysms associated vascular diseases is demonstrated. This database contains 3-dimensional geometries of aneurysms, patient's clinical information, articles, biological models, related diseases and our recently published model of aneurysms' risk of rapture. Framework is available in: http://nbel-lab.com.

  4. Geothermal Case Studies

    DOE Data Explorer

    Young, Katherine

    2014-09-30

    database.) In fiscal year 2015, NREL is working with universities to populate additional case studies on OpenEI. The goal is to provide a large enough dataset to start conducting analyses of exploration programs to identify correlations between successful exploration plans for areas with similar geologic occurrence models.

  5. Methods to achieve accurate projection of regional and global raster databases

    USGS Publications Warehouse

    Usery, E. Lynn; Seong, Jeong Chang; Steinwand, Dan

    2002-01-01

    Modeling regional and global activities of climatic and human-induced change requires accurate geographic data from which we can develop mathematical and statistical tabulations of attributes and properties of the environment. Many of these models depend on data formatted as raster cells or matrices of pixel values. Recently, it has been demonstrated that regional and global raster datasets are subject to significant error from mathematical projection and that these errors are of such magnitude that model results may be jeopardized (Steinwand, et al., 1995; Yang, et al., 1996; Usery and Seong, 2001; Seong and Usery, 2001). There is a need to develop methods of projection that maintain the accuracy of these datasets to support regional and global analyses and modeling

  6. Evaluation, Calibration and Comparison of the Precipitation-Runoff Modeling System (PRMS) National Hydrologic Model (NHM) Using Moderate Resolution Imaging Spectroradiometer (MODIS) and Snow Data Assimilation System (SNODAS) Gridded Datasets

    NASA Astrophysics Data System (ADS)

    Norton, P. A., II; Haj, A. E., Jr.

    2014-12-01

    The United States Geological Survey is currently developing a National Hydrologic Model (NHM) to support and facilitate coordinated and consistent hydrologic modeling efforts at the scale of the continental United States. As part of this effort, the Geospatial Fabric (GF) for the NHM was created. The GF is a database that contains parameters derived from datasets that characterize the physical features of watersheds. The GF was used to aggregate catchments and flowlines defined in the National Hydrography Dataset Plus dataset for more than 100,000 hydrologic response units (HRUs), and to establish initial parameter values for input to the Precipitation-Runoff Modeling System (PRMS). Many parameter values are adjusted in PRMS using an automated calibration process. Using these adjusted parameter values, the PRMS model estimated variables such as evapotranspiration (ET), potential evapotranspiration (PET), snow-covered area (SCA), and snow water equivalent (SWE). In order to evaluate the effectiveness of parameter calibration, and model performance in general, several satellite-based Moderate Resolution Imaging Spectroradiometer (MODIS) and Snow Data Assimilation System (SNODAS) gridded datasets including ET, PET, SCA, and SWE were compared to PRMS-simulated values. The MODIS and SNODAS data were spatially averaged for each HRU, and compared to PRMS-simulated ET, PET, SCA, and SWE values for each HRU in the Upper Missouri River watershed. Default initial GF parameter values and PRMS calibration ranges were evaluated. Evaluation results, and the use of MODIS and SNODAS datasets to update GF parameter values and PRMS calibration ranges, are presented and discussed.

  7. Plant databases and data analysis tools

    USDA-ARS?s Scientific Manuscript database

    It is anticipated that the coming years will see the generation of large datasets including diagnostic markers in several plant species with emphasis on crop plants. To use these datasets effectively in any plant breeding program, it is essential to have the information available via public database...

  8. A robust dataset-agnostic heart disease classifier from Phonocardiogram.

    PubMed

    Banerjee, Rohan; Dutta Choudhury, Anirban; Deshpande, Parijat; Bhattacharya, Sakyajit; Pal, Arpan; Mandana, K M

    2017-07-01

    Automatic classification of normal and abnormal heart sounds is a popular area of research. However, building a robust algorithm unaffected by signal quality and patient demography is a challenge. In this paper we have analysed a wide list of Phonocardiogram (PCG) features in time and frequency domain along with morphological and statistical features to construct a robust and discriminative feature set for dataset-agnostic classification of normal and cardiac patients. The large and open access database, made available in Physionet 2016 challenge was used for feature selection, internal validation and creation of training models. A second dataset of 41 PCG segments, collected using our in-house smart phone based digital stethoscope from an Indian hospital was used for performance evaluation. Our proposed methodology yielded sensitivity and specificity scores of 0.76 and 0.75 respectively on the test dataset in classifying cardiovascular diseases. The methodology also outperformed three popular prior art approaches, when applied on the same dataset.

  9. A Comparative Study of Frequent and Maximal Periodic Pattern Mining Algorithms in Spatiotemporal Databases

    NASA Astrophysics Data System (ADS)

    Obulesu, O.; Rama Mohan Reddy, A., Dr; Mahendra, M.

    2017-08-01

    Detecting regular and efficient cyclic models is the demanding activity for data analysts due to unstructured, vigorous and enormous raw information produced from web. Many existing approaches generate large candidate patterns in the occurrence of huge and complex databases. In this work, two novel algorithms are proposed and a comparative examination is performed by considering scalability and performance parameters. The first algorithm is, EFPMA (Extended Regular Model Detection Algorithm) used to find frequent sequential patterns from the spatiotemporal dataset and the second one is, ETMA (Enhanced Tree-based Mining Algorithm) for detecting effective cyclic models with symbolic database representation. EFPMA is an algorithm grows models from both ends (prefixes and suffixes) of detected patterns, which results in faster pattern growth because of less levels of database projection compared to existing approaches such as Prefixspan and SPADE. ETMA uses distinct notions to store and manage transactions data horizontally such as segment, sequence and individual symbols. ETMA exploits a partition-and-conquer method to find maximal patterns by using symbolic notations. Using this algorithm, we can mine cyclic models in full-series sequential patterns including subsection series also. ETMA reduces the memory consumption and makes use of the efficient symbolic operation. Furthermore, ETMA only records time-series instances dynamically, in terms of character, series and section approaches respectively. The extent of the pattern and proving efficiency of the reducing and retrieval techniques from synthetic and actual datasets is a really open & challenging mining problem. These techniques are useful in data streams, traffic risk analysis, medical diagnosis, DNA sequence Mining, Earthquake prediction applications. Extensive investigational outcomes illustrates that the algorithms outperforms well towards efficiency and scalability than ECLAT, STNR and MAFIA approaches.

  10. OperomeDB: A Database of Condition-Specific Transcription Units in Prokaryotic Genomes.

    PubMed

    Chetal, Kashish; Janga, Sarath Chandra

    2015-01-01

    Background. In prokaryotic organisms, a substantial fraction of adjacent genes are organized into operons-codirectionally organized genes in prokaryotic genomes with the presence of a common promoter and terminator. Although several available operon databases provide information with varying levels of reliability, very few resources provide experimentally supported results. Therefore, we believe that the biological community could benefit from having a new operon prediction database with operons predicted using next-generation RNA-seq datasets. Description. We present operomeDB, a database which provides an ensemble of all the predicted operons for bacterial genomes using available RNA-sequencing datasets across a wide range of experimental conditions. Although several studies have recently confirmed that prokaryotic operon structure is dynamic with significant alterations across environmental and experimental conditions, there are no comprehensive databases for studying such variations across prokaryotic transcriptomes. Currently our database contains nine bacterial organisms and 168 transcriptomes for which we predicted operons. User interface is simple and easy to use, in terms of visualization, downloading, and querying of data. In addition, because of its ability to load custom datasets, users can also compare their datasets with publicly available transcriptomic data of an organism. Conclusion. OperomeDB as a database should not only aid experimental groups working on transcriptome analysis of specific organisms but also enable studies related to computational and comparative operomics.

  11. The GEM Global Active Faults Database: The growth and synthesis of a worldwide database of active structures for PSHA, research, and education

    NASA Astrophysics Data System (ADS)

    Styron, R. H.; Garcia, J.; Pagani, M.

    2017-12-01

    A global catalog of active faults is a resource of value to a wide swath of the geoscience, earthquake engineering, and hazards risk communities. Though construction of such a dataset has been attempted now and again through the past few decades, success has been elusive. The Global Earthquake Model (GEM) Foundation has been working on this problem, as a fundamental step in its goal of making a global seismic hazard model. Progress on the assembly of the database is rapid, with the concatenation of many national—, orogen—, and continental—scale datasets produced by different research groups throughout the years. However, substantial data gaps exist throughout much of the deforming world, requiring new mapping based on existing publications as well as consideration of seismicity, geodesy and remote sensing data. Thus far, new fault datasets have been created for the Caribbean and Central America, North Africa, and northeastern Asia, with Madagascar, Canada and a few other regions in the queue. The second major task, as formidable as the initial data concatenation, is the 'harmonization' of data. This entails the removal or recombination of duplicated structures, reconciliation of contrastinginterpretations in areas of overlap, and the synthesis of many different types of attributes or metadata into a consistent whole. In a project of this scale, the methods used in the database construction are as critical to project success as the data themselves. After some experimentation, we have settled on an iterative methodology that involves rapid accumulation of data followed by successive episodes of data revision, and a computer-scripted data assembly using GIS file formats that is flexible, reproducible, and as able as possible to cope with updates to the constituent datasets. We find that this approach of initially maximizing coverage and then increasing resolution is the most robust to regional data problems and the most amenable to continued updates and refinement. Combined with the public, open-source nature of this project, GEM is producing a resource that can continue to evolve with the changing knowledge and needs of the community.

  12. Using the gini coefficient to measure the chemical diversity of small-molecule libraries.

    PubMed

    Weidlich, Iwona E; Filippov, Igor V

    2016-08-15

    Modern databases of small organic molecules contain tens of millions of structures. The size of theoretically available chemistry is even larger. However, despite the large amount of chemical information, the "big data" moment for chemistry has not yet provided the corresponding payoff of cheaper computer-predicted medicine or robust machine-learning models for the determination of efficacy and toxicity. Here, we present a study of the diversity of chemical datasets using a measure that is commonly used in socioeconomic studies. We demonstrate the use of this diversity measure on several datasets that were constructed to contain various congeneric subsets of molecules as well as randomly selected molecules. We also apply our method to a number of well-known databases that are frequently used for structure-activity relationship modeling. Our results show the poor diversity of the common sources of potential lead compounds compared to actual known drugs. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.

  13. The Stream-Catchment (StreamCat) Dataset: A database of watershed metrics for the conterminous USA

    EPA Science Inventory

    We developed an extensive database of landscape metrics for ~2.65 million streams, and their associated catchments, within the conterminous USA: The Stream-Catchment (StreamCat) Dataset. These data are publically available and greatly reduce the specialized geospatial expertise n...

  14. Geologic datasets for weights-of-evidence analysis in northeast Washington: 2. Mineral databases

    USGS Publications Warehouse

    Boleneus, D.E.

    1999-01-01

    Digital mineral databases are necessary to carry out weights-of-evidence modeling of mineral resources for epithermal gold and carbonate-hosted lead-zinc deposits in northeast Washington. This report describes spreadsheet tables consisting of: 1) training sites for epithermal gold, 2) placer gold sites, 3) training sites for carbonate-hosted lead-zinc, and 4) small lead-zinc mines and prospects. A fifth table provides location data about sites in the four tables.

  15. Remote visual analysis of large turbulence databases at multiple scales

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pulido, Jesus; Livescu, Daniel; Kanov, Kalin

    The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less

  16. Remote visual analysis of large turbulence databases at multiple scales

    DOE PAGES

    Pulido, Jesus; Livescu, Daniel; Kanov, Kalin; ...

    2018-06-15

    The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less

  17. Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge

    PubMed Central

    Wei, Wei; Ji, Zhanglong; He, Yupeng; Zhang, Kai; Ha, Yuanchi; Li, Qi; Ohno-Machado, Lucila

    2018-01-01

    Abstract The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline PMID:29688374

  18. New models and online calculator for predicting non-sentinel lymph node status in sentinel lymph node positive breast cancer patients.

    PubMed

    Kohrt, Holbrook E; Olshen, Richard A; Bermas, Honnie R; Goodson, William H; Wood, Douglas J; Henry, Solomon; Rouse, Robert V; Bailey, Lisa; Philben, Vicki J; Dirbas, Frederick M; Dunn, Jocelyn J; Johnson, Denise L; Wapnir, Irene L; Carlson, Robert W; Stockdale, Frank E; Hansen, Nora M; Jeffrey, Stefanie S

    2008-03-04

    Current practice is to perform a completion axillary lymph node dissection (ALND) for breast cancer patients with tumor-involved sentinel lymph nodes (SLNs), although fewer than half will have non-sentinel node (NSLN) metastasis. Our goal was to develop new models to quantify the risk of NSLN metastasis in SLN-positive patients and to compare predictive capabilities to another widely used model. We constructed three models to predict NSLN status: recursive partitioning with receiver operating characteristic curves (RP-ROC), boosted Classification and Regression Trees (CART), and multivariate logistic regression (MLR) informed by CART. Data were compiled from a multicenter Northern California and Oregon database of 784 patients who prospectively underwent SLN biopsy and completion ALND. We compared the predictive abilities of our best model and the Memorial Sloan-Kettering Breast Cancer Nomogram (Nomogram) in our dataset and an independent dataset from Northwestern University. 285 patients had positive SLNs, of which 213 had known angiolymphatic invasion status and 171 had complete pathologic data including hormone receptor status. 264 (93%) patients had limited SLN disease (micrometastasis, 70%, or isolated tumor cells, 23%). 101 (35%) of all SLN-positive patients had tumor-involved NSLNs. Three variables (tumor size, angiolymphatic invasion, and SLN metastasis size) predicted risk in all our models. RP-ROC and boosted CART stratified patients into four risk levels. MLR informed by CART was most accurate. Using two composite predictors calculated from three variables, MLR informed by CART was more accurate than the Nomogram computed using eight predictors. In our dataset, area under ROC curve (AUC) was 0.83/0.85 for MLR (n = 213/n = 171) and 0.77 for Nomogram (n = 171). When applied to an independent dataset (n = 77), AUC was 0.74 for our model and 0.62 for Nomogram. The composite predictors in our model were the product of angiolymphatic invasion and size of SLN metastasis, and the product of tumor size and square of SLN metastasis size. We present a new model developed from a community-based SLN database that uses only three rather than eight variables to achieve higher accuracy than the Nomogram for predicting NSLN status in two different datasets.

  19. Taste CREp: the Cosmic-Ray Exposure program

    NASA Astrophysics Data System (ADS)

    Martin, Léo; Blard, Pierre-Henri; Balco, Greg; Lavé, Jérôme; Delunel, Romain; Lifton, Nathaniel

    2017-04-01

    We present here the CREp program and the ICE-D production rate database, an online system to compute Cosmic Ray Exposure (CRE) ages with cosmogenic 3He and 10Be (crep.crpg.cnrs-nancy.fr). The CREp calculator is designed to automatically reflect the current state of the global calibration database production rate stored in ICE-D (http://calibration.ice-d.org). ICE-D will be regularly updated in order to incorporate new calibration data and reflect the current state of the available literature. The CREp program permits to calculate ages in a flexible way: 1) Two scaling models are available, i.e. i) the empirical Lal-Stone time-dependent model (Balco et al., 2008; Lal, 1991; Stone, 2000) with the muon parameters of Braucher et al. (2011), and ii) the Lifton-Sato-Dunai (LSD) theoretical model (Lifton et al., 2014). 2) Users may also test the impact of the atmosphere model, using either i) the ERA-40 database (Uppala et al., 2005), or ii) the standard atmosphere (N.O.A.A., 1976). 3) For the time-dependent correction, users or choose among the three proposed geomagnetic datasets (Lifton, 2016; Lifton et al., 2014; Muscheler et al., 2005) or import their own database. 4) For the important choice of the production rate, CREp is linked to a database of production rate calibration data, ICE-D. This database includes published empirical calibration rate studies that are publicly available at present, including those of the CRONUS-Earth and CRONUS-EU projects, as well as studies from other projects. Users may select the production rates either: i) using a worldwide mean value, ii) a regionally averaged value (not available in regions with no data), iii) a local unique value, which can be chosen among the existing dataset or imported by the user, or iv) any combination of single or multiple calibration data. We tested the efficacy of the different scaling models by looking at the statistical dispersion of the computed Sea Level High Latitude (SLHL) calibrated production rates. Lal/Stone and LSD models have comparable efficacies, and the impact of the tested atmospheric model and the geomagnetic database is also limited. If a global mean is chosen, the 1σ uncertainty arising from the production rate is about 5% for 10Be and 10% for 3He. If a regional production rate is picked, these uncertainties are potentially lower.

  20. The GED4GEM project: development of a Global Exposure Database for the Global Earthquake Model initiative

    USGS Publications Warehouse

    Gamba, P.; Cavalca, D.; Jaiswal, K.S.; Huyck, C.; Crowley, H.

    2012-01-01

    In order to quantify earthquake risk of any selected region or a country of the world within the Global Earthquake Model (GEM) framework (www.globalquakemodel.org/), a systematic compilation of building inventory and population exposure is indispensable. Through the consortium of leading institutions and by engaging the domain-experts from multiple countries, the GED4GEM project has been working towards the development of a first comprehensive publicly available Global Exposure Database (GED). This geospatial exposure database will eventually facilitate global earthquake risk and loss estimation through GEM’s OpenQuake platform. This paper provides an overview of the GED concepts, aims, datasets, and inference methodology, as well as the current implementation scheme, status and way forward.

  1. The IRHUM (Isotopic Reconstruction of Human Migration) database - bioavailable strontium isotope ratios for geochemical fingerprinting in France

    NASA Astrophysics Data System (ADS)

    Willmes, M.; McMorrow, L.; Kinsley, L.; Armstrong, R.; Aubert, M.; Eggins, S.; Falguères, C.; Maureille, B.; Moffat, I.; Grün, R.

    2013-11-01

    Strontium isotope ratios (87Sr/86Sr) are a key geochemical tracer used in a wide range of fields including archaeology, ecology, food and forensic sciences. These applications are based on the principle that the Sr isotopic ratios of natural materials reflect the sources of strontium available during their formation. A major constraint for current studies is the lack of robust reference maps to evaluate the source of strontium isotope ratios measured in the samples. Here we provide a new dataset of bioavailable Sr isotope ratios for the major geologic units of France, based on plant and soil samples (Pangaea data repository doi:10.1594/PANGAEA.819142). The IRHUM (Isotopic Reconstruction of Human Migration) database is a web platform to access, explore and map our dataset. The database provides the spatial context and metadata for each sample, allowing the user to evaluate the suitability of the sample for their specific study. In addition, it allows users to upload and share their own datasets and data products, which will enhance collaboration across the different research fields. This article describes the sampling and analytical methods used to generate the dataset and how to use and access of the dataset through the IRHUM database. Any interpretation of the isotope dataset is outside the scope of this publication.

  2. Water and carbon stable isotope records from natural archives: a new database and interactive online platform for data browsing, visualizing and downloading

    NASA Astrophysics Data System (ADS)

    Bolliet, Timothé; Brockmann, Patrick; Masson-Delmotte, Valérie; Bassinot, Franck; Daux, Valérie; Genty, Dominique; Landais, Amaelle; Lavrieux, Marlène; Michel, Elisabeth; Ortega, Pablo; Risi, Camille; Roche, Didier M.; Vimeux, Françoise; Waelbroeck, Claire

    2016-08-01

    Past climate is an important benchmark to assess the ability of climate models to simulate key processes and feedbacks. Numerous proxy records exist for stable isotopes of water and/or carbon, which are also implemented inside the components of a growing number of Earth system model. Model-data comparisons can help to constrain the uncertainties associated with transfer functions. This motivates the need of producing a comprehensive compilation of different proxy sources. We have put together a global database of proxy records of oxygen (δ18O), hydrogen (δD) and carbon (δ13C) stable isotopes from different archives: ocean and lake sediments, corals, ice cores, speleothems and tree-ring cellulose. Source records were obtained from the georeferenced open access PANGAEA and NOAA libraries, complemented by additional data obtained from a literature survey. About 3000 source records were screened for chronological information and temporal resolution of proxy records. Altogether, this database consists of hundreds of dated δ18O, δ13C and δD records in a standardized simple text format, complemented with a metadata Excel catalog. A quality control flag was implemented to describe age markers and inform on chronological uncertainty. This compilation effort highlights the need to homogenize and structure the format of datasets and chronological information as well as enhance the distribution of published datasets that are currently highly fragmented and scattered. We also provide an online portal based on the records included in this database with an intuitive and interactive platform (http://climateproxiesfinder.ipsl.fr/), allowing one to easily select, visualize and download subsets of the homogeneously formatted records that constitute this database, following a choice of search criteria, and to upload new datasets. In the last part, we illustrate the type of application allowed by our database by comparing several key periods highly investigated by the paleoclimate community. For coherency with the Paleoclimate Modelling Intercomparison Project (PMIP), we focus on records spanning the past 200 years, the mid-Holocene (MH, 5.5-6.5 ka; calendar kiloyears before 1950), the Last Glacial Maximum (LGM, 19-23 ka), and those spanning the last interglacial period (LIG, 115-130 ka). Basic statistics have been applied to characterize anomalies between these different periods. Most changes from the MH to present day and from LIG to MH appear statistically insignificant. Significant global differences are reported from LGM to MH with regional discrepancies in signals from different archives and complex patterns.

  3. ESTuber db: an online database for Tuber borchii EST sequences.

    PubMed

    Lazzari, Barbara; Caprera, Andrea; Cosentino, Cristian; Stella, Alessandra; Milanesi, Luciano; Viotti, Angelo

    2007-03-08

    The ESTuber database (http://www.itb.cnr.it/estuber) includes 3,271 Tuber borchii expressed sequence tags (EST). The dataset consists of 2,389 sequences from an in-house prepared cDNA library from truffle vegetative hyphae, and 882 sequences downloaded from GenBank and representing four libraries from white truffle mycelia and ascocarps at different developmental stages. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts. Data were collected in a MySQL database, which can be queried via a php-based web interface. Sequences included in the ESTuber db were clustered and annotated against three databases: the GenBank nr database, the UniProtKB database and a third in-house prepared database of fungi genomic sequences. An algorithm was implemented to infer statistical classification among Gene Ontology categories from the ontology occurrences deduced from the annotation procedure against the UniProtKB database. Ontologies were also deduced from the annotation of more than 130,000 EST sequences from five filamentous fungi, for intra-species comparison purposes. Further analyses were performed on the ESTuber db dataset, including tandem repeats search and comparison of the putative protein dataset inferred from the EST sequences to the PROSITE database for protein patterns identification. All the analyses were performed both on the complete sequence dataset and on the contig consensus sequences generated by the EST assembly procedure. The resulting web site is a resource of data and links related to truffle expressed genes. The Sequence Report and Contig Report pages are the web interface core structures which, together with the Text search utility and the Blast utility, allow easy access to the data stored in the database.

  4. USA National Phenology Network’s volunteer-contributed observations yield predictive models of phenological transitions

    PubMed Central

    Crimmins, Michael A.; Gerst, Katharine L.; Rosemartin, Alyssa H.; Weltzin, Jake F.

    2017-01-01

    Purpose In support of science and society, the USA National Phenology Network (USA-NPN) maintains a rapidly growing, continental-scale, species-rich dataset of plant and animal phenology observations that with over 10 million records is the largest such database in the United States. The aim of this study was to explore the potential that exists in the broad and rich volunteer-collected dataset maintained by the USA-NPN for constructing models predicting the timing of phenological transition across species’ ranges within the continental United States. Contributed voluntarily by professional and citizen scientists, these opportunistically collected observations are characterized by spatial clustering, inconsistent spatial and temporal sampling, and short temporal depth (2009-present). Whether data exhibiting such limitations can be used to develop predictive models appropriate for use across large geographic regions has not yet been explored. Methods We constructed predictive models for phenophases that are the most abundant in the database and also relevant to management applications for all species with available data, regardless of plant growth habit, location, geographic extent, or temporal depth of the observations. We implemented a very basic model formulation—thermal time models with a fixed start date. Results Sufficient data were available to construct 107 individual species × phenophase models. Remarkably, given the limited temporal depth of this dataset and the simple modeling approach used, fifteen of these models (14%) met our criteria for model fit and error. The majority of these models represented the “breaking leaf buds” and “leaves” phenophases and represented shrub or tree growth forms. Accumulated growing degree day (GDD) thresholds that emerged ranged from 454 GDDs (Amelanchier canadensis-breaking leaf buds) to 1,300 GDDs (Prunus serotina-open flowers). Such candidate thermal time thresholds can be used to produce real-time and short-term forecast maps of the timing of these phenophase transition. In addition, many of the candidate models that emerged were suitable for use across the majority of the species’ geographic ranges. Real-time and forecast maps of phenophase transitions could support a wide range of natural resource management applications, including invasive plant management, issuing asthma and allergy alerts, and anticipating frost damage for crops in vulnerable states. Implications Our finding that several viable thermal time threshold models that work across the majority of the species ranges could be constructed from the USA-NPN database provides clear evidence that great potential exists this dataset to develop more enhanced predictive models for additional species and phenophases. Further, the candidate models that emerged have immediate utility for supporting a wide range of management applications. PMID:28829783

  5. Estimation of average annual streamflows and power potentials for Alaska and Hawaii

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Verdin, Kristine L.

    2004-05-01

    This paper describes the work done to develop average annual streamflow estimates and power potential for the states of Alaska and Hawaii. The Elevation Derivatives for National Applications (EDNA) database was used, along with climatic datasets, to develop flow and power estimates for every stream reach in the EDNA database. Estimates of average annual streamflows were derived using state-specific regression equations, which were functions of average annual precipitation, precipitation intensity, drainage area, and other elevation-derived parameters. Power potential was calculated through the use of the average annual streamflow and the hydraulic head of each reach, which is calculated from themore » EDNA digital elevation model. In all, estimates of streamflow and power potential were calculated for over 170,000 stream segments in the Alaskan and Hawaiian datasets.« less

  6. Toward An Unstructured Mesh Database

    NASA Astrophysics Data System (ADS)

    Rezaei Mahdiraji, Alireza; Baumann, Peter Peter

    2014-05-01

    Unstructured meshes are used in several application domains such as earth sciences (e.g., seismology), medicine, oceanography, cli- mate modeling, GIS as approximate representations of physical objects. Meshes subdivide a domain into smaller geometric elements (called cells) which are glued together by incidence relationships. The subdivision of a domain allows computational manipulation of complicated physical structures. For instance, seismologists model earthquakes using elastic wave propagation solvers on hexahedral meshes. The hexahedral con- tains several hundred millions of grid points and millions of hexahedral cells. Each vertex node in the hexahedrals stores a multitude of data fields. To run simulation on such meshes, one needs to iterate over all the cells, iterate over incident cells to a given cell, retrieve coordinates of cells, assign data values to cells, etc. Although meshes are used in many application domains, to the best of our knowledge there is no database vendor that support unstructured mesh features. Currently, the main tool for querying and manipulating unstructured meshes are mesh libraries, e.g., CGAL and GRAL. Mesh li- braries are dedicated libraries which includes mesh algorithms and can be run on mesh representations. The libraries do not scale with dataset size, do not have declarative query language, and need deep C++ knowledge for query implementations. Furthermore, due to high coupling between the implementations and input file structure, the implementations are less reusable and costly to maintain. A dedicated mesh database offers the following advantages: 1) declarative querying, 2) ease of maintenance, 3) hiding mesh storage structure from applications, and 4) transparent query optimization. To design a mesh database, the first challenge is to define a suitable generic data model for unstructured meshes. We proposed ImG-Complexes data model as a generic topological mesh data model which extends incidence graph model to multi-incidence relationships. We instrument ImG model with sets of optional and application-specific constraints which can be used to check validity of meshes for a specific class of object such as manifold, pseudo-manifold, and simplicial manifold. We conducted experiments to measure the performance of the graph database solution in processing mesh queries and compare it with GrAL mesh library and PostgreSQL database on synthetic and real mesh datasets. The experiments show that each system perform well on specific types of mesh queries, e.g., graph databases perform well on global path-intensive queries. In the future, we investigate database operations for the ImG model and design a mesh query language.

  7. Preparing Nursing Home Data from Multiple Sites for Clinical Research – A Case Study Using Observational Health Data Sciences and Informatics

    PubMed Central

    Boyce, Richard D.; Handler, Steven M.; Karp, Jordan F.; Perera, Subashan; Reynolds, Charles F.

    2016-01-01

    Introduction: A potential barrier to nursing home research is the limited availability of research quality data in electronic form. We describe a case study of converting electronic health data from five skilled nursing facilities to a research quality longitudinal dataset by means of open-source tools produced by the Observational Health Data Sciences and Informatics (OHDSI) collaborative. Methods: The Long-Term Care Minimum Data Set (MDS), drug dispensing, and fall incident data from five SNFs were extracted, translated, and loaded into version 4 of the OHDSI common data model. Quality assurance involved identifying errors using the Achilles data characterization tool and comparing both quality measures and drug exposures in the new database for concordance with externally available sources. Findings: Records for a total 4,519 patients (95.1%) made it into the final database. Achilles identified 10 different types of errors that were addressed in the final dataset. Drug exposures based on dispensing were generally accurate when compared with medication administration data from the pharmacy services provider. Quality measures were generally concordant between the new database and Nursing Home Compare for measures with a prevalence ≥ 10%. Fall data recorded in MDS was found to be more complete than data from fall incident reports. Conclusions: The new dataset is ready to support observational research on topics of clinical importance in the nursing home including patient-level prediction of falls. The extraction, translation, and loading process enabled the use of OHDSI data characterization tools that improved the quality of the final dataset. PMID:27891528

  8. funRiceGenes dataset for comprehensive understanding and application of rice functional genes.

    PubMed

    Yao, Wen; Li, Guangwei; Yu, Yiming; Ouyang, Yidan

    2018-01-01

    As a main staple food, rice is also a model plant for functional genomic studies of monocots. Decoding of every DNA element of the rice genome is essential for genetic improvement to address increasing food demands. The past 15 years have witnessed extraordinary advances in rice functional genomics. Systematic characterization and proper deposition of every rice gene are vital for both functional studies and crop genetic improvement. We built a comprehensive and accurate dataset of ∼2800 functionally characterized rice genes and ∼5000 members of different gene families by integrating data from available databases and reviewing every publication on rice functional genomic studies. The dataset accounts for 19.2% of the 39 045 annotated protein-coding rice genes, which provides the most exhaustive archive for investigating the functions of rice genes. We also constructed 214 gene interaction networks based on 1841 connections between 1310 genes. The largest network with 762 genes indicated that pleiotropic genes linked different biological pathways. Increasing degree of conservation of the flowering pathway was observed among more closely related plants, implying substantial value of rice genes for future dissection of flowering regulation in other crops. All data are deposited in the funRiceGenes database (https://funricegenes.github.io/). Functionality for advanced search and continuous updating of the database are provided by a Shiny application (http://funricegenes.ncpgr.cn/). The funRiceGenes dataset would enable further exploring of the crosslink between gene functions and natural variations in rice, which can also facilitate breeding design to improve target agronomic traits of rice. © The Authors 2017. Published by Oxford University Press.

  9. ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers.

    PubMed

    Teodoro, Douglas; Sundvall, Erik; João Junior, Mario; Ruch, Patrick; Miranda Freire, Sergio

    2018-01-01

    The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms.

  10. ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers

    PubMed Central

    Sundvall, Erik; João Junior, Mario; Ruch, Patrick; Miranda Freire, Sergio

    2018-01-01

    The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms. PMID:29293556

  11. A study on identification of bacteria in environmental samples using single-cell Raman spectroscopy: feasibility and reference libraries.

    PubMed

    Baritaux, Jean-Charles; Simon, Anne-Catherine; Schultz, Emmanuelle; Emain, C; Laurent, P; Dinten, Jean-Marc

    2016-05-01

    We report on our recent efforts towards identifying bacteria in environmental samples by means of Raman spectroscopy. We established a database of Raman spectra from bacteria submitted to various environmental conditions. This dataset was used to verify that Raman typing is possible from measurements performed in non-ideal conditions. Starting from the same dataset, we then varied the phenotype and matrix diversity content included in the reference library used to train the statistical model. The results show that it is possible to obtain models with an extended coverage of spectral variabilities, compared to environment-specific models trained on spectra from a restricted set of conditions. Broad coverage models are desirable for environmental samples since the exact conditions of the bacteria cannot be controlled.

  12. A digital library for medical imaging activities

    NASA Astrophysics Data System (ADS)

    dos Santos, Marcelo; Furuie, Sérgio S.

    2007-03-01

    This work presents the development of an electronic infrastructure to make available a free, online, multipurpose and multimodality medical image database. The proposed infrastructure implements a distributed architecture for medical image database, authoring tools, and a repository for multimedia documents. Also it includes a peer-reviewed model that assures quality of dataset. This public repository provides a single point of access for medical images and related information to facilitate retrieval tasks. The proposed approach has been used as an electronic teaching system in Radiology as well.

  13. Generating Modeling Data From Repeat-Dose Toxicity Reports

    PubMed Central

    López-Massaguer, Oriol; Pinto-Gil, Kevin; Sanz, Ferran; Amberg, Alexander; Anger, Lennart T; Stolte, Manuela; Ravagli, Carlo

    2018-01-01

    Abstract Over the past decades, pharmaceutical companies have conducted a large number of high-quality in vivo repeat-dose toxicity (RDT) studies for regulatory purposes. As part of the eTOX project, a high number of these studies have been compiled and integrated into a database. This valuable resource can be queried directly, but it can be further exploited to build predictive models. As the studies were originally conducted to investigate the properties of individual compounds, the experimental conditions across the studies are highly heterogeneous. Consequently, the original data required normalization/standardization, filtering, categorization and integration to make possible any data analysis (such as building predictive models). Additionally, the primary objectives of the RDT studies were to identify toxicological findings, most of which do not directly translate to in vivo endpoints. This article describes a method to extract datasets containing comparable toxicological properties for a series of compounds amenable for building predictive models. The proposed strategy starts with the normalization of the terms used within the original reports. Then, comparable datasets are extracted from the database by applying filters based on the experimental conditions. Finally, carefully selected profiles of toxicological findings are mapped to endpoints of interest, generating QSAR-like tables. In this work, we describe in detail the strategy and tools used for carrying out these transformations and illustrate its application in a data sample extracted from the eTOX database. The suitability of the resulting tables for developing hazard-predicting models was investigated by building proof-of-concept models for in vivo liver endpoints. PMID:29155963

  14. CrossCheck: an open-source web tool for high-throughput screen data analysis.

    PubMed

    Najafov, Jamil; Najafov, Ayaz

    2017-07-19

    Modern high-throughput screening methods allow researchers to generate large datasets that potentially contain important biological information. However, oftentimes, picking relevant hits from such screens and generating testable hypotheses requires training in bioinformatics and the skills to efficiently perform database mining. There are currently no tools available to general public that allow users to cross-reference their screen datasets with published screen datasets. To this end, we developed CrossCheck, an online platform for high-throughput screen data analysis. CrossCheck is a centralized database that allows effortless comparison of the user-entered list of gene symbols with 16,231 published datasets. These datasets include published data from genome-wide RNAi and CRISPR screens, interactome proteomics and phosphoproteomics screens, cancer mutation databases, low-throughput studies of major cell signaling mediators, such as kinases, E3 ubiquitin ligases and phosphatases, and gene ontological information. Moreover, CrossCheck includes a novel database of predicted protein kinase substrates, which was developed using proteome-wide consensus motif searches. CrossCheck dramatically simplifies high-throughput screen data analysis and enables researchers to dig deep into the published literature and streamline data-driven hypothesis generation. CrossCheck is freely accessible as a web-based application at http://proteinguru.com/crosscheck.

  15. A Graphics Design Framework to Visualize Multi-Dimensional Economic Datasets

    ERIC Educational Resources Information Center

    Chandramouli, Magesh; Narayanan, Badri; Bertoline, Gary R.

    2013-01-01

    This study implements a prototype graphics visualization framework to visualize multidimensional data. This graphics design framework serves as a "visual analytical database" for visualization and simulation of economic models. One of the primary goals of any kind of visualization is to extract useful information from colossal volumes of…

  16. Large Dataset of Acute Oral Toxicity Data Created for Testing ...

    EPA Pesticide Factsheets

    Acute toxicity data is a common requirement for substance registration in the US. Currently only data derived from animal tests are accepted by regulatory agencies, and the standard in vivo tests use lethality as the endpoint. Non-animal alternatives such as in silico models are being developed due to animal welfare and resource considerations. We compiled a large dataset of oral rat LD50 values to assess the predictive performance currently available in silico models. Our dataset combines LD50 values from five different sources: literature data provided by The Dow Chemical Company, REACH data from eChemportal, HSDB (Hazardous Substances Data Bank), RTECS data from Leadscope, and the training set underpinning TEST (Toxicity Estimation Software Tool). Combined these data sources yield 33848 chemical-LD50 pairs (data points), with 23475 unique data points covering 16439 compounds. The entire dataset was loaded into a chemical properties database. All of the compounds were registered in DSSTox and 59.5% have publically available structures. Compounds without a structure in DSSTox are currently having their structures registered. The structural data will be used to evaluate the predictive performance and applicable chemical domains of three QSAR models (TIMES, PROTOX, and TEST). Future work will combine the dataset with information from ToxCast assays, and using random forest modeling, assess whether ToxCast assays are useful in predicting acute oral toxicity. Pre

  17. Study on a pattern classification method of soil quality based on simplified learning sample dataset

    USGS Publications Warehouse

    Zhang, Jiahua; Liu, S.; Hu, Y.; Tian, Y.

    2011-01-01

    Based on the massive soil information in current soil quality grade evaluation, this paper constructed an intelligent classification approach of soil quality grade depending on classical sampling techniques and disordered multiclassification Logistic regression model. As a case study to determine the learning sample capacity under certain confidence level and estimation accuracy, and use c-means algorithm to automatically extract the simplified learning sample dataset from the cultivated soil quality grade evaluation database for the study area, Long chuan county in Guangdong province, a disordered Logistic classifier model was then built and the calculation analysis steps of soil quality grade intelligent classification were given. The result indicated that the soil quality grade can be effectively learned and predicted by the extracted simplified dataset through this method, which changed the traditional method for soil quality grade evaluation. ?? 2011 IEEE.

  18. The ChArMEx database

    NASA Astrophysics Data System (ADS)

    Ferré, Hélène; Descloitres, Jacques; Fleury, Laurence; Boichard, Jean-Luc; Brissebrat, Guillaume; Focsa, Loredana; Henriot, Nicolas; Mastrorillo, Laurence; Mière, Arnaud; Vermeulen, Anne

    2013-04-01

    The Chemistry-Aerosol Mediterranean Experiment (ChArMEx, http://charmex.lsce.ipsl.fr/) aims at a scientific assessment of the present and future state of the atmospheric environment in the Mediterranean Basin, and of its impacts on the regional climate, air quality, and marine biogeochemistry. The project includes long term monitoring of environmental parameters, intensive field campaigns, use of satellite data and modelling studies. Therefore ChARMEx scientists produce and need to access a wide diversity of data. In this context, the objective of the database task is to organize data management, distribution system and services such as facilitating the exchange of information and stimulating the collaboration between researchers within the ChArMEx community, and beyond. The database relies on a strong collaboration between OMP and ICARE data centres and falls within the scope of the Mediterranean Integrated Studies at Regional And Locals Scales (MISTRALS) program data portal. All the data produced by or of interest for the ChArMEx community will be documented in the data catalogue and accessible through the database website: http://mistrals.sedoo.fr/ChArMEx. The database website offers different tools: - A registration procedure which enables any scientist to accept the data policy and apply for a user database account. - Forms to document observations or products that will be provided to the database in compliance with metadata international standards (ISO 19115-19139; INSPIRE; Global Change Master Directory Thesaurus). - A search tool to browse the catalogue using thematic, geographic and/or temporal criteria. - Sorted lists of the datasets by thematic keywords, by measured parameters, by instruments or by platform type. - A shopping-cart web interface to order in situ data files. At present datasets from the background monitoring station of Ersa, Cape Corsica and from the 2012 ChArMEx pre-campaign are available. - A user-friendly access to satellite products (SEVIRI, TRIMM, PARASOL...) stored in the ICARE data archive using OpeNDAP protocole The website will soon propose new facilities. In particular, many in situ datasets will be homogenized and inserted in a relational database, in order to enable more accurate data selection and download of different datasets in a shared format. In order to meet the operational needs of the airborne and ground based observational teams during the ChArMEx 2012 pre-campaign and the 2013 experiment, a day-to-day quick look and report display website has been developed too: http://choc.sedoo.org. It offers a convenient way to browse weather conditions and chemical composition during the campaign periods.

  19. High resolution population distribution maps for Southeast Asia in 2010 and 2015.

    PubMed

    Gaughan, Andrea E; Stevens, Forrest R; Linard, Catherine; Jia, Peng; Tatem, Andrew J

    2013-01-01

    Spatially accurate, contemporary data on human population distributions are vitally important to many applied and theoretical researchers. The Southeast Asia region has undergone rapid urbanization and population growth over the past decade, yet existing spatial population distribution datasets covering the region are based principally on population count data from censuses circa 2000, with often insufficient spatial resolution or input data to map settlements precisely. Here we outline approaches to construct a database of GIS-linked circa 2010 census data and methods used to construct fine-scale (∼100 meters spatial resolution) population distribution datasets for each country in the Southeast Asia region. Landsat-derived settlement maps and land cover information were combined with ancillary datasets on infrastructure to model population distributions for 2010 and 2015. These products were compared with those from two other methods used to construct commonly used global population datasets. Results indicate mapping accuracies are consistently higher when incorporating land cover and settlement information into the AsiaPop modelling process. Using existing data, it is possible to produce detailed, contemporary and easily updatable population distribution datasets for Southeast Asia. The 2010 and 2015 datasets produced are freely available as a product of the AsiaPop Project and can be downloaded from: www.asiapop.org.

  20. High Resolution Population Distribution Maps for Southeast Asia in 2010 and 2015

    PubMed Central

    Gaughan, Andrea E.; Stevens, Forrest R.; Linard, Catherine; Jia, Peng; Tatem, Andrew J.

    2013-01-01

    Spatially accurate, contemporary data on human population distributions are vitally important to many applied and theoretical researchers. The Southeast Asia region has undergone rapid urbanization and population growth over the past decade, yet existing spatial population distribution datasets covering the region are based principally on population count data from censuses circa 2000, with often insufficient spatial resolution or input data to map settlements precisely. Here we outline approaches to construct a database of GIS-linked circa 2010 census data and methods used to construct fine-scale (∼100 meters spatial resolution) population distribution datasets for each country in the Southeast Asia region. Landsat-derived settlement maps and land cover information were combined with ancillary datasets on infrastructure to model population distributions for 2010 and 2015. These products were compared with those from two other methods used to construct commonly used global population datasets. Results indicate mapping accuracies are consistently higher when incorporating land cover and settlement information into the AsiaPop modelling process. Using existing data, it is possible to produce detailed, contemporary and easily updatable population distribution datasets for Southeast Asia. The 2010 and 2015 datasets produced are freely available as a product of the AsiaPop Project and can be downloaded from: www.asiapop.org. PMID:23418469

  1. ANISEED 2017: extending the integrated ascidian database to the exploration and evolutionary comparison of genome-scale datasets

    PubMed Central

    Brozovic, Matija; Dantec, Christelle; Dardaillon, Justine; Dauga, Delphine; Faure, Emmanuel; Gineste, Mathieu; Louis, Alexandra; Naville, Magali; Nitta, Kazuhiro R; Piette, Jacques; Reeves, Wendy; Scornavacca, Céline; Simion, Paul; Vincentelli, Renaud; Bellec, Maelle; Aicha, Sameh Ben; Fagotto, Marie; Guéroult-Bellone, Marion; Haeussler, Maximilian; Jacox, Edwin; Lowe, Elijah K; Mendez, Mickael; Roberge, Alexis; Stolfi, Alberto; Yokomori, Rui; Cambillau, Christian; Christiaen, Lionel; Delsuc, Frédéric; Douzery, Emmanuel; Dumollard, Rémi; Kusakabe, Takehiro; Nakai, Kenta; Nishida, Hiroki; Satou, Yutaka; Swalla, Billie; Veeman, Michael; Volff, Jean-Nicolas

    2018-01-01

    Abstract ANISEED (www.aniseed.cnrs.fr) is the main model organism database for tunicates, the sister-group of vertebrates. This release gives access to annotated genomes, gene expression patterns, and anatomical descriptions for nine ascidian species. It provides increased integration with external molecular and taxonomy databases, better support for epigenomics datasets, in particular RNA-seq, ChIP-seq and SELEX-seq, and features novel interactive interfaces for existing and novel datatypes. In particular, the cross-species navigation and comparison is enhanced through a novel taxonomy section describing each represented species and through the implementation of interactive phylogenetic gene trees for 60% of tunicate genes. The gene expression section displays the results of RNA-seq experiments for the three major model species of solitary ascidians. Gene expression is controlled by the binding of transcription factors to cis-regulatory sequences. A high-resolution description of the DNA-binding specificity for 131 Ciona robusta (formerly C. intestinalis type A) transcription factors by SELEX-seq is provided and used to map candidate binding sites across the Ciona robusta and Phallusia mammillata genomes. Finally, use of a WashU Epigenome browser enhances genome navigation, while a Genomicus server was set up to explore microsynteny relationships within tunicates and with vertebrates, Amphioxus, echinoderms and hemichordates. PMID:29149270

  2. Validated environmental and physiological data from the CELSS Breadboard Projects Biomass Production Chamber. BWT931 (Wheat cv. Yecora Rojo)

    NASA Technical Reports Server (NTRS)

    Stutte, G. W.; Mackowiak, C. L.; Markwell, G. A.; Wheeler, R. M.; Sager, J. C.

    1993-01-01

    This KSC database is being made available to the scientific research community to facilitate the development of crop development models, to test monitoring and control strategies, and to identify environmental limitations in crop production systems. The KSC validated dataset consists of 17 parameters necessary to maintain bioregenerative life support functions: water purification, CO2 removal, O2 production, and biomass production. The data are available on disk as either a DATABASE SUBSET (one week of 5-minute data) or DATABASE SUMMARY (daily averages of parameters). Online access to the VALIDATED DATABASE will be made available to institutions with specific programmatic requirements. Availability and access to the KSC validated database are subject to approval and limitations implicit in KSC computer security policies.

  3. New clues on carcinogenicity-related substructures derived from mining two large datasets of chemical compounds.

    PubMed

    Golbamaki, Azadi; Benfenati, Emilio; Golbamaki, Nazanin; Manganaro, Alberto; Merdivan, Erinc; Roncaglioni, Alessandra; Gini, Giuseppina

    2016-04-02

    In this study, new molecular fragments associated with genotoxic and nongenotoxic carcinogens are introduced to estimate the carcinogenic potential of compounds. Two rule-based carcinogenesis models were developed with the aid of SARpy: model R (from rodents' experimental data) and model E (from human carcinogenicity data). Structural alert extraction method of SARpy uses a completely automated and unbiased manner with statistical significance. The carcinogenicity models developed in this study are collections of carcinogenic potential fragments that were extracted from two carcinogenicity databases: the ANTARES carcinogenicity dataset with information from bioassay on rats and the combination of ISSCAN and CGX datasets, which take into accounts human-based assessment. The performance of these two models was evaluated in terms of cross-validation and external validation using a 258 compound case study dataset. Combining R and H predictions and scoring a positive or negative result when both models are concordant on a prediction, increased accuracy to 72% and specificity to 79% on the external test set. The carcinogenic fragments present in the two models were compared and analyzed from the point of view of chemical class. The results of this study show that the developed rule sets will be a useful tool to identify some new structural alerts of carcinogenicity and provide effective information on the molecular structures of carcinogenic chemicals.

  4. The PREDICTS database: a global database of how local terrestrial biodiversity responds to human impacts

    PubMed Central

    Hudson, Lawrence N; Newbold, Tim; Contu, Sara; Hill, Samantha L L; Lysenko, Igor; De Palma, Adriana; Phillips, Helen R P; Senior, Rebecca A; Bennett, Dominic J; Booth, Hollie; Choimes, Argyrios; Correia, David L P; Day, Julie; Echeverría-Londoño, Susy; Garon, Morgan; Harrison, Michelle L K; Ingram, Daniel J; Jung, Martin; Kemp, Victoria; Kirkpatrick, Lucinda; Martin, Callum D; Pan, Yuan; White, Hannah J; Aben, Job; Abrahamczyk, Stefan; Adum, Gilbert B; Aguilar-Barquero, Virginia; Aizen, Marcelo A; Ancrenaz, Marc; Arbeláez-Cortés, Enrique; Armbrecht, Inge; Azhar, Badrul; Azpiroz, Adrián B; Baeten, Lander; Báldi, András; Banks, John E; Barlow, Jos; Batáry, Péter; Bates, Adam J; Bayne, Erin M; Beja, Pedro; Berg, Åke; Berry, Nicholas J; Bicknell, Jake E; Bihn, Jochen H; Böhning-Gaese, Katrin; Boekhout, Teun; Boutin, Céline; Bouyer, Jérémy; Brearley, Francis Q; Brito, Isabel; Brunet, Jörg; Buczkowski, Grzegorz; Buscardo, Erika; Cabra-García, Jimmy; Calviño-Cancela, María; Cameron, Sydney A; Cancello, Eliana M; Carrijo, Tiago F; Carvalho, Anelena L; Castro, Helena; Castro-Luna, Alejandro A; Cerda, Rolando; Cerezo, Alexis; Chauvat, Matthieu; Clarke, Frank M; Cleary, Daniel F R; Connop, Stuart P; D'Aniello, Biagio; da Silva, Pedro Giovâni; Darvill, Ben; Dauber, Jens; Dejean, Alain; Diekötter, Tim; Dominguez-Haydar, Yamileth; Dormann, Carsten F; Dumont, Bertrand; Dures, Simon G; Dynesius, Mats; Edenius, Lars; Elek, Zoltán; Entling, Martin H; Farwig, Nina; Fayle, Tom M; Felicioli, Antonio; Felton, Annika M; Ficetola, Gentile F; Filgueiras, Bruno K C; Fonte, Steven J; Fraser, Lauchlan H; Fukuda, Daisuke; Furlani, Dario; Ganzhorn, Jörg U; Garden, Jenni G; Gheler-Costa, Carla; Giordani, Paolo; Giordano, Simonetta; Gottschalk, Marco S; Goulson, Dave; Gove, Aaron D; Grogan, James; Hanley, Mick E; Hanson, Thor; Hashim, Nor R; Hawes, Joseph E; Hébert, Christian; Helden, Alvin J; Henden, John-André; Hernández, Lionel; Herzog, Felix; Higuera-Diaz, Diego; Hilje, Branko; Horgan, Finbarr G; Horváth, Roland; Hylander, Kristoffer; Isaacs-Cubides, Paola; Ishitani, Masahiro; Jacobs, Carmen T; Jaramillo, Víctor J; Jauker, Birgit; Jonsell, Mats; Jung, Thomas S; Kapoor, Vena; Kati, Vassiliki; Katovai, Eric; Kessler, Michael; Knop, Eva; Kolb, Annette; Kőrösi, Ádám; Lachat, Thibault; Lantschner, Victoria; Le Féon, Violette; LeBuhn, Gretchen; Légaré, Jean-Philippe; Letcher, Susan G; Littlewood, Nick A; López-Quintero, Carlos A; Louhaichi, Mounir; Lövei, Gabor L; Lucas-Borja, Manuel Esteban; Luja, Victor H; Maeto, Kaoru; Magura, Tibor; Mallari, Neil Aldrin; Marin-Spiotta, Erika; Marshall, E J P; Martínez, Eliana; Mayfield, Margaret M; Mikusinski, Grzegorz; Milder, Jeffrey C; Miller, James R; Morales, Carolina L; Muchane, Mary N; Muchane, Muchai; Naidoo, Robin; Nakamura, Akihiro; Naoe, Shoji; Nates-Parra, Guiomar; Navarrete Gutierrez, Dario A; Neuschulz, Eike L; Noreika, Norbertas; Norfolk, Olivia; Noriega, Jorge Ari; Nöske, Nicole M; O'Dea, Niall; Oduro, William; Ofori-Boateng, Caleb; Oke, Chris O; Osgathorpe, Lynne M; Paritsis, Juan; Parra-H, Alejandro; Pelegrin, Nicolás; Peres, Carlos A; Persson, Anna S; Petanidou, Theodora; Phalan, Ben; Philips, T Keith; Poveda, Katja; Power, Eileen F; Presley, Steven J; Proença, Vânia; Quaranta, Marino; Quintero, Carolina; Redpath-Downing, Nicola A; Reid, J Leighton; Reis, Yana T; Ribeiro, Danilo B; Richardson, Barbara A; Richardson, Michael J; Robles, Carolina A; Römbke, Jörg; Romero-Duque, Luz Piedad; Rosselli, Loreta; Rossiter, Stephen J; Roulston, T'ai H; Rousseau, Laurent; Sadler, Jonathan P; Sáfián, Szabolcs; Saldaña-Vázquez, Romeo A; Samnegård, Ulrika; Schüepp, Christof; Schweiger, Oliver; Sedlock, Jodi L; Shahabuddin, Ghazala; Sheil, Douglas; Silva, Fernando A B; Slade, Eleanor M; Smith-Pardo, Allan H; Sodhi, Navjot S; Somarriba, Eduardo J; Sosa, Ramón A; Stout, Jane C; Struebig, Matthew J; Sung, Yik-Hei; Threlfall, Caragh G; Tonietto, Rebecca; Tóthmérész, Béla; Tscharntke, Teja; Turner, Edgar C; Tylianakis, Jason M; Vanbergen, Adam J; Vassilev, Kiril; Verboven, Hans A F; Vergara, Carlos H; Vergara, Pablo M; Verhulst, Jort; Walker, Tony R; Wang, Yanping; Watling, James I; Wells, Konstans; Williams, Christopher D; Willig, Michael R; Woinarski, John C Z; Wolf, Jan H D; Woodcock, Ben A; Yu, Douglas W; Zaitsev, Andrey S; Collen, Ben; Ewers, Rob M; Mace, Georgina M; Purves, Drew W; Scharlemann, Jörn P W; Purvis, Andy

    2014-01-01

    Biodiversity continues to decline in the face of increasing anthropogenic pressures such as habitat destruction, exploitation, pollution and introduction of alien species. Existing global databases of species’ threat status or population time series are dominated by charismatic species. The collation of datasets with broad taxonomic and biogeographic extents, and that support computation of a range of biodiversity indicators, is necessary to enable better understanding of historical declines and to project – and avert – future declines. We describe and assess a new database of more than 1.6 million samples from 78 countries representing over 28,000 species, collated from existing spatial comparisons of local-scale biodiversity exposed to different intensities and types of anthropogenic pressures, from terrestrial sites around the world. The database contains measurements taken in 208 (of 814) ecoregions, 13 (of 14) biomes, 25 (of 35) biodiversity hotspots and 16 (of 17) megadiverse countries. The database contains more than 1% of the total number of all species described, and more than 1% of the described species within many taxonomic groups – including flowering plants, gymnosperms, birds, mammals, reptiles, amphibians, beetles, lepidopterans and hymenopterans. The dataset, which is still being added to, is therefore already considerably larger and more representative than those used by previous quantitative models of biodiversity trends and responses. The database is being assembled as part of the PREDICTS project (Projecting Responses of Ecological Diversity In Changing Terrestrial Systems – http://www.predicts.org.uk). We make site-level summary data available alongside this article. The full database will be publicly available in 2015. PMID:25558364

  5. The PREDICTS database: a global database of how local terrestrial biodiversity responds to human impacts.

    PubMed

    Hudson, Lawrence N; Newbold, Tim; Contu, Sara; Hill, Samantha L L; Lysenko, Igor; De Palma, Adriana; Phillips, Helen R P; Senior, Rebecca A; Bennett, Dominic J; Booth, Hollie; Choimes, Argyrios; Correia, David L P; Day, Julie; Echeverría-Londoño, Susy; Garon, Morgan; Harrison, Michelle L K; Ingram, Daniel J; Jung, Martin; Kemp, Victoria; Kirkpatrick, Lucinda; Martin, Callum D; Pan, Yuan; White, Hannah J; Aben, Job; Abrahamczyk, Stefan; Adum, Gilbert B; Aguilar-Barquero, Virginia; Aizen, Marcelo A; Ancrenaz, Marc; Arbeláez-Cortés, Enrique; Armbrecht, Inge; Azhar, Badrul; Azpiroz, Adrián B; Baeten, Lander; Báldi, András; Banks, John E; Barlow, Jos; Batáry, Péter; Bates, Adam J; Bayne, Erin M; Beja, Pedro; Berg, Åke; Berry, Nicholas J; Bicknell, Jake E; Bihn, Jochen H; Böhning-Gaese, Katrin; Boekhout, Teun; Boutin, Céline; Bouyer, Jérémy; Brearley, Francis Q; Brito, Isabel; Brunet, Jörg; Buczkowski, Grzegorz; Buscardo, Erika; Cabra-García, Jimmy; Calviño-Cancela, María; Cameron, Sydney A; Cancello, Eliana M; Carrijo, Tiago F; Carvalho, Anelena L; Castro, Helena; Castro-Luna, Alejandro A; Cerda, Rolando; Cerezo, Alexis; Chauvat, Matthieu; Clarke, Frank M; Cleary, Daniel F R; Connop, Stuart P; D'Aniello, Biagio; da Silva, Pedro Giovâni; Darvill, Ben; Dauber, Jens; Dejean, Alain; Diekötter, Tim; Dominguez-Haydar, Yamileth; Dormann, Carsten F; Dumont, Bertrand; Dures, Simon G; Dynesius, Mats; Edenius, Lars; Elek, Zoltán; Entling, Martin H; Farwig, Nina; Fayle, Tom M; Felicioli, Antonio; Felton, Annika M; Ficetola, Gentile F; Filgueiras, Bruno K C; Fonte, Steven J; Fraser, Lauchlan H; Fukuda, Daisuke; Furlani, Dario; Ganzhorn, Jörg U; Garden, Jenni G; Gheler-Costa, Carla; Giordani, Paolo; Giordano, Simonetta; Gottschalk, Marco S; Goulson, Dave; Gove, Aaron D; Grogan, James; Hanley, Mick E; Hanson, Thor; Hashim, Nor R; Hawes, Joseph E; Hébert, Christian; Helden, Alvin J; Henden, John-André; Hernández, Lionel; Herzog, Felix; Higuera-Diaz, Diego; Hilje, Branko; Horgan, Finbarr G; Horváth, Roland; Hylander, Kristoffer; Isaacs-Cubides, Paola; Ishitani, Masahiro; Jacobs, Carmen T; Jaramillo, Víctor J; Jauker, Birgit; Jonsell, Mats; Jung, Thomas S; Kapoor, Vena; Kati, Vassiliki; Katovai, Eric; Kessler, Michael; Knop, Eva; Kolb, Annette; Kőrösi, Ádám; Lachat, Thibault; Lantschner, Victoria; Le Féon, Violette; LeBuhn, Gretchen; Légaré, Jean-Philippe; Letcher, Susan G; Littlewood, Nick A; López-Quintero, Carlos A; Louhaichi, Mounir; Lövei, Gabor L; Lucas-Borja, Manuel Esteban; Luja, Victor H; Maeto, Kaoru; Magura, Tibor; Mallari, Neil Aldrin; Marin-Spiotta, Erika; Marshall, E J P; Martínez, Eliana; Mayfield, Margaret M; Mikusinski, Grzegorz; Milder, Jeffrey C; Miller, James R; Morales, Carolina L; Muchane, Mary N; Muchane, Muchai; Naidoo, Robin; Nakamura, Akihiro; Naoe, Shoji; Nates-Parra, Guiomar; Navarrete Gutierrez, Dario A; Neuschulz, Eike L; Noreika, Norbertas; Norfolk, Olivia; Noriega, Jorge Ari; Nöske, Nicole M; O'Dea, Niall; Oduro, William; Ofori-Boateng, Caleb; Oke, Chris O; Osgathorpe, Lynne M; Paritsis, Juan; Parra-H, Alejandro; Pelegrin, Nicolás; Peres, Carlos A; Persson, Anna S; Petanidou, Theodora; Phalan, Ben; Philips, T Keith; Poveda, Katja; Power, Eileen F; Presley, Steven J; Proença, Vânia; Quaranta, Marino; Quintero, Carolina; Redpath-Downing, Nicola A; Reid, J Leighton; Reis, Yana T; Ribeiro, Danilo B; Richardson, Barbara A; Richardson, Michael J; Robles, Carolina A; Römbke, Jörg; Romero-Duque, Luz Piedad; Rosselli, Loreta; Rossiter, Stephen J; Roulston, T'ai H; Rousseau, Laurent; Sadler, Jonathan P; Sáfián, Szabolcs; Saldaña-Vázquez, Romeo A; Samnegård, Ulrika; Schüepp, Christof; Schweiger, Oliver; Sedlock, Jodi L; Shahabuddin, Ghazala; Sheil, Douglas; Silva, Fernando A B; Slade, Eleanor M; Smith-Pardo, Allan H; Sodhi, Navjot S; Somarriba, Eduardo J; Sosa, Ramón A; Stout, Jane C; Struebig, Matthew J; Sung, Yik-Hei; Threlfall, Caragh G; Tonietto, Rebecca; Tóthmérész, Béla; Tscharntke, Teja; Turner, Edgar C; Tylianakis, Jason M; Vanbergen, Adam J; Vassilev, Kiril; Verboven, Hans A F; Vergara, Carlos H; Vergara, Pablo M; Verhulst, Jort; Walker, Tony R; Wang, Yanping; Watling, James I; Wells, Konstans; Williams, Christopher D; Willig, Michael R; Woinarski, John C Z; Wolf, Jan H D; Woodcock, Ben A; Yu, Douglas W; Zaitsev, Andrey S; Collen, Ben; Ewers, Rob M; Mace, Georgina M; Purves, Drew W; Scharlemann, Jörn P W; Purvis, Andy

    2014-12-01

    Biodiversity continues to decline in the face of increasing anthropogenic pressures such as habitat destruction, exploitation, pollution and introduction of alien species. Existing global databases of species' threat status or population time series are dominated by charismatic species. The collation of datasets with broad taxonomic and biogeographic extents, and that support computation of a range of biodiversity indicators, is necessary to enable better understanding of historical declines and to project - and avert - future declines. We describe and assess a new database of more than 1.6 million samples from 78 countries representing over 28,000 species, collated from existing spatial comparisons of local-scale biodiversity exposed to different intensities and types of anthropogenic pressures, from terrestrial sites around the world. The database contains measurements taken in 208 (of 814) ecoregions, 13 (of 14) biomes, 25 (of 35) biodiversity hotspots and 16 (of 17) megadiverse countries. The database contains more than 1% of the total number of all species described, and more than 1% of the described species within many taxonomic groups - including flowering plants, gymnosperms, birds, mammals, reptiles, amphibians, beetles, lepidopterans and hymenopterans. The dataset, which is still being added to, is therefore already considerably larger and more representative than those used by previous quantitative models of biodiversity trends and responses. The database is being assembled as part of the PREDICTS project (Projecting Responses of Ecological Diversity In Changing Terrestrial Systems - http://www.predicts.org.uk). We make site-level summary data available alongside this article. The full database will be publicly available in 2015.

  6. Hierarchical Bayesian modelling of mobility metrics for hazard model input calibration

    NASA Astrophysics Data System (ADS)

    Calder, Eliza; Ogburn, Sarah; Spiller, Elaine; Rutarindwa, Regis; Berger, Jim

    2015-04-01

    In this work we present a method to constrain flow mobility input parameters for pyroclastic flow models using hierarchical Bayes modeling of standard mobility metrics such as H/L and flow volume etc. The advantage of hierarchical modeling is that it can leverage the information in global dataset for a particular mobility metric in order to reduce the uncertainty in modeling of an individual volcano, especially important where individual volcanoes have only sparse datasets. We use compiled pyroclastic flow runout data from Colima, Merapi, Soufriere Hills, Unzen and Semeru volcanoes, presented in an open-source database FlowDat (https://vhub.org/groups/massflowdatabase). While the exact relationship between flow volume and friction varies somewhat between volcanoes, dome collapse flows originating from the same volcano exhibit similar mobility relationships. Instead of fitting separate regression models for each volcano dataset, we use a variation of the hierarchical linear model (Kass and Steffey, 1989). The model presents a hierarchical structure with two levels; all dome collapse flows and dome collapse flows at specific volcanoes. The hierarchical model allows us to assume that the flows at specific volcanoes share a common distribution of regression slopes, then solves for that distribution. We present comparisons of the 95% confidence intervals on the individual regression lines for the data set from each volcano as well as those obtained from the hierarchical model. The results clearly demonstrate the advantage of considering global datasets using this technique. The technique developed is demonstrated here for mobility metrics, but can be applied to many other global datasets of volcanic parameters. In particular, such methods can provide a means to better contain parameters for volcanoes for which we only have sparse data, a ubiquitous problem in volcanology.

  7. Watershed Data Management (WDM) Database for Salt Creek Streamflow Simulation, DuPage County, Illinois

    USGS Publications Warehouse

    Murphy, Elizabeth A.; Ishii, Audrey L.

    2006-01-01

    The U.S. Geological Survey (USGS), in cooperation with DuPage County Department of Engineering, Stormwater Management Division, maintains a database of hourly meteorologic and hydrologic data for use in a near real-time streamflow simulation system, which assists in the management and operation of reservoirs and other flood-control structures in the Salt Creek watershed in DuPage County, Illinois. The majority of the precipitation data are collected from a tipping-bucket rain-gage network located in and near DuPage County. The other meteorologic data (wind speed, solar radiation, air temperature, and dewpoint temperature) are collected at Argonne National Laboratory in Argonne, Illinois. Potential evapotranspiration is computed from the meteorologic data. The hydrologic data (discharge and stage) are collected at USGS streamflow-gaging stations in DuPage County. These data are stored in a Watershed Data Management (WDM) database. This report describes a version of the WDM database that was quality-assured and quality-controlled annually to ensure the datasets were complete and accurate. This version of the WDM database contains data from January 1, 1997, through September 30, 2004, and is named SEP04.WDM. This report provides a record of time periods of poor data for each precipitation dataset and describes methods used to estimate the data for the periods when data were missing, flawed, or snowfall-affected. The precipitation dataset data-filling process was changed in 2001, and both processes are described. The other meteorologic and hydrologic datasets in the database are fully described in the annual U.S. Geological Survey Water Data Report for Illinois and, therefore, are described in less detail than the precipitation datasets in this report.

  8. Echinoids of the Kerguelen Plateau - occurrence data and environmental setting for past, present, and future species distribution modelling.

    PubMed

    Guillaumot, Charlène; Martin, Alexis; Fabri-Ruiz, Salomé; Eléaume, Marc; Saucède, Thomas

    2016-01-01

    The present dataset provides a case study for species distribution modelling (SDM) and for model testing in a poorly documented marine region. The dataset includes spatially-explicit data for echinoid (Echinodermata: Echinoidea) distribution. Echinoids were collected during oceanographic campaigns led around the Kerguelen Plateau (+63°/+81°E; -46°/-56°S) since 1872. In addition to the identification of collection specimens from historical cruises, original data from the recent campaigns POKER II (2010) and PROTEKER 2 to 4 (2013-2015) are also provided. In total, five families, ten genera, and 12 echinoid species are recorded in the region of the Kerguelen Plateau. The dataset is complemented with environmental descriptors available and relevant for echinoid ecology and SDM. The environmental data was compiled from different sources and was modified to suit the geographic extent of the Kerguelen Plateau, using scripts developed with the R language (R Core Team 2015). Spatial resolution was set at a common 0.1° pixel resolution. Mean seafloor and sea surface temperatures, salinity and their amplitudes, all derived from the World Ocean Database (Boyer et al. 2013) are made available for the six following decades: 1955-1964, 1965-1974, 1975-1984, 1985-1994, 1995-2004, 2005-2012. Future projections are provided for several parameters: they were modified from the Bio-ORACLE database (Tyberghein et al. 2012). They are based on three IPCC scenarii (B1, AIB, A2) for years 2100 and 2200 (IPCC, 4 th report).

  9. VTCdb: a gene co-expression database for the crop species Vitis vinifera (grapevine).

    PubMed

    Wong, Darren C J; Sweetman, Crystal; Drew, Damian P; Ford, Christopher M

    2013-12-16

    Gene expression datasets in model plants such as Arabidopsis have contributed to our understanding of gene function and how a single underlying biological process can be governed by a diverse network of genes. The accumulation of publicly available microarray data encompassing a wide range of biological and environmental conditions has enabled the development of additional capabilities including gene co-expression analysis (GCA). GCA is based on the understanding that genes encoding proteins involved in similar and/or related biological processes may exhibit comparable expression patterns over a range of experimental conditions, developmental stages and tissues. We present an open access database for the investigation of gene co-expression networks within the cultivated grapevine, Vitis vinifera. The new gene co-expression database, VTCdb (http://vtcdb.adelaide.edu.au/Home.aspx), offers an online platform for transcriptional regulatory inference in the cultivated grapevine. Using condition-independent and condition-dependent approaches, grapevine co-expression networks were constructed using the latest publicly available microarray datasets from diverse experimental series, utilising the Affymetrix Vitis vinifera GeneChip (16 K) and the NimbleGen Grape Whole-genome microarray chip (29 K), thus making it possible to profile approximately 29,000 genes (95% of the predicted grapevine transcriptome). Applications available with the online platform include the use of gene names, probesets, modules or biological processes to query the co-expression networks, with the option to choose between Affymetrix or Nimblegen datasets and between multiple co-expression measures. Alternatively, the user can browse existing network modules using interactive network visualisation and analysis via CytoscapeWeb. To demonstrate the utility of the database, we present examples from three fundamental biological processes (berry development, photosynthesis and flavonoid biosynthesis) whereby the recovered sub-networks reconfirm established plant gene functions and also identify novel associations. Together, we present valuable insights into grapevine transcriptional regulation by developing network models applicable to researchers in their prioritisation of gene candidates, for on-going study of biological processes related to grapevine development, metabolism and stress responses.

  10. Comparing tree foliage biomass models fitted to a multispecies, felled-tree biomass dataset for the United States

    Treesearch

    Brian J. Clough; Matthew B. Russell; Grant M. Domke; Christopher W. Woodall; Philip J. Radtke

    2016-01-01

    tEstimation of live tree biomass is an important task for both forest carbon accounting and studies of nutri-ent dynamics in forest ecosystems. In this study, we took advantage of an extensive felled-tree database(with 2885 foliage biomass observations) to compare different models and grouping schemes based onphylogenetic and geographic variation for predicting foliage...

  11. USDA National Nutrient Database for Standard Reference Dataset for What We Eat in America, NHANES (Survey-SR) 2013-2014

    USDA-ARS?s Scientific Manuscript database

    USDA National Nutrient Database for Standard Reference Dataset for What We Eat In America, NHANES (Survey-SR) provides the nutrient data for assessing dietary intakes from the national survey What We Eat In America, National Health and Nutrition Examination Survey (WWEIA, NHANES). The current versi...

  12. Arabidopsis Gene Family Profiler (aGFP)--user-oriented transcriptomic database with easy-to-use graphic interface.

    PubMed

    Dupl'áková, Nikoleta; Renák, David; Hovanec, Patrik; Honysová, Barbora; Twell, David; Honys, David

    2007-07-23

    Microarray technologies now belong to the standard functional genomics toolbox and have undergone massive development leading to increased genome coverage, accuracy and reliability. The number of experiments exploiting microarray technology has markedly increased in recent years. In parallel with the rapid accumulation of transcriptomic data, on-line analysis tools are being introduced to simplify their use. Global statistical data analysis methods contribute to the development of overall concepts about gene expression patterns and to query and compose working hypotheses. More recently, these applications are being supplemented with more specialized products offering visualization and specific data mining tools. We present a curated gene family-oriented gene expression database, Arabidopsis Gene Family Profiler (aGFP; http://agfp.ueb.cas.cz), which gives the user access to a large collection of normalised Affymetrix ATH1 microarray datasets. The database currently contains NASC Array and AtGenExpress transcriptomic datasets for various tissues at different developmental stages of wild type plants gathered from nearly 350 gene chips. The Arabidopsis GFP database has been designed as an easy-to-use tool for users needing an easily accessible resource for expression data of single genes, pre-defined gene families or custom gene sets, with the further possibility of keyword search. Arabidopsis Gene Family Profiler presents a user-friendly web interface using both graphic and text output. Data are stored at the MySQL server and individual queries are created in PHP script. The most distinguishable features of Arabidopsis Gene Family Profiler database are: 1) the presentation of normalized datasets (Affymetrix MAS algorithm and calculation of model-based gene-expression values based on the Perfect Match-only model); 2) the choice between two different normalization algorithms (Affymetrix MAS4 or MAS5 algorithms); 3) an intuitive interface; 4) an interactive "virtual plant" visualizing the spatial and developmental expression profiles of both gene families and individual genes. Arabidopsis GFP gives users the possibility to analyze current Arabidopsis developmental transcriptomic data starting with simple global queries that can be expanded and further refined to visualize comparative and highly selective gene expression profiles.

  13. Comparative testing of dark matter models with 15 HSB and 15 LSB galaxies

    NASA Astrophysics Data System (ADS)

    Kun, E.; Keresztes, Z.; Simkó, A.; Szűcs, G.; Gergely, L. Á.

    2017-12-01

    Context. We assemble a database of 15 high surface brightness (HSB) and 15 low surface brightness (LSB) galaxies, for which surface brightness density and spectroscopic rotation curve data are both available and representative for various morphologies. We use this dataset to test the Navarro-Frenk-White, the Einasto, and the pseudo-isothermal sphere dark matter models. Aims: We investigate the compatibility of the pure baryonic model and baryonic plus one of the three dark matter models with observations on the assembled galaxy database. When a dark matter component improves the fit with the spectroscopic rotational curve, we rank the models according to the goodness of fit to the datasets. Methods: We constructed the spatial luminosity density of the baryonic component based on the surface brightness profile of the galaxies. We estimated the mass-to-light (M/L) ratio of the stellar component through a previously proposed color-mass-to-light ratio relation (CMLR), which yields stellar masses independent of the photometric band. We assumed an axissymetric baryonic mass model with variable axis ratios together with one of the three dark matter models to provide the theoretical rotational velocity curves, and we compared them with the dataset. In a second attempt, we addressed the question whether the dark component could be replaced by a pure baryonic model with fitted M/L ratios, varied over ranges consistent with CMLR relations derived from the available stellar population models. We employed the Akaike information criterion to establish the performance of the best-fit models. Results: For 7 galaxies (2 HSB and 5 LSB), neither model fits the dataset within the 1σ confidence level. For the other 23 cases, one of the models with dark matter explains the rotation curve data best. According to the Akaike information criterion, the pseudo-isothermal sphere emerges as most favored in 14 cases, followed by the Navarro-Frenk-White (6 cases) and the Einasto (3 cases) dark matter models. We find that the pure baryonic model with fitted M/L ratios falls within the 1σ confidence level for 10 HSB and 2 LSB galaxies, at the price of growing the M/Ls on average by a factor of two, but the fits are inferior compared to the best-fitting dark matter model.

  14. Interdisciplinary Collaboration amongst Colleagues and between Initiatives with the Magnetics Information Consortium (MagIC) Database

    NASA Astrophysics Data System (ADS)

    Minnett, R.; Koppers, A. A. P.; Jarboe, N.; Tauxe, L.; Constable, C.; Jonestrask, L.; Shaar, R.

    2014-12-01

    Earth science grand challenges often require interdisciplinary and geographically distributed scientific collaboration to make significant progress. However, this organic collaboration between researchers, educators, and students only flourishes with the reduction or elimination of technological barriers. The Magnetics Information Consortium (http://earthref.org/MagIC/) is a grass-roots cyberinfrastructure effort envisioned by the geo-, paleo-, and rock magnetic scientific community to archive their wealth of peer-reviewed raw data and interpretations from studies on natural and synthetic samples. MagIC is dedicated to facilitating scientific progress towards several highly multidisciplinary grand challenges and the MagIC Database team is currently beta testing a new MagIC Search Interface and API designed to be flexible enough for the incorporation of large heterogeneous datasets and for horizontal scalability to tens of millions of records and hundreds of requests per second. In an effort to reduce the barriers to effective collaboration, the search interface includes a simplified data model and upload procedure, support for online editing of datasets amongst team members, commenting by reviewers and colleagues, and automated contribution workflows and data retrieval through the API. This web application has been designed to generalize to other databases in MagIC's umbrella website (EarthRef.org) so the Geochemical Earth Reference Model (http://earthref.org/GERM/) portal, Seamount Biogeosciences Network (http://earthref.org/SBN/), EarthRef Digital Archive (http://earthref.org/ERDA/) and EarthRef Reference Database (http://earthref.org/ERR/) will benefit from its development.

  15. User Guidelines for the Brassica Database: BRAD.

    PubMed

    Wang, Xiaobo; Cheng, Feng; Wang, Xiaowu

    2016-01-01

    The genome sequence of Brassica rapa was first released in 2011. Since then, further Brassica genomes have been sequenced or are undergoing sequencing. It is therefore necessary to develop tools that help users to mine information from genomic data efficiently. This will greatly aid scientific exploration and breeding application, especially for those with low levels of bioinformatic training. Therefore, the Brassica database (BRAD) was built to collect, integrate, illustrate, and visualize Brassica genomic datasets. BRAD provides useful searching and data mining tools, and facilitates the search of gene annotation datasets, syntenic or non-syntenic orthologs, and flanking regions of functional genomic elements. It also includes genome-analysis tools such as BLAST and GBrowse. One of the important aims of BRAD is to build a bridge between Brassica crop genomes with the genome of the model species Arabidopsis thaliana, thus transferring the bulk of A. thaliana gene study information for use with newly sequenced Brassica crops.

  16. CELL5M: A geospatial database of agricultural indicators for Africa South of the Sahara.

    PubMed

    Koo, Jawoo; Cox, Cindy M; Bacou, Melanie; Azzarri, Carlo; Guo, Zhe; Wood-Sichra, Ulrike; Gong, Queenie; You, Liangzhi

    2016-01-01

    Recent progress in large-scale georeferenced data collection is widening opportunities for combining multi-disciplinary datasets from biophysical to socioeconomic domains, advancing our analytical and modeling capacity. Granular spatial datasets provide critical information necessary for decision makers to identify target areas, assess baseline conditions, prioritize investment options, set goals and targets and monitor impacts. However, key challenges in reconciling data across themes, scales and borders restrict our capacity to produce global and regional maps and time series. This paper provides overview, structure and coverage of CELL5M-an open-access database of geospatial indicators at 5 arc-minute grid resolution-and introduces a range of analytical applications and case-uses. CELL5M covers a wide set of agriculture-relevant domains for all countries in Africa South of the Sahara and supports our understanding of multi-dimensional spatial variability inherent in farming landscapes throughout the region.

  17. LOD 1 VS. LOD 2 - Preliminary Investigations Into Differences in Mobile Rendering Performance

    NASA Astrophysics Data System (ADS)

    Ellul, C.; Altenbuchner, J.

    2013-09-01

    The increasing availability, size and detail of 3D City Model datasets has led to a challenge when rendering such data on mobile devices. Understanding the limitations to the usability of such models on these devices is particularly important given the broadening range of applications - such as pollution or noise modelling, tourism, planning, solar potential - for which these datasets and resulting visualisations can be utilized. Much 3D City Model data is created by extrusion of 2D topographic datasets, resulting in what is known as Level of Detail (LoD) 1 buildings - with flat roofs. However, in the UK the National Mapping Agency (the Ordnance Survey, OS) is now releasing test datasets to Level of Detail (LoD) 2 - i.e. including roof structures. These datasets are designed to integrate with the LoD 1 datasets provided by the OS, and provide additional detail in particular on larger buildings and in town centres. The availability of such integrated datasets at two different Levels of Detail permits investigation into the impact of the additional roof structures (and hence the display of a more realistic 3D City Model) on rendering performance on a mobile device. This paper describes preliminary work carried out to investigate this issue, for the test area of the city of Sheffield (in the UK Midlands). The data is stored in a 3D spatial database as triangles and then extracted and served as a web-based data stream which is queried by an App developed on the mobile device (using the Android environment, Java and OpenGL for graphics). Initial tests have been carried out on two dataset sizes, for the city centre and a larger area, rendering the data onto a tablet to compare results. Results of 52 seconds for rendering LoD 1 data, and 72 seconds for LoD 1 mixed with LoD 2 data, show that the impact of LoD 2 is significant.

  18. Space Launch System Ascent Static Aerodynamic Database Development

    NASA Technical Reports Server (NTRS)

    Pinier, Jeremy T.; Bennett, David W.; Blevins, John A.; Erickson, Gary E.; Favaregh, Noah M.; Houlden, Heather P.; Tomek, William G.

    2014-01-01

    This paper describes the wind tunnel testing work and data analysis required to characterize the static aerodynamic environment of NASA's Space Launch System (SLS) ascent portion of flight. Scaled models of the SLS have been tested in transonic and supersonic wind tunnels to gather the high fidelity data that is used to build aerodynamic databases. A detailed description of the wind tunnel test that was conducted to produce the latest version of the database is presented, and a representative set of aerodynamic data is shown. The wind tunnel data quality remains very high, however some concerns with wall interference effects through transonic Mach numbers are also discussed. Post-processing and analysis of the wind tunnel dataset are crucial for the development of a formal ascent aerodynamics database.

  19. Implementation of the CUAHSI information system for regional hydrological research and workflow

    NASA Astrophysics Data System (ADS)

    Bugaets, Andrey; Gartsman, Boris; Bugaets, Nadezhda; Krasnopeyev, Sergey; Krasnopeyeva, Tatyana; Sokolov, Oleg; Gonchukov, Leonid

    2013-04-01

    Environmental research and education have become increasingly data-intensive as a result of the proliferation of digital technologies, instrumentation, and pervasive networks through which data are collected, generated, shared, and analyzed. Over the next decade, it is likely that science and engineering research will produce more scientific data than has been created over the whole of human history (Cox et al., 2006). Successful using these data to achieve new scientific breakthroughs depends on the ability to access, organize, integrate, and analyze these large datasets. The new project of PGI FEB RAS (http://tig.dvo.ru), FERHRI (www.ferhri.org) and Primgidromet (www.primgidromet.ru) is focused on creation of an open unified hydrological information system according to the international standards to support hydrological investigation, water management and forecasts systems. Within the hydrologic science community, the Consortium of Universities for the Advancement of Hydrologic Science, Inc. (http://his.cuahsi.org) has been developing a distributed network of data sources and functions that are integrated using web services and that provide access to data, tools, and models that enable synthesis, visualization, and evaluation of hydrologic system behavior. Based on the top of CUAHSI technologies two first template databases were developed for primary datasets of special observations on experimental basins in the Far East Region of Russia. The first database contains data of special observation performed on the former (1957-1994) Primorskaya Water-Balance Station (1500 km2). Measurements were carried out on 20 hydrological and 40 rain gauging station and were published as special series but only as hardcopy books. Database provides raw data from loggers with hourly and daily time support. The second database called «FarEastHydro» provides published standard daily measurement performed at Roshydromet observation network (200 hydrological and meteorological stations) for the period beginning 1930 through 1990. Both of the data resources are maintained in a test mode at the project site http://gis.dvo.ru:81/, which is permanently updated. After first success, the decision was made to use the CUAHSI technology as a basis for development of hydrological information system to support data publishing and workflow of Primgidromet, the regional office of Federal State Hydrometeorological Agency. At the moment, Primgidromet observation network is equipped with 34 automatic SEBA hydrological pressure sensor pneumatic gauges PS-Light-2 and 36 automatic SEBA weather stations. Large datasets generated by sensor networks are organized and stored within a central ODM database which allows to unambiguously interpret the data with sufficient metadata and provides traceable heritage from raw measurements to useable information. Organization of the data within a central CUAHSI ODM database was the most critical step, with several important implications. This technology is widespread and well documented, and it ensures that all datasets are publicly available and readily used by other investigators and developers to support additional analyses and hydrological modeling. Implementation of ODM within a Relational Database Management System eliminates the potential data manipulation errors and intermediate the data processing steps. Wrapping CUAHSI WaterOneFlow web-service into OpenMI 2.0 linkable component (www.openmi.org) allows a seamless integration with well-known hydrological modeling systems.

  20. Interoperable Solar Data and Metadata via LISIRD 3

    NASA Astrophysics Data System (ADS)

    Wilson, A.; Lindholm, D. M.; Pankratz, C. K.; Snow, M. A.; Woods, T. N.

    2015-12-01

    LISIRD 3 is a major upgrade of the LASP Interactive Solar Irradiance Data Center (LISIRD), which serves several dozen space based solar irradiance and related data products to the public. Through interactive plots, LISIRD 3 provides data browsing supported by data subsetting and aggregation. Incorporating a semantically enabled metadata repository, LISIRD 3 users see current, vetted, consistent information about the datasets offered. Users can now also search for datasets based on metadata fields such as dataset type and/or spectral or temporal range. This semantic database enables metadata browsing, so users can discover the relationships between datasets, instruments, spacecraft, mission and PI. The database also enables creation and publication of metadata records in a variety of formats, such as SPASE or ISO, making these datasets more discoverable. The database also enables the possibility of a public SPARQL endpoint, making the metadata browsable in an automated fashion. LISIRD 3's data access middleware, LaTiS, provides dynamic, on demand reformatting of data and timestamps, subsetting and aggregation, and other server side functionality via a RESTful OPeNDAP compliant API, enabling interoperability between LASP datasets and many common tools. LISIRD 3's templated front end design, coupled with the uniform data interface offered by LaTiS, allows easy integration of new datasets. Consequently the number and variety of datasets offered by LISIRD has grown to encompass several dozen, with many more to come. This poster will discuss design and implementation of LISIRD 3, including tools used, capabilities enabled, and issues encountered.

  1. Using Third Party Data to Update a Reference Dataset in a Quality Evaluation Service

    NASA Astrophysics Data System (ADS)

    Xavier, E. M. A.; Ariza-López, F. J.; Ureña-Cámara, M. A.

    2016-06-01

    Nowadays it is easy to find many data sources for various regions around the globe. In this 'data overload' scenario there are few, if any, information available about the quality of these data sources. In order to easily provide these data quality information we presented the architecture of a web service for the automation of quality control of spatial datasets running over a Web Processing Service (WPS). For quality procedures that require an external reference dataset, like positional accuracy or completeness, the architecture permits using a reference dataset. However, this reference dataset is not ageless, since it suffers the natural time degradation inherent to geospatial features. In order to mitigate this problem we propose the Time Degradation & Updating Module which intends to apply assessed data as a tool to maintain the reference database updated. The main idea is to utilize datasets sent to the quality evaluation service as a source of 'candidate data elements' for the updating of the reference database. After the evaluation, if some elements of a candidate dataset reach a determined quality level, they can be used as input data to improve the current reference database. In this work we present the first design of the Time Degradation & Updating Module. We believe that the outcomes can be applied in the search of a full-automatic on-line quality evaluation platform.

  2. Thresholds of Toxicological Concern for cosmetics-related substances: New database, thresholds, and enrichment of chemical space.

    PubMed

    Yang, Chihae; Barlow, Susan M; Muldoon Jacobs, Kristi L; Vitcheva, Vessela; Boobis, Alan R; Felter, Susan P; Arvidson, Kirk B; Keller, Detlef; Cronin, Mark T D; Enoch, Steven; Worth, Andrew; Hollnagel, Heli M

    2017-11-01

    A new dataset of cosmetics-related chemicals for the Threshold of Toxicological Concern (TTC) approach has been compiled, comprising 552 chemicals with 219, 40, and 293 chemicals in Cramer Classes I, II, and III, respectively. Data were integrated and curated to create a database of No-/Lowest-Observed-Adverse-Effect Level (NOAEL/LOAEL) values, from which the final COSMOS TTC dataset was developed. Criteria for study inclusion and NOAEL decisions were defined, and rigorous quality control was performed for study details and assignment of Cramer classes. From the final COSMOS TTC dataset, human exposure thresholds of 42 and 7.9 μg/kg-bw/day were derived for Cramer Classes I and III, respectively. The size of Cramer Class II was insufficient for derivation of a TTC value. The COSMOS TTC dataset was then federated with the dataset of Munro and colleagues, previously published in 1996, after updating the latter using the quality control processes for this project. This federated dataset expands the chemical space and provides more robust thresholds. The 966 substances in the federated database comprise 245, 49 and 672 chemicals in Cramer Classes I, II and III, respectively. The corresponding TTC values of 46, 6.2 and 2.3 μg/kg-bw/day are broadly similar to those of the original Munro dataset. Copyright © 2017 The Authors. Published by Elsevier Ltd.. All rights reserved.

  3. The African Crane Database (1978-2014): Records of three threatened crane species (Family: Gruidae) from southern and eastern Africa.

    PubMed

    Smith, Tanya; Page-Nicholson, Samantha; Morrison, Kerryn; Gibbons, Bradley; Jones, M Genevieve W; van Niekerk, Mark; Botha, Bronwyn; Oliver, Kirsten; McCann, Kevin; Roxburgh, Lizanne

    2016-01-01

    The International Crane Foundation (ICF) / Endangered Wildlife Trust's (EWT) African Crane Conservation Programme has recorded 26 403 crane sightings in its database from 1978 to 2014. This sightings collection is currently ongoing and records are continuously added to the database by the EWT field staff, ICF/EWT Partnership staff, various partner organizations and private individuals. The dataset has two peak collection periods: 1994-1996 and 2008-2012. The dataset collection spans five African countries: Kenya, Rwanda, South Africa, Uganda and Zambia; 98% of the data were collected in South Africa. Georeferencing of the dataset was verified before publication of the data. The dataset contains data on three African crane species: Blue Crane Anthropoides paradiseus , Grey Crowned Crane Balearica regulorum and Wattled Crane Bugeranus carunculatus . The Blue and Wattled Cranes are classified by the IUCN Red List of Threatened Species as Vulnerable and the Grey Crowned Crane as Endangered. This is the single most comprehensive dataset published on African Crane species that adds new information about the distribution of these three threatened species. We hope this will further aid conservation authorities to monitor and protect these species. The dataset continues to grow and especially to expand in geographic coverage into new countries in Africa and new sites within countries. The dataset can be freely accessed through the Global Biodiversity Information Facility data portal.

  4. A high resolution spatial population database of Somalia for disease risk mapping.

    PubMed

    Linard, Catherine; Alegana, Victor A; Noor, Abdisalan M; Snow, Robert W; Tatem, Andrew J

    2010-09-14

    Millions of Somali have been deprived of basic health services due to the unstable political situation of their country. Attempts are being made to reconstruct the health sector, in particular to estimate the extent of infectious disease burden. However, any approach that requires the use of modelled disease rates requires reasonable information on population distribution. In a low-income country such as Somalia, population data are lacking, are of poor quality, or become outdated rapidly. Modelling methods are therefore needed for the production of contemporary and spatially detailed population data. Here land cover information derived from satellite imagery and existing settlement point datasets were used for the spatial reallocation of populations within census units. We used simple and semi-automated methods that can be implemented with free image processing software to produce an easily updatable gridded population dataset at 100 × 100 meters spatial resolution. The 2010 population dataset was matched to administrative population totals projected by the UN. Comparison tests between the new dataset and existing population datasets revealed important differences in population size distributions, and in population at risk of malaria estimates. These differences are particularly important in more densely populated areas and strongly depend on the settlement data used in the modelling approach. The results show that it is possible to produce detailed, contemporary and easily updatable settlement and population distribution datasets of Somalia using existing data. The 2010 population dataset produced is freely available as a product of the AfriPop Project and can be downloaded from: http://www.afripop.org.

  5. A high resolution spatial population database of Somalia for disease risk mapping

    PubMed Central

    2010-01-01

    Background Millions of Somali have been deprived of basic health services due to the unstable political situation of their country. Attempts are being made to reconstruct the health sector, in particular to estimate the extent of infectious disease burden. However, any approach that requires the use of modelled disease rates requires reasonable information on population distribution. In a low-income country such as Somalia, population data are lacking, are of poor quality, or become outdated rapidly. Modelling methods are therefore needed for the production of contemporary and spatially detailed population data. Results Here land cover information derived from satellite imagery and existing settlement point datasets were used for the spatial reallocation of populations within census units. We used simple and semi-automated methods that can be implemented with free image processing software to produce an easily updatable gridded population dataset at 100 × 100 meters spatial resolution. The 2010 population dataset was matched to administrative population totals projected by the UN. Comparison tests between the new dataset and existing population datasets revealed important differences in population size distributions, and in population at risk of malaria estimates. These differences are particularly important in more densely populated areas and strongly depend on the settlement data used in the modelling approach. Conclusions The results show that it is possible to produce detailed, contemporary and easily updatable settlement and population distribution datasets of Somalia using existing data. The 2010 population dataset produced is freely available as a product of the AfriPop Project and can be downloaded from: http://www.afripop.org. PMID:20840751

  6. Validation of a common data model for active safety surveillance research

    PubMed Central

    Ryan, Patrick B; Reich, Christian G; Hartzema, Abraham G; Stang, Paul E

    2011-01-01

    Objective Systematic analysis of observational medical databases for active safety surveillance is hindered by the variation in data models and coding systems. Data analysts often find robust clinical data models difficult to understand and ill suited to support their analytic approaches. Further, some models do not facilitate the computations required for systematic analysis across many interventions and outcomes for large datasets. Translating the data from these idiosyncratic data models to a common data model (CDM) could facilitate both the analysts' understanding and the suitability for large-scale systematic analysis. In addition to facilitating analysis, a suitable CDM has to faithfully represent the source observational database. Before beginning to use the Observational Medical Outcomes Partnership (OMOP) CDM and a related dictionary of standardized terminologies for a study of large-scale systematic active safety surveillance, the authors validated the model's suitability for this use by example. Validation by example To validate the OMOP CDM, the model was instantiated into a relational database, data from 10 different observational healthcare databases were loaded into separate instances, a comprehensive array of analytic methods that operate on the data model was created, and these methods were executed against the databases to measure performance. Conclusion There was acceptable representation of the data from 10 observational databases in the OMOP CDM using the standardized terminologies selected, and a range of analytic methods was developed and executed with sufficient performance to be useful for active safety surveillance. PMID:22037893

  7. National Transportation Atlas Databases : 2002

    DOT National Transportation Integrated Search

    2002-01-01

    The National Transportation Atlas Databases 2002 (NTAD2002) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...

  8. National Transportation Atlas Databases : 2010

    DOT National Transportation Integrated Search

    2010-01-01

    The National Transportation Atlas Databases 2010 (NTAD2010) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...

  9. National Transportation Atlas Databases : 2006

    DOT National Transportation Integrated Search

    2006-01-01

    The National Transportation Atlas Databases 2006 (NTAD2006) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...

  10. National Transportation Atlas Databases : 2005

    DOT National Transportation Integrated Search

    2005-01-01

    The National Transportation Atlas Databases 2005 (NTAD2005) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...

  11. National Transportation Atlas Databases : 2008

    DOT National Transportation Integrated Search

    2008-01-01

    The National Transportation Atlas Databases 2008 (NTAD2008) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...

  12. National Transportation Atlas Databases : 2003

    DOT National Transportation Integrated Search

    2003-01-01

    The National Transportation Atlas Databases 2003 (NTAD2003) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...

  13. National Transportation Atlas Databases : 2004

    DOT National Transportation Integrated Search

    2004-01-01

    The National Transportation Atlas Databases 2004 (NTAD2004) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...

  14. National Transportation Atlas Databases : 2009

    DOT National Transportation Integrated Search

    2009-01-01

    The National Transportation Atlas Databases 2009 (NTAD2009) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...

  15. National Transportation Atlas Databases : 2007

    DOT National Transportation Integrated Search

    2007-01-01

    The National Transportation Atlas Databases 2007 (NTAD2007) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...

  16. National Transportation Atlas Databases : 2012

    DOT National Transportation Integrated Search

    2012-01-01

    The National Transportation Atlas Databases 2012 (NTAD2012) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...

  17. National Transportation Atlas Databases : 2011

    DOT National Transportation Integrated Search

    2011-01-01

    The National Transportation Atlas Databases 2011 (NTAD2011) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...

  18. Global QSAR modeling of logP values of phenethylamines acting as adrenergic alpha-1 receptor agonists.

    PubMed

    Yadav, Mukesh; Joshi, Shobha; Nayarisseri, Anuraj; Jain, Anuja; Hussain, Aabid; Dubey, Tushar

    2013-06-01

    Global QSAR models predict biological response of molecular structures which are generic in particular class. A global QSAR dataset admits structural features derived from larger chemical space, intricate to model but more applicable in medicinal chemistry. The present work is global in either sense of structural diversity in QSAR dataset or large number of descriptor input. Forty phenethylamine structure derivatives were selected from a large pool (904) of similar phenethylamines available in Pubchem database. LogP values of selected candidates were collected from physical properties database (PHYSPROP) determined in identical set of conditions. Attempts to model logP value have produced significant QSAR models. MLR aided linear one-variable and two-variable QSAR models with their respective R(2) (0.866, 0.937), R(2)A (0.862, 0.932), F-stat (181.936, 199.812) and Standard Error (0.365, 0.255) are statistically fit and found predictive after internal validation and external validation. The descriptors chosen after improvisation and optimization reveal mechanistic part of work in terms of Verhaar model of Fish base-line toxicity from MLOGP, i.e. (BLTF96) and 3D-MoRSE -signal 15 /unweighted molecular descriptor calculated by summing atom weights viewed by a different angular scattering function (Mor15u) are crucial in regulation of logP values of phenethylamines.

  19. High-resolution spatial databases of monthly climate variables (1961-2010) over a complex terrain region in southwestern China

    NASA Astrophysics Data System (ADS)

    Wu, Wei; Xu, An-Ding; Liu, Hong-Bin

    2015-01-01

    Climate data in gridded format are critical for understanding climate change and its impact on eco-environment. The aim of the current study is to develop spatial databases for three climate variables (maximum, minimum temperatures, and relative humidity) over a large region with complex topography in southwestern China. Five widely used approaches including inverse distance weighting, ordinary kriging, universal kriging, co-kriging, and thin-plate smoothing spline were tested. Root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) showed that thin-plate smoothing spline with latitude, longitude, and elevation outperformed other models. Average RMSE, MAE, and MAPE of the best models were 1.16 °C, 0.74 °C, and 7.38 % for maximum temperature; 0.826 °C, 0.58 °C, and 6.41 % for minimum temperature; and 3.44, 2.28, and 3.21 % for relative humidity, respectively. Spatial datasets of annual and monthly climate variables with 1-km resolution covering the period 1961-2010 were then obtained using the best performance methods. Comparative study showed that the current outcomes were in well agreement with public datasets. Based on the gridded datasets, changes in temperature variables were investigated across the study area. Future study might be needed to capture the uncertainty induced by environmental conditions through remote sensing and knowledge-based methods.

  20. Systematic chemical-genetic and chemical-chemical interaction datasets for prediction of compound synergism

    PubMed Central

    Wildenhain, Jan; Spitzer, Michaela; Dolma, Sonam; Jarvik, Nick; White, Rachel; Roy, Marcia; Griffiths, Emma; Bellows, David S.; Wright, Gerard D.; Tyers, Mike

    2016-01-01

    The network structure of biological systems suggests that effective therapeutic intervention may require combinations of agents that act synergistically. However, a dearth of systematic chemical combination datasets have limited the development of predictive algorithms for chemical synergism. Here, we report two large datasets of linked chemical-genetic and chemical-chemical interactions in the budding yeast Saccharomyces cerevisiae. We screened 5,518 unique compounds against 242 diverse yeast gene deletion strains to generate an extended chemical-genetic matrix (CGM) of 492,126 chemical-gene interaction measurements. This CGM dataset contained 1,434 genotype-specific inhibitors, termed cryptagens. We selected 128 structurally diverse cryptagens and tested all pairwise combinations to generate a benchmark dataset of 8,128 pairwise chemical-chemical interaction tests for synergy prediction, termed the cryptagen matrix (CM). An accompanying database resource called ChemGRID was developed to enable analysis, visualisation and downloads of all data. The CGM and CM datasets will facilitate the benchmarking of computational approaches for synergy prediction, as well as chemical structure-activity relationship models for anti-fungal drug discovery. PMID:27874849

  1. Using the ToxMiner Database for Identifying Disease-Gene Associations in the ToxCast Dataset

    EPA Science Inventory

    The US EPA ToxCast program is using in vitro, high-throughput screening (HTS) to profile and model the bioactivity of environmental chemicals. The main goal of the ToxCast program is to generate predictive signatures of toxicity that ultimately provide rapid and cost-effective me...

  2. Informatics approach using metabolic reactivity classifiers to link in vitro to in vivo data in application to the ToxCast Phase I dataset

    EPA Science Inventory

    Strategic combinations and tiered application of alternative testing methods to replace or minimize the use of animal models is attracting much attention. With the advancement of high throughput screening (HTS) assays and legacy databases providing in vivo testing results, suffic...

  3. A Web Server and Mobile App for Computing Hemolytic Potency of Peptides.

    PubMed

    Chaudhary, Kumardeep; Kumar, Ritesh; Singh, Sandeep; Tuknait, Abhishek; Gautam, Ankur; Mathur, Deepika; Anand, Priya; Varshney, Grish C; Raghava, Gajendra P S

    2016-03-08

    Numerous therapeutic peptides do not enter the clinical trials just because of their high hemolytic activity. Recently, we developed a database, Hemolytik, for maintaining experimentally validated hemolytic and non-hemolytic peptides. The present study describes a web server and mobile app developed for predicting, and screening of peptides having hemolytic potency. Firstly, we generated a dataset HemoPI-1 that contains 552 hemolytic peptides extracted from Hemolytik database and 552 random non-hemolytic peptides (from Swiss-Prot). The sequence analysis of these peptides revealed that certain residues (e.g., L, K, F, W) and motifs (e.g., "FKK", "LKL", "KKLL", "KWK", "VLK", "CYCR", "CRR", "RFC", "RRR", "LKKL") are more abundant in hemolytic peptides. Therefore, we developed models for discriminating hemolytic and non-hemolytic peptides using various machine learning techniques and achieved more than 95% accuracy. We also developed models for discriminating peptides having high and low hemolytic potential on different datasets called HemoPI-2 and HemoPI-3. In order to serve the scientific community, we developed a web server, mobile app and JAVA-based standalone software (http://crdd.osdd.net/raghava/hemopi/).

  4. VIEWCACHE: An incremental pointer-based access method for autonomous interoperable databases

    NASA Technical Reports Server (NTRS)

    Roussopoulos, N.; Sellis, Timos

    1993-01-01

    One of the biggest problems facing NASA today is to provide scientists efficient access to a large number of distributed databases. Our pointer-based incremental data base access method, VIEWCACHE, provides such an interface for accessing distributed datasets and directories. VIEWCACHE allows database browsing and search performing inter-database cross-referencing with no actual data movement between database sites. This organization and processing is especially suitable for managing Astrophysics databases which are physically distributed all over the world. Once the search is complete, the set of collected pointers pointing to the desired data are cached. VIEWCACHE includes spatial access methods for accessing image datasets, which provide much easier query formulation by referring directly to the image and very efficient search for objects contained within a two-dimensional window. We will develop and optimize a VIEWCACHE External Gateway Access to database management systems to facilitate database search.

  5. Quantitative comparison of microarray experiments with published leukemia related gene expression signatures.

    PubMed

    Klein, Hans-Ulrich; Ruckert, Christian; Kohlmann, Alexander; Bullinger, Lars; Thiede, Christian; Haferlach, Torsten; Dugas, Martin

    2009-12-15

    Multiple gene expression signatures derived from microarray experiments have been published in the field of leukemia research. A comparison of these signatures with results from new experiments is useful for verification as well as for interpretation of the results obtained. Currently, the percentage of overlapping genes is frequently used to compare published gene signatures against a signature derived from a new experiment. However, it has been shown that the percentage of overlapping genes is of limited use for comparing two experiments due to the variability of gene signatures caused by different array platforms or assay-specific influencing parameters. Here, we present a robust approach for a systematic and quantitative comparison of published gene expression signatures with an exemplary query dataset. A database storing 138 leukemia-related published gene signatures was designed. Each gene signature was manually annotated with terms according to a leukemia-specific taxonomy. Two analysis steps are implemented to compare a new microarray dataset with the results from previous experiments stored and curated in the database. First, the global test method is applied to assess gene signatures and to constitute a ranking among them. In a subsequent analysis step, the focus is shifted from single gene signatures to chromosomal aberrations or molecular mutations as modeled in the taxonomy. Potentially interesting disease characteristics are detected based on the ranking of gene signatures associated with these aberrations stored in the database. Two example analyses are presented. An implementation of the approach is freely available as web-based application. The presented approach helps researchers to systematically integrate the knowledge derived from numerous microarray experiments into the analysis of a new dataset. By means of example leukemia datasets we demonstrate that this approach detects related experiments as well as related molecular mutations and may help to interpret new microarray data.

  6. The AMMA information system

    NASA Astrophysics Data System (ADS)

    Brissebrat, Guillaume; Fleury, Laurence; Boichard, Jean-Luc; Cloché, Sophie; Eymard, Laurence; Mastrorillo, Laurence; Moulaye, Oumarou; Ramage, Karim; Asencio, Nicole; Favot, Florence; Roussot, Odile

    2013-04-01

    The AMMA information system aims at expediting data and scientific results communication inside the AMMA community and beyond. It has already been adopted as the data management system by several projects and is meant to become a reference information system about West Africa area for the whole scientific community. The AMMA database and the associated on line tools have been developed and are managed by two French teams (IPSL Database Centre, Palaiseau and OMP Data Service, Toulouse). The complete system has been fully duplicated and is operated by AGRHYMET Regional Centre in Niamey, Niger. The AMMA database contains a wide variety of datasets: - about 250 local observation datasets, that cover geophysical components (atmosphere, ocean, soil, vegetation) and human activities (agronomy, health...) They come from either operational networks or scientific experiments, and include historical data in West Africa from 1850; - 1350 outputs of a socio-economics questionnaire; - 60 operational satellite products and several research products; - 10 output sets of meteorological and ocean operational models and 15 of research simulations. Database users can access all the data using either the portal http://database.amma-international.org or http://amma.agrhymet.ne/amma-data. Different modules are available. The complete catalogue enables to access metadata (i.e. information about the datasets) that are compliant with the international standards (ISO19115, INSPIRE...). Registration pages enable to read and sign the data and publication policy, and to apply for a user database account. The data access interface enables to easily build a data extraction request by selecting various criteria like location, time, parameters... At present, the AMMA database counts more than 740 registered users and process about 80 data requests every month In order to monitor day-to-day meteorological and environment information over West Africa, some quick look and report display websites have been developed. They met the operational needs for the observational teams during the AMMA 2006 (http://aoc.amma-international.org) and FENNEC 2011 (http://fenoc.sedoo.fr) campaigns. But they also enable scientific teams to share physical indices along the monsoon season (http://misva.sedoo.fr from 2011). A collaborative WIKINDX tool has been set on line in order to manage scientific publications and communications of interest to AMMA (http://biblio.amma-international.org). Now the bibliographic database counts about 1200 references. It is the most exhaustive document collection about African Monsoon available for all. Every scientist is invited to make use of the different AMMA on line tools and data. Scientists or project leaders who have data management needs for existing or future datasets over West Africa are welcome to use the AMMA database framework and to contact ammaAdmin@sedoo.fr .

  7. Database resources of the National Center for Biotechnology Information

    PubMed Central

    Wheeler, David L.; Barrett, Tanya; Benson, Dennis A.; Bryant, Stephen H.; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M.; DiCuccio, Michael; Edgar, Ron; Federhen, Scott; Geer, Lewis Y.; Helmberg, Wolfgang; Kapustin, Yuri; Kenton, David L.; Khovayko, Oleg; Lipman, David J.; Madden, Thomas L.; Maglott, Donna R.; Ostell, James; Pruitt, Kim D.; Schuler, Gregory D.; Schriml, Lynn M.; Sequeira, Edwin; Sherry, Stephen T.; Sirotkin, Karl; Souvorov, Alexandre; Starchenko, Grigory; Suzek, Tugba O.; Tatusov, Roman; Tatusova, Tatiana A.; Wagner, Lukas; Yaschenko, Eugene

    2006-01-01

    In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI's Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups, Retroviral Genotyping Tools, HIV-1, Human Protein Interaction Database, SAGEmap, Gene Expression Omnibus, Entrez Probe, GENSAT, Online Mendelian Inheritance in Man, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized datasets. All of the resources can be accessed through the NCBI home page at: . PMID:16381840

  8. SCPortalen: human and mouse single-cell centric database

    PubMed Central

    Noguchi, Shuhei; Böttcher, Michael; Hasegawa, Akira; Kouno, Tsukasa; Kato, Sachi; Tada, Yuhki; Ura, Hiroki; Abe, Kuniya; Shin, Jay W; Plessy, Charles; Carninci, Piero

    2018-01-01

    Abstract Published single-cell datasets are rich resources for investigators who want to address questions not originally asked by the creators of the datasets. The single-cell datasets might be obtained by different protocols and diverse analysis strategies. The main challenge in utilizing such single-cell data is how we can make the various large-scale datasets to be comparable and reusable in a different context. To challenge this issue, we developed the single-cell centric database ‘SCPortalen’ (http://single-cell.clst.riken.jp/). The current version of the database covers human and mouse single-cell transcriptomics datasets that are publicly available from the INSDC sites. The original metadata was manually curated and single-cell samples were annotated with standard ontology terms. Following that, common quality assessment procedures were conducted to check the quality of the raw sequence. Furthermore, primary data processing of the raw data followed by advanced analyses and interpretation have been performed from scratch using our pipeline. In addition to the transcriptomics data, SCPortalen provides access to single-cell image files whenever available. The target users of SCPortalen are all researchers interested in specific cell types or population heterogeneity. Through the web interface of SCPortalen users are easily able to search, explore and download the single-cell datasets of their interests. PMID:29045713

  9. De novo transcriptome assembly databases for the butterfly orchid Phalaenopsis equestris

    PubMed Central

    Niu, Shan-Ce; Xu, Qing; Zhang, Guo-Qiang; Zhang, Yong-Qiang; Tsai, Wen-Chieh; Hsu, Jui-Ling; Liang, Chieh-Kai; Luo, Yi-Bo; Liu, Zhong-Jian

    2016-01-01

    Orchids are renowned for their spectacular flowers and ecological adaptations. After the sequencing of the genome of the tropical epiphytic orchid Phalaenopsis equestris, we combined Illumina HiSeq2000 for RNA-Seq and Trinity for de novo assembly to characterize the transcriptomes for 11 diverse P. equestris tissues representing the root, stem, leaf, flower buds, column, lip, petal, sepal and three developmental stages of seeds. Our aims were to contribute to a better understanding of the molecular mechanisms driving the analysed tissue characteristics and to enrich the available data for P. equestris. Here, we present three databases. The first dataset is the RNA-Seq raw reads, which can be used to execute new experiments with different analysis approaches. The other two datasets allow different types of searches for candidate homologues. The second dataset includes the sets of assembled unigenes and predicted coding sequences and proteins, enabling a sequence-based search. The third dataset consists of the annotation results of the aligned unigenes versus the Nonredundant (Nr) protein database, Kyoto Encyclopaedia of Genes and Genomes (KEGG) and Clusters of Orthologous Groups (COG) databases with low e-values, enabling a name-based search. PMID:27673730

  10. A data model and database for high-resolution pathology analytical image informatics.

    PubMed

    Wang, Fusheng; Kong, Jun; Cooper, Lee; Pan, Tony; Kurc, Tahsin; Chen, Wenjin; Sharma, Ashish; Niedermayr, Cristobal; Oh, Tae W; Brat, Daniel; Farris, Alton B; Foran, David J; Saltz, Joel

    2011-01-01

    The systematic analysis of imaged pathology specimens often results in a vast amount of morphological information at both the cellular and sub-cellular scales. While microscopy scanners and computerized analysis are capable of capturing and analyzing data rapidly, microscopy image data remain underutilized in research and clinical settings. One major obstacle which tends to reduce wider adoption of these new technologies throughout the clinical and scientific communities is the challenge of managing, querying, and integrating the vast amounts of data resulting from the analysis of large digital pathology datasets. This paper presents a data model, which addresses these challenges, and demonstrates its implementation in a relational database system. This paper describes a data model, referred to as Pathology Analytic Imaging Standards (PAIS), and a database implementation, which are designed to support the data management and query requirements of detailed characterization of micro-anatomic morphology through many interrelated analysis pipelines on whole-slide images and tissue microarrays (TMAs). (1) Development of a data model capable of efficiently representing and storing virtual slide related image, annotation, markup, and feature information. (2) Development of a database, based on the data model, capable of supporting queries for data retrieval based on analysis and image metadata, queries for comparison of results from different analyses, and spatial queries on segmented regions, features, and classified objects. The work described in this paper is motivated by the challenges associated with characterization of micro-scale features for comparative and correlative analyses involving whole-slides tissue images and TMAs. Technologies for digitizing tissues have advanced significantly in the past decade. Slide scanners are capable of producing high-magnification, high-resolution images from whole slides and TMAs within several minutes. Hence, it is becoming increasingly feasible for basic, clinical, and translational research studies to produce thousands of whole-slide images. Systematic analysis of these large datasets requires efficient data management support for representing and indexing results from hundreds of interrelated analyses generating very large volumes of quantifications such as shape and texture and of classifications of the quantified features. We have designed a data model and a database to address the data management requirements of detailed characterization of micro-anatomic morphology through many interrelated analysis pipelines. The data model represents virtual slide related image, annotation, markup and feature information. The database supports a wide range of metadata and spatial queries on images, annotations, markups, and features. We currently have three databases running on a Dell PowerEdge T410 server with CentOS 5.5 Linux operating system. The database server is IBM DB2 Enterprise Edition 9.7.2. The set of databases consists of 1) a TMA database containing image analysis results from 4740 cases of breast cancer, with 641 MB storage size; 2) an algorithm validation database, which stores markups and annotations from two segmentation algorithms and two parameter sets on 18 selected slides, with 66 GB storage size; and 3) an in silico brain tumor study database comprising results from 307 TCGA slides, with 365 GB storage size. The latter two databases also contain human-generated annotations and markups for regions and nuclei. Modeling and managing pathology image analysis results in a database provide immediate benefits on the value and usability of data in a research study. The database provides powerful query capabilities, which are otherwise difficult or cumbersome to support by other approaches such as programming languages. Standardized, semantic annotated data representation and interfaces also make it possible to more efficiently share image data and analysis results.

  11. Progress connecting multi-disciplinary geoscience communities through the VIVO semantic web application

    NASA Astrophysics Data System (ADS)

    Gross, M. B.; Mayernik, M. S.; Rowan, L. R.; Khan, H.; Boler, F. M.; Maull, K. E.; Stott, D.; Williams, S.; Corson-Rikert, J.; Johns, E. M.; Daniels, M. D.; Krafft, D. B.

    2015-12-01

    UNAVCO, UCAR, and Cornell University are working together to leverage semantic web technologies to enable discovery of people, datasets, publications and other research products, as well as the connections between them. The EarthCollab project, an EarthCube Building Block, is enhancing an existing open-source semantic web application, VIVO, to address connectivity gaps across distributed networks of researchers and resources related to the following two geoscience-based communities: (1) the Bering Sea Project, an interdisciplinary field program whose data archive is hosted by NCAR's Earth Observing Laboratory (EOL), and (2) UNAVCO, a geodetic facility and consortium that supports diverse research projects informed by geodesy. People, publications, datasets and grant information have been mapped to an extended version of the VIVO-ISF ontology and ingested into VIVO's database. Data is ingested using a custom set of scripts that include the ability to perform basic automated and curated disambiguation. VIVO can display a page for every object ingested, including connections to other objects in the VIVO database. A dataset page, for example, includes the dataset type, time interval, DOI, related publications, and authors. The dataset type field provides a connection to all other datasets of the same type. The author's page will show, among other information, related datasets and co-authors. Information previously spread across several unconnected databases is now stored in a single location. In addition to VIVO's default display, the new database can also be queried using SPARQL, a query language for semantic data. EarthCollab will also extend the VIVO web application. One such extension is the ability to cross-link separate VIVO instances across institutions, allowing local display of externally curated information. For example, Cornell's VIVO faculty pages will display UNAVCO's dataset information and UNAVCO's VIVO will display Cornell faculty member contact and position information. Additional extensions, including enhanced geospatial capabilities, will be developed following task-centered usability testing.

  12. A database for the analysis of immunity genes in Drosophila: PADMA database.

    PubMed

    Lee, Mark J; Mondal, Ariful; Small, Chiyedza; Paddibhatla, Indira; Kawaguchi, Akira; Govind, Shubha

    2011-01-01

    While microarray experiments generate voluminous data, discerning trends that support an existing or alternative paradigm is challenging. To synergize hypothesis building and testing, we designed the Pathogen Associated Drosophila MicroArray (PADMA) database for easy retrieval and comparison of microarray results from immunity-related experiments (www.padmadatabase.org). PADMA also allows biologists to upload their microarray-results and compare it with datasets housed within PADMA. We tested PADMA using a preliminary dataset from Ganaspis xanthopoda-infected fly larvae, and uncovered unexpected trends in gene expression, reshaping our hypothesis. Thus, the PADMA database will be a useful resource to fly researchers to evaluate, revise, and refine hypotheses.

  13. A comparison of database systems for XML-type data.

    PubMed

    Risse, Judith E; Leunissen, Jack A M

    2010-01-01

    In the field of bioinformatics interchangeable data formats based on XML are widely used. XML-type data is also at the core of most web services. With the increasing amount of data stored in XML comes the need for storing and accessing the data. In this paper we analyse the suitability of different database systems for storing and querying large datasets in general and Medline in particular. All reviewed database systems perform well when tested with small to medium sized datasets, however when the full Medline dataset is queried a large variation in query times is observed. There is not one system that is vastly superior to the others in this comparison and, depending on the database size and the query requirements, different systems are most suitable. The best all-round solution is the Oracle 11~g database system using the new binary storage option. Alias-i's Lingpipe is a more lightweight, customizable and sufficiently fast solution. It does however require more initial configuration steps. For data with a changing XML structure Sedna and BaseX as native XML database systems or MySQL with an XML-type column are suitable.

  14. Documentation of the U.S. Geological Survey Stress and Sediment Mobility Database

    USGS Publications Warehouse

    Dalyander, P. Soupy; Butman, Bradford; Sherwood, Christopher R.; Signell, Richard P.

    2012-01-01

    The U.S. Geological Survey Sea Floor Stress and Sediment Mobility Database contains estimates of bottom stress and sediment mobility for the U.S. continental shelf. This U.S. Geological Survey database provides information that is needed to characterize sea floor ecosystems and evaluate areas for human use. The estimates contained in the database are designed to spatially and seasonally resolve the general characteristics of bottom stress over the U.S. continental shelf and to estimate sea floor mobility by comparing critical stress thresholds based on observed sediment texture data to the modeled stress. This report describes the methods used to make the bottom stress and mobility estimates, statistics used to characterize stress and mobility, data validation procedures, and the metadata for each dataset and provides information on how to access the database online.

  15. A novel approach for incremental uncertainty rule generation from databases with missing values handling: application to dynamic medical databases.

    PubMed

    Konias, Sokratis; Chouvarda, Ioanna; Vlahavas, Ioannis; Maglaveras, Nicos

    2005-09-01

    Current approaches for mining association rules usually assume that the mining is performed in a static database, where the problem of missing attribute values does not practically exist. However, these assumptions are not preserved in some medical databases, like in a home care system. In this paper, a novel uncertainty rule algorithm is illustrated, namely URG-2 (Uncertainty Rule Generator), which addresses the problem of mining dynamic databases containing missing values. This algorithm requires only one pass from the initial dataset in order to generate the item set, while new metrics corresponding to the notion of Support and Confidence are used. URG-2 was evaluated over two medical databases, introducing randomly multiple missing values for each record's attribute (rate: 5-20% by 5% increments) in the initial dataset. Compared with the classical approach (records with missing values are ignored), the proposed algorithm was more robust in mining rules from datasets containing missing values. In all cases, the difference in preserving the initial rules ranged between 30% and 60% in favour of URG-2. Moreover, due to its incremental nature, URG-2 saved over 90% of the time required for thorough re-mining. Thus, the proposed algorithm can offer a preferable solution for mining in dynamic relational databases.

  16. A computational model for predicting integrase catalytic domain of retrovirus.

    PubMed

    Wu, Sijia; Han, Jiuqiang; Zhang, Xinman; Zhong, Dexing; Liu, Ruiling

    2017-06-21

    Integrase catalytic domain (ICD) is an essential part in the retrovirus for integration reaction, which enables its newly synthesized DNA to be incorporated into the DNA of infected cells. Owing to the crucial role of ICD for the retroviral replication and the absence of an equivalent of integrase in host cells, it is comprehensible that ICD is a promising drug target for therapeutic intervention. However, annotated ICDs in UniProtKB database have still been insufficient for a good understanding of their statistical characteristics so far. Accordingly, it is of great importance to put forward a computational ICD model in this work to annotate these domains in the retroviruses. The proposed model then discovered 11,660 new putative ICDs after scanning sequences without ICD annotations. Subsequently in order to provide much confidence in ICD prediction, it was tested under different cross-validation methods, compared with other database search tools, and verified on independent datasets. Furthermore, an evolutionary analysis performed on the annotated ICDs of retroviruses revealed a tight connection between ICD and retroviral classification. All the datasets involved in this paper and the application software tool of this model can be available for free download at https://sourceforge.net/projects/icdtool/files/?source=navbar. Copyright © 2017 Elsevier Ltd. All rights reserved.

  17. Creating a Coastal National Elevation Database (CoNED) for science and conservation applications

    USGS Publications Warehouse

    Thatcher, Cindy A.; Brock, John C.; Danielson, Jeffrey J.; Poppenga, Sandra K.; Gesch, Dean B.; Palaseanu-Lovejoy, Monica; Barras, John; Evans, Gayla A.; Gibbs, Ann

    2016-01-01

    The U.S. Geological Survey is creating the Coastal National Elevation Database, an expanding set of topobathymetric elevation models that extend seamlessly across coastal regions of high societal or ecological significance in the United States that are undergoing rapid change or are threatened by inundation hazards. Topobathymetric elevation models are raster datasets useful for inundation prediction and other earth science applications, such as the development of sediment-transport and storm surge models. These topobathymetric elevation models are being constructed by the broad regional assimilation of numerous topographic and bathymetric datasets, and are intended to fulfill the pressing needs of decision makers establishing policies for hazard mitigation and emergency preparedness, coastal managers tasked with coastal planning compatible with predictions of inundation due to sea-level rise, and scientists investigating processes of coastal geomorphic change. A key priority of this coastal elevation mapping effort is to foster collaborative lidar acquisitions that meet the standards of the USGS National Geospatial Program's 3D Elevation Program, a nationwide initiative to systematically collect high-quality elevation data. The focus regions are located in highly dynamic environments, for example in areas subject to shoreline change, rapid wetland loss, hurricane impacts such as overwash and wave scouring, and/or human-induced changes to coastal topography.

  18. Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.

    PubMed

    Chen, Qingyu; Zobel, Justin; Zhang, Xiuzhen; Verspoor, Karin

    2016-01-01

    First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.

  19. Reporting to Improve Reproducibility and Facilitate Validity Assessment for Healthcare Database Studies V1.0.

    PubMed

    Wang, Shirley V; Schneeweiss, Sebastian; Berger, Marc L; Brown, Jeffrey; de Vries, Frank; Douglas, Ian; Gagne, Joshua J; Gini, Rosa; Klungel, Olaf; Mullins, C Daniel; Nguyen, Michael D; Rassen, Jeremy A; Smeeth, Liam; Sturkenboom, Miriam

    2017-09-01

    Defining a study population and creating an analytic dataset from longitudinal healthcare databases involves many decisions. Our objective was to catalogue scientific decisions underpinning study execution that should be reported to facilitate replication and enable assessment of validity of studies conducted in large healthcare databases. We reviewed key investigator decisions required to operate a sample of macros and software tools designed to create and analyze analytic cohorts from longitudinal streams of healthcare data. A panel of academic, regulatory, and industry experts in healthcare database analytics discussed and added to this list. Evidence generated from large healthcare encounter and reimbursement databases is increasingly being sought by decision-makers. Varied terminology is used around the world for the same concepts. Agreeing on terminology and which parameters from a large catalogue are the most essential to report for replicable research would improve transparency and facilitate assessment of validity. At a minimum, reporting for a database study should provide clarity regarding operational definitions for key temporal anchors and their relation to each other when creating the analytic dataset, accompanied by an attrition table and a design diagram. A substantial improvement in reproducibility, rigor and confidence in real world evidence generated from healthcare databases could be achieved with greater transparency about operational study parameters used to create analytic datasets from longitudinal healthcare databases. © 2017 The Authors. Pharmacoepidemiology & Drug Safety Published by John Wiley & Sons Ltd.

  20. Isfahan MISP Dataset.

    PubMed

    Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein

    2017-01-01

    An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled "biosigdata.com." It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf).

  1. Datasets related to in-land water for limnology and remote sensing applications: distance-to-land, distance-to-water, water-body identifier and lake-centre co-ordinates.

    PubMed

    Carrea, Laura; Embury, Owen; Merchant, Christopher J

    2015-11-01

    Datasets containing information to locate and identify water bodies have been generated from data locating static-water-bodies with resolution of about 300 m (1/360 ∘ ) recently released by the Land Cover Climate Change Initiative (LC CCI) of the European Space Agency. The LC CCI water-bodies dataset has been obtained from multi-temporal metrics based on time series of the backscattered intensity recorded by ASAR on Envisat between 2005 and 2010. The new derived datasets provide coherently: distance to land, distance to water, water-body identifiers and lake-centre locations. The water-body identifier dataset locates the water bodies assigning the identifiers of the Global Lakes and Wetlands Database (GLWD), and lake centres are defined for in-land waters for which GLWD IDs were determined. The new datasets therefore link recent lake/reservoir/wetlands extent to the GLWD, together with a set of coordinates which locates unambiguously the water bodies in the database. Information on distance-to-land for each water cell and the distance-to-water for each land cell has many potential applications in remote sensing, where the applicability of geophysical retrieval algorithms may be affected by the presence of water or land within a satellite field of view (image pixel). During the generation and validation of the datasets some limitations of the GLWD database and of the LC CCI water-bodies mask have been found. Some examples of the inaccuracies/limitations are presented and discussed. Temporal change in water-body extent is common. Future versions of the LC CCI dataset are planned to represent temporal variation, and this will permit these derived datasets to be updated.

  2. The Northern Circumpolar Soil Carbon Database: spatially distributed datasets of soil coverage and soil carbon storage in the northern permafrost regions

    NASA Astrophysics Data System (ADS)

    Hugelius, G.; Tarnocai, C.; Broll, G.; Canadell, J. G.; Kuhry, P.; Swanson, D. K.

    2012-08-01

    High latitude terrestrial ecosystems are key components in the global carbon (C) cycle. Estimates of global soil organic carbon (SOC), however, do not include updated estimates of SOC storage in permafrost-affected soils or representation of the unique pedogenic processes that affect these soils. The Northern Circumpolar Soil Carbon Database (NCSCD) was developed to quantify the SOC stocks in the circumpolar permafrost region (18.7 × 106 km2). The NCSCD is a polygon-based digital database compiled from harmonized regional soil classification maps in which data on soil order coverage has been linked to pedon data (n = 1647) from the northern permafrost regions to calculate SOC content and mass. In addition, new gridded datasets at different spatial resolutions have been generated to facilitate research applications using the NCSCD (standard raster formats for use in Geographic Information Systems and Network Common Data Form files common for applications in numerical models). This paper describes the compilation of the NCSCD spatial framework, the soil sampling and soil analyses procedures used to derive SOC content in pedons from North America and Eurasia and the formatting of the digital files that are available online. The potential applications and limitations of the NCSCD in spatial analyses are also discussed. The database has the doi:10.5879/ecds/00000001. An open access data-portal with all the described GIS-datasets is available online at: http://dev1.geo.su.se/bbcc/dev/ncscd/.

  3. The Northern Circumpolar Soil Carbon Database: spatially distributed datasets of soil coverage and soil carbon storage in the northern permafrost regions

    NASA Astrophysics Data System (ADS)

    Hugelius, G.; Tarnocai, C.; Broll, G.; Canadell, J. G.; Kuhry, P.; Swanson, D. K.

    2013-01-01

    High-latitude terrestrial ecosystems are key components in the global carbon (C) cycle. Estimates of global soil organic carbon (SOC), however, do not include updated estimates of SOC storage in permafrost-affected soils or representation of the unique pedogenic processes that affect these soils. The Northern Circumpolar Soil Carbon Database (NCSCD) was developed to quantify the SOC stocks in the circumpolar permafrost region (18.7 × 106 km2). The NCSCD is a polygon-based digital database compiled from harmonized regional soil classification maps in which data on soil order coverage have been linked to pedon data (n = 1778) from the northern permafrost regions to calculate SOC content and mass. In addition, new gridded datasets at different spatial resolutions have been generated to facilitate research applications using the NCSCD (standard raster formats for use in geographic information systems and Network Common Data Form files common for applications in numerical models). This paper describes the compilation of the NCSCD spatial framework, the soil sampling and soil analytical procedures used to derive SOC content in pedons from North America and Eurasia and the formatting of the digital files that are available online. The potential applications and limitations of the NCSCD in spatial analyses are also discussed. The database has the doi:10.5879/ecds/00000001. An open access data portal with all the described GIS-datasets is available online at: http://www.bbcc.su.se/data/ncscd/.

  4. A database on the distribution of butterflies (Lepidoptera) in northern Belgium (Flanders and the Brussels Capital Region)

    PubMed Central

    Maes, Dirk; Vanreusel, Wouter; Herremans, Marc; Vantieghem, Pieter; Brosens, Dimitri; Gielen, Karin; Beck, Olivier; Van Dyck, Hans; Desmet, Peter; Natuurpunt, Vlinderwerkgroep

    2016-01-01

    Abstract In this data paper, we describe two datasets derived from two sources, which collectively represent the most complete overview of butterflies in Flanders and the Brussels Capital Region (northern Belgium). The first dataset (further referred to as the INBO dataset – http://doi.org/10.15468/njgbmh) contains 761,660 records of 70 species and is compiled by the Research Institute for Nature and Forest (INBO) in cooperation with the Butterfly working group of Natuurpunt (Vlinderwerkgroep). It is derived from the database Vlinderdatabank at the INBO, which consists of (historical) collection and literature data (1830-2001), for which all butterfly specimens in institutional and available personal collections were digitized and all entomological and other relevant publications were checked for butterfly distribution data. It also contains observations and monitoring data for the period 1991-2014. The latter type were collected by a (small) butterfly monitoring network where butterflies were recorded using a standardized protocol. The second dataset (further referred to as the Natuurpunt dataset – http://doi.org/10.15468/ezfbee) contains 612,934 records of 63 species and is derived from the database http://waarnemingen.be, hosted at the nature conservation NGO Natuurpunt in collaboration with Stichting Natuurinformatie. This dataset contains butterfly observations by volunteers (citizen scientists), mainly since 2008. Together, these datasets currently contain a total of 1,374,594 records, which are georeferenced using the centroid of their respective 5 × 5 km² Universal Transverse Mercator (UTM) grid cell. Both datasets are published as open data and are available through the Global Biodiversity Information Facility (GBIF). PMID:27199606

  5. A database on the distribution of butterflies (Lepidoptera) in northern Belgium (Flanders and the Brussels Capital Region).

    PubMed

    Maes, Dirk; Vanreusel, Wouter; Herremans, Marc; Vantieghem, Pieter; Brosens, Dimitri; Gielen, Karin; Beck, Olivier; Van Dyck, Hans; Desmet, Peter; Natuurpunt, Vlinderwerkgroep

    2016-01-01

    In this data paper, we describe two datasets derived from two sources, which collectively represent the most complete overview of butterflies in Flanders and the Brussels Capital Region (northern Belgium). The first dataset (further referred to as the INBO dataset - http://doi.org/10.15468/njgbmh) contains 761,660 records of 70 species and is compiled by the Research Institute for Nature and Forest (INBO) in cooperation with the Butterfly working group of Natuurpunt (Vlinderwerkgroep). It is derived from the database Vlinderdatabank at the INBO, which consists of (historical) collection and literature data (1830-2001), for which all butterfly specimens in institutional and available personal collections were digitized and all entomological and other relevant publications were checked for butterfly distribution data. It also contains observations and monitoring data for the period 1991-2014. The latter type were collected by a (small) butterfly monitoring network where butterflies were recorded using a standardized protocol. The second dataset (further referred to as the Natuurpunt dataset - http://doi.org/10.15468/ezfbee) contains 612,934 records of 63 species and is derived from the database http://waarnemingen.be, hosted at the nature conservation NGO Natuurpunt in collaboration with Stichting Natuurinformatie. This dataset contains butterfly observations by volunteers (citizen scientists), mainly since 2008. Together, these datasets currently contain a total of 1,374,594 records, which are georeferenced using the centroid of their respective 5 × 5 km² Universal Transverse Mercator (UTM) grid cell. Both datasets are published as open data and are available through the Global Biodiversity Information Facility (GBIF).

  6. Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval

    PubMed Central

    Karisani, Payam; Qin, Zhaohui S; Agichtein, Eugene

    2018-01-01

    Abstract The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie PMID:29688379

  7. Antarctic and Sub-Antarctic Asteroidea database.

    PubMed

    Moreau, Camille; Mah, Christopher; Agüera, Antonio; Améziane, Nadia; David Barnes; Crokaert, Guillaume; Eléaume, Marc; Griffiths, Huw; Charlène Guillaumot; Hemery, Lenaïg G; Jażdżewska, Anna; Quentin Jossart; Vladimir Laptikhovsky; Linse, Katrin; Neill, Kate; Sands, Chester; Thomas Saucède; Schiaparelli, Stefano; Siciński, Jacek; Vasset, Noémie; Bruno Danis

    2018-01-01

    The present dataset is a compilation of georeferenced occurrences of asteroids (Echinodermata: Asteroidea) in the Southern Ocean. Occurrence data south of 45°S latitude were mined from various sources together with information regarding the taxonomy, the sampling source and sampling sites when available. Records from 1872 to 2016 were thoroughly checked to ensure the quality of a dataset that reaches a total of 13,840 occurrences from 4,580 unique sampling events. Information regarding the reproductive strategy (brooders vs. broadcasters) of 63 species is also made available. This dataset represents the most exhaustive occurrence database on Antarctic and Sub-Antarctic asteroids.

  8. Trust, but verify: On the importance of chemical structure curation in cheminformatics and QSAR modeling research

    PubMed Central

    Fourches, Denis; Muratov, Eugene; Tropsha, Alexander

    2010-01-01

    Molecular modelers and cheminformaticians typically analyze experimental data generated by other scientists. Consequently, when it comes to data accuracy, cheminformaticians are always at the mercy of data providers who may inadvertently publish (partially) erroneous data. Thus, dataset curation is crucial for any cheminformatics analysis such as similarity searching, clustering, QSAR modeling, virtual screening, etc., especially nowadays when the availability of chemical datasets in public domain has skyrocketed in recent years. Despite the obvious importance of this preliminary step in the computational analysis of any dataset, there appears to be no commonly accepted guidance or set of procedures for chemical data curation. The main objective of this paper is to emphasize the need for a standardized chemical data curation strategy that should be followed at the onset of any molecular modeling investigation. Herein, we discuss several simple but important steps for cleaning chemical records in a database including the removal of a fraction of the data that cannot be appropriately handled by conventional cheminformatics techniques. Such steps include the removal of inorganic and organometallic compounds, counterions, salts and mixtures; structure validation; ring aromatization; normalization of specific chemotypes; curation of tautomeric forms; and the deletion of duplicates. To emphasize the importance of data curation as a mandatory step in data analysis, we discuss several case studies where chemical curation of the original “raw” database enabled the successful modeling study (specifically, QSAR analysis) or resulted in a significant improvement of model's prediction accuracy. We also demonstrate that in some cases rigorously developed QSAR models could be even used to correct erroneous biological data associated with chemical compounds. We believe that good practices for curation of chemical records outlined in this paper will be of value to all scientists working in the fields of molecular modeling, cheminformatics, and QSAR studies. PMID:20572635

  9. The Reach Address Database (RAD)

    EPA Pesticide Factsheets

    The Reach Address Database (RAD) stores reach address information for each Water Program feature that has been linked to the underlying surface water features (streams, lakes, etc) in the National Hydrology Database (NHD) Plus dataset.

  10. Patterns, biases and prospects in the distribution and diversity of Neotropical snakes.

    PubMed

    Guedes, Thaís B; Sawaya, Ricardo J; Zizka, Alexander; Laffan, Shawn; Faurby, Søren; Pyron, R Alexander; Bérnils, Renato S; Jansen, Martin; Passos, Paulo; Prudente, Ana L C; Cisneros-Heredia, Diego F; Braz, Henrique B; Nogueira, Cristiano de C; Antonelli, Alexandre; Meiri, Shai

    2018-01-01

    We generated a novel database of Neotropical snakes (one of the world's richest herpetofauna) combining the most comprehensive, manually compiled distribution dataset with publicly available data. We assess, for the first time, the diversity patterns for all Neotropical snakes as well as sampling density and sampling biases. We compiled three databases of species occurrences: a dataset downloaded from the Global Biodiversity Information Facility (GBIF), a verified dataset built through taxonomic work and specialized literature, and a combined dataset comprising a cleaned version of the GBIF dataset merged with the verified dataset. Neotropics, Behrmann projection equivalent to 1° × 1°. Specimens housed in museums during the last 150 years. Squamata: Serpentes. Geographical information system (GIS). The combined dataset provides the most comprehensive distribution database for Neotropical snakes to date. It contains 147,515 records for 886 species across 12 families, representing 74% of all species of snakes, spanning 27 countries in the Americas. Species richness and phylogenetic diversity show overall similar patterns. Amazonia is the least sampled Neotropical region, whereas most well-sampled sites are located near large universities and scientific collections. We provide a list and updated maps of geographical distribution of all snake species surveyed. The biodiversity metrics of Neotropical snakes reflect patterns previously documented for other vertebrates, suggesting that similar factors may determine the diversity of both ectothermic and endothermic animals. We suggest conservation strategies for high-diversity areas and sampling efforts be directed towards Amazonia and poorly known species.

  11. The African Crane Database (1978-2014): Records of three threatened crane species (Family: Gruidae) from southern and eastern Africa

    PubMed Central

    Smith, Tanya; Page-Nicholson, Samantha; Gibbons, Bradley; Jones, M. Genevieve W.; van Niekerk, Mark; Botha, Bronwyn; Oliver, Kirsten; McCann, Kevin

    2016-01-01

    Abstract Background The International Crane Foundation (ICF) / Endangered Wildlife Trust’s (EWT) African Crane Conservation Programme has recorded 26 403 crane sightings in its database from 1978 to 2014. This sightings collection is currently ongoing and records are continuously added to the database by the EWT field staff, ICF/EWT Partnership staff, various partner organizations and private individuals. The dataset has two peak collection periods: 1994-1996 and 2008-2012. The dataset collection spans five African countries: Kenya, Rwanda, South Africa, Uganda and Zambia; 98% of the data were collected in South Africa. Georeferencing of the dataset was verified before publication of the data. The dataset contains data on three African crane species: Blue Crane Anthropoides paradiseus, Grey Crowned Crane Balearica regulorum and Wattled Crane Bugeranus carunculatus. The Blue and Wattled Cranes are classified by the IUCN Red List of Threatened Species as Vulnerable and the Grey Crowned Crane as Endangered. New information This is the single most comprehensive dataset published on African Crane species that adds new information about the distribution of these three threatened species. We hope this will further aid conservation authorities to monitor and protect these species. The dataset continues to grow and especially to expand in geographic coverage into new countries in Africa and new sites within countries. The dataset can be freely accessed through the Global Biodiversity Information Facility data portal. PMID:27956850

  12. Using kittens to unlock photo-sharing website datasets for environmental applications

    NASA Astrophysics Data System (ADS)

    Gascoin, Simon

    2016-04-01

    Mining photo-sharing websites is a promising approach to complement in situ and satellite observations of the environment, however a challenge is to deal with the large degree of noise inherent to online social datasets. Here I explored the value of the Flickr image hosting website database to monitor the snow cover in the Pyrenees. Using the Flickr application programming interface (API) I queried all the public images metadata tagged at least with one of the following words: "snow", "neige", "nieve", "neu" (snow in French, Spanish and Catalan languages). The search was limited to the geo-tagged pictures taken in the Pyrenees area. However, the number of public pictures available in the Flickr database for a given time interval depends on several factors, including the Flickr website popularity and the development of digital photography. Thus, I also searched for all Flickr images tagged with "chat", "gat" or "gato" (cat in French, Spanish and Catalan languages). The tag "cat" was not considered in order to exclude the results from North America where Flickr got popular earlier than in Europe. The number of "cat" images per month was used to fit a model of the number of images uploaded in Flickr with time. This model was used to remove this trend in the numbers of snow-tagged photographs. The resulting time series was compared to a time series of the snow cover area derived from the MODIS satellite over the same region. Both datasets are well correlated; in particular they exhibit the same seasonal evolution, although the inter-annual variabilities are less similar. I will also discuss which other factors may explain the main discrepancies in order to further decrease the noise in the Flickr dataset.

  13. Database Objects vs Files: Evaluation of alternative strategies for managing large remote sensing data

    NASA Astrophysics Data System (ADS)

    Baru, Chaitan; Nandigam, Viswanath; Krishnan, Sriram

    2010-05-01

    Increasingly, the geoscience user community expects modern IT capabilities to be available in service of their research and education activities, including the ability to easily access and process large remote sensing datasets via online portals such as GEON (www.geongrid.org) and OpenTopography (opentopography.org). However, serving such datasets via online data portals presents a number of challenges. In this talk, we will evaluate the pros and cons of alternative storage strategies for management and processing of such datasets using binary large object implementations (BLOBs) in database systems versus implementation in Hadoop files using the Hadoop Distributed File System (HDFS). The storage and I/O requirements for providing online access to large datasets dictate the need for declustering data across multiple disks, for capacity as well as bandwidth and response time performance. This requires partitioning larger files into a set of smaller files, and is accompanied by the concomitant requirement for managing large numbers of file. Storing these sub-files as blobs in a shared-nothing database implemented across a cluster provides the advantage that all the distributed storage management is done by the DBMS. Furthermore, subsetting and processing routines can be implemented as user-defined functions (UDFs) on these blobs and would run in parallel across the set of nodes in the cluster. On the other hand, there are both storage overheads and constraints, and software licensing dependencies created by such an implementation. Another approach is to store the files in an external filesystem with pointers to them from within database tables. The filesystem may be a regular UNIX filesystem, a parallel filesystem, or HDFS. In the HDFS case, HDFS would provide the file management capability, while the subsetting and processing routines would be implemented as Hadoop programs using the MapReduce model. Hadoop and its related software libraries are freely available. Another consideration is the strategy used for partitioning large data collections, and large datasets within collections, using round-robin vs hash partitioning vs range partitioning methods. Each has different characteristics in terms of spatial locality of data and resultant degree of declustering of the computations on the data. Furthermore, we have observed that, in practice, there can be large variations in the frequency of access to different parts of a large data collection and/or dataset, thereby creating "hotspots" in the data. We will evaluate the ability of different approaches for dealing effectively with such hotspots and alternative strategies for dealing with hotspots.

  14. A Hybrid Semi-supervised Classification Scheme for Mining Multisource Geospatial Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Vatsavai, Raju; Bhaduri, Budhendra L

    2011-01-01

    Supervised learning methods such as Maximum Likelihood (ML) are often used in land cover (thematic) classification of remote sensing imagery. ML classifier relies exclusively on spectral characteristics of thematic classes whose statistical distributions (class conditional probability densities) are often overlapping. The spectral response distributions of thematic classes are dependent on many factors including elevation, soil types, and ecological zones. A second problem with statistical classifiers is the requirement of large number of accurate training samples (10 to 30 |dimensions|), which are often costly and time consuming to acquire over large geographic regions. With the increasing availability of geospatial databases, itmore » is possible to exploit the knowledge derived from these ancillary datasets to improve classification accuracies even when the class distributions are highly overlapping. Likewise newer semi-supervised techniques can be adopted to improve the parameter estimates of statistical model by utilizing a large number of easily available unlabeled training samples. Unfortunately there is no convenient multivariate statistical model that can be employed for mulitsource geospatial databases. In this paper we present a hybrid semi-supervised learning algorithm that effectively exploits freely available unlabeled training samples from multispectral remote sensing images and also incorporates ancillary geospatial databases. We have conducted several experiments on real datasets, and our new hybrid approach shows over 25 to 35% improvement in overall classification accuracy over conventional classification schemes.« less

  15. ANISEED 2017: extending the integrated ascidian database to the exploration and evolutionary comparison of genome-scale datasets.

    PubMed

    Brozovic, Matija; Dantec, Christelle; Dardaillon, Justine; Dauga, Delphine; Faure, Emmanuel; Gineste, Mathieu; Louis, Alexandra; Naville, Magali; Nitta, Kazuhiro R; Piette, Jacques; Reeves, Wendy; Scornavacca, Céline; Simion, Paul; Vincentelli, Renaud; Bellec, Maelle; Aicha, Sameh Ben; Fagotto, Marie; Guéroult-Bellone, Marion; Haeussler, Maximilian; Jacox, Edwin; Lowe, Elijah K; Mendez, Mickael; Roberge, Alexis; Stolfi, Alberto; Yokomori, Rui; Brown, C Titus; Cambillau, Christian; Christiaen, Lionel; Delsuc, Frédéric; Douzery, Emmanuel; Dumollard, Rémi; Kusakabe, Takehiro; Nakai, Kenta; Nishida, Hiroki; Satou, Yutaka; Swalla, Billie; Veeman, Michael; Volff, Jean-Nicolas; Lemaire, Patrick

    2018-01-04

    ANISEED (www.aniseed.cnrs.fr) is the main model organism database for tunicates, the sister-group of vertebrates. This release gives access to annotated genomes, gene expression patterns, and anatomical descriptions for nine ascidian species. It provides increased integration with external molecular and taxonomy databases, better support for epigenomics datasets, in particular RNA-seq, ChIP-seq and SELEX-seq, and features novel interactive interfaces for existing and novel datatypes. In particular, the cross-species navigation and comparison is enhanced through a novel taxonomy section describing each represented species and through the implementation of interactive phylogenetic gene trees for 60% of tunicate genes. The gene expression section displays the results of RNA-seq experiments for the three major model species of solitary ascidians. Gene expression is controlled by the binding of transcription factors to cis-regulatory sequences. A high-resolution description of the DNA-binding specificity for 131 Ciona robusta (formerly C. intestinalis type A) transcription factors by SELEX-seq is provided and used to map candidate binding sites across the Ciona robusta and Phallusia mammillata genomes. Finally, use of a WashU Epigenome browser enhances genome navigation, while a Genomicus server was set up to explore microsynteny relationships within tunicates and with vertebrates, Amphioxus, echinoderms and hemichordates. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  16. ScrubChem: Cleaning of PubChem Bioassay Data to Create Diverse and Massive Bioactivity Datasets for Use in Modeling Applications (SOT)

    EPA Science Inventory

    The PubChem Bioassay database is a non-curated public repository with bioactivity data from 64 sources, including: ChEMBL, BindingDb, DrugBank, Tox21, NIH Molecular Libraries Screening Program, and various academic, government, and industrial contributors. However, this data is d...

  17. interPopula: a Python API to access the HapMap Project dataset

    PubMed Central

    2010-01-01

    Background The HapMap project is a publicly available catalogue of common genetic variants that occur in humans, currently including several million SNPs across 1115 individuals spanning 11 different populations. This important database does not provide any programmatic access to the dataset, furthermore no standard relational database interface is provided. Results interPopula is a Python API to access the HapMap dataset. interPopula provides integration facilities with both the Python ecology of software (e.g. Biopython and matplotlib) and other relevant human population datasets (e.g. Ensembl gene annotation and UCSC Known Genes). A set of guidelines and code examples to address possible inconsistencies across heterogeneous data sources is also provided. Conclusions interPopula is a straightforward and flexible Python API that facilitates the construction of scripts and applications that require access to the HapMap dataset. PMID:21210977

  18. Linkage disequilibrium matches forensic genetic records to disjoint genomic marker sets.

    PubMed

    Edge, Michael D; Algee-Hewitt, Bridget F B; Pemberton, Trevor J; Li, Jun Z; Rosenberg, Noah A

    2017-05-30

    Combining genotypes across datasets is central in facilitating advances in genetics. Data aggregation efforts often face the challenge of record matching-the identification of dataset entries that represent the same individual. We show that records can be matched across genotype datasets that have no shared markers based on linkage disequilibrium between loci appearing in different datasets. Using two datasets for the same 872 people-one with 642,563 genome-wide SNPs and the other with 13 short tandem repeats (STRs) used in forensic applications-we find that 90-98% of forensic STR records can be connected to corresponding SNP records and vice versa. Accuracy increases to 99-100% when ∼30 STRs are used. Our method expands the potential of data aggregation, but it also suggests privacy risks intrinsic in maintenance of databases containing even small numbers of markers-including databases of forensic significance.

  19. Neuro-fuzzy model for estimating race and gender from geometric distances of human face across pose

    NASA Astrophysics Data System (ADS)

    Nanaa, K.; Rahman, M. N. A.; Rizon, M.; Mohamad, F. S.; Mamat, M.

    2018-03-01

    Classifying human face based on race and gender is a vital process in face recognition. It contributes to an index database and eases 3D synthesis of the human face. Identifying race and gender based on intrinsic factor is problematic, which is more fitting to utilizing nonlinear model for estimating process. In this paper, we aim to estimate race and gender in varied head pose. For this purpose, we collect dataset from PICS and CAS-PEAL databases, detect the landmarks and rotate them to the frontal pose. After geometric distances are calculated, all of distance values will be normalized. Implementation is carried out by using Neural Network Model and Fuzzy Logic Model. These models are combined by using Adaptive Neuro-Fuzzy Model. The experimental results showed that the optimization of address fuzzy membership. Model gives a better assessment rate and found that estimating race contributing to a more accurate gender assessment.

  20. Object instance recognition using motion cues and instance specific appearance models

    NASA Astrophysics Data System (ADS)

    Schumann, Arne

    2014-03-01

    In this paper we present an object instance retrieval approach. The baseline approach consists of a pool of image features which are computed on the bounding boxes of a query object track and compared to a database of tracks in order to find additional appearances of the same object instance. We improve over this simple baseline approach in multiple ways: 1) we include motion cues to achieve improved robustness to viewpoint and rotation changes, 2) we include operator feedback to iteratively re-rank the resulting retrieval lists and 3) we use operator feedback and location constraints to train classifiers and learn an instance specific appearance model. We use these classifiers to further improve the retrieval results. The approach is evaluated on two popular public datasets for two different applications. We evaluate person re-identification on the CAVIAR shopping mall surveillance dataset and vehicle instance recognition on the VIVID aerial dataset and achieve significant improvements over our baseline results.

  1. Watershed Data Management (WDM) database for West Branch DuPage River streamflow simulation, DuPage County, Illinois, January 1, 2007, through September 30, 2013

    USGS Publications Warehouse

    Bera, Maitreyee

    2017-10-16

    The U.S. Geological Survey (USGS), in cooperation with the DuPage County Stormwater Management Department, maintains a database of hourly meteorological and hydrologic data for use in a near real-time streamflow simulation system. This system is used in the management and operation of reservoirs and other flood-control structures in the West Branch DuPage River watershed in DuPage County, Illinois. The majority of the precipitation data are collected from a tipping-bucket rain-gage network located in and near DuPage County. The other meteorological data (air temperature, dewpoint temperature, wind speed, and solar radiation) are collected at Argonne National Laboratory in Argonne, Ill. Potential evapotranspiration is computed from the meteorological data using the computer program LXPET (Lamoreux Potential Evapotranspiration). The hydrologic data (water-surface elevation [stage] and discharge) are collected at U.S.Geological Survey streamflow-gaging stations in and around DuPage County. These data are stored in a Watershed Data Management (WDM) database.This report describes a version of the WDM database that is quality-assured and quality-controlled annually to ensure datasets are complete and accurate. This database is named WBDR13.WDM. It contains data from January 1, 2007, through September 30, 2013. Each precipitation dataset may have time periods of inaccurate data. This report describes the methods used to estimate the data for the periods of missing, erroneous, or snowfall-affected data and thereby improve the accuracy of these data. The other meteorological datasets are described in detail in Over and others (2010), and the hydrologic datasets in the database are fully described in the online USGS annual water data reports for Illinois (U.S. Geological Survey, 2016) and, therefore, are described in less detail than the precipitation datasets in this report.

  2. MOBBED: a computational data infrastructure for handling large collections of event-rich time series datasets in MATLAB

    PubMed Central

    Cockfield, Jeremy; Su, Kyungmin; Robbins, Kay A.

    2013-01-01

    Experiments to monitor human brain activity during active behavior record a variety of modalities (e.g., EEG, eye tracking, motion capture, respiration monitoring) and capture a complex environmental context leading to large, event-rich time series datasets. The considerable variability of responses within and among subjects in more realistic behavioral scenarios requires experiments to assess many more subjects over longer periods of time. This explosion of data requires better computational infrastructure to more systematically explore and process these collections. MOBBED is a lightweight, easy-to-use, extensible toolkit that allows users to incorporate a computational database into their normal MATLAB workflow. Although capable of storing quite general types of annotated data, MOBBED is particularly oriented to multichannel time series such as EEG that have event streams overlaid with sensor data. MOBBED directly supports access to individual events, data frames, and time-stamped feature vectors, allowing users to ask questions such as what types of events or features co-occur under various experimental conditions. A database provides several advantages not available to users who process one dataset at a time from the local file system. In addition to archiving primary data in a central place to save space and avoid inconsistencies, such a database allows users to manage, search, and retrieve events across multiple datasets without reading the entire dataset. The database also provides infrastructure for handling more complex event patterns that include environmental and contextual conditions. The database can also be used as a cache for expensive intermediate results that are reused in such activities as cross-validation of machine learning algorithms. MOBBED is implemented over PostgreSQL, a widely used open source database, and is freely available under the GNU general public license at http://visual.cs.utsa.edu/mobbed. Source and issue reports for MOBBED are maintained at http://vislab.github.com/MobbedMatlab/ PMID:24124417

  3. MOBBED: a computational data infrastructure for handling large collections of event-rich time series datasets in MATLAB.

    PubMed

    Cockfield, Jeremy; Su, Kyungmin; Robbins, Kay A

    2013-01-01

    Experiments to monitor human brain activity during active behavior record a variety of modalities (e.g., EEG, eye tracking, motion capture, respiration monitoring) and capture a complex environmental context leading to large, event-rich time series datasets. The considerable variability of responses within and among subjects in more realistic behavioral scenarios requires experiments to assess many more subjects over longer periods of time. This explosion of data requires better computational infrastructure to more systematically explore and process these collections. MOBBED is a lightweight, easy-to-use, extensible toolkit that allows users to incorporate a computational database into their normal MATLAB workflow. Although capable of storing quite general types of annotated data, MOBBED is particularly oriented to multichannel time series such as EEG that have event streams overlaid with sensor data. MOBBED directly supports access to individual events, data frames, and time-stamped feature vectors, allowing users to ask questions such as what types of events or features co-occur under various experimental conditions. A database provides several advantages not available to users who process one dataset at a time from the local file system. In addition to archiving primary data in a central place to save space and avoid inconsistencies, such a database allows users to manage, search, and retrieve events across multiple datasets without reading the entire dataset. The database also provides infrastructure for handling more complex event patterns that include environmental and contextual conditions. The database can also be used as a cache for expensive intermediate results that are reused in such activities as cross-validation of machine learning algorithms. MOBBED is implemented over PostgreSQL, a widely used open source database, and is freely available under the GNU general public license at http://visual.cs.utsa.edu/mobbed. Source and issue reports for MOBBED are maintained at http://vislab.github.com/MobbedMatlab/

  4. Rapid Response Tools and Datasets for Post-fire Hydrological Modeling

    NASA Astrophysics Data System (ADS)

    Miller, Mary Ellen; MacDonald, Lee H.; Billmire, Michael; Elliot, William J.; Robichaud, Pete R.

    2016-04-01

    Rapid response is critical following natural disasters. Flooding, erosion, and debris flows are a major threat to life, property and municipal water supplies after moderate and high severity wildfires. The problem is that mitigation measures must be rapidly implemented if they are to be effective, but they are expensive and cannot be applied everywhere. Fires, runoff, and erosion risks also are highly heterogeneous in space, so there is an urgent need for a rapid, spatially-explicit assessment. Past post-fire modeling efforts have usually relied on lumped, conceptual models because of the lack of readily available, spatially-explicit data layers on the key controls of topography, vegetation type, climate, and soil characteristics. The purpose of this project is to develop a set of spatially-explicit data layers for use in process-based models such as WEPP, and to make these data layers freely available. The resulting interactive online modeling database (http://geodjango.mtri.org/geowepp/) is now operational and publically available for 17 western states in the USA. After a fire, users only need to upload a soil burn severity map, and this is combined with the pre-existing data layers to generate the model inputs needed for spatially explicit models such as GeoWEPP (Renschler, 2003). The development of this online database has allowed us to predict post-fire erosion and various remediation scenarios in just 1-7 days for six fires ranging in size from 4-540 km2. These initial successes have stimulated efforts to further improve the spatial extent and amount of data, and add functionality to support the USGS debris flow model, batch processing for Disturbed WEPP (Elliot et al., 2004) and ERMiT (Robichaud et al., 2007), and to support erosion modeling for other land uses, such as agriculture or mining. The design and techniques used to create the database and the modeling interface are readily repeatable for any area or country that has the necessary topography, climate, soil, and land cover datasets.

  5. Initial spatio-temporal domain expansion of the Modelfest database

    NASA Astrophysics Data System (ADS)

    Carney, Thom; Mozaffari, Sahar; Sun, Sean; Johnson, Ryan; Shirvastava, Sharona; Shen, Priscilla; Ly, Emma

    2013-03-01

    The first Modelfest group publication appeared in the SPIE Human Vision and Electronic Imaging conference proceedings in 1999. "One of the group's goals is to develop a public database of test images with threshold data from multiple laboratories for designing and testing HVS (Human Vision Models)." After extended discussions the group selected a set of 45 static images thought to best meet that goal and collected psychophysical detection data which is available on the WEB and presented in the 2000 SPIE conference proceedings. Several groups have used these datasets to test spatial modeling ideas. Further discussions led to the preliminary stimulus specification for extending the database into the temporal domain which was published in the 2002 conference proceeding. After a hiatus of 12 years, some of us have collected spatio-temporal thresholds on an expanded stimulus set of 41 video clips; the original specification included 35 clips. The principal change involved adding one additional spatial pattern beyond the three originally specified. The stimuli consisted of 4 spatial patterns, Gaussian Blob, 4 c/d Gabor patch, 11.3 c/d Gabor patch and a 2D white noise patch. Across conditions the patterns were temporally modulated over a range of approximately 0-25 Hz as well as temporal edge and pulse modulation conditions. The display and data collection specifications were as specified by the Modelfest groups in the 2002 conference proceedings. To date seven subjects have participated in this phase of the data collection effort, one of which also participated in the first phase of Modelfest. Three of the spatio-temporal stimuli were identical to conditions in the original static dataset. Small differences in the thresholds were evident and may point to a stimulus limitation. The temporal CSF peaked between 4 and 8 Hz for the 0 c/d (Gaussian blob) and 4 c/d patterns. The 4 c/d and 11.3 c/d Gabor temporal CSF was low pass while the 0 c/d pattern was band pass. This preliminary expansion of the Modelfest dataset needs the participation of additional laboratories to evaluate the impact of different methods on threshold estimates and increase the subject base. We eagerly await the addition of new data from interested researchers. It remains to be seen how accurately general HVS models will predict thresholds across both Modelfest datasets.

  6. [Scientific significance and prospective application of digitized virtual human].

    PubMed

    Zhong, Shi-zhen

    2003-03-01

    As a cutting-edge research project, digitization of human anatomical information combines conventional medicine with information technology, computer technology, and virtual reality technology. Recent years have seen the establishment of, or the ongoing effort to establish various virtual human models in many countries, on the basis of continuous sections of human body that are digitized by means of computational medicine incorporating information technology to quantitatively simulate human physiological and pathological conditions, and to provide wide prospective applications in the fields of medicine and other disciplines. This article addresses 4 issues concerning the progress in virtual human model researches as the following: (1) Worldwide survey of sectioning and modeling of visible human. American visible human database was completed in 1994, which contains both a male and a female datasets, and has found wide application internationally. South Korea also finished the data collection for a male visible Korean human dataset in 2000. (2) Application of the dataset of Visible Human Project (VHP). This dataset has yielded plentiful fruits in medical education and clinical research, and further plans are proposed and practiced to construct a Physical Human and Physiological Human . (3) Scientific significance and prospect of virtual human studies. Digitized human dataset may eventually contribute to the development of many new high-tech industries. (4) Progress of virtual Chinese human project. The 174th session of Xiangshang Science Conferences held in 2001 marked the initiation of digitized virtual human project in China, and some key techniques have been explored. By now the data-collection process for 4 Chinese virtual human datasets have been successfully completed.

  7. TISSUES 2.0: an integrative web resource on mammalian tissue expression

    PubMed Central

    Palasca, Oana; Santos, Alberto; Stolte, Christian; Gorodkin, Jan; Jensen, Lars Juhl

    2018-01-01

    Abstract Physiological and molecular similarities between organisms make it possible to translate findings from simpler experimental systems—model organisms—into more complex ones, such as human. This translation facilitates the understanding of biological processes under normal or disease conditions. Researchers aiming to identify the similarities and differences between organisms at the molecular level need resources collecting multi-organism tissue expression data. We have developed a database of gene–tissue associations in human, mouse, rat and pig by integrating multiple sources of evidence: transcriptomics covering all four species and proteomics (human only), manually curated and mined from the scientific literature. Through a scoring scheme, these associations are made comparable across all sources of evidence and across organisms. Furthermore, the scoring produces a confidence score assigned to each of the associations. The TISSUES database (version 2.0) is publicly accessible through a user-friendly web interface and as part of the STRING app for Cytoscape. In addition, we analyzed the agreement between datasets, across and within organisms, and identified that the agreement is mainly affected by the quality of the datasets rather than by the technologies used or organisms compared. Database URL: http://tissues.jensenlab.org/ PMID:29617745

  8. Condensing Massive Satellite Datasets For Rapid Interactive Analysis

    NASA Astrophysics Data System (ADS)

    Grant, G.; Gallaher, D. W.; Lv, Q.; Campbell, G. G.; Fowler, C.; LIU, Q.; Chen, C.; Klucik, R.; McAllister, R. A.

    2015-12-01

    Our goal is to enable users to interactively analyze massive satellite datasets, identifying anomalous data or values that fall outside of thresholds. To achieve this, the project seeks to create a derived database containing only the most relevant information, accelerating the analysis process. The database is designed to be an ancillary tool for the researcher, not an archival database to replace the original data. This approach is aimed at improving performance by reducing the overall size by way of condensing the data. The primary challenges of the project include: - The nature of the research question(s) may not be known ahead of time. - The thresholds for determining anomalies may be uncertain. - Problems associated with processing cloudy, missing, or noisy satellite imagery. - The contents and method of creation of the condensed dataset must be easily explainable to users. The architecture of the database will reorganize spatially-oriented satellite imagery into temporally-oriented columns of data (a.k.a., "data rods") to facilitate time-series analysis. The database itself is an open-source parallel database, designed to make full use of clustered server technologies. A demonstration of the system capabilities will be shown. Applications for this technology include quick-look views of the data, as well as the potential for on-board satellite processing of essential information, with the goal of reducing data latency.

  9. Isfahan MISP Dataset

    PubMed Central

    Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein

    2017-01-01

    An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled “biosigdata.com.” It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf). PMID:28487832

  10. PropBase Query Layer: a single portal to UK subsurface physical property databases

    NASA Astrophysics Data System (ADS)

    Kingdon, Andrew; Nayembil, Martin L.; Richardson, Anne E.; Smith, A. Graham

    2013-04-01

    Until recently, the delivery of geological information for industry and public was achieved by geological mapping. Now pervasively available computers mean that 3D geological models can deliver realistic representations of the geometric location of geological units, represented as shells or volumes. The next phase of this process is to populate these with physical properties data that describe subsurface heterogeneity and its associated uncertainty. Achieving this requires capture and serving of physical, hydrological and other property information from diverse sources to populate these models. The British Geological Survey (BGS) holds large volumes of subsurface property data, derived both from their own research data collection and also other, often commercially derived data sources. This can be voxelated to incorporate this data into the models to demonstrate property variation within the subsurface geometry. All property data held by BGS has for many years been stored in relational databases to ensure their long-term continuity. However these have, by necessity, complex structures; each database contains positional reference data and model information, and also metadata such as sample identification information and attributes that define the source and processing. Whilst this is critical to assessing these analyses, it also hugely complicates the understanding of variability of the property under assessment and requires multiple queries to study related datasets making extracting physical properties from these databases difficult. Therefore the PropBase Query Layer has been created to allow simplified aggregation and extraction of all related data and its presentation of complex data in simple, mostly denormalized, tables which combine information from multiple databases into a single system. The structure from each relational database is denormalized in a generalised structure, so that each dataset can be viewed together in a common format using a simple interface. Data are re-engineered to facilitate easy loading. The query layer structure comprises tables, procedures, functions, triggers, views and materialised views. The structure contains a main table PRB_DATA which contains all of the data with the following attribution: • a unique identifier • the data source • the unique identifier from the parent database for traceability • the 3D location • the property type • the property value • the units • necessary qualifiers • precision information and an audit trail Data sources, property type and units are constrained by dictionaries, a key component of the structure which defines what properties and inheritance hierarchies are to be coded and also guides the process as to what and how these are extracted from the structure. Data types served by the Query Layer include site investigation derived geotechnical data, hydrogeology datasets, regional geochemistry, geophysical logs as well as lithological and borehole metadata. The size and complexity of the data sets with multiple parent structures requires a technically robust approach to keep the layer synchronised. This is achieved through Oracle procedures written in PL/SQL containing the logic required to carry out the data manipulation (inserts, updates, deletes) to keep the layer synchronised with the underlying databases either as regular scheduled jobs (weekly, monthly etc) or invoked on demand. The PropBase Query Layer's implementation has enabled rapid data discovery, visualisation and interpretation of geological data with greater ease, simplifying the parametrisation of 3D model volumes and facilitating the study of intra-unit heterogeneity.

  11. [Developing forensic reference database by 18 autosomal STR for DNA identification in Republic of Belarus].

    PubMed

    Tsybovskii, I S; Veremeichik, V M; Kotova, S A; Kritskaya, S V; Evmenenko, S A; Udina, I G

    2017-02-01

    For the Republic of Belarus, development of a forensic reference database on the basis of 18 autosomal microsatellites (STR) using a population dataset (N = 1040), “familial” genotypic dataset (N = 2550) obtained from expertise performance of paternity testing, and a dataset of genotypes from a criminal registration database (N = 8756) is described. Population samples studied consist of 80% ethnic Belarusians and 20% individuals of other nationality or of mixed origin (by questionnaire data). Genotypes of 12346 inhabitants of the Republic of Belarus from 118 regional samples studied by 18 autosomal microsatellites are included in the sample: 16 tetranucleotide STR (D2S1338, TPOX, D3S1358, CSF1PO, D5S818, D8S1179, D7S820, THO1, vWA, D13S317, D16S539, D18S51, D19S433, D21S11, F13B, and FGA) and two pentanucleotide STR (Penta D and Penta E). The samples studied are in Hardy–Weinberg equilibrium according to distribution of genotypes by 18 STR. Significant differences were not detected between discrete populations or between samples from various historical ethnographic regions of the Republic of Belarus (Western and Eastern Polesie, Podneprovye, Ponemanye, Poozerye, and Center), which indicates the absence of prominent genetic differentiation. Statistically significant differences between the studied genotypic datasets also were not detected, which made it possible to combine the datasets and consider the total sample as a unified forensic reference database for 18 “criminalistic” STR loci. Differences between reference database of the Republic of Belarus and Russians and Ukrainians by the distribution of the range of autosomal STR also were not detected, corresponding to a close genetic relationship of the three Eastern Slavic nations mediated by common origin and intense mutual migrations. Significant differences by separate STR loci between the reference database of Republic of Belarus and populations of Southern and Western Slavs were observed. The necessity of using original reference database for support of forensic expertise practice in the Republic of Belarus was demonstrated.

  12. NLCD - MODIS albedo data

    EPA Pesticide Factsheets

    The NLCD-MODIS land cover-albedo database integrates high-quality MODIS albedo observations with areas of homogeneous land cover from NLCD. The spatial resolution (pixel size) of the database is 480m-x-480m aligned to the standardized UGSG Albers Equal-Area projection. The spatial extent of the database is the continental United States. This dataset is associated with the following publication:Wickham , J., C.A. Barnes, and T. Wade. Combining NLCD and MODIS to Create a Land Cover-Albedo Dataset for the Continental United States. REMOTE SENSING OF ENVIRONMENT. Elsevier Science Ltd, New York, NY, USA, 170(0): 143-153, (2015).

  13. A Regional Climate Model Evaluation System based on contemporary Satellite and other Observations for Assessing Regional Climate Model Fidelity

    NASA Astrophysics Data System (ADS)

    Waliser, D. E.; Kim, J.; Mattman, C.; Goodale, C.; Hart, A.; Zimdars, P.; Lean, P.

    2011-12-01

    Evaluation of climate models against observations is an essential part of assessing the impact of climate variations and change on regionally important sectors and improving climate models. Regional climate models (RCMs) are of a particular concern. RCMs provide fine-scale climate needed by the assessment community via downscaling global climate model projections such as those contributing to the Coupled Model Intercomparison Project (CMIP) that form one aspect of the quantitative basis of the IPCC Assessment Reports. The lack of reliable fine-resolution observational data and formal tools and metrics has represented a challenge in evaluating RCMs. Recent satellite observations are particularly useful as they provide a wealth of information and constraints on many different processes within the climate system. Due to their large volume and the difficulties associated with accessing and using contemporary observations, however, these datasets have been generally underutilized in model evaluation studies. Recognizing this problem, NASA JPL and UCLA have developed the Regional Climate Model Evaluation System (RCMES) to help make satellite observations, in conjunction with in-situ and reanalysis datasets, more readily accessible to the regional modeling community. The system includes a central database (Regional Climate Model Evaluation Database: RCMED) to store multiple datasets in a common format and codes for calculating and plotting statistical metrics to assess model performance (Regional Climate Model Evaluation Tool: RCMET). This allows the time taken to compare model data with satellite observations to be reduced from weeks to days. RCMES is a component of the recent ExArch project, an international effort for facilitating the archive and access of massive amounts data for users using cloud-based infrastructure, in this case as applied to the study of climate and climate change. This presentation will describe RCMES and demonstrate its utility using examples from RCMs applied to the southwest US as well as to Africa based on output from the CORDEX activity. Application of RCMES to the evaluation of multi-RCM hindcast for CORDEX-Africa will be presented in a companion paper in A41.

  14. Image-based query-by-example for big databases of galaxy images

    NASA Astrophysics Data System (ADS)

    Shamir, Lior; Kuminski, Evan

    2017-01-01

    Very large astronomical databases containing millions or even billions of galaxy images have been becoming increasingly important tools in astronomy research. However, in many cases the very large size makes it more difficult to analyze these data manually, reinforcing the need for computer algorithms that can automate the data analysis process. An example of such task is the identification of galaxies of a certain morphology of interest. For instance, if a rare galaxy is identified it is reasonable to expect that more galaxies of similar morphology exist in the database, but it is virtually impossible to manually search these databases to identify such galaxies. Here we describe computer vision and pattern recognition methodology that receives a galaxy image as an input, and searches automatically a large dataset of galaxies to return a list of galaxies that are visually similar to the query galaxy. The returned list is not necessarily complete or clean, but it provides a substantial reduction of the original database into a smaller dataset, in which the frequency of objects visually similar to the query galaxy is much higher. Experimental results show that the algorithm can identify rare galaxies such as ring galaxies among datasets of 10,000 astronomical objects.

  15. Allele Frequencies Net Database: Improvements for storage of individual genotypes and analysis of existing data.

    PubMed

    Santos, Eduardo Jose Melos Dos; McCabe, Antony; Gonzalez-Galarza, Faviel F; Jones, Andrew R; Middleton, Derek

    2016-03-01

    The Allele Frequencies Net Database (AFND) is a freely accessible database which stores population frequencies for alleles or genes of the immune system in worldwide populations. Herein we introduce two new tools. We have defined new classifications of data (gold, silver and bronze) to assist users in identifying the most suitable populations for their tasks. The gold standard datasets are defined by allele frequencies summing to 1, sample sizes >50 and high resolution genotyping, while silver standard datasets do not meet gold standard genotyping resolution and/or sample size criteria. The bronze standard datasets are those that could not be classified under the silver or gold standards. The gold standard includes >500 datasets covering over 3 million individuals from >100 countries at one or more of the following loci: HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1 and -DRB1 - with all loci except DPA1 present in more than 220 datasets. Three out of 12 geographic regions have low representation (the majority of their countries having less than five datasets) and the Central Asia region has no representation. There are 18 countries that are not represented by any gold standard datasets but are represented by at least one dataset that is either silver or bronze standard. We also briefly summarize the data held by AFND for KIR genes, alleles and their ligands. Our second new component is a data submission tool to assist users in the collection of the genotypes of the individuals (raw data), facilitating submission of short population reports to Human Immunology, as well as simplifying the submission of population demographics and frequency data. Copyright © 2015 American Society for Histocompatibility and Immunogenetics. Published by Elsevier Inc. All rights reserved.

  16. Tomato functional genomics database (TFGD): a comprehensive collection and analysis package for tomato functional genomics

    USDA-ARS?s Scientific Manuscript database

    Tomato Functional Genomics Database (TFGD; http://ted.bti.cornell.edu) provides a comprehensive systems biology resource to store, mine, analyze, visualize and integrate large-scale tomato functional genomics datasets. The database is expanded from the previously described Tomato Expression Database...

  17. HTM Spatial Pooler With Memristor Crossbar Circuits for Sparse Biometric Recognition.

    PubMed

    James, Alex Pappachen; Fedorova, Irina; Ibrayev, Timur; Kudithipudi, Dhireesha

    2017-06-01

    Hierarchical Temporal Memory (HTM) is an online machine learning algorithm that emulates the neo-cortex. The development of a scalable on-chip HTM architecture is an open research area. The two core substructures of HTM are spatial pooler and temporal memory. In this work, we propose a new Spatial Pooler circuit design with parallel memristive crossbar arrays for the 2D columns. The proposed design was validated on two different benchmark datasets, face recognition, and speech recognition. The circuits are simulated and analyzed using a practical memristor device model and 0.18 μm IBM CMOS technology model. The databases AR, YALE, ORL, and UFI, are used to test the performance of the design in face recognition. TIMIT dataset is used for the speech recognition.

  18. EPA GHG Certification of Medium- and Heavy-Duty Vehicles: Development of Road Grade Profiles Representative of US Controlled Access Highways

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wood, Eric; Duran, Adam; Burton, Evan

    This report includes a detailed comparison of the TomTom national road grade database relative to a local road grade dataset generated by Southwest Research Institute and a national elevation dataset publically available from the U.S. Geological Survey. This analysis concluded that the TomTom national road grade database was a suitable source of road grade data for purposes of this study.

  19. A Web Server and Mobile App for Computing Hemolytic Potency of Peptides

    NASA Astrophysics Data System (ADS)

    Chaudhary, Kumardeep; Kumar, Ritesh; Singh, Sandeep; Tuknait, Abhishek; Gautam, Ankur; Mathur, Deepika; Anand, Priya; Varshney, Grish C.; Raghava, Gajendra P. S.

    2016-03-01

    Numerous therapeutic peptides do not enter the clinical trials just because of their high hemolytic activity. Recently, we developed a database, Hemolytik, for maintaining experimentally validated hemolytic and non-hemolytic peptides. The present study describes a web server and mobile app developed for predicting, and screening of peptides having hemolytic potency. Firstly, we generated a dataset HemoPI-1 that contains 552 hemolytic peptides extracted from Hemolytik database and 552 random non-hemolytic peptides (from Swiss-Prot). The sequence analysis of these peptides revealed that certain residues (e.g., L, K, F, W) and motifs (e.g., “FKK”, “LKL”, “KKLL”, “KWK”, “VLK”, “CYCR”, “CRR”, “RFC”, “RRR”, “LKKL”) are more abundant in hemolytic peptides. Therefore, we developed models for discriminating hemolytic and non-hemolytic peptides using various machine learning techniques and achieved more than 95% accuracy. We also developed models for discriminating peptides having high and low hemolytic potential on different datasets called HemoPI-2 and HemoPI-3. In order to serve the scientific community, we developed a web server, mobile app and JAVA-based standalone software (http://crdd.osdd.net/raghava/hemopi/).

  20. USA National Phenology Network’s volunteer-contributed observations yield predictive models of phenological transitions

    USGS Publications Warehouse

    Crimmins, Theresa M.; Crimmins, Michael A.; Gerst, Katherine L.; Rosemartin, Alyssa H.; Weltzin, Jake F.

    2017-01-01

    In support of science and society, the USA National Phenology Network (USA-NPN) maintains a rapidly growing, continental-scale, species-rich dataset of plant and animal phenology observations that with over 10 million records is the largest such database in the United States. Contributed voluntarily by professional and citizen scientists, these opportunistically collected observations are characterized by spatial clustering, inconsistent spatial and temporal sampling, and short temporal depth. We explore the potential for developing models of phenophase transitions suitable for use at the continental scale, which could be applied to a wide range of resource management contexts. We constructed predictive models of the onset of breaking leaf buds, leaves, open flowers, and ripe fruits – phenophases that are the most abundant in the database and also relevant to management applications – for all species with available data, regardless of plant growth habit, location, geographic extent, or temporal depth of the observations. We implemented a very basic model formulation - thermal time models with a fixed start date. Sufficient data were available to construct 107 individual species × phenophase models. Of these, fifteen models (14%) met our criteria for model fit and error and were suitable for use across the majority of the species’ geographic ranges. These findings indicate that the USA-NPN dataset holds promise for further and more refined modeling efforts. Further, the candidate models that emerged could be used to produce real-time and short-term forecast maps of the timing of such transitions to directly support natural resource management.

  1. In silico modeling to predict drug-induced phospholipidosis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Choi, Sydney S.; Kim, Jae S.; Valerio, Luis G., E-mail: luis.valerio@fda.hhs.gov

    2013-06-01

    Drug-induced phospholipidosis (DIPL) is a preclinical finding during pharmaceutical drug development that has implications on the course of drug development and regulatory safety review. A principal characteristic of drugs inducing DIPL is known to be a cationic amphiphilic structure. This provides evidence for a structure-based explanation and opportunity to analyze properties and structures of drugs with the histopathologic findings for DIPL. In previous work from the FDA, in silico quantitative structure–activity relationship (QSAR) modeling using machine learning approaches has shown promise with a large dataset of drugs but included unconfirmed data as well. In this study, we report the constructionmore » and validation of a battery of complementary in silico QSAR models using the FDA's updated database on phospholipidosis, new algorithms and predictive technologies, and in particular, we address high performance with a high-confidence dataset. The results of our modeling for DIPL include rigorous external validation tests showing 80–81% concordance. Furthermore, the predictive performance characteristics include models with high sensitivity and specificity, in most cases above ≥ 80% leading to desired high negative and positive predictivity. These models are intended to be utilized for regulatory toxicology applied science needs in screening new drugs for DIPL. - Highlights: • New in silico models for predicting drug-induced phospholipidosis (DIPL) are described. • The training set data in the models is derived from the FDA's phospholipidosis database. • We find excellent predictivity values of the models based on external validation. • The models can support drug screening and regulatory decision-making on DIPL.« less

  2. Strategies to improve reference databases for soil microbiomes

    DOE PAGES

    Choi, Jinlyung; Yang, Fan; Stepanauskas, Ramunas; ...

    2016-12-09

    A database of curated genomes is needed to better assess soil microbial communities and their processes associated with differing land management and environmental impacts. Interpreting soil metagenomic datasets with existing sequence databases is challenging because these datasets are biased towards medical and biotechnology research and can result in misleading annotations. We have curated a database of 928 genomes of soil-associated organisms (888 bacteria, 34 archaea, and 6 fungi). Using this database as a representation of the current state of knowledge of soil microbes that are well-characterized, we evaluated its composition and compared it to broader microbial databases, specifically NCBI’s RefSeq,more » as well as 3,035 publicly available soil amplicon datasets. These comparisons identified phyla and functions that are enriched in soils as well as those that may be underrepresented in RefSoil. For example, RefSoil was observed to have increased representation of Firmicutes despite its low abundance in soil environments and also lacked representation of Acidobacteria and Verrucomicrobia, which are abundant in soils. Our comparison of RefSoil to soil amplicon datasets allowed us to identify targets that if cultured or sequenced would significantly increase the biodiversity represented within RefSoil. To demonstrate the opportunities to access these underrepresented targets, we employed single cell genomics in a pilot experiment to recover 14 genomes from the "most wanted" list, which improved RefSoil's representation of EMP sequences by 7% by abundance. This effort demonstrates the value of RefSoil in the guidance of future research efforts and the capability of single cell genomics as a practical means to fill the existing genomic data gaps.« less

  3. Strategies to improve reference databases for soil microbiomes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Choi, Jinlyung; Yang, Fan; Stepanauskas, Ramunas

    A database of curated genomes is needed to better assess soil microbial communities and their processes associated with differing land management and environmental impacts. Interpreting soil metagenomic datasets with existing sequence databases is challenging because these datasets are biased towards medical and biotechnology research and can result in misleading annotations. We have curated a database of 928 genomes of soil-associated organisms (888 bacteria, 34 archaea, and 6 fungi). Using this database as a representation of the current state of knowledge of soil microbes that are well-characterized, we evaluated its composition and compared it to broader microbial databases, specifically NCBI’s RefSeq,more » as well as 3,035 publicly available soil amplicon datasets. These comparisons identified phyla and functions that are enriched in soils as well as those that may be underrepresented in RefSoil. For example, RefSoil was observed to have increased representation of Firmicutes despite its low abundance in soil environments and also lacked representation of Acidobacteria and Verrucomicrobia, which are abundant in soils. Our comparison of RefSoil to soil amplicon datasets allowed us to identify targets that if cultured or sequenced would significantly increase the biodiversity represented within RefSoil. To demonstrate the opportunities to access these underrepresented targets, we employed single cell genomics in a pilot experiment to recover 14 genomes from the "most wanted" list, which improved RefSoil's representation of EMP sequences by 7% by abundance. This effort demonstrates the value of RefSoil in the guidance of future research efforts and the capability of single cell genomics as a practical means to fill the existing genomic data gaps.« less

  4. Automatic training and reliability estimation for 3D ASM applied to cardiac MRI segmentation

    NASA Astrophysics Data System (ADS)

    Tobon-Gomez, Catalina; Sukno, Federico M.; Butakoff, Constantine; Huguet, Marina; Frangi, Alejandro F.

    2012-07-01

    Training active shape models requires collecting manual ground-truth meshes in a large image database. While shape information can be reused across multiple imaging modalities, intensity information needs to be imaging modality and protocol specific. In this context, this study has two main purposes: (1) to test the potential of using intensity models learned from MRI simulated datasets and (2) to test the potential of including a measure of reliability during the matching process to increase robustness. We used a population of 400 virtual subjects (XCAT phantom), and two clinical populations of 40 and 45 subjects. Virtual subjects were used to generate simulated datasets (MRISIM simulator). Intensity models were trained both on simulated and real datasets. The trained models were used to segment the left ventricle (LV) and right ventricle (RV) from real datasets. Segmentations were also obtained with and without reliability information. Performance was evaluated with point-to-surface and volume errors. Simulated intensity models obtained average accuracy comparable to inter-observer variability for LV segmentation. The inclusion of reliability information reduced volume errors in hypertrophic patients (EF errors from 17 ± 57% to 10 ± 18% LV MASS errors from -27 ± 22 g to -14 ± 25 g), and in heart failure patients (EF errors from -8 ± 42% to -5 ± 14%). The RV model of the simulated images needs further improvement to better resemble image intensities around the myocardial edges. Both for real and simulated models, reliability information increased segmentation robustness without penalizing accuracy.

  5. Automatic training and reliability estimation for 3D ASM applied to cardiac MRI segmentation.

    PubMed

    Tobon-Gomez, Catalina; Sukno, Federico M; Butakoff, Constantine; Huguet, Marina; Frangi, Alejandro F

    2012-07-07

    Training active shape models requires collecting manual ground-truth meshes in a large image database. While shape information can be reused across multiple imaging modalities, intensity information needs to be imaging modality and protocol specific. In this context, this study has two main purposes: (1) to test the potential of using intensity models learned from MRI simulated datasets and (2) to test the potential of including a measure of reliability during the matching process to increase robustness. We used a population of 400 virtual subjects (XCAT phantom), and two clinical populations of 40 and 45 subjects. Virtual subjects were used to generate simulated datasets (MRISIM simulator). Intensity models were trained both on simulated and real datasets. The trained models were used to segment the left ventricle (LV) and right ventricle (RV) from real datasets. Segmentations were also obtained with and without reliability information. Performance was evaluated with point-to-surface and volume errors. Simulated intensity models obtained average accuracy comparable to inter-observer variability for LV segmentation. The inclusion of reliability information reduced volume errors in hypertrophic patients (EF errors from 17 ± 57% to 10 ± 18%; LV MASS errors from -27 ± 22 g to -14 ± 25 g), and in heart failure patients (EF errors from -8 ± 42% to -5 ± 14%). The RV model of the simulated images needs further improvement to better resemble image intensities around the myocardial edges. Both for real and simulated models, reliability information increased segmentation robustness without penalizing accuracy.

  6. The U.S. Geological Survey Monthly Water Balance Model Futures Portal

    USGS Publications Warehouse

    Bock, Andrew R.; Hay, Lauren E.; Markstrom, Steven L.; Emmerich, Christopher; Talbert, Marian

    2017-05-03

    The U.S. Geological Survey Monthly Water Balance Model Futures Portal (https://my.usgs.gov/mows/) is a user-friendly interface that summarizes monthly historical and simulated future conditions for seven hydrologic and meteorological variables (actual evapotranspiration, potential evapotranspiration, precipitation, runoff, snow water equivalent, atmospheric temperature, and streamflow) at locations across the conterminous United States (CONUS).The estimates of these hydrologic and meteorological variables were derived using a Monthly Water Balance Model (MWBM), a modular system that simulates monthly estimates of components of the hydrologic cycle using monthly precipitation and atmospheric temperature inputs. Precipitation and atmospheric temperature from 222 climate datasets spanning historical conditions (1952 through 2005) and simulated future conditions (2020 through 2099) were summarized for hydrographic features and used to drive the MWBM for the CONUS. The MWBM input and output variables were organized into an open-access database. An Open Geospatial Consortium, Inc., Web Feature Service allows the querying and identification of hydrographic features across the CONUS. To connect the Web Feature Service to the open-access database, a user interface—the Monthly Water Balance Model Futures Portal—was developed to allow the dynamic generation of summary files and plots  based on plot type, geographic location, specific climate datasets, period of record, MWBM variable, and other options. Both the plots and the data files are made available to the user for download 

  7. Highly diverse, massive organic data as explored by a composite QSPR strategy: an advanced study of boiling point.

    PubMed

    Ivanova, A A; Ivanov, A A; Oliferenko, A A; Palyulin, V A; Zefirov, N S

    2005-06-01

    An improved strategy of quantitative structure-property relationship (QSPR) studies of diverse and inhomogeneous organic datasets has been proposed. A molecular connectivity term was successively corrected for different structural features encoded in fragmental descriptors. The so-called solvation index 1chis (a weighted Randic index) was used as a "leading" variable and standardized molecular fragments were employed as "corrective" class-specific variables. Performance of the new approach was illustrated by modelling a dataset of experimental normal boiling points of 833 organic compounds belonging to 20 structural classes. Firstly, separate QSPR models were derived for each class and for eight groups of structurally similar classes. Finally, a general model formed by combining all the classes together was derived (r2=0.957, s=12.9degreesC). The strategy outlined can find application in QSPR analyses of massive, highly diverse databases of organic compounds.

  8. Geospatial database of estimates of groundwater discharge to streams in the Upper Colorado River Basin

    USGS Publications Warehouse

    Garcia, Adriana; Masbruch, Melissa D.; Susong, David D.

    2014-01-01

    The U.S. Geological Survey, as part of the Department of the Interior’s WaterSMART (Sustain and Manage America’s Resources for Tomorrow) initiative, compiled published estimates of groundwater discharge to streams in the Upper Colorado River Basin as a geospatial database. For the purpose of this report, groundwater discharge to streams is the baseflow portion of streamflow that includes contributions of groundwater from various flow paths. Reported estimates of groundwater discharge were assigned as attributes to stream reaches derived from the high-resolution National Hydrography Dataset. A total of 235 estimates of groundwater discharge to streams were compiled and included in the dataset. Feature class attributes of the geospatial database include groundwater discharge (acre-feet per year), method of estimation, citation abbreviation, defined reach, and 8-digit hydrologic unit code(s). Baseflow index (BFI) estimates of groundwater discharge were calculated using an existing streamflow characteristics dataset and were included as an attribute in the geospatial database. A comparison of the BFI estimates to the compiled estimates of groundwater discharge found that the BFI estimates were greater than the reported groundwater discharge estimates.

  9. A comparative cellular and molecular biology of longevity database.

    PubMed

    Stuart, Jeffrey A; Liang, Ping; Luo, Xuemei; Page, Melissa M; Gallagher, Emily J; Christoff, Casey A; Robb, Ellen L

    2013-10-01

    Discovering key cellular and molecular traits that promote longevity is a major goal of aging and longevity research. One experimental strategy is to determine which traits have been selected during the evolution of longevity in naturally long-lived animal species. This comparative approach has been applied to lifespan research for nearly four decades, yielding hundreds of datasets describing aspects of cell and molecular biology hypothesized to relate to animal longevity. Here, we introduce a Comparative Cellular and Molecular Biology of Longevity Database, available at ( http://genomics.brocku.ca/ccmbl/ ), as a compendium of comparative cell and molecular data presented in the context of longevity. This open access database will facilitate the meta-analysis of amalgamated datasets using standardized maximum lifespan (MLSP) data (from AnAge). The first edition contains over 800 data records describing experimental measurements of cellular stress resistance, reactive oxygen species metabolism, membrane composition, protein homeostasis, and genome homeostasis as they relate to vertebrate species MLSP. The purpose of this review is to introduce the database and briefly demonstrate its use in the meta-analysis of combined datasets.

  10. U.S. Datasets

    Cancer.gov

    Datasets for U.S. mortality, U.S. populations, standard populations, county attributes, and expected survival. Plus SEER-linked databases (SEER-Medicare, SEER-Medicare Health Outcomes Survey [SEER-MHOS], SEER-Consumer Assessment of Healthcare Providers and Systems [SEER-CAHPS]).

  11. Potential for using regional and global datasets for national scale ecosystem service modelling

    NASA Astrophysics Data System (ADS)

    Maxwell, Deborah; Jackson, Bethanna

    2016-04-01

    Ecosystem service models are increasingly being used by planners and policy makers to inform policy development and decisions about national-level resource management. Such models allow ecosystem services to be mapped and quantified, and subsequent changes to these services to be identified and monitored. In some cases, the impact of small scale changes can be modelled at a national scale, providing more detailed information to decision makers about where to best focus investment and management interventions that could address these issues, while moving toward national goals and/or targets. National scale modelling often uses national (or local) data (for example, soils, landcover and topographical information) as input. However, there are some places where fine resolution and/or high quality national datasets cannot be easily obtained, or do not even exist. In the absence of such detailed information, regional or global datasets could be used as input to such models. There are questions, however, about the usefulness of these coarser resolution datasets and the extent to which inaccuracies in this data may degrade predictions of existing and potential ecosystem service provision and subsequent decision making. Using LUCI (the Land Utilisation and Capability Indicator) as an example predictive model, we examine how the reliability of predictions change when national datasets of soil, landcover and topography are substituted with coarser scale regional and global datasets. We specifically look at how LUCI's predictions of where water services, such as flood risk, flood mitigation, erosion and water quality, change when national data inputs are replaced by regional and global datasets. Using the Conwy catchment, Wales, as a case study, the land cover products compared are the UK's Land Cover Map (2007), the European CORINE land cover map and the ESA global land cover map. Soils products include the National Soil Map of England and Wales (NatMap) and the European Soils Database. NEXTmap elevation data, which covers the UK and parts of continental Europe, are compared to global AsterDEM and SRTM30 topographical products. While the regional and global datasets can be used to fill gaps in data requirements, the coarser resolution of these datasets means that there is greater aggregation of information over larger areas. This loss of detail impacts on the reliability of model output, particularly where significant discrepancies between datasets exist. The implications of this loss of detail in terms of spatial planning and decision making is discussed. Finally, in the context of broader development the need for better nationally and globally available data to allow LUCI and other ecosystem models to become more globally applicable is highlighted.

  12. SEER Linked Databases - SEER Datasets

    Cancer.gov

    SEER-Medicare database of elderly persons with cancer is useful for epidemiologic and health services research. SEER-MHOS has health-related quality of life information about elderly persons with cancer. SEER-CAHPS database has clinical, survey, and health services information on people with cancer.

  13. Onshore industrial wind turbine locations for the United States

    USGS Publications Warehouse

    Diffendorfer, Jay E.; Compton, Roger; Kramer, Louisa; Ancona, Zach; Norton, Donna

    2017-01-01

    This dataset provides industrial-scale onshore wind turbine locations in the United States, corresponding facility information, and turbine technical specifications. The database has wind turbine records that have been collected, digitized, locationally verified, and internally quality controlled. Turbines from the Federal Aviation Administration Digital Obstacles File, through product release date July 22, 2013, were used as the primary source of turbine data points. The dataset was subsequently revised and reposted as described in the revision histories for the report. Verification of the turbine positions was done by visual interpretation using high-resolution aerial imagery in Environmental Systems Research Institute (Esri) ArcGIS Desktop. Turbines without Federal Aviation Administration Obstacles Repository System numbers were visually identified and point locations were added to the collection. We estimated a locational error of plus or minus 10 meters for turbine locations. Wind farm facility names were identified from publicly available facility datasets. Facility names were then used in a Web search of additional industry publications and press releases to attribute additional turbine information (such as manufacturer, model, and technical specifications of wind turbines). Wind farm facility location data from various wind and energy industry sources were used to search for and digitize turbines not in existing databases. Technical specifications for turbines were assigned based on the wind turbine make and model as described in literature, specifications listed in the Federal Aviation Administration Digital Obstacles File, and information on the turbine manufacturer’s Web site. Some facility and turbine information on make and model did not exist or was difficult to obtain. Thus, uncertainty may exist for certain turbine specifications. That uncertainty was rated and a confidence was recorded for both location and attribution data quality.

  14. An improved Pearson's correlation proximity-based hierarchical clustering for mining biological association between genes.

    PubMed

    Booma, P M; Prabhakaran, S; Dhanalakshmi, R

    2014-01-01

    Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality.

  15. An Improved Pearson's Correlation Proximity-Based Hierarchical Clustering for Mining Biological Association between Genes

    PubMed Central

    Booma, P. M.; Prabhakaran, S.; Dhanalakshmi, R.

    2014-01-01

    Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality. PMID:25136661

  16. Spatiotemporal historical datasets at micro-level for geocoded individuals in five Swedish parishes, 1813–1914

    PubMed Central

    Hedefalk, Finn; Svensson, Patrick; Harrie, Lars

    2017-01-01

    This paper presents datasets that enable historical longitudinal studies of micro-level geographic factors in a rural setting. These types of datasets are new, as historical demography studies have generally failed to properly include the micro-level geographic factors. Our datasets describe the geography over five Swedish rural parishes, and by linking them to a longitudinal demographic database, we obtain a geocoded population (at the property unit level) for this area for the period 1813–1914. The population is a subset of the Scanian Economic Demographic Database (SEDD). The geographic information includes the following feature types: property units, wetlands, buildings, roads and railroads. The property units and wetlands are stored in object-lifeline time representations (information about creation, changes and ends of objects are recorded in time), whereas the other feature types are stored as snapshots in time. Thus, the datasets present one of the first opportunities to study historical spatio-temporal patterns at the micro-level. PMID:28398288

  17. Big Data in Organ Transplantation: Registries and Administrative Claims

    PubMed Central

    Massie, Allan B.; Kucirka, Lauren; Segev, Dorry L.

    2015-01-01

    The field of organ transplantation benefits from large, comprehensive, transplant-specific national datasets available to researchers. In addition to the widely-used OPTN-based registries (the UNOS and SRTR datasets) and USRDS datasets, there are other publicly available national datasets, not specific to transplantation, which have historically been underutilized in the field of transplantation. Of particular interest are the Nationwide Inpatient Sample (NIS) and State Inpatient Databases (SID), produced by the Agency for Healthcare Research and Quality (AHRQ). The United States Renal Data System (USRDS) database provides extensive data relevant to studies of kidney transplantation. Linkage of publicly available datasets to external data sources such as private claims or pharmacy data provides further resources for registry-based research. Although these resources can transcend some limitations of OPTN-based registry data, they come with their own limitations, which must be understood to avoid biased inference. This review discusses different registry-based data sources available in the United States, as well as the proper design and conduct of registry-based research. PMID:25040084

  18. A high-resolution European dataset for hydrologic modeling

    NASA Astrophysics Data System (ADS)

    Ntegeka, Victor; Salamon, Peter; Gomes, Goncalo; Sint, Hadewij; Lorini, Valerio; Thielen, Jutta

    2013-04-01

    There is an increasing demand for large scale hydrological models not only in the field of modeling the impact of climate change on water resources but also for disaster risk assessments and flood or drought early warning systems. These large scale models need to be calibrated and verified against large amounts of observations in order to judge their capabilities to predict the future. However, the creation of large scale datasets is challenging for it requires collection, harmonization, and quality checking of large amounts of observations. For this reason, only a limited number of such datasets exist. In this work, we present a pan European, high-resolution gridded dataset of meteorological observations (EFAS-Meteo) which was designed with the aim to drive a large scale hydrological model. Similar European and global gridded datasets already exist, such as the HadGHCND (Caesar et al., 2006), the JRC MARS-STAT database (van der Goot and Orlandi, 2003) and the E-OBS gridded dataset (Haylock et al., 2008). However, none of those provide similarly high spatial resolution and/or a complete set of variables to force a hydrologic model. EFAS-Meteo contains daily maps of precipitation, surface temperature (mean, minimum and maximum), wind speed and vapour pressure at a spatial grid resolution of 5 x 5 km for the time period 1 January 1990 - 31 December 2011. It furthermore contains calculated radiation, which is calculated by using a staggered approach depending on the availability of sunshine duration, cloud cover and minimum and maximum temperature, and evapotranspiration (potential evapotranspiration, bare soil and open water evapotranspiration). The potential evapotranspiration was calculated using the Penman-Monteith equation with the above-mentioned meteorological variables. The dataset was created as part of the development of the European Flood Awareness System (EFAS) and has been continuously updated throughout the last years. The dataset variables are used as inputs to the hydrological calibration and validation of EFAS as well as for establishing long-term discharge "proxy" climatologies which can then in turn be used for statistical analysis to derive return periods or other time series derivatives. In addition, this dataset will be used to assess climatological trends in Europe. Unfortunately, to date no baseline dataset at the European scale exists to test the quality of the herein presented data. Hence, a comparison against other existing datasets can therefore only be an indication of data quality. Due to availability, a comparison was made for precipitation and temperature only, arguably the most important meteorological drivers for hydrologic models. A variety of analyses was undertaken at country scale against data reported to EUROSTAT and E-OBS datasets. The comparison revealed that while the datasets showed overall similar temporal and spatial patterns, there were some differences in magnitudes especially for precipitation. It is not straightforward to define the specific cause for these differences. However, in most cases the comparatively low observation station density appears to be the principal reason for the differences in magnitude.

  19. Evaluation and comparison of bioinformatic tools for the enrichment analysis of metabolomics data.

    PubMed

    Marco-Ramell, Anna; Palau-Rodriguez, Magali; Alay, Ania; Tulipani, Sara; Urpi-Sarda, Mireia; Sanchez-Pla, Alex; Andres-Lacueva, Cristina

    2018-01-02

    Bioinformatic tools for the enrichment of 'omics' datasets facilitate interpretation and understanding of data. To date few are suitable for metabolomics datasets. The main objective of this work is to give a critical overview, for the first time, of the performance of these tools. To that aim, datasets from metabolomic repositories were selected and enriched data were created. Both types of data were analysed with these tools and outputs were thoroughly examined. An exploratory multivariate analysis of the most used tools for the enrichment of metabolite sets, based on a non-metric multidimensional scaling (NMDS) of Jaccard's distances, was performed and mirrored their diversity. Codes (identifiers) of the metabolites of the datasets were searched in different metabolite databases (HMDB, KEGG, PubChem, ChEBI, BioCyc/HumanCyc, LipidMAPS, ChemSpider, METLIN and Recon2). The databases that presented more identifiers of the metabolites of the dataset were PubChem, followed by METLIN and ChEBI. However, these databases had duplicated entries and might present false positives. The performance of over-representation analysis (ORA) tools, including BioCyc/HumanCyc, ConsensusPathDB, IMPaLA, MBRole, MetaboAnalyst, Metabox, MetExplore, MPEA, PathVisio and Reactome and the mapping tool KEGGREST, was examined. Results were mostly consistent among tools and between real and enriched data despite the variability of the tools. Nevertheless, a few controversial results such as differences in the total number of metabolites were also found. Disease-based enrichment analyses were also assessed, but they were not found to be accurate probably due to the fact that metabolite disease sets are not up-to-date and the difficulty of predicting diseases from a list of metabolites. We have extensively reviewed the state-of-the-art of the available range of tools for metabolomic datasets, the completeness of metabolite databases, the performance of ORA methods and disease-based analyses. Despite the variability of the tools, they provided consistent results independent of their analytic approach. However, more work on the completeness of metabolite and pathway databases is required, which strongly affects the accuracy of enrichment analyses. Improvements will be translated into more accurate and global insights of the metabolome.

  20. Fast and Sensitive Alignment of Microbial Whole Genome Sequencing Reads to Large Sequence Datasets on a Desktop PC: Application to Metagenomic Datasets and Pathogen Identification

    PubMed Central

    2014-01-01

    Next generation sequencing (NGS) of metagenomic samples is becoming a standard approach to detect individual species or pathogenic strains of microorganisms. Computer programs used in the NGS community have to balance between speed and sensitivity and as a result, species or strain level identification is often inaccurate and low abundance pathogens can sometimes be missed. We have developed Taxoner, an open source, taxon assignment pipeline that includes a fast aligner (e.g. Bowtie2) and a comprehensive DNA sequence database. We tested the program on simulated datasets as well as experimental data from Illumina, IonTorrent, and Roche 454 sequencing platforms. We found that Taxoner performs as well as, and often better than BLAST, but requires two orders of magnitude less running time meaning that it can be run on desktop or laptop computers. Taxoner is slower than the approaches that use small marker databases but is more sensitive due the comprehensive reference database. In addition, it can be easily tuned to specific applications using small tailored databases. When applied to metagenomic datasets, Taxoner can provide a functional summary of the genes mapped and can provide strain level identification. Taxoner is written in C for Linux operating systems. The code and documentation are available for research applications at http://code.google.com/p/taxoner. PMID:25077800

  1. SurvExpress: an online biomarker validation tool and database for cancer gene expression data using survival analysis.

    PubMed

    Aguirre-Gamboa, Raul; Gomez-Rueda, Hugo; Martínez-Ledesma, Emmanuel; Martínez-Torteya, Antonio; Chacolla-Huaringa, Rafael; Rodriguez-Barrientos, Alberto; Tamez-Peña, José G; Treviño, Victor

    2013-01-01

    Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R.

  2. SurvExpress: An Online Biomarker Validation Tool and Database for Cancer Gene Expression Data Using Survival Analysis

    PubMed Central

    Aguirre-Gamboa, Raul; Gomez-Rueda, Hugo; Martínez-Ledesma, Emmanuel; Martínez-Torteya, Antonio; Chacolla-Huaringa, Rafael; Rodriguez-Barrientos, Alberto; Tamez-Peña, José G.; Treviño, Victor

    2013-01-01

    Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R. PMID:24066126

  3. Fast and sensitive alignment of microbial whole genome sequencing reads to large sequence datasets on a desktop PC: application to metagenomic datasets and pathogen identification.

    PubMed

    Pongor, Lőrinc S; Vera, Roberto; Ligeti, Balázs

    2014-01-01

    Next generation sequencing (NGS) of metagenomic samples is becoming a standard approach to detect individual species or pathogenic strains of microorganisms. Computer programs used in the NGS community have to balance between speed and sensitivity and as a result, species or strain level identification is often inaccurate and low abundance pathogens can sometimes be missed. We have developed Taxoner, an open source, taxon assignment pipeline that includes a fast aligner (e.g. Bowtie2) and a comprehensive DNA sequence database. We tested the program on simulated datasets as well as experimental data from Illumina, IonTorrent, and Roche 454 sequencing platforms. We found that Taxoner performs as well as, and often better than BLAST, but requires two orders of magnitude less running time meaning that it can be run on desktop or laptop computers. Taxoner is slower than the approaches that use small marker databases but is more sensitive due the comprehensive reference database. In addition, it can be easily tuned to specific applications using small tailored databases. When applied to metagenomic datasets, Taxoner can provide a functional summary of the genes mapped and can provide strain level identification. Taxoner is written in C for Linux operating systems. The code and documentation are available for research applications at http://code.google.com/p/taxoner.

  4. EDDIX--a database of ionisation double differential cross sections.

    PubMed

    MacGibbon, J H; Emerson, S; Liamsuwan, T; Nikjoo, H

    2011-02-01

    The use of Monte Carlo track structure is a choice method in biophysical modelling and calculations. To precisely model 3D and 4D tracks, the cross section for the ionisation by an incoming ion, double differential in the outgoing electron energy and angle, is required. However, the double differential cross section cannot be theoretically modelled over the full range of parameters. To address this issue, a database of all available experimental data has been constructed. Currently, the database of Experimental Double Differential Ionisation Cross sections (EDDIX) contains over 1200 digitalised experimentally measured datasets from the 1960s to present date, covering all available ion species (hydrogen to uranium) and all available target species. Double differential cross sections are also presented with the aid of an eight parameter functions fitted to the cross sections. The parameters include projectile species and charge, target nuclear charge and atomic mass, projectile atomic mass and energy, electron energy and deflection angle. It is planned to freely distribute EDDIX and make it available to the radiation research community for use in the analytical and numerical modelling of track structure.

  5. A Global Geospatial Database of 5000+ Historic Flood Event Extents

    NASA Astrophysics Data System (ADS)

    Tellman, B.; Sullivan, J.; Doyle, C.; Kettner, A.; Brakenridge, G. R.; Erickson, T.; Slayback, D. A.

    2017-12-01

    A key dataset that is missing for global flood model validation and understanding historic spatial flood vulnerability is a global historical geo-database of flood event extents. Decades of earth observing satellites and cloud computing now make it possible to not only detect floods in near real time, but to run these water detection algorithms back in time to capture the spatial extent of large numbers of specific events. This talk will show results from the largest global historical flood database developed to date. We use the Dartmouth Flood Observatory flood catalogue to map over 5000 floods (from 1985-2017) using MODIS, Landsat, and Sentinel-1 Satellites. All events are available for public download via the Earth Engine Catalogue and via a website that allows the user to query floods by area or date, assess population exposure trends over time, and download flood extents in geospatial format.In this talk, we will highlight major trends in global flood exposure per continent, land use type, and eco-region. We will also make suggestions how to use this dataset in conjunction with other global sets to i) validate global flood models, ii) assess the potential role of climatic change in flood exposure iii) understand how urbanization and other land change processes may influence spatial flood exposure iv) assess how innovative flood interventions (e.g. wetland restoration) influence flood patterns v) control for event magnitude to assess the role of social vulnerability and damage assessment vi) aid in rapid probabilistic risk assessment to enable microinsurance markets. Authors on this paper are already using the database for the later three applications and will show examples of wetland intervention analysis in Argentina, social vulnerability analysis in the USA, and micro insurance in India.

  6. The Neotoma Paleoecology Database: An International Community-Curated Resource for Paleoecological and Paleoenvironmental Data

    NASA Astrophysics Data System (ADS)

    Williams, J. W.; Grimm, E. C.; Ashworth, A. C.; Blois, J.; Charles, D. F.; Crawford, S.; Davis, E.; Goring, S. J.; Graham, R. W.; Miller, D. A.; Smith, A. J.; Stryker, M.; Uhen, M. D.

    2017-12-01

    The Neotoma Paleoecology Database supports global change research at the intersection of geology and ecology by providing a high-quality, community-curated data repository for paleoecological data. These data are widely used to study biological responses and feedbacks to past environmental change at local to global scales. The Neotoma data model is flexible and can store multiple kinds of fossil, biogeochemical, or physical variables measured from sedimentary archives. Data additions to Neotoma are growing and include >3.5 million observations, >16,000 datasets, and >8,500 sites. Dataset types include fossil pollen, vertebrates, diatoms, ostracodes, macroinvertebrates, plant macrofossils, insects, testate amoebae, geochronological data, and the recently added organic biomarkers, stable isotopes, and specimen-level data. Neotoma data can be found and retrieved in multiple ways, including the Explorer map-based interface, a RESTful Application Programming Interface, the neotoma R package, and digital object identifiers. Neotoma has partnered with the Paleobiology Database to produce a common data portal for paleobiological data, called the Earth Life Consortium. A new embargo management is designed to allow investigators to put their data into Neotoma and then make use of Neotoma's value-added services. Neotoma's distributed scientific governance model is flexible and scalable, with many open pathways for welcoming new members, data contributors, stewards, and research communities. As the volume and variety of scientific data grow, community-curated data resources such as Neotoma have become foundational infrastructure for big data science.

  7. Addition of a breeding database in the Genome Database for Rosaceae

    PubMed Central

    Evans, Kate; Jung, Sook; Lee, Taein; Brutcher, Lisa; Cho, Ilhyung; Peace, Cameron; Main, Dorrie

    2013-01-01

    Breeding programs produce large datasets that require efficient management systems to keep track of performance, pedigree, geographical and image-based data. With the development of DNA-based screening technologies, more breeding programs perform genotyping in addition to phenotyping for performance evaluation. The integration of breeding data with other genomic and genetic data is instrumental for the refinement of marker-assisted breeding tools, enhances genetic understanding of important crop traits and maximizes access and utility by crop breeders and allied scientists. Development of new infrastructure in the Genome Database for Rosaceae (GDR) was designed and implemented to enable secure and efficient storage, management and analysis of large datasets from the Washington State University apple breeding program and subsequently expanded to fit datasets from other Rosaceae breeders. The infrastructure was built using the software Chado and Drupal, making use of the Natural Diversity module to accommodate large-scale phenotypic and genotypic data. Breeders can search accessions within the GDR to identify individuals with specific trait combinations. Results from Search by Parentage lists individuals with parents in common and results from Individual Variety pages link to all data available on each chosen individual including pedigree, phenotypic and genotypic information. Genotypic data are searchable by markers and alleles; results are linked to other pages in the GDR to enable the user to access tools such as GBrowse and CMap. This breeding database provides users with the opportunity to search datasets in a fully targeted manner and retrieve and compare performance data from multiple selections, years and sites, and to output the data needed for variety release publications and patent applications. The breeding database facilitates efficient program management. Storing publicly available breeding data in a database together with genomic and genetic data will further accelerate the cross-utilization of diverse data types by researchers from various disciplines. Database URL: http://www.rosaceae.org/breeders_toolbox PMID:24247530

  8. Addition of a breeding database in the Genome Database for Rosaceae.

    PubMed

    Evans, Kate; Jung, Sook; Lee, Taein; Brutcher, Lisa; Cho, Ilhyung; Peace, Cameron; Main, Dorrie

    2013-01-01

    Breeding programs produce large datasets that require efficient management systems to keep track of performance, pedigree, geographical and image-based data. With the development of DNA-based screening technologies, more breeding programs perform genotyping in addition to phenotyping for performance evaluation. The integration of breeding data with other genomic and genetic data is instrumental for the refinement of marker-assisted breeding tools, enhances genetic understanding of important crop traits and maximizes access and utility by crop breeders and allied scientists. Development of new infrastructure in the Genome Database for Rosaceae (GDR) was designed and implemented to enable secure and efficient storage, management and analysis of large datasets from the Washington State University apple breeding program and subsequently expanded to fit datasets from other Rosaceae breeders. The infrastructure was built using the software Chado and Drupal, making use of the Natural Diversity module to accommodate large-scale phenotypic and genotypic data. Breeders can search accessions within the GDR to identify individuals with specific trait combinations. Results from Search by Parentage lists individuals with parents in common and results from Individual Variety pages link to all data available on each chosen individual including pedigree, phenotypic and genotypic information. Genotypic data are searchable by markers and alleles; results are linked to other pages in the GDR to enable the user to access tools such as GBrowse and CMap. This breeding database provides users with the opportunity to search datasets in a fully targeted manner and retrieve and compare performance data from multiple selections, years and sites, and to output the data needed for variety release publications and patent applications. The breeding database facilitates efficient program management. Storing publicly available breeding data in a database together with genomic and genetic data will further accelerate the cross-utilization of diverse data types by researchers from various disciplines. Database URL: http://www.rosaceae.org/breeders_toolbox.

  9. National Transportation Atlas Databases : 2014

    DOT National Transportation Integrated Search

    2014-01-01

    The National Transportation Atlas Databases 2014 : (NTAD2014) is a set of nationwide geographic datasets of : transportation facilities, transportation networks, associated : infrastructure, and other political and administrative entities. : These da...

  10. National Transportation Atlas Databases : 2015

    DOT National Transportation Integrated Search

    2015-01-01

    The National Transportation Atlas Databases 2015 : (NTAD2015) is a set of nationwide geographic datasets of : transportation facilities, transportation networks, associated : infrastructure, and other political and administrative entities. : These da...

  11. Development of Web tools to predict axillary lymph node metastasis and pathological response to neoadjuvant chemotherapy in breast cancer patients.

    PubMed

    Sugimoto, Masahiro; Takada, Masahiro; Toi, Masakazu

    2014-12-09

    Nomograms are a standard computational tool to predict the likelihood of an outcome using multiple available patient features. We have developed a more powerful data mining methodology, to predict axillary lymph node (AxLN) metastasis and response to neoadjuvant chemotherapy (NAC) in primary breast cancer patients. We developed websites to use these tools. The tools calculate the probability of AxLN metastasis (AxLN model) and pathological complete response to NAC (NAC model). As a calculation algorithm, we employed a decision tree-based prediction model known as the alternative decision tree (ADTree), which is an analog development of if-then type decision trees. An ensemble technique was used to combine multiple ADTree predictions, resulting in higher generalization abilities and robustness against missing values. The AxLN model was developed with training datasets (n=148) and test datasets (n=143), and validated using an independent cohort (n=174), yielding an area under the receiver operating characteristic curve (AUC) of 0.768. The NAC model was developed and validated with n=150 and n=173 datasets from a randomized controlled trial, yielding an AUC of 0.787. AxLN and NAC models require users to input up to 17 and 16 variables, respectively. These include pathological features, including human epidermal growth factor receptor 2 (HER2) status and imaging findings. Each input variable has an option of "unknown," to facilitate prediction for cases with missing values. The websites developed facilitate the use of these tools, and serve as a database for accumulating new datasets.

  12. Sparse modeling of spatial environmental variables associated with asthma

    PubMed Central

    Chang, Timothy S.; Gangnon, Ronald E.; Page, C. David; Buckingham, William R.; Tandias, Aman; Cowan, Kelly J.; Tomasallo, Carrie D.; Arndt, Brian G.; Hanrahan, Lawrence P.; Guilbert, Theresa W.

    2014-01-01

    Geographically distributed environmental factors influence the burden of diseases such as asthma. Our objective was to identify sparse environmental variables associated with asthma diagnosis gathered from a large electronic health record (EHR) dataset while controlling for spatial variation. An EHR dataset from the University of Wisconsin’s Family Medicine, Internal Medicine and Pediatrics Departments was obtained for 199,220 patients aged 5–50 years over a three-year period. Each patient’s home address was geocoded to one of 3,456 geographic census block groups. Over one thousand block group variables were obtained from a commercial database. We developed a Sparse Spatial Environmental Analysis (SASEA). Using this method, the environmental variables were first dimensionally reduced with sparse principal component analysis. Logistic thin plate regression spline modeling was then used to identify block group variables associated with asthma from sparse principal components. The addresses of patients from the EHR dataset were distributed throughout the majority of Wisconsin’s geography. Logistic thin plate regression spline modeling captured spatial variation of asthma. Four sparse principal components identified via model selection consisted of food at home, dog ownership, household size, and disposable income variables. In rural areas, dog ownership and renter occupied housing units from significant sparse principal components were associated with asthma. Our main contribution is the incorporation of sparsity in spatial modeling. SASEA sequentially added sparse principal components to Logistic thin plate regression spline modeling. This method allowed association of geographically distributed environmental factors with asthma using EHR and environmental datasets. SASEA can be applied to other diseases with environmental risk factors. PMID:25533437

  13. Sparse modeling of spatial environmental variables associated with asthma.

    PubMed

    Chang, Timothy S; Gangnon, Ronald E; David Page, C; Buckingham, William R; Tandias, Aman; Cowan, Kelly J; Tomasallo, Carrie D; Arndt, Brian G; Hanrahan, Lawrence P; Guilbert, Theresa W

    2015-02-01

    Geographically distributed environmental factors influence the burden of diseases such as asthma. Our objective was to identify sparse environmental variables associated with asthma diagnosis gathered from a large electronic health record (EHR) dataset while controlling for spatial variation. An EHR dataset from the University of Wisconsin's Family Medicine, Internal Medicine and Pediatrics Departments was obtained for 199,220 patients aged 5-50years over a three-year period. Each patient's home address was geocoded to one of 3456 geographic census block groups. Over one thousand block group variables were obtained from a commercial database. We developed a Sparse Spatial Environmental Analysis (SASEA). Using this method, the environmental variables were first dimensionally reduced with sparse principal component analysis. Logistic thin plate regression spline modeling was then used to identify block group variables associated with asthma from sparse principal components. The addresses of patients from the EHR dataset were distributed throughout the majority of Wisconsin's geography. Logistic thin plate regression spline modeling captured spatial variation of asthma. Four sparse principal components identified via model selection consisted of food at home, dog ownership, household size, and disposable income variables. In rural areas, dog ownership and renter occupied housing units from significant sparse principal components were associated with asthma. Our main contribution is the incorporation of sparsity in spatial modeling. SASEA sequentially added sparse principal components to Logistic thin plate regression spline modeling. This method allowed association of geographically distributed environmental factors with asthma using EHR and environmental datasets. SASEA can be applied to other diseases with environmental risk factors. Copyright © 2014 Elsevier Inc. All rights reserved.

  14. A Semantic Transformation Methodology for the Secondary Use of Observational Healthcare Data in Postmarketing Safety Studies.

    PubMed

    Pacaci, Anil; Gonul, Suat; Sinaci, A Anil; Yuksel, Mustafa; Laleci Erturkmen, Gokce B

    2018-01-01

    Background: Utilization of the available observational healthcare datasets is key to complement and strengthen the postmarketing safety studies. Use of common data models (CDM) is the predominant approach in order to enable large scale systematic analyses on disparate data models and vocabularies. Current CDM transformation practices depend on proprietarily developed Extract-Transform-Load (ETL) procedures, which require knowledge both on the semantics and technical characteristics of the source datasets and target CDM. Purpose: In this study, our aim is to develop a modular but coordinated transformation approach in order to separate semantic and technical steps of transformation processes, which do not have a strict separation in traditional ETL approaches. Such an approach would discretize the operations to extract data from source electronic health record systems, alignment of the source, and target models on the semantic level and the operations to populate target common data repositories. Approach: In order to separate the activities that are required to transform heterogeneous data sources to a target CDM, we introduce a semantic transformation approach composed of three steps: (1) transformation of source datasets to Resource Description Framework (RDF) format, (2) application of semantic conversion rules to get the data as instances of ontological model of the target CDM, and (3) population of repositories, which comply with the specifications of the CDM, by processing the RDF instances from step 2. The proposed approach has been implemented on real healthcare settings where Observational Medical Outcomes Partnership (OMOP) CDM has been chosen as the common data model and a comprehensive comparative analysis between the native and transformed data has been conducted. Results: Health records of ~1 million patients have been successfully transformed to an OMOP CDM based database from the source database. Descriptive statistics obtained from the source and target databases present analogous and consistent results. Discussion and Conclusion: Our method goes beyond the traditional ETL approaches by being more declarative and rigorous. Declarative because the use of RDF based mapping rules makes each mapping more transparent and understandable to humans while retaining logic-based computability. Rigorous because the mappings would be based on computer readable semantics which are amenable to validation through logic-based inference methods.

  15. The GAAIN Entity Mapper: An Active-Learning System for Medical Data Mapping.

    PubMed

    Ashish, Naveen; Dewan, Peehoo; Toga, Arthur W

    2015-01-01

    This work is focused on mapping biomedical datasets to a common representation, as an integral part of data harmonization for integrated biomedical data access and sharing. We present GEM, an intelligent software assistant for automated data mapping across different datasets or from a dataset to a common data model. The GEM system automates data mapping by providing precise suggestions for data element mappings. It leverages the detailed metadata about elements in associated dataset documentation such as data dictionaries that are typically available with biomedical datasets. It employs unsupervised text mining techniques to determine similarity between data elements and also employs machine-learning classifiers to identify element matches. It further provides an active-learning capability where the process of training the GEM system is optimized. Our experimental evaluations show that the GEM system provides highly accurate data mappings (over 90% accuracy) for real datasets of thousands of data elements each, in the Alzheimer's disease research domain. Further, the effort in training the system for new datasets is also optimized. We are currently employing the GEM system to map Alzheimer's disease datasets from around the globe into a common representation, as part of a global Alzheimer's disease integrated data sharing and analysis network called GAAIN. GEM achieves significantly higher data mapping accuracy for biomedical datasets compared to other state-of-the-art tools for database schema matching that have similar functionality. With the use of active-learning capabilities, the user effort in training the system is minimal.

  16. The GAAIN Entity Mapper: An Active-Learning System for Medical Data Mapping

    PubMed Central

    Ashish, Naveen; Dewan, Peehoo; Toga, Arthur W.

    2016-01-01

    This work is focused on mapping biomedical datasets to a common representation, as an integral part of data harmonization for integrated biomedical data access and sharing. We present GEM, an intelligent software assistant for automated data mapping across different datasets or from a dataset to a common data model. The GEM system automates data mapping by providing precise suggestions for data element mappings. It leverages the detailed metadata about elements in associated dataset documentation such as data dictionaries that are typically available with biomedical datasets. It employs unsupervised text mining techniques to determine similarity between data elements and also employs machine-learning classifiers to identify element matches. It further provides an active-learning capability where the process of training the GEM system is optimized. Our experimental evaluations show that the GEM system provides highly accurate data mappings (over 90% accuracy) for real datasets of thousands of data elements each, in the Alzheimer's disease research domain. Further, the effort in training the system for new datasets is also optimized. We are currently employing the GEM system to map Alzheimer's disease datasets from around the globe into a common representation, as part of a global Alzheimer's disease integrated data sharing and analysis network called GAAIN1. GEM achieves significantly higher data mapping accuracy for biomedical datasets compared to other state-of-the-art tools for database schema matching that have similar functionality. With the use of active-learning capabilities, the user effort in training the system is minimal. PMID:26793094

  17. Assessment of imputation methods using varying ecological information to fill the gaps in a tree functional trait database

    NASA Astrophysics Data System (ADS)

    Poyatos, Rafael; Sus, Oliver; Vilà-Cabrera, Albert; Vayreda, Jordi; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi

    2016-04-01

    Plant functional traits are increasingly being used in ecosystem ecology thanks to the growing availability of large ecological databases. However, these databases usually contain a large fraction of missing data because measuring plant functional traits systematically is labour-intensive and because most databases are compilations of datasets with different sampling designs. As a result, within a given database, there is an inevitable variability in the number of traits available for each data entry and/or the species coverage in a given geographical area. The presence of missing data may severely bias trait-based analyses, such as the quantification of trait covariation or trait-environment relationships and may hamper efforts towards trait-based modelling of ecosystem biogeochemical cycles. Several data imputation (i.e. gap-filling) methods have been recently tested on compiled functional trait databases, but the performance of imputation methods applied to a functional trait database with a regular spatial sampling has not been thoroughly studied. Here, we assess the effects of data imputation on five tree functional traits (leaf biomass to sapwood area ratio, foliar nitrogen, maximum height, specific leaf area and wood density) in the Ecological and Forest Inventory of Catalonia, an extensive spatial database (covering 31900 km2). We tested the performance of species mean imputation, single imputation by the k-nearest neighbors algorithm (kNN) and a multiple imputation method, Multivariate Imputation with Chained Equations (MICE) at different levels of missing data (10%, 30%, 50%, and 80%). We also assessed the changes in imputation performance when additional predictors (species identity, climate, forest structure, spatial structure) were added in kNN and MICE imputations. We evaluated the imputed datasets using a battery of indexes describing departure from the complete dataset in trait distribution, in the mean prediction error, in the correlation matrix and in selected bivariate trait relationships. MICE yielded imputations which better preserved the variability and covariance structure of the data and provided an estimate of between-imputation uncertainty. We found that adding species identity as a predictor in MICE and kNN improved imputation for all traits, but adding climate did not lead to any appreciable improvement. However, forest structure and spatial structure did reduce imputation errors in maximum height and in leaf biomass to sapwood area ratios, respectively. Although species mean imputations showed the lowest error for 3 out the 5 studied traits, dataset-averaged errors were lowest for MICE imputations with all additional predictors, when missing data levels were 50% or lower. Species mean imputations always resulted in larger errors in the correlation matrix and appreciably altered the studied bivariate trait relationships. In conclusion, MICE imputations using species identity, climate, forest structure and spatial structure as predictors emerged as the most suitable method of the ones tested here, but it was also evident that imputation performance deteriorates at high levels of missing data (80%).

  18. Release of a 10-m-resolution DEM for the whole Italian territory: a new, freely available resource for research purposes

    NASA Astrophysics Data System (ADS)

    Tarquini, S.; Nannipieri, L.; Favalli, M.; Fornaciai, A.; Vinci, S.; Doumaz, F.

    2012-04-01

    Digital elevation models (DEMs) are fundamental in any kind of environmental or morphological study. DEMs are obtained from a variety of sources and generated in several ways. Nowadays, a few global-coverage elevation datasets are available for free (e.g., SRTM, http://www.jpl.nasa.gov/srtm; ASTER, http://asterweb.jpl.nasa.gov/). When the matrix of a DEM is used also for computational purposes, the choice of the elevation dataset which better suits the target of the study is crucial. Recently, the increasing use of DEM-based numerical simulation tools (e.g. for gravity driven mass flows), would largely benefit from the use of a higher resolution/higher accuracy topography than those available at planetary scale. Similar elevation datasets are neither easily nor freely available for all countries worldwide. Here we introduce a new web resource which made available for free (for research purposes only) a 10 m-resolution DEM for the whole Italian territory. The creation of this elevation dataset was presented by Tarquini et al. (2007). This DEM was obtained in triangular irregular network (TIN) format starting from heterogeneous vector datasets, mostly consisting in elevation contour lines and elevation points derived from several sources. The input vector database was carefully cleaned up to obtain an improved seamless TIN refined by using the DEST algorithm, thus improving the Delaunay tessellation. The whole TINITALY/01 DEM was converted in grid format (10-m cell size) according to a tiled structure composed of 193, 50-km side square elements. The grid database consists of more than 3 billions of cells and occupies almost 12 GB of disk memory. A web-GIS has been created (http://tinitaly.pi.ingv.it/ ) where a seamless layer of images in full resolution (10 m) obtained from the whole DEM (both in color-shaded and anaglyph mode) is open for browsing. Accredited navigators are allowed to download the elevation dataset.

  19. A geo-spatial data management system for potentially active volcanoes—GEOWARN project

    NASA Astrophysics Data System (ADS)

    Gogu, Radu C.; Dietrich, Volker J.; Jenny, Bernhard; Schwandner, Florian M.; Hurni, Lorenz

    2006-02-01

    Integrated studies of active volcanic systems for the purpose of long-term monitoring and forecast and short-term eruption prediction require large numbers of data-sets from various disciplines. A modern database concept has been developed for managing and analyzing multi-disciplinary volcanological data-sets. The GEOWARN project (choosing the "Kos-Yali-Nisyros-Tilos volcanic field, Greece" and the "Campi Flegrei, Italy" as test sites) is oriented toward potentially active volcanoes situated in regions of high geodynamic unrest. This article describes the volcanological database of the spatial and temporal data acquired within the GEOWARN project. As a first step, a spatial database embedded in a Geographic Information System (GIS) environment was created. Digital data of different spatial resolution, and time-series data collected at different intervals or periods, were unified in a common, four-dimensional representation of space and time. The database scheme comprises various information layers containing geographic data (e.g. seafloor and land digital elevation model, satellite imagery, anthropogenic structures, land-use), geophysical data (e.g. from active and passive seismicity, gravity, tomography, SAR interferometry, thermal imagery, differential GPS), geological data (e.g. lithology, structural geology, oceanography), and geochemical data (e.g. from hydrothermal fluid chemistry and diffuse degassing features). As a second step based on the presented database, spatial data analysis has been performed using custom-programmed interfaces that execute query scripts resulting in a graphical visualization of data. These query tools were designed and compiled following scenarios of known "behavior" patterns of dormant volcanoes and first candidate signs of potential unrest. The spatial database and query approach is intended to facilitate scientific research on volcanic processes and phenomena, and volcanic surveillance.

  20. Patterns, biases and prospects in the distribution and diversity of Neotropical snakes

    PubMed Central

    Sawaya, Ricardo J.; Zizka, Alexander; Laffan, Shawn; Faurby, Søren; Pyron, R. Alexander; Bérnils, Renato S.; Jansen, Martin; Passos, Paulo; Prudente, Ana L. C.; Cisneros‐Heredia, Diego F.; Braz, Henrique B.; Nogueira, Cristiano de C.; Antonelli, Alexandre; Meiri, Shai

    2017-01-01

    Abstract Motivation We generated a novel database of Neotropical snakes (one of the world's richest herpetofauna) combining the most comprehensive, manually compiled distribution dataset with publicly available data. We assess, for the first time, the diversity patterns for all Neotropical snakes as well as sampling density and sampling biases. Main types of variables contained We compiled three databases of species occurrences: a dataset downloaded from the Global Biodiversity Information Facility (GBIF), a verified dataset built through taxonomic work and specialized literature, and a combined dataset comprising a cleaned version of the GBIF dataset merged with the verified dataset. Spatial location and grain Neotropics, Behrmann projection equivalent to 1° × 1°. Time period Specimens housed in museums during the last 150 years. Major taxa studied Squamata: Serpentes. Software format Geographical information system (GIS). Results The combined dataset provides the most comprehensive distribution database for Neotropical snakes to date. It contains 147,515 records for 886 species across 12 families, representing 74% of all species of snakes, spanning 27 countries in the Americas. Species richness and phylogenetic diversity show overall similar patterns. Amazonia is the least sampled Neotropical region, whereas most well‐sampled sites are located near large universities and scientific collections. We provide a list and updated maps of geographical distribution of all snake species surveyed. Main conclusions The biodiversity metrics of Neotropical snakes reflect patterns previously documented for other vertebrates, suggesting that similar factors may determine the diversity of both ectothermic and endothermic animals. We suggest conservation strategies for high‐diversity areas and sampling efforts be directed towards Amazonia and poorly known species. PMID:29398972

  1. Anatomy and evolution of database search engines-a central component of mass spectrometry based proteomic workflows.

    PubMed

    Verheggen, Kenneth; Raeder, Helge; Berven, Frode S; Martens, Lennart; Barsnes, Harald; Vaudel, Marc

    2017-09-13

    Sequence database search engines are bioinformatics algorithms that identify peptides from tandem mass spectra using a reference protein sequence database. Two decades of development, notably driven by advances in mass spectrometry, have provided scientists with more than 30 published search engines, each with its own properties. In this review, we present the common paradigm behind the different implementations, and its limitations for modern mass spectrometry datasets. We also detail how the search engines attempt to alleviate these limitations, and provide an overview of the different software frameworks available to the researcher. Finally, we highlight alternative approaches for the identification of proteomic mass spectrometry datasets, either as a replacement for, or as a complement to, sequence database search engines. © 2017 Wiley Periodicals, Inc.

  2. Continuous-time interval model identification of blood glucose dynamics for type 1 diabetes

    NASA Astrophysics Data System (ADS)

    Kirchsteiger, Harald; Johansson, Rolf; Renard, Eric; del Re, Luigi

    2014-07-01

    While good physiological models of the glucose metabolism in type 1 diabetic patients are well known, their parameterisation is difficult. The high intra-patient variability observed is a further major obstacle. This holds for data-based models too, so that no good patient-specific models are available. Against this background, this paper proposes the use of interval models to cover the different metabolic conditions. The control-oriented models contain a carbohydrate and insulin sensitivity factor to be used for insulin bolus calculators directly. Available clinical measurements were sampled on an irregular schedule which prompts the use of continuous-time identification, also for the direct estimation of the clinically interpretable factors mentioned above. An identification method is derived and applied to real data from 28 diabetic patients. Model estimation was done on a clinical data-set, whereas validation results shown were done on an out-of-clinic, everyday life data-set. The results show that the interval model approach allows a much more regular estimation of the parameters and avoids physiologically incompatible parameter estimates.

  3. Diviner lunar radiometer gridded brightness temperatures from geodesic binning of modeled fields of view

    NASA Astrophysics Data System (ADS)

    Sefton-Nash, E.; Williams, J.-P.; Greenhagen, B. T.; Aye, K.-M.; Paige, D. A.

    2017-12-01

    An approach is presented to efficiently produce high quality gridded data records from the large, global point-based dataset returned by the Diviner Lunar Radiometer Experiment aboard NASA's Lunar Reconnaissance Orbiter. The need to minimize data volume and processing time in production of science-ready map products is increasingly important with the growth in data volume of planetary datasets. Diviner makes on average >1400 observations per second of radiance that is reflected and emitted from the lunar surface, using 189 detectors divided into 9 spectral channels. Data management and processing bottlenecks are amplified by modeling every observation as a probability distribution function over the field of view, which can increase the required processing time by 2-3 orders of magnitude. Geometric corrections, such as projection of data points onto a digital elevation model, are numerically intensive and therefore it is desirable to perform them only once. Our approach reduces bottlenecks through parallel binning and efficient storage of a pre-processed database of observations. Database construction is via subdivision of a geodesic icosahedral grid, with a spatial resolution that can be tailored to suit the field of view of the observing instrument. Global geodesic grids with high spatial resolution are normally impractically memory intensive. We therefore demonstrate a minimum storage and highly parallel method to bin very large numbers of data points onto such a grid. A database of the pre-processed and binned points is then used for production of mapped data products that is significantly faster than if unprocessed points were used. We explore quality controls in the production of gridded data records by conditional interpolation, allowed only where data density is sufficient. The resultant effects on the spatial continuity and uncertainty in maps of lunar brightness temperatures is illustrated. We identify four binning regimes based on trades between the spatial resolution of the grid, the size of the FOV and the on-target spacing of observations. Our approach may be applicable and beneficial for many existing and future point-based planetary datasets.

  4. Predicting discharge mortality after acute ischemic stroke using balanced data.

    PubMed

    Ho, King Chung; Speier, William; El-Saden, Suzie; Liebeskind, David S; Saver, Jeffery L; Bui, Alex A T; Arnold, Corey W

    2014-01-01

    Several models have been developed to predict stroke outcomes (e.g., stroke mortality, patient dependence, etc.) in recent decades. However, there is little discussion regarding the problem of between-class imbalance in stroke datasets, which leads to prediction bias and decreased performance. In this paper, we demonstrate the use of the Synthetic Minority Over-sampling Technique to overcome such problems. We also compare state of the art machine learning methods and construct a six-variable support vector machine (SVM) model to predict stroke mortality at discharge. Finally, we discuss how the identification of a reduced feature set allowed us to identify additional cases in our research database for validation testing. Our classifier achieved a c-statistic of 0.865 on the cross-validated dataset, demonstrating good classification performance using a reduced set of variables.

  5. Slab2 - Updated Subduction Zone Geometries and Modeling Tools

    NASA Astrophysics Data System (ADS)

    Moore, G.; Hayes, G. P.; Portner, D. E.; Furtney, M.; Flamme, H. E.; Hearne, M. G.

    2017-12-01

    The U.S. Geological Survey database of global subduction zone geometries (Slab1.0), is a highly utilized dataset that has been applied to a wide range of geophysical problems. In 2017, these models have been improved and expanded upon as part of the Slab2 modeling effort. With a new data driven approach that can be applied to a broader range of tectonic settings and geophysical data sets, we have generated a model set that will serve as a more comprehensive, reliable, and reproducible resource for three-dimensional slab geometries at all of the world's convergent margins. The newly developed framework of Slab2 is guided by: (1) a large integrated dataset, consisting of a variety of geophysical sources (e.g., earthquake hypocenters, moment tensors, active-source seismic survey images of the shallow slab, tomography models, receiver functions, bathymetry, trench ages, and sediment thickness information); (2) a dynamic filtering scheme aimed at constraining incorporated seismicity to only slab related events; (3) a 3-D data interpolation approach which captures both high resolution shallow geometries and instances of slab rollback and overlap at depth; and (4) an algorithm which incorporates uncertainties of contributing datasets to identify the most probable surface depth over the extent of each subduction zone. Further layers will also be added to the base geometry dataset, such as historic moment release, earthquake tectonic providence, and interface coupling. Along with access to several queryable data formats, all components have been wrapped into an open source library in Python, such that suites of updated models can be released as further data becomes available. This presentation will discuss the extent of Slab2 development, as well as the current availability of the model and modeling tools.

  6. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Heredia-Langner, Alejandro; Amidan, Brett G.; Matzner, Shari

    We present results from the optimization of a re-identification process using two sets of biometric data obtained from the Civilian American and European Surface Anthropometry Resource Project (CAESAR) database. The datasets contain real measurements of features for 2378 individuals in a standing (43 features) and seated (16 features) position. A genetic algorithm (GA) was used to search a large combinatorial space where different features are available between the probe (seated) and gallery (standing) datasets. Results show that optimized model predictions obtained using less than half of the 43 gallery features and data from roughly 16% of the individuals available producemore » better re-identification rates than two other approaches that use all the information available.« less

  7. TranslatomeDB: a comprehensive database and cloud-based analysis platform for translatome sequencing data

    PubMed Central

    Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie

    2018-01-01

    Abstract Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. PMID:29106630

  8. Introduction to the DISRUPT postprandial database: subjects, studies and methodologies.

    PubMed

    Jackson, Kim G; Clarke, Dave T; Murray, Peter; Lovegrove, Julie A; O'Malley, Brendan; Minihane, Anne M; Williams, Christine M

    2010-03-01

    Dysregulation of lipid and glucose metabolism in the postprandial state are recognised as important risk factors for the development of cardiovascular disease and type 2 diabetes. Our objective was to create a comprehensive, standardised database of postprandial studies to provide insights into the physiological factors that influence postprandial lipid and glucose responses. Data were collated from subjects (n = 467) taking part in single and sequential meal postprandial studies conducted by researchers at the University of Reading, to form the DISRUPT (DIetary Studies: Reading Unilever Postprandial Trials) database. Subject attributes including age, gender, genotype, menopausal status, body mass index, blood pressure and a fasting biochemical profile, together with postprandial measurements of triacylglycerol (TAG), non-esterified fatty acids, glucose, insulin and TAG-rich lipoprotein composition are recorded. A particular strength of the studies is the frequency of blood sampling, with on average 10-13 blood samples taken during each postprandial assessment, and the fact that identical test meal protocols were used in a number of studies, allowing pooling of data to increase statistical power. The DISRUPT database is the most comprehensive postprandial metabolism database that exists worldwide and preliminary analysis of the pooled sequential meal postprandial dataset has revealed both confirmatory and novel observations with respect to the impact of gender and age on the postprandial TAG response. Further analysis of the dataset using conventional statistical techniques along with integrated mathematical models and clustering analysis will provide a unique opportunity to greatly expand current knowledge of the aetiology of inter-individual variability in postprandial lipid and glucose responses.

  9. Database Organisation in a Web-Enabled Free and Open-Source Software (foss) Environment for Spatio-Temporal Landslide Modelling

    NASA Astrophysics Data System (ADS)

    Das, I.; Oberai, K.; Sarathi Roy, P.

    2012-07-01

    Landslides exhibit themselves in different mass movement processes and are considered among the most complex natural hazards occurring on the earth surface. Making landslide database available online via WWW (World Wide Web) promotes the spreading and reaching out of the landslide information to all the stakeholders. The aim of this research is to present a comprehensive database for generating landslide hazard scenario with the help of available historic records of landslides and geo-environmental factors and make them available over the Web using geospatial Free & Open Source Software (FOSS). FOSS reduces the cost of the project drastically as proprietary software's are very costly. Landslide data generated for the period 1982 to 2009 were compiled along the national highway road corridor in Indian Himalayas. All the geo-environmental datasets along with the landslide susceptibility map were served through WEBGIS client interface. Open source University of Minnesota (UMN) mapserver was used as GIS server software for developing web enabled landslide geospatial database. PHP/Mapscript server-side application serve as a front-end application and PostgreSQL with PostGIS extension serve as a backend application for the web enabled landslide spatio-temporal databases. This dynamic virtual visualization process through a web platform brings an insight into the understanding of the landslides and the resulting damage closer to the affected people and user community. The landslide susceptibility dataset is also made available as an Open Geospatial Consortium (OGC) Web Feature Service (WFS) which can be accessed through any OGC compliant open source or proprietary GIS Software.

  10. Development of an Open Global Oil and Gas Infrastructure Inventory and Geodatabase

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rose, Kelly

    This submission contains a technical report describing the development process and visual graphics for the Global Oil and Gas Infrastructure database. Access the GOGI database using the following link: https://edx.netl.doe.gov/dataset/global-oil-gas-features-database

  11. D Partition-Based Clustering for Supply Chain Data Management

    NASA Astrophysics Data System (ADS)

    Suhaibah, A.; Uznir, U.; Anton, F.; Mioc, D.; Rahman, A. A.

    2015-10-01

    Supply Chain Management (SCM) is the management of the products and goods flow from its origin point to point of consumption. During the process of SCM, information and dataset gathered for this application is massive and complex. This is due to its several processes such as procurement, product development and commercialization, physical distribution, outsourcing and partnerships. For a practical application, SCM datasets need to be managed and maintained to serve a better service to its three main categories; distributor, customer and supplier. To manage these datasets, a structure of data constellation is used to accommodate the data into the spatial database. However, the situation in geospatial database creates few problems, for example the performance of the database deteriorate especially during the query operation. We strongly believe that a more practical hierarchical tree structure is required for efficient process of SCM. Besides that, three-dimensional approach is required for the management of SCM datasets since it involve with the multi-level location such as shop lots and residential apartments. 3D R-Tree has been increasingly used for 3D geospatial database management due to its simplicity and extendibility. However, it suffers from serious overlaps between nodes. In this paper, we proposed a partition-based clustering for the construction of a hierarchical tree structure. Several datasets are tested using the proposed method and the percentage of the overlapping nodes and volume coverage are computed and compared with the original 3D R-Tree and other practical approaches. The experiments demonstrated in this paper substantiated that the hierarchical structure of the proposed partitionbased clustering is capable of preserving minimal overlap and coverage. The query performance was tested using 300,000 points of a SCM dataset and the results are presented in this paper. This paper also discusses the outlook of the structure for future reference.

  12. Prediction of 222Rn in Danish dwellings using geology and house construction information from central databases.

    PubMed

    Andersen, Claus E; Raaschou-Nielsen, Ole; Andersen, Helle Primdal; Lind, Morten; Gravesen, Peter; Thomsen, Birthe L; Ulbak, Kaare

    2007-01-01

    A linear regression model has been developed for the prediction of indoor (222)Rn in Danish houses. The model provides proxy radon concentrations for about 21,000 houses in a Danish case-control study on the possible association between residential radon and childhood cancer (primarily leukaemia). The model was calibrated against radon measurements in 3116 houses. An independent dataset with 788 house measurements was used for model performance assessment. The model includes nine explanatory variables, of which the most important ones are house type and geology. All explanatory variables are available from central databases. The model was fitted to log-transformed radon concentrations and it has an R(2) of 40%. The uncertainty associated with individual predictions of (untransformed) radon concentrations is about a factor of 2.0 (one standard deviation). The comparison with the independent test data shows that the model makes sound predictions and that errors of radon predictions are only weakly correlated with the estimates themselves (R(2) = 10%).

  13. Querying clinical data in HL7 RIM based relational model with morph-RDB.

    PubMed

    Priyatna, Freddy; Alonso-Calvo, Raul; Paraiso-Medina, Sergio; Corcho, Oscar

    2017-10-05

    Semantic interoperability is essential when carrying out post-genomic clinical trials where several institutions collaborate, since researchers and developers need to have an integrated view and access to heterogeneous data sources. One possible approach to accommodate this need is to use RDB2RDF systems that provide RDF datasets as the unified view. These RDF datasets may be materialized and stored in a triple store, or transformed into RDF in real time, as virtual RDF data sources. Our previous efforts involved materialized RDF datasets, hence losing data freshness. In this paper we present a solution that uses an ontology based on the HL7 v3 Reference Information Model and a set of R2RML mappings that relate this ontology to an underlying relational database implementation, and where morph-RDB is used to expose a virtual, non-materialized SPARQL endpoint over the data. By applying a set of optimization techniques on the SPARQL-to-SQL query translation algorithm, we can now issue SPARQL queries to the underlying relational data with generally acceptable performance.

  14. Plant Reactome: a resource for plant pathways and comparative analysis

    PubMed Central

    Naithani, Sushma; Preece, Justin; D'Eustachio, Peter; Gupta, Parul; Amarasinghe, Vindhya; Dharmawardhana, Palitha D.; Wu, Guanming; Fabregat, Antonio; Elser, Justin L.; Weiser, Joel; Keays, Maria; Fuentes, Alfonso Munoz-Pomer; Petryszak, Robert; Stein, Lincoln D.; Ware, Doreen; Jaiswal, Pankaj

    2017-01-01

    Plant Reactome (http://plantreactome.gramene.org/) is a free, open-source, curated plant pathway database portal, provided as part of the Gramene project. The database provides intuitive bioinformatics tools for the visualization, analysis and interpretation of pathway knowledge to support genome annotation, genome analysis, modeling, systems biology, basic research and education. Plant Reactome employs the structural framework of a plant cell to show metabolic, transport, genetic, developmental and signaling pathways. We manually curate molecular details of pathways in these domains for reference species Oryza sativa (rice) supported by published literature and annotation of well-characterized genes. Two hundred twenty-two rice pathways, 1025 reactions associated with 1173 proteins, 907 small molecules and 256 literature references have been curated to date. These reference annotations were used to project pathways for 62 model, crop and evolutionarily significant plant species based on gene homology. Database users can search and browse various components of the database, visualize curated baseline expression of pathway-associated genes provided by the Expression Atlas and upload and analyze their Omics datasets. The database also offers data access via Application Programming Interfaces (APIs) and in various standardized pathway formats, such as SBML and BioPAX. PMID:27799469

  15. The TERRA-PNW Dataset: A New Source for Standardized Plant Trait, Forest Carbon Cycling, and Soil Properties Measurements from the Pacific Northwest US, 2000-2014.

    NASA Astrophysics Data System (ADS)

    Berner, L. T.; Law, B. E.

    2015-12-01

    Plant traits include physiological, morphological, and biogeochemical characteristics that in combination determine a species sensitivity to environmental conditions. Standardized, co-located, and geo-referenced species- and plot-level measurements are needed to address variation in species sensitivity to climate change impacts and for ecosystem process model development, parameterization and testing. We present a new database of plant trait, forest carbon cycling, and soil property measurements derived from multiple TERRA-PNW projects in the Pacific Northwest US, spanning 2000-2014. The database includes measurements from over 200 forest plots across Oregon and northern California, where the data were explicitly collected for scaling and modeling regional terrestrial carbon processes with models such as Biome-BGC and the Community Land Model. Some of the data are co-located at AmeriFlux sites in the region. The database currently contains leaf trait measurements (specific leaf area, leaf longevity, leaf carbon and nitrogen) from over 1,200 branch samples and 30 species, as well as plot-level biomass and productivity components, and soil carbon and nitrogen. Standardized protocols were used across projects, as summarized in an FAO protocols document. The database continues to expand and will include agricultural crops. The database will be hosted by the Oak Ridge National Laboratory (ORLN) Distributed Active Archive Center (DAAC). We hope that other regional databases will become publicly available to help enable Earth System Modeling to simulate species-level sensitivity to climate at regional to global scales.

  16. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data

    PubMed Central

    Köhler, Sebastian; Doelken, Sandra C.; Mungall, Christopher J.; Bauer, Sebastian; Firth, Helen V.; Bailleul-Forestier, Isabelle; Black, Graeme C. M.; Brown, Danielle L.; Brudno, Michael; Campbell, Jennifer; FitzPatrick, David R.; Eppig, Janan T.; Jackson, Andrew P.; Freson, Kathleen; Girdea, Marta; Helbig, Ingo; Hurst, Jane A.; Jähn, Johanna; Jackson, Laird G.; Kelly, Anne M.; Ledbetter, David H.; Mansour, Sahar; Martin, Christa L.; Moss, Celia; Mumford, Andrew; Ouwehand, Willem H.; Park, Soo-Mi; Riggs, Erin Rooney; Scott, Richard H.; Sisodiya, Sanjay; Vooren, Steven Van; Wapner, Ronald J.; Wilkie, Andrew O. M.; Wright, Caroline F.; Vulto-van Silfhout, Anneke T.; de Leeuw, Nicole; de Vries, Bert B. A.; Washingthon, Nicole L.; Smith, Cynthia L.; Westerfield, Monte; Schofield, Paul; Ruef, Barbara J.; Gkoutos, Georgios V.; Haendel, Melissa; Smedley, Damian; Lewis, Suzanna E.; Robinson, Peter N.

    2014-01-01

    The Human Phenotype Ontology (HPO) project, available at http://www.human-phenotype-ontology.org, provides a structured, comprehensive and well-defined set of 10,088 classes (terms) describing human phenotypic abnormalities and 13,326 subclass relations between the HPO classes. In addition we have developed logical definitions for 46% of all HPO classes using terms from ontologies for anatomy, cell types, function, embryology, pathology and other domains. This allows interoperability with several resources, especially those containing phenotype information on model organisms such as mouse and zebrafish. Here we describe the updated HPO database, which provides annotations of 7,278 human hereditary syndromes listed in OMIM, Orphanet and DECIPHER to classes of the HPO. Various meta-attributes such as frequency, references and negations are associated with each annotation. Several large-scale projects worldwide utilize the HPO for describing phenotype information in their datasets. We have therefore generated equivalence mappings to other phenotype vocabularies such as LDDB, Orphanet, MedDRA, UMLS and phenoDB, allowing integration of existing datasets and interoperability with multiple biomedical resources. We have created various ways to access the HPO database content using flat files, a MySQL database, and Web-based tools. All data and documentation on the HPO project can be found online. PMID:24217912

  17. Solar forcing for CMIP6 (v3.2)

    NASA Astrophysics Data System (ADS)

    Matthes, Katja; Funke, Bernd; Andersson, Monika E.; Barnard, Luke; Beer, Jürg; Charbonneau, Paul; Clilverd, Mark A.; Dudok de Wit, Thierry; Haberreiter, Margit; Hendry, Aaron; Jackman, Charles H.; Kretzschmar, Matthieu; Kruschke, Tim; Kunze, Markus; Langematz, Ulrike; Marsh, Daniel R.; Maycock, Amanda C.; Misios, Stergios; Rodger, Craig J.; Scaife, Adam A.; Seppälä, Annika; Shangguan, Ming; Sinnhuber, Miriam; Tourpali, Kleareti; Usoskin, Ilya; van de Kamp, Max; Verronen, Pekka T.; Versick, Stefan

    2017-06-01

    This paper describes the recommended solar forcing dataset for CMIP6 and highlights changes with respect to CMIP5. The solar forcing is provided for radiative properties, namely total solar irradiance (TSI), solar spectral irradiance (SSI), and the F10.7 index as well as particle forcing, including geomagnetic indices Ap and Kp, and ionization rates to account for effects of solar protons, electrons, and galactic cosmic rays. This is the first time that a recommendation for solar-driven particle forcing has been provided for a CMIP exercise. The solar forcing datasets are provided at daily and monthly resolution separately for the CMIP6 preindustrial control, historical (1850-2014), and future (2015-2300) simulations. For the preindustrial control simulation, both constant and time-varying solar forcing components are provided, with the latter including variability on 11-year and shorter timescales but no long-term changes. For the future, we provide a realistic scenario of what solar behavior could be, as well as an additional extreme Maunder-minimum-like sensitivity scenario. This paper describes the forcing datasets and also provides detailed recommendations as to their implementation in current climate models.For the historical simulations, the TSI and SSI time series are defined as the average of two solar irradiance models that are adapted to CMIP6 needs: an empirical one (NRLTSI2-NRLSSI2) and a semi-empirical one (SATIRE). A new and lower TSI value is recommended: the contemporary solar-cycle average is now 1361.0 W m-2. The slight negative trend in TSI over the three most recent solar cycles in the CMIP6 dataset leads to only a small global radiative forcing of -0.04 W m-2. In the 200-400 nm wavelength range, which is important for ozone photochemistry, the CMIP6 solar forcing dataset shows a larger solar-cycle variability contribution to TSI than in CMIP5 (50 % compared to 35 %).We compare the climatic effects of the CMIP6 solar forcing dataset to its CMIP5 predecessor by using time-slice experiments of two chemistry-climate models and a reference radiative transfer model. The differences in the long-term mean SSI in the CMIP6 dataset, compared to CMIP5, impact on climatological stratospheric conditions (lower shortwave heating rates of -0.35 K day-1 at the stratopause), cooler stratospheric temperatures (-1.5 K in the upper stratosphere), lower ozone abundances in the lower stratosphere (-3 %), and higher ozone abundances (+1.5 % in the upper stratosphere and lower mesosphere). Between the maximum and minimum phases of the 11-year solar cycle, there is an increase in shortwave heating rates (+0.2 K day-1 at the stratopause), temperatures ( ˜ 1 K at the stratopause), and ozone (+2.5 % in the upper stratosphere) in the tropical upper stratosphere using the CMIP6 forcing dataset. This solar-cycle response is slightly larger, but not statistically significantly different from that for the CMIP5 forcing dataset.CMIP6 models with a well-resolved shortwave radiation scheme are encouraged to prescribe SSI changes and include solar-induced stratospheric ozone variations, in order to better represent solar climate variability compared to models that only prescribe TSI and/or exclude the solar-ozone response. We show that monthly-mean solar-induced ozone variations are implicitly included in the SPARC/CCMI CMIP6 Ozone Database for historical simulations, which is derived from transient chemistry-climate model simulations and has been developed for climate models that do not calculate ozone interactively. CMIP6 models without chemistry that perform a preindustrial control simulation with time-varying solar forcing will need to use a modified version of the SPARC/CCMI Ozone Database that includes solar variability. CMIP6 models with interactive chemistry are also encouraged to use the particle forcing datasets, which will allow the potential long-term effects of particles to be addressed for the first time. The consideration of particle forcing has been shown to significantly improve the representation of reactive nitrogen and ozone variability in the polar middle atmosphere, eventually resulting in further improvements in the representation of solar climate variability in global models.

  18. Caliver: An R package for CALIbration and VERification of forest fire gridded model outputs.

    PubMed

    Vitolo, Claudia; Di Giuseppe, Francesca; D'Andrea, Mirko

    2018-01-01

    The name caliver stands for CALIbration and VERification of forest fire gridded model outputs. This is a package developed for the R programming language and available under an APACHE-2 license from a public repository. In this paper we describe the functionalities of the package and give examples using publicly available datasets. Fire danger model outputs are taken from the modeling components of the European Forest Fire Information System (EFFIS) and observed burned areas from the Global Fire Emission Database (GFED). Complete documentation, including a vignette, is also available within the package.

  19. Caliver: An R package for CALIbration and VERification of forest fire gridded model outputs

    PubMed Central

    Di Giuseppe, Francesca; D’Andrea, Mirko

    2018-01-01

    The name caliver stands for CALIbration and VERification of forest fire gridded model outputs. This is a package developed for the R programming language and available under an APACHE-2 license from a public repository. In this paper we describe the functionalities of the package and give examples using publicly available datasets. Fire danger model outputs are taken from the modeling components of the European Forest Fire Information System (EFFIS) and observed burned areas from the Global Fire Emission Database (GFED). Complete documentation, including a vignette, is also available within the package. PMID:29293536

  20. PARRoT- a homology-based strategy to quantify and compare RNA-sequencing from non-model organisms.

    PubMed

    Gan, Ruei-Chi; Chen, Ting-Wen; Wu, Timothy H; Huang, Po-Jung; Lee, Chi-Ching; Yeh, Yuan-Ming; Chiu, Cheng-Hsun; Huang, Hsien-Da; Tang, Petrus

    2016-12-22

    Next-generation sequencing promises the de novo genomic and transcriptomic analysis of samples of interests. However, there are only a few organisms having reference genomic sequences and even fewer having well-defined or curated annotations. For transcriptome studies focusing on organisms lacking proper reference genomes, the common strategy is de novo assembly followed by functional annotation. However, things become even more complicated when multiple transcriptomes are compared. Here, we propose a new analysis strategy and quantification methods for quantifying expression level which not only generate a virtual reference from sequencing data, but also provide comparisons between transcriptomes. First, all reads from the transcriptome datasets are pooled together for de novo assembly. The assembled contigs are searched against NCBI NR databases to find potential homolog sequences. Based on the searched result, a set of virtual transcripts are generated and served as a reference transcriptome. By using the same reference, normalized quantification values including RC (read counts), eRPKM (estimated RPKM) and eTPM (estimated TPM) can be obtained that are comparable across transcriptome datasets. In order to demonstrate the feasibility of our strategy, we implement it in the web service PARRoT. PARRoT stands for Pipeline for Analyzing RNA Reads of Transcriptomes. It analyzes gene expression profiles for two transcriptome sequencing datasets. For better understanding of the biological meaning from the comparison among transcriptomes, PARRoT further provides linkage between these virtual transcripts and their potential function through showing best hits in SwissProt, NR database, assigning GO terms. Our demo datasets showed that PARRoT can analyze two paired-end transcriptomic datasets of approximately 100 million reads within just three hours. In this study, we proposed and implemented a strategy to analyze transcriptomes from non-reference organisms which offers the opportunity to quantify and compare transcriptome profiles through a homolog based virtual transcriptome reference. By using the homolog based reference, our strategy effectively avoids the problems that may cause from inconsistencies among transcriptomes. This strategy will shed lights on the field of comparative genomics for non-model organism. We have implemented PARRoT as a web service which is freely available at http://parrot.cgu.edu.tw .

  1. Querying phenotype-genotype relationships on patient datasets using semantic web technology: the example of Cerebrotendinous xanthomatosis.

    PubMed

    Taboada, María; Martínez, Diego; Pilo, Belén; Jiménez-Escrig, Adriano; Robinson, Peter N; Sobrido, María J

    2012-07-31

    Semantic Web technology can considerably catalyze translational genetics and genomics research in medicine, where the interchange of information between basic research and clinical levels becomes crucial. This exchange involves mapping abstract phenotype descriptions from research resources, such as knowledge databases and catalogs, to unstructured datasets produced through experimental methods and clinical practice. This is especially true for the construction of mutation databases. This paper presents a way of harmonizing abstract phenotype descriptions with patient data from clinical practice, and querying this dataset about relationships between phenotypes and genetic variants, at different levels of abstraction. Due to the current availability of ontological and terminological resources that have already reached some consensus in biomedicine, a reuse-based ontology engineering approach was followed. The proposed approach uses the Ontology Web Language (OWL) to represent the phenotype ontology and the patient model, the Semantic Web Rule Language (SWRL) to bridge the gap between phenotype descriptions and clinical data, and the Semantic Query Web Rule Language (SQWRL) to query relevant phenotype-genotype bidirectional relationships. The work tests the use of semantic web technology in the biomedical research domain named cerebrotendinous xanthomatosis (CTX), using a real dataset and ontologies. A framework to query relevant phenotype-genotype bidirectional relationships is provided. Phenotype descriptions and patient data were harmonized by defining 28 Horn-like rules in terms of the OWL concepts. In total, 24 patterns of SWQRL queries were designed following the initial list of competency questions. As the approach is based on OWL, the semantic of the framework adapts the standard logical model of an open world assumption. This work demonstrates how semantic web technologies can be used to support flexible representation and computational inference mechanisms required to query patient datasets at different levels of abstraction. The open world assumption is especially good for describing only partially known phenotype-genotype relationships, in a way that is easily extensible. In future, this type of approach could offer researchers a valuable resource to infer new data from patient data for statistical analysis in translational research. In conclusion, phenotype description formalization and mapping to clinical data are two key elements for interchanging knowledge between basic and clinical research.

  2. A dynamic appearance descriptor approach to facial actions temporal modeling.

    PubMed

    Jiang, Bihan; Valstar, Michel; Martinez, Brais; Pantic, Maja

    2014-02-01

    Both the configuration and the dynamics of facial expressions are crucial for the interpretation of human facial behavior. Yet to date, the vast majority of reported efforts in the field either do not take the dynamics of facial expressions into account, or focus only on prototypic facial expressions of six basic emotions. Facial dynamics can be explicitly analyzed by detecting the constituent temporal segments in Facial Action Coding System (FACS) Action Units (AUs)-onset, apex, and offset. In this paper, we present a novel approach to explicit analysis of temporal dynamics of facial actions using the dynamic appearance descriptor Local Phase Quantization from Three Orthogonal Planes (LPQ-TOP). Temporal segments are detected by combining a discriminative classifier for detecting the temporal segments on a frame-by-frame basis with Markov Models that enforce temporal consistency over the whole episode. The system is evaluated in detail over the MMI facial expression database, the UNBC-McMaster pain database, the SAL database, the GEMEP-FERA dataset in database-dependent experiments, in cross-database experiments using the Cohn-Kanade, and the SEMAINE databases. The comparison with other state-of-the-art methods shows that the proposed LPQ-TOP method outperforms the other approaches for the problem of AU temporal segment detection, and that overall AU activation detection benefits from dynamic appearance information.

  3. Database of small research watersheds for the territory of former Soviet Union as a source of data for improving hydrological models and their parameterizations in different geographical conditions

    NASA Astrophysics Data System (ADS)

    Lebedeva, Liudmila; Semenova, Olga

    2013-04-01

    One of widely claimed problems in modern modelling hydrology is lack of available information to investigate hydrological processes and improve their representation in the models. In spite of this, one hardly might confidently say that existing "traditional" data sources have been already fully analyzed and made use of. There existed the network of research watersheds in USSR called water-balance stations where comprehensive and extensive hydrometeorological measurements were conducted according to more or less single program during the last 40-60 years. The program (where not ceased) includes observations of discharges in several, often nested and homogeneous, small watersheds, meteorological elements, evaporation, soil temperature and moisture, snow depths, etc. The network covered different climatic and landscape zones and was established in the middle of the last century with the aim of investigation of the runoff formation in different conditions. Until recently the long-term observational data accompanied by descriptions and maps had existed only in hard copies. It partly explains why these datasets are not enough exploited yet and very rarely or even never were used for the purposes of hydrological modelling although they seem to be much more promising than implementation of the completely new measuring techniques not detracting from its importance. The goal of the presented work is development of a database of observational data and supportive materials from small research watersheds across the territory of the former Soviet Union. The first version of the database will include the following information for 12 water-balance stations across Russia, Ukraine, Kazahstan and Turkmenistan: daily values of discharges (one or several watersheds), air temperature, humidity, precipitation (one or several gauges), soil and snow state variables, soil and snow evaporation. The stations will cover desert and semi desert, steppe and forest steppe, forest, permafrost and mountainous zones. Supportive material will include maps of watershed boundaries and location of observational sites. Text descriptions of the data, measuring techniques and hydrometeorological conditions related to each of the water-balance station will accompany the datasets. The database is supposed to be expanded with time in number of the stations (by 20) and available data series for each of them. It will be uploaded to the internet with open access to everyone interested in. Such a database allows one to test hydrological models and separate modules for their adequacy and workability in different conditions and can serve as a base for models comparison and evaluation. Special profit of the database will gain models that don't rely on calibration but on the adequate process representation and use of the observable parameters. One of such models, process-based Hydrograph model, will be tested against the data from every watershed from the developed database. The aim of the Hydrograph model application to the as many as possible number of research data-rich watersheds in different climatic zones is both amending the algorithms and creation and adjustment of the model parameters that allow using the model across the geographic spectrum.

  4. Benford's Law for Quality Assurance of Manner of Death Counts in Small and Large Databases.

    PubMed

    Daniels, Jeremy; Caetano, Samantha-Jo; Huyer, Dirk; Stephen, Andrew; Fernandes, John; Lytwyn, Alice; Hoppe, Fred M

    2017-09-01

    To assess if Benford's law, a mathematical law used for quality assurance in accounting, can be applied as a quality assurance measure for the manner of death determination. We examined a regional forensic pathology service's monthly manner of death counts (N = 2352) from 2011 to 2013, and provincial monthly and weekly death counts from 2009 to 2013 (N = 81,831). We tested whether each dataset's leading digit followed Benford's law via the chi-square test. For each database, we assessed whether number 1 was the most common leading digit. The manner of death counts first digit followed Benford's law in all the three datasets. Two of the three datasets had 1 as the most frequent leading digit. The manner of death data in this study showed qualities consistent with Benford's law. The law has potential as a quality assurance metric in the manner of death determination for both small and large databases. © 2017 American Academy of Forensic Sciences.

  5. Enhancing Geoscience Research Discovery Through the Semantic Web

    NASA Astrophysics Data System (ADS)

    Rowan, Linda R.; Gross, M. Benjamin; Mayernik, Matthew; Khan, Huda; Boler, Frances; Maull, Keith; Stott, Don; Williams, Steve; Corson-Rikert, Jon; Johns, Erica M.; Daniels, Michael; Krafft, Dean B.; Meertens, Charles

    2016-04-01

    UNAVCO, UCAR, and Cornell University are working together to leverage semantic web technologies to enable discovery of people, datasets, publications and other research products, as well as the connections between them. The EarthCollab project, a U.S. National Science Foundation EarthCube Building Block, is enhancing an existing open-source semantic web application, VIVO, to enhance connectivity across distributed networks of researchers and resources related to the following two geoscience-based communities: (1) the Bering Sea Project, an interdisciplinary field program whose data archive is hosted by NCAR's Earth Observing Laboratory (EOL), and (2) UNAVCO, a geodetic facility and consortium that supports diverse research projects informed by geodesy. People, publications, datasets and grant information have been mapped to an extended version of the VIVO-ISF ontology and ingested into VIVO's database. Much of the VIVO ontology was built for the life sciences, so we have added some components of existing geoscience-based ontologies and a few terms from a local ontology that we created. The UNAVCO VIVO instance, connect.unavco.org, utilizes persistent identifiers whenever possible; for example using ORCIDs for people, publication DOIs, data DOIs and unique NSF grant numbers. Data is ingested using a custom set of scripts that include the ability to perform basic automated and curated disambiguation. VIVO can display a page for every object ingested, including connections to other objects in the VIVO database. A dataset page, for example, includes the dataset type, time interval, DOI, related publications, and authors. The dataset type field provides a connection to all other datasets of the same type. The author's page shows, among other information, related datasets and co-authors. Information previously spread across several unconnected databases is now stored in a single location. In addition to VIVO's default display, the new database can be queried using SPARQL, a query language for semantic data. EarthCollab is extending the VIVO web application. One such extension is the ability to cross-link separate VIVO instances across institutions, allowing local display of externally curated information. For example, Cornell's VIVO faculty pages will display UNAVCO's dataset information and UNAVCO's VIVO will display Cornell faculty member contact and position information. About half of UNAVCO's membership is international and we hope to connect our data to institutions in other countries with a similar approach. Additional extensions, including enhanced geospatial capabilities, will be developed based on task-centered usability testing.

  6. [Naïve Bayes classification for classifying injury-cause groups from Emergency Room data in the Friuli Venezia Giulia region (Northern Italy)].

    PubMed

    Valent, Francesca; Clagnan, Elena; Zanier, Loris

    2014-01-01

    to assess whether Naïve Bayes Classification could be used to classify injury causes from the Emergency Room (ER) database, because in the Friuli Venezia Giulia Region (Northern Italy) the electronic ER data have never been used to study the epidemiology of injuries, because the proportion of generic "accidental" causes is much higher than that of injuries with a specific cause. application of the Naïve Bayes Classification method to the regional ER database. sensitivity, specificity, positive and negative predictive values, agreement, and the kappa statistic were calculated for the train dataset and the distribution of causes of injury for the test dataset. on 22.248 records with known cause, the classifications assigned by the model agreed moderately (kappa =0.53) with those assigned by ER personnel. The model was then used on 76.660 unclassified cases. Although sensitivity and positive predictive value of the method were generally poor, mainly due to limitations in the ER data, it allowed to estimate for the first time the frequency of specific injury causes in the Region. the model was useful to provide the "big picture" of non-fatal injuries in the Region. To improve the collection of injury data at the ER, the options available for injury classification in the ER software are being revised to make categories exhaustive and mutually exclusive.

  7. A method to derive vegetation distribution maps for pollen dispersion models using birch as an example

    NASA Astrophysics Data System (ADS)

    Pauling, A.; Rotach, M. W.; Gehrig, R.; Clot, B.

    2012-09-01

    Detailed knowledge of the spatial distribution of sources is a crucial prerequisite for the application of pollen dispersion models such as, for example, COSMO-ART (COnsortium for Small-scale MOdeling - Aerosols and Reactive Trace gases). However, this input is not available for the allergy-relevant species such as hazel, alder, birch, grass or ragweed. Hence, plant distribution datasets need to be derived from suitable sources. We present an approach to produce such a dataset from existing sources using birch as an example. The basic idea is to construct a birch dataset using a region with good data coverage for calibration and then to extrapolate this relationship to a larger area by using land use classes. We use the Swiss forest inventory (1 km resolution) in combination with a 74-category land use dataset that covers the non-forested areas of Switzerland as well (resolution 100 m). Then we assign birch density categories of 0%, 0.1%, 0.5% and 2.5% to each of the 74 land use categories. The combination of this derived dataset with the birch distribution from the forest inventory yields a fairly accurate birch distribution encompassing entire Switzerland. The land use categories of the Global Land Cover 2000 (GLC2000; Global Land Cover 2000 database, 2003, European Commission, Joint Research Centre; resolution 1 km) are then calibrated with the Swiss dataset in order to derive a Europe-wide birch distribution dataset and aggregated onto the 7 km COSMO-ART grid. This procedure thus assumes that a certain GLC2000 land use category has the same birch density wherever it may occur in Europe. In order to reduce the strict application of this crucial assumption, the birch density distribution as obtained from the previous steps is weighted using the mean Seasonal Pollen Index (SPI; yearly sums of daily pollen concentrations). For future improvement, region-specific birch densities for the GLC2000 categories could be integrated into the mapping procedure.

  8. A comparison of U.S. geological survey seamless elevation models with shuttle radar topography mission data

    USGS Publications Warehouse

    Gesch, D.; Williams, J.; Miller, W.

    2001-01-01

    Elevation models produced from Shuttle Radar Topography Mission (SRTM) data will be the most comprehensive, consistently processed, highest resolution topographic dataset ever produced for the Earth's land surface. Many applications that currently use elevation data will benefit from the increased availability of data with higher accuracy, quality, and resolution, especially in poorly mapped areas of the globe. SRTM data will be produced as seamless data, thereby avoiding many of the problems inherent in existing multi-source topographic databases. Serving as precursors to SRTM datasets, the U.S. Geological Survey (USGS) has produced and is distributing seamless elevation datasets that facilitate scientific use of elevation data over large areas. GTOPO30 is a global elevation model with a 30 arc-second resolution (approximately 1-kilometer). The National Elevation Dataset (NED) covers the United States at a resolution of 1 arc-second (approximately 30-meters). Due to their seamless format and broad area coverage, both GTOPO30 and NED represent an advance in the usability of elevation data, but each still includes artifacts from the highly variable source data used to produce them. The consistent source data and processing approach for SRTM data will result in elevation products that will be a significant addition to the current availability of seamless datasets, specifically for many areas outside the U.S. One application that demonstrates some advantages that may be realized with SRTM data is delineation of land surface drainage features (watersheds and stream channels). Seamless distribution of elevation data in which a user interactively specifies the area of interest and order parameters via a map server is already being successfully demonstrated with existing USGS datasets. Such an approach for distributing SRTM data is ideal for a dataset that undoubtedly will be of very high interest to the spatial data user community.

  9. NHD Event Data Dictionary

    EPA Pesticide Factsheets

    The Reach Address Database (RAD) stores reach address information for each Water Program feature that has been linked to the underlying surface water features (streams, lakes, etc) in the National Hydrology Database (NHD) Plus dataset.

  10. GTN-G, WGI, RGI, DCW, GLIMS, WGMS, GCOS - What's all this about? (Invited)

    NASA Astrophysics Data System (ADS)

    Paul, F.; Raup, B. H.; Zemp, M.

    2013-12-01

    In a large collaborative effort, the glaciological community has compiled a new and spa-tially complete global dataset of glacier outlines, the so-called Randolph Glacier Inventory or RGI. Despite its regional shortcomings in quality (e.g. in regard to geolocation, gener-alization, and interpretation), this dataset was heavily used for global-scale modelling ap-plications (e.g. determination of total glacier volume and glacier contribution to sea-level rise) in support of the forthcoming 5th Assessment Report (AR5) of Working Group I of the IPCC. The RGI is a merged dataset that is largely based on the GLIMS database and several new datasets provided by the community (both are mostly derived from satellite data), as well as the Digital Chart of the World (DCW) and glacier attribute information (location, size) from the World Glacier Inventory (WGI). There are now two key tasks to be performed, (1) improving the quality of the RGI in all regions where the outlines do not met the quality required for local scale applications, and (2) integrating the RGI in the GLIMS glacier database to improve its spatial completeness. While (1) requires again a huge effort but is already ongoing, (2) is mainly a technical issue that is nearly solved. Apart from this technical dimension, there is also a more political or structural one. While GLIMS is responsible for the remote sensing and glacier inventory part (Tier 5) of the Global Terrestrial Network for Glaciers (GTN-G) within the Global Climate Observing System (GCOS), the World Glacier Monitoring Service (WGMS) is collecting and dis-seminating the field observations. Along with new global products derived from satellite data (e.g. elevation changes and velocity fields) and the community wish to keep a snap-shot dataset such as the RGI available, how to make all these datasets available to the community without duplicating efforts and making best use of the very limited financial resources available must now be discussed. This overview presentation describes the cur-rently available datasets, clarifying the terminology and the international framework, and suggesting a way forward to serve the community at best.

  11. AnthWest, occurrence records for wool carder bees of the genus Anthidium (Hymenoptera, Megachilidae, Anthidiini) in the Western Hemisphere.

    PubMed

    Griswold, Terry; Gonzalez, Victor H; Ikerd, Harold

    2014-01-01

    This paper describes AnthWest, a large dataset that represents one of the outcomes of a comprehensive, broadly comparative study on the diversity, biology, biogeography, and evolution of Anthidium Fabricius in the Western Hemisphere. In this dataset a total of 22,648 adult occurrence records comprising 9657 unique events are documented for 92 species of Anthidium, including the invasive range of two introduced species from Eurasia, A. oblongatum (Illiger) and A. manicatum (Linnaeus). The geospatial coverage of the dataset extends from northern Canada and Alaska to southern Argentina, and from below sea level in Death Valley, California, USA, to 4700 m a.s.l. in Tucumán, Argentina. The majority of records in the dataset correspond to information recorded from individual specimens examined by the authors during this project and deposited in 60 biodiversity collections located in Africa, Europe, North and South America. A fraction (4.8%) of the occurrence records were taken from the literature, largely California records from a taxonomic treatment with some additional records for the two introduced species. The temporal scale of the dataset represents collection events recorded between 1886 and 2012. The dataset was developed employing SQL server 2008 r2. For each specimen, the following information is generally provided: scientific name including identification qualifier when species status is uncertain (e.g. "Questionable Determination" for 0.4% of the specimens), sex, temporal and geospatial details, coordinates, data collector, host plants, associated organisms, name of identifier, historic identification, historic identifier, taxonomic value (i.e., type specimen, voucher, etc.), and repository. For a small portion of the database records, bees associated with threatened or endangered plants (~ 0.08% of total records) as well as specimens collected as part of unpublished biological inventories (~17%), georeferencing is presented only to nearest degree and the information on floral host, locality, elevation, month, and day has been withheld. This database can potentially be used in species distribution and niche modeling studies, as well as in assessments of pollinator status and pollination services. For native pollinators, this large dataset of occurrence records is the first to be simultaneously developed during a species-level systematic study.

  12. ChemMaps: Towards an approach for visualizing the chemical space based on adaptive satellite compounds

    PubMed Central

    Naveja, J. Jesús; Medina-Franco, José L.

    2017-01-01

    We present a novel approach called ChemMaps for visualizing chemical space based on the similarity matrix of compound datasets generated with molecular fingerprints’ similarity. The method uses a ‘satellites’ approach, where satellites are, in principle, molecules whose similarity to the rest of the molecules in the database provides sufficient information for generating a visualization of the chemical space. Such an approach could help make chemical space visualizations more efficient. We hereby describe a proof-of-principle application of the method to various databases that have different diversity measures. Unsurprisingly, we found the method works better with databases that have low 2D diversity. 3D diversity played a secondary role, although it seems to be more relevant as 2D diversity increases. For less diverse datasets, taking as few as 25% satellites seems to be sufficient for a fair depiction of the chemical space. We propose to iteratively increase the satellites number by a factor of 5% relative to the whole database, and stop when the new and the prior chemical space correlate highly. This Research Note represents a first exploratory step, prior to the full application of this method for several datasets. PMID:28794856

  13. ChemMaps: Towards an approach for visualizing the chemical space based on adaptive satellite compounds.

    PubMed

    Naveja, J Jesús; Medina-Franco, José L

    2017-01-01

    We present a novel approach called ChemMaps for visualizing chemical space based on the similarity matrix of compound datasets generated with molecular fingerprints' similarity. The method uses a 'satellites' approach, where satellites are, in principle, molecules whose similarity to the rest of the molecules in the database provides sufficient information for generating a visualization of the chemical space. Such an approach could help make chemical space visualizations more efficient. We hereby describe a proof-of-principle application of the method to various databases that have different diversity measures. Unsurprisingly, we found the method works better with databases that have low 2D diversity. 3D diversity played a secondary role, although it seems to be more relevant as 2D diversity increases. For less diverse datasets, taking as few as 25% satellites seems to be sufficient for a fair depiction of the chemical space. We propose to iteratively increase the satellites number by a factor of 5% relative to the whole database, and stop when the new and the prior chemical space correlate highly. This Research Note represents a first exploratory step, prior to the full application of this method for several datasets.

  14. WATERS Terms of Use and Disclaimer

    EPA Pesticide Factsheets

    The Reach Address Database (RAD) stores reach address information for each Water Program feature that has been linked to the underlying surface water features (streams, lakes, etc) in the National Hydrology Database (NHD) Plus dataset.

  15. Architectural Implications for Spatial Object Association Algorithms*

    PubMed Central

    Kumar, Vijay S.; Kurc, Tahsin; Saltz, Joel; Abdulla, Ghaleb; Kohn, Scott R.; Matarazzo, Celeste

    2013-01-01

    Spatial object association, also referred to as crossmatch of spatial datasets, is the problem of identifying and comparing objects in two or more datasets based on their positions in a common spatial coordinate system. In this work, we evaluate two crossmatch algorithms that are used for astronomical sky surveys, on the following database system architecture configurations: (1) Netezza Performance Server®, a parallel database system with active disk style processing capabilities, (2) MySQL Cluster, a high-throughput network database system, and (3) a hybrid configuration consisting of a collection of independent database system instances with data replication support. Our evaluation provides insights about how architectural characteristics of these systems affect the performance of the spatial crossmatch algorithms. We conducted our study using real use-case scenarios borrowed from a large-scale astronomy application known as the Large Synoptic Survey Telescope (LSST). PMID:25692244

  16. Transport Statistics - Transport - UNECE

    Science.gov Websites

    Statistics and Data Online Infocards Database SDG Papers E-Road Census Traffic Census Map Traffic Census 2015 available. Two new datasets have been added to the transport statistics database: bus and coach statistics Database Evaluations Follow UNECE Facebook Rss Twitter You tube Contact us Instagram Flickr Google+ Â

  17. Global Oil & Gas Features Database

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kelly Rose; Jennifer Bauer; Vic Baker

    This submission contains a zip file with the developed Global Oil & Gas Features Database (as an ArcGIS geodatabase). Access the technical report describing how this database was produced using the following link: https://edx.netl.doe.gov/dataset/development-of-an-open-global-oil-and-gas-infrastructure-inventory-and-geodatabase

  18. Creating a FIESTA (Framework for Integrated Earth Science and Technology Applications) with MagIC

    NASA Astrophysics Data System (ADS)

    Minnett, R.; Koppers, A. A. P.; Jarboe, N.; Tauxe, L.; Constable, C.

    2017-12-01

    The Magnetics Information Consortium (https://earthref.org/MagIC) has recently developed a containerized web application to considerably reduce the friction in contributing, exploring and combining valuable and complex datasets for the paleo-, geo- and rock magnetic scientific community. The data produced in this scientific domain are inherently hierarchical and the communities evolving approaches to this scientific workflow, from sampling to taking measurements to multiple levels of interpretations, require a large and flexible data model to adequately annotate the results and ensure reproducibility. Historically, contributing such detail in a consistent format has been prohibitively time consuming and often resulted in only publishing the highly derived interpretations. The new open-source (https://github.com/earthref/MagIC) application provides a flexible upload tool integrated with the data model to easily create a validated contribution and a powerful search interface for discovering datasets and combining them to enable transformative science. MagIC is hosted at EarthRef.org along with several interdisciplinary geoscience databases. A FIESTA (Framework for Integrated Earth Science and Technology Applications) is being created by generalizing MagIC's web application for reuse in other domains. The application relies on a single configuration document that describes the routing, data model, component settings and external services integrations. The container hosts an isomorphic Meteor JavaScript application, MongoDB database and ElasticSearch search engine. Multiple containers can be configured as microservices to serve portions of the application or rely on externally hosted MongoDB, ElasticSearch, or third-party services to efficiently scale computational demands. FIESTA is particularly well suited for many Earth Science disciplines with its flexible data model, mapping, account management, upload tool to private workspaces, reference metadata, image galleries, full text searches and detailed filters. EarthRef's Seamount Catalog of bathymetry and morphology data, EarthRef's Geochemical Earth Reference Model (GERM) databases, and Oregon State University's Marine and Geology Repository (http://osu-mgr.org) will benefit from custom adaptations of FIESTA.

  19. Unleashing spatially distributed ecohydrology modeling using Big Data tools

    NASA Astrophysics Data System (ADS)

    Miles, B.; Idaszak, R.

    2015-12-01

    Physically based spatially distributed ecohydrology models are useful for answering science and management questions related to the hydrology and biogeochemistry of prairie, savanna, forested, as well as urbanized ecosystems. However, these models can produce hundreds of gigabytes of spatial output for a single model run over decadal time scales when run at regional spatial scales and moderate spatial resolutions (~100-km2+ at 30-m spatial resolution) or when run for small watersheds at high spatial resolutions (~1-km2 at 3-m spatial resolution). Numerical data formats such as HDF5 can store arbitrarily large datasets. However even in HPC environments, there are practical limits on the size of single files that can be stored and reliably backed up. Even when such large datasets can be stored, querying and analyzing these data can suffer from poor performance due to memory limitations and I/O bottlenecks, for example on single workstations where memory and bandwidth are limited, or in HPC environments where data are stored separately from computational nodes. The difficulty of storing and analyzing spatial data from ecohydrology models limits our ability to harness these powerful tools. Big Data tools such as distributed databases have the potential to surmount the data storage and analysis challenges inherent to large spatial datasets. Distributed databases solve these problems by storing data close to computational nodes while enabling horizontal scalability and fault tolerance. Here we present the architecture of and preliminary results from PatchDB, a distributed datastore for managing spatial output from the Regional Hydro-Ecological Simulation System (RHESSys). The initial version of PatchDB uses message queueing to asynchronously write RHESSys model output to an Apache Cassandra cluster. Once stored in the cluster, these data can be efficiently queried to quickly produce both spatial visualizations for a particular variable (e.g. maps and animations), as well as point time series of arbitrary variables at arbitrary points in space within a watershed or river basin. By treating ecohydrology modeling as a Big Data problem, we hope to provide a platform for answering transformative science and management questions related to water quantity and quality in a world of non-stationary climate.

  20. Global distribution of urban parameters derived from high-resolution global datasets for weather modelling

    NASA Astrophysics Data System (ADS)

    Kawano, N.; Varquez, A. C. G.; Dong, Y.; Kanda, M.

    2016-12-01

    Numerical model such as Weather Research and Forecasting model coupled with single-layer Urban Canopy Model (WRF-UCM) is one of the powerful tools to investigate urban heat island. Urban parameters such as average building height (Have), plain area index (λp) and frontal area index (λf), are necessary inputs for the model. In general, these parameters are uniformly assumed in WRF-UCM but this leads to unrealistic urban representation. Distributed urban parameters can also be incorporated into WRF-UCM to consider a detail urban effect. The problem is that distributed building information is not readily available for most megacities especially in developing countries. Furthermore, acquiring real building parameters often require huge amount of time and money. In this study, we investigated the potential of using globally available satellite-captured datasets for the estimation of the parameters, Have, λp, and λf. Global datasets comprised of high spatial resolution population dataset (LandScan by Oak Ridge National Laboratory), nighttime lights (NOAA), and vegetation fraction (NASA). True samples of Have, λp, and λf were acquired from actual building footprints from satellite images and 3D building database of Tokyo, New York, Paris, Melbourne, Istanbul, Jakarta and so on. Regression equations were then derived from the block-averaging of spatial pairs of real parameters and global datasets. Results show that two regression curves to estimate Have and λf from the combination of population and nightlight are necessary depending on the city's level of development. An index which can be used to decide which equation to use for a city is the Gross Domestic Product (GDP). On the other hand, λphas less dependence on GDP but indicated a negative relationship to vegetation fraction. Finally, a simplified but precise approximation of urban parameters through readily-available, high-resolution global datasets and our derived regressions can be utilized to estimate a global distribution of urban parameters for later incorporation into a weather model, thus allowing us to acquire a global understanding of urban climate (Global Urban Climatology). Acknowledgment: This research was supported by the Environment Research and Technology Development Fund (S-14) of the Ministry of the Environment, Japan.

  1. The CREp program, a fully parameterizable program to compute exposure ages (3He, 10Be)

    NASA Astrophysics Data System (ADS)

    Martin, L.; Blard, P. H.; Lave, J.; Delunel, R.; Balco, G.

    2015-12-01

    Over the last decades, cosmogenic exposure dating permitted major advances in Earth surface sciences, and particularly in paleoclimatology. Yet, exposure age calculation is a dense procedure. It requires numerous choices of parameterization and the use of an appropriate production rate. Nowadays, Earth surface scientists may either calculate exposure ages on their own or use the available programs. However, these programs do not offer the possibility to include all the most recent advances in Cosmic Ray Exposure (CRE) dating. Notably, they do not propose the most recent production rate datasets and they only offer few possibilities to test the impact of the atmosphere model and the geomagnetic model on the computed ages. We present the CREp program, a Matlab © code that computes CRE ages for 3He and 10Be over the last 2 million years. The CREp program includes the scaling models of Lal-Stone in the "Lal modified" version (Balco et al., 2008; Lal, 1991; Stone, 2000) and the LSD model (Lifton et al., 2014). For any of these models, CREP allows choosing between the ERA-40 atmosphere model (Uppala et al., 2005) and the standard atmosphere (National Oceanic and Atmospheric Administration, 1976). Regarding the geomagnetic database, users can opt for one of the three proposed datasets: Muscheler et al. 2005, GLOPIS-75 (Laj et al. 2004) and the geomagnetic framework proposed in the LSD model (Lifton et al., 2014). They may also import their own geomagnetic database. Importantly, the reference production rate can be chosen among a large variety of possibilities. We made an effort to propose a wide and homogenous calibration database in order to promote the use of local calibration rates: CREp includes all the calibration data published until July 2015 and will be able to access an updated online database including all the newly published production rates. This is crucial for improving the ages accuracy. Users may also choose a global production rate or use their own data to either calibrate a production rate or directly input a Sea Level High Latitude value. The program is fast to calculate a large number of ages and to export the final density probability function associated with each age into an Excel © spreadsheet format.

  2. Database resources of the National Center for Biotechnology Information

    PubMed Central

    2016-01-01

    The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts for published life science journals. Additional NCBI resources focus on literature (PubMed Central (PMC), Bookshelf and PubReader), health (ClinVar, dbGaP, dbMHC, the Genetic Testing Registry, HIV-1/Human Protein Interaction Database and MedGen), genomes (BioProject, Assembly, Genome, BioSample, dbSNP, dbVar, Epigenomics, the Map Viewer, Nucleotide, Probe, RefSeq, Sequence Read Archive, the Taxonomy Browser and the Trace Archive), genes (Gene, Gene Expression Omnibus (GEO), HomoloGene, PopSet and UniGene), proteins (Protein, the Conserved Domain Database (CDD), COBALT, Conserved Domain Architecture Retrieval Tool (CDART), the Molecular Modeling Database (MMDB) and Protein Clusters) and chemicals (Biosystems and the PubChem suite of small molecule databases). The Entrez system provides search and retrieval operations for most of these databases. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized datasets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov. PMID:26615191

  3. PICKLE 2.0: A human protein-protein interaction meta-database employing data integration via genetic information ontology

    PubMed Central

    Gioutlakis, Aris; Klapa, Maria I.

    2017-01-01

    It has been acknowledged that source databases recording experimentally supported human protein-protein interactions (PPIs) exhibit limited overlap. Thus, the reconstruction of a comprehensive PPI network requires appropriate integration of multiple heterogeneous primary datasets, presenting the PPIs at various genetic reference levels. Existing PPI meta-databases perform integration via normalization; namely, PPIs are merged after converted to a certain target level. Hence, the node set of the integrated network depends each time on the number and type of the combined datasets. Moreover, the irreversible a priori normalization process hinders the identification of normalization artifacts in the integrated network, which originate from the nonlinearity characterizing the genetic information flow. PICKLE (Protein InteraCtion KnowLedgebasE) 2.0 implements a new architecture for this recently introduced human PPI meta-database. Its main novel feature over the existing meta-databases is its approach to primary PPI dataset integration via genetic information ontology. Building upon the PICKLE principles of using the reviewed human complete proteome (RHCP) of UniProtKB/Swiss-Prot as the reference protein interactor set, and filtering out protein interactions with low probability of being direct based on the available evidence, PICKLE 2.0 first assembles the RHCP genetic information ontology network by connecting the corresponding genes, nucleotide sequences (mRNAs) and proteins (UniProt entries) and then integrates PPI datasets by superimposing them on the ontology network without any a priori transformations. Importantly, this process allows the resulting heterogeneous integrated network to be reversibly normalized to any level of genetic reference without loss of the original information, the latter being used for identification of normalization biases, and enables the appraisal of potential false positive interactions through PPI source database cross-checking. The PICKLE web-based interface (www.pickle.gr) allows for the simultaneous query of multiple entities and provides integrated human PPI networks at either the protein (UniProt) or the gene level, at three PPI filtering modes. PMID:29023571

  4. Building a database for long-term monitoring of benthic macrofauna in the Pertuis-Charentais (2004-2014).

    PubMed

    Philippe, Anne S; Plumejeaud-Perreau, Christine; Jourde, Jérôme; Pineau, Philippe; Lachaussée, Nicolas; Joyeux, Emmanuel; Corre, Frédéric; Delaporte, Philippe; Bocher, Pierrick

    2017-01-01

    Long-term benthic monitoring is rewarding in terms of science, but labour-intensive, whether in the field, the laboratory, or behind the computer. Building and managing databases require multiple skills, including consistency over time as well as organisation via a systematic approach. Here, we introduce and share our spatially explicit benthic database, comprising 11 years of benthic data. It is the result of intensive benthic sampling that has been conducted on a regular grid (259 stations) covering the intertidal mudflats of the Pertuis-Charentais (Marennes-Oléron Bay and Aiguillon Bay). Samples were taken by foot or by boats during winter depending on tidal height, from December 2003 to February 2014. The present dataset includes abundances and biomass densities of all mollusc species of the study regions and principal polychaetes as well as their length, accessibility to shorebirds, energy content and shell mass when appropriate and available. This database has supported many studies dealing with the spatial distribution of benthic invertebrates and temporal variations in food resources for shorebird species as well as latitudinal comparisons with other databases. In this paper, we introduce our benthos monitoring, share our data, and present a "guide of good practices" for building, cleaning and using it efficiently, providing examples of results with associated R code. The dataset has been formatted into a geo-referenced relational database, using PostgreSQL open-source DBMS. We provide density information, measurements, energy content and accessibility of thirteen bivalve, nine gastropod and two polychaete taxa (a total of 66,620 individuals)​ for 11 consecutive winters. Figures and maps are provided to describe how the dataset was built, cleaned, and how it can be used. This dataset can again support studies concerning spatial and temporal variations in species abundance, interspecific interactions as well as evaluations of the availability of food resources for small- and medium size shorebirds and, potentially, conservation and impact assessment studies.

  5. EpiK: A Knowledge Base for Epidemiological Modeling and Analytics of Infectious Diseases

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hasan, S. M. Shamimul; Fox, Edward A.; Bisset, Keith

    Computational epidemiology seeks to develop computational methods to study the distribution and determinants of health-related states or events (including disease), and the application of this study to the control of diseases and other health problems. Recent advances in computing and data sciences have led to the development of innovative modeling environments to support this important goal. The datasets used to drive the dynamic models as well as the data produced by these models presents unique challenges owing to their size, heterogeneity and diversity. These datasets form the basis of effective and easy to use decision support and analytical environments. Asmore » a result, it is important to develop scalable data management systems to store, manage and integrate these datasets. In this paper, we develop EpiK—a knowledge base that facilitates the development of decision support and analytical environments to support epidemic science. An important goal is to develop a framework that links the input as well as output datasets to facilitate effective spatio-temporal and social reasoning that is critical in planning and intervention analysis before and during an epidemic. The data management framework links modeling workflow data and its metadata using a controlled vocabulary. The metadata captures information about storage, the mapping between the linked model and the physical layout, and relationships to support services. EpiK is designed to support agent-based modeling and analytics frameworks—aggregate models can be seen as special cases and are thus supported. We use semantic web technologies to create a representation of the datasets that encapsulates both the location and the schema heterogeneity. The choice of RDF as a representation language is motivated by the diversity and growth of the datasets that need to be integrated. A query bank is developed—the queries capture a broad range of questions that can be posed and answered during a typical case study pertaining to disease outbreaks. The queries are constructed using SPARQL Protocol and RDF Query Language (SPARQL) over the EpiK. EpiK can hide schema and location heterogeneity while efficiently supporting queries that span the computational epidemiology modeling pipeline: from model construction to simulation output. As a result, we show that the performance of benchmark queries varies significantly with respect to the choice of hardware underlying the database and resource description framework (RDF) engine.« less

  6. EpiK: A Knowledge Base for Epidemiological Modeling and Analytics of Infectious Diseases

    DOE PAGES

    Hasan, S. M. Shamimul; Fox, Edward A.; Bisset, Keith; ...

    2017-11-06

    Computational epidemiology seeks to develop computational methods to study the distribution and determinants of health-related states or events (including disease), and the application of this study to the control of diseases and other health problems. Recent advances in computing and data sciences have led to the development of innovative modeling environments to support this important goal. The datasets used to drive the dynamic models as well as the data produced by these models presents unique challenges owing to their size, heterogeneity and diversity. These datasets form the basis of effective and easy to use decision support and analytical environments. Asmore » a result, it is important to develop scalable data management systems to store, manage and integrate these datasets. In this paper, we develop EpiK—a knowledge base that facilitates the development of decision support and analytical environments to support epidemic science. An important goal is to develop a framework that links the input as well as output datasets to facilitate effective spatio-temporal and social reasoning that is critical in planning and intervention analysis before and during an epidemic. The data management framework links modeling workflow data and its metadata using a controlled vocabulary. The metadata captures information about storage, the mapping between the linked model and the physical layout, and relationships to support services. EpiK is designed to support agent-based modeling and analytics frameworks—aggregate models can be seen as special cases and are thus supported. We use semantic web technologies to create a representation of the datasets that encapsulates both the location and the schema heterogeneity. The choice of RDF as a representation language is motivated by the diversity and growth of the datasets that need to be integrated. A query bank is developed—the queries capture a broad range of questions that can be posed and answered during a typical case study pertaining to disease outbreaks. The queries are constructed using SPARQL Protocol and RDF Query Language (SPARQL) over the EpiK. EpiK can hide schema and location heterogeneity while efficiently supporting queries that span the computational epidemiology modeling pipeline: from model construction to simulation output. As a result, we show that the performance of benchmark queries varies significantly with respect to the choice of hardware underlying the database and resource description framework (RDF) engine.« less

  7. The Orthology Ontology: development and applications.

    PubMed

    Fernández-Breis, Jesualdo Tomás; Chiba, Hirokazu; Legaz-García, María Del Carmen; Uchiyama, Ikuo

    2016-06-04

    Computational comparative analysis of multiple genomes provides valuable opportunities to biomedical research. In particular, orthology analysis can play a central role in comparative genomics; it guides establishing evolutionary relations among genes of organisms and allows functional inference of gene products. However, the wide variations in current orthology databases necessitate the research toward the shareability of the content that is generated by different tools and stored in different structures. Exchanging the content with other research communities requires making the meaning of the content explicit. The need for a common ontology has led to the creation of the Orthology Ontology (ORTH) following the best practices in ontology construction. Here, we describe our model and major entities of the ontology that is implemented in the Web Ontology Language (OWL), followed by the assessment of the quality of the ontology and the application of the ORTH to existing orthology datasets. This shareable ontology enables the possibility to develop Linked Orthology Datasets and a meta-predictor of orthology through standardization for the representation of orthology databases. The ORTH is freely available in OWL format to all users at http://purl.org/net/orth . The Orthology Ontology can serve as a framework for the semantic standardization of orthology content and it will contribute to a better exploitation of orthology resources in biomedical research. The results demonstrate the feasibility of developing shareable datasets using this ontology. Further applications will maximize the usefulness of this ontology.

  8. Applied learning-based color tone mapping for face recognition in video surveillance system

    NASA Astrophysics Data System (ADS)

    Yew, Chuu Tian; Suandi, Shahrel Azmin

    2012-04-01

    In this paper, we present an applied learning-based color tone mapping technique for video surveillance system. This technique can be applied onto both color and grayscale surveillance images. The basic idea is to learn the color or intensity statistics from a training dataset of photorealistic images of the candidates appeared in the surveillance images, and remap the color or intensity of the input image so that the color or intensity statistics match those in the training dataset. It is well known that the difference in commercial surveillance cameras models, and signal processing chipsets used by different manufacturers will cause the color and intensity of the images to differ from one another, thus creating additional challenges for face recognition in video surveillance system. Using Multi-Class Support Vector Machines as the classifier on a publicly available video surveillance camera database, namely SCface database, this approach is validated and compared to the results of using holistic approach on grayscale images. The results show that this technique is suitable to improve the color or intensity quality of video surveillance system for face recognition.

  9. Toxics Release Inventory Chemical Hazard Information Profiles (TRI-CHIP) Dataset

    EPA Pesticide Factsheets

    The Toxics Release Inventory (TRI) Chemical Hazard Information Profiles (TRI-CHIP) dataset contains hazard information about the chemicals reported in TRI. Users can use this XML-format dataset to create their own databases and hazard analyses of TRI chemicals. The hazard information is compiled from a series of authoritative sources including the Integrated Risk Information System (IRIS). The dataset is provided as a downloadable .zip file that when extracted provides XML files and schemas for the hazard information tables.

  10. EFEHR - the European Facilities for Earthquake Hazard and Risk: beyond the web-platform

    NASA Astrophysics Data System (ADS)

    Danciu, Laurentiu; Wiemer, Stefan; Haslinger, Florian; Kastli, Philipp; Giardini, Domenico

    2017-04-01

    European Facilities for Earthquake Hazard and Risk (EEFEHR) represents the sustainable community resource for seismic hazard and risk in Europe. The EFEHR web platform is the main gateway to access data, models and tools as well as provide expertise relevant for assessment of seismic hazard and risk. The main services (databases and web-platform) are hosted at ETH Zurich and operated by the Swiss Seismological Service (Schweizerischer Erdbebendienst SED). EFEHR web-portal (www.efehr.org) collects and displays (i) harmonized datasets necessary for hazard and risk modeling, e.g. seismic catalogues, fault compilations, site amplifications, vulnerabilities, inventories; (ii) extensive seismic hazard products, namely hazard curves, uniform hazard spectra and maps for national and regional assessments. (ii) standardized configuration files for re-computing the regional seismic hazard models; (iv) relevant documentation of harmonized datasets, models and web-services. Today, EFEHR distributes full output of the 2013 European Seismic Hazard Model, ESHM13, as developed within the SHARE project (http://www.share-eu.org/); the latest results of the 2014 Earthquake Model of the Middle East (EMME14), derived within the EMME Project (www.emme-gem.org); the 2001 Global Seismic Hazard Assessment Project (GSHAP) results and the 2015 updates of the Swiss Seismic Hazard. New datasets related to either seismic hazard or risk will be incorporated as they become available. We present the currents status of the EFEHR platform, with focus on the challenges, summaries of the up-to-date datasets, user experience and feedback, as well as the roadmap to future technological innovation beyond the web-platform development. We also show the new services foreseen to fully integrate with the seismological core services of European Plate Observing System (EPOS).

  11. Protein-coding genes combined with long noncoding RNA as a novel transcriptome molecular staging model to predict the survival of patients with esophageal squamous cell carcinoma.

    PubMed

    Guo, Jin-Cheng; Wu, Yang; Chen, Yang; Pan, Feng; Wu, Zhi-Yong; Zhang, Jia-Sheng; Wu, Jian-Yi; Xu, Xiu-E; Zhao, Jian-Mei; Li, En-Min; Zhao, Yi; Xu, Li-Yan

    2018-04-09

    Esophageal squamous cell carcinoma (ESCC) is the predominant subtype of esophageal carcinoma in China. This study was to develop a staging model to predict outcomes of patients with ESCC. Using Cox regression analysis, principal component analysis (PCA), partitioning clustering, Kaplan-Meier analysis, receiver operating characteristic (ROC) curve analysis, and classification and regression tree (CART) analysis, we mined the Gene Expression Omnibus database to determine the expression profiles of genes in 179 patients with ESCC from GSE63624 and GSE63622 dataset. Univariate cox regression analysis of the GSE63624 dataset revealed that 2404 protein-coding genes (PCGs) and 635 long non-coding RNAs (lncRNAs) were associated with the survival of patients with ESCC. PCA categorized these PCGs and lncRNAs into three principal components (PCs), which were used to cluster the patients into three groups. ROC analysis demonstrated that the predictive ability of PCG-lncRNA PCs when applied to new patients was better than that of the tumor-node-metastasis staging (area under ROC curve [AUC]: 0.69 vs. 0.65, P < 0.05). Accordingly, we constructed a molecular disaggregated model comprising one lncRNA and two PCGs, which we designated as the LSB staging model using CART analysis in the GSE63624 dataset. This LSB staging model classified the GSE63622 dataset of patients into three different groups, and its effectiveness was validated by analysis of another cohort of 105 patients. The LSB staging model has clinical significance for the prognosis prediction of patients with ESCC and may serve as a three-gene staging microarray.

  12. The National NeuroAIDS Tissue Consortium (NNTC) Database: an integrated database for HIV-related studies

    PubMed Central

    Cserhati, Matyas F.; Pandey, Sanjit; Beaudoin, James J.; Baccaglini, Lorena; Guda, Chittibabu; Fox, Howard S.

    2015-01-01

    We herein present the National NeuroAIDS Tissue Consortium-Data Coordinating Center (NNTC-DCC) database, which is the only available database for neuroAIDS studies that contains data in an integrated, standardized form. This database has been created in conjunction with the NNTC, which provides human tissue and biofluid samples to individual researchers to conduct studies focused on neuroAIDS. The database contains experimental datasets from 1206 subjects for the following categories (which are further broken down into subcategories): gene expression, genotype, proteins, endo-exo-chemicals, morphometrics and other (miscellaneous) data. The database also contains a wide variety of downloadable data and metadata for 95 HIV-related studies covering 170 assays from 61 principal investigators. The data represent 76 tissue types, 25 measurement types, and 38 technology types, and reaches a total of 33 017 407 data points. We used the ISA platform to create the database and develop a searchable web interface for querying the data. A gene search tool is also available, which searches for NCBI GEO datasets associated with selected genes. The database is manually curated with many user-friendly features, and is cross-linked to the NCBI, HUGO and PubMed databases. A free registration is required for qualified users to access the database. Database URL: http://nntc-dcc.unmc.edu PMID:26228431

  13. A Regional Climate Model Evaluation System based on Satellite and other Observations

    NASA Astrophysics Data System (ADS)

    Lean, P.; Kim, J.; Waliser, D. E.; Hall, A. D.; Mattmann, C. A.; Granger, S. L.; Case, K.; Goodale, C.; Hart, A.; Zimdars, P.; Guan, B.; Molotch, N. P.; Kaki, S.

    2010-12-01

    Regional climate models are a fundamental tool needed for downscaling global climate simulations and projections, such as those contributing to the Coupled Model Intercomparison Projects (CMIPs) that form the basis of the IPCC Assessment Reports. The regional modeling process provides the means to accommodate higher resolution and a greater complexity of Earth System processes. Evaluation of both the global and regional climate models against observations is essential to identify model weaknesses and to direct future model development efforts focused on reducing the uncertainty associated with climate projections. However, the lack of reliable observational data and the lack of formal tools are among the serious limitations to addressing these objectives. Recent satellite observations are particularly useful as they provide a wealth of information on many different aspects of the climate system, but due to their large volume and the difficulties associated with accessing and using the data, these datasets have been generally underutilized in model evaluation studies. Recognizing this problem, NASA JPL / UCLA is developing a model evaluation system to help make satellite observations, in conjunction with in-situ, assimilated, and reanalysis datasets, more readily accessible to the modeling community. The system includes a central database to store multiple datasets in a common format and codes for calculating predefined statistical metrics to assess model performance. This allows the time taken to compare model simulations with satellite observations to be reduced from weeks to days. Early results from the use this new model evaluation system for evaluating regional climate simulations over California/western US regions will be presented.

  14. International LCA

    EPA Science Inventory

    To provide global guidance on the establishment and maintenance of LCA databases, as the basis for improved dataset exchangeability and interlinkages of databases worldwide. Increase the credibility of existing LCA data, the generation of more data and their overall accessibilit...

  15. Working covariance model selection for generalized estimating equations.

    PubMed

    Carey, Vincent J; Wang, You-Gan

    2011-11-20

    We investigate methods for data-based selection of working covariance models in the analysis of correlated data with generalized estimating equations. We study two selection criteria: Gaussian pseudolikelihood and a geodesic distance based on discrepancy between model-sensitive and model-robust regression parameter covariance estimators. The Gaussian pseudolikelihood is found in simulation to be reasonably sensitive for several response distributions and noncanonical mean-variance relations for longitudinal data. Application is also made to a clinical dataset. Assessment of adequacy of both correlation and variance models for longitudinal data should be routine in applications, and we describe open-source software supporting this practice. Copyright © 2011 John Wiley & Sons, Ltd.

  16. TranslatomeDB: a comprehensive database and cloud-based analysis platform for translatome sequencing data.

    PubMed

    Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie; Zhang, Gong

    2018-01-04

    Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  17. Effects of Soil Data and Simulation Unit Resolution on Quantifying Changes of Soil Organic Carbon at Regional Scale with a Biogeochemical Process Model

    PubMed Central

    Zhang, Liming; Yu, Dongsheng; Shi, Xuezheng; Xu, Shengxiang; Xing, Shihe; Zhao, Yongcong

    2014-01-01

    Soil organic carbon (SOC) models were often applied to regions with high heterogeneity, but limited spatially differentiated soil information and simulation unit resolution. This study, carried out in the Tai-Lake region of China, defined the uncertainty derived from application of the DeNitrification-DeComposition (DNDC) biogeochemical model in an area with heterogeneous soil properties and different simulation units. Three different resolution soil attribute databases, a polygonal capture of mapping units at 1∶50,000 (P5), a county-based database of 1∶50,000 (C5) and county-based database of 1∶14,000,000 (C14), were used as inputs for regional DNDC simulation. The P5 and C5 databases were combined with the 1∶50,000 digital soil map, which is the most detailed soil database for the Tai-Lake region. The C14 database was combined with 1∶14,000,000 digital soil map, which is a coarse database and is often used for modeling at a national or regional scale in China. The soil polygons of P5 database and county boundaries of C5 and C14 databases were used as basic simulation units. Results project that from 1982 to 2000, total SOC change in the top layer (0–30 cm) of the 2.3 M ha of paddy soil in the Tai-Lake region was +1.48 Tg C, −3.99 Tg C and −15.38 Tg C based on P5, C5 and C14 databases, respectively. With the total SOC change as modeled with P5 inputs as the baseline, which is the advantages of using detailed, polygon-based soil dataset, the relative deviation of C5 and C14 were 368% and 1126%, respectively. The comparison illustrates that DNDC simulation is strongly influenced by choice of fundamental geographic resolution as well as input soil attribute detail. The results also indicate that improving the framework of DNDC is essential in creating accurate models of the soil carbon cycle. PMID:24523922

  18. A curated database of cyanobacterial strains relevant for modern taxonomy and phylogenetic studies.

    PubMed

    Ramos, Vitor; Morais, João; Vasconcelos, Vitor M

    2017-04-25

    The dataset herein described lays the groundwork for an online database of relevant cyanobacterial strains, named CyanoType (http://lege.ciimar.up.pt/cyanotype). It is a database that includes categorized cyanobacterial strains useful for taxonomic, phylogenetic or genomic purposes, with associated information obtained by means of a literature-based curation. The dataset lists 371 strains and represents the first version of the database (CyanoType v.1). Information for each strain includes strain synonymy and/or co-identity, strain categorization, habitat, accession numbers for molecular data, taxonomy and nomenclature notes according to three different classification schemes, hierarchical automatic classification, phylogenetic placement according to a selection of relevant studies (including this), and important bibliographic references. The database will be updated periodically, namely by adding new strains meeting the criteria for inclusion and by revising and adding up-to-date metadata for strains already listed. A global 16S rDNA-based phylogeny is provided in order to assist users when choosing the appropriate strains for their studies.

  19. A curated database of cyanobacterial strains relevant for modern taxonomy and phylogenetic studies

    PubMed Central

    Ramos, Vitor; Morais, João; Vasconcelos, Vitor M.

    2017-01-01

    The dataset herein described lays the groundwork for an online database of relevant cyanobacterial strains, named CyanoType (http://lege.ciimar.up.pt/cyanotype). It is a database that includes categorized cyanobacterial strains useful for taxonomic, phylogenetic or genomic purposes, with associated information obtained by means of a literature-based curation. The dataset lists 371 strains and represents the first version of the database (CyanoType v.1). Information for each strain includes strain synonymy and/or co-identity, strain categorization, habitat, accession numbers for molecular data, taxonomy and nomenclature notes according to three different classification schemes, hierarchical automatic classification, phylogenetic placement according to a selection of relevant studies (including this), and important bibliographic references. The database will be updated periodically, namely by adding new strains meeting the criteria for inclusion and by revising and adding up-to-date metadata for strains already listed. A global 16S rDNA-based phylogeny is provided in order to assist users when choosing the appropriate strains for their studies. PMID:28440791

  20. Applying manifold learning techniques to the CAESAR database

    NASA Astrophysics Data System (ADS)

    Mendoza-Schrock, Olga; Patrick, James; Arnold, Gregory; Ferrara, Matthew

    2010-04-01

    Understanding and organizing data is the first step toward exploiting sensor phenomenology for dismount tracking. What image features are good for distinguishing people and what measurements, or combination of measurements, can be used to classify the dataset by demographics including gender, age, and race? A particular technique, Diffusion Maps, has demonstrated the potential to extract features that intuitively make sense [1]. We want to develop an understanding of this tool by validating existing results on the Civilian American and European Surface Anthropometry Resource (CAESAR) database. This database, provided by the Air Force Research Laboratory (AFRL) Human Effectiveness Directorate and SAE International, is a rich dataset which includes 40 traditional, anthropometric measurements of 4400 human subjects. If we could specifically measure the defining features for classification, from this database, then the future question will then be to determine a subset of these features that can be measured from imagery. This paper briefly describes the Diffusion Map technique, shows potential for dimension reduction of the CAESAR database, and describes interesting problems to be further explored.

  1. Mr-Moose: An advanced SED-fitting tool for heterogeneous multi-wavelength datasets

    NASA Astrophysics Data System (ADS)

    Drouart, G.; Falkendal, T.

    2018-04-01

    We present the public release of Mr-Moose, a fitting procedure that is able to perform multi-wavelength and multi-object spectral energy distribution (SED) fitting in a Bayesian framework. This procedure is able to handle a large variety of cases, from an isolated source to blended multi-component sources from an heterogeneous dataset (i.e. a range of observation sensitivities and spectral/spatial resolutions). Furthermore, Mr-Moose handles upper-limits during the fitting process in a continuous way allowing models to be gradually less probable as upper limits are approached. The aim is to propose a simple-to-use, yet highly-versatile fitting tool fro handling increasing source complexity when combining multi-wavelength datasets with fully customisable filter/model databases. The complete control of the user is one advantage, which avoids the traditional problems related to the "black box" effect, where parameter or model tunings are impossible and can lead to overfitting and/or over-interpretation of the results. Also, while a basic knowledge of Python and statistics is required, the code aims to be sufficiently user-friendly for non-experts. We demonstrate the procedure on three cases: two artificially-generated datasets and a previous result from the literature. In particular, the most complex case (inspired by a real source, combining Herschel, ALMA and VLA data) in the context of extragalactic SED fitting, makes Mr-Moose a particularly-attractive SED fitting tool when dealing with partially blended sources, without the need for data deconvolution.

  2. Development of climate data storage and processing model

    NASA Astrophysics Data System (ADS)

    Okladnikov, I. G.; Gordov, E. P.; Titov, A. G.

    2016-11-01

    We present a storage and processing model for climate datasets elaborated in the framework of a virtual research environment (VRE) for climate and environmental monitoring and analysis of the impact of climate change on the socio-economic processes on local and regional scales. The model is based on a «shared nothings» distributed computing architecture and assumes using a computing network where each computing node is independent and selfsufficient. Each node holds a dedicated software for the processing and visualization of geospatial data providing programming interfaces to communicate with the other nodes. The nodes are interconnected by a local network or the Internet and exchange data and control instructions via SSH connections and web services. Geospatial data is represented by collections of netCDF files stored in a hierarchy of directories in the framework of a file system. To speed up data reading and processing, three approaches are proposed: a precalculation of intermediate products, a distribution of data across multiple storage systems (with or without redundancy), and caching and reuse of the previously obtained products. For a fast search and retrieval of the required data, according to the data storage and processing model, a metadata database is developed. It contains descriptions of the space-time features of the datasets available for processing, their locations, as well as descriptions and run options of the software components for data analysis and visualization. The model and the metadata database together will provide a reliable technological basis for development of a high- performance virtual research environment for climatic and environmental monitoring.

  3. External validation of a publicly available computer assisted diagnostic tool for mammographic mass lesions with two high prevalence research datasets.

    PubMed

    Benndorf, Matthias; Burnside, Elizabeth S; Herda, Christoph; Langer, Mathias; Kotter, Elmar

    2015-08-01

    Lesions detected at mammography are described with a highly standardized terminology: the breast imaging-reporting and data system (BI-RADS) lexicon. Up to now, no validated semantic computer assisted classification algorithm exists to interactively link combinations of morphological descriptors from the lexicon to a probabilistic risk estimate of malignancy. The authors therefore aim at the external validation of the mammographic mass diagnosis (MMassDx) algorithm. A classification algorithm like MMassDx must perform well in a variety of clinical circumstances and in datasets that were not used to generate the algorithm in order to ultimately become accepted in clinical routine. The MMassDx algorithm uses a naïve Bayes network and calculates post-test probabilities of malignancy based on two distinct sets of variables, (a) BI-RADS descriptors and age ("descriptor model") and (b) BI-RADS descriptors, age, and BI-RADS assessment categories ("inclusive model"). The authors evaluate both the MMassDx (descriptor) and MMassDx (inclusive) models using two large publicly available datasets of mammographic mass lesions: the digital database for screening mammography (DDSM) dataset, which contains two subsets from the same examinations-a medio-lateral oblique (MLO) view and cranio-caudal (CC) view dataset-and the mammographic mass (MM) dataset. The DDSM contains 1220 mass lesions and the MM dataset contains 961 mass lesions. The authors evaluate discriminative performance using area under the receiver-operating-characteristic curve (AUC) and compare this to the BI-RADS assessment categories alone (i.e., the clinical performance) using the DeLong method. The authors also evaluate whether assigned probabilistic risk estimates reflect the lesions' true risk of malignancy using calibration curves. The authors demonstrate that the MMassDx algorithms show good discriminatory performance. AUC for the MMassDx (descriptor) model in the DDSM data is 0.876/0.895 (MLO/CC view) and AUC for the MMassDx (inclusive) model in the DDSM data is 0.891/0.900 (MLO/CC view). AUC for the MMassDx (descriptor) model in the MM data is 0.862 and AUC for the MMassDx (inclusive) model in the MM data is 0.900. In all scenarios, MMassDx performs significantly better than clinical performance, P < 0.05 each. The authors furthermore demonstrate that the MMassDx algorithm systematically underestimates the risk of malignancy in the DDSM and MM datasets, especially when low probabilities of malignancy are assigned. The authors' results reveal that the MMassDx algorithms have good discriminatory performance but less accurate calibration when tested on two independent validation datasets. Improvement in calibration and testing in a prospective clinical population will be important steps in the pursuit of translation of these algorithms to the clinic.

  4. A compatible exon-exon junction database for the identification of exon skipping events using tandem mass spectrum data.

    PubMed

    Mo, Fan; Hong, Xu; Gao, Feng; Du, Lin; Wang, Jun; Omenn, Gilbert S; Lin, Biaoyang

    2008-12-16

    Alternative splicing is an important gene regulation mechanism. It is estimated that about 74% of multi-exon human genes have alternative splicing. High throughput tandem (MS/MS) mass spectrometry provides valuable information for rapidly identifying potentially novel alternatively-spliced protein products from experimental datasets. However, the ability to identify alternative splicing events through tandem mass spectrometry depends on the database against which the spectra are searched. We wrote scripts in perl, Bioperl, mysql and Ensembl API and built a theoretical exon-exon junction protein database to account for all possible combinations of exons for a gene while keeping the frame of translation (i.e., keeping only in-phase exon-exon combinations) from the Ensembl Core Database. Using our liver cancer MS/MS dataset, we identified a total of 488 non-redundant peptides that represent putative exon skipping events. Our exon-exon junction database provides the scientific community with an efficient means to identify novel alternatively spliced (exon skipping) protein isoforms using mass spectrometry data. This database will be useful in annotating genome structures using rapidly accumulating proteomics data.

  5. The National NeuroAIDS Tissue Consortium (NNTC) Database: an integrated database for HIV-related studies.

    PubMed

    Cserhati, Matyas F; Pandey, Sanjit; Beaudoin, James J; Baccaglini, Lorena; Guda, Chittibabu; Fox, Howard S

    2015-01-01

    We herein present the National NeuroAIDS Tissue Consortium-Data Coordinating Center (NNTC-DCC) database, which is the only available database for neuroAIDS studies that contains data in an integrated, standardized form. This database has been created in conjunction with the NNTC, which provides human tissue and biofluid samples to individual researchers to conduct studies focused on neuroAIDS. The database contains experimental datasets from 1206 subjects for the following categories (which are further broken down into subcategories): gene expression, genotype, proteins, endo-exo-chemicals, morphometrics and other (miscellaneous) data. The database also contains a wide variety of downloadable data and metadata for 95 HIV-related studies covering 170 assays from 61 principal investigators. The data represent 76 tissue types, 25 measurement types, and 38 technology types, and reaches a total of 33,017,407 data points. We used the ISA platform to create the database and develop a searchable web interface for querying the data. A gene search tool is also available, which searches for NCBI GEO datasets associated with selected genes. The database is manually curated with many user-friendly features, and is cross-linked to the NCBI, HUGO and PubMed databases. A free registration is required for qualified users to access the database. © The Author(s) 2015. Published by Oxford University Press.

  6. Architectural Implications for Spatial Object Association Algorithms

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kumar, V S; Kurc, T; Saltz, J

    2009-01-29

    Spatial object association, also referred to as cross-match of spatial datasets, is the problem of identifying and comparing objects in two or more datasets based on their positions in a common spatial coordinate system. In this work, we evaluate two crossmatch algorithms that are used for astronomical sky surveys, on the following database system architecture configurations: (1) Netezza Performance Server R, a parallel database system with active disk style processing capabilities, (2) MySQL Cluster, a high-throughput network database system, and (3) a hybrid configuration consisting of a collection of independent database system instances with data replication support. Our evaluation providesmore » insights about how architectural characteristics of these systems affect the performance of the spatial crossmatch algorithms. We conducted our study using real use-case scenarios borrowed from a large-scale astronomy application known as the Large Synoptic Survey Telescope (LSST).« less

  7. Karst database development in Minnesota: Design and data assembly

    USGS Publications Warehouse

    Gao, Y.; Alexander, E.C.; Tipping, R.G.

    2005-01-01

    The Karst Feature Database (KFD) of Minnesota is a relational GIS-based Database Management System (DBMS). Previous karst feature datasets used inconsistent attributes to describe karst features in different areas of Minnesota. Existing metadata were modified and standardized to represent a comprehensive metadata for all the karst features in Minnesota. Microsoft Access 2000 and ArcView 3.2 were used to develop this working database. Existing county and sub-county karst feature datasets have been assembled into the KFD, which is capable of visualizing and analyzing the entire data set. By November 17 2002, 11,682 karst features were stored in the KFD of Minnesota. Data tables are stored in a Microsoft Access 2000 DBMS and linked to corresponding ArcView applications. The current KFD of Minnesota has been moved from a Windows NT server to a Windows 2000 Citrix server accessible to researchers and planners through networked interfaces. ?? Springer-Verlag 2005.

  8. Atlas of Iberian water beetles (ESACIB database).

    PubMed

    Sánchez-Fernández, David; Millán, Andrés; Abellán, Pedro; Picazo, Félix; Carbonell, José A; Ribera, Ignacio

    2015-01-01

    The ESACIB ('EScarabajos ACuáticos IBéricos') database is provided, including all available distributional data of Iberian and Balearic water beetles from the literature up to 2013, as well as from museum and private collections, PhD theses, and other unpublished sources. The database contains 62,015 records with associated geographic data (10×10 km UTM squares) for 488 species and subspecies of water beetles, 120 of them endemic to the Iberian Peninsula and eight to the Balearic Islands. This database was used for the elaboration of the "Atlas de los Coleópteros Acuáticos de España Peninsular". In this dataset data of 15 additional species has been added: 11 that occur in the Balearic Islands or mainland Portugal but not in peninsular Spain and an other four with mainly terrestrial habits within the genus Helophorus (for taxonomic coherence). The complete dataset is provided in Darwin Core Archive format.

  9. Atlas of Iberian water beetles (ESACIB database)

    PubMed Central

    Sánchez-Fernández, David; Millán, Andrés; Abellán, Pedro; Picazo, Félix; Carbonell, José A.; Ribera, Ignacio

    2015-01-01

    Abstract The ESACIB (‘EScarabajos ACuáticos IBéricos’) database is provided, including all available distributional data of Iberian and Balearic water beetles from the literature up to 2013, as well as from museum and private collections, PhD theses, and other unpublished sources. The database contains 62,015 records with associated geographic data (10×10 km UTM squares) for 488 species and subspecies of water beetles, 120 of them endemic to the Iberian Peninsula and eight to the Balearic Islands. This database was used for the elaboration of the “Atlas de los Coleópteros Acuáticos de España Peninsular”. In this dataset data of 15 additional species has been added: 11 that occur in the Balearic Islands or mainland Portugal but not in peninsular Spain and an other four with mainly terrestrial habits within the genus Helophorus (for taxonomic coherence). The complete dataset is provided in Darwin Core Archive format. PMID:26448717

  10. Integration of SRTM and TRMM date into the GIS-based hydrological model for the purpose of flood modelling

    NASA Astrophysics Data System (ADS)

    Akbari, A.; Abu Samah, A.; Othman, F.

    2012-04-01

    Due to land use and climate changes, more severe and frequent floods occur worldwide. Flood simulation as the first step in flood risk management can be robustly conducted with integration of GIS, RS and flood modeling tools. The primary goal of this research is to examine the practical use of public domain satellite data and GIS-based hydrologic model. Firstly, database development process is described. GIS tools and techniques were used in the light of relevant literature to achieve the appropriate database. Watershed delineation and parameterizations were carried out using cartographic DEM derived from digital topography at a scale of 1:25 000 with 30 m cell size and SRTM elevation data at 30 m cell size. The SRTM elevation dataset is evaluated and compared with cartographic DEM. With the assistance of statistical measures such as Correlation coefficient (r), Nash-Sutcliffe efficiency (NSE), Percent Bias (PBias) or Percent of Error (PE). According to NSE index, SRTM-DEM can be used for watershed delineation and parameterization with 87% similarity with Topo-DEM in a complex and underdeveloped terrains. Primary TRMM (V6) data was used as satellite based hytograph for rainfall-runoff simulation. The SCS-CN approach was used for losses and kinematic routing method employed for hydrograph transformation through the reaches. It is concluded that TRMM estimates do not give adequate information about the storms as it can be drawn from the rain gauges. Event-based flood modeling using HEC-HMS proved that SRTM elevation dataset has the ability to obviate the lack of terrain data for hydrologic modeling where appropriate data for terrain modeling and simulation of hydrological processes is unavailable. However, TRMM precipitation estimates failed to explain the behavior of rainfall events and its resultant peak discharge and time of peak.

  11. Machine learning for medical images analysis.

    PubMed

    Criminisi, A

    2016-10-01

    This article discusses the application of machine learning for the analysis of medical images. Specifically: (i) We show how a special type of learning models can be thought of as automatically optimized, hierarchically-structured, rule-based algorithms, and (ii) We discuss how the issue of collecting large labelled datasets applies to both conventional algorithms as well as machine learning techniques. The size of the training database is a function of model complexity rather than a characteristic of machine learning methods. Crown Copyright © 2016. Published by Elsevier B.V. All rights reserved.

  12. The Cambridge MRI database for animal models of Huntington disease.

    PubMed

    Sawiak, Stephen J; Morton, A Jennifer

    2016-01-01

    We describe the Cambridge animal brain magnetic resonance imaging repository comprising 400 datasets to date from mouse models of Huntington disease. The data include raw images as well as segmented grey and white matter images with maps of cortical thickness. All images and phenotypic data for each subject are freely-available without restriction from (http://www.dspace.cam.ac.uk/handle/1810/243361/). Software and anatomical population templates optimised for animal brain analysis with MRI are also available from this site. Copyright © 2015. Published by Elsevier Inc.

  13. ToxMiner Software Interface for Visualizing and Analyzing ToxCast Data

    EPA Science Inventory

    The ToxCast dataset represents a collection of assays and endpoints that will require both standard statistical approaches as well as customized data analysis workflows. To analyze this unique dataset, we have developed an integrated database with Javabased interface called ToxMi...

  14. Aggregating todays data for tomorrows science: a geological use case

    NASA Astrophysics Data System (ADS)

    Glaves, H.; Kingdon, A.; Nayembil, M.; Baker, G.

    2016-12-01

    Geoscience data is made up of diverse and complex smaller datasets that, when aggregated together, build towards what is recognised as `big data'. The British Geological Survey (BGS), which acts as a repository for all subsurface data from the United Kingdom, has been collating these disparate small datasets that have been accumulated from the activities of a large number of geoscientists over many years. Recently this picture has been further complicated by the addition of new data sources such as near real-time sensor data, and industry or community data that is increasingly delivered via automatic donations. Many of these datasets have been aggregated in relational databases to form larger ones that are used to address a variety of issues ranging from development of national infrastructure to disaster response. These complex domain-specific SQL databases deliver effective data management using normalised subject-based database designs in a secure environment. However, the isolated subject-oriented design of these systems inhibits efficient cross-domain querying of the datasets. Additionally, the tools provided often do not enable effective data discovery as they have problems resolving the complex underlying normalised structures. Recent requirements to understand sub-surface geology in three dimensions have led BGS to develop new data systems. One such solution is PropBase which delivers a generic denormalised data structure within an RDBMS to store geological property data. Propbase facilitates rapid and standardised data discovery and access, incorporating 2D and 3D physical and chemical property data, including associated metadata. It also provides a dedicated web interface to deliver complex multiple data sets from a single database in standardised common output formats (e.g. CSV, GIS shape files) without the need for complex data conditioning. PropBase facilitates new scientific research, previously considered impractical, by enabling property data searches across multiple databases. Using the Propbase exemplar this presentation will seek to illustrate how BGS has developed systems for aggregating `small datasets' to create the `big data' necessary for the data analytics, mining, processing and visualisation needed for future geoscientific research.

  15. Verification of road databases using multiple road models

    NASA Astrophysics Data System (ADS)

    Ziems, Marcel; Rottensteiner, Franz; Heipke, Christian

    2017-08-01

    In this paper a new approach for automatic road database verification based on remote sensing images is presented. In contrast to existing methods, the applicability of the new approach is not restricted to specific road types, context areas or geographic regions. This is achieved by combining several state-of-the-art road detection and road verification approaches that work well under different circumstances. Each one serves as an independent module representing a unique road model and a specific processing strategy. All modules provide independent solutions for the verification problem of each road object stored in the database in form of two probability distributions, the first one for the state of a database object (correct or incorrect), and a second one for the state of the underlying road model (applicable or not applicable). In accordance with the Dempster-Shafer Theory, both distributions are mapped to a new state space comprising the classes correct, incorrect and unknown. Statistical reasoning is applied to obtain the optimal state of a road object. A comparison with state-of-the-art road detection approaches using benchmark datasets shows that in general the proposed approach provides results with larger completeness. Additional experiments reveal that based on the proposed method a highly reliable semi-automatic approach for road data base verification can be designed.

  16. The Global Earthquake Model - Past, Present, Future

    NASA Astrophysics Data System (ADS)

    Smolka, Anselm; Schneider, John; Stein, Ross

    2014-05-01

    The Global Earthquake Model (GEM) is a unique collaborative effort that aims to provide organizations and individuals with tools and resources for transparent assessment of earthquake risk anywhere in the world. By pooling data, knowledge and people, GEM acts as an international forum for collaboration and exchange. Sharing of data and risk information, best practices, and approaches across the globe are key to assessing risk more effectively. Through consortium driven global projects, open-source IT development and collaborations with more than 10 regions, leading experts are developing unique global datasets, best practice, open tools and models for seismic hazard and risk assessment. The year 2013 has seen the completion of ten global data sets or components addressing various aspects of earthquake hazard and risk, as well as two GEM-related, but independently managed regional projects SHARE and EMME. Notably, the International Seismological Centre (ISC) led the development of a new ISC-GEM global instrumental earthquake catalogue, which was made publicly available in early 2013. It has set a new standard for global earthquake catalogues and has found widespread acceptance and application in the global earthquake community. By the end of 2014, GEM's OpenQuake computational platform will provide the OpenQuake hazard/risk assessment software and integrate all GEM data and information products. The public release of OpenQuake is planned for the end of this 2014, and will comprise the following datasets and models: • ISC-GEM Instrumental Earthquake Catalogue (released January 2013) • Global Earthquake History Catalogue [1000-1903] • Global Geodetic Strain Rate Database and Model • Global Active Fault Database • Tectonic Regionalisation Model • Global Exposure Database • Buildings and Population Database • Earthquake Consequences Database • Physical Vulnerabilities Database • Socio-Economic Vulnerability and Resilience Indicators • Seismic Source Models • Ground Motion (Attenuation) Models • Physical Exposure Models • Physical Vulnerability Models • Composite Index Models (social vulnerability, resilience, indirect loss) • Repository of national hazard models • Uniform global hazard model Armed with these tools and databases, stakeholders worldwide will then be able to calculate, visualise and investigate earthquake risk, capture new data and to share their findings for joint learning. Earthquake hazard information will be able to be combined with data on exposure (buildings, population) and data on their vulnerability, for risk assessment around the globe. Furthermore, for a truly integrated view of seismic risk, users will be able to add social vulnerability and resilience indices and estimate the costs and benefits of different risk management measures. Having finished its first five-year Work Program at the end of 2013, GEM has entered into its second five-year Work Program 2014-2018. Beyond maintaining and enhancing the products developed in Work Program 1, the second phase will have a stronger focus on regional hazard and risk activities, and on seeing GEM products used for risk assessment and risk management practice at regional, national and local scales. Furthermore GEM intends to partner with similar initiatives underway for other natural perils, which together are needed to meet the need for advanced risk assessment methods, tools and data to underpin global disaster risk reduction efforts under the Hyogo Framework for Action #2 to be launched in Sendai/Japan in spring 2015

  17. Developing a Resource for Implementing ArcSWAT Using Global Datasets

    NASA Astrophysics Data System (ADS)

    Taggart, M.; Caraballo Álvarez, I. O.; Mueller, C.; Palacios, S. L.; Schmidt, C.; Milesi, C.; Palmer-Moloney, L. J.

    2015-12-01

    This project developed a comprehensive user manual outlining methods for adapting and implementing global datasets for use within ArcSWAT for international and worldwide applications. The Soil and Water Assessment Tool (SWAT) is a hydrologic model that looks at a number of hydrologic variables including runoff and the chemical makeup of water at a given location on the Earth's surface using Digital Elevation Models (DEM), land cover, soil, and weather data. However, the application of ArcSWAT for projects outside of the United States is challenging as there is no standard framework for inputting global datasets into ArcSWAT. This project aims to remove this obstacle by outlining methods for adapting and implementing these global datasets via the user manual. The manual takes the user through the processes of data conditioning while providing solutions and suggestions for common errors. The efficacy of the manual was explored using examples from watersheds located in Puerto Rico, Mexico and Western Africa. Each run explored the various options for setting up a ArcSWAT project as well as a range of satellite data products and soil databases. Future work will incorporate in-situ data for validation and calibration of the model and outline additional resources to assist future users in efficiently implementing the model for worldwide applications. The capacity to manage and monitor freshwater availability is of critical importance in both developed and developing countries. As populations grow and climate changes, both the quality and quantity of freshwater are affected resulting in negative impacts on the health of the surrounding population. The use of hydrologic models such as ArcSWAT can help stakeholders and decision makers understand the future impacts of these changes enabling informed and substantiated decisions.

  18. CoINcIDE: A framework for discovery of patient subtypes across multiple datasets.

    PubMed

    Planey, Catherine R; Gevaert, Olivier

    2016-03-09

    Patient disease subtypes have the potential to transform personalized medicine. However, many patient subtypes derived from unsupervised clustering analyses on high-dimensional datasets are not replicable across multiple datasets, limiting their clinical utility. We present CoINcIDE, a novel methodological framework for the discovery of patient subtypes across multiple datasets that requires no between-dataset transformations. We also present a high-quality database collection, curatedBreastData, with over 2,500 breast cancer gene expression samples. We use CoINcIDE to discover novel breast and ovarian cancer subtypes with prognostic significance and novel hypothesized ovarian therapeutic targets across multiple datasets. CoINcIDE and curatedBreastData are available as R packages.

  19. The Cardiac Atlas Project--an imaging database for computational modeling and statistical atlases of the heart.

    PubMed

    Fonseca, Carissa G; Backhaus, Michael; Bluemke, David A; Britten, Randall D; Chung, Jae Do; Cowan, Brett R; Dinov, Ivo D; Finn, J Paul; Hunter, Peter J; Kadish, Alan H; Lee, Daniel C; Lima, Joao A C; Medrano-Gracia, Pau; Shivkumar, Kalyanam; Suinesiaputra, Avan; Tao, Wenchao; Young, Alistair A

    2011-08-15

    Integrative mathematical and statistical models of cardiac anatomy and physiology can play a vital role in understanding cardiac disease phenotype and planning therapeutic strategies. However, the accuracy and predictive power of such models is dependent upon the breadth and depth of noninvasive imaging datasets. The Cardiac Atlas Project (CAP) has established a large-scale database of cardiac imaging examinations and associated clinical data in order to develop a shareable, web-accessible, structural and functional atlas of the normal and pathological heart for clinical, research and educational purposes. A goal of CAP is to facilitate collaborative statistical analysis of regional heart shape and wall motion and characterize cardiac function among and within population groups. Three main open-source software components were developed: (i) a database with web-interface; (ii) a modeling client for 3D + time visualization and parametric description of shape and motion; and (iii) open data formats for semantic characterization of models and annotations. The database was implemented using a three-tier architecture utilizing MySQL, JBoss and Dcm4chee, in compliance with the DICOM standard to provide compatibility with existing clinical networks and devices. Parts of Dcm4chee were extended to access image specific attributes as search parameters. To date, approximately 3000 de-identified cardiac imaging examinations are available in the database. All software components developed by the CAP are open source and are freely available under the Mozilla Public License Version 1.1 (http://www.mozilla.org/MPL/MPL-1.1.txt). http://www.cardiacatlas.org a.young@auckland.ac.nz Supplementary data are available at Bioinformatics online.

  20. Synchronising data sources and filling gaps by global hydrological modelling

    NASA Astrophysics Data System (ADS)

    Pimentel, Rafael; Crochemore, Louise; Hasan, Abdulghani; Pineda, Luis; Isberg, Kristina; Arheimer, Berit

    2017-04-01

    The advances in remote sensing in the last decades combined with the creation of different open hydrological databases have generated a very large amount of useful information for global hydrological modelling. Working with this huge number of datasets to set up a global hydrological model can constitute challenges such as multiple data formats and big heterogeneity on spatial and temporal resolutions. Different initiatives have made effort to homogenize some of these data sources, i.e. GRDC (Global Runoff Data Center), HYDROSHEDS (SHuttle Elevation Derivatives at multiple Scales), GLWD (Global Lake and Wetland Database) for runoff, watershed delineation and water bodies respectively. However, not all the related issues are covered or homogenously solved at the global scale and new information is continuously available to complete the current ones. This work presents synchronising efforts to make use of different global data sources needed to set up the semi-distributed hydrological model HYPE (Hydrological Predictions for the Environment) at the global scale. These data sources included: topography for watershed delineation, gauging stations of river flow, and extention of lakes, flood plains and land cover classes. A new database with approximately 100 000 subbasins, with an average area of 1000 km2, was created. Subbasin delineation was done combining Global Width Database for Large River (GWD-LR), SRTM high-resolution elevation data and a number of forced points of interest (gauging station of river flow, lakes, reservoirs, urban areas, nuclear plants and areas with high risk of flooding). Regarding flow data, the locations of GRDC stations were checked or placed along the river network when necessary, and completed with available information from national water services in data-sparse regions. A screening of doublet stations and associated time series was necessary to efficiently combine the two types of data sources. A total number about 21 000 stations were considered as forced point. In the case of lakes, some updating relating with location and area, of GLWD was done using esa (European Space Agency) gridded water bodies dataset. Many of the original lakes were shifted in relation with topography and some of them change their extension since the creation of the database. Moreover, the location of the outlet of all these lakes was also calculated. A new definition of global floodplain areas was also included. The land covers provided by ESA and some elevation criteria were used to define elevation land classes (ELC) using for the definition of the properties of each one of the proposed subbasin. All these new features: a) the inclusion of river width in the delineation of the subbasin, going further in the consideration of river shape; b) the merging of several data bases of gauging stations of river flow into an extended global dataset; c) coherent location of the lakes, river networks and floodplains; and d) a new definition of hydrological response units also considering elevation of the subbasins, will contribute to a better implementation of global hydrological models. The first results of world-wide HYPE will be shown but the model will yet not be fully calibrated using multi-sources of observed data and information. The ambition is to receive a global scale model which can also be useful at local scales. Starting with the global picture and then going into the details.

  1. The Disease Portals, disease-gene annotation and the RGD disease ontology at the Rat Genome Database.

    PubMed

    Hayman, G Thomas; Laulederkind, Stanley J F; Smith, Jennifer R; Wang, Shur-Jen; Petri, Victoria; Nigam, Rajni; Tutaj, Marek; De Pons, Jeff; Dwinell, Melinda R; Shimoyama, Mary

    2016-01-01

    The Rat Genome Database (RGD;http://rgd.mcw.edu/) provides critical datasets and software tools to a diverse community of rat and non-rat researchers worldwide. To meet the needs of the many users whose research is disease oriented, RGD has created a series of Disease Portals and has prioritized its curation efforts on the datasets important to understanding the mechanisms of various diseases. Gene-disease relationships for three species, rat, human and mouse, are annotated to capture biomarkers, genetic associations, molecular mechanisms and therapeutic targets. To generate gene-disease annotations more effectively and in greater detail, RGD initially adopted the MEDIC disease vocabulary from the Comparative Toxicogenomics Database and adapted it for use by expanding this framework with the addition of over 1000 terms to create the RGD Disease Ontology (RDO). The RDO provides the foundation for, at present, 10 comprehensive disease area-related dataset and analysis platforms at RGD, the Disease Portals. Two major disease areas are the focus of data acquisition and curation efforts each year, leading to the release of the related Disease Portals. Collaborative efforts to realize a more robust disease ontology are underway. Database URL:http://rgd.mcw.edu. © The Author(s) 2016. Published by Oxford University Press.

  2. Scrubchem: Building Bioactivity Datasets from Pubchem ...

    EPA Pesticide Factsheets

    The PubChem Bioassay database is a non-curated public repository with data from 64 sources, including: ChEMBL, BindingDb, DrugBank, EPA Tox21, NIH Molecular Libraries Screening Program, and various other academic, government, and industrial contributors. Methods for extracting this public data into quality datasets, useable for analytical research, presents several big-data challenges for which we have designed manageable solutions. According to our preliminary work, there are approximately 549 million bioactivity values and related meta-data within PubChem that can be mapped to over 10,000 biological targets. However, this data is not ready for use in data-driven research, mainly due to lack of structured annotations.We used a pragmatic approach that provides increasing access to bioactivity values in the PubChem Bioassay database. This included restructuring of individual PubChem Bioassay files into a relational database (ScrubChem). ScrubChem contains all primary PubChem Bioassay data that was: reparsed; error-corrected (when applicable); enriched with additional data links from other NCBI databases; and improved by adding key biological and assay annotations derived from logic-based language processing rules. The utility of ScrubChem and the curation process were illustrated using an example bioactivity dataset for the androgen receptor protein. This initial work serves as a trial ground for establishing the technical framework for accessing, integrating, cu

  3. Producing a Climate-Quality Database of Global Upper Ocean Profile Temperatures - The IQuOD (International Quality-controlled Ocean Database) Project.

    NASA Astrophysics Data System (ADS)

    Sprintall, J.; Cowley, R.; Palmer, M. D.; Domingues, C. M.; Suzuki, T.; Ishii, M.; Boyer, T.; Goni, G. J.; Gouretski, V. V.; Macdonald, A. M.; Thresher, A.; Good, S. A.; Diggs, S. C.

    2016-02-01

    Historical ocean temperature profile observations provide a critical element for a host of ocean and climate research activities. These include providing initial conditions for seasonal-to-decadal prediction systems, evaluating past variations in sea level and Earth's energy imbalance, ocean state estimation for studying variability and change, and climate model evaluation and development. The International Quality controlled Ocean Database (IQuOD) initiative represents a community effort to create the most globally complete temperature profile dataset, with (intelligent) metadata and assigned uncertainties. With an internationally coordinated effort organized by oceanographers, with data and ocean instrumentation expertise, and in close consultation with end users (e.g., climate modelers), the IQuOD initiative will assess and maximize the potential of an irreplaceable collection of ocean temperature observations (tens of millions of profiles collected at a cost of tens of billions of dollars, since 1772) to fulfil the demand for a climate-quality global database that can be used with greater confidence in a vast range of climate change related research and services of societal benefit. Progress towards version 1 of the IQuOD database, ongoing and future work will be presented. More information on IQuOD is available at www.iquod.org.

  4. Plant Reactome: a resource for plant pathways and comparative analysis.

    PubMed

    Naithani, Sushma; Preece, Justin; D'Eustachio, Peter; Gupta, Parul; Amarasinghe, Vindhya; Dharmawardhana, Palitha D; Wu, Guanming; Fabregat, Antonio; Elser, Justin L; Weiser, Joel; Keays, Maria; Fuentes, Alfonso Munoz-Pomer; Petryszak, Robert; Stein, Lincoln D; Ware, Doreen; Jaiswal, Pankaj

    2017-01-04

    Plant Reactome (http://plantreactome.gramene.org/) is a free, open-source, curated plant pathway database portal, provided as part of the Gramene project. The database provides intuitive bioinformatics tools for the visualization, analysis and interpretation of pathway knowledge to support genome annotation, genome analysis, modeling, systems biology, basic research and education. Plant Reactome employs the structural framework of a plant cell to show metabolic, transport, genetic, developmental and signaling pathways. We manually curate molecular details of pathways in these domains for reference species Oryza sativa (rice) supported by published literature and annotation of well-characterized genes. Two hundred twenty-two rice pathways, 1025 reactions associated with 1173 proteins, 907 small molecules and 256 literature references have been curated to date. These reference annotations were used to project pathways for 62 model, crop and evolutionarily significant plant species based on gene homology. Database users can search and browse various components of the database, visualize curated baseline expression of pathway-associated genes provided by the Expression Atlas and upload and analyze their Omics datasets. The database also offers data access via Application Programming Interfaces (APIs) and in various standardized pathway formats, such as SBML and BioPAX. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  5. Integrating forensic information in a crime intelligence database.

    PubMed

    Rossy, Quentin; Ioset, Sylvain; Dessimoz, Damien; Ribaux, Olivier

    2013-07-10

    Since 2008, intelligence units of six states of the western part of Switzerland have been sharing a common database for the analysis of high volume crimes. On a daily basis, events reported to the police are analysed, filtered and classified to detect crime repetitions and interpret the crime environment. Several forensic outcomes are integrated in the system such as matches of traces with persons, and links between scenes detected by the comparison of forensic case data. Systematic procedures have been settled to integrate links assumed mainly through DNA profiles, shoemarks patterns and images. A statistical outlook on a retrospective dataset of series from 2009 to 2011 of the database informs for instance on the number of repetition detected or confirmed and increased by forensic case data. Time needed to obtain forensic intelligence in regard with the type of marks treated, is seen as a critical issue. Furthermore, the underlying integration process of forensic intelligence into the crime intelligence database raised several difficulties in regards of the acquisition of data and the models used in the forensic databases. Solutions found and adopted operational procedures are described and discussed. This process form the basis to many other researches aimed at developing forensic intelligence models. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.

  6. Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies.

    PubMed

    Aldridge, Robert W; Shaji, Kunju; Hayward, Andrew C; Abubakar, Ibrahim

    2015-01-01

    The Enhanced Matching System (EMS) is a probabilistic record linkage program developed by the tuberculosis section at Public Health England to match data for individuals across two datasets. This paper outlines how EMS works and investigates its accuracy for linkage across public health datasets. EMS is a configurable Microsoft SQL Server database program. To examine the accuracy of EMS, two public health databases were matched using National Health Service (NHS) numbers as a gold standard unique identifier. Probabilistic linkage was then performed on the same two datasets without inclusion of NHS number. Sensitivity analyses were carried out to examine the effect of varying matching process parameters. Exact matching using NHS number between two datasets (containing 5931 and 1759 records) identified 1071 matched pairs. EMS probabilistic linkage identified 1068 record pairs. The sensitivity of probabilistic linkage was calculated as 99.5% (95%CI: 98.9, 99.8), specificity 100.0% (95%CI: 99.9, 100.0), positive predictive value 99.8% (95%CI: 99.3, 100.0), and negative predictive value 99.9% (95%CI: 99.8, 100.0). Probabilistic matching was most accurate when including address variables and using the automatically generated threshold for determining links with manual review. With the establishment of national electronic datasets across health and social care, EMS enables previously unanswerable research questions to be tackled with confidence in the accuracy of the linkage process. In scenarios where a small sample is being matched into a very large database (such as national records of hospital attendance) then, compared to results presented in this analysis, the positive predictive value or sensitivity may drop according to the prevalence of matches between databases. Despite this possible limitation, probabilistic linkage has great potential to be used where exact matching using a common identifier is not possible, including in low-income settings, and for vulnerable groups such as homeless populations, where the absence of unique identifiers and lower data quality has historically hindered the ability to identify individuals across datasets.

  7. Development of large scale riverine terrain-bathymetry dataset by integrating NHDPlus HR with NED,CoNED and HAND data

    NASA Astrophysics Data System (ADS)

    Li, Z.; Clark, E. P.

    2017-12-01

    Large scale and fine resolution riverine bathymetry data is critical for flood inundation modelingbut not available over the continental United States (CONUS). Previously we implementedbankfull hydraulic geometry based approaches to simulate bathymetry for individual riversusing NHDPlus v2.1 data and 10 m National Elevation Dataset (NED). USGS has recentlydeveloped High Resolution NHD data (NHDPlus HR Beta) (USGS, 2017), and thisenhanced dataset has a significant improvement on its spatial correspondence with 10 m DEM.In this study, we used this high resolution data, specifically NHDFlowline and NHDArea,to create bathymetry/terrain for CONUS river channels and floodplains. A software packageNHDPlus Inundation Modeler v5.0 Beta was developed for this project as an Esri ArcGIShydrological analysis extension. With the updated tools, raw 10 m DEM was first hydrologicallytreated to remove artificial blockages (e.g., overpasses, bridges and eve roadways, etc.) usinglow pass moving window filters. Cross sections were then automatically constructed along eachflowline to extract elevation from the hydrologically treated DEM. In this study, river channelshapes were approximated using quadratic curves to reduce uncertainties from commonly usedtrapezoids. We calculated underneath water channel elevation at each cross section samplingpoint using bankfull channel dimensions that were estimated from physiographicprovince/division based regression equations (Bieger et al. 2015). These elevation points werethen interpolated to generate bathymetry raster. The simulated bathymetry raster wasintegrated with USGS NED and Coastal National Elevation Database (CoNED) (whereveravailable) to make seamless terrain-bathymetry dataset. Channel bathymetry was alsointegrated to the HAND (Height above Nearest Drainage) dataset to improve large scaleinundation modeling. The generated terrain-bathymetry was processed at WatershedBoundary Dataset Hydrologic Unit 4 (WBDHU4) level.

  8. Hg concentrations in fish from coastal waters of California and Western North America

    USGS Publications Warehouse

    Davis, Jay; Ross, John; Bezalel, Shira; Sim, Lawrence; Bonnema, Autumn; Ichikawa, Gary; Heim, Wes; Schiff, Kenneth C; Eagles-Smith, Collin A.; Ackerman, Joshua T.

    2016-01-01

    The State of California conducted an extensive and systematic survey of mercury (Hg) in fish from the California coast in 2009 and 2010. The California survey sampled 3483 fish representing 46 species at 68 locations, and demonstrated that methylHg in fish presents a widespread exposure risk to fish consumers. Most of the locations sampled (37 of 68) had a species with an average concentration above 0.3 μg/g wet weight (ww), and 10 locations an average above 1.0 μg/g ww. The recent and robust dataset from California provided a basis for a broader examination of spatial and temporal patterns in fish Hg in coastal waters of Western North America. There is a striking lack of data in publicly accessible databases on Hg and other contaminants in coastal fish. An assessment of the raw data from these databases suggested the presence of relatively high concentrations along the California coast and in Puget Sound, and relatively low concentrations along the coasts of Alaska and Oregon, and the outer coast of Washington. The dataset suggests that Hg concentrations of public health concern can be observed at any location on the coast of Western North America where long-lived predator species are sampled. Output from a linear mixed-effects model resembled the spatial pattern observed for the raw data and suggested, based on the limited dataset, a lack of trend in fish Hg over the nearly 30-year period covered by the dataset. Expanded and continued monitoring, accompanied by rigorous data management procedures, would be of great value in characterizing methylHg exposure, and tracking changes in contamination of coastal fish in response to possible increases in atmospheric Hg emissions in Asia, climate change, and terrestrial Hg control efforts in coastal watersheds.

  9. It's all in the timing: calibrating temporal penalties for biomedical data sharing.

    PubMed

    Xia, Weiyi; Wan, Zhiyu; Yin, Zhijun; Gaupp, James; Liu, Yongtai; Clayton, Ellen Wright; Kantarcioglu, Murat; Vorobeychik, Yevgeniy; Malin, Bradley A

    2018-01-01

    Biomedical science is driven by datasets that are being accumulated at an unprecedented rate, with ever-growing volume and richness. There are various initiatives to make these datasets more widely available to recipients who sign Data Use Certificate agreements, whereby penalties are levied for violations. A particularly popular penalty is the temporary revocation, often for several months, of the recipient's data usage rights. This policy is based on the assumption that the value of biomedical research data depreciates significantly over time; however, no studies have been performed to substantiate this belief. This study investigates whether this assumption holds true and the data science policy implications. This study tests the hypothesis that the value of data for scientific investigators, in terms of the impact of the publications based on the data, decreases over time. The hypothesis is tested formally through a mixed linear effects model using approximately 1200 publications between 2007 and 2013 that used datasets from the Database of Genotypes and Phenotypes, a data-sharing initiative of the National Institutes of Health. The analysis shows that the impact factors for publications based on Database of Genotypes and Phenotypes datasets depreciate in a statistically significant manner. However, we further discover that the depreciation rate is slow, only ∼10% per year, on average. The enduring value of data for subsequent studies implies that revoking usage for short periods of time may not sufficiently deter those who would violate Data Use Certificate agreements and that alternative penalty mechanisms may need to be invoked. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  10. Prediction of drug indications based on chemical interactions and chemical similarities.

    PubMed

    Huang, Guohua; Lu, Yin; Lu, Changhong; Zheng, Mingyue; Cai, Yu-Dong

    2015-01-01

    Discovering potential indications of novel or approved drugs is a key step in drug development. Previous computational approaches could be categorized into disease-centric and drug-centric based on the starting point of the issues or small-scaled application and large-scale application according to the diversity of the datasets. Here, a classifier has been constructed to predict the indications of a drug based on the assumption that interactive/associated drugs or drugs with similar structures are more likely to target the same diseases using a large drug indication dataset. To examine the classifier, it was conducted on a dataset with 1,573 drugs retrieved from Comprehensive Medicinal Chemistry database for five times, evaluated by 5-fold cross-validation, yielding five 1st order prediction accuracies that were all approximately 51.48%. Meanwhile, the model yielded an accuracy rate of 50.00% for the 1st order prediction by independent test on a dataset with 32 other drugs in which drug repositioning has been confirmed. Interestingly, some clinically repurposed drug indications that were not included in the datasets are successfully identified by our method. These results suggest that our method may become a useful tool to associate novel molecules with new indications or alternative indications with existing drugs.

  11. Prediction of Drug Indications Based on Chemical Interactions and Chemical Similarities

    PubMed Central

    Huang, Guohua; Lu, Yin; Lu, Changhong; Cai, Yu-Dong

    2015-01-01

    Discovering potential indications of novel or approved drugs is a key step in drug development. Previous computational approaches could be categorized into disease-centric and drug-centric based on the starting point of the issues or small-scaled application and large-scale application according to the diversity of the datasets. Here, a classifier has been constructed to predict the indications of a drug based on the assumption that interactive/associated drugs or drugs with similar structures are more likely to target the same diseases using a large drug indication dataset. To examine the classifier, it was conducted on a dataset with 1,573 drugs retrieved from Comprehensive Medicinal Chemistry database for five times, evaluated by 5-fold cross-validation, yielding five 1st order prediction accuracies that were all approximately 51.48%. Meanwhile, the model yielded an accuracy rate of 50.00% for the 1st order prediction by independent test on a dataset with 32 other drugs in which drug repositioning has been confirmed. Interestingly, some clinically repurposed drug indications that were not included in the datasets are successfully identified by our method. These results suggest that our method may become a useful tool to associate novel molecules with new indications or alternative indications with existing drugs. PMID:25821813

  12. Mineral and Geochemical Classification From Spectroscopy/Diffraction Through Neural Networks

    NASA Astrophysics Data System (ADS)

    Ferralis, N.; Grossman, J.; Summons, R. E.

    2017-12-01

    Spectroscopy and diffraction techniques are essential for understanding structural, chemical and functional properties of geological materials for Earth and Planetary Sciences. Beyond data collection, quantitative insight relies on experimentally assembled, or computationally derived spectra. Inference on the geochemical or geophysical properties (such as crystallographic order, chemical functionality, elemental composition, etc.) of a particular geological material (mineral, organic matter, etc.) is based on fitting unknown spectra and comparing the fit with consolidated databases. The complexity of fitting highly convoluted spectra, often limits the ability to infer geochemical characteristics, and limits the throughput for extensive datasets. With the emergence of heuristic approaches to pattern recognitions though machine learning, in this work we investigate the possibility and potential of using supervised neural networks trained on available public spectroscopic database to directly infer geochemical parameters from unknown spectra. Using Raman, infrared spectroscopy and powder x-ray diffraction from the publicly available RRUFF database, we train neural network models to classify mineral and organic compounds (pure or mixtures) based on crystallographic structure from diffraction, chemical functionality, elemental composition and bonding from spectroscopy. As expected, the accuracy of the inference is strongly dependent on the quality and extent of the training data. We will identify a series of requirements and guidelines for the training dataset needed to achieve consistent high accuracy inference, along with methods to compensate for limited of data.

  13. Real time reconstruction of quasiperiodic multi parameter physiological signals

    NASA Astrophysics Data System (ADS)

    Ganeshapillai, Gartheeban; Guttag, John

    2012-12-01

    A modern intensive care unit (ICU) has automated analysis systems that depend on continuous uninterrupted real time monitoring of physiological signals such as electrocardiogram (ECG), arterial blood pressure (ABP), and photo-plethysmogram (PPG). These signals are often corrupted by noise, artifacts, and missing data. We present an automated learning framework for real time reconstruction of corrupted multi-parameter nonstationary quasiperiodic physiological signals. The key idea is to learn a patient-specific model of the relationships between signals, and then reconstruct corrupted segments using the information available in correlated signals. We evaluated our method on MIT-BIH arrhythmia data, a two-channel ECG dataset with many clinically significant arrhythmias, and on the CinC challenge 2010 data, a multi-parameter dataset containing ECG, ABP, and PPG. For each, we evaluated both the residual distance between the original signals and the reconstructed signals, and the performance of a heartbeat classifier on a reconstructed ECG signal. At an SNR of 0 dB, the average residual distance on the CinC data was roughly 3% of the energy in the signal, and on the arrhythmia database it was roughly 16%. The difference is attributable to the large amount of diversity in the arrhythmia database. Remarkably, despite the relatively high residual difference, the classification accuracy on the arrhythmia database was still 98%, indicating that our method restored the physiologically important aspects of the signal.

  14. Implementing a Parallel Image Edge Detection Algorithm Based on the Otsu-Canny Operator on the Hadoop Platform.

    PubMed

    Cao, Jianfang; Chen, Lichao; Wang, Min; Tian, Yun

    2018-01-01

    The Canny operator is widely used to detect edges in images. However, as the size of the image dataset increases, the edge detection performance of the Canny operator decreases and its runtime becomes excessive. To improve the runtime and edge detection performance of the Canny operator, in this paper, we propose a parallel design and implementation for an Otsu-optimized Canny operator using a MapReduce parallel programming model that runs on the Hadoop platform. The Otsu algorithm is used to optimize the Canny operator's dual threshold and improve the edge detection performance, while the MapReduce parallel programming model facilitates parallel processing for the Canny operator to solve the processing speed and communication cost problems that occur when the Canny edge detection algorithm is applied to big data. For the experiments, we constructed datasets of different scales from the Pascal VOC2012 image database. The proposed parallel Otsu-Canny edge detection algorithm performs better than other traditional edge detection algorithms. The parallel approach reduced the running time by approximately 67.2% on a Hadoop cluster architecture consisting of 5 nodes with a dataset of 60,000 images. Overall, our approach system speeds up the system by approximately 3.4 times when processing large-scale datasets, which demonstrates the obvious superiority of our method. The proposed algorithm in this study demonstrates both better edge detection performance and improved time performance.

  15. CT-based manual segmentation and evaluation of paranasal sinuses.

    PubMed

    Pirner, S; Tingelhoff, K; Wagner, I; Westphal, R; Rilk, M; Wahl, F M; Bootz, F; Eichhorn, Klaus W G

    2009-04-01

    Manual segmentation of computed tomography (CT) datasets was performed for robot-assisted endoscope movement during functional endoscopic sinus surgery (FESS). Segmented 3D models are needed for the robots' workspace definition. A total of 50 preselected CT datasets were each segmented in 150-200 coronal slices with 24 landmarks being set. Three different colors for segmentation represent diverse risk areas. Extension and volumetric measurements were performed. Three-dimensional reconstruction was generated after segmentation. Manual segmentation took 8-10 h for each CT dataset. The mean volumes were: right maxillary sinus 17.4 cm(3), left side 17.9 cm(3), right frontal sinus 4.2 cm(3), left side 4.0 cm(3), total frontal sinuses 7.9 cm(3), sphenoid sinus right side 5.3 cm(3), left side 5.5 cm(3), total sphenoid sinus volume 11.2 cm(3). Our manually segmented 3D-models present the patient's individual anatomy with a special focus on structures in danger according to the diverse colored risk areas. For safe robot assistance, the high-accuracy models represent an average of the population for anatomical variations, extension and volumetric measurements. They can be used as a database for automatic model-based segmentation. None of the segmentation methods so far described provide risk segmentation. The robot's maximum distance to the segmented border can be adjusted according to the differently colored areas.

  16. SAR image dataset of military ground targets with multiple poses for ATR

    NASA Astrophysics Data System (ADS)

    Belloni, Carole; Balleri, Alessio; Aouf, Nabil; Merlet, Thomas; Le Caillec, Jean-Marc

    2017-10-01

    Automatic Target Recognition (ATR) is the task of automatically detecting and classifying targets. Recognition using Synthetic Aperture Radar (SAR) images is interesting because SAR images can be acquired at night and under any weather conditions, whereas optical sensors operating in the visible band do not have this capability. Existing SAR ATR algorithms have mostly been evaluated using the MSTAR dataset.1 The problem with the MSTAR is that some of the proposed ATR methods have shown good classification performance even when targets were hidden,2 suggesting the presence of a bias in the dataset. Evaluations of SAR ATR techniques are currently challenging due to the lack of publicly available data in the SAR domain. In this paper, we present a high resolution SAR dataset consisting of images of a set of ground military target models taken at various aspect angles, The dataset can be used for a fair evaluation and comparison of SAR ATR algorithms. We applied the Inverse Synthetic Aperture Radar (ISAR) technique to echoes from targets rotating on a turntable and illuminated with a stepped frequency waveform. The targets in the database consist of four variants of two 1.7m-long models of T-64 and T-72 tanks. The gun, the turret position and the depression angle are varied to form 26 different sequences of images. The emitted signal spanned the frequency range from 13 GHz to 18 GHz to achieve a bandwidth of 5 GHz sampled with 4001 frequency points. The resolution obtained with respect to the size of the model targets is comparable to typical values obtained using SAR airborne systems. Single polarized images (Horizontal-Horizontal) are generated using the backprojection algorithm.3 A total of 1480 images are produced using a 20° integration angle. The images in the dataset are organized in a suggested training and testing set to facilitate a standard evaluation of SAR ATR algorithms.

  17. Ocean data management in OMP Data Service

    NASA Astrophysics Data System (ADS)

    Fleury, Laurence; André, François; Belmahfoud, Nizar; Boichard, Jean-Luc; Brissebrat, Guillaume; Ferré, Hélène; Mière, Arnaud

    2014-05-01

    The Observatoire Midi-Pyrénées Data Service (SEDOO) is a development team, dedicated to environmental data management and dissemination application set up, in the framework of intensive field campaigns and long term observation networks. SEDOO developped some applications dealing with ocean data only, but also generic databases that enable to store and distribute multidisciplinary datasets. SEDOO is in charge of the in situ data management and the data portal for international and multidisciplinary programmes as large as African Monsoon Multidisciplinary Analyses (AMMA) and Mediterranean Integrated STudies at Regional And Local Scales (MISTRALS). The AMMA and MISTRALS databases are distributed and the data portals enable to access datasets managed by other data centres (IPSL, CORIOLIS...) through interoperability protocols (OPeNDAP, xml requests...). AMMA and MISTRALS metadata (data description) are standardized and comply with international standards (ISO 19115-19139; INSPIRE European Directive; Global Change Master Directory Thesaurus). Most of the AMMA and MISTRALS in situ ocean data sets are homogenized and inserted in a relational database, in order to enable accurate data selection and download of different data sets in a shared format. Data selection criteria are location, period, physical property name, physical property range... The data extraction procedure include format output selection among CSV, NetCDF, Nasa Ames... The AMMA database - http://database.amma-international.org/ - contains field campaign observations in the Guinea Gulf (EGEE 2005-2007) and Atlantic Tropical Ocean (AEROSE-II 2006...), as well as long term monitoring data (PIRATA, ARGO...). Operational analysis (MERCATOR) and satellite products (TMI, SSMI...) are managed by IPSL data centre and can be accessed too. They have been projected over regular latitude-longitude grids and converted into the NetCDF format. The MISTRALS data portal - http://mistrals.sedoo.fr/ - enables to access ocean datasets produced by the contributing programmes: Hydrological cycle in the Mediterranean eXperiment (HyMeX), Chemistry-Aerosol Mediterranean eXperiment (ChArMEx), Marine Mediterranean eXperiment (MERMeX)... The programmes include many field campaigns from 2011 to 2015, collecting general and specific properties. Long term monitoring networks, like Mediterranean Ocean Observing System on Environment (MOOSE) or Mediterranean Eurocentre for Underwater Sciences and Technologies (MEUST-SE), contribute to the MISTRALS data portal as well. Relevant model outputs and satellite products managed by external data centres (IPSL, ENEA...) can be accessed too. SEDOO manages the SSS (Sea Surface Salinity) national observation service data: http://sss.sedoo.fr/. SSS aims at collecting, validating, archiving and distributing in situ SSS measurements derived from Voluntary Observing Ship programs. The SSS data user interface enables to built multicriteria data request and download relevant datasets. SEDOO contributes to the SOLWARA project that aims at understanding the oceanic circulation in the Coral Sea and the Solomon Sea and their role in both the climate system and the oceanic chemistry. The research programme include in situ measurements, numerical modelling and compiled analyses of past data. The website http://thredds.sedoo.fr/solwara/ enables to access, visualize and download Solwara gridded data and model simulations, using Thredds associated services (OPEnDAP, NCSS and WMS). In order to improve the application user-friendliness, SSS and SOLWARA web interfaces are JEE applications build with GWT Framework, and share many modules.

  18. Argonne Geothermal Geochemical Database v2.0

    DOE Data Explorer

    Harto, Christopher

    2013-05-22

    A database of geochemical data from potential geothermal sources aggregated from multiple sources as of March 2010. The database contains fields for the location, depth, temperature, pH, total dissolved solids concentration, chemical composition, and date of sampling. A separate tab contains data on non-condensible gas compositions. The database contains records for over 50,000 wells, although many entries are incomplete. Current versions of source documentation are listed in the dataset.

  19. Identifying the missing proteins in human proteome by biological language model.

    PubMed

    Dong, Qiwen; Wang, Kai; Liu, Xuan

    2016-12-23

    With the rapid development of high-throughput sequencing technology, the proteomics research becomes a trendy field in the post genomics era. It is necessary to identify all the native-encoding protein sequences for further function and pathway analysis. Toward that end, the Human Proteome Organization lunched the Human Protein Project in 2011. However many proteins are hard to be detected by experiment methods, which becomes one of the bottleneck in Human Proteome Project. In consideration of the complicatedness of detecting these missing proteins by using wet-experiment approach, here we use bioinformatics method to pre-filter the missing proteins. Since there are analogy between the biological sequences and natural language, the n-gram models from Natural Language Processing field has been used to filter the missing proteins. The dataset used in this study contains 616 missing proteins from the "uncertain" category of the neXtProt database. There are 102 proteins deduced by the n-gram model, which have high probability to be native human proteins. We perform a detail analysis on the predicted structure and function of these missing proteins and also compare the high probability proteins with other mass spectrum datasets. The evaluation shows that the results reported here are in good agreement with those obtained by other well-established databases. The analysis shows that 102 proteins may be native gene-coding proteins and some of the missing proteins are membrane or natively disordered proteins which are hard to be detected by experiment methods.

  20. A TEX86 surface sediment database and extended Bayesian calibration

    NASA Astrophysics Data System (ADS)

    Tierney, Jessica E.; Tingley, Martin P.

    2015-06-01

    Quantitative estimates of past temperature changes are a cornerstone of paleoclimatology. For a number of marine sediment-based proxies, the accuracy and precision of past temperature reconstructions depends on a spatial calibration of modern surface sediment measurements to overlying water temperatures. Here, we present a database of 1095 surface sediment measurements of TEX86, a temperature proxy based on the relative cyclization of marine archaeal glycerol dialkyl glycerol tetraether (GDGT) lipids. The dataset is archived in a machine-readable format with geospatial information, fractional abundances of lipids (if available), and metadata. We use this new database to update surface and subsurface temperature calibration models for TEX86 and demonstrate the applicability of the TEX86 proxy to past temperature prediction. The TEX86 database confirms that surface sediment GDGT distribution has a strong relationship to temperature, which accounts for over 70% of the variance in the data. Future efforts, made possible by the data presented here, will seek to identify variables with secondary relationships to GDGT distributions, such as archaeal community composition.

  1. Relax with CouchDB - Into the non-relational DBMS era of Bioinformatics

    PubMed Central

    Manyam, Ganiraju; Payton, Michelle A.; Roth, Jack A.; Abruzzo, Lynne V.; Coombes, Kevin R.

    2012-01-01

    With the proliferation of high-throughput technologies, genome-level data analysis has become common in molecular biology. Bioinformaticians are developing extensive resources to annotate and mine biological features from high-throughput data. The underlying database management systems for most bioinformatics software are based on a relational model. Modern non-relational databases offer an alternative that has flexibility, scalability, and a non-rigid design schema. Moreover, with an accelerated development pace, non-relational databases like CouchDB can be ideal tools to construct bioinformatics utilities. We describe CouchDB by presenting three new bioinformatics resources: (a) geneSmash, which collates data from bioinformatics resources and provides automated gene-centric annotations, (b) drugBase, a database of drug-target interactions with a web interface powered by geneSmash, and (c) HapMap-CN, which provides a web interface to query copy number variations from three SNP-chip HapMap datasets. In addition to the web sites, all three systems can be accessed programmatically via web services. PMID:22609849

  2. USGS Mineral Resources Program; national maps and datasets for research and land planning

    USGS Publications Warehouse

    Nicholson, S.W.; Stoeser, D.B.; Ludington, S.D.; Wilson, Frederic H.

    2001-01-01

    The U.S. Geological Survey, the Nation’s leader in producing and maintaining earth science data, serves as an advisor to Congress, the Department of the Interior, and many other Federal and State agencies. Nationwide datasets that are easily available and of high quality are critical for addressing a wide range of land-planning, resource, and environmental issues. Four types of digital databases (geological, geophysical, geochemical, and mineral occurrence) are being compiled and upgraded by the Mineral Resources Program on regional and national scales to meet these needs. Where existing data are incomplete, new data are being collected to ensure national coverage. Maps and analyses produced from these databases provide basic information essential for mineral resource assessments and environmental studies, as well as fundamental information for regional and national land-use studies. Maps and analyses produced from the databases are instrumental to ongoing basic research, such as the identification of mineral deposit origins, determination of regional background values of chemical elements with known environmental impact, and study of the relationships between toxic elements or mining practices to human health. As datasets are completed or revised, the information is made available through a variety of media, including the Internet. Much of the available information is the result of cooperative activities with State and other Federal agencies. The upgraded Mineral Resources Program datasets make geologic, geophysical, geochemical, and mineral occurrence information at the state, regional, and national scales available to members of Congress, State and Federal government agencies, researchers in academia, and the general public. The status of the Mineral Resources Program datasets is outlined below.

  3. 1-km Global Anthropogenic Heat Flux Database for Urban Climate Studies

    NASA Astrophysics Data System (ADS)

    Dong, Y.; Varquez, A. C. G.; Kanda, M.

    2016-12-01

    Among various factors contributing to warming in cities, anthropogenic heat emission (AHE), defined by heat fluxes arising from human consumption of energy, has the most obvious influence. Despite this, estimation of the AHE distribution is challenging and assumed almost uniform in investigations of the regional atmospheric environment. In this study, we introduce a top-down method for estimating a global distribution of AHE (see attachment), with a high spatial resolution of 30 arc-seconds and temporal resolution of 1 hour. Annual average AHE was derived from human metabolic heating and primary energy consumption, which was further divided into three components based on consumer sector: heat loss, heat emissions from industrial-related sectors and heat emissions from commercial, residential and transport sectors (CRT). The first and second components were equally distributed throughout the country and populated areas, respectively. Bulk AHE from the CRT was proportionally distributed using a global population dataset with a nighttime lights adjustment. An empirical function to estimate monthly fluctuations of AHE based on monthly temperatures was derived from various city measurements. Finally, a global AHE database was constructed for the year 2013. Comparisons between our proposed AHE and other existing datasets revealed that a problem of AHE underestimation at central urban areas existing in previous top-down models was significantly mitigated by the nighttime lights adjustment. A strong agreement in the monthly profiles of AHE between our database and other bottom-up datasets further proved the validity of our current methodology. Investigations of AHE in the 29 largest urban agglomerations globally highlighted that the share of heat emissions from CRT sectors to the total AHE at the city level was 40-95%, whereas the share of metabolic heating varied closely depending on the level of economic development in the city. Incorporation of our proposed AHE data into climate models will provide a more realistic representation of urban atmospheric environment, leading to a deeper understanding of urban climate change. Acknowledgment: This research was supported by the Environment Research and Technology Development Fund (S-14) of the Ministry of the Environment, Japan

  4. NASADEM Global Elevation Model of Earth: Methods for the Refinement and Merger of SRTM and ASTER GDEM

    NASA Astrophysics Data System (ADS)

    Crippen, R. E.; Buckley, S.; Agram, P. S.; Belz, J. E.; Gurrola, E. M.; Hensley, S.; Kobrick, M.; Lavalle, M.; Martin, J. M.; Neumann, M.; Nguyen, Q.; Rosen, P. A.; Shimada, J.; Simard, M.; Tung, W.

    2016-12-01

    NASADEM is a near-global elevation model that is being produced primarily by completely reprocessing the Shuttle Radar Topography Mission (SRTM) radar data and then merging it with refined ASTER GDEM elevations. The new and improved SRTM elevations in NASADEM result from better vertical control of each SRTM data swath via reference to ICESat elevations and from SRTM void reductions using advanced interferometric unwrapping algorithms. Errors in SRTM (due to incorrect interferometric unwrapping) are rare but can be found and removed via a detector that relies upon pattern analysis within synergistic comparisons of SRTM and GDEM. Remnant voids in SRTM are filled primarily by GDEM3, but with removal of GDEM glitches that are mostly related to clouds. GDEM glitch removal uses a measure of curvature and then spatial filtering to detect, isolate, and delete anomalous spikes and pits that are uncharacteristic of natural topography. Water masking uses the original SRTM Water Body Dataset (SWBD), but with errors corrected via a new ASTER Water Body Database. The improved SRTM, GDEM, and water body databases will be made available individually in addition to our merged product, which is particularly important for the SRTM dataset, which stands as a February 2000 baseline for many topographic change studies. New and forthcoming freely available elevation data (at reduced resolutions) from the ALOS PRISM World 3D and TanDEM-X projects will contribute to the critical but not yet reached goal of a complete, high-quality elevation model of Earth, and they are expected to provide additional validation for NASADEM. Indeed, cross validation among all of these datasets is a vital part of reaching that goal. The value of elevation data is difficult to overstate. These data are used in nearly all types of geophysical study conducted at or near Earth's surface.

  5. Site-conditions map for Portugal based on VS measurements: methodology and final model

    NASA Astrophysics Data System (ADS)

    Vilanova, Susana; Narciso, João; Carvalho, João; Lopes, Isabel; Quinta Ferreira, Mario; Moura, Rui; Borges, José; Nemser, Eliza; Pinto, carlos

    2017-04-01

    In this paper we present a statistically significant site-condition model for Portugal based on shear-wave velocity (VS) data and surface geology. We also evaluate the performance of commonly used Vs30 proxies based on exogenous data and analyze the implications of using those proxies for calculating site amplification in seismic hazard assessment. The dataset contains 161 Vs profiles acquired in Portugal in the context of research projects, technical reports, academic thesis and academic papers. The methodologies involved in characterizing the Vs structure at the sites in the database include seismic refraction, multichannel analysis of seismic waves and refraction microtremor. Invasive measurements were performed in selected locations in order to compare the Vs profiles obtained from both invasive and non-invasive techniques. In general there was good agreement in the subsurface structure of Vs30 obtained from the different methodologies. The database flat-file includes information on Vs30, surface geology at 1:50.000 and 1:500.000 scales, elevation and topographic slope and based on SRTM30 topographic dataset. The procedure used to develop the site-conditions map is based on a three-step process that includes defining a preliminary set of geological units based on the literature, performing statistical tests to assess whether or not the differences in the distributions of Vs30 are statistically significant, and merging of the geological units accordingly. The dataset was, to some extent, affected by clustering and/or preferential sampling and therefore a declustering algorithm was applied. The final model includes three geological units: 1) Igneous, metamorphic and old (Paleogene and Mesozoic) sedimentary rocks; 2) Neogene and Pleistocene formations, and 3) Holocene formations. The evaluation of proxies indicates that although geological analogues and topographic slope are in general unbiased, the latter shows significant bias for particular geological units and subsequently for some geographical regions.

  6. The effects of spatial population dataset choice on estimates of population at risk of disease

    PubMed Central

    2011-01-01

    Background The spatial modeling of infectious disease distributions and dynamics is increasingly being undertaken for health services planning and disease control monitoring, implementation, and evaluation. Where risks are heterogeneous in space or dependent on person-to-person transmission, spatial data on human population distributions are required to estimate infectious disease risks, burdens, and dynamics. Several different modeled human population distribution datasets are available and widely used, but the disparities among them and the implications for enumerating disease burdens and populations at risk have not been considered systematically. Here, we quantify some of these effects using global estimates of populations at risk (PAR) of P. falciparum malaria as an example. Methods The recent construction of a global map of P. falciparum malaria endemicity enabled the testing of different gridded population datasets for providing estimates of PAR by endemicity class. The estimated population numbers within each class were calculated for each country using four different global gridded human population datasets: GRUMP (~1 km spatial resolution), LandScan (~1 km), UNEP Global Population Databases (~5 km), and GPW3 (~5 km). More detailed assessments of PAR variation and accuracy were conducted for three African countries where census data were available at a higher administrative-unit level than used by any of the four gridded population datasets. Results The estimates of PAR based on the datasets varied by more than 10 million people for some countries, even accounting for the fact that estimates of population totals made by different agencies are used to correct national totals in these datasets and can vary by more than 5% for many low-income countries. In many cases, these variations in PAR estimates comprised more than 10% of the total national population. The detailed country-level assessments suggested that none of the datasets was consistently more accurate than the others in estimating PAR. The sizes of such differences among modeled human populations were related to variations in the methods, input resolution, and date of the census data underlying each dataset. Data quality varied from country to country within the spatial population datasets. Conclusions Detailed, highly spatially resolved human population data are an essential resource for planning health service delivery for disease control, for the spatial modeling of epidemics, and for decision-making processes related to public health. However, our results highlight that for the low-income regions of the world where disease burden is greatest, existing datasets display substantial variations in estimated population distributions, resulting in uncertainty in disease assessments that utilize them. Increased efforts are required to gather contemporary and spatially detailed demographic data to reduce this uncertainty, particularly in Africa, and to develop population distribution modeling methods that match the rigor, sophistication, and ability to handle uncertainty of contemporary disease mapping and spread modeling. In the meantime, studies that utilize a particular spatial population dataset need to acknowledge the uncertainties inherent within them and consider how the methods and data that comprise each will affect conclusions. PMID:21299885

  7. Comprehensive analysis of a multidimensional liquid chromatography mass spectrometry dataset acquired on a quadrupole selecting, quadrupole collision cell, time-of-flight mass spectrometer: I. How much of the data is theoretically interpretable by search engines?

    PubMed

    Chalkley, Robert J; Baker, Peter R; Hansen, Kirk C; Medzihradszky, Katalin F; Allen, Nadia P; Rexach, Michael; Burlingame, Alma L

    2005-08-01

    An in-depth analysis of a multidimensional chromatography-mass spectrometry dataset acquired on a quadrupole selecting, quadrupole collision cell, time-of-flight (QqTOF) geometry instrument was carried out. A total of 3269 CID spectra were acquired. Through manual verification of database search results and de novo interpretation of spectra 2368 spectra could be confidently determined as predicted tryptic peptides. A detailed analysis of the non-matching spectra was also carried out, highlighting what the non-matching spectra in a database search typically are composed of. The results of this comprehensive dataset study demonstrate that QqTOF instruments produce information-rich data of which a high percentage of the data is readily interpretable.

  8. Percentage of Protected Area Amounts within each Watershed Boundary for the Conterminous US

    EPA Science Inventory

    Abstract: This dataset uses spatial information from the Watershed Boundary Dataset (WBD, March 2011) and the Protected Areas Database of the United States (PAD-US Version 1.0). The resulting data layer, with percentages of protected areas by category, was created using the ATtI...

  9. Database for the geologic map of Upper Geyser Basin, Yellowstone National Park, Wyoming

    USGS Publications Warehouse

    Abendini, Atosa A.; Robinson, Joel E.; Muffler, L. J. Patrick; White, D. E.; Beeson, Melvin H.; Truesdell, A. H.

    2015-01-01

    This dataset contains contacts, geologic units, and map boundaries from Miscellaneous Investigations Series Map I-1371, "The Geologic map of upper Geyser Basin, Yellowstone, National Park, Wyoming". This dataset was constructed to produce a digital geologic map as a basis for ongoing studies of hydrothermal processes.

  10. Mass spectrometry-based protein identification by integrating de novo sequencing with database searching.

    PubMed

    Wang, Penghao; Wilson, Susan R

    2013-01-01

    Mass spectrometry-based protein identification is a very challenging task. The main identification approaches include de novo sequencing and database searching. Both approaches have shortcomings, so an integrative approach has been developed. The integrative approach firstly infers partial peptide sequences, known as tags, directly from tandem spectra through de novo sequencing, and then puts these sequences into a database search to see if a close peptide match can be found. However the current implementation of this integrative approach has several limitations. Firstly, simplistic de novo sequencing is applied and only very short sequence tags are used. Secondly, most integrative methods apply an algorithm similar to BLAST to search for exact sequence matches and do not accommodate sequence errors well. Thirdly, by applying these methods the integrated de novo sequencing makes a limited contribution to the scoring model which is still largely based on database searching. We have developed a new integrative protein identification method which can integrate de novo sequencing more efficiently into database searching. Evaluated on large real datasets, our method outperforms popular identification methods.

  11. eMelanoBase: an online locus-specific variant database for familial melanoma.

    PubMed

    Fung, David C Y; Holland, Elizabeth A; Becker, Therese M; Hayward, Nicholas K; Bressac-de Paillerets, Brigitte; Mann, Graham J

    2003-01-01

    A proportion of melanoma-prone individuals in both familial and non-familial contexts has been shown to carry inactivating mutations in either CDKN2A or, rarely, CDK4. CDKN2A is a complex locus that encodes two unrelated proteins from alternately spliced transcripts that are read in different frames. The alpha transcript (exons 1alpha, 2, and 3) produces the p16INK4A cyclin-dependent kinase inhibitor, while the beta transcript (exons 1beta and 2) is translated as p14ARF, a stabilizing factor of p53 levels through binding to MDM2. Mutations in exon 2 can impair both polypeptides and insertions and deletions in exons 1alpha, 1beta, and 2, which can theoretically generate p16INK4A-p14ARF fusion proteins. No online database currently takes into account all the consequences of these genotypes, a situation compounded by some problematic previous annotations of CDKN2A-related sequences and descriptions of their mutations. As an initiative of the international Melanoma Genetics Consortium, we have therefore established a database of germline variants observed in all loci implicated in familial melanoma susceptibility. Such a comprehensive, publicly accessible database is an essential foundation for research on melanoma susceptibility and its clinical application. Our database serves two types of data as defined by HUGO. The core dataset includes the nucleotide variants on the genomic and transcript levels, amino acid variants, and citation. The ancillary dataset includes keyword description of events at the transcription and translation levels and epidemiological data. The application that handles users' queries was designed in the model-view-controller architecture and was implemented in Java. The object-relational database schema was deduced using functional dependency analysis. We hereby present our first functional prototype of eMelanoBase. The service is accessible via the URL www.wmi.usyd.edu.au:8080/melanoma.html. Copyright 2002 Wiley-Liss, Inc.

  12. Predicting chemically-induced skin reactions. Part I: QSAR models of skin sensitization and their application to identify potentially hazardous compounds

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Alves, Vinicius M.; Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599; Muratov, Eugene

    Repetitive exposure to a chemical agent can induce an immune reaction in inherently susceptible individuals that leads to skin sensitization. Although many chemicals have been reported as skin sensitizers, there have been very few rigorously validated QSAR models with defined applicability domains (AD) that were developed using a large group of chemically diverse compounds. In this study, we have aimed to compile, curate, and integrate the largest publicly available dataset related to chemically-induced skin sensitization, use this data to generate rigorously validated and QSAR models for skin sensitization, and employ these models as a virtual screening tool for identifying putativemore » sensitizers among environmental chemicals. We followed best practices for model building and validation implemented with our predictive QSAR workflow using Random Forest modeling technique in combination with SiRMS and Dragon descriptors. The Correct Classification Rate (CCR) for QSAR models discriminating sensitizers from non-sensitizers was 71–88% when evaluated on several external validation sets, within a broad AD, with positive (for sensitizers) and negative (for non-sensitizers) predicted rates of 85% and 79% respectively. When compared to the skin sensitization module included in the OECD QSAR Toolbox as well as to the skin sensitization model in publicly available VEGA software, our models showed a significantly higher prediction accuracy for the same sets of external compounds as evaluated by Positive Predicted Rate, Negative Predicted Rate, and CCR. These models were applied to identify putative chemical hazards in the Scorecard database of possible skin or sense organ toxicants as primary candidates for experimental validation. - Highlights: • It was compiled the largest publicly-available skin sensitization dataset. • Predictive QSAR models were developed for skin sensitization. • Developed models have higher prediction accuracy than OECD QSAR Toolbox. • Putative chemical hazards in the Scorecard database were found using our models.« less

  13. Information footprint of different ecohydrological data sources: using multi-objective calibration of a physically-based model as hypothesis testing

    NASA Astrophysics Data System (ADS)

    Kuppel, S.; Soulsby, C.; Maneta, M. P.; Tetzlaff, D.

    2017-12-01

    The utility of field measurements to help constrain the model solution space and identify feasible model configurations has been an increasingly central issue in hydrological model calibration. Sufficiently informative observations are necessary to ensure that the goodness of model-data fit attained effectively translates into more physically-sound information for the internal model parameters, as a basis for model structure evaluation. Here we assess to which extent the diversity of information content can inform on the suitability of a complex, process-based ecohydrological model to simulate key water flux and storage dynamics at a long-term research catchment in the Scottish Highlands. We use the fully-distributed ecohydrological model EcH2O, calibrated against long-term datasets that encompass hydrologic and energy exchanges and ecological measurements: stream discharge, soil moisture, net radiation above canopy, and pine stand transpiration. Diverse combinations of these constraints were applied using a multi-objective cost function specifically designed to avoid compensatory effects between model-data metrics. Results revealed that calibration against virtually all datasets enabled the model to reproduce streamflow reasonably well. However, parameterizing the model to adequately capture local flux and storage dynamics, such as soil moisture or transpiration, required calibration with specific observations. This indicates that the footprint of the information contained in observations varies for each type of dataset, and that a diverse database informing about the different compartments of the domain, is critical to test hypotheses of catchment function and identify a consistent model parameterization. The results foster confidence in using EcH2O to help understanding current and future ecohydrological couplings in Northern catchments.

  14. The Chinchilla Research Resource Database: resource for an otolaryngology disease model

    PubMed Central

    Shimoyama, Mary; Smith, Jennifer R.; De Pons, Jeff; Tutaj, Marek; Khampang, Pawjai; Hong, Wenzhou; Erbe, Christy B.; Ehrlich, Garth D.; Bakaletz, Lauren O.; Kerschner, Joseph E.

    2016-01-01

    The long-tailed chinchilla (Chinchilla lanigera) is an established animal model for diseases of the inner and middle ear, among others. In particular, chinchilla is commonly used to study diseases involving viral and bacterial pathogens and polymicrobial infections of the upper respiratory tract and the ear, such as otitis media. The value of the chinchilla as a model for human diseases prompted the sequencing of its genome in 2012 and the more recent development of the Chinchilla Research Resource Database (http://crrd.mcw.edu) to provide investigators with easy access to relevant datasets and software tools to enhance their research. The Chinchilla Research Resource Database contains a complete catalog of genes for chinchilla and, for comparative purposes, human. Chinchilla genes can be viewed in the context of their genomic scaffold positions using the JBrowse genome browser. In contrast to the corresponding records at NCBI, individual gene reports at CRRD include functional annotations for Disease, Gene Ontology (GO) Biological Process, GO Molecular Function, GO Cellular Component and Pathway assigned to chinchilla genes based on annotations from the corresponding human orthologs. Data can be retrieved via keyword and gene-specific searches. Lists of genes with similar functional attributes can be assembled by leveraging the hierarchical structure of the Disease, GO and Pathway vocabularies through the Ontology Search and Browser tool. Such lists can then be further analyzed for commonalities using the Gene Annotator (GA) Tool. All data in the Chinchilla Research Resource Database is freely accessible and downloadable via the CRRD FTP site or using the download functions available in the search and analysis tools. The Chinchilla Research Resource Database is a rich resource for researchers using, or considering the use of, chinchilla as a model for human disease. Database URL: http://crrd.mcw.edu PMID:27173523

  15. Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae

    PubMed Central

    Reguly, Teresa; Breitkreutz, Ashton; Boucher, Lorrie; Breitkreutz, Bobby-Joe; Hon, Gary C; Myers, Chad L; Parsons, Ainslie; Friesen, Helena; Oughtred, Rose; Tong, Amy; Stark, Chris; Ho, Yuen; Botstein, David; Andrews, Brenda; Boone, Charles; Troyanskya, Olga G; Ideker, Trey; Dolinski, Kara; Batada, Nizar N; Tyers, Mike

    2006-01-01

    Background The study of complex biological networks and prediction of gene function has been enabled by high-throughput (HTP) methods for detection of genetic and protein interactions. Sparse coverage in HTP datasets may, however, distort network properties and confound predictions. Although a vast number of well substantiated interactions are recorded in the scientific literature, these data have not yet been distilled into networks that enable system-level inference. Results We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications. This literature-curated (LC) dataset contains 33,311 interactions, on the order of all extant HTP datasets combined. Surprisingly, HTP protein-interaction datasets currently achieve only around 14% coverage of the interactions in the literature. The LC network nevertheless shares attributes with HTP networks, including scale-free connectivity and correlations between interactions, abundance, localization, and expression. We find that essential genes or proteins are enriched for interactions with other essential genes or proteins, suggesting that the global network may be functionally unified. This interconnectivity is supported by a substantial overlap of protein and genetic interactions in the LC dataset. We show that the LC dataset considerably improves the predictive power of network-analysis approaches. The full LC dataset is available at the BioGRID () and SGD () databases. Conclusion Comprehensive datasets of biological interactions derived from the primary literature provide critical benchmarks for HTP methods, augment functional prediction, and reveal system-level attributes of biological networks. PMID:16762047

  16. Development of a consensus core dataset in juvenile dermatomyositis for clinical use to inform research

    PubMed Central

    McCann, Liza J; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Appelbe, Duncan; Kirkham, Jamie J; Williamson, Paula R; Aggarwal, Amita; Christopher-Stine, Lisa; Constantin, Tamas; Feldman, Brian M; Lundberg, Ingrid; Maillard, Sue; Mathiesen, Pernille; Murphy, Ruth; Pachman, Lauren M; Reed, Ann M; Rider, Lisa G; van Royen-Kerkof, Annet; Russo, Ricardo; Spinty, Stefan; Wedderburn, Lucy R

    2018-01-01

    Objectives This study aimed to develop consensus on an internationally agreed dataset for juvenile dermatomyositis (JDM), designed for clinical use, to enhance collaborative research and allow integration of data between centres. Methods A prototype dataset was developed through a formal process that included analysing items within existing databases of patients with idiopathic inflammatory myopathies. This template was used to aid a structured multistage consensus process. Exploiting Delphi methodology, two web-based questionnaires were distributed to healthcare professionals caring for patients with JDM identified through email distribution lists of international paediatric rheumatology and myositis research groups. A separate questionnaire was sent to parents of children with JDM and patients with JDM, identified through established research networks and patient support groups. The results of these parallel processes informed a face-to-face nominal group consensus meeting of international myositis experts, tasked with defining the content of the dataset. This developed dataset was tested in routine clinical practice before review and finalisation. Results A dataset containing 123 items was formulated with an accompanying glossary. Demographic and diagnostic data are contained within form A collected at baseline visit only, disease activity measures are included within form B collected at every visit and disease damage items within form C collected at baseline and annual visits thereafter. Conclusions Through a robust international process, a consensus dataset for JDM has been formulated that can capture disease activity and damage over time. This dataset can be incorporated into national and international collaborative efforts, including existing clinical research databases. PMID:29084729

  17. Geocoding and stereo display of tropical forest multisensor datasets

    NASA Technical Reports Server (NTRS)

    Welch, R.; Jordan, T. R.; Luvall, J. C.

    1990-01-01

    Concern about the future of tropical forests has led to a demand for geocoded multisensor databases that can be used to assess forest structure, deforestation, thermal response, evapotranspiration, and other parameters linked to climate change. In response to studies being conducted at the Braulino Carrillo National Park, Costa Rica, digital satellite and aircraft images recorded by Landsat TM, SPOT HRV, Thermal Infrared Multispectral Scanner, and Calibrated Airborne Multispectral Scanner sensors were placed in register using the Landsat TM image as the reference map. Despite problems caused by relief, multitemporal datasets, and geometric distortions in the aircraft images, registration was accomplished to within + or - 20 m (+ or - 1 data pixel). A digital elevation model constructed from a multisensor Landsat TM/SPOT stereopair proved useful for generating perspective views of the rugged, forested terrain.

  18. Data mining in newt-omics, the repository for omics data from the newt.

    PubMed

    Looso, Mario; Braun, Thomas

    2015-01-01

    Salamanders are an excellent model organism to study regenerative processes due to their unique ability to regenerate lost appendages or organs. Straightforward bioinformatics tools to analyze and take advantage of the growing number of "omics" studies performed in salamanders were lacking so far. To overcome this limitation, we have generated a comprehensive data repository for the red-spotted newt Notophthalmus viridescens, named newt-omics, merging omics style datasets on the transcriptome and proteome level including expression values and annotations. The resource is freely available via a user-friendly Web-based graphical user interface ( http://newt-omics.mpi-bn.mpg.de) that allows access and queries to the database without prior bioinformatical expertise. The repository is updated regularly, incorporating new published datasets from omics technologies.

  19. Assessment of Global Cloud Datasets from Satellites: Project and Database Initiated by the GEWEX Radiation Panel

    NASA Technical Reports Server (NTRS)

    Stubenrauch, C. J.; Rossow, W. B.; Kinne, S.; Ackerman, S.; Cesana, G.; Chepfer, H.; Getzewich, B.; Di Girolamo, L.; Guignard, A.; Heidinger, A.; hide

    2012-01-01

    Clouds cover about 70% of the Earth's surface and play a dominant role in the energy and water cycle of our planet. Only satellite observations provide a continuous survey of the state of the atmosphere over the whole globe and across the wide range of spatial and temporal scales that comprise weather and climate variability. Satellite cloud data records now exceed more than 25 years in length. However, climatologies compiled from different satellite datasets can exhibit systematic biases. Questions therefore arise as to the accuracy and limitations of the various sensors. The Global Energy and Water cycle Experiment (GEWEX) Cloud Assessment, initiated in 2005 by the GEWEX Radiation Panel, provided the first coordinated intercomparison of publically available, standard global cloud products (gridded, monthly statistics) retrieved from measurements of multi-spectral imagers (some with multiangle view and polarization capabilities), IR sounders and lidar. Cloud properties under study include cloud amount, cloud height (in terms of pressure, temperature or altitude), cloud radiative properties (optical depth or emissivity), cloud thermodynamic phase and bulk microphysical properties (effective particle size and water path). Differences in average cloud properties, especially in the amount of high-level clouds, are mostly explained by the inherent instrument measurement capability for detecting and/or identifying optically thin cirrus, especially when overlying low-level clouds. The study of long-term variations with these datasets requires consideration of many factors. A monthly, gridded database, in common format, facilitates further assessments, climate studies and the evaluation of climate models.

  20. An Intercomparison of Large-Extent Tree Canopy Cover Geospatial Datasets

    NASA Astrophysics Data System (ADS)

    Bender, S.; Liknes, G.; Ruefenacht, B.; Reynolds, J.; Miller, W. P.

    2017-12-01

    As a member of the Multi-Resolution Land Characteristics Consortium (MRLC), the U.S. Forest Service (USFS) is responsible for producing and maintaining the tree canopy cover (TCC) component of the National Land Cover Database (NLCD). The NLCD-TCC data are available for the conterminous United States (CONUS), coastal Alaska, Hawai'i, Puerto Rico, and the U.S. Virgin Islands. The most recent official version of the NLCD-TCC data is based primarily on reference data from 2010-2011 and is part of the multi-component 2011 version of the NLCD. NLCD data are updated on a five-year cycle. The USFS is currently producing the next official version (2016) of the NLCD-TCC data for the United States, and it will be made publicly-available in early 2018. In this presentation, we describe the model inputs, modeling methods, and tools used to produce the 30-m NLCD-TCC data. Several tree cover datasets at 30-m, as well as datasets at finer resolution, have become available in recent years due to advancements in earth observation data and their availability, computing, and sensors. We compare multiple tree cover datasets that have similar resolution to the NLCD-TCC data. We also aggregate the tree class from fine-resolution land cover datasets to a percent canopy value on a 30-m pixel, in order to compare the fine-resolution datasets to the datasets created directly from 30-m Landsat data. The extent of the tree canopy cover datasets included in the study ranges from global and national to the state level. Preliminary investigation of multiple tree cover datasets over the CONUS indicates a high amount of spatial variability. For example, in a comparison of the NLCD-TCC and the Global Land Cover Facility's Landsat Tree Cover Continuous Fields (2010) data by MRLC mapping zones, the zone-level root mean-square deviation ranges from 2% to 39% (mean=17%, median=15%). The analysis outcomes are expected to inform USFS decisions with regard to the next cycle (2021) of NLCD-TCC production.

  1. Model-data integration for developing the Cropland Carbon Monitoring System (CCMS)

    NASA Astrophysics Data System (ADS)

    Jones, C. D.; Bandaru, V.; Pnvr, K.; Jin, H.; Reddy, A.; Sahajpal, R.; Sedano, F.; Skakun, S.; Wagle, P.; Gowda, P. H.; Hurtt, G. C.; Izaurralde, R. C.

    2017-12-01

    The Cropland Carbon Monitoring System (CCMS) has been initiated to improve regional estimates of carbon fluxes from croplands in the conterminous United States through integration of terrestrial ecosystem modeling, use of remote-sensing products and publically available datasets, and development of improved landscape and management databases. In order to develop these improved carbon flux estimates, experimental datasets are essential for evaluating the skill of estimates, characterizing the uncertainty of these estimates, characterizing parameter sensitivities, and calibrating specific modeling components. Experiments were sought that included flux tower measurement of CO2 fluxes under production of major agronomic crops. Currently data has been collected from 17 experiments comprising 117 site-years from 12 unique locations. Calibration of terrestrial ecosystem model parameters using available crop productivity and net ecosystem exchange (NEE) measurements resulted in improvements in RMSE of NEE predictions of between 3.78% to 7.67%, while improvements in RMSE for yield ranged from -1.85% to 14.79%. Model sensitivities were dominated by parameters related to leaf area index (LAI) and spring growth, demonstrating considerable capacity for model improvement through development and integration of remote-sensing products. Subsequent analyses will assess the impact of such integrated approaches on skill of cropland carbon flux estimates.

  2. ECOALIM: A Dataset of Environmental Impacts of Feed Ingredients Used in French Animal Production.

    PubMed

    Wilfart, Aurélie; Espagnol, Sandrine; Dauguet, Sylvie; Tailleur, Aurélie; Gac, Armelle; Garcia-Launay, Florence

    2016-01-01

    Feeds contribute highly to environmental impacts of livestock products. Therefore, formulating low-impact feeds requires data on environmental impacts of feed ingredients with consistent perimeters and methodology for life cycle assessment (LCA). We created the ECOALIM dataset of life cycle inventories (LCIs) and associated impacts of feed ingredients used in animal production in France. It provides several perimeters for LCIs (field gate, storage agency gate, plant gate and harbour gate) with homogeneously collected data from French R&D institutes covering the 2005-2012 period. The dataset of environmental impacts is available as a Microsoft® Excel spreadsheet on the ECOALIM website and provides climate change, acidification, eutrophication, non-renewable and total cumulative energy demand, phosphorus demand, and land occupation. LCIs in the ECOALIM dataset are available in the AGRIBALYSE® database in SimaPro® software. The typology performed on the dataset classified the 149 average feed ingredients into categories of low impact (co-products of plant origin and minerals), high impact (feed-use amino acids, fats and vitamins) and intermediate impact (cereals, oilseeds, oil meals and protein crops). Therefore, the ECOALIM dataset can be used by feed manufacturers and LCA practitioners to investigate formulation of low-impact feeds. It also provides data for environmental evaluation of feeds and animal production systems. Included in AGRIBALYSE® database and SimaPro®, the ECOALIM dataset will benefit from their procedures for maintenance and regular updating. Future use can also include environmental labelling of commercial products from livestock production.

  3. ECOALIM: A Dataset of Environmental Impacts of Feed Ingredients Used in French Animal Production

    PubMed Central

    Espagnol, Sandrine; Dauguet, Sylvie; Tailleur, Aurélie; Gac, Armelle; Garcia-Launay, Florence

    2016-01-01

    Feeds contribute highly to environmental impacts of livestock products. Therefore, formulating low-impact feeds requires data on environmental impacts of feed ingredients with consistent perimeters and methodology for life cycle assessment (LCA). We created the ECOALIM dataset of life cycle inventories (LCIs) and associated impacts of feed ingredients used in animal production in France. It provides several perimeters for LCIs (field gate, storage agency gate, plant gate and harbour gate) with homogeneously collected data from French R&D institutes covering the 2005–2012 period. The dataset of environmental impacts is available as a Microsoft® Excel spreadsheet on the ECOALIM website and provides climate change, acidification, eutrophication, non-renewable and total cumulative energy demand, phosphorus demand, and land occupation. LCIs in the ECOALIM dataset are available in the AGRIBALYSE® database in SimaPro® software. The typology performed on the dataset classified the 149 average feed ingredients into categories of low impact (co-products of plant origin and minerals), high impact (feed-use amino acids, fats and vitamins) and intermediate impact (cereals, oilseeds, oil meals and protein crops). Therefore, the ECOALIM dataset can be used by feed manufacturers and LCA practitioners to investigate formulation of low-impact feeds. It also provides data for environmental evaluation of feeds and animal production systems. Included in AGRIBALYSE® database and SimaPro®, the ECOALIM dataset will benefit from their procedures for maintenance and regular updating. Future use can also include environmental labelling of commercial products from livestock production. PMID:27930682

  4. AnthWest, occurrence records for wool carder bees of the genus Anthidium (Hymenoptera, Megachilidae, Anthidiini) in the Western Hemisphere

    PubMed Central

    Griswold, Terry; Gonzalez, Victor H.; Ikerd, Harold

    2014-01-01

    Abstract This paper describes AnthWest, a large dataset that represents one of the outcomes of a comprehensive, broadly comparative study on the diversity, biology, biogeography, and evolution of Anthidium Fabricius in the Western Hemisphere. In this dataset a total of 22,648 adult occurrence records comprising 9657 unique events are documented for 92 species of Anthidium, including the invasive range of two introduced species from Eurasia, A. oblongatum (Illiger) and A. manicatum (Linnaeus). The geospatial coverage of the dataset extends from northern Canada and Alaska to southern Argentina, and from below sea level in Death Valley, California, USA, to 4700 m a.s.l. in Tucumán, Argentina. The majority of records in the dataset correspond to information recorded from individual specimens examined by the authors during this project and deposited in 60 biodiversity collections located in Africa, Europe, North and South America. A fraction (4.8%) of the occurrence records were taken from the literature, largely California records from a taxonomic treatment with some additional records for the two introduced species. The temporal scale of the dataset represents collection events recorded between 1886 and 2012. The dataset was developed employing SQL server 2008 r2. For each specimen, the following information is generally provided: scientific name including identification qualifier when species status is uncertain (e.g. “Questionable Determination” for 0.4% of the specimens), sex, temporal and geospatial details, coordinates, data collector, host plants, associated organisms, name of identifier, historic identification, historic identifier, taxonomic value (i.e., type specimen, voucher, etc.), and repository. For a small portion of the database records, bees associated with threatened or endangered plants (~ 0.08% of total records) as well as specimens collected as part of unpublished biological inventories (~17%), georeferencing is presented only to nearest degree and the information on floral host, locality, elevation, month, and day has been withheld. This database can potentially be used in species distribution and niche modeling studies, as well as in assessments of pollinator status and pollination services. For native pollinators, this large dataset of occurrence records is the first to be simultaneously developed during a species-level systematic study. PMID:24899835

  5. The NCBI BioSystems database.

    PubMed

    Geer, Lewis Y; Marchler-Bauer, Aron; Geer, Renata C; Han, Lianyi; He, Jane; He, Siqian; Liu, Chunlei; Shi, Wenyao; Bryant, Stephen H

    2010-01-01

    The NCBI BioSystems database, found at http://www.ncbi.nlm.nih.gov/biosystems/, centralizes and cross-links existing biological systems databases, increasing their utility and target audience by integrating their pathways and systems into NCBI resources. This integration allows users of NCBI's Entrez databases to quickly categorize proteins, genes and small molecules by metabolic pathway, disease state or other BioSystem type, without requiring time-consuming inference of biological relationships from the literature or multiple experimental datasets.

  6. Keeping Research Data from the Continental Deep Drilling Programme (KTB) Accessible and Taking First Steps Towards Digital Preservation

    NASA Astrophysics Data System (ADS)

    Klump, J. F.; Ulbricht, D.; Conze, R.

    2014-12-01

    The Continental Deep Drilling Programme (KTB) was a scientific drilling project from 1987 to 1995 near Windischeschenbach, Bavaria. The main super-deep borehole reached a depth of 9,101 meters into the Earth's continental crust. The project used the most current equipment for data capture and processing. After the end of the project key data were disseminated through the web portal of the International Continental Scientific Drilling Program (ICDP). The scientific reports were published as printed volumes. As similar projects have also experienced, it becomes increasingly difficult to maintain a data portal over a long time. Changes in software and underlying hardware make a migration of the entire system inevitable. Around 2009 the data presented on the ICDP web portal were migrated to the Scientific Drilling Database (SDDB) and published through DataCite using Digital Object Identifiers (DOI) as persistent identifiers. The SDDB portal used a relational database with a complex data model to store data and metadata. A PHP-based Content Management System with custom modifications made it possible to navigate and browse datasets using the metadata and then download datasets. The data repository software eSciDoc allows storing self-contained packages consistent with the OAIS reference model. Each package consists of binary data files and XML-metadata. Using a REST-API the packages can be stored in the eSciDoc repository and can be searched using the XML-metadata. During the last maintenance cycle of the SDDB the data and metadata were migrated into the eSciDoc repository. Discovery metadata was generated following the GCMD-DIF, ISO19115 and DataCite schemas. The eSciDoc repository allows to store an arbitrary number of XML-metadata records with each data object. In addition to descriptive metadata each data object may contain pointers to related materials, such as IGSN-metadata to link datasets to physical specimens, or identifiers of literature interpreting the data. Datasets are presented by XSLT-stylesheet transformation using the stored metadata. The presentation shows several migration cycles of data and metadata, which were driven by aging software systems. Currently the datasets reside as self-contained entities in a repository system that is ready for digital preservation.

  7. Querying phenotype-genotype relationships on patient datasets using semantic web technology: the example of cerebrotendinous xanthomatosis

    PubMed Central

    2012-01-01

    Background Semantic Web technology can considerably catalyze translational genetics and genomics research in medicine, where the interchange of information between basic research and clinical levels becomes crucial. This exchange involves mapping abstract phenotype descriptions from research resources, such as knowledge databases and catalogs, to unstructured datasets produced through experimental methods and clinical practice. This is especially true for the construction of mutation databases. This paper presents a way of harmonizing abstract phenotype descriptions with patient data from clinical practice, and querying this dataset about relationships between phenotypes and genetic variants, at different levels of abstraction. Methods Due to the current availability of ontological and terminological resources that have already reached some consensus in biomedicine, a reuse-based ontology engineering approach was followed. The proposed approach uses the Ontology Web Language (OWL) to represent the phenotype ontology and the patient model, the Semantic Web Rule Language (SWRL) to bridge the gap between phenotype descriptions and clinical data, and the Semantic Query Web Rule Language (SQWRL) to query relevant phenotype-genotype bidirectional relationships. The work tests the use of semantic web technology in the biomedical research domain named cerebrotendinous xanthomatosis (CTX), using a real dataset and ontologies. Results A framework to query relevant phenotype-genotype bidirectional relationships is provided. Phenotype descriptions and patient data were harmonized by defining 28 Horn-like rules in terms of the OWL concepts. In total, 24 patterns of SWQRL queries were designed following the initial list of competency questions. As the approach is based on OWL, the semantic of the framework adapts the standard logical model of an open world assumption. Conclusions This work demonstrates how semantic web technologies can be used to support flexible representation and computational inference mechanisms required to query patient datasets at different levels of abstraction. The open world assumption is especially good for describing only partially known phenotype-genotype relationships, in a way that is easily extensible. In future, this type of approach could offer researchers a valuable resource to infer new data from patient data for statistical analysis in translational research. In conclusion, phenotype description formalization and mapping to clinical data are two key elements for interchanging knowledge between basic and clinical research. PMID:22849591

  8. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases

    PubMed Central

    Makadia, Rupa; Matcho, Amy; Ma, Qianli; Knoll, Chris; Schuemie, Martijn; DeFalco, Frank J; Londhe, Ajit; Zhu, Vivienne; Ryan, Patrick B

    2015-01-01

    Objectives To evaluate the utility of applying the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) across multiple observational databases within an organization and to apply standardized analytics tools for conducting observational research. Materials and methods Six deidentified patient-level datasets were transformed to the OMOP CDM. We evaluated the extent of information loss that occurred through the standardization process. We developed a standardized analytic tool to replicate the cohort construction process from a published epidemiology protocol and applied the analysis to all 6 databases to assess time-to-execution and comparability of results. Results Transformation to the CDM resulted in minimal information loss across all 6 databases. Patients and observations excluded were due to identified data quality issues in the source system, 96% to 99% of condition records and 90% to 99% of drug records were successfully mapped into the CDM using the standard vocabulary. The full cohort replication and descriptive baseline summary was executed for 2 cohorts in 6 databases in less than 1 hour. Discussion The standardization process improved data quality, increased efficiency, and facilitated cross-database comparisons to support a more systematic approach to observational research. Comparisons across data sources showed consistency in the impact of inclusion criteria, using the protocol and identified differences in patient characteristics and coding practices across databases. Conclusion Standardizing data structure (through a CDM), content (through a standard vocabulary with source code mappings), and analytics can enable an institution to apply a network-based approach to observational research across multiple, disparate observational health databases. PMID:25670757

  9. An integrative machine learning strategy for improved prediction of essential genes in Escherichia coli metabolism using flux-coupled features.

    PubMed

    Nandi, Sutanu; Subramanian, Abhishek; Sarkar, Ram Rup

    2017-07-25

    Prediction of essential genes helps to identify a minimal set of genes that are absolutely required for the appropriate functioning and survival of a cell. The available machine learning techniques for essential gene prediction have inherent problems, like imbalanced provision of training datasets, biased choice of the best model for a given balanced dataset, choice of a complex machine learning algorithm, and data-based automated selection of biologically relevant features for classification. Here, we propose a simple support vector machine-based learning strategy for the prediction of essential genes in Escherichia coli K-12 MG1655 metabolism that integrates a non-conventional combination of an appropriate sample balanced training set, a unique organism-specific genotype, phenotype attributes that characterize essential genes, and optimal parameters of the learning algorithm to generate the best machine learning model (the model with the highest accuracy among all the models trained for different sample training sets). For the first time, we also introduce flux-coupled metabolic subnetwork-based features for enhancing the classification performance. Our strategy proves to be superior as compared to previous SVM-based strategies in obtaining a biologically relevant classification of genes with high sensitivity and specificity. This methodology was also trained with datasets of other recent supervised classification techniques for essential gene classification and tested using reported test datasets. The testing accuracy was always high as compared to the known techniques, proving that our method outperforms known methods. Observations from our study indicate that essential genes are conserved among homologous bacterial species, demonstrate high codon usage bias, GC content and gene expression, and predominantly possess a tendency to form physiological flux modules in metabolism.

  10. The PREDICTS database: a global database of how local terrestrial biodiversity responds to human impacts

    Treesearch

    L.N. Hudson; T. Newbold; S. Contu

    2014-01-01

    Biodiversity continues to decline in the face of increasing anthropogenic pressures such as habitat destruction, exploitation, pollution and introduction of alien species. Existing global databases of species’ threat status or population time series are dominated by charismatic species. The collation of datasets with broad taxonomic and biogeographic extents, and that...

  11. Planetary Data Archiving Plan at JAXA

    NASA Astrophysics Data System (ADS)

    Shinohara, Iku; Kasaba, Yasumasa; Yamamoto, Yukio; Abe, Masanao; Okada, Tatsuaki; Imamura, Takeshi; Sobue, Shinichi; Takashima, Takeshi; Terazono, Jun-Ya

    After the successful rendezvous of Hayabusa with the small-body planet Itokawa, and the successful launch of Kaguya to the moon, Japanese planetary community has gotten their own and full-scale data. However, at this moment, these datasets are only available from the data sites managed by each mission team. The databases are individually constructed in the different formats, and the user interface of these data sites is not compatible with foreign databases. To improve the usability of the planetary archives at JAXA and to enable the international data exchange smooth, we are investigating to make a new planetary database. Within a coming decade, Japan will have fruitful datasets in the planetary science field, Venus (Planet-C), Mercury (BepiColombo), and several missions in planning phase (small-bodies). In order to strongly assist the international scientific collaboration using these mission archive data, the planned planetary data archive at JAXA should be managed in an unified manner and the database should be constructed in the international planetary database standard style. In this presentation, we will show the current status and future plans of the planetary data archiving at JAXA.

  12. MiRNA-TF-gene network analysis through ranking of biomolecules for multi-informative uterine leiomyoma dataset.

    PubMed

    Mallik, Saurav; Maulik, Ujjwal

    2015-10-01

    Gene ranking is an important problem in bioinformatics. Here, we propose a new framework for ranking biomolecules (viz., miRNAs, transcription-factors/TFs and genes) in a multi-informative uterine leiomyoma dataset having both gene expression and methylation data using (statistical) eigenvector centrality based approach. At first, genes that are both differentially expressed and methylated, are identified using Limma statistical test. A network, comprising these genes, corresponding TFs from TRANSFAC and ITFP databases, and targeter miRNAs from miRWalk database, is then built. The biomolecules are then ranked based on eigenvector centrality. Our proposed method provides better average accuracy in hub gene and non-hub gene classifications than other methods. Furthermore, pre-ranked Gene set enrichment analysis is applied on the pathway database as well as GO-term databases of Molecular Signatures Database with providing a pre-ranked gene-list based on different centrality values for comparing among the ranking methods. Finally, top novel potential gene-markers for the uterine leiomyoma are provided. Copyright © 2015 Elsevier Inc. All rights reserved.

  13. Improved Method for Linear B-Cell Epitope Prediction Using Antigen’s Primary Sequence

    PubMed Central

    Raghava, Gajendra P. S.

    2013-01-01

    One of the major challenges in designing a peptide-based vaccine is the identification of antigenic regions in an antigen that can stimulate B-cell’s response, also called B-cell epitopes. In the past, several methods have been developed for the prediction of conformational and linear (or continuous) B-cell epitopes. However, the existing methods for predicting linear B-cell epitopes are far from perfection. In this study, an attempt has been made to develop an improved method for predicting linear B-cell epitopes. We have retrieved experimentally validated B-cell epitopes as well as non B-cell epitopes from Immune Epitope Database and derived two types of datasets called Lbtope_Variable and Lbtope_Fixed length datasets. The Lbtope_Variable dataset contains 14876 B-cell epitope and 23321 non-epitopes of variable length where as Lbtope_Fixed length dataset contains 12063 B-cell epitopes and 20589 non-epitopes of fixed length. We also evaluated the performance of models on above datasets after removing highly identical peptides from the datasets. In addition, we have derived third dataset Lbtope_Confirm having 1042 epitopes and 1795 non-epitopes where each epitope or non-epitope has been experimentally validated in at least two studies. A number of models have been developed to discriminate epitopes and non-epitopes using different machine-learning techniques like Support Vector Machine, and K-Nearest Neighbor. We achieved accuracy from ∼54% to 86% using diverse s features like binary profile, dipeptide composition, AAP (amino acid pair) profile. In this study, for the first time experimentally validated non B-cell epitopes have been used for developing method for predicting linear B-cell epitopes. In previous studies, random peptides have been used as non B-cell epitopes. In order to provide service to scientific community, a web server LBtope has been developed for predicting and designing B-cell epitopes (http://crdd.osdd.net/raghava/lbtope/). PMID:23667458

  14. Filling a missing link between biogeochemical, climate and ecosystem studies: a global database of atmospheric water-soluble organic nitrogen

    NASA Astrophysics Data System (ADS)

    Cornell, Sarah

    2015-04-01

    It is time to collate a global community database of atmospheric water-soluble organic nitrogen deposition. Organic nitrogen (ON) has long been known to be globally ubiquitous in atmospheric aerosol and precipitation, with implications for air and water quality, climate, biogeochemical cycles, ecosystems and human health. The number of studies of atmospheric ON deposition has increased steadily in recent years, but to date there is no accessible global dataset, for either bulk ON or its major components. Improved qualitative and quantitative understanding of the organic nitrogen component is needed to complement the well-established knowledge base pertaining to other components of atmospheric deposition (cf. Vet et al 2014). Without this basic information, we are increasingly constrained in addressing the current dynamics and potential interactions of atmospheric chemistry, climate and ecosystem change. To see the full picture we need global data synthesis, more targeted data gathering, and models that let us explore questions about the natural and anthropogenic dynamics of atmospheric ON. Collectively, our research community already has a substantial amount of atmospheric ON data. Published reports extend back over a century and now have near-global coverage. However, datasets available from the literature are very piecemeal and too often lack crucially important information that would enable aggregation or re-use. I am initiating an open collaborative process to construct a community database, so we can begin to systematically synthesize these datasets (generally from individual studies at a local and temporally limited scale) to increase their scientific usability and statistical power for studies of global change and anthropogenic perturbation. In drawing together our disparate knowledge, we must address various challenges and concerns, not least about the comparability of analysis and sampling methodologies, and the known complexity of composition of ON. We need to discuss and develop protocols that work for diverse research needs. The database will need to be harmonized or merged into existing global N data initiatives. This presentation therefore launches a standing invitation for experts to contribute and share rain and aerosol ON and chemical composition data, and jointly refine the preliminary database structure and metadata requirements for optimal mutual use. Reference: Vet et al. (2014) A global assessment of precipitation chemistry… Atmos Environ 93: 3-100

  15. Benchmarking Spike-Based Visual Recognition: A Dataset and Evaluation

    PubMed Central

    Liu, Qian; Pineda-García, Garibaldi; Stromatias, Evangelos; Serrano-Gotarredona, Teresa; Furber, Steve B.

    2016-01-01

    Today, increasing attention is being paid to research into spike-based neural computation both to gain a better understanding of the brain and to explore biologically-inspired computation. Within this field, the primate visual pathway and its hierarchical organization have been extensively studied. Spiking Neural Networks (SNNs), inspired by the understanding of observed biological structure and function, have been successfully applied to visual recognition and classification tasks. In addition, implementations on neuromorphic hardware have enabled large-scale networks to run in (or even faster than) real time, making spike-based neural vision processing accessible on mobile robots. Neuromorphic sensors such as silicon retinas are able to feed such mobile systems with real-time visual stimuli. A new set of vision benchmarks for spike-based neural processing are now needed to measure progress quantitatively within this rapidly advancing field. We propose that a large dataset of spike-based visual stimuli is needed to provide meaningful comparisons between different systems, and a corresponding evaluation methodology is also required to measure the performance of SNN models and their hardware implementations. In this paper we first propose an initial NE (Neuromorphic Engineering) dataset based on standard computer vision benchmarksand that uses digits from the MNIST database. This dataset is compatible with the state of current research on spike-based image recognition. The corresponding spike trains are produced using a range of techniques: rate-based Poisson spike generation, rank order encoding, and recorded output from a silicon retina with both flashing and oscillating input stimuli. In addition, a complementary evaluation methodology is presented to assess both model-level and hardware-level performance. Finally, we demonstrate the use of the dataset and the evaluation methodology using two SNN models to validate the performance of the models and their hardware implementations. With this dataset we hope to (1) promote meaningful comparison between algorithms in the field of neural computation, (2) allow comparison with conventional image recognition methods, (3) provide an assessment of the state of the art in spike-based visual recognition, and (4) help researchers identify future directions and advance the field. PMID:27853419

  16. Benchmarking Spike-Based Visual Recognition: A Dataset and Evaluation.

    PubMed

    Liu, Qian; Pineda-García, Garibaldi; Stromatias, Evangelos; Serrano-Gotarredona, Teresa; Furber, Steve B

    2016-01-01

    Today, increasing attention is being paid to research into spike-based neural computation both to gain a better understanding of the brain and to explore biologically-inspired computation. Within this field, the primate visual pathway and its hierarchical organization have been extensively studied. Spiking Neural Networks (SNNs), inspired by the understanding of observed biological structure and function, have been successfully applied to visual recognition and classification tasks. In addition, implementations on neuromorphic hardware have enabled large-scale networks to run in (or even faster than) real time, making spike-based neural vision processing accessible on mobile robots. Neuromorphic sensors such as silicon retinas are able to feed such mobile systems with real-time visual stimuli. A new set of vision benchmarks for spike-based neural processing are now needed to measure progress quantitatively within this rapidly advancing field. We propose that a large dataset of spike-based visual stimuli is needed to provide meaningful comparisons between different systems, and a corresponding evaluation methodology is also required to measure the performance of SNN models and their hardware implementations. In this paper we first propose an initial NE (Neuromorphic Engineering) dataset based on standard computer vision benchmarksand that uses digits from the MNIST database. This dataset is compatible with the state of current research on spike-based image recognition. The corresponding spike trains are produced using a range of techniques: rate-based Poisson spike generation, rank order encoding, and recorded output from a silicon retina with both flashing and oscillating input stimuli. In addition, a complementary evaluation methodology is presented to assess both model-level and hardware-level performance. Finally, we demonstrate the use of the dataset and the evaluation methodology using two SNN models to validate the performance of the models and their hardware implementations. With this dataset we hope to (1) promote meaningful comparison between algorithms in the field of neural computation, (2) allow comparison with conventional image recognition methods, (3) provide an assessment of the state of the art in spike-based visual recognition, and (4) help researchers identify future directions and advance the field.

  17. Analysis of the relationship between the monthly temperatures and weather types in Iberian Peninsula

    NASA Astrophysics Data System (ADS)

    Peña Angulo, Dhais; Trigo, Ricardo; Nicola, Cortesi; José Carlos, González-Hidalgo

    2016-04-01

    In this study, the relationship between the atmospheric circulation and weather types and the monthly average maximum and minimum temperatures in the Iberian Peninsula is modeled (period 1950-2010). The temperature data used were obtained from a high spatial resolution (10km x 10km) dataset (MOTEDAS dataset, Gonzalez-Hidalgo et al., 2015a). In addition, a dataset of Portuguese temperatures was used (obtained from the Portuguese Institute of Sea and Atmosphere). The weather type classification used was the one developed by Jenkinson and Collison, which was adapted for the Iberian Peninsula by Trigo and DaCamara (2000), using Sea Level Pressure data from NCAR/NCEP Reanalysis dataset (period 1951-2010). The analysis of the behaviour of monthly temperatures based on the weather types was carried out using a stepwise regression procedure of type forward to estimate temperatures in each cell of the considered grid, for each month, and for both maximum and minimum monthly average temperatures. The model selects the weather types that best estimate the temperatures. From the validation model it was obtained the error distribution in the time (months) and space (Iberian Peninsula). The results show that best estimations are obtained for minimum temperatures, during the winter months and in coastal areas. González-Hidalgo J.C., Peña-Angulo D., Brunetti M., Cortesi, C. (2015a): MOTEDAS: a new monthly temperature database for mainland Spain and the trend in temperature (1951-2010). International Journal of Climatology 31, 715-731. DOI: 10.1002/joc.4298

  18. Modelling surface-water depression storage in a Prairie Pothole Region

    USGS Publications Warehouse

    Hay, Lauren E.; Norton, Parker A.; Viger, Roland; Markstrom, Steven; Regan, R. Steven; Vanderhoof, Melanie

    2018-01-01

    In this study, the Precipitation-Runoff Modelling System (PRMS) was used to simulate changes in surface-water depression storage in the 1,126-km2 Upper Pipestem Creek basin located within the Prairie Pothole Region of North Dakota, USA. The Prairie Pothole Region is characterized by millions of small water bodies (or surface-water depressions) that provide numerous ecosystem services and are considered an important contribution to the hydrologic cycle. The Upper Pipestem PRMS model was extracted from the U.S. Geological Survey's (USGS) National Hydrologic Model (NHM), developed to support consistent hydrologic modelling across the conterminous United States. The Geospatial Fabric database, created for the USGS NHM, contains hydrologic model parameter values derived from datasets that characterize the physical features of the entire conterminous United States for 109,951 hydrologic response units. Each hydrologic response unit in the Geospatial Fabric was parameterized using aggregated surface-water depression area derived from the National Hydrography Dataset Plus, an integrated suite of application-ready geospatial datasets. This paper presents a calibration strategy for the Upper Pipestem PRMS model that uses normalized lake elevation measurements to calibrate the parameters influencing simulated fractional surface-water depression storage. Results indicate that inclusion of measurements that give an indication of the change in surface-water depression storage in the calibration procedure resulted in accurate changes in surface-water depression storage in the water balance. Regionalized parameterization of the USGS NHM will require a proxy for change in surface-storage to accurately parameterize surface-water depression storage within the USGS NHM.

  19. Introducing a New Interface for the Online MagIC Database by Integrating Data Uploading, Searching, and Visualization

    NASA Astrophysics Data System (ADS)

    Jarboe, N.; Minnett, R.; Constable, C.; Koppers, A. A.; Tauxe, L.

    2013-12-01

    The Magnetics Information Consortium (MagIC) is dedicated to supporting the paleomagnetic, geomagnetic, and rock magnetic communities through the development and maintenance of an online database (http://earthref.org/MAGIC/), data upload and quality control, searches, data downloads, and visualization tools. While MagIC has completed importing some of the IAGA paleomagnetic databases (TRANS, PINT, PSVRL, GPMDB) and continues to import others (ARCHEO, MAGST and SECVR), further individual data uploading from the community contributes a wealth of easily-accessible rich datasets. Previously uploading of data to the MagIC database required the use of an Excel spreadsheet using either a Mac or PC. The new method of uploading data utilizes an HTML 5 web interface where the only computer requirement is a modern browser. This web interface will highlight all errors discovered in the dataset at once instead of the iterative error checking process found in the previous Excel spreadsheet data checker. As a web service, the community will always have easy access to the most up-to-date and bug free version of the data upload software. The filtering search mechanism of the MagIC database has been changed to a more intuitive system where the data from each contribution is displayed in tables similar to how the data is uploaded (http://earthref.org/MAGIC/search/). Searches themselves can be saved as a permanent URL, if desired. The saved search URL could then be used as a citation in a publication. When appropriate, plots (equal area, Zijderveld, ARAI, demagnetization, etc.) are associated with the data to give the user a quicker understanding of the underlying dataset. The MagIC database will continue to evolve to meet the needs of the paleomagnetic, geomagnetic, and rock magnetic communities.

  20. Fast 3D shape screening of large chemical databases through alignment-recycling

    PubMed Central

    Fontaine, Fabien; Bolton, Evan; Borodina, Yulia; Bryant, Stephen H

    2007-01-01

    Background Large chemical databases require fast, efficient, and simple ways of looking for similar structures. Although such tasks are now fairly well resolved for graph-based similarity queries, they remain an issue for 3D approaches, particularly for those based on 3D shape overlays. Inspired by a recent technique developed to compare molecular shapes, we designed a hybrid methodology, alignment-recycling, that enables efficient retrieval and alignment of structures with similar 3D shapes. Results Using a dataset of more than one million PubChem compounds of limited size (< 28 heavy atoms) and flexibility (< 6 rotatable bonds), we obtained a set of a few thousand diverse structures covering entirely the 3D shape space of the conformers of the dataset. Transformation matrices gathered from the overlays between these diverse structures and the 3D conformer dataset allowed us to drastically (100-fold) reduce the CPU time required for shape overlay. The alignment-recycling heuristic produces results consistent with de novo alignment calculation, with better than 80% hit list overlap on average. Conclusion Overlay-based 3D methods are computationally demanding when searching large databases. Alignment-recycling reduces the CPU time to perform shape similarity searches by breaking the alignment problem into three steps: selection of diverse shapes to describe the database shape-space; overlay of the database conformers to the diverse shapes; and non-optimized overlay of query and database conformers using common reference shapes. The precomputation, required by the first two steps, is a significant cost of the method; however, once performed, querying is two orders of magnitude faster. Extensions and variations of this methodology, for example, to handle more flexible and larger small-molecules are discussed. PMID:17880744

  1. Consistency Analysis of Genome-Scale Models of Bacterial Metabolism: A Metamodel Approach

    PubMed Central

    Ponce-de-Leon, Miguel; Calle-Espinosa, Jorge; Peretó, Juli; Montero, Francisco

    2015-01-01

    Genome-scale metabolic models usually contain inconsistencies that manifest as blocked reactions and gap metabolites. With the purpose to detect recurrent inconsistencies in metabolic models, a large-scale analysis was performed using a previously published dataset of 130 genome-scale models. The results showed that a large number of reactions (~22%) are blocked in all the models where they are present. To unravel the nature of such inconsistencies a metamodel was construed by joining the 130 models in a single network. This metamodel was manually curated using the unconnected modules approach, and then, it was used as a reference network to perform a gap-filling on each individual genome-scale model. Finally, a set of 36 models that had not been considered during the construction of the metamodel was used, as a proof of concept, to extend the metamodel with new biochemical information, and to assess its impact on gap-filling results. The analysis performed on the metamodel allowed to conclude: 1) the recurrent inconsistencies found in the models were already present in the metabolic database used during the reconstructions process; 2) the presence of inconsistencies in a metabolic database can be propagated to the reconstructed models; 3) there are reactions not manifested as blocked which are active as a consequence of some classes of artifacts, and; 4) the results of an automatic gap-filling are highly dependent on the consistency and completeness of the metamodel or metabolic database used as the reference network. In conclusion the consistency analysis should be applied to metabolic databases in order to detect and fill gaps as well as to detect and remove artifacts and redundant information. PMID:26629901

  2. Using a centralised database system and server in the European Union Framework Programme 7 project SEPServer

    NASA Astrophysics Data System (ADS)

    Heynderickx, Daniel

    2012-07-01

    The main objective of the SEPServer project (EU FP7 project 262773) is to produce a new tool, which greatly facilitates the investigation of solar energetic particles (SEPs) and their origin: a server providing SEP data, related electromagnetic (EM) observations and analysis methods, a comprehensive catalogue of the observed SEP events, and educational/outreach material on solar eruptions. The project is coordinated by the University of Helsinki. The project will combine data and knowledge from 11 European partners and several collaborating parties from Europe and US. The datasets provided by the consortium partners are collected in a MySQL database (using the ESA Open Data Interface under licence) on a server operated by DH Consultancy, which also hosts a web interface providing browsing, plotting and post-processing and analysis tools developed by the consortium, as well as a Solar Energetic Particle event catalogue. At this stage of the project, a prototype server has been established, which is presently undergoing testing by users inside the consortium. Using a centralized database has numerous advantages, including: homogeneous storage of the data, which eliminates the need for dataset specific file access routines once the data are ingested in the database; a homogeneous set of metadata describing the datasets on both a global and detailed level, allowing for automated access to and presentation of the various data products; standardised access to the data in different programming environments (e.g. php, IDL); elimination of the need to download data for individual data requests. SEPServer will, thus, add value to several space missions and Earth-based observations by facilitating the coordinated exploitation of and open access to SEP data and related EM observations, and promoting correct use of these data for the entire space research community. This will lead to new knowledge on the production and transport of SEPs during solar eruptions and facilitate the development of models for predicting solar radiation storms and calculation of expected fluxes/fluences of SEPs encountered by spacecraft in the interplanetary medium.

  3. Highly parameterized model calibration with cloud computing: an example of regional flow model calibration in northeast Alberta, Canada

    NASA Astrophysics Data System (ADS)

    Hayley, Kevin; Schumacher, J.; MacMillan, G. J.; Boutin, L. C.

    2014-05-01

    Expanding groundwater datasets collected by automated sensors, and improved groundwater databases, have caused a rapid increase in calibration data available for groundwater modeling projects. Improved methods of subsurface characterization have increased the need for model complexity to represent geological and hydrogeological interpretations. The larger calibration datasets and the need for meaningful predictive uncertainty analysis have both increased the degree of parameterization necessary during model calibration. Due to these competing demands, modern groundwater modeling efforts require a massive degree of parallelization in order to remain computationally tractable. A methodology for the calibration of highly parameterized, computationally expensive models using the Amazon EC2 cloud computing service is presented. The calibration of a regional-scale model of groundwater flow in Alberta, Canada, is provided as an example. The model covers a 30,865-km2 domain and includes 28 hydrostratigraphic units. Aquifer properties were calibrated to more than 1,500 static hydraulic head measurements and 10 years of measurements during industrial groundwater use. Three regionally extensive aquifers were parameterized (with spatially variable hydraulic conductivity fields), as was the aerial recharge boundary condition, leading to 450 adjustable parameters in total. The PEST-based model calibration was parallelized on up to 250 computing nodes located on Amazon's EC2 servers.

  4. Bio-inspired benchmark generator for extracellular multi-unit recordings

    PubMed Central

    Mondragón-González, Sirenia Lizbeth; Burguière, Eric

    2017-01-01

    The analysis of multi-unit extracellular recordings of brain activity has led to the development of numerous tools, ranging from signal processing algorithms to electronic devices and applications. Currently, the evaluation and optimisation of these tools are hampered by the lack of ground-truth databases of neural signals. These databases must be parameterisable, easy to generate and bio-inspired, i.e. containing features encountered in real electrophysiological recording sessions. Towards that end, this article introduces an original computational approach to create fully annotated and parameterised benchmark datasets, generated from the summation of three components: neural signals from compartmental models and recorded extracellular spikes, non-stationary slow oscillations, and a variety of different types of artefacts. We present three application examples. (1) We reproduced in-vivo extracellular hippocampal multi-unit recordings from either tetrode or polytrode designs. (2) We simulated recordings in two different experimental conditions: anaesthetised and awake subjects. (3) Last, we also conducted a series of simulations to study the impact of different level of artefacts on extracellular recordings and their influence in the frequency domain. Beyond the results presented here, such a benchmark dataset generator has many applications such as calibration, evaluation and development of both hardware and software architectures. PMID:28233819

  5. Image segmentation evaluation for very-large datasets

    NASA Astrophysics Data System (ADS)

    Reeves, Anthony P.; Liu, Shuang; Xie, Yiting

    2016-03-01

    With the advent of modern machine learning methods and fully automated image analysis there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. Current approaches of visual inspection and manual markings do not scale well to big data. We present a new approach that depends on fully automated algorithm outcomes for segmentation documentation, requires no manual marking, and provides quantitative evaluation for computer algorithms. The documentation of new image segmentations and new algorithm outcomes are achieved by visual inspection. The burden of visual inspection on large datasets is minimized by (a) customized visualizations for rapid review and (b) reducing the number of cases to be reviewed through analysis of quantitative segmentation evaluation. This method has been applied to a dataset of 7,440 whole-lung CT images for 6 different segmentation algorithms designed to fully automatically facilitate the measurement of a number of very important quantitative image biomarkers. The results indicate that we could achieve 93% to 99% successful segmentation for these algorithms on this relatively large image database. The presented evaluation method may be scaled to much larger image databases.

  6. A Geospatial Database that Supports Derivation of Climatological Features of Severe Weather

    NASA Astrophysics Data System (ADS)

    Phillips, M.; Ansari, S.; Del Greco, S.

    2007-12-01

    The Severe Weather Data Inventory (SWDI) at NOAA's National Climatic Data Center (NCDC) provides user access to archives of several datasets critical to the detection and evaluation of severe weather. These datasets include archives of: · NEXRAD Level-III point features describing general storm structure, hail, mesocyclone and tornado signatures · National Weather Service Storm Events Database · National Weather Service Local Storm Reports collected from storm spotters · National Weather Service Warnings · Lightning strikes from Vaisala's National Lightning Detection Network (NLDN) SWDI archives all of these datasets in a spatial database that allows for convenient searching and subsetting. These data are accessible via the NCDC web site, Web Feature Services (WFS) or automated web services. The results of interactive web page queries may be saved in a variety of formats, including plain text, XML, Google Earth's KMZ, standards-based NetCDF and Shapefile. NCDC's Storm Risk Assessment Project (SRAP) uses data from the SWDI database to derive gridded climatology products that show the spatial distributions of the frequency of various events. SRAP also can relate SWDI events to other spatial data such as roads, population, watersheds, and other geographic, sociological, or economic data to derive products that are useful in municipal planning, emergency management, the insurance industry, and other areas where there is a need to quantify and qualify how severe weather patterns affect people and property.

  7. An open access thyroid ultrasound image database

    NASA Astrophysics Data System (ADS)

    Pedraza, Lina; Vargas, Carlos; Narváez, Fabián.; Durán, Oscar; Muñoz, Emma; Romero, Eduardo

    2015-01-01

    Computer aided diagnosis systems (CAD) have been developed to assist radiologists in the detection and diagnosis of abnormalities and a large number of pattern recognition techniques have been proposed to obtain a second opinion. Most of these strategies have been evaluated using different datasets making their performance incomparable. In this work, an open access database of thyroid ultrasound images is presented. The dataset consists of a set of B-mode Ultrasound images, including a complete annotation and diagnostic description of suspicious thyroid lesions by expert radiologists. Several types of lesions as thyroiditis, cystic nodules, adenomas and thyroid cancers were included while an accurate lesion delineation is provided in XML format. The diagnostic description of malignant lesions was confirmed by biopsy. The proposed new database is expected to be a resource for the community to assess different CAD systems.

  8. RBscore&NBench: a high-level web server for nucleic acid binding residues prediction with a large-scale benchmarking database.

    PubMed

    Miao, Zhichao; Westhof, Eric

    2016-07-08

    RBscore&NBench combines a web server, RBscore and a database, NBench. RBscore predicts RNA-/DNA-binding residues in proteins and visualizes the prediction scores and features on protein structures. The scoring scheme of RBscore directly links feature values to nucleic acid binding probabilities and illustrates the nucleic acid binding energy funnel on the protein surface. To avoid dataset, binding site definition and assessment metric biases, we compared RBscore with 18 web servers and 3 stand-alone programs on 41 datasets, which demonstrated the high and stable accuracy of RBscore. A comprehensive comparison led us to develop a benchmark database named NBench. The web server is available on: http://ahsoka.u-strasbg.fr/rbscorenbench/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  9. Learning Computational Models of Video Memorability from fMRI Brain Imaging.

    PubMed

    Han, Junwei; Chen, Changyuan; Shao, Ling; Hu, Xintao; Han, Jungong; Liu, Tianming

    2015-08-01

    Generally, various visual media are unequally memorable by the human brain. This paper looks into a new direction of modeling the memorability of video clips and automatically predicting how memorable they are by learning from brain functional magnetic resonance imaging (fMRI). We propose a novel computational framework by integrating the power of low-level audiovisual features and brain activity decoding via fMRI. Initially, a user study experiment is performed to create a ground truth database for measuring video memorability and a set of effective low-level audiovisual features is examined in this database. Then, human subjects' brain fMRI data are obtained when they are watching the video clips. The fMRI-derived features that convey the brain activity of memorizing videos are extracted using a universal brain reference system. Finally, due to the fact that fMRI scanning is expensive and time-consuming, a computational model is learned on our benchmark dataset with the objective of maximizing the correlation between the low-level audiovisual features and the fMRI-derived features using joint subspace learning. The learned model can then automatically predict the memorability of videos without fMRI scans. Evaluations on publically available image and video databases demonstrate the effectiveness of the proposed framework.

  10. The BioGRID interaction database: 2013 update.

    PubMed

    Chatr-Aryamontri, Andrew; Breitkreutz, Bobby-Joe; Heinicke, Sven; Boucher, Lorrie; Winter, Andrew; Stark, Chris; Nixon, Julie; Ramage, Lindsay; Kolas, Nadine; O'Donnell, Lara; Reguly, Teresa; Breitkreutz, Ashton; Sellam, Adnane; Chen, Daici; Chang, Christie; Rust, Jennifer; Livstone, Michael; Oughtred, Rose; Dolinski, Kara; Tyers, Mike

    2013-01-01

    The Biological General Repository for Interaction Datasets (BioGRID: http//thebiogrid.org) is an open access archive of genetic and protein interactions that are curated from the primary biomedical literature for all major model organism species. As of September 2012, BioGRID houses more than 500 000 manually annotated interactions from more than 30 model organisms. BioGRID maintains complete curation coverage of the literature for the budding yeast Saccharomyces cerevisiae, the fission yeast Schizosaccharomyces pombe and the model plant Arabidopsis thaliana. A number of themed curation projects in areas of biomedical importance are also supported. BioGRID has established collaborations and/or shares data records for the annotation of interactions and phenotypes with most major model organism databases, including Saccharomyces Genome Database, PomBase, WormBase, FlyBase and The Arabidopsis Information Resource. BioGRID also actively engages with the text-mining community to benchmark and deploy automated tools to expedite curation workflows. BioGRID data are freely accessible through both a user-defined interactive interface and in batch downloads in a wide variety of formats, including PSI-MI2.5 and tab-delimited files. BioGRID records can also be interrogated and analyzed with a series of new bioinformatics tools, which include a post-translational modification viewer, a graphical viewer, a REST service and a Cytoscape plugin.

  11. The ASTARTE Paleotsunami Deposits data base - web-based references for tsunami research in the NEAM region

    NASA Astrophysics Data System (ADS)

    De Martini, P. M.; Pantosti, D.; Orefice, S.; Smedile, A.; Patera, A.; Paris, R.; Terrinha, P.; Hunt, J.; Papadopoulos, G. A.; Noiva, J.; Triantafyllou, I.; Yalciner, A. C.

    2017-12-01

    EU project ASTARTE aimed at developing a higher level of tsunami hazard assessment in the North-Eastern Atlantic, the Mediterranean and connected seas (NEAM) region by a combination of field work, experimental work, numerical modeling and technical development. The project was a cooperative work of 26 institutes from 16 countries and linked together the description of past tsunamigenic events, the identification and characterization of tsunami sources, the calculation of the impact of such events, and the development of adequate resilience and risks mitigation strategies (http://www.astarte-project.eu/). Within ASTARTE, a web-based database on Paleotsunami Deposits in the NEAM area was created with the purpose to be the future information repository for tsunami research in the broad region. The aim of the project is the integration of every existing official scientific reports and peer reviewed papers on this topic. The database, which archives information and detailed data crucial for tsunami modeling, will be updated on new entries every 12 months. A relational database managed by ArcGIS for Desktop 10.x software has been implemented. One of the final goals of the project is the public sharing of the archived dataset through a web-based map service that will allow visualizing, querying, analyzing, and interpreting the dataset. The interactive map service is hosted by ArcGIS Online and will deploy the cloud capabilities of the portal. Any interested users will be able to access the online GIS resources through any Internet browser or specific apps that run on desktop machines, smartphones, or tablets and will be able to use the analytical tools, key tasks, and workflows of the service. We will present the database structure (characterized by the presence of two main tables: the Site table and the Event table) and topics as well as their ArcGIS Online version. To date, a total of 151 sites and 220 tsunami evidence have been recorded within the ASTARTE database. The ASTARTE Paleotsunami Deposits database - NEAM region is now available online at the address http://arcg.is/1CWz0. The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 603839 (Project ASTARTE - Assessment, Strategy and Risk Reduction for Tsunamis in Europe).

  12. FOAM (Functional Ontology Assignments for Metagenomes): A Hidden Markov Model (HMM) database with environmental focus

    DOE PAGES

    Prestat, Emmanuel; David, Maude M.; Hultman, Jenni; ...

    2014-09-26

    A new functional gene database, FOAM (Functional Ontology Assignments for Metagenomes), was developed to screen environmental metagenomic sequence datasets. FOAM provides a new functional ontology dedicated to classify gene functions relevant to environmental microorganisms based on Hidden Markov Models (HMMs). Sets of aligned protein sequences (i.e. ‘profiles’) were tailored to a large group of target KEGG Orthologs (KOs) from which HMMs were trained. The alignments were checked and curated to make them specific to the targeted KO. Within this process, sequence profiles were enriched with the most abundant sequences available to maximize the yield of accurate classifier models. An associatedmore » functional ontology was built to describe the functional groups and hierarchy. FOAM allows the user to select the target search space before HMM-based comparison steps and to easily organize the results into different functional categories and subcategories. FOAM is publicly available at http://portal.nersc.gov/project/m1317/FOAM/.« less

  13. Large-Scale Image Analytics Using Deep Learning

    NASA Astrophysics Data System (ADS)

    Ganguly, S.; Nemani, R. R.; Basu, S.; Mukhopadhyay, S.; Michaelis, A.; Votava, P.

    2014-12-01

    High resolution land cover classification maps are needed to increase the accuracy of current Land ecosystem and climate model outputs. Limited studies are in place that demonstrates the state-of-the-art in deriving very high resolution (VHR) land cover products. In addition, most methods heavily rely on commercial softwares that are difficult to scale given the region of study (e.g. continents to globe). Complexities in present approaches relate to (a) scalability of the algorithm, (b) large image data processing (compute and memory intensive), (c) computational cost, (d) massively parallel architecture, and (e) machine learning automation. In addition, VHR satellite datasets are of the order of terabytes and features extracted from these datasets are of the order of petabytes. In our present study, we have acquired the National Agricultural Imaging Program (NAIP) dataset for the Continental United States at a spatial resolution of 1-m. This data comes as image tiles (a total of quarter million image scenes with ~60 million pixels) and has a total size of ~100 terabytes for a single acquisition. Features extracted from the entire dataset would amount to ~8-10 petabytes. In our proposed approach, we have implemented a novel semi-automated machine learning algorithm rooted on the principles of "deep learning" to delineate the percentage of tree cover. In order to perform image analytics in such a granular system, it is mandatory to devise an intelligent archiving and query system for image retrieval, file structuring, metadata processing and filtering of all available image scenes. Using the Open NASA Earth Exchange (NEX) initiative, which is a partnership with Amazon Web Services (AWS), we have developed an end-to-end architecture for designing the database and the deep belief network (following the distbelief computing model) to solve a grand challenge of scaling this process across quarter million NAIP tiles that cover the entire Continental United States. The AWS core components that we use to solve this problem are DynamoDB along with S3 for database query and storage, ElastiCache shared memory architecture for image segmentation, Elastic Map Reduce (EMR) for image feature extraction, and the memory optimized Elastic Cloud Compute (EC2) for the learning algorithm.

  14. An object model and database for functional genomics.

    PubMed

    Jones, Andrew; Hunt, Ela; Wastling, Jonathan M; Pizarro, Angel; Stoeckert, Christian J

    2004-07-10

    Large-scale functional genomics analysis is now feasible and presents significant challenges in data analysis, storage and querying. Data standards are required to enable the development of public data repositories and to improve data sharing. There is an established data format for microarrays (microarray gene expression markup language, MAGE-ML) and a draft standard for proteomics (PEDRo). We believe that all types of functional genomics experiments should be annotated in a consistent manner, and we hope to open up new ways of comparing multiple datasets used in functional genomics. We have created a functional genomics experiment object model (FGE-OM), developed from the microarray model, MAGE-OM and two models for proteomics, PEDRo and our own model (Gla-PSI-Glasgow Proposal for the Proteomics Standards Initiative). FGE-OM comprises three namespaces representing (i) the parts of the model common to all functional genomics experiments; (ii) microarray-specific components; and (iii) proteomics-specific components. We believe that FGE-OM should initiate discussion about the contents and structure of the next version of MAGE and the future of proteomics standards. A prototype database called RNA And Protein Abundance Database (RAPAD), based on FGE-OM, has been implemented and populated with data from microbial pathogenesis. FGE-OM and the RAPAD schema are available from http://www.gusdb.org/fge.html, along with a set of more detailed diagrams. RAPAD can be accessed by registration at the site.

  15. The NCBI BioSystems database

    PubMed Central

    Geer, Lewis Y.; Marchler-Bauer, Aron; Geer, Renata C.; Han, Lianyi; He, Jane; He, Siqian; Liu, Chunlei; Shi, Wenyao; Bryant, Stephen H.

    2010-01-01

    The NCBI BioSystems database, found at http://www.ncbi.nlm.nih.gov/biosystems/, centralizes and cross-links existing biological systems databases, increasing their utility and target audience by integrating their pathways and systems into NCBI resources. This integration allows users of NCBI’s Entrez databases to quickly categorize proteins, genes and small molecules by metabolic pathway, disease state or other BioSystem type, without requiring time-consuming inference of biological relationships from the literature or multiple experimental datasets. PMID:19854944

  16. Comparative Evaluation of Five Fire Emissions Datasets Using the GEOS-5 Model

    NASA Astrophysics Data System (ADS)

    Ichoku, C. M.; Pan, X.; Chin, M.; Bian, H.; Darmenov, A.; Ellison, L.; Kucsera, T. L.; da Silva, A. M., Jr.; Petrenko, M. M.; Wang, J.; Ge, C.; Wiedinmyer, C.

    2017-12-01

    Wildfires and other types of biomass burning affect most vegetated parts of the globe, contributing 40% of the annual global atmospheric loading of carbonaceous aerosols, as well as significant amounts of numerous trace gases, such as carbon dioxide, carbon monoxide, and methane. Many of these smoke constituents affect the air quality and/or the climate system directly or through their interactions with solar radiation and cloud properties. However, fire emissions are poorly constrained in global and regional models, resulting in high levels of uncertainty in understanding their real impacts. With the advent of satellite remote sensing of fires and burned areas in the last couple of decades, a number of fire emissions products have become available for use in relevant research and applications. In this study, we evaluated five global biomass burning emissions datasets, namely: (1) GFEDv3.1 (Global Fire Emissions Database version 3.1); (2) GFEDv4s (Global Fire Emissions Database version 4 with small fires); (3) FEERv1 (Fire Energetics and Emissions Research version 1.0); (4) QFEDv2.4 (Quick Fire Emissions Dataset version 2.4); and (5) Fire INventory from NCAR (FINN) version 1.5. Overall, the spatial patterns of biomass burning emissions from these inventories are similar, although the magnitudes of the emissions can be noticeably different. The inventories derived using top-down approaches (QFEDv2.4 and FEERv1) are larger than those based on bottom-up approaches. For example, global organic carbon (OC) emissions in 2008 are: QFEDv2.4 (51.93 Tg), FEERv1 (28.48 Tg), FINN v1.5 (19.48 Tg), GFEDv3.1 (15.65 Tg) and GFEDv4s (13.76 Tg); representing a factor of 3.7 difference between the largest and the least. We also used all five biomass-burning emissions datasets to conduct aerosol simulations using the NASA Goddard Earth Observing System Model, Version 5 (GEOS-5), and compared the resulting aerosol optical depth (AOD) output to the corresponding retrievals from MODIS and AERONET. Simulated AOD based on all five emissions inventories show significant underestimation in biomass burning dominated regions. Attributions of possible factors responsible for the differences among the inventories were further explored in Southern Africa and South America, two of the major biomass burning regions of the world.

  17. Development of a video tampering dataset for forensic investigation.

    PubMed

    Ismael Al-Sanjary, Omar; Ahmed, Ahmed Abdullah; Sulong, Ghazali

    2016-09-01

    Forgery is an act of modifying a document, product, image or video, among other media. Video tampering detection research requires an inclusive database of video modification. This paper aims to discuss a comprehensive proposal to create a dataset composed of modified videos for forensic investigation, in order to standardize existing techniques for detecting video tampering. The primary purpose of developing and designing this new video library is for usage in video forensics, which can be consciously associated with reliable verification using dynamic and static camera recognition. To the best of the author's knowledge, there exists no similar library among the research community. Videos were sourced from YouTube and by exploring social networking sites extensively by observing posted videos and rating their feedback. The video tampering dataset (VTD) comprises a total of 33 videos, divided among three categories in video tampering: (1) copy-move, (2) splicing, and (3) swapping-frames. Compared to existing datasets, this is a higher number of tampered videos, and with longer durations. The duration of every video is 16s, with a 1280×720 resolution, and a frame rate of 30 frames per second. Moreover, all videos possess the same formatting quality (720p(HD).avi). Both temporal and spatial video features were considered carefully during selection of the videos, and there exists complete information related to the doctored regions in every modified video in the VTD dataset. This database has been made publically available for research on splicing, Swapping frames, and copy-move tampering, and, as such, various video tampering detection issues with ground truth. The database has been utilised by many international researchers and groups of researchers. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  18. A data skimming service for locally resident analysis data

    NASA Astrophysics Data System (ADS)

    Cranshaw, J.; Gardner, R. W.; Gieraltowski, J.; Malon, D.; Mambelli, M.; May, E.

    2008-07-01

    A Data Skimming Service (DSS) is a site-level service for rapid event filtering and selection from locally resident datasets based on metadata queries to associated 'tag' databases. In US ATLAS, we expect most if not all of the AOD-based datasets to be replicated to each of the five Tier 2 regional facilities in the US Tier 1 'cloud' coordinated by Brookhaven National Laboratory. Entire datasets will consist of on the order of several terabytes of data, and providing easy, quick access to skimmed subsets of these data will be vital to physics working groups. Typically, physicists will be interested in portions of the complete datasets, selected according to event-level attributes (number of jets, missing Et, etc) and content (specific analysis objects for subsequent processing). In this paper we describe methods used to classify data (metadata tag generation) and to store these results in a local database. Next we discuss a general framework which includes methods for accessing this information, defining skims, specifying event output content, accessing locally available storage through a variety of interfaces (SRM, dCache/dccp, gridftp), accessing remote storage elements as specified, and user job submission tools through local or grid schedulers. The advantages of the DSS are the ability to quickly 'browse' datasets and design skims, for example, pre-adjusting cuts to get to a desired skim level with minimal use of compute resources, and to encode these analysis operations in a database for re-analysis and archival purposes. Additionally the framework has provisions to operate autonomously in the event that external, central resources are not available, and to provide, as a reduced package, a minimal skimming service tailored to the needs of small Tier 3 centres or individual users.

  19. The MIND PALACE: A Multi-Spectral Imaging and Spectroscopy Database for Planetary Science

    NASA Astrophysics Data System (ADS)

    Eshelman, E.; Doloboff, I.; Hara, E. K.; Uckert, K.; Sapers, H. M.; Abbey, W.; Beegle, L. W.; Bhartia, R.

    2017-12-01

    The Multi-Instrument Database (MIND) is the web-based home to a well-characterized set of analytical data collected by a suite of deep-UV fluorescence/Raman instruments built at the Jet Propulsion Laboratory (JPL). Samples derive from a growing body of planetary surface analogs, mineral and microbial standards, meteorites, spacecraft materials, and other astrobiologically relevant materials. In addition to deep-UV spectroscopy, datasets stored in MIND are obtained from a variety of analytical techniques obtained over multiple spatial and spectral scales including electron microscopy, optical microscopy, infrared spectroscopy, X-ray fluorescence, and direct fluorescence imaging. Multivariate statistical analysis techniques, primarily Principal Component Analysis (PCA), are used to guide interpretation of these large multi-analytical spectral datasets. Spatial co-referencing of integrated spectral/visual maps is performed using QGIS (geographic information system software). Georeferencing techniques transform individual instrument data maps into a layered co-registered data cube for analysis across spectral and spatial scales. The body of data in MIND is intended to serve as a permanent, reliable, and expanding database of deep-UV spectroscopy datasets generated by this unique suite of JPL-based instruments on samples of broad planetary science interest.

  20. Analyzing legacy U.S. Geological Survey geochemical databases using GIS: applications for a national mineral resource assessment

    USGS Publications Warehouse

    Yager, Douglas B.; Hofstra, Albert H.; Granitto, Matthew

    2012-01-01

    This report emphasizes geographic information system analysis and the display of data stored in the legacy U.S. Geological Survey National Geochemical Database for use in mineral resource investigations. Geochemical analyses of soils, stream sediments, and rocks that are archived in the National Geochemical Database provide an extensive data source for investigating geochemical anomalies. A study area in the Egan Range of east-central Nevada was used to develop a geographic information system analysis methodology for two different geochemical datasets involving detailed (Bureau of Land Management Wilderness) and reconnaissance-scale (National Uranium Resource Evaluation) investigations. ArcGIS was used to analyze and thematically map geochemical information at point locations. Watershed-boundary datasets served as a geographic reference to relate potentially anomalous sample sites with hydrologic unit codes at varying scales. The National Hydrography Dataset was analyzed with Hydrography Event Management and ArcGIS Utility Network Analyst tools to delineate potential sediment-sample provenance along a stream network. These tools can be used to track potential upstream-sediment-contributing areas to a sample site. This methodology identifies geochemically anomalous sample sites, watersheds, and streams that could help focus mineral resource investigations in the field.

  1. HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.

    PubMed

    O'Driscoll, Aisling; Belogrudov, Vladislav; Carroll, John; Kropp, Kai; Walsh, Paul; Ghazal, Peter; Sleator, Roy D

    2015-04-01

    The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples. Copyright © 2015 Elsevier Inc. All rights reserved.

  2. Prediction of lung cancer patient survival via supervised machine learning classification techniques.

    PubMed

    Lynch, Chip M; Abdollahi, Behnaz; Fuqua, Joshua D; de Carlo, Alexandra R; Bartholomai, James A; Balgemann, Rayeanne N; van Berkel, Victor H; Frieboes, Hermann B

    2017-12-01

    Outcomes for cancer patients have been previously estimated by applying various machine learning techniques to large datasets such as the Surveillance, Epidemiology, and End Results (SEER) program database. In particular for lung cancer, it is not well understood which types of techniques would yield more predictive information, and which data attributes should be used in order to determine this information. In this study, a number of supervised learning techniques is applied to the SEER database to classify lung cancer patients in terms of survival, including linear regression, Decision Trees, Gradient Boosting Machines (GBM), Support Vector Machines (SVM), and a custom ensemble. Key data attributes in applying these methods include tumor grade, tumor size, gender, age, stage, and number of primaries, with the goal to enable comparison of predictive power between the various methods The prediction is treated like a continuous target, rather than a classification into categories, as a first step towards improving survival prediction. The results show that the predicted values agree with actual values for low to moderate survival times, which constitute the majority of the data. The best performing technique was the custom ensemble with a Root Mean Square Error (RMSE) value of 15.05. The most influential model within the custom ensemble was GBM, while Decision Trees may be inapplicable as it had too few discrete outputs. The results further show that among the five individual models generated, the most accurate was GBM with an RMSE value of 15.32. Although SVM underperformed with an RMSE value of 15.82, statistical analysis singles the SVM as the only model that generated a distinctive output. The results of the models are consistent with a classical Cox proportional hazards model used as a reference technique. We conclude that application of these supervised learning techniques to lung cancer data in the SEER database may be of use to estimate patient survival time with the ultimate goal to inform patient care decisions, and that the performance of these techniques with this particular dataset may be on par with that of classical methods. Copyright © 2017 Elsevier B.V. All rights reserved.

  3. Application of the Conway-Maxwell-Poisson generalized linear model for analyzing motor vehicle crashes.

    PubMed

    Lord, Dominique; Guikema, Seth D; Geedipally, Srinivas Reddy

    2008-05-01

    This paper documents the application of the Conway-Maxwell-Poisson (COM-Poisson) generalized linear model (GLM) for modeling motor vehicle crashes. The COM-Poisson distribution, originally developed in 1962, has recently been re-introduced by statisticians for analyzing count data subjected to over- and under-dispersion. This innovative distribution is an extension of the Poisson distribution. The objectives of this study were to evaluate the application of the COM-Poisson GLM for analyzing motor vehicle crashes and compare the results with the traditional negative binomial (NB) model. The comparison analysis was carried out using the most common functional forms employed by transportation safety analysts, which link crashes to the entering flows at intersections or on segments. To accomplish the objectives of the study, several NB and COM-Poisson GLMs were developed and compared using two datasets. The first dataset contained crash data collected at signalized four-legged intersections in Toronto, Ont. The second dataset included data collected for rural four-lane divided and undivided highways in Texas. Several methods were used to assess the statistical fit and predictive performance of the models. The results of this study show that COM-Poisson GLMs perform as well as NB models in terms of GOF statistics and predictive performance. Given the fact the COM-Poisson distribution can also handle under-dispersed data (while the NB distribution cannot or has difficulties converging), which have sometimes been observed in crash databases, the COM-Poisson GLM offers a better alternative over the NB model for modeling motor vehicle crashes, especially given the important limitations recently documented in the safety literature about the latter type of model.

  4. A comparison of spatial interpolation methods for soil temperature over a complex topographical region

    NASA Astrophysics Data System (ADS)

    Wu, Wei; Tang, Xiao-Ping; Ma, Xue-Qing; Liu, Hong-Bin

    2016-08-01

    Soil temperature variability data provide valuable information on understanding land-surface ecosystem processes and climate change. This study developed and analyzed a spatial dataset of monthly mean soil temperature at a depth of 10 cm over a complex topographical region in southwestern China. The records were measured at 83 stations during the period of 1961-2000. Nine approaches were compared for interpolating soil temperature. The accuracy indicators were root mean square error (RMSE), modelling efficiency (ME), and coefficient of residual mass (CRM). The results indicated that thin plate spline with latitude, longitude, and elevation gave the best performance with RMSE varying between 0.425 and 0.592 °C, ME between 0.895 and 0.947, and CRM between -0.007 and 0.001. A spatial database was developed based on the best model. The dataset showed that larger seasonal changes of soil temperature were from autumn to winter over the region. The northern and eastern areas with hilly and low-middle mountains experienced larger seasonal changes.

  5. The MAREDAT Global Database of High Performance Liquid Chromatography Marine Pigment Measurements

    NASA Technical Reports Server (NTRS)

    Peloquin, J.; Swan, C.; Gruber, N.; Vogt, M.; Claustre, H.; Ras, J.; Uitz, J.; Barlow, R.; Behrenfeld, M.; Bidigare, R.; hide

    2013-01-01

    A global pigment database consisting of 35 634 pigment suites measured by high performance liquid chromatography was assembled in support of the MARine Ecosytem DATa (MAREDAT) initiative. These data originate from 136 field surveys within the global ocean, were solicited from investigators and databases, compiled, and then quality controlled. Nearly one quarter of the data originates from the Laboratoire d'Oc´eanographie de Villefranche (LOV), with an additional 17% and 19% stemming from the US JGOFS and LTER programs, respectively. The MAREDAT pigment database provides high quality measurements of the major taxonomic pigments including chlorophylls a and b, 19'-butanoyloxyfucoxanthin, 19'- hexanoyloxyfucoxanthin, alloxanthin, divinyl chlorophyll a, fucoxanthin, lutein, peridinin, prasinoxanthin, violaxanthin and zeaxanthin, which may be used in varying combinations to estimate phytoplankton community composition. Quality control measures consisted of flagging samples that had a total chlorophyll a concentration of zero, had fewer than four reported accessory pigments, or exceeded two standard deviations of the log-linear regression of total chlorophyll a with total accessory pigment concentrations. We anticipate the MAREDAT pigment database to be of use in the marine ecology, remote sensing and ecological modeling communities, where it will support model validation and advance our global perspective on marine biodiversity. The original dataset together with quality control flags as well as the gridded MAREDAT pigment data may be downloaded from PANGAEA: http://doi.pangaea.de/10. 1594/PANGAEA.793246.

  6. Completion of the National Land Cover Database (NLCD) 1992-2001 Land Cover Change Retrofit Product

    EPA Science Inventory

    The Multi-Resolution Land Characteristics Consortium has supported the development of two national digital land cover products: the National Land Cover Dataset (NLCD) 1992 and National Land Cover Database (NLCD) 2001. Substantial differences in imagery, legends, and methods betwe...

  7. A Hadoop-based Molecular Docking System

    NASA Astrophysics Data System (ADS)

    Dong, Yueli; Guo, Quan; Sun, Bin

    2017-10-01

    Molecular docking always faces the challenge of managing tens of TB datasets. It is necessary to improve the efficiency of the storage and docking. We proposed the molecular docking platform based on Hadoop for virtual screening, it provides the preprocessing of ligand datasets and the analysis function of the docking results. A molecular cloud database that supports mass data management is constructed. Through this platform, the docking time is reduced, the data storage is efficient, and the management of the ligand datasets is convenient.

  8. Development of a consensus core dataset in juvenile dermatomyositis for clinical use to inform research.

    PubMed

    McCann, Liza J; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Appelbe, Duncan; Kirkham, Jamie J; Williamson, Paula R; Aggarwal, Amita; Christopher-Stine, Lisa; Constantin, Tamas; Feldman, Brian M; Lundberg, Ingrid; Maillard, Sue; Mathiesen, Pernille; Murphy, Ruth; Pachman, Lauren M; Reed, Ann M; Rider, Lisa G; van Royen-Kerkof, Annet; Russo, Ricardo; Spinty, Stefan; Wedderburn, Lucy R; Beresford, Michael W

    2018-02-01

    This study aimed to develop consensus on an internationally agreed dataset for juvenile dermatomyositis (JDM), designed for clinical use, to enhance collaborative research and allow integration of data between centres. A prototype dataset was developed through a formal process that included analysing items within existing databases of patients with idiopathic inflammatory myopathies. This template was used to aid a structured multistage consensus process. Exploiting Delphi methodology, two web-based questionnaires were distributed to healthcare professionals caring for patients with JDM identified through email distribution lists of international paediatric rheumatology and myositis research groups. A separate questionnaire was sent to parents of children with JDM and patients with JDM, identified through established research networks and patient support groups. The results of these parallel processes informed a face-to-face nominal group consensus meeting of international myositis experts, tasked with defining the content of the dataset. This developed dataset was tested in routine clinical practice before review and finalisation. A dataset containing 123 items was formulated with an accompanying glossary. Demographic and diagnostic data are contained within form A collected at baseline visit only, disease activity measures are included within form B collected at every visit and disease damage items within form C collected at baseline and annual visits thereafter. Through a robust international process, a consensus dataset for JDM has been formulated that can capture disease activity and damage over time. This dataset can be incorporated into national and international collaborative efforts, including existing clinical research databases. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  9. Automated mapping of persistent ice and snow cover across the western U.S. with Landsat

    NASA Astrophysics Data System (ADS)

    Selkowitz, David J.; Forster, Richard R.

    2016-07-01

    We implemented an automated approach for mapping persistent ice and snow cover (PISC) across the conterminous western U.S. using all available Landsat TM and ETM+ scenes acquired during the late summer/early fall period between 2010 and 2014. Two separate validation approaches indicate this dataset provides a more accurate representation of glacial ice and perennial snow cover for the region than either the U.S. glacier database derived from US Geological Survey (USGS) Digital Raster Graphics (DRG) maps (based on aerial photography primarily from the 1960s-1980s) or the National Land Cover Database 2011 perennial ice and snow cover class. Our 2010-2014 Landsat-derived dataset indicates 28% less glacier and perennial snow cover than the USGS DRG dataset. There are larger differences between the datasets in some regions, such as the Rocky Mountains of Northwest Wyoming and Southwest Montana, where the Landsat dataset indicates 54% less PISC area. Analysis of Landsat scenes from 1987-1988 and 2008-2010 for three regions using a more conventional, semi-automated approach indicates substantial decreases in glaciers and perennial snow cover that correlate with differences between PISC mapped by the USGS DRG dataset and the automated Landsat-derived dataset. This suggests that most of the differences in PISC between the USGS DRG and the Landsat-derived dataset can be attributed to decreases in PISC, as opposed to differences between mapping techniques. While the dataset produced by the automated Landsat mapping approach is not designed to serve as a conventional glacier inventory that provides glacier outlines and attribute information, it allows for an updated estimate of PISC for the conterminous U.S. as well as for smaller regions. Additionally, the new dataset highlights areas where decreases in PISC have been most significant over the past 25-50 years.

  10. Automated mapping of persistent ice and snow cover across the western U.S. with Landsat

    USGS Publications Warehouse

    Selkowitz, David J.; Forster, Richard R.

    2016-01-01

    We implemented an automated approach for mapping persistent ice and snow cover (PISC) across the conterminous western U.S. using all available Landsat TM and ETM+ scenes acquired during the late summer/early fall period between 2010 and 2014. Two separate validation approaches indicate this dataset provides a more accurate representation of glacial ice and perennial snow cover for the region than either the U.S. glacier database derived from US Geological Survey (USGS) Digital Raster Graphics (DRG) maps (based on aerial photography primarily from the 1960s–1980s) or the National Land Cover Database 2011 perennial ice and snow cover class. Our 2010–2014 Landsat-derived dataset indicates 28% less glacier and perennial snow cover than the USGS DRG dataset. There are larger differences between the datasets in some regions, such as the Rocky Mountains of Northwest Wyoming and Southwest Montana, where the Landsat dataset indicates 54% less PISC area. Analysis of Landsat scenes from 1987–1988 and 2008–2010 for three regions using a more conventional, semi-automated approach indicates substantial decreases in glaciers and perennial snow cover that correlate with differences between PISC mapped by the USGS DRG dataset and the automated Landsat-derived dataset. This suggests that most of the differences in PISC between the USGS DRG and the Landsat-derived dataset can be attributed to decreases in PISC, as opposed to differences between mapping techniques. While the dataset produced by the automated Landsat mapping approach is not designed to serve as a conventional glacier inventory that provides glacier outlines and attribute information, it allows for an updated estimate of PISC for the conterminous U.S. as well as for smaller regions. Additionally, the new dataset highlights areas where decreases in PISC have been most significant over the past 25–50 years.

  11. Implementing a Parallel Image Edge Detection Algorithm Based on the Otsu-Canny Operator on the Hadoop Platform

    PubMed Central

    Wang, Min; Tian, Yun

    2018-01-01

    The Canny operator is widely used to detect edges in images. However, as the size of the image dataset increases, the edge detection performance of the Canny operator decreases and its runtime becomes excessive. To improve the runtime and edge detection performance of the Canny operator, in this paper, we propose a parallel design and implementation for an Otsu-optimized Canny operator using a MapReduce parallel programming model that runs on the Hadoop platform. The Otsu algorithm is used to optimize the Canny operator's dual threshold and improve the edge detection performance, while the MapReduce parallel programming model facilitates parallel processing for the Canny operator to solve the processing speed and communication cost problems that occur when the Canny edge detection algorithm is applied to big data. For the experiments, we constructed datasets of different scales from the Pascal VOC2012 image database. The proposed parallel Otsu-Canny edge detection algorithm performs better than other traditional edge detection algorithms. The parallel approach reduced the running time by approximately 67.2% on a Hadoop cluster architecture consisting of 5 nodes with a dataset of 60,000 images. Overall, our approach system speeds up the system by approximately 3.4 times when processing large-scale datasets, which demonstrates the obvious superiority of our method. The proposed algorithm in this study demonstrates both better edge detection performance and improved time performance. PMID:29861711

  12. Large Survey Database: A Distributed Framework for Storage and Analysis of Large Datasets

    NASA Astrophysics Data System (ADS)

    Juric, Mario

    2011-01-01

    The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures. An LSD database consists of a set of vertically and horizontally partitioned tables, physically stored as compressed HDF5 files. Vertically, we partition the tables into groups of related columns ('column groups'), storing together logically related data (e.g., astrometry, photometry). Horizontally, the tables are partitioned into partially overlapping ``cells'' by position in space (lon, lat) and time (t). This organization allows for fast lookups based on spatial and temporal coordinates, as well as data and task distribution. The design was inspired by the success of Google BigTable (Chang et al., 2006). Our programming model is a pipelined extension of MapReduce (Dean and Ghemawat, 2004). An SQL-like query language is used to access data. For complex tasks, map-reduce ``kernels'' that operate on query results on a per-cell basis can be written, with the framework taking care of scheduling and execution. The combination leverages users' familiarity with SQL, while offering a fully distributed computing environment. LSD adds little overhead compared to direct Python file I/O. In tests, we sweeped through 1.1 Grows of PanSTARRS+SDSS data (220GB) less than 15 minutes on a dual CPU machine. In a cluster environment, we achieved bandwidths of 17Gbits/sec (I/O limited). Based on current experience, we believe LSD should scale to be useful for analysis and storage of LSST-scale datasets. It can be downloaded from http://mwscience.net/lsd.

  13. A novel algorithm for reducing false arrhythmia alarms in intensive care units.

    PubMed

    Srivastava, Chandan; Sharma, Sonal; Jalali, Ali

    2016-08-01

    Alarm fatigue in intensive care units (ICU) is one of the top healthcare issues in the US. False alarms in ICU will decrease the quality of care and staff response time over the alarms. Normally, false alarm will cause desensitization of the clinical staff which leads to warnings and misleading, if the triggered alarm is true. In this study, we have proposed a multi-model ensemble approach to reduce the false alarm rate in monitoring systems. We have used 750 patient records from PhysioNet database. At First arrhythmia based features from electrocardiogram (ECG), arterial blood pressure (ABP) and photoplethysmogram (PPG) features were extracted from the records. Next, the dataset has been separated into two subsets on the basis of available features information. The first dataset (DS1) is the combination of ECG physiological, ABP and PPG features. Their correlation coefficient and p-values criteria have been applied for relevant alarm-wise feature-set selection, and random forest classifier was used for model development and validation. The threshold based approach was used on second dataset (DS2) which is the combination of arrhythmia, ABP and PPG features. The developed ensemble model is able to achieve sensitivity 83.33-100 % (average 95.56 %) being true alarms and suppress false alarms rate 66.67-89% (average 77.25%). The predictability of classifier shows the advantage to deal with unbalanced set of information, therefore overall model performance has reached to 83.96% accuracy.

  14. Here the data: the new FLUXNET collection and the future for model-data integration

    NASA Astrophysics Data System (ADS)

    Papale, D.; Pastorello, G.; Trotta, C.; Chu, H.; Canfora, E.; Agarwal, D.; Baldocchi, D. D.; Torn, M. S.

    2016-12-01

    Seven years after the release of the LaThuile FLUXNET database, widely used in synthesis activities and model-data fusion exercises, a new FLUXNET collection has been released (FLUXNET 2015 - http://fluxnet.fluxdata.org) with the aim to increase the quality of the measurements and provide high quality standardized data obtained by a new processing pipeline. The new FLUXNET collection includes also sites with timeseries of 20 years of continuous carbon and energy fluxes, opening new opportunities in their use in the context of models parameterization and validation. The main characteristics of the FLUXNET 2015 dataset are the uncertainty quantification, the multiple products (e.g. partitioning in photosynthesis and ecosystem respiration) that allow consistency analysis for each site, and new long term downscaled meteorological data provided with the data. Feedbacks from new users, in particular from the modelling communities, are crucial to further improve the quality of the products and move in the direction of a coherent integration across multi-disciplinary communities. In this presentation, the new FLUXNET2015 dataset will be explained and explored, with particular focus on the meaning of the different products and variables, their potentiality but also their limitations. The future development of the dataset will be discussed, with the role of the regional networks and the ongoing efforts to provide new and advanced services such a near real time data provision and a completely open access policy to high quality standardized measurements of GHGs exchanges and additional ecological quantities.

  15. Antarctic Porifera database from the Spanish benthic expeditions

    PubMed Central

    Rios, Pilar; Cristobo, Javier

    2014-01-01

    Abstract The information about the sponges in this dataset is derived from the samples collected during five Spanish Antarctic expeditions: Bentart 94, Bentart 95, Gebrap 96, Ciemar 99/00 and Bentart 2003. Samples were collected in the Antarctic Peninsula and Bellingshausen Sea at depths ranging from 4 to 2044 m using various sampling gears. The Antarctic Porifera database from the Spanish benthic expeditions is unique as it provides information for an under-explored region of the Southern Ocean (Bellingshausen Sea). It fills an information gap on Antarctic deep-sea sponges, for which there were previously very few data. This phylum is an important part of the Antarctic biota and plays a key role in the structure of the Antarctic marine benthic community due to its considerable diversity and predominance in different areas. It is often a dominant component of Southern Ocean benthic communities. The quality of the data was controlled very thoroughly with GPS systems onboard the R/V Hesperides and by checking the data against the World Porifera Database (which is part of the World Register of Marine Species, WoRMS). The data are therefore fit for completing checklists, inclusion in biodiversity pattern analysis and niche modelling. The authors can be contacted if any additional information is needed before carrying out detailed biodiversity or biogeographic studies. The dataset currently contains 767 occurrence data items that have been checked for systematic reliability. This database is not yet complete and the collection is growing. Specimens are stored in the author’s collection at the Spanish Institute of Oceanography (IEO) in the city of Gijón (Spain). The data are available in GBIF. PMID:24843257

  16. Development of the Global Earthquake Model’s neotectonic fault database

    USGS Publications Warehouse

    Christophersen, Annemarie; Litchfield, Nicola; Berryman, Kelvin; Thomas, Richard; Basili, Roberto; Wallace, Laura; Ries, William; Hayes, Gavin P.; Haller, Kathleen M.; Yoshioka, Toshikazu; Koehler, Richard D.; Clark, Dan; Wolfson-Schwehr, Monica; Boettcher, Margaret S.; Villamor, Pilar; Horspool, Nick; Ornthammarath, Teraphan; Zuñiga, Ramon; Langridge, Robert M.; Stirling, Mark W.; Goded, Tatiana; Costa, Carlos; Yeats, Robert

    2015-01-01

    The Global Earthquake Model (GEM) aims to develop uniform, openly available, standards, datasets and tools for worldwide seismic risk assessment through global collaboration, transparent communication and adapting state-of-the-art science. GEM Faulted Earth (GFE) is one of GEM’s global hazard module projects. This paper describes GFE’s development of a modern neotectonic fault database and a unique graphical interface for the compilation of new fault data. A key design principle is that of an electronic field notebook for capturing observations a geologist would make about a fault. The database is designed to accommodate abundant as well as sparse fault observations. It features two layers, one for capturing neotectonic faults and fold observations, and the other to calculate potential earthquake fault sources from the observations. In order to test the flexibility of the database structure and to start a global compilation, five preexisting databases have been uploaded to the first layer and two to the second. In addition, the GFE project has characterised the world’s approximately 55,000 km of subduction interfaces in a globally consistent manner as a basis for generating earthquake event sets for inclusion in earthquake hazard and risk modelling. Following the subduction interface fault schema and including the trace attributes of the GFE database schema, the 2500-km-long frontal thrust fault system of the Himalaya has also been characterised. We propose the database structure to be used widely, so that neotectonic fault data can make a more complete and beneficial contribution to seismic hazard and risk characterisation globally.

  17. NHEXAS PHASE I MARYLAND STUDY--STANDARD OPERATING PROCEDURE FOR LAB RESULTS DATA ENTRY AND PREPARATION (D03)

    EPA Science Inventory

    The purpose of this SOP is to describe how lab results are organized and processed into the official database known as the Complete Dataset (CDS); to describe the structure and creation of the Analysis-ready Dataset (ADS); and to describe the structure and process of creating the...

  18. Analysing and correcting the differences between multi-source and multi-scale spatial remote sensing observations.

    PubMed

    Dong, Yingying; Luo, Ruisen; Feng, Haikuan; Wang, Jihua; Zhao, Jinling; Zhu, Yining; Yang, Guijun

    2014-01-01

    Differences exist among analysis results of agriculture monitoring and crop production based on remote sensing observations, which are obtained at different spatial scales from multiple remote sensors in same time period, and processed by same algorithms, models or methods. These differences can be mainly quantitatively described from three aspects, i.e. multiple remote sensing observations, crop parameters estimation models, and spatial scale effects of surface parameters. Our research proposed a new method to analyse and correct the differences between multi-source and multi-scale spatial remote sensing surface reflectance datasets, aiming to provide references for further studies in agricultural application with multiple remotely sensed observations from different sources. The new method was constructed on the basis of physical and mathematical properties of multi-source and multi-scale reflectance datasets. Theories of statistics were involved to extract statistical characteristics of multiple surface reflectance datasets, and further quantitatively analyse spatial variations of these characteristics at multiple spatial scales. Then, taking the surface reflectance at small spatial scale as the baseline data, theories of Gaussian distribution were selected for multiple surface reflectance datasets correction based on the above obtained physical characteristics and mathematical distribution properties, and their spatial variations. This proposed method was verified by two sets of multiple satellite images, which were obtained in two experimental fields located in Inner Mongolia and Beijing, China with different degrees of homogeneity of underlying surfaces. Experimental results indicate that differences of surface reflectance datasets at multiple spatial scales could be effectively corrected over non-homogeneous underlying surfaces, which provide database for further multi-source and multi-scale crop growth monitoring and yield prediction, and their corresponding consistency analysis evaluation.

  19. Analysing and Correcting the Differences between Multi-Source and Multi-Scale Spatial Remote Sensing Observations

    PubMed Central

    Dong, Yingying; Luo, Ruisen; Feng, Haikuan; Wang, Jihua; Zhao, Jinling; Zhu, Yining; Yang, Guijun

    2014-01-01

    Differences exist among analysis results of agriculture monitoring and crop production based on remote sensing observations, which are obtained at different spatial scales from multiple remote sensors in same time period, and processed by same algorithms, models or methods. These differences can be mainly quantitatively described from three aspects, i.e. multiple remote sensing observations, crop parameters estimation models, and spatial scale effects of surface parameters. Our research proposed a new method to analyse and correct the differences between multi-source and multi-scale spatial remote sensing surface reflectance datasets, aiming to provide references for further studies in agricultural application with multiple remotely sensed observations from different sources. The new method was constructed on the basis of physical and mathematical properties of multi-source and multi-scale reflectance datasets. Theories of statistics were involved to extract statistical characteristics of multiple surface reflectance datasets, and further quantitatively analyse spatial variations of these characteristics at multiple spatial scales. Then, taking the surface reflectance at small spatial scale as the baseline data, theories of Gaussian distribution were selected for multiple surface reflectance datasets correction based on the above obtained physical characteristics and mathematical distribution properties, and their spatial variations. This proposed method was verified by two sets of multiple satellite images, which were obtained in two experimental fields located in Inner Mongolia and Beijing, China with different degrees of homogeneity of underlying surfaces. Experimental results indicate that differences of surface reflectance datasets at multiple spatial scales could be effectively corrected over non-homogeneous underlying surfaces, which provide database for further multi-source and multi-scale crop growth monitoring and yield prediction, and their corresponding consistency analysis evaluation. PMID:25405760

  20. Asteroseismology and the Virtual Observatory

    NASA Astrophysics Data System (ADS)

    Suárez, J. C.

    2010-12-01

    Virtual Observatory is an international project aiming at solving the problem of interoperability among astronomical archives and the scalability in the classical methods of retrieving and analyzing astronomical data in order to deal with huge amounts of datasets. This is being tackled thanks to the standardization of astronomical archives favoring their access in a efficient manner. This project, which is nowadays a reality, is more and more adopted by many fields of Science. In the present paper I will describe the origin of a new era in Stellar Physics whose main role is played by the relationship between asteroseismology and V.O. I will summarize the main concerns of both fields and the current development of VO tools for the development of what we could name as asteroseismology online, in which not only observed datasets are concerned but also the management of model databases.

  1. Heterogeneous postsurgical data analytics for predictive modeling of mortality risks in intensive care units.

    PubMed

    Yun Chen; Hui Yang

    2014-01-01

    The rapid advancements of biomedical instrumentation and healthcare technology have resulted in data-rich environments in hospitals. However, the meaningful information extracted from rich datasets is limited. There is a dire need to go beyond current medical practices, and develop data-driven methods and tools that will enable and help (i) the handling of big data, (ii) the extraction of data-driven knowledge, (iii) the exploitation of acquired knowledge for optimizing clinical decisions. This present study focuses on the prediction of mortality rates in Intensive Care Units (ICU) using patient-specific healthcare recordings. It is worth mentioning that postsurgical monitoring in ICU leads to massive datasets with unique properties, e.g., variable heterogeneity, patient heterogeneity, and time asyncronization. To cope with the challenges in ICU datasets, we developed the postsurgical decision support system with a series of analytical tools, including data categorization, data pre-processing, feature extraction, feature selection, and predictive modeling. Experimental results show that the proposed data-driven methodology outperforms traditional approaches and yields better results based on the evaluation of real-world ICU data from 4000 subjects in the database. This research shows great potentials for the use of data-driven analytics to improve the quality of healthcare services.

  2. RankProd Combined with Genetic Algorithm Optimized Artificial Neural Network Establishes a Diagnostic and Prognostic Prediction Model that Revealed C1QTNF3 as a Biomarker for Prostate Cancer.

    PubMed

    Hou, Qi; Bing, Zhi-Tong; Hu, Cheng; Li, Mao-Yin; Yang, Ke-Hu; Mo, Zu; Xie, Xiang-Wei; Liao, Ji-Lin; Lu, Yan; Horie, Shigeo; Lou, Ming-Wu

    2018-06-01

    Prostate cancer (PCa) is the most commonly diagnosed cancer in males in the Western world. Although prostate-specific antigen (PSA) has been widely used as a biomarker for PCa diagnosis, its results can be controversial. Therefore, new biomarkers are needed to enhance the clinical management of PCa. From publicly available microarray data, differentially expressed genes (DEGs) were identified by meta-analysis with RankProd. Genetic algorithm optimized artificial neural network (GA-ANN) was introduced to establish a diagnostic prediction model and to filter candidate genes. The diagnostic and prognostic capability of the prediction model and candidate genes were investigated in both GEO and TCGA datasets. Candidate genes were further validated by qPCR, Western Blot and Tissue microarray. By RankProd meta-analyses, 2306 significantly up- and 1311 down-regulated probes were found in 133 cases and 30 controls microarray data. The overall accuracy rate of the PCa diagnostic prediction model, consisting of a 15-gene signature, reached up to 100% in both the training and test dataset. The prediction model also showed good results for the diagnosis (AUC = 0.953) and prognosis (AUC of 5 years overall survival time = 0.808) of PCa in the TCGA database. The expression levels of three genes, FABP5, C1QTNF3 and LPHN3, were validated by qPCR. C1QTNF3 high expression was further validated in PCa tissue by Western Blot and Tissue microarray. In the GEO datasets, C1QTNF3 was a good predictor for the diagnosis of PCa (GSE6956: AUC = 0.791; GSE8218: AUC = 0.868; GSE26910: AUC = 0.972). In the TCGA database, C1QTNF3 was significantly associated with PCa patient recurrence free survival (P < .001, AUC = 0.57). In this study, we have developed a diagnostic and prognostic prediction model for PCa. C1QTNF3 was revealed as a promising biomarker for PCa. This approach can be applied to other high-throughput data from different platforms for the discovery of oncogenes or biomarkers in different kinds of diseases. Copyright © 2018. Published by Elsevier B.V.

  3. A Novel Multi-Class Ensemble Model for Classifying Imbalanced Biomedical Datasets

    NASA Astrophysics Data System (ADS)

    Bikku, Thulasi; Sambasiva Rao, N., Dr; Rao, Akepogu Ananda, Dr

    2017-08-01

    This paper mainly focuseson developing aHadoop based framework for feature selection and classification models to classify high dimensionality data in heterogeneous biomedical databases. Wide research has been performing in the fields of Machine learning, Big data and Data mining for identifying patterns. The main challenge is extracting useful features generated from diverse biological systems. The proposed model can be used for predicting diseases in various applications and identifying the features relevant to particular diseases. There is an exponential growth of biomedical repositories such as PubMed and Medline, an accurate predictive model is essential for knowledge discovery in Hadoop environment. Extracting key features from unstructured documents often lead to uncertain results due to outliers and missing values. In this paper, we proposed a two phase map-reduce framework with text preprocessor and classification model. In the first phase, mapper based preprocessing method was designed to eliminate irrelevant features, missing values and outliers from the biomedical data. In the second phase, a Map-Reduce based multi-class ensemble decision tree model was designed and implemented in the preprocessed mapper data to improve the true positive rate and computational time. The experimental results on the complex biomedical datasets show that the performance of our proposed Hadoop based multi-class ensemble model significantly outperforms state-of-the-art baselines.

  4. StarView: The object oriented design of the ST DADS user interface

    NASA Technical Reports Server (NTRS)

    Williams, J. D.; Pollizzi, J. A.

    1992-01-01

    StarView is the user interface being developed for the Hubble Space Telescope Data Archive and Distribution Service (ST DADS). ST DADS is the data archive for HST observations and a relational database catalog describing the archived data. Users will use StarView to query the catalog and select appropriate datasets for study. StarView sends requests for archived datasets to ST DADS which processes the requests and returns the database to the user. StarView is designed to be a powerful and extensible user interface. Unique features include an internal relational database to navigate query results, a form definition language that will work with both CRT and X interfaces, a data definition language that will allow StarView to work with any relational database, and the ability to generate adhoc queries without requiring the user to understand the structure of the ST DADS catalog. Ultimately, StarView will allow the user to refine queries in the local database for improved performance and merge in data from external sources for correlation with other query results. The user will be able to create a query from single or multiple forms, merging the selected attributes into a single query. Arbitrary selection of attributes for querying is supported. The user will be able to select how query results are viewed. A standard form or table-row format may be used. Navigation capabilities are provided to aid the user in viewing query results. Object oriented analysis and design techniques were used in the design of StarView to support the mechanisms and concepts required to implement these features. One such mechanism is the Model-View-Controller (MVC) paradigm. The MVC allows the user to have multiple views of the underlying database, while providing a consistent mechanism for interaction regardless of the view. This approach supports both CRT and X interfaces while providing a common mode of user interaction. Another powerful abstraction is the concept of a Query Model. This concept allows a single query to be built form a single or multiple forms before it is submitted to ST DADS. Supporting this concept is the adhoc query generator which allows the user to select and qualify an indeterminate number attributes from the database. The user does not need any knowledge of how the joins across various tables are to be resolved. The adhoc generator calculates the joins automatically and generates the correct SQL query.

  5. Database resources of the National Center for Biotechnology Information.

    PubMed

    2016-01-04

    The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank(®) nucleic acid sequence database and the PubMed database of citations and abstracts for published life science journals. Additional NCBI resources focus on literature (PubMed Central (PMC), Bookshelf and PubReader), health (ClinVar, dbGaP, dbMHC, the Genetic Testing Registry, HIV-1/Human Protein Interaction Database and MedGen), genomes (BioProject, Assembly, Genome, BioSample, dbSNP, dbVar, Epigenomics, the Map Viewer, Nucleotide, Probe, RefSeq, Sequence Read Archive, the Taxonomy Browser and the Trace Archive), genes (Gene, Gene Expression Omnibus (GEO), HomoloGene, PopSet and UniGene), proteins (Protein, the Conserved Domain Database (CDD), COBALT, Conserved Domain Architecture Retrieval Tool (CDART), the Molecular Modeling Database (MMDB) and Protein Clusters) and chemicals (Biosystems and the PubChem suite of small molecule databases). The Entrez system provides search and retrieval operations for most of these databases. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized datasets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov. Published by Oxford University Press on behalf of Nucleic Acids Research 2015. This work is written by (a) US Government employee(s) and is in the public domain in the US.

  6. Spatial cyberinfrastructures, ontologies, and the humanities.

    PubMed

    Sieber, Renee E; Wellen, Christopher C; Jin, Yuan

    2011-04-05

    We report on research into building a cyberinfrastructure for Chinese biographical and geographic data. Our cyberinfrastructure contains (i) the McGill-Harvard-Yenching Library Ming Qing Women's Writings database (MQWW), the only online database on historical Chinese women's writings, (ii) the China Biographical Database, the authority for Chinese historical people, and (iii) the China Historical Geographical Information System, one of the first historical geographic information systems. Key to this integration is that linked databases retain separate identities as bases of knowledge, while they possess sufficient semantic interoperability to allow for multidatabase concepts and to support cross-database queries on an ad hoc basis. Computational ontologies create underlying semantics for database access. This paper focuses on the spatial component in a humanities cyberinfrastructure, which includes issues of conflicting data, heterogeneous data models, disambiguation, and geographic scale. First, we describe the methodology for integrating the databases. Then we detail the system architecture, which includes a tier of ontologies and schema. We describe the user interface and applications that allow for cross-database queries. For instance, users should be able to analyze the data, examine hypotheses on spatial and temporal relationships, and generate historical maps with datasets from MQWW for research, teaching, and publication on Chinese women writers, their familial relations, publishing venues, and the literary and social communities. Last, we discuss the social side of cyberinfrastructure development, as people are considered to be as critical as the technical components for its success.

  7. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases.

    PubMed

    Voss, Erica A; Makadia, Rupa; Matcho, Amy; Ma, Qianli; Knoll, Chris; Schuemie, Martijn; DeFalco, Frank J; Londhe, Ajit; Zhu, Vivienne; Ryan, Patrick B

    2015-05-01

    To evaluate the utility of applying the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) across multiple observational databases within an organization and to apply standardized analytics tools for conducting observational research. Six deidentified patient-level datasets were transformed to the OMOP CDM. We evaluated the extent of information loss that occurred through the standardization process. We developed a standardized analytic tool to replicate the cohort construction process from a published epidemiology protocol and applied the analysis to all 6 databases to assess time-to-execution and comparability of results. Transformation to the CDM resulted in minimal information loss across all 6 databases. Patients and observations excluded were due to identified data quality issues in the source system, 96% to 99% of condition records and 90% to 99% of drug records were successfully mapped into the CDM using the standard vocabulary. The full cohort replication and descriptive baseline summary was executed for 2 cohorts in 6 databases in less than 1 hour. The standardization process improved data quality, increased efficiency, and facilitated cross-database comparisons to support a more systematic approach to observational research. Comparisons across data sources showed consistency in the impact of inclusion criteria, using the protocol and identified differences in patient characteristics and coding practices across databases. Standardizing data structure (through a CDM), content (through a standard vocabulary with source code mappings), and analytics can enable an institution to apply a network-based approach to observational research across multiple, disparate observational health databases. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.

  8. Challenges and Experiences of Building Multidisciplinary Datasets across Cultures

    NASA Astrophysics Data System (ADS)

    Jamiyansharav, K.; Laituri, M.; Fernandez-Gimenez, M.; Fassnacht, S. R.; Venable, N. B. H.; Allegretti, A. M.; Reid, R.; Baival, B.; Jamsranjav, C.; Ulambayar, T.; Linn, S.; Angerer, J.

    2017-12-01

    Efficient data sharing and management are key challenges to multidisciplinary scientific research. These challenges are further complicated by adding a multicultural component. We address the construction of a complex database for social-ecological analysis in Mongolia. Funded by the National Science Foundation (NSF) Dynamics of Coupled Natural and Human (CNH) Systems, the Mongolian Rangelands and Resilience (MOR2) project focuses on the vulnerability of Mongolian pastoral systems to climate change and adaptive capacity. The MOR2 study spans over three years of fieldwork in 36 paired districts (Soum) from 18 provinces (Aimag) of Mongolia that covers steppe, mountain forest steppe, desert steppe and eastern steppe ecological zones. Our project team is composed of hydrologists, social scientists, geographers, and ecologists. The MOR2 database includes multiple ecological, social, meteorological, geospatial and hydrological datasets, as well as archives of original data and survey in multiple formats. Managing this complex database requires significant organizational skills, attention to detail and ability to communicate within collective team members from diverse disciplines and across multiple institutions in the US and Mongolia. We describe the database's rich content, organization, structure and complexity. We discuss lessons learned, best practices and recommendations for complex database management, sharing, and archiving in creating a cross-cultural and multi-disciplinary database.

  9. Analysing inter-relationships among water, governance, human development variables in developing countries: WatSan4Dev database coherency analysis

    NASA Astrophysics Data System (ADS)

    Dondeynaz, C.; Carmona Moreno, C.; Céspedes Lorente, J. J.

    2012-01-01

    The "Integrated Water Resources Management" principle was formally laid down at the International Conference on Water and Sustainable development in Dublin 1992. One of the main results of this conference is that improving Water and Sanitation Services (WSS), being a complex and interdisciplinary issue, passes through collaboration and coordination of different sectors (environment, health, economic activities, governance, and international cooperation). These sectors influence or are influenced by the access to WSS. The understanding of these interrelations appears as crucial for decision makers in the water sector. In this framework, the Joint Research Centre (JRC) of the European Commission (EC) has developed a new database (WatSan4Dev database) containing 45 indicators (called variables in this paper) from environmental, socio-economic, governance and financial aid flows data in developing countries. This paper describes the development of the WatSan4Dev dataset, the statistical processes needed to improve the data quality; and, finally, the analysis to verify the database coherence is presented. At the light of the first analysis, WatSan4Dev Dataset shows the coherency among the different variables that are confirmed by the direct field experience and/or the scientific literature in the domain. Preliminary analysis of the relationships indicates that the informal urbanisation development is an important factor influencing negatively the percentage of the population having access to WSS. Health, and in particular children health, benefits from the improvement of WSS. Efficient environmental governance is also an important factor for providing improved water supply services. The database would be at the base of posterior analyses to better understand the interrelationships between the different indicators associated in the water sector in developing countries. A data model using the different indicators will be realised on the next phase of this research work.

  10. Generation of a precise DEM by interactive synthesis of multi-temporal elevation datasets: a case study of Schirmacher Oasis, East Antarctica

    NASA Astrophysics Data System (ADS)

    Jawak, Shridhar D.; Luis, Alvarinho J.

    2016-05-01

    Digital elevation model (DEM) is indispensable for analysis such as topographic feature extraction, ice sheet melting, slope stability analysis, landscape analysis and so on. Such analysis requires a highly accurate DEM. Available DEMs of Antarctic region compiled by using radar altimetry and the Antarctic digital database indicate elevation variations of up to hundreds of meters, which necessitates the generation of local improved DEM. An improved DEM of the Schirmacher Oasis, East Antarctica has been generated by synergistically fusing satellite-derived laser altimetry data from Geoscience Laser Altimetry System (GLAS), Radarsat Antarctic Mapping Project (RAMP) elevation data and Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) global elevation data (GDEM). This is a characteristic attempt to generate a DEM of any part of Antarctica by fusing multiple elevation datasets, which is essential to model the ice elevation change and address the ice mass balance. We analyzed a suite of interpolation techniques for constructing a DEM from GLAS, RAMP and ASTER DEM-based point elevation datasets, in order to determine the level of confidence with which the interpolation techniques can generate a better interpolated continuous surface, and eventually improve the elevation accuracy of DEM from synergistically fused RAMP, GLAS and ASTER point elevation datasets. The DEM presented in this work has a vertical accuracy (≈ 23 m) better than RAMP DEM (≈ 57 m) and ASTER DEM (≈ 64 m) individually. The RAMP DEM and ASTER DEM elevations were corrected using differential GPS elevations as ground reference data, and the accuracy obtained after fusing multitemporal datasets is found to be improved than that of existing DEMs constructed by using RAMP or ASTER alone. This is our second attempt of fusing multitemporal, multisensory and multisource elevation data to generate a DEM of Antarctica, in order to address the ice elevation change and address the ice mass balance. Our approach focuses on the strengths of each elevation data source to produce an accurate elevation model.

  11. A dataset of stereoscopic images and ground-truth disparity mimicking human fixations in peripersonal space

    PubMed Central

    Canessa, Andrea; Gibaldi, Agostino; Chessa, Manuela; Fato, Marco; Solari, Fabio; Sabatini, Silvio P.

    2017-01-01

    Binocular stereopsis is the ability of a visual system, belonging to a live being or a machine, to interpret the different visual information deriving from two eyes/cameras for depth perception. From this perspective, the ground-truth information about three-dimensional visual space, which is hardly available, is an ideal tool both for evaluating human performance and for benchmarking machine vision algorithms. In the present work, we implemented a rendering methodology in which the camera pose mimics realistic eye pose for a fixating observer, thus including convergent eye geometry and cyclotorsion. The virtual environment we developed relies on highly accurate 3D virtual models, and its full controllability allows us to obtain the stereoscopic pairs together with the ground-truth depth and camera pose information. We thus created a stereoscopic dataset: GENUA PESTO—GENoa hUman Active fixation database: PEripersonal space STereoscopic images and grOund truth disparity. The dataset aims to provide a unified framework useful for a number of problems relevant to human and computer vision, from scene exploration and eye movement studies to 3D scene reconstruction. PMID:28350382

  12. Accessibility, searchability, transparency and engagement of soil carbon data: The International Soil Carbon Network

    NASA Astrophysics Data System (ADS)

    Harden, Jennifer W.; Hugelius, Gustaf; Koven, Charlie; Sulman, Ben; O'Donnell, Jon; He, Yujie

    2016-04-01

    Soils are capacitors for carbon and water entering and exiting through land-atmosphere exchange. Capturing the spatiotemporal variations in soil C exchange through monitoring and modeling is difficult in part because data are reported unevenly across spatial, temporal, and management scales and in part because the unit of measure generally involves destructive harvest or non-recurrent measurements. In order to improve our fundamental basis for understanding soil C exchange, a multi-user, open source, searchable database and network of scientists has been formed. The International Soil Carbon Network (ISCN) is a self-chartered, member-based and member-owned network of scientists dedicated to soil carbon science. Attributes of the ISCN include 1) Targeted ISCN Action Groups which represent teams of motivated researchers that propose and pursue specific soil C research questions with the aim of synthesizing seminal articles regarding soil C fate. 2) Datasets to date contributed by institutions and individuals to a comprehensive, searchable open-access database that currently includes over 70,000 geolocated profiles for which soil C and other soil properties. 3) Derivative products resulting from the database, including depth attenuation attributes for C concentration and storage; C storage maps; and model-based assessments of emission/sequestration for future climate scenarios. Several examples illustrate the power of such a database and its engagement with the science community. First, a simplified, data-constrained global ecosystem model estimated a global sensitivity of permafrost soil carbon to climate change (g sensitivity) of -14 to -19 Pg C °C-1 of warming on a 100 years time scale. Second, using mathematical characterizations of depth profiles for organic carbon storage, C at the soil surface reflects Net Primary Production (NPP) and its allotment as moss or litter, while e-folding depths are correlated to rooting depth. Third, storage of deep C is highly correlated with bulk density and porosity of the rock/sediment matrix. Thus C storage is most stable at depth, yet is susceptible to changes in tillage, rooting depths, and erosion/sedimentation. Fourth, current ESMs likely overestimate the turnover time of soil organic carbon and subsequently overestimate soil carbon sequestration, thus datasets combined with other soil properties will help constrain the ESM predictions. Last, analysis of soil horizon and carbon data showed that soils with a history of tillage had significantly lower carbon concentrations in both near-surface and deep layers, and that the effect persisted even in reforested areas. In addition to the opportunities for empirical science using a large database, the database has great promise for evaluation of biogeochemical and earth system models. The preservation of individual soil core measurements avoids issues with spatial averaging while facilitating evaluation of advanced model processes such as depth distributions of soil carbon, land use impacts, and spatial heterogeneity.

  13. Web-based Visualization and Query of semantically segmented multiresolution 3D Models in the Field of Cultural Heritage

    NASA Astrophysics Data System (ADS)

    Auer, M.; Agugiaro, G.; Billen, N.; Loos, L.; Zipf, A.

    2014-05-01

    Many important Cultural Heritage sites have been studied over long periods of time by different means of technical equipment, methods and intentions by different researchers. This has led to huge amounts of heterogeneous "traditional" datasets and formats. The rising popularity of 3D models in the field of Cultural Heritage in recent years has brought additional data formats and makes it even more necessary to find solutions to manage, publish and study these data in an integrated way. The MayaArch3D project aims to realize such an integrative approach by establishing a web-based research platform bringing spatial and non-spatial databases together and providing visualization and analysis tools. Especially the 3D components of the platform use hierarchical segmentation concepts to structure the data and to perform queries on semantic entities. This paper presents a database schema to organize not only segmented models but also different Levels-of-Details and other representations of the same entity. It is further implemented in a spatial database which allows the storing of georeferenced 3D data. This enables organization and queries by semantic, geometric and spatial properties. As service for the delivery of the segmented models a standardization candidate of the OpenGeospatialConsortium (OGC), the Web3DService (W3DS) has been extended to cope with the new database schema and deliver a web friendly format for WebGL rendering. Finally a generic user interface is presented which uses the segments as navigation metaphor to browse and query the semantic segmentation levels and retrieve information from an external database of the German Archaeological Institute (DAI).

  14. The State Geologic Map Compilation (SGMC) geodatabase of the conterminous United States

    USGS Publications Warehouse

    Horton, John D.; San Juan, Carma A.; Stoeser, Douglas B.

    2017-06-30

    The State Geologic Map Compilation (SGMC) geodatabase of the conterminous United States (https://doi. org/10.5066/F7WH2N65) represents a seamless, spatial database of 48 State geologic maps that range from 1:50,000 to 1:1,000,000 scale. A national digital geologic map database is essential in interpreting other datasets that support numerous types of national-scale studies and assessments, such as those that provide geochemistry, remote sensing, or geophysical data. The SGMC is a compilation of the individual U.S. Geological Survey releases of the Preliminary Integrated Geologic Map Databases for the United States. The SGMC geodatabase also contains updated data for seven States and seven entirely new State geologic maps that have been added since the preliminary databases were published. Numerous errors have been corrected and enhancements added to the preliminary datasets using thorough quality assurance/quality control procedures. The SGMC is not a truly integrated geologic map database because geologic units have not been reconciled across State boundaries. However, the geologic data contained in each State geologic map have been standardized to allow spatial analyses of lithology, age, and stratigraphy at a national scale.

  15. MetReS, an Efficient Database for Genomic Applications.

    PubMed

    Vilaplana, Jordi; Alves, Rui; Solsona, Francesc; Mateo, Jordi; Teixidó, Ivan; Pifarré, Marc

    2018-02-01

    MetReS (Metabolic Reconstruction Server) is a genomic database that is shared between two software applications that address important biological problems. Biblio-MetReS is a data-mining tool that enables the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the processes of interest and their function. The main goal of this work was to identify the areas where the performance of the MetReS database performance could be improved and to test whether this improvement would scale to larger datasets and more complex types of analysis. The study was started with a relational database, MySQL, which is the current database server used by the applications. We also tested the performance of an alternative data-handling framework, Apache Hadoop. Hadoop is currently used for large-scale data processing. We found that this data handling framework is likely to greatly improve the efficiency of the MetReS applications as the dataset and the processing needs increase by several orders of magnitude, as expected to happen in the near future.

  16. Data augmentation-assisted deep learning of hand-drawn partially colored sketches for visual search

    PubMed Central

    Muhammad, Khan; Baik, Sung Wook

    2017-01-01

    In recent years, image databases are growing at exponential rates, making their management, indexing, and retrieval, very challenging. Typical image retrieval systems rely on sample images as queries. However, in the absence of sample query images, hand-drawn sketches are also used. The recent adoption of touch screen input devices makes it very convenient to quickly draw shaded sketches of objects to be used for querying image databases. This paper presents a mechanism to provide access to visual information based on users’ hand-drawn partially colored sketches using touch screen devices. A key challenge for sketch-based image retrieval systems is to cope with the inherent ambiguity in sketches due to the lack of colors, textures, shading, and drawing imperfections. To cope with these issues, we propose to fine-tune a deep convolutional neural network (CNN) using augmented dataset to extract features from partially colored hand-drawn sketches for query specification in a sketch-based image retrieval framework. The large augmented dataset contains natural images, edge maps, hand-drawn sketches, de-colorized, and de-texturized images which allow CNN to effectively model visual contents presented to it in a variety of forms. The deep features extracted from CNN allow retrieval of images using both sketches and full color images as queries. We also evaluated the role of partial coloring or shading in sketches to improve the retrieval performance. The proposed method is tested on two large datasets for sketch recognition and sketch-based image retrieval and achieved better classification and retrieval performance than many existing methods. PMID:28859140

  17. Compiling an Open Database of Dam Inundation Areas on the Irrawaddy, Salween, Mekong, and Red River Basins

    NASA Astrophysics Data System (ADS)

    Cutter, P. G.; Walcutt, A.; O'Neil-Dunne, J.; Geheb, K.; Troy, A.; Saah, D. S.; Ganz, D.

    2016-12-01

    Dam construction in mainland Southeast Asia has increased substantially in recent years with extensive regional impacts including alterations to water regimes, the loss and degradation of natural forests and biodiversity, and reductions in soil and water quality. The CGIAR Water Land Ecosystem program (WLE) and partners maintain a comprehensive database of locations and other data relating to existing, planned, and proposed dams in the region's major transboundary rivers spanning areas in Thailand, Cambodia, Laos, Vietnam, Myanmar, and China. A recent regional needs assessment and specific stakeholder requests revealed the need for a dataset reflecting the inundation areas of these dams for use in measuring impacts to river ecology, analyzing disaster risk, monitoring land cover and land use change, evaluating carbon emissions, and assessing the actual and potential impacts to communities. In conjunction with WLE and other partners, SERVIR-Mekong, a regional hub of the USAID and NASA-supported SERVIR program, formulated an explicit procedure to produce this dataset. The procedure includes leveraging data from OpenStreetMap and other sources, creating polygons based on surface water classification procedures achieved via Google Earth Engine, manual digitizing, and modeling of planned/proposed dams based on a DEM and the location and planned height of dams. A quality assurance step ensures that all polygons conform to spatial data quality standards agreed upon by a wide range of production partners. When complete, the dataset will be made publicly available to encourage greater understanding and more informed decisions related to the actual and potential impacts of dams in the region.

  18. Relax with CouchDB--into the non-relational DBMS era of bioinformatics.

    PubMed

    Manyam, Ganiraju; Payton, Michelle A; Roth, Jack A; Abruzzo, Lynne V; Coombes, Kevin R

    2012-07-01

    With the proliferation of high-throughput technologies, genome-level data analysis has become common in molecular biology. Bioinformaticians are developing extensive resources to annotate and mine biological features from high-throughput data. The underlying database management systems for most bioinformatics software are based on a relational model. Modern non-relational databases offer an alternative that has flexibility, scalability, and a non-rigid design schema. Moreover, with an accelerated development pace, non-relational databases like CouchDB can be ideal tools to construct bioinformatics utilities. We describe CouchDB by presenting three new bioinformatics resources: (a) geneSmash, which collates data from bioinformatics resources and provides automated gene-centric annotations, (b) drugBase, a database of drug-target interactions with a web interface powered by geneSmash, and (c) HapMap-CN, which provides a web interface to query copy number variations from three SNP-chip HapMap datasets. In addition to the web sites, all three systems can be accessed programmatically via web services. Copyright © 2012 Elsevier Inc. All rights reserved.

  19. Soil Erosion map of Europe based on high resolution input datasets

    NASA Astrophysics Data System (ADS)

    Panagos, Panos; Borrelli, Pasquale; Meusburger, Katrin; Ballabio, Cristiano; Alewell, Christine

    2015-04-01

    Modelling soil erosion in European Union is of major importance for agro-environmental policies. Soil erosion estimates are important inputs for the Common Agricultural Policy (CAP) and the implementation of the Soil Thematic Strategy. Using the findings of a recent pan-European data collection through the EIONET network, it was concluded that most Member States are applying the empirical Revised Universal Soil Loss Equation (RUSLE) for the modelling soil erosion at National level. This model was chosen for the pan-European soil erosion risk assessment and it is based on 6 input factors. Compared to past approaches, each of the factors is modelled using the latest pan-European datasets, expertise and data from Member states and high resolution remote sensing data. The soil erodibility (K-factor) is modelled using the recently published LUCAS topsoil database with 20,000 point measurements and incorporating the surface stone cover which can reduce K-factor by 15%. The rainfall erosivity dataset (R-factor) has been implemented using high temporal resolution rainfall data from more than 1,500 precipitation stations well distributed in Europe. The cover-management (C-factor) incorporates crop statistics and management practices such as cover crops, tillage practices and plant residuals. The slope length and steepness (combined LS-factor) is based on the first ever 25m Digital Elevation Model (DEM) of Europe. Finally, the support practices (P-factor) is modelled for first time at this scale taking into account the 270,000 LUCAS earth observations and the Good Agricultural and Environmental Condition (GAEC) that farmers have to follow in Europe. The high resolution input layers produce the final soil erosion risk map at 100m resolution and allow policy makers to run future land use, management and climate change scenarios.

  20. Photorealistic virtual anatomy based on Chinese Visible Human data.

    PubMed

    Heng, P A; Zhang, S X; Xie, Y M; Wong, T T; Chui, Y P; Cheng, C Y

    2006-04-01

    Virtual reality based learning of human anatomy is feasible when a database of 3D organ models is available for the learner to explore, visualize, and dissect in virtual space interactively. In this article, we present our latest work on photorealistic virtual anatomy applications based on the Chinese Visible Human (CVH) data. We have focused on the development of state-of-the-art virtual environments that feature interactive photo-realistic visualization and dissection of virtual anatomical models constructed from ultra-high resolution CVH datasets. We also outline our latest progress in applying these highly accurate virtual and functional organ models to generate realistic look and feel to advanced surgical simulators. (c) 2006 Wiley-Liss, Inc.

  1. Static facial expression recognition with convolution neural networks

    NASA Astrophysics Data System (ADS)

    Zhang, Feng; Chen, Zhong; Ouyang, Chao; Zhang, Yifei

    2018-03-01

    Facial expression recognition is a currently active research topic in the fields of computer vision, pattern recognition and artificial intelligence. In this paper, we have developed a convolutional neural networks (CNN) for classifying human emotions from static facial expression into one of the seven facial emotion categories. We pre-train our CNN model on the combined FER2013 dataset formed by train, validation and test set and fine-tune on the extended Cohn-Kanade database. In order to reduce the overfitting of the models, we utilized different techniques including dropout and batch normalization in addition to data augmentation. According to the experimental result, our CNN model has excellent classification performance and robustness for facial expression recognition.

  2. A current landscape of provincial perinatal data collection in Canada.

    PubMed

    Massey, Kiran A; Magee, Laura A; Dale, Sheryll; Claydon, Jennifer; Morris, Tara J; von Dadelszen, Peter; Liston, Robert M; Ansermino, J Mark

    2009-03-01

    The Canadian Perinatal Network (CPN) was launched in 2005 as a national perinatal database project designed to identify best practices in maternity care. The inaugural project of CPN is focused on interventions that optimize maternal and perinatal outcomes in women with threatened preterm birth at 22+0 to 28+6 weeks' gestation. To examine existing data collection by perinatal health programs (PHPs) to inform decisions about shared data collection and CPN database construction. We reviewed the database manuals and websites of all Canadian PHPs and compiled a list of data fields and their definitions. We compared these fields and definitions with those of CPN and the Canadian Minimal Dataset, proposed as a common dataset by the Canadian Perinatal Programs Coalition of Canadian PHPs. PHPs collect information on 2/3 of deliveries in Canada. PHPs consistently collect information on maternal demographics (including both maternal and neonatal personal identifiers), past obstetrical history, maternal lifestyle, aspects of labour and delivery, and basic neonatal outcomes. However, most PHPs collect insufficient data to enable identification of obstetric (and neonatal) practices associated with improved maternal and perinatal outcomes. In addition, there is between-PHP variability in defining many data fields. Construction of a separate CPN database was needed although harmonization of data field definitions with those of the proposed Canadian Minimal Dataset was done to plan for future shared data collection. This convergence should be the goal of researchers and clinicians alike as we construct a common language for electronic health records.

  3. Gridded global surface ozone metrics for atmospheric chemistry model evaluation

    NASA Astrophysics Data System (ADS)

    Sofen, E. D.; Bowdalo, D.; Evans, M. J.; Apadula, F.; Bonasoni, P.; Cupeiro, M.; Ellul, R.; Galbally, I. E.; Girgzdiene, R.; Luppo, S.; Mimouni, M.; Nahas, A. C.; Saliba, M.; Tørseth, K.; Wmo Gaw, Epa Aqs, Epa Castnet, Capmon, Naps, Airbase, Emep, Eanet Ozone Datasets, All Other Contributors To

    2015-07-01

    The concentration of ozone at the Earth's surface is measured at many locations across the globe for the purposes of air quality monitoring and atmospheric chemistry research. We have brought together all publicly available surface ozone observations from online databases from the modern era to build a consistent dataset for the evaluation of chemical transport and chemistry-climate (Earth System) models for projects such as the Chemistry-Climate Model Initiative and Aer-Chem-MIP. From a total dataset of approximately 6600 sites and 500 million hourly observations from 1971-2015, approximately 2200 sites and 200 million hourly observations pass screening as high-quality sites in regional background locations that are appropriate for use in global model evaluation. There is generally good data volume since the start of air quality monitoring networks in 1990 through 2013. Ozone observations are biased heavily toward North America and Europe with sparse coverage over the rest of the globe. This dataset is made available for the purposes of model evaluation as a set of gridded metrics intended to describe the distribution of ozone concentrations on monthly and annual timescales. Metrics include the moments of the distribution, percentiles, maximum daily eight-hour average (MDA8), SOMO35, AOT40, and metrics related to air quality regulatory thresholds. Gridded datasets are stored as netCDF-4 files and are available to download from the British Atmospheric Data Centre (doi:10.5285/08fbe63d-fa6d-4a7a-b952-5932e3ab0452). We provide recommendations to the ozone measurement community regarding improving metadata reporting to simplify ongoing and future efforts in working with ozone data from disparate networks in a consistent manner.

  4. An ensemble model of QSAR tools for regulatory risk assessment.

    PubMed

    Pradeep, Prachi; Povinelli, Richard J; White, Shannon; Merrill, Stephen J

    2016-01-01

    Quantitative structure activity relationships (QSARs) are theoretical models that relate a quantitative measure of chemical structure to a physical property or a biological effect. QSAR predictions can be used for chemical risk assessment for protection of human and environmental health, which makes them interesting to regulators, especially in the absence of experimental data. For compatibility with regulatory use, QSAR models should be transparent, reproducible and optimized to minimize the number of false negatives. In silico QSAR tools are gaining wide acceptance as a faster alternative to otherwise time-consuming clinical and animal testing methods. However, different QSAR tools often make conflicting predictions for a given chemical and may also vary in their predictive performance across different chemical datasets. In a regulatory context, conflicting predictions raise interpretation, validation and adequacy concerns. To address these concerns, ensemble learning techniques in the machine learning paradigm can be used to integrate predictions from multiple tools. By leveraging various underlying QSAR algorithms and training datasets, the resulting consensus prediction should yield better overall predictive ability. We present a novel ensemble QSAR model using Bayesian classification. The model allows for varying a cut-off parameter that allows for a selection in the desirable trade-off between model sensitivity and specificity. The predictive performance of the ensemble model is compared with four in silico tools (Toxtree, Lazar, OECD Toolbox, and Danish QSAR) to predict carcinogenicity for a dataset of air toxins (332 chemicals) and a subset of the gold carcinogenic potency database (480 chemicals). Leave-one-out cross validation results show that the ensemble model achieves the best trade-off between sensitivity and specificity (accuracy: 83.8 % and 80.4 %, and balanced accuracy: 80.6 % and 80.8 %) and highest inter-rater agreement [kappa ( κ ): 0.63 and 0.62] for both the datasets. The ROC curves demonstrate the utility of the cut-off feature in the predictive ability of the ensemble model. This feature provides an additional control to the regulators in grading a chemical based on the severity of the toxic endpoint under study.

  5. An ensemble model of QSAR tools for regulatory risk assessment

    DOE PAGES

    Pradeep, Prachi; Povinelli, Richard J.; White, Shannon; ...

    2016-09-22

    Quantitative structure activity relationships (QSARs) are theoretical models that relate a quantitative measure of chemical structure to a physical property or a biological effect. QSAR predictions can be used for chemical risk assessment for protection of human and environmental health, which makes them interesting to regulators, especially in the absence of experimental data. For compatibility with regulatory use, QSAR models should be transparent, reproducible and optimized to minimize the number of false negatives. In silico QSAR tools are gaining wide acceptance as a faster alternative to otherwise time-consuming clinical and animal testing methods. However, different QSAR tools often make conflictingmore » predictions for a given chemical and may also vary in their predictive performance across different chemical datasets. In a regulatory context, conflicting predictions raise interpretation, validation and adequacy concerns. To address these concerns, ensemble learning techniques in the machine learning paradigm can be used to integrate predictions from multiple tools. By leveraging various underlying QSAR algorithms and training datasets, the resulting consensus prediction should yield better overall predictive ability. We present a novel ensemble QSAR model using Bayesian classification. The model allows for varying a cut-off parameter that allows for a selection in the desirable trade-off between model sensitivity and specificity. The predictive performance of the ensemble model is compared with four in silico tools (Toxtree, Lazar, OECD Toolbox, and Danish QSAR) to predict carcinogenicity for a dataset of air toxins (332 chemicals) and a subset of the gold carcinogenic potency database (480 chemicals). Leave-one-out cross validation results show that the ensemble model achieves the best trade-off between sensitivity and specificity (accuracy: 83.8 % and 80.4 %, and balanced accuracy: 80.6 % and 80.8 %) and highest inter-rater agreement [kappa (κ): 0.63 and 0.62] for both the datasets. The ROC curves demonstrate the utility of the cut-off feature in the predictive ability of the ensemble model. In conclusion, this feature provides an additional control to the regulators in grading a chemical based on the severity of the toxic endpoint under study.« less

  6. Probabilistic Elastic Part Model: A Pose-Invariant Representation for Real-World Face Verification.

    PubMed

    Li, Haoxiang; Hua, Gang

    2018-04-01

    Pose variation remains to be a major challenge for real-world face recognition. We approach this problem through a probabilistic elastic part model. We extract local descriptors (e.g., LBP or SIFT) from densely sampled multi-scale image patches. By augmenting each descriptor with its location, a Gaussian mixture model (GMM) is trained to capture the spatial-appearance distribution of the face parts of all face images in the training corpus, namely the probabilistic elastic part (PEP) model. Each mixture component of the GMM is confined to be a spherical Gaussian to balance the influence of the appearance and the location terms, which naturally defines a part. Given one or multiple face images of the same subject, the PEP-model builds its PEP representation by sequentially concatenating descriptors identified by each Gaussian component in a maximum likelihood sense. We further propose a joint Bayesian adaptation algorithm to adapt the universally trained GMM to better model the pose variations between the target pair of faces/face tracks, which consistently improves face verification accuracy. Our experiments show that we achieve state-of-the-art face verification accuracy with the proposed representations on the Labeled Face in the Wild (LFW) dataset, the YouTube video face database, and the CMU MultiPIE dataset.

  7. JADDS - towards a tailored global atmospheric composition data service for CAMS forecasts and reanalysis

    NASA Astrophysics Data System (ADS)

    Stein, Olaf; Schultz, Martin G.; Rambadt, Michael; Saini, Rajveer; Hoffmann, Lars; Mallmann, Daniel

    2017-04-01

    Global model data of atmospheric composition produced by the Copernicus Atmospheric Monitoring Service (CAMS) is collected since 2010 at FZ Jülich and serves as boundary condition for use by Regional Air Quality (RAQ) modellers world-wide. RAQ models need time-resolved meteorological as well as chemical lateral boundary conditions for their individual model domains. While the meteorological data usually come from well-established global forecast systems, the chemical boundary conditions are not always well defined. In the past, many models used 'climatic' boundary conditions for the tracer concentrations, which can lead to significant concentration biases, particularly for tracers with longer lifetimes which can be transported over long distances (e.g. over the whole northern hemisphere) with the mean wind. The Copernicus approach utilizes extensive near-realtime data assimilation of atmospheric composition data observed from space which gives additional reliability to the global modelling data and is well received by the RAQ communities. An existing Web Coverage Service (WCS) for sharing these individually tailored model results is currently being re-engineered to make use of a modern, scalable database technology in order to improve performance, enhance flexibility, and allow the operation of catalogue services. The new Jülich Atmospheric Data Distributions Server (JADDS) adheres to the Web Coverage Service WCS2.0 standard as defined by the Open Geospatial Consortium OGC. This enables the user groups to flexibly define datasets they need by selecting a subset of chemical species or restricting geographical boundaries or the length of the time series. The data is made available in the form of different catalogues stored locally on our server. In addition, the Jülich OWS Interface (JOIN) provides interoperable web services allowing for easy download and visualization of datasets delivered from WCS servers via the internet. We will present the prototype JADDS server and address the major issues identified when relocating large four-dimensional datasets into a RASDAMAN raster array database. So far the RASDAMAN support for data available in netCDF format is limited with respect to metadata related to variables and axes. For community-wide accepted solutions, selected data coverages shall result in downloadable netCDF files including metadata complying with the netCDF CF Metadata Conventions standard (http://cfconventions.org/). This can be achieved by adding custom metadata elements for RASDAMAN bands (model levels) on data ingestion. Furthermore, an optimization strategy for ingestion of several TB of 4D model output data will be outlined.

  8. Decoys Selection in Benchmarking Datasets: Overview and Perspectives

    PubMed Central

    Réau, Manon; Langenfeld, Florent; Zagury, Jean-François; Lagarde, Nathalie; Montes, Matthieu

    2018-01-01

    Virtual Screening (VS) is designed to prospectively help identifying potential hits, i.e., compounds capable of interacting with a given target and potentially modulate its activity, out of large compound collections. Among the variety of methodologies, it is crucial to select the protocol that is the most adapted to the query/target system under study and that yields the most reliable output. To this aim, the performance of VS methods is commonly evaluated and compared by computing their ability to retrieve active compounds in benchmarking datasets. The benchmarking datasets contain a subset of known active compounds together with a subset of decoys, i.e., assumed non-active molecules. The composition of both the active and the decoy compounds subsets is critical to limit the biases in the evaluation of the VS methods. In this review, we focus on the selection of decoy compounds that has considerably changed over the years, from randomly selected compounds to highly customized or experimentally validated negative compounds. We first outline the evolution of decoys selection in benchmarking databases as well as current benchmarking databases that tend to minimize the introduction of biases, and secondly, we propose recommendations for the selection and the design of benchmarking datasets. PMID:29416509

  9. The Alaska Arctic Vegetation Archive (AVA-AK)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Walker, Donald; Breen, Amy; Druckenmiller, Lisa

    The Alaska Arctic Vegetation Archive (AVA-AK, GIVD-ID: NA-US-014) is a free, publically available database archive of vegetation-plot data from the Arctic tundra region of northern Alaska. The archive currently contains 24 datasets with 3,026 non-overlapping plots. Of these, 74% have geolocation data with 25-m or better precision. Species cover data and header data are stored in a Turboveg database. A standardized Pan Arctic Species List provides a consistent nomenclature for vascular plants, bryophytes, and lichens in the archive. A web-based online Alaska Arctic Geoecological Atlas (AGA-AK) allows viewing and downloading the species data in a variety of formats, and providesmore » access to a wide variety of ancillary data. We conducted a preliminary cluster analysis of the first 16 datasets (1,613 plots) to examine how the spectrum of derived clusters is related to the suite of datasets, habitat types, and environmental gradients. Here, we present the contents of the archive, assess its strengths and weaknesses, and provide three supplementary files that include the data dictionary, a list of habitat types, an overview of the datasets, and details of the cluster analysis.« less

  10. ExpTreeDB: web-based query and visualization of manually annotated gene expression profiling experiments of human and mouse from GEO.

    PubMed

    Ni, Ming; Ye, Fuqiang; Zhu, Juanjuan; Li, Zongwei; Yang, Shuai; Yang, Bite; Han, Lu; Wu, Yongge; Chen, Ying; Li, Fei; Wang, Shengqi; Bo, Xiaochen

    2014-12-01

    Numerous public microarray datasets are valuable resources for the scientific communities. Several online tools have made great steps to use these data by querying related datasets with users' own gene signatures or expression profiles. However, dataset annotation and result exhibition still need to be improved. ExpTreeDB is a database that allows for queries on human and mouse microarray experiments from Gene Expression Omnibus with gene signatures or profiles. Compared with similar applications, ExpTreeDB pays more attention to dataset annotations and result visualization. We introduced a multiple-level annotation system to depict and organize original experiments. For example, a tamoxifen-treated cell line experiment is hierarchically annotated as 'agent→drug→estrogen receptor antagonist→tamoxifen'. Consequently, retrieved results are exhibited by an interactive tree-structured graphics, which provide an overview for related experiments and might enlighten users on key items of interest. The database is freely available at http://biotech.bmi.ac.cn/ExpTreeDB. Web site is implemented in Perl, PHP, R, MySQL and Apache. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  11. The Alaska Arctic Vegetation Archive (AVA-AK)

    DOE PAGES

    Walker, Donald; Breen, Amy; Druckenmiller, Lisa; ...

    2016-05-17

    The Alaska Arctic Vegetation Archive (AVA-AK, GIVD-ID: NA-US-014) is a free, publically available database archive of vegetation-plot data from the Arctic tundra region of northern Alaska. The archive currently contains 24 datasets with 3,026 non-overlapping plots. Of these, 74% have geolocation data with 25-m or better precision. Species cover data and header data are stored in a Turboveg database. A standardized Pan Arctic Species List provides a consistent nomenclature for vascular plants, bryophytes, and lichens in the archive. A web-based online Alaska Arctic Geoecological Atlas (AGA-AK) allows viewing and downloading the species data in a variety of formats, and providesmore » access to a wide variety of ancillary data. We conducted a preliminary cluster analysis of the first 16 datasets (1,613 plots) to examine how the spectrum of derived clusters is related to the suite of datasets, habitat types, and environmental gradients. Here, we present the contents of the archive, assess its strengths and weaknesses, and provide three supplementary files that include the data dictionary, a list of habitat types, an overview of the datasets, and details of the cluster analysis.« less

  12. Molecular determinants of blood-brain barrier permeation.

    PubMed

    Geldenhuys, Werner J; Mohammad, Afroz S; Adkins, Chris E; Lockman, Paul R

    2015-01-01

    The blood-brain barrier (BBB) is a microvascular unit which selectively regulates the permeability of drugs to the brain. With the rise in CNS drug targets and diseases, there is a need to be able to accurately predict a priori which compounds in a company database should be pursued for favorable properties. In this review, we will explore the different computational tools available today, as well as underpin these to the experimental methods used to determine BBB permeability. These include in vitro models and the in vivo models that yield the dataset we use to generate predictive models. Understanding of how these models were experimentally derived determines our accurate and predicted use for determining a balance between activity and BBB distribution.

  13. Molecular determinants of blood–brain barrier permeation

    PubMed Central

    Geldenhuys, Werner J; Mohammad, Afroz S; Adkins, Chris E; Lockman, Paul R

    2015-01-01

    The blood–brain barrier (BBB) is a microvascular unit which selectively regulates the permeability of drugs to the brain. With the rise in CNS drug targets and diseases, there is a need to be able to accurately predict a priori which compounds in a company database should be pursued for favorable properties. In this review, we will explore the different computational tools available today, as well as underpin these to the experimental methods used to determine BBB permeability. These include in vitro models and the in vivo models that yield the dataset we use to generate predictive models. Understanding of how these models were experimentally derived determines our accurate and predicted use for determining a balance between activity and BBB distribution. PMID:26305616

  14. Artificial Intelligence in Prediction of Secondary Protein Structure Using CB513 Database

    PubMed Central

    Avdagic, Zikrija; Purisevic, Elvir; Omanovic, Samir; Coralic, Zlatan

    2009-01-01

    In this paper we describe CB513 a non-redundant dataset, suitable for development of algorithms for prediction of secondary protein structure. A program was made in Borland Delphi for transforming data from our dataset to make it suitable for learning of neural network for prediction of secondary protein structure implemented in MATLAB Neural-Network Toolbox. Learning (training and testing) of neural network is researched with different sizes of windows, different number of neurons in the hidden layer and different number of training epochs, while using dataset CB513. PMID:21347158

  15. The Problem with the Delta Cost Project Database

    ERIC Educational Resources Information Center

    Jaquette, Ozan; Parra, Edna

    2016-01-01

    The Integrated Postsecondary Education System (IPEDS) collects data on Title IV institutions. The Delta Cost Project (DCP) integrated data from multiple IPEDS survey components into a public-use longitudinal dataset. The DCP Database was the basis for dozens of journal articles and a series of influential policy reports. Unfortunately, a flaw in…

  16. Introducing the Infant Bookreading Database (IBDb)

    ERIC Educational Resources Information Center

    Hudson Kam, Carla L.; Matthewson, Lisa

    2017-01-01

    Studies on the relationship between bookreading and language development typically lack data about which books are actually read to children. This paper reports on an Internet survey designed to address this data gap. The resulting dataset (the Infant Bookreading Database or IBDb) includes responses from 1,107 caregivers of children aged 0-36…

  17. FDT 2.0: Improving scalability of the fuzzy decision tree induction tool - integrating database storage.

    PubMed

    Durham, Erin-Elizabeth A; Yu, Xiaxia; Harrison, Robert W

    2014-12-01

    Effective machine-learning handles large datasets efficiently. One key feature of handling large data is the use of databases such as MySQL. The freeware fuzzy decision tree induction tool, FDT, is a scalable supervised-classification software tool implementing fuzzy decision trees. It is based on an optimized fuzzy ID3 (FID3) algorithm. FDT 2.0 improves upon FDT 1.0 by bridging the gap between data science and data engineering: it combines a robust decisioning tool with data retention for future decisions, so that the tool does not need to be recalibrated from scratch every time a new decision is required. In this paper we briefly review the analytical capabilities of the freeware FDT tool and its major features and functionalities; examples of large biological datasets from HIV, microRNAs and sRNAs are included. This work shows how to integrate fuzzy decision algorithms with modern database technology. In addition, we show that integrating the fuzzy decision tree induction tool with database storage allows for optimal user satisfaction in today's Data Analytics world.

  18. NoSQL technologies for the CMS Conditions Database

    NASA Astrophysics Data System (ADS)

    Sipos, Roland

    2015-12-01

    With the restart of the LHC in 2015, the growth of the CMS Conditions dataset will continue, therefore the need of consistent and highly available access to the Conditions makes a great cause to revisit different aspects of the current data storage solutions. We present a study of alternative data storage backends for the Conditions Databases, by evaluating some of the most popular NoSQL databases to support a key-value representation of the CMS Conditions. The definition of the database infrastructure is based on the need of storing the conditions as BLOBs. Because of this, each condition can reach the size that may require special treatment (splitting) in these NoSQL databases. As big binary objects may be problematic in several database systems, and also to give an accurate baseline, a testing framework extension was implemented to measure the characteristics of the handling of arbitrary binary data in these databases. Based on the evaluation, prototypes of a document store, using a column-oriented and plain key-value store, are deployed. An adaption layer to access the backends in the CMS Offline software was developed to provide transparent support for these NoSQL databases in the CMS context. Additional data modelling approaches and considerations in the software layer, deployment and automatization of the databases are also covered in the research. In this paper we present the results of the evaluation as well as a performance comparison of the prototypes studied.

  19. [Technical improvement of cohort constitution in administrative health databases: Providing a tool for integration and standardization of data applicable in the French National Health Insurance Database (SNIIRAM)].

    PubMed

    Ferdynus, C; Huiart, L

    2016-09-01

    Administrative health databases such as the French National Heath Insurance Database - SNIIRAM - are a major tool to answer numerous public health research questions. However the use of such data requires complex and time-consuming data management. Our objective was to develop and make available a tool to optimize cohort constitution within administrative health databases. We developed a process to extract, transform and load (ETL) data from various heterogeneous sources in a standardized data warehouse. This data warehouse is architected as a star schema corresponding to an i2b2 star schema model. We then evaluated the performance of this ETL using data from a pharmacoepidemiology research project conducted in the SNIIRAM database. The ETL we developed comprises a set of functionalities for creating SAS scripts. Data can be integrated into a standardized data warehouse. As part of the performance assessment of this ETL, we achieved integration of a dataset from the SNIIRAM comprising more than 900 million lines in less than three hours using a desktop computer. This enables patient selection from the standardized data warehouse within seconds of the request. The ETL described in this paper provides a tool which is effective and compatible with all administrative health databases, without requiring complex database servers. This tool should simplify cohort constitution in health databases; the standardization of warehouse data facilitates collaborative work between research teams. Copyright © 2016 Elsevier Masson SAS. All rights reserved.

  20. Technical note: Space-time analysis of rainfall extremes in Italy: clues from a reconciled dataset

    NASA Astrophysics Data System (ADS)

    Libertino, Andrea; Ganora, Daniele; Claps, Pierluigi

    2018-05-01

    Like other Mediterranean areas, Italy is prone to the development of events with significant rainfall intensity, lasting for several hours. The main triggering mechanisms of these events are quite well known, but the aim of developing rainstorm hazard maps compatible with their actual probability of occurrence is still far from being reached. A systematic frequency analysis of these occasional highly intense events would require a complete countrywide dataset of sub-daily rainfall records, but this kind of information was still lacking for the Italian territory. In this work several sources of data are gathered, for assembling the first comprehensive and updated dataset of extreme rainfall of short duration in Italy. The resulting dataset, referred to as the Italian Rainfall Extreme Dataset (I-RED), includes the annual maximum rainfalls recorded in 1 to 24 consecutive hours from more than 4500 stations across the country, spanning the period between 1916 and 2014. A detailed description of the spatial and temporal coverage of the I-RED is presented, together with an exploratory statistical analysis aimed at providing preliminary information on the climatology of extreme rainfall at the national scale. Due to some legal restrictions, the database can be provided only under certain conditions. Taking into account the potentialities emerging from the analysis, a description of the ongoing and planned future work activities on the database is provided.

Top