OpenFDA: an innovative platform providing access to a wealth of FDA's publicly available data.
Kass-Hout, Taha A; Xu, Zhiheng; Mohebbi, Matthew; Nelsen, Hans; Baker, Adam; Levine, Jonathan; Johanson, Elaine; Bright, Roselie A
2016-05-01
The objective of openFDA is to facilitate access and use of big important Food and Drug Administration public datasets by developers, researchers, and the public through harmonization of data across disparate FDA datasets provided via application programming interfaces (APIs). Using cutting-edge technologies deployed on FDA's new public cloud computing infrastructure, openFDA provides open data for easier, faster (over 300 requests per second per process), and better access to FDA datasets; open source code and documentation shared on GitHub for open community contributions of examples, apps and ideas; and infrastructure that can be adopted for other public health big data challenges. Since its launch on June 2, 2014, openFDA has developed four APIs for drug and device adverse events, recall information for all FDA-regulated products, and drug labeling. There have been more than 20 million API calls (more than half from outside the United States), 6000 registered users, 20,000 connected Internet Protocol addresses, and dozens of new software (mobile or web) apps developed. A case study demonstrates a use of openFDA data to understand an apparent association of a drug with an adverse event. With easier and faster access to these datasets, consumers worldwide can learn more about FDA-regulated products. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved.
OpenFDA: an innovative platform providing access to a wealth of FDA’s publicly available data
Kass-Hout, Taha A; Mohebbi, Matthew; Nelsen, Hans; Baker, Adam; Levine, Jonathan; Johanson, Elaine; Bright, Roselie A
2016-01-01
Objective The objective of openFDA is to facilitate access and use of big important Food and Drug Administration public datasets by developers, researchers, and the public through harmonization of data across disparate FDA datasets provided via application programming interfaces (APIs). Materials and Methods Using cutting-edge technologies deployed on FDA’s new public cloud computing infrastructure, openFDA provides open data for easier, faster (over 300 requests per second per process), and better access to FDA datasets; open source code and documentation shared on GitHub for open community contributions of examples, apps and ideas; and infrastructure that can be adopted for other public health big data challenges. Results Since its launch on June 2, 2014, openFDA has developed four APIs for drug and device adverse events, recall information for all FDA-regulated products, and drug labeling. There have been more than 20 million API calls (more than half from outside the United States), 6000 registered users, 20,000 connected Internet Protocol addresses, and dozens of new software (mobile or web) apps developed. A case study demonstrates a use of openFDA data to understand an apparent association of a drug with an adverse event. Conclusion With easier and faster access to these datasets, consumers worldwide can learn more about FDA-regulated products. PMID:26644398
NASA Astrophysics Data System (ADS)
Weber, J.; Domenico, B.
2004-12-01
This paper is an example of what we call data interactive publications. With a properly configured workstation, the readers can click on "hotspots" in the document that launches an interactive analysis tool called the Unidata Integrated Data Viewer (IDV). The IDV will enable the readers to access, analyze and display datasets on remote servers as well as documents describing them. Beyond the parameters and datasets initially configured into the paper, the analysis tool will have access to all the other dataset parameters as well as to a host of other datasets on remote servers. These data interactive publications are built on top of several data delivery, access, discovery, and visualization tools developed by Unidata and its partner organizations. For purposes of illustrating this integrative technology, we will use data from the event of Hurricane Charley over Florida from August 13-15, 2004. This event illustrates how components of this process fit together. The Local Data Manager (LDM), Open-source Project for a Network Data Access Protocol (OPeNDAP) and Abstract Data Distribution Environment (ADDE) services, Thematic Realtime Environmental Distributed Data Service (THREDDS) cataloging services, and the IDV are highlighted in this example of a publication with embedded pointers for accessing and interacting with remote datasets. An important objective of this paper is to illustrate how these integrated technologies foster the creation of documents that allow the reader to learn the scientific concepts by direct interaction with illustrative datasets, and help build a framework for integrated Earth System science.
McKinney, Bill; Meyer, Peter A.; Crosas, Mercè; Sliz, Piotr
2016-01-01
Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension—functionality supporting preservation of filesystem structure within Dataverse—which is essential for both in-place computation and supporting non-http data transfers. PMID:27862010
The link provided access to all the datasets and metadata used in this manuscript for the model development and evaluation per Geoscientific Model Development's publication guidelines with the exception of the model output due to its size. This dataset is associated with the following publication:Bash , J., K. Baker , and M. Beaver. Evaluation of improved land use and canopy representation in BEIS v3.61 with biogenic VOC measurements in California. Geoscientific Model Development. Copernicus Publications, Katlenburg-Lindau, GERMANY, 9: 2191-2207, (2016).
McKinney, Bill; Meyer, Peter A; Crosas, Mercè; Sliz, Piotr
2017-01-01
Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension-functionality supporting preservation of file system structure within Dataverse-which is essential for both in-place computation and supporting non-HTTP data transfers. © 2016 New York Academy of Sciences.
Hrynaszkiewicz, Iain; Khodiyar, Varsha; Hufton, Andrew L; Sansone, Susanna-Assunta
2016-01-01
Sharing of experimental clinical research data usually happens between individuals or research groups rather than via public repositories, in part due to the need to protect research participant privacy. This approach to data sharing makes it difficult to connect journal articles with their underlying datasets and is often insufficient for ensuring access to data in the long term. Voluntary data sharing services such as the Yale Open Data Access (YODA) and Clinical Study Data Request (CSDR) projects have increased accessibility to clinical datasets for secondary uses while protecting patient privacy and the legitimacy of secondary analyses but these resources are generally disconnected from journal articles-where researchers typically search for reliable information to inform future research. New scholarly journal and article types dedicated to increasing accessibility of research data have emerged in recent years and, in general, journals are developing stronger links with data repositories. There is a need for increased collaboration between journals, data repositories, researchers, funders, and voluntary data sharing services to increase the visibility and reliability of clinical research. Using the journal Scientific Data as a case study, we propose and show examples of changes to the format and peer-review process for journal articles to more robustly link them to data that are only available on request. We also propose additional features for data repositories to better accommodate non-public clinical datasets, including Data Use Agreements (DUAs).
ATACseqQC: a Bioconductor package for post-alignment quality assessment of ATAC-seq data.
Ou, Jianhong; Liu, Haibo; Yu, Jun; Kelliher, Michelle A; Castilla, Lucio H; Lawson, Nathan D; Zhu, Lihua Julie
2018-03-01
ATAC-seq (Assays for Transposase-Accessible Chromatin using sequencing) is a recently developed technique for genome-wide analysis of chromatin accessibility. Compared to earlier methods for assaying chromatin accessibility, ATAC-seq is faster and easier to perform, does not require cross-linking, has higher signal to noise ratio, and can be performed on small cell numbers. However, to ensure a successful ATAC-seq experiment, step-by-step quality assurance processes, including both wet lab quality control and in silico quality assessment, are essential. While several tools have been developed or adopted for assessing read quality, identifying nucleosome occupancy and accessible regions from ATAC-seq data, none of the tools provide a comprehensive set of functionalities for preprocessing and quality assessment of aligned ATAC-seq datasets. We have developed a Bioconductor package, ATACseqQC, for easily generating various diagnostic plots to help researchers quickly assess the quality of their ATAC-seq data. In addition, this package contains functions to preprocess aligned ATAC-seq data for subsequent peak calling. Here we demonstrate the utilities of our package using 25 publicly available ATAC-seq datasets from four studies. We also provide guidelines on what the diagnostic plots should look like for an ideal ATAC-seq dataset. This software package has been used successfully for preprocessing and assessing several in-house and public ATAC-seq datasets. Diagnostic plots generated by this package will facilitate the quality assessment of ATAC-seq data, and help researchers to evaluate their own ATAC-seq experiments as well as select high-quality ATAC-seq datasets from public repositories such as GEO to avoid generating hypotheses or drawing conclusions from low-quality ATAC-seq experiments. The software, source code, and documentation are freely available as a Bioconductor package at https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html .
Creation of the Naturalistic Engagement in Secondary Tasks (NEST) distracted driving dataset.
Owens, Justin M; Angell, Linda; Hankey, Jonathan M; Foley, James; Ebe, Kazutoshi
2015-09-01
Distracted driving has become a topic of critical importance to driving safety research over the past several decades. Naturalistic driving data offer a unique opportunity to study how drivers engage with secondary tasks in real-world driving; however, the complexities involved with identifying and coding relevant epochs of naturalistic data have limited its accessibility to the general research community. This project was developed to help address this problem by creating an accessible dataset of driver behavior and situational factors observed during distraction-related safety-critical events and baseline driving epochs, using the Strategic Highway Research Program 2 (SHRP2) naturalistic dataset. The new NEST (Naturalistic Engagement in Secondary Tasks) dataset was created using crashes and near-crashes from the SHRP2 dataset that were identified as including secondary task engagement as a potential contributing factor. Data coding included frame-by-frame video analysis of secondary task and hands-on-wheel activity, as well as summary event information. In addition, information about each secondary task engagement within the trip prior to the crash/near-crash was coded at a higher level. Data were also coded for four baseline epochs and trips per safety-critical event. 1,180 events and baseline epochs were coded, and a dataset was constructed. The project team is currently working to determine the most useful way to allow broad public access to the dataset. We anticipate that the NEST dataset will be extraordinarily useful in allowing qualified researchers access to timely, real-world data concerning how drivers interact with secondary tasks during safety-critical events and baseline driving. The coded dataset developed for this project will allow future researchers to have access to detailed data on driver secondary task engagement in the real world. It will be useful for standalone research, as well as for integration with additional SHRP2 data to enable the conduct of more complex research. Copyright © 2015 Elsevier Ltd and National Safety Council. All rights reserved.
Data publication, documentation and user friendly landing pages - improving data discovery and reuse
NASA Astrophysics Data System (ADS)
Elger, Kirsten; Ulbricht, Damian; Bertelmann, Roland
2016-04-01
Research data are the basis for scientific research and often irreplaceable (e.g. observational data). Storage of such data in appropriate, theme specific or institutional repositories is an essential part of ensuring their long term preservation and access. The free and open access to research data for reuse and scrutiny has been identified as a key issue by the scientific community as well as by research agencies and the public. To ensure the datasets to intelligible and usable for others they must be accompanied by comprehensive data description and standardized metadata for data discovery, and ideally should be published using digital object identifier (DOI). These make datasets citable and ensure their long-term accessibility and are accepted in reference lists of journal articles (http://www.copdess.org/statement-of-commitment/). The GFZ German Research Centre for Geosciences is the national laboratory for Geosciences in Germany and part of the Helmholtz Association, Germany's largest scientific organization. The development and maintenance of data systems is a key component of 'GFZ Data Services' to support state-of-the-art research. The datasets, archived in and published by the GFZ Data Repository cover all geoscientific disciplines and range from large dynamic datasets deriving from global monitoring seismic or geodetic networks with real-time data acquisition, to remotely sensed satellite products, to automatically generated data publications from a database for data from micro meteorological stations, to various model results, to geochemical and rock mechanical analyses from various labs, and field observations. The user-friendly presentation of published datasets via a DOI landing page is as important for reuse as the storage itself, and the required information is highly specific for each scientific discipline. If dataset descriptions are too general, or require the download of a dataset before knowing its suitability, many researchers often decide not to reuse a published dataset. In contrast to large data repositories without thematic specification, theme-specific data repositories have a large expertise in data discovery and opportunity to develop usable, discipline-specific formats and layouts for specific datasets, including consultation to different formats for the data description (e.g., via a Data Report or an article in a Data Journal) with full consideration of international metadata standards.
Carroll, Adam J; Badger, Murray R; Harvey Millar, A
2010-07-14
Standardization of analytical approaches and reporting methods via community-wide collaboration can work synergistically with web-tool development to result in rapid community-driven expansion of online data repositories suitable for data mining and meta-analysis. In metabolomics, the inter-laboratory reproducibility of gas-chromatography/mass-spectrometry (GC/MS) makes it an obvious target for such development. While a number of web-tools offer access to datasets and/or tools for raw data processing and statistical analysis, none of these systems are currently set up to act as a public repository by easily accepting, processing and presenting publicly submitted GC/MS metabolomics datasets for public re-analysis. Here, we present MetabolomeExpress, a new File Transfer Protocol (FTP) server and web-tool for the online storage, processing, visualisation and statistical re-analysis of publicly submitted GC/MS metabolomics datasets. Users may search a quality-controlled database of metabolite response statistics from publicly submitted datasets by a number of parameters (eg. metabolite, species, organ/biofluid etc.). Users may also perform meta-analysis comparisons of multiple independent experiments or re-analyse public primary datasets via user-friendly tools for t-test, principal components analysis, hierarchical cluster analysis and correlation analysis. They may interact with chromatograms, mass spectra and peak detection results via an integrated raw data viewer. Researchers who register for a free account may upload (via FTP) their own data to the server for online processing via a novel raw data processing pipeline. MetabolomeExpress https://www.metabolome-express.org provides a new opportunity for the general metabolomics community to transparently present online the raw and processed GC/MS data underlying their metabolomics publications. Transparent sharing of these data will allow researchers to assess data quality and draw their own insights from published metabolomics datasets.
NASA Astrophysics Data System (ADS)
Archibong, B.
2014-12-01
Do precolonial institutions, geography and ecological diversity affect population access to public infrastructure services over a century later? Can local leaders from historically centralized or 'conqueror' groups still influence access to public goods today? Do precolonial states located in ecologically diverse environments have better access to water, power and sanitation resources today? A growing body of literature examining the sources of the current state of African economic development has cited the enduring impacts of precolonial institutions and geography on contemporary African economic development using large sample cross-sectional analysis. In this paper, I focus on within country effects of local ethnic and political state institutions on access to public infrastructure services in present day Nigeria. Specifically, I combine information on the spatial distribution of ethnic states and ecological diversity in Nigeria circa mid 19th century and political states in Nigeria circa 1785 and 1850 with information, from a novel geocoded survey dataset, on access to public infrastructure at the local government level in present day Nigeria to examine the impact of precolonial state centralization on the current unequal access to public infrastructure services in Nigeria, accounting for the effects of ecological diversity and other geographic covariates. Some preliminary results show evidence for the long-term impacts of institutions, geography and ecological diversity on access to public infrastructure in Nigeria.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wood, Eric; Duran, Adam; Burton, Evan
This report includes a detailed comparison of the TomTom national road grade database relative to a local road grade dataset generated by Southwest Research Institute and a national elevation dataset publically available from the U.S. Geological Survey. This analysis concluded that the TomTom national road grade database was a suitable source of road grade data for purposes of this study.
Anguita, Alberto; García-Remesal, Miguel; Graf, Norbert; Maojo, Victor
2016-04-01
Modern biomedical research relies on the semantic integration of heterogeneous data sources to find data correlations. Researchers access multiple datasets of disparate origin, and identify elements-e.g. genes, compounds, pathways-that lead to interesting correlations. Normally, they must refer to additional public databases in order to enrich the information about the identified entities-e.g. scientific literature, published clinical trial results, etc. While semantic integration techniques have traditionally focused on providing homogeneous access to private datasets-thus helping automate the first part of the research, and there exist different solutions for browsing public data, there is still a need for tools that facilitate merging public repositories with private datasets. This paper presents a framework that automatically locates public data of interest to the researcher and semantically integrates it with existing private datasets. The framework has been designed as an extension of traditional data integration systems, and has been validated with an existing data integration platform from a European research project by integrating a private biological dataset with data from the National Center for Biotechnology Information (NCBI). Copyright © 2016 Elsevier Inc. All rights reserved.
Accession numbers for microarray datasets used in Oshida et al. Chemical and Hormonal Effects on STAT5b-Dependent Sexual Dimorphism of the Liver Transcriptome. PLoS One. 2016 Mar 9;11(3):e0150284. This dataset is associated with the following publication:Oshida, K., D. Waxman, and C. Corton. Chemical and Hormonal Effects on STAT5b-Dependent Sexual Dimorphism of the Liver Transcriptome.. PLoS ONE. Public Library of Science, San Francisco, CA, USA, 11(3): NA, (2016).
An examination of data quality on QSAR Modeling in regards ...
The development of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago and, specifically, on the PHYSPROP dataset used to train the EPISuite prediction models. This presentation will review our approaches to examining key datasets, the delivery of curated data and the development of machine-learning models for thirteen separate property endpoints of interest to environmental science. We will also review how these data will be made freely accessible to the community via a new “chemistry dashboard”. This abstract does not reflect U.S. EPA policy. presentation at UNC-CH.
NASA Astrophysics Data System (ADS)
Archibong, Belinda
While previous literature has emphasized the importance of energy and public infrastructure services for economic development, questions surrounding the implications of unequal spatial distribution in access to these resources remain, particularly in the developing country context. This dissertation provides evidence on the nature, origins and implications of this distribution uniting three strands of research from the development and political economy, regional science and energy economics fields. The dissertation unites three papers on the nature of spatial inequality of access to energy and infrastructure with further implications for conflict risk , the historical institutional and biogeographical determinants of current distribution of access to energy and public infrastructure services and the response of households to fuel price changes over time. Chapter 2 uses a novel survey dataset to provide evidence for spatial clustering of public infrastructure non-functionality at schools by geopolitical zone in Nigeria with further implications for armed conflict risk in the region. Chapter 3 investigates the drivers of the results in chapter 2, exploiting variation in the spatial distribution of precolonial institutions and geography in the region, to provide evidence for the long-term impacts of these factors on current heterogeneity of access to public services. Chapter 4 addresses the policy implications of energy access, providing the first multi-year evidence on firewood demand elasticities in India, using the spatial variation in prices for estimation.
interPopula: a Python API to access the HapMap Project dataset
2010-01-01
Background The HapMap project is a publicly available catalogue of common genetic variants that occur in humans, currently including several million SNPs across 1115 individuals spanning 11 different populations. This important database does not provide any programmatic access to the dataset, furthermore no standard relational database interface is provided. Results interPopula is a Python API to access the HapMap dataset. interPopula provides integration facilities with both the Python ecology of software (e.g. Biopython and matplotlib) and other relevant human population datasets (e.g. Ensembl gene annotation and UCSC Known Genes). A set of guidelines and code examples to address possible inconsistencies across heterogeneous data sources is also provided. Conclusions interPopula is a straightforward and flexible Python API that facilitates the construction of scripts and applications that require access to the HapMap dataset. PMID:21210977
Development of a Planetary Web GIS at the ``Photothèque Planétaire'' in Orsay
NASA Astrophysics Data System (ADS)
Marmo, C.
2012-09-01
The “Photothèque Planétaire d'Orsay” belongs to the Regional Planetary Image Facilities (RPIF) network started by NASA in 1984. The original purpose of the RPIF was mainly to provide easy access to data from US space missions throughout the world. The “Photothèque” itself specializes in planetary data processing and distribution for research and public outreach. Planetary data are heterogeneous, and combining different observations is particularly challenging, especially if they belong to different data-sets. A common description framework is needed, similar to the existing Geographical Information Systems (GIS) that have been developed for manipulating Earth data. In their present state, GIS software and standards cannot directly be applied to other planets because they still lack flexibility in managing coordinate systems. Yet, the GIS framework serves as an excellent starting point for the implementation of a Virtual Observatory for Planetary Sciences, provided it is made more generic and inter-operable. The “Photothèque Planétaire d'Orsay” has produced some planetary GIS examples using historical and public data-sets. Our main project is a Web-based visualization system for planetary data, which features direct point-and-click access to quantitative measurements. Thanks to being compatible with all recent web browsers, our interface can also be used for public outreach and to make data accessible for education and training.
NASA Astrophysics Data System (ADS)
Lary, D. J.
2013-12-01
A BigData case study is described where multiple datasets from several satellites, high-resolution global meteorological data, social media and in-situ observations are combined using machine learning on a distributed cluster using an automated workflow. The global particulate dataset is relevant to global public health studies and would not be possible to produce without the use of the multiple big datasets, in-situ data and machine learning.To greatly reduce the development time and enhance the functionality a high level language capable of parallel processing has been used (Matlab). A key consideration for the system is high speed access due to the large data volume, persistence of the large data volumes and a precise process time scheduling capability.
National Scale Marine Geophysical Data Portal for the Israel EEZ with Public Access Web-GIS Platform
NASA Astrophysics Data System (ADS)
Ketter, T.; Kanari, M.; Tibor, G.
2017-12-01
Recent offshore discoveries and regulation in the Israel Exclusive Economic Zone (EEZ) are the driving forces behind increasing marine research and development initiatives such as infrastructure development, environmental protection and decision making among many others. All marine operations rely on existing seabed information, while some also generate new data. We aim to create a single platform knowledge-base to enable access to existing information, in a comprehensive, publicly accessible web-based interface. The Israel EEZ covers approx. 26,000 sqkm and has been surveyed continuously with various geophysical instruments over the past decades, including 10,000 km of multibeam survey lines, 8,000 km of sub-bottom seismic lines, and hundreds of sediment sampling stations. Our database consists of vector and raster datasets from multiple sources compiled into a repository of geophysical data and metadata, acquired nation-wide by several research institutes and universities. The repository will enable public access via a web portal based on a GIS platform, including datasets from multibeam, sub-bottom profiling, single- and multi-channel seismic surveys and sediment sampling analysis. Respective data products will also be available e.g. bathymetry, substrate type, granulometry, geological structure etc. Operating a web-GIS based repository allows retrieval of pre-existing data for potential users to facilitate planning of future activities e.g. conducting marine surveys, construction of marine infrastructure and other private or public projects. User interface is based on map oriented spatial selection, which will reveal any relevant data for designated areas of interest. Querying the database will allow the user to obtain information about the data owner and to address them for data retrieval as required. Wide and free public access to existing data and metadata can save time and funds for academia, government and commercial sectors, while aiding in cooperation and data sharing among the various stakeholders.
NASA Astrophysics Data System (ADS)
Tisdale, M.
2016-12-01
NASA's Atmospheric Science Data Center (ASDC) is operationally using the Esri ArcGIS Platform to improve data discoverability, accessibility and interoperability to meet the diversifying government, private, public and academic communities' driven requirements. The ASDC is actively working to provide their mission essential datasets as ArcGIS Image Services, Open Geospatial Consortium (OGC) Web Mapping Services (WMS), OGC Web Coverage Services (WCS) and leveraging the ArcGIS multidimensional mosaic dataset structure. Science teams and ASDC are utilizing these services, developing applications using the Web AppBuilder for ArcGIS and ArcGIS API for Javascript, and evaluating restructuring their data production and access scripts within the ArcGIS Python Toolbox framework and Geoprocessing service environment. These capabilities yield a greater usage and exposure of ASDC data holdings and provide improved geospatial analytical tools for a mission critical understanding in the areas of the earth's radiation budget, clouds, aerosols, and tropospheric chemistry.
IMG/M: integrated genome and metagenome comparative data analysis system
Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken; ...
2016-10-13
The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support formore » examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review(ER) companion system (IMG/M ER: https://img.jgi.doe.gov/ mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system.« less
IMG/M: integrated genome and metagenome comparative data analysis system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken
The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support formore » examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review(ER) companion system (IMG/M ER: https://img.jgi.doe.gov/ mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system.« less
IMG/M: integrated genome and metagenome comparative data analysis system
Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken; Palaniappan, Krishna; Szeto, Ernest; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Andersen, Evan; Huntemann, Marcel; Varghese, Neha; Hadjithomas, Michalis; Tennessen, Kristin; Nielsen, Torben; Ivanova, Natalia N.; Kyrpides, Nikos C.
2017-01-01
The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support for examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review (ER) companion system (IMG/M ER: https://img.jgi.doe.gov/mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system. PMID:27738135
NASA Astrophysics Data System (ADS)
Horsburgh, J. S.; Jones, A. S.
2016-12-01
Data and models used within the hydrologic science community are diverse. New research data and model repositories have succeeded in making data and models more accessible, but have been, in most cases, limited to particular types or classes of data or models and also lack the type of collaborative, and iterative functionality needed to enable shared data collection and modeling workflows. File sharing systems currently used within many scientific communities for private sharing of preliminary and intermediate data and modeling products do not support collaborative data capture, description, visualization, and annotation. More recently, hydrologic datasets and models have been cast as "social objects" that can be published, collaborated around, annotated, discovered, and accessed. Yet it can be difficult using existing software tools to achieve the kind of collaborative workflows and data/model reuse that many envision. HydroShare is a new, web-based system for sharing hydrologic data and models with specific functionality aimed at making collaboration easier and achieving new levels of interactive functionality and interoperability. Within HydroShare, we have developed new functionality for creating datasets, describing them with metadata, and sharing them with collaborators. HydroShare is enabled by a generic data model and content packaging scheme that supports describing and sharing diverse hydrologic datasets and models. Interoperability among the diverse types of data and models used by hydrologic scientists is achieved through the use of consistent storage, management, sharing, publication, and annotation within HydroShare. In this presentation, we highlight and demonstrate how the flexibility of HydroShare's data model and packaging scheme, HydroShare's access control and sharing functionality, and versioning and publication capabilities have enabled the sharing and publication of research datasets for a large, interdisciplinary water research project called iUTAH (innovative Urban Transitions and Aridregion Hydro-sustainability). We discuss the experiences of iUTAH researchers now using HydroShare to collaboratively create, curate, and publish datasets and models in a way that encourages collaboration, promotes reuse, and meets funding agency requirements.
Key Lessons in Building "Data Commons": The Open Science Data Cloud Ecosystem
NASA Astrophysics Data System (ADS)
Patterson, M.; Grossman, R.; Heath, A.; Murphy, M.; Wells, W.
2015-12-01
Cloud computing technology has created a shift around data and data analysis by allowing researchers to push computation to data as opposed to having to pull data to an individual researcher's computer. Subsequently, cloud-based resources can provide unique opportunities to capture computing environments used both to access raw data in its original form and also to create analysis products which may be the source of data for tables and figures presented in research publications. Since 2008, the Open Cloud Consortium (OCC) has operated the Open Science Data Cloud (OSDC), which provides scientific researchers with computational resources for storing, sharing, and analyzing large (terabyte and petabyte-scale) scientific datasets. OSDC has provided compute and storage services to over 750 researchers in a wide variety of data intensive disciplines. Recently, internal users have logged about 2 million core hours each month. The OSDC also serves the research community by colocating these resources with access to nearly a petabyte of public scientific datasets in a variety of fields also accessible for download externally by the public. In our experience operating these resources, researchers are well served by "data commons," meaning cyberinfrastructure that colocates data archives, computing, and storage infrastructure and supports essential tools and services for working with scientific data. In addition to the OSDC public data commons, the OCC operates a data commons in collaboration with NASA and is developing a data commons for NOAA datasets. As cloud-based infrastructures for distributing and computing over data become more pervasive, we ask, "What does it mean to publish data in a data commons?" Here we present the OSDC perspective and discuss several services that are key in architecting data commons, including digital identifier services.
Remote visual analysis of large turbulence databases at multiple scales
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pulido, Jesus; Livescu, Daniel; Kanov, Kalin
The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less
Remote visual analysis of large turbulence databases at multiple scales
Pulido, Jesus; Livescu, Daniel; Kanov, Kalin; ...
2018-06-15
The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less
Nord, Derek; Nye-Lengerman, Kelly
2015-08-01
Public benefits are widely used by people with intellectual and development disabilities (IDD) as crucial financial supports. Using Rehabilitation Service Administration 911 and Annual Review Report datasets to account for individual and state vocational rehabilitation (VR) agency variables, a sample of 21,869 people with IDD were analyzed using hierarchical linear modeling to model the effects of public benefits on hours worked per week. Findings point to associations that indicate that public benefits not only limit access to employment participation, they also have a restricting effect on growth of weekly hours that typically come with higher wage positions, compared those that do not access benefits. The article also lays out important implications and recommendations to increase the inclusion of people with IDD in the workplace.
Cluster Active Archive: lessons learnt
NASA Astrophysics Data System (ADS)
Laakso, H. E.; Perry, C. H.; Taylor, M. G.; Escoubet, C. P.; Masson, A.
2010-12-01
The ESA Cluster Active Archive (CAA) was opened to public in February 2006 after an initial three-year development phase. It provides access (both web GUI and command-line tool are available) to the calibrated full-resolution datasets of the four-satellite Cluster mission. The data archive is publicly accessible and suitable for science use and publication by the world-wide scientific community. There are more than 350 datasets from each spacecraft, including high-resolution magnetic and electric DC and AC fields as well as full 3-dimensional electron and ion distribution functions and moments from a few eV to hundreds of keV. The Cluster mission has been in operation since February 2001, and currently although the CAA can provide access to some recent observations, the ingestion of some other datasets can be delayed by a few years due to large and difficult calibration routines of aging detectors. The quality of the datasets is the central matter to the CAA. Having the same instrument on four spacecraft allows the cross-instrument comparisons and provide confidence on some of the instrumental calibration parameters. Furthermore it is highly important that many physical parameters are measured by more than one instrument which allow to perform extensive and continuous cross-calibration analyses. In addition some of the instruments can be regarded as absolute or reference measurements for other instruments. The CAA attempts to avoid as much as possible mission-specific acronyms and concepts and tends to use more generic terms in describing the datasets and their contents in order to ease the usage of the CAA data by “non-Cluster” scientists. Currently the CAA has more 1000 users and every month more than 150 different users log in the CAA for plotting and/or downloading observations. The users download about 1 TeraByte of data every month. The CAA has separated the graphical tool from the download tool because full-resolution datasets can be visualized in many ways and so there is no one-to-one correspondence between graphical products and full-resolution datasets. The CAA encourages users to contact the CAA team for all kind of issues whether it concerns the user interface, the content of the datasets, the quality of the observations or provision of new type of services. The CAA runs regular annual reviews on the data products and the user services in order to improve the quality and usability of the CAA system to the world-wide user community. The CAA is continuously being upgraded in terms of datasets and services.
Publicly Releasing a Large Simulation Dataset with NDS Labs
NASA Astrophysics Data System (ADS)
Goldbaum, Nathan
2016-03-01
Optimally, all publicly funded research should be accompanied by the tools, code, and data necessary to fully reproduce the analysis performed in journal articles describing the research. This ideal can be difficult to attain, particularly when dealing with large (>10 TB) simulation datasets. In this lightning talk, we describe the process of publicly releasing a large simulation dataset to accompany the submission of a journal article. The simulation was performed using Enzo, an open source, community-developed N-body/hydrodynamics code and was analyzed using a wide range of community- developed tools in the scientific Python ecosystem. Although the simulation was performed and analyzed using an ecosystem of sustainably developed tools, we enable sustainable science using our data by making it publicly available. Combining the data release with the NDS Labs infrastructure allows a substantial amount of added value, including web-based access to analysis and visualization using the yt analysis package through an IPython notebook interface. In addition, we are able to accompany the paper submission to the arXiv preprint server with links to the raw simulation data as well as interactive real-time data visualizations that readers can explore on their own or share with colleagues during journal club discussions. It is our hope that the value added by these services will substantially increase the impact and readership of the paper.
NASA Astrophysics Data System (ADS)
Elger, Kirsten; Ulbricht, Damian; Bertelmann, Roland
2017-04-01
Open access to research data is an increasing international request and includes not only data underlying scholarly publication, but also raw and curated data. Especially in the framework of the observed shift in many scientific fields towards data science and data mining, data repositories are becoming important player as data archives and access point to curated research data. While general and institutional data repositories are available across all scientific disciplines, domain-specific data repositories are specialised for scientific disciplines, like, e.g., bio- or geosciences, with the possibility to use more discipline-specific and richer metadata models than general repositories. Data publication is increasingly regarded as important scientific achievement, and datasets with digital object identifier (DOI) are now fully citable in journal articles. Moreover, following in their signature of the "Statement of Commitment of the Coalition on Publishing Data in the Earth and Space Sciences" (COPDESS), many publishers have adopted their data policies and recommend and even request to store and publish data underlying scholarly publications in (domain-specific) data repositories and not as classical supplementary material directly attached to the respective article. The curation of large dynamic data from global networks in, e.g., seismology, magnetics or geodesy, always required a high grade of professional, IT-supported data management, simply to be able to store and access the huge number of files and manage dynamic datasets. In contrast to these, the vast amount of research data acquired by individual investigators or small teams known as 'long-tail data' was often not the focus for the development of data curation infrastructures. Nevertheless, even though they are small in size and highly variable, in total they represent a significant portion of the total scientific outcome. The curation of long-tail data requires more individual approaches and personal involvement of the data curator, especially regarding the data description. Here we will introduce best practices for the publication of long-tail data that are helping to reduce the individual effort, improve the quality of the data description. The data repository of GFZ Data Services, which is hosted at GFZ German Research Centre for Geosciences in Potsdam, is a domain-specific data repository for geosciences. In addition to large dynamic datasets from different disciplines, it has a large focus on the DOI-referenced publication of long-tail data with the aim to reach a high grade of reusability through a comprehensive data description and in the same time provide and distribute standardised, machine actionable metadata for data discovery (FAIR data). The development of templates for data reports, metadata provision by scientists via an XML Metadata Editor and discipline-specific DOI landing pages are helping both, the data curators to handle all kinds of datasets and enabling the scientists, i.e. user, to quickly decide whether a published dataset is fulfilling their needs. In addition, GFZ Data Services have developed DOI-registration services for several international networks (e.g. ICGEM, World Stress Map, IGETS, etc.). In addition, we have developed project-or network-specific designs of the DOI landing pages with the logo or design of the networks or project
Science Education with the LSST
NASA Astrophysics Data System (ADS)
Jacoby, S. H.; Khandro, L. M.; Larson, A. M.; McCarthy, D. W.; Pompea, S. M.; Shara, M. M.
2004-12-01
LSST will create the first true celestial cinematography - a revolution in public access to the changing universe. The challenge will be to take advantage of the unique capabilities of the LSST while presenting the data in ways that are manageable, engaging, and supportive of national science education goals. To prepare for this opportunity for exploration, tools and displays will be developed using current deep-sky multi-color imaging data. Education professionals from LSST partners invite input from interested members of the community. Initial LSST science education priorities include: - Fostering authentic student-teacher research projects at all levels, - Exploring methods of visualizing the large and changing datasets in science centers, - Defining Web-based interfaces and tools for access and interaction with the data, - Delivering online instructional materials, and - Developing meaningful interactions between LSST scientists and the public.
Harvard Aging Brain Study: Dataset and accessibility.
Dagley, Alexander; LaPoint, Molly; Huijbers, Willem; Hedden, Trey; McLaren, Donald G; Chatwal, Jasmeer P; Papp, Kathryn V; Amariglio, Rebecca E; Blacker, Deborah; Rentz, Dorene M; Johnson, Keith A; Sperling, Reisa A; Schultz, Aaron P
2017-01-01
The Harvard Aging Brain Study is sharing its data with the global research community. The longitudinal dataset consists of a 284-subject cohort with the following modalities acquired: demographics, clinical assessment, comprehensive neuropsychological testing, clinical biomarkers, and neuroimaging. To promote more extensive analyses, imaging data was designed to be compatible with other publicly available datasets. A cloud-based system enables access to interested researchers with blinded data available contingent upon completion of a data usage agreement and administrative approval. Data collection is ongoing and currently in its fifth year. Copyright © 2015 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Lloyd, S. A.; Acker, J. G.; Prados, A. I.; Leptoukh, G. G.
2008-12-01
One of the biggest obstacles for the average Earth science student today is locating and obtaining satellite- based remote sensing datasets in a format that is accessible and optimal for their data analysis needs. At the Goddard Earth Sciences Data and Information Services Center (GES-DISC) alone, on the order of hundreds of Terabytes of data are available for distribution to scientists, students and the general public. The single biggest and time-consuming hurdle for most students when they begin their study of the various datasets is how to slog through this mountain of data to arrive at a properly sub-setted and manageable dataset to answer their science question(s). The GES DISC provides a number of tools for data access and visualization, including the Google-like Mirador search engine and the powerful GES-DISC Interactive Online Visualization ANd aNalysis Infrastructure (Giovanni) web interface. Giovanni provides a simple way to visualize, analyze and access vast amounts of satellite-based Earth science data. Giovanni's features and practical examples of its use will be demonstrated, with an emphasis on how satellite remote sensing can help students understand recent events in the atmosphere and biosphere. Giovanni is actually a series of sixteen similar web-based data interfaces, each of which covers a single satellite dataset (such as TRMM, TOMS, OMI, AIRS, MLS, HALOE, etc.) or a group of related datasets (such as MODIS and MISR for aerosols, SeaWIFS and MODIS for ocean color, and the suite of A-Train observations co-located along the CloudSat orbital path). Recently, ground-based datasets have been included in Giovanni, including the Northern Eurasian Earth Science Partnership Initiative (NEESPI), and EPA fine particulate matter (PM2.5) for air quality. Model data such as the Goddard GOCART model and MERRA meteorological reanalyses (in process) are being increasingly incorporated into Giovanni to facilitate model- data intercomparison. A full suite of data analysis and visualization tools is also available within Giovanni. The GES DISC is currently developing a systematic series of training modules for Earth science satellite data, associated with our development of additional datasets and data visualization tools for Giovanni. Training sessions will include an overview of the Earth science datasets archived at Goddard, an overview of terms and techniques associated with satellite remote sensing, dataset-specific issues, an overview of Giovanni functionality, and a series of examples of how data can be readily accessed and visualized.
Process mining in oncology using the MIMIC-III dataset
NASA Astrophysics Data System (ADS)
Prima Kurniati, Angelina; Hall, Geoff; Hogg, David; Johnson, Owen
2018-03-01
Process mining is a data analytics approach to discover and analyse process models based on the real activities captured in information systems. There is a growing body of literature on process mining in healthcare, including oncology, the study of cancer. In earlier work we found 37 peer-reviewed papers describing process mining research in oncology with a regular complaint being the limited availability and accessibility of datasets with suitable information for process mining. Publicly available datasets are one option and this paper describes the potential to use MIMIC-III, for process mining in oncology. MIMIC-III is a large open access dataset of de-identified patient records. There are 134 publications listed as using the MIMIC dataset, but none of them have used process mining. The MIMIC-III dataset has 16 event tables which are potentially useful for process mining and this paper demonstrates the opportunities to use MIMIC-III for process mining in oncology. Our research applied the L* lifecycle method to provide a worked example showing how process mining can be used to analyse cancer pathways. The results and data quality limitations are discussed along with opportunities for further work and reflection on the value of MIMIC-III for reproducible process mining research.
Addition of a breeding database in the Genome Database for Rosaceae
Evans, Kate; Jung, Sook; Lee, Taein; Brutcher, Lisa; Cho, Ilhyung; Peace, Cameron; Main, Dorrie
2013-01-01
Breeding programs produce large datasets that require efficient management systems to keep track of performance, pedigree, geographical and image-based data. With the development of DNA-based screening technologies, more breeding programs perform genotyping in addition to phenotyping for performance evaluation. The integration of breeding data with other genomic and genetic data is instrumental for the refinement of marker-assisted breeding tools, enhances genetic understanding of important crop traits and maximizes access and utility by crop breeders and allied scientists. Development of new infrastructure in the Genome Database for Rosaceae (GDR) was designed and implemented to enable secure and efficient storage, management and analysis of large datasets from the Washington State University apple breeding program and subsequently expanded to fit datasets from other Rosaceae breeders. The infrastructure was built using the software Chado and Drupal, making use of the Natural Diversity module to accommodate large-scale phenotypic and genotypic data. Breeders can search accessions within the GDR to identify individuals with specific trait combinations. Results from Search by Parentage lists individuals with parents in common and results from Individual Variety pages link to all data available on each chosen individual including pedigree, phenotypic and genotypic information. Genotypic data are searchable by markers and alleles; results are linked to other pages in the GDR to enable the user to access tools such as GBrowse and CMap. This breeding database provides users with the opportunity to search datasets in a fully targeted manner and retrieve and compare performance data from multiple selections, years and sites, and to output the data needed for variety release publications and patent applications. The breeding database facilitates efficient program management. Storing publicly available breeding data in a database together with genomic and genetic data will further accelerate the cross-utilization of diverse data types by researchers from various disciplines. Database URL: http://www.rosaceae.org/breeders_toolbox PMID:24247530
Addition of a breeding database in the Genome Database for Rosaceae.
Evans, Kate; Jung, Sook; Lee, Taein; Brutcher, Lisa; Cho, Ilhyung; Peace, Cameron; Main, Dorrie
2013-01-01
Breeding programs produce large datasets that require efficient management systems to keep track of performance, pedigree, geographical and image-based data. With the development of DNA-based screening technologies, more breeding programs perform genotyping in addition to phenotyping for performance evaluation. The integration of breeding data with other genomic and genetic data is instrumental for the refinement of marker-assisted breeding tools, enhances genetic understanding of important crop traits and maximizes access and utility by crop breeders and allied scientists. Development of new infrastructure in the Genome Database for Rosaceae (GDR) was designed and implemented to enable secure and efficient storage, management and analysis of large datasets from the Washington State University apple breeding program and subsequently expanded to fit datasets from other Rosaceae breeders. The infrastructure was built using the software Chado and Drupal, making use of the Natural Diversity module to accommodate large-scale phenotypic and genotypic data. Breeders can search accessions within the GDR to identify individuals with specific trait combinations. Results from Search by Parentage lists individuals with parents in common and results from Individual Variety pages link to all data available on each chosen individual including pedigree, phenotypic and genotypic information. Genotypic data are searchable by markers and alleles; results are linked to other pages in the GDR to enable the user to access tools such as GBrowse and CMap. This breeding database provides users with the opportunity to search datasets in a fully targeted manner and retrieve and compare performance data from multiple selections, years and sites, and to output the data needed for variety release publications and patent applications. The breeding database facilitates efficient program management. Storing publicly available breeding data in a database together with genomic and genetic data will further accelerate the cross-utilization of diverse data types by researchers from various disciplines. Database URL: http://www.rosaceae.org/breeders_toolbox.
A Research Graph dataset for connecting research data repositories using RD-Switchboard.
Aryani, Amir; Poblet, Marta; Unsworth, Kathryn; Wang, Jingbo; Evans, Ben; Devaraju, Anusuriya; Hausstein, Brigitte; Klas, Claus-Peter; Zapilko, Benjamin; Kaplun, Samuele
2018-05-29
This paper describes the open access graph dataset that shows the connections between Dryad, CERN, ANDS and other international data repositories to publications and grants across multiple research data infrastructures. The graph dataset was created using the Research Graph data model and the Research Data Switchboard (RD-Switchboard), a collaborative project by the Research Data Alliance DDRI Working Group (DDRI WG) with the aim to discover and connect the related research datasets based on publication co-authorship or jointly funded grants. The graph dataset allows researchers to trace and follow the paths to understanding a body of work. By mapping the links between research datasets and related resources, the graph dataset improves both their discovery and visibility, while avoiding duplicate efforts in data creation. Ultimately, the linked datasets may spur novel ideas, facilitate reproducibility and re-use in new applications, stimulate combinatorial creativity, and foster collaborations across institutions.
NASA Astrophysics Data System (ADS)
Tisdale, M.
2017-12-01
NASA's Atmospheric Science Data Center (ASDC) is operationally using the Esri ArcGIS Platform to improve data discoverability, accessibility and interoperability to meet the diversifying user requirements from government, private, public and academic communities. The ASDC is actively working to provide their mission essential datasets as ArcGIS Image Services, Open Geospatial Consortium (OGC) Web Mapping Services (WMS), and OGC Web Coverage Services (WCS) while leveraging the ArcGIS multidimensional mosaic dataset structure. Science teams at ASDC are utilizing these services through the development of applications using the Web AppBuilder for ArcGIS and the ArcGIS API for Javascript. These services provide greater exposure of ASDC data holdings to the GIS community and allow for broader sharing and distribution to various end users. These capabilities provide interactive visualization tools and improved geospatial analytical tools for a mission critical understanding in the areas of the earth's radiation budget, clouds, aerosols, and tropospheric chemistry. The presentation will cover how the ASDC is developing geospatial web services and applications to improve data discoverability, accessibility, and interoperability.
Rand, Hugh; Shumway, Martin; Trees, Eija K.; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E.; Defibaugh-Chavez, Stephanie; Carleton, Heather A.; Klimke, William A.; Katz, Lee S.
2017-01-01
Background As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. Methods We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Results Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. Discussion These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools—we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines. PMID:29372115
Timme, Ruth E; Rand, Hugh; Shumway, Martin; Trees, Eija K; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E; Defibaugh-Chavez, Stephanie; Carleton, Heather A; Klimke, William A; Katz, Lee S
2017-01-01
As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and "known" phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Our "outbreak" benchmark datasets represent the four major foodborne bacterial pathogens ( Listeria monocytogenes , Salmonella enterica , Escherichia coli , and Campylobacter jejuni ) and one simulated dataset where the "known tree" can be accurately called the "true tree". The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools-we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.
The Geoscience Internet of Things
NASA Astrophysics Data System (ADS)
Lehnert, K.; Klump, J.
2012-04-01
Internet of Things is a term that refers to "uniquely identifiable objects (things) and their virtual representations in an Internet-like structure" (Wikipedia). We here use the term to describe new and innovative ways to integrate physical samples in the Earth Sciences into the emerging digital infrastructures that are developed to support research and education in the Geosciences. Many Earth Science data are acquired on solid earth samples through observations and experiments conducted in the field or in the lab. The application and long-term utility of sample-based data for science is critically dependent on (a) the availability of information (metadata) about the samples such as geographical location where the sample was collected, time of sampling, sampling method, etc. (b) links between the different data types available for individual samples that are dispersed in the literature and in digital data repositories, and (c) access to the samples themselves. Neither of these requirements could be achieved in the past due to incomplete documentation of samples in publications, use of ambiguous sample names, and the lack of a central catalog that allows researchers to find a sample's archiving location. New internet-based capabilities have been developed over the past few years for the registration and unique identification of samples that make it possible to overcome these problems. Services for the registration and unique identification of samples are provided by the System for Earth Sample Registration SESAR (www.geosamples.org). SESAR developed the International Geo Sample Number, or IGSN, as a unique identifier for samples and specimens collected from our natural environment. Since December 2011, the IGSN is governed by an international organization, the IGSN eV (www.igsn.org), which endorses and promotes an internationally unified approach for registration and discovery of physical specimens in the Geoscience community and is establishing a new modular and scalable architecture for the IGSN to advance global implementation. Use of the IGSN will, for the first time, allow to establish links between samples (or the digital representation of them), data acquired on these samples, and the publications that report these data. Samples can be linked to a dataset by including IGSNs in the metadata record of a dataset's DOI® when the dataset is registered with the DOI® system for unique identification. Links between datasets and publications already have been implemented based on dataset DOIs® between some Geoscience journals and data centers that are Publication Agents in the DataCite consortium (www.datacite.org). Links between IGSNs, dataset DOIs, and publication DOIs will in the future allow researchers to find and access with a single query and without ambiguity all data acquired on a specific sample across the entire literature.
Open Access Data Centers as an Essential Partner to a Data Publication Journal
NASA Astrophysics Data System (ADS)
Carlson, D.; Pfeiffenberger, H.
2016-12-01
The success of Earth System Science Data derives in part from key infrastructure: digital object identifiers (doi) and open access data centers. Our concept that a data journal should promote access and exchange through publication of reviewed data descriptions presupposed third parties to hold the data. As minimum criteria for those data centers we expected international reputation for quality of service and an active lifetime extending at least a decade into the future. We also expected modern access interfaces offering geographic, topical and parameter-based browsing - so that users could discover related holdings through an ESSD link or discover ESSD by way of links in data sets revealed through the center's browse tools - and true open access. True open access means one or two clicks from abstract in ESSD to the data itself without barriers. We started with Pangaea and CDIAC. Data providers already used these centers, the staff welcomed the ESSD initiative and all parties cooperated on doi. With this initial support ESSD proved the basic concept of data publication and demonstrated utility to a larger group of data providers, many of whom suggested additional centers. So long as those data centers met expectations for open access and quality and durability of service, ESSD agreed to collaborate. Through back-door collaborations - e.g. service on particular data sets - ESSD developed working partnerships with more than 30 data centers in 13 countries. Data centers ask to join our list. We encourage those centers to stimulate local providers to submit a data set to ESSD, thus preserving our practical data-set by data-set partnership mode. For a few data centers where national policies impose a registration step, center staff and ESSD editors created bypass access routes to facilitate anonymous reviews. For ESSD purposes, open access and doi cooperation leading to reliable curation allows a win, win, win partnership among centers, providers, and journal.
Persistent identifiers for CMIP6 data in the Earth System Grid Federation
NASA Astrophysics Data System (ADS)
Buurman, Merret; Weigel, Tobias; Juckes, Martin; Lautenschlager, Michael; Kindermann, Stephan
2016-04-01
The Earth System Grid Federation (ESGF) is a distributed data infrastructure that will provide access to the CMIP6 experiment data. The data consist of thousands of datasets composed of millions of files. Over the course of the CMIP6 operational phase, datasets may be retracted and replaced by newer versions that consist of completely or partly new files. Each dataset is hosted at a single data centre, but can have one or several backups (replicas) at other data centres. To keep track of the different data entities and relationships between them, to ensure their consistency and improve exchange of information about them, Persistent Identifiers (PIDs) are used. These are unique identifiers that are registered at a globally accessible server, along with some metadata (the PID record). While usually providing access to the data object they refer to, as long as it exists, the metadata record will remain available even beyond the object's lifetime. Besides providing access to data and metadata, PIDs will allow scientists to communicate effectively and on a fine granularity about CMIP6 data. The initiative to introduce PIDs in the ESGF infrastructure has been described and agreed upon through a series of white papers governed by the WGCM Infrastructure Panel (WIP). In CMIP6, each dataset and each file is assigned a PID that keeps track of the data object's physical copies throughout the object lifetime. In addition to this, its relationship with other data objects is stored in the PID recordA human-readable version of this information is available on an information page also linked in the PID record. A possible application that exploits the information available from the PID records is a smart information tool, which a scientific user can call to find out if his/her version was replaced by a new one, to view and browse the related datasets and files, and to get access to the various copies or to additional metadata on a dedicated website. The PID registration process is embedded in the ESGF data publication process. During their first publication, the PID records are populated with metadata including the parent dataset(s), other existing versions and physical location. Every subsequent publication, un-publication or replica publication of a dataset or file then updates the PID records to keep track of changing physical locations of the data (or lack thereof) and of reported errors in the data. Assembling the metadata records and registering the PIDs on a central server is a potential performance bottleneck as millions of data objects may be published in a short timeframe when the CMIP6 experiment phase begins. For this reason, the PID registration and metadata update tasks are pushed to a message queueing system facilitating high availability and scalability and then processed asynchronously. This will lead to a slight delay in PID registration but will avoid blocking resources at the data centres and slowing down the publication of the data so eagerly awaited by the scientists.
ISRUC-Sleep: A comprehensive public dataset for sleep researchers.
Khalighi, Sirvan; Sousa, Teresa; Santos, José Moutinho; Nunes, Urbano
2016-02-01
To facilitate the performance comparison of new methods for sleep patterns analysis, datasets with quality content, publicly-available, are very important and useful. We introduce an open-access comprehensive sleep dataset, called ISRUC-Sleep. The data were obtained from human adults, including healthy subjects, subjects with sleep disorders, and subjects under the effect of sleep medication. Each recording was randomly selected between PSG recordings that were acquired by the Sleep Medicine Centre of the Hospital of Coimbra University (CHUC). The dataset comprises three groups of data: (1) data concerning 100 subjects, with one recording session per subject; (2) data gathered from 8 subjects; two recording sessions were performed per subject, and (3) data collected from one recording session related to 10 healthy subjects. The polysomnography (PSG) recordings, associated with each subject, were visually scored by two human experts. Comparing the existing sleep-related public datasets, ISRUC-Sleep provides data of a reasonable number of subjects with different characteristics such as: data useful for studies involving changes in the PSG signals over time; and data of healthy subjects useful for studies involving comparison of healthy subjects with the patients, suffering from sleep disorders. This dataset was created aiming to complement existing datasets by providing easy-to-apply data collection with some characteristics not covered yet. ISRUC-Sleep can be useful for analysis of new contributions: (i) in biomedical signal processing; (ii) in development of ASSC methods; and (iii) on sleep physiology studies. To evaluate and compare new contributions, which use this dataset as a benchmark, results of applying a subject-independent automatic sleep stage classification (ASSC) method on ISRUC-Sleep dataset are presented. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Connecting the Public to Scientific Research Data - Science On a Sphere°
NASA Astrophysics Data System (ADS)
Henderson, M. A.; Russell, E. L.; Science on a Sphere Datasets
2011-12-01
Connecting the Public to Scientific Research Data - Science On a Sphere° Maurice Henderson, NASA Goddard Space Flight Center Elizabeth Russell, NOAA Earth System Research Laboratory, University of Colorado Cooperative Institute for Research in Environmental Sciences Science On a Sphere° is a six foot animated globe developed by the National Ocean and Atmospheric Administration, NOAA, as a means to display global scientific research data in an intuitive, engaging format in public forums. With over 70 permanent installations of SOS around the world in science museums, visitor's centers and universities, the audience that enjoys SOS yearly is substantial, wide-ranging, and diverse. Through partnerships with the National Aeronautics and Space Administration, NASA, the SOS Data Catalog (http://sos.noaa.gov/datasets/) has grown to a collection of over 350 datasets from NOAA, NASA, and many others. Using an external projection system, these datasets are displayed onto the sphere creating a seamless global image. In a cross-site evaluation of Science On a Sphere°, 82% of participants said yes, seeing information displayed on a sphere changed their understanding of the information. This unique technology captivates viewers and exposes them to scientific research data in a way that is accessible, presentable, and understandable. The datasets that comprise the SOS Data Catalog are scientific research data that have been formatted for display on SOS. By formatting research data into visualizations that can be used on SOS, NOAA and NASA are able to turn research data into educational materials that are easily accessible for users. In many cases, visualizations do not need to be modified because SOS uses a common map projection. The SOS Data Catalog has become a "one-stop shop" for a broad range of global datasets from across NOAA and NASA, and as a result, the traffic on the site is more than just SOS users. While the target audience for this site is SOS users, many inquiries come from teachers, book editors, film producers and students interested in using the available datasets. The SOS Data Catalog online includes a written description of each dataset, rendered images of the data, animated movies of the data, links to more information, details on the data source and creator, and a link to a FTP server where each dataset can be downloaded. Many of the datasets are also displayed on the SOS YouTube Channel and Facebook page. In addition, NASA has developed NASA Earth Observations, NEO, which is a collection of global satellite datasets. The NEO website allows users to layer multiple datasets and perform basic analysis. Through a new iPad application, the NASA Earth Observations datasets can be exported to SOS and analyzed on the sphere. This new capability greatly expands the number of datasets that can be shown on SOS and adds a new element of interactivity with the datasets.
The HydroServer Platform for Sharing Hydrologic Data
NASA Astrophysics Data System (ADS)
Tarboton, D. G.; Horsburgh, J. S.; Schreuders, K.; Maidment, D. R.; Zaslavsky, I.; Valentine, D. W.
2010-12-01
The CUAHSI Hydrologic Information System (HIS) is an internet based system that supports sharing of hydrologic data. HIS consists of databases connected using the Internet through Web services, as well as software for data discovery, access, and publication. The HIS system architecture is comprised of servers for publishing and sharing data, a centralized catalog to support cross server data discovery and a desktop client to access and analyze data. This paper focuses on HydroServer, the component developed for sharing and publishing space-time hydrologic datasets. A HydroServer is a computer server that contains a collection of databases, web services, tools, and software applications that allow data producers to store, publish, and manage the data from an experimental watershed or project site. HydroServer is designed to permit publication of data as part of a distributed national/international system, while still locally managing access to the data. We describe the HydroServer architecture and software stack, including tools for managing and publishing time series data for fixed point monitoring sites as well as spatially distributed, GIS datasets that describe a particular study area, watershed, or region. HydroServer adopts a standards based approach to data publication, relying on accepted and emerging standards for data storage and transfer. CUAHSI developed HydroServer code is free with community code development managed through the codeplex open source code repository and development system. There is some reliance on widely used commercial software for general purpose and standard data publication capability. The sharing of data in a common format is one way to stimulate interdisciplinary research and collaboration. It is anticipated that the growing, distributed network of HydroServers will facilitate cross-site comparisons and large scale studies that synthesize information from diverse settings, making the network as a whole greater than the sum of its parts in advancing hydrologic research. Details of the CUAHSI HIS can be found at http://his.cuahsi.org, and HydroServer codeplex site http://hydroserver.codeplex.com.
The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update.
Afgan, Enis; Baker, Dannon; Batut, Bérénice; van den Beek, Marius; Bouvier, Dave; Cech, Martin; Chilton, John; Clements, Dave; Coraor, Nate; Grüning, Björn A; Guerler, Aysam; Hillman-Jackson, Jennifer; Hiltemann, Saskia; Jalili, Vahid; Rasche, Helena; Soranzo, Nicola; Goecks, Jeremy; Taylor, James; Nekrutenko, Anton; Blankenberg, Daniel
2018-05-22
Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three key challenges of data-driven biomedical science: making analyses accessible to all researchers, ensuring analyses are completely reproducible, and making it simple to communicate analyses so that they can be reused and extended. During the last two years, the Galaxy team and the open-source community around Galaxy have made substantial improvements to Galaxy's core framework, user interface, tools, and training materials. Framework and user interface improvements now enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed. The Galaxy community has led an effort to create numerous high-quality tutorials focused on common types of genomic analyses. The Galaxy developer and user communities continue to grow and be integral to Galaxy's development. The number of Galaxy public servers, developers contributing to the Galaxy framework and its tools, and users of the main Galaxy server have all increased substantially.
Efficient and Flexible Climate Analysis with Python in a Cloud-Based Distributed Computing Framework
NASA Astrophysics Data System (ADS)
Gannon, C.
2017-12-01
As climate models become progressively more advanced, and spatial resolution further improved through various downscaling projects, climate projections at a local level are increasingly insightful and valuable. However, the raw size of climate datasets presents numerous hurdles for analysts wishing to develop customized climate risk metrics or perform site-specific statistical analysis. Four Twenty Seven, a climate risk consultancy, has implemented a Python-based distributed framework to analyze large climate datasets in the cloud. With the freedom afforded by efficiently processing these datasets, we are able to customize and continually develop new climate risk metrics using the most up-to-date data. Here we outline our process for using Python packages such as XArray and Dask to evaluate netCDF files in a distributed framework, StarCluster to operate in a cluster-computing environment, cloud computing services to access publicly hosted datasets, and how this setup is particularly valuable for generating climate change indicators and performing localized statistical analysis.
Applications of the LBA-ECO Metadata Warehouse
NASA Astrophysics Data System (ADS)
Wilcox, L.; Morrell, A.; Griffith, P. C.
2006-05-01
The LBA-ECO Project Office has developed a system to harvest and warehouse metadata resulting from the Large-Scale Biosphere Atmosphere Experiment in Amazonia. The harvested metadata is used to create dynamically generated reports, available at www.lbaeco.org, which facilitate access to LBA-ECO datasets. The reports are generated for specific controlled vocabulary terms (such as an investigation team or a geospatial region), and are cross-linked with one another via these terms. This approach creates a rich contextual framework enabling researchers to find datasets relevant to their research. It maximizes data discovery by association and provides a greater understanding of the scientific and social context of each dataset. For example, our website provides a profile (e.g. participants, abstract(s), study sites, and publications) for each LBA-ECO investigation. Linked from each profile is a list of associated registered dataset titles, each of which link to a dataset profile that describes the metadata in a user-friendly way. The dataset profiles are generated from the harvested metadata, and are cross-linked with associated reports via controlled vocabulary terms such as geospatial region. The region name appears on the dataset profile as a hyperlinked term. When researchers click on this link, they find a list of reports relevant to that region, including a list of dataset titles associated with that region. Each dataset title in this list is hyperlinked to its corresponding dataset profile. Moreover, each dataset profile contains hyperlinks to each associated data file at its home data repository and to publications that have used the dataset. We also use the harvested metadata in administrative applications to assist quality assurance efforts. These include processes to check for broken hyperlinks to data files, automated emails that inform our administrators when critical metadata fields are updated, dynamically generated reports of metadata records that link to datasets with questionable file formats, and dynamically generated region/site coordinate quality assurance reports. These applications are as important as those that facilitate access to information because they help ensure a high standard of quality for the information. This presentation will discuss reports currently in use, provide a technical overview of the system, and discuss plans to extend this system to harvest metadata resulting from the North American Carbon Program by drawing on datasets in many different formats, residing in many thematic data centers and also distributed among hundreds of investigators.
Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige
2017-01-01
The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp.
Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige
2017-01-01
The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp. PMID:28413616
Loftus, Stacie K
2018-05-01
The number of melanocyte- and melanoma-derived next generation sequence genome-scale datasets have rapidly expanded over the past several years. This resource guide provides a summary of publicly available sources of melanocyte cell derived whole genome, exome, mRNA and miRNA transcriptome, chromatin accessibility and epigenetic datasets. Also highlighted are bioinformatic resources and tools for visualization and data queries which allow researchers a genome-scale view of the melanocyte. Published 2018. This article is a U.S. Government work and is in the public domain in the USA.
Includes 1) list of genes in the STAT5b biomarker and 2) list of accession numbers for microarray datasets used in study.This dataset is associated with the following publication:Oshida, K., N. Vasani, D. Waxman, and C. Corton. Disruption of STAT5b-Regulated Sexual Dimorphism of the Liver Transcriptome by Diverse Factors Is a Common Event. PLoS ONE. Public Library of Science, San Francisco, CA, USA, 11(3): NA, (2016).
Thomas, Kathryn A.; Fornwall, Mark D.; Weltzin, Jake F.; Griffis, R.B.
2014-01-01
Among the many effects of climate change is its influence on the phenology of biota. In marine and coastal ecosystems, phenological shifts have been documented for multiple life forms; however, biological data related to marine species' phenology remain difficult to access and is under-used. We conducted an assessment of potential sources of biological data for marine species and their availability for use in phenological analyses and assessments. Our evaluations showed that data potentially related to understanding marine species' phenology are available through online resources of governmental, academic, and non-governmental organizations, but appropriate datasets are often difficult to discover and access, presenting opportunities for scientific infrastructure improvement. The developing Federal Marine Data Architecture when fully implemented will improve data flow and standardization for marine data within major federal repositories and provide an archival repository for collaborating academic and public data contributors. Another opportunity, largely untapped, is the engagement of citizen scientists in standardized collection of marine phenology data and contribution of these data to established data flows. Use of metadata with marine phenology related keywords could improve discovery and access to appropriate datasets. When data originators choose to self-publish, publication of research datasets with a digital object identifier, linked to metadata, will also improve subsequent discovery and access. Phenological changes in the marine environment will affect human economics, food systems, and recreation. No one source of data will be sufficient to understand these changes. The collective attention of marine data collectors is needed—whether with an agency, an educational institution, or a citizen scientist group—toward adopting the data management processes and standards needed to ensure availability of sufficient and useable marine data to understand marine phenology.
Making Geoscience Data Relevant for Students, Teachers, and the Public
NASA Astrophysics Data System (ADS)
Taber, M.; Ledley, T. S.; Prakash, A.; Domenico, B.
2009-12-01
The scientific data collected by government funded research belongs to the public. As such, the scientific and technical communities are responsible to make scientific data accessible and usable by the educational community. However, much geoscience data are difficult for educators and students to find and use. Such data are generally described by metadata that are narrowly focused and contain scientific language. Thus, data access presents a challenge to educators in determining if a particular dataset is relevant to their needs, and to effectively access and use the data. The AccessData project (EAR-0623136, EAR-0305058) has developed a model for bridging the scientific and educational communities to develop robust inquiry-based activities using scientific datasets in the form of Earth Exploration Toolbook (EET, http://serc.carleton.edu/eet) chapters. EET chapters provide step-by-step instructions for accessing specific data and analyzing it with a software analysis tool to explore issues or concepts in science, technology, and mathematics. The AccessData model involves working directly with small teams made up of data providers from scientific data archives or research teams, data analysis tool specialists, scientists, curriculum developers, and educators (AccessData, http://serc.carleton.edu/usingdata/accessdata). The process involves a number of steps including 1) building of the team; 2) pre-workshop facilitation; 3) face-to-face 2.5 day workshop; 4) post-workshop follow-up; 5) completion and review of the EET chapter. The AccessData model has been evolved over a series of six annual workshops hosting ~10 teams each. This model has been expanded to other venues to explore expanding its scope and sustainable mechanisms. These venues include 1) workshops focused on the data collected by a large research program (RIDGE, EarthScope); 2) a workshop focused on developing a citizen scientist guide to conducting research; and 3) facilitating a team on an annual basis within the structure of the Federation of Earth Science Information Partners (ESIP Federation), leveraging their semi-annual meetings. In this presentation we will describe the AccessData model of making geoscience data accessible and usable in educational contexts from the perspective of both the organizers and from a team. We will also describe how this model has been adapted to other contexts to facilitate a broader reach of geoscience data.
Optimizing tertiary storage organization and access for spatio-temporal datasets
NASA Technical Reports Server (NTRS)
Chen, Ling Tony; Rotem, Doron; Shoshani, Arie; Drach, Bob; Louis, Steve; Keating, Meridith
1994-01-01
We address in this paper data management techniques for efficiently retrieving requested subsets of large datasets stored on mass storage devices. This problem represents a major bottleneck that can negate the benefits of fast networks, because the time to access a subset from a large dataset stored on a mass storage system is much greater that the time to transmit that subset over a network. This paper focuses on very large spatial and temporal datasets generated by simulation programs in the area of climate modeling, but the techniques developed can be applied to other applications that deal with large multidimensional datasets. The main requirement we have addressed is the efficient access of subsets of information contained within much larger datasets, for the purpose of analysis and interactive visualization. We have developed data partitioning techniques that partition datasets into 'clusters' based on analysis of data access patterns and storage device characteristics. The goal is to minimize the number of clusters read from mass storage systems when subsets are requested. We emphasize in this paper proposed enhancements to current storage server protocols to permit control over physical placement of data on storage devices. We also discuss in some detail the aspects of the interface between the application programs and the mass storage system, as well as a workbench to help scientists to design the best reorganization of a dataset for anticipated access patterns.
NASA Astrophysics Data System (ADS)
Hsu, L.; Lehnert, K. A.; Carbotte, S. M.; Arko, R. A.; Ferrini, V.; O'hara, S. H.; Walker, J. D.
2012-12-01
The Integrated Earth Data Applications (IEDA) facility maintains multiple data systems with a wide range of solid earth data types from the marine, terrestrial, and polar environments. Examples of the different data types include syntheses of ultra-high resolution seafloor bathymetry collected on large collaborative cruises and analytical geochemistry measurements collected by single investigators in small, unique projects. These different data types have historically been channeled into separate, discipline-specific databases with search and retrieval tailored for the specific data type. However, a current major goal is to integrate data from different systems to allow interdisciplinary data discovery and scientific analysis. To increase discovery and access across these heterogeneous systems, IEDA employs several unique IDs, including sample IDs (International Geo Sample Number, IGSN), person IDs (GeoPass ID), funding award IDs (NSF Award Number), cruise IDs (from the Marine Geoscience Data System Expedition Metadata Catalog), dataset IDs (DOIs), and publication IDs (DOIs). These IDs allow linking of a sample registry (System for Earth SAmple Registration), data libraries and repositories (e.g. Geochemical Research Library, Marine Geoscience Data System), integrated synthesis databases (e.g. EarthChem Portal, PetDB), and investigator services (IEDA Data Compliance Tool). The linked systems allow efficient discovery of related data across different levels of granularity. In addition, IEDA data systems maintain links with several external data systems, including digital journal publishers. Links have been established between the EarthChem Portal and ScienceDirect through publication DOIs, returning sample-level objects and geochemical analyses for a particular publication. Linking IEDA-hosted data to digital publications with IGSNs at the sample level and with IEDA-allocated dataset DOIs are under development. As an example, an individual investigator could sign up for a GeoPass account ID, write a proposal to NSF and create a data plan using the IEDA Data Management Plan Tool. Having received the grant, the investigator then collects rock samples on a scientific cruise from dredges and registers the samples with IGSNs. The investigator then performs analytical geochemistry on the samples, and submits the full dataset to the Geochemical Resource Library for a dataset DOI. Finally, the investigator writes an article that is published in Science Direct. Knowing any of the following IDs: Investigator GeoPass ID, NSF Award Number, Cruise ID, Sample IGSNs, dataset DOI, or publication DOI, a user would be able to navigate to all samples, datasets, and publications in IEDA and external systems. Use of persistent identifiers to link heterogeneous data systems in IEDA thus increases access, discovery, and proper citation of hard-earned investigator datasets.
MAAMD: a workflow to standardize meta-analyses and comparison of affymetrix microarray data
2014-01-01
Background Mandatory deposit of raw microarray data files for public access, prior to study publication, provides significant opportunities to conduct new bioinformatics analyses within and across multiple datasets. Analysis of raw microarray data files (e.g. Affymetrix CEL files) can be time consuming, complex, and requires fundamental computational and bioinformatics skills. The development of analytical workflows to automate these tasks simplifies the processing of, improves the efficiency of, and serves to standardize multiple and sequential analyses. Once installed, workflows facilitate the tedious steps required to run rapid intra- and inter-dataset comparisons. Results We developed a workflow to facilitate and standardize Meta-Analysis of Affymetrix Microarray Data analysis (MAAMD) in Kepler. Two freely available stand-alone software tools, R and AltAnalyze were embedded in MAAMD. The inputs of MAAMD are user-editable csv files, which contain sample information and parameters describing the locations of input files and required tools. MAAMD was tested by analyzing 4 different GEO datasets from mice and drosophila. MAAMD automates data downloading, data organization, data quality control assesment, differential gene expression analysis, clustering analysis, pathway visualization, gene-set enrichment analysis, and cross-species orthologous-gene comparisons. MAAMD was utilized to identify gene orthologues responding to hypoxia or hyperoxia in both mice and drosophila. The entire set of analyses for 4 datasets (34 total microarrays) finished in ~ one hour. Conclusions MAAMD saves time, minimizes the required computer skills, and offers a standardized procedure for users to analyze microarray datasets and make new intra- and inter-dataset comparisons. PMID:24621103
PMLB: a large benchmark suite for machine learning evaluation and comparison.
Olson, Randal S; La Cava, William; Orzechowski, Patryk; Urbanowicz, Ryan J; Moore, Jason H
2017-01-01
The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists. The present study introduces an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies. We compare meta-features among the current set of benchmark datasets in this resource to characterize the diversity of available data. Finally, we apply a number of established machine learning methods to the entire benchmark suite and analyze how datasets and algorithms cluster in terms of performance. From this study, we find that existing benchmarks lack the diversity to properly benchmark machine learning algorithms, and there are several gaps in benchmarking problems that still need to be considered. This work represents another important step towards understanding the limitations of popular benchmarking suites and developing a resource that connects existing benchmarking standards to more diverse and efficient standards in the future.
Collaborative Planetary GIS with JMARS
NASA Astrophysics Data System (ADS)
Dickenshied, S.; Christensen, P. R.; Edwards, C. S.; Prashad, L. C.; Anwar, S.; Engle, E.; Noss, D.; Jmars Development Team
2010-12-01
Traditional GIS tools have allowed users to work locally with their own datasets in their own computing environment. More recently, data providers have started offering online repositories of preprocessed data which helps minimize the learning curve required to access new datasets. The ideal collaborative GIS tool provides the functionality of a traditional GIS and easy access to preprocessed data repositories while also enabling users to contribute data, analysis, and ideas back into the very tools they're using. JMARS (Java Mission-planning and Analysis for Remote Sensing) is a suite of geospatial applications developed by the Mars Space Flight Facility at Arizona State University. This software is used for mission planning and scientific data analysis by several NASA missions, including Mars Odyssey, Mars Reconnaissance Orbiter, and the Lunar Reconnaissance Orbiter. It is used by scientists, researchers and students of all ages from more than 40 countries around the world. In addition to offering a rich set of global and regional maps and publicly released orbiter images, the JMARS software development team has been working on ways to encourage the creation of collaborative datasets. Bringing together users from diverse teams and backgrounds allows new features to be developed with an interest in making the application useful and accessible to as wide a potential audience as possible. Actively engaging the scientific community in development strategy and hands on tasks allows the creation of user driven data content that would not otherwise be possible. The first community generated dataset to result from this effort is a tool mapping peer-reviewed papers to the locations they relate to on Mars with links to ancillary data. This allows users of JMARS to browse to an area of interest and then quickly locate papers corresponding to that area. Alternately, users can search for published papers over a specified time interval and visually see what areas of Mars have received the most attention over the requested time span.
NASA Astrophysics Data System (ADS)
Pascoe, Stephen; Cinquini, Luca; Lawrence, Bryan
2010-05-01
The Phase 5 Coupled Model Intercomparison Project (CMIP5) will produce a petabyte scale archive of climate data relevant to future international assessments of climate science (e.g., the IPCC's 5th Assessment Report scheduled for publication in 2013). The infrastructure for the CMIP5 archive must meet many challenges to support this ambitious international project. We describe here the distributed software architecture being deployed worldwide to meet these challenges. The CMIP5 architecture extends the Earth System Grid (ESG) distributed architecture of Datanodes, providing data access and visualisation services, and Gateways providing the user interface including registration, search and browse services. Additional features developed for CMIP5 include a publication workflow incorporating quality control and metadata submission, data replication, version control, update notification and production of citable metadata records. Implementation of these features have been driven by the requirements of reliable global access to over 1Pb of data and consistent citability of data and metadata. Central to the implementation is the concept of Atomic Datasets that are identifiable through a Data Reference Syntax (DRS). Atomic Datasets are immutable to allow them to be replicated and tracked whilst maintaining data consistency. However, since occasional errors in data production and processing is inevitable, new versions can be published and users notified of these updates. As deprecated datasets may be the target of existing citations they can remain visible in the system. Replication of Atomic Datasets is designed to improve regional access and provide fault tolerance. Several datanodes in the system are designated replicating nodes and hold replicas of a portion of the archive expected to be of broad interest to the community. Gateways provide a system-wide interface to users where they can track the version history and location of replicas to select the most appropriate location for download. In addition to meeting the immediate needs of CMIP5 this architecture provides a basis for the Earth System Modeling e-infrastructure being further developed within the EU FP7 IS-ENES project.
Data Publication in the Meteorological Sciences: the OJIMS project
NASA Astrophysics Data System (ADS)
Callaghan, Sarah; Hewer, Fiona; Pepler, Sam; Hardaker, Paul; Gadian, Alan
2010-05-01
Historically speaking, scientific publication has mainly focussed on the analysis, interpretation and conclusions drawn from a given dataset, as these are the information that can be easily published in hard copy text format with the aid of diagrams. Examining the raw data that forms the dataset is often difficult to do, as datasets are usually stored in digital media, in a variety of (often proprietary or non-standard) formats. This means that the peer-review process is generally only applied to the methodology and final conclusions of a piece of work, and not the underlying data itself. Yet for the conclusions to stand, the data must be of good quality, and the peer-review process must be used to judge the data quality. Data publication, involving the peer-review of datasets, would be of benefit to many sectors of the academic community. For the data scientists, who often spend considerable time and effort ensuring that their data and metadata is complete, valid and stored in an accredited data repository, this would provide academic credit in the form of extra publications and citations. Data publication would benefit the wider community, allowing discovery and reuse of useful datasets, ensuring their curation and providing the best possible value for money. Overlay journals are a technology which is already being used to facilitate peer review and publication on-line. The Overlay Journal Infrastructure for Meteorological Sciences (OJIMS) Project aimed to develop the mechanisms that could support both a new (overlay) Journal of Meteorological Data and an Open-Access Repository for documents related to the meteorological sciences. The OJIMS project was conducted by a partnership between the UK's Royal Meteorological Society (RMetS) and two members of the National Centre for Atmospheric Science (NCAS), the British Atmospheric Data Centre (BADC) and the University of Leeds. Conference delegates at the NCAS Conference in Bristol of 8-10 December 2008 were invited to complete a survey to assess the potential implications for the meteorological sciences should a data journal and an open access subject repository be created and operated. Supervised run-throughs of a demonstrator Journal of Meteorological Data were also carried out by seven volunteers at the conference. The feedback from the surveys and demonstrations became part of the reports and recommendations produced by the project. This included discussion of the benefits to data creators, the review process, branding, version control and citations. The project concluded that standard online journal technologies are suitable for the development and operation of a data journal as they allow the use of all the functions of journals without the need to engineer new solutions. The user surveys and interviews also showed that there is a significant desire in the meteorological sciences community for a data journal.
Merino-Sáinz, Izaskun; Anadón, Araceli; Torralba-Burrial, Antonio
2013-01-01
There are significant gaps in accessible knowledge about the distribution and phenology of Iberian harvestmen (Arachnida: Opiliones). Harvestmen accessible datasets in Iberian Peninsula are unknown, an only two other datasets available in GBIF are composed exclusively of harvestmen records. Moreover, only a few harvestmen data from Iberian Peninsula are available in GBIF network (or in any network that allows public retrieval or use these data). This paper describes the data associated with the Opiliones kept in the BOS Arthropod Collection of the University of Oviedo, Spain (hosted in the Department of Biología de Organismos y Sistemas), filling some of those gaps. The specimens were mainly collected from the northern third of the Iberian Peninsula. The earliest specimen deposited in the collection, dating back to the early 20(th) century, belongs to the P. Franganillo Collection. The dataset documents the collection of 16,455 specimens, preserved in 3,772 vials. Approximately 38% of the specimens belong to the family Sclerosomatidae, and 26% to Phalangidae; six other families with fewer specimens are also included. Data quality control was incorporated at several steps of digitisation process to facilitate reuse and improve accuracy. The complete dataset is also provided in Darwin Core Archive format, allowing public retrieval, use and combination with other biological, biodiversity of geographical variables datasets.
NASA Astrophysics Data System (ADS)
Santhana Vannan, S. K.; Ramachandran, R.; Deb, D.; Beaty, T.; Wright, D.
2017-12-01
This paper summarizes the workflow challenges of curating and publishing data produced from disparate data sources and provides a generalized workflow solution to efficiently archive data generated by researchers. The Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) for biogeochemical dynamics and the Global Hydrology Resource Center (GHRC) DAAC have been collaborating on the development of a generalized workflow solution to efficiently manage the data publication process. The generalized workflow presented here are built on lessons learned from implementations of the workflow system. Data publication consists of the following steps: Accepting the data package from the data providers, ensuring the full integrity of the data files. Identifying and addressing data quality issues Assembling standardized, detailed metadata and documentation, including file level details, processing methodology, and characteristics of data files Setting up data access mechanisms Setup of the data in data tools and services for improved data dissemination and user experience Registering the dataset in online search and discovery catalogues Preserving the data location through Digital Object Identifiers (DOI) We will describe the steps taken to automate, and realize efficiencies to the above process. The goals of the workflow system are to reduce the time taken to publish a dataset, to increase the quality of documentation and metadata, and to track individual datasets through the data curation process. Utilities developed to achieve these goal will be described. We will also share metrics driven value of the workflow system and discuss the future steps towards creation of a common software framework.
NABIC: A New Access Portal to Search, Visualize, and Share Agricultural Genomics Data.
Seol, Young-Joo; Lee, Tae-Ho; Park, Dong-Suk; Kim, Chang-Kug
2016-01-01
The National Agricultural Biotechnology Information Center developed an access portal to search, visualize, and share agricultural genomics data with a focus on South Korean information and resources. The portal features an agricultural biotechnology database containing a wide range of omics data from public and proprietary sources. We collected 28.4 TB of data from 162 agricultural organisms, with 10 types of omics data comprising next-generation sequencing sequence read archive, genome, gene, nucleotide, DNA chip, expressed sequence tag, interactome, protein structure, molecular marker, and single-nucleotide polymorphism datasets. Our genomic resources contain information on five animals, seven plants, and one fungus, which is accessed through a genome browser. We also developed a data submission and analysis system as a web service, with easy-to-use functions and cutting-edge algorithms, including those for handling next-generation sequencing data.
Public Data Archiving in Ecology and Evolution: How Well Are We Doing?
Roche, Dominique G.; Kruuk, Loeske E. B.; Lanfear, Robert; Binning, Sandra A.
2015-01-01
Policies that mandate public data archiving (PDA) successfully increase accessibility to data underlying scientific publications. However, is the data quality sufficient to allow reuse and reanalysis? We surveyed 100 datasets associated with nonmolecular studies in journals that commonly publish ecological and evolutionary research and have a strong PDA policy. Out of these datasets, 56% were incomplete, and 64% were archived in a way that partially or entirely prevented reuse. We suggest that cultural shifts facilitating clearer benefits to authors are necessary to achieve high-quality PDA and highlight key guidelines to help authors increase their data’s reuse potential and compliance with journal data policies. PMID:26556502
NASA Astrophysics Data System (ADS)
Blower, Jon; Lawrence, Bryan; Kershaw, Philip; Nagni, Maurizio
2014-05-01
The research process can be thought of as an iterative activity, initiated based on prior domain knowledge, as well on a number of external inputs, and producing a range of outputs including datasets, studies and peer reviewed publications. These outputs may describe the problem under study, the methodology used, the results obtained, etc. In any new publication, the author may cite or comment other papers or datasets in order to support their research hypothesis. However, as their work progresses, the researcher may draw from many other latent channels of information. These could include for example, a private conversation following a lecture or during a social dinner; an opinion expressed concerning some significant event such as an earthquake or for example a satellite failure. In addition, other sources of information of grey literature are important public such as informal papers such as the arxiv deposit, reports and studies. The climate science community is not an exception to this pattern; the CHARMe project, funded under the European FP7 framework, is developing an online system for collecting and sharing user feedback on climate datasets. This is to help users judge how suitable such climate data are for an intended application. The user feedback could be comments about assessments, citations, or provenance of the dataset, or other information such as descriptions of uncertainty or data quality. We define this as a distinct category of metadata called Commentary or C-metadata. We link C-metadata with target climate datasets using a Linked Data approach via the Open Annotation data model. In the context of Linked Data, C-metadata plays the role of a resource which, depending on its nature, may be accessed as simple text or as more structured content. The project is implementing a range of software tools to create, search or visualize C-metadata including a JavaScript plugin enabling this functionality to be integrated in situ with data provider portals. Since commentary metadata may originate from a range of sources, moderation of this information will become a crucial issue. If the project is successful, expert human moderation (analogous to peer-review) will become impracticable as annotation numbers increase, and some combination of algorithmic and crowd-sourced evaluation of commentary metadata will be necessary. To that end, future work will need to extend work under development to enable access control and checking of inputs, to deal with scale.
The Ever-Present Demand for Public Computing Resources. CDS Spotlight
ERIC Educational Resources Information Center
Pirani, Judith A.
2014-01-01
This Core Data Service (CDS) Spotlight focuses on public computing resources, including lab/cluster workstations in buildings, virtual lab/cluster workstations, kiosks, laptop and tablet checkout programs, and workstation access in unscheduled classrooms. The findings are derived from 758 CDS 2012 participating institutions. A dataset of 529…
Data publication and sharing using the SciDrive service
NASA Astrophysics Data System (ADS)
Mishin, Dmitry; Medvedev, D.; Szalay, A. S.; Plante, R. L.
2014-01-01
Despite the last years progress in scientific data storage, still remains the problem of public data storage and sharing system for relatively small scientific datasets. These are collections forming the “long tail” of power log datasets distribution. The aggregated size of the long tail data is comparable to the size of all data collections from large archives, and the value of data is significant. The SciDrive project's main goal is providing the scientific community with a place to reliably and freely store such data and provide access to it to broad scientific community. The primary target audience of the project is astoromy community, and it will be extended to other fields. We're aiming to create a simple way of publishing a dataset, which can be then shared with other people. Data owner controls the permissions to modify and access the data and can assign a group of users or open the access to everyone. The data contained in the dataset will be automaticaly recognized by a background process. Known data formats will be extracted according to the user's settings. Currently tabular data can be automatically extracted to the user's MyDB table where user can make SQL queries to the dataset and merge it with other public CasJobs resources. Other data formats can be processed using a set of plugins that upload the data or metadata to user-defined side services. The current implementation targets some of the data formats commonly used by the astronomy communities, including FITS, ASCII and Excel tables, TIFF images, and YT simulations data archives. Along with generic metadata, format-specific metadata is also processed. For example, basic information about celestial objects is extracted from FITS files and TIFF images, if present. A 100TB implementation has just been put into production at Johns Hopkins University. The system features public data storage REST service supporting VOSpace 2.0 and Dropbox protocols, HTML5 web portal, command-line client and Java standalone client to synchronize a local folder with the remote storage. We use VAO SSO (Single Sign On) service from NCSA for users authentication that provides free registration for everyone.
A web Accessible Framework for Discovery, Visualization and Dissemination of Polar Data
NASA Astrophysics Data System (ADS)
Kirsch, P. J.; Breen, P.; Barnes, T. D.
2007-12-01
A web accessible information framework, currently under development within the Physical Sciences Division of the British Antarctic Survey is described. The datasets accessed are generally heterogeneous in nature from fields including space physics, meteorology, atmospheric chemistry, ice physics, and oceanography. Many of these are returned in near real time over a 24/7 limited bandwidth link from remote Antarctic Stations and ships. The requirement is to provide various user groups - each with disparate interests and demands - a system incorporating a browsable and searchable catalogue; bespoke data summary visualization, metadata access facilities and download utilities. The system allows timely access to raw and processed datasets through an easily navigable discovery interface. Once discovered, a summary of the dataset can be visualized in a manner prescribed by the particular projects and user communities or the dataset may be downloaded, subject to accessibility restrictions that may exist. In addition, access to related ancillary information including software, documentation, related URL's and information concerning non-electronic media (of particular relevance to some legacy datasets) is made directly available having automatically been associated with a dataset during the discovery phase. Major components of the framework include the relational database containing the catalogue, the organizational structure of the systems holding the data - enabling automatic updates of the system catalogue and real-time access to data -, the user interface design, and administrative and data management scripts allowing straightforward incorporation of utilities, datasets and system maintenance.
EPOS Multi-Scale Laboratory platform: a long-term reference tool for experimental Earth Sciences
NASA Astrophysics Data System (ADS)
Trippanera, Daniele; Tesei, Telemaco; Funiciello, Francesca; Sagnotti, Leonardo; Scarlato, Piergiorgio; Rosenau, Matthias; Elger, Kirsten; Ulbricht, Damian; Lange, Otto; Calignano, Elisa; Spiers, Chris; Drury, Martin; Willingshofer, Ernst; Winkler, Aldo
2017-04-01
With continuous progress on scientific research, a large amount of datasets has been and will be produced. The data access and sharing along with their storage and homogenization within a unique and coherent framework is a new challenge for the whole scientific community. This is particularly emphasized for geo-scientific laboratories, encompassing the most diverse Earth Science disciplines and typology of data. To this aim the "Multiscale Laboratories" Work Package (WP16), operating in the framework of the European Plate Observing System (EPOS), is developing a virtual platform of geo-scientific data and services for the worldwide community of laboratories. This long-term project aims at merging the top class multidisciplinary laboratories in Geoscience into a coherent and collaborative network, facilitating the standardization of virtual access to data, data products and software. This will help our community to evolve beyond the stage in which most of data produced by the different laboratories are available only within the related scholarly publications (often as print-version only) or they remain unpublished and inaccessible on local devices. The EPOS multi-scale laboratory platform will provide the possibility to easily share and discover data by means of open access, DOI-referenced, online data publication including long-term storage, managing and curation services and to set up a cohesive community of laboratories. The WP16 is starting with three pilot cases laboratories: (1) rock physics, (2) palaeomagnetic, and (3) analogue modelling. As a proof of concept, first analogue modelling datasets have been published via GFZ Data Services (http://doidb.wdc-terra.org/search/public/ui?&sort=updated+desc&q=epos). The datasets include rock analogue material properties (e.g. friction data, rheology data, SEM imagery), as well as supplementary figures, images and movies from experiments on tectonic processes. A metadata catalogue tailored to the specific communities will link the growing number of datasets to a centralized EPOS hub. Acknowledging the fact that we are dealing with a variety in levels of maturity regarding available data infrastructures within the different labs, we have set up an architecture that provides different scenarios for participation. Thus, research groups which do not have access to localized repositories and catalogues for sustainable storage of data and metadata can rely on shared services within the Multi-scale Laboratories community. As an example of the usage of data retrieved through the community, an experimentalist willing to decide which material is suitable for his experimental setup can get "virtual lab access" to retrieve information about material parameters with a minimum effort and then may decide to move in a specific laboratory equipped with the instruments needed. The currently participating and collaborating laboratories (Utrecht University, GFZ, Roma Tre University, INGV, NERC, CSIC-ICTJA, CNRS, LMU, UBI, ETH, CNR) warmly welcome everyone who is interested in participating at the development of this project.
NASA Earth Observations (NEO): Data Imagery for Education and Visualization
NASA Astrophysics Data System (ADS)
Ward, K.
2008-12-01
NASA Earth Observations (NEO) has dramatically simplified public access to georeferenced imagery of NASA remote sensing data. NEO targets the non-traditional data users who are currently underserved by functionality and formats available from the existing data ordering systems. These users include formal and informal educators, museum and science center personnel, professional communicators, and citizen scientists. NEO currently serves imagery from 45 different datasets with daily, weekly, and/or monthly temporal resolutions, with more datasets currently under development. The imagery from these datasets is produced in coordination with several data partners who are affiliated either with the instrument science teams or with the respective data processing center. NEO is a system of three components -- website, WMS (Web Mapping Service), and ftp archive -- which together are able to meet the wide-ranging needs of our users. Some of these needs include the ability to: view and manipulate imagery using the NEO website -- e.g., applying color palettes, resizing, exporting to a variety of formats including PNG, JPEG, KMZ (Google Earth), GeoTIFF; access the NEO collection via a standards-based API (WMS); and create customized exports for select users (ftp archive) such as Science on a Sphere, NASA's Earth Observatory, and others.
Becnel, Lauren B; Darlington, Yolanda F; Ochsner, Scott A; Easton-Marks, Jeremy R; Watkins, Christopher M; McOwiti, Apollo; Kankanamge, Wasula H; Wise, Michael W; DeHart, Michael; Margolis, Ronald N; McKenna, Neil J
2015-01-01
Signaling pathways involving nuclear receptors (NRs), their ligands and coregulators, regulate tissue-specific transcriptomes in diverse processes, including development, metabolism, reproduction, the immune response and neuronal function, as well as in their associated pathologies. The Nuclear Receptor Signaling Atlas (NURSA) is a Consortium focused around a Hub website (www.nursa.org) that annotates and integrates diverse 'omics datasets originating from the published literature and NURSA-funded Data Source Projects (NDSPs). These datasets are then exposed to the scientific community on an Open Access basis through user-friendly data browsing and search interfaces. Here, we describe the redesign of the Hub, version 3.0, to deploy "Web 2.0" technologies and add richer, more diverse content. The Molecule Pages, which aggregate information relevant to NR signaling pathways from myriad external databases, have been enhanced to include resources for basic scientists, such as post-translational modification sites and targeting miRNAs, and for clinicians, such as clinical trials. A portal to NURSA's Open Access, PubMed-indexed journal Nuclear Receptor Signaling has been added to facilitate manuscript submissions. Datasets and information on reagents generated by NDSPs are available, as is information concerning periodic new NDSP funding solicitations. Finally, the new website integrates the Transcriptomine analysis tool, which allows for mining of millions of richly annotated public transcriptomic data points in the field, providing an environment for dataset re-use and citation, bench data validation and hypothesis generation. We anticipate that this new release of the NURSA database will have tangible, long term benefits for both basic and clinical research in this field.
New Tools to Search for Data in the European Space Agency's Planetary Science Archive
NASA Astrophysics Data System (ADS)
Grotheer, E.; Macfarlane, A. J.; Rios, C.; Arviset, C.; Heather, D.; Fraga, D.; Vallejo, F.; De Marchi, G.; Barbarisi, I.; Saiz, J.; Barthelemy, M.; Docasal, R.; Martinez, S.; Besse, S.; Lim, T.
2016-12-01
The European Space Agency's (ESA) Planetary Science Archive (PSA), which can be accessed at http://archives.esac.esa.int/psa, provides public access to the archived data of Europe's missions to our neighboring planets. These datasets are compliant with the Planetary Data System (PDS) standards. Recently, a new interface has been released, which includes upgrades to make PDS4 data available from newer missions such as ExoMars and BepiColombo. Additionally, the PSA development team has been working to ensure that the legacy PDS3 data will be more easily accessible via the new interface as well. In addition to a new querying interface, the new PSA also allows access via the EPN-TAP and PDAP protocols. This makes the PSA data sets compatible with other archive-related tools and projects, such as the Virtual European Solar and Planetary Access (VESPA) project for creating a virtual observatory.
Improving the discoverability, accessibility, and citability of omics datasets: a case report.
Darlington, Yolanda F; Naumov, Alexey; McOwiti, Apollo; Kankanamge, Wasula H; Becnel, Lauren B; McKenna, Neil J
2017-03-01
Although omics datasets represent valuable assets for hypothesis generation, model testing, and data validation, the infrastructure supporting their reuse lacks organization and consistency. Using nuclear receptor signaling transcriptomic datasets as proof of principle, we developed a model to improve the discoverability, accessibility, and citability of published omics datasets. Primary datasets were retrieved from archives, processed to extract data points, then subjected to metadata enrichment and gap filling. The resulting secondary datasets were exposed on responsive web pages to support mining of gene lists, discovery of related datasets, and single-click citation integration with popular reference managers. Automated processes were established to embed digital object identifier-driven links to the secondary datasets in associated journal articles, small molecule and gene-centric databases, and a dataset search engine. Our model creates multiple points of access to reprocessed and reannotated derivative datasets across the digital biomedical research ecosystem, promoting their visibility and usability across disparate research communities. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Merino-Sáinz, Izaskun; Anadón, Araceli; Torralba-Burrial, Antonio
2013-01-01
Abstract There are significant gaps in accessible knowledge about the distribution and phenology of Iberian harvestmen (Arachnida: Opiliones). Harvestmen accessible datasets in Iberian Peninsula are unknown, an only two other datasets available in GBIF are composed exclusively of harvestmen records. Moreover, only a few harvestmen data from Iberian Peninsula are available in GBIF network (or in any network that allows public retrieval or use these data). This paper describes the data associated with the Opiliones kept in the BOS Arthropod Collection of the University of Oviedo, Spain (hosted in the Department of Biología de Organismos y Sistemas), filling some of those gaps. The specimens were mainly collected from the northern third of the Iberian Peninsula. The earliest specimen deposited in the collection, dating back to the early 20th century, belongs to the P. Franganillo Collection. The dataset documents the collection of 16,455 specimens, preserved in 3,772 vials. Approximately 38% of the specimens belong to the family Sclerosomatidae, and 26% to Phalangidae; six other families with fewer specimens are also included. Data quality control was incorporated at several steps of digitisation process to facilitate reuse and improve accuracy. The complete dataset is also provided in Darwin Core Archive format, allowing public retrieval, use and combination with other biological, biodiversity of geographical variables datasets. PMID:24146596
NASA Astrophysics Data System (ADS)
Domenico, B.; Weber, J.
2012-04-01
For some years now, the authors have developed examples of online documents that allowed the reader to interact directly with datasets, but there were limitations that restricted the interaction to specific desktop analysis and display tools that were not generally available to all readers of the documents. Recent advances in web service technology and related standards are making it possible to develop systems for publishing online documents that enable readers to access, analyze, and display the data discussed in the publication from the perspective and in the manner from which the author wants it to be represented. By clicking on embedded links, the reader accesses not only the usual textual information in a publication, but also data residing on a local or remote web server as well as a set of processing tools for analyzing and displaying the data. With the option of having the analysis and display processing provided on the server (or in the cloud), there are now a broader set of possibilities on the client side where the reader can interact with the data via a thin web client, a rich desktop application, or a mobile platform "app." The presentation will outline the architecture of data interactive publications along with illustrative examples.
NASA Astrophysics Data System (ADS)
Wyborn, Lesley; Car, Nicholas; Evans, Benjamin; Klump, Jens
2016-04-01
Persistent identifiers in the form of a Digital Object Identifier (DOI) are becoming more mainstream, assigned at both the collection and dataset level. For static datasets, this is a relatively straight-forward matter. However, many new data collections are dynamic, with new data being appended, models and derivative products being revised with new data, or the data itself revised as processing methods are improved. Further, because data collections are becoming accessible as services, researchers can log in and dynamically create user-defined subsets for specific research projects: they also can easily mix and match data from multiple collections, each of which can have a complex history. Inevitably extracts from such dynamic data sets underpin scholarly publications, and this presents new challenges. The National Computational Infrastructure (NCI) has been experiencing and making progress towards addressing these issues. The NCI is large node of the Research Data Services initiative (RDS) of the Australian Government's research infrastructure, which currently makes available over 10 PBytes of priority research collections, ranging from geosciences, geophysics, environment, and climate, through to astronomy, bioinformatics, and social sciences. Data are replicated to, or are produced at, NCI and then processed there to higher-level data products or directly analysed. Individual datasets range from multi-petabyte computational models and large volume raster arrays, down to gigabyte size, ultra-high resolution datasets. To facilitate access, maximise reuse and enable integration across the disciplines, datasets have been organized on a platform called the National Environmental Research Data Interoperability Platform (NERDIP). Combined, the NERDIP data collections form a rich and diverse asset for researchers: their co-location and standardization optimises the value of existing data, and forms a new resource to underpin data-intensive Science. New publication procedures require that a persistent identifier (DOI) be provided for the dataset that underpins the publication. Being able to produce these for data extracts from the NCI data node using only DOIs is proving difficult: preserving a copy of each data extract is not possible due to data scale. A proposal is for researchers to use workflows that capture the provenance of each data extraction, including metadata (e.g., version of the dataset used, the query and time of extraction). In parallel, NCI is now working with the NERDIP dataset providers to ensure that the provenance of data publication is also captured in provenance systems including references to previous versions and a history of data appended or modified. This proposed solution would require an enhancement to new scholarly publication procedures whereby the reference to underlying dataset to a scholarly publication would be the persistent identifier of the provenance workflow that created the data extract. In turn, the provenance workflow would itself link to a series of persistent identifiers that, at a minimum, provide complete dataset production transparency and, if required, would facilitate reconstruction of the dataset. Such a solution will require strict adherence to design patterns for provenance representation to ensure that the provenance representation of the workflow does indeed contain information required to deliver dataset generation transparency and a pathway to reconstruction.
CrossCheck: an open-source web tool for high-throughput screen data analysis.
Najafov, Jamil; Najafov, Ayaz
2017-07-19
Modern high-throughput screening methods allow researchers to generate large datasets that potentially contain important biological information. However, oftentimes, picking relevant hits from such screens and generating testable hypotheses requires training in bioinformatics and the skills to efficiently perform database mining. There are currently no tools available to general public that allow users to cross-reference their screen datasets with published screen datasets. To this end, we developed CrossCheck, an online platform for high-throughput screen data analysis. CrossCheck is a centralized database that allows effortless comparison of the user-entered list of gene symbols with 16,231 published datasets. These datasets include published data from genome-wide RNAi and CRISPR screens, interactome proteomics and phosphoproteomics screens, cancer mutation databases, low-throughput studies of major cell signaling mediators, such as kinases, E3 ubiquitin ligases and phosphatases, and gene ontological information. Moreover, CrossCheck includes a novel database of predicted protein kinase substrates, which was developed using proteome-wide consensus motif searches. CrossCheck dramatically simplifies high-throughput screen data analysis and enables researchers to dig deep into the published literature and streamline data-driven hypothesis generation. CrossCheck is freely accessible as a web-based application at http://proteinguru.com/crosscheck.
Virtual Hubs for facilitating access to Open Data
NASA Astrophysics Data System (ADS)
Mazzetti, Paolo; Latre, Miguel Á.; Ernst, Julia; Brumana, Raffaella; Brauman, Stefan; Nativi, Stefano
2015-04-01
In October 2014 the ENERGIC-OD (European NEtwork for Redistributing Geospatial Information to user Communities - Open Data) project, funded by the European Union under the Competitiveness and Innovation framework Programme (CIP), has started. In response to the EU call, the general objective of the project is to "facilitate the use of open (freely available) geographic data from different sources for the creation of innovative applications and services through the creation of Virtual Hubs". In ENERGIC-OD, Virtual Hubs are conceived as information systems supporting the full life cycle of Open Data: publishing, discovery and access. They facilitate the use of Open Data by lowering and possibly removing the main barriers which hampers geo-information (GI) usage by end-users and application developers. Data and data services heterogeneity is recognized as one of the major barriers to Open Data (re-)use. It imposes end-users and developers to spend a lot of effort in accessing different infrastructures and harmonizing datasets. Such heterogeneity cannot be completely removed through the adoption of standard specifications for service interfaces, metadata and data models, since different infrastructures adopt different standards to answer to specific challenges and to address specific use-cases. Thus, beyond a certain extent, heterogeneity is irreducible especially in interdisciplinary contexts. ENERGIC-OD Virtual Hubs address heterogeneity adopting a mediation and brokering approach: specific components (brokers) are dedicated to harmonize service interfaces, metadata and data models, enabling seamless discovery and access to heterogeneous infrastructures and datasets. As an innovation project, ENERGIC-OD will integrate several existing technologies to implement Virtual Hubs as single points of access to geospatial datasets provided by new or existing platforms and infrastructures, including INSPIRE-compliant systems and Copernicus services. ENERGIC OD will deploy a set of five Virtual Hubs (VHs) at national level in France, Germany, Italy, Poland, Spain and an additional one at the European level. VHs will be provided according to the cloud Software-as-a-Services model. The main expected impact of VHs is the creation of new business opportunities opening up access to Research Data and Public Sector Information. Therefore, ENERGIC-OD addresses not only end-users, who will have the opportunity to access the VH through a geo-portal, but also application developers who will be able to access VH functionalities through simple Application Programming Interfaces (API). ENERGIC-OD Consortium will develop ten different applications on top of the deployed VHs. They aim to demonstrate how VHs facilitate the development of new and multidisciplinary applications based on the full exploitation of (open) GI, hence stimulating innovation and business activities.
NABIC: A New Access Portal to Search, Visualize, and Share Agricultural Genomics Data
Seol, Young-Joo; Lee, Tae-Ho; Park, Dong-Suk; Kim, Chang-Kug
2016-01-01
The National Agricultural Biotechnology Information Center developed an access portal to search, visualize, and share agricultural genomics data with a focus on South Korean information and resources. The portal features an agricultural biotechnology database containing a wide range of omics data from public and proprietary sources. We collected 28.4 TB of data from 162 agricultural organisms, with 10 types of omics data comprising next-generation sequencing sequence read archive, genome, gene, nucleotide, DNA chip, expressed sequence tag, interactome, protein structure, molecular marker, and single-nucleotide polymorphism datasets. Our genomic resources contain information on five animals, seven plants, and one fungus, which is accessed through a genome browser. We also developed a data submission and analysis system as a web service, with easy-to-use functions and cutting-edge algorithms, including those for handling next-generation sequencing data. PMID:26848255
Recommendations for a service framework to access astronomical archives
NASA Technical Reports Server (NTRS)
Travisano, J. J.; Pollizzi, J.
1992-01-01
There are a large number of astronomical archives and catalogs on-line for network access, with many different user interfaces and features. Some systems are moving towards distributed access, supplying users with client software for their home sites which connects to servers at the archive site. Many of the issues involved in defining a standard framework of services that archive/catalog suppliers can use to achieve a basic level of interoperability are described. Such a framework would simplify the development of client and server programs to access the wide variety of astronomical archive systems. The primary services that are supplied by current systems include: catalog browsing, dataset retrieval, name resolution, and data analysis. The following issues (and probably more) need to be considered in establishing a standard set of client/server interfaces and protocols: Archive Access - dataset retrieval, delivery, file formats, data browsing, analysis, etc.; Catalog Access - database management systems, query languages, data formats, synchronous/asynchronous mode of operation, etc.; Interoperability - transaction/message protocols, distributed processing mechanisms (DCE, ONC/SunRPC, etc), networking protocols, etc.; Security - user registration, authorization/authentication mechanisms, etc.; Service Directory - service registration, lookup, port/task mapping, parameters, etc.; Software - public vs proprietary, client/server software, standard interfaces to client/server functions, software distribution, operating system portability, data portability, etc. Several archive/catalog groups, notably the Astrophysics Data System (ADS), are already working in many of these areas. In the process of developing StarView, which is the user interface to the Space Telescope Data Archive and Distribution Service (ST-DADS), these issues and the work of others were analyzed. A framework of standard interfaces for accessing services on any archive system which would benefit archive user and supplier alike is proposed.
OpenNEX, a private-public partnership in support of the national climate assessment
NASA Astrophysics Data System (ADS)
Nemani, R. R.; Wang, W.; Michaelis, A.; Votava, P.; Ganguly, S.
2016-12-01
The NASA Earth Exchange (NEX) is a collaborative computing platform that has been developed with the objective of bringing scientists together with the software tools, massive global datasets, and supercomputing resources necessary to accelerate research in Earth systems science and global change. NEX is funded as an enabling tool for sustaining the national climate assessment. Over the past five years, researchers have used the NEX platform and produced a number of data sets highly relevant to the National Climate Assessment. These include high-resolution climate projections using different downscaling techniques and trends in historical climate from satellite data. To enable a broader community in exploiting the above datasets, the NEX team partnered with public cloud providers to create the OpenNEX platform. OpenNEX provides ready access to NEX data holdings on a number of public cloud platforms along with pertinent analysis tools and workflows in the form of Machine Images and Docker Containers, lectures and tutorials by experts. We will showcase some of the applications of OpenNEX data and tools by the community on Amazon Web Services, Google Cloud and the NEX Sandbox.
A digital library for medical imaging activities
NASA Astrophysics Data System (ADS)
dos Santos, Marcelo; Furuie, Sérgio S.
2007-03-01
This work presents the development of an electronic infrastructure to make available a free, online, multipurpose and multimodality medical image database. The proposed infrastructure implements a distributed architecture for medical image database, authoring tools, and a repository for multimedia documents. Also it includes a peer-reviewed model that assures quality of dataset. This public repository provides a single point of access for medical images and related information to facilitate retrieval tasks. The proposed approach has been used as an electronic teaching system in Radiology as well.
Comparison of patients' experiences in public and private primary care clinics in Malta.
Pullicino, Glorianne; Sciortino, Philip; Calleja, Neville; Schäfer, Willemijn; Boerma, Wienke; Groenewegen, Peter
2015-06-01
Demographic changes, technological developments and rising expectations require the analysis of public-private primary care (PC) service provision to inform policy makers. We conducted a descriptive, cross-sectional study using the dataset of the Maltese arm of the QUALICOPC Project to compare the PC patients' experiences provided by public-funded and private (independent) general practitioners in Malta. Seven hundred patients from 70 clinics completed a self-administered questionnaire. Direct logistic regression showed that patients visiting the private sector experienced better continuity of care with more difficulty in accessing out-of-hours care. Such findings help to improve (primary) healthcare service provision and resource allocation. © The Author 2014. Published by Oxford University Press on behalf of the European Public Health Association. All rights reserved.
Goddard Atmospheric Composition Data Center: Aura Data and Services in One Place
NASA Technical Reports Server (NTRS)
Leptoukh, G.; Kempler, S.; Gerasimov, I.; Ahmad, S.; Johnson, J.
2005-01-01
The Goddard Atmospheric Composition Data and Information Services Center (AC-DISC) is a portal to the Atmospheric Composition specific, user driven, multi-sensor, on-line, easy access archive and distribution system employing data analysis and visualization, data mining, and other user requested techniques for the better science data usage. It provides convenient access to Atmospheric Composition data and information from various remote-sensing missions, from TOMS, UARS, MODIS, and AIRS, to the most recent data from Aura OMI, MLS, HIRDLS (once these datasets are released to the public), as well as Atmospheric Composition datasets residing at other remote archive site.
RefEx, a reference gene expression dataset as a web tool for the functional analysis of genes.
Ono, Hiromasa; Ogasawara, Osamu; Okubo, Kosaku; Bono, Hidemasa
2017-08-29
Gene expression data are exponentially accumulating; thus, the functional annotation of such sequence data from metadata is urgently required. However, life scientists have difficulty utilizing the available data due to its sheer magnitude and complicated access. We have developed a web tool for browsing reference gene expression pattern of mammalian tissues and cell lines measured using different methods, which should facilitate the reuse of the precious data archived in several public databases. The web tool is called Reference Expression dataset (RefEx), and RefEx allows users to search by the gene name, various types of IDs, chromosomal regions in genetic maps, gene family based on InterPro, gene expression patterns, or biological categories based on Gene Ontology. RefEx also provides information about genes with tissue-specific expression, and the relative gene expression values are shown as choropleth maps on 3D human body images from BodyParts3D. Combined with the newly incorporated Functional Annotation of Mammals (FANTOM) dataset, RefEx provides insight regarding the functional interpretation of unfamiliar genes. RefEx is publicly available at http://refex.dbcls.jp/.
RefEx, a reference gene expression dataset as a web tool for the functional analysis of genes
Ono, Hiromasa; Ogasawara, Osamu; Okubo, Kosaku; Bono, Hidemasa
2017-01-01
Gene expression data are exponentially accumulating; thus, the functional annotation of such sequence data from metadata is urgently required. However, life scientists have difficulty utilizing the available data due to its sheer magnitude and complicated access. We have developed a web tool for browsing reference gene expression pattern of mammalian tissues and cell lines measured using different methods, which should facilitate the reuse of the precious data archived in several public databases. The web tool is called Reference Expression dataset (RefEx), and RefEx allows users to search by the gene name, various types of IDs, chromosomal regions in genetic maps, gene family based on InterPro, gene expression patterns, or biological categories based on Gene Ontology. RefEx also provides information about genes with tissue-specific expression, and the relative gene expression values are shown as choropleth maps on 3D human body images from BodyParts3D. Combined with the newly incorporated Functional Annotation of Mammals (FANTOM) dataset, RefEx provides insight regarding the functional interpretation of unfamiliar genes. RefEx is publicly available at http://refex.dbcls.jp/. PMID:28850115
Improving Existing EPO Efforts with Data Access through the National Virtual Observatory
NASA Astrophysics Data System (ADS)
Raddick, M. J.; Christian, C. A.; O'Mullane, W. J.
2005-05-01
The National Virtual Observatory (NVO) is developing tools to enable astronomy data to be shared seamlessly across the Internet. The goal of the NVO is to allow anyone on the Internet to access all astronomy data ever measured, with any instrument, in any wavelength. The NVO's research efforts focus on allowing scientists to access existing online data, adding value to each dataset by virtue of its connection to others. Similarly, the NVO's Education and Public Outreach (EPO) efforts focus on connecting existing projects with the our seamless access to real, modern astronomy data from thousands of research projects. We hope that this connection will provide countless opportunities to expand and enhance existing EPO projects. Some of the projects currently working with NVO are the CLEA labs at Gettysburg College, Project LITE at Boston University, and Adler Planetarium. In this poster, I will describe the current EPO efforts that incorporate the NVO's data access tools. I will also provide a tutorial for EPO developers, with practical suggestions on how to incorporate NVO tools into existing projects. I will also give contact information for further help.
Open Source Bayesian Models. 1. Application to ADME/Tox and Drug Discovery Datasets.
Clark, Alex M; Dole, Krishna; Coulon-Spektor, Anna; McNutt, Andrew; Grass, George; Freundlich, Joel S; Reynolds, Robert C; Ekins, Sean
2015-06-22
On the order of hundreds of absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) models have been described in the literature in the past decade which are more often than not inaccessible to anyone but their authors. Public accessibility is also an issue with computational models for bioactivity, and the ability to share such models still remains a major challenge limiting drug discovery. We describe the creation of a reference implementation of a Bayesian model-building software module, which we have released as an open source component that is now included in the Chemistry Development Kit (CDK) project, as well as implemented in the CDD Vault and in several mobile apps. We use this implementation to build an array of Bayesian models for ADME/Tox, in vitro and in vivo bioactivity, and other physicochemical properties. We show that these models possess cross-validation receiver operator curve values comparable to those generated previously in prior publications using alternative tools. We have now described how the implementation of Bayesian models with FCFP6 descriptors generated in the CDD Vault enables the rapid production of robust machine learning models from public data or the user's own datasets. The current study sets the stage for generating models in proprietary software (such as CDD) and exporting these models in a format that could be run in open source software using CDK components. This work also demonstrates that we can enable biocomputation across distributed private or public datasets to enhance drug discovery.
Open Source Bayesian Models. 1. Application to ADME/Tox and Drug Discovery Datasets
2015-01-01
On the order of hundreds of absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) models have been described in the literature in the past decade which are more often than not inaccessible to anyone but their authors. Public accessibility is also an issue with computational models for bioactivity, and the ability to share such models still remains a major challenge limiting drug discovery. We describe the creation of a reference implementation of a Bayesian model-building software module, which we have released as an open source component that is now included in the Chemistry Development Kit (CDK) project, as well as implemented in the CDD Vault and in several mobile apps. We use this implementation to build an array of Bayesian models for ADME/Tox, in vitro and in vivo bioactivity, and other physicochemical properties. We show that these models possess cross-validation receiver operator curve values comparable to those generated previously in prior publications using alternative tools. We have now described how the implementation of Bayesian models with FCFP6 descriptors generated in the CDD Vault enables the rapid production of robust machine learning models from public data or the user’s own datasets. The current study sets the stage for generating models in proprietary software (such as CDD) and exporting these models in a format that could be run in open source software using CDK components. This work also demonstrates that we can enable biocomputation across distributed private or public datasets to enhance drug discovery. PMID:25994950
Interoperable Solar Data and Metadata via LISIRD 3
NASA Astrophysics Data System (ADS)
Wilson, A.; Lindholm, D. M.; Pankratz, C. K.; Snow, M. A.; Woods, T. N.
2015-12-01
LISIRD 3 is a major upgrade of the LASP Interactive Solar Irradiance Data Center (LISIRD), which serves several dozen space based solar irradiance and related data products to the public. Through interactive plots, LISIRD 3 provides data browsing supported by data subsetting and aggregation. Incorporating a semantically enabled metadata repository, LISIRD 3 users see current, vetted, consistent information about the datasets offered. Users can now also search for datasets based on metadata fields such as dataset type and/or spectral or temporal range. This semantic database enables metadata browsing, so users can discover the relationships between datasets, instruments, spacecraft, mission and PI. The database also enables creation and publication of metadata records in a variety of formats, such as SPASE or ISO, making these datasets more discoverable. The database also enables the possibility of a public SPARQL endpoint, making the metadata browsable in an automated fashion. LISIRD 3's data access middleware, LaTiS, provides dynamic, on demand reformatting of data and timestamps, subsetting and aggregation, and other server side functionality via a RESTful OPeNDAP compliant API, enabling interoperability between LASP datasets and many common tools. LISIRD 3's templated front end design, coupled with the uniform data interface offered by LaTiS, allows easy integration of new datasets. Consequently the number and variety of datasets offered by LISIRD has grown to encompass several dozen, with many more to come. This poster will discuss design and implementation of LISIRD 3, including tools used, capabilities enabled, and issues encountered.
Becnel, Lauren B.; Darlington, Yolanda F.; Ochsner, Scott A.; Easton-Marks, Jeremy R.; Watkins, Christopher M.; McOwiti, Apollo; Kankanamge, Wasula H.; Wise, Michael W.; DeHart, Michael; Margolis, Ronald N.; McKenna, Neil J.
2015-01-01
Signaling pathways involving nuclear receptors (NRs), their ligands and coregulators, regulate tissue-specific transcriptomes in diverse processes, including development, metabolism, reproduction, the immune response and neuronal function, as well as in their associated pathologies. The Nuclear Receptor Signaling Atlas (NURSA) is a Consortium focused around a Hub website (www.nursa.org) that annotates and integrates diverse ‘omics datasets originating from the published literature and NURSA-funded Data Source Projects (NDSPs). These datasets are then exposed to the scientific community on an Open Access basis through user-friendly data browsing and search interfaces. Here, we describe the redesign of the Hub, version 3.0, to deploy “Web 2.0” technologies and add richer, more diverse content. The Molecule Pages, which aggregate information relevant to NR signaling pathways from myriad external databases, have been enhanced to include resources for basic scientists, such as post-translational modification sites and targeting miRNAs, and for clinicians, such as clinical trials. A portal to NURSA’s Open Access, PubMed-indexed journal Nuclear Receptor Signaling has been added to facilitate manuscript submissions. Datasets and information on reagents generated by NDSPs are available, as is information concerning periodic new NDSP funding solicitations. Finally, the new website integrates the Transcriptomine analysis tool, which allows for mining of millions of richly annotated public transcriptomic data points in the field, providing an environment for dataset re-use and citation, bench data validation and hypothesis generation. We anticipate that this new release of the NURSA database will have tangible, long term benefits for both basic and clinical research in this field. PMID:26325041
NASA Astrophysics Data System (ADS)
Walker, J. I.; Blodgett, D. L.; Suftin, I.; Kunicki, T.
2013-12-01
High-resolution data for use in environmental modeling is increasingly becoming available at broad spatial and temporal scales. Downscaled climate projections, remotely sensed landscape parameters, and land-use/land-cover projections are examples of datasets that may exceed an individual investigation's data management and analysis capacity. To allow projects on limited budgets to work with many of these data sets, the burden of working with them must be reduced. The approach being pursued at the U.S. Geological Survey Center for Integrated Data Analytics uses standard self-describing web services that allow machine to machine data access and manipulation. These techniques have been implemented and deployed in production level server-based Web Processing Services that can be accessed from a web application or scripted workflow. Data publication techniques that allow machine-interpretation of large collections of data have also been implemented for numerous datasets at U.S. Geological Survey data centers as well as partner agencies and academic institutions. Discovery of data services is accomplished using a method in which a machine-generated metadata record holds content--derived from the data's source web service--that is intended for human interpretation as well as machine interpretation. A distributed search application has been developed that demonstrates the utility of a decentralized search of data-owner metadata catalogs from multiple agencies. The integrated but decentralized system of metadata, data, and server-based processing capabilities will be presented. The design, utility, and value of these solutions will be illustrated with applied science examples and success stories. Datasets such as the EPA's Integrated Climate and Land Use Scenarios, USGS/NASA MODIS derived land cover attributes, and downscaled climate projections from several sources are examples of data this system includes. These and other datasets, have been published as standard, self-describing, web services that provide the ability to inspect and subset the data. This presentation will demonstrate this file-to-web service concept and how it can be used from script-based workflows or web applications.
Vempati, Uma D.; Przydzial, Magdalena J.; Chung, Caty; Abeyruwan, Saminda; Mir, Ahsan; Sakurai, Kunie; Visser, Ubbo; Lemmon, Vance P.; Schürer, Stephan C.
2012-01-01
Huge amounts of high-throughput screening (HTS) data for probe and drug development projects are being generated in the pharmaceutical industry and more recently in the public sector. The resulting experimental datasets are increasingly being disseminated via publically accessible repositories. However, existing repositories lack sufficient metadata to describe the experiments and are often difficult to navigate by non-experts. The lack of standardized descriptions and semantics of biological assays and screening results hinder targeted data retrieval, integration, aggregation, and analyses across different HTS datasets, for example to infer mechanisms of action of small molecule perturbagens. To address these limitations, we created the BioAssay Ontology (BAO). BAO has been developed with a focus on data integration and analysis enabling the classification of assays and screening results by concepts that relate to format, assay design, technology, target, and endpoint. Previously, we reported on the higher-level design of BAO and on the semantic querying capabilities offered by the ontology-indexed triple store of HTS data. Here, we report on our detailed design, annotation pipeline, substantially enlarged annotation knowledgebase, and analysis results. We used BAO to annotate assays from the largest public HTS data repository, PubChem, and demonstrate its utility to categorize and analyze diverse HTS results from numerous experiments. BAO is publically available from the NCBO BioPortal at http://bioportal.bioontology.org/ontologies/1533. BAO provides controlled terminology and uniform scope to report probe and drug discovery screening assays and results. BAO leverages description logic to formalize the domain knowledge and facilitate the semantic integration with diverse other resources. As a consequence, BAO offers the potential to infer new knowledge from a corpus of assay results, for example molecular mechanisms of action of perturbagens. PMID:23155465
NASA Astrophysics Data System (ADS)
Thibault, K. M.
2013-12-01
As the construction of NEON and its transition to operations progresses, more and more data will become available to the scientific community, both from NEON directly and from the concomitant growth of existing data repositories. Many of these datasets include ecological observations of a diversity of taxa in both aquatic and terrestrial environments. Although observational data have been collected and used throughout the history of organismal biology, the field has not yet fully developed a culture of data management, documentation, standardization, sharing and discoverability to facilitate the integration and synthesis of datasets. Moreover, the tools required to accomplish these goals, namely database design, implementation, and management, and automation and parallelization of analytical tasks through computational techniques, have not historically been included in biology curricula, at either the undergraduate or graduate levels. To ensure the success of data-generating projects like NEON in advancing organismal ecology and to increase transparency and reproducibility of scientific analyses, an acceleration of the cultural shift to open science practices, the development and adoption of data standards, such as the DarwinCore standard for taxonomic data, and increased training in computational approaches for biologists need to be realized. Here I highlight several initiatives that are intended to increase access to and discoverability of publicly available datasets and equip biologists and other scientists with the skills that are need to manage, integrate, and analyze data from multiple large-scale projects. The EcoData Retriever (ecodataretriever.org) is a tool that downloads publicly available datasets, re-formats the data into an efficient relational database structure, and then automatically imports the data tables onto a user's local drive into the database tool of the user's choice. The automation of these tasks results in nearly instantaneous execution of tasks that previously required hours to days of each data user's time, with decreased error rates and increased useability of the data. The Ecological Data wiki (ecologicaldata.org) provides a forum for users of ecological datasets to share relevant metadata and tips and tricks for using the data, in order to flatten learning curves, as well as minimize redundancy of efforts among users of the same datasets. Finally, Software Carpentry (software-carpentry.org) has developed curricula for scientific computing and provides both online training and low cost, short courses that can be tailored to the specific needs of the students. Demand for these courses has been increasing exponentially in recent years, and represent a significant educational resource for biologists. I will conclude by linking these initiatives to the challenges facing ecologists related to the effective and efficient exploitation of NEON's diverse data streams.
Building a better search engine for earth science data
NASA Astrophysics Data System (ADS)
Armstrong, E. M.; Yang, C. P.; Moroni, D. F.; McGibbney, L. J.; Jiang, Y.; Huang, T.; Greguska, F. R., III; Li, Y.; Finch, C. J.
2017-12-01
Free text data searching of earth science datasets has been implemented with varying degrees of success and completeness across the spectrum of the 12 NASA earth sciences data centers. At the JPL Physical Oceanography Distributed Active Archive Center (PO.DAAC) the search engine has been developed around the Solr/Lucene platform. Others have chosen other popular enterprise search platforms like Elasticsearch. Regardless, the default implementations of these search engines leveraging factors such as dataset popularity, term frequency and inverse document term frequency do not fully meet the needs of precise relevancy and ranking of earth science search results. For the PO.DAAC, this shortcoming has been identified for several years by its external User Working Group that has assigned several recommendations to improve the relevancy and discoverability of datasets related to remotely sensed sea surface temperature, ocean wind, waves, salinity, height and gravity that comprise a total count of over 500 public availability datasets. Recently, the PO.DAAC has teamed with an effort led by George Mason University to improve the improve the search and relevancy ranking of oceanographic data via a simple search interface and powerful backend services called MUDROD (Mining and Utilizing Dataset Relevancy from Oceanographic Datasets to Improve Data Discovery) funded by the NASA AIST program. MUDROD has mined and utilized the combination of PO.DAAC earth science dataset metadata, usage metrics, and user feedback and search history to objectively extract relevance for improved data discovery and access. In addition to improved dataset relevance and ranking, the MUDROD search engine also returns recommendations to related datasets and related user queries. This presentation will report on use cases that drove the architecture and development, and the success metrics and improvements on search precision and recall that MUDROD has demonstrated over the existing PO.DAAC search interfaces.
SCPortalen: human and mouse single-cell centric database
Noguchi, Shuhei; Böttcher, Michael; Hasegawa, Akira; Kouno, Tsukasa; Kato, Sachi; Tada, Yuhki; Ura, Hiroki; Abe, Kuniya; Shin, Jay W; Plessy, Charles; Carninci, Piero
2018-01-01
Abstract Published single-cell datasets are rich resources for investigators who want to address questions not originally asked by the creators of the datasets. The single-cell datasets might be obtained by different protocols and diverse analysis strategies. The main challenge in utilizing such single-cell data is how we can make the various large-scale datasets to be comparable and reusable in a different context. To challenge this issue, we developed the single-cell centric database ‘SCPortalen’ (http://single-cell.clst.riken.jp/). The current version of the database covers human and mouse single-cell transcriptomics datasets that are publicly available from the INSDC sites. The original metadata was manually curated and single-cell samples were annotated with standard ontology terms. Following that, common quality assessment procedures were conducted to check the quality of the raw sequence. Furthermore, primary data processing of the raw data followed by advanced analyses and interpretation have been performed from scratch using our pipeline. In addition to the transcriptomics data, SCPortalen provides access to single-cell image files whenever available. The target users of SCPortalen are all researchers interested in specific cell types or population heterogeneity. Through the web interface of SCPortalen users are easily able to search, explore and download the single-cell datasets of their interests. PMID:29045713
Architecture of the local spatial data infrastructure for regional climate change research
NASA Astrophysics Data System (ADS)
Titov, Alexander; Gordov, Evgeny
2013-04-01
Georeferenced datasets (meteorological databases, modeling and reanalysis results, etc.) are actively used in modeling and analysis of climate change for various spatial and temporal scales. Due to inherent heterogeneity of environmental datasets as well as their size which might constitute up to tens terabytes for a single dataset studies in the area of climate and environmental change require a special software support based on SDI approach. A dedicated architecture of the local spatial data infrastructure aiming at regional climate change analysis using modern web mapping technologies is presented. Geoportal is a key element of any SDI, allowing searching of geoinformation resources (datasets and services) using metadata catalogs, producing geospatial data selections by their parameters (data access functionality) as well as managing services and applications of cartographical visualization. It should be noted that due to objective reasons such as big dataset volume, complexity of data models used, syntactic and semantic differences of various datasets, the development of environmental geodata access, processing and visualization services turns out to be quite a complex task. Those circumstances were taken into account while developing architecture of the local spatial data infrastructure as a universal framework providing geodata services. So that, the architecture presented includes: 1. Effective in terms of search, access, retrieval and subsequent statistical processing, model of storing big sets of regional georeferenced data, allowing in particular to store frequently used values (like monthly and annual climate change indices, etc.), thus providing different temporal views of the datasets 2. General architecture of the corresponding software components handling geospatial datasets within the storage model 3. Metadata catalog describing in detail using ISO 19115 and CF-convention standards datasets used in climate researches as a basic element of the spatial data infrastructure as well as its publication according to OGC CSW (Catalog Service Web) specification 4. Computational and mapping web services to work with geospatial datasets based on OWS (OGC Web Services) standards: WMS, WFS, WPS 5. Geoportal as a key element of thematic regional spatial data infrastructure providing also software framework for dedicated web applications development To realize web mapping services Geoserver software is used since it provides natural WPS implementation as a separate software module. To provide geospatial metadata services GeoNetwork Opensource (http://geonetwork-opensource.org) product is planned to be used for it supports ISO 19115/ISO 19119/ISO 19139 metadata standards as well as ISO CSW 2.0 profile for both client and server. To implement thematic applications based on geospatial web services within the framework of local SDI geoportal the following open source software have been selected: 1. OpenLayers JavaScript library, providing basic web mapping functionality for the thin client such as web browser 2. GeoExt/ExtJS JavaScript libraries for building client-side web applications working with geodata services. The web interface developed will be similar to the interface of such popular desktop GIS applications, as uDIG, QuantumGIS etc. The work is partially supported by RF Ministry of Education and Science grant 8345, SB RAS Program VIII.80.2.1 and IP 131.
Data reuse and the open data citation advantage
Vision, Todd J.
2013-01-01
Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation benefit”. Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003. PMID:24109559
Frizzelle, Brian G; Evenson, Kelly R; Rodriguez, Daniel A; Laraia, Barbara A
2009-01-01
Background Health researchers have increasingly adopted the use of geographic information systems (GIS) for analyzing environments in which people live and how those environments affect health. One aspect of this research that is often overlooked is the quality and detail of the road data and whether or not it is appropriate for the scale of analysis. Many readily available road datasets, both public domain and commercial, contain positional errors or generalizations that may not be compatible with highly accurate geospatial locations. This study examined the accuracy, completeness, and currency of four readily available public and commercial sources for road data (North Carolina Department of Transportation, StreetMap Pro, TIGER/Line 2000, TIGER/Line 2007) relative to a custom road dataset which we developed and used for comparison. Methods and Results A custom road network dataset was developed to examine associations between health behaviors and the environment among pregnant and postpartum women living in central North Carolina in the United States. Three analytical measures were developed to assess the comparative accuracy and utility of four publicly and commercially available road datasets and the custom dataset in relation to participants' residential locations over three time periods. The exclusion of road segments and positional errors in the four comparison road datasets resulted in between 5.9% and 64.4% of respondents lying farther than 15.24 meters from their nearest road, the distance of the threshold set by the project to facilitate spatial analysis. Agreement, using a Pearson's correlation coefficient, between the customized road dataset and the four comparison road datasets ranged from 0.01 to 0.82. Conclusion This study demonstrates the importance of examining available road datasets and assessing their completeness, accuracy, and currency for their particular study area. This paper serves as an example for assessing the feasibility of readily available commercial or public road datasets, and outlines the steps by which an improved custom dataset for a study area can be developed. PMID:19409088
SAFE: SPARQL Federation over RDF Data Cubes with Access Control.
Khan, Yasar; Saleem, Muhammad; Mehdi, Muntazir; Hogan, Aidan; Mehmood, Qaiser; Rebholz-Schuhmann, Dietrich; Sahay, Ratnesh
2017-02-01
Several query federation engines have been proposed for accessing public Linked Open Data sources. However, in many domains, resources are sensitive and access to these resources is tightly controlled by stakeholders; consequently, privacy is a major concern when federating queries over such datasets. In the Healthcare and Life Sciences (HCLS) domain real-world datasets contain sensitive statistical information: strict ownership is granted to individuals working in hospitals, research labs, clinical trial organisers, etc. Therefore, the legal and ethical concerns on (i) preserving the anonymity of patients (or clinical subjects); and (ii) respecting data ownership through access control; are key challenges faced by the data analytics community working within the HCLS domain. Likewise statistical data play a key role in the domain, where the RDF Data Cube Vocabulary has been proposed as a standard format to enable the exchange of such data. However, to the best of our knowledge, no existing approach has looked to optimise federated queries over such statistical data. We present SAFE: a query federation engine that enables policy-aware access to sensitive statistical datasets represented as RDF data cubes. SAFE is designed specifically to query statistical RDF data cubes in a distributed setting, where access control is coupled with source selection, user profiles and their access rights. SAFE proposes a join-aware source selection method that avoids wasteful requests to irrelevant and unauthorised data sources. In order to preserve anonymity and enforce stricter access control, SAFE's indexing system does not hold any data instances-it stores only predicates and endpoints. The resulting data summary has a significantly lower index generation time and size compared to existing engines, which allows for faster updates when sources change. We validate the performance of the system with experiments over real-world datasets provided by three clinical organisations as well as legacy linked datasets. We show that SAFE enables granular graph-level access control over distributed clinical RDF data cubes and efficiently reduces the source selection and overall query execution time when compared with general-purpose SPARQL query federation engines in the targeted setting.
Demonstration of Data Interactive Publications
NASA Astrophysics Data System (ADS)
Domenico, B.; Weber, J.
2012-04-01
This is a demonstration version of the talk given in session ESSI2.4 "Full lifecycle of data." For some years now, the authors have developed examples of online documents that allowed the reader to interact directly with datasets, but there were limitations that restricted the interaction to specific desktop analysis and display tools that were not generally available to all readers of the documents. Recent advances in web service technology and related standards are making it possible to develop systems for publishing online documents that enable readers to access, analyze, and display the data discussed in the publication from the perspective and in the manner from which the author wants it to be represented. By clicking on embedded links, the reader accesses not only the usual textual information in a publication, but also data residing on a local or remote web server as well as a set of processing tools for analyzing and displaying the data. With the option of having the analysis and display processing provided on the server (or in the cloud), there are now a broader set of possibilities on the client side where the reader can interact with the data via a thin web client, a rich desktop application, or a mobile platform "app." The presentation will outline the architecture of data interactive publications along with illustrative examples.
ToxRefDB - Release user-friendly web-based tool for mining ToxRefDB
The updated URL link is for a table of NCCT ToxCast public datasets. The next to last row of the table has the link for the US EPA ToxCast ToxRefDB Data Release October 2014. ToxRefDB provides detailed chemical toxicity data in a publically accessible searchable format. ToxRefD...
Online collaboration and model sharing in volcanology via VHub.org
NASA Astrophysics Data System (ADS)
Valentine, G.; Patra, A. K.; Bajo, J. V.; Bursik, M. I.; Calder, E.; Carn, S. A.; Charbonnier, S. J.; Connor, C.; Connor, L.; Courtland, L. M.; Gallo, S.; Jones, M.; Palma Lizana, J. L.; Moore-Russo, D.; Renschler, C. S.; Rose, W. I.
2013-12-01
VHub (short for VolcanoHub, and accessible at vhub.org) is an online platform for barrier free access to high end modeling and simulation and collaboration in research and training related to volcanoes, the hazards they pose, and risk mitigation. The underlying concept is to provide a platform, building upon the successful HUBzero software infrastructure (hubzero.org), that enables workers to collaborate online and to easily share information, modeling and analysis tools, and educational materials with colleagues around the globe. Collaboration occurs around several different points: (1) modeling and simulation; (2) data sharing; (3) education and training; (4) volcano observatories; and (5) project-specific groups. VHub promotes modeling and simulation in two ways: (1) some models can be implemented on VHub for online execution. VHub can provide a central warehouse for such models that should result in broader dissemination. VHub also provides a platform that supports the more complex CFD models by enabling the sharing of code development and problem-solving knowledge, benchmarking datasets, and the development of validation exercises. VHub also provides a platform for sharing of data and datasets. The VHub development team is implementing the iRODS data sharing middleware (see irods.org). iRODS allows a researcher to access data that are located at participating data sources around the world (a cloud of data) as if the data were housed in a single virtual database. Projects associated with VHub are also going to introduce the use of data driven workflow tools to support the use of multistage analysis processes where computing and data are integrated for model validation, hazard analysis etc. Audio-video recordings of seminars, PowerPoint slide sets, and educational simulations are all items that can be placed onto VHub for use by the community or by selected collaborators. An important point is that the manager of a given educational resource (or any other resource, such as a dataset or a model) can control the privacy of that resource, ranging from private (only accessible by, and known to, specific collaborators) to completely public. VHub is a very useful platform for project-specific collaborations. With a group site on VHub collaborators share documents, datasets, maps, and have ongoing discussions using the discussion board function. VHub is funded by the U.S. National Science Foundation, and is participating in development of larger earth-science cyberinfrastructure initiatives (EarthCube), as well as supporting efforts such as the Global Volcano Model. Emerging VHub-facilitated efforts include model benchmarking, collaborative code development, and growth in online modeling tools.
Data Publication: A Partnership between Scientists, Data Managers and Librarians
NASA Astrophysics Data System (ADS)
Raymond, L.; Chandler, C.; Lowry, R.; Urban, E.; Moncoiffe, G.; Pissierssens, P.; Norton, C.; Miller, H.
2012-04-01
Current literature on the topic of data publication suggests that success is best achieved when there is a partnership between scientists, data managers, and librarians. The Marine Biological Laboratory/Woods Hole Oceanographic Institution (MBLWHOI) Library and the Biological and Chemical Oceanography Data Management Office (BCO-DMO) have developed tools and processes to automate the ingestion of metadata from BCO-DMO for deposit with datasets into the Institutional Repository (IR) Woods Hole Open Access Server (WHOAS). The system also incorporates functionality for BCO-DMO to request a Digital Object Identifier (DOI) from the Library. This partnership allows the Library to work with a trusted data repository to ensure high quality data while the data repository utilizes library services and is assured of a permanent archive of the copy of the data extracted from the repository database. The assignment of persistent identifiers enables accurate data citation. The Library can assign a DOI to appropriate datasets deposited in WHOAS. A primary activity is working with authors to deposit datasets associated with published articles. The DOI would ideally be assigned before submission and be included in the published paper so readers can link directly to the dataset, but DOIs are also being assigned to datasets related to articles after publication. WHOAS metadata records link the article to the datasets and the datasets to the article. The assignment of DOIs has enabled another important collaboration with Elsevier, publisher of educational and professional science journals. Elsevier can now link from articles in the Science Direct database to the datasets available from WHOAS that are related to that article. The data associated with the article are freely available from WHOAS and accompanied by a Dublin Core metadata record. In addition, the Library has worked with researchers to deposit datasets in WHOAS that are not appropriate for national, international, or domain specific data repositories. These datasets currently include audio, text and image files. This research is being conducted by a team of librarians, data managers and scientists that are collaborating with representatives from the Scientific Committee on Oceanic Research (SCOR) and the International Oceanographic Data and Information Exchange (IODE) of the Intergovernmental Oceanographic Commission (IOC). The goal is to identify best practices for tracking data provenance and clearly attributing credit to data collectors/providers.
PIVOT: platform for interactive analysis and visualization of transcriptomics data.
Zhu, Qin; Fisher, Stephen A; Dueck, Hannah; Middleton, Sarah; Khaladkar, Mugdha; Kim, Junhyong
2018-01-05
Many R packages have been developed for transcriptome analysis but their use often requires familiarity with R and integrating results of different packages requires scripts to wrangle the datatypes. Furthermore, exploratory data analyses often generate multiple derived datasets such as data subsets or data transformations, which can be difficult to track. Here we present PIVOT, an R-based platform that wraps open source transcriptome analysis packages with a uniform user interface and graphical data management that allows non-programmers to interactively explore transcriptomics data. PIVOT supports more than 40 popular open source packages for transcriptome analysis and provides an extensive set of tools for statistical data manipulations. A graph-based visual interface is used to represent the links between derived datasets, allowing easy tracking of data versions. PIVOT further supports automatic report generation, publication-quality plots, and program/data state saving, such that all analysis can be saved, shared and reproduced. PIVOT will allow researchers with broad background to easily access sophisticated transcriptome analysis tools and interactively explore transcriptome datasets.
Booly: a new data integration platform.
Do, Long H; Esteves, Francisco F; Karten, Harvey J; Bier, Ethan
2010-10-13
Data integration is an escalating problem in bioinformatics. We have developed a web tool and warehousing system, Booly, that features a simple yet flexible data model coupled with the ability to perform powerful comparative analysis, including the use of Boolean logic to merge datasets together, and an integrated aliasing system to decipher differing names of the same gene or protein. Furthermore, Booly features a collaborative sharing system and a public repository so that users can retrieve new datasets while contributors can easily disseminate new content. We illustrate the uses of Booly with several examples including: the versatile creation of homebrew datasets, the integration of heterogeneous data to identify genes useful for comparing avian and mammalian brain architecture, and generation of a list of Food and Drug Administration (FDA) approved drugs with possible alternative disease targets. The Booly paradigm for data storage and analysis should facilitate integration between disparate biological and medical fields and result in novel discoveries that can then be validated experimentally. Booly can be accessed at http://booly.ucsd.edu.
Booly: a new data integration platform
2010-01-01
Background Data integration is an escalating problem in bioinformatics. We have developed a web tool and warehousing system, Booly, that features a simple yet flexible data model coupled with the ability to perform powerful comparative analysis, including the use of Boolean logic to merge datasets together, and an integrated aliasing system to decipher differing names of the same gene or protein. Furthermore, Booly features a collaborative sharing system and a public repository so that users can retrieve new datasets while contributors can easily disseminate new content. Results We illustrate the uses of Booly with several examples including: the versatile creation of homebrew datasets, the integration of heterogeneous data to identify genes useful for comparing avian and mammalian brain architecture, and generation of a list of Food and Drug Administration (FDA) approved drugs with possible alternative disease targets. Conclusions The Booly paradigm for data storage and analysis should facilitate integration between disparate biological and medical fields and result in novel discoveries that can then be validated experimentally. Booly can be accessed at http://booly.ucsd.edu. PMID:20942966
OpenTopography: Enabling Online Access to High-Resolution Lidar Topography Data and Processing Tools
NASA Astrophysics Data System (ADS)
Crosby, Christopher; Nandigam, Viswanath; Baru, Chaitan; Arrowsmith, J. Ramon
2013-04-01
High-resolution topography data acquired with lidar (light detection and ranging) technology are revolutionizing the way we study the Earth's surface and overlying vegetation. These data, collected from airborne, tripod, or mobile-mounted scanners have emerged as a fundamental tool for research on topics ranging from earthquake hazards to hillslope processes. Lidar data provide a digital representation of the earth's surface at a resolution sufficient to appropriately capture the processes that contribute to landscape evolution. The U.S. National Science Foundation-funded OpenTopography Facility (http://www.opentopography.org) is a web-based system designed to democratize access to earth science-oriented lidar topography data. OpenTopography provides free, online access to lidar data in a number of forms, including the raw point cloud and associated geospatial-processing tools for customized analysis. The point cloud data are co-located with on-demand processing tools to generate digital elevation models, and derived products and visualizations which allow users to quickly access data in a format appropriate for their scientific application. The OpenTopography system is built using a service-oriented architecture (SOA) that leverages cyberinfrastructure resources at the San Diego Supercomputer Center at the University of California San Diego to allow users, regardless of expertise level, to access these massive lidar datasets and derived products for use in research and teaching. OpenTopography hosts over 500 billion lidar returns covering 85,000 km2. These data are all in the public domain and are provided by a variety of partners under joint agreements and memoranda of understanding with OpenTopography. Partners include national facilities such as the NSF-funded National Center for Airborne Lidar Mapping (NCALM), as well as non-governmental organizations and local, state, and federal agencies. OpenTopography has become a hub for high-resolution topography resources. Datasets hosted by other organizations, as well as lidar-specific software, can be registered into the OpenTopography catalog, providing users a "one-stop shop" for such information. With several thousand active users, OpenTopography is an excellent example of a mature Spatial Data Infrastructure system that is enabling access to challenging data for research, education and outreach. Ongoing OpenTopography design and development work includes the archive and publication of datasets using digital object identifiers (DOIs); creation of a more flexible and scalable high-performance environment for processing of large datasets; expanded support for satellite and terrestrial lidar; and creation of a "pluggable" infrastructure for third-party programs and algorithms. OpenTopography has successfully created a facility for sharing lidar data. In the project's next phase, we are working to enable equally easy and successful sharing of services for processing and analysis of these data.
Publishing datasets with eSciDoc and panMetaDocs
NASA Astrophysics Data System (ADS)
Ulbricht, D.; Klump, J.; Bertelmann, R.
2012-04-01
Currently serveral research institutions worldwide undertake considerable efforts to have their scientific datasets published and to syndicate them to data portals as extensively described objects identified by a persistent identifier. This is done to foster the reuse of data, to make scientific work more transparent, and to create a citable entity that can be referenced unambigously in written publications. GFZ Potsdam established a publishing workflow for file based research datasets. Key software components are an eSciDoc infrastructure [1] and multiple instances of the data curation tool panMetaDocs [2]. The eSciDoc repository holds data objects and their associated metadata in container objects, called eSciDoc items. A key metadata element in this context is the publication status of the referenced data set. PanMetaDocs, which is based on PanMetaWorks [3], is a PHP based web application that allows to describe data with any XML-based metadata schema. The metadata fields can be filled with static or dynamic content to reduce the number of fields that require manual entries to a minimum and make use of contextual information in a project setting. Access rights can be applied to set visibility of datasets to other project members and allow collaboration on and notifying about datasets (RSS) and interaction with the internal messaging system, that was inherited from panMetaWorks. When a dataset is to be published, panMetaDocs allows to change the publication status of the eSciDoc item from status "private" to "submitted" and prepare the dataset for verification by an external reviewer. After quality checks, the item publication status can be changed to "published". This makes the data and metadata available through the internet worldwide. PanMetaDocs is developed as an eSciDoc application. It is an easy to use graphical user interface to eSciDoc items, their data and metadata. It is also an application supporting a DOI publication agent during the process of publishing scientific datasets as electronic data supplements to research papers. Publication of research manuscripts has an already well established workflow that shares junctures with other processes and involves several parties in the process of dataset publication. Activities of the author, the reviewer, the print publisher and the data publisher have to be coordinated into a common data publication workflow. The case of data publication at GFZ Potsdam displays some specifics, e.g. the DOIDB webservice. The DOIDB is a proxy service at GFZ for the DataCite [4] DOI registration and its metadata store. DOIDB provides a local summary of the dataset DOIs registered through GFZ as a publication agent. An additional use case for the DOIDB is its function to enrich the datacite metadata with additional custom attributes, like a geographic reference in a DIF record. These attributes are at the moment not available in the datacite metadata schema but would be valuable elements for the compilation of data catalogues in the earth sciences and for dissemination of catalogue data via OAI-PMH. [1] http://www.escidoc.org , eSciDoc, FIZ Karlruhe, Germany [2] http://panmetadocs.sf.net , panMetaDocs, GFZ Potsdam, Germany [3] http://metaworks.pangaea.de , panMetaWorks, Dr. R. Huber, MARUM, Univ. Bremen, Germany [4] http://www.datacite.org
Enrichment of Data Publications in Earth Sciences - Data Reports as a Missing Link
NASA Astrophysics Data System (ADS)
Elger, Kirsten; Bertelmann, Roland; Haberland, Christian; Evans, Peter L.
2015-04-01
During the past decade, the relevance of research data stewardship has been rising significantly. Preservation and publication of scientific data for long-term use, including the storage in adequate repositories has been identified as a key issue by the scientific community as well as by bodies like research agencies. Essential for any kind of re-use is a proper description of the datasets. As a result of the increasing interest, data repositories have been developed and the included research data is accompanied with at least a minimum set of metadata. This metadata is useful for data discovery and a first insight to the content of a dataset. But often data re-use needs more and extended information. Many datasets are accompanied by a small 'readme file' with basic information on the data structure, or other accompanying documents. A source of additional information could be an article published in one of the newly emerging data journals (e.g. Copernicus's ESSD Earth System Science Data or Nature's Scientific Data). Obviously there is an information gap between a 'readme file', that is only accessible after data download (which often leads to less usage of published datasets than if the information was available beforehand) and the much larger effort to prepare an article for a peer-reviewed data journal. For many years, GFZ German Research Centre for Geosciences publishes 'Scientific Technical Reports (STR)' as a report series which is electronically persistently available and citable with assigned DOIs. This series was opened for the description of parallel published datasets as 'STR Data'. These are internally reviewed and offer a flexible publication format describing published data in depth, suitable for different datasets ranging from long-term monitoring time series of observatories to field data, to (meta-)databases, and software publications. STR Data offer a full and consistent overview and description to all relevant parameters of a linked published dataset. These reports are readable and citable on their own, but are, of course, closely connected to the respective datasets. Therefore, they give full insight into the framework of the data before data download. This is especially relevant for large and often heterogeneous datasets, like e.g. controlled-source seismic data gathered with instruments of the 'Geophysical Instrument Pool Potsdam GIPP'. Here, details of the instrumentation, data organization, data format, accuracy, geographical coordinates, timing and data completeness, etc. need to be documented. STR Data are also attractive for the publication of historic datasets, e.g. 30-40 years old seismic experiments. It is also possible for one STR Data to describe several datasets, e.g. from multiple diverse instruments types, or distinct regions of interest. The publication of DOI-assigned data reports is a helpful tool to fill the gap between basic metadata and restricted 'readme' information on the one hand and preparing extended journal articles on the other hand. They open the way for informed re-use and, with their comprehensive data description, may act as 'appetizer' for the re-use of published datasets.
Moderate-Resolution Sea Surface Temperature Data for the Nearshore North Pacific
Coastal sea surface temperature (SST) is an important environmental characteristic defining habitat suitability for nearshore marine and estuarine organisms. The purpose of this publication is to provide access to an easy-to-use coastal SST dataset for ecologists, biogeographers...
The tragedy of the biodiversity data commons: a data impediment creeping nigher?
Galicia, David; Ariño, Arturo H
2018-01-01
Abstract Researchers are embracing the open access movement to facilitate unrestricted availability of scientific results. One sign of this willingness is the steady increase in data freely shared online, which has prompted a corresponding increase in the number of papers using such data. Publishing datasets is a time-consuming process that is often seen as a courtesy, rather than a necessary step in the research process. Making data accessible allows further research, provides basic information for decision-making and contributes to transparency in science. Nevertheless, the ease of access to heaps of data carries a perception of ‘free lunch for all’, and the work of data publishers is largely going unnoticed. Acknowledging such a significant effort involving the creation, management and publication of a dataset remains a flimsy, not well established practice in the scientific community. In a meta-analysis of published literature, we have observed various dataset citation practices, but mostly (92%) consisting of merely citing the data repository rather than the data publisher. Failing to recognize the work of data publishers might lead to a decrease in the number of quality datasets shared online, compromising potential research that is dependent on the availability of such data. We make an urgent appeal to raise awareness about this issue. PMID:29688384
GC31G-1182: Opennex, a Private-Public Partnership in Support of the National Climate Assessment
NASA Technical Reports Server (NTRS)
Nemani, Ramakrishna R.; Wang, Weile; Michaelis, Andrew; Votava, Petr; Ganguly, Sangram
2016-01-01
The NASA Earth Exchange (NEX) is a collaborative computing platform that has been developed with the objective of bringing scientists together with the software tools, massive global datasets, and supercomputing resources necessary to accelerate research in Earth systems science and global change. NEX is funded as an enabling tool for sustaining the national climate assessment. Over the past five years, researchers have used the NEX platform and produced a number of data sets highly relevant to the National Climate Assessment. These include high-resolution climate projections using different downscaling techniques and trends in historical climate from satellite data. To enable a broader community in exploiting the above datasets, the NEX team partnered with public cloud providers to create the OpenNEX platform. OpenNEX provides ready access to NEX data holdings on a number of public cloud platforms along with pertinent analysis tools and workflows in the form of Machine Images and Docker Containers, lectures and tutorials by experts. We will showcase some of the applications of OpenNEX data and tools by the community on Amazon Web Services, Google Cloud and the NEX Sandbox.
Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein
2017-01-01
An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled "biosigdata.com." It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf).
Collaboration-Centred Cities through Urban Apps Based on Open and User-Generated Data.
Aguilera, Unai; López-de-Ipiña, Diego; Pérez, Jorge
2016-07-01
This paper describes the IES Cities platform conceived to streamline the development of urban apps that combine heterogeneous datasets provided by diverse entities, namely, government, citizens, sensor infrastructure and other information data sources. This work pursues the challenge of achieving effective citizen collaboration by empowering them to prosume urban data across time. Particularly, this paper focuses on the query mapper; a key component of the IES Cities platform devised to democratize the development of open data-based mobile urban apps. This component allows developers not only to use available data, but also to contribute to existing datasets with the execution of SQL sentences. In addition, the component allows developers to create ad hoc storages for their applications, publishable as new datasets accessible by other consumers. As multiple users could be contributing and using a dataset, our solution also provides a data level permission mechanism to control how the platform manages the access to its datasets. We have evaluated the advantages brought forward by IES Cities from the developers' perspective by describing an exemplary urban app created on top of it. In addition, we include an evaluation of the main functionalities of the query mapper.
Schofield, E C; Carver, T; Achuthan, P; Freire-Pritchett, P; Spivakov, M; Todd, J A; Burren, O S
2016-08-15
Promoter capture Hi-C (PCHi-C) allows the genome-wide interrogation of physical interactions between distal DNA regulatory elements and gene promoters in multiple tissue contexts. Visual integration of the resultant chromosome interaction maps with other sources of genomic annotations can provide insight into underlying regulatory mechanisms. We have developed Capture HiC Plotter (CHiCP), a web-based tool that allows interactive exploration of PCHi-C interaction maps and integration with both public and user-defined genomic datasets. CHiCP is freely accessible from www.chicp.org and supports most major HTML5 compliant web browsers. Full source code and installation instructions are available from http://github.com/D-I-L/django-chicp ob219@cam.ac.uk. © The Author 2016. Published by Oxford University Press. All rights reserved.
EnviroAtlas - Austin, TX - Residents with Minimal Potential Window Views of Trees by Block Group
This EnviroAtlas dataset shows the total block group population and the percentage of the block group population that has little access to potential window views of trees at home. Having little potential access to window views of trees is defined as having no trees & forest land cover within 50 meters. The window views are considered potential because the procedure does not account for presence or directionality of windows in one's home. Forest is defined as Trees & Forest. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Data Interactive Publications Revisited
NASA Astrophysics Data System (ADS)
Domenico, B.; Weber, W. J.
2011-12-01
A few years back, the authors presented examples of online documents that allowed the reader to interact directly with datasets, but there were limitations that restricted the interaction to specific desktop analysis and display tools that were not generally available to all readers of the documents. Recent advances in web service technology and related standards are making it possible to develop systems for publishing online documents that enable readers to access, analyze, and display the data discussed in the publication from the perspective and in the manner from which the author wants it to be represented. By clicking on embedded links, the reader accesses not only the usual textual information in a publication, but also data residing on a local or remote web server as well as a set of processing tools for analyzing and displaying the data. With the option of having the analysis and display processing provided on the server, there are now a broader set of possibilities on the client side where the reader can interact with the data via a thin web client, a rich desktop application, or a mobile platform "app." The presentation will outline the architecture of data interactive publications along with illustrative examples.
Murphy, David J; Rubinson, Lewis; Blum, James; Isakov, Alexander; Bhagwanjee, Statish; Cairns, Charles B; Cobb, J Perren; Sevransky, Jonathan E
2015-11-01
In developed countries, public health systems have become adept at rapidly identifying the etiology and impact of public health emergencies. However, within the time course of clinical responses, shortfalls in readily analyzable patient-level data limit capabilities to understand clinical course, predict outcomes, ensure resource availability, and evaluate the effectiveness of diagnostic and therapeutic strategies for seriously ill and injured patients. To be useful in the timeline of a public health emergency, multi-institutional clinical investigation systems must be in place to rapidly collect, analyze, and disseminate detailed clinical information regarding patients across prehospital, emergency department, and acute care hospital settings, including ICUs. As an initial step to near real-time clinical learning during public health emergencies, we sought to develop an "all-hazards" core dataset to characterize serious illness and injuries and the resource requirements for acute medical response across the care continuum. A multidisciplinary panel of clinicians, public health professionals, and researchers with expertise in public health emergencies. Group consensus process. The consensus process included regularly scheduled conference calls, electronic communications, and an in-person meeting to generate candidate variables. Candidate variables were then reviewed by the group to meet the competing criteria of utility and feasibility resulting in the core dataset. The 40-member panel generated 215 candidate variables for potential dataset inclusion. The final dataset includes 140 patient-level variables in the domains of demographics and anthropometrics (7), prehospital (11), emergency department (13), diagnosis (8), severity of illness (54), medications and interventions (38), and outcomes (9). The resulting all-hazard core dataset for seriously ill and injured persons provides a foundation to facilitate rapid collection, analyses, and dissemination of information necessary for clinicians, public health officials, and policymakers to optimize public health emergency response. Further work is needed to validate the effectiveness of the dataset in a variety of emergency settings.
The NCAR Research Data Archive's Hybrid Approach for Data Discovery and Access
NASA Astrophysics Data System (ADS)
Schuster, D.; Worley, S. J.
2013-12-01
The NCAR Research Data Archive (RDA http://rda.ucar.edu) maintains a variety of data discovery and access capabilities for it's 600+ dataset collections to support the varying needs of a diverse user community. In-house developed and standards-based community tools offer services to more than 10,000 users annually. By number of users the largest group is external and access the RDA through web based protocols; the internal NCAR HPC users are fewer in number, but typically access more data volume. This paper will detail the data discovery and access services maintained by the RDA to support both user groups, and show metrics that illustrate how the community is using the services. The distributed search capability enabled by standards-based community tools, such as Geoportal and an OAI-PMH access point that serves multiple metadata standards, provide pathways for external users to initially discover RDA holdings. From here, in-house developed web interfaces leverage primary discovery level metadata databases that support keyword and faceted searches. Internal NCAR HPC users, or those familiar with the RDA, may go directly to the dataset collection of interest and refine their search based on rich file collection metadata. Multiple levels of metadata have proven to be invaluable for discovery within terabyte-sized archives composed of many atmospheric or oceanic levels, hundreds of parameters, and often numerous grid and time resolutions. Once users find the data they want, their access needs may vary as well. A THREDDS data server running on targeted dataset collections enables remote file access through OPENDAP and other web based protocols primarily for external users. In-house developed tools give all users the capability to submit data subset extraction and format conversion requests through scalable, HPC based delayed mode batch processing. Users can monitor their RDA-based data processing progress and receive instructions on how to access the data when it is ready. External users are provided with RDA server generated scripts to download the resulting request output. Similarly they can download native dataset collection files or partial files using Wget or cURL based scripts supplied by the RDA server. Internal users can access the resulting request output or native dataset collection files directly from centralized file systems.
Nanomaterial datasets to advance tomography in scanning transmission electron microscopy
Levin, Barnaby D. A.; Padgett, Elliot; Chen, Chien-Chun; ...
2016-06-07
Electron tomography in materials science has flourished with the demand to characterize nanoscale materials in three dimensions (3D). Access to experimental data is vital for developing and validating reconstruction methods that improve resolution and reduce radiation dose requirements. This work presents five high-quality scanning transmission electron microscope (STEM) tomography datasets in order to address the critical need for open access data in this field. The datasets represent the current limits of experimental technique, are of high quality, and contain materials with structural complexity. Included are tomographic series of a hyperbranched Co 2 P nanocrystal, platinum nanoparticles on a carbonmore » nanofibre imaged over the complete 180° tilt range, a platinum nanoparticle and a tungsten needle both imaged at atomic resolution by equal slope tomography, and a through-focal tilt series of PtCu nanoparticles. A volumetric reconstruction from every dataset is provided for comparison and development of post-processing and visualization techniques. Researchers interested in creating novel data processing and reconstruction algorithms will now have access to state of the art experimental test data.« less
Nanomaterial datasets to advance tomography in scanning transmission electron microscopy.
Levin, Barnaby D A; Padgett, Elliot; Chen, Chien-Chun; Scott, M C; Xu, Rui; Theis, Wolfgang; Jiang, Yi; Yang, Yongsoo; Ophus, Colin; Zhang, Haitao; Ha, Don-Hyung; Wang, Deli; Yu, Yingchao; Abruña, Hector D; Robinson, Richard D; Ercius, Peter; Kourkoutis, Lena F; Miao, Jianwei; Muller, David A; Hovden, Robert
2016-06-07
Electron tomography in materials science has flourished with the demand to characterize nanoscale materials in three dimensions (3D). Access to experimental data is vital for developing and validating reconstruction methods that improve resolution and reduce radiation dose requirements. This work presents five high-quality scanning transmission electron microscope (STEM) tomography datasets in order to address the critical need for open access data in this field. The datasets represent the current limits of experimental technique, are of high quality, and contain materials with structural complexity. Included are tomographic series of a hyperbranched Co2P nanocrystal, platinum nanoparticles on a carbon nanofibre imaged over the complete 180° tilt range, a platinum nanoparticle and a tungsten needle both imaged at atomic resolution by equal slope tomography, and a through-focal tilt series of PtCu nanoparticles. A volumetric reconstruction from every dataset is provided for comparison and development of post-processing and visualization techniques. Researchers interested in creating novel data processing and reconstruction algorithms will now have access to state of the art experimental test data.
Nanomaterial datasets to advance tomography in scanning transmission electron microscopy
Levin, Barnaby D.A.; Padgett, Elliot; Chen, Chien-Chun; Scott, M.C.; Xu, Rui; Theis, Wolfgang; Jiang, Yi; Yang, Yongsoo; Ophus, Colin; Zhang, Haitao; Ha, Don-Hyung; Wang, Deli; Yu, Yingchao; Abruña, Hector D.; Robinson, Richard D.; Ercius, Peter; Kourkoutis, Lena F.; Miao, Jianwei; Muller, David A.; Hovden, Robert
2016-01-01
Electron tomography in materials science has flourished with the demand to characterize nanoscale materials in three dimensions (3D). Access to experimental data is vital for developing and validating reconstruction methods that improve resolution and reduce radiation dose requirements. This work presents five high-quality scanning transmission electron microscope (STEM) tomography datasets in order to address the critical need for open access data in this field. The datasets represent the current limits of experimental technique, are of high quality, and contain materials with structural complexity. Included are tomographic series of a hyperbranched Co2P nanocrystal, platinum nanoparticles on a carbon nanofibre imaged over the complete 180° tilt range, a platinum nanoparticle and a tungsten needle both imaged at atomic resolution by equal slope tomography, and a through-focal tilt series of PtCu nanoparticles. A volumetric reconstruction from every dataset is provided for comparison and development of post-processing and visualization techniques. Researchers interested in creating novel data processing and reconstruction algorithms will now have access to state of the art experimental test data. PMID:27272459
NASA Astrophysics Data System (ADS)
Valentine, G. A.
2012-12-01
VHub (short for VolcanoHub, and accessible at vhub.org) is an online platform for collaboration in research and training related to volcanoes, the hazards they pose, and risk mitigation. The underlying concept is to provide a mechanism that enables workers to share information with colleagues around the globe; VHub and similar hub technologies could prove very powerful in collaborating and communicating about circum-Pacific volcanic hazards. Collaboration occurs around several different points: (1) modeling and simulation; (2) data sharing; (3) education and training; (4) volcano observatories; and (5) project-specific groups. VHub promotes modeling and simulation in two ways: (1) some models can be implemented on VHub for online execution. This eliminates the need to download and compile a code on a local computer. VHub can provide a central "warehouse" for such models that should result in broader dissemination. VHub also provides a platform that supports the more complex CFD models by enabling the sharing of code development and problem-solving knowledge, benchmarking datasets, and the development of validation exercises. VHub also provides a platform for sharing of data and datasets. The VHub development team is implementing the iRODS data sharing middleware (see irods.org). iRODS allows a researcher to access data that are located at participating data sources around the world (a "cloud" of data) as if the data were housed in a single virtual database. Education and training is another important use of the VHub platform. Audio-video recordings of seminars, PowerPoint slide sets, and educational simulations are all items that can be placed onto VHub for use by the community or by selected collaborators. An important point is that the "manager" of a given educational resource (or any other resource, such as a dataset or a model) can control the privacy of that resource, ranging from private (only accessible by, and known to, specific collaborators) to completely public. Materials for use in the classroom can be shared via VHub. VHub is a very useful platform for project-specific collaborations. With a group site on VHub where collaborators share documents, datasets, maps, and have ongoing discussions using the discussion board function. VHub is funded by the U.S. National Science Foundation, and is participating in development of larger earth-science cyberinfrastructure initiatives (EarthCube), as well as supporting efforts such as the Global Volcano Model.
Data You May Like: A Recommender System for Research Data Discovery
NASA Astrophysics Data System (ADS)
Devaraju, A.; Davy, R.; Hogan, D.
2016-12-01
Various data portals been developed to facilitate access to research datasets from different sources. For example, the Data Publisher for Earth & Environmental Science (PANGAEA), the Registry of Research Data Repositories (re3data.org), and the National Geoscience Data Centre (NGDC). Due to data quantity and heterogeneity, finding relevant datasets on these portals may be difficult and tedious. Keyword searches based on specific metadata elements or multi-key indexes may return irrelevant results. Faceted searches may be unsatisfactory and time consuming, especially when facet values are exhaustive. We need a much more intelligent way to complement existing searching mechanisms in order to enhance user experiences of the data portals. We developed a recommender system that helps users to find the most relevant research datasets on the CSIRO's Data Access Portal (DAP). The system is based on content-based filtering. We computed the similarity of datasets based on data attributes (e.g., descriptions, fields of research, location, contributors, and provenance) and inference from transaction logs (e.g., the relations among datasets and between queries and datasets). We improved the recommendation quality by assigning weights to data similarities. The weight values are drawn from a survey involving data users. The recommender results for a given dataset are accessible programmatically via a web service. Taking both data attributes and user actions into account, the recommender system will make it easier for researchers to find and reuse data offered through the data portal.
Accessing Multi-Dimensional Images and Data Cubes in the Virtual Observatory
NASA Astrophysics Data System (ADS)
Tody, Douglas; Plante, R. L.; Berriman, G. B.; Cresitello-Dittmar, M.; Good, J.; Graham, M.; Greene, G.; Hanisch, R. J.; Jenness, T.; Lazio, J.; Norris, P.; Pevunova, O.; Rots, A. H.
2014-01-01
Telescopes across the spectrum are routinely producing multi-dimensional images and datasets, such as Doppler velocity cubes, polarization datasets, and time-resolved “movies.” Examples of current telescopes producing such multi-dimensional images include the JVLA, ALMA, and the IFU instruments on large optical and near-infrared wavelength telescopes. In the near future, both the LSST and JWST will also produce such multi-dimensional images routinely. High-energy instruments such as Chandra produce event datasets that are also a form of multi-dimensional data, in effect being a very sparse multi-dimensional image. Ensuring that the data sets produced by these telescopes can be both discovered and accessed by the community is essential and is part of the mission of the Virtual Observatory (VO). The Virtual Astronomical Observatory (VAO, http://www.usvao.org/), in conjunction with its international partners in the International Virtual Observatory Alliance (IVOA), has developed a protocol and an initial demonstration service designed for the publication, discovery, and access of arbitrarily large multi-dimensional images. The protocol describing multi-dimensional images is the Simple Image Access Protocol, version 2, which provides the minimal set of metadata required to characterize a multi-dimensional image for its discovery and access. A companion Image Data Model formally defines the semantics and structure of multi-dimensional images independently of how they are serialized, while providing capabilities such as support for sparse data that are essential to deal effectively with large cubes. A prototype data access service has been deployed and tested, using a suite of multi-dimensional images from a variety of telescopes. The prototype has demonstrated the capability to discover and remotely access multi-dimensional data via standard VO protocols. The prototype informs the specification of a protocol that will be submitted to the IVOA for approval, with an operational data cube service to be delivered in mid-2014. An associated user-installable VO data service framework will provide the capabilities required to publish VO-compatible multi-dimensional images or data cubes.
Field of genes: using Apache Kafka as a bioinformatic data repository.
Lawlor, Brendan; Lynch, Richard; Mac Aogáin, Micheál; Walsh, Paul
2018-04-01
Bioinformatic research is increasingly dependent on large-scale datasets, accessed either from private or public repositories. An example of a public repository is National Center for Biotechnology Information's (NCBI's) Reference Sequence (RefSeq). These repositories must decide in what form to make their data available. Unstructured data can be put to almost any use but are limited in how access to them can be scaled. Highly structured data offer improved performance for specific algorithms but limit the wider usefulness of the data. We present an alternative: lightly structured data stored in Apache Kafka in a way that is amenable to parallel access and streamed processing, including subsequent transformations into more highly structured representations. We contend that this approach could provide a flexible and powerful nexus of bioinformatic data, bridging the gap between low structure on one hand, and high performance and scale on the other. To demonstrate this, we present a proof-of-concept version of NCBI's RefSeq database using this technology. We measure the performance and scalability characteristics of this alternative with respect to flat files. The proof of concept scales almost linearly as more compute nodes are added, outperforming the standard approach using files. Apache Kafka merits consideration as a fast and more scalable but general-purpose way to store and retrieve bioinformatic data, for public, centralized reference datasets such as RefSeq and for private clinical and experimental data.
Toward a complete dataset of drug-drug interaction information from publicly available sources.
Ayvaz, Serkan; Horn, John; Hassanzadeh, Oktie; Zhu, Qian; Stan, Johann; Tatonetti, Nicholas P; Vilar, Santiago; Brochhausen, Mathias; Samwald, Matthias; Rastegar-Mojarad, Majid; Dumontier, Michel; Boyce, Richard D
2015-06-01
Although potential drug-drug interactions (PDDIs) are a significant source of preventable drug-related harm, there is currently no single complete source of PDDI information. In the current study, all publically available sources of PDDI information that could be identified using a comprehensive and broad search were combined into a single dataset. The combined dataset merged fourteen different sources including 5 clinically-oriented information sources, 4 Natural Language Processing (NLP) Corpora, and 5 Bioinformatics/Pharmacovigilance information sources. As a comprehensive PDDI source, the merged dataset might benefit the pharmacovigilance text mining community by making it possible to compare the representativeness of NLP corpora for PDDI text extraction tasks, and specifying elements that can be useful for future PDDI extraction purposes. An analysis of the overlap between and across the data sources showed that there was little overlap. Even comprehensive PDDI lists such as DrugBank, KEGG, and the NDF-RT had less than 50% overlap with each other. Moreover, all of the comprehensive lists had incomplete coverage of two data sources that focus on PDDIs of interest in most clinical settings. Based on this information, we think that systems that provide access to the comprehensive lists, such as APIs into RxNorm, should be careful to inform users that the lists may be incomplete with respect to PDDIs that drug experts suggest clinicians be aware of. In spite of the low degree of overlap, several dozen cases were identified where PDDI information provided in drug product labeling might be augmented by the merged dataset. Moreover, the combined dataset was also shown to improve the performance of an existing PDDI NLP pipeline and a recently published PDDI pharmacovigilance protocol. Future work will focus on improvement of the methods for mapping between PDDI information sources, identifying methods to improve the use of the merged dataset in PDDI NLP algorithms, integrating high-quality PDDI information from the merged dataset into Wikidata, and making the combined dataset accessible as Semantic Web Linked Data. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
Collaboration-Centred Cities through Urban Apps Based on Open and User-Generated Data
Aguilera, Unai; López-de-Ipiña, Diego; Pérez, Jorge
2016-01-01
This paper describes the IES Cities platform conceived to streamline the development of urban apps that combine heterogeneous datasets provided by diverse entities, namely, government, citizens, sensor infrastructure and other information data sources. This work pursues the challenge of achieving effective citizen collaboration by empowering them to prosume urban data across time. Particularly, this paper focuses on the query mapper; a key component of the IES Cities platform devised to democratize the development of open data-based mobile urban apps. This component allows developers not only to use available data, but also to contribute to existing datasets with the execution of SQL sentences. In addition, the component allows developers to create ad hoc storages for their applications, publishable as new datasets accessible by other consumers. As multiple users could be contributing and using a dataset, our solution also provides a data level permission mechanism to control how the platform manages the access to its datasets. We have evaluated the advantages brought forward by IES Cities from the developers’ perspective by describing an exemplary urban app created on top of it. In addition, we include an evaluation of the main functionalities of the query mapper. PMID:27376300
MONTRA: An agile architecture for data publishing and discovery.
Bastião Silva, Luís; Trifan, Alina; Luís Oliveira, José
2018-07-01
Data catalogues are a common form of capturing and presenting information about a specific kind of entity (e.g. products, services, professionals, datasets, etc.). However, the construction of a web-based catalogue for a particular scenario normally implies the development of a specific and dedicated solution. In this paper, we present MONTRA, a rapid-application development framework designed to facilitate the integration and discovery of heterogeneous objects, which may be characterized by distinct data structures. MONTRA was developed following a plugin-based architecture to allow dynamic composition of services over represented datasets. The core of MONTRA's functionalities resides in a flexible data skeleton used to characterize data entities, and from which a fully-fledged web data catalogue is automatically generated, ensuring access control and data privacy. MONTRA is being successfully used by several European projects to collect and manage biomedical databases. In this paper, we describe three of these applications scenarios. This work was motivated by the plethora of geographically scattered biomedical repositories, and by the role they can play altogether for the understanding of diseases and of the real-world effectiveness of treatments. Using metadata to expose datasets' characteristics, MONTRA greatly simplifies the task of building data catalogues. The source code is publicly available at https://github.com/bioinformatics-ua/montra. Copyright © 2018 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Li, J.; Zhang, T.; Huang, Q.; Liu, Q.
2014-12-01
Today's climate datasets are featured with large volume, high degree of spatiotemporal complexity and evolving fast overtime. As visualizing large volume distributed climate datasets is computationally intensive, traditional desktop based visualization applications fail to handle the computational intensity. Recently, scientists have developed remote visualization techniques to address the computational issue. Remote visualization techniques usually leverage server-side parallel computing capabilities to perform visualization tasks and deliver visualization results to clients through network. In this research, we aim to build a remote parallel visualization platform for visualizing and analyzing massive climate data. Our visualization platform was built based on Paraview, which is one of the most popular open source remote visualization and analysis applications. To further enhance the scalability and stability of the platform, we have employed cloud computing techniques to support the deployment of the platform. In this platform, all climate datasets are regular grid data which are stored in NetCDF format. Three types of data access methods are supported in the platform: accessing remote datasets provided by OpenDAP servers, accessing datasets hosted on the web visualization server and accessing local datasets. Despite different data access methods, all visualization tasks are completed at the server side to reduce the workload of clients. As a proof of concept, we have implemented a set of scientific visualization methods to show the feasibility of the platform. Preliminary results indicate that the framework can address the computation limitation of desktop based visualization applications.
Scrubchem: Building Bioactivity Datasets from Pubchem ...
The PubChem Bioassay database is a non-curated public repository with data from 64 sources, including: ChEMBL, BindingDb, DrugBank, EPA Tox21, NIH Molecular Libraries Screening Program, and various other academic, government, and industrial contributors. Methods for extracting this public data into quality datasets, useable for analytical research, presents several big-data challenges for which we have designed manageable solutions. According to our preliminary work, there are approximately 549 million bioactivity values and related meta-data within PubChem that can be mapped to over 10,000 biological targets. However, this data is not ready for use in data-driven research, mainly due to lack of structured annotations.We used a pragmatic approach that provides increasing access to bioactivity values in the PubChem Bioassay database. This included restructuring of individual PubChem Bioassay files into a relational database (ScrubChem). ScrubChem contains all primary PubChem Bioassay data that was: reparsed; error-corrected (when applicable); enriched with additional data links from other NCBI databases; and improved by adding key biological and assay annotations derived from logic-based language processing rules. The utility of ScrubChem and the curation process were illustrated using an example bioactivity dataset for the androgen receptor protein. This initial work serves as a trial ground for establishing the technical framework for accessing, integrating, cu
Data publication activities in the Natural Environment Research Council
NASA Astrophysics Data System (ADS)
Leadbetter, A.; Callaghan, S.; Lowry, R.; Moncoiffé, G.; Donnegan, S.; Pepler, S.; Cunningham, N.; Kirsch, P.; Ault, L.; Bell, P.; Bowie, R.; Harrison, K.; Smith-Haddon, B.; Wetherby, A.; Wright, D.; Thorley, M.
2012-04-01
The Natural Environment Research Council (NERC) is implementing its Science Information Strategy in order to provide a world class service to deliver integrated data for earth system science. One project within this strategy is Data Citation and Publication, which aims to put the promotion and recognition stages of the data lifecycle into place alongside the traditional data management activities of NERC's Environmental Data Centres (EDCs). The NERC EDCs have made a distinction between the serving of data and its publication. Data serving is defined in this case as the day-to-day data management tasks of: • acquiring data and metadata from the originating scientists; • metadata and format harmonisation prior to database ingestion; • ensuring the metadata is adequate and accurate and that the data are available in appropriate file formats; • and making the data available for interested parties. Whereas publication: • requires the assignment of a digital object identifier to a dataset which guarantees that an EDC has assessed the quality of the metadata and the file format and will maintain an unchanged version of the data for the foreseeable future • requires the peer-review of the scientific quality of the data by a scientist with knowledge of the scientific domain in which the data were collected, using a framework for peer-review of datasets such as that developed by the CLADDIER project. • requires collaboration with journal publishers who have access to a well established peer-review system The first of these requirements can be managed in-house by the EDCs, while the remainder require collaboration with the wider scientific and publishing communities. It is anticipated that a scientist may achieve a lower level of academic credit for a dataset which is assigned a DOI but does not follow through to the scientific peer-review stage, similar to publication in a report or other non-peer reviewed publication normally described as grey literature, or in a conference proceedings. At the time of writing, the project has successfully assigned DOIs to more than ten legacy datasets held by EDCs through the British Library acting on behalf of the DataCite network. The project is in the process of developing guidelines for which datasets are suitable for submission to an EDC by a scientist wishing to receive a DOI for their data. While maintaining a United Kingdom focus, this project is not operating in isolation as its members are working alongside international groups such as the CODATA-ICSTI Task Group on Data Citations, the DataCite Working Group on Criteria for Datacentres, and the joint Scientific Commission for Oceanography / International Oceanographic Data and Information Exchange / Marine Biological Laboratory, Woods Hole Oceanographic Institution Library working group on data publication.
Tracking Research Data Footprints via Integration with Research Graph
NASA Astrophysics Data System (ADS)
Evans, B. J. K.; Wang, J.; Aryani, A.; Conlon, M.; Wyborn, L. A.; Choudhury, S. A.
2017-12-01
The researcher of today is likely to be part of a team that will use subsets of data from at least one, if not more external repositories, and that same data could be used by multiple researchers for many different purposes. At best, the repositories that host this data will know who is accessing their data, but rarely what they are using it for, resulting in funders of data collecting programs and data repositories that store the data unlikely to know: 1) which research funding contributed to the collection and preservation of a dataset, and 2) which data contributed to high impact research and publications. In days of funding shortages there is a growing need to be able to trace the footprint a data set from the originator that collected the data to the repository that stores the data and ultimately to any derived publications. The Research Data Alliance's Data Description Registry Interoperability Working Group (DDRIWG) has addressed this problem through the development of a distributed graph, called Research Graph that can map each piece of the research interaction puzzle by building aggregated graphs. It can connect datasets on the basis of co-authorship or other collaboration models such as joint funding and grants and can connect research datasets, publications, grants and researcher profiles across research repositories and infrastructures such as DataCite and ORCID. National Computational Infrastructure (NCI) in Australia is one of the early adopters of Research Graph. The graphic view and quantitative analysis helps NCI track the usage of their National reference data collections thus quantifying the role that these NCI-hosted data assets play within the funding-researcher-data-publication-cycle. The graph can unlock the complex interactions of the research projects by tracking the contribution of datasets, the various funding bodies and the downstream data users. RMap Project is a similar initiative which aims to solve complex relationships among scholarly publications and their underlying data, including IEEE publications. It is hoped to combine RMap and Research Graph in the near futures and also to add physical samples to Research Graph.
BCO-DMO: Enabling Access to Federally Funded Research Data
NASA Astrophysics Data System (ADS)
Kinkade, D.; Allison, M. D.; Chandler, C. L.; Groman, R. C.; Rauch, S.; Shepherd, A.; Gegg, S. R.; Wiebe, P. H.; Glover, D. M.
2013-12-01
In a February, 2013 memo1, the White House Office of Science and Technology Policy (OSTP) outlined principles and objectives to increase access by the public to federally funded research publications and data. Such access is intended to drive innovation by allowing private and commercial efforts to take full advantage of existing resources, thereby maximizing Federal research dollars and efforts. The Biological and Chemical Oceanography Data Management Office (BCO-DMO; bco-dmo.org) serves as a model resource for organizations seeking compliance with the OSTP policy. BCO-DMO works closely with scientific investigators to publish their data from research projects funded by the National Science Foundation (NSF), within the Biological and Chemical Oceanography Sections (OCE) and the Division of Polar Programs Antarctic Organisms & Ecosystems Program (PLR). BCO-DMO addresses many of the OSTP objectives for public access to digital scientific data: (1) Marine biogeochemical and ecological data and metadata are disseminated via a public website, and curated on intermediate time frames; (2) Preservation needs are met by collaborating with appropriate national data facilities for data archive; (3) Cost and administrative burden associated with data management is minimized by the use of one dedicated office providing hundreds of NSF investigators support for data management plan development, data organization, metadata generation and deposition of data and metadata into the BCO-DMO repository; (4) Recognition of intellectual property is reinforced through the office's citation policy and the use of digital object identifiers (DOIs); (5) Education and training in data stewardship and use of the BCO-DMO system is provided by office staff through a variety of venues. Oceanographic research data and metadata from thousands of datasets generated by hundreds of investigators are now available through BCO-DMO. 1 White House Office of Science and Technology Policy, Memorandum for the Heads of Executive Departments and Agencies: Increasing Access to the Results of Federally Funded Scientific Research, February 23, 2013. http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein
2017-01-01
An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled “biosigdata.com.” It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf). PMID:28487832
The Community Line Source (C-LINE) modeling system estimates emissions and dispersion of toxic air pollutants for roadways within the continental United States. It accesses publicly available traffic and meteorological datasets, and is optimized for use on community-sized areas (...
PGP repository: a plant phenomics and genomics data publication infrastructure
Arend, Daniel; Junker, Astrid; Scholz, Uwe; Schüler, Danuta; Wylie, Juliane; Lange, Matthias
2016-01-01
Plant genomics and phenomics represents the most promising tools for accelerating yield gains and overcoming emerging crop productivity bottlenecks. However, accessing this wealth of plant diversity requires the characterization of this material using state-of-the-art genomic, phenomic and molecular technologies and the release of subsequent research data via a long-term stable, open-access portal. Although several international consortia and public resource centres offer services for plant research data management, valuable digital assets remains unpublished and thus inaccessible to the scientific community. Recently, the Leibniz Institute of Plant Genetics and Crop Plant Research and the German Plant Phenotyping Network have jointly initiated the Plant Genomics and Phenomics Research Data Repository (PGP) as infrastructure to comprehensively publish plant research data. This covers in particular cross-domain datasets that are not being published in central repositories because of its volume or unsupported data scope, like image collections from plant phenotyping and microscopy, unfinished genomes, genotyping data, visualizations of morphological plant models, data from mass spectrometry as well as software and documents. The repository is hosted at Leibniz Institute of Plant Genetics and Crop Plant Research using e!DAL as software infrastructure and a Hierarchical Storage Management System as data archival backend. A novel developed data submission tool was made available for the consortium that features a high level of automation to lower the barriers of data publication. After an internal review process, data are published as citable digital object identifiers and a core set of technical metadata is registered at DataCite. The used e!DAL-embedded Web frontend generates for each dataset a landing page and supports an interactive exploration. PGP is registered as research data repository at BioSharing.org, re3data.org and OpenAIRE as valid EU Horizon 2020 open data archive. Above features, the programmatic interface and the support of standard metadata formats, enable PGP to fulfil the FAIR data principles—findable, accessible, interoperable, reusable. Database URL: http://edal.ipk-gatersleben.de/repos/pgp/ PMID:27087305
NASA Astrophysics Data System (ADS)
Willmes, M.; McMorrow, L.; Kinsley, L.; Armstrong, R.; Aubert, M.; Eggins, S.; Falguères, C.; Maureille, B.; Moffat, I.; Grün, R.
2013-11-01
Strontium isotope ratios (87Sr/86Sr) are a key geochemical tracer used in a wide range of fields including archaeology, ecology, food and forensic sciences. These applications are based on the principle that the Sr isotopic ratios of natural materials reflect the sources of strontium available during their formation. A major constraint for current studies is the lack of robust reference maps to evaluate the source of strontium isotope ratios measured in the samples. Here we provide a new dataset of bioavailable Sr isotope ratios for the major geologic units of France, based on plant and soil samples (Pangaea data repository doi:10.1594/PANGAEA.819142). The IRHUM (Isotopic Reconstruction of Human Migration) database is a web platform to access, explore and map our dataset. The database provides the spatial context and metadata for each sample, allowing the user to evaluate the suitability of the sample for their specific study. In addition, it allows users to upload and share their own datasets and data products, which will enhance collaboration across the different research fields. This article describes the sampling and analytical methods used to generate the dataset and how to use and access of the dataset through the IRHUM database. Any interpretation of the isotope dataset is outside the scope of this publication.
Securely Measuring the Overlap between Private Datasets with Cryptosets
Swamidass, S. Joshua; Matlock, Matthew; Rozenblit, Leon
2015-01-01
Many scientific questions are best approached by sharing data—collected by different groups or across large collaborative networks—into a combined analysis. Unfortunately, some of the most interesting and powerful datasets—like health records, genetic data, and drug discovery data—cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset’s contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach “information-theoretic” security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure. PMID:25714898
App-lifying USGS Earth Science Data: Engaging the public through Challenge.gov
NASA Astrophysics Data System (ADS)
Frame, M. T.
2013-12-01
With the goal of promoting innovative use and applications of USGS data, USGS Core Science Analytics and Synthesis (CSAS) launched the first USGS Challenge: App-lifying USGS Earth Science Data. While initiated before the recent Office of Science and Technology Policy's memorandum 'Increasing Access to the Results of Federally Funded Scientific Research', our challenge focused on one of the core tenets of the memorandum- expanding discoverability, accessibility and usability of CSAS data. From January 9 to April 1, 2013, we invited developers, information scientists, biologists/ecologists, and scientific data visualization specialists to create applications for selected USGS datasets. Identifying new, innovative ways to represent, apply, and make these data available is a high priority for our leadership. To help boost innovation, our only constraint on the challengers stated they must incorporate at least one of the identified datasets in their application. Winners were selected based on the relevance to the USGS and CSAS missions, innovation in design, and overall ease of use of the application. The winner for Best Overall App was TaxaViewer by the rOpenSci group. TaxaViewer is a Web interface to a mashup of data from the USGS-sponsored interagency Integrated Taxonomic Information System (ITIS) and other data from the Phylotastic taxonomic Name service, the Global Invasive Species Database, Phylomatic, and the Global Biodiversity Information Facility. The Popular Choice App award, selected through a public vote on the submissions, went to the Species Comparison Tool by Kimberly Sparks of Raleigh, N.C., which allows users to explore the USGS Gap Analysis Program habitat distribution and/or range of two species concurrently. The application also incorporates ITIS data and provides external links to NatureServe species information. Our results indicated that running a challenge was an effective method for promoting our data products and therefore improving accessibility. We had approximately 7,000 unique visitors to our challenge site and a corresponding increase in visits of 50% to our CSAS Web site. Similarly, we saw an increase for some of our data product's Web sites. For instance, ScienceBase received three times more visits during the period of the challenge. Using the challenge as a test case for accessibility of our data, we identified improvements for making our datasets more accessible, identified new ways to integrate across our datasets, and increased the visibility of our program. Feedback we received from participants led us to form a Web Services Team to create good governance by a best practices approach to the data services for our national products. Because this is the first challenge that USGS has done, all of our documentation is available for others in the USGS to use in running their own challenges hopefully leading to an increase in accessibility not just for CSAS but for all of USGS. In future challenges, we expect to focus more narrowly on specific natural resource questions.
Liu, Bin; Wu, Hao; Zhang, Deyuan; Wang, Xiaolong; Chou, Kuo-Chen
2017-02-21
To expedite the pace in conducting genome/proteome analysis, we have developed a Python package called Pse-Analysis. The powerful package can automatically complete the following five procedures: (1) sample feature extraction, (2) optimal parameter selection, (3) model training, (4) cross validation, and (5) evaluating prediction quality. All the work a user needs to do is to input a benchmark dataset along with the query biological sequences concerned. Based on the benchmark dataset, Pse-Analysis will automatically construct an ideal predictor, followed by yielding the predicted results for the submitted query samples. All the aforementioned tedious jobs can be automatically done by the computer. Moreover, the multiprocessing technique was adopted to enhance computational speed by about 6 folds. The Pse-Analysis Python package is freely accessible to the public at http://bioinformatics.hitsz.edu.cn/Pse-Analysis/, and can be directly run on Windows, Linux, and Unix.
NASA Astrophysics Data System (ADS)
Schmidt-Kloiber, Astrid; De Wever, Aaike; Bremerich, Vanessa; Strackbein, Jörg; Hering, Daniel; Jähnig, Sonja; Kiesel, Jens; Martens, Koen; Tockner, Klement
2017-04-01
Species distribution data is crucial for improving our understanding of biodiversity and its threats. This is especially the case for freshwater environments, which are heavily affected by the global biodiversity crisis. Currently, a huge body of freshwater biodiversity data is often difficult to access, because systematic data publishing practices have not yet been adopted by the freshwater research community. The Freshwater Information Platform (FIP; www.freshwaterplatform.eu) - initiated through the BioFresh project - aims at pooling freshwater related research information from a variety of projects and initiatives to make it easily accessible for scientists, water managers and conservationists as well as the interested public. It consists of several major components, three of which we want to specifically address: (1) The Freshwater Biodiversity Data Portal aims at mobilising freshwater biodiversity data, making them online available Datasets in the portal are described and documented in the (2) Freshwater Metadatabase and published as open access articles in the Freshwater Metadata Journal. The use of collected datasets for large-scale analyses and models is demonstrated in the (3) Global Freshwater Biodiversity Atlas that publishes interactive online maps featuring research results on freshwater biodiversity, resources, threats and conservation priorities. Here we present the main components of the FIP as tools to streamline open access freshwater data publication arguing this will improve the capacity to protect and manage freshwater biodiversity in the face of global change.
SEEG initiative estimates of Brazilian greenhouse gas emissions from 1970 to 2015.
de Azevedo, Tasso Rezende; Costa Junior, Ciniro; Brandão Junior, Amintas; Cremer, Marcelo Dos Santos; Piatto, Marina; Tsai, David Shiling; Barreto, Paulo; Martins, Heron; Sales, Márcio; Galuchi, Tharic; Rodrigues, Alessandro; Morgado, Renato; Ferreira, André Luis; Barcellos E Silva, Felipe; Viscondi, Gabriel de Freitas; Dos Santos, Karoline Costal; Cunha, Kamyla Borges da; Manetti, Andrea; Coluna, Iris Moura Esteves; Albuquerque, Igor Reis de; Junior, Shigueo Watanabe; Leite, Clauber; Kishinami, Roberto
2018-05-29
This work presents the SEEG platform, a 46-year long dataset of greenhouse gas emissions (GHG) in Brazil (1970-2015) providing more than 2 million data records for the Agriculture, Energy, Industry, Waste and Land Use Change Sectors at national and subnational levels. The SEEG dataset was developed by the Climate Observatory, a Brazilian civil society initiative, based on the IPCC guidelines and Brazilian National Inventories embedded with country specific emission factors and processes, raw data from multiple official and non-official sources, and organized together with social and economic indicators. Once completed, the SEEG dataset was converted into a spreadsheet format and shared via web-platform that, by means of simple queries, allows users to search data by emission sources and country and state activities. Because of its effectiveness in producing and making available data on a consistent and accessible basis, SEEG may significantly increase the capacity of civil society, scientists and stakeholders to understand and anticipate trends related to GHG emissions as well as its implications to public policies in Brazil.
101 Labeled Brain Images and a Consistent Human Cortical Labeling Protocol
Klein, Arno; Tourville, Jason
2012-01-01
We introduce the Mindboggle-101 dataset, the largest and most complete set of free, publicly accessible, manually labeled human brain images. To manually label the macroscopic anatomy in magnetic resonance images of 101 healthy participants, we created a new cortical labeling protocol that relies on robust anatomical landmarks and minimal manual edits after initialization with automated labels. The “Desikan–Killiany–Tourville” (DKT) protocol is intended to improve the ease, consistency, and accuracy of labeling human cortical areas. Given how difficult it is to label brains, the Mindboggle-101 dataset is intended to serve as brain atlases for use in labeling other brains, as a normative dataset to establish morphometric variation in a healthy population for comparison against clinical populations, and contribute to the development, training, testing, and evaluation of automated registration and labeling algorithms. To this end, we also introduce benchmarks for the evaluation of such algorithms by comparing our manual labels with labels automatically generated by probabilistic and multi-atlas registration-based approaches. All data and related software and updated information are available on the http://mindboggle.info/data website. PMID:23227001
SEEG initiative estimates of Brazilian greenhouse gas emissions from 1970 to 2015
de Azevedo, Tasso Rezende; Costa Junior, Ciniro; Brandão Junior, Amintas; Cremer, Marcelo dos Santos; Piatto, Marina; Tsai, David Shiling; Barreto, Paulo; Martins, Heron; Sales, Márcio; Galuchi, Tharic; Rodrigues, Alessandro; Morgado, Renato; Ferreira, André Luis; Barcellos e Silva, Felipe; Viscondi, Gabriel de Freitas; dos Santos, Karoline Costal; Cunha, Kamyla Borges da; Manetti, Andrea; Coluna, Iris Moura Esteves; Albuquerque, Igor Reis de; Junior, Shigueo Watanabe; Leite, Clauber; Kishinami, Roberto
2018-01-01
This work presents the SEEG platform, a 46-year long dataset of greenhouse gas emissions (GHG) in Brazil (1970–2015) providing more than 2 million data records for the Agriculture, Energy, Industry, Waste and Land Use Change Sectors at national and subnational levels. The SEEG dataset was developed by the Climate Observatory, a Brazilian civil society initiative, based on the IPCC guidelines and Brazilian National Inventories embedded with country specific emission factors and processes, raw data from multiple official and non-official sources, and organized together with social and economic indicators. Once completed, the SEEG dataset was converted into a spreadsheet format and shared via web-platform that, by means of simple queries, allows users to search data by emission sources and country and state activities. Because of its effectiveness in producing and making available data on a consistent and accessible basis, SEEG may significantly increase the capacity of civil society, scientists and stakeholders to understand and anticipate trends related to GHG emissions as well as its implications to public policies in Brazil. PMID:29809176
Fathead minnow genome sequencing and assembly
The dataset provides the URLs for accessing the genome sequence data and two draft assemblies as well as fathead minnow genotyping data associated with estimating the heterozygosity of the in-bred line.This dataset is associated with the following publication:Burns, F., L. Cogburn, G. Ankley , D. Villeneuve , E. Waits , Y. Chang, V. Llaca, S. Deschamps, R. Jackson, and R. Hoke. Sequencing and De novo Draft Assemblies of the Fathead Minnow (Pimphales promelas)Reference Genome. ENVIRONMENTAL TOXICOLOGY AND CHEMISTRY. Society of Environmental Toxicology and Chemistry, Pensacola, FL, USA, 35(1): 212-217, (2016).
MLACP: machine-learning-based prediction of anticancer peptides
Manavalan, Balachandran; Basith, Shaherin; Shin, Tae Hwan; Choi, Sun; Kim, Myeong Ok; Lee, Gwang
2017-01-01
Cancer is the second leading cause of death globally, and use of therapeutic peptides to target and kill cancer cells has received considerable attention in recent years. Identification of anticancer peptides (ACPs) through wet-lab experimentation is expensive and often time consuming; therefore, development of an efficient computational method is essential to identify potential ACP candidates prior to in vitro experimentation. In this study, we developed support vector machine- and random forest-based machine-learning methods for the prediction of ACPs using the features calculated from the amino acid sequence, including amino acid composition, dipeptide composition, atomic composition, and physicochemical properties. We trained our methods using the Tyagi-B dataset and determined the machine parameters by 10-fold cross-validation. Furthermore, we evaluated the performance of our methods on two benchmarking datasets, with our results showing that the random forest-based method outperformed the existing methods with an average accuracy and Matthews correlation coefficient value of 88.7% and 0.78, respectively. To assist the scientific community, we also developed a publicly accessible web server at www.thegleelab.org/MLACP.html. PMID:29100375
Academic Research Library as Broker in Addressing Interoperability Challenges for the Geosciences
NASA Astrophysics Data System (ADS)
Smith, P., II
2015-12-01
Data capture is an important process in the research lifecycle. Complete descriptive and representative information of the data or database is necessary during data collection whether in the field or in the research lab. The National Science Foundation's (NSF) Public Access Plan (2015) mandates the need for federally funded projects to make their research data more openly available. Developing, implementing, and integrating metadata workflows into to the research process of the data lifecycle facilitates improved data access while also addressing interoperability challenges for the geosciences such as data description and representation. Lack of metadata or data curation can contribute to (1) semantic, (2) ontology, and (3) data integration issues within and across disciplinary domains and projects. Some researchers of EarthCube funded projects have identified these issues as gaps. These gaps can contribute to interoperability data access, discovery, and integration issues between domain-specific and general data repositories. Academic Research Libraries have expertise in providing long-term discovery and access through the use of metadata standards and provision of access to research data, datasets, and publications via institutional repositories. Metadata crosswalks, open archival information systems (OAIS), trusted-repositories, data seal of approval, persistent URL, linking data, objects, resources, and publications in institutional repositories and digital content management systems are common components in the library discipline. These components contribute to a library perspective on data access and discovery that can benefit the geosciences. The USGS Community for Data Integration (CDI) has developed the Science Support Framework (SSF) for data management and integration within its community of practice for contribution to improved understanding of the Earth's physical and biological systems. The USGS CDI SSF can be used as a reference model to map to EarthCube Funded projects with academic research libraries facilitating the data and information assets components of the USGS CDI SSF via institutional repositories and/or digital content management. This session will explore the USGS CDI SSF for cross-discipline collaboration considerations from a library perspective.
This EnviroAtlas dataset contains polygons depicting the geographic areas of market-based programs, referred to herein as markets, and projects addressing ecosystem services protection in the United States. Depending upon the type of market or project and data availability, polygons reflect market coverage areas, project footprints, or project primary impact areas in which ecosystem service markets and projects operate. The data were collected via surveys and desk research conducted by Forest Trends' Ecosystem Marketplace from 2008 to 2016 on biodiversity (i.e., imperiled species/habitats; wetlands and streams), carbon, and water markets. Additional biodiversity data were obtained from the Regulatory In-lieu Fee and Bank Information Tracking System (RIBITS) database in 2015. Attribute data include information regarding the methodology, design, and development of biodiversity, carbon, and water markets and projects. This dataset was produced by Forest Trends' Ecosystem Marketplace for EnviroAtlas in order to support public access to and use of information related to environmental markets. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about thi
This EnviroAtlas dataset contains points depicting the location of market-based programs, referred to herein as markets, and projects addressing ecosystem services protection in the United States. The data were collected via surveys and desk research conducted by Forest Trends' Ecosystem Marketplace from 2008 to 2016 on biodiversity (i.e., imperiled species/habitats; wetlands and streams), carbon, and water markets. Additional biodiversity data were obtained from the Regulatory In-lieu Fee and Bank Information Tracking System (RIBITS) database in 2015. Points represent the centroids (i.e., center points) of market coverage areas, project footprints, or project primary impact areas in which ecosystem service markets or projects operate. National-level markets are an exception to this norm with points representing administrative headquarters locations. Attribute data include information regarding the methodology, design, and development of biodiversity, carbon, and water markets and projects. This dataset was produced by Forest Trends' Ecosystem Marketplace for EnviroAtlas in order to support public access to and use of information related to environmental markets. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) o
Innovations in user-defined analysis: dynamic grouping and customized user datasets in VistaPHw.
Solet, David; Glusker, Ann; Laurent, Amy; Yu, Tianji
2006-01-01
Flexible, ready access to community health assessment data is a feature of innovative Web-based data query systems. An example is VistaPHw, which provides access to Washington state data and statistics used in community health assessment. Because of its flexible analysis options, VistaPHw customizes local, population-based results to be relevant to public health decision-making. The advantages of two innovations, dynamic grouping and the Custom Data Module, are described. Dynamic grouping permits the creation of user-defined aggregations of geographic areas, age groups, race categories, and years. Standard VistaPHw measures such as rates, confidence intervals, and other statistics may then be calculated for the new groups. Dynamic grouping has provided data for major, successful grant proposals, building partnerships with local governments and organizations, and informing program planning for community organizations. The Custom Data Module allows users to prepare virtually any dataset so it may be analyzed in VistaPHw. Uses for this module may include datasets too sensitive to be placed on a Web server or datasets that are not standardized across the state. Limitations and other system needs are also discussed.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Levin, Barnaby D. A.; Padgett, Elliot; Chen, Chien-Chun
Electron tomography in materials science has flourished with the demand to characterize nanoscale materials in three dimensions (3D). Access to experimental data is vital for developing and validating reconstruction methods that improve resolution and reduce radiation dose requirements. This work presents five high-quality scanning transmission electron microscope (STEM) tomography datasets in order to address the critical need for open access data in this field. The datasets represent the current limits of experimental technique, are of high quality, and contain materials with structural complexity. Included are tomographic series of a hyperbranched Co 2 P nanocrystal, platinum nanoparticles on a carbonmore » nanofibre imaged over the complete 180° tilt range, a platinum nanoparticle and a tungsten needle both imaged at atomic resolution by equal slope tomography, and a through-focal tilt series of PtCu nanoparticles. A volumetric reconstruction from every dataset is provided for comparison and development of post-processing and visualization techniques. Researchers interested in creating novel data processing and reconstruction algorithms will now have access to state of the art experimental test data.« less
The health care and life sciences community profile for dataset descriptions
Alexiev, Vladimir; Ansell, Peter; Bader, Gary; Baran, Joachim; Bolleman, Jerven T.; Callahan, Alison; Cruz-Toledo, José; Gaudet, Pascale; Gombocz, Erich A.; Gonzalez-Beltran, Alejandra N.; Groth, Paul; Haendel, Melissa; Ito, Maori; Jupp, Simon; Juty, Nick; Katayama, Toshiaki; Kobayashi, Norio; Krishnaswami, Kalpana; Laibe, Camille; Le Novère, Nicolas; Lin, Simon; Malone, James; Miller, Michael; Mungall, Christopher J.; Rietveld, Laurens; Wimalaratne, Sarala M.; Yamaguchi, Atsuko
2016-01-01
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting guideline covers elements of description, identification, attribution, versioning, provenance, and content summarization. This guideline reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. PMID:27602295
DOIs for Data: Progress in Data Citation and Publication in the Geosciences
NASA Astrophysics Data System (ADS)
Callaghan, S.; Murphy, F.; Tedds, J.; Allan, R.
2012-12-01
Identifiers for data are the bedrock on which data citation and publication rests. These, in their turn, are widely proposed as methods for encouraging researchers to share their datasets, and at the same time receive academic credit for their efforts in producing them. However, neither data citation nor publication can be properly achieved without a method of identifying clearly what is, and what isn't, part of the dataset. Once a dataset becomes part of the scientific record (either through formal data publication or through being cited) then issues such as dataset stability and permanence become vital to address. In the geosciences, several projects in the UK are concentrating on issues of dataset identification, citation and publication. The UK's Natural Environment Research Council's (NERC) Science Information Strategy data citation and publication project is addressing the issue of identifiers for data, stability, transparency, and credit for data producers through data citation. At a data publication level, 2012 has seen the launch of the new Wiley title Geoscience Data Journal and the PREPARDE (Peer Review for Publication & Accreditation of Research Data in the Earth sciences) project, both aiming to encourage data publication by addressing issues such as data paper submission workflows and the scientific peer-review of data. All of these initiatives work with a range of partners including academic institutions, learned societies, data centers and commercial publishers, both nationally and internationally, with a cross-project aim of developing the mechanisms so data can be identified, cited and published with confidence. This involves investigating barriers and drivers to data publishing and sharing, peer review, and re-use of geoscientific datasets, and specifically such topics as dataset requirements for citation, workflows for dataset ingestion into data centers and publishers, procedures and policies for editors, reviewers and authors of data publication, and assessing the trustworthiness of data archives. A key goal is to ensure that these projects reach out to, and are informed by, other related initiatives on a global basis, in particular anyone interested in developing long-term sustainable policies, processes, incentives and business models for managing and publishing research data. This presentation will give an overview of progress in the projects mentioned above, specifically focussing on the use of DOIs for datasets hosted in the NERC environmental data centers, and how DOIs are enabling formal data citation and publication in the geosciences.
Review and Analysis of Algorithmic Approaches Developed for Prognostics on CMAPSS Dataset
2014-12-23
publications for benchmarking prognostics algorithms. The turbofan degradation datasets have received over seven thousand unique downloads in the last five...approaches that researchers have taken to implement prognostics using these turbofan datasets. Some unique characteristics of these datasets are also...Description of the five turbofan degradation datasets available from NASA repository. Datasets #Fault Modes #Conditions #Train Units #Test Units
The multilayer temporal network of public transport in Great Britain
NASA Astrophysics Data System (ADS)
Gallotti, Riccardo; Barthelemy, Marc
2015-01-01
Despite the widespread availability of information concerning public transport coming from different sources, it is extremely hard to have a complete picture, in particular at a national scale. Here, we integrate timetable data obtained from the United Kingdom open-data program together with timetables of domestic flights, and obtain a comprehensive snapshot of the temporal characteristics of the whole UK public transport system for a week in October 2010. In order to focus on multi-modal aspects of the system, we use a coarse graining procedure and define explicitly the coupling between different transport modes such as connections at airports, ferry docks, rail, metro, coach and bus stations. The resulting weighted, directed, temporal and multilayer network is provided in simple, commonly used formats, ensuring easy access and the possibility of a straightforward use of old or specifically developed methods on this new and extensive dataset.
Integrative Spatial Data Analytics for Public Health Studies of New York State
Chen, Xin; Wang, Fusheng
2016-01-01
Increased accessibility of health data made available by the government provides unique opportunity for spatial analytics with much higher resolution to discover patterns of diseases, and their correlation with spatial impact indicators. This paper demonstrated our vision of integrative spatial analytics for public health by linking the New York Cancer Mapping Dataset with datasets containing potential spatial impact indicators. We performed spatial based discovery of disease patterns and variations across New York State, and identify potential correlations between diseases and demographic, socio-economic and environmental indicators. Our methods were validated by three correlation studies: the correlation between stomach cancer and Asian race, the correlation between breast cancer and high education population, and the correlation between lung cancer and air toxics. Our work will allow public health researchers, government officials or other practitioners to adequately identify, analyze, and monitor health problems at the community or neighborhood level for New York State. PMID:28269834
Simulation of Smart Home Activity Datasets
Synnott, Jonathan; Nugent, Chris; Jeffers, Paul
2015-01-01
A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation. PMID:26087371
Simulation of Smart Home Activity Datasets.
Synnott, Jonathan; Nugent, Chris; Jeffers, Paul
2015-06-16
A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.
Digital Object Identifiers (DOI's) usage and adoption in U.S Geological Survey (USGS)
NASA Astrophysics Data System (ADS)
Frame, M. T.; Palanisamy, G.
2013-12-01
Addressing grand environmental science challenges requires unprecedented access to easily understood data that cross the breadth of temporal, spatial, and thematic scales. From a scientist's perspective, the big challenges lie in discovering the relevant data, dealing with extreme data heterogeneity, large data volumes, and converting data to information and knowledge. Historical linkages between derived products, i.e. Publications, and associated datasets has not existed in the earth science community. The USGS Core Science Analytics and Synthesis, in collaboration with DOE's Oak Ridge National Laboratory (ORNL) Mercury Consortium (funded by NASA, USGS and DOE), established a Digital Object Identifier (DOI) service for USGS data, metadata, and other media. This service is offered in partnership through the University of California Digital Library EZID service. USGS scientists, data managers, and other professionals can generate globally unique, persistent and resolvable identifiers for any kind of digital objects. Additional efforts to assign DOIs to historical data and publications have also been underway. These DOI identifiers are being used to cite data in journal articles, web-accessible datasets, and other media for distribution, integration, and in support of improved data management practices. The session will discuss the current DOI efforts within USGS, including a discussion on adoption, challenges, and future efforts necessary to improve access, reuse, sharing, and discoverability of USGS data and information.
NCAR's Research Data Archive: OPeNDAP Access for Complex Datasets
NASA Astrophysics Data System (ADS)
Dattore, R.; Worley, S. J.
2014-12-01
Many datasets have complex structures including hundreds of parameters and numerous vertical levels, grid resolutions, and temporal products. Making these data accessible is a challenge for a data provider. OPeNDAP is powerful protocol for delivering in real-time multi-file datasets that can be ingested by many analysis and visualization tools, but for these datasets there are too many choices about how to aggregate. Simple aggregation schemes can fail to support, or at least make it very challenging, for many potential studies based on complex datasets. We address this issue by using a rich file content metadata collection to create a real-time customized OPeNDAP service to match the full suite of access possibilities for complex datasets. The Climate Forecast System Reanalysis (CFSR) and it's extension, the Climate Forecast System Version 2 (CFSv2) datasets produced by the National Centers for Environmental Prediction (NCEP) and hosted by the Research Data Archive (RDA) at the Computational and Information Systems Laboratory (CISL) at NCAR are examples of complex datasets that are difficult to aggregate with existing data server software. CFSR and CFSv2 contain 141 distinct parameters on 152 vertical levels, six grid resolutions and 36 products (analyses, n-hour forecasts, multi-hour averages, etc.) where not all parameter/level combinations are available at all grid resolution/product combinations. These data are archived in the RDA with the data structure provided by the producer; no additional re-organization or aggregation have been applied. Since 2011, users have been able to request customized subsets (e.g. - temporal, parameter, spatial) from the CFSR/CFSv2, which are processed in delayed-mode and then downloaded to a user's system. Until now, the complexity has made it difficult to provide real-time OPeNDAP access to the data. We have developed a service that leverages the already-existing subsetting interface and allows users to create a virtual dataset with its own structure (das, dds). The user receives a URL to the customized dataset that can be used by existing tools to ingest, analyze, and visualize the data. This presentation will detail the metadata system and OPeNDAP server that enable user-customized real-time access and show an example of how a visualization tool can access the data.
Web-GIS visualisation of permafrost-related Remote Sensing products for ESA GlobPermafrost
NASA Astrophysics Data System (ADS)
Haas, A.; Heim, B.; Schaefer-Neth, C.; Laboor, S.; Nitze, I.; Grosse, G.; Bartsch, A.; Kaab, A.; Strozzi, T.; Wiesmann, A.; Seifert, F. M.
2016-12-01
The ESA GlobPermafrost (www.globpermafrost.info) provides a remote sensing service for permafrost research and applications. The service comprises of data product generation for various sites and regions as well as specific infrastructure allowing overview and access to datasets. Based on an online user survey conducted within the project, the user community extensively applies GIS software to handle remote sensing-derived datasets and requires preview functionalities before accessing them. In response, we develop the Permafrost Information System PerSys which is conceptualized as an open access geospatial data dissemination and visualization portal. PerSys will allow visualisation of GlobPermafrost raster and vector products such as land cover classifications, Landsat multispectral index trend datasets, lake and wetland extents, InSAR-based land surface deformation maps, rock glacier velocity fields, spatially distributed permafrost model outputs, and land surface temperature datasets. The datasets will be published as WebGIS services relying on OGC-standardized Web Mapping Service (WMS) and Web Feature Service (WFS) technologies for data display and visualization. The WebGIS environment will be hosted at the AWI computing centre where a geodata infrastructure has been implemented comprising of ArcGIS for Server 10.4, PostgreSQL 9.2 and a browser-driven data viewer based on Leaflet (http://leafletjs.com). Independently, we will provide an `Access - Restricted Data Dissemination Service', which will be available to registered users for testing frequently updated versions of project datasets. PerSys will become a core project of the Arctic Permafrost Geospatial Centre (APGC) within the ERC-funded PETA-CARB project (www.awi.de/petacarb). The APGC Data Catalogue will contain all final products of GlobPermafrost, allow in-depth dataset search via keywords, spatial and temporal coverage, data type, etc., and will provide DOI-based links to the datasets archived in the long-term, open access PANGAEA data repository.
Field of genes: using Apache Kafka as a bioinformatic data repository
Lynch, Richard; Walsh, Paul
2018-01-01
Abstract Background Bioinformatic research is increasingly dependent on large-scale datasets, accessed either from private or public repositories. An example of a public repository is National Center for Biotechnology Information's (NCBI’s) Reference Sequence (RefSeq). These repositories must decide in what form to make their data available. Unstructured data can be put to almost any use but are limited in how access to them can be scaled. Highly structured data offer improved performance for specific algorithms but limit the wider usefulness of the data. We present an alternative: lightly structured data stored in Apache Kafka in a way that is amenable to parallel access and streamed processing, including subsequent transformations into more highly structured representations. We contend that this approach could provide a flexible and powerful nexus of bioinformatic data, bridging the gap between low structure on one hand, and high performance and scale on the other. To demonstrate this, we present a proof-of-concept version of NCBI’s RefSeq database using this technology. We measure the performance and scalability characteristics of this alternative with respect to flat files. Results The proof of concept scales almost linearly as more compute nodes are added, outperforming the standard approach using files. Conclusions Apache Kafka merits consideration as a fast and more scalable but general-purpose way to store and retrieve bioinformatic data, for public, centralized reference datasets such as RefSeq and for private clinical and experimental data. PMID:29635394
Nascimento, Leandro Costa; Salazar, Marcela Mendes; Lepikson-Neto, Jorge; Camargo, Eduardo Leal Oliveira; Parreiras, Lucas Salera; Carazzolle, Marcelo Falsarella
2017-01-01
Abstract Tree species of the genus Eucalyptus are the most valuable and widely planted hardwoods in the world. Given the economic importance of Eucalyptus trees, much effort has been made towards the generation of specimens with superior forestry properties that can deliver high-quality feedstocks, customized to the industrýs needs for both cellulosic (paper) and lignocellulosic biomass production. In line with these efforts, large sets of molecular data have been generated by several scientific groups, providing invaluable information that can be applied in the development of improved specimens. In order to fully explore the potential of available datasets, the development of a public database that provides integrated access to genomic and transcriptomic data from Eucalyptus is needed. EUCANEXT is a database that analyses and integrates publicly available Eucalyptus molecular data, such as the E. grandis genome assembly and predicted genes, ESTs from several species and digital gene expression from 26 RNA-Seq libraries. The database has been implemented in a Fedora Linux machine running MySQL and Apache, while Perl CGI was used for the web interfaces. EUCANEXT provides a user-friendly web interface for easy access and analysis of publicly available molecular data from Eucalyptus species. This integrated database allows for complex searches by gene name, keyword or sequence similarity and is publicly accessible at http://www.lge.ibi.unicamp.br/eucalyptusdb. Through EUCANEXT, users can perform complex analysis to identify genes related traits of interest using RNA-Seq libraries and tools for differential expression analysis. Moreover, all the bioinformatics pipeline here described, including the database schema and PERL scripts, are readily available and can be applied to any genomic and transcriptomic project, regardless of the organism. Database URL: http://www.lge.ibi.unicamp.br/eucalyptusdb PMID:29220468
Common Web Mapping and Mobile Device Framework for Display of NASA Real-time Data
NASA Astrophysics Data System (ADS)
Burks, J. E.
2013-12-01
Scientists have strategic goals to deliver their unique datasets and research to both collaborative partners and more broadly to the public. These datasets can have a significant impact locally and globally as has been shown by the success of the NASA Short-term Prediction Research and Transition (SPoRT) Center and SERVIR programs at Marshall Space Flight Center. Each of these respective organizations provides near real-time data at the best resolution possible to address concerns of the operational weather forecasting community (SPoRT) and to support environmental monitoring and disaster assessment (SERVIR). However, one of the biggest struggles to delivering the data to these and other Earth science community partners is formatting the product to fit into an end user's Decision Support System (DSS). The problem of delivering the data to the end-user's DSS can be a significant impediment to transitioning research to operational environments especially for disaster response where the deliver time is critical. The decision makers, in addition to the DSS, need seamless access to these same datasets from a web browser or a mobile phone for support when they are away from their DSS or for personnel out in the field. A framework has been developed for MSFC Earth Science program that can be used to easily enable seamless delivery of scientific data to end users in multiple formats. The first format is an open geospatial format, Web Mapping Service (WMS), which is easily integrated into most DSSs. The second format is a web browser display, which can be embedded within any MSFC Science web page with just a few lines of web page coding. The third format is accessible in the form of iOS and Android native mobile applications that could be downloaded from an 'app store'. The framework developed has reduced the level of effort needed to bring new and existing NASA datasets to each of these end user platforms and help extend the reach of science data.
Common Web Mapping and Mobile Device Framework for Display of NASA Real-time Data
NASA Technical Reports Server (NTRS)
Burks, Jason
2013-01-01
Scientists have strategic goals to deliver their unique datasets and research to both collaborative partners and more broadly to the public. These datasets can have a significant impact locally and globally as has been shown by the success of the NASA Short-term Prediction Research and Transition (SPoRT) Center and SERVIR programs at Marshall Space Flight Center. Each of these respective organizations provides near real-time data at the best resolution possible to address concerns of the operational weather forecasting community (SPoRT) and to support environmental monitoring and disaster assessment (SERVIR). However, one of the biggest struggles to delivering the data to these and other Earth science community partners is formatting the product to fit into an end user's Decision Support System (DSS). The problem of delivering the data to the end-user's DSS can be a significant impediment to transitioning research to operational environments especially for disaster response where the deliver time is critical. The decision makers, in addition to the DSS, need seamless access to these same datasets from a web browser or a mobile phone for support when they are away from their DSS or for personnel out in the field. A framework has been developed for MSFC Earth Science program that can be used to easily enable seamless delivery of scientific data to end users in multiple formats. The first format is an open geospatial format, Web Mapping Service (WMS), which is easily integrated into most DSSs. The second format is a web browser display, which can be embedded within any MSFC Science web page with just a few lines of web page coding. The third format is accessible in the form of iOS and Android native mobile applications that could be downloaded from an "app store". The framework developed has reduced the level of effort needed to bring new and existing NASA datasets to each of these end user platforms and help extend the reach of science data.
Open data and digital morphology
Davies, Thomas G.; Cunningham, John A.; Asher, Robert J.; Bates, Karl T.; Bengtson, Stefan; Benson, Roger B. J.; Boyer, Doug M.; Braga, José; Dong, Xi-Ping; Evans, Alistair R.; Friedman, Matt; Garwood, Russell J.; Goswami, Anjali; Hutchinson, John R.; Jeffery, Nathan S.; Lebrun, Renaud; Martínez-Pérez, Carlos; O'Higgins, Paul M.; Orliac, Maëva; Rowe, Timothy B.; Sánchez-Villagra, Marcelo R.; Shubin, Neil H.; Starck, J. Matthias; Stringer, Chris; Summers, Adam P.; Sutton, Mark D.; Walsh, Stig A.; Weisbecker, Vera; Witmer, Lawrence M.; Wroe, Stephen; Yin, Zongjun
2017-01-01
Over the past two decades, the development of methods for visualizing and analysing specimens digitally, in three and even four dimensions, has transformed the study of living and fossil organisms. However, the initial promise that the widespread application of such methods would facilitate access to the underlying digital data has not been fully achieved. The underlying datasets for many published studies are not readily or freely available, introducing a barrier to verification and reproducibility, and the reuse of data. There is no current agreement or policy on the amount and type of data that should be made available alongside studies that use, and in some cases are wholly reliant on, digital morphology. Here, we propose a set of recommendations for minimum standards and additional best practice for three-dimensional digital data publication, and review the issues around data storage, management and accessibility. PMID:28404779
Open data and digital morphology.
Davies, Thomas G; Rahman, Imran A; Lautenschlager, Stephan; Cunningham, John A; Asher, Robert J; Barrett, Paul M; Bates, Karl T; Bengtson, Stefan; Benson, Roger B J; Boyer, Doug M; Braga, José; Bright, Jen A; Claessens, Leon P A M; Cox, Philip G; Dong, Xi-Ping; Evans, Alistair R; Falkingham, Peter L; Friedman, Matt; Garwood, Russell J; Goswami, Anjali; Hutchinson, John R; Jeffery, Nathan S; Johanson, Zerina; Lebrun, Renaud; Martínez-Pérez, Carlos; Marugán-Lobón, Jesús; O'Higgins, Paul M; Metscher, Brian; Orliac, Maëva; Rowe, Timothy B; Rücklin, Martin; Sánchez-Villagra, Marcelo R; Shubin, Neil H; Smith, Selena Y; Starck, J Matthias; Stringer, Chris; Summers, Adam P; Sutton, Mark D; Walsh, Stig A; Weisbecker, Vera; Witmer, Lawrence M; Wroe, Stephen; Yin, Zongjun; Rayfield, Emily J; Donoghue, Philip C J
2017-04-12
Over the past two decades, the development of methods for visualizing and analysing specimens digitally, in three and even four dimensions, has transformed the study of living and fossil organisms. However, the initial promise that the widespread application of such methods would facilitate access to the underlying digital data has not been fully achieved. The underlying datasets for many published studies are not readily or freely available, introducing a barrier to verification and reproducibility, and the reuse of data. There is no current agreement or policy on the amount and type of data that should be made available alongside studies that use, and in some cases are wholly reliant on, digital morphology. Here, we propose a set of recommendations for minimum standards and additional best practice for three-dimensional digital data publication, and review the issues around data storage, management and accessibility. © 2017 The Authors.
PERSIANN-CDR Daily Precipitation Dataset for Hydrologic Applications and Climate Studies.
NASA Astrophysics Data System (ADS)
Sorooshian, S.; Hsu, K. L.; Ashouri, H.; Braithwaite, D.; Nguyen, P.; Thorstensen, A. R.
2015-12-01
Precipitation Estimation from Remotely Sensed Information using Artificial Neural Network - Climate Data Record (PERSIANN-CDR) is a newly developed and released dataset which covers more than 3 decades (01/01/1983 - 03/31/2015 to date) of daily precipitation estimations at 0.25° resolution for 60°S-60°N latitude band. PERSIANN-CDR is processed using the archive of the Gridded Satellite IRWIN CDR (GridSat-B1) from the International Satellite Cloud Climatology Project (ISCCP), and the Global Precipitation Climatology Project (GPCP) 2.5° monthly product for bias correction. The dataset has been released and made available for public access through NOAA's National Centers for Environmental Information (NCEI) (http://www1.ncdc.noaa.gov/pub/data/sds/cdr/CDRs/PERSIANN/Overview.pdf). PERSIANN-CDR has already shown its usefulness for a wide range of applications, including climate variability and change monitoring, hydrologic applications, and water resources system planning and management. This precipitation CDR data has also been used in studying the behavior of historical extreme precipitation events. Demonstration of PERSIANN-CDR data in detecting trends and variability of precipitation over the past 30 years, the potential usefulness of the dataset for evaluating climate model performance relevant to precipitation in retrospective mode, will be presented.
PinAPL-Py: A comprehensive web-application for the analysis of CRISPR/Cas9 screens.
Spahn, Philipp N; Bath, Tyler; Weiss, Ryan J; Kim, Jihoon; Esko, Jeffrey D; Lewis, Nathan E; Harismendy, Olivier
2017-11-20
Large-scale genetic screens using CRISPR/Cas9 technology have emerged as a major tool for functional genomics. With its increased popularity, experimental biologists frequently acquire large sequencing datasets for which they often do not have an easy analysis option. While a few bioinformatic tools have been developed for this purpose, their utility is still hindered either due to limited functionality or the requirement of bioinformatic expertise. To make sequencing data analysis of CRISPR/Cas9 screens more accessible to a wide range of scientists, we developed a Platform-independent Analysis of Pooled Screens using Python (PinAPL-Py), which is operated as an intuitive web-service. PinAPL-Py implements state-of-the-art tools and statistical models, assembled in a comprehensive workflow covering sequence quality control, automated sgRNA sequence extraction, alignment, sgRNA enrichment/depletion analysis and gene ranking. The workflow is set up to use a variety of popular sgRNA libraries as well as custom libraries that can be easily uploaded. Various analysis options are offered, suitable to analyze a large variety of CRISPR/Cas9 screening experiments. Analysis output includes ranked lists of sgRNAs and genes, and publication-ready plots. PinAPL-Py helps to advance genome-wide screening efforts by combining comprehensive functionality with user-friendly implementation. PinAPL-Py is freely accessible at http://pinapl-py.ucsd.edu with instructions and test datasets.
A large-scale dataset of solar event reports from automated feature recognition modules
NASA Astrophysics Data System (ADS)
Schuh, Michael A.; Angryk, Rafal A.; Martens, Petrus C.
2016-05-01
The massive repository of images of the Sun captured by the Solar Dynamics Observatory (SDO) mission has ushered in the era of Big Data for Solar Physics. In this work, we investigate the entire public collection of events reported to the Heliophysics Event Knowledgebase (HEK) from automated solar feature recognition modules operated by the SDO Feature Finding Team (FFT). With the SDO mission recently surpassing five years of operations, and over 280,000 event reports for seven types of solar phenomena, we present the broadest and most comprehensive large-scale dataset of the SDO FFT modules to date. We also present numerous statistics on these modules, providing valuable contextual information for better understanding and validating of the individual event reports and the entire dataset as a whole. After extensive data cleaning through exploratory data analysis, we highlight several opportunities for knowledge discovery from data (KDD). Through these important prerequisite analyses presented here, the results of KDD from Solar Big Data will be overall more reliable and better understood. As the SDO mission remains operational over the coming years, these datasets will continue to grow in size and value. Future versions of this dataset will be analyzed in the general framework established in this work and maintained publicly online for easy access by the community.
OpenAQ: A Platform to Aggregate and Freely Share Global Air Quality Data
NASA Astrophysics Data System (ADS)
Hasenkopf, C. A.; Flasher, J. C.; Veerman, O.; DeWitt, H. L.
2015-12-01
Thousands of ground-based air quality monitors around the world publicly publish real-time air quality data; however, researchers and the public do not have access to this information in the ways most useful to them. Often, air quality data are posted on obscure websites showing only current values, are programmatically inaccessible, and/or are in inconsistent data formats across sites. Yet, historical and programmatic access to such a global dataset would be transformative to several scientific fields, from epidemiology to low-cost sensor technologies to estimates of ground-level aerosol by satellite retrievals. To increase accessibility and standardize this disparate dataset, we have built OpenAQ, an innovative, open platform created by a group of scientists and open data programmers. The source code for the platform is viewable at github.com/openaq. Currently, we are aggregating, storing, and making publicly available real-time air quality data (PM2.5, PM10, SO2, NO2, and O3) via an Application Program Interface (API). We will present the OpenAQ platform, which currently has the following specific capabilities: A continuous ingest mechanism for some of the most polluted cities, generalizable to more sources An API providing data-querying, including ability to filter by location, measurement type and value and date, as well as custom sort options A generalized, chart-based visualization tool to explore data accessible via the API At this stage, we are seeking wider participation and input from multiple research communities in expanding our data retrieval sites, standardizing our protocols, receiving feedback on quality issues, and creating tools that can be built on top of this open platform.
Visualizing Earth and Planetary Remote Sensing Data Using JMARS
NASA Astrophysics Data System (ADS)
Dickenshied, S.; Christensen, P. R.; Carter, S.; Anwar, S.; Noss, D.
2014-12-01
JMARS (Java Mission-planning and Analysis for Remote Sensing) is a free geospatial application developed by the Mars Space Flight Facility at Arizona State University. Originally written as a mission planning tool for the THEMIS instrument on board the MARS Odyssey Spacecraft, it was released as an analysis tool to the general public in 2003. Since then it has expanded to be used for mission planning and scientific data analysis by additional NASA missions to Mars, the Moon, and Vesta, and it has come to be used by scientists, researchers and students of all ages from more than 40 countries around the world. The public version of JMARS now also includes remote sensing data for Mercury, Venus, Earth, the Moon, Mars, and a number of the moons of Jupiter and Saturn. Additional datasets for asteroids and other smaller bodies are being added as they becomes available and time permits. JMARS fuses data from different instruments in a geographical context. One core strength of JMARS is that it provides access to geospatially registered data via a consistent interface. Such data include global images (graphical and numeric), local mosaics, individual instrument images, spectra, and vector-oriented data. By hosting these products, users are able to avoid searching for, downloading, decoding, and projecting data on their own using a disparate set of tools and procedures. The JMARS team processes, indexes, and reorganizes data to make it quickly and easily accessible in a consistent manner. JMARS leverages many open-source technologies and tools to accomplish these data preparation steps. In addition to visualizing multiple datasets in context with one another, JMARS allows a user to find data products from differing missions that intersect the same geographical location, time range, or observational parameters. Any number of georegistered datasets can then be viewed or analyzed simultaneously with one another. A user can easily create a mosaic of graphic data, plot numeric data, or project any arbitrary scene over surface topography. All of these visualization options can be exported for use in presentations, publications, or for further analysis in other tools.
Federal Register 2010, 2011, 2012, 2013, 2014
2012-05-07
... and evaluates potential datasets and recommends which datasets are appropriate for assessment analyses..., 2012: 9 a.m.-8 p.m. Using datasets provided by the Data Scoping Workshops, participants will develop...
VIEWCACHE: An incremental pointer-based access method for autonomous interoperable databases
NASA Technical Reports Server (NTRS)
Roussopoulos, N.; Sellis, Timos
1993-01-01
One of the biggest problems facing NASA today is to provide scientists efficient access to a large number of distributed databases. Our pointer-based incremental data base access method, VIEWCACHE, provides such an interface for accessing distributed datasets and directories. VIEWCACHE allows database browsing and search performing inter-database cross-referencing with no actual data movement between database sites. This organization and processing is especially suitable for managing Astrophysics databases which are physically distributed all over the world. Once the search is complete, the set of collected pointers pointing to the desired data are cached. VIEWCACHE includes spatial access methods for accessing image datasets, which provide much easier query formulation by referring directly to the image and very efficient search for objects contained within a two-dimensional window. We will develop and optimize a VIEWCACHE External Gateway Access to database management systems to facilitate database search.
Carr, N.B.; Babel, N.; Diffendorfer, J.; Ignizio, D.; Hawkins, S.; Latysh, N.; Leib, K.; Linard, J.; Matherne, A.
2012-01-01
Throughout the western United States, increased demand for energy is driving the rapid development of oil, gas (including shale gas and coal-bed methane), and uranium, as well as renewable energy resources such as geothermal, solar, and wind. Much of the development in the West is occurring on public lands, including those under Federal and State jurisdictions. In Colorado and New Mexico, these public lands make up about 40 percent of the land area. Both states benefit from the revenue generated by energy production, but resource managers and other decisionmakers must balance the benefits of energy development with the potential consequences for ecosystems, recreation, and other resources. Although a substantial amount of geospatial data on existing energy development and energy potential is available, much of this information is not readily accessible to natural resource decisionmakers, policymakers, or the public. Furthermore, the data often exist in varied formats, requiring considerable processing before these datasets can be used to evaluate tradeoffs among resources, compare development alternatives, or quantify cumulative impacts. To allow for a comprehensive evaluation among different energy types, an interdisciplinary team of U.S. Geological Survey (USGS) scientists has developed an online Interactive Energy Atlas for Colorado and New Mexico. The Energy and Environment in the Rocky Mountain Area (EERMA) interdisciplinary team includes investigators from several USGS science centers1. The purpose of the EERMA Interactive Energy Atlas is to facilitate access to geospatial data related to energy resources, energy infrastructure, and natural resources that may be affected by energy development. The Atlas is designed to meet the needs of various users, including GIS analysts, resource managers, policymakers, and the public, who seek information about energy in the western United States. Currently, the Atlas has two primary capabilities, a GIS data viewer and an interactive map gallery.
NASA Astrophysics Data System (ADS)
Brissebrat, Guillaume; Fleury, Laurence; Boichard, Jean-Luc; Cloché, Sophie; Eymard, Laurence; Mastrorillo, Laurence; Moulaye, Oumarou; Ramage, Karim; Asencio, Nicole; Favot, Florence; Roussot, Odile
2013-04-01
The AMMA information system aims at expediting data and scientific results communication inside the AMMA community and beyond. It has already been adopted as the data management system by several projects and is meant to become a reference information system about West Africa area for the whole scientific community. The AMMA database and the associated on line tools have been developed and are managed by two French teams (IPSL Database Centre, Palaiseau and OMP Data Service, Toulouse). The complete system has been fully duplicated and is operated by AGRHYMET Regional Centre in Niamey, Niger. The AMMA database contains a wide variety of datasets: - about 250 local observation datasets, that cover geophysical components (atmosphere, ocean, soil, vegetation) and human activities (agronomy, health...) They come from either operational networks or scientific experiments, and include historical data in West Africa from 1850; - 1350 outputs of a socio-economics questionnaire; - 60 operational satellite products and several research products; - 10 output sets of meteorological and ocean operational models and 15 of research simulations. Database users can access all the data using either the portal http://database.amma-international.org or http://amma.agrhymet.ne/amma-data. Different modules are available. The complete catalogue enables to access metadata (i.e. information about the datasets) that are compliant with the international standards (ISO19115, INSPIRE...). Registration pages enable to read and sign the data and publication policy, and to apply for a user database account. The data access interface enables to easily build a data extraction request by selecting various criteria like location, time, parameters... At present, the AMMA database counts more than 740 registered users and process about 80 data requests every month In order to monitor day-to-day meteorological and environment information over West Africa, some quick look and report display websites have been developed. They met the operational needs for the observational teams during the AMMA 2006 (http://aoc.amma-international.org) and FENNEC 2011 (http://fenoc.sedoo.fr) campaigns. But they also enable scientific teams to share physical indices along the monsoon season (http://misva.sedoo.fr from 2011). A collaborative WIKINDX tool has been set on line in order to manage scientific publications and communications of interest to AMMA (http://biblio.amma-international.org). Now the bibliographic database counts about 1200 references. It is the most exhaustive document collection about African Monsoon available for all. Every scientist is invited to make use of the different AMMA on line tools and data. Scientists or project leaders who have data management needs for existing or future datasets over West Africa are welcome to use the AMMA database framework and to contact ammaAdmin@sedoo.fr .
Solutions for research data from a publisher's perspective
NASA Astrophysics Data System (ADS)
Cotroneo, P.
2015-12-01
Sharing research data has the potential to make research more efficient and reproducible. Elsevier has developed several initiatives to address the different needs of research data users. These include PANGEA Linked data, which provides geo-referenced, citable datasets from earth and life sciences, archived as supplementary data from publications by the PANGEA data repository; Mendeley Data, which allows users to freely upload and share their data; a database linking program that creates links between articles on ScienceDirect and datasets held in external data repositories such as EarthRef and EarthChem; a pilot for searching for research data through a map interface; an open data pilot that allows authors publishing in Elsevier journals to store and share research data and make this publicly available as a supplementary file alongside their article; and data journals, including Data in Brief, which allow researchers to share their data open access. Through these initiatives, researchers are not only encouraged to share their research data, but also supported in optimizing their research data management. By making data more readily citable and visible, and hence generating citations for authors, these initiatives also aim to ensure that researchers get the recognition they deserve for publishing their data.
EVpedia: a community web portal for extracellular vesicles research.
Kim, Dae-Kyum; Lee, Jaewook; Kim, Sae Rom; Choi, Dong-Sic; Yoon, Yae Jin; Kim, Ji Hyun; Go, Gyeongyun; Nhung, Dinh; Hong, Kahye; Jang, Su Chul; Kim, Si-Hyun; Park, Kyong-Su; Kim, Oh Youn; Park, Hyun Taek; Seo, Ji Hye; Aikawa, Elena; Baj-Krzyworzeka, Monika; van Balkom, Bas W M; Belting, Mattias; Blanc, Lionel; Bond, Vincent; Bongiovanni, Antonella; Borràs, Francesc E; Buée, Luc; Buzás, Edit I; Cheng, Lesley; Clayton, Aled; Cocucci, Emanuele; Dela Cruz, Charles S; Desiderio, Dominic M; Di Vizio, Dolores; Ekström, Karin; Falcon-Perez, Juan M; Gardiner, Chris; Giebel, Bernd; Greening, David W; Gross, Julia Christina; Gupta, Dwijendra; Hendrix, An; Hill, Andrew F; Hill, Michelle M; Nolte-'t Hoen, Esther; Hwang, Do Won; Inal, Jameel; Jagannadham, Medicharla V; Jayachandran, Muthuvel; Jee, Young-Koo; Jørgensen, Malene; Kim, Kwang Pyo; Kim, Yoon-Keun; Kislinger, Thomas; Lässer, Cecilia; Lee, Dong Soo; Lee, Hakmo; van Leeuwen, Johannes; Lener, Thomas; Liu, Ming-Lin; Lötvall, Jan; Marcilla, Antonio; Mathivanan, Suresh; Möller, Andreas; Morhayim, Jess; Mullier, François; Nazarenko, Irina; Nieuwland, Rienk; Nunes, Diana N; Pang, Ken; Park, Jaesung; Patel, Tushar; Pocsfalvi, Gabriella; Del Portillo, Hernando; Putz, Ulrich; Ramirez, Marcel I; Rodrigues, Marcio L; Roh, Tae-Young; Royo, Felix; Sahoo, Susmita; Schiffelers, Raymond; Sharma, Shivani; Siljander, Pia; Simpson, Richard J; Soekmadji, Carolina; Stahl, Philip; Stensballe, Allan; Stępień, Ewa; Tahara, Hidetoshi; Trummer, Arne; Valadi, Hadi; Vella, Laura J; Wai, Sun Nyunt; Witwer, Kenneth; Yáñez-Mó, María; Youn, Hyewon; Zeidler, Reinhard; Gho, Yong Song
2015-03-15
Extracellular vesicles (EVs) are spherical bilayered proteolipids, harboring various bioactive molecules. Due to the complexity of the vesicular nomenclatures and components, online searches for EV-related publications and vesicular components are currently challenging. We present an improved version of EVpedia, a public database for EVs research. This community web portal contains a database of publications and vesicular components, identification of orthologous vesicular components, bioinformatic tools and a personalized function. EVpedia includes 6879 publications, 172 080 vesicular components from 263 high-throughput datasets, and has been accessed more than 65 000 times from more than 750 cities. In addition, about 350 members from 73 international research groups have participated in developing EVpedia. This free web-based database might serve as a useful resource to stimulate the emerging field of EV research. The web site was implemented in PHP, Java, MySQL and Apache, and is freely available at http://evpedia.info. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Beyond PARR - PMEL's Integrated Data Management Strategy
NASA Astrophysics Data System (ADS)
Burger, E. F.; O'Brien, K.; Manke, A. B.; Schweitzer, R.; Smith, K. M.
2016-12-01
NOAA's Pacific Marine Environmental Laboratory (PMEL) hosts a wide range of scientific projects that span a number of scientific and environmental research disciplines. Each of these 14 research projects have their own data streams that are as diverse as the research. With its requirements for public access to federally funded research results and data, the 2013 White House Office of Science and Technology memo on Public Access to Research Results (PARR) changed the data management landscape for Federal agencies. In 2015, with support from the PMEL Director, Dr. Christopher Sabine, PMEL's Science Data Integration Group (SDIG) initiated a multi-year effort to formulate and implement an integrated data-management strategy for PMEL research efforts. Instead of using external requirements, such as PARR, to define our approach, we focussed on strategies to provide PMEL science projects with a unified framework for data submission, interoperable data access, data storage, and easier data archival to National Data Centers. This improves data access to PMEL scientists, their collaborators, and the public, and also provides a unified lab framework that allows our projects to meet their data management objectives, as well as those required by the PARR. We are implementing this solution in stages that allows us to test technology and architecture choices before comitting to a large scale implementation. SDIG developers have completed the first year of development where our approach is to reuse and leverage existing frameworks and standards. This presentation will describe our data management strategy, explain our phased implementation approach, the software and framework choices, and how these elements help us meet the objectives of this strategy. We will share the lessons learned in dealing with diverse and complex datasets in this first year of implementation and how these outcomes will shape our decisions for this ongoing effort. The data management capabilities now available to scientific projects, and other services being developed to manage and preserve PMEL's scientific data assets for our researchers, their collaborators, and future generations, will be described.
Sustainable Data Evolution Technology for Power Grid Optimization
DOE Office of Scientific and Technical Information (OSTI.GOV)
The SDET Tool is used to create open-access power grid data sets and facilitate updates of these data sets by the community. Pacific Northwest National Laboratory (PNNL) and its power industry and software vendor partners are developing an innovative sustainable data evolution technology (SDET) to create open-access power grid datasets and facilitate updates to these datasets by the power grid community. The objective is to make this a sustained effort within and beyond the ARPA-E GRID DATA program so that the datasets can evolve over time and meet the current and future needs for power grid optimization and potentially othermore » applications in power grid operation and planning.« less
Earth Science Data for a Mobile Age
NASA Astrophysics Data System (ADS)
Oostra, D.; Chambers, L. H.; Lewis, P. M.; Baize, R.; Oots, P.; Rogerson, T.; Crecelius, S.; Coleman, T.
2012-12-01
Earth science data access needs to be interoperable and automatic. Recently, increasingly savvy data users combined with more complex web and mobile applications have placed increasing demands on how this Earth science data is being delivered to educators and students. The MY NASA DATA (MND) and S'COOL projects are developing a strategy to interact with the education community in the age of mobile devices and platforms. How can we provide data and meaningful scientific experiences to educational users through mobile technologies? This initiative will seek out existing technologies and stakeholders within the Earth Science community to identify datasets that are relevant and appropriate for mobile application development and use by the educational community. Targeting efforts within the educational community will give the project a better understanding of the previous attempts at data/mobile application use in the classroom and its problems. In addition, we will query developers and data providers on what successes and failures they've experienced in trying to provide data for applications designed on mobile platforms. This feedback will be implemented in new websites, applications and lessons that will provide authentic scientific experiences for students and end users. We want to create tools that help sort through the vast amounts of NASA data, and deliver it to users automatically. NASA provides millions of gigabytes of data that is publicly available through a large number of services spread across the World Wide Web. Accessing and navigating this data can be time consuming and problematic with variety of file types and methods for accessing this data. The MND project, through its' Live Access Server system, provides selected datasets that are relevant and targets National Standards of Learning for educators to easily integrate into existing curricula. In the future, we want to provide desired data to users with automatic updates, anticipate future data queries/needs and generate new data combinations--targeting users with a web 3.0 methodology. We will examine applications that give users direct access to data in near real-time and find solutions for the educational community. MND and S'COOL will identify trends in the mobile and web application sectors to provide the greatest effect upon relevant audiences within the science and educational communities. Greater access is the goal, with an acute focus on educating our future explorers and scientists with tools and data that will provide the most efficacy, use, and enriching science experiences. Current trends point to cross-platform web applications as being the most effective and efficient means of delivering content, data, and ideas to end users. Universal availability of key datasets on any device will encourage users to continue to use data and attract potential data users and providers. Projected Outcomes Initially, the outcome for this work is to increase the effectiveness of the MND and S'COOL projects by learning more about our users needs and anticipating how data will be used in the future. Through our work we will increase exposure and ease of access to NASA datasets relevant to our communities. Our goal is to focus on our participants mobile usage in the classroom, thereby gaining a greater understanding on how data is being used to teach students about the Earth and begin to develop better tools and technologies.
The paper presents the Community Line Source (C-LINE) modeling system that estimates toxic air pollutant (air toxics) concentration gradients within 500 meters of busy roadways for community-sized areas on the order of 100 km2. C-LINE accesses publicly available datasets with nat...
NHDPlusHR: A national geospatial framework for surface-water information
Viger, Roland; Rea, Alan H.; Simley, Jeffrey D.; Hanson, Karen M.
2016-01-01
The U.S. Geological Survey is developing a new geospatial hydrographic framework for the United States, called the National Hydrography Dataset Plus High Resolution (NHDPlusHR), that integrates a diversity of the best-available information, robustly supports ongoing dataset improvements, enables hydrographic generalization to derive alternate representations of the network while maintaining feature identity, and supports modern scientific computing and Internet accessibility needs. This framework is based on the High Resolution National Hydrography Dataset, the Watershed Boundaries Dataset, and elevation from the 3-D Elevation Program, and will provide an authoritative, high precision, and attribute-rich geospatial framework for surface-water information for the United States. Using this common geospatial framework will provide a consistent basis for indexing water information in the United States, eliminate redundancy, and harmonize access to, and exchange of water information.
EnviroAtlas - Austin, TX - Park Access by Block Group
This EnviroAtlas dataset shows the block group population that is within and beyond an easy walking distance (500m) of a park entrance. Park entrances were included in this analysis if they were within 5km of the EnviroAtlas community boundary. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Review and Analysis of Algorithmic Approaches Developed for Prognostics on CMAPSS Dataset
NASA Technical Reports Server (NTRS)
Ramasso, Emannuel; Saxena, Abhinav
2014-01-01
Benchmarking of prognostic algorithms has been challenging due to limited availability of common datasets suitable for prognostics. In an attempt to alleviate this problem several benchmarking datasets have been collected by NASA's prognostic center of excellence and made available to the Prognostics and Health Management (PHM) community to allow evaluation and comparison of prognostics algorithms. Among those datasets are five C-MAPSS datasets that have been extremely popular due to their unique characteristics making them suitable for prognostics. The C-MAPSS datasets pose several challenges that have been tackled by different methods in the PHM literature. In particular, management of high variability due to sensor noise, effects of operating conditions, and presence of multiple simultaneous fault modes are some factors that have great impact on the generalization capabilities of prognostics algorithms. More than 70 publications have used the C-MAPSS datasets for developing data-driven prognostic algorithms. The C-MAPSS datasets are also shown to be well-suited for development of new machine learning and pattern recognition tools for several key preprocessing steps such as feature extraction and selection, failure mode assessment, operating conditions assessment, health status estimation, uncertainty management, and prognostics performance evaluation. This paper summarizes a comprehensive literature review of publications using C-MAPSS datasets and provides guidelines and references to further usage of these datasets in a manner that allows clear and consistent comparison between different approaches.
The Ocean Observatories Initiative: Data Access and Visualization via the Graphical User Interface
NASA Astrophysics Data System (ADS)
Garzio, L. M.; Belabbassi, L.; Knuth, F.; Smith, M. J.; Crowley, M. F.; Vardaro, M.; Kerfoot, J.
2016-02-01
The Ocean Observatories Initiative (OOI), funded by the National Science Foundation, is a broad-scale, multidisciplinary effort to transform oceanographic research by providing users with real-time access to long-term datasets from a variety of deployed physical, chemical, biological, and geological sensors. The global array component of the OOI includes four high latitude sites: Irminger Sea off Greenland, Station Papa in the Gulf of Alaska, Argentine Basin off the coast of Argentina, and Southern Ocean near coordinates 55°S and 90°W. Each site is composed of fixed moorings, hybrid profiler moorings and mobile assets, with a total of approximately 110 instruments at each site. Near real-time (telemetered) and recovered data from these instruments can be visualized and downloaded via the OOI Graphical User Interface. In this Interface, the user can visualize scientific parameters via six different plotting functions with options to specify time ranges and apply various QA/QC tests. Data streams from all instruments can also be downloaded in different formats (CSV, JSON, and NetCDF) for further data processing, visualization, and comparison to supplementary datasets. In addition, users can view alerts and alarms in the system, access relevant metadata and deployment information for specific instruments, and find infrastructure specifics for each array including location, sampling strategies, deployment schedules, and technical drawings. These datasets from the OOI provide an unprecedented opportunity to transform oceanographic research and education, and will be readily accessible to the general public via the OOI's Graphical User Interface.
Data Basin: Expanding Access to Conservation Data, Tools, and People
NASA Astrophysics Data System (ADS)
Comendant, T.; Strittholt, J.; Frost, P.; Ward, B. C.; Bachelet, D. M.; Osborne-Gowey, J.
2009-12-01
Mapping and spatial analysis are a fundamental part of problem solving in conservation science, yet spatial data are widely scattered, difficult to locate, and often unavailable. Valuable time and resources are wasted locating and gaining access to important biological, cultural, and economic datasets, scientific analysis, and experts. As conservation problems become more serious and the demand to solve them grows more urgent, a new way to connect science and practice is needed. To meet this need, an open-access, web tool called Data Basin (www.databasin.org) has been created by the Conservation Biology Institute in partnership with ESRI and the Wilburforce Foundation. Users of Data Basin can gain quick access to datasets, experts, groups, and tools to help solve real-world problems. Individuals and organizations can perform essential tasks such as exploring and downloading from a vast library of conservation datasets, uploading existing datasets, connecting to other external data sources, create groups, and produce customized maps that can be easily shared. Data Basin encourages sharing and publishing, but also provides privacy and security for sensitive information when needed. Users can publish projects within Data Basin to tell more complete and rich stories of discovery and solutions. Projects are an ideal way to publish collections of datasets, maps and other information on the internet to reach wider audiences. Data Basin also houses individual centers that provide direct access to data, maps, and experts focused on specific geographic areas or conservation topics. Current centers being developed include the Boreal Information Centre, the Data Basin Climate Center, and proposed Aquatic and Forest Conservation Centers.
The NAS Computational Aerosciences Archive
NASA Technical Reports Server (NTRS)
Miceli, Kristina D.; Globus, Al; Lasinski, T. A. (Technical Monitor)
1995-01-01
In order to further the state-of-the-art in computational aerosciences (CAS) technology, researchers must be able to gather and understand existing work in the field. One aspect of this information gathering is studying published work available in scientific journals and conference proceedings. However, current scientific publications are very limited in the type and amount of information that they can disseminate. Information is typically restricted to text, a few images, and a bibliography list. Additional information that might be useful to the researcher, such as additional visual results, referenced papers, and datasets, are not available. New forms of electronic publication, such as the World Wide Web (WWW), limit publication size only by available disk space and data transmission bandwidth, both of which are improving rapidly. The Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center is in the process of creating an archive of CAS information on the WWW. This archive will be based on the large amount of information produced by researchers associated with the NAS facility. The archive will contain technical summaries and reports of research performed on NAS supercomputers, visual results (images, animations, visualization system scripts), datasets, and any other supporting meta-information. This information will be available via the WWW through the NAS homepage, located at http://www.nas.nasa.gov/, fully indexed for searching. The main components of the archive are technical summaries and reports, visual results, and datasets. Technical summaries are gathered every year by researchers who have been allotted resources on NAS supercomputers. These summaries, together with supporting visual results and references, are browsable by interested researchers. Referenced papers made available by researchers can be accessed through hypertext links. Technical reports are in-depth accounts of tools and applications research projects performed by NAS staff members and collaborators. Visual results, which may be available in the form of images, animations, and/or visualization scripts, are generated by researchers with respect to a certain research project, depicting dataset features that were determined important by the investigating researcher. For example, script files for visualization systems (e.g. FAST, PLOT3D, AVS) are provided to create visualizations on the user's local workstation to elucidate the key points of the numerical study. Users can then interact with the data starting where the investigator left off. Datasets are intended to give researchers an opportunity to understand previous work, 'mine' solutions for new information (for example, have you ever read a paper thinking "I wonder what the helicity density looks like?"), compare new techniques with older results, collaborate with remote colleagues, and perform validation. Supporting meta-information associated with the research projects is also important to provide additional context for research projects. This may include information such as the software used in the simulation (e.g. grid generators, flow solvers, visualization). In addition to serving the CAS research community, the information archive will also be helpful to students, visualization system developers and researchers, and management. Students (of any age) can use the data to study fluid dynamics, compare results from different flow solvers, learn about meshing techniques, etc., leading to better informed individuals. For these users it is particularly important that visualization be integrated into dataset archives. Visualization researchers can use dataset archives to test algorithms and techniques, leading to better visualization systems, Management can use the data to figure what is really going on behind the viewgraphs. All users will benefit from fast, easy, and convenient access to CFD datasets. The CAS information archive hopes to serve as a useful resource to those interested in computational sciences. At present, only information that may be distributed internationally is made available via the archive. Studies are underway to determine security requirements and solutions to make additional information available. By providing access to the archive via the WWW, the process of information gathering can be more productive and fruitful due to ease of access and ability to manage many different types of information. As the archive grows, additional resources from outside NAS will be added, providing a dynamic source of research results.
GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains
Lu, Zhiyong
2015-01-01
The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are publicly available and these tools are typically restricted to detecting only mention level gene names or only document level gene identifiers. In this work, we report GNormPlus: an end-to-end and open source system that handles both gene mention and identifier detection. We created a new corpus of 694 PubMed articles to support our development of GNormPlus, containing manual annotations for not only gene names and their identifiers, but also closely related concepts useful for gene name disambiguation, such as gene families and protein domains. GNormPlus integrates several advanced text-mining techniques, including SimConcept for resolving composite gene names. As a result, GNormPlus compares favorably to other state-of-the-art methods when evaluated on two widely used public benchmarking datasets, achieving 86.7% F1-score on the BioCreative II Gene Normalization task dataset and 50.1% F1-score on the BioCreative III Gene Normalization task dataset. The GNormPlus source code and its annotated corpus are freely available, and the results of applying GNormPlus to the entire PubMed are freely accessible through our web-based tool PubTator. PMID:26380306
Software Applications to Access Earth Science Data: Building an ECHO Client
NASA Astrophysics Data System (ADS)
Cohen, A.; Cechini, M.; Pilone, D.
2010-12-01
Historically, developing an ECHO (NASA’s Earth Observing System (EOS) ClearingHOuse) client required interaction with its SOAP API. SOAP, as a framework for web service communication has numerous advantages for Enterprise applications and Java/C# type programming languages. However, as interest has grown for quick development cycles and more intriguing “mashups,” ECHO has seen the SOAP API lose its appeal. In order to address these changing needs, ECHO has introduced two new interfaces facilitating simple access to its metadata holdings. The first interface is built upon the OpenSearch format and ESIP Federated Search framework. The second interface is built upon the Representational State Transfer (REST) architecture. Using the REST and OpenSearch APIs to access ECHO makes development with modern languages much more feasible and simpler. Client developers can leverage the simple interaction with ECHO to focus more of their time on the advanced functionality they are presenting to users. To demonstrate the simplicity of developing with the REST API, participants will be led through a hands-on experience where they will develop an ECHO client that performs the following actions: + Login + Provider discovery + Provider based dataset discovery + Dataset, Temporal, and Spatial constraint based Granule discovery + Online Data Access
SIMOcean: Maritime Open Data and Services Platform for Portuguese Institutions
NASA Astrophysics Data System (ADS)
Almeida, Nuno; Grosso, Nuno; Catarino, Nuno; Gutierrez, Antonio; Lamas, Luísa; Alves, Margarida; Almeida, Sara; Deus, Ricardo; Oliveira, Paulo
2016-04-01
Portugal is the country with the largest EEZ in the EU and the 10th largest EEZ in the world, at 3,877,408 km2, rendering the existence of an integrated management of Portuguese marine system crucial to monitor a wide range of interdependent domains. A system like this assimilates data and information from different thematic areas, ranging from ocean and atmosphere state variables to higher level datasets describing human activities and related environmental, social and economic impacts. Currently, these datasets are collected by a wide number of public and private institutions with very diverse purposes (e.g., monitoring, research, recreation, vigilance) leading to dataset duplication, inexistence of common data and metadata standards across organizations, and the propagation of closed information systems with different implementation solutions. This lack of coordination and visibility hinders the marine management, monitoring and vigilance capabilities, not only by making it more difficult to access, or even be aware of, the existence of certain datasets, but also by minimizing the ability to create added value products or services through dataset integration from different sources. Adopting Open Data approach will bring significant benefits by reducing the cost of information exchange and data integration, promoting the extensive use of this data. SIMOcean (System for Integrated Monitoring of the Ocean), co-funded by the EEA Grants Programme, is integrated in the initiative of the Portuguese Government to develop a set of coordinated systems providing access to national marine data. These systems aim to improve the Portuguese marine management, monitoring and vigilance capabilities, aggregating different data, including specific human activities datasets (vessel traffic, fishing records, oil spills), and environment variables (waves, currents, wind). Those datasets, currently scattered among different departments of the Portuguese Meteorological (IPMA) and the Navy's Hydrographic (IH) Institutes, will be brought together in the SIMOcean Open Data system. The SIMOcean system will also exploit this data in the following three flagship value added services: 1) Characterisation of Fishing Areas; 2) Wave Alerts for Sea Ports; and 3) Support to Search and Rescue Missions. These services will be driven by end users such as Civil Protection Authorities, Port Authorities and Fishing Associations, where these new products will lead to a significant positive impact in their operations. SIMOcean will be based on open source web based GIS interoperable solutions, compliant with OGC and INSPIRE directive standards to support the evolution of a set of open interfaces and protocols in the development of a common European spatial data infrastructure. The Catalogue solution (based on ckan) will consider the Portuguese Metadata Profile for the Sea developed by SNIM@R project, the guidelines provided by the directive 2013/37/EU and the Goldbook provided by the European Data portal. The system will be based on SenSyF approach of a scalable Cloud Computing system, providing authorised entities a single access point system for data catalogue, visualisation, processing and value added service deployment. It will be used by the two of the main Portuguese sea data providers with operational responsibilities in marine management, monitoring and vigilance.
Using the Proteomics Identifications Database (PRIDE).
Martens, Lennart; Jones, Phil; Côté, Richard
2008-03-01
The Proteomics Identifications Database (PRIDE) is a public data repository designed to store, disseminate, and analyze mass spectrometry based proteomics datasets. The PRIDE database can accommodate any level of detailed metadata about the submitted results, which can be queried, explored, viewed, or downloaded via the PRIDE Web interface. The PRIDE database also provides a simple, yet powerful, access control mechanism that fully supports confidential peer-reviewing of data related to a manuscript, ensuring that these results remain invisible to the general public while allowing referees and journal editors anonymized access to the data. This unit describes in detail the functionality that PRIDE provides with regards to searching, viewing, and comparing the available data, as well as different options for submitting data to PRIDE.
sbtools: A package connecting R to cloud-based data for collaborative online research
Winslow, Luke; Chamberlain, Scott; Appling, Alison P.; Read, Jordan S.
2016-01-01
The adoption of high-quality tools for collaboration and reproducible research such as R and Github is becoming more common in many research fields. While Github and other version management systems are excellent resources, they were originally designed to handle code and scale poorly to large text-based or binary datasets. A number of scientific data repositories are coming online and are often focused on dataset archival and publication. To handle collaborative workflows using large scientific datasets, there is increasing need to connect cloud-based online data storage to R. In this article, we describe how the new R package sbtools enables direct access to the advanced online data functionality provided by ScienceBase, the U.S. Geological Survey’s online scientific data storage platform.
Interoperable and accessible census and survey data from IPUMS.
Kugler, Tracy A; Fitch, Catherine A
2018-02-27
The first version of the Integrated Public Use Microdata Series (IPUMS) was released to users in 1993, and since that time IPUMS has come to stand for interoperable and accessible census and survey data. Initially created to harmonize U.S. census microdata over time, IPUMS now includes microdata from the U.S. and international censuses and from surveys on health, employment, and other topics. IPUMS also provides geo-spatial data, aggregate population data, and environmental data. IPUMS supports ten data products, each disseminating an integrated data collection with a set of tools that make complex data easy to find, access, and use. Key features are record-level integration to create interoperable datasets, user-friendly interfaces, and comprehensive metadata and documentation. The IPUMS philosophy aligns closely with the FAIR principles of findability, accessibility, interoperability, and re-usability. IPUMS data have catalyzed knowledge generation across a wide range of social science and other disciplines, as evidenced by the large volume of publications and other products created by the vast IPUMS user community.
Ingwersen, Peter; Chavan, Vishwas
2011-01-01
A professional recognition mechanism is required to encourage expedited publishing of an adequate volume of 'fit-for-use' biodiversity data. As a component of such a recognition mechanism, we propose the development of the Data Usage Index (DUI) to demonstrate to data publishers that their efforts of creating biodiversity datasets have impact by being accessed and used by a wide spectrum of user communities. We propose and give examples of a range of 14 absolute and normalized biodiversity dataset usage indicators for the development of a DUI based on search events and dataset download instances. The DUI is proposed to include relative as well as species profile weighted comparative indicators. We believe that in addition to the recognition to the data publisher and all players involved in the data life cycle, a DUI will also provide much needed yet novel insight into how users use primary biodiversity data. A DUI consisting of a range of usage indicators obtained from the GBIF network and other relevant access points is within reach. The usage of biodiversity datasets leads to the development of a family of indicators in line with well known citation-based measurements of recognition.
Johansen, Morten Bo; Izarzugaza, Jose M. G.; Brunak, Søren; Petersen, Thomas Nordahl; Gupta, Ramneek
2013-01-01
We have developed a sequence conservation-based artificial neural network predictor called NetDiseaseSNP which classifies nsSNPs as disease-causing or neutral. Our method uses the excellent alignment generation algorithm of SIFT to identify related sequences and a combination of 31 features assessing sequence conservation and the predicted surface accessibility to produce a single score which can be used to rank nsSNPs based on their potential to cause disease. NetDiseaseSNP classifies successfully disease-causing and neutral mutations. In addition, we show that NetDiseaseSNP discriminates cancer driver and passenger mutations satisfactorily. Our method outperforms other state-of-the-art methods on several disease/neutral datasets as well as on cancer driver/passenger mutation datasets and can thus be used to pinpoint and prioritize plausible disease candidates among nsSNPs for further investigation. NetDiseaseSNP is publicly available as an online tool as well as a web service: http://www.cbs.dtu.dk/services/NetDiseaseSNP PMID:23935863
VTCdb: a gene co-expression database for the crop species Vitis vinifera (grapevine).
Wong, Darren C J; Sweetman, Crystal; Drew, Damian P; Ford, Christopher M
2013-12-16
Gene expression datasets in model plants such as Arabidopsis have contributed to our understanding of gene function and how a single underlying biological process can be governed by a diverse network of genes. The accumulation of publicly available microarray data encompassing a wide range of biological and environmental conditions has enabled the development of additional capabilities including gene co-expression analysis (GCA). GCA is based on the understanding that genes encoding proteins involved in similar and/or related biological processes may exhibit comparable expression patterns over a range of experimental conditions, developmental stages and tissues. We present an open access database for the investigation of gene co-expression networks within the cultivated grapevine, Vitis vinifera. The new gene co-expression database, VTCdb (http://vtcdb.adelaide.edu.au/Home.aspx), offers an online platform for transcriptional regulatory inference in the cultivated grapevine. Using condition-independent and condition-dependent approaches, grapevine co-expression networks were constructed using the latest publicly available microarray datasets from diverse experimental series, utilising the Affymetrix Vitis vinifera GeneChip (16 K) and the NimbleGen Grape Whole-genome microarray chip (29 K), thus making it possible to profile approximately 29,000 genes (95% of the predicted grapevine transcriptome). Applications available with the online platform include the use of gene names, probesets, modules or biological processes to query the co-expression networks, with the option to choose between Affymetrix or Nimblegen datasets and between multiple co-expression measures. Alternatively, the user can browse existing network modules using interactive network visualisation and analysis via CytoscapeWeb. To demonstrate the utility of the database, we present examples from three fundamental biological processes (berry development, photosynthesis and flavonoid biosynthesis) whereby the recovered sub-networks reconfirm established plant gene functions and also identify novel associations. Together, we present valuable insights into grapevine transcriptional regulation by developing network models applicable to researchers in their prioritisation of gene candidates, for on-going study of biological processes related to grapevine development, metabolism and stress responses.
Increasing global accessibility and understanding of water column sonar data
NASA Astrophysics Data System (ADS)
Wall, C.; Anderson, C.; Mesick, S.; Parsons, A. R.; Boyer, T.; McLean, S. J.
2016-02-01
Active acoustic (sonar) technology is of increasing importance for research examining the water column. NOAA uses water column sonar data to map acoustic properties from the ocean surface to the seafloor - from bubbles to biology to bottom. Scientific echosounders aboard fishery survey vessels are used to estimate biomass, measure fish school morphology, and characterize habitat. These surveys produce large volumes of data that are costly and difficult to maintain due to their size, complexity, and proprietary format that require specific software and extensive knowledge. However, through proper management they can deliver valuable information beyond their original collection purpose. In order to maximize the benefit to the public, the data must be easily discoverable and accessible. Access to ancillary data is also needed for complete environmental context and ecosystem assessment. NOAA's National Centers for Environmental Information, in partnership with NOAA's National Marine Fisheries Service and the University of Colorado, created a national archive for the stewardship and distribution of water column sonar data collected on NOAA and academic vessels. A web-based access page allows users to query the metadata and access the raw sonar data. Visualization products being developed allow researchers and the public to understand the quality and content of large volumes of archived data more easily. Such products transform the complex data into a digestible image or graphic and are highly valuable for a broad audience of varying backgrounds. Concurrently collected oceanographic data and bathymetric data are being integrated into the data access web page to provide an ecosystem-wide understanding of the area ensonified. Benefits of the archive include global access to an unprecedented nationwide dataset and the increased potential for researchers to address cross-cutting scientific questions to advance the field of marine ecosystem acoustics.
Mahamdallie, Shazia; Ruark, Elise; Yost, Shawn; Ramsay, Emma; Uddin, Imran; Wylie, Harriett; Elliott, Anna; Strydom, Ann; Renwick, Anthony; Seal, Sheila; Rahman, Nazneen
2017-01-01
Detection of deletions and duplications of whole exons (exon CNVs) is a key requirement of genetic testing. Accurate detection of this variant type has proved very challenging in targeted next-generation sequencing (NGS) data, particularly if only a single exon is involved. Many different NGS exon CNV calling methods have been developed over the last five years. Such methods are usually evaluated using simulated and/or in-house data due to a lack of publicly-available datasets with orthogonally generated results. This hinders tool comparisons, transparency and reproducibility. To provide a community resource for assessment of exon CNV calling methods in targeted NGS data, we here present the ICR96 exon CNV validation series. The dataset includes high-quality sequencing data from a targeted NGS assay (the TruSight Cancer Panel) together with Multiplex Ligation-dependent Probe Amplification (MLPA) results for 96 independent samples. 66 samples contain at least one validated exon CNV and 30 samples have validated negative results for exon CNVs in 26 genes. The dataset includes 46 exon CNVs in BRCA1 , BRCA2 , TP53 , MLH1 , MSH2 , MSH6 , PMS2 , EPCAM or PTEN , giving excellent representation of the cancer predisposition genes most frequently tested in clinical practice. Moreover, the validated exon CNVs include 25 single exon CNVs, the most difficult type of exon CNV to detect. The FASTQ files for the ICR96 exon CNV validation series can be accessed through the European-Genome phenome Archive (EGA) under the accession number EGAS00001002428.
PGP repository: a plant phenomics and genomics data publication infrastructure.
Arend, Daniel; Junker, Astrid; Scholz, Uwe; Schüler, Danuta; Wylie, Juliane; Lange, Matthias
2016-01-01
Plant genomics and phenomics represents the most promising tools for accelerating yield gains and overcoming emerging crop productivity bottlenecks. However, accessing this wealth of plant diversity requires the characterization of this material using state-of-the-art genomic, phenomic and molecular technologies and the release of subsequent research data via a long-term stable, open-access portal. Although several international consortia and public resource centres offer services for plant research data management, valuable digital assets remains unpublished and thus inaccessible to the scientific community. Recently, the Leibniz Institute of Plant Genetics and Crop Plant Research and the German Plant Phenotyping Network have jointly initiated the Plant Genomics and Phenomics Research Data Repository (PGP) as infrastructure to comprehensively publish plant research data. This covers in particular cross-domain datasets that are not being published in central repositories because of its volume or unsupported data scope, like image collections from plant phenotyping and microscopy, unfinished genomes, genotyping data, visualizations of morphological plant models, data from mass spectrometry as well as software and documents.The repository is hosted at Leibniz Institute of Plant Genetics and Crop Plant Research using e!DAL as software infrastructure and a Hierarchical Storage Management System as data archival backend. A novel developed data submission tool was made available for the consortium that features a high level of automation to lower the barriers of data publication. After an internal review process, data are published as citable digital object identifiers and a core set of technical metadata is registered at DataCite. The used e!DAL-embedded Web frontend generates for each dataset a landing page and supports an interactive exploration. PGP is registered as research data repository at BioSharing.org, re3data.org and OpenAIRE as valid EU Horizon 2020 open data archive. Above features, the programmatic interface and the support of standard metadata formats, enable PGP to fulfil the FAIR data principles-findable, accessible, interoperable, reusable.Database URL:http://edal.ipk-gatersleben.de/repos/pgp/. © The Author(s) 2016. Published by Oxford University Press.
Song, Ruiguang; Hall, H Irene; Harrison, Kathleen McDavid; Sharpe, Tanya Telfair; Lin, Lillian S; Dean, Hazel D
2011-01-01
We developed a statistical tool that brings together standard, accessible, and well-understood analytic approaches and uses area-based information and other publicly available data to identify social determinants of health (SDH) that significantly affect the morbidity of a specific disease. We specified AIDS as the disease of interest and used data from the American Community Survey and the National HIV Surveillance System. Morbidity and socioeconomic variables in the two data systems were linked through geographic areas that can be identified in both systems. Correlation and partial correlation coefficients were used to measure the impact of socioeconomic factors on AIDS diagnosis rates in certain geographic areas. We developed an easily explained approach that can be used by a data analyst with access to publicly available datasets and standard statistical software to identify the impact of SDH. We found that the AIDS diagnosis rate was highly correlated with the distribution of race/ethnicity, population density, and marital status in an area. The impact of poverty, education level, and unemployment depended on other SDH variables. Area-based measures of socioeconomic variables can be used to identify risk factors associated with a disease of interest. When correlation analysis is used to identify risk factors, potential confounding from other variables must be taken into account.
Food Safety in the Age of Next Generation Sequencing, Bioinformatics, and Open Data Access.
Taboada, Eduardo N; Graham, Morag R; Carriço, João A; Van Domselaar, Gary
2017-01-01
Public health labs and food regulatory agencies globally are embracing whole genome sequencing (WGS) as a revolutionary new method that is positioned to replace numerous existing diagnostic and microbial typing technologies with a single new target: the microbial draft genome. The ability to cheaply generate large amounts of microbial genome sequence data, combined with emerging policies of food regulatory and public health institutions making their microbial sequences increasingly available and public, has served to open up the field to the general scientific community. This open data access policy shift has resulted in a proliferation of data being deposited into sequence repositories and of novel bioinformatics software designed to analyze these vast datasets. There also has been a more recent drive for improved data sharing to achieve more effective global surveillance, public health and food safety. Such developments have heightened the need for enhanced analytical systems in order to process and interpret this new type of data in a timely fashion. In this review we outline the emergence of genomics, bioinformatics and open data in the context of food safety. We also survey major efforts to translate genomics and bioinformatics technologies out of the research lab and into routine use in modern food safety labs. We conclude by discussing the challenges and opportunities that remain, including those expected to play a major role in the future of food safety science.
Secure access control and large scale robust representation for online multimedia event detection.
Liu, Changyu; Lu, Bin; Li, Huiling
2014-01-01
We developed an online multimedia event detection (MED) system. However, there are a secure access control issue and a large scale robust representation issue when we want to integrate traditional event detection algorithms into the online environment. For the first issue, we proposed a tree proxy-based and service-oriented access control (TPSAC) model based on the traditional role based access control model. Verification experiments were conducted on the CloudSim simulation platform, and the results showed that the TPSAC model is suitable for the access control of dynamic online environments. For the second issue, inspired by the object-bank scene descriptor, we proposed a 1000-object-bank (1000OBK) event descriptor. Feature vectors of the 1000OBK were extracted from response pyramids of 1000 generic object detectors which were trained on standard annotated image datasets, such as the ImageNet dataset. A spatial bag of words tiling approach was then adopted to encode these feature vectors for bridging the gap between the objects and events. Furthermore, we performed experiments in the context of event classification on the challenging TRECVID MED 2012 dataset, and the results showed that the robust 1000OBK event descriptor outperforms the state-of-the-art approaches.
This EnviroAtlas web service contains layers depicting market-based programs and projects addressing ecosystem services protection in the United States. Layers include data collected via surveys and desk research conducted by Forest Trends' Ecosystem Marketplace from 2008 to 2016 on biodiversity (i.e., imperiled species/habitats; wetlands and streams), carbon, and water markets and enabling conditions that facilitate, directly or indirectly, market-based approaches to protecting and investing in those ecosystem services. This dataset was produced by Forest Trends' Ecosystem Marketplace for EnviroAtlas in order to support public access to and use of information related to environmental markets. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Revamped Website Features Easier Access to Travel Survey Data, Offers New
Datasets | News | NREL Revamped Website Features Easier Access to Travel Survey Data, Offers New Datasets Revamped Website Features Easier Access to Travel Survey Data, Offers New Datasets table. Each survey or study now has its own page, allowing users to bookmark it or provide a link to
The Importance of Biodiversity E-infrastructures for Megadiverse Countries
Canhos, Dora A. L.; Sousa-Baena, Mariane S.; de Souza, Sidnei; Maia, Leonor C.; Stehmann, João R.; Canhos, Vanderlei P.; De Giovanni, Renato; Bonacelli, Maria B. M.; Los, Wouter; Peterson, A. Townsend
2015-01-01
Addressing the challenges of biodiversity conservation and sustainable development requires global cooperation, support structures, and new governance models to integrate diverse initiatives and achieve massive, open exchange of data, tools, and technology. The traditional paradigm of sharing scientific knowledge through publications is not sufficient to meet contemporary demands that require not only the results but also data, knowledge, and skills to analyze the data. E-infrastructures are key in facilitating access to data and providing the framework for collaboration. Here we discuss the importance of e-infrastructures of public interest and the lack of long-term funding policies. We present the example of Brazil’s speciesLink network, an e-infrastructure that provides free and open access to biodiversity primary data and associated tools. SpeciesLink currently integrates 382 datasets from 135 national institutions and 13 institutions from abroad, openly sharing ~7.4 million records, 94% of which are associated to voucher specimens. Just as important as the data is the network of data providers and users. In 2014, more than 95% of its users were from Brazil, demonstrating the importance of local e-infrastructures in enabling and promoting local use of biodiversity data and knowledge. From the outset, speciesLink has been sustained through project-based funding, normally public grants for 2–4-year periods. In between projects, there are short-term crises in trying to keep the system operational, a fact that has also been observed in global biodiversity portals, as well as in social and physical sciences platforms and even in computing services portals. In the last decade, the open access movement propelled the development of many web platforms for sharing data. Adequate policies unfortunately did not follow the same tempo, and now many initiatives may perish. PMID:26204382
The Importance of Biodiversity E-infrastructures for Megadiverse Countries.
Canhos, Dora A L; Sousa-Baena, Mariane S; de Souza, Sidnei; Maia, Leonor C; Stehmann, João R; Canhos, Vanderlei P; De Giovanni, Renato; Bonacelli, Maria B M; Los, Wouter; Peterson, A Townsend
2015-07-01
Addressing the challenges of biodiversity conservation and sustainable development requires global cooperation, support structures, and new governance models to integrate diverse initiatives and achieve massive, open exchange of data, tools, and technology. The traditional paradigm of sharing scientific knowledge through publications is not sufficient to meet contemporary demands that require not only the results but also data, knowledge, and skills to analyze the data. E-infrastructures are key in facilitating access to data and providing the framework for collaboration. Here we discuss the importance of e-infrastructures of public interest and the lack of long-term funding policies. We present the example of Brazil's speciesLink network, an e-infrastructure that provides free and open access to biodiversity primary data and associated tools. SpeciesLink currently integrates 382 datasets from 135 national institutions and 13 institutions from abroad, openly sharing ~7.4 million records, 94% of which are associated to voucher specimens. Just as important as the data is the network of data providers and users. In 2014, more than 95% of its users were from Brazil, demonstrating the importance of local e-infrastructures in enabling and promoting local use of biodiversity data and knowledge. From the outset, speciesLink has been sustained through project-based funding, normally public grants for 2-4-year periods. In between projects, there are short-term crises in trying to keep the system operational, a fact that has also been observed in global biodiversity portals, as well as in social and physical sciences platforms and even in computing services portals. In the last decade, the open access movement propelled the development of many web platforms for sharing data. Adequate policies unfortunately did not follow the same tempo, and now many initiatives may perish.
Enabling cross-disciplinary research by linking data to Open Access publications
NASA Astrophysics Data System (ADS)
Rettberg, N.
2012-04-01
OpenAIREplus focuses on the linking of research data to associated publications. The interlinking of research objects has implications for optimising the research process, allowing the sharing, enrichment and reuse of data, and ultimately serving to make open data an essential part of first class research. The growing call for more concrete data management and sharing plans, apparent at funder and national level, is complemented by the increasing support for a scientific infrastructure that supports the seamless access to a range of research materials. This paper will describe the recently launched OpenAIREplus and will detail how it plans to achieve its goals of developing an Open Access participatory infrastructure for scientific information. OpenAIREplus extends the current collaborative OpenAIRE project, which provides European researchers with a service network for the deposit of peer-reviewed FP7 grant-funded Open Access publications. This new project will focus on opening up the infrastructure to data sources from subject-specific communities to provide metadata about research data and publications, facilitating the linking between these objects. The ability to link within a publication out to a citable database, or other research data material, is fairly innovative and this project will enable users to search, browse, view, and create relationships between different information objects. In this regard, OpenAIREplus will build on prototypes of so-called "Enhanced Publications", originally conceived in the DRIVER-II project. OpenAIREplus recognizes the importance of representing the context of publications and datasets, thus linking to resources about the authors, their affiliation, location, project data and funding. The project will explore how links between text-based publications and research data are managed in different scientific fields. This complements a previous study in OpenAIRE on current disciplinary practices and future needs for infrastructural Open Access services, taking into account the variety within research approaches. Adopting Linked Data mechanisms on top of citation and content mining, it will approach the interchange of data between generic infrastructures such as OpenAIREplus and subject specific service providers. The paper will also touch on the other challenges envisaged in the project with regard to the culture of sharing data, as well as IPR, licensing and organisational issues.
76 FR 4904 - Agency Information Collection Request; 30-Day Public Comment Request
Federal Register 2010, 2011, 2012, 2013, 2014
2011-01-27
... datasets that are not specific to individual's personal health information to improve decision making by... making health indicator datasets (data that is not associated with any individuals) and tools available.../health . These datasets and tools are anticipated to benefit development of applications, web-based tools...
Just, Anaïs; Gourvil, Johan; Millet, Jérôme; Boullet, Vincent; Milon, Thomas; Mandon, Isabelle; Dutrève, Bruno
2015-01-01
More than 20 years ago, the French Muséum National d'Histoire Naturelle (MNHN, Secretariat of the Fauna and Flora) published the first part of an atlas of the flora of France at a 20km spatial resolution, accounting for 645 taxa (Dupont 1990). Since then, at the national level, there has not been any work on this scale relating to flora distribution, despite the obvious need for a better understanding. In 2011, in response to this need, the Federation des Conservatoires Botaniques Nationaux (FCBN, http://www.fcbn.fr) launched an ambitious collaborative project involving eleven national botanical conservatories of France. The project aims to establish a formal procedure and standardized system for data hosting, aggregation and publication for four areas: flora, fungi, vegetation and habitats. In 2014, the first phase of the project led to the development of the national flora dataset: SIFlore. As it includes about 21 million records of flora occurrences, this is currently the most comprehensive dataset on the distribution of vascular plants (Tracheophyta) in the French territory. SIFlore contains information for about 15'454 plant taxa occurrences (indigenous and alien taxa) in metropolitan France and Reunion Island, from 1545 until 2014. The data records were originally collated from inventories, checklists, literature and herbarium records. SIFlore was developed by assembling flora datasets from the regional to the national level. At the regional level, source records are managed by the national botanical conservatories that are responsible for flora data collection and validation. In order to present our results, a geoportal was developed by the Fédération des conservatoires botaniques nationaux that allows the SIFlore dataset to be publically viewed. This portal is available at: http://siflore.fcbn.fr. As the FCBN belongs to the Information System for Nature and Landscapes' (SINP), a governmental program, the dataset is also accessible through the websites of the National Inventory of Natural Heritage (http://www.inpn.fr) and the Global Biodiversity Information Facility (http://www.gbif.fr). SIFlore is regularly updated with additional data records. It is also planned to expand the scope of the dataset to include information about taxon biology, phenology, ecology, chorology, frequency, conservation status and seed banks. A map showing an estimation of the dataset completeness (based on Jackknife 1 estimator) is presented and included as a numerical appendix. SIFlore aims to make the data of the flora of France available at the national level for conservation, policy management and scientific research. Such a dataset will provide enough information to allow for macro-ecological reviews of species distribution patterns and, coupled with climatic or topographic datasets, the identification of determinants of these patterns. This dataset can be considered as the primary indicator of the current state of knowledge of flora distribution across France. At a policy level, and in the context of global warming, this should promote the adoption of new measures aiming to improve and intensify flora conservation and surveys.
Just, Anaïs; Gourvil, Johan; Millet, Jérôme; Boullet, Vincent; Milon, Thomas; Mandon, Isabelle; Dutrève, Bruno
2015-01-01
Abstract More than 20 years ago, the French Muséum National d’Histoire Naturelle1 (MNHN, Secretariat of the Fauna and Flora) published the first part of an atlas of the flora of France at a 20km spatial resolution, accounting for 645 taxa (Dupont 1990). Since then, at the national level, there has not been any work on this scale relating to flora distribution, despite the obvious need for a better understanding. In 2011, in response to this need, the Federation des Conservatoires Botaniques Nationaux2 (FCBN, http://www.fcbn.fr) launched an ambitious collaborative project involving eleven national botanical conservatories of France. The project aims to establish a formal procedure and standardized system for data hosting, aggregation and publication for four areas: flora, fungi, vegetation and habitats. In 2014, the first phase of the project led to the development of the national flora dataset: SIFlore. As it includes about 21 million records of flora occurrences, this is currently the most comprehensive dataset on the distribution of vascular plants (Tracheophyta) in the French territory. SIFlore contains information for about 15'454 plant taxa occurrences (indigenous and alien taxa) in metropolitan France and Reunion Island, from 1545 until 2014. The data records were originally collated from inventories, checklists, literature and herbarium records. SIFlore was developed by assembling flora datasets from the regional to the national level. At the regional level, source records are managed by the national botanical conservatories that are responsible for flora data collection and validation. In order to present our results, a geoportal was developed by the Fédération des conservatoires botaniques nationaux that allows the SIFlore dataset to be publically viewed. This portal is available at: http://siflore.fcbn.fr. As the FCBN belongs to the Information System for Nature and Landscapes’ (SINP), a governmental program, the dataset is also accessible through the websites of the National Inventory of Natural Heritage (http://www.inpn.fr) and the Global Biodiversity Information Facility (http://www.gbif.fr). SIFlore is regularly updated with additional data records. It is also planned to expand the scope of the dataset to include information about taxon biology, phenology, ecology, chorology, frequency, conservation status and seed banks. A map showing an estimation of the dataset completeness (based on Jackknife 1 estimator) is presented and included as a numerical appendix. Purpose: SIFlore aims to make the data of the flora of France available at the national level for conservation, policy management and scientific research. Such a dataset will provide enough information to allow for macro-ecological reviews of species distribution patterns and, coupled with climatic or topographic datasets, the identification of determinants of these patterns. This dataset can be considered as the primary indicator of the current state of knowledge of flora distribution across France. At a policy level, and in the context of global warming, this should promote the adoption of new measures aiming to improve and intensify flora conservation and surveys. PMID:26491386
He, Zilong; Zhang, Huangkai; Gao, Shenghan; Lercher, Martin J; Chen, Wei-Hua; Hu, Songnian
2016-07-08
Evolview is an online visualization and management tool for customized and annotated phylogenetic trees. It allows users to visualize phylogenetic trees in various formats, customize the trees through built-in functions and user-supplied datasets and export the customization results to publication-ready figures. Its 'dataset system' contains not only the data to be visualized on the tree, but also 'modifiers' that control various aspects of the graphical annotation. Evolview is a single-page application (like Gmail); its carefully designed interface allows users to upload, visualize, manipulate and manage trees and datasets all in a single webpage. Developments since the last public release include a modern dataset editor with keyword highlighting functionality, seven newly added types of annotation datasets, collaboration support that allows users to share their trees and datasets and various improvements of the web interface and performance. In addition, we included eleven new 'Demo' trees to demonstrate the basic functionalities of Evolview, and five new 'Showcase' trees inspired by publications to showcase the power of Evolview in producing publication-ready figures. Evolview is freely available at: http://www.evolgenius.info/evolview/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Learning to recognize rat social behavior: Novel dataset and cross-dataset application.
Lorbach, Malte; Kyriakou, Elisavet I; Poppe, Ronald; van Dam, Elsbeth A; Noldus, Lucas P J J; Veltkamp, Remco C
2018-04-15
Social behavior is an important aspect of rodent models. Automated measuring tools that make use of video analysis and machine learning are an increasingly attractive alternative to manual annotation. Because machine learning-based methods need to be trained, it is important that they are validated using data from different experiment settings. To develop and validate automated measuring tools, there is a need for annotated rodent interaction datasets. Currently, the availability of such datasets is limited to two mouse datasets. We introduce the first, publicly available rat social interaction dataset, RatSI. We demonstrate the practical value of the novel dataset by using it as the training set for a rat interaction recognition method. We show that behavior variations induced by the experiment setting can lead to reduced performance, which illustrates the importance of cross-dataset validation. Consequently, we add a simple adaptation step to our method and improve the recognition performance. Most existing methods are trained and evaluated in one experimental setting, which limits the predictive power of the evaluation to that particular setting. We demonstrate that cross-dataset experiments provide more insight in the performance of classifiers. With our novel, public dataset we encourage the development and validation of automated recognition methods. We are convinced that cross-dataset validation enhances our understanding of rodent interactions and facilitates the development of more sophisticated recognition methods. Combining them with adaptation techniques may enable us to apply automated recognition methods to a variety of animals and experiment settings. Copyright © 2017 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Brekke, L. D.; Pruitt, T.; Maurer, E. P.; Duffy, P. B.
2007-12-01
Incorporating climate change information into long-term evaluations of water and energy resources requires analysts to have access to climate projection data that have been spatially downscaled to "basin-relevant" resolution. This is necessary in order to develop system-specific hydrology and demand scenarios consistent with projected climate scenarios. Analysts currently have access to "climate model" resolution data (e.g., at LLNL PCMDI), but not spatially downscaled translations of these datasets. Motivated by a common interest in supporting regional and local assessments, the U.S. Bureau of Reclamation and LLNL (through support from the DOE National Energy Technology Laboratory) have teamed to develop an archive of downscaled climate projections (temperature and precipitation) with geographic coverage consistent with the North American Land Data Assimilation System domain, encompassing the contiguous United States. A web-based information service, hosted at LLNL Green Data Oasis, has been developed to provide Reclamation, LLNL, and other interested analysts free access to archive content. A contemporary statistical method was used to bias-correct and spatially disaggregate projection datasets, and was applied to 112 projections included in the WCRP CMIP3 multi-model dataset hosted by LLNL PCMDI (i.e. 16 GCMs and their multiple simulations of SRES A2, A1b, and B1 emissions pathways).
NASA Technical Reports Server (NTRS)
Al-Hamdan, Mohammad; Crosson, William; Economou, Sigrid; Estes, Marice Jr; Estes, Sue; Hemmings, Sarah; Kent, Shia; Puckett, Mark; Quattrochi, Dale; Wade, Gina
2013-01-01
NASA Marshall Space Flight Center is collaborating with the University of Alabama at Birmingham (UAB) School of Public Health and the Centers for Disease Control and Prevention (CDC) National Center for Public Health Informatics to address issues of environmental health and enhance public health decision-making using NASA remotely-sensed data and products. The objectives of this study are to develop high-quality spatial data sets of environmental variables, link these with public health data from a national cohort study, and deliver the linked data sets and associated analyses to local, state and federal end-user groups. Three daily environmental data sets were developed for the conterminous U.S. on different spatial resolutions for the period 2003-2008: (1) spatial surfaces of estimated fine particulate matter (PM2.5) exposures on a 10-km grid using the US Environmental Protection Agency (EPA) ground observations and NASA's MODerate-resolution Imaging Spectroradiometer (MODIS) data; (2) a 1-km grid of Land Surface Temperature (LST) using MODIS data; and (3) a 12-km grid of daily Incoming Solar Radiation (Insolation) and heat-related products using the North American Land Data Assimilation System (NLDAS) forcing data. These environmental data sets were linked with public health data from the UAB REasons for Geographic And Racial Differences in Stroke (REGARDS) national cohort study to determine whether exposures to these environmental risk factors are related to cognitive decline, stroke and other health outcomes. These environmental datasets and the results of the public health linkage analyses will be disseminated to end-users for decision-making through the CDC Wide-ranging Online Data for Epidemiologic Research (WONDER) system and through peer-reviewed publications respectively. The linkage of these data with the CDC WONDER system substantially expands public access to NASA data, making their use by a wide range of decision makers feasible. By successful completion of this research, decision-making activities, including policy-making and clinical decision-making, can be positively affected through utilization of the data products and analyses provided on the CDC WONDER system.
Wide-Open: Accelerating public data release by automating detection of overdue datasets
Poon, Hoifung; Howe, Bill
2017-01-01
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. PMID:28594819
Wide-Open: Accelerating public data release by automating detection of overdue datasets.
Grechkin, Maxim; Poon, Hoifung; Howe, Bill
2017-06-01
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.
Karst database development in Minnesota: Design and data assembly
Gao, Y.; Alexander, E.C.; Tipping, R.G.
2005-01-01
The Karst Feature Database (KFD) of Minnesota is a relational GIS-based Database Management System (DBMS). Previous karst feature datasets used inconsistent attributes to describe karst features in different areas of Minnesota. Existing metadata were modified and standardized to represent a comprehensive metadata for all the karst features in Minnesota. Microsoft Access 2000 and ArcView 3.2 were used to develop this working database. Existing county and sub-county karst feature datasets have been assembled into the KFD, which is capable of visualizing and analyzing the entire data set. By November 17 2002, 11,682 karst features were stored in the KFD of Minnesota. Data tables are stored in a Microsoft Access 2000 DBMS and linked to corresponding ArcView applications. The current KFD of Minnesota has been moved from a Windows NT server to a Windows 2000 Citrix server accessible to researchers and planners through networked interfaces. ?? Springer-Verlag 2005.
NASA Astrophysics Data System (ADS)
Kraft, Angelina; Sens, Irina; Löwe, Peter; Dreyer, Britta
2016-04-01
Globally resolvable, persistent digital identifiers have become an essential tool to enable unambiguous links between published research results and their underlying digital resources. In addition, this unambiguous identification allows citation. In an ideal research world, any scientific content should be citable and the coherent content, as well as the citation itself, should be persistent. However, today's scientists do not just produce traditional research papers - they produce comprehensive digital collections of objects which, alongside digital texts, include digital resources such as research data, audiovisual media, digital lab journals, images, statistics and software code. Researchers start to look for services which allow management of these digital resources with minimum time investment. In light of this, we show how the German National Library of Science and Technology (TIB) develops supportive frameworks to accompany the life cycle of scientific knowledge generation and transfer. This includes technical infrastructures for • indexing, cataloguing, digital preservation, DOI names and licencing for text and digital objects (the TIB DOI registration, active since 2004) and • a digital repository for the deposition and provision of accessible, traceable and citeable research data (RADAR). One particular problem for the management of data originating from (collaborating) research infrastructures is their dynamic nature in terms of growth, access rights and quality. On a global scale, systems for access and preservation are in place for the big data domains (e.g. environmental sciences, space, climate). However, the stewardship for disciplines without a tradition of data sharing, including the fields of the so-called long tail, remains uncertain. The RADAR - Research Data Repository - project establishes a generic end-point data repository, which can be used in a collaborative way. RADAR enables clients to upload, edit, structure and describe their (collaborative) data in an organizational workspace. In such a workspace, administrators and curators can manage access and editorial rights before the data enters the preservation and optional publication phase. RADAR applies different PID strategies for closed vs. open data. For closed datasets, RADAR uses handles as identifiers and offers format-independent data preservation between 5 and 15 years, which can also be prolonged. By default, preserved data are only available to the respective data curators, which may selectively grant other researches access to preserved data. For open datasets, RADAR provides a Digital Object Identifier (DOI) to enable researchers to clearly reference and reuse data and to guarantee data accessibility. RADAR offers the publication service of research data together with format-independent data preservation for an unlimited time period. Each published dataset can be enriched with discipline-specific metadata and an optional embargo period can be specified. With these two services, RADAR aims to meet demands from a broad range of specialized research disciplines: To provide a secure, citable data storage and citability for researchers which need to retain restricted access to data on one hand, and an e-infrastructure which allows for research data to be stored, found, managed, annotated, cited, curated and published in a digital platform available 24/7, on the other.
Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community.
Krampis, Konstantinos; Booth, Tim; Chapman, Brad; Tiwari, Bela; Bicak, Mesude; Field, Dawn; Nelson, Karen E
2012-03-19
A steep drop in the cost of next-generation sequencing during recent years has made the technology affordable to the majority of researchers, but downstream bioinformatic analysis still poses a resource bottleneck for smaller laboratories and institutes that do not have access to substantial computational resources. Sequencing instruments are typically bundled with only the minimal processing and storage capacity required for data capture during sequencing runs. Given the scale of sequence datasets, scientific value cannot be obtained from acquiring a sequencer unless it is accompanied by an equal investment in informatics infrastructure. Cloud BioLinux is a publicly accessible Virtual Machine (VM) that enables scientists to quickly provision on-demand infrastructures for high-performance bioinformatics computing using cloud platforms. Users have instant access to a range of pre-configured command line and graphical software applications, including a full-featured desktop interface, documentation and over 135 bioinformatics packages for applications including sequence alignment, clustering, assembly, display, editing, and phylogeny. Each tool's functionality is fully described in the documentation directly accessible from the graphical interface of the VM. Besides the Amazon EC2 cloud, we have started instances of Cloud BioLinux on a private Eucalyptus cloud installed at the J. Craig Venter Institute, and demonstrated access to the bioinformatic tools interface through a remote connection to EC2 instances from a local desktop computer. Documentation for using Cloud BioLinux on EC2 is available from our project website, while a Eucalyptus cloud image and VirtualBox Appliance is also publicly available for download and use by researchers with access to private clouds. Cloud BioLinux provides a platform for developing bioinformatics infrastructures on the cloud. An automated and configurable process builds Virtual Machines, allowing the development of highly customized versions from a shared code base. This shared community toolkit enables application specific analysis platforms on the cloud by minimizing the effort required to prepare and maintain them.
Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community
2012-01-01
Background A steep drop in the cost of next-generation sequencing during recent years has made the technology affordable to the majority of researchers, but downstream bioinformatic analysis still poses a resource bottleneck for smaller laboratories and institutes that do not have access to substantial computational resources. Sequencing instruments are typically bundled with only the minimal processing and storage capacity required for data capture during sequencing runs. Given the scale of sequence datasets, scientific value cannot be obtained from acquiring a sequencer unless it is accompanied by an equal investment in informatics infrastructure. Results Cloud BioLinux is a publicly accessible Virtual Machine (VM) that enables scientists to quickly provision on-demand infrastructures for high-performance bioinformatics computing using cloud platforms. Users have instant access to a range of pre-configured command line and graphical software applications, including a full-featured desktop interface, documentation and over 135 bioinformatics packages for applications including sequence alignment, clustering, assembly, display, editing, and phylogeny. Each tool's functionality is fully described in the documentation directly accessible from the graphical interface of the VM. Besides the Amazon EC2 cloud, we have started instances of Cloud BioLinux on a private Eucalyptus cloud installed at the J. Craig Venter Institute, and demonstrated access to the bioinformatic tools interface through a remote connection to EC2 instances from a local desktop computer. Documentation for using Cloud BioLinux on EC2 is available from our project website, while a Eucalyptus cloud image and VirtualBox Appliance is also publicly available for download and use by researchers with access to private clouds. Conclusions Cloud BioLinux provides a platform for developing bioinformatics infrastructures on the cloud. An automated and configurable process builds Virtual Machines, allowing the development of highly customized versions from a shared code base. This shared community toolkit enables application specific analysis platforms on the cloud by minimizing the effort required to prepare and maintain them. PMID:22429538
Ram K. Deo; Matthew B. Russell; Grant M. Domke; Christopher W. Woodall; Michael J. Falkowski; Warren B. Cohen
2017-01-01
The publicly accessible archive of Landsat imagery and increasing regional-scale LiDAR acquisitions offer an opportunity to periodically estimate aboveground forest biomass (AGB) from 1990 to the present to alignwith the reporting needs ofNationalGreenhouseGas Inventories (NGHGIs). This study integrated Landsat time-series data, a state-wide LiDAR dataset, and a recent...
Federal Register 2010, 2011, 2012, 2013, 2014
2011-06-14
... Workshop. The product of the Data Workshop is a data report which compiles and evaluates potential datasets and recommends which datasets are appropriate for assessment analyses. The product of the Stock....m. Using datasets provided by the Data Workshop, participants will develop population models to...
DATS, the data tag suite to enable discoverability of datasets.
Sansone, Susanna-Assunta; Gonzalez-Beltran, Alejandra; Rocca-Serra, Philippe; Alter, George; Grethe, Jeffrey S; Xu, Hua; Fore, Ian M; Lyle, Jared; Gururaj, Anupama E; Chen, Xiaoling; Kim, Hyeon-Eui; Zong, Nansu; Li, Yueling; Liu, Ruiling; Ozyurt, I Burak; Ohno-Machado, Lucila
2017-06-06
Today's science increasingly requires effective ways to find and access existing datasets that are distributed across a range of repositories. For researchers in the life sciences, discoverability of datasets may soon become as essential as identifying the latest publications via PubMed. Through an international collaborative effort funded by the National Institutes of Health (NIH)'s Big Data to Knowledge (BD2K) initiative, we have designed and implemented the DAta Tag Suite (DATS) model to support the DataMed data discovery index. DataMed's goal is to be for data what PubMed has been for the scientific literature. Akin to the Journal Article Tag Suite (JATS) used in PubMed, the DATS model enables submission of metadata on datasets to DataMed. DATS has a core set of elements, which are generic and applicable to any type of dataset, and an extended set that can accommodate more specialized data types. DATS is a platform-independent model also available as an annotated serialization in schema.org, which in turn is widely used by major search engines like Google, Microsoft, Yahoo and Yandex.
Scribl: an HTML5 Canvas-based graphics library for visualizing genomic data over the web.
Miller, Chase A; Anthony, Jon; Meyer, Michelle M; Marth, Gabor
2013-02-01
High-throughput biological research requires simultaneous visualization as well as analysis of genomic data, e.g. read alignments, variant calls and genomic annotations. Traditionally, such integrative analysis required desktop applications operating on locally stored data. Many current terabyte-size datasets generated by large public consortia projects, however, are already only feasibly stored at specialist genome analysis centers. As even small laboratories can afford very large datasets, local storage and analysis are becoming increasingly limiting, and it is likely that most such datasets will soon be stored remotely, e.g. in the cloud. These developments will require web-based tools that enable users to access, analyze and view vast remotely stored data with a level of sophistication and interactivity that approximates desktop applications. As rapidly dropping cost enables researchers to collect data intended to answer questions in very specialized contexts, developers must also provide software libraries that empower users to implement customized data analyses and data views for their particular application. Such specialized, yet lightweight, applications would empower scientists to better answer specific biological questions than possible with general-purpose genome browsers currently available. Using recent advances in core web technologies (HTML5), we developed Scribl, a flexible genomic visualization library specifically targeting coordinate-based data such as genomic features, DNA sequence and genetic variants. Scribl simplifies the development of sophisticated web-based graphical tools that approach the dynamism and interactivity of desktop applications. Software is freely available online at http://chmille4.github.com/Scribl/ and is implemented in JavaScript with all modern browsers supported.
NASA Technical Reports Server (NTRS)
Johnson, Jeffrey R.
2006-01-01
This viewgraph presentation reviews the problems that non-mission researchers have in accessing data to use in their analysis of Mars. The increasing complexity of Mars datasets results in custom software development by instrument teams that is often the only means to visualize and analyze the data. The solutions to the problem are to continue efforts toward synergizing data from multiple missions and making the data, s/w, derived products available in standardized, easily-accessible formats, encourage release of "lite" versions of mission-related software prior to end-of-mission, and planetary image data should be systematically processed in a coordinated way and made available in an easily accessed form. The recommendations of Mars Environmental GIS Workshop are reviewed.
The Problem with Big Data: Operating on Smaller Datasets to Bridge the Implementation Gap.
Mann, Richard P; Mushtaq, Faisal; White, Alan D; Mata-Cervantes, Gabriel; Pike, Tom; Coker, Dalton; Murdoch, Stuart; Hiles, Tim; Smith, Clare; Berridge, David; Hinchliffe, Suzanne; Hall, Geoff; Smye, Stephen; Wilkie, Richard M; Lodge, J Peter A; Mon-Williams, Mark
2016-01-01
Big datasets have the potential to revolutionize public health. However, there is a mismatch between the political and scientific optimism surrounding big data and the public's perception of its benefit. We suggest a systematic and concerted emphasis on developing models derived from smaller datasets to illustrate to the public how big data can produce tangible benefits in the long term. In order to highlight the immediate value of a small data approach, we produced a proof-of-concept model predicting hospital length of stay. The results demonstrate that existing small datasets can be used to create models that generate a reasonable prediction, facilitating health-care delivery. We propose that greater attention (and funding) needs to be directed toward the utilization of existing information resources in parallel with current efforts to create and exploit "big data."
Howe, E.A.; de Souza, A.; Lahr, D.L.; Chatwin, S.; Montgomery, P.; Alexander, B.R.; Nguyen, D.-T.; Cruz, Y.; Stonich, D.A.; Walzer, G.; Rose, J.T.; Picard, S.C.; Liu, Z.; Rose, J.N.; Xiang, X.; Asiedu, J.; Durkin, D.; Levine, J.; Yang, J.J.; Schürer, S.C.; Braisted, J.C.; Southall, N.; Southern, M.R.; Chung, T.D.Y.; Brudz, S.; Tanega, C.; Schreiber, S.L.; Bittker, J.A.; Guha, R.; Clemons, P.A.
2015-01-01
BARD, the BioAssay Research Database (https://bard.nih.gov/) is a public database and suite of tools developed to provide access to bioassay data produced by the NIH Molecular Libraries Program (MLP). Data from 631 MLP projects were migrated to a new structured vocabulary designed to capture bioassay data in a formalized manner, with particular emphasis placed on the description of assay protocols. New data can be submitted to BARD with a user-friendly set of tools that assist in the creation of appropriately formatted datasets and assay definitions. Data published through the BARD application program interface (API) can be accessed by researchers using web-based query tools or a desktop client. Third-party developers wishing to create new tools can use the API to produce stand-alone tools or new plug-ins that can be integrated into BARD. The entire BARD suite of tools therefore supports three classes of researcher: those who wish to publish data, those who wish to mine data for testable hypotheses, and those in the developer community who wish to build tools that leverage this carefully curated chemical biology resource. PMID:25477388
From Data to Knowledge: GEOSS experience and the GEOSS Knowledge Base contribution to the GCI
NASA Astrophysics Data System (ADS)
Santoro, M.; Nativi, S.; Mazzetti, P., Sr.; Plag, H. P.
2016-12-01
According to systems theory, data is raw, it simply exists and has no significance beyond its existence; while, information is data that has been given meaning by way of relational connection. The appropriate collection of information, such that it contributes to understanding, is a process of knowledge creation.The Global Earth Observation System of Systems (GEOSS) developed by the Group on Earth Observations (GEO) is a set of coordinated, independent Earth observation, information and processing systems that interact and provide access to diverse information for a broad range of users in both public and private sectors. GEOSS links these systems to strengthen the monitoring of the state of the Earth. In the past ten years, the development of GEOSS has taught several lessons dealing with the need to move from (open) data to information and knowledge sharing. Advanced user-focused services require to move from a data-driven framework to a knowledge sharing platform. Such a platform needs to manage information and knowledge, in addition to datasets linked to them. For this scope, GEO has launched a specific task called "GEOSS Knowledge Base", which deals with resources, like user requirements, Sustainable Development Goals (SDGs), observation and processing ontologies, publications, guidelines, best practices, business processes/algorithms, definition of advanced concepts like Essential Variables (EVs), indicators, strategic goals, etc. In turn, information and knowledge (e.g. guidelines, best practices, user requirements, business processes, algorithms, etc.) can be used to generate additional information and knowledge from shared datasets. To fully utilize and leverage the GEOSS Knowledge Base, the current GEOSS Common Infrastructure (GCI) model will be extended and advanced to consider important concepts and implementation artifacts, such as data processing services and environmental/economic models as well as EVs, Primary Indicators, and SDGs. The new GCI model will link these concepts to the present dataset, observation and sensor concepts, enabling a set of very important new capabilities to be offered to GEOSS users.
The Great Lakes Water Balance: Data availability and annotated bibliography of selected references
Neff, Brian P.; Killian, Jason R.
2003-01-01
Water balance calculations for the Great Lakes have been made for several decades and are a key component of Great Lakes water management. Despite the importance of the water balance, little has been done to inventory and describe the data available for use in water balance calculations. This report provides a catalog and brief description of major datasets that are used to calculate the Great Lakes water balance. Several additional datasets are identified that could be used to calculate parts of the water balance but currently are not being used. Individual offices and web pages that are useful for attaining these datasets are included. Four specific data gaps are also identified. An annotated bibliography of important publications dealing with the Great Lakes water balance is included. The findings of this investigation permit resource managers and scientists to access data more easily, assess shortcomings of current datasets, and identify which data are not currently being utilized in water balance calculations.
Dataset from chemical gas sensor array in turbulent wind tunnel.
Fonollosa, Jordi; Rodríguez-Luján, Irene; Trincavelli, Marco; Huerta, Ramón
2015-06-01
The dataset includes the acquired time series of a chemical detection platform exposed to different gas conditions in a turbulent wind tunnel. The chemo-sensory elements were sampling directly the environment. In contrast to traditional approaches that include measurement chambers, open sampling systems are sensitive to dispersion mechanisms of gaseous chemical analytes, namely diffusion, turbulence, and advection, making the identification and monitoring of chemical substances more challenging. The sensing platform included 72 metal-oxide gas sensors that were positioned at 6 different locations of the wind tunnel. At each location, 10 distinct chemical gases were released in the wind tunnel, the sensors were evaluated at 5 different operating temperatures, and 3 different wind speeds were generated in the wind tunnel to induce different levels of turbulence. Moreover, each configuration was repeated 20 times, yielding a dataset of 18,000 measurements. The dataset was collected over a period of 16 months. The data is related to "On the performance of gas sensor arrays in open sampling systems using Inhibitory Support Vector Machines", by Vergara et al.[1]. The dataset can be accessed publicly at the UCI repository upon citation of [1]: http://archive.ics.uci.edu/ml/datasets/Gas+sensor+arrays+in+open+sampling+settings.
2011-01-01
Background A professional recognition mechanism is required to encourage expedited publishing of an adequate volume of 'fit-for-use' biodiversity data. As a component of such a recognition mechanism, we propose the development of the Data Usage Index (DUI) to demonstrate to data publishers that their efforts of creating biodiversity datasets have impact by being accessed and used by a wide spectrum of user communities. Discussion We propose and give examples of a range of 14 absolute and normalized biodiversity dataset usage indicators for the development of a DUI based on search events and dataset download instances. The DUI is proposed to include relative as well as species profile weighted comparative indicators. Conclusions We believe that in addition to the recognition to the data publisher and all players involved in the data life cycle, a DUI will also provide much needed yet novel insight into how users use primary biodiversity data. A DUI consisting of a range of usage indicators obtained from the GBIF network and other relevant access points is within reach. The usage of biodiversity datasets leads to the development of a family of indicators in line with well known citation-based measurements of recognition. PMID:22373200
FACETS: using open data to measure community social determinants of health.
Cantor, Michael N; Chandras, Rajan; Pulgarin, Claudia
2018-04-01
To develop a dataset based on open data sources reflective of community-level social determinants of health (SDH). We created FACETS (Factors Affecting Communities and Enabling Targeted Services), an architecture that incorporates open data related to SDH into a single dataset mapped at the census-tract level for New York City. FACETS (https://github.com/mcantor2/FACETS) can be easily used to map individual addresses to their census-tract-level SDH. This dataset facilitates analysis across different determinants that are often not easily accessible. Wider access to open data from government agencies at the local, state, and national level would facilitate the aggregation and analysis of community-level determinants. Timeliness of updates to federal non-census data sources may limit their usefulness. FACETS is an important first step in standardizing and compiling SDH-related data in an open architecture that can give context to a patient's condition and enable better decision-making when developing a plan of care.
ChiLin: a comprehensive ChIP-seq and DNase-seq quality control and analysis pipeline.
Qin, Qian; Mei, Shenglin; Wu, Qiu; Sun, Hanfei; Li, Lewyn; Taing, Len; Chen, Sujun; Li, Fugen; Liu, Tao; Zang, Chongzhi; Xu, Han; Chen, Yiwen; Meyer, Clifford A; Zhang, Yong; Brown, Myles; Long, Henry W; Liu, X Shirley
2016-10-03
Transcription factor binding, histone modification, and chromatin accessibility studies are important approaches to understanding the biology of gene regulation. ChIP-seq and DNase-seq have become the standard techniques for studying protein-DNA interactions and chromatin accessibility respectively, and comprehensive quality control (QC) and analysis tools are critical to extracting the most value from these assay types. Although many analysis and QC tools have been reported, few combine ChIP-seq and DNase-seq data analysis and quality control in a unified framework with a comprehensive and unbiased reference of data quality metrics. ChiLin is a computational pipeline that automates the quality control and data analyses of ChIP-seq and DNase-seq data. It is developed using a flexible and modular software framework that can be easily extended and modified. ChiLin is ideal for batch processing of many datasets and is well suited for large collaborative projects involving ChIP-seq and DNase-seq from different designs. ChiLin generates comprehensive quality control reports that include comparisons with historical data derived from over 23,677 public ChIP-seq and DNase-seq samples (11,265 datasets) from eight literature-based classified categories. To the best of our knowledge, this atlas represents the most comprehensive ChIP-seq and DNase-seq related quality metric resource currently available. These historical metrics provide useful heuristic quality references for experiment across all commonly used assay types. Using representative datasets, we demonstrate the versatility of the pipeline by applying it to different assay types of ChIP-seq data. The pipeline software is available open source at https://github.com/cfce/chilin . ChiLin is a scalable and powerful tool to process large batches of ChIP-seq and DNase-seq datasets. The analysis output and quality metrics have been structured into user-friendly directories and reports. We have successfully compiled 23,677 profiles into a comprehensive quality atlas with fine classification for users.
OLS Client and OLS Dialog: Open Source Tools to Annotate Public Omics Datasets.
Perez-Riverol, Yasset; Ternent, Tobias; Koch, Maximilian; Barsnes, Harald; Vrousgou, Olga; Jupp, Simon; Vizcaíno, Juan Antonio
2017-10-01
The availability of user-friendly software to annotate biological datasets and experimental details is becoming essential in data management practices, both in local storage systems and in public databases. The Ontology Lookup Service (OLS, http://www.ebi.ac.uk/ols) is a popular centralized service to query, browse and navigate biomedical ontologies and controlled vocabularies. Recently, the OLS framework has been completely redeveloped (version 3.0), including enhancements in the data model, like the added support for Web Ontology Language based ontologies, among many other improvements. However, the new OLS is not backwards compatible and new software tools are needed to enable access to this widely used framework now that the previous version is no longer available. We here present the OLS Client as a free, open-source Java library to retrieve information from the new version of the OLS. It enables rapid tool creation by providing a robust, pluggable programming interface and common data model to programmatically access the OLS. The library has already been integrated and is routinely used by several bioinformatics resources and related data annotation tools. Secondly, we also introduce an updated version of the OLS Dialog (version 2.0), a Java graphical user interface that can be easily plugged into Java desktop applications to access the OLS. The software and related documentation are freely available at https://github.com/PRIDE-Utilities/ols-client and https://github.com/PRIDE-Toolsuite/ols-dialog. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Smart-DS: Synthetic Models for Advanced, Realistic Testing: Distribution Systems and Scenarios
DOE Office of Scientific and Technical Information (OSTI.GOV)
Krishnan, Venkat K; Palmintier, Bryan S; Hodge, Brian S
The National Renewable Energy Laboratory (NREL) in collaboration with Massachusetts Institute of Technology (MIT), Universidad Pontificia Comillas (Comillas-IIT, Spain) and GE Grid Solutions, is working on an ARPA-E GRID DATA project, titled Smart-DS, to create: 1) High-quality, realistic, synthetic distribution network models, and 2) Advanced tools for automated scenario generation based on high-resolution weather data and generation growth projections. Through these advancements, the Smart-DS project is envisioned to accelerate the development, testing, and adoption of advanced algorithms, approaches, and technologies for sustainable and resilient electric power systems, especially in the realm of U.S. distribution systems. This talk will present themore » goals and overall approach of the Smart-DS project, including the process of creating the synthetic distribution datasets using reference network model (RNM) and the comprehensive validation process to ensure network realism, feasibility, and applicability to advanced use cases. The talk will provide demonstrations of early versions of synthetic models, along with the lessons learnt from expert engagements to enhance future iterations. Finally, the scenario generation framework, its development plans, and co-ordination with GRID DATA repository teams to house these datasets for public access will also be discussed.« less
Hur, Manhoi; Campbell, Alexis Ann; Almeida-de-Macedo, Marcia; Li, Ling; Ransom, Nick; Jose, Adarsh; Crispin, Matt; Nikolau, Basil J; Wurtele, Eve Syrkin
2013-04-01
Discovering molecular components and their functionality is key to the development of hypotheses concerning the organization and regulation of metabolic networks. The iterative experimental testing of such hypotheses is the trajectory that can ultimately enable accurate computational modelling and prediction of metabolic outcomes. This information can be particularly important for understanding the biology of natural products, whose metabolism itself is often only poorly defined. Here, we describe factors that must be in place to optimize the use of metabolomics in predictive biology. A key to achieving this vision is a collection of accurate time-resolved and spatially defined metabolite abundance data and associated metadata. One formidable challenge associated with metabolite profiling is the complexity and analytical limits associated with comprehensively determining the metabolome of an organism. Further, for metabolomics data to be efficiently used by the research community, it must be curated in publicly available metabolomics databases. Such databases require clear, consistent formats, easy access to data and metadata, data download, and accessible computational tools to integrate genome system-scale datasets. Although transcriptomics and proteomics integrate the linear predictive power of the genome, the metabolome represents the nonlinear, final biochemical products of the genome, which results from the intricate system(s) that regulate genome expression. For example, the relationship of metabolomics data to the metabolic network is confounded by redundant connections between metabolites and gene-products. However, connections among metabolites are predictable through the rules of chemistry. Therefore, enhancing the ability to integrate the metabolome with anchor-points in the transcriptome and proteome will enhance the predictive power of genomics data. We detail a public database repository for metabolomics, tools and approaches for statistical analysis of metabolomics data, and methods for integrating these datasets with transcriptomic data to create hypotheses concerning specialized metabolisms that generate the diversity in natural product chemistry. We discuss the importance of close collaborations among biologists, chemists, computer scientists and statisticians throughout the development of such integrated metabolism-centric databases and software.
Hur, Manhoi; Campbell, Alexis Ann; Almeida-de-Macedo, Marcia; Li, Ling; Ransom, Nick; Jose, Adarsh; Crispin, Matt; Nikolau, Basil J.
2013-01-01
Discovering molecular components and their functionality is key to the development of hypotheses concerning the organization and regulation of metabolic networks. The iterative experimental testing of such hypotheses is the trajectory that can ultimately enable accurate computational modelling and prediction of metabolic outcomes. This information can be particularly important for understanding the biology of natural products, whose metabolism itself is often only poorly defined. Here, we describe factors that must be in place to optimize the use of metabolomics in predictive biology. A key to achieving this vision is a collection of accurate time-resolved and spatially defined metabolite abundance data and associated metadata. One formidable challenge associated with metabolite profiling is the complexity and analytical limits associated with comprehensively determining the metabolome of an organism. Further, for metabolomics data to be efficiently used by the research community, it must be curated in publically available metabolomics databases. Such databases require clear, consistent formats, easy access to data and metadata, data download, and accessible computational tools to integrate genome system-scale datasets. Although transcriptomics and proteomics integrate the linear predictive power of the genome, the metabolome represents the nonlinear, final biochemical products of the genome, which results from the intricate system(s) that regulate genome expression. For example, the relationship of metabolomics data to the metabolic network is confounded by redundant connections between metabolites and gene-products. However, connections among metabolites are predictable through the rules of chemistry. Therefore, enhancing the ability to integrate the metabolome with anchor-points in the transcriptome and proteome will enhance the predictive power of genomics data. We detail a public database repository for metabolomics, tools and approaches for statistical analysis of metabolomics data, and methods for integrating these dataset with transcriptomic data to create hypotheses concerning specialized metabolism that generates the diversity in natural product chemistry. We discuss the importance of close collaborations among biologists, chemists, computer scientists and statisticians throughout the development of such integrated metabolism-centric databases and software. PMID:23447050
Retinal fundus images for glaucoma analysis: the RIGA dataset
NASA Astrophysics Data System (ADS)
Almazroa, Ahmed; Alodhayb, Sami; Osman, Essameldin; Ramadan, Eslam; Hummadi, Mohammed; Dlaim, Mohammed; Alkatee, Muhannad; Raahemifar, Kaamran; Lakshminarayanan, Vasudevan
2018-03-01
Glaucoma neuropathy is a major cause of irreversible blindness worldwide. Current models of chronic care will not be able to close the gap of growing prevalence of glaucoma and challenges for access to healthcare services. Teleophthalmology is being developed to close this gap. In order to develop automated techniques for glaucoma detection which can be used in tele-ophthalmology we have developed a large retinal fundus dataset. A de-identified dataset of retinal fundus images for glaucoma analysis (RIGA) was derived from three sources for a total of 750 images. The optic cup and disc boundaries for each image was marked and annotated manually by six experienced ophthalmologists and included the cup to disc (CDR) estimates. Six parameters were extracted and assessed (the disc area and centroid, cup area and centroid, horizontal and vertical cup to disc ratios) among the ophthalmologists. The inter-observer annotations were compared by calculating the standard deviation (SD) for every image between the six ophthalmologists in order to determine if the outliers amongst the six and was used to filter the corresponding images. The data set will be made available to the research community in order to crowd source other analysis from other research groups in order to develop, validate and implement analysis algorithms appropriate for tele-glaucoma assessment. The RIGA dataset can be freely accessed online through University of Michigan, Deep Blue website (doi:10.7302/Z23R0R29).
NASA Astrophysics Data System (ADS)
Taylor, Faith E.; Malamud, Bruce D.; Millington, James D. A.
2016-04-01
Access to reliable spatial and quantitative datasets (e.g., infrastructure maps, historical observations, environmental variables) at regional and site specific scales can be a limiting factor for understanding hazards and risks in developing country settings. Here we present a 'living database' of >75 freely available data sources relevant to hazard and risk in Africa (and more globally). Data sources include national scientific foundations, non-governmental bodies, crowd-sourced efforts, academic projects, special interest groups and others. The database is available at http://tinyurl.com/africa-datasets and is continually being updated, particularly in the context of broader natural hazards research we are doing in the context of Malawi and Kenya. For each data source, we review the spatiotemporal resolution and extent and make our own assessments of reliability and usability of datasets. Although such freely available datasets are sometimes presented as a panacea to improving our understanding of hazards and risk in developing countries, there are both pitfalls and opportunities unique to using this type of freely available data. These include factors such as resolution, homogeneity, uncertainty, access to metadata and training for usage. Based on our experience, use in the field and grey/peer-review literature, we present a suggested set of guidelines for using these free and open source data in developing country contexts.
Secure Access Control and Large Scale Robust Representation for Online Multimedia Event Detection
Liu, Changyu; Li, Huiling
2014-01-01
We developed an online multimedia event detection (MED) system. However, there are a secure access control issue and a large scale robust representation issue when we want to integrate traditional event detection algorithms into the online environment. For the first issue, we proposed a tree proxy-based and service-oriented access control (TPSAC) model based on the traditional role based access control model. Verification experiments were conducted on the CloudSim simulation platform, and the results showed that the TPSAC model is suitable for the access control of dynamic online environments. For the second issue, inspired by the object-bank scene descriptor, we proposed a 1000-object-bank (1000OBK) event descriptor. Feature vectors of the 1000OBK were extracted from response pyramids of 1000 generic object detectors which were trained on standard annotated image datasets, such as the ImageNet dataset. A spatial bag of words tiling approach was then adopted to encode these feature vectors for bridging the gap between the objects and events. Furthermore, we performed experiments in the context of event classification on the challenging TRECVID MED 2012 dataset, and the results showed that the robust 1000OBK event descriptor outperforms the state-of-the-art approaches. PMID:25147840
Open data used in water sciences - Review of access, licenses and understandability
NASA Astrophysics Data System (ADS)
Falkenroth, Esa; Lagerbäck Adolphi, Emma; Arheimer, Berit
2016-04-01
The amount of open data available for hydrology research is continually growing. In the EU-funded project SWITCH-ON (Sharing Water-related Information to Tackle Changes in the Hydrosphere - for Operational Needs: www.water-switch-on.eu), we are addressing water concerns by exploring and exploiting the untapped potential of these new open data. This work is enabled by many ongoing efforts to facilitate the use of open data. For instance, a number of portals provide the means to search for open data sets and open spatial data services (such as the GEOSS Portal, INSPIRE community geoportal or various Climate Services and public portals). However, in general, many research groups in water sciences still hesitate in using this open data. We therefore examined some limiting factors. Factors that limit usability of a dataset include: (1) accessibility, (2) understandability and (3) licences. In the SWITCH-ON project we have developed a search tool for finding and accessing data with relevance to water science in Europe, as the existing ones are not addressing data needs in water sciences specifically. The tool is filled with some 9000 sets of metadata and each one is linked to water related key-words. The keywords are based on the ones developed within the CUAHSI community in USA, but extended with non-hydrosphere topics, additional subclasses and only showing keywords actually having data. Access to data sets: 78% of the data is directly accessible, while the rest is either available after registration and request, or through a web client for visualisation but without direct download. However, several data sets were found to be inaccessible due to server downtime, incorrect links or problems with the host database management system. One possible explanation for this could be that many datasets have been assembled by research project that no longer are funded. Hence, their server infrastructure would be less maintained compared to large-scale operational services. Understandability of the data sets: 13 major formats were found, but the major issues encountered were due to incomplete documentation or metadata and problems with decoding binary formats. Ideally, open data sets should be represented in well-known formats and they should be accompanied with sufficient documentation so the data set can be understood. The development efforts on Water ML and NETCDF and other standards could improve understandability of data sets over time but in this review, only a few data sets were provided in these formats. Instead, the majority of datasets were stored in various text-based or binary formats or even document-oriented formats such as PDF. Other disciplines such as meteorology have long-standing traditions of operational data exchange format whereas hydrology research is still quite fragmented and the data exchange is usually done on a case-by-case basis. With the increased sharing of open data there is a good chance the situation will improve for data sets used also in water sciences. License issue: Only 3% of the data is completely free to use, while 57% can be used for non-commercial purposes or research. A high number of datasets did not have a clear statement on terms of use and limitation for access. In most cases the provider could be contacted regarding licensing issues.
YummyData: providing high-quality open life science data
Yamaguchi, Atsuko; Splendiani, Andrea
2018-01-01
Abstract Many life science datasets are now available via Linked Data technologies, meaning that they are represented in a common format (the Resource Description Framework), and are accessible via standard APIs (SPARQL endpoints). While this is an important step toward developing an interoperable bioinformatics data landscape, it also creates a new set of obstacles, as it is often difficult for researchers to find the datasets they need. Different providers frequently offer the same datasets, with different levels of support: as well as having more or less up-to-date data, some providers add metadata to describe the content, structures, and ontologies of the stored datasets while others do not. We currently lack a place where researchers can go to easily assess datasets from different providers in terms of metrics such as service stability or metadata richness. We also lack a space for collecting feedback and improving data providers’ awareness of user needs. To address this issue, we have developed YummyData, which consists of two components. One periodically polls a curated list of SPARQL endpoints, monitoring the states of their Linked Data implementations and content. The other presents the information measured for the endpoints and provides a forum for discussion and feedback. YummyData is designed to improve the findability and reusability of life science datasets provided as Linked Data and to foster its adoption. It is freely accessible at http://yummydata.org/. Database URL: http://yummydata.org/ PMID:29688370
ESASky: a new Astronomy Multi-Mission Interface
NASA Astrophysics Data System (ADS)
Baines, D.; Merin, B.; Salgado, J.; Giordano, F.; Sarmiento, M.; Lopez Marti, B.; Racero, E.; Gutierrez, R.; De Teodoro, P.; Nieto, S.
2016-06-01
ESA is working on a science-driven discovery portal for all its astronomy missions at ESAC called ESASky. The first public release of this service will be shown, featuring interfaces for sky exploration and for single and multiple targets. It requires no operational knowledge of any of the missions involved. A first public beta release took place in October 2015 and gives users world-wide simplified access to high-level science-ready data products from ESA Astronomy missions plus a number of ESA-produced source catalogues. XMM-Newton data, metadata and products were some of the first to be accessible through ESASky. In the next decade, ESASky aims to include not only ESA missions but also access to data from other space and ground-based astronomy missions and observatories. From a technical point of view, ESASky is a web application that offers all-sky projections of full mission datasets using a new-generation HEALPix projection called HiPS; detailed geometrical footprints to connect all-sky mosaics to individual observations; direct access to the underlying mission-specific science archives and catalogues. The poster will be accompanied by a demo booth at the conference.
NASA Astrophysics Data System (ADS)
Forte, M.; Hesser, T.; Knee, K.; Ingram, I.; Hathaway, K. K.; Brodie, K. L.; Spore, N.; Bird, A.; Fratantonio, R.; Dopsovic, R.; Keith, A.; Gadomski, K.
2016-02-01
The U.S. Army Engineer Research and Development Center's (USACE ERDC) Coastal and Hydraulics Laboratory (CHL) Coastal Observations and Analysis Branch (COAB) Measurements Program has a 35-year record of coastal observations. These datasets include oceanographic point source measurements, Real-Time Kinematic (RTK) GPS bathymetry surveys, and remote sensing data from both the Field Research Facility (FRF) in Duck, NC and from other project and experiment sites around the nation. The data has been used to support a variety of USACE mission areas, including coastal wave model development, beach and bar response, coastal project design, coastal storm surge, and other coastal hazard investigations. Furthermore these data have been widely used by a number of federal and state agencies, academic institutions, and private industries in hundreds of scientific and engineering investigations, publications, conference presentations and model advancement studies. A limiting factor to the use of FRF data has been rapid, reliable access and publicly available metadata for each data type. The addition of web tools, accessible data files, and well-documented metadata will open the door to much future collaboration. With the help of industry partner RPS ASA and the U.S. Army Corps of Engineers Mobile District Spatial Data Branch, a Data Integration Framework (DIF) was developed. The DIF represents a combination of processes, standards, people, and tools used to transform disconnected enterprise data into useful, easily accessible information for analysis and reporting. A front-end data portal connects the user to the framework that integrates both oceanographic observation and geomorphology measurements using a combination of ESRI and open-source technology while providing a seamless data discovery, access, and analysis experience to the user. The user interface was built with ESRI's JavaScript API and all project metadata is managed using Geoportal. The geomorphology data is made available through ArcGIS Server, while the oceanographic data sets have been formatted to netCDF4 and made available through a THREDDS server. Additional web tools run alongside the THREDDS server to provide rapid statistical calculations and plotting, allowing for user defined data access and visualization.
NASA Astrophysics Data System (ADS)
Jarboe, N.; Minnett, R.; Constable, C.; Koppers, A. A.; Tauxe, L.
2013-12-01
The Magnetics Information Consortium (MagIC) is dedicated to supporting the paleomagnetic, geomagnetic, and rock magnetic communities through the development and maintenance of an online database (http://earthref.org/MAGIC/), data upload and quality control, searches, data downloads, and visualization tools. While MagIC has completed importing some of the IAGA paleomagnetic databases (TRANS, PINT, PSVRL, GPMDB) and continues to import others (ARCHEO, MAGST and SECVR), further individual data uploading from the community contributes a wealth of easily-accessible rich datasets. Previously uploading of data to the MagIC database required the use of an Excel spreadsheet using either a Mac or PC. The new method of uploading data utilizes an HTML 5 web interface where the only computer requirement is a modern browser. This web interface will highlight all errors discovered in the dataset at once instead of the iterative error checking process found in the previous Excel spreadsheet data checker. As a web service, the community will always have easy access to the most up-to-date and bug free version of the data upload software. The filtering search mechanism of the MagIC database has been changed to a more intuitive system where the data from each contribution is displayed in tables similar to how the data is uploaded (http://earthref.org/MAGIC/search/). Searches themselves can be saved as a permanent URL, if desired. The saved search URL could then be used as a citation in a publication. When appropriate, plots (equal area, Zijderveld, ARAI, demagnetization, etc.) are associated with the data to give the user a quicker understanding of the underlying dataset. The MagIC database will continue to evolve to meet the needs of the paleomagnetic, geomagnetic, and rock magnetic communities.
BrainBrowser: distributed, web-based neurological data visualization.
Sherif, Tarek; Kassis, Nicolas; Rousseau, Marc-Étienne; Adalat, Reza; Evans, Alan C
2014-01-01
Recent years have seen massive, distributed datasets become the norm in neuroimaging research, and the methodologies used to analyze them have, in response, become more collaborative and exploratory. Tools and infrastructure are continuously being developed and deployed to facilitate research in this context: grid computation platforms to process the data, distributed data stores to house and share them, high-speed networks to move them around and collaborative, often web-based, platforms to provide access to and sometimes manage the entire system. BrainBrowser is a lightweight, high-performance JavaScript visualization library built to provide easy-to-use, powerful, on-demand visualization of remote datasets in this new research environment. BrainBrowser leverages modern web technologies, such as WebGL, HTML5 and Web Workers, to visualize 3D surface and volumetric neuroimaging data in any modern web browser without requiring any browser plugins. It is thus trivial to integrate BrainBrowser into any web-based platform. BrainBrowser is simple enough to produce a basic web-based visualization in a few lines of code, while at the same time being robust enough to create full-featured visualization applications. BrainBrowser can dynamically load the data required for a given visualization, so no network bandwidth needs to be waisted on data that will not be used. BrainBrowser's integration into the standardized web platform also allows users to consider using 3D data visualization in novel ways, such as for data distribution, data sharing and dynamic online publications. BrainBrowser is already being used in two major online platforms, CBRAIN and LORIS, and has been used to make the 1TB MACACC dataset openly accessible.
NASA Astrophysics Data System (ADS)
Smith, M. J.; Vardaro, M.; Crowley, M. F.; Glenn, S. M.; Schofield, O.; Belabbassi, L.; Garzio, L. M.; Knuth, F.; Fram, J. P.; Kerfoot, J.
2016-02-01
The Ocean Observatories Initiative (OOI), funded by the National Science Foundation, provides users with access to long-term datasets from a variety of oceanographic sensors. The Endurance Array in the Pacific Ocean consists of two separate lines off the coasts of Oregon and Washington. The Oregon line consists of 7 moorings, two cabled benthic experiment packages and 6 underwater gliders. The Washington line comprises 6 moorings and 6 gliders. Each mooring is outfitted with a variety of instrument packages. The raw data from these instruments are sent to shore via satellite communication and in some cases, via fiber optic cable. Raw data is then sent to the cyberinfrastructure (CI) group at Rutgers where it is aggregated, parsed into thousands of different data streams, and integrated into a software package called uFrame. The OOI CI delivers the data to the general public via a web interface that outputs data into commonly used scientific data file formats such as JSON, netCDF, and CSV. The Rutgers data management team has developed a series of command-line Python tools that streamline data acquisition in order to facilitate the QA/QC review process. The first step in the process is querying the uFrame database for a list of all available platforms. From this list, a user can choose a specific platform and automatically download all available datasets from the specified platform. The downloaded dataset is plotted using a generalized Python netcdf plotting routine that utilizes a data visualization toolbox called matplotlib. This routine loads each netCDF file separately and outputs plots by each available parameter. These Python tools have been uploaded to a Github repository that is openly available to help facilitate OOI data access and visualization.
BrainBrowser: distributed, web-based neurological data visualization
Sherif, Tarek; Kassis, Nicolas; Rousseau, Marc-Étienne; Adalat, Reza; Evans, Alan C.
2015-01-01
Recent years have seen massive, distributed datasets become the norm in neuroimaging research, and the methodologies used to analyze them have, in response, become more collaborative and exploratory. Tools and infrastructure are continuously being developed and deployed to facilitate research in this context: grid computation platforms to process the data, distributed data stores to house and share them, high-speed networks to move them around and collaborative, often web-based, platforms to provide access to and sometimes manage the entire system. BrainBrowser is a lightweight, high-performance JavaScript visualization library built to provide easy-to-use, powerful, on-demand visualization of remote datasets in this new research environment. BrainBrowser leverages modern web technologies, such as WebGL, HTML5 and Web Workers, to visualize 3D surface and volumetric neuroimaging data in any modern web browser without requiring any browser plugins. It is thus trivial to integrate BrainBrowser into any web-based platform. BrainBrowser is simple enough to produce a basic web-based visualization in a few lines of code, while at the same time being robust enough to create full-featured visualization applications. BrainBrowser can dynamically load the data required for a given visualization, so no network bandwidth needs to be waisted on data that will not be used. BrainBrowser's integration into the standardized web platform also allows users to consider using 3D data visualization in novel ways, such as for data distribution, data sharing and dynamic online publications. BrainBrowser is already being used in two major online platforms, CBRAIN and LORIS, and has been used to make the 1TB MACACC dataset openly accessible. PMID:25628562
Including all voices in international data-sharing governance.
Kaye, Jane; Terry, Sharon F; Juengst, Eric; Coy, Sarah; Harris, Jennifer R; Chalmers, Don; Dove, Edward S; Budin-Ljøsne, Isabelle; Adebamowo, Clement; Ogbe, Emilomo; Bezuidenhout, Louise; Morrison, Michael; Minion, Joel T; Murtagh, Madeleine J; Minari, Jusaku; Teare, Harriet; Isasi, Rosario; Kato, Kazuto; Rial-Sebbag, Emmanuelle; Marshall, Patricia; Koenig, Barbara; Cambon-Thomsen, Anne
2018-03-07
Governments, funding bodies, institutions, and publishers have developed a number of strategies to encourage researchers to facilitate access to datasets. The rationale behind this approach is that this will bring a number of benefits and enable advances in healthcare and medicine by allowing the maximum returns from the investment in research, as well as reducing waste and promoting transparency. As this approach gains momentum, these data-sharing practices have implications for many kinds of research as they become standard practice across the world. The governance frameworks that have been developed to support biomedical research are not well equipped to deal with the complexities of international data sharing. This system is nationally based and is dependent upon expert committees for oversight and compliance, which has often led to piece-meal decision-making. This system tends to perpetuate inequalities by obscuring the contributions and the important role of different data providers along the data stream, whether they be low- or middle-income country researchers, patients, research participants, groups, or communities. As research and data-sharing activities are largely publicly funded, there is a strong moral argument for including the people who provide the data in decision-making and to develop governance systems for their continued participation. We recommend that governance of science becomes more transparent, representative, and responsive to the voices of many constituencies by conducting public consultations about data-sharing addressing issues of access and use; including all data providers in decision-making about the use and sharing of data along the whole of the data stream; and using digital technologies to encourage accessibility, transparency, and accountability. We anticipate that this approach could enhance the legitimacy of the research process, generate insights that may otherwise be overlooked or ignored, and help to bring valuable perspectives into the decision-making around international data sharing.
MetaboLights: An Open-Access Database Repository for Metabolomics Data.
Kale, Namrata S; Haug, Kenneth; Conesa, Pablo; Jayseelan, Kalaivani; Moreno, Pablo; Rocca-Serra, Philippe; Nainala, Venkata Chandrasekhar; Spicer, Rachel A; Williams, Mark; Li, Xuefei; Salek, Reza M; Griffin, Julian L; Steinbeck, Christoph
2016-03-24
MetaboLights is the first general purpose, open-access database repository for cross-platform and cross-species metabolomics research at the European Bioinformatics Institute (EMBL-EBI). Based upon the open-source ISA framework, MetaboLights provides Metabolomics Standard Initiative (MSI) compliant metadata and raw experimental data associated with metabolomics experiments. Users can upload their study datasets into the MetaboLights Repository. These studies are then automatically assigned a stable and unique identifier (e.g., MTBLS1) that can be used for publication reference. The MetaboLights Reference Layer associates metabolites with metabolomics studies in the archive and is extensively annotated with data fields such as structural and chemical information, NMR and MS spectra, target species, metabolic pathways, and reactions. The database is manually curated with no specific release schedules. MetaboLights is also recommended by journals for metabolomics data deposition. This unit provides a guide to using MetaboLights, downloading experimental data, and depositing metabolomics datasets using user-friendly submission tools. Copyright © 2016 John Wiley & Sons, Inc.
NASA Astrophysics Data System (ADS)
Wilkinson, Mark; Beven, Keith; Brewer, Paul; El-khatib, Yehia; Gemmell, Alastair; Haygarth, Phil; Mackay, Ellie; Macklin, Mark; Marshall, Keith; Quinn, Paul; Stutter, Marc; Thomas, Nicola; Vitolo, Claudia
2013-04-01
Today's world is dominated by a wide range of informatics tools that are readily available to a wide range of stakeholders. There is growing recognition that the appropriate involvement of local communities in land and water management decisions can result in multiple environmental, economic and social benefits. Therefore, local stakeholder groups are increasingly being asked to participate in decision making alongside policy makers, government agencies and scientists. As such, addressing flooding issues requires new ways of engaging with the catchment and its inhabitants at a local level. To support this, new tools and approaches are required. The growth of cloud based technologies offers new novel ways to facilitate this process of exchange of information in earth sciences. The Environmental Virtual Observatory Pilot project (EVOp) is a new initiative from the UK Natural Environment Research Council (NERC) designed to deliver proof of concept for new tools and approaches to support the challenges as outlined above (http://www.evo-uk.org/). The long term vision of the Environmental Virtual Observatory is to: • Make environmental data more visible and accessible to a wide range of potential users including public good applications; • Provide tools to facilitate the integrated analysis of data, greater access to added knowledge and expert analysis and visualisation of the results; • Develop new, added-value knowledge from public and private sector data assets to help tackle environmental challenges. As part of the EVO pilot, an interactive cloud based tool has been developed with local stakeholders. The Local Landscape Visualisation Tool attempts to communicate flood risk in local impacted communities. The tool has been developed iteratively to reflect the needs, interests and capabilities of a wide range of stakeholders. This tool (assessable via a web portal) combines numerous cloud based tools and services, local catchment datasets, hydrological models and novel visualisation techniques. This pilot tool has been developed by engaging with different stakeholder groups in three catchments in the UK; the Afon Dyfi (Wales), the River Tarland (Scotland) and the River Eden (England). Stakeholders were interested in accessing live data in their catchments and looking at different land use change scenarios on flood peaks. Visualisation tools have been created which offer access to real time data (such as river level, rainfall and webcam images). Other tools allow land owners to use cloud based models (example presented here uses Topmodel, a rainfall-runoff model, on a custom virtual machine image on Amazon web services) and local datasets to explore future land use scenarios, allowing them to understand the associated flood risk. Different ways to communicate model uncertainty are currently being investigated and discussed with stakeholders. In summary the pilot project has had positive feedback and has evolved into two unique parts; a web based map tool and a model interface tool. Users can view live data from different sources, combine different data types together (data mash-up), develop local scenarios for land use and flood risk and exploit the dynamic, elastic cloud modelling capability. This local toolkit will reside within a wider EVO platform that will include national and global datasets, models and state of the art cloud computer systems.
Evans-Lacko, S; Brohan, E; Mojtabai, R; Thornicroft, G
2012-08-01
Little is known about how the views of the public are related to self-stigma among people with mental health problems. Despite increasing activity aimed at reducing mental illness stigma, there is little evidence to guide and inform specific anti-stigma campaign development and messages to be used in mass campaigns. A better understanding of the association between public knowledge, attitudes and behaviours and the internalization of stigma among people with mental health problems is needed. This study links two large, international datasets to explore the association between public stigma in 14 European countries (Eurobarometer survey) and individual reports of self-stigma, perceived discrimination and empowerment among persons with mental illness (n=1835) residing in those countries [the Global Alliance of Mental Illness Advocacy Networks (GAMIAN) study]. Individuals with mental illness living in countries with less stigmatizing attitudes, higher rates of help-seeking and treatment utilization and better perceived access to information had lower rates of self-stigma and perceived discrimination and those living in countries where the public felt more comfortable talking to people with mental illness had less self-stigma and felt more empowered. Targeting the general public through mass anti-stigma interventions may lead to a virtuous cycle by disrupting the negative feedback engendered by public stigma, thereby reducing self-stigma among people with mental health problems. A combined approach involving knowledge, attitudes and behaviour is needed; mass interventions that facilitate disclosure and positive social contact may be the most effective. Improving availability of information about mental health issues and facilitating access to care and help-seeking also show promise with regard to stigma.
Genetic and Diagnostic Biomarker Development in ASD Toddlers Using Resting State Functional MRI
2015-09-01
for public release; distribution unlimited Autism spectrum disorder (ASD); biomarker; early brain development; intrinsic functional brain networks...three large neuroimaging/neurobehavioral datasets to identify brain-imaging based biomarkers for Autism Spectrum Disorders (ASD). At Yale, we focus...neurobehavioral!datasets!in!order!to!identify! brainFimaging!based!biomarkers!for! Autism ! Spectrum ! Disorders !(ASD),!including!1)!BrainMap,! developed!and
A Free and Open Source Web-based Data Catalog Evaluation Tool
NASA Astrophysics Data System (ADS)
O'Brien, K.; Schweitzer, R.; Burger, E. F.
2015-12-01
For many years, the Unified Access Framework (UAF) project has worked to provide improved access to scientific data by leveraging widely used data standards and conventions. These standards include the Climate and Forecast (CF) metadata conventions, the Data Access Protocol (DAP) and various Open Geospatial Consortium (OGC) standards such as WMS and WCS. The UAF has also worked to create a unified access point for scientific data access through THREDDS and ERDDAP catalogs. A significant effort was made by the UAF project to build a catalog-crawling tool that was designed to crawl remote catalogs, analyze their content and then build a clean catalog that 1) represented only CF compliant data; 2) provided a uniform set of access services and 3) where possible, aggregated data in time. That catalog is available at http://ferret.pmel.noaa.gov/geoide/geoIDECleanCatalog.html.Although this tool has proved immensely valuable in allowing the UAF project to create a high quality data catalog, the need for a catalog evaluation service or tool to operate on a more local level also exists. Many programs that generate data of interest to the public are recognizing the utility and power of using the THREDDS data server (TDS) to serve that data. However, for some groups that lack the resources to maintain dedicated IT personnel, it can be difficult to set up a properly configured TDS. The TDS catalog evaluating service that is under development and will be discussed in this presentation is an effort, through the UAF project, to bridge that gap. Based upon the power of the original UAF catalog cleaner, the web evaluator will have the ability to scan and crawl a local TDS catalog, evaluate the contents for compliance with CF standards, analyze the services offered, and identify datasets where possible temporal aggregation would benefit data access. The results of the catalog evaluator will guide the configuration of the dataset in TDS to ensure that it meets the standards as promoted by the UAF framework.
NASA Astrophysics Data System (ADS)
Dyer, T.; Brodie, K. L.; Spore, N.
2016-02-01
Modern LIDAR systems, while capable of providing highly accurate and dense datasets, introduce significant challenges in data processing and end-user accessibility. At the United States Army Corps of Engineers Field Research Facility in Duck, North Carolina, we have developed a stationary LIDAR tower for the continuous monitoring of the ocean, beach, and foredune, as well as an automated workflow capable of providing scientific data products from the LIDAR scanner in near real-time through an online data portal. The LIDAR performs hourly scans, taking approximately 50 minutes to complete and producing datasets on the order of 1GB. Processing of the LIDAR data includes coordinate transformations, data rectification and coregistration, filtering to remove noise and unwanted objects, gridding, and time-series analysis to generate products for use by end-users. Examples of these products include water levels and significant wave heights, virtual wave gauge time-series and FFTs, wave runup, foreshore elevations and slopes, and bare earth DEMs. Immediately after processing, data products are combined with ISO compliant metadata and stored using the NetCDF-4 file format, making them easily discoverable through a web portal which provides an interactive map that allows users to explore datasets both spatially and temporally. End-users can download datasets in user-defined time intervals, which can be used, for example, as forcing or validation parameters in numerical models. Funded by the USACE Coastal Ocean Data Systems Program.
The BLUEPRINT Data Analysis Portal.
Fernández, José María; de la Torre, Victor; Richardson, David; Royo, Romina; Puiggròs, Montserrat; Moncunill, Valentí; Fragkogianni, Stamatina; Clarke, Laura; Flicek, Paul; Rico, Daniel; Torrents, David; Carrillo de Santa Pau, Enrique; Valencia, Alfonso
2016-11-23
The impact of large and complex epigenomic datasets on biological insights or clinical applications is limited by the lack of accessibility by easy, intuitive, and fast tools. Here, we describe an epigenomics comparative cyber-infrastructure (EPICO), an open-access reference set of libraries to develop comparative epigenomic data portals. Using EPICO, large epigenome projects can make available their rich datasets to the community without requiring specific technical skills. As a first instance of EPICO, we implemented the BLUEPRINT Data Analysis Portal (BDAP). BDAP provides a desktop for the comparative analysis of epigenomes of hematopoietic cell types based on results, such as the position of epigenetic features, from basic analysis pipelines. The BDAP interface facilitates interactive exploration of genomic regions, genes, and pathways in the context of differentiation of hematopoietic lineages. This work represents initial steps toward broadly accessible integrative analysis of epigenomic data across international consortia. EPICO can be accessed at https://github.com/inab, and BDAP can be accessed at http://blueprint-data.bsc.es. Copyright © 2016 Elsevier Inc. All rights reserved.
The igmspec database of public spectra probing the intergalactic medium
NASA Astrophysics Data System (ADS)
Prochaska, J. X.
2017-04-01
We describe v02 of igmspec, a database of publicly available ultraviolet, optical, and near-infrared spectra that probe the intergalactic medium (IGM). This database, a child of the specdb repository in the specdb github organization, comprises 403 277 unique sources and 434 686 spectra obtained with the world's greatest observatories. All of these data are distributed in a single ≈ 25GB HDF5 file maintained at the University of California Observatories and the University of California, Santa Cruz. The specdb software package includes Python scripts and modules for searching the source catalog and spectral datasets, and software links to the linetools package for spectral analysis. The repository also includes software to generate private spectral datasets that are compliant with International Virtual Observatory Alliance (IVOA) protocols and a Python-based interface for IVOA Simple Spectral Access queries. Future versions of igmspec will ingest other sources (e.g. gamma-ray burst afterglows) and other surveys as they become publicly available. The overall goal is to include every spectrum that effectively probes the IGM. Future databases of specdb may include publicly available galaxy spectra (exgalspec) and published supernovae spectra (snspec). The community is encouraged to join the effort on github: https://github.com/specdb.
NASA Astrophysics Data System (ADS)
Stall, S.
2017-12-01
Integrity and transparency within research is solidified by a complete set of research products that are findable, accessible, interoperable, and reusable. In other words, they follow the FAIR Guidelines developed by FORCE11.org. Your datasets, images, video, software, scripts, models, physical samples, and other tools and technology are an integral part of the narrative you tell about your research. These research products increasingly are being captured through workflow tools and preserved and connected through persistent identifiers across multiple repositories that keep them safe. They help secure, with your publications, the supporting evidence and integrity of the scientific record. This is the direction that Earth and space science as well as other disciplines is moving. Within our community, some science domains are further along, and others are taking more measured steps. AGU as a publisher is working to support the full scientific record with peer reviewed publications. Working with our community and all the Earth and space science journals, AGU is developing new policies to encourage researchers to plan for proper data preservation and provide data citations along with their research submission and to encourage adoption of best practices throughout the research workflow and data life cycle. Providing incentives, community standards, and easy-to-use tools are some important factors for helping researchers embrace the FAIR Guidelines and support transparency and integrity.
Srivastava, Divya; McGuire, Alistair
2014-07-30
Access to medicines is an important health policy issue. This paper considers demand structures in a selection of low-income countries from the perspective of public authorities as the evidence base is limited. Analysis of the demand for medicines in low-income countries is critical for effective pharmaceutical policy where regulation is less developed, health systems are cash constrained and medicines are not typically subsidised by a public health insurance system This study analyses the demand for medicines in low-income countries from the perspective of the prices paid by public authorities. The analysis draws on a unique dataset from World Health Organization (WHO) and Health Action International (HAI) using 2003 data on procurement prices of medicines across 16 low-income countries covering 48 branded drugs and 18 therapeutic categories. Variation in prices, the mark-ups over marginal costs and estimation of price elasticities allows assessment of whether these elasticities are correlated with a country's national income. Using the Ramsey pricing rule, the study's findings suggest that substantial cross-country variation in prices and mark-ups exist, with price elasticities ranging from -1 to -2, which are weakly correlated with national income. Government demand for medicines thus appears to be price elastic, raising important policy implications aimed at improving access to medicines for patients in low-income countries.
The need for data standards in zoomorphology.
Vogt, Lars; Nickel, Michael; Jenner, Ronald A; Deans, Andrew R
2013-07-01
eScience is a new approach to research that focuses on data mining and exploration rather than data generation or simulation. This new approach is arguably a driving force for scientific progress and requires data to be openly available, easily accessible via the Internet, and compatible with each other. eScience relies on modern standards for the reporting and documentation of data and metadata. Here, we suggest necessary components (i.e., content, concept, nomenclature, format) of such standards in the context of zoomorphology. We document the need for using data repositories to prevent data loss and how publication practice is currently changing, with the emergence of dynamic publications and the publication of digital datasets. Subsequently, we demonstrate that in zoomorphology the scientific record is still limited to published literature and that zoomorphological data are usually not accessible through data repositories. The underlying problem is that zoomorphology lacks the standards for data and metadata. As a consequence, zoomorphology cannot participate in eScience. We argue that the standardization of morphological data requires i) a standardized framework for terminologies for anatomy and ii) a formalized method of description that allows computer-parsable morphological data to be communicable, compatible, and comparable. The role of controlled vocabularies (e.g., ontologies) for developing respective terminologies and methods of description is discussed, especially in the context of data annotation and semantic enhancement of publications. Finally, we introduce the International Consortium for Zoomorphology Standards, a working group that is open to everyone and whose aim is to stimulate and synthesize dialog about standards. It is the Consortium's ultimate goal to assist the zoomorphology community in developing modern data and metadata standards, including anatomy ontologies, thereby facilitating the participation of zoomorphology in eScience. Copyright © 2013 Wiley Periodicals, Inc.
NASA Astrophysics Data System (ADS)
Asch, Kristine; Tellez-Arenas, Agnes
2010-05-01
OneGeology-Europe is making geological spatial data held by the geological surveys of Europe more easily discoverable and accessible via the internet. This will provide a fundamental scientific layer to the European Plate Observation System Rich geological data assets exist in the geological survey of each individual EC Member State, but they are difficult to discover and are not interoperable. For those outside the geological surveys they are not easy to obtain, to understand or to use. Geological spatial data is essential to the prediction and mitigation of landslides, subsidence, earthquakes, flooding and pollution. These issues are global in nature and their profile has also been raised by the OneGeology global initiative for the International Year of Planet Earth 2008. Geology is also a key dataset in the EC INSPIRE Directive, where it is also fundamental to the themes of natural risk zones, energy and mineral resources. The OneGeology-Europe project is delivering a web-accessible, interoperable geological spatial dataset for the whole of Europe at the 1:1 million scale based on existing data held by the European geological surveys. Proof of concept will be applied to key areas at a higher resolution and some geological surveys will deliver their data at high resolution. An important role is developing a European specification for basic geological map data and making significant progress towards harmonising the dataset (an essential first step to addressing harmonisation at higher data resolutions). It is accelerating the development and deployment of a nascent international interchange standard for geological data - GeoSciML, which will enable the sharing and exchange of the data within and beyond the geological community within Europe and globally. The geological dataset for the whole of Europe is not a centralized database but a distributed system. Each geological survey implements and hosts an interoperable web service, delivering their national harmonized geological data. These datasets are registered in a multilingual catalogue, who is one the main part of this system. This catalogue and a common metadata profile allows the discovery of national geological and applied geological maps at all scapes, Such an architecture is facilitating re-use and addition of value by a wide spectrum of users in the public and private sector and identifying, documenting and disseminating strategies for the reduction of technical and business barriers to re-use. In identifying and raising awareness in the user and provider communities, it is moving geological knowledge closer to the end-user where it will have greater societal impact and ensure fuller exploitation of a key data resource gathered at huge public expense. The project is providing examples of best practice in the delivery of digital geological spatial data to users, e.g. in the insurance, property, engineering, planning, mineral resource and environmental sectors. The scientifically attributed map data of the project will provide a pan-European base for science research and, importantly, a prime geoscience dataset capable of integration with other data sets within and beyond the geoscience domain. This presentation will demonstrate the first results of this project and will indicate how OneGeology-Europe is ensuring that Europe may play a leading role in the development of a geoscience spatial data infrastructure (SDI) globally.
MATISSE: A novel tool to access, visualize and analyse data from planetary exploration missions
NASA Astrophysics Data System (ADS)
Zinzi, A.; Capria, M. T.; Palomba, E.; Giommi, P.; Antonelli, L. A.
2016-04-01
The increasing number and complexity of planetary exploration space missions require new tools to access, visualize and analyse data to improve their scientific return. ASI Science Data Center (ASDC) addresses this request with the web-tool MATISSE (Multi-purpose Advanced Tool for the Instruments of the Solar System Exploration), allowing the visualization of single observation or real-time computed high-order products, directly projected on the three-dimensional model of the selected target body. Using MATISSE it will be no longer needed to download huge quantity of data or to write down a specific code for every instrument analysed, greatly encouraging studies based on joint analysis of different datasets. In addition the extremely high-resolution output, to be used offline with a Python-based free software, together with the files to be read with specific GIS software, makes it a valuable tool to further process the data at the best spatial accuracy available. MATISSE modular structure permits addition of new missions or tasks and, thanks to dedicated future developments, it would be possible to make it compliant to the Planetary Virtual Observatory standards currently under definition. In this context the recent development of an interface to the NASA ODE REST API by which it is possible to access to public repositories is set.
NASA Technical Reports Server (NTRS)
Al-Hamdan, Mohammad; Crosson, William; Economou, Sigrid; Estes, Maurice, Jr.; Estes, Sue; Hemmings, Sarah; Kent, Shia; Puckett, Mark; Quattrochi, Dale; Wade, Gina;
2012-01-01
The overall goal of this study is to address issues of environmental health and enhance public health decision making by utilizing NASA remotely-sensed data and products. This study is a collaboration between NASA Marshall Space Flight Center, Universities Space Research Association (USRA), the University of Alabama at Birmingham (UAB) School of Public Health and the Centers for Disease Control and Prevention (CDC) National Center for Public Health Informatics. The objectives of this study are to develop high-quality spatial data sets of environmental variables, link these with public health data from a national cohort study, and deliver the linked data sets and associated analyses to local, state and federal end-user groups. Three daily environmental data sets were developed for the conterminous U.S. on different spatial resolutions for the period 2003-2008: (1) spatial surfaces of estimated fine particulate matter (PM2.5) exposures on a 10-km grid utilizing the US Environmental Protection Agency (EPA) ground observations and NASA s MODerate-resolution Imaging Spectroradiometer (MODIS) data; (2) a 1-km grid of Land Surface Temperature (LST) using MODIS data; and (3) a 12-km grid of daily Solar Insolation (SI) and maximum and minimum air temperature using the North American Land Data Assimilation System (NLDAS) forcing data. These environmental datasets were linked with public health data from the UAB REasons for Geographic and Racial Differences in Stroke (REGARDS) national cohort study to determine whether exposures to these environmental risk factors are related to cognitive decline and other health outcomes. These environmental national datasets will also be made available to public health professionals, researchers and the general public via the CDC Wide-ranging Online Data for Epidemiologic Research (WONDER) system, where they can be aggregated to the county, state or regional level as per users need and downloaded in tabular, graphical, and map formats. The linkage of these data provides a useful addition to CDC WONDER, allowing public health researchers and policy makers to better include environmental exposure data in the context of other health data available in this online system. It also substantially expands public access to NASA data, making their use by a wide range of decision makers feasible.
Jiang, Yueyang; Kim, John B.; Still, Christopher J.; Kerns, Becky K.; Kline, Jeffrey D.; Cunningham, Patrick G.
2018-01-01
Statistically downscaled climate data have been widely used to explore possible impacts of climate change in various fields of study. Although many studies have focused on characterizing differences in the downscaling methods, few studies have evaluated actual downscaled datasets being distributed publicly. Spatially focusing on the Pacific Northwest, we compare five statistically downscaled climate datasets distributed publicly in the US: ClimateNA, NASA NEX-DCP30, MACAv2-METDATA, MACAv2-LIVNEH and WorldClim. We compare the downscaled projections of climate change, and the associated observational data used as training data for downscaling. We map and quantify the variability among the datasets and characterize the spatio-temporal patterns of agreement and disagreement among the datasets. Pair-wise comparisons of datasets identify the coast and high-elevation areas as areas of disagreement for temperature. For precipitation, high-elevation areas, rainshadows and the dry, eastern portion of the study area have high dissimilarity among the datasets. By spatially aggregating the variability measures into watersheds, we develop guidance for selecting datasets within the Pacific Northwest climate change impact studies. PMID:29461513
Jiang, Yueyang; Kim, John B; Still, Christopher J; Kerns, Becky K; Kline, Jeffrey D; Cunningham, Patrick G
2018-02-20
Statistically downscaled climate data have been widely used to explore possible impacts of climate change in various fields of study. Although many studies have focused on characterizing differences in the downscaling methods, few studies have evaluated actual downscaled datasets being distributed publicly. Spatially focusing on the Pacific Northwest, we compare five statistically downscaled climate datasets distributed publicly in the US: ClimateNA, NASA NEX-DCP30, MACAv2-METDATA, MACAv2-LIVNEH and WorldClim. We compare the downscaled projections of climate change, and the associated observational data used as training data for downscaling. We map and quantify the variability among the datasets and characterize the spatio-temporal patterns of agreement and disagreement among the datasets. Pair-wise comparisons of datasets identify the coast and high-elevation areas as areas of disagreement for temperature. For precipitation, high-elevation areas, rainshadows and the dry, eastern portion of the study area have high dissimilarity among the datasets. By spatially aggregating the variability measures into watersheds, we develop guidance for selecting datasets within the Pacific Northwest climate change impact studies.
Clickstream data yields high-resolution maps of science
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bollen, Johan; Van De Sompel, Herbert; Hagberg, Aric
2009-01-01
Intricate maps of science have been created from citation data to visualize the structure of scientific activity. However, most scientific publications are now accessed online. Scholarly web portals record detailed log data at a scale that exceeds the number of all existing citations combined. Such log data is recorded immediately upon publication and keeps track of the sequences of user requests (clickstreams) that are issued by a variety of users across many different domains. Given these advantagees of log datasets over citation data, we investigate whether they can produce high-resolution, more current maps of science.
De-identification of patient notes with recurrent neural networks.
Dernoncourt, Franck; Lee, Ji Young; Uzuner, Ozlem; Szolovits, Peter
2017-05-01
Patient notes in electronic health records (EHRs) may contain critical information for medical investigations. However, the vast majority of medical investigators can only access de-identified notes, in order to protect the confidentiality of patients. In the United States, the Health Insurance Portability and Accountability Act (HIPAA) defines 18 types of protected health information that needs to be removed to de-identify patient notes. Manual de-identification is impractical given the size of electronic health record databases, the limited number of researchers with access to non-de-identified notes, and the frequent mistakes of human annotators. A reliable automated de-identification system would consequently be of high value. We introduce the first de-identification system based on artificial neural networks (ANNs), which requires no handcrafted features or rules, unlike existing systems. We compare the performance of the system with state-of-the-art systems on two datasets: the i2b2 2014 de-identification challenge dataset, which is the largest publicly available de-identification dataset, and the MIMIC de-identification dataset, which we assembled and is twice as large as the i2b2 2014 dataset. Our ANN model outperforms the state-of-the-art systems. It yields an F1-score of 97.85 on the i2b2 2014 dataset, with a recall of 97.38 and a precision of 98.32, and an F1-score of 99.23 on the MIMIC de-identification dataset, with a recall of 99.25 and a precision of 99.21. Our findings support the use of ANNs for de-identification of patient notes, as they show better performance than previously published systems while requiring no manual feature engineering. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Skounakis, Emmanouil; Farmaki, Christina; Sakkalis, Vangelis; Roniotis, Alexandros; Banitsas, Konstantinos; Graf, Norbert; Marias, Konstantinos
2010-01-01
This paper presents a novel, open access interactive platform for 3D medical image analysis, simulation and visualization, focusing in oncology images. The platform was developed through constant interaction and feedback from expert clinicians integrating a thorough analysis of their requirements while having an ultimate goal of assisting in accurately delineating tumors. It allows clinicians not only to work with a large number of 3D tomographic datasets but also to efficiently annotate multiple regions of interest in the same session. Manual and semi-automatic segmentation techniques combined with integrated correction tools assist in the quick and refined delineation of tumors while different users can add different components related to oncology such as tumor growth and simulation algorithms for improving therapy planning. The platform has been tested by different users and over large number of heterogeneous tomographic datasets to ensure stability, usability, extensibility and robustness with promising results. the platform, a manual and tutorial videos are available at: http://biomodeling.ics.forth.gr. it is free to use under the GNU General Public License.
Solar Irradiance Data Products at the LASP Interactive Solar IRradiance Datacenter (LISIRD)
NASA Astrophysics Data System (ADS)
Lindholm, D. M.; Ware DeWolfe, A.; Wilson, A.; Pankratz, C. K.; Snow, M. A.; Woods, T. N.
2011-12-01
The Laboratory for Atmospheric and Space Physics (LASP) has developed the LASP Interactive Solar IRradiance Datacenter (LISIRD, http://lasp.colorado.edu/lisird/) web site to provide access to a comprehensive set of solar irradiance measurements and related datasets. Current data holdings include products from NASA missions SORCE, UARS, SME, and TIMED-SEE. The data provided covers a wavelength range from soft X-ray (XUV) at 0.1 nm up to the near infrared (NIR) at 2400 nm, as well as Total Solar Irradiance (TSI). Other datasets include solar indices, spectral and flare models, solar images, and more. The LISIRD web site features updated plotting, browsing, and download capabilities enabled by dygraphs, JavaScript, and Ajax calls to the LASP Time Series Server (LaTiS). In addition to the web browser interface, most of the LISIRD datasets can be accessed via the LaTiS web service interface that supports the OPeNDAP standard. OPeNDAP clients and other programming APIs are available for making requests that subset, aggregate, or filter data on the server before it is transported to the user. This poster provides an overview of the LISIRD system, summarizes the datasets currently available, and provides details on how to access solar irradiance data products through LISIRD's interfaces.
Enabling Open Research Data Discovery through a Recommender System
NASA Astrophysics Data System (ADS)
Devaraju, Anusuriya; Jayasinghe, Gaya; Klump, Jens; Hogan, Dominic
2017-04-01
Government agencies, universities, research and nonprofit organizations are increasingly publishing their datasets to promote transparency, induce new research and generate economic value through the development of new products or services. The datasets may be downloaded from various data portals (data repositories) which are general or domain-specific. The Registry of Research Data Repository (re3data.org) lists more than 2500 such data repositories from around the globe. Data portals allow keyword search and faceted navigation to facilitate discovery of research datasets. However, the volume and variety of datasets have made finding relevant datasets more difficult. Common dataset search mechanisms may be time consuming, may produce irrelevant results and are primarily suitable for users who are familiar with the general structure and contents of the respective database. Therefore, we need new approaches to support research data discovery. Recommender systems offer new possibilities for users to find datasets that are relevant to their research interests. This study presents a recommender system developed for the CSIRO Data Access Portal (DAP, http://data.csiro.au). The datasets hosted on the portal are diverse, published by researchers from 13 business units in the organisation. The goal of the study is not to replace the current search mechanisms on the data portal, but rather to extend the data discovery through an exploratory search, in this case by building a recommender system. We adopted a hybrid recommendation approach, comprising content-based filtering and item-item collaborative filtering. The content-based filtering computes similarities between datasets based on metadata such as title, keywords, descriptions, fields of research, location, contributors, etc. The collaborative filtering utilizes user search behaviour and download patterns derived from the server logs to determine similar datasets. Similarities above are then combined with different degrees of importance (weights) to determine the overall data similarity. We determined the similarity weights based on a survey involving 150 users of the portal. The recommender results for a given dataset are accessible programmatically via a RESTful web service. An offline evaluation involving data users demonstrates the ability of the recommender system to discover relevant and 'novel' datasets.
NASA Astrophysics Data System (ADS)
McDougall, C.; McLaughlin, J.
2008-12-01
NOAA has developed several programs aimed at facilitating the use of earth system science data and data visualizations by formal and informal educators. One of them, Science On a Sphere, a visualization display tool and system that uses networked LCD projectors to display animated global datasets onto the outside of a suspended, 1.7-meter diameter opaque sphere, enables science centers, museums, and universities to display real-time and current earth system science data. NOAA's Office of Education has provided grants to such education institutions to develop exhibits featuring Science On a Sphere (SOS) and create content for and evaluate audience impact. Currently, 20 public education institutions have permanent Science On a Sphere exhibits and 6 more will be installed soon. These institutions and others that are working to create and evaluate content for this system work collaboratively as a network to improve our collective knowledge about how to create educationally effective visualizations. Network members include other federal agencies, such as, NASA and the Dept. of Energy, and major museums such as Smithsonian and American Museum of Natural History, as well as a variety of mid-sized and small museums and universities. Although the audiences in these institutions vary widely in their scientific awareness and understanding, we find there are misconceptions and lack of familiarity with viewing visualizations that are common among the audiences. Through evaluations performed in these institutions we continue to evolve our understanding of how to create content that is understandable by those with minimal scientific literacy. The findings from our network will be presented including the importance of providing context, real-world connections and imagery to accompany the visualizations and the need for audience orientation before the visualizations are viewed. Additionally, we will review the publicly accessible virtual library housing over 200 datasets for SOS and any other real or virtual globe. These datasets represent contributions from NOAA, NASA, Dept. of Energy, and the public institutions that are displaying the spheres.
NASA Technical Reports Server (NTRS)
Al-Hamdan, Mohammad; Crosson, William; Economou, Sigrid; Estes,Maurice, Jr.; Estes, Sue; Hemmings, Sarah; Kent, Shia; Puckett, Mark; Quattrochi, Dale; Wade, Gina;
2012-01-01
The overall goal of this study is to address issues of environmental health and enhance public health decision making by using NASA remotely sensed data and products. This study is a collaboration between NASA Marshall Space Flight Center, Universities Space Research Association (USRA), the University of Alabama at Birmingham (UAB) School of Public Health and the Centers for Disease Control and Prevention (CDC) Office of Surveillance, Epidemiology and Laboratory Services. The objectives of this study are to develop high-quality spatial data sets of environmental variables, link these with public health data from a national cohort study, and deliver the environmental data sets and associated public health analyses to local, state and federal end ]user groups. Three daily environmental data sets were developed for the conterminous U.S. on different spatial resolutions for the period 2003-2008: (1) spatial surfaces of estimated fine particulate matter (PM2.5) on a 10-km grid using US Environmental Protection Agency (EPA) ground observations and NASA's MODerate-resolution Imaging Spectroradiometer (MODIS) data; (2) a 1-km grid of MODIS Land Surface Temperature (LST); and (3) a 12-km grid of daily incoming solar radiation and maximum and minimum air temperature using the North American Land Data Assimilation System (NLDAS) data. These environmental datasets were linked with public health data from the UAB REasons for Geographic and Racial Differences in Stroke (REGARDS) national cohort study to determine whether exposures to these environmental risk factors are related to cognitive decline, stroke and other health outcomes. These environmental national datasets will also be made available to public health professionals, researchers and the general public via the CDC Wide-ranging Online Data for Epidemiologic Research (WONDER) system, where they can be aggregated to the county-level, state-level, or regional-level as per users f need and downloaded in tabular, graphical, and map formats. This provides a significant addition to the CDC WONDER online system, allowing public health researchers and policy makers to better include environmental exposure data in the context of other health data available in CDC WONDER. It also substantially expands public access to NASA data, making their use by a wide range of decisionmakers feasible.
NASA Astrophysics Data System (ADS)
Al-Hamdan, M. Z.; Crosson, W. L.; Economou, S.; Estes, M., Jr.; Estes, S. M.; Hemmings, S. N.; Kent, S.; Loop, M.; Puckett, M.; Quattrochi, D. A.; Wade, G.; McClure, L.
2012-12-01
The overall goal of this study is to address issues of environmental health and enhance public health decision making by using NASA remotely sensed data and products. This study is a collaboration between NASA Marshall Space Flight Center, Universities Space Research Association (USRA), the University of Alabama at Birmingham (UAB) School of Public Health and the Centers for Disease Control and Prevention (CDC) Office of Surveillance, Epidemiology and Laboratory Services. The objectives of this study are to develop high-quality spatial data sets of environmental variables, link these with public health data from a national cohort study, and deliver the environmental data sets and associated public health analyses to local, state and federal end-user groups. Three daily environmental data sets were developed for the conterminous U.S. on different spatial resolutions for the period 2003-2008: (1) spatial surfaces of estimated fine particulate matter (PM2.5) on a 10-km grid using US Environmental Protection Agency (EPA) ground observations and NASA's MODerate-resolution Imaging Spectroradiometer (MODIS) data; (2) a 1-km grid of MODIS Land Surface Temperature (LST); and (3) a 12-km grid of daily incoming solar radiation and maximum and minimum air temperature using the North American Land Data Assimilation System (NLDAS) data. These environmental datasets were linked with public health data from the UAB REasons for Geographic and Racial Differences in Stroke (REGARDS) national cohort study to determine whether exposures to these environmental risk factors are related to cognitive decline, stroke and other health outcomes. These environmental national datasets will also be made available to public health professionals, researchers and the general public via the CDC Wide-ranging Online Data for Epidemiologic Research (WONDER) system, where they can be aggregated to the county-level, state-level, or regional-level as per users' need and downloaded in tabular, graphical, and map formats. This provides a significant addition to the CDC WONDER online system, allowing public health researchers and policy makers to better include environmental exposure data in the context of other health data available in CDC WONDER. It also substantially expands public access to NASA data, making their use by a wide range of decision-makers feasible.
Rinchai, Darawan; Boughorbel, Sabri; Presnell, Scott; Quinn, Charlie; Chaussabel, Damien
2016-01-01
Systems-scale profiling approaches have become widely used in translational research settings. The resulting accumulation of large-scale datasets in public repositories represents a critical opportunity to promote insight and foster knowledge discovery. However, resources that can serve as an interface between biomedical researchers and such vast and heterogeneous dataset collections are needed in order to fulfill this potential. Recently, we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB). This tool can be used to overlay deep molecular phenotyping data with rich contextual information about analytes, samples and studies along with ancillary clinical or immunological profiling data. In this note, we describe a curated compendium of 93 public datasets generated in the context of human monocyte immunological studies, representing a total of 4,516 transcriptome profiles. Datasets were uploaded to an instance of GXB along with study description and sample annotations. Study samples were arranged in different groups. Ranked gene lists were generated based on relevant group comparisons. This resource is publicly available online at http://monocyte.gxbsidra.org/dm3/landing.gsp. PMID:27158452
Scribl: an HTML5 Canvas-based graphics library for visualizing genomic data over the web
Miller, Chase A.; Anthony, Jon; Meyer, Michelle M.; Marth, Gabor
2013-01-01
Motivation: High-throughput biological research requires simultaneous visualization as well as analysis of genomic data, e.g. read alignments, variant calls and genomic annotations. Traditionally, such integrative analysis required desktop applications operating on locally stored data. Many current terabyte-size datasets generated by large public consortia projects, however, are already only feasibly stored at specialist genome analysis centers. As even small laboratories can afford very large datasets, local storage and analysis are becoming increasingly limiting, and it is likely that most such datasets will soon be stored remotely, e.g. in the cloud. These developments will require web-based tools that enable users to access, analyze and view vast remotely stored data with a level of sophistication and interactivity that approximates desktop applications. As rapidly dropping cost enables researchers to collect data intended to answer questions in very specialized contexts, developers must also provide software libraries that empower users to implement customized data analyses and data views for their particular application. Such specialized, yet lightweight, applications would empower scientists to better answer specific biological questions than possible with general-purpose genome browsers currently available. Results: Using recent advances in core web technologies (HTML5), we developed Scribl, a flexible genomic visualization library specifically targeting coordinate-based data such as genomic features, DNA sequence and genetic variants. Scribl simplifies the development of sophisticated web-based graphical tools that approach the dynamism and interactivity of desktop applications. Availability and implementation: Software is freely available online at http://chmille4.github.com/Scribl/ and is implemented in JavaScript with all modern browsers supported. Contact: gabor.marth@bc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23172864
Data integration: Combined imaging and electrophysiology data in the cloud.
Kini, Lohith G; Davis, Kathryn A; Wagenaar, Joost B
2016-01-01
There has been an increasing effort to correlate electrophysiology data with imaging in patients with refractory epilepsy over recent years. IEEG.org provides a free-access, rapidly growing archive of imaging data combined with electrophysiology data and patient metadata. It currently contains over 1200 human and animal datasets, with multiple data modalities associated with each dataset (neuroimaging, EEG, EKG, de-identified clinical and experimental data, etc.). The platform is developed around the concept that scientific data sharing requires a flexible platform that allows sharing of data from multiple file formats. IEEG.org provides high- and low-level access to the data in addition to providing an environment in which domain experts can find, visualize, and analyze data in an intuitive manner. Here, we present a summary of the current infrastructure of the platform, available datasets and goals for the near future. Copyright © 2015 Elsevier Inc. All rights reserved.
Data integration: Combined Imaging and Electrophysiology data in the cloud
Kini, Lohith G.; Davis, Kathryn A.; Wagenaar, Joost B.
2015-01-01
There has been an increasing effort to correlate electrophysiology data with imaging in patients with refractory epilepsy over recent years. IEEG.org provides a free-access, rapidly growing archive of imaging data combined with electrophysiology data and patient metadata. It currently contains over 1200 human and animal datasets, with multiple data modalities associated with each dataset (neuroimaging, EEG, EKG, de-identified clinical and experimental data, etc.). The platform is developed around the concept that scientific data sharing requires a flexible platform that allows sharing of data from multiple file-formats. IEEG.org provides high and low-level access to the data in addition to providing an environment in which domain experts can find, visualize, and analyze data in an intuitive manner. Here, we present a summary of the current infrastructure of the platform, available datasets and goals for the near future. PMID:26044858
24 CFR 598.410 - Public access to materials and proceedings.
Code of Federal Regulations, 2014 CFR
2014-04-01
... 24 Housing and Urban Development 3 2014-04-01 2013-04-01 true Public access to materials and proceedings. 598.410 Section 598.410 Housing and Urban Development Regulations Relating to Housing and Urban... DESIGNATIONS Post-Designation Requirements § 598.410 Public access to materials and proceedings. After...
24 CFR 598.410 - Public access to materials and proceedings.
Code of Federal Regulations, 2012 CFR
2012-04-01
... 24 Housing and Urban Development 3 2012-04-01 2012-04-01 false Public access to materials and proceedings. 598.410 Section 598.410 Housing and Urban Development Regulations Relating to Housing and Urban... DESIGNATIONS Post-Designation Requirements § 598.410 Public access to materials and proceedings. After...
24 CFR 598.410 - Public access to materials and proceedings.
Code of Federal Regulations, 2013 CFR
2013-04-01
... 24 Housing and Urban Development 3 2013-04-01 2013-04-01 false Public access to materials and proceedings. 598.410 Section 598.410 Housing and Urban Development Regulations Relating to Housing and Urban... DESIGNATIONS Post-Designation Requirements § 598.410 Public access to materials and proceedings. After...
TCGA Expedition: A Data Acquisition and Management System for TCGA Data
Chandran, Uma R.; Medvedeva, Olga P.; Barmada, M. Michael; Blood, Philip D.; Chakka, Anish; Luthra, Soumya; Ferreira, Antonio; Wong, Kim F.; Lee, Adrian V.; Zhang, Zhihui; Budden, Robert; Scott, J. Ray; Berndt, Annerose; Berg, Jeremy M.; Jacobson, Rebecca S.
2016-01-01
Background The Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile at least 500 cases of 20 different tumor types using genomic platforms and to make these data, both raw and processed, available to all researchers. TCGA data are currently over 1.2 Petabyte in size and include whole genome sequence (WGS), whole exome sequence, methylation, RNA expression, proteomic, and clinical datasets. Publicly accessible TCGA data are released through public portals, but many challenges exist in navigating and using data obtained from these sites. We developed TCGA Expedition to support the research community focused on computational methods for cancer research. Data obtained, versioned, and archived using TCGA Expedition supports command line access at high-performance computing facilities as well as some functionality with third party tools. For a subset of TCGA data collected at University of Pittsburgh, we also re-associate TCGA data with de-identified data from the electronic health records. Here we describe the software as well as the architecture of our repository, methods for loading of TCGA data to multiple platforms, and security and regulatory controls that conform to federal best practices. Results TCGA Expedition software consists of a set of scripts written in Bash, Python and Java that download, extract, harmonize, version and store all TCGA data and metadata. The software generates a versioned, participant- and sample-centered, local TCGA data directory with metadata structures that directly reference the local data files as well as the original data files. The software supports flexible searches of the data via a web portal, user-centric data tracking tools, and data provenance tools. Using this software, we created a collaborative repository, the Pittsburgh Genome Resource Repository (PGRR) that enabled investigators at our institution to work with all TCGA data formats, and to interrogate these data with analysis pipelines, and associated tools. WGS data are especially challenging for individual investigators to use, due to issues with downloading, storage, and processing; having locally accessible WGS BAM files has proven invaluable. Conclusion Our open-source, freely available TCGA Expedition software can be used to create a local collaborative infrastructure for acquiring, managing, and analyzing TCGA data and other large public datasets. PMID:27788220
Development of the public information and communication technology assessment tool.
Ripat, Jacquie; Watzke, James; Birch, Gary
2008-09-01
Public information and communication technologies, such as information kiosks, automated banking machines and ticket dispensers, allow people to access services in a convenient and timely manner. However, the development of these technologies has occurred largely without consideration of access by people with disabilities. Inaccessible technical features make operation of a public technology difficult and barriers in the environment create navigational challenges, limiting the opportunity of people with disabilities to use these devices and access the services they provide. This paper describes the development of a tool that individuals, disability advocacy groups, business owners, healthcare providers, and urban planners can use to evaluate the accessibility of public technologies and the surrounding environment. Evaluation results can then be used to develop recommendations and advocate for technical and environmental changes to improve access. Tool development consisted of a review of the literature and key Canadian Standards Association documents, task analysis, and consultation with accessibility experts. Studies of content validity, tool usability, inter-rater and test-retest reliability were conducted in sites across Canada. Accessibility experts verified the content validity of the tool. The current version of the tool has incorporated the findings of a usability study. Initial testing indicated excellent agreement for inter-rater and test-retest reliability scores. Social exclusion can arise when public technologies are not accessible. This newly developed instrument provides detailed information that can be used to advocate for more accessible and inclusive public information and communication technologies.
User-based representation of time-resolved multimodal public transportation networks.
Alessandretti, Laura; Karsai, Márton; Gauvin, Laetitia
2016-07-01
Multimodal transportation systems, with several coexisting services like bus, tram and metro, can be represented as time-resolved multilayer networks where the different transportation modes connecting the same set of nodes are associated with distinct network layers. Their quantitative description became possible recently due to openly accessible datasets describing the geo-localized transportation dynamics of large urban areas. Advancements call for novel analytics, which combines earlier established methods and exploits the inherent complexity of the data. Here, we provide a novel user-based representation of public transportation systems, which combines representations, accounting for the presence of multiple lines and reducing the effect of spatial embeddedness, while considering the total travel time, its variability across the schedule, and taking into account the number of transfers necessary. After the adjustment of earlier techniques to the novel representation framework, we analyse the public transportation systems of several French municipal areas and identify hidden patterns of privileged connections. Furthermore, we study their efficiency as compared to the commuting flow. The proposed representation could help to enhance resilience of local transportation systems to provide better design policies for future developments.
User-based representation of time-resolved multimodal public transportation networks
Alessandretti, Laura; Gauvin, Laetitia
2016-01-01
Multimodal transportation systems, with several coexisting services like bus, tram and metro, can be represented as time-resolved multilayer networks where the different transportation modes connecting the same set of nodes are associated with distinct network layers. Their quantitative description became possible recently due to openly accessible datasets describing the geo-localized transportation dynamics of large urban areas. Advancements call for novel analytics, which combines earlier established methods and exploits the inherent complexity of the data. Here, we provide a novel user-based representation of public transportation systems, which combines representations, accounting for the presence of multiple lines and reducing the effect of spatial embeddedness, while considering the total travel time, its variability across the schedule, and taking into account the number of transfers necessary. After the adjustment of earlier techniques to the novel representation framework, we analyse the public transportation systems of several French municipal areas and identify hidden patterns of privileged connections. Furthermore, we study their efficiency as compared to the commuting flow. The proposed representation could help to enhance resilience of local transportation systems to provide better design policies for future developments. PMID:27493773
The Earth Microbiome Project and Global Systems Biology
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gilbert, Jack A.; Jansson, Janet K.; Knight, Rob
Recently, we published the first large-scale analysis of data from the Earth Microbiome Project (1, 2), a truly multidisciplinary research program involving more than 500 scientists and 27,751 samples acquired from 43 countries. These samples represent myriad specimen types and span a wide range of biotic and abiotic factors, geographic locations, and physicochemical properties. The database (https://qiita.ucsd.edu/emp/) is still growing, with over 90,000 amplicon datasets, >500 metagenomic runs, and metabolomics datasets from a similar number of samples. Importantly, the techniques, data and analytical tools are all standardized and publicly accessible, providing a framework to support research at a scale ofmore » integration that just 7 years ago seemed impossible.« less
Accessing the SEED genome databases via Web services API: tools for programmers.
Disz, Terry; Akhter, Sajia; Cuevas, Daniel; Olson, Robert; Overbeek, Ross; Vonstein, Veronika; Stevens, Rick; Edwards, Robert A
2010-06-14
The SEED integrates many publicly available genome sequences into a single resource. The database contains accurate and up-to-date annotations based on the subsystems concept that leverages clustering between genomes and other clues to accurately and efficiently annotate microbial genomes. The backend is used as the foundation for many genome annotation tools, such as the Rapid Annotation using Subsystems Technology (RAST) server for whole genome annotation, the metagenomics RAST server for random community genome annotations, and the annotation clearinghouse for exchanging annotations from different resources. In addition to a web user interface, the SEED also provides Web services based API for programmatic access to the data in the SEED, allowing the development of third-party tools and mash-ups. The currently exposed Web services encompass over forty different methods for accessing data related to microbial genome annotations. The Web services provide comprehensive access to the database back end, allowing any programmer access to the most consistent and accurate genome annotations available. The Web services are deployed using a platform independent service-oriented approach that allows the user to choose the most suitable programming platform for their application. Example code demonstrate that Web services can be used to access the SEED using common bioinformatics programming languages such as Perl, Python, and Java. We present a novel approach to access the SEED database. Using Web services, a robust API for access to genomics data is provided, without requiring large volume downloads all at once. The API ensures timely access to the most current datasets available, including the new genomes as soon as they come online.
Smith, Tanya; Page-Nicholson, Samantha; Morrison, Kerryn; Gibbons, Bradley; Jones, M Genevieve W; van Niekerk, Mark; Botha, Bronwyn; Oliver, Kirsten; McCann, Kevin; Roxburgh, Lizanne
2016-01-01
The International Crane Foundation (ICF) / Endangered Wildlife Trust's (EWT) African Crane Conservation Programme has recorded 26 403 crane sightings in its database from 1978 to 2014. This sightings collection is currently ongoing and records are continuously added to the database by the EWT field staff, ICF/EWT Partnership staff, various partner organizations and private individuals. The dataset has two peak collection periods: 1994-1996 and 2008-2012. The dataset collection spans five African countries: Kenya, Rwanda, South Africa, Uganda and Zambia; 98% of the data were collected in South Africa. Georeferencing of the dataset was verified before publication of the data. The dataset contains data on three African crane species: Blue Crane Anthropoides paradiseus , Grey Crowned Crane Balearica regulorum and Wattled Crane Bugeranus carunculatus . The Blue and Wattled Cranes are classified by the IUCN Red List of Threatened Species as Vulnerable and the Grey Crowned Crane as Endangered. This is the single most comprehensive dataset published on African Crane species that adds new information about the distribution of these three threatened species. We hope this will further aid conservation authorities to monitor and protect these species. The dataset continues to grow and especially to expand in geographic coverage into new countries in Africa and new sites within countries. The dataset can be freely accessed through the Global Biodiversity Information Facility data portal.
Sargeant, Tobias; Laperrière, David; Ismail, Houssam; Boucher, Geneviève; Rozendaal, Marieke; Lavallée, Vincent-Philippe; Ashton-Beaucage, Dariel; Wilhelm, Brian; Hébert, Josée; Hilton, Douglas J.
2017-01-01
Abstract Genome-wide transcriptome profiling has enabled non-supervised classification of tumours, revealing different sub-groups characterized by specific gene expression features. However, the biological significance of these subtypes remains for the most part unclear. We describe herein an interactive platform, Minimum Spanning Trees Inferred Clustering (MiSTIC), that integrates the direct visualization and comparison of the gene correlation structure between datasets, the analysis of the molecular causes underlying co-variations in gene expression in cancer samples, and the clinical annotation of tumour sets defined by the combined expression of selected biomarkers. We have used MiSTIC to highlight the roles of specific transcription factors in breast cancer subtype specification, to compare the aspects of tumour heterogeneity targeted by different prognostic signatures, and to highlight biomarker interactions in AML. A version of MiSTIC preloaded with datasets described herein can be accessed through a public web server (http://mistic.iric.ca); in addition, the MiSTIC software package can be obtained (github.com/iric-soft/MiSTIC) for local use with personalized datasets. PMID:28472340
NASA Astrophysics Data System (ADS)
Brown, L. E.; Faden, J.; Vandegriff, J. D.; Kurth, W. S.; Mitchell, D. G.
2017-12-01
We present a plan to provide enhanced longevity to analysis software and science data used throughout the Cassini mission for viewing Magnetosphere and Plasma Science (MAPS) data. While a final archive is being prepared for Cassini, the tools that read from this archive will eventually become moribund as real world hardware and software systems evolve. We will add an access layer over existing and planned Cassini data products that will allow multiple tools to access many public MAPS datasets. The access layer is called the Heliophysics Application Programmer's Interface (HAPI), and this is a mechanism being adopted at many data centers across Heliophysics and planetary science for the serving of time series data. Two existing tools are also being enhanced to read from HAPI servers, namely Autoplot from the University of Iowa and MIDL (Mission Independent Data Layer) from The Johns Hopkins Applied Physics Lab. Thus both tools will be able to access data from RPWS, MAG, CAPS, and MIMI. In addition to being able to access data from each other's institutions, these tools will be able to read from all the new datasets expected to come online using the HAPI standard in the near future. The PDS also plans to use HAPI for all the holdings at the Planetary and Plasma Interactions (PPI) node. A basic presentation of the new HAPI data server mechanism is presented, as is an early demonstration of the modified tools.
NASA Astrophysics Data System (ADS)
Wagemann, Julia; Siemen, Stephan
2017-04-01
The European Centre for Medium-Range Weather Forecasts (ECMWF) has been providing an increasing amount of data to the public. One of the most widely used datasets include the global climate reanalyses (e.g. ERA-interim) and atmospheric composition data, which are available to the public free of charge. The centre is further operating, on behalf of the European Commission, two Copernicus Services, the Copernicus Atmosphere Monitoring Service (CAMS) and Climate Change Service (C3S), which are making up-to-date environmental information freely available for scientists, policy makers and businesses. However, to fully benefit from open data, large environmental datasets also have to be easily accessible in a standardised, machine-readable format. Traditional data centres, such as ECMWF, currently face challenges in providing interoperable standardised access to increasingly large and complex datasets for scientists and industry. Therefore, ECMWF put open data in the spotlight during a week of events in March 2017 exploring the potential of freely available weather- and climate-related data and to review technological solutions serving these data. Key events included a Workshop on Meteorological Operational Systems (MOS) and a two-day hackathon. The MOS workshop aimed at reviewing technologies and practices to ensure efficient (open) data processing and provision. The hackathon focused on exploring creative uses of open environmental data and to see how open data is beneficial for various industries. The presentation aims to give a review of the outcomes and conclusions of the Open Data Week at ECMWF. A specific focus will be set on the importance of data standards and web services to make open environmental data a success. The presentation overall examines the opportunities and challenges of open environmental data from a data provider's perspective.
Data Publication: The Evolving Lifecyle
NASA Astrophysics Data System (ADS)
Studwell, S.; Elliott, J.; Anderson, A.
2015-12-01
Datasets are recognized as valuable information entities in their own right that, now and in the future, need to be available for citation, discovery, retrieval and reuse. The U.S. Department of Energy's Office of Scientific and Technical Information (OSTI) provides Digital Object Identifiers (DOIs) to DOE-funded data through partnership with DataCite. The Geothermal Data Repository (GDR) has been using OSTI's Data ID Service since summer, 2014 and is a success story for data publishing in several different ways. This presentation attributes the initial success to the insistence of DOE's Geothermal Technologies Office on detailed planning, robust data curation, and submitter participation. OSTI widely disseminates these data products across both U.S. and international platforms and continually enhances the Data ID Service to facilitate better linkage between published literature, supplementary data components, and the underlying datasets within the structure of the GDR repository. Issues of granularity in DOI assignment, the role of new federal government guidelines on public access to digital data, and the challenges still ahead will be addressed.
NASA Astrophysics Data System (ADS)
Lehnert, K. A.; Carbotte, S. M.; Ferrini, V.; Hsu, L.; Arko, R. A.; Walker, J. D.; O'hara, S. H.
2012-12-01
Substantial volumes of data in the Earth Sciences are collected in small- to medium-size projects by individual investigators or small research teams, known as the 'Long Tail' of science. Traditionally, these data have largely stayed 'in the dark', i.e. they have not been properly archived, and have therefore been inaccessible and underutilized. The primary reason has been the lack of appropriate infrastructure, from adequate repositories to resources and support for investigators to properly manage their data, to community standards and best practices. Lack of credit for data management and for the data themselves has contributed to the reluctance of investigators to share their data. IEDA (Integrated Earth Data Applications), a NSF-funded data facility for solid earth geoscience data, has developed a comprehensive suite of data services that are designed to address the concerns and needs of investigators. IEDA's data publication service registers datasets with DOI and ensures their proper citation and attribution. IEDA is working with publishers on advanced linkages between datasets in the IEDA repository and scientific online articles to facilitate access to the data, enhance their visibility, and augment their use and citation. IEDA's investigator support ranges from individual support for data management to tools, tutorials, and virtual or face-to-face workshops that guide and assist investigators with data management planning, data submission, and data documentation. A critical aspect of IEDA's concept has been the disciplinary expertise within the team and its strong liaison with the science community, as well as a community-based governance. These have been fundamental to gain the trust and support of the community that have lead to significantly improved data preservation and access in the communities served by IEDA.
dbPTM 2016: 10-year anniversary of a resource for post-translational modification of proteins.
Huang, Kai-Yao; Su, Min-Gang; Kao, Hui-Ju; Hsieh, Yun-Chung; Jhong, Jhih-Hua; Cheng, Kuang-Hao; Huang, Hsien-Da; Lee, Tzong-Yi
2016-01-04
Owing to the importance of the post-translational modifications (PTMs) of proteins in regulating biological processes, the dbPTM (http://dbPTM.mbc.nctu.edu.tw/) was developed as a comprehensive database of experimentally verified PTMs from several databases with annotations of potential PTMs for all UniProtKB protein entries. For this 10th anniversary of dbPTM, the updated resource provides not only a comprehensive dataset of experimentally verified PTMs, supported by the literature, but also an integrative interface for accessing all available databases and tools that are associated with PTM analysis. As well as collecting experimental PTM data from 14 public databases, this update manually curates over 12 000 modified peptides, including the emerging S-nitrosylation, S-glutathionylation and succinylation, from approximately 500 research articles, which were retrieved by text mining. As the number of available PTM prediction methods increases, this work compiles a non-homologous benchmark dataset to evaluate the predictive power of online PTM prediction tools. An increasing interest in the structural investigation of PTM substrate sites motivated the mapping of all experimental PTM peptides to protein entries of Protein Data Bank (PDB) based on database identifier and sequence identity, which enables users to examine spatially neighboring amino acids, solvent-accessible surface area and side-chain orientations for PTM substrate sites on tertiary structures. Since drug binding in PDB is annotated, this update identified over 1100 PTM sites that are associated with drug binding. The update also integrates metabolic pathways and protein-protein interactions to support the PTM network analysis for a group of proteins. Finally, the web interface is redesigned and enhanced to facilitate access to this resource. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Gururaj, Anupama E.; Chen, Xiaoling; Pournejati, Saeid; Alter, George; Hersh, William R.; Demner-Fushman, Dina; Ohno-Machado, Lucila
2017-01-01
Abstract The rapid proliferation of publicly available biomedical datasets has provided abundant resources that are potentially of value as a means to reproduce prior experiments, and to generate and explore novel hypotheses. However, there are a number of barriers to the re-use of such datasets, which are distributed across a broad array of dataset repositories, focusing on different data types and indexed using different terminologies. New methods are needed to enable biomedical researchers to locate datasets of interest within this rapidly expanding information ecosystem, and new resources are needed for the formal evaluation of these methods as they emerge. In this paper, we describe the design and generation of a benchmark for information retrieval of biomedical datasets, which was developed and used for the 2016 bioCADDIE Dataset Retrieval Challenge. In the tradition of the seminal Cranfield experiments, and as exemplified by the Text Retrieval Conference (TREC), this benchmark includes a corpus (biomedical datasets), a set of queries, and relevance judgments relating these queries to elements of the corpus. This paper describes the process through which each of these elements was derived, with a focus on those aspects that distinguish this benchmark from typical information retrieval reference sets. Specifically, we discuss the origin of our queries in the context of a larger collaborative effort, the biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium, and the distinguishing features of biomedical dataset retrieval as a task. The resulting benchmark set has been made publicly available to advance research in the area of biomedical dataset retrieval. Database URL: https://biocaddie.org/benchmark-data PMID:29220453
To improve public health and the environment, the United States Environmental Protection Agency (USEPA) collects information about facilities, sites, or places subject to environmental regulation or of environmental interest. Through the Geospatial Data Download Service, the public is now able to download the EPA Geodata shapefile containing facility and site information from EPA's national program systems. The file is Internet accessible from the Envirofacts Web site (https://www3.epa.gov/enviro/). The data may be used with geospatial mapping applications. (Note: The shapefile omits facilities without latitude/longitude coordinates.) The EPA Geospatial Data contains the name, location (latitude/longitude), and EPA program information about specific facilities and sites. In addition, the file contains a Uniform Resource Locator (URL), which allows mapping applications to present an option to users to access additional EPA data resources on a specific facility or site. This dataset shows Brownfields listed in the 2012 Facility Registry System.
NASA Astrophysics Data System (ADS)
Hudspeth, W. B.; Barrett, H.; Diller, S.; Valentin, G.
2016-12-01
Energize is New Mexico's Experimental Program to Stimulate Competitive Research (NM EPSCoR), funded by the NSF with a focus on building capacity to conduct scientific research. Energize New Mexico leverages the work of faculty and students from NM universities and colleges to provide the tools necessary to a quantitative, science-driven discussion of the state's water policy options and to realize New Mexico's potential for sustainable energy development. This presentation discusses the architectural details of NM EPSCoR's collaborative data management system, GSToRE, and how New Mexico researchers use it to share and analyze diverse research data, with the goal of attaining sustainable energy development in the state.The Earth Data Analysis Center (EDAC) at The University of New Mexico leads the development of computational interoperability capacity that allows the wide use and sharing of energy-related data among NM EPSCoR researchers. Data from a variety of research disciplines is stored and maintained in EDAC's Geographic Storage, Transformation and Retrieval Engine (GSToRE), a distributed platform for large-scale vector and raster data discovery, subsetting, and delivery via Web services that are based on Open Geospatial Consortium (OGC) and REST Web-service standards. Researchers upload and register scientific datasets using a front-end client that collects the critical metadata. In addition, researchers have the option to register their datasets with DataONE, a national, community-driven project that provides access to data across multiple member repositories. The GSToRE platform maintains a searchable, core collection of metadata elements that can be used to deliver metadata in multiple formats, including ISO 19115-2/19139 and FGDC CSDGM. Stored metadata elements also permit the platform to automate the registration of Energize datasets into DataONE, once the datasets are approved for release to the public.
Who shares? Who doesn't? Factors associated with openly archiving raw research data.
Piwowar, Heather A
2011-01-01
Many initiatives encourage investigators to share their raw datasets in hopes of increasing research efficiency and quality. Despite these investments of time and money, we do not have a firm grasp of who openly shares raw research data, who doesn't, and which initiatives are correlated with high rates of data sharing. In this analysis I use bibliometric methods to identify patterns in the frequency with which investigators openly archive their raw gene expression microarray datasets after study publication. Automated methods identified 11,603 articles published between 2000 and 2009 that describe the creation of gene expression microarray data. Associated datasets in best-practice repositories were found for 25% of these articles, increasing from less than 5% in 2001 to 30%-35% in 2007-2009. Accounting for sensitivity of the automated methods, approximately 45% of recent gene expression studies made their data publicly available. First-order factor analysis on 124 diverse bibliometric attributes of the data creation articles revealed 15 factors describing authorship, funding, institution, publication, and domain environments. In multivariate regression, authors were most likely to share data if they had prior experience sharing or reusing data, if their study was published in an open access journal or a journal with a relatively strong data sharing policy, or if the study was funded by a large number of NIH grants. Authors of studies on cancer and human subjects were least likely to make their datasets available. These results suggest research data sharing levels are still low and increasing only slowly, and data is least available in areas where it could make the biggest impact. Let's learn from those with high rates of sharing to embrace the full potential of our research output.
Tomato Expression Database (TED): a suite of data presentation and analysis tools
Fei, Zhangjun; Tang, Xuemei; Alba, Rob; Giovannoni, James
2006-01-01
The Tomato Expression Database (TED) includes three integrated components. The Tomato Microarray Data Warehouse serves as a central repository for raw gene expression data derived from the public tomato cDNA microarray. In addition to expression data, TED stores experimental design and array information in compliance with the MIAME guidelines and provides web interfaces for researchers to retrieve data for their own analysis and use. The Tomato Microarray Expression Database contains normalized and processed microarray data for ten time points with nine pair-wise comparisons during fruit development and ripening in a normal tomato variety and nearly isogenic single gene mutants impacting fruit development and ripening. Finally, the Tomato Digital Expression Database contains raw and normalized digital expression (EST abundance) data derived from analysis of the complete public tomato EST collection containing >150 000 ESTs derived from 27 different non-normalized EST libraries. This last component also includes tools for the comparison of tomato and Arabidopsis digital expression data. A set of query interfaces and analysis, and visualization tools have been developed and incorporated into TED, which aid users in identifying and deciphering biologically important information from our datasets. TED can be accessed at . PMID:16381976
Tomato Expression Database (TED): a suite of data presentation and analysis tools.
Fei, Zhangjun; Tang, Xuemei; Alba, Rob; Giovannoni, James
2006-01-01
The Tomato Expression Database (TED) includes three integrated components. The Tomato Microarray Data Warehouse serves as a central repository for raw gene expression data derived from the public tomato cDNA microarray. In addition to expression data, TED stores experimental design and array information in compliance with the MIAME guidelines and provides web interfaces for researchers to retrieve data for their own analysis and use. The Tomato Microarray Expression Database contains normalized and processed microarray data for ten time points with nine pair-wise comparisons during fruit development and ripening in a normal tomato variety and nearly isogenic single gene mutants impacting fruit development and ripening. Finally, the Tomato Digital Expression Database contains raw and normalized digital expression (EST abundance) data derived from analysis of the complete public tomato EST collection containing >150,000 ESTs derived from 27 different non-normalized EST libraries. This last component also includes tools for the comparison of tomato and Arabidopsis digital expression data. A set of query interfaces and analysis, and visualization tools have been developed and incorporated into TED, which aid users in identifying and deciphering biologically important information from our datasets. TED can be accessed at http://ted.bti.cornell.edu.
The NCAR Digital Asset Services Hub (DASH): Implementing Unified Data Discovery and Access
NASA Astrophysics Data System (ADS)
Stott, D.; Worley, S. J.; Hou, C. Y.; Nienhouse, E.
2017-12-01
The National Center for Atmospheric Research (NCAR) Directorate created the Data Stewardship Engineering Team (DSET) to plan and implement an integrated single entry point for uniform digital asset discovery and access across the organization in order to improve the efficiency of access, reduce the costs, and establish the foundation for interoperability with other federated systems. This effort supports new policies included in federal funding mandates, NSF data management requirements, and journal citation recommendations. An inventory during the early planning stage identified diverse asset types across the organization that included publications, datasets, metadata, models, images, and software tools and code. The NCAR Digital Asset Services Hub (DASH) is being developed and phased in this year to improve the quality of users' experiences in finding and using these assets. DASH serves to provide engagement, training, search, and support through the following four nodes (see figure). DASH MetadataDASH provides resources for creating and cataloging metadata to the NCAR Dialect, a subset of ISO 19115. NMDEdit, an editor based on a European open source application, has been configured for manual entry of NCAR metadata. CKAN, an open source data portal platform, harvests these XML records (along with records output directly from databases) from a Web Accessible Folder (WAF) on GitHub for validation. DASH SearchThe NCAR Dialect metadata drives cross-organization search and discovery through CKAN, which provides the display interface of search results. DASH search will establish interoperability by facilitating metadata sharing with other federated systems. DASH ConsultingThe DASH Data Curation & Stewardship Coordinator assists with Data Management (DM) Plan preparation and advises on Digital Object Identifiers. The coordinator arranges training sessions on the DASH metadata tools and DM planning, and provides one-on-one assistance as requested. DASH RepositoryA repository is under development for NCAR datasets currently not in existing lab-managed archives. The DASH repository will be under NCAR governance and meet Trustworthy Repositories Audit & Certification (TRAC) requirements. This poster will highlight the processes, lessons learned, and current status of the DASH effort at NCAR.
DePriest, Adam D; Fiandalo, Michael V; Schlanger, Simon; Heemers, Frederike; Mohler, James L; Liu, Song; Heemers, Hannelore V
2016-01-01
Androgen receptor (AR) is a ligand-activated transcription factor that is the main target for treatment of non-organ-confined prostate cancer (CaP). Failure of life-prolonging AR-targeting androgen deprivation therapy is due to flexibility in steroidogenic pathways that control intracrine androgen levels and variability in the AR transcriptional output. Androgen biosynthesis enzymes, androgen transporters and AR-associated coregulators are attractive novel CaP treatment targets. These proteins, however, are characterized by multiple transcript variants and isoforms, are subject to genomic alterations, and are differentially expressed among CaPs. Determining their therapeutic potential requires evaluation of extensive, diverse datasets that are dispersed over multiple databases, websites and literature reports. Mining and integrating these datasets are cumbersome, time-consuming tasks and provide only snapshots of relevant information. To overcome this impediment to effective, efficient study of AR and potential drug targets, we developed the Regulators of Androgen Action Resource (RAAR), a non-redundant, curated and user-friendly searchable web interface. RAAR centralizes information on gene function, clinical relevance, and resources for 55 genes that encode proteins involved in biosynthesis, metabolism and transport of androgens and for 274 AR-associated coregulator genes. Data in RAAR are organized in two levels: (i) Information pertaining to production of androgens is contained in a 'pre-receptor level' database, and coregulator gene information is provided in a 'post-receptor level' database, and (ii) an 'other resources' database contains links to additional databases that are complementary to and useful to pursue further the information provided in RAAR. For each of its 329 entries, RAAR provides access to more than 20 well-curated publicly available databases, and thus, access to thousands of data points. Hyperlinks provide direct access to gene-specific entries in the respective database(s). RAAR is a novel, freely available resource that provides fast, reliable and easy access to integrated information that is needed to develop alternative CaP therapies. Database URL: http://www.lerner.ccf.org/cancerbio/heemers/RAAR/search/. © The Author(s) 2016. Published by Oxford University Press.
The Stream-Catchment (StreamCat) Dataset: A database of watershed metrics for the conterminous USA
We developed an extensive database of landscape metrics for ~2.65 million streams, and their associated catchments, within the conterminous USA: The Stream-Catchment (StreamCat) Dataset. These data are publically available and greatly reduce the specialized geospatial expertise n...
Code of Federal Regulations, 2013 CFR
2013-07-01
... DEVELOPMENT AREA Glossary of Terms § 910.51 Access. Access, when used in reference to parking or loading... 36 Parks, Forests, and Public Property 3 2013-07-01 2012-07-01 true Access. § 910.51 Section § 910.51 Parks, Forests, and Public Property PENNSYLVANIA AVENUE DEVELOPMENT CORPORATION GENERAL...
Beyond Food Access: The Impact of Parent-, Home-, and Neighborhood-Level Factors on Children’s Diets
Futrell Dunaway, Lauren; Carton, Thomas; Ma, Ping; Mundorf, Adrienne R.; Keel, Kelsey; Theall, Katherine P.
2017-01-01
Despite the growth in empirical research on neighborhood environmental characteristics and their influence on children’s diets, physical activity, and obesity, much remains to be learned, as few have examined the relationship between neighborhood food availability on dietary behavior in children, specifically. This analysis utilized data from a community-based, cross-sectional sample of children (n = 199) that was collected in New Orleans, Louisiana, in 2010. This dataset was linked to food environment data to assess the impact of neighborhood food access as well as household and parent factors on children’s diets. We observed a negligible impact of the neighborhood food environment on children’s diets, except with respect to fast food, with children who had access to fast food within 500 m around their home significantly less likely (OR = 0.35, 95% CI: 0.1, 0.8) to consume vegetables. Key parental and household factors did play a role in diet, including receipt of public assistance and cooking meals at home. Children receiving public assistance were 2.5 times (95% CI: 1.1, 5.4) more likely to consume fruit more than twice per day compared with children not receiving public assistance. Children whose family cooked dinner at home more than 5 times per week had significantly more consumption of fruit (64% vs. 58%) and vegetables (55% vs. 39%), but less soda (27% vs. 43%). Findings highlight the need for future research that focuses on the dynamic and complex relationships between built and social factors in the communities and homes of children that impact their diet in order to develop multilevel prevention approaches that address childhood obesity. PMID:28632162
Futrell Dunaway, Lauren; Carton, Thomas; Ma, Ping; Mundorf, Adrienne R; Keel, Kelsey; Theall, Katherine P
2017-06-20
Despite the growth in empirical research on neighborhood environmental characteristics and their influence on children's diets, physical activity, and obesity, much remains to be learned, as few have examined the relationship between neighborhood food availability on dietary behavior in children, specifically. This analysis utilized data from a community-based, cross-sectional sample of children ( n = 199) that was collected in New Orleans, Louisiana, in 2010. This dataset was linked to food environment data to assess the impact of neighborhood food access as well as household and parent factors on children's diets. We observed a negligible impact of the neighborhood food environment on children's diets, except with respect to fast food, with children who had access to fast food within 500 m around their home significantly less likely (OR = 0.35, 95% CI: 0.1, 0.8) to consume vegetables. Key parental and household factors did play a role in diet, including receipt of public assistance and cooking meals at home. Children receiving public assistance were 2.5 times (95% CI: 1.1, 5.4) more likely to consume fruit more than twice per day compared with children not receiving public assistance. Children whose family cooked dinner at home more than 5 times per week had significantly more consumption of fruit (64% vs. 58%) and vegetables (55% vs. 39%), but less soda (27% vs. 43%). Findings highlight the need for future research that focuses on the dynamic and complex relationships between built and social factors in the communities and homes of children that impact their diet in order to develop multilevel prevention approaches that address childhood obesity.
Asafu-Adjaye, John; Byrne, Dominic; Alvarez, Maximiliano
2017-02-01
The data presented in this article are related to the research article entitled 'Economic Growth, Fossil Fuel and Non-Fossil Consumption: A Pooled Mean Group Analysis using Proxies for Capital' (J. Asafu-Adjaye, D. Byrne, M. Alvarez, 2016) [1]. This article describes data modified from three publicly available data sources: the World Bank׳s World Development Indicators (http://databank.worldbank.org/data/reports.aspx?source=world-development-indicators), the U.S. Energy Information Administration׳s International Energy Statistics (http://www.eia.gov/cfapps/ipdbproject/IEDIndex3.cfm?tid=44&pid=44&aid=2) and the Barro-Lee Educational Attainment Dataset (http://www.barrolee.com). These data can be used to examine the relationships between economic growth and different forms of energy consumption. The dataset is made publicly available to promote further analyses.
Development of an Open Global Oil and Gas Infrastructure Inventory and Geodatabase
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rose, Kelly
This submission contains a technical report describing the development process and visual graphics for the Global Oil and Gas Infrastructure database. Access the GOGI database using the following link: https://edx.netl.doe.gov/dataset/global-oil-gas-features-database
Privacy-Preserving Data Exploration in Genome-Wide Association Studies.
Johnson, Aaron; Shmatikov, Vitaly
2013-08-01
Genome-wide association studies (GWAS) have become a popular method for analyzing sets of DNA sequences in order to discover the genetic basis of disease. Unfortunately, statistics published as the result of GWAS can be used to identify individuals participating in the study. To prevent privacy breaches, even previously published results have been removed from public databases, impeding researchers' access to the data and hindering collaborative research. Existing techniques for privacy-preserving GWAS focus on answering specific questions, such as correlations between a given pair of SNPs (DNA sequence variations). This does not fit the typical GWAS process, where the analyst may not know in advance which SNPs to consider and which statistical tests to use, how many SNPs are significant for a given dataset, etc. We present a set of practical, privacy-preserving data mining algorithms for GWAS datasets. Our framework supports exploratory data analysis, where the analyst does not know a priori how many and which SNPs to consider. We develop privacy-preserving algorithms for computing the number and location of SNPs that are significantly associated with the disease, the significance of any statistical test between a given SNP and the disease, any measure of correlation between SNPs, and the block structure of correlations. We evaluate our algorithms on real-world datasets and demonstrate that they produce significantly more accurate results than prior techniques while guaranteeing differential privacy.
Moderate-resolution sea surface temperature data for the nearshore North Pacific
Payne, Meredith C.; Reusser, Deborah A.; Lee, Henry; Brown, Cheryl A.
2011-01-01
Coastal sea surface temperature (SST) is an important environmental characteristic in determining the suitability of habitat for nearshore marine and estuarine organisms. This publication describes and provides access to an easy-to-use coastal SST dataset for ecologists, biogeographers, oceanographers, and other scientists conducting research on nearshore marine habitats or processes. The data cover the Temperate Northern Pacific Ocean as defined by the 'Marine Ecosystems of the World' (MEOW) biogeographic schema developed by The Nature Conservancy. The spatial resolution of the SST data is 4-km grid cells within 20 km of the shore. The data span a 29-year period - from September 1981 to December 2009. These SST data were derived from Advanced Very High Resolution Radiometer (AVHRR) instrument measurements compiled into monthly means as part of the Pathfinder versions 5.0 and 5.1 (PFSST V50 and V51) Project. The processing methods used to transform the data from their native Hierarchical Data Format Scientific Data Set (HDF SDS) to georeferenced, spatial datasets capable of being read into geographic information systems (GIS) software are explained. In addition, links are provided to examples of scripts involved in the data processing steps. The scripts were written in the Python programming language, which is supported by ESRI's ArcGIS version 9 or later. The processed data files are also provided in text (.csv) and Access 2003 Database (.mdb) formats. All data except the raster files include attributes identifying realm, province, and ecoregion as defined by the MEOW classification schema.
Armed Forces 1996 Equal Opportunity Survey: Administration, Datasets, and Codebook
1997-12-01
was taken in the preparation of analysis files. These files balance two needs: public access to data with sufficient information for accurate estimates...Native American/Alaskan Native, and Other. The duty location variable has two levels: US (a duty station in any of the 50 states or the District of...More specifically, the new DMDC procedures most closely follow CASRO’s Sample Type U design. As discussed by CASRO, the overall response rate has two
Use of court records for supplementing occupational disease surveillance.
Schwartz, E; Landrigan, P
1987-01-01
To conduct surveillance of occupationally related health events, the New Hampshire Division of Public Health Services analyzes death certificates and workers' compensation claims. In an effort to bolster these limited data sources, a previously unrecognized data-set comprised of court records was explored. Court records obtained from the Federal District Court proved to be a readily accessible and detailed source of information for identifying suspected cases of asbestos-related disease and potential sources of asbestos exposure. PMID:2959164
McMahon, Alex D; Elliott, Lawrie; Macpherson, Lorna Md; Sharpe, Katharine H; Connelly, Graham; Milligan, Ian; Wilson, Philip; Clark, David; King, Albert; Wood, Rachael; Conway, David I
2018-01-01
There is limited evidence on the health needs and service access among children and young people who are looked after by the state. The aim of this study was to compare dental treatment needs and access to dental services (as an exemplar of wider health and well-being concerns) among children and young people who are looked after with the general child population. Population data linkage study utilising national datasets of social work referrals for 'looked after' placements, the Scottish census of children in local authority schools, and national health service's dental health and service datasets. 633 204 children in publicly funded schools in Scotland during the academic year 2011/2012, of whom 10 927 (1.7%) were known to be looked after during that or a previous year (from 2007-2008). The children in the looked after children (LAC) group were more likely to have urgent dental treatment need at 5 years of age: 23%vs10% (n=209/16533), adjusted (for age, sex and area socioeconomic deprivation) OR 2.65 (95% CI 2.30 to 3.05); were less likely to attend a dentist regularly: 51%vs63% (n=5519/388934), 0.55 (0.53 to 0.58) and more likely to have teeth extracted under general anaesthesia: 9%vs5% (n=967/30253), 1.91 (1.78 to 2.04). LAC are more likely to have dental treatment needs and less likely to access dental services even when accounting for sociodemographic factors. Greater efforts are required to integrate child social and healthcare for LAC and to develop preventive care pathways on entering and throughout their time in the care system. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
NASA Astrophysics Data System (ADS)
Jiang, Y.
2015-12-01
Oceanographic resource discovery is a critical step for developing ocean science applications. With the increasing number of resources available online, many Spatial Data Infrastructure (SDI) components (e.g. catalogues and portals) have been developed to help manage and discover oceanographic resources. However, efficient and accurate resource discovery is still a big challenge because of the lack of data relevancy information. In this article, we propose a search engine framework for mining and utilizing dataset relevancy from oceanographic dataset metadata, usage metrics, and user feedback. The objective is to improve discovery accuracy of oceanographic data and reduce time for scientist to discover, download and reformat data for their projects. Experiments and a search example show that the propose engine helps both scientists and general users search for more accurate results with enhanced performance and user experience through a user-friendly interface.
Expanding understanding of optical variability in Lake Superior with a 4-year dataset
NASA Astrophysics Data System (ADS)
Mouw, Colleen B.; Ciochetto, Audrey B.; Grunert, Brice; Yu, Angela
2017-07-01
Lake Superior is one of the largest freshwater lakes on our planet, but few optical observations have been made to allow for the development and validation of visible spectral satellite remote sensing products. The dataset described here focuses on coincidently observing inherent and apparent optical properties along with biogeochemical parameters. Specifically, we observe remote sensing reflectance, absorption, scattering, backscattering, attenuation, chlorophyll concentration, and suspended particulate matter over the ice-free months of 2013-2016. The dataset substantially increases the optical knowledge of the lake. In addition to visible spectral satellite algorithm development, the dataset is valuable for characterizing the variable light field, particle, phytoplankton, and colored dissolved organic matter distributions, and helpful in food web and carbon cycle investigations. The compiled data can be freely accessed at https://seabass.gsfc.nasa.gov/archive/URI/Mouw/LakeSuperior/.
Eilbeck, Karen L; Lipstein, Julie; McGarvey, Sunanda; Staes, Catherine J
2014-01-01
The Reportable Condition Knowledge Management System (RCKMS) is envisioned to be a single, comprehensive, authoritative, real-time portal to author, view and access computable information about reportable conditions. The system is designed for use by hospitals, laboratories, health information exchanges, and providers to meet public health reporting requirements. The RCKMS Knowledge Representation Workgroup was tasked to explore the need for ontologies to support RCKMS functionality. The workgroup reviewed relevant projects and defined criteria to evaluate candidate knowledge domain areas for ontology development. The use of ontologies is justified for this project to unify the semantics used to describe similar reportable events and concepts between different jurisdictions and over time, to aid data integration, and to manage large, unwieldy datasets that evolve, and are sometimes externally managed.
Thermodynamic Data Rescue and Informatics for Deep Carbon Science
NASA Astrophysics Data System (ADS)
Zhong, H.; Ma, X.; Prabhu, A.; Eleish, A.; Pan, F.; Parsons, M. A.; Ghiorso, M. S.; West, P.; Zednik, S.; Erickson, J. S.; Chen, Y.; Wang, H.; Fox, P. A.
2017-12-01
A large number of legacy datasets are contained in geoscience literature published between 1930 and 1980 and not expressed external to the publication text in digitized formats. Extracting, organizing, and reusing these "dark" datasets is highly valuable for many within the Earth and planetary science community. As a part of the Deep Carbon Observatory (DCO) data legacy missions, the DCO Data Science Team and Extreme Physics and Chemistry community identified thermodynamic datasets related to carbon, or more specifically datasets about the enthalpy and entropy of chemicals, as a proof of principle analysis. The data science team endeavored to develop a semi-automatic workflow, which includes identifying relevant publications, extracting contained datasets using OCR methods, collaborative reviewing, and registering the datasets via the DCO Data Portal where the 'Linked Data' feature of the data portal provides a mechanism for connecting rescued datasets beyond their individual data sources, to research domains, DCO Communities, and more, making data discovery and retrieval more effective.To date, the team has successfully rescued, deposited and registered additional datasets from publications with thermodynamic sources. These datasets contain 3 main types of data: (1) heat content or enthalpy data determined for a given compound as a function of temperature using high-temperature calorimetry, (2) heat content or enthalpy data determined for a given compound as a function of temperature using adiabatic calorimetry, and (3) direct determination of heat capacity of a compound as a function of temperature using differential scanning calorimetry. The data science team integrated these datasets and delivered a spectrum of data analytics including visualizations, which will lead to a comprehensive characterization of the thermodynamics of carbon and carbon-related materials.
Cyberinfrastructure for Open Science at the Montreal Neurological Institute
Das, Samir; Glatard, Tristan; Rogers, Christine; Saigle, John; Paiva, Santiago; MacIntyre, Leigh; Safi-Harab, Mouna; Rousseau, Marc-Etienne; Stirling, Jordan; Khalili-Mahani, Najmeh; MacFarlane, David; Kostopoulos, Penelope; Rioux, Pierre; Madjar, Cecile; Lecours-Boucher, Xavier; Vanamala, Sandeep; Adalat, Reza; Mohaddes, Zia; Fonov, Vladimir S.; Milot, Sylvain; Leppert, Ilana; Degroot, Clotilde; Durcan, Thomas M.; Campbell, Tara; Moreau, Jeremy; Dagher, Alain; Collins, D. Louis; Karamchandani, Jason; Bar-Or, Amit; Fon, Edward A.; Hoge, Rick; Baillet, Sylvain; Rouleau, Guy; Evans, Alan C.
2017-01-01
Data sharing is becoming more of a requirement as technologies mature and as global research and communications diversify. As a result, researchers are looking for practical solutions, not only to enhance scientific collaborations, but also to acquire larger amounts of data, and to access specialized datasets. In many cases, the realities of data acquisition present a significant burden, therefore gaining access to public datasets allows for more robust analyses and broadly enriched data exploration. To answer this demand, the Montreal Neurological Institute has announced its commitment to Open Science, harnessing the power of making both clinical and research data available to the world (Owens, 2016a,b). As such, the LORIS and CBRAIN (Das et al., 2016) platforms have been tasked with the technical challenges specific to the institutional-level implementation of open data sharing, including: Comprehensive linking of multimodal data (phenotypic, clinical, neuroimaging, biobanking, and genomics, etc.)Secure database encryption, specifically designed for institutional and multi-project data sharing, ensuring subject confidentiality (using multi-tiered identifiers).Querying capabilities with multiple levels of single study and institutional permissions, allowing public data sharing for all consented and de-identified subject data.Configurable pipelines and flags to facilitate acquisition and analysis, as well as access to High Performance Computing clusters for rapid data processing and sharing of software tools.Robust Workflows and Quality Control mechanisms ensuring transparency and consistency in best practices.Long term storage (and web access) of data, reducing loss of institutional data assets.Enhanced web-based visualization of imaging, genomic, and phenotypic data, allowing for real-time viewing and manipulation of data from anywhere in the world.Numerous modules for data filtering, summary statistics, and personalized and configurable dashboards. Implementing the vision of Open Science at the Montreal Neurological Institute will be a concerted undertaking that seeks to facilitate data sharing for the global research community. Our goal is to utilize the years of experience in multi-site collaborative research infrastructure to implement the technical requirements to achieve this level of public data sharing in a practical yet robust manner, in support of accelerating scientific discovery. PMID:28111547
Cyberinfrastructure for Open Science at the Montreal Neurological Institute.
Das, Samir; Glatard, Tristan; Rogers, Christine; Saigle, John; Paiva, Santiago; MacIntyre, Leigh; Safi-Harab, Mouna; Rousseau, Marc-Etienne; Stirling, Jordan; Khalili-Mahani, Najmeh; MacFarlane, David; Kostopoulos, Penelope; Rioux, Pierre; Madjar, Cecile; Lecours-Boucher, Xavier; Vanamala, Sandeep; Adalat, Reza; Mohaddes, Zia; Fonov, Vladimir S; Milot, Sylvain; Leppert, Ilana; Degroot, Clotilde; Durcan, Thomas M; Campbell, Tara; Moreau, Jeremy; Dagher, Alain; Collins, D Louis; Karamchandani, Jason; Bar-Or, Amit; Fon, Edward A; Hoge, Rick; Baillet, Sylvain; Rouleau, Guy; Evans, Alan C
2016-01-01
Data sharing is becoming more of a requirement as technologies mature and as global research and communications diversify. As a result, researchers are looking for practical solutions, not only to enhance scientific collaborations, but also to acquire larger amounts of data, and to access specialized datasets. In many cases, the realities of data acquisition present a significant burden, therefore gaining access to public datasets allows for more robust analyses and broadly enriched data exploration. To answer this demand, the Montreal Neurological Institute has announced its commitment to Open Science, harnessing the power of making both clinical and research data available to the world (Owens, 2016a,b). As such, the LORIS and CBRAIN (Das et al., 2016) platforms have been tasked with the technical challenges specific to the institutional-level implementation of open data sharing, including: Comprehensive linking of multimodal data (phenotypic, clinical, neuroimaging, biobanking, and genomics, etc.)Secure database encryption, specifically designed for institutional and multi-project data sharing, ensuring subject confidentiality (using multi-tiered identifiers).Querying capabilities with multiple levels of single study and institutional permissions, allowing public data sharing for all consented and de-identified subject data.Configurable pipelines and flags to facilitate acquisition and analysis, as well as access to High Performance Computing clusters for rapid data processing and sharing of software tools.Robust Workflows and Quality Control mechanisms ensuring transparency and consistency in best practices.Long term storage (and web access) of data, reducing loss of institutional data assets.Enhanced web-based visualization of imaging, genomic, and phenotypic data, allowing for real-time viewing and manipulation of data from anywhere in the world.Numerous modules for data filtering, summary statistics, and personalized and configurable dashboards. Implementing the vision of Open Science at the Montreal Neurological Institute will be a concerted undertaking that seeks to facilitate data sharing for the global research community. Our goal is to utilize the years of experience in multi-site collaborative research infrastructure to implement the technical requirements to achieve this level of public data sharing in a practical yet robust manner, in support of accelerating scientific discovery.
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Manipon, G.; Xing, Z.
2007-12-01
The General Earth Science Investigation Suite (GENESIS) project is a NASA-sponsored partnership between the Jet Propulsion Laboratory, academia, and NASA data centers to develop a new suite of Web Services tools to facilitate multi-sensor investigations in Earth System Science. The goal of GENESIS is to enable large-scale, multi-instrument atmospheric science using combined datasets from the AIRS, MODIS, MISR, and GPS sensors. Investigations include cross-comparison of spaceborne climate sensors, cloud spectral analysis, study of upper troposphere-stratosphere water transport, study of the aerosol indirect cloud effect, and global climate model validation. The challenges are to bring together very large datasets, reformat and understand the individual instrument retrievals, co-register or re-grid the retrieved physical parameters, perform computationally-intensive data fusion and data mining operations, and accumulate complex statistics over months to years of data. To meet these challenges, we have developed a Grid computing and dataflow framework, named SciFlo, in which we are deploying a set of versatile and reusable operators for data access, subsetting, registration, mining, fusion, compression, and advanced statistical analysis. SciFlo leverages remote Web Services, called via Simple Object Access Protocol (SOAP) or REST (one-line) URLs, and the Grid Computing standards (WS-* & Globus Alliance toolkits), and enables scientists to do multi- instrument Earth Science by assembling reusable Web Services and native executables into a distributed computing flow (tree of operators). The SciFlo client & server engines optimize the execution of such distributed data flows and allow the user to transparently find and use datasets and operators without worrying about the actual location of the Grid resources. In particular, SciFlo exploits the wealth of datasets accessible by OpenGIS Consortium (OGC) Web Mapping Servers & Web Coverage Servers (WMS/WCS), and by Open Data Access Protocol (OpenDAP) servers. SciFlo also publishes its own SOAP services for space/time query and subsetting of Earth Science datasets, and automated access to large datasets via lists of (FTP, HTTP, or DAP) URLs which point to on-line HDF or netCDF files. Typical distributed workflows obtain datasets by calling standard WMS/WCS servers or discovering and fetching data granules from ftp sites; invoke remote analysis operators available as SOAP services (interface described by a WSDL document); and merge results into binary containers (netCDF or HDF files) for further analysis using local executable operators. Naming conventions (HDFEOS and CF-1.0 for netCDF) are exploited to automatically understand and read on-line datasets. More interoperable conventions, and broader adoption of existing converntions, are vital if we are to "scale up" automated choreography of Web Services beyond toy applications. Recently, the ESIP Federation sponsored a collaborative activity in which several ESIP members developed some collaborative science scenarios for atmospheric and aerosol science, and then choreographed services from multiple groups into demonstration workflows using the SciFlo engine and a Business Process Execution Language (BPEL) workflow engine. We will discuss the lessons learned from this activity, the need for standardized interfaces (like WMS/WCS), the difficulty in agreeing on even simple XML formats and interfaces, the benefits of doing collaborative science analysis at the "touch of a button" once services are connected, and further collaborations that are being pursued.
PRGdb: a bioinformatics platform for plant resistance gene analysis
Sanseverino, Walter; Roma, Guglielmo; De Simone, Marco; Faino, Luigi; Melito, Sara; Stupka, Elia; Frusciante, Luigi; Ercolano, Maria Raffaella
2010-01-01
PRGdb is a web accessible open-source (http://www.prgdb.org) database that represents the first bioinformatic resource providing a comprehensive overview of resistance genes (R-genes) in plants. PRGdb holds more than 16 000 known and putative R-genes belonging to 192 plant species challenged by 115 different pathogens and linked with useful biological information. The complete database includes a set of 73 manually curated reference R-genes, 6308 putative R-genes collected from NCBI and 10463 computationally predicted putative R-genes. Thanks to a user-friendly interface, data can be examined using different query tools. A home-made prediction pipeline called Disease Resistance Analysis and Gene Orthology (DRAGO), based on reference R-gene sequence data, was developed to search for plant resistance genes in public datasets such as Unigene and Genbank. New putative R-gene classes containing unknown domain combinations were discovered and characterized. The development of the PRG platform represents an important starting point to conduct various experimental tasks. The inferred cross-link between genomic and phenotypic information allows access to a large body of information to find answers to several biological questions. The database structure also permits easy integration with other data types and opens up prospects for future implementations. PMID:19906694
Measuring water affordability in developed economies. The added value of a needs-based approach.
Vanhille, Josefine; Goedemé, Tim; Penne, Tess; Van Thielen, Leen; Storms, Bérénice
2018-07-01
In developed countries, water affordability problems remain up on the agenda as the increasing financial costs of water services can impede the realisation of an equal access to water. More than ever, public authorities that define water tariffs face the challenge of reconciling environmental and cost recovery objectives with equity and financial accessibility for all. Indicators of water affordability can be helpful in this regard. Conventional affordability indicators often rely on the actual amount that households spend on water use. In contrast, we propose a needs-based indicator that measures the risk of being unable to afford the amount of water necessary to fulfill essential needs, i.e. needs that should be fulfilled for adequate participation in society. In this paper we set forth the methodological choices inherent to constructing a needs-based affordability indicator. Using a micro-dataset on households in Flanders (Belgium), we compare its results with the outcomes of a more common actual expenses-indicator. The paper illustrates how the constructed needs-based indicator can complement existing affordability indicators, and its capacity to reveal important risk groups. Copyright © 2018 Elsevier Ltd. All rights reserved.
Development of South Australian-Victorian Prostate Cancer Health Outcomes Research Dataset.
Ruseckaite, Rasa; Beckmann, Kerri; O'Callaghan, Michael; Roder, David; Moretti, Kim; Zalcberg, John; Millar, Jeremy; Evans, Sue
2016-01-22
Prostate cancer is the most commonly diagnosed and prevalent malignancy reported to Australian cancer registries, with numerous studies from single institutions summarizing patient outcomes at individual hospitals or States. In order to provide an overview of patterns of care of men with prostate cancer across multiple institutions in Australia, a specialized dataset was developed. This dataset, containing amalgamated data from South Australian and Victorian prostate cancer registries, is called the South Australian-Victorian Prostate Cancer Health Outcomes Research Dataset (SA-VIC PCHORD). A total of 13,598 de-identified records of men with prostate cancer diagnosed and consented between 2008 and 2013 in South Australia and Victoria were merged into the SA-VIC PCHORD. SA-VIC PCHORD contains detailed information about socio-demographic, diagnostic and treatment characteristics of patients with prostate cancer in South Australia and Victoria. Data from individual registries are available to researchers and can be accessed under individual data access policies in each State. The SA-VIC PCHORD will be used for numerous studies summarizing trends in diagnostic characteristics, survival and patterns of care in men with prostate cancer in Victoria and South Australia. It is expected that in the future the SA-VIC PCHORD will become a principal component of the recently developed bi-national Australian and New Zealand Prostate Cancer Outcomes Registry to collect and report patterns of care and standardised patient reported outcome measures of men nation-wide in Australia and New Zealand.
Mansouri, K; Grulke, C M; Richard, A M; Judson, R S; Williams, A J
2016-11-01
The increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publicly available PHYSPROP physicochemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest-quality subset of the original dataset was compared with the larger curated and corrected dataset. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publicly available for further usage and integration by the scientific community.
HDDTOOLS: an R package serving Hydrological Data Discovery Tools
NASA Astrophysics Data System (ADS)
Vitolo, C.; Buytaert, W.
2014-12-01
Many governmental bodies and institutions are currently committed to publish open data as the result of a trend of increasing transparency, based on which a wide variety of information produced at public expense is now becoming open and freely available to improve public involvement in the process of decision and policy making. Discovery, access and retrieval of information is, however, not always a simple task. Especially when programmatic access to data resources is not allowed, downloading metadata catalogue, select the information needed, request datasets, de-compression, conversion, manual filtering and parsing can become rather tedious. The R package "hddtools" is an open source project, designed to make all the above operations more efficient by means of re-usable functions. The package facilitate non programmatic access to various online data sources such as the Global Runoff Data Centre, NASA's TRMM mission, the Data60UK database amongst others. This package complements R's growing functionality in environmental web technologies to bridge the gap between data providers and data consumers and it is designed to be the starting building block of scientific workflows for linking data and models in a seamless fashion.
A Self-Directed Method for Cell-Type Identification and Separation of Gene Expression Microarrays
Zuckerman, Neta S.; Noam, Yair; Goldsmith, Andrea J.; Lee, Peter P.
2013-01-01
Gene expression analysis is generally performed on heterogeneous tissue samples consisting of multiple cell types. Current methods developed to separate heterogeneous gene expression rely on prior knowledge of the cell-type composition and/or signatures - these are not available in most public datasets. We present a novel method to identify the cell-type composition, signatures and proportions per sample without need for a-priori information. The method was successfully tested on controlled and semi-controlled datasets and performed as accurately as current methods that do require additional information. As such, this method enables the analysis of cell-type specific gene expression using existing large pools of publically available microarray datasets. PMID:23990767
Robinson, Nathaniel; Allred, Brady; Jones, Matthew; ...
2017-08-21
Satellite derived vegetation indices (VIs) are broadly used in ecological research, ecosystem modeling, and land surface monitoring. The Normalized Difference Vegetation Index (NDVI), perhaps the most utilized VI, has countless applications across ecology, forestry, agriculture, wildlife, biodiversity, and other disciplines. Calculating satellite derived NDVI is not always straight-forward, however, as satellite remote sensing datasets are inherently noisy due to cloud and atmospheric contamination, data processing failures, and instrument malfunction. Readily available NDVI products that account for these complexities are generally at coarse resolution; high resolution NDVI datasets are not conveniently accessible and developing them often presents numerous technical and methodologicalmore » challenges. Here, we address this deficiency by producing a Landsat derived, high resolution (30 m), long-term (30+ years) NDVI dataset for the conterminous United States. We use Google Earth Engine, a planetary-scale cloud-based geospatial analysis platform, for processing the Landsat data and distributing the final dataset. We use a climatology driven approach to fill missing data and validate the dataset with established remote sensing products at multiple scales. We provide access to the composites through a simple web application, allowing users to customize key parameters appropriate for their application, question, and region of interest.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Robinson, Nathaniel; Allred, Brady; Jones, Matthew
Satellite derived vegetation indices (VIs) are broadly used in ecological research, ecosystem modeling, and land surface monitoring. The Normalized Difference Vegetation Index (NDVI), perhaps the most utilized VI, has countless applications across ecology, forestry, agriculture, wildlife, biodiversity, and other disciplines. Calculating satellite derived NDVI is not always straight-forward, however, as satellite remote sensing datasets are inherently noisy due to cloud and atmospheric contamination, data processing failures, and instrument malfunction. Readily available NDVI products that account for these complexities are generally at coarse resolution; high resolution NDVI datasets are not conveniently accessible and developing them often presents numerous technical and methodologicalmore » challenges. Here, we address this deficiency by producing a Landsat derived, high resolution (30 m), long-term (30+ years) NDVI dataset for the conterminous United States. We use Google Earth Engine, a planetary-scale cloud-based geospatial analysis platform, for processing the Landsat data and distributing the final dataset. We use a climatology driven approach to fill missing data and validate the dataset with established remote sensing products at multiple scales. We provide access to the composites through a simple web application, allowing users to customize key parameters appropriate for their application, question, and region of interest.« less
NASA Astrophysics Data System (ADS)
Ward, Dennis W.; Bennett, Kelly W.
2017-05-01
The Sensor Information Testbed COllaberative Research Environment (SITCORE) and the Automated Online Data Repository (AODR) are significant enablers of the U.S. Army Research Laboratory (ARL)'s Open Campus Initiative and together create a highly-collaborative research laboratory and testbed environment focused on sensor data and information fusion. SITCORE creates a virtual research development environment allowing collaboration from other locations, including DoD, industry, academia, and collation facilities. SITCORE combined with AODR provides end-toend algorithm development, experimentation, demonstration, and validation. The AODR enterprise allows the U.S. Army Research Laboratory (ARL), as well as other government organizations, industry, and academia to store and disseminate multiple intelligence (Multi-INT) datasets collected at field exercises and demonstrations, and to facilitate research and development (R and D), and advancement of analytical tools and algorithms supporting the Intelligence, Surveillance, and Reconnaissance (ISR) community. The AODR provides a potential central repository for standards compliant datasets to serve as the "go-to" location for lessons-learned and reference products. Many of the AODR datasets have associated ground truth and other metadata which provides a rich and robust data suite for researchers to develop, test, and refine their algorithms. Researchers download the test data to their own environments using a sophisticated web interface. The AODR allows researchers to request copies of stored datasets and for the government to process the requests and approvals in an automated fashion. Access to the AODR requires two-factor authentication in the form of a Common Access Card (CAC) or External Certificate Authority (ECA)
The Wired Island: The First Two Years of Public Access To Cable Television In Manhattan.
ERIC Educational Resources Information Center
Othmer, David
A review is presented of the first two years of free public access programing on New York City's cable television (CATV) systems. The report provides some background information on franchising, public access to CATV in New York City, and Federal Communications Commission regulations. It also deals with the public access programing developed; it…
Remote visualization and scale analysis of large turbulence datatsets
NASA Astrophysics Data System (ADS)
Livescu, D.; Pulido, J.; Burns, R.; Canada, C.; Ahrens, J.; Hamann, B.
2015-12-01
Accurate simulations of turbulent flows require solving all the dynamically relevant scales of motions. This technique, called Direct Numerical Simulation, has been successfully applied to a variety of simple flows; however, the large-scale flows encountered in Geophysical Fluid Dynamics (GFD) would require meshes outside the range of the most powerful supercomputers for the foreseeable future. Nevertheless, the current generation of petascale computers has enabled unprecedented simulations of many types of turbulent flows which focus on various GFD aspects, from the idealized configurations extensively studied in the past to more complex flows closer to the practical applications. The pace at which such simulations are performed only continues to increase; however, the simulations themselves are restricted to a small number of groups with access to large computational platforms. Yet the petabytes of turbulence data offer almost limitless information on many different aspects of the flow, from the hierarchy of turbulence moments, spectra and correlations, to structure-functions, geometrical properties, etc. The ability to share such datasets with other groups can significantly reduce the time to analyze the data, help the creative process and increase the pace of discovery. Using the largest DOE supercomputing platforms, we have performed some of the biggest turbulence simulations to date, in various configurations, addressing specific aspects of turbulence production and mixing mechanisms. Until recently, the visualization and analysis of such datasets was restricted by access to large supercomputers. The public Johns Hopkins Turbulence database simplifies the access to multi-Terabyte turbulence datasets and facilitates turbulence analysis through the use of commodity hardware. First, one of our datasets, which is part of the database, will be described and then a framework that adds high-speed visualization and wavelet support for multi-resolution analysis of turbulence will be highlighted. The addition of wavelet support reduces the latency and bandwidth requirements for visualization, allowing for many concurrent users, and enables new types of analyses, including scale decomposition and coherent feature extraction.
A database of marine phytoplankton abundance, biomass and species composition in Australian waters
NASA Astrophysics Data System (ADS)
Davies, Claire H.; Coughlan, Alex; Hallegraeff, Gustaaf; Ajani, Penelope; Armbrecht, Linda; Atkins, Natalia; Bonham, Prudence; Brett, Steve; Brinkman, Richard; Burford, Michele; Clementson, Lesley; Coad, Peter; Coman, Frank; Davies, Diana; Dela-Cruz, Jocelyn; Devlin, Michelle; Edgar, Steven; Eriksen, Ruth; Furnas, Miles; Hassler, Christel; Hill, David; Holmes, Michael; Ingleton, Tim; Jameson, Ian; Leterme, Sophie C.; Lønborg, Christian; McLaughlin, James; McEnnulty, Felicity; McKinnon, A. David; Miller, Margaret; Murray, Shauna; Nayar, Sasi; Patten, Renee; Pritchard, Tim; Proctor, Roger; Purcell-Meyerink, Diane; Raes, Eric; Rissik, David; Ruszczyk, Jason; Slotwinski, Anita; Swadling, Kerrie M.; Tattersall, Katherine; Thompson, Peter; Thomson, Paul; Tonks, Mark; Trull, Thomas W.; Uribe-Palomino, Julian; Waite, Anya M.; Yauwenas, Rouna; Zammit, Anthony; Richardson, Anthony J.
2016-06-01
There have been many individual phytoplankton datasets collected across Australia since the mid 1900s, but most are unavailable to the research community. We have searched archives, contacted researchers, and scanned the primary and grey literature to collate 3,621,847 records of marine phytoplankton species from Australian waters from 1844 to the present. Many of these are small datasets collected for local questions, but combined they provide over 170 years of data on phytoplankton communities in Australian waters. Units and taxonomy have been standardised, obviously erroneous data removed, and all metadata included. We have lodged this dataset with the Australian Ocean Data Network (http://portal.aodn.org.au/) allowing public access. The Australian Phytoplankton Database will be invaluable for global change studies, as it allows analysis of ecological indicators of climate change and eutrophication (e.g., changes in distribution; diatom:dinoflagellate ratios). In addition, the standardised conversion of abundance records to biomass provides modellers with quantifiable data to initialise and validate ecosystem models of lower marine trophic levels.
Otegui, Javier; Ariño, Arturo H
2012-08-15
In any data quality workflow, data publishers must become aware of issues in their data so these can be corrected. User feedback mechanisms provide one avenue, while global assessments of datasets provide another. To date, there is no publicly available tool to allow both biodiversity data institutions sharing their data through the Global Biodiversity Information Facility network and its potential users to assess datasets as a whole. Contributing to bridge this gap both for publishers and users, we introduce BIoDiversity DataSets Assessment Tool, an online tool that enables selected diagnostic visualizations on the content of data publishers and/or their individual collections. The online application is accessible at http://www.unav.es/unzyec/mzna/biddsat/ and is supported by all major browsers. The source code is licensed under the GNU GPLv3 license (http://www.gnu.org/licenses/gpl-3.0.txt) and is available at https://github.com/jotegui/BIDDSAT.
Ontology for Transforming Geo-Spatial Data for Discovery and Integration of Scientific Data
NASA Astrophysics Data System (ADS)
Nguyen, L.; Chee, T.; Minnis, P.
2013-12-01
Discovery and access to geo-spatial scientific data across heterogeneous repositories and multi-discipline datasets can present challenges for scientist. We propose to build a workflow for transforming geo-spatial datasets into semantic environment by using relationships to describe the resource using OWL Web Ontology, RDF, and a proposed geo-spatial vocabulary. We will present methods for transforming traditional scientific dataset, use of a semantic repository, and querying using SPARQL to integrate and access datasets. This unique repository will enable discovery of scientific data by geospatial bound or other criteria.
BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data.
Ausmees, Kristiina; John, Aji; Toor, Salman Z; Hellander, Andreas; Nettelblad, Carl
2018-06-26
The advent of next-generation sequencing (NGS) has made whole-genome sequencing of cohorts of individuals a reality. Primary datasets of raw or aligned reads of this sort can get very large. For scientific questions where curated called variants are not sufficient, the sheer size of the datasets makes analysis prohibitively expensive. In order to make re-analysis of such data feasible without the need to have access to a large-scale computing facility, we have developed a highly scalable, storage-agnostic framework, an associated API and an easy-to-use web user interface to execute custom filters on large genomic datasets. We present BAMSI, a Software as-a Service (SaaS) solution for filtering of the 1000 Genomes phase 3 set of aligned reads, with the possibility of extension and customization to other sets of files. Unique to our solution is the capability of simultaneously utilizing many different mirrors of the data to increase the speed of the analysis. In particular, if the data is available in private or public clouds - an increasingly common scenario for both academic and commercial cloud providers - our framework allows for seamless deployment of filtering workers close to data. We show results indicating that such a setup improves the horizontal scalability of the system, and present a possible use case of the framework by performing an analysis of structural variation in the 1000 Genomes data set. BAMSI constitutes a framework for efficient filtering of large genomic data sets that is flexible in the use of compute as well as storage resources. The data resulting from the filter is assumed to be greatly reduced in size, and can easily be downloaded or routed into e.g. a Hadoop cluster for subsequent interactive analysis using Hive, Spark or similar tools. In this respect, our framework also suggests a general model for making very large datasets of high scientific value more accessible by offering the possibility for organizations to share the cost of hosting data on hot storage, without compromising the scalability of downstream analysis.
MGIS: managing banana (Musa spp.) genetic resources information and high-throughput genotyping data
Guignon, V.; Sempere, G.; Sardos, J.; Hueber, Y.; Duvergey, H.; Andrieu, A.; Chase, R.; Jenny, C.; Hazekamp, T.; Irish, B.; Jelali, K.; Adeka, J.; Ayala-Silva, T.; Chao, C.P.; Daniells, J.; Dowiya, B.; Effa effa, B.; Gueco, L.; Herradura, L.; Ibobondji, L.; Kempenaers, E.; Kilangi, J.; Muhangi, S.; Ngo Xuan, P.; Paofa, J.; Pavis, C.; Thiemele, D.; Tossou, C.; Sandoval, J.; Sutanto, A.; Vangu Paka, G.; Yi, G.; Van den houwe, I.; Roux, N.
2017-01-01
Abstract Unraveling the genetic diversity held in genebanks on a large scale is underway, due to advances in Next-generation sequence (NGS) based technologies that produce high-density genetic markers for a large number of samples at low cost. Genebank users should be in a position to identify and select germplasm from the global genepool based on a combination of passport, genotypic and phenotypic data. To facilitate this, a new generation of information systems is being designed to efficiently handle data and link it with other external resources such as genome or breeding databases. The Musa Germplasm Information System (MGIS), the database for global ex situ-held banana genetic resources, has been developed to address those needs in a user-friendly way. In developing MGIS, we selected a generic database schema (Chado), the robust content management system Drupal for the user interface, and Tripal, a set of Drupal modules which links the Chado schema to Drupal. MGIS allows germplasm collection examination, accession browsing, advanced search functions, and germplasm orders. Additionally, we developed unique graphical interfaces to compare accessions and to explore them based on their taxonomic information. Accession-based data has been enriched with publications, genotyping studies and associated genotyping datasets reporting on germplasm use. Finally, an interoperability layer has been implemented to facilitate the link with complementary databases like the Banana Genome Hub and the MusaBase breeding database. Database URL: https://www.crop-diversity.org/mgis/ PMID:29220435
Maximizing Accessibility to Spatially Referenced Digital Data.
ERIC Educational Resources Information Center
Hunt, Li; Joselyn, Mark
1995-01-01
Discusses some widely available spatially referenced datasets, including raster and vector datasets. Strategies for improving accessibility include: acquisition of data in a software-dependent format; reorganization of data into logical geographic units; acquisition of intelligent retrieval software; improving computer hardware; and intelligent…
Development of Disruptive Open Access Journals
ERIC Educational Resources Information Center
Anderson, Terry; McConkey, Brigette
2009-01-01
Open access (OA) publication has emerged, with disruptive effects, as a major outlet for scholarly publication. OA publication is usually associated with on-line distribution and provides access to scholarly publications to anyone, anywhere--regardless of their ability to pay subscription fees or their association with an educational institution.…
A Secure Architecture to Provide a Medical Emergency Dataset for Patients in Germany and Abroad.
Storck, Michael; Wohlmann, Jan; Krudwig, Sarah; Vogel, Alexander; Born, Judith; Weber, Thomas; Dugas, Martin; Juhra, Christian
2017-01-01
The ongoing fragmentation of medical care and mobility of patients severely restrains exchange of lifesaving information about patient's medical history in case of emergencies. Therefore, the objective of this work is to offer a secure technical solution to supply medical professionals with emergency-relevant information concerning the current patient via mobile accessibility. To achieve this goal, the official national emergency data set was extended by additional features to form a patient summary for emergencies, a software architecture was developed and data security and data protection issues were taken into account. The patient has sovereignty over his/her data and can therefore decide who has access to or can change his/her stored data, but the treating physician composes the validated dataset. Building upon the introduced concept, future activities are the development of user-interfaces for the software components of the different user groups as well as functioning prototypes for upcoming field tests.
NASA Astrophysics Data System (ADS)
Eberle, Jonas; Urban, Marcel; Hüttich, Christian; Schmullius, Christiane
2014-05-01
Numerous datasets providing temperature information from meteorological stations or remote sensing satellites are available. However, the challenging issue is to search in the archives and process the time series information for further analysis. These steps can be automated for each individual product, if the pre-conditions are complied, e.g. data access through web services (HTTP, FTP) or legal rights to redistribute the datasets. Therefore a python-based package was developed to provide data access and data processing tools for MODIS Land Surface Temperature (LST) data, which is provided by NASA Land Processed Distributed Active Archive Center (LPDAAC), as well as the Global Surface Summary of the Day (GSOD) and the Global Historical Climatology Network (GHCN) daily datasets provided by NOAA National Climatic Data Center (NCDC). The package to access and process the information is available as web services used by an interactive web portal for simple data access and analysis. Tools for time series analysis were linked to the system, e.g. time series plotting, decomposition, aggregation (monthly, seasonal, etc.), trend analyses, and breakpoint detection. Especially for temperature data a plot was integrated for the comparison of two temperature datasets based on the work by Urban et al. (2013). As a first result, a kernel density plot compares daily MODIS LST from satellites Aqua and Terra with daily means from GSOD and GHCN datasets. Without any data download and data processing, the users can analyze different time series datasets in an easy-to-use web portal. As a first use case, we built up this complimentary system with remotely sensed MODIS data and in situ measurements from meteorological stations for Siberia within the Siberian Earth System Science Cluster (www.sibessc.uni-jena.de). References: Urban, Marcel; Eberle, Jonas; Hüttich, Christian; Schmullius, Christiane; Herold, Martin. 2013. "Comparison of Satellite-Derived Land Surface Temperature and Air Temperature from Meteorological Stations on the Pan-Arctic Scale." Remote Sens. 5, no. 5: 2348-2367. Further materials: Eberle, Jonas; Clausnitzer, Siegfried; Hüttich, Christian; Schmullius, Christiane. 2013. "Multi-Source Data Processing Middleware for Land Monitoring within a Web-Based Spatial Data Infrastructure for Siberia." ISPRS Int. J. Geo-Inf. 2, no. 3: 553-576.
MemAxes Visualization Software
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hardware advancements such as Intel's PEBS and AMD's IBS, as well as software developments such as the perf_event API in Linux have made available the acquisition of memory access samples with performance information. MemAxes is a visualization and analysis tool for memory access sample data. By mapping the samples to their associated code, variables, node topology, and application dataset, MemAxes provides intuitive views of the data.
Providing Access to a Diverse Set of Global Reanalysis Dataset Collections
NASA Astrophysics Data System (ADS)
Schuster, D.; Worley, S. J.
2015-12-01
The National Center for Atmospheric Research (NCAR) Research Data Archive (RDA, http://rda.ucar.edu) provides open access to a variety of global reanalysis dataset collections to support atmospheric and related sciences research worldwide. These include products from the European Centre for Medium-Range Weather Forecasts (ECMWF), Japan Meteorological Agency (JMA), National Centers for Environmental Prediction (NCEP), National Oceanic and Atmospheric Administration (NOAA), and NCAR.All RDA hosted reanalysis collections are freely accessible to registered users through a variety of methods. Standard access methods include traditional browser and scripted HTTP file download. Enhanced downloads are available through the Globus GridFTP "fire and forget" data transfer service, which provides an efficient, reliable, and preferred alternative to traditional HTTP-based methods. For those that favor interoperable access using compatible tools, the Unidata THREDDS Data server provides remote access to complete reanalysis collections by virtual dataset aggregation "files". Finally, users can request data subsets and format conversions to be prepared for them through web interface form requests or web service API batch requests. This approach uses NCAR HPC and central file systems to effectively prepare products from the high-resolution and very large reanalyses archives. The presentation will include a detailed inventory of all RDA reanalysis dataset collection holdings, and highlight access capabilities to these collections through use case examples.
Smith, Tanya; Page-Nicholson, Samantha; Gibbons, Bradley; Jones, M. Genevieve W.; van Niekerk, Mark; Botha, Bronwyn; Oliver, Kirsten; McCann, Kevin
2016-01-01
Abstract Background The International Crane Foundation (ICF) / Endangered Wildlife Trust’s (EWT) African Crane Conservation Programme has recorded 26 403 crane sightings in its database from 1978 to 2014. This sightings collection is currently ongoing and records are continuously added to the database by the EWT field staff, ICF/EWT Partnership staff, various partner organizations and private individuals. The dataset has two peak collection periods: 1994-1996 and 2008-2012. The dataset collection spans five African countries: Kenya, Rwanda, South Africa, Uganda and Zambia; 98% of the data were collected in South Africa. Georeferencing of the dataset was verified before publication of the data. The dataset contains data on three African crane species: Blue Crane Anthropoides paradiseus, Grey Crowned Crane Balearica regulorum and Wattled Crane Bugeranus carunculatus. The Blue and Wattled Cranes are classified by the IUCN Red List of Threatened Species as Vulnerable and the Grey Crowned Crane as Endangered. New information This is the single most comprehensive dataset published on African Crane species that adds new information about the distribution of these three threatened species. We hope this will further aid conservation authorities to monitor and protect these species. The dataset continues to grow and especially to expand in geographic coverage into new countries in Africa and new sites within countries. The dataset can be freely accessed through the Global Biodiversity Information Facility data portal. PMID:27956850
Large-Scale Pattern Discovery in Music
NASA Astrophysics Data System (ADS)
Bertin-Mahieux, Thierry
This work focuses on extracting patterns in musical data from very large collections. The problem is split in two parts. First, we build such a large collection, the Million Song Dataset, to provide researchers access to commercial-size datasets. Second, we use this collection to study cover song recognition which involves finding harmonic patterns from audio features. Regarding the Million Song Dataset, we detail how we built the original collection from an online API, and how we encouraged other organizations to participate in the project. The result is the largest research dataset with heterogeneous sources of data available to music technology researchers. We demonstrate some of its potential and discuss the impact it already has on the field. On cover song recognition, we must revisit the existing literature since there are no publicly available results on a dataset of more than a few thousand entries. We present two solutions to tackle the problem, one using a hashing method, and one using a higher-level feature computed from the chromagram (dubbed the 2DFTM). We further investigate the 2DFTM since it has potential to be a relevant representation for any task involving audio harmonic content. Finally, we discuss the future of the dataset and the hope of seeing more work making use of the different sources of data that are linked in the Million Song Dataset. Regarding cover songs, we explain how this might be a first step towards defining a harmonic manifold of music, a space where harmonic similarities between songs would be more apparent.
Global Oil & Gas Features Database
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kelly Rose; Jennifer Bauer; Vic Baker
This submission contains a zip file with the developed Global Oil & Gas Features Database (as an ArcGIS geodatabase). Access the technical report describing how this database was produced using the following link: https://edx.netl.doe.gov/dataset/development-of-an-open-global-oil-and-gas-infrastructure-inventory-and-geodatabase
Development of a SPARK Training Dataset
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sayre, Amanda M.; Olson, Jarrod R.
2015-03-01
In its first five years, the National Nuclear Security Administration’s (NNSA) Next Generation Safeguards Initiative (NGSI) sponsored more than 400 undergraduate, graduate, and post-doctoral students in internships and research positions (Wyse 2012). In the past seven years, the NGSI program has, and continues to produce a large body of scientific, technical, and policy work in targeted core safeguards capabilities and human capital development activities. Not only does the NGSI program carry out activities across multiple disciplines, but also across all U.S. Department of Energy (DOE)/NNSA locations in the United States. However, products are not readily shared among disciplines and acrossmore » locations, nor are they archived in a comprehensive library. Rather, knowledge of NGSI-produced literature is localized to the researchers, clients, and internal laboratory/facility publication systems such as the Electronic Records and Information Capture Architecture (ERICA) at the Pacific Northwest National Laboratory (PNNL). There is also no incorporated way of analyzing existing NGSI literature to determine whether the larger NGSI program is achieving its core safeguards capabilities and activities. A complete library of NGSI literature could prove beneficial to a cohesive, sustainable, and more economical NGSI program. The Safeguards Platform for Automated Retrieval of Knowledge (SPARK) has been developed to be a knowledge storage, retrieval, and analysis capability to capture safeguards knowledge to exist beyond the lifespan of NGSI. During the development process, it was necessary to build a SPARK training dataset (a corpus of documents) for initial entry into the system and for demonstration purposes. We manipulated these data to gain new information about the breadth of NGSI publications, and they evaluated the science-policy interface at PNNL as a practical demonstration of SPARK’s intended analysis capability. The analysis demonstration sought to answer the question, “Who leads research and development at PNNL, scientists or policy researchers?” The analysis was inconclusive as to whether policy researchers or scientists are primary drivers for research at PNNL. However, the dataset development and analysis activity did demonstrate the utility and usability of the SPARK dataset. After the initiation of the NGSI program there is a clear increase in the number of publications of safeguards products. Employing the natural language analysis tool IN SPIRE™ showed the presence of vocation- and topic-specific vernacular within NGSI sub-topics. The methodology developed to define the scope of the dataset was useful in describing safeguards applications, but may be applicable for research on other topics beyond safeguards. The analysis emphasized the need for an expanded dataset to fully understand the scope of safeguards publications and research both nationally and internationally. As the SPARK dataset grows to include publications outside PNNL, topics crosscutting disciplines and DOE/NNSA locations should become more apparent. NGSI was established in 2008 to cultivate the next generation of safeguards professionals and support the development of core safeguards capabilities (NNSA 2012). Now a robust system to preserve and share institutional memory such as SPARK is needed to inspire and equip the next generation of safeguards experts, technologies, and policies.« less
Squillace, Joe
2013-01-01
Children in Medicaid/CHIP public coverage programs who reside in rural counties have limited access to dental care services. Shortages of dental professionals in rural areas impede utilization of dental care. Public and private initiatives are attempting to address this crisis. Missourians instituted deregulatory policies and invested in community-based initiatives. Using a Medicaid/CHIP claims administrative dataset from 2004 to 2007, this research explored patterns of utilization to assess the impact of these efforts. The number of participating private dental office providers declined over the study period, and the number of children utilizing clinics increased. Trends are being observed within the public health dental care market demonstrating clinics are replacing private dentists as providers of Medicaid/CHIP dental services. Allowing greater market entry through deregulation could provide states with greater improvements to their public dental health infrastructure. © 2012 American Association of Public Health Dentistry.
Issues and Solutions for Bringing Heterogeneous Water Cycle Data Sets Together
NASA Technical Reports Server (NTRS)
Acker, James; Kempler, Steven; Teng, William; Belvedere, Deborah; Liu, Zhong; Leptoukh, Gregory
2010-01-01
The water cycle research community has generated many regional to global scale products using data from individual NASA missions or sensors (e.g., TRMM, AMSR-E); multiple ground- and space-based data sources (e.g., Global Precipitation Climatology Project [GPCP] products); and sophisticated data assimilation systems (e.g., Land Data Assimilation Systems [LDAS]). However, it is often difficult to access, explore, merge, analyze, and inter-compare these data in a coherent manner due to issues of data resolution, format, and structure. These difficulties were substantiated at the recent Collaborative Energy and Water Cycle Information Services (CEWIS) Workshop, where members of the NASA Energy and Water cycle Study (NEWS) community gave presentations, provided feedback, and developed scenarios which illustrated the difficulties and techniques for bringing together heterogeneous datasets. This presentation reports on the findings of the workshop, thus defining the problems and challenges of multi-dataset research. In addition, the CEWIS prototype shown at the workshop will be presented to illustrate new technologies that can mitigate data access roadblocks encountered in multi-dataset research, including: (1) Quick and easy search and access of selected NEWS data sets. (2) Multi-parameter data subsetting, manipulation, analysis, and display tools. (3) Access to input and derived water cycle data (data lineage). It is hoped that this presentation will encourage community discussion and feedback on heterogeneous data analysis scenarios, issues, and remedies.
NASA Astrophysics Data System (ADS)
Heather, David
2016-07-01
Introduction: The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces (e.g. FTP browser, Map based, Advanced search, and Machine interface): http://archives.esac.esa.int/psa All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. Updating the PSA: The PSA is currently implementing a number of significant changes, both to its web-based interface to the scientific community, and to its database structure. The new PSA will be up-to-date with versions 3 and 4 of the PDS standards, as PDS4 will be used for ESA's upcoming ExoMars and BepiColombo missions. The newly designed PSA homepage will provide direct access to scientific datasets via a text search for targets or missions. This will significantly reduce the complexity for users to find their data and will promote one-click access to the datasets. Additionally, the homepage will provide direct access to advanced views and searches of the datasets. Users will have direct access to documentation, information and tools that are relevant to the scientific use of the dataset, including ancillary datasets, Software Interface Specification (SIS) documents, and any tools/help that the PSA team can provide. A login mechanism will provide additional functionalities to the users to aid / ease their searches (e.g. saving queries, managing default views). Queries to the PSA database will be possible either via the homepage (for simple searches of missions or targets), or through a filter menu for more tailored queries. The filter menu will offer multiple options to search for a particular dataset or product, and will manage queries for both in-situ and remote sensing instruments. Parameters such as start-time, phase angle, and heliocentric distance will be emphasized. A further advanced search function will allow users to query all the metadata present in the PSA database. Results will be displayed in 3 different ways: 1) A table listing all the corresponding data matching the criteria in the filter menu, 2) a projection of the products onto the surface of the object when applicable (i.e. planets, small bodies), and 3) a list of images for the relevant instruments to enjoy the beauty of our Solar System. These different ways of viewing the datasets will ensure that scientists and non-professionals alike will have access to the specific data they are looking for, regardless of their background. Conclusions: The new PSA will maintain the various interfaces and services it had in the past, and will include significant improvements designed to allow easier and more effective access to the scientific data and supporting materials. The new PSA is expected to be released by mid-2016. It will support the past, present and future missions, ancillary datasets, and will enhance the scientific output of ESA's missions. As such, the PSA will become a unique archive ensuring the long-term preservation and usage of scientific datasets together with user-friendly access.
NASA Astrophysics Data System (ADS)
Heather, David; Besse, Sebastien; Barbarisi, Isa; Arviset, Christophe; de Marchi, Guido; Barthelemy, Maud; Docasal, Ruben; Fraga, Diego; Grotheer, Emmanuel; Lim, Tanya; Macfarlane, Alan; Martinez, Santa; Rios, Carlos
2016-04-01
Introduction: The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces (e.g. FTP browser, Map based, Advanced search, and Machine interface): http://archives.esac.esa.int/psa All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. Updating the PSA: The PSA is currently implementing a number of significant changes, both to its web-based interface to the scientific community, and to its database structure. The new PSA will be up-to-date with versions 3 and 4 of the PDS standards, as PDS4 will be used for ESA's upcoming ExoMars and BepiColombo missions. The newly designed PSA homepage will provide direct access to scientific datasets via a text search for targets or missions. This will significantly reduce the complexity for users to find their data and will promote one-click access to the datasets. Additionally, the homepage will provide direct access to advanced views and searches of the datasets. Users will have direct access to documentation, information and tools that are relevant to the scientific use of the dataset, including ancillary datasets, Software Interface Specification (SIS) documents, and any tools/help that the PSA team can provide. A login mechanism will provide additional functionalities to the users to aid / ease their searches (e.g. saving queries, managing default views). Queries to the PSA database will be possible either via the homepage (for simple searches of missions or targets), or through a filter menu for more tailored queries. The filter menu will offer multiple options to search for a particular dataset or product, and will manage queries for both in-situ and remote sensing instruments. Parameters such as start-time, phase angle, and heliocentric distance will be emphasized. A further advanced search function will allow users to query all the metadata present in the PSA database. Results will be displayed in 3 different ways: 1) A table listing all the corresponding data matching the criteria in the filter menu, 2) a projection of the products onto the surface of the object when applicable (i.e. planets, small bodies), and 3) a list of images for the relevant instruments to enjoy the beauty of our Solar System. These different ways of viewing the datasets will ensure that scientists and non-professionals alike will have access to the specific data they are looking for, regardless of their background. Conclusions: The new PSA will maintain the various interfaces and services it had in the past, and will include significant improvements designed to allow easier and more effective access to the scientific data and supporting materials. The new PSA is expected to be released by mid-2016. It will support the past, present and future missions, ancillary datasets, and will enhance the scientific output of ESA's missions. As such, the PSA will become a unique archive ensuring the long-term preservation and usage of scientific datasets together with user-friendly access.
panMetaDocs and DataSync - providing a convenient way to share and publish research data
NASA Astrophysics Data System (ADS)
Ulbricht, D.; Klump, J. F.
2013-12-01
In recent years research institutions, geological surveys and funding organizations started to build infrastructures to facilitate the re-use of research data from previous work. At present, several intermeshed activities are coordinated to make data systems of the earth sciences interoperable and recorded data discoverable. Driven by governmental authorities, ISO19115/19139 emerged as metadata standards for discovery of data and services. Established metadata transport protocols like OAI-PMH and OGC-CSW are used to disseminate metadata to data portals. With the persistent identifiers like DOI and IGSN research data and corresponding physical samples can be given unambiguous names and thus become citable. In summary, these activities focus primarily on 'ready to give away'-data, already stored in an institutional repository and described with appropriate metadata. Many datasets are not 'born' in this state but are produced in small and federated research projects. To make access and reuse of these 'small data' easier, these data should be centrally stored and version controlled from the very beginning of activities. We developed DataSync [1] as supplemental application to the panMetaDocs [2] data exchange platform as a data management tool for small science projects. DataSync is a JAVA-application that runs on a local computer and synchronizes directory trees into an eSciDoc-repository [3] by creating eSciDoc-objects via eSciDocs' REST API. DataSync can be installed on multiple computers and is in this way able to synchronize files of a research team over the internet. XML Metadata can be added as separate files that are managed together with data files as versioned eSciDoc-objects. A project-customized instance of panMetaDocs is provided to show a web-based overview of the previously uploaded file collection and to allow further annotation with metadata inside the eSciDoc-repository. PanMetaDocs is a PHP based web application to assist the creation of metadata in any XML-based metadata schema. To reduce manual entries of metadata to a minimum and make use of contextual information in a project setting, metadata fields can be populated with static or dynamic content. Access rights can be defined to control visibility and access to stored objects. Notifications about recently updated datasets are available by RSS and e-mail and the entire inventory can be harvested via OAI-PMH. panMetaDocs is optimized to be harvested by panFMP [4]. panMetaDocs is able to mint dataset DOIs though DataCite and uses eSciDocs' REST API to transfer eSciDoc-objects from a non-public 'pending'-status to the published status 'released', which makes data and metadata of the published object available worldwide through the internet. The application scenario presented here shows the adoption of open source applications to data sharing and publication of data. An eSciDoc-repository is used as storage for data and metadata. DataSync serves as a file ingester and distributor, whereas panMetaDocs' main function is to annotate the dataset files with metadata to make them ready for publication and sharing with your own team, or with the scientific community.
Do citations and readership identify seminal publications?
Herrmannova, Drahomira; Patton, Robert M.; Knoth, Petr; ...
2018-02-10
Here, this work presents a new approach for analysing the ability of existing research metrics to identify research which has strongly influenced future developments. More specifically, we focus on the ability of citation counts and Mendeley reader counts to distinguish between publications regarded as seminal and publications regarded as literature reviews by field experts. The main motivation behind our research is to gain a better understanding of whether and how well the existing research metrics relate to research quality. For this experiment we have created a new dataset which we call TrueImpactDataset and which contains two types of publications, seminalmore » papers and literature reviews. Using the dataset, we conduct a set of experiments to study how citation and reader counts perform in distinguishing these publication types, following the intuition that causing a change in a field signifies research quality. Finally, our research shows that citation counts work better than a random baseline (by a margin of 10%) in distinguishing important seminal research papers from literature reviews while Mendeley reader counts do not work better than the baseline.« less
Do citations and readership identify seminal publications?
DOE Office of Scientific and Technical Information (OSTI.GOV)
Herrmannova, Drahomira; Patton, Robert M.; Knoth, Petr
Here, this work presents a new approach for analysing the ability of existing research metrics to identify research which has strongly influenced future developments. More specifically, we focus on the ability of citation counts and Mendeley reader counts to distinguish between publications regarded as seminal and publications regarded as literature reviews by field experts. The main motivation behind our research is to gain a better understanding of whether and how well the existing research metrics relate to research quality. For this experiment we have created a new dataset which we call TrueImpactDataset and which contains two types of publications, seminalmore » papers and literature reviews. Using the dataset, we conduct a set of experiments to study how citation and reader counts perform in distinguishing these publication types, following the intuition that causing a change in a field signifies research quality. Finally, our research shows that citation counts work better than a random baseline (by a margin of 10%) in distinguishing important seminal research papers from literature reviews while Mendeley reader counts do not work better than the baseline.« less
The Path from Large Earth Science Datasets to Information
NASA Astrophysics Data System (ADS)
Vicente, G. A.
2013-12-01
The NASA Goddard Earth Sciences Data (GES) and Information Services Center (DISC) is one of the major Science Mission Directorate (SMD) for archiving and distribution of Earth Science remote sensing data, products and services. This virtual portal provides convenient access to Atmospheric Composition and Dynamics, Hydrology, Precipitation, Ozone, and model derived datasets (generated by GSFC's Global Modeling and Assimilation Office), the North American Land Data Assimilation System (NLDAS) and the Global Land Data Assimilation System (GLDAS) data products (both generated by GSFC's Hydrological Sciences Branch). This presentation demonstrates various tools and computational technologies developed in the GES DISC to manage the huge volume of data and products acquired from various missions and programs over the years. It explores approaches to archive, document, distribute, access and analyze Earth Science data and information as well as addresses the technical and scientific issues, governance and user support problem faced by scientists in need of multi-disciplinary datasets. It also discusses data and product metrics, user distribution profiles and lessons learned through interactions with the science communities around the world. Finally it demonstrates some of the most used data and product visualization and analyses tools developed and maintained by the GES DISC.
MEXPRESS: visualizing expression, DNA methylation and clinical TCGA data.
Koch, Alexander; De Meyer, Tim; Jeschke, Jana; Van Criekinge, Wim
2015-08-26
In recent years, increasing amounts of genomic and clinical cancer data have become publically available through large-scale collaborative projects such as The Cancer Genome Atlas (TCGA). However, as long as these datasets are difficult to access and interpret, they are essentially useless for a major part of the research community and their scientific potential will not be fully realized. To address these issues we developed MEXPRESS, a straightforward and easy-to-use web tool for the integration and visualization of the expression, DNA methylation and clinical TCGA data on a single-gene level ( http://mexpress.be ). In comparison to existing tools, MEXPRESS allows researchers to quickly visualize and interpret the different TCGA datasets and their relationships for a single gene, as demonstrated for GSTP1 in prostate adenocarcinoma. We also used MEXPRESS to reveal the differences in the DNA methylation status of the PAM50 marker gene MLPH between the breast cancer subtypes and how these differences were linked to the expression of MPLH. We have created a user-friendly tool for the visualization and interpretation of TCGA data, offering clinical researchers a simple way to evaluate the TCGA data for their genes or candidate biomarkers of interest.
TISSUES 2.0: an integrative web resource on mammalian tissue expression
Palasca, Oana; Santos, Alberto; Stolte, Christian; Gorodkin, Jan; Jensen, Lars Juhl
2018-01-01
Abstract Physiological and molecular similarities between organisms make it possible to translate findings from simpler experimental systems—model organisms—into more complex ones, such as human. This translation facilitates the understanding of biological processes under normal or disease conditions. Researchers aiming to identify the similarities and differences between organisms at the molecular level need resources collecting multi-organism tissue expression data. We have developed a database of gene–tissue associations in human, mouse, rat and pig by integrating multiple sources of evidence: transcriptomics covering all four species and proteomics (human only), manually curated and mined from the scientific literature. Through a scoring scheme, these associations are made comparable across all sources of evidence and across organisms. Furthermore, the scoring produces a confidence score assigned to each of the associations. The TISSUES database (version 2.0) is publicly accessible through a user-friendly web interface and as part of the STRING app for Cytoscape. In addition, we analyzed the agreement between datasets, across and within organisms, and identified that the agreement is mainly affected by the quality of the datasets rather than by the technologies used or organisms compared. Database URL: http://tissues.jensenlab.org/ PMID:29617745
ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers.
Teodoro, Douglas; Sundvall, Erik; João Junior, Mario; Ruch, Patrick; Miranda Freire, Sergio
2018-01-01
The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms.
ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers
Sundvall, Erik; João Junior, Mario; Ruch, Patrick; Miranda Freire, Sergio
2018-01-01
The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms. PMID:29293556
Dataset from Dick et al published in Sawyer et al 2016
Dataset is a time course description of lindane disappearance in blood plasma after dermal exposure in human volunteersThis dataset is associated with the following publication:Sawyer, M.E., M.V. Evans , C. Wilson, L.J. Beesley, L. Leon, C. Eklund , E. Croom, and R. Pegram. Development of a Human Physiologically Based Pharmacokinetics (PBPK) Model For Dermal Permeability for Lindane. TOXICOLOGY LETTERS. Elsevier Science Ltd, New York, NY, USA, 14(245): pp106-109, (2016).
NASA Astrophysics Data System (ADS)
Shiklomanov, A. I.; Okladnikov, I.; Gordov, E. P.; Proussevitch, A. A.; Titov, A. G.
2016-12-01
Presented is a collaborative project carrying out by joint team of researchers from the Institute of Monitoring of Climatic and Ecological Systems, Russia and Earth Systems Research Center, University of New Hampshire, USA. Its main objective is development of a hardware and software prototype of Distributed Research Center (DRC) for monitoring and projecting of regional climatic and and their impacts on the environment over the Northern extratropical areas. In the framework of the project new approaches to "cloud" processing and analysis of large geospatial datasets (big geospatial data) are being developed. It will be deployed on technical platforms of both institutions and applied in research of climate change and its consequences. Datasets available at NCEI and IMCES include multidimensional arrays of climatic, environmental, demographic, and socio-economic characteristics. The project is aimed at solving several major research and engineering tasks: 1) structure analysis of huge heterogeneous climate and environmental geospatial datasets used in the project, their preprocessing and unification; 2) development of a new distributed storage and processing model based on a "shared nothing" paradigm; 3) development of a dedicated database of metadata describing geospatial datasets used in the project; 4) development of a dedicated geoportal and a high-end graphical frontend providing intuitive user interface, internet-accessible online tools for analysis of geospatial data and web services for interoperability with other geoprocessing software packages. DRC will operate as a single access point to distributed archives of spatial data and online tools for their processing. Flexible modular computational engine running verified data processing routines will provide solid results of geospatial data analysis. "Cloud" data analysis and visualization approach will guarantee access to the DRC online tools and data from all over the world. Additionally, exporting of data processing results through WMS and WFS services will be used to provide their interoperability. Financial support of this activity by the RF Ministry of Education and Science under Agreement 14.613.21.0037 (RFMEFI61315X0037) and by the Iola Hubbard Climate Change Endowment is acknowledged.
Collaborative development of predictive toxicology applications
2010-01-01
OpenTox provides an interoperable, standards-based Framework for the support of predictive toxicology data management, algorithms, modelling, validation and reporting. It is relevant to satisfying the chemical safety assessment requirements of the REACH legislation as it supports access to experimental data, (Quantitative) Structure-Activity Relationship models, and toxicological information through an integrating platform that adheres to regulatory requirements and OECD validation principles. Initial research defined the essential components of the Framework including the approach to data access, schema and management, use of controlled vocabularies and ontologies, architecture, web service and communications protocols, and selection and integration of algorithms for predictive modelling. OpenTox provides end-user oriented tools to non-computational specialists, risk assessors, and toxicological experts in addition to Application Programming Interfaces (APIs) for developers of new applications. OpenTox actively supports public standards for data representation, interfaces, vocabularies and ontologies, Open Source approaches to core platform components, and community-based collaboration approaches, so as to progress system interoperability goals. The OpenTox Framework includes APIs and services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, and reporting which may be combined into multiple applications satisfying a variety of different user needs. OpenTox applications are based on a set of distributed, interoperable OpenTox API-compliant REST web services. The OpenTox approach to ontology allows for efficient mapping of complementary data coming from different datasets into a unifying structure having a shared terminology and representation. Two initial OpenTox applications are presented as an illustration of the potential impact of OpenTox for high-quality and consistent structure-activity relationship modelling of REACH-relevant endpoints: ToxPredict which predicts and reports on toxicities for endpoints for an input chemical structure, and ToxCreate which builds and validates a predictive toxicity model based on an input toxicology dataset. Because of the extensible nature of the standardised Framework design, barriers of interoperability between applications and content are removed, as the user may combine data, models and validation from multiple sources in a dependable and time-effective way. PMID:20807436
Collaborative development of predictive toxicology applications.
Hardy, Barry; Douglas, Nicki; Helma, Christoph; Rautenberg, Micha; Jeliazkova, Nina; Jeliazkov, Vedrin; Nikolova, Ivelina; Benigni, Romualdo; Tcheremenskaia, Olga; Kramer, Stefan; Girschick, Tobias; Buchwald, Fabian; Wicker, Joerg; Karwath, Andreas; Gütlein, Martin; Maunz, Andreas; Sarimveis, Haralambos; Melagraki, Georgia; Afantitis, Antreas; Sopasakis, Pantelis; Gallagher, David; Poroikov, Vladimir; Filimonov, Dmitry; Zakharov, Alexey; Lagunin, Alexey; Gloriozova, Tatyana; Novikov, Sergey; Skvortsova, Natalia; Druzhilovsky, Dmitry; Chawla, Sunil; Ghosh, Indira; Ray, Surajit; Patel, Hitesh; Escher, Sylvia
2010-08-31
OpenTox provides an interoperable, standards-based Framework for the support of predictive toxicology data management, algorithms, modelling, validation and reporting. It is relevant to satisfying the chemical safety assessment requirements of the REACH legislation as it supports access to experimental data, (Quantitative) Structure-Activity Relationship models, and toxicological information through an integrating platform that adheres to regulatory requirements and OECD validation principles. Initial research defined the essential components of the Framework including the approach to data access, schema and management, use of controlled vocabularies and ontologies, architecture, web service and communications protocols, and selection and integration of algorithms for predictive modelling. OpenTox provides end-user oriented tools to non-computational specialists, risk assessors, and toxicological experts in addition to Application Programming Interfaces (APIs) for developers of new applications. OpenTox actively supports public standards for data representation, interfaces, vocabularies and ontologies, Open Source approaches to core platform components, and community-based collaboration approaches, so as to progress system interoperability goals.The OpenTox Framework includes APIs and services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, and reporting which may be combined into multiple applications satisfying a variety of different user needs. OpenTox applications are based on a set of distributed, interoperable OpenTox API-compliant REST web services. The OpenTox approach to ontology allows for efficient mapping of complementary data coming from different datasets into a unifying structure having a shared terminology and representation.Two initial OpenTox applications are presented as an illustration of the potential impact of OpenTox for high-quality and consistent structure-activity relationship modelling of REACH-relevant endpoints: ToxPredict which predicts and reports on toxicities for endpoints for an input chemical structure, and ToxCreate which builds and validates a predictive toxicity model based on an input toxicology dataset. Because of the extensible nature of the standardised Framework design, barriers of interoperability between applications and content are removed, as the user may combine data, models and validation from multiple sources in a dependable and time-effective way.
ERIC Educational Resources Information Center
Anshien, Carol M.; And Others
A short review of the development of cable television in New York City, a brief description of wiring patterns, a history of public access, and some statistical data on public channel usage are provided in the first portion of this report. The second major part describes the Public Access Celebration, a three-day informational event held in July…
2011-02-01
search capability for Air Force Research Information Management System (AFRIMS) data as a part of federated search under DTIC Online Access...provide vetted requests to dataset owners. • Develop a federated search capability for databases containing limited distribution material. • Deploy
Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses
Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M.; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V.; Ma’ayan, Avi
2018-01-01
Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated ‘canned’ analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools. PMID:29485625
Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses.
Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V; Ma'ayan, Avi
2018-02-27
Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated 'canned' analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools.
Lemieux, Sebastien; Sargeant, Tobias; Laperrière, David; Ismail, Houssam; Boucher, Geneviève; Rozendaal, Marieke; Lavallée, Vincent-Philippe; Ashton-Beaucage, Dariel; Wilhelm, Brian; Hébert, Josée; Hilton, Douglas J; Mader, Sylvie; Sauvageau, Guy
2017-07-27
Genome-wide transcriptome profiling has enabled non-supervised classification of tumours, revealing different sub-groups characterized by specific gene expression features. However, the biological significance of these subtypes remains for the most part unclear. We describe herein an interactive platform, Minimum Spanning Trees Inferred Clustering (MiSTIC), that integrates the direct visualization and comparison of the gene correlation structure between datasets, the analysis of the molecular causes underlying co-variations in gene expression in cancer samples, and the clinical annotation of tumour sets defined by the combined expression of selected biomarkers. We have used MiSTIC to highlight the roles of specific transcription factors in breast cancer subtype specification, to compare the aspects of tumour heterogeneity targeted by different prognostic signatures, and to highlight biomarker interactions in AML. A version of MiSTIC preloaded with datasets described herein can be accessed through a public web server (http://mistic.iric.ca); in addition, the MiSTIC software package can be obtained (github.com/iric-soft/MiSTIC) for local use with personalized datasets. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
The Alaska Arctic Vegetation Archive (AVA-AK)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Walker, Donald; Breen, Amy; Druckenmiller, Lisa
The Alaska Arctic Vegetation Archive (AVA-AK, GIVD-ID: NA-US-014) is a free, publically available database archive of vegetation-plot data from the Arctic tundra region of northern Alaska. The archive currently contains 24 datasets with 3,026 non-overlapping plots. Of these, 74% have geolocation data with 25-m or better precision. Species cover data and header data are stored in a Turboveg database. A standardized Pan Arctic Species List provides a consistent nomenclature for vascular plants, bryophytes, and lichens in the archive. A web-based online Alaska Arctic Geoecological Atlas (AGA-AK) allows viewing and downloading the species data in a variety of formats, and providesmore » access to a wide variety of ancillary data. We conducted a preliminary cluster analysis of the first 16 datasets (1,613 plots) to examine how the spectrum of derived clusters is related to the suite of datasets, habitat types, and environmental gradients. Here, we present the contents of the archive, assess its strengths and weaknesses, and provide three supplementary files that include the data dictionary, a list of habitat types, an overview of the datasets, and details of the cluster analysis.« less
The Alaska Arctic Vegetation Archive (AVA-AK)
Walker, Donald; Breen, Amy; Druckenmiller, Lisa; ...
2016-05-17
The Alaska Arctic Vegetation Archive (AVA-AK, GIVD-ID: NA-US-014) is a free, publically available database archive of vegetation-plot data from the Arctic tundra region of northern Alaska. The archive currently contains 24 datasets with 3,026 non-overlapping plots. Of these, 74% have geolocation data with 25-m or better precision. Species cover data and header data are stored in a Turboveg database. A standardized Pan Arctic Species List provides a consistent nomenclature for vascular plants, bryophytes, and lichens in the archive. A web-based online Alaska Arctic Geoecological Atlas (AGA-AK) allows viewing and downloading the species data in a variety of formats, and providesmore » access to a wide variety of ancillary data. We conducted a preliminary cluster analysis of the first 16 datasets (1,613 plots) to examine how the spectrum of derived clusters is related to the suite of datasets, habitat types, and environmental gradients. Here, we present the contents of the archive, assess its strengths and weaknesses, and provide three supplementary files that include the data dictionary, a list of habitat types, an overview of the datasets, and details of the cluster analysis.« less
Public Libraries and Internet Public Access Models: Describing Possible Approaches.
ERIC Educational Resources Information Center
Tomasello, Tami K.; McClure, Charles R.
2002-01-01
Discusses ways of providing Internet access to the general public and analyzes eight models currently in use: public schools, public libraries, cybermobiles, public housing, community technology centers, community networks, kiosks, and cyber cafes. Concludes that public libraries may wish to develop collaborative strategies with other…
Multi-modal two-step floating catchment area analysis of primary health care accessibility.
Langford, Mitchel; Higgs, Gary; Fry, Richard
2016-03-01
Two-step floating catchment area (2SFCA) techniques are popular for measuring potential geographical accessibility to health care services. This paper proposes methodological enhancements to increase the sophistication of the 2SFCA methodology by incorporating both public and private transport modes using dedicated network datasets. The proposed model yields separate accessibility scores for each modal group at each demand point to better reflect the differential accessibility levels experienced by each cohort. An empirical study of primary health care facilities in South Wales, UK, is used to illustrate the approach. Outcomes suggest the bus-riding cohort of each census tract experience much lower accessibility levels than those estimated by an undifferentiated (car-only) model. Car drivers' accessibility may also be misrepresented in an undifferentiated model because they potentially profit from the lower demand placed upon service provision points by bus riders. The ability to specify independent catchment sizes for each cohort in the multi-modal model allows aspects of preparedness to travel to be investigated. Copyright © 2016. Published by Elsevier Ltd.
Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study
2015-01-01
Objective This study informs efforts to improve the discoverability of and access to biomedical datasets by providing a preliminary estimate of the number and type of datasets generated annually by research funded by the U.S. National Institutes of Health (NIH). It focuses on those datasets that are “invisible” or not deposited in a known repository. Methods We analyzed NIH-funded journal articles that were published in 2011, cited in PubMed and deposited in PubMed Central (PMC) to identify those that indicate data were submitted to a known repository. After excluding those articles, we analyzed a random sample of the remaining articles to estimate how many and what types of invisible datasets were used in each article. Results About 12% of the articles explicitly mention deposition of datasets in recognized repositories, leaving 88% that are invisible datasets. Among articles with invisible datasets, we found an average of 2.9 to 3.4 datasets, suggesting there were approximately 200,000 to 235,000 invisible datasets generated from NIH-funded research published in 2011. Approximately 87% of the invisible datasets consist of data newly collected for the research reported; 13% reflect reuse of existing data. More than 50% of the datasets were derived from live human or non-human animal subjects. Conclusion In addition to providing a rough estimate of the total number of datasets produced per year by NIH-funded researchers, this study identifies additional issues that must be addressed to improve the discoverability of and access to biomedical research data: the definition of a “dataset,” determination of which (if any) data are valuable for archiving and preservation, and better methods for estimating the number of datasets of interest. Lack of consensus amongst annotators about the number of datasets in a given article reinforces the need for a principled way of thinking about how to identify and characterize biomedical datasets. PMID:26207759
Web-Based Geographic Information System Tool for Accessing Hanford Site Environmental Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Triplett, Mark B.; Seiple, Timothy E.; Watson, David J.
Data volume, complexity, and access issues pose severe challenges for analysts, regulators and stakeholders attempting to efficiently use legacy data to support decision making at the U.S. Department of Energy’s (DOE) Hanford Site. DOE has partnered with the Pacific Northwest National Laboratory (PNNL) on the PHOENIX (PNNL-Hanford Online Environmental Information System) project, which seeks to address data access, transparency, and integration challenges at Hanford to provide effective decision support. PHOENIX is a family of spatially-enabled web applications providing quick access to decades of valuable scientific data and insight through intuitive query, visualization, and analysis tools. PHOENIX realizes broad, public accessibilitymore » by relying only on ubiquitous web-browsers, eliminating the need for specialized software. It accommodates a wide range of users with intuitive user interfaces that require little or no training to quickly obtain and visualize data. Currently, PHOENIX is actively hosting three applications focused on groundwater monitoring, groundwater clean-up performance reporting, and in-tank monitoring. PHOENIX-based applications are being used to streamline investigative and analytical processes at Hanford, saving time and money. But more importantly, by integrating previously isolated datasets and developing relevant visualization and analysis tools, PHOENIX applications are enabling DOE to discover new correlations hidden in legacy data, allowing them to more effectively address complex issues at Hanford.« less
WDS Trusted Data Services in Support of International Science
NASA Astrophysics Data System (ADS)
Mokrane, M.; Minster, J. B. H.
2014-12-01
Today's research is international, transdisciplinary, and data-enabled, which requires scrupulous data stewardship, full and open access to data, and efficient collaboration and coordination. New expectations on researchers based on policies from governments and funders to share data fully, openly, and in a timely manner present significant challenges but are also opportunities to improve the quality and efficiency of research and its accountability to society. Researchers should be able to archive and disseminate data as required by many institutions or funders, and civil society to scrutinize datasets underlying public policies. Thus, the trustworthiness of data services must be verifiable. In addition, the need to integrate large and complex datasets across disciplines and domains with variable levels of maturity calls for greater coordination to achieve sufficient interoperability and sustainability. The World Data System (WDS) of the International Council for Science (ICSU) promotes long-term stewardship of, and universal and equitable access to, quality-assured scientific data and services across a range of disciplines in the natural and social sciences. WDS aims at coordinating and supporting trusted scientific data services for the provision, use, and preservation of relevant datasets to facilitate scientific research, in particular under the ICSU umbrella, while strengthening their links with the research community. WDS certifies it Members, holders and providers of data or data products, using internationally recognized standards. Thus, providing the building blocks of a searchable common infrastructure, from which a data system that is both interoperable and distributed can be formed. This presentation will describe the coordination role of WDS and more specifically activities developed by its Scientific Committee to: Improve and stimulate basic level Certification for Scientific Data Services, in particular through collaboration with the Data Seal of Approval. Identify and define best practices for Publishing Data and to test their implementation by involving the core stakeholders i.e. researchers, institutions, data centres, scholarly publishers, and funders. Establish an open WDS Metadata Catalogue, Knowledge Network, and Global Registry of Trusted Data Services.
Stakeholder values and ecosystems in developing open access to research data.
NASA Astrophysics Data System (ADS)
Wessels, Bridgette; Sveinsdottir, Thordis; Smallwood, Rod
2014-05-01
One aspect of understanding how to develop open access to research data is to understand the values of stakeholders in the emerging open data ecosystem. The EU FP7 funded project Policy RECommendations for Open Access to Research Data in Europe (RECODE) (Grant Agreement No: 321463) undertook such research to identify stakeholder values and mapped the emerging ecosystem. In this paper we outline and discuss the findings of this research. We address three key objectives, which are: (a) the identification and mapping of the diverse range of stakeholder values in Open Access data and data dissemination and preservation; (b) mapping stakeholder values on to research ecosystems using case studies from different disciplinary perspectives; and (c) evaluate and identify good practice in addressing conflicting value chains and stakeholder fragmentation. The research was structured on three related actions: (a) an analysis of policy and related documents and protocols, in order to map the formal expression of values and motivations; (b) conducting five case studies in particle physics, health sciences, bioengineering, environmental research and archaeology. These explored issues of data size; quality control, ethics and data security; replication of large datasets; interoperability; and the preservation of diverse types of data; and (c) undertaking a validation and dissemination workshop that sought to better understand how to match policies with stakeholder drivers and motivations to increase their effectiveness in promoting Open Access to research data. The research findings include that there is clearly an overall drive for Open Data Access within the policy documents, which is part of a wider drive for open science in general. This is underpinned by the view of science as an open enterprise. Although there is a strong argument for publicly funded science to be made open to the public the details of how to make research data open as yet still unclear. Our research found that discussions of Open Data tend to refer to science as a single sector, leading to differences between disciplines being ignored in policy making. Each discipline has different methods for gathering and analysing data, some disciplines deal with sensitive data, and others deal with data that may have IPR or legal issues. We recommend that these differences are recognised, as they will inform the debate about subject specific requirements and common infrastructures for Open Data Access.
McKenzie, Grant; Janowicz, Krzysztof
2017-01-01
Gaining access to inexpensive, high-resolution, up-to-date, three-dimensional road network data is a top priority beyond research, as such data would fuel applications in industry, governments, and the broader public alike. Road network data are openly available via user-generated content such as OpenStreetMap (OSM) but lack the resolution required for many tasks, e.g., emergency management. More importantly, however, few publicly available data offer information on elevation and slope. For most parts of the world, up-to-date digital elevation products with a resolution of less than 10 meters are a distant dream and, if available, those datasets have to be matched to the road network through an error-prone process. In this paper we present a radically different approach by deriving road network elevation data from massive amounts of in-situ observations extracted from user-contributed data from an online social fitness tracking application. While each individual observation may be of low-quality in terms of resolution and accuracy, taken together they form an accurate, high-resolution, up-to-date, three-dimensional road network that excels where other technologies such as LiDAR fail, e.g., in case of overpasses, overhangs, and so forth. In fact, the 1m spatial resolution dataset created in this research based on 350 million individual 3D location fixes has an RMSE of approximately 3.11m compared to a LiDAR-based ground-truth and can be used to enhance existing road network datasets where individual elevation fixes differ by up to 60m. In contrast, using interpolated data from the National Elevation Dataset (NED) results in 4.75m RMSE compared to the base line. We utilize Linked Data technologies to integrate the proposed high-resolution dataset with OpenStreetMap road geometries without requiring any changes to the OSM data model.
Computer assisted screening, correction, and analysis of historical weather measurements
NASA Astrophysics Data System (ADS)
Burnette, Dorian J.; Stahle, David W.
2013-04-01
A computer program, Historical Observation Tools (HOB Tools), has been developed to facilitate many of the calculations used by historical climatologists to develop instrumental and documentary temperature and precipitation datasets and makes them readily accessible to other researchers. The primitive methodology used by the early weather observers makes the application of standard techniques difficult. HOB Tools provides a step-by-step framework to visually and statistically assess, adjust, and reconstruct historical temperature and precipitation datasets. These routines include the ability to check for undocumented discontinuities, adjust temperature data for poor thermometer exposures and diurnal averaging, and assess and adjust daily precipitation data for undercount. This paper provides an overview of the Visual Basic.NET program and a demonstration of how it can assist in the development of extended temperature and precipitation datasets using modern and early instrumental measurements from the United States.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hodge, Bri-Mathias
2016-04-08
The primary objective of this work was to create a state-of-the-art national wind resource data set and to provide detailed wind plant output data for specific sites based on that data set. Corresponding retrospective wind forecasts were also included at all selected locations. The combined information from these activities was used to create the Wind Integration National Dataset (WIND), and an extraction tool was developed to allow web-based data access.
DeTEXT: A Database for Evaluating Text Extraction from Biomedical Literature Figures
Yin, Xu-Cheng; Yang, Chun; Pei, Wei-Yi; Man, Haixia; Zhang, Jun; Learned-Miller, Erik; Yu, Hong
2015-01-01
Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text is a rich source of information in figures, automatically extracting such text may assist in the task of mining figure information. A high-quality ground truth standard can greatly facilitate the development of an automated system. This article describes DeTEXT: A database for evaluating text extraction from biomedical literature figures. It is the first publicly available, human-annotated, high quality, and large-scale figure-text dataset with 288 full-text articles, 500 biomedical figures, and 9308 text regions. This article describes how figures were selected from open-access full-text biomedical articles and how annotation guidelines and annotation tools were developed. We also discuss the inter-annotator agreement and the reliability of the annotations. We summarize the statistics of the DeTEXT data and make available evaluation protocols for DeTEXT. Finally we lay out challenges we observed in the automated detection and recognition of figure text and discuss research directions in this area. DeTEXT is publicly available for downloading at http://prir.ustb.edu.cn/DeTEXT/. PMID:25951377
The Climate Data Analytic Services (CDAS) Framework.
NASA Astrophysics Data System (ADS)
Maxwell, T. P.; Duffy, D.
2016-12-01
Faced with unprecedented growth in climate data volume and demand, NASA has developed the Climate Data Analytic Services (CDAS) framework. This framework enables scientists to execute data processing workflows combining common analysis operations in a high performance environment close to the massive data stores at NASA. The data is accessed in standard (NetCDF, HDF, etc.) formats in a POSIX file system and processed using vetted climate data analysis tools (ESMF, CDAT, NCO, etc.). A dynamic caching architecture enables interactive response times. CDAS utilizes Apache Spark for parallelization and a custom array framework for processing huge datasets within limited memory spaces. CDAS services are accessed via a WPS API being developed in collaboration with the ESGF Compute Working Team to support server-side analytics for ESGF. The API can be accessed using either direct web service calls, a python script, a unix-like shell client, or a javascript-based web application. Client packages in python, scala, or javascript contain everything needed to make CDAS requests. The CDAS architecture brings together the tools, data storage, and high-performance computing required for timely analysis of large-scale data sets, where the data resides, to ultimately produce societal benefits. It is is currently deployed at NASA in support of the Collaborative REAnalysis Technical Environment (CREATE) project, which centralizes numerous global reanalysis datasets onto a single advanced data analytics platform. This service permits decision makers to investigate climate changes around the globe, inspect model trends and variability, and compare multiple reanalysis datasets.
Assessing and quantifying public transit access.
DOT National Transportation Integrated Search
2014-03-01
Measuring access to transit services is important in evaluating existing services, predicting travel demands, allocating transportation investments and making decisions on land development. A composite index for assessing accessibility of public tran...
National Geothermal Data System: an Exemplar of Open Access to Data
NASA Astrophysics Data System (ADS)
Allison, M. L.; Richard, S. M.; Blackman, H.; Anderson, A.
2013-12-01
The National Geothermal Data System's (NGDS - www.geothermaldata.org) formal launch in 2014 will provide open access to millions of datasets, sharing technical geothermal-relevant data across the geosciences to propel geothermal development and production. With information from all of the Department of Energy's sponsored development and research projects and geologic data from all 50 states, this free, interactive tool is opening new exploration opportunities and shortening project development by making data easily discoverable and accessible. We continue to populate our prototype functional data system with multiple data nodes and nationwide data online and available to the public. Data from state geological surveys and partners includes more than 5 million records online, including 1.48 million well headers (oil and gas, water, geothermal), 732,000 well logs, and 314,000 borehole temperatures and is growing rapidly. There are over 250 Web services and another 138 WMS (Web Map Services) registered in the system as of August, 2013. Companion projects run by Boise State University, Southern Methodist University, and USGS are adding millions of additional data records. The National Renewable Energy Laboratory is managing the Geothermal Data Repository which will serve as a system node and clearinghouse for data from hundreds of DOE-funded geothermal projects. NGDS is built on the US Geoscience Information Network data integration framework, which is a joint undertaking of the USGS and the Association of American State Geologists (AASG). NGDS is fully compliant with the White House Executive Order of May 2013, requiring all federal agencies to make their data holdings publicly accessible online in open source, interoperable formats with common core and extensible metadata. The National Geothermal Data System is being designed, built, deployed, and populated primarily with grants from the US Department of Energy, Geothermal Technologies Office. To keep this operational system sustainable after the original implementation will require four core elements: continued serving of data and applications by providers; maintenance of system operations; a governance structure; and an effective business model. Each of these presents a number of challenges currently under consideration.
NASA Astrophysics Data System (ADS)
Cavallo, Eugenio; Biddoccu, Marcella; Bagagiolo, Giorgia; De Marziis, Massimo; Gaia Forni, Emanuela; Alemanno, Laura; Ferraris, Stefano; Canone, Davide; Previati, Maurizio; Turconi, Laura; Arattano, Massimo; Coviello, Velio
2016-04-01
Environmental sensor monitoring is continuously developing, both in terms of quantity (i.e. measurement sites), and quality (i.e. technological innovation). Environmental monitoring is carried out by either public or private entities for their own specific purposes, such as scientific research, civil protection, support to industrial and agricultural activities, services for citizens, security, education, and information. However, the acquired dataset could be cross-appealing, hence, being interesting for purposes that diverted from their main intended use. The CIRCE project (Cooperative Internet-of-Data Rural-alpine Community Environment) aimed to gather, manage, use and distribute data obtained from sensors and from people, in a multipurpose approach. The CIRCE project was selected within a call for tender launched by Piedmont Region (in collaboration with CSI Piemonte) in order to improve the digital ecosystem represented by YUCCA, an open source platform oriented to the acquisition, sharing and reuse of data resulting both from real-time and on-demand applications. The partnership of the CIRCE project was made by scientific research bodies (IMAMOTER-CNR, IRPI-CNR, DIST) together with SMEs involved in environmental monitoring and ICT sectors (namely: 3a srl, EnviCons srl, Impresa Verde Cuneo srl, and NetValue srl). Within the project a shared network of agro-meteo-hydrological sensors has been created. Then a platform and its interface for collection, management and distribution of data has been developed. The CIRCE network is currently constituted by a total amount of 171 sensors remotely connected and originally belonging to different networks. They are settled-up in order to monitor and investigate agro-meteo-hydrological processes in different rural and mountain areas of Piedmont Region (NW-Italy), including some very sensitive locations, but difficult to access. Each sensor network differs from each other, in terms of purpose of monitoring, monitored parameters, instrumentation, system architecture, data acquisition and communication processes. In addition to real-time data, the CIRCE database includes many historical datasets, which were uniformed to the adopted database architecture. Such datasets were collected before the implementation of the project both from the connected sensors, and from sensors no longer active. In order to attempt to reduce the gap between the research community and end users, specific APP for smartphones and tablets were created. Such tools facilitate the access and the enrichment of the CIRCE database both for the hydrological section (APP IDRO) than for the agro-meteorological section (APP AGRO). Non-specialists may participate in enrichment of the sensor punctual data with sending qualitative and quantitative information about the observed processes (e.g. watercourse levels, erosion processes, presence of pathogens, damage pictures, etc.). The territorial investigation and the data acquisition also involved groups of citizens (namely farmers, technician and volunteers), that were engaged in creating and testing the informatics tools, according with the "Living Lab" approach. Finally, the CIRCE platform was interfaced with the YUCCA platform, allowing an open access to the CIRCE dataset and its integration in the SmartDataNet system of the Regione Piemonte public administration. The CIRCE project was funded by EU FESR, by Italian Government and Regione Piemonte within the programme Regione Piemonte POR/FESR 2007-2013.
Oceans 2.0 API: Programmatic access to Ocean Networks Canada's sensor data.
NASA Astrophysics Data System (ADS)
Heesemann, M.; Ross, R.; Hoeberechts, M.; Pirenne, B.; MacArthur, M.; Jeffries, M. A.; Morley, M. G.
2017-12-01
Ocean Networks Canada (ONC) is a not-for-profit society that operates and manages innovative cabled observatories on behalf of the University of Victoria. These observatories supply continuous power and Internet connectivity to various scientific instruments located in coastal, deep-ocean and Arctic environments. The data from the instruments are relayed to the University of Victoria where they are archived, quality-controlled and made freely available to researchers, educators, and the public. The Oceans 2.0 data management system currently contains over 500 terabytes of data collected over 11 years from thousands of sensors. In order to facilitate access to the data, particularly for large datasets and long-time series of high-resolution data, a project was started in 2016 create a comprehensive Application Programming Interface, the "Oceans 2.0 API," to provide programmatic access to all ONC data products. The development is part of a project entitled "A Research Platform for User-Defined Oceanographic Data Products," funded through CANARIE, a Canadian organization responsible for the design and delivery of digital infrastructure for research, education and innovation [1]. Providing quick and easy access to ONC Data Products from within custom software solutions, allows researchers, modelers and decision makers to focus on what is important: solving their problems, answering their questions and making informed decisions. In this paper, we discuss how to access ONC's vast archive of data programmatically, through the Oceans 2.0 API. In particular we discuss the following: Access to ONC Data Products Access to ONC sensor data in near real-time Programming language support Use Cases References [1] CANARIE. Internet: https://www.canarie.ca/; accessed March 6, 2017.
Oscar, Nels; Fox, Pamela A; Croucher, Racheal; Wernick, Riana; Keune, Jessica; Hooker, Karen
2017-09-01
Social scientists need practical methods for harnessing large, publicly available datasets that inform the social context of aging. We describe our development of a semi-automated text coding method and use a content analysis of Alzheimer's disease (AD) and dementia portrayal on Twitter to demonstrate its use. The approach improves feasibility of examining large publicly available datasets. Machine learning techniques modeled stigmatization expressed in 31,150 AD-related tweets collected via Twitter's search API based on 9 AD-related keywords. Two researchers manually coded 311 random tweets on 6 dimensions. This input from 1% of the dataset was used to train a classifier against the tweet text and code the remaining 99% of the dataset. Our automated process identified that 21.13% of the AD-related tweets used AD-related keywords to perpetuate public stigma, which could impact stereotypes and negative expectations for individuals with the disease and increase "excess disability". This technique could be applied to questions in social gerontology related to how social media outlets reflect and shape attitudes bearing on other developmental outcomes. Recommendations for the collection and analysis of large Twitter datasets are discussed. © The Author 2017. Published by Oxford University Press on behalf of The Gerontological Society of America. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Cloud-Based Mobile Application Development Tools and NASA Science Datasets
NASA Astrophysics Data System (ADS)
Oostra, D.; Lewis, P. M.; Chambers, L. H.; Moore, S. W.
2011-12-01
A number of cloud-based visual development tools have emerged that provide methods for developing mobile applications quickly and without previous programming experience. This paper will explore how our new and current data users can best combine these cloud-based mobile application tools and available NASA climate science datasets. Our vision is that users will create their own mobile applications for visualizing our data and will develop tools for their own needs. The approach we are documenting is based on two main ideas. The first is to provide training and information. Through examples, sharing experiences, and providing workshops, users can be shown how to use free online tools to easily create mobile applications that interact with NASA datasets. The second approach is to provide application programming interfaces (APIs), databases, and web applications to access data in a way that educators, students and scientists can quickly integrate it into their own mobile application development. This framework allows us to foster development activities and boost interaction with NASA's data while saving resources that would be required for a large internal application development staff. The findings of this work will include data gathered through meetings with local data providers, educators, libraries and individuals. From the very first queries into this topic, a high level of interest has been identified from our groups of users. This overt interest, combined with the marked popularity of mobile applications, has created a new channel for outreach and communications between the science and education communities. As a result, we would like to offer educators and other stakeholders some insight into the mobile application development arena, and provide some next steps and new approaches. Our hope is that, through our efforts, we will broaden the scope and usage of NASA's climate science data by providing new ways to access environmentally relevant datasets.
Eilbeck, Karen L.; Lipstein, Julie; McGarvey, Sunanda; Staes, Catherine J.
2014-01-01
The Reportable Condition Knowledge Management System (RCKMS) is envisioned to be a single, comprehensive, authoritative, real-time portal to author, view and access computable information about reportable conditions. The system is designed for use by hospitals, laboratories, health information exchanges, and providers to meet public health reporting requirements. The RCKMS Knowledge Representation Workgroup was tasked to explore the need for ontologies to support RCKMS functionality. The workgroup reviewed relevant projects and defined criteria to evaluate candidate knowledge domain areas for ontology development. The use of ontologies is justified for this project to unify the semantics used to describe similar reportable events and concepts between different jurisdictions and over time, to aid data integration, and to manage large, unwieldy datasets that evolve, and are sometimes externally managed. PMID:25954354
Paretti, Nicholas V.; Kennedy, Jeffrey R.; Turney, Lovina A.; Veilleux, Andrea G.
2014-01-01
The regional regression equations were integrated into the U.S. Geological Survey’s StreamStats program. The StreamStats program is a national map-based web application that allows the public to easily access published flood frequency and basin characteristic statistics. The interactive web application allows a user to select a point within a watershed (gaged or ungaged) and retrieve flood-frequency estimates derived from the current regional regression equations and geographic information system data within the selected basin. StreamStats provides users with an efficient and accurate means for retrieving the most up to date flood frequency and basin characteristic data. StreamStats is intended to provide consistent statistics, minimize user error, and reduce the need for large datasets and costly geographic information system software.
In-field Access to Geoscientific Metadata through GPS-enabled Mobile Phones
NASA Astrophysics Data System (ADS)
Hobona, Gobe; Jackson, Mike; Jordan, Colm; Butchart, Ben
2010-05-01
Fieldwork is an integral part of much geosciences research. But whilst geoscientists have physical or online access to data collections whilst in the laboratory or at base stations, equivalent in-field access is not standard or straightforward. The increasing availability of mobile internet and GPS-supported mobile phones, however, now provides the basis for addressing this issue. The SPACER project was commissioned by the Rapid Innovation initiative of the UK Joint Information Systems Committee (JISC) to explore the potential for GPS-enabled mobile phones to access geoscientific metadata collections. Metadata collections within the geosciences and the wider geospatial domain can be disseminated through web services based on the Catalogue Service for Web(CSW) standard of the Open Geospatial Consortium (OGC) - a global grouping of over 380 private, public and academic organisations aiming to improve interoperability between geospatial technologies. CSW offers an XML-over-HTTP interface for querying and retrieval of geospatial metadata. By default, the metadata returned by CSW is based on the ISO19115 standard and encoded in XML conformant to ISO19139. The SPACER project has created a prototype application that enables mobile phones to send queries to CSW containing user-defined keywords and coordinates acquired from GPS devices built-into the phones. The prototype has been developed using the free and open source Google Android platform. The mobile application offers views for listing titles, presenting multiple metadata elements and a Google Map with an overlay of bounding coordinates of datasets. The presentation will describe the architecture and approach applied in the development of the prototype.
Varley-Winter, Olivia; Shah, Hetan
2016-12-28
In order to generate the gains that can come from analysing and linking big datasets, data holders need to consider the ethical frameworks, principles and applications that help to maintain public trust. In the USA, the National Science Foundation helped to set up a Council for Big Data, Ethics and Society, of which there is no equivalent in the UK. In November 2015, the Royal Statistical Society convened a workshop of 28 participants from government, academia and the private sector, and discussed the practical priorities that might be assisted by a new Council of Data Ethics in the UK. This article draws together the views from that meeting. Priorities for policy-makers and others include seeking a public mandate and informing the terms of the social contract for use of data; building professional competence and due diligence on data protection; appointment of champions who are competent to address public concerns; and transparency, across all dimensions. For government data, further priorities include improvements to data access, and development of data infrastructure. In conclusion, we support the establishment of a national Data Ethics Council, alongside wider and deeper engagement of the public to address data ethics dilemmas.This article is part of the themed issue 'The ethical impact of data science'. © 2016 The Author(s).
Statistical Reference Datasets
National Institute of Standards and Technology Data Gateway
Statistical Reference Datasets (Web, free access) The Statistical Reference Datasets is also supported by the Standard Reference Data Program. The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software.
Evaluating Geologic Sources of Arsenic in Well Water in Virginia (USA)
VanDerwerker, Tiffany; Zhang, Lin; Ling, Erin; Benham, Brian; Schreiber, Madeline
2018-01-01
We investigated if geologic factors are linked to elevated arsenic (As) concentrations above 5 μg/L in well water in the state of Virginia, USA. Using geologic unit data mapped within GIS and two datasets of measured As concentrations in well water (one from public wells, the other from private wells), we evaluated occurrences of elevated As (above 5 μg/L) based on geologic unit. We also constructed a logistic regression model to examine statistical relationships between elevated As and geologic units. Two geologic units, including Triassic-aged sedimentary rocks and Triassic-Jurassic intrusives of the Culpeper Basin in north-central Virginia, had higher occurrences of elevated As in well water than other geologic units in Virginia. Model results support these patterns, showing a higher probability for As occurrence above 5 μg/L in well water in these two units. Due to the lack of observations (<5%) having elevated As concentrations in our data set, our model cannot be used to predict As concentrations in other parts of the state. However, our results are useful for identifying areas of Virginia, defined by underlying geology, that are more likely to have elevated As concentrations in well water. Due to the ease of obtaining publicly available data and the accessibility of GIS, this study approach can be applied to other areas with existing datasets of As concentrations in well water and accessible data on geology. PMID:29670010
Ruffier, Magali; Kähäri, Andreas; Komorowska, Monika; Keenan, Stephen; Laird, Matthew; Longden, Ian; Proctor, Glenn; Searle, Steve; Staines, Daniel; Taylor, Kieron; Vullo, Alessandro; Yates, Andrew; Zerbino, Daniel; Flicek, Paul
2017-01-01
The Ensembl software resources are a stable infrastructure to store, access and manipulate genome assemblies and their functional annotations. The Ensembl 'Core' database and Application Programming Interface (API) was our first major piece of software infrastructure and remains at the centre of all of our genome resources. Since its initial design more than fifteen years ago, the number of publicly available genomic, transcriptomic and proteomic datasets has grown enormously, accelerated by continuous advances in DNA-sequencing technology. Initially intended to provide annotation for the reference human genome, we have extended our framework to support the genomes of all species as well as richer assembly models. Cross-referenced links to other informatics resources facilitate searching our database with a variety of popular identifiers such as UniProt and RefSeq. Our comprehensive and robust framework storing a large diversity of genome annotations in one location serves as a platform for other groups to generate and maintain their own tailored annotation. We welcome reuse and contributions: our databases and APIs are publicly available, all of our source code is released with a permissive Apache v2.0 licence at http://github.com/Ensembl and we have an active developer mailing list ( http://www.ensembl.org/info/about/contact/index.html ). http://www.ensembl.org. © The Author(s) 2017. Published by Oxford University Press.
Booth, N.L.; Everman, E.J.; Kuo, I.-L.; Sprague, L.; Murphy, L.
2011-01-01
The U.S. Geological Survey National Water Quality Assessment Program has completed a number of water-quality prediction models for nitrogen and phosphorus for the conterminous United States as well as for regional areas of the nation. In addition to estimating water-quality conditions at unmonitored streams, the calibrated SPAtially Referenced Regressions On Watershed attributes (SPARROW) models can be used to produce estimates of yield, flow-weighted concentration, or load of constituents in water under various land-use condition, change, or resource management scenarios. A web-based decision support infrastructure has been developed to provide access to SPARROW simulation results on stream water-quality conditions and to offer sophisticated scenario testing capabilities for research and water-quality planning via a graphical user interface with familiar controls. The SPARROW decision support system (DSS) is delivered through a web browser over an Internet connection, making it widely accessible to the public in a format that allows users to easily display water-quality conditions and to describe, test, and share modeled scenarios of future conditions. SPARROW models currently supported by the DSS are based on the modified digital versions of the 1:500,000-scale River Reach File (RF1) and 1:100,000-scale National Hydrography Dataset (medium-resolution, NHDPlus) stream networks. ?? 2011 American Water Resources Association. This article is a U.S. Government work and is in the public domain in the USA.
Federal Register 2010, 2011, 2012, 2013, 2014
2013-05-01
... NATIONAL SCIENCE FOUNDATION Public Access to Federally Supported Research and Development Data and... for Health and Human Services, Agency for Healthcare Research and Quality, Centers for Disease Control... Veterans Affairs, Environmental Protection Agency, Institute of Museum and Library Services, National...
A Generalized Distributed Data Match-Up Service in Support of Oceanographic Application
NASA Astrophysics Data System (ADS)
Tsontos, V. M.; Huang, T.; Holt, B.; Smith, S. R.; Bourassa, M. A.; Worley, S. J.; Ji, Z.; Elya, J. L.; Stallard, A. P.
2016-02-01
Oceanographic applications increasingly rely on the integration and colocation of satellite and field observations providing complementary data coverage over a continuum of spatio-temporal scales. Here we report on a collaborative venture between NASA/JPL, NCAR and FSU/COAPS to develop a Distributed Oceanographic Match-up Service (DOMS). The DOMS project aims to implement a technical infrastructure providing a generalized, publicly accessible data collocation capability for satellite and in situ datasets utilizing remote data stores in support of satellite mission cal/val and a range of research and operational applications. The service will provide a mechanism for users to specify geospatial references and receive collocated satellite and field observations within the selected spatio-temporal domain and matchup window extent. DOMS will include several representative in situ and satellite datasets. Field data will focus on surface observations from NCAR's International Comprehensive Ocean-Atmosphere Data Set (ICOADS), the Shipboard Automated Meteorological and Oceanographic System Initiative (SAMOS) at FSU/COAPS, and the Salinity Processes in the Upper Ocean Regional Study (SPURS) data hosted at JPL/PO.DAAC. Satellite data will include JPL ASCAT L2 12.5 km winds, the Aquarius L2 orbital dataset, MODIS L2 swath data, and the high-resolution gridded L4 MUR-SST product. Importantly, while DOMS will be developed with these select datasets, it will be readily extendable for other in situ and satellite data collections and easily ported to other remote providers, thus potentially supporting additional science disciplines. Technical challenges to be addressed include: 1) ensuring accurate, efficient, and scalable match-up algorithm performance, 2) undertaking colocation using datasets that are distributed on the network, and 3) returning matched observations with sufficient metadata so that value differences can be properly interpreted. DOMS leverages existing technologies (EDGE, w10n, OPeNDAP, relational and graph/triple-store databases) and cloud computing. It will implement both a web portal interface for users to review and submit match-up requests interactively and underlying web service interface facilitating large-scale and automated machine-to-machine based queries.
ESTuber db: an online database for Tuber borchii EST sequences.
Lazzari, Barbara; Caprera, Andrea; Cosentino, Cristian; Stella, Alessandra; Milanesi, Luciano; Viotti, Angelo
2007-03-08
The ESTuber database (http://www.itb.cnr.it/estuber) includes 3,271 Tuber borchii expressed sequence tags (EST). The dataset consists of 2,389 sequences from an in-house prepared cDNA library from truffle vegetative hyphae, and 882 sequences downloaded from GenBank and representing four libraries from white truffle mycelia and ascocarps at different developmental stages. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts. Data were collected in a MySQL database, which can be queried via a php-based web interface. Sequences included in the ESTuber db were clustered and annotated against three databases: the GenBank nr database, the UniProtKB database and a third in-house prepared database of fungi genomic sequences. An algorithm was implemented to infer statistical classification among Gene Ontology categories from the ontology occurrences deduced from the annotation procedure against the UniProtKB database. Ontologies were also deduced from the annotation of more than 130,000 EST sequences from five filamentous fungi, for intra-species comparison purposes. Further analyses were performed on the ESTuber db dataset, including tandem repeats search and comparison of the putative protein dataset inferred from the EST sequences to the PROSITE database for protein patterns identification. All the analyses were performed both on the complete sequence dataset and on the contig consensus sequences generated by the EST assembly procedure. The resulting web site is a resource of data and links related to truffle expressed genes. The Sequence Report and Contig Report pages are the web interface core structures which, together with the Text search utility and the Blast utility, allow easy access to the data stored in the database.
Geoscience data visualization and analysis using GeoMapApp
NASA Astrophysics Data System (ADS)
Ferrini, Vicki; Carbotte, Suzanne; Ryan, William; Chan, Samantha
2013-04-01
Increased availability of geoscience data resources has resulted in new opportunities for developing visualization and analysis tools that not only promote data integration and synthesis, but also facilitate quantitative cross-disciplinary access to data. Interdisciplinary investigations, in particular, frequently require visualizations and quantitative access to specialized data resources across disciplines, which has historically required specialist knowledge of data formats and software tools. GeoMapApp (www.geomapapp.org) is a free online data visualization and analysis tool that provides direct quantitative access to a wide variety of geoscience data for a broad international interdisciplinary user community. While GeoMapApp provides access to online data resources, it can also be packaged to work offline through the deployment of a small portable hard drive. This mode of operation can be particularly useful during field programs to provide functionality and direct access to data when a network connection is not possible. Hundreds of data sets from a variety of repositories are directly accessible in GeoMapApp, without the need for the user to understand the specifics of file formats or data reduction procedures. Available data include global and regional gridded data, images, as well as tabular and vector datasets. In addition to basic visualization and data discovery functionality, users are provided with simple tools for creating customized maps and visualizations and to quantitatively interrogate data. Specialized data portals with advanced functionality are also provided for power users to further analyze data resources and access underlying component datasets. Users may import and analyze their own geospatial datasets by loading local versions of geospatial data and can access content made available through Web Feature Services (WFS) and Web Map Services (WMS). Once data are loaded in GeoMapApp, a variety options are provided to export data and/or 2D/3D visualizations into common formats including grids, images, text files, spreadsheets, etc. Examples of interdisciplinary investigations that make use of GeoMapApp visualization and analysis functionality will be provided.
Description of the U.S. Geological Survey Geo Data Portal data integration framework
Blodgett, David L.; Booth, Nathaniel L.; Kunicki, Thomas C.; Walker, Jordan I.; Lucido, Jessica M.
2012-01-01
The U.S. Geological Survey has developed an open-standard data integration framework for working efficiently and effectively with large collections of climate and other geoscience data. A web interface accesses catalog datasets to find data services. Data resources can then be rendered for mapping and dataset metadata are derived directly from these web services. Algorithm configuration and information needed to retrieve data for processing are passed to a server where all large-volume data access and manipulation takes place. The data integration strategy described here was implemented by leveraging existing free and open source software. Details of the software used are omitted; rather, emphasis is placed on how open-standard web services and data encodings can be used in an architecture that integrates common geographic and atmospheric data.
24 CFR 570.508 - Public access to program records.
Code of Federal Regulations, 2010 CFR
2010-04-01
... 24 Housing and Urban Development 3 2010-04-01 2010-04-01 false Public access to program records. 570.508 Section 570.508 Housing and Urban Development Regulations Relating to Housing and Urban Development (Continued) OFFICE OF ASSISTANT SECRETARY FOR COMMUNITY PLANNING AND DEVELOPMENT, DEPARTMENT OF...
SPICE: exploration and analysis of post-cytometric complex multivariate datasets.
Roederer, Mario; Nozzi, Joshua L; Nason, Martha C
2011-02-01
Polychromatic flow cytometry results in complex, multivariate datasets. To date, tools for the aggregate analysis of these datasets across multiple specimens grouped by different categorical variables, such as demographic information, have not been optimized. Often, the exploration of such datasets is accomplished by visualization of patterns with pie charts or bar charts, without easy access to statistical comparisons of measurements that comprise multiple components. Here we report on algorithms and a graphical interface we developed for these purposes. In particular, we discuss thresholding necessary for accurate representation of data in pie charts, the implications for display and comparison of normalized versus unnormalized data, and the effects of averaging when samples with significant background noise are present. Finally, we define a statistic for the nonparametric comparison of complex distributions to test for difference between groups of samples based on multi-component measurements. While originally developed to support the analysis of T cell functional profiles, these techniques are amenable to a broad range of datatypes. Published 2011 Wiley-Liss, Inc.
STScI Archive Manual, Version 7.0
NASA Astrophysics Data System (ADS)
Padovani, Paolo
1999-06-01
The STScI Archive Manual provides information a user needs to know to access the HST archive via its two user interfaces: StarView and a World Wide Web (WWW) interface. It provides descriptions of the StarView screens used to access information in the database and the format of that information, and introduces the use to the WWW interface. Using the two interfaces, users can search for observations, preview public data, and retrieve data from the archive. Using StarView one can also find calibration reference files and perform detailed association searches. With the WWW interface archive users can access, and obtain information on, all Multimission Archive at Space Telescope (MAST) data, a collection of mainly optical and ultraviolet datasets which include, amongst others, the International Ultraviolet Explorer (IUE) Final Archive. Both interfaces feature a name resolver which simplifies searches based on target name.
NASA Technical Reports Server (NTRS)
Lloyd, Steven; Acker, James G.; Prados, Ana I.; Leptoukh, Gregory G.
2008-01-01
One of the biggest obstacles for the average Earth science student today is locating and obtaining satellite-based remote sensing data sets in a format that is accessible and optimal for their data analysis needs. At the Goddard Earth Sciences Data and Information Services Center (GES-DISC) alone, on the order of hundreds of Terabytes of data are available for distribution to scientists, students and the general public. The single biggest and time-consuming hurdle for most students when they begin their study of the various datasets is how to slog through this mountain of data to arrive at a properly sub-setted and manageable data set to answer their science question(s). The GES DISC provides a number of tools for data access and visualization, including the Google-like Mirador search engine and the powerful GES-DISC Interactive Online Visualization ANd aNalysis Infrastructure (Giovanni) web interface.
Ng, Kok-Hoe
2016-06-01
The study aims to project future trends in living arrangements and access to children's cash contributions and market income sources among older people in Hong Kong. A cell-based model was constructed by combining available population projections, labour force projections, an extrapolation of the historical trend in living arrangements based on national survey datasets and a regression model on income sources. Under certain assumptions, the proportion of older people living with their children may decline from 59 to 48% during 2006-2030. Although access to market income sources may improve slightly, up to 20% of older people may have no access to either children's financial support or market income sources, and will not live with their children by 2030. Family support is expected to contract in the next two decades. Public pensions should be expanded to protect financially vulnerable older people. © 2015 AJA Inc.
Robertson, Tim; Döring, Markus; Guralnick, Robert; Bloom, David; Wieczorek, John; Braak, Kyle; Otegui, Javier; Russell, Laura; Desmet, Peter
2014-01-01
The planet is experiencing an ongoing global biodiversity crisis. Measuring the magnitude and rate of change more effectively requires access to organized, easily discoverable, and digitally-formatted biodiversity data, both legacy and new, from across the globe. Assembling this coherent digital representation of biodiversity requires the integration of data that have historically been analog, dispersed, and heterogeneous. The Integrated Publishing Toolkit (IPT) is a software package developed to support biodiversity dataset publication in a common format. The IPT’s two primary functions are to 1) encode existing species occurrence datasets and checklists, such as records from natural history collections or observations, in the Darwin Core standard to enhance interoperability of data, and 2) publish and archive data and metadata for broad use in a Darwin Core Archive, a set of files following a standard format. Here we discuss the key need for the IPT, how it has developed in response to community input, and how it continues to evolve to streamline and enhance the interoperability, discoverability, and mobilization of new data types beyond basic Darwin Core records. We close with a discussion how IPT has impacted the biodiversity research community, how it enhances data publishing in more traditional journal venues, along with new features implemented in the latest version of the IPT, and future plans for more enhancements. PMID:25099149
enDNA-Prot: identification of DNA-binding proteins by applying ensemble learning.
Xu, Ruifeng; Zhou, Jiyun; Liu, Bin; Yao, Lin; He, Yulan; Zou, Quan; Wang, Xiaolong
2014-01-01
DNA-binding proteins are crucial for various cellular processes, such as recognition of specific nucleotide, regulation of transcription, and regulation of gene expression. Developing an effective model for identifying DNA-binding proteins is an urgent research problem. Up to now, many methods have been proposed, but most of them focus on only one classifier and cannot make full use of the large number of negative samples to improve predicting performance. This study proposed a predictor called enDNA-Prot for DNA-binding protein identification by employing the ensemble learning technique. Experiential results showed that enDNA-Prot was comparable with DNA-Prot and outperformed DNAbinder and iDNA-Prot with performance improvement in the range of 3.97-9.52% in ACC and 0.08-0.19 in MCC. Furthermore, when the benchmark dataset was expanded with negative samples, the performance of enDNA-Prot outperformed the three existing methods by 2.83-16.63% in terms of ACC and 0.02-0.16 in terms of MCC. It indicated that enDNA-Prot is an effective method for DNA-binding protein identification and expanding training dataset with negative samples can improve its performance. For the convenience of the vast majority of experimental scientists, we developed a user-friendly web-server for enDNA-Prot which is freely accessible to the public.
Walker, Lindsay; Chang, Lin-Ching; Nayak, Amritha; Irfanoglu, M Okan; Botteron, Kelly N; McCracken, James; McKinstry, Robert C; Rivkin, Michael J; Wang, Dah-Jyuu; Rumsey, Judith; Pierpaoli, Carlo
2016-01-01
The NIH MRI Study of normal brain development sought to characterize typical brain development in a population of infants, toddlers, children and adolescents/young adults, covering the socio-economic and ethnic diversity of the population of the United States. The study began in 1999 with data collection commencing in 2001 and concluding in 2007. The study was designed with the final goal of providing a controlled-access database; open to qualified researchers and clinicians, which could serve as a powerful tool for elucidating typical brain development and identifying deviations associated with brain-based disorders and diseases, and as a resource for developing computational methods and image processing tools. This paper focuses on the DTI component of the NIH MRI study of normal brain development. In this work, we describe the DTI data acquisition protocols, data processing steps, quality assessment procedures, and data included in the database, along with database access requirements. For more details, visit http://www.pediatricmri.nih.gov. This longitudinal DTI dataset includes raw and processed diffusion data from 498 low resolution (3 mm) DTI datasets from 274 unique subjects, and 193 high resolution (2.5 mm) DTI datasets from 152 unique subjects. Subjects range in age from 10 days (from date of birth) through 22 years. Additionally, a set of age-specific DTI templates are included. This forms one component of the larger NIH MRI study of normal brain development which also includes T1-, T2-, proton density-weighted, and proton magnetic resonance spectroscopy (MRS) imaging data, and demographic, clinical and behavioral data. Published by Elsevier Inc.
Ontology-based meta-analysis of global collections of high-throughput public data.
Kupershmidt, Ilya; Su, Qiaojuan Jane; Grewal, Anoop; Sundaresh, Suman; Halperin, Inbal; Flynn, James; Shekar, Mamatha; Wang, Helen; Park, Jenny; Cui, Wenwu; Wall, Gregory D; Wisotzkey, Robert; Alag, Satnam; Akhtari, Saeid; Ronaghi, Mostafa
2010-09-29
The investigation of the interconnections between the molecular and genetic events that govern biological systems is essential if we are to understand the development of disease and design effective novel treatments. Microarray and next-generation sequencing technologies have the potential to provide this information. However, taking full advantage of these approaches requires that biological connections be made across large quantities of highly heterogeneous genomic datasets. Leveraging the increasingly huge quantities of genomic data in the public domain is fast becoming one of the key challenges in the research community today. We have developed a novel data mining framework that enables researchers to use this growing collection of public high-throughput data to investigate any set of genes or proteins. The connectivity between molecular states across thousands of heterogeneous datasets from microarrays and other genomic platforms is determined through a combination of rank-based enrichment statistics, meta-analyses, and biomedical ontologies. We address data quality concerns through dataset replication and meta-analysis and ensure that the majority of the findings are derived using multiple lines of evidence. As an example of our strategy and the utility of this framework, we apply our data mining approach to explore the biology of brown fat within the context of the thousands of publicly available gene expression datasets. Our work presents a practical strategy for organizing, mining, and correlating global collections of large-scale genomic data to explore normal and disease biology. Using a hypothesis-free approach, we demonstrate how a data-driven analysis across very large collections of genomic data can reveal novel discoveries and evidence to support existing hypothesis.
Ó Conchúir, Shane; Barlow, Kyle A; Pache, Roland A; Ollikainen, Noah; Kundert, Kale; O'Meara, Matthew J; Smith, Colin A; Kortemme, Tanja
2015-01-01
The development and validation of computational macromolecular modeling and design methods depend on suitable benchmark datasets and informative metrics for comparing protocols. In addition, if a method is intended to be adopted broadly in diverse biological applications, there needs to be information on appropriate parameters for each protocol, as well as metrics describing the expected accuracy compared to experimental data. In certain disciplines, there exist established benchmarks and public resources where experts in a particular methodology are encouraged to supply their most efficient implementation of each particular benchmark. We aim to provide such a resource for protocols in macromolecular modeling and design. We present a freely accessible web resource (https://kortemmelab.ucsf.edu/benchmarks) to guide the development of protocols for protein modeling and design. The site provides benchmark datasets and metrics to compare the performance of a variety of modeling protocols using different computational sampling methods and energy functions, providing a "best practice" set of parameters for each method. Each benchmark has an associated downloadable benchmark capture archive containing the input files, analysis scripts, and tutorials for running the benchmark. The captures may be run with any suitable modeling method; we supply command lines for running the benchmarks using the Rosetta software suite. We have compiled initial benchmarks for the resource spanning three key areas: prediction of energetic effects of mutations, protein design, and protein structure prediction, each with associated state-of-the-art modeling protocols. With the help of the wider macromolecular modeling community, we hope to expand the variety of benchmarks included on the website and continue to evaluate new iterations of current methods as they become available.
The Gulf of Mexico Coastal Ocean Observing System: A Decade of Data Aggregation and Services.
NASA Astrophysics Data System (ADS)
Howard, M.; Gayanilo, F.; Kobara, S.; Baum, S. K.; Currier, R. D.; Stoessel, M. M.
2016-02-01
The Gulf of Mexico Coastal Ocean Observing System Regional Association (GCOOS-RA) celebrated its 10-year anniversary in 2015. GCOOS-RA is one of 11 RAs organized under the NOAA-led U.S. Integrated Ocean Observing System (IOOS) Program Office to aggregate regional data and make these data publicly-available in preferred forms and formats via standards-based web services. Initial development of GCOOS focused on building elements of the IOOS Data Management and Communications Plan which is a framework for end-to-end interoperability. These elements included: data discovery, catalog, metadata, online-browse, data access and transport. Initial data types aggregated included near real-time physical oceanographic, marine meteorological and satellite data. Our focus in the middle of the past decade was on the production of basic products such as maps of current oceanographic conditions and quasi-static datasets such as bathymetry and climatologies. In the latter part of the decade we incorporated historical physical oceanographic datasets and historical coastal and offshore water quality data into our holdings and added our first biological dataset. We also developed web environments and products to support Citizen Scientists and stakeholder groups such as recreational boaters. Current efforts are directed towards applying data quality assurance (testing and flagging) to non-federal data, data archiving at national repositories, serving and visualizing numerical model output, providing data services for glider operators, and supporting marine biodiversity observing networks. GCOOS Data Management works closely with the Gulf of Mexico Research Initiative Information and Data Cooperative and various groups involved with Gulf Restoration. GCOOS-RA has influenced attitudes and behaviors associated with good data stewardship and data management practices across the Gulf and will to continue to do so into the next decade.
Measuring river from the cloud - River width algorithm development on Google Earth Engine
NASA Astrophysics Data System (ADS)
Yang, X.; Pavelsky, T.; Allen, G. H.; Donchyts, G.
2017-12-01
Rivers are some of the most dynamic features of the terrestrial land surface. They help distribute freshwater, nutrients, sediment, and they are also responsible for some of the greatest natural hazards. Despite their importance, our understanding of river behavior is limited at the global scale, in part because we do not have a river observational dataset that spans both time and space. Remote sensing data represent a rich, largely untapped resource for observing river dynamics. In particular, publicly accessible archives of satellite optical imagery, which date back to the 1970s, can be used to study the planview morphodynamics of rivers at the global scale. Here we present an image processing algorithm developed using the Google Earth Engine cloud-based platform, that can automatically extracts river centerlines and widths from Landsat 5, 7, and 8 scenes at 30 m resolution. Our algorithm makes use of the latest monthly global surface water history dataset and an existing Global River Width from Landsat (GRWL) dataset to efficiently extract river masks from each Landsat scene. Then a combination of distance transform and skeletonization techniques are used to extract river centerlines. Finally, our algorithm calculates wetted river width at each centerline pixel perpendicular to its local centerline direction. We validated this algorithm using in situ data estimated from 16 USGS gauge stations (N=1781). We find that 92% of the width differences are within 60 m (i.e. the minimum length of 2 Landsat pixels). Leveraging Earth Engine's infrastructure of collocated data and processing power, our goal is to use this algorithm to reconstruct the morphodynamic history of rivers globally by processing over 100,000 Landsat 5 scenes, covering from 1984 to 2013.
Preserving Data for Renewable Energy
NASA Astrophysics Data System (ADS)
Macduff, M.; Sivaraman, C.
2017-12-01
The EERE Atmosphere to Electrons (A2e) program established the Data Archive and Portal (DAP) to ensure the long-term preservation and access to A2e research data. The DAP has been operated by PNNL for 2 years with data from more than a dozen projects and 1PB of data and hundreds of datasets expected to be stored this year. The data are a diverse mix of model runs, observational data, and dervived products. While most of the data is public, the DAP has securely stored many proprietary data sets provided by energy producers that are critical to the research goals of the A2e program. The DAP uses Amazon Web Services (AWS) and PNNL resources to provide long-term archival and access to the data with appropriate access controls. As a key element of the DAP, metadata are collected for each dataset to assist with data discovery and usefulness of the data. Further, the DAP has begun a process of standardizing observation data into NetCDF, which allows users to focus on the data instead of parsing the many formats. Creating a central repository that is in tune with the unique needs of the A2e research community is helping active tasks today as well as making many future research efforts possible. In this presentation, we provide an overview the DAP capabilities and benefits to the renewable energy community.
Arend, Daniel; Lange, Matthias; Pape, Jean-Michel; Weigelt-Fischer, Kathleen; Arana-Ceballos, Fernando; Mücke, Ingo; Klukas, Christian; Altmann, Thomas; Scholz, Uwe; Junker, Astrid
2016-01-01
With the implementation of novel automated, high throughput methods and facilities in the last years, plant phenomics has developed into a highly interdisciplinary research domain integrating biology, engineering and bioinformatics. Here we present a dataset of a non-invasive high throughput plant phenotyping experiment, which uses image- and image analysis- based approaches to monitor the growth and development of 484 Arabidopsis thaliana plants (thale cress). The result is a comprehensive dataset of images and extracted phenotypical features. Such datasets require detailed documentation, standardized description of experimental metadata as well as sustainable data storage and publication in order to ensure the reproducibility of experiments, data reuse and comparability among the scientific community. Therefore the here presented dataset has been annotated using the standardized ISA-Tab format and considering the recently published recommendations for the semantical description of plant phenotyping experiments. PMID:27529152
Arend, Daniel; Lange, Matthias; Pape, Jean-Michel; Weigelt-Fischer, Kathleen; Arana-Ceballos, Fernando; Mücke, Ingo; Klukas, Christian; Altmann, Thomas; Scholz, Uwe; Junker, Astrid
2016-08-16
With the implementation of novel automated, high throughput methods and facilities in the last years, plant phenomics has developed into a highly interdisciplinary research domain integrating biology, engineering and bioinformatics. Here we present a dataset of a non-invasive high throughput plant phenotyping experiment, which uses image- and image analysis- based approaches to monitor the growth and development of 484 Arabidopsis thaliana plants (thale cress). The result is a comprehensive dataset of images and extracted phenotypical features. Such datasets require detailed documentation, standardized description of experimental metadata as well as sustainable data storage and publication in order to ensure the reproducibility of experiments, data reuse and comparability among the scientific community. Therefore the here presented dataset has been annotated using the standardized ISA-Tab format and considering the recently published recommendations for the semantical description of plant phenotyping experiments.
EMERALD: Coping with the Explosion of Seismic Data
NASA Astrophysics Data System (ADS)
West, J. D.; Fouch, M. J.; Arrowsmith, R.
2009-12-01
The geosciences are currently generating an unparalleled quantity of new public broadband seismic data with the establishment of large-scale seismic arrays such as the EarthScope USArray, which are enabling new and transformative scientific discoveries of the structure and dynamics of the Earth’s interior. Much of this explosion of data is a direct result of the formation of the IRIS consortium, which has enabled an unparalleled level of open exchange of seismic instrumentation, data, and methods. The production of these massive volumes of data has generated new and serious data management challenges for the seismological community. A significant challenge is the maintenance and updating of seismic metadata, which includes information such as station location, sensor orientation, instrument response, and clock timing data. This key information changes at unknown intervals, and the changes are not generally communicated to data users who have already downloaded and processed data. Another basic challenge is the ability to handle massive seismic datasets when waveform file volumes exceed the fundamental limitations of a computer’s operating system. A third, long-standing challenge is the difficulty of exchanging seismic processing codes between researchers; each scientist typically develops his or her own unique directory structure and file naming convention, requiring that codes developed by another researcher be rewritten before they can be used. To address these challenges, we are developing EMERALD (Explore, Manage, Edit, Reduce, & Analyze Large Datasets). The overarching goal of the EMERALD project is to enable more efficient and effective use of seismic datasets ranging from just a few hundred to millions of waveforms with a complete database-driven system, leading to higher quality seismic datasets for scientific analysis and enabling faster, more efficient scientific research. We will present a preliminary (beta) version of EMERALD, an integrated, extensible, standalone database server system based on the open-source PostgreSQL database engine. The system is designed for fast and easy processing of seismic datasets, and provides the necessary tools to manage very large datasets and all associated metadata. EMERALD provides methods for efficient preprocessing of seismic records; large record sets can be easily and quickly searched, reviewed, revised, reprocessed, and exported. EMERALD can retrieve and store station metadata and alert the user to metadata changes. The system provides many methods for visualizing data, analyzing dataset statistics, and tracking the processing history of individual datasets. EMERALD allows development and sharing of visualization and processing methods using any of 12 programming languages. EMERALD is designed to integrate existing software tools; the system provides wrapper functionality for existing widely-used programs such as GMT, SOD, and TauP. Users can interact with EMERALD via a web browser interface, or they can directly access their data from a variety of database-enabled external tools. Data can be imported and exported from the system in a variety of file formats, or can be directly requested and downloaded from the IRIS DMC from within EMERALD.
NASA Astrophysics Data System (ADS)
Ulbricht, Damian; Elger, Kirsten; Bertelmann, Roland; Klump, Jens
2016-04-01
With the foundation of DataCite in 2009 and the technical infrastructure installed in the last six years it has become very easy to create citable dataset DOIs. Nowadays, dataset DOIs are increasingly accepted and required by journals in reference lists of manuscripts. In addition, DataCite provides usage statistics [1] of assigned DOIs and offers a public search API to make research data count. By linking related information to the data, they become more useful for future generations of scientists. For this purpose, several identifier systems, as ISBN for books, ISSN for journals, DOI for articles or related data, Orcid for authors, and IGSN for physical samples can be attached to DOIs using the DataCite metadata schema [2]. While these are good preconditions to publish data, free and open solutions that help with the curation of data, the publication of research data, and the assignment of DOIs in one software seem to be rare. At GFZ Potsdam we built a modular software stack that is made of several free and open software solutions and we established 'GFZ Data Services'. 'GFZ Data Services' provides storage, a metadata editor for publication and a facility to moderate minted DOIs. All software solutions are connected through web APIs, which makes it possible to reuse and integrate established software. Core component of 'GFZ Data Services' is an eSciDoc [3] middleware that is used as central storage, and has been designed along the OAIS reference model for digital preservation. Thus, data are stored in self-contained packages that are made of binary file-based data and XML-based metadata. The eSciDoc infrastructure provides access control to data and it is able to handle half-open datasets, which is useful in embargo situations when a subset of the research data are released after an adequate period. The data exchange platform panMetaDocs [4] makes use of eSciDoc's REST API to upload file-based data into eSciDoc and uses a metadata editor [5] to annotate the files with metadata. The metadata editor has a user-friendly interface with nominal lists, extensive explanations, and an interactive mapping tool to provide assistance to scientists describing the data. It is possible to deposit metadata templates to fill certain fields with default values. The metadata editor generates metadata in the schemas ISO19139, NASA GCMD DIF, and DataCite and could be extended for other schemas. panMetaDocs is able to mint dataset DOIs through DOIDB, which is our component to moderate dataset DOIs issued through 'GFZ Data Services'. DOIDB accepts metadata in the schemas ISO19139, DIF, and DataCite. In addition, DOIDB provides an OAI-PMH interface to disseminate all deposited metadata to data portals. The presentation of datasets on DOI landing pages is done though XSLT stylesheet transformation of the XML-based metadata. The landing pages have been designed to meet needs of scientists. We are able to render the metadata to different layouts. Furthermore, additional information about datasets and publications is assembled into the webpage by querying public databases on the internet. The work presented here will focus on technical details of the software stack. [1] http://stats.datacite.org [2] http://www.dlib.org/dlib/january11/starr/01starr.html [3] http://www.escidoc.org [4] http://panmetadocs.sf.net [5] http://github.com/ulbricht
Chèneby, Jeanne; Gheorghe, Marius; Artufel, Marie
2018-01-01
Abstract With this latest release of ReMap (http://remap.cisreg.eu), we present a unique collection of regulatory regions in human, as a result of a large-scale integrative analysis of ChIP-seq experiments for hundreds of transcriptional regulators (TRs) such as transcription factors, transcriptional co-activators and chromatin regulators. In 2015, we introduced the ReMap database to capture the genome regulatory space by integrating public ChIP-seq datasets, covering 237 TRs across 13 million (M) peaks. In this release, we have extended this catalog to constitute a unique collection of regulatory regions. Specifically, we have collected, analyzed and retained after quality control a total of 2829 ChIP-seq datasets available from public sources, covering a total of 485 TRs with a catalog of 80M peaks. Additionally, the updated database includes new search features for TR names as well as aliases, including cell line names and the ability to navigate the data directly within genome browsers via public track hubs. Finally, full access to this catalog is available online together with a TR binding enrichment analysis tool. ReMap 2018 provides a significant update of the ReMap database, providing an in depth view of the complexity of the regulatory landscape in human. PMID:29126285
The Earth Data Analytic Services (EDAS) Framework
NASA Astrophysics Data System (ADS)
Maxwell, T. P.; Duffy, D.
2017-12-01
Faced with unprecedented growth in earth data volume and demand, NASA has developed the Earth Data Analytic Services (EDAS) framework, a high performance big data analytics framework built on Apache Spark. This framework enables scientists to execute data processing workflows combining common analysis operations close to the massive data stores at NASA. The data is accessed in standard (NetCDF, HDF, etc.) formats in a POSIX file system and processed using vetted earth data analysis tools (ESMF, CDAT, NCO, etc.). EDAS utilizes a dynamic caching architecture, a custom distributed array framework, and a streaming parallel in-memory workflow for efficiently processing huge datasets within limited memory spaces with interactive response times. EDAS services are accessed via a WPS API being developed in collaboration with the ESGF Compute Working Team to support server-side analytics for ESGF. The API can be accessed using direct web service calls, a Python script, a Unix-like shell client, or a JavaScript-based web application. New analytic operations can be developed in Python, Java, or Scala (with support for other languages planned). Client packages in Python, Java/Scala, or JavaScript contain everything needed to build and submit EDAS requests. The EDAS architecture brings together the tools, data storage, and high-performance computing required for timely analysis of large-scale data sets, where the data resides, to ultimately produce societal benefits. It is is currently deployed at NASA in support of the Collaborative REAnalysis Technical Environment (CREATE) project, which centralizes numerous global reanalysis datasets onto a single advanced data analytics platform. This service enables decision makers to compare multiple reanalysis datasets and investigate trends, variability, and anomalies in earth system dynamics around the globe.
Distributed Structure-Searchable Toxicity (DSSTox) Database Network: Making Public Toxicity Data Resources More Accessible and U sable for Data Exploration and SAR Development
Many sources of public toxicity data are not currently linked to chemical structure, are not ...
GI-axe: an access broker framework for the geosciences
NASA Astrophysics Data System (ADS)
Boldrini, E.; Nativi, S.; Santoro, M.; Papeschi, F.; Mazzetti, P.
2012-12-01
The efficient and effective discovery of heterogeneous geospatial resources (e.g. data and services) is currently addressed by implementing "Discovery Brokering components"—such as GI-cat which is successfully used by the GEO brokering framework. A related (and subsequent) problem is the access of discovered resources. As for the discovery case, there exists a clear challenge: the geospatial Community makes use of heterogeneous access protocols and data models. In fact, different standards (and best practices) are defined and used by the diverse Geoscience domains and Communities of practice. Besides, through a client application, Users want to access diverse data to be jointly used in a common Geospatial Environment (CGE): a geospatial environment characterized by a spatio-temporal CRS (Coordinate Reference System), resolution, and extension. Users want to define a CGE and get the selected data ready to be used in such an environment. Finally, they want to download data according to a common encoding (either binary or textual). Therefore, it is possible to introduce the concept of "Access Brokering component" which addresses all these intermediation needs, in a transparent way for both clients (i.e. Users) and access servers (i.e. Data Providers). This work presents GI-axe: a flexible Access Broker which is capable to intermediate the different access standards and to get data according to a CGE, previously specified by the User. In doing that, GI-axe complements the capabilities of the brokered access servers, in keeping with the brokering principles. Let's consider a sample use case of a User needing to access a global temperature dataset available online on a THREDDS Data Server and a rainfall dataset accessible through a WFS—she/he may have obtained the datasets as a search result from a discovery broker. Distribution information metadata accompanying the temperature dataset further indicate that a given OPeNDAP service has to be accessed to retrieve it. At this point, the User would be in charge of searching for an existing OPeNDAP client and retrieve the desired data with the desired CGE; worse he/she may need to write his/her own OPeNDAP client. While, the User has to utilize a GIS to access the rainfall data and perform all the necessary transformations to obtain the same CGE. The GI-axe access broker takes this interoperability burden off the User, by bearing the charge of accessing the available services and performing the needed adaptations to get both data according to the same CGE. Actually, GI-axe can also expose both the TDS the WFS as (for example) a WMS, allowing the User to utilize a unique and (perhaps) more familiar client. The User can this way concentrate on less technological aspects more inherent to his/her scientific field. GI-axe has been first developed and experimented in the multidisciplinary interoperability framework of the European Community funded EuroGEOSS project. Presently, is utilized in the GEOSS Discovery & Access Brokering framework.
Code of Federal Regulations, 2014 CFR
2014-07-01
... AREA Glossary of Terms § 910.51 Access. Access, when used in reference to parking or loading, means... 36 Parks, Forests, and Public Property 3 2014-07-01 2014-07-01 false Access. 910.51 Section 910.51 Parks, Forests, and Public Property PENNSYLVANIA AVENUE DEVELOPMENT CORPORATION GENERAL GUIDELINES AND...
Code of Federal Regulations, 2010 CFR
2010-07-01
... AREA Glossary of Terms § 910.51 Access. Access, when used in reference to parking or loading, means... 36 Parks, Forests, and Public Property 3 2010-07-01 2010-07-01 false Access. 910.51 Section 910.51 Parks, Forests, and Public Property PENNSYLVANIA AVENUE DEVELOPMENT CORPORATION GENERAL GUIDELINES AND...
Code of Federal Regulations, 2012 CFR
2012-07-01
... AREA Glossary of Terms § 910.51 Access. Access, when used in reference to parking or loading, means... 36 Parks, Forests, and Public Property 3 2012-07-01 2012-07-01 false Access. 910.51 Section 910.51 Parks, Forests, and Public Property PENNSYLVANIA AVENUE DEVELOPMENT CORPORATION GENERAL GUIDELINES AND...
Code of Federal Regulations, 2011 CFR
2011-07-01
... AREA Glossary of Terms § 910.51 Access. Access, when used in reference to parking or loading, means... 36 Parks, Forests, and Public Property 3 2011-07-01 2011-07-01 false Access. 910.51 Section 910.51 Parks, Forests, and Public Property PENNSYLVANIA AVENUE DEVELOPMENT CORPORATION GENERAL GUIDELINES AND...
22 CFR 214.51 - Administrative review of denial for public access to records.
Code of Federal Regulations, 2010 CFR
2010-04-01
... 22 Foreign Relations 1 2010-04-01 2010-04-01 false Administrative review of denial for public access to records. 214.51 Section 214.51 Foreign Relations AGENCY FOR INTERNATIONAL DEVELOPMENT ADVISORY COMMITTEE MANAGEMENT Administrative Remedies § 214.51 Administrative review of denial for public access to...
Progress in computational toxicology.
Ekins, Sean
2014-01-01
Computational methods have been widely applied to toxicology across pharmaceutical, consumer product and environmental fields over the past decade. Progress in computational toxicology is now reviewed. A literature review was performed on computational models for hepatotoxicity (e.g. for drug-induced liver injury (DILI)), cardiotoxicity, renal toxicity and genotoxicity. In addition various publications have been highlighted that use machine learning methods. Several computational toxicology model datasets from past publications were used to compare Bayesian and Support Vector Machine (SVM) learning methods. The increasing amounts of data for defined toxicology endpoints have enabled machine learning models that have been increasingly used for predictions. It is shown that across many different models Bayesian and SVM perform similarly based on cross validation data. Considerable progress has been made in computational toxicology in a decade in both model development and availability of larger scale or 'big data' models. The future efforts in toxicology data generation will likely provide us with hundreds of thousands of compounds that are readily accessible for machine learning models. These models will cover relevant chemistry space for pharmaceutical, consumer product and environmental applications. Copyright © 2013 Elsevier Inc. All rights reserved.
Arbelaez, Juan D; Moreno, Laura T; Singh, Namrata; Tung, Chih-Wei; Maron, Lyza G; Ospina, Yolima; Martinez, César P; Grenier, Cécile; Lorieux, Mathias; McCouch, Susan
Two populations of interspecific introgression lines (ILs) in a common recurrent parent were developed for use in pre-breeding and QTL mapping. The ILs were derived from crosses between cv Curinga, a tropical japonica upland cultivar, and two different wild donors, Oryza meridionalis Ng. accession (W2112) and Oryza rufipogon Griff. accession (IRGC 105491). The lines were genotyped using genotyping-by-sequencing (GBS) and SSRs. The 32 Curinga/ O. meridionalis ILs contain 76.73 % of the donor genome in individual introgressed segments, and each line has an average of 94.9 % recurrent parent genome. The 48 Curinga/ O. rufipogon ILs collectively contain 97.6 % of the donor genome with an average of 89.9 % recurrent parent genome per line. To confirm that these populations were segregating for traits of interest, they were phenotyped for pericarp color in the greenhouse and for four agronomic traits-days to flowering, plant height, number of tillers, and number of panicles-in an upland field environment. Seeds from these IL libraries and the accompanying GBS datasets are publicly available and represent valuable genetic resources for exploring the genetics and breeding potential of rice wild relatives.
Defining Data Access Pathways for Atmosphere to Electrons Wind Energy Data
NASA Astrophysics Data System (ADS)
Macduff, M.; Sivaraman, C.
2016-12-01
Atmosphere to Electrons (A2e), is a U.S. Department of Energy (DOE) Wind Program research initiative designed to optimize the performance of wind power plants by lowering the levelized cost of energy (LCOE). The Data Archive and Portal (DAP), managed by PNNL and hosted on Amazon Web Services, is a key capability of the A2e initiative. The DAP is used to collect, store, catalog, preserve and disseminate results from the experimental and computational studies representing a diverse user community requiring both open and proprietary data archival solutions(http://a2e.pnnl.gov). To enable consumer access to the data in DAP it is being built on a set of API's that are publically accessible. This includes persistent references for key meta-data objects as well as authenticated access to the data itself. The goal is to make the DAP catalog visible through a variety of data access paths bringing the data and metadata closer to the consumer. By providing persistent metadata records we hope to be able to build services that capture consumer utility and make referencing datasets easier.
The Role of the Virtual Astronomical Observatory in the Era of Big Data
NASA Astrophysics Data System (ADS)
Berriman, G. B.; Hanisch, R. J.; Lazio, T. J.
2013-01-01
The Virtual Observatory (VO) is realizing global electronic integration of astronomy data. The rapid growth in the size and complexity of data sets is transforming the computing landscape in astronomy. One of the long-term goals of the U.S. VO project, the Virtual Astronomical Observatory (VAO), is development of an information backbone that responds to this growth. Such a backbone will, when complete, provide innovative mechanisms for fast discovery of, and access to, massive data sets, and services that enable distributed storage, publication processing of large datasets. All these services will be built so that new projects can incorporate them as part of their data management and processing plans. Services under development to date include a general purpose indexing scheme for fast access to data sets, a cross-comparison engine that operate on catalogs of 1 billion records or more, and an interface for managing distributed data sets and connecting them to data discovery and analysis tools. The VAO advises projects on technology solutions for their data access and processing needs, and recently advised the Sagan Workshop on using cloud computing to support hands-on data analysis sessions for 150+ participants. Acknowledgements: The Virtual Astronomical Observatory (VAO) is managed by the VAO, LLC, a non-profit company established as a partnership of the Associated Universities, Inc. and the Association of Universities for Research in Astronomy, Inc. The VAO is sponsored by the National Science Foundation and the National Aeronautics and Space Administration.
Increasing value and reducing waste: addressing inaccessible research
Chan, An-Wen; Song, Fujian; Vickers, Andrew; Jefferson, Tom; Dickersin, Kay; Gøtzsche, Peter C.; Krumholz, Harlan M.; Ghersi, Davina; van der Worp, H. Bart
2015-01-01
The study protocol, publications, full study report detailing all analyses, and participant-level dataset constitute the main documentation of methods and results for health research. However, journal publications are available for only half of all studies and are plagued by selective reporting of methods and results. The protocol, full study report, and participant-level dataset are rarely available. The quality of information provided in study protocols and reports is variable and often incomplete. Inaccessibility of full information for the vast majority of studies wastes billions of dollars, introduces bias, and has a detrimental impact on patient care and research. To help improve this situation at a systemic level, three main actions are warranted. Firstly, it is important that academic institutions and funders reward investigators who fully disseminate their research protocols, reports, and participant-level datasets. Secondly, standards for the content of protocols, full study reports, and data sharing practices should be rigorously developed and adopted for all types of health research. Finally, journals, funders, sponsors, research ethics committees, regulators, and legislators should implement and enforce policies supporting study registration and availability of journal publications, full study reports, and participant-level datasets. PMID:24411650
Ten years of maintaining and expanding a microbial genome and metagenome analysis system.
Markowitz, Victor M; Chen, I-Min A; Chu, Ken; Pati, Amrita; Ivanova, Natalia N; Kyrpides, Nikos C
2015-11-01
Launched in March 2005, the Integrated Microbial Genomes (IMG) system is a comprehensive data management system that supports multidimensional comparative analysis of genomic data. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets sequenced at the Joint Genome Institute or provided by scientific users, as well as public genome datasets available at the National Center for Biotechnology Information Genbank sequence data archive. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and are integrated into the data warehouse using IMG's data integration toolkits. Microbial genome and metagenome application specific data marts and user interfaces provide access to different subsets of IMG's data and analysis toolkits. This review article revisits IMG's original aims, highlights key milestones reached by the system during the past 10 years, and discusses the main challenges faced by a rapidly expanding system, in particular the complexity of maintaining such a system in an academic setting with limited budgets and computing and data management infrastructure. Copyright © 2015 Elsevier Ltd. All rights reserved.
A database of marine phytoplankton abundance, biomass and species composition in Australian waters
Davies, Claire H.; Coughlan, Alex; Hallegraeff, Gustaaf; Ajani, Penelope; Armbrecht, Linda; Atkins, Natalia; Bonham, Prudence; Brett, Steve; Brinkman, Richard; Burford, Michele; Clementson, Lesley; Coad, Peter; Coman, Frank; Davies, Diana; Dela-Cruz, Jocelyn; Devlin, Michelle; Edgar, Steven; Eriksen, Ruth; Furnas, Miles; Hassler, Christel; Hill, David; Holmes, Michael; Ingleton, Tim; Jameson, Ian; Leterme, Sophie C.; Lønborg, Christian; McLaughlin, James; McEnnulty, Felicity; McKinnon, A. David; Miller, Margaret; Murray, Shauna; Nayar, Sasi; Patten, Renee; Pritchard, Tim; Proctor, Roger; Purcell-Meyerink, Diane; Raes, Eric; Rissik, David; Ruszczyk, Jason; Slotwinski, Anita; Swadling, Kerrie M.; Tattersall, Katherine; Thompson, Peter; Thomson, Paul; Tonks, Mark; Trull, Thomas W.; Uribe-Palomino, Julian; Waite, Anya M.; Yauwenas, Rouna; Zammit, Anthony; Richardson, Anthony J.
2016-01-01
There have been many individual phytoplankton datasets collected across Australia since the mid 1900s, but most are unavailable to the research community. We have searched archives, contacted researchers, and scanned the primary and grey literature to collate 3,621,847 records of marine phytoplankton species from Australian waters from 1844 to the present. Many of these are small datasets collected for local questions, but combined they provide over 170 years of data on phytoplankton communities in Australian waters. Units and taxonomy have been standardised, obviously erroneous data removed, and all metadata included. We have lodged this dataset with the Australian Ocean Data Network (http://portal.aodn.org.au/) allowing public access. The Australian Phytoplankton Database will be invaluable for global change studies, as it allows analysis of ecological indicators of climate change and eutrophication (e.g., changes in distribution; diatom:dinoflagellate ratios). In addition, the standardised conversion of abundance records to biomass provides modellers with quantifiable data to initialise and validate ecosystem models of lower marine trophic levels. PMID:27328409
A data discovery index for the social sciences
Krämer, Thomas; Klas, Claus-Peter; Hausstein, Brigitte
2018-01-01
This paper describes a novel search index for social and economic research data, one that enables users to search up-to-date references for data holdings in these disciplines. The index can be used for comparative analysis of publication of datasets in different areas of social science. The core of the index is the da|ra registration agency’s database for social and economic data, which contains high-quality searchable metadata from registered data publishers. Research data’s metadata records are harvested from data providers around the world and included in the index. In this paper, we describe the currently available indices on social science datasets and their shortcomings. Next, we describe the motivation behind and the purpose for the data discovery index as a dedicated and curated platform for finding social science research data and gesisDataSearch, its user interface. Further, we explain the harvesting, filtering and indexing procedure and give usage instructions for the dataset index. Lastly, we show that the index is currently the most comprehensive and most accessible collection of social science data descriptions available. PMID:29633988
Multi-facetted Metadata - Describing datasets with different metadata schemas at the same time
NASA Astrophysics Data System (ADS)
Ulbricht, Damian; Klump, Jens; Bertelmann, Roland
2013-04-01
Inspired by the wish to re-use research data a lot of work is done to bring data systems of the earth sciences together. Discovery metadata is disseminated to data portals to allow building of customized indexes of catalogued dataset items. Data that were once acquired in the context of a scientific project are open for reappraisal and can now be used by scientists that were not part of the original research team. To make data re-use easier, measurement methods and measurement parameters must be documented in an application metadata schema and described in a written publication. Linking datasets to publications - as DataCite [1] does - requires again a specific metadata schema and every new use context of the measured data may require yet another metadata schema sharing only a subset of information with the meta information already present. To cope with the problem of metadata schema diversity in our common data repository at GFZ Potsdam we established a solution to store file-based research data and describe these with an arbitrary number of metadata schemas. Core component of the data repository is an eSciDoc infrastructure that provides versioned container objects, called eSciDoc [2] "items". The eSciDoc content model allows assigning files to "items" and adding any number of metadata records to these "items". The eSciDoc items can be submitted, revised, and finally published, which makes the data and metadata available through the internet worldwide. GFZ Potsdam uses eSciDoc to support its scientific publishing workflow, including mechanisms for data review in peer review processes by providing temporary web links for external reviewers that do not have credentials to access the data. Based on the eSciDoc API, panMetaDocs [3] provides a web portal for data management in research projects. PanMetaDocs, which is based on panMetaWorks [4], is a PHP based web application that allows to describe data with any XML-based schema. It uses the eSciDoc infrastructures REST-interface to store versioned dataset files and metadata in a XML-format. The software is able to administrate more than one eSciDoc metadata record per item and thus allows the description of a dataset according to its context. The metadata fields can be filled with static or dynamic content to reduce the number of fields that require manual entries to a minimum and, at the same time, make use of contextual information available in a project setting. Access rights can be adjusted to set visibility of datasets to the required degree of openness. Metadata from separate instances of panMetaDocs can be syndicated to portals through RSS and OAI-PMH interfaces. The application architecture presented here allows storing file-based datasets and describe these datasets with any number of metadata schemas, depending on the intended use case. Data and metadata are stored in the same entity (eSciDoc items) and are managed by a software tool through the eSciDoc REST interface - in this case the application is panMetaDocs. Other software may re-use the produced items and modify the appropriate metadata records by accessing the web API of the eSciDoc data infrastructure. For presentation of the datasets in a web browser we are not bound to panMetaDocs. This is done by stylesheet transformation of the eSciDoc-item. [1] http://www.datacite.org [2] http://www.escidoc.org , eSciDoc, FIZ Karlruhe, Germany [3] http://panmetadocs.sf.net , panMetaDocs, GFZ Potsdam, Germany [4] http://metaworks.pangaea.de , panMetaWorks, Dr. R. Huber, MARUM, Univ. Bremen, Germany
Wong, Paul Wai-Ching; Fu, King-Wa; Yau, Rickey Sai-Pong; Ma, Helen Hei-Man; Law, Yik-Wa; Chang, Shu-Sen; Yip, Paul Siu-Fai
2013-01-11
The Internet's potential impact on suicide is of major public health interest as easy online access to pro-suicide information or specific suicide methods may increase suicide risk among vulnerable Internet users. Little is known, however, about users' actual searching and browsing behaviors of online suicide-related information. To investigate what webpages people actually clicked on after searching with suicide-related queries on a search engine and to examine what queries people used to get access to pro-suicide websites. A retrospective observational study was done. We used a web search dataset released by America Online (AOL). The dataset was randomly sampled from all AOL subscribers' web queries between March and May 2006 and generated by 657,000 service subscribers. We found 5526 search queries (0.026%, 5526/21,000,000) that included the keyword "suicide". The 5526 search queries included 1586 different search terms and were generated by 1625 unique subscribers (0.25%, 1625/657,000). Of these queries, 61.38% (3392/5526) were followed by users clicking on a search result. Of these 3392 queries, 1344 (39.62%) webpages were clicked on by 930 unique users but only 1314 of those webpages were accessible during the study period. Each clicked-through webpage was classified into 11 categories. The categories of the most visited webpages were: entertainment (30.13%; 396/1314), scientific information (18.31%; 240/1314), and community resources (14.53%; 191/1314). Among the 1314 accessed webpages, we could identify only two pro-suicide websites. We found that the search terms used to access these sites included "commiting suicide with a gas oven", "hairless goat", "pictures of murder by strangulation", and "photo of a severe burn". A limitation of our study is that the database may be dated and confined to mainly English webpages. Searching or browsing suicide-related or pro-suicide webpages was uncommon, although a small group of users did access websites that contain detailed suicide method information.
PatGen--a consolidated resource for searching genetic patent sequences.
Rouse, Richard J D; Castagnetto, Jesus; Niedner, Roland H
2005-04-15
Compared to the wealth of online resources covering genomic, proteomic and derived data the Bioinformatics community is rather underserved when it comes to patent information related to biological sequences. The current online resources are either incomplete or rather expensive. This paper describes, PatGen, an integrated database containing data from bioinformatic and patent resources. This effort addresses the inconsistency of publicly available genetic patent data coverage by providing access to a consolidated dataset. PatGen can be searched at http://www.patgendb.com rjdrouse@patentinformatics.com.
Data Discovery, Exploration, Integration and Delivery - a practical experience
NASA Astrophysics Data System (ADS)
Kirsch, Peter; Barnes, Tim; Breen, Paul
2010-05-01
To fully address the questions and issues arising within Earth Systems Science; the discovery, exploration, integration, delivery and sharing of data, metadata and services across potentially many disciplines and areas of expertise is fundamental. British Antarctic Survey (BAS) collects, manages and curates data across many fields of the geophysical and biological sciences (including upper atmospheric physics, atmospheric chemistry, meteorology, glaciology, oceanography, Polar ecology and biology). BAS, through its Polar Data Centre has an interest to construct and deliver a user-friendly, informative, and administratively low overhead interface onto these data holdings. Designing effective interfaces and frameworks onto the heterogeneous datasets described above is non-trivial. We will discuss some of our approaches and implementations; particularly those addressing the following issues: How to aid and guide the user to accurate discovery of data? Many portals do not inform users clearly enough about the datasets they actually hold. As a result the search interface by which a user is meant to discover information is often inadequate and assumes prior knowledge (for example, that the dataset you are looking for actually exists; that a particular event, campaign, research cruise took place; and that you have a specialist knowledge of the terminology in a particular field), assumptions that cannot be made in multi-disciplinary topic areas. How easily is provenance, quality, and metadata information displayed and accessed? Once informed through the portal that data is available it is often extremely difficult to assess its provenance and quality information and broader documentation (including field reports, notebooks and software repositories). We shall demonstrate some simple methodologies. Can the user access summary data or visualizations of the dataset? It may be that the user is interested in some event, feature or threshold within the dataset; mechanisms need to be provided to allow a user to browse the data (or at least a summary of the data in the most appropriate form be it a plot, table, video etc) prior to making the decision to download or request data. A framework should be flexible enough to allow several methods of visualization. Can datasets be compared and or integrated? By allowing the inclusion of open, 3rd party, standards compliant utilities (e.g. Open Geo-Spatial Consortium WxS clients) into the framework, the utility of a data access system can be made more valuable. Is accessing the actual data straightforward? The process of accessing the data should follow naturally from the data discovery and browsing stages. The user should be made aware of terms and conditions of access. Access restrictions (if applicable) and security should be made as unobtrusive as possible. How is user feedback and comment monitored and acted upon? In general these systems exist to serve science communities, appropriate notice and acknowledgement of their needs and requirements must be taken into account when designing and developing these systems if they are to be of continued use in the future.
Information management systems for pharmacogenomics.
Thallinger, Gerhard G; Trajanoski, Slave; Stocker, Gernot; Trajanoski, Zlatko
2002-09-01
The value of high-throughput genomic research is dramatically enhanced by association with key patient data. These data are generally available but of disparate quality and not typically directly associated. A system that could bring these disparate data sources into a common resource connected with functional genomic data would be tremendously advantageous. However, the integration of clinical and accurate interpretation of the generated functional genomic data requires the development of information management systems capable of effectively capturing the data as well as tools to make that data accessible to the laboratory scientist or to the clinician. In this review these challenges and current information technology solutions associated with the management, storage and analysis of high-throughput data are highlighted. It is suggested that the development of a pharmacogenomic data management system which integrates public and proprietary databases, clinical datasets, and data mining tools embedded in a high-performance computing environment should include the following components: parallel processing systems, storage technologies, network technologies, databases and database management systems (DBMS), and application services.
NASA Astrophysics Data System (ADS)
Willis, D. M.; Coffey, H. E.; Henwood, R.; Erwin, E. H.; Hoyt, D. V.; Wild, M. N.; Denig, W. F.
2013-11-01
The measurements of sunspot positions and areas that were published initially by the Royal Observatory, Greenwich, and subsequently by the Royal Greenwich Observatory (RGO), as the Greenwich Photo-heliographic Results ( GPR), 1874 - 1976, exist in both printed and digital forms. These printed and digital sunspot datasets have been archived in various libraries and data centres. Unfortunately, however, typographic, systematic and isolated errors can be found in the various datasets. The purpose of the present paper is to begin the task of identifying and correcting these errors. In particular, the intention is to provide in one foundational paper all the necessary background information on the original solar observations, their various applications in scientific research, the format of the different digital datasets, the necessary definitions of the quantities measured, and the initial identification of errors in both the printed publications and the digital datasets. Two companion papers address the question of specific identifiable errors; namely, typographic errors in the printed publications, and both isolated and systematic errors in the digital datasets. The existence of two independently prepared digital datasets, which both contain information on sunspot positions and areas, makes it possible to outline a preliminary strategy for the development of an even more accurate digital dataset. Further work is in progress to generate an extremely reliable sunspot digital dataset, based on the programme of solar observations supported for more than a century by the Royal Observatory, Greenwich, and the Royal Greenwich Observatory. This improved dataset should be of value in many future scientific investigations.
bioWeb3D: an online webGL 3D data visualisation tool.
Pettit, Jean-Baptiste; Marioni, John C
2013-06-07
Data visualization is critical for interpreting biological data. However, in practice it can prove to be a bottleneck for non trained researchers; this is especially true for three dimensional (3D) data representation. Whilst existing software can provide all necessary functionalities to represent and manipulate biological 3D datasets, very few are easily accessible (browser based), cross platform and accessible to non-expert users. An online HTML5/WebGL based 3D visualisation tool has been developed to allow biologists to quickly and easily view interactive and customizable three dimensional representations of their data along with multiple layers of information. Using the WebGL library Three.js written in Javascript, bioWeb3D allows the simultaneous visualisation of multiple large datasets inputted via a simple JSON, XML or CSV file, which can be read and analysed locally thanks to HTML5 capabilities. Using basic 3D representation techniques in a technologically innovative context, we provide a program that is not intended to compete with professional 3D representation software, but that instead enables a quick and intuitive representation of reasonably large 3D datasets.
OpenClimateGIS - A Web Service Providing Climate Model Data in Commonly Used Geospatial Formats
NASA Astrophysics Data System (ADS)
Erickson, T. A.; Koziol, B. W.; Rood, R. B.
2011-12-01
The goal of the OpenClimateGIS project is to make climate model datasets readily available in commonly used, modern geospatial formats used by GIS software, browser-based mapping tools, and virtual globes.The climate modeling community typically stores climate data in multidimensional gridded formats capable of efficiently storing large volumes of data (such as netCDF, grib) while the geospatial community typically uses flexible vector and raster formats that are capable of storing small volumes of data (relative to the multidimensional gridded formats). OpenClimateGIS seeks to address this difference in data formats by clipping climate data to user-specified vector geometries (i.e. areas of interest) and translating the gridded data on-the-fly into multiple vector formats. The OpenClimateGIS system does not store climate data archives locally, but rather works in conjunction with external climate archives that expose climate data via the OPeNDAP protocol. OpenClimateGIS provides a RESTful API web service for accessing climate data resources via HTTP, allowing a wide range of applications to access the climate data.The OpenClimateGIS system has been developed using open source development practices and the source code is publicly available. The project integrates libraries from several other open source projects (including Django, PostGIS, numpy, Shapely, and netcdf4-python).OpenClimateGIS development is supported by a grant from NOAA's Climate Program Office.
Building infrastructure to prevent disasters like Hurricane Maria
NASA Astrophysics Data System (ADS)
Bandaragoda, C.; Phuong, J.; Mooney, S.; Stephens, K.; Istanbulluoglu, E.; Pieper, K.; Rhoads, W.; Edwards, M.; Pruden, A.; Bales, J.; Clark, E.; Brazil, L.; Leon, M.; McDowell, W. G.; Horsburgh, J. S.; Tarboton, D. G.; Jones, A. S.; Hutton, E.; Tucker, G. E.; McCready, L.; Peckham, S. D.; Lenhardt, W. C.; Idaszak, R.
2017-12-01
2000 words Recovery efforts from natural disasters can be more efficient with data-driven information on current needs and future risks. We aim to advance open-source software infrastructure to support scientific investigation and data-driven decision making with a prototype system using a water quality assessment developed to investigate post-Hurricane Maria drinking water contamination in Puerto Rico. The widespread disruption of water treatment processes and uncertain drinking water quality within distribution systems in Puerto Rico poses risk to human health. However, there is no existing digital infrastructure to scientifically determine the impacts of the hurricane. After every natural disaster, it is difficult to answer elementary questions on how to provide high quality water supplies and health services. This project will archive and make accessible data on environmental variables unique to Puerto Rico, damage caused by Hurricane Maria, and will begin to address time sensitive needs of citizens. The initial focus is to work directly with public utilities to collect and archive samples of biological and inorganic drinking water quality. Our goal is to advance understanding of how the severity of a hazard to human health (e.g., no access to safe culinary water) is related to the sophistication, connectivity, and operations of the physical and related digital infrastructure systems. By rapidly collecting data in the early stages of recovery, we will test the design of an integrated cyberinfrastructure system to for usability of environmental and health data to understand the impacts from natural disasters. We will test and stress the CUAHSI HydroShare data publication mechanisms and capabilities to (1) assess the spatial and temporal presence of waterborne pathogens in public water systems impacted by a natural disaster, (2) demonstrate usability of HydroShare as a clearinghouse to centralize selected datasets related to Hurricane Maria, and (3) develop a prototype cyberinfrastructure to assess environmental conditions and public health impacted by natural disasters. The project thus serves to not only document post-disaster conditions, but develops a process to track the impact of recovery over time, as monitored through health, power availability and water quality.
ERIC Educational Resources Information Center
Pritchard-Schoch, Teresa
1995-01-01
Examines developments among public record information providers, including a shift from file acquisition to entire company acquisition. Highlights include a table of remote access to public records by state; pricing information; privacy issues; and information about the three main companies offering access to public records: LEXIS, CDB Infotek,…
NASA Astrophysics Data System (ADS)
Erwin, E. H.; Coffey, H. E.; Denig, W. F.; Willis, D. M.; Henwood, R.; Wild, M. N.
2013-11-01
A new sunspot and faculae digital dataset for the interval 1874 - 1955 has been prepared under the auspices of the NOAA National Geophysical Data Center (NGDC). This digital dataset contains measurements of the positions and areas of both sunspots and faculae published initially by the Royal Observatory, Greenwich, and subsequently by the Royal Greenwich Observatory (RGO), under the title Greenwich Photo-heliographic Results ( GPR) , 1874 - 1976. Quality control (QC) procedures based on logical consistency have been used to identify the more obvious errors in the RGO publications. Typical examples of identifiable errors are North versus South errors in specifying heliographic latitude, errors in specifying heliographic (Carrington) longitude, errors in the dates and times, errors in sunspot group numbers, arithmetic errors in the summation process, and the occasional omission of solar ephemerides. Although the number of errors in the RGO publications is remarkably small, an initial table of necessary corrections is provided for the interval 1874 - 1917. Moreover, as noted in the preceding companion papers, the existence of two independently prepared digital datasets, which both contain information on sunspot positions and areas, makes it possible to outline a preliminary strategy for the development of an even more accurate digital dataset. Further work is in progress to generate an extremely reliable sunspot digital dataset, based on the long programme of solar observations supported first by the Royal Observatory, Greenwich, and then by the Royal Greenwich Observatory.
Simultaneous acquisition of EEG and NIRS during cognitive tasks for an open access dataset.
Shin, Jaeyoung; von Lühmann, Alexander; Kim, Do-Won; Mehnert, Jan; Hwang, Han-Jeong; Müller, Klaus-Robert
2018-02-13
We provide an open access multimodal brain-imaging dataset of simultaneous electroencephalography (EEG) and near-infrared spectroscopy (NIRS) recordings. Twenty-six healthy participants performed three cognitive tasks: 1) n-back (0-, 2- and 3-back), 2) discrimination/selection response task (DSR) and 3) word generation (WG) tasks. The data provided includes: 1) measured data, 2) demographic data, and 3) basic analysis results. For n-back (dataset A) and DSR tasks (dataset B), event-related potential (ERP) analysis was performed, and spatiotemporal characteristics and classification results for 'target' versus 'non-target' (dataset A) and symbol 'O' versus symbol 'X' (dataset B) are provided. Time-frequency analysis was performed to show the EEG spectral power to differentiate the task-relevant activations. Spatiotemporal characteristics of hemodynamic responses are also shown. For the WG task (dataset C), the EEG spectral power and spatiotemporal characteristics of hemodynamic responses are analyzed, and the potential merit of hybrid EEG-NIRS BCIs was validated with respect to classification accuracy. We expect that the dataset provided will facilitate performance evaluation and comparison of many neuroimaging analysis techniques.
Simultaneous acquisition of EEG and NIRS during cognitive tasks for an open access dataset
Shin, Jaeyoung; von Lühmann, Alexander; Kim, Do-Won; Mehnert, Jan; Hwang, Han-Jeong; Müller, Klaus-Robert
2018-01-01
We provide an open access multimodal brain-imaging dataset of simultaneous electroencephalography (EEG) and near-infrared spectroscopy (NIRS) recordings. Twenty-six healthy participants performed three cognitive tasks: 1) n-back (0-, 2- and 3-back), 2) discrimination/selection response task (DSR) and 3) word generation (WG) tasks. The data provided includes: 1) measured data, 2) demographic data, and 3) basic analysis results. For n-back (dataset A) and DSR tasks (dataset B), event-related potential (ERP) analysis was performed, and spatiotemporal characteristics and classification results for ‘target’ versus ‘non-target’ (dataset A) and symbol ‘O’ versus symbol ‘X’ (dataset B) are provided. Time-frequency analysis was performed to show the EEG spectral power to differentiate the task-relevant activations. Spatiotemporal characteristics of hemodynamic responses are also shown. For the WG task (dataset C), the EEG spectral power and spatiotemporal characteristics of hemodynamic responses are analyzed, and the potential merit of hybrid EEG-NIRS BCIs was validated with respect to classification accuracy. We expect that the dataset provided will facilitate performance evaluation and comparison of many neuroimaging analysis techniques. PMID:29437166
NASA Astrophysics Data System (ADS)
Shukla, S.; Husak, G. J.; Macharia, D.; Peterson, P.; Landsfeld, M. F.; Funk, C.; Flores, A.
2017-12-01
Remote sensing, reanalysis and model based earth observations (EOs) are crucial for environmental decision making, particularly in a region like Eastern and Southern Africa, where ground-based observations are sparse. NASA and the Famine Early Warning System Network (FEWS NET) provide several EOs relevant for monitoring, providing early warning of agroclimatic conditions. Nonetheless, real-time application of those EOs for decision making in the region is still limited. This presentation reports on an ongoing SERVIR-supported Applied Science Team (AST) project that aims to fill that gap by working in close collaboration with Regional Centre for Mapping of Resources for Development (RCMRD), the NASA SERVIR regional hub. The three main avenues being taken to enhance access and usage of EOs in the region are: (1) Transition and implementation of web-based tools to RCMRD to allow easy processing and visualization of EOs (2) Capacity building of personnel from regional and national agroclimate service agencies in using EOs, through training using targeted case studies, and (3) Development of new datasets to meet the specific needs of RCMRD and regional stakeholders. The presentation will report on the initial success, lessons learned, and feedback thus far in this project regarding the implementation of web-based tool and capacity building efforts. It will also briefly describe three new datasets, currently in development, to improve agroclimate monitoring in the region, which are: (1) Satellite infrared and stations based temperature maximum dataset (CHIRTS) (2) NASA's GEOS5 and NCEP's CFSv2 based seasonal scale reference evapotranspiration forecasts and (3) NCEP's GEFS based medium range weather forecasts which are bias-corrected to USGS and UCSB's rainfall monitoring dataset (CHIRPS).
An open access thyroid ultrasound image database
NASA Astrophysics Data System (ADS)
Pedraza, Lina; Vargas, Carlos; Narváez, Fabián.; Durán, Oscar; Muñoz, Emma; Romero, Eduardo
2015-01-01
Computer aided diagnosis systems (CAD) have been developed to assist radiologists in the detection and diagnosis of abnormalities and a large number of pattern recognition techniques have been proposed to obtain a second opinion. Most of these strategies have been evaluated using different datasets making their performance incomparable. In this work, an open access database of thyroid ultrasound images is presented. The dataset consists of a set of B-mode Ultrasound images, including a complete annotation and diagnostic description of suspicious thyroid lesions by expert radiologists. Several types of lesions as thyroiditis, cystic nodules, adenomas and thyroid cancers were included while an accurate lesion delineation is provided in XML format. The diagnostic description of malignant lesions was confirmed by biopsy. The proposed new database is expected to be a resource for the community to assess different CAD systems.
NASA Astrophysics Data System (ADS)
Versteeg, R. J.; Johnson, T.; Henrie, A.; Johnson, D.
2013-12-01
The Hanford 300 Area, located adjacent to the Columbia River in south-central Washington, USA, is the site of former research and uranium fuel rod fabrication facilities. Waste disposal practices at the site included discharging between 33 and 59 metric tons of uranium over a 40 year period into shallow infiltration galleries, resulting in persistent uranium contamination within the vadose and saturated zones. Uranium transport from the vadose zone to the saturated zone is intimately linked with water table fluctuations and river water driven by upstream dam operations. Different remedial efforts have occurred at the site to address uranium contamination. Numerous investigations are occurring at the site, both to investigate remedial performance and to increase the understanding of uranium dynamics. Several of these studies include acquisition of large hydrological and time lapse electrical geophysical data sets. Such datasets contain large amounts of information on hydrological processes. There are substantial challenges in how to effectively deal with the data volumes of such datasets, how to process such datasets and how to provide users with the ability to effectively access and synergize the hydrological information contained in raw and processed data. These challenges motivated the development of a cloud based cyberinfrastructure for dealing with large electrical hydrogeophysical datasets. This cyberinfrastructure is modular and extensible and includes datamanagement, data processing, visualization and result mining capabilities. Specifically, it provides for data transmission to a central server, data parsing in a relational database and processing of the data using a PNNL developed parallel inversion code on either dedicated or commodity compute clusters. Access to results is done through a browser with interactive tools allowing for generation of on demand visualization of the inversion results as well as interactive data mining and statistical calculation. This infrastructure was used for the acquisition and processing of an electrical geophysical timelapse survey which was collected over a highly instrumented field site in the Hanford 300 Area. Over a 13 month period between November 2011 and December 2012 1823 timelapse datasets were collected (roughly 5 datasets a day for a total of 23 million individual measurements) on three parallel resistivity lines of 30 m each with 0.5 meter electrode spacing. In addition, hydrological and environmental data were collected from dedicated and general purpose sensors. This dataset contains rich information on near surface processes on a range of different spatial and temporal scales (ranging from hourly to seasonal). We will show how this cyberinfrastructure was used to manage and process this dataset and how the cyberinfrastructure can be used to access, mine and visualize the resulting data and information.
NASA Astrophysics Data System (ADS)
Pollak, J.; Berry, K.; Couch, A.; Arrigo, J.; Hooper, R. P.
2013-12-01
Scientific data about water are collected and distributed by numerous sources which can differ tremendously in scale. As competition for water resources increases, increasing access to and understanding of information about water will be critical. The mission of the new CUAHSI Water Data Center (WDC) is to provide those researchers who collect data a medium to publish their datasets and give those wanting to discover data the proper tools to efficiently find the data that they seek. These tools include standards-based data publication, data discovery tools based upon faceted and telescoping search, and a data analysis tool HydroDesktop that downloads and unifies data in standardized formats. The CUAHSI Hydrologic Information System (HIS) is a community developed and open source system for sharing water data. As a federated, web service oriented system it enables data publication for a diverse user population including scientific investigators (Research Coordination Networks, Critical Zone Observatories), government agencies (USGS, NASA, EPA), and citizen scientists (watershed associations). HydroDesktop is an end user application for data consumption in this system that the WDC supports. This application can be used for finding, downloading, and analyzing data from the HIS. It provides a GIS interface that allows users to incorporate spatial data that are not accessible via HIS, simple analysis tools to facilitate graphing and visualization, tools to export data to common file types, and provides an extensible architecture that developers can build upon. HydroDesktop, however, is just one example of a data access client for HIS. The web service oriented architecture enables data access by an unlimited number of clients provided they can consume the web services used in HIS. One such example developed at the WDC is the 'Faceted Search Client', which capitalizes upon exploratory search concepts to improve accuracy and precision during search. We highlight such features of the CUAHSI-HIS which make it particularly appropriate for providing unified access to several sources of water data. A growing community of researchers and educators are employing these tools for education; including sharing best practices around creating modules, supporting researchers and educators in accessing the services, and cataloging and sharing modules. The CUAHSI WDC is a community governed organization. Our agenda is driven by the community's voice through a Board of Directors and committees that decide strategic direction (new products), tactical decisions (product improvement), and evaluation of usability. By providing the aforementioned services within a community driven framework, we believe the WDC is providing critical services that include improving water data discoverability, accessibility and usability within a sustainable governance structure.
Geobrowser Enhanced Access of Real-Time Antarctic Data
NASA Astrophysics Data System (ADS)
Breen, P.; Judge, D.; Cunningham, N.; Kirsch, P. J.
2007-12-01
A proof of principle project was initiated in the Fall of 2006 to develop a system enabling remote field station and ship borne data, collected in near real-time to be discovered, visualised and acquired through a web accessible framework. The two principal enabling drivers for this system were the recent improvements in communications with remote field stations and ships and the advent of low cost, easily accessible geobrowser technology providing the ability to visualise multiple, sometimes physically disparate datasets within a common interface. Strongly spatial in nature the oceanographic datasets suggested the incorporation of geobrowser (Google Earth) technology into this framework. A number of scientific benefits were identified by the project, these include the overall enhancing of the value of many of the datasets through their real-time contribution to forecasting models, satellite ground truthing and calibration of autonomous instrumentation. Improved efficacy of fieldwork led to rapid discovery of problems and the ability to deal with them promptly. The ability to correct or improve experiment parameters and increase capability of routine collection of high-quality data. In the past it may have been over a year before data arrived back at HQ potentially unusable, definitely unrepeatable and significantly reducing or delaying scientific output. The geobrowser interface provides the platform from which the spatial data is discovered, for example ship tracks and aspects of the physical oceanography such as sea surface temperature can be directly visualized. Importantly, ancillary and auxiliary information and metadata can be linked to the cruise data in a straightforward and accessible manner; scientists in Cambridge using a geobrowser were able to access and visualize cruise data from the Southern ocean 20 minutes after collection.
Quantifying and Mapping Global Data Poverty.
Leidig, Mathias; Teeuw, Richard M
2015-01-01
Digital information technologies, such as the Internet, mobile phones and social media, provide vast amounts of data for decision-making and resource management. However, access to these technologies, as well as their associated software and training materials, is not evenly distributed: since the 1990s there has been concern about a "Digital Divide" between the data-rich and the data-poor. We present an innovative metric for evaluating international variations in access to digital data: the Data Poverty Index (DPI). The DPI is based on Internet speeds, numbers of computer owners and Internet users, mobile phone ownership and network coverage, as well as provision of higher education. The datasets used to produce the DPI are provided annually for almost all the countries of the world and can be freely downloaded. The index that we present in this 'proof of concept' study is the first to quantify and visualise the problem of global data poverty, using the most recent datasets, for 2013. The effects of severe data poverty, particularly limited access to geoinformatic data, free software and online training materials, are discussed in the context of sustainable development and disaster risk reduction. The DPI highlights countries where support is needed for improving access to the Internet and for the provision of training in geoinfomatics. We conclude that the DPI is of value as a potential metric for monitoring the Sustainable Development Goals of the Sendai Framework for Disaster Risk Reduction.
Limb-Enhancer Genie: An accessible resource of accurate enhancer predictions in the developing limb
Monti, Remo; Barozzi, Iros; Osterwalder, Marco; ...
2017-08-21
Epigenomic mapping of enhancer-associated chromatin modifications facilitates the genome-wide discovery of tissue-specific enhancers in vivo. However, reliance on single chromatin marks leads to high rates of false-positive predictions. More sophisticated, integrative methods have been described, but commonly suffer from limited accessibility to the resulting predictions and reduced biological interpretability. Here we present the Limb-Enhancer Genie (LEG), a collection of highly accurate, genome-wide predictions of enhancers in the developing limb, available through a user-friendly online interface. We predict limb enhancers using a combination of > 50 published limb-specific datasets and clusters of evolutionarily conserved transcription factor binding sites, taking advantage ofmore » the patterns observed at previously in vivo validated elements. By combining different statistical models, our approach outperforms current state-of-the-art methods and provides interpretable measures of feature importance. Our results indicate that including a previously unappreciated score that quantifies tissue-specific nuclease accessibility significantly improves prediction performance. We demonstrate the utility of our approach through in vivo validation of newly predicted elements. Moreover, we describe general features that can guide the type of datasets to include when predicting tissue-specific enhancers genome-wide, while providing an accessible resource to the general biological community and facilitating the functional interpretation of genetic studies of limb malformations.« less
Access NASA Satellite Global Precipitation Data Visualization on YouTube
NASA Astrophysics Data System (ADS)
Liu, Z.; Su, J.; Acker, J. G.; Huffman, G. J.; Vollmer, B.; Wei, J.; Meyer, D. J.
2017-12-01
Since the satellite era began, NASA has collected a large volume of Earth science observations for research and applications around the world. Satellite data at 12 NASA data centers can also be used for STEM activities such as disaster events, climate change, etc. However, accessing satellite data can be a daunting task for non-professional users such as teachers and students because of unfamiliarity of terminology, disciplines, data formats, data structures, computing resources, processing software, programing languages, etc. Over the years, many efforts have been developed to improve satellite data access, but barriers still exist for non-professionals. In this presentation, we will present our latest activity that uses the popular online video sharing web site, YouTube, to access visualization of global precipitation datasets at the NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC). With YouTube, users can access and visualize a large volume of satellite data without necessity to learn new software or download data. The dataset in this activity is the 3-hourly TRMM (Tropical Rainfall Measuring Mission) Multi-satellite Precipitation Analysis (TMPA). The video consists of over 50,000 data files collected since 1998 onwards, covering a zone between 50°N-S. The YouTube video will last 36 minutes for the entire dataset record (over 19 years). Since the time stamp is on each frame of the video, users can begin at any time by dragging the time progress bar. This precipitation animation will allow viewing precipitation events and processes (e.g., hurricanes, fronts, atmospheric rivers, etc.) on a global scale. The next plan is to develop a similar animation for the GPM (Global Precipitation Measurement) Integrated Multi-satellitE Retrievals for GPM (IMERG). The IMERG provides precipitation on a near-global (60°N-S) coverage at half-hourly time interval, showing more details on precipitation processes and development, compared to the 3-hourly TMPA product. The entire video will contain more than 330,000 files and will last 3.6 hours. Future plans include development of fly-over videos for orbital data for an entire satellite mission or project. All videos will be uploaded and available at the GES DISC site on YouTube (https://www.youtube.com/user/NASAGESDISC).
Semantic technologies improving the recall and precision of the Mercury metadata search engine
NASA Astrophysics Data System (ADS)
Pouchard, L. C.; Cook, R. B.; Green, J.; Palanisamy, G.; Noy, N.
2011-12-01
The Mercury federated metadata system [1] was developed at the Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC), a NASA-sponsored effort holding datasets about biogeochemical dynamics, ecological data, and environmental processes. Mercury currently indexes over 100,000 records from several data providers conforming to community standards, e.g. EML, FGDC, FGDC Biological Profile, ISO 19115 and DIF. With the breadth of sciences represented in Mercury, the potential exists to address some key interdisciplinary scientific challenges related to climate change, its environmental and ecological impacts, and mitigation of these impacts. However, this wealth of metadata also hinders pinpointing datasets relevant to a particular inquiry. We implemented a semantic solution after concluding that traditional search approaches cannot improve the accuracy of the search results in this domain because: a) unlike everyday queries, scientific queries seek to return specific datasets with numerous parameters that may or may not be exposed to search (Deep Web queries); b) the relevance of a dataset cannot be judged by its popularity, as each scientific inquiry tends to be unique; and c)each domain science has its own terminology, more or less curated, consensual, and standardized depending on the domain. The same terms may refer to different concepts across domains (homonyms), but different terms mean the same thing (synonyms). Interdisciplinary research is arduous because an expert in a domain must become fluent in the language of another, just to find relevant datasets. Thus, we decided to use scientific ontologies because they can provide a context for a free-text search, in a way that string-based keywords never will. With added context, relevant datasets are more easily discoverable. To enable search and programmatic access to ontology entities in Mercury, we are using an instance of the BioPortal ontology repository. Mercury accesses ontology entities using the BioPortal REST API by passing a search parameter to BioPortal that may return domain context, parameter attribute, or entity annotations depending on the entity's associated ontological relationships. As Mercury's facetted search is popular with users, the results are displayed as facets. Unlike a facetted search however, the ontology-based solution implements both restrictions (improving precision) and expansions (improving recall) on the results of the initial search. For instance, "carbon" acquires a scientific context and additional key terms or phrases for discovering domain-specific datasets. A limitation of our solution is that the user must perform an additional step. Another limitation is that the quality of the newly discovered metadata is contingent upon the quality of the ontologies we use. Our solution leverages Mercury's federated capabilities to collect records from heterogeneous domains, and BioPortal's storage, curation and access capabilities for ontology entities. With minimal additional development, our approach builds on two mature systems for finding relevant datasets for interdisciplinary inquiries. We thus indicate a path forward for linking environmental, ecological and biological sciences. References: [1] Devarakonda, R., Palanisamy, G., Wilson, B. E., & Green, J. M. (2010). Mercury: reusable metadata management, data discovery and access system. Earth Science Informatics, 3(1-2), 87-94.
DOE Office of Scientific and Technical Information (OSTI.GOV)
De Carlo, Francesco; Gürsoy, Doğa; Ching, Daniel J.
There is a widening gap between the fast advancement of computational methods for tomographic reconstruction and their successful implementation in production software at various synchrotron facilities. This is due in part to the lack of readily available instrument datasets and phantoms representative of real materials for validation and comparison of new numerical methods. Recent advancements in detector technology made sub-second and multi-energy tomographic data collection possible [1], but also increased the demand to develop new reconstruction methods able to handle in-situ [2] and dynamic systems [3] that can be quickly incorporated in beamline production software [4]. The X-ray Tomography Datamore » Bank, tomoBank, provides a repository of experimental and simulated datasets with the aim to foster collaboration among computational scientists, beamline scientists, and experimentalists and to accelerate the development and implementation of tomographic reconstruction methods for synchrotron facility production software by providing easy access to challenging dataset and their descriptors.« less
Mavraki, Dimitra; Fanini, Lucia; Tsompanou, Marilena; Gerovasileiou, Vasilis; Nikolopoulou, Stamatina; Chatzinikolaou, Eva; Plaitis, Wanda
2016-01-01
Abstract Background This article describes the digitization of a series of historical datasets based οn the reports of the 1908–1910 Danish Oceanographical Expeditions to the Mediterranean and adjacent seas. All station and sampling metadata as well as biodiversity data regarding calcareous rhodophytes, pelagic polychaetes, and fish (families Engraulidae and Clupeidae) obtained during these expeditions were digitized within the activities of the LifeWatchGreece Research Ιnfrastructure project and presented in the present paper. The aim was to safeguard public data availability by using an open access infrastructure, and to prevent potential loss of valuable historical data on the Mediterranean marine biodiversity. New information The datasets digitized here cover 2,043 samples taken at 567 stations during a time period from 1904 to 1930 in the Mediterranean and adjacent seas. The samples resulted in 1,588 occurrence records of pelagic polychaetes, fish (Clupeiformes) and calcareous algae (Rhodophyta). In addition, basic environmental data (e.g. sea surface temperature, salinity) as well as meterological conditions are included for most sampling events. In addition to the description of the digitized datasets, a detailed description of the problems encountered during the digitization of this historical dataset and a discussion on the value of such data are provided. PMID:28174510
NASA Astrophysics Data System (ADS)
Akanda, A. S.; Hasan, M. A.; Nusrat, F.; Jutla, A.; Huq, A.; Alam, M.; Colwell, R. R.
2016-12-01
The United Nations Sustainable Development Goals call for universal and equitable access to safe and affordable drinking water, improvement of water quality, and adequate and equitable sanitation for all, with special attention to the needs of women and girls and those in vulnerable situations (Goal 6). In addition, the world community also aims to end preventable deaths of newborns and children under 5 years of age, and end the epidemics of neglected tropical diseases and combat hepatitis, water-borne diseases and other infectious diseases (Goal 3). Water and sanitation-related diseases remain the leading causes of death in children under five, mostly in South Asia and sub-Saharan Africa, due to diarrheal diseases linked to poor sanitation and hygiene. Water scarcity affects more than 40 per cent of the global population and is projected to rise substantially. More than 80 per cent of wastewater resulting from human activities is also discharged into rivers or sea without any treatment and poor water quality controls. As a result, around 1.8 billion people globally are still forced to use a source of drinking water that is fecally contaminated. Earth observation techniques provide the most effective and encompassing tool to monitor both regional and local scale changes in water quality and quantity, impacts of droughts and flooding, and water resources vulnerabilities in delta regions around the globe. University of Rhode Island, along with partners in the US and Bangladesh, is using satellite remote sensing datasets and earth observation techniques to develop a series of tools for surveillance, analysis and decision support for various government, academic, and non-government stakeholder organizations in South-Asia to achieve sustainable development goals in 1) providing safe water and sanitation access in vulnerable regions through safe water resources mapping, 2) providing increasing access to medicine and vaccines through estimation of disease burden and identification of hotspots, and 3) reducing child mortality due to water-borne diseases in vulnerable regions through empowering public health personnel with prediction of diarrheal disease outbreaks.
The Hyperspectral Infrared Imager (HyspIRI) Public Health and Air Quality Applications
NASA Technical Reports Server (NTRS)
Luvall, Jeffrey C.; Hook, Simon J.
2014-01-01
The neglected tropical diseases (NTDs), a group of chronic, debilitating, and poverty-promoting parasitic, bacterial, and some viral and fungal infections, are among the most common causes of illness of the poorest people living in developing countries. Abiotic environmental factors are important in determining the distribution of disease-causing vectors and their life-cycles. HyspIRI observations can be merged through a Land Data Assimilation System (LDAS) be used to drive spatially-explicit ecological models of NTD vectors distribution and life cycles. Assimilations will be driven by observational data LDAS and satellite-derived meteorological forcing data, parameter datasets, and assimilation observations. HyspIRI hyperspectral measurements would provide global measurements of surface mineralogy and biotic crusts important in accessing the impact of dust in human health. HyspIRI surface thermal measurements would also help identify the variability of dust sources due to surface moisture conditions and map mineralogy.
The Hyperspectral Infrared Imager (HyspIRI) Public Health and Air Quality Applications
NASA Technical Reports Server (NTRS)
Luvall, Jeffrey C.; Hook, Simon J.
2013-01-01
The neglected tropical diseases (NTDs), a group of chronic, debilitating, and poverty-promoting parasitic, bacterial, and some viral and fungal infections, are among the most common causes of illness of the poorest people living in developing countries. Abiotic environmental factors are important in determining the distribution of disease-causing vectors and their life-cycles. HyspIRI observations can be merged through a Land Data Assimilation System (LDAS) be used to drive spatially-explicit ecological models of NTD vectors distribution & life cycles. Assimilations will be driven by observational data LDAS and satellite-derived meteorological forcing data, parameter datasets, and assimilation observations. HyspIRI hyperspectral measurements would provide global measurements of surface mineralogy and biotic crusts important in accessing the impact of dust in human health. HyspIRI surface thermal measurements would also help identify the variability of dust sources due to surface moisture conditions and map mineralogy.
The Hyperspectral Infrared Imager (HyspIRI) Public Health & Air Quality Applications
NASA Technical Reports Server (NTRS)
Luvall, Jeffrey C.; Hook, Simon J.
2013-01-01
The neglected tropical diseases (NTDs), a group of chronic, debilitating, and poverty-promoting parasitic, bacterial, and some viral and fungal infections, are among the most common causes of illness of the poorest people living in developing countries. Abiotic environmental factors are important in determining the distribution of disease-causing vectors and their life-cycles. HyspIRI observations can be merged through a Land Data Assimilation System (LDAS) be used to drive spatially-explicit ecological models of NTD vectors distribution & life cycles. Assimilations will be driven by observational data LDAS and satellite-derived meteorological forcing data, parameter datasets, and assimilation observations. HyspIRI hyperspectral measurements would provide global measurements of surface mineralogy and biotic crusts important in accessing the impact of dust in human health. HyspIRI surface thermal measurements would also help identify the variability of dust sources due to surface moisture conditions and map mineralogy.
Petousis, Ioannis; Mrdjenovich, David; Ballouz, Eric; ...
2017-01-31
Dielectrics are an important class of materials that are ubiquitous in modern electronic applications. Even though their properties are important for the performance of devices, the number of compounds with known dielectric constant is on the order of a few hundred. Here, we use Density Functional Perturbation Theory as a way to screen for the dielectric constant and refractive index of materials in a fast and computationally efficient way. Our results constitute the largest dielectric tensors database to date, containing 1,056 compounds. Details regarding the computational methodology and technical validation are presented along with the format of our publicly availablemore » data. In addition, we integrate our dataset with the Materials Project allowing users easy access to material properties. Finally, we explain how our dataset and calculation methodology can be used in the search for novel dielectric compounds.« less
Petousis, Ioannis; Mrdjenovich, David; Ballouz, Eric; Liu, Miao; Winston, Donald; Chen, Wei; Graf, Tanja; Schladt, Thomas D.; Persson, Kristin A.; Prinz, Fritz B.
2017-01-01
Dielectrics are an important class of materials that are ubiquitous in modern electronic applications. Even though their properties are important for the performance of devices, the number of compounds with known dielectric constant is on the order of a few hundred. Here, we use Density Functional Perturbation Theory as a way to screen for the dielectric constant and refractive index of materials in a fast and computationally efficient way. Our results constitute the largest dielectric tensors database to date, containing 1,056 compounds. Details regarding the computational methodology and technical validation are presented along with the format of our publicly available data. In addition, we integrate our dataset with the Materials Project allowing users easy access to material properties. Finally, we explain how our dataset and calculation methodology can be used in the search for novel dielectric compounds. PMID:28140408
Petousis, Ioannis; Mrdjenovich, David; Ballouz, Eric; Liu, Miao; Winston, Donald; Chen, Wei; Graf, Tanja; Schladt, Thomas D; Persson, Kristin A; Prinz, Fritz B
2017-01-31
Dielectrics are an important class of materials that are ubiquitous in modern electronic applications. Even though their properties are important for the performance of devices, the number of compounds with known dielectric constant is on the order of a few hundred. Here, we use Density Functional Perturbation Theory as a way to screen for the dielectric constant and refractive index of materials in a fast and computationally efficient way. Our results constitute the largest dielectric tensors database to date, containing 1,056 compounds. Details regarding the computational methodology and technical validation are presented along with the format of our publicly available data. In addition, we integrate our dataset with the Materials Project allowing users easy access to material properties. Finally, we explain how our dataset and calculation methodology can be used in the search for novel dielectric compounds.
NASA Astrophysics Data System (ADS)
Hsueh, D.; Farnham, D. J.; Gibson, R.; McGillis, W. R.; Culligan, P. J.; Cooper, C.; Larson, L.; Mailloux, B. J.; Buchanan, R.; Borus, N.; Zain, N.; Eddowes, D.; Butkiewicz, L.; Loiselle, S. A.
2015-12-01
Citizen Science is a fast-growing ecological research tool with proven potential to rapidly produce large datasets. While the fields of astronomy and ornithology demonstrate particularly successful histories of enlisting the public in conducting scientific work, citizen science applications to the field of hydrology have been relatively underutilized. We demonstrate the potential of citizen science for monitoring water quality, particularly in the impervious, urban environment of New York City (NYC) where pollution via stormwater runoff is a leading source of waterway contamination. Through partnerships with HSBC, Earthwatch, and the NYC Water Trail Association, we have trained two citizen science communities to monitor the quality of NYC waterways, testing for a suite of water quality parameters including pH, turbidity, phosphate, nitrate, and Enterococci (an indicator bacteria for the presence of harmful pathogens associated with fecal pollution). We continue to enhance these citizen science programs with two additions to our methodology. First, we designed and produced at-home incubation ovens for Enterococci analysis, and second, we are developing automated photo-imaging for nitrate and phosphate concentrations. These improvements make our work more publicly accessible while maintaining scientific accuracy. We also initiated a volunteer survey assessing the motivations for participation among our citizen scientists. These three endeavors will inform future applications of citizen science for urban hydrological research. Ultimately, the spatiotemporally-rich dataset of waterway quality produced from our citizen science efforts will help advise NYC policy makers about the impacts of green infrastructure and other types of government-led efforts to clean up NYC waterways.
Florida: Library Networking and Technology Development.
ERIC Educational Resources Information Center
Wilkins, Barratt, Ed.
1996-01-01
Explains the development of library networks in Florida and the role of the state library. Topics include regional multitype library consortia; a statewide bibliographic database; interlibrary loan; Internet access in public libraries; government information, including remote public access; automation projects; telecommunications; and free-nets.…
Hg concentrations in fish from coastal waters of California and Western North America
Davis, Jay; Ross, John; Bezalel, Shira; Sim, Lawrence; Bonnema, Autumn; Ichikawa, Gary; Heim, Wes; Schiff, Kenneth C; Eagles-Smith, Collin A.; Ackerman, Joshua T.
2016-01-01
The State of California conducted an extensive and systematic survey of mercury (Hg) in fish from the California coast in 2009 and 2010. The California survey sampled 3483 fish representing 46 species at 68 locations, and demonstrated that methylHg in fish presents a widespread exposure risk to fish consumers. Most of the locations sampled (37 of 68) had a species with an average concentration above 0.3 μg/g wet weight (ww), and 10 locations an average above 1.0 μg/g ww. The recent and robust dataset from California provided a basis for a broader examination of spatial and temporal patterns in fish Hg in coastal waters of Western North America. There is a striking lack of data in publicly accessible databases on Hg and other contaminants in coastal fish. An assessment of the raw data from these databases suggested the presence of relatively high concentrations along the California coast and in Puget Sound, and relatively low concentrations along the coasts of Alaska and Oregon, and the outer coast of Washington. The dataset suggests that Hg concentrations of public health concern can be observed at any location on the coast of Western North America where long-lived predator species are sampled. Output from a linear mixed-effects model resembled the spatial pattern observed for the raw data and suggested, based on the limited dataset, a lack of trend in fish Hg over the nearly 30-year period covered by the dataset. Expanded and continued monitoring, accompanied by rigorous data management procedures, would be of great value in characterizing methylHg exposure, and tracking changes in contamination of coastal fish in response to possible increases in atmospheric Hg emissions in Asia, climate change, and terrestrial Hg control efforts in coastal watersheds.
Patent law--balancing profit maximization and public access to technology.
Beckerman-Rodau, Andrew
2003-01-01
This article addresses the contemporary issue of balancing the need for patent protection for intellectual property with the resulting restriction of public access to new technology. The author argues that patent law protects private property rights rather than creating monopolies. Additionally, the author discusses how restricting access to patented technology, such as pharmaceuticals, can affect public health problems, such as the HIV/AIDS epidemic in developing nations. The author then concludes with some proposals for making patented technology available to people in developing nations who need access to such technology but who are unable to afford its high costs due to patent protection.
Tools for Interdisciplinary Data Assimilation and Sharing in Support of Hydrologic Science
NASA Astrophysics Data System (ADS)
Blodgett, D. L.; Walker, J.; Suftin, I.; Warren, M.; Kunicki, T.
2013-12-01
Information consumed and produced in hydrologic analyses is interdisciplinary and massive. These factors put a heavy information management burden on the hydrologic science community. The U.S. Geological Survey (USGS) Office of Water Information Center for Integrated Data Analytics (CIDA) seeks to assist hydrologic science investigators with all-components of their scientific data management life cycle. Ongoing data publication and software development projects will be presented demonstrating publically available data access services and manipulation tools being developed with support from two Department of the Interior initiatives. The USGS-led National Water Census seeks to provide both data and tools in support of nationally consistent water availability estimates. Newly available data include national coverages of radar-indicated precipitation, actual evapotranspiration, water use estimates aggregated by county, and South East region estimates of streamflow for 12-digit hydrologic unit code watersheds. Web services making these data available and applications to access them will be demonstrated. Web-available processing services able to provide numerous streamflow statistics for any USGS daily flow record or model result time series and other National Water Census processing tools will also be demonstrated. The National Climate Change and Wildlife Science Center is a USGS center leading DOI-funded academic global change adaptation research. It has a mission goal to ensure data used and produced by funded projects is available via web services and tools that streamline data management tasks in interdisciplinary science. For example, collections of downscaled climate projections, typically large collections of files that must be downloaded to be accessed, are being published using web services that allow access to the entire dataset via simple web-service requests and numerous processing tools. Recent progress on this front includes, data web services for Climate Model Intercomparison Phase 5 based downscaled climate projections, EPA's Integrated Climate and Land Use Scenarios projections of population and land cover metrics, and MODIS-derived land cover parameters from NASA's Land Processes Distributed Active Archive Center. These new services and ways to discover others will be presented through demonstration of a recently open-sourced project from a web-application or scripted workflow. Development and public deployment of server-based processing tools to subset and summarize these and other data is ongoing at the CIDA with partner groups such as 52 Degrees North and Unidata. The latest progress on subsetting, spatial summarization to areas of interest, and temporal summarization via common-statistical methods will be presented.
Wu, Jing; Philip, Ana-Maria; Podkowinski, Dominika; Gerendas, Bianca S; Langs, Georg; Simader, Christian; Waldstein, Sebastian M; Schmidt-Erfurth, Ursula M
2016-01-01
Development of image analysis and machine learning methods for segmentation of clinically significant pathology in retinal spectral-domain optical coherence tomography (SD-OCT), used in disease detection and prediction, is limited due to the availability of expertly annotated reference data. Retinal segmentation methods use datasets that either are not publicly available, come from only one device, or use different evaluation methodologies making them difficult to compare. Thus we present and evaluate a multiple expert annotated reference dataset for the problem of intraretinal cystoid fluid (IRF) segmentation, a key indicator in exudative macular disease. In addition, a standardized framework for segmentation accuracy evaluation, applicable to other pathological structures, is presented. Integral to this work is the dataset used which must be fit for purpose for IRF segmentation algorithm training and testing. We describe here a multivendor dataset comprised of 30 scans. Each OCT scan for system training has been annotated by multiple graders using a proprietary system. Evaluation of the intergrader annotations shows a good correlation, thus making the reproducibly annotated scans suitable for the training and validation of image processing and machine learning based segmentation methods. The dataset will be made publicly available in the form of a segmentation Grand Challenge.
Wu, Jing; Philip, Ana-Maria; Podkowinski, Dominika; Gerendas, Bianca S.; Langs, Georg; Simader, Christian
2016-01-01
Development of image analysis and machine learning methods for segmentation of clinically significant pathology in retinal spectral-domain optical coherence tomography (SD-OCT), used in disease detection and prediction, is limited due to the availability of expertly annotated reference data. Retinal segmentation methods use datasets that either are not publicly available, come from only one device, or use different evaluation methodologies making them difficult to compare. Thus we present and evaluate a multiple expert annotated reference dataset for the problem of intraretinal cystoid fluid (IRF) segmentation, a key indicator in exudative macular disease. In addition, a standardized framework for segmentation accuracy evaluation, applicable to other pathological structures, is presented. Integral to this work is the dataset used which must be fit for purpose for IRF segmentation algorithm training and testing. We describe here a multivendor dataset comprised of 30 scans. Each OCT scan for system training has been annotated by multiple graders using a proprietary system. Evaluation of the intergrader annotations shows a good correlation, thus making the reproducibly annotated scans suitable for the training and validation of image processing and machine learning based segmentation methods. The dataset will be made publicly available in the form of a segmentation Grand Challenge. PMID:27579177
NASA Astrophysics Data System (ADS)
Mantas, Vasco M.; Pereira, A. J. S. C.; Liu, Zhong
2013-12-01
A project was devised to develop a set of freely available applications and web services that can (1) simplify access from Mobile Devices to TOVAS data and (2) support the development of new datasets through data repackaging and mash-up. The bottom-up approach enables the multiplication of new services, often of limited direct interest to the organizations that produces the original, global datasets, but significant to small, local users. Through this multiplication of services, the development cost is transferred to the intermediate or end users and the entire process is made more efficient, even allowing new players to use the data in innovative ways.
Kohli, Marc D; Summers, Ronald M; Geis, J Raymond
2017-08-01
At the first annual Conference on Machine Intelligence in Medical Imaging (C-MIMI), held in September 2016, a conference session on medical image data and datasets for machine learning identified multiple issues. The common theme from attendees was that everyone participating in medical image evaluation with machine learning is data starved. There is an urgent need to find better ways to collect, annotate, and reuse medical imaging data. Unique domain issues with medical image datasets require further study, development, and dissemination of best practices and standards, and a coordinated effort among medical imaging domain experts, medical imaging informaticists, government and industry data scientists, and interested commercial, academic, and government entities. High-level attributes of reusable medical image datasets suitable to train, test, validate, verify, and regulate ML products should be better described. NIH and other government agencies should promote and, where applicable, enforce, access to medical image datasets. We should improve communication among medical imaging domain experts, medical imaging informaticists, academic clinical and basic science researchers, government and industry data scientists, and interested commercial entities.
Zinkgraf, Matthew; Liu, Lijun; Groover, Andrew; Filkov, Vladimir
2017-06-01
Trees modify wood formation through integration of environmental and developmental signals in complex but poorly defined transcriptional networks, allowing trees to produce woody tissues appropriate to diverse environmental conditions. In order to identify relationships among genes expressed during wood formation, we integrated data from new and publically available datasets in Populus. These datasets were generated from woody tissue and include transcriptome profiling, transcription factor binding, DNA accessibility and genome-wide association mapping experiments. Coexpression modules were calculated, each of which contains genes showing similar expression patterns across experimental conditions, genotypes and treatments. Conserved gene coexpression modules (four modules totaling 8398 genes) were identified that were highly preserved across diverse environmental conditions and genetic backgrounds. Functional annotations as well as correlations with specific experimental treatments associated individual conserved modules with distinct biological processes underlying wood formation, such as cell-wall biosynthesis, meristem development and epigenetic pathways. Module genes were also enriched for DNase I hypersensitivity footprints and binding from four transcription factors associated with wood formation. The conserved modules are excellent candidates for modeling core developmental pathways common to wood formation in diverse environments and genotypes, and serve as testbeds for hypothesis generation and testing for future studies. No claim to original US government works. New Phytologist © 2017 New Phytologist Trust.
Teaching the Thrill of Discovery: Student Exploration of the Large-Scale Structures of the Universe
NASA Astrophysics Data System (ADS)
Juneau, Stephanie; Dey, Arjun; Walker, Constance E.; NOAO Data Lab
2018-01-01
In collaboration with the Teen Astronomy Cafes program, the NOAO Data Lab is developing online Jupyter Notebooks as a free and publicly accessible tool for students and teachers. Each interactive activity teaches students simultaneously about coding and astronomy with a focus on large datasets. Therefore, students learn state-of-the-art techniques at the cross-section between astronomy and data science. During the activity entitled “Our Vast Universe”, students use real spectroscopic data to measure the distance to galaxies before moving on to a catalog with distances to over 100,000 galaxies. Exploring this dataset gives students an appreciation of the large number of galaxies in the universe (2 trillion!), and leads them to discover how galaxies are located in large and impressive filamentary structures. During the Teen Astronomy Cafes program, the notebook is supplemented with visual material conducive to discussion, and hands-on activities involving cubes representing model universes. These steps contribute to build the students’ physical intuition and give them a better grasp of the concepts before using software and coding. At the end of the activity, students have made their own measurements, and have experienced scientific research directly. More information is available online for the Teen Astronomy Cafes (teensciencecafe.org/cafes) and the NOAO Data Lab (datalab.noao.edu).
36 CFR 909.149 - Program accessibility: Discrimination prohibited.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 36 Parks, Forests, and Public Property 3 2010-07-01 2010-07-01 false Program accessibility: Discrimination prohibited. 909.149 Section 909.149 Parks, Forests, and Public Property PENNSYLVANIA AVENUE... CONDUCTED BY THE PENNSYLVANIA AVENUE DEVELOPMENT CORPORATION § 909.149 Program accessibility: Discrimination...
Virtual Observatory Interfaces to the Chandra Data Archive
NASA Astrophysics Data System (ADS)
Tibbetts, M.; Harbo, P.; Van Stone, D.; Zografou, P.
2014-05-01
The Chandra Data Archive (CDA) plays a central role in the operation of the Chandra X-ray Center (CXC) by providing access to Chandra data. Proprietary interfaces have been the backbone of the CDA throughout the Chandra mission. While these interfaces continue to provide the depth and breadth of mission specific access Chandra users expect, the CXC has been adding Virtual Observatory (VO) interfaces to the Chandra proposal catalog and observation catalog. VO interfaces provide standards-based access to Chandra data through simple positional queries or more complex queries using the Astronomical Data Query Language. Recent development at the CDA has generalized our existing VO services to create a suite of services that can be configured to provide VO interfaces to any dataset. This approach uses a thin web service layer for the individual VO interfaces, a middle-tier query component which is shared among the VO interfaces for parsing, scheduling, and executing queries, and existing web services for file and data access. The CXC VO services provide Simple Cone Search (SCS), Simple Image Access (SIA), and Table Access Protocol (TAP) implementations for both the Chandra proposal and observation catalogs within the existing archive architecture. Our work with the Chandra proposal and observation catalogs, as well as additional datasets beyond the CDA, illustrates how we can provide configurable VO services to extend core archive functionality.
Three visualization approaches for communicating and exploring PIT tag data
Letcher, Benjamin; Walker, Jeffrey D.; O'Donnell, Matthew; Whiteley, Andrew R.; Nislow, Keith; Coombs, Jason
2018-01-01
As the number, size and complexity of ecological datasets has increased, narrative and interactive raw data visualizations have emerged as important tools for exploring and understanding these large datasets. As a demonstration, we developed three visualizations to communicate and explore passive integrated transponder tag data from two long-term field studies. We created three independent visualizations for the same dataset, allowing separate entry points for users with different goals and experience levels. The first visualization uses a narrative approach to introduce users to the study. The second visualization provides interactive cross-filters that allow users to explore multi-variate relationships in the dataset. The last visualization allows users to visualize the movement histories of individual fish within the stream network. This suite of visualization tools allows a progressive discovery of more detailed information and should make the data accessible to users with a wide variety of backgrounds and interests.
National Hydropower Plant Dataset, Version 2 (FY18Q3)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Samu, Nicole; Kao, Shih-Chieh; O'Connor, Patrick
The National Hydropower Plant Dataset, Version 2 (FY18Q3) is a geospatially comprehensive point-level dataset containing locations and key characteristics of U.S. hydropower plants that are currently either in the hydropower development pipeline (pre-operational), operational, withdrawn, or retired. These data are provided in GIS and tabular formats with corresponding metadata for each. In addition, we include access to download 2 versions of the National Hydropower Map, which was produced with these data (i.e. Map 1 displays the geospatial distribution and characteristics of all operational hydropower plants; Map 2 displays the geospatial distribution and characteristics of operational hydropower plants with pumped storagemore » and mixed capabilities only). This dataset is a subset of ORNL's Existing Hydropower Assets data series, updated quarterly as part of ORNL's National Hydropower Asset Assessment Program.« less
ERIC Educational Resources Information Center
Pinnock, Katherine; Evans, Ruth
2008-01-01
As part of the prevention and social inclusion agenda, the Children's Fund, set up in 2000, has developed preventative services for children at risk of social exclusion. Drawing on a large qualitative dataset of interviews conducted in 2004/05 with children, young people and their parents/carers who accessed Children Fund services, this article…
Genomics Portals: integrative web-platform for mining genomics data.
Shinde, Kaustubh; Phatak, Mukta; Johannes, Freudenberg M; Chen, Jing; Li, Qian; Vineet, Joshi K; Hu, Zhen; Ghosh, Krishnendu; Meller, Jaroslaw; Medvedovic, Mario
2010-01-13
A large amount of experimental data generated by modern high-throughput technologies is available through various public repositories. Our knowledge about molecular interaction networks, functional biological pathways and transcriptional regulatory modules is rapidly expanding, and is being organized in lists of functionally related genes. Jointly, these two sources of information hold a tremendous potential for gaining new insights into functioning of living systems. Genomics Portals platform integrates access to an extensive knowledge base and a large database of human, mouse, and rat genomics data with basic analytical visualization tools. It provides the context for analyzing and interpreting new experimental data and the tool for effective mining of a large number of publicly available genomics datasets stored in the back-end databases. The uniqueness of this platform lies in the volume and the diversity of genomics data that can be accessed and analyzed (gene expression, ChIP-chip, ChIP-seq, epigenomics, computationally predicted binding sites, etc), and the integration with an extensive knowledge base that can be used in such analysis. The integrated access to primary genomics data, functional knowledge and analytical tools makes Genomics Portals platform a unique tool for interpreting results of new genomics experiments and for mining the vast amount of data stored in the Genomics Portals backend databases. Genomics Portals can be accessed and used freely at http://GenomicsPortals.org.
Genomics Portals: integrative web-platform for mining genomics data
2010-01-01
Background A large amount of experimental data generated by modern high-throughput technologies is available through various public repositories. Our knowledge about molecular interaction networks, functional biological pathways and transcriptional regulatory modules is rapidly expanding, and is being organized in lists of functionally related genes. Jointly, these two sources of information hold a tremendous potential for gaining new insights into functioning of living systems. Results Genomics Portals platform integrates access to an extensive knowledge base and a large database of human, mouse, and rat genomics data with basic analytical visualization tools. It provides the context for analyzing and interpreting new experimental data and the tool for effective mining of a large number of publicly available genomics datasets stored in the back-end databases. The uniqueness of this platform lies in the volume and the diversity of genomics data that can be accessed and analyzed (gene expression, ChIP-chip, ChIP-seq, epigenomics, computationally predicted binding sites, etc), and the integration with an extensive knowledge base that can be used in such analysis. Conclusion The integrated access to primary genomics data, functional knowledge and analytical tools makes Genomics Portals platform a unique tool for interpreting results of new genomics experiments and for mining the vast amount of data stored in the Genomics Portals backend databases. Genomics Portals can be accessed and used freely at http://GenomicsPortals.org. PMID:20070909
NASA Astrophysics Data System (ADS)
Martinez, Santa; Besse, Sebastien; Heather, Dave; Barbarisi, Isa; Arviset, Christophe; De Marchi, Guido; Barthelemy, Maud; Docasal, Ruben; Fraga, Diego; Grotheer, Emmanuel; Lim, Tanya; Macfarlane, Alan; Rios, Carlos; Vallejo, Fran; Saiz, Jaime; ESDC (European Space Data Centre) Team
2016-10-01
The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces at http://archives.esac.esa.int/psa. All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. The PSA is currently implementing a number of significant improvements, mostly driven by the evolution of the PDS standard, and the growing need for better interfaces and advanced applications to support science exploitation. The newly designed PSA will enhance the user experience and will significantly reduce the complexity for users to find their data promoting one-click access to the scientific datasets with more specialised views when needed. This includes a better integration with Planetary GIS analysis tools and Planetary interoperability services (search and retrieve data, supporting e.g. PDAP, EPN-TAP). It will be also up-to-date with versions 3 and 4 of the PDS standards, as PDS4 will be used for ESA's ExoMars and upcoming BepiColombo missions. Users will have direct access to documentation, information and tools that are relevant to the scientific use of the dataset, including ancillary datasets, Software Interface Specification (SIS) documents, and any tools/help that the PSA team can provide. A login mechanism will provide additional functionalities to the users to aid / ease their searches (e.g. saving queries, managing default views). This contribution will introduce the new PSA, its key features and access interfaces.
Rolling Deck to Repository (R2R): Standards and Semantics for Open Access to Research Data
NASA Astrophysics Data System (ADS)
Arko, Robert; Carbotte, Suzanne; Chandler, Cynthia; Smith, Shawn; Stocks, Karen
2015-04-01
In recent years, a growing number of funding agencies and professional societies have issued policies calling for open access to research data. The Rolling Deck to Repository (R2R) program is working to ensure open access to the environmental sensor data routinely acquired by the U.S. academic research fleet. Currently 25 vessels deliver 7 terabytes of data to R2R each year, acquired from a suite of geophysical, oceanographic, meteorological, and navigational sensors on over 400 cruises worldwide. R2R is working to ensure these data are preserved in trusted repositories, discoverable via standard protocols, and adequately documented for reuse. R2R maintains a master catalog of cruises for the U.S. academic research fleet, currently holding essential documentation for over 3,800 expeditions including vessel and cruise identifiers, start/end dates and ports, project titles and funding awards, science parties, dataset inventories with instrument types and file formats, data quality assessments, and links to related content at other repositories. A Digital Object Identifier (DOI) is published for 1) each cruise, 2) each original field sensor dataset, 3) each post-field data product such as quality-controlled shiptrack navigation produced by the R2R program, and 4) each document such as a cruise report submitted by the science party. Scientists are linked to personal identifiers, such as the Open Researcher and Contributor ID (ORCID), where known. Using standard global identifiers such as DOIs and ORCIDs facilitates linking with journal publications and generation of citation metrics. Since its inception, the R2R program has worked in close collaboration with other data repositories in the development of shared semantics for oceanographic research. The R2R cruise catalog uses community-standard terms and definitions hosted by the NERC Vocabulary Server, and publishes ISO metadata records for each cruise that use community-standard profiles developed with the NOAA Data Centers and the EU SeaDataNet project. R2R is a partner in the Ocean Data Interoperability Platform (ODIP), working to strengthen links among regional and national data systems, as well as a lead partner in the EarthCube "GeoLink" project, developing a standard set of ontology design patterns for publishing research data using Semantic Web protocols.
Blodgett, David L.
2013-01-01
The increasing availability of downscaled climate projections and other data products that summarize or predict climate conditions, is making climate data use more common in research and management. Scientists and decisionmakers often need to construct ensembles and compare climate hindcasts and future projections for particular spatial areas. These tasks generally require an investigator to procure all datasets of interest en masse, integrate the various data formats and representations into commonly accessible and comparable formats, and then extract the subsets of the datasets that are actually of interest. This process can be challenging and time intensive due to data-transfer, -storage, and(or) -processing limits, or unfamiliarity with methods of accessing climate data. Data management for modeling and assessing the impacts of future climate conditions is also becoming increasingly expensive due to the size of the datasets. The Climate Geo Data Portal (http://cida.usgs.gov/climate/gdp/) addresses these limitations, making access to numerous climate datasets for particular areas of interest a simple and efficient task.
NASA Astrophysics Data System (ADS)
Ryan, J. G.; McIlrath, J. A.
2008-12-01
Web-accessible geospatial information system (GIS) technologies have advanced in concert with an expansion of data resources that can be accessed and used by researchers, educators and students. These resources facilitate the development of data-rich instructional resources and activities that can be used to transition seamlessly into undergraduate research projects. MARGINS Data in the Classroom (http://serc.carleton.edu/ margins/index.html) seeks to engage MARGINS researchers and educators in using the images, datasets, and visualizations produced by NSF-MARGINS Program-funded research and related efforts to create Web-deliverable instructional materials for use in undergraduate-level geoscience courses (MARGINS Mini-Lessons). MARGINS science data is managed by the Marine Geosciences Data System (MGDS), and these and all other MGDS-hosted data can be accessed, manipulated and visualized using GeoMapApp (www.geomapapp.org; Carbotte et al, 2004), a freely available geographic information system focused on the marine environment. Both "packaged" MGDS datasets (i.e., global earthquake foci, volcanoes, bathymetry) and "raw" data (seismic surveys, magnetics, gravity) are accessible via GeoMapApp, with WFS linkages to other resources (geodesy from UNAVCO; seismic profiles from IRIS; geochemical and drillsite data from EarthChem, IODP, and others), permitting the comprehensive characterization of many regions of the ocean basins. Geospatially controlled datasets can be imported into GeoMapApp visualizations, and these visualizations can be exported into Google Earth as .kmz image files. Many of the MARGINS Mini-Lessons thus far produced use (or have studentss use the varied capabilities of GeoMapApp (i.e., constructing topographic profiles, overlaying varied geophysical and bathymetric datasets, characterizing geochemical data). These materials are available for use and testing from the project webpage (http://serc.carleton.edu/margins/). Classroom testing and assessment of the Mini- Lessons begins this Fall.
Scaling Up Scientific Discovery in Sleep Medicine: The National Sleep Research Resource.
Dean, Dennis A; Goldberger, Ary L; Mueller, Remo; Kim, Matthew; Rueschman, Michael; Mobley, Daniel; Sahoo, Satya S; Jayapandian, Catherine P; Cui, Licong; Morrical, Michael G; Surovec, Susan; Zhang, Guo-Qiang; Redline, Susan
2016-05-01
Professional sleep societies have identified a need for strategic research in multiple areas that may benefit from access to and aggregation of large, multidimensional datasets. Technological advances provide opportunities to extract and analyze physiological signals and other biomedical information from datasets of unprecedented size, heterogeneity, and complexity. The National Institutes of Health has implemented a Big Data to Knowledge (BD2K) initiative that aims to develop and disseminate state of the art big data access tools and analytical methods. The National Sleep Research Resource (NSRR) is a new National Heart, Lung, and Blood Institute resource designed to provide big data resources to the sleep research community. The NSRR is a web-based data portal that aggregates, harmonizes, and organizes sleep and clinical data from thousands of individuals studied as part of cohort studies or clinical trials and provides the user a suite of tools to facilitate data exploration and data visualization. Each deidentified study record minimally includes the summary results of an overnight sleep study; annotation files with scored events; the raw physiological signals from the sleep record; and available clinical and physiological data. NSRR is designed to be interoperable with other public data resources such as the Biologic Specimen and Data Repository Information Coordinating Center Demographics (BioLINCC) data and analyzed with methods provided by the Research Resource for Complex Physiological Signals (PhysioNet). This article reviews the key objectives, challenges and operational solutions to addressing big data opportunities for sleep research in the context of the national sleep research agenda. It provides information to facilitate further interactions of the user community with NSRR, a community resource. © 2016 Associated Professional Sleep Societies, LLC.
Davidson, Robert L; Weber, Ralf J M; Liu, Haoyu; Sharma-Oates, Archana; Viant, Mark R
2016-01-01
Metabolomics is increasingly recognized as an invaluable tool in the biological, medical and environmental sciences yet lags behind the methodological maturity of other omics fields. To achieve its full potential, including the integration of multiple omics modalities, the accessibility, standardization and reproducibility of computational metabolomics tools must be improved significantly. Here we present our end-to-end mass spectrometry metabolomics workflow in the widely used platform, Galaxy. Named Galaxy-M, our workflow has been developed for both direct infusion mass spectrometry (DIMS) and liquid chromatography mass spectrometry (LC-MS) metabolomics. The range of tools presented spans from processing of raw data, e.g. peak picking and alignment, through data cleansing, e.g. missing value imputation, to preparation for statistical analysis, e.g. normalization and scaling, and principal components analysis (PCA) with associated statistical evaluation. We demonstrate the ease of using these Galaxy workflows via the analysis of DIMS and LC-MS datasets, and provide PCA scores and associated statistics to help other users to ensure that they can accurately repeat the processing and analysis of these two datasets. Galaxy and data are all provided pre-installed in a virtual machine (VM) that can be downloaded from the GigaDB repository. Additionally, source code, executables and installation instructions are available from GitHub. The Galaxy platform has enabled us to produce an easily accessible and reproducible computational metabolomics workflow. More tools could be added by the community to expand its functionality. We recommend that Galaxy-M workflow files are included within the supplementary information of publications, enabling metabolomics studies to achieve greater reproducibility.
NASA Astrophysics Data System (ADS)
Gross, M. B.; Mayernik, M. S.; Rowan, L. R.; Khan, H.; Boler, F. M.; Maull, K. E.; Stott, D.; Williams, S.; Corson-Rikert, J.; Johns, E. M.; Daniels, M. D.; Krafft, D. B.
2015-12-01
UNAVCO, UCAR, and Cornell University are working together to leverage semantic web technologies to enable discovery of people, datasets, publications and other research products, as well as the connections between them. The EarthCollab project, an EarthCube Building Block, is enhancing an existing open-source semantic web application, VIVO, to address connectivity gaps across distributed networks of researchers and resources related to the following two geoscience-based communities: (1) the Bering Sea Project, an interdisciplinary field program whose data archive is hosted by NCAR's Earth Observing Laboratory (EOL), and (2) UNAVCO, a geodetic facility and consortium that supports diverse research projects informed by geodesy. People, publications, datasets and grant information have been mapped to an extended version of the VIVO-ISF ontology and ingested into VIVO's database. Data is ingested using a custom set of scripts that include the ability to perform basic automated and curated disambiguation. VIVO can display a page for every object ingested, including connections to other objects in the VIVO database. A dataset page, for example, includes the dataset type, time interval, DOI, related publications, and authors. The dataset type field provides a connection to all other datasets of the same type. The author's page will show, among other information, related datasets and co-authors. Information previously spread across several unconnected databases is now stored in a single location. In addition to VIVO's default display, the new database can also be queried using SPARQL, a query language for semantic data. EarthCollab will also extend the VIVO web application. One such extension is the ability to cross-link separate VIVO instances across institutions, allowing local display of externally curated information. For example, Cornell's VIVO faculty pages will display UNAVCO's dataset information and UNAVCO's VIVO will display Cornell faculty member contact and position information. Additional extensions, including enhanced geospatial capabilities, will be developed following task-centered usability testing.
36 CFR 903.6 - Appeal of initial denial of access.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 36 Parks, Forests, and Public Property 3 2010-07-01 2010-07-01 false Appeal of initial denial of access. 903.6 Section 903.6 Parks, Forests, and Public Property PENNSYLVANIA AVENUE DEVELOPMENT... the Executive Director, Pennsylvania Avenue Development Corporation, 1331 Pennsylvania Avenue, NW...
Pharos: Collating protein information to shed light on the druggable genome.
Nguyen, Dac-Trung; Mathias, Stephen; Bologa, Cristian; Brunak, Soren; Fernandez, Nicolas; Gaulton, Anna; Hersey, Anne; Holmes, Jayme; Jensen, Lars Juhl; Karlsson, Anneli; Liu, Guixia; Ma'ayan, Avi; Mandava, Geetha; Mani, Subramani; Mehta, Saurabh; Overington, John; Patel, Juhee; Rouillard, Andrew D; Schürer, Stephan; Sheils, Timothy; Simeonov, Anton; Sklar, Larry A; Southall, Noel; Ursu, Oleg; Vidovic, Dusica; Waller, Anna; Yang, Jeremy; Jadhav, Ajit; Oprea, Tudor I; Guha, Rajarshi
2017-01-04
The 'druggable genome' encompasses several protein families, but only a subset of targets within them have attracted significant research attention and thus have information about them publicly available. The Illuminating the Druggable Genome (IDG) program was initiated in 2014, has the goal of developing experimental techniques and a Knowledge Management Center (KMC) that would collect and organize information about protein targets from four families, representing the most common druggable targets with an emphasis on understudied proteins. Here, we describe two resources developed by the KMC: the Target Central Resource Database (TCRD) which collates many heterogeneous gene/protein datasets and Pharos (https://pharos.nih.gov), a multimodal web interface that presents the data from TCRD. We briefly describe the types and sources of data considered by the KMC and then highlight features of the Pharos interface designed to enable intuitive access to the IDG knowledgebase. The aim of Pharos is to encourage 'serendipitous browsing', whereby related, relevant information is made easily discoverable. We conclude by describing two use cases that highlight the utility of Pharos and TCRD. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.
The inequitable impact of health shocks on the uninsured in Namibia.
Gustafsson-Wright, Emily; Janssens, Wendy; van der Gaag, Jacques
2011-03-01
The AIDS pandemic in sub-Saharan Africa puts increasing pressure on the buffer capacity of low- and middle-income households without access to health insurance. This paper examines the relationship between health shocks, insurance status and health-seeking behaviour. It also investigates the possible mitigating effects of insurance on income loss and out-of-pocket health expenditure. The study uses a unique dataset based on a random sample of 1769 households and 7343 individuals living in the Greater Windhoek area in Namibia. The survey includes medical testing for HIV infection which allows for the explicit analysis of HIV-related health shocks. We find that the economic consequences of health shocks can be severe for uninsured households even in a country with a relatively well-developed public health care system such as Namibia. The uninsured resort to a variety of coping strategies to deal with the high medical expenses and reductions in income, such as selling assets, taking up credit or receiving financial support from relatives and friends. As HIV-infected individuals increasingly develop AIDS, this will put substantial pressure on the public health care system as well as social support networks. Evidence suggests that private insurance, currently unaffordable to the poor, protects households from the most severe consequences of health shocks.
Providing Access and Visualization to Global Cloud Properties from GEO Satellites
NASA Astrophysics Data System (ADS)
Chee, T.; Nguyen, L.; Minnis, P.; Spangenberg, D.; Palikonda, R.; Ayers, J. K.
2015-12-01
Providing public access to cloud macro and microphysical properties is a key concern for the NASA Langley Research Center Cloud and Radiation Group. This work describes a tool and method that allows end users to easily browse and access cloud information that is otherwise difficult to acquire and manipulate. The core of the tool is an application-programming interface that is made available to the public. One goal of the tool is to provide a demonstration to end users so that they can use the dynamically generated imagery as an input into their own work flows for both image generation and cloud product requisition. This project builds upon NASA Langley Cloud and Radiation Group's experience with making real-time and historical satellite cloud product imagery accessible and easily searchable. As we see the increasing use of virtual supply chains that provide additional value at each link there is value in making satellite derived cloud product information available through a simple access method as well as allowing users to browse and view that imagery as they need rather than in a manner most convenient for the data provider. Using the Open Geospatial Consortium's Web Processing Service as our access method, we describe a system that uses a hybrid local and cloud based parallel processing system that can return both satellite imagery and cloud product imagery as well as the binary data used to generate them in multiple formats. The images and cloud products are sourced from multiple satellites and also "merged" datasets created by temporally and spatially matching satellite sensors. Finally, the tool and API allow users to access information that spans the time ranges that our group has information available. In the case of satellite imagery, the temporal range can span the entire lifetime of the sensor.