Interoperability Across the Stewardship Spectrum in the DataONE Repository Federation
NASA Astrophysics Data System (ADS)
Jones, M. B.; Vieglais, D.; Wilson, B. E.
2016-12-01
Thousands of earth and environmental science repositories serve many researchers and communities, each with their own community and legal mandates, sustainability models, and historical infrastructure. These repositories span the stewardship spectrum from highly curated collections that employ large numbers of staff members to review and improve data, to small, minimal budget repositories that accept data caveat emptor and where all responsibility for quality lies with the submitter. Each repository fills a niche, providing services that meet the stewardship tradeoffs of one or more communities. We have reviewed these stewardship tradeoffs for several DataONE member repositories ranging from minimally (KNB) to highly curated (Arctic Data Center), as well as general purpose (Dryad) to highly discipline or project specific (NEON). The rationale behind different levels of stewardship reflect resolution of these tradeoffs. Some repositories aim to encourage extensive uptake by keeping processes simple and minimizing the amount of information collected, but this limits the long-term utility of the data and the search, discovery, and integration systems that are possible. Other repositories require extensive metadata input, review, and assessment, allowing for excellent preservation, discovery, and integration but at the cost of significant time for submitters and expense for curatorial staff. DataONE recognizes these different levels of curation, and attempts to embrace them to create a federation that is useful across the stewardship spectrum. DataONE provides a tiered model for repositories with growing utility of DataONE services at higher tiers of curation. The lowest tier supports read-only access to data and requires little more than title and contact metadata. Repositories can gradually phase in support for higher levels of metadata and services as needed. These tiered capabilities are possible through flexible support for multiple metadata standards and services, where repositories can incrementally increase their requirements as they want to satisfy more use cases. Within DataONE, metadata search services support minimal metadata models, but significantly expanded precision and recall become possible when repositories provide more extensively curated metadata.
A Window to the World: Lessons Learned from NASA's Collaborative Metadata Curation Effort
NASA Astrophysics Data System (ADS)
Bugbee, K.; Dixon, V.; Baynes, K.; Shum, D.; le Roux, J.; Ramachandran, R.
2017-12-01
Well written descriptive metadata adds value to data by making data easier to discover as well as increases the use of data by providing the context or appropriateness of use. While many data centers acknowledge the importance of correct, consistent and complete metadata, allocating resources to curate existing metadata is often difficult. To lower resource costs, many data centers seek guidance on best practices for curating metadata but struggle to identify those recommendations. In order to assist data centers in curating metadata and to also develop best practices for creating and maintaining metadata, NASA has formed a collaborative effort to improve the Earth Observing System Data and Information System (EOSDIS) metadata in the Common Metadata Repository (CMR). This effort has taken significant steps in building consensus around metadata curation best practices. However, this effort has also revealed gaps in EOSDIS enterprise policies and procedures within the core metadata curation task. This presentation will explore the mechanisms used for building consensus on metadata curation, the gaps identified in policies and procedures, the lessons learned from collaborating with both the data centers and metadata curation teams, and the proposed next steps for the future.
NASA Astrophysics Data System (ADS)
le Roux, J.; Baker, A.; Caltagirone, S.; Bugbee, K.
2017-12-01
The Common Metadata Repository (CMR) is a high-performance, high-quality repository for Earth science metadata records, and serves as the primary way to search NASA's growing 17.5 petabytes of Earth science data holdings. Released in 2015, CMR has the capability to support several different metadata standards already being utilized by NASA's combined network of Earth science data providers, or Distributed Active Archive Centers (DAACs). The Analysis and Review of CMR (ARC) Team located at Marshall Space Flight Center is working to improve the quality of records already in CMR with the goal of making records optimal for search and discovery. This effort entails a combination of automated and manual review, where each NASA record in CMR is checked for completeness, accuracy, and consistency. This effort is highly collaborative in nature, requiring communication and transparency of findings amongst NASA personnel, DAACs, the CMR team and other metadata curation teams. Through the evolution of this project it has become apparent that there is a need to document and report findings, as well as track metadata improvements in a more efficient manner. The ARC team has collaborated with Element 84 in order to develop a metadata curation tool to meet these needs. In this presentation, we will provide an overview of this metadata curation tool and its current capabilities. Challenges and future plans for the tool will also be discussed.
Li, Zhao; Li, Jin; Yu, Peng
2018-01-01
Abstract Metadata curation has become increasingly important for biological discovery and biomedical research because a large amount of heterogeneous biological data is currently freely available. To facilitate efficient metadata curation, we developed an easy-to-use web-based curation application, GEOMetaCuration, for curating the metadata of Gene Expression Omnibus datasets. It can eliminate mechanical operations that consume precious curation time and can help coordinate curation efforts among multiple curators. It improves the curation process by introducing various features that are critical to metadata curation, such as a back-end curation management system and a curator-friendly front-end. The application is based on a commonly used web development framework of Python/Django and is open-sourced under the GNU General Public License V3. GEOMetaCuration is expected to benefit the biocuration community and to contribute to computational generation of biological insights using large-scale biological data. An example use case can be found at the demo website: http://geometacuration.yubiolab.org. Database URL: https://bitbucket.com/yubiolab/GEOMetaCuration PMID:29688376
Schmedes, Sarah E; King, Jonathan L; Budowle, Bruce
2015-01-01
Whole-genome data are invaluable for large-scale comparative genomic studies. Current sequencing technologies have made it feasible to sequence entire bacterial genomes with relative ease and time with a substantially reduced cost per nucleotide, hence cost per genome. More than 3,000 bacterial genomes have been sequenced and are available at the finished status. Publically available genomes can be readily downloaded; however, there are challenges to verify the specific supporting data contained within the download and to identify errors and inconsistencies that may be present within the organizational data content and metadata. AutoCurE, an automated tool for bacterial genome database curation in Excel, was developed to facilitate local database curation of supporting data that accompany downloaded genomes from the National Center for Biotechnology Information. AutoCurE provides an automated approach to curate local genomic databases by flagging inconsistencies or errors by comparing the downloaded supporting data to the genome reports to verify genome name, RefSeq accession numbers, the presence of archaea, BioProject/UIDs, and sequence file descriptions. Flags are generated for nine metadata fields if there are inconsistencies between the downloaded genomes and genomes reports and if erroneous or missing data are evident. AutoCurE is an easy-to-use tool for local database curation for large-scale genome data prior to downstream analyses.
Collaborative Metadata Curation in Support of NASA Earth Science Data Stewardship
NASA Technical Reports Server (NTRS)
Sisco, Adam W.; Bugbee, Kaylin; le Roux, Jeanne; Staton, Patrick; Freitag, Brian; Dixon, Valerie
2018-01-01
Growing collection of NASA Earth science data is archived and distributed by EOSDIS’s 12 Distributed Active Archive Centers (DAACs). Each collection and granule is described by a metadata record housed in the Common Metadata Repository (CMR). Multiple metadata standards are in use, and core elements of each are mapped to and from a common model – the Unified Metadata Model (UMM). Work done by the Analysis and Review of CMR (ARC) Team.
NASA Astrophysics Data System (ADS)
Zaslavsky, I.; Richard, S. M.; Valentine, D. W., Jr.; Grethe, J. S.; Hsu, L.; Malik, T.; Bermudez, L. E.; Gupta, A.; Lehnert, K. A.; Whitenack, T.; Ozyurt, I. B.; Condit, C.; Calderon, R.; Musil, L.
2014-12-01
EarthCube is envisioned as a cyberinfrastructure that fosters new, transformational geoscience by enabling sharing, understanding and scientifically-sound and efficient re-use of formerly unconnected data resources, software, models, repositories, and computational power. Its purpose is to enable science enterprise and workforce development via an extensible and adaptable collaboration and resource integration framework. A key component of this vision is development of comprehensive inventories supporting resource discovery and re-use across geoscience domains. The goal of the EarthCube CINERGI (Community Inventory of EarthCube Resources for Geoscience Interoperability) project is to create a methodology and assemble a large inventory of high-quality information resources with standard metadata descriptions and traceable provenance. The inventory is compiled from metadata catalogs maintained by geoscience data facilities, as well as from user contributions. The latter mechanism relies on community resource viewers: online applications that support update and curation of metadata records. Once harvested into CINERGI, metadata records from domain catalogs and community resource viewers are loaded into a staging database implemented in MongoDB, and validated for compliance with ISO 19139 metadata schema. Several types of metadata defects detected by the validation engine are automatically corrected with help of several information extractors or flagged for manual curation. The metadata harvesting, validation and processing components generate provenance statements using W3C PROV notation, which are stored in a Neo4J database. Thus curated metadata, along with the provenance information, is re-published and accessed programmatically and via a CINERGI online application. This presentation focuses on the role of resource inventories in a scalable and adaptable information infrastructure, and on the CINERGI metadata pipeline and its implementation challenges. Key project components are described at the project's website (http://workspace.earthcube.org/cinergi), which also provides access to the initial resource inventory, the inventory metadata model, metadata entry forms and a collection of the community resource viewers.
Making Metadata Better with CMR and MMT
NASA Technical Reports Server (NTRS)
Gilman, Jason Arthur; Shum, Dana
2016-01-01
Ensuring complete, consistent and high quality metadata is a challenge for metadata providers and curators. The CMR and MMT systems provide providers and curators options to build in metadata quality from the start and also assess and improve the quality of already existing metadata.
NASA Technical Reports Server (NTRS)
Shum, Dana; Bugbee, Kaylin
2017-01-01
This talk explains the ongoing metadata curation activities in the Common Metadata Repository. It explores tools that exist today which are useful for building quality metadata and also opens up the floor for discussions on other potentially useful tools.
Standards-based curation of a decade-old digital repository dataset of molecular information.
Harvey, Matthew J; Mason, Nicholas J; McLean, Andrew; Murray-Rust, Peter; Rzepa, Henry S; Stewart, James J P
2015-01-01
The desirable curation of 158,122 molecular geometries derived from the NCI set of reference molecules together with associated properties computed using the MOPAC semi-empirical quantum mechanical method and originally deposited in 2005 into the Cambridge DSpace repository as a data collection is reported. The procedures involved in the curation included annotation of the original data using new MOPAC methods, updating the syntax of the CML documents used to express the data to ensure schema conformance and adding new metadata describing the entries together with a XML schema transformation to map the metadata schema to that used by the DataCite organisation. We have adopted a granularity model in which a DataCite persistent identifier (DOI) is created for each individual molecule to enable data discovery and data metrics at this level using DataCite tools. We recommend that the future research data management (RDM) of the scientific and chemical data components associated with journal articles (the "supporting information") should be conducted in a manner that facilitates automatic periodic curation. Graphical abstractStandards and metadata-based curation of a decade-old digital repository dataset of molecular information.
Making Interoperability Easier with NASA's Metadata Management Tool (MMT)
NASA Technical Reports Server (NTRS)
Shum, Dana; Reese, Mark; Pilone, Dan; Baynes, Katie
2016-01-01
While the ISO-19115 collection level metadata format meets many users' needs for interoperable metadata, it can be cumbersome to create it correctly. Through the MMT's simple UI experience, metadata curators can create and edit collections which are compliant with ISO-19115 without full knowledge of the NASA Best Practices implementation of ISO-19115 format. Users are guided through the metadata creation process through a forms-based editor, complete with field information, validation hints and picklists. Once a record is completed, users can download the metadata in any of the supported formats with just 2 clicks.
Reddy, T.B.K.; Thomas, Alex D.; Stamatis, Dimitri; Bertsch, Jon; Isbandi, Michelle; Jansson, Jakob; Mallajosyula, Jyothi; Pagani, Ioanna; Lobos, Elizabeth A.; Kyrpides, Nikos C.
2015-01-01
The Genomes OnLine Database (GOLD; http://www.genomesonline.org) is a comprehensive online resource to catalog and monitor genetic studies worldwide. GOLD provides up-to-date status on complete and ongoing sequencing projects along with a broad array of curated metadata. Here we report version 5 (v.5) of the database. The newly designed database schema and web user interface supports several new features including the implementation of a four level (meta)genome project classification system and a simplified intuitive web interface to access reports and launch search tools. The database currently hosts information for about 19 200 studies, 56 000 Biosamples, 56 000 sequencing projects and 39 400 analysis projects. More than just a catalog of worldwide genome projects, GOLD is a manually curated, quality-controlled metadata warehouse. The problems encountered in integrating disparate and varying quality data into GOLD are briefly highlighted. GOLD fully supports and follows the Genomic Standards Consortium (GSC) Minimum Information standards. PMID:25348402
NASA Astrophysics Data System (ADS)
Zaslavsky, I.; Richard, S. M.; Malik, T.; Hsu, L.; Gupta, A.; Grethe, J. S.; Valentine, D. W., Jr.; Lehnert, K. A.; Bermudez, L. E.; Ozyurt, I. B.; Whitenack, T.; Schachne, A.; Giliarini, A.
2015-12-01
While many geoscience-related repositories and data discovery portals exist, finding information about available resources remains a pervasive problem, especially when searching across multiple domains and catalogs. Inconsistent and incomplete metadata descriptions, disparate access protocols and semantic differences across domains, and troves of unstructured or poorly structured information which is hard to discover and use are major hindrances toward discovery, while metadata compilation and curation remain manual and time-consuming. We report on methodology, main results and lessons learned from an ongoing effort to develop a geoscience-wide catalog of information resources, with consistent metadata descriptions, traceable provenance, and automated metadata enhancement. Developing such a catalog is the central goal of CINERGI (Community Inventory of EarthCube Resources for Geoscience Interoperability), an EarthCube building block project (earthcube.org/group/cinergi). The key novel technical contributions of the projects include: a) development of a metadata enhancement pipeline and a set of document enhancers to automatically improve various aspects of metadata descriptions, including keyword assignment and definition of spatial extents; b) Community Resource Viewers: online applications for crowdsourcing community resource registry development, curation and search, and channeling metadata to the unified CINERGI inventory, c) metadata provenance, validation and annotation services, d) user interfaces for advanced resource discovery; and e) geoscience-wide ontology and machine learning to support automated semantic tagging and faceted search across domains. We demonstrate these CINERGI components in three types of user scenarios: (1) improving existing metadata descriptions maintained by government and academic data facilities, (2) supporting work of several EarthCube Research Coordination Network projects in assembling information resources for their domains, and (3) enhancing the inventory and the underlying ontology to address several complicated data discovery use cases in hydrology, geochemistry, sedimentology, and critical zone science. Support from the US National Science Foundation under award ICER-1343816 is gratefully acknowledged.
The Metadata Coverage Index (MCI): A standardized metric for quantifying database metadata richness.
Liolios, Konstantinos; Schriml, Lynn; Hirschman, Lynette; Pagani, Ioanna; Nosrat, Bahador; Sterk, Peter; White, Owen; Rocca-Serra, Philippe; Sansone, Susanna-Assunta; Taylor, Chris; Kyrpides, Nikos C; Field, Dawn
2012-07-30
Variability in the extent of the descriptions of data ('metadata') held in public repositories forces users to assess the quality of records individually, which rapidly becomes impractical. The scoring of records on the richness of their description provides a simple, objective proxy measure for quality that enables filtering that supports downstream analysis. Pivotally, such descriptions should spur on improvements. Here, we introduce such a measure - the 'Metadata Coverage Index' (MCI): the percentage of available fields actually filled in a record or description. MCI scores can be calculated across a database, for individual records or for their component parts (e.g., fields of interest). There are many potential uses for this simple metric: for example; to filter, rank or search for records; to assess the metadata availability of an ad hoc collection; to determine the frequency with which fields in a particular record type are filled, especially with respect to standards compliance; to assess the utility of specific tools and resources, and of data capture practice more generally; to prioritize records for further curation; to serve as performance metrics of funded projects; or to quantify the value added by curation. Here we demonstrate the utility of MCI scores using metadata from the Genomes Online Database (GOLD), including records compliant with the 'Minimum Information about a Genome Sequence' (MIGS) standard developed by the Genomic Standards Consortium. We discuss challenges and address the further application of MCI scores; to show improvements in annotation quality over time, to inform the work of standards bodies and repository providers on the usability and popularity of their products, and to assess and credit the work of curators. Such an index provides a step towards putting metadata capture practices and in the future, standards compliance, into a quantitative and objective framework.
Improving Access to NASA Earth Science Data through Collaborative Metadata Curation
NASA Astrophysics Data System (ADS)
Sisco, A. W.; Bugbee, K.; Shum, D.; Baynes, K.; Dixon, V.; Ramachandran, R.
2017-12-01
The NASA-developed Common Metadata Repository (CMR) is a high-performance metadata system that currently catalogs over 375 million Earth science metadata records. It serves as the authoritative metadata management system of NASA's Earth Observing System Data and Information System (EOSDIS), enabling NASA Earth science data to be discovered and accessed by a worldwide user community. The size of the EOSDIS data archive is steadily increasing, and the ability to manage and query this archive depends on the input of high quality metadata to the CMR. Metadata that does not provide adequate descriptive information diminishes the CMR's ability to effectively find and serve data to users. To address this issue, an innovative and collaborative review process is underway to systematically improve the completeness, consistency, and accuracy of metadata for approximately 7,000 data sets archived by NASA's twelve EOSDIS data centers, or Distributed Active Archive Centers (DAACs). The process involves automated and manual metadata assessment of both collection and granule records by a team of Earth science data specialists at NASA Marshall Space Flight Center. The team communicates results to DAAC personnel, who then make revisions and reingest improved metadata into the CMR. Implementation of this process relies on a network of interdisciplinary collaborators leveraging a variety of communication platforms and long-range planning strategies. Curating metadata at this scale and resolving metadata issues through community consensus improves the CMR's ability to serve current and future users and also introduces best practices for stewarding the next generation of Earth Observing System data. This presentation will detail the metadata curation process, its outcomes thus far, and also share the status of ongoing curation activities.
Omics Metadata Management Software (OMMS).
Perez-Arriaga, Martha O; Wilson, Susan; Williams, Kelly P; Schoeniger, Joseph; Waymire, Russel L; Powell, Amy Jo
2015-01-01
Next-generation sequencing projects have underappreciated information management tasks requiring detailed attention to specimen curation, nucleic acid sample preparation and sequence production methods required for downstream data processing, comparison, interpretation, sharing and reuse. The few existing metadata management tools for genome-based studies provide weak curatorial frameworks for experimentalists to store and manage idiosyncratic, project-specific information, typically offering no automation supporting unified naming and numbering conventions for sequencing production environments that routinely deal with hundreds, if not thousands of samples at a time. Moreover, existing tools are not readily interfaced with bioinformatics executables, (e.g., BLAST, Bowtie2, custom pipelines). Our application, the Omics Metadata Management Software (OMMS), answers both needs, empowering experimentalists to generate intuitive, consistent metadata, and perform analyses and information management tasks via an intuitive web-based interface. Several use cases with short-read sequence datasets are provided to validate installation and integrated function, and suggest possible methodological road maps for prospective users. Provided examples highlight possible OMMS workflows for metadata curation, multistep analyses, and results management and downloading. The OMMS can be implemented as a stand alone-package for individual laboratories, or can be configured for webbased deployment supporting geographically-dispersed projects. The OMMS was developed using an open-source software base, is flexible, extensible and easily installed and executed. The OMMS can be obtained at http://omms.sandia.gov. The OMMS can be obtained at http://omms.sandia.gov.
Omics Metadata Management Software (OMMS)
Perez-Arriaga, Martha O; Wilson, Susan; Williams, Kelly P; Schoeniger, Joseph; Waymire, Russel L; Powell, Amy Jo
2015-01-01
Next-generation sequencing projects have underappreciated information management tasks requiring detailed attention to specimen curation, nucleic acid sample preparation and sequence production methods required for downstream data processing, comparison, interpretation, sharing and reuse. The few existing metadata management tools for genome-based studies provide weak curatorial frameworks for experimentalists to store and manage idiosyncratic, project-specific information, typically offering no automation supporting unified naming and numbering conventions for sequencing production environments that routinely deal with hundreds, if not thousands of samples at a time. Moreover, existing tools are not readily interfaced with bioinformatics executables, (e.g., BLAST, Bowtie2, custom pipelines). Our application, the Omics Metadata Management Software (OMMS), answers both needs, empowering experimentalists to generate intuitive, consistent metadata, and perform analyses and information management tasks via an intuitive web-based interface. Several use cases with short-read sequence datasets are provided to validate installation and integrated function, and suggest possible methodological road maps for prospective users. Provided examples highlight possible OMMS workflows for metadata curation, multistep analyses, and results management and downloading. The OMMS can be implemented as a stand alone-package for individual laboratories, or can be configured for webbased deployment supporting geographically-dispersed projects. The OMMS was developed using an open-source software base, is flexible, extensible and easily installed and executed. The OMMS can be obtained at http://omms.sandia.gov. Availability The OMMS can be obtained at http://omms.sandia.gov PMID:26124554
Dataworks for GNSS: Software for Supporting Data Sharing and Federation of Geodetic Networks
NASA Astrophysics Data System (ADS)
Boler, F. M.; Meertens, C. M.; Miller, M. M.; Wier, S.; Rost, M.; Matykiewicz, J.
2015-12-01
Continuously-operating Global Navigation Satellite System (GNSS) networks are increasingly being installed globally for a wide variety of science and societal applications. GNSS enables Earth science research in areas including tectonic plate interactions, crustal deformation in response to loading by tectonics, magmatism, water and ice, and the dynamics of water - and thereby energy transfer - in the atmosphere at regional scale. The many individual scientists and organizations that set up GNSS stations globally are often open to sharing data, but lack the resources or expertise to deploy systems and software to manage and curate data and metadata and provide user tools that would support data sharing. UNAVCO previously gained experience in facilitating data sharing through the NASA-supported development of the Geodesy Seamless Archive Centers (GSAC) open source software. GSAC provides web interfaces and simple web services for data and metadata discovery and access, supports federation of multiple data centers, and simplifies transfer of data and metadata to long-term archives. The NSF supported the dissemination of GSAC to multiple European data centers forming the European Plate Observing System. To expand upon GSAC to provide end-to-end, instrument-to-distribution capability, UNAVCO developed Dataworks for GNSS with NSF funding to the COCONet project, and deployed this software on systems that are now operating as Regional GNSS Data Centers as part of the NSF-funded TLALOCNet and COCONet projects. Dataworks consists of software modules written in Python and Java for data acquisition, management and sharing. There are modules for GNSS receiver control and data download, a database schema for metadata, tools for metadata handling, ingest software to manage file metadata, data file management scripts, GSAC, scripts for mirroring station data and metadata from partner GSACs, and extensive software and operator documentation. UNAVCO plans to provide a cloud VM image of Dataworks that would allow standing up a Dataworks-enabled GNSS data center without requiring upfront investment in server hardware. By enabling data creators to organize their data and metadata for sharing, Dataworks helps scientists expand their data curation awareness and responsibility, and enhances data access for all.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Reddy, Tatiparthi B. K.; Thomas, Alex D.; Stamatis, Dimitri
The Genomes OnLine Database (GOLD; http://www.genomesonline.org) is a comprehensive online resource to catalog and monitor genetic studies worldwide. GOLD provides up-to-date status on complete and ongoing sequencing projects along with a broad array of curated metadata. Within this paper, we report version 5 (v.5) of the database. The newly designed database schema and web user interface supports several new features including the implementation of a four level (meta)genome project classification system and a simplified intuitive web interface to access reports and launch search tools. The database currently hosts information for about 19 200 studies, 56 000 Biosamples, 56 000 sequencingmore » projects and 39 400 analysis projects. More than just a catalog of worldwide genome projects, GOLD is a manually curated, quality-controlled metadata warehouse. The problems encountered in integrating disparate and varying quality data into GOLD are briefly highlighted. Lastly, GOLD fully supports and follows the Genomic Standards Consortium (GSC) Minimum Information standards.« less
Sequencing Data Discovery and Integration for Earth System Science with MetaSeek
NASA Astrophysics Data System (ADS)
Hoarfrost, A.; Brown, N.; Arnosti, C.
2017-12-01
Microbial communities play a central role in biogeochemical cycles. Sequencing data resources from environmental sources have grown exponentially in recent years, and represent a singular opportunity to investigate microbial interactions with Earth system processes. Carrying out such meta-analyses depends on our ability to discover and curate sequencing data into large-scale integrated datasets. However, such integration efforts are currently challenging and time-consuming, with sequencing data scattered across multiple repositories and metadata that is not easily or comprehensively searchable. MetaSeek is a sequencing data discovery tool that integrates sequencing metadata from all the major data repositories, allowing the user to search and filter on datasets in a lightweight application with an intuitive, easy-to-use web-based interface. Users can save and share curated datasets, while other users can browse these data integrations or use them as a jumping off point for their own curation. Missing and/or erroneous metadata are inferred automatically where possible, and where not possible, users are prompted to contribute to the improvement of the sequencing metadata pool by correcting and amending metadata errors. Once an integrated dataset has been curated, users can follow simple instructions to download their raw data and quickly begin their investigations. In addition to the online interface, the MetaSeek database is easily queryable via an open API, further enabling users and facilitating integrations of MetaSeek with other data curation tools. This tool lowers the barriers to curation and integration of environmental sequencing data, clearing the path forward to illuminating the ecosystem-scale interactions between biological and abiotic processes.
Omics Metadata Management Software v. 1 (OMMS)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Our application, the Omics Metadata Management Software (OMMS), answers both needs, empowering experimentalists to generate intuitive, consistent metadata, and to perform bioinformatics analyses and information management tasks via a simple and intuitive web-based interface. Several use cases with short-read sequence datasets are provided to showcase the full functionality of the OMMS, from metadata curation tasks, to bioinformatics analyses and results management and downloading. The OMMS can be implemented as a stand alone-package for individual laboratories, or can be configured for web-based deployment supporting geographically dispersed research teams. Our software was developed with open-source bundles, is flexible, extensible and easily installedmore » and run by operators with general system administration and scripting language literacy.« less
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs and Increasing Value
NASA Astrophysics Data System (ADS)
Myers, J.; Hedstrom, M.; Plale, B. A.; Kumar, P.; McDonald, R.; Kooper, R.; Marini, L.; Kouper, I.; Chandrasekar, K.
2013-12-01
What if everything that researchers know about their data, and everything their applications know, were directly available to curators? What if all the information that data consumers discover and infer about data were also available? What if curation and preservation activities occurred incrementally, during research projects instead of after they end, and could be leveraged to make it easier to manage research data from the moment of its creation? These are questions that the Sustainable Environments - Actionable Data (SEAD) project, funded as part of the National Science Foundation's DataNet partnership, was designed to answer. Data curation is challenging, but it is made more difficult by the historical separation of data production, data use, and formal curation activities across organizations, locations, and applications, and across time. Modern computing and networking technologies allow a much different approach in which data and metadata can easily flow between these activities throughout the data lifecycle, and in which heterogeneous and evolving data and metadata can be managed. Sustainability research, SEAD's initial focus area, is a clear example of an area where the nature of the research (cross-disciplinary, integrating heterogeneous data from independent sources, small teams, rapid evolution of sensing and analysis techniques) and the barriers and costs inherent in traditional methods have limited adoption of existing curation tools and techniques, to the detriment of overall scientific progress. To explore these ideas and create a sustainable curation capability for communities such as sustainability research, the SEAD team has developed and is now deploying an interacting set of open source data services that demonstrate this approach. These services provide end-to-end support for management of data during research projects; publication of that data into long-term archives; and integration of it into community networks of publications, research center activities, and synthesis efforts. They build on a flexible ';semantic content management' architecture and incorporate notions of ';active' and ';social' curation - continuous, incremental curation activities performed by the data producers (active) and the community (social) that are motivated by a range of direct benefits. Examples include the use of metadata (tags) to allow generation of custom geospatial maps, automated metadata extraction to generate rich data pages for known formats, and the use of information about data authorship to allow automatic updates of personal and project research profiles when data is published. In this presentation, we describe the core capabilities of SEAD's services and their application in sustainability research. We also outline the key features of the SEAD architecture - the use of global semantic identifiers, extensible data and metadata models, web services to manage context shifts, scalable cloud storage - and highlight how this approach is particularly well suited to extension by independent third parties. We conclude with thoughts on how this approach can be applied to challenging issues such as exposing ';dark' data and reducing duplicate creation of derived data products, and can provide a new level of analytics for community analysis and coordination.
Simplified Metadata Curation via the Metadata Management Tool
NASA Astrophysics Data System (ADS)
Shum, D.; Pilone, D.
2015-12-01
The Metadata Management Tool (MMT) is the newest capability developed as part of NASA Earth Observing System Data and Information System's (EOSDIS) efforts to simplify metadata creation and improve metadata quality. The MMT was developed via an agile methodology, taking into account inputs from GCMD's science coordinators and other end-users. In its initial release, the MMT uses the Unified Metadata Model for Collections (UMM-C) to allow metadata providers to easily create and update collection records in the ISO-19115 format. Through a simplified UI experience, metadata curators can create and edit collections without full knowledge of the NASA Best Practices implementation of ISO-19115 format, while still generating compliant metadata. More experienced users are also able to access raw metadata to build more complex records as needed. In future releases, the MMT will build upon recent work done in the community to assess metadata quality and compliance with a variety of standards through application of metadata rubrics. The tool will provide users with clear guidance as to how to easily change their metadata in order to improve their quality and compliance. Through these features, the MMT allows data providers to create and maintain compliant and high quality metadata in a short amount of time.
Rocca-Serra, Philippe; Brandizi, Marco; Maguire, Eamonn; Sklyar, Nataliya; Taylor, Chris; Begley, Kimberly; Field, Dawn; Harris, Stephen; Hide, Winston; Hofmann, Oliver; Neumann, Steffen; Sterk, Peter; Tong, Weida; Sansone, Susanna-Assunta
2010-01-01
Summary: The first open source software suite for experimentalists and curators that (i) assists in the annotation and local management of experimental metadata from high-throughput studies employing one or a combination of omics and other technologies; (ii) empowers users to uptake community-defined checklists and ontologies; and (iii) facilitates submission to international public repositories. Availability and Implementation: Software, documentation, case studies and implementations at http://www.isa-tools.org Contact: isatools@googlegroups.com PMID:20679334
Making Interoperability Easier with the NASA Metadata Management Tool
NASA Astrophysics Data System (ADS)
Shum, D.; Reese, M.; Pilone, D.; Mitchell, A. E.
2016-12-01
ISO 19115 has enabled interoperability amongst tools, yet many users find it hard to build ISO metadata for their collections because it can be large and overly flexible for their needs. The Metadata Management Tool (MMT), part of NASA's Earth Observing System Data and Information System (EOSDIS), offers users a modern, easy to use browser based tool to develop ISO compliant metadata. Through a simplified UI experience, metadata curators can create and edit collections without any understanding of the complex ISO-19115 format, while still generating compliant metadata. The MMT is also able to assess the completeness of collection level metadata by evaluating it against a variety of metadata standards. The tool provides users with clear guidance as to how to change their metadata in order to improve their quality and compliance. It is based on NASA's Unified Metadata Model for Collections (UMM-C) which is a simpler metadata model which can be cleanly mapped to ISO 19115. This allows metadata authors and curators to meet ISO compliance requirements faster and more accurately. The MMT and UMM-C have been developed in an agile fashion, with recurring end user tests and reviews to continually refine the tool, the model and the ISO mappings. This process is allowing for continual improvement and evolution to meet the community's needs.
NASA Astrophysics Data System (ADS)
Budden, A. E.; Arzayus, K. M.; Baker-Yeboah, S.; Casey, K. S.; Dozier, J.; Jones, C. S.; Jones, M. B.; Schildhauer, M.; Walker, L.
2016-12-01
The newly established NSF Arctic Data Center plays a critical support role in archiving and curating the data and software generated by Arctic researchers from diverse disciplines. The Arctic community, comprising Earth science, archaeology, geography, anthropology, and other social science researchers, are supported through data curation services and domain agnostic tools and infrastructure, ensuring data are accessible in the most transparent and usable way possible. This interoperability across diverse disciplines within the Arctic community facilitates collaborative research and is mirrored by interoperability between the Arctic Data Center infrastructure and other large scale cyberinfrastructure initiatives. The Arctic Data Center leverages the DataONE federation to standardize access to and replication of data and metadata to other repositories, specifically the NOAA's National Centers for Environmental Information (NCEI). This approach promotes long-term preservation of the data and metadata, as well as opening the door for other data repositories to leverage this replication infrastructure with NCEI and other DataONE member repositories. The Arctic Data Center uses rich, detailed metadata following widely recognized standards. Particularly, measurement-level and provenance metadata provide scientists the details necessary to integrate datasets across studies and across repositories while enabling a full understanding of the provenance of data used in the system. The Arctic Data Center gains this deep metadata and provenance support by simply adopting DataONE services, which results in significant efficiency gains by eliminating the need to develop systems de novo. Similarly, the advanced search tool developed by the Knowledge Network for Biocomplexity and extended for data submission by the Arctic Data Center, can be used by other DataONE-compliant repositories without further development. By standardizing interfaces and leveraging the DataONE federation, the Arctic Data Center has advanced rapidly and can itself contribute to raising the capabilities of all members of the federation.
NeuroTransDB: highly curated and structured transcriptomic metadata for neurodegenerative diseases.
Bagewadi, Shweta; Adhikari, Subash; Dhrangadhariya, Anjani; Irin, Afroza Khanam; Ebeling, Christian; Namasivayam, Aishwarya Alex; Page, Matthew; Hofmann-Apitius, Martin; Senger, Philipp
2015-01-01
Neurodegenerative diseases are chronic debilitating conditions, characterized by progressive loss of neurons that represent a significant health care burden as the global elderly population continues to grow. Over the past decade, high-throughput technologies such as the Affymetrix GeneChip microarrays have provided new perspectives into the pathomechanisms underlying neurodegeneration. Public transcriptomic data repositories, namely Gene Expression Omnibus and curated ArrayExpress, enable researchers to conduct integrative meta-analysis; increasing the power to detect differentially regulated genes in disease and explore patterns of gene dysregulation across biologically related studies. The reliability of retrospective, large-scale integrative analyses depends on an appropriate combination of related datasets, in turn requiring detailed meta-annotations capturing the experimental setup. In most cases, we observe huge variation in compliance to defined standards for submitted metadata in public databases. Much of the information to complete, or refine meta-annotations are distributed in the associated publications. For example, tissue preparation or comorbidity information is frequently described in an article's supplementary tables. Several value-added databases have employed additional manual efforts to overcome this limitation. However, none of these databases explicate annotations that distinguish human and animal models in neurodegeneration context. Therefore, adopting a more specific disease focus, in combination with dedicated disease ontologies, will better empower the selection of comparable studies with refined annotations to address the research question at hand. In this article, we describe the detailed development of NeuroTransDB, a manually curated database containing metadata annotations for neurodegenerative studies. The database contains more than 20 dimensions of metadata annotations within 31 mouse, 5 rat and 45 human studies, defined in collaboration with domain disease experts. We elucidate the step-by-step guidelines used to critically prioritize studies from public archives and their metadata curation and discuss the key challenges encountered. Curated metadata for Alzheimer's disease gene expression studies are available for download. Database URL: www.scai.fraunhofer.de/NeuroTransDB.html. © The Author(s) 2015. Published by Oxford University Press.
NeuroTransDB: highly curated and structured transcriptomic metadata for neurodegenerative diseases
Bagewadi, Shweta; Adhikari, Subash; Dhrangadhariya, Anjani; Irin, Afroza Khanam; Ebeling, Christian; Namasivayam, Aishwarya Alex; Page, Matthew; Hofmann-Apitius, Martin
2015-01-01
Neurodegenerative diseases are chronic debilitating conditions, characterized by progressive loss of neurons that represent a significant health care burden as the global elderly population continues to grow. Over the past decade, high-throughput technologies such as the Affymetrix GeneChip microarrays have provided new perspectives into the pathomechanisms underlying neurodegeneration. Public transcriptomic data repositories, namely Gene Expression Omnibus and curated ArrayExpress, enable researchers to conduct integrative meta-analysis; increasing the power to detect differentially regulated genes in disease and explore patterns of gene dysregulation across biologically related studies. The reliability of retrospective, large-scale integrative analyses depends on an appropriate combination of related datasets, in turn requiring detailed meta-annotations capturing the experimental setup. In most cases, we observe huge variation in compliance to defined standards for submitted metadata in public databases. Much of the information to complete, or refine meta-annotations are distributed in the associated publications. For example, tissue preparation or comorbidity information is frequently described in an article’s supplementary tables. Several value-added databases have employed additional manual efforts to overcome this limitation. However, none of these databases explicate annotations that distinguish human and animal models in neurodegeneration context. Therefore, adopting a more specific disease focus, in combination with dedicated disease ontologies, will better empower the selection of comparable studies with refined annotations to address the research question at hand. In this article, we describe the detailed development of NeuroTransDB, a manually curated database containing metadata annotations for neurodegenerative studies. The database contains more than 20 dimensions of metadata annotations within 31 mouse, 5 rat and 45 human studies, defined in collaboration with domain disease experts. We elucidate the step-by-step guidelines used to critically prioritize studies from public archives and their metadata curation and discuss the key challenges encountered. Curated metadata for Alzheimer’s disease gene expression studies are available for download. Database URL: www.scai.fraunhofer.de/NeuroTransDB.html PMID:26475471
Automated Database Mediation Using Ontological Metadata Mappings
Marenco, Luis; Wang, Rixin; Nadkarni, Prakash
2009-01-01
Objective To devise an automated approach for integrating federated database information using database ontologies constructed from their extended metadata. Background One challenge of database federation is that the granularity of representation of equivalent data varies across systems. Dealing effectively with this problem is analogous to dealing with precoordinated vs. postcoordinated concepts in biomedical ontologies. Model Description The authors describe an approach based on ontological metadata mapping rules defined with elements of a global vocabulary, which allows a query specified at one granularity level to fetch data, where possible, from databases within the federation that use different granularities. This is implemented in OntoMediator, a newly developed production component of our previously described Query Integrator System. OntoMediator's operation is illustrated with a query that accesses three geographically separate, interoperating databases. An example based on SNOMED also illustrates the applicability of high-level rules to support the enforcement of constraints that can prevent inappropriate curator or power-user actions. Summary A rule-based framework simplifies the design and maintenance of systems where categories of data must be mapped to each other, for the purpose of either cross-database query or for curation of the contents of compositional controlled vocabularies. PMID:19567801
NASA Astrophysics Data System (ADS)
Moore, J.; Serreze, M. C.; Middleton, D.; Ramamurthy, M. K.; Yarmey, L.
2013-12-01
The NSF funds the Advanced Cooperative Arctic Data and Information System (ACADIS), url: (http://www.aoncadis.org/). It serves the growing and increasingly diverse data management needs of NSF's arctic research community. The ACADIS investigator team combines experienced data managers, curators and software engineers from the NSIDC, UCAR and NCAR. ACADIS fosters scientific synthesis and discovery by providing a secure long-term data archive to NSF investigators. The system provides discovery and access to arctic related data from this and other archives. This paper updates the technical components of ACADIS, the implementation of best practices, the value of ACADIS to the community and the major challenges facing this archive for the future in handling the diverse data coming from NSF Arctic investigators. ACADIS provides sustainable data management, data stewardship services and leadership for the NSF Arctic research community through open data sharing, adherence to best practices and standards, capitalizing on appropriate evolving technologies, community support and engagement. ACADIS leverages other pertinent projects, capitalizing on appropriate emerging technologies and participating in emerging cyberinfrastructure initiatives. The key elements of ACADIS user services to the NSF Arctic community include: data and metadata upload; support for datasets with special requirements; metadata and documentation generation; interoperability and initiatives with other archives; and science support to investigators and the community. Providing a self-service data publishing platform requiring minimal curation oversight while maintaining rich metadata for discovery, access and preservation is challenging. Implementing metadata standards are a first step towards consistent content. The ACADIS Gateway and ADE offer users choices for data discovery and access with the clear objective of increasing discovery and use of all Arctic data especially for analysis activities. Metadata is at the core of ACADIS activities, from capturing metadata at the point of data submission to ensuring interoperability , providing data citations, and supporting data discovery. ACADIS metadata efforts include: 1) Evolution of the ACADIS metadata profile to increase flexibility in search; 2) Documentation guidelines; and 3) Metadata standardization efforts. A major activity is now underway to ensure consistency in the metadata profile across all archived datasets. ACADIS is embarking on a critical activity to create Digital Object Identifiers (DOI) for all its holdings. The data services offered by ACADIS focus on meeting the needs of the data providers, providing dynamic search capabilities to peruse the ACADIS and related cyrospheric data repositories, efficient data download and some special services including dataset reformatting and visualization. The service is built around of the following key technical elements: The ACADIS Gateway housed at NCAR has been developed to support NSF Arctic data coming from AON and now broadly across PLR/ARC and related archives: The Arctic Data Explorer (ADE) developed at NSIDC is an integral service of ACADIS bringing the rich archive from NSIDC together with catalogs from ACADIS and international partners in Arctic research: and Rosetta and the Digital Object Identifier (DOI) generation scheme are tools available to the community to help publish and utilize datasets in integration and synthesis and publication.
Science Support: The Building Blocks of Active Data Curation
NASA Astrophysics Data System (ADS)
Guillory, A.
2013-12-01
While the scientific method is built on reproducibility and transparency, and results are published in peer reviewed literature, we have come to the digital age of very large datasets (now of the order of petabytes and soon exabytes) which cannot be published in the traditional way. To preserve reproducibility and transparency, active curation is necessary to keep and protect the information in the long term, and 'science support' activities provide the building blocks for active data curation. With the explosive growth of data in all fields in recent years, there is a pressing urge for data centres to now provide adequate services to ensure long-term preservation and digital curation of project data outputs, however complex those may be. Science support provides advice and support to science projects on data and information management, from file formats through to general data management awareness. Another purpose of science support is to raise awareness in the science community of data and metadata standards and best practice, engendering a culture where data outputs are seen as valued assets. At the heart of Science support is the Data Management Plan (DMP) which sets out a coherent approach to data issues pertaining to the data generating project. It provides an agreed record of the data management needs and issues within the project. The DMP is agreed upon with project investigators to ensure that a high quality documented data archive is created. It includes conditions of use and deposit to clearly express the ownership, responsibilities and rights associated with the data. Project specific needs are also identified for data processing, visualization tools and data sharing services. As part of the National Centre for Atmospheric Science (NCAS) and National Centre for Earth Observation (NCEO), the Centre for Environmental Data Archival (CEDA) fulfills this science support role of facilitating atmospheric and Earth observation data generating projects to ensure successful management of the data and accompanying information for reuse and repurpose. Specific examples at CEDA include science support provided to FAAM (Facility for Airborne Atmospheric Measurements) aircraft campaigns and large-scale modelling projects such as UPSCALE, the largest ever PRACE (Partnership for Advanced Computing in Europe) computational project, dependent on CEDA to provide the high-performance storage, transfer capability and data analysis environment on the 'super-data-cluster' JASMIN. The impact of science support on scientific research is conspicuous: better documented datasets with an increasing collection of metadata associated to the archived data, ease of data sharing with the use of standards in formats and metadata and data citation. These establish a high-quality of data management ensuring long-term preservation and enabling re-use by peer scientists which ultimately leads to faster paced progress in science.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Karcher, Sandra; Willighagen, Egon L.; Rumble, John
Many groups within the broad field of nanoinformatics are already developing data repositories and analytical tools driven by their individual organizational goals. Integrating these data resources across disciplines and with non-nanotechnology resources can support multiple objectives by enabling the reuse of the same information. Integration can also serve as the impetus for novel scientific discoveries by providing the framework to support deeper data analyses. This article discusses current data integration practices in nanoinformatics and in comparable mature fields, and nanotechnology-specific challenges impacting data integration. Based on results from a nanoinformatics-community-wide survey, recommendations for achieving integration of existing operational nanotechnology resourcesmore » are presented. Nanotechnology-specific data integration challenges, if effectively resolved, can foster the application and validation of nanotechnology within and across disciplines. This paper is one of a series of articles by the Nanomaterial Data Curation Initiative that address data issues such as data curation workflows, data completeness and quality, curator responsibilities, and metadata.« less
NASA Technical Reports Server (NTRS)
Todd, Nancy S.
2016-01-01
The rock and soil samples returned from the Apollo missions from 1969-72 have supported 46 years of research leading to advances in our understanding of the formation and evolution of the inner Solar System. NASA has been engaged in several initiatives that aim to restore, digitize, and make available to the public existing published and unpublished research data for the Apollo samples. One of these initiatives is a collaboration with IEDA (Interdisciplinary Earth Data Alliance) to develop MoonDB, a lunar geochemical database modeled after PetDB (Petrological Database of the Ocean Floor). In support of this initiative, NASA has adopted the use of IGSN (International Geo Sample Number) to generate persistent, unique identifiers for lunar samples that scientists can use when publishing research data. To facilitate the IGSN registration of the original 2,200 samples and over 120,000 subdivided samples, NASA has developed an application that retrieves sample metadata from the Lunar Curation Database and uses the SESAR API to automate the generation of IGSNs and registration of samples into SESAR (System for Earth Sample Registration). This presentation will describe the work done by NASA to map existing sample metadata to the IGSN metadata and integrate the IGSN registration process into the sample curation workflow, the lessons learned from this effort, and how this work can be extended in the future to help deal with the registration of large numbers of samples.
NASA Astrophysics Data System (ADS)
Todd, N. S.
2016-12-01
The rock and soil samples returned from the Apollo missions from 1969-72 have supported 46 years of research leading to advances in our understanding of the formation and evolution of the inner Solar System. NASA has been engaged in several initiatives that aim to restore, digitize, and make available to the public existing published and unpublished research data for the Apollo samples. One of these initiatives is a collaboration with IEDA (Interdisciplinary Earth Data Alliance) to develop MoonDB, a lunar geochemical database modeled after PetDB. In support of this initiative, NASA has adopted the use of IGSN (International Geo Sample Number) to generate persistent, unique identifiers for lunar samples that scientists can use when publishing research data. To facilitate the IGSN registration of the original 2,200 samples and over 120,000 subdivided samples, NASA has developed an application that retrieves sample metadata from the Lunar Curation Database and uses the SESAR API to automate the generation of IGSNs and registration of samples into SESAR (System for Earth Sample Registration). This presentation will describe the work done by NASA to map existing sample metadata to the IGSN metadata and integrate the IGSN registration process into the sample curation workflow, the lessons learned from this effort, and how this work can be extended in the future to help deal with the registration of large numbers of samples.
OntoCheck: verifying ontology naming conventions and metadata completeness in Protégé 4.
Schober, Daniel; Tudose, Ilinca; Svatek, Vojtech; Boeker, Martin
2012-09-21
Although policy providers have outlined minimal metadata guidelines and naming conventions, ontologies of today still display inter- and intra-ontology heterogeneities in class labelling schemes and metadata completeness. This fact is at least partially due to missing or inappropriate tools. Software support can ease this situation and contribute to overall ontology consistency and quality by helping to enforce such conventions. We provide a plugin for the Protégé Ontology editor to allow for easy checks on compliance towards ontology naming conventions and metadata completeness, as well as curation in case of found violations. In a requirement analysis, derived from a prior standardization approach carried out within the OBO Foundry, we investigate the needed capabilities for software tools to check, curate and maintain class naming conventions. A Protégé tab plugin was implemented accordingly using the Protégé 4.1 libraries. The plugin was tested on six different ontologies. Based on these test results, the plugin could be refined, also by the integration of new functionalities. The new Protégé plugin, OntoCheck, allows for ontology tests to be carried out on OWL ontologies. In particular the OntoCheck plugin helps to clean up an ontology with regard to lexical heterogeneity, i.e. enforcing naming conventions and metadata completeness, meeting most of the requirements outlined for such a tool. Found test violations can be corrected to foster consistency in entity naming and meta-annotation within an artefact. Once specified, check constraints like name patterns can be stored and exchanged for later re-use. Here we describe a first version of the software, illustrate its capabilities and use within running ontology development efforts and briefly outline improvements resulting from its application. Further, we discuss OntoChecks capabilities in the context of related tools and highlight potential future expansions. The OntoCheck plugin facilitates labelling error detection and curation, contributing to lexical quality assurance in OWL ontologies. Ultimately, we hope this Protégé extension will ease ontology alignments as well as lexical post-processing of annotated data and hence can increase overall secondary data usage by humans and computers.
Enhancing the Impact of Science Data: Toward Data Discovery and Reuse
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chappell, Alan R.; Weaver, Jesse R.; Purohit, Sumit
The amount of data produced in support of scientific research continues to grow rapidly. Despite the accumulation and demand for scientific data, relatively little data are actually made available for the broader scientific community. We surmise that one root of this problem is the perceived difficulty of electronically publishing scientific data and associated metadata in a way that makes it discoverable. We propose exploiting Semantic Web technologies and best practices to make metadata both discoverable and easy to publish. We share experiences in curating metadata to illustrate the cumbersome nature of data reuse in the current research environment. We alsomore » make recommendations with a real-world example of how data publishers can provide their metadata by adding limited additional markup to HTML pages on the Web. With little additional effort from data publishers, the difficulty of data discovery, access, and sharing can be greatly reduced and the impact of research data greatly enhanced.« less
Willoughby, Cerys; Bird, Colin L; Coles, Simon J; Frey, Jeremy G
2014-12-22
The drive toward more transparency in research, the growing willingness to make data openly available, and the reuse of data to maximize the return on research investment all increase the importance of being able to find information and make links to the underlying data. The use of metadata in Electronic Laboratory Notebooks (ELNs) to curate experiment data is an essential ingredient for facilitating discovery. The University of Southampton has developed a Web browser-based ELN that enables users to add their own metadata to notebook entries. A survey of these notebooks was completed to assess user behavior and patterns of metadata usage within ELNs, while user perceptions and expectations were gathered through interviews and user-testing activities within the community. The findings indicate that while some groups are comfortable with metadata and are able to design a metadata structure that works effectively, many users are making little attempts to use it, thereby endangering their ability to recover data in the future. A survey of patterns of metadata use in these notebooks, together with feedback from the user community, indicated that while a few groups are comfortable with metadata and are able to design a metadata structure that works effectively, many users adopt a "minimum required" approach to metadata. To investigate whether the patterns of metadata use in LabTrove were unusual, a series of surveys were undertaken to investigate metadata usage in a variety of platforms supporting user-defined metadata. These surveys also provided the opportunity to investigate whether interface designs in these other environments might inform strategies for encouraging metadata creation and more effective use of metadata in LabTrove.
Evolution in Metadata Quality: Common Metadata Repository's Role in NASA Curation Efforts
NASA Technical Reports Server (NTRS)
Gilman, Jason; Shum, Dana; Baynes, Katie
2016-01-01
Metadata Quality is one of the chief drivers of discovery and use of NASA EOSDIS (Earth Observing System Data and Information System) data. Issues with metadata such as lack of completeness, inconsistency, and use of legacy terms directly hinder data use. As the central metadata repository for NASA Earth Science data, the Common Metadata Repository (CMR) has a responsibility to its users to ensure the quality of CMR search results. This poster covers how we use humanizers, a technique for dealing with the symptoms of metadata issues, as well as our plans for future metadata validation enhancements. The CMR currently indexes 35K collections and 300M granules.
MetaRNA-Seq: An Interactive Tool to Browse and Annotate Metadata from RNA-Seq Studies.
Kumar, Pankaj; Halama, Anna; Hayat, Shahina; Billing, Anja M; Gupta, Manish; Yousri, Noha A; Smith, Gregory M; Suhre, Karsten
2015-01-01
The number of RNA-Seq studies has grown in recent years. The design of RNA-Seq studies varies from very simple (e.g., two-condition case-control) to very complicated (e.g., time series involving multiple samples at each time point with separate drug treatments). Most of these publically available RNA-Seq studies are deposited in NCBI databases, but their metadata are scattered throughout four different databases: Sequence Read Archive (SRA), Biosample, Bioprojects, and Gene Expression Omnibus (GEO). Although the NCBI web interface is able to provide all of the metadata information, it often requires significant effort to retrieve study- or project-level information by traversing through multiple hyperlinks and going to another page. Moreover, project- and study-level metadata lack manual or automatic curation by categories, such as disease type, time series, case-control, or replicate type, which are vital to comprehending any RNA-Seq study. Here we describe "MetaRNA-Seq," a new tool for interactively browsing, searching, and annotating RNA-Seq metadata with the capability of semiautomatic curation at the study level.
NASA Astrophysics Data System (ADS)
Elger, Kirsten; Ulbricht, Damian; Bertelmann, Roland
2017-04-01
Open access to research data is an increasing international request and includes not only data underlying scholarly publication, but also raw and curated data. Especially in the framework of the observed shift in many scientific fields towards data science and data mining, data repositories are becoming important player as data archives and access point to curated research data. While general and institutional data repositories are available across all scientific disciplines, domain-specific data repositories are specialised for scientific disciplines, like, e.g., bio- or geosciences, with the possibility to use more discipline-specific and richer metadata models than general repositories. Data publication is increasingly regarded as important scientific achievement, and datasets with digital object identifier (DOI) are now fully citable in journal articles. Moreover, following in their signature of the "Statement of Commitment of the Coalition on Publishing Data in the Earth and Space Sciences" (COPDESS), many publishers have adopted their data policies and recommend and even request to store and publish data underlying scholarly publications in (domain-specific) data repositories and not as classical supplementary material directly attached to the respective article. The curation of large dynamic data from global networks in, e.g., seismology, magnetics or geodesy, always required a high grade of professional, IT-supported data management, simply to be able to store and access the huge number of files and manage dynamic datasets. In contrast to these, the vast amount of research data acquired by individual investigators or small teams known as 'long-tail data' was often not the focus for the development of data curation infrastructures. Nevertheless, even though they are small in size and highly variable, in total they represent a significant portion of the total scientific outcome. The curation of long-tail data requires more individual approaches and personal involvement of the data curator, especially regarding the data description. Here we will introduce best practices for the publication of long-tail data that are helping to reduce the individual effort, improve the quality of the data description. The data repository of GFZ Data Services, which is hosted at GFZ German Research Centre for Geosciences in Potsdam, is a domain-specific data repository for geosciences. In addition to large dynamic datasets from different disciplines, it has a large focus on the DOI-referenced publication of long-tail data with the aim to reach a high grade of reusability through a comprehensive data description and in the same time provide and distribute standardised, machine actionable metadata for data discovery (FAIR data). The development of templates for data reports, metadata provision by scientists via an XML Metadata Editor and discipline-specific DOI landing pages are helping both, the data curators to handle all kinds of datasets and enabling the scientists, i.e. user, to quickly decide whether a published dataset is fulfilling their needs. In addition, GFZ Data Services have developed DOI-registration services for several international networks (e.g. ICGEM, World Stress Map, IGETS, etc.). In addition, we have developed project-or network-specific designs of the DOI landing pages with the logo or design of the networks or project
Mining dark information resources to develop new informatics capabilities to support science
NASA Astrophysics Data System (ADS)
Ramachandran, Rahul; Maskey, Manil; Bugbee, Kaylin
2016-04-01
Dark information resources are digital resources that organizations collect, process, and store for regular business or operational activities but fail to realize their potential for other purposes. The challenge for any organization is to recognize, identify and effectively exploit these dark information stores. Metadata catalogs at different data centers store dark information resources consisting of structured information, free form descriptions of data and browse images. These information resources are never fully exploited beyond a few fields used for search and discovery. For example, the NASA Earth science catalog holds greater than 6000 data collections, 127 million records for individual files and 67 million browse images. We believe that the information contained in the metadata catalogs and the browse images can be utilized beyond their original design intent to provide new data discovery and exploration pathways to support science and education communities. In this paper we present two research applications using information stored in the metadata catalog in a completely novel way. The first application is designing a data curation service. The objective of the data curation service is to augment the existing data search capabilities. Given a specific atmospheric phenomenon, the data curation service returns the user a ranked list of relevant data sets. Different fields in the metadata records including textual descriptions are mined. A specialized relevancy ranking algorithm has been developed that uses a "bag of words" to define phenomena along with an ensemble of known approaches such as the Jaccard Coefficient, Cosine Similarity and Zone ranking to rank the data sets. This approach is also extended to map from the data set level to data file variable level. The second application is focused on providing a service where a user can search and discover browse images containing specific phenomena from the vast catalog. This service will aid researchers in uncovering interesting event in the data for case study analysis. The challenge of this second application is to bridge the semantic gap between the low level image pixel values and the semantic concept perceived by a user when he or she sees an image. A deep learning algorithm, specifically the Convolution Neural Network (CNN), has been trained and tested to identify three types of Earth science phenomena - Hurricanes, Dust, and Smoke/Haze in MODIS imagery. Latest results from both the applications will be presented in this paper.
OntoCheck: verifying ontology naming conventions and metadata completeness in Protégé 4
2012-01-01
Background Although policy providers have outlined minimal metadata guidelines and naming conventions, ontologies of today still display inter- and intra-ontology heterogeneities in class labelling schemes and metadata completeness. This fact is at least partially due to missing or inappropriate tools. Software support can ease this situation and contribute to overall ontology consistency and quality by helping to enforce such conventions. Objective We provide a plugin for the Protégé Ontology editor to allow for easy checks on compliance towards ontology naming conventions and metadata completeness, as well as curation in case of found violations. Implementation In a requirement analysis, derived from a prior standardization approach carried out within the OBO Foundry, we investigate the needed capabilities for software tools to check, curate and maintain class naming conventions. A Protégé tab plugin was implemented accordingly using the Protégé 4.1 libraries. The plugin was tested on six different ontologies. Based on these test results, the plugin could be refined, also by the integration of new functionalities. Results The new Protégé plugin, OntoCheck, allows for ontology tests to be carried out on OWL ontologies. In particular the OntoCheck plugin helps to clean up an ontology with regard to lexical heterogeneity, i.e. enforcing naming conventions and metadata completeness, meeting most of the requirements outlined for such a tool. Found test violations can be corrected to foster consistency in entity naming and meta-annotation within an artefact. Once specified, check constraints like name patterns can be stored and exchanged for later re-use. Here we describe a first version of the software, illustrate its capabilities and use within running ontology development efforts and briefly outline improvements resulting from its application. Further, we discuss OntoChecks capabilities in the context of related tools and highlight potential future expansions. Conclusions The OntoCheck plugin facilitates labelling error detection and curation, contributing to lexical quality assurance in OWL ontologies. Ultimately, we hope this Protégé extension will ease ontology alignments as well as lexical post-processing of annotated data and hence can increase overall secondary data usage by humans and computers. PMID:23046606
NASA Astrophysics Data System (ADS)
Stroker, K. J.; Jencks, J. H.; Eakins, B.
2016-12-01
The Index to Marine and Lacustrine Geological Samples (IMLGS) is a community designed and maintained resource enabling researchers to locate and request seafloor and lakebed geologic samples curated by partner institutions. The Index was conceived in the dawn of the digital age by representatives from U.S. academic and government marine core repositories and the NOAA National Geophysical Data Center, now the National Centers for Environmental Information (NCEI), at a 1977 meeting convened by the National Science Foundation (NSF). The Index is based on core concepts of community oversight, common vocabularies, consistent metadata and a shared interface. The Curators Consortium, international in scope, meets biennially to share ideas and discuss best practices. NCEI serves the group by providing database access and maintenance, a list server, digitizing support and long-term archival of sample metadata, data and imagery. Over three decades, participating curators have performed the laborious task of creating and contributing metadata for over 205,000 sea floor and lake-bed cores, grabs, and dredges archived in their collections. Some partners use the Index for primary web access to their collections while others use it to increase exposure of more in-depth institutional systems. The IMLGS has a persistent URL/Digital Object Identifier (DOI), as well as DOIs assigned to partner collections for citation and to provide a persistent link to curator collections. The Index is currently a geospatially-enabled relational database, publicly accessible via Web Feature and Web Map Services, and text- and ArcGIS map-based web interfaces. To provide as much knowledge as possible about each sample, the Index includes curatorial contact information and links to related data, information and images : 1) at participating institutions, 2) in the NCEI archive, and 3) through a Linked Data interface maintained by the Rolling Deck to Repository R2R. Over 43,000 International GeoSample Numbers (IGSNs) linking to the System for Earth Sample Registration (SESAR) are included in anticipation of opportunities for interconnectivity with Integrated Earth Data Applications (IEDA) systems. The paper will discuss the database with a goal to increase the connections and links to related data at partner institutions.
NASA Astrophysics Data System (ADS)
Zaslavsky, I.; Valentine, D.; Richard, S. M.; Gupta, A.; Meier, O.; Peucker-Ehrenbrink, B.; Hudman, G.; Stocks, K. I.; Hsu, L.; Whitenack, T.; Grethe, J. S.; Ozyurt, I. B.
2017-12-01
EarthCube Data Discovery Hub (DDH) is an EarthCube Building Block project using technologies developed in CINERGI (Community Inventory of EarthCube Resources for Geoscience Interoperability) to enable geoscience users to explore a growing portfolio of EarthCube-created and other geoscience-related resources. Over 1 million metadata records are available for discovery through the project portal (cinergi.sdsc.edu). These records are retrieved from data facilities, including federal, state and academic sources, or contributed by geoscientists through workshops, surveys, or other channels. CINERGI metadata augmentation pipeline components 1) provide semantic enhancement based on a large ontology of geoscience terms, using text analytics to generate keywords with references to ontology classes, 2) add spatial extents based on place names found in the metadata record, and 3) add organization identifiers to the metadata. The records are indexed and can be searched via a web portal and standard search APIs. The added metadata content improves discoverability and interoperability of the registered resources. Specifically, the addition of ontology-anchored keywords enables faceted browsing and lets users navigate to datasets related by variables measured, equipment used, science domain, processes described, geospatial features studied, and other dataset characteristics that are generated by the pipeline. DDH also lets data curators access and edit the automatically generated metadata records using the CINERGI metadata editor, accept or reject the enhanced metadata content, and consider it in updating their metadata descriptions. We consider several complex data discovery workflows, in environmental seismology (quantifying sediment and water fluxes using seismic data), marine biology (determining available temperature, location, weather and bleaching characteristics of coral reefs related to measurements in a given coral reef survey), and river geochemistry (discovering observations relevant to geochemical measurements outside the tidal zone, given specific discharge conditions).
PanMetaDocs - A tool for collecting and managing the long tail of "small science data"
NASA Astrophysics Data System (ADS)
Klump, J.; Ulbricht, D.
2011-12-01
In the early days of thinking about cyberinfrastructure the focus was on "big science data". Today, the challenge is not anymore to store several terabytes of data, but to manage data objects in a way that facilitates their re-use. Key to re-use by a user as a data consumer is proper documentation of the data. Also, data consumers need discovery metadata to find the data they need and they need descriptive metadata to be able to use the data they retrieved. Thus, data documentation faces the challenge to extensively and completely describe these objects, hold the items easily accessible at a sustainable cost level. However, data curation and documentation do not rank high in the everyday work of a scientist as a data producer. Data producers are often frustrated by being asked to provide metadata on their data over and over again, information that seemed very obvious from the context of their work. A challenge to data archives is the wide variety of metadata schemata in use, which creates a number of maintenance and design challenges of its own. PanMetaDocs addresses these issues by allowing an uploaded files to be described by more than one metadata object. PanMetaDocs, which was developed from PanMetaWorks, is a PHP based web application that allow to describe data with any xml-based metadata schema. Its user interface is browser based and was developed to collect metadata and data in collaborative scientific projects situated at one or more institutions. The metadata fields can be filled with static or dynamic content to reduce the number of fields that require manual entries to a minimum and make use of contextual information in a project setting. In the development of PanMetaDocs the business logic of panMetaWorks is reused, except for the authentication and data management functions of PanMetaWorks, which are delegated to the eSciDoc framework. The eSciDoc repository framework is designed as a service oriented architecture that can be controlled through a REST interface to create version controlled items with metadata records in XML format. PanMetaDocs utilizes the eSciDoc items model to add multiple metadata records that describe uploaded files in different metadata schemata. While datasets are collected and described, shared to collaborate with other scientists and finally published, data objects are transferred from a shared data curation domain into a persistent data curation domain. Through an RSS interface for recent datasets PanMetaWorks allows project members to be informed about data uploaded by other project members. The implementation of the OAI-PMH interface can be used to syndicate data catalogs to research data portals, such as the panFMP data portal framework. Once data objects are uploaded to the eSciDoc infrastructure it is possible to drop the software instance that was used for collecting the data, while the compiled data and metadata are accessible for other authorized applications through the institution's eSciDoc middleware. This approach of "expendable data curation tools" allows for a significant reduction in costs for software maintenance as expensive data capture applications do not need to be maintained indefinitely to ensure long term access to the stored data.
SnoVault and encodeD: A novel object-based storage system and applications to ENCODE metadata.
Hitz, Benjamin C; Rowe, Laurence D; Podduturi, Nikhil R; Glick, David I; Baymuradov, Ulugbek K; Malladi, Venkat S; Chan, Esther T; Davidson, Jean M; Gabdank, Idan; Narayana, Aditi K; Onate, Kathrina C; Hilton, Jason; Ho, Marcus C; Lee, Brian T; Miyasato, Stuart R; Dreszer, Timothy R; Sloan, Cricket A; Strattan, J Seth; Tanaka, Forrest Y; Hong, Eurie L; Cherry, J Michael
2017-01-01
The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements initiated shortly after the completion of the Human Genome Project. The current database exceeds 6500 experiments across more than 450 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the H. sapiens and M. musculus genomes. All ENCODE experimental data, metadata, and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage, unified processing, and distribution to community resources and the scientific community. As the volume of data increases, the identification and organization of experimental details becomes increasingly intricate and demands careful curation. The ENCODE DCC has created a general purpose software system, known as SnoVault, that supports metadata and file submission, a database used for metadata storage, web pages for displaying the metadata and a robust API for querying the metadata. The software is fully open-source, code and installation instructions can be found at: http://github.com/ENCODE-DCC/snovault/ (for the generic database) and http://github.com/ENCODE-DCC/encoded/ to store genomic data in the manner of ENCODE. The core database engine, SnoVault (which is completely independent of ENCODE, genomic data, or bioinformatic data) has been released as a separate Python package.
SnoVault and encodeD: A novel object-based storage system and applications to ENCODE metadata
Podduturi, Nikhil R.; Glick, David I.; Baymuradov, Ulugbek K.; Malladi, Venkat S.; Chan, Esther T.; Davidson, Jean M.; Gabdank, Idan; Narayana, Aditi K.; Onate, Kathrina C.; Hilton, Jason; Ho, Marcus C.; Lee, Brian T.; Miyasato, Stuart R.; Dreszer, Timothy R.; Sloan, Cricket A.; Strattan, J. Seth; Tanaka, Forrest Y.; Hong, Eurie L.; Cherry, J. Michael
2017-01-01
The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements initiated shortly after the completion of the Human Genome Project. The current database exceeds 6500 experiments across more than 450 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the H. sapiens and M. musculus genomes. All ENCODE experimental data, metadata, and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage, unified processing, and distribution to community resources and the scientific community. As the volume of data increases, the identification and organization of experimental details becomes increasingly intricate and demands careful curation. The ENCODE DCC has created a general purpose software system, known as SnoVault, that supports metadata and file submission, a database used for metadata storage, web pages for displaying the metadata and a robust API for querying the metadata. The software is fully open-source, code and installation instructions can be found at: http://github.com/ENCODE-DCC/snovault/ (for the generic database) and http://github.com/ENCODE-DCC/encoded/ to store genomic data in the manner of ENCODE. The core database engine, SnoVault (which is completely independent of ENCODE, genomic data, or bioinformatic data) has been released as a separate Python package. PMID:28403240
Evolution of data stewardship over two decades at a NASA data center
NASA Astrophysics Data System (ADS)
Armstrong, E. M.; Moroni, D. F.; Hausman, J.; Tsontos, V. M.
2013-12-01
Whether referred to as data science or data engineering, the technical nature and practice of data curation has seen a noticeable shift in the last two decades. The majority of this has been driven by factors of increasing data volumes and complexity, new data structures, and data virtualization through internet access that have themselves spawned new fields or advances in semantic ontologies, metadata, advanced distributed computing and new file formats. As a result of this shifting landscape, the role of the data scientist/engineer has also evolved.. We will discuss the key elements of this evolutionary shift from the perspective of data curation at the NASA Physical Oceanography Distributed Active Archive Center (PO.DAAC), which is one of 12 NASA Earth Science data centers responsible for archiving and distributing oceanographic satellite data since 1993. Earlier responsibilities of data curation in the history of the PO.DAAC focused strictly on data archiving, low-level data quality assessments, understanding and building read software for terse binary data or limited applications of self-describing file formats and metadata. Data discovery was often word of mouth or based on perusing simple web pages built for specific products. At that time the PO.DAAC served only a few tens of datasets. A single data engineer focused on a specific mission or suite of datasets from a specific physical parameter (e.g., ocean topography measurements). Since that time the number of datasets in the PO.DAAC has grown to approach one thousand, with increasing complexity of data and metadata structures in self-describing formats. Advances in ontologies, metadata, applications of MapReduce distributed computing and "big data", improvements in data discovery, data mining, and tools for visualization and analysis have all required new and evolving skill sets. The community began requiring more rigorous assessments of data quality and uncertainty. Although the expert knowledge of the physical domain was still critical, especially relevant to assessments of data quality, additional skills in computer science, statistics and system engineering also became necessary. Furthermore, the level of effort to implement data curation has not expanded linearly either. Management of ongoing data operations demands increased productivity on a continual basis and larger volumes of data, with constraints on funding, must be managed with proportionately less human resources. The role of data curation has also changed within the perspective of satellite missions. In many early missions, data management and curation was an afterthought (since there were no explicit data management plans written into the proposals), while current NASA mission proposals must have explicit data management plans to identify resources and funds for archiving, distribution and implementing overall data stewardship. In conclusion, the role of the data scientist/engineer at the PO.DAAC has shifted from supporting singular missions and primarily representing a point of contact for the science community to complete end-to-end stewardship through the implementation of a robust set of dataset lifecycle policies from ingest, to archiving, including data quality assessment for a broad swath of parameter based datasets that can number in the hundreds.
McMahon, Christiana; Denaxas, Spiros
2016-01-01
Metadata are critical in epidemiological and public health research. However, a lack of biomedical metadata quality frameworks and limited awareness of the implications of poor quality metadata renders data analyses problematic. In this study, we created and evaluated a novel framework to assess metadata quality of epidemiological and public health research datasets. We performed a literature review and surveyed stakeholders to enhance our understanding of biomedical metadata quality assessment. The review identified 11 studies and nine quality dimensions; none of which were specifically aimed at biomedical metadata. 96 individuals completed the survey; of those who submitted data, most only assessed metadata quality sometimes, and eight did not at all. Our framework has four sections: a) general information; b) tools and technologies; c) usability; and d) management and curation. We evaluated the framework using three test cases and sought expert feedback. The framework can assess biomedical metadata quality systematically and robustly. PMID:27570670
McMahon, Christiana; Denaxas, Spiros
2016-01-01
Metadata are critical in epidemiological and public health research. However, a lack of biomedical metadata quality frameworks and limited awareness of the implications of poor quality metadata renders data analyses problematic. In this study, we created and evaluated a novel framework to assess metadata quality of epidemiological and public health research datasets. We performed a literature review and surveyed stakeholders to enhance our understanding of biomedical metadata quality assessment. The review identified 11 studies and nine quality dimensions; none of which were specifically aimed at biomedical metadata. 96 individuals completed the survey; of those who submitted data, most only assessed metadata quality sometimes, and eight did not at all. Our framework has four sections: a) general information; b) tools and technologies; c) usability; and d) management and curation. We evaluated the framework using three test cases and sought expert feedback. The framework can assess biomedical metadata quality systematically and robustly.
Pafilis, Evangelos; Buttigieg, Pier Luigi; Ferrell, Barbra; Pereira, Emiliano; Schnetzer, Julia; Arvanitidis, Christos; Jensen, Lars Juhl
2016-01-01
The microbial and molecular ecology research communities have made substantial progress on developing standards for annotating samples with environment metadata. However, sample manual annotation is a highly labor intensive process and requires familiarity with the terminologies used. We have therefore developed an interactive annotation tool, EXTRACT, which helps curators identify and extract standard-compliant terms for annotation of metagenomic records and other samples. Behind its web-based user interface, the system combines published methods for named entity recognition of environment, organism, tissue and disease terms. The evaluators in the BioCreative V Interactive Annotation Task found the system to be intuitive, useful, well documented and sufficiently accurate to be helpful in spotting relevant text passages and extracting organism and environment terms. Comparison of fully manual and text-mining-assisted curation revealed that EXTRACT speeds up annotation by 15-25% and helps curators to detect terms that would otherwise have been missed. Database URL: https://extract.hcmr.gr/. © The Author(s) 2016. Published by Oxford University Press.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pafilis, Evangelos; Buttigieg, Pier Luigi; Ferrell, Barbra
The microbial and molecular ecology research communities have made substantial progress on developing standards for annotating samples with environment metadata. However, sample manual annotation is a highly labor intensive process and requires familiarity with the terminologies used. We have therefore developed an interactive annotation tool, EXTRACT, which helps curators identify and extract standard-compliant terms for annotation of metagenomic records and other samples. Behind its web-based user interface, the system combines published methods for named entity recognition of environment, organism, tissue and disease terms. The evaluators in the BioCreative V Interactive Annotation Task found the system to be intuitive, useful, wellmore » documented and sufficiently accurate to be helpful in spotting relevant text passages and extracting organism and environment terms. Here the comparison of fully manual and text-mining-assisted curation revealed that EXTRACT speeds up annotation by 15–25% and helps curators to detect terms that would otherwise have been missed.« less
Pafilis, Evangelos; Buttigieg, Pier Luigi; Ferrell, Barbra; ...
2016-01-01
The microbial and molecular ecology research communities have made substantial progress on developing standards for annotating samples with environment metadata. However, sample manual annotation is a highly labor intensive process and requires familiarity with the terminologies used. We have therefore developed an interactive annotation tool, EXTRACT, which helps curators identify and extract standard-compliant terms for annotation of metagenomic records and other samples. Behind its web-based user interface, the system combines published methods for named entity recognition of environment, organism, tissue and disease terms. The evaluators in the BioCreative V Interactive Annotation Task found the system to be intuitive, useful, wellmore » documented and sufficiently accurate to be helpful in spotting relevant text passages and extracting organism and environment terms. Here the comparison of fully manual and text-mining-assisted curation revealed that EXTRACT speeds up annotation by 15–25% and helps curators to detect terms that would otherwise have been missed.« less
NASA Astrophysics Data System (ADS)
Keck, N. N.; Macduff, M.; Martin, T.
2017-12-01
The Atmospheric Radiation Measurement's (ARM) Data Management Facility (DMF) plays a critical support role in processing and curating data generated by the Department of Energy's ARM Program. Data are collected near real time from hundreds of observational instruments spread out all over the globe. Data are then ingested hourly to provide time series data in NetCDF (network Common Data Format) and includes standardized metadata. Based on automated processes and a variety of user reviews the data may need to be reprocessed. Final data sets are then stored and accessed by users through the ARM Archive. Over the course of 20 years, a suite of data visualization tools have been developed to facilitate the operational processes to manage and maintain the more than 18,000 real time events, that move 1.3 TB of data each day through the various stages of the DMF's data system. This poster will present the resources and methodology used to capture metadata and the tools that assist in routine data management and discoverability.
The NOAA OneStop System: From Well-Curated Metadata to Data Discovery
NASA Astrophysics Data System (ADS)
McQuinn, E.; Jakositz, A.; Caldwell, A.; Delk, Z.; Neufeld, D.; Shapiro, J.; Partee, R.; Milan, A.
2017-12-01
The NOAA OneStop project is a pathfinder in the realm of enabling users to search for, discover, and access NOAA data. As the project continues along its path to maturity, it has become evident that three areas are of utmost importance to its success in the Earth science community: ensuring quality metadata, building a robust and scalable backend architecture, and keeping the user interface simple to use. Why is this the case? Because, simply put, we are dealing with all aspects of a Big Data problem: large volumes of disparate data needing to be quickly and easily processed and retrieved. In this presentation we discuss the three key aspects of OneStop architecture and how development in each area must be done through cross-team collaboration in order to succeed. We cover aspects of the web-based user interface and OneStop API and how metadata curators and software engineers have worked together to continually iterate on an ever-improving data discovery tool meant to be used by a variety of users searching across a broad assortment of data types.
Metazen – metadata capture for metagenomes
2014-01-01
Background As the impact and prevalence of large-scale metagenomic surveys grow, so does the acute need for more complete and standards compliant metadata. Metadata (data describing data) provides an essential complement to experimental data, helping to answer questions about its source, mode of collection, and reliability. Metadata collection and interpretation have become vital to the genomics and metagenomics communities, but considerable challenges remain, including exchange, curation, and distribution. Currently, tools are available for capturing basic field metadata during sampling, and for storing, updating and viewing it. Unfortunately, these tools are not specifically designed for metagenomic surveys; in particular, they lack the appropriate metadata collection templates, a centralized storage repository, and a unique ID linking system that can be used to easily port complete and compatible metagenomic metadata into widely used assembly and sequence analysis tools. Results Metazen was developed as a comprehensive framework designed to enable metadata capture for metagenomic sequencing projects. Specifically, Metazen provides a rapid, easy-to-use portal to encourage early deposition of project and sample metadata. Conclusions Metazen is an interactive tool that aids users in recording their metadata in a complete and valid format. A defined set of mandatory fields captures vital information, while the option to add fields provides flexibility. PMID:25780508
Metazen - metadata capture for metagenomes.
Bischof, Jared; Harrison, Travis; Paczian, Tobias; Glass, Elizabeth; Wilke, Andreas; Meyer, Folker
2014-01-01
As the impact and prevalence of large-scale metagenomic surveys grow, so does the acute need for more complete and standards compliant metadata. Metadata (data describing data) provides an essential complement to experimental data, helping to answer questions about its source, mode of collection, and reliability. Metadata collection and interpretation have become vital to the genomics and metagenomics communities, but considerable challenges remain, including exchange, curation, and distribution. Currently, tools are available for capturing basic field metadata during sampling, and for storing, updating and viewing it. Unfortunately, these tools are not specifically designed for metagenomic surveys; in particular, they lack the appropriate metadata collection templates, a centralized storage repository, and a unique ID linking system that can be used to easily port complete and compatible metagenomic metadata into widely used assembly and sequence analysis tools. Metazen was developed as a comprehensive framework designed to enable metadata capture for metagenomic sequencing projects. Specifically, Metazen provides a rapid, easy-to-use portal to encourage early deposition of project and sample metadata. Metazen is an interactive tool that aids users in recording their metadata in a complete and valid format. A defined set of mandatory fields captures vital information, while the option to add fields provides flexibility.
NASA Astrophysics Data System (ADS)
Stevens, T.
2016-12-01
NASA's Global Change Master Directory (GCMD) curates a hierarchical set of controlled vocabularies (keywords) covering Earth sciences and associated information (data centers, projects, platforms, and instruments). The purpose of the keywords is to describe Earth science data and services in a consistent and comprehensive manner, allowing for precise metadata search and subsequent retrieval of data and services. The keywords are accessible in a standardized SKOS/RDF/OWL representation and are used as an authoritative taxonomy, as a source for developing ontologies, and to search and access Earth Science data within online metadata catalogs. The keyword curation approach involves: (1) receiving community suggestions; (2) triaging community suggestions; (3) evaluating keywords against a set of criteria coordinated by the NASA Earth Science Data and Information System (ESDIS) Standards Office; (4) implementing the keywords; and (5) publication/notification of keyword changes. This approach emphasizes community input, which helps ensure a high quality, normalized, and relevant keyword structure that will evolve with users' changing needs. The Keyword Community Forum, which promotes a responsive, open, and transparent process, is an area where users can discuss keyword topics and make suggestions for new keywords. Others could potentially use this formalized approach as a model for keyword curation.
NASA Astrophysics Data System (ADS)
Peng, G.; Austin, M.
2017-12-01
Identification and prioritization of targeted user community needs are not always considered until after data has been created and archived. Gaps in data curation and documentation in the data production and delivery phases limit data's broad utility specifically for decision makers. Expert understanding and knowledge of a particular dataset is often required as a part of the data and metadata curation process to establish the credibility of the data and support informed decision-making. To enhance curation practices, content from NOAA's Observing System Integrated Assessment (NOSIA) Value Tree, NOAA's Data Catalog/Digital Object Identifier (DOI) projects (collection-level metadata) have been integrated with Data/Stewardship Maturity Matrices (data and stewardship quality information) focused on assessment of user community needs. This results in user focused evidence based decision making tools created by NOAA's National Environmental Satellite, Data, and Information Service (NESDIS) through identification and assessment of data content gaps related to scientific knowledge and application to key areas of societal benefit. Through enabling user need feedback from the beginning of data creation through archive allows users to determine the quality and value of data that is fit for purpose. Data gap assessment and prioritization are presented in a user-friendly way using the data stewardship maturity matrices as measurement of data management quality. These decision maker tools encourages data producers and data providers/stewards to consider users' needs prior to data creation and dissemination resulting in user driven data requirements increasing return on investment. A use case focused on need for NOAA observations linked societal benefit will be used to demonstrate the value of these tools.
GES DISC Datalist Improves Earth Science Data Discoverbility
NASA Astrophysics Data System (ADS)
Li, A.; Teng, W. L.; Hegde, M.; Petrenko, M.; Shen, S.; Shie, C. L.; Liu, Z.; Hearty, T.; Bryant, K.; Vollmer, B.; Meyer, D. J.
2017-12-01
At American Geophysical Union(AGU) 2016 Fall Meeting, Goddard Earth Sciences Data Information Services Center (GES DISC) unveiled a novel way to access data: Datalist. Currently, datalist is a collection of predefined data variables from one or more archived datasets, curated by our subject matter expert (SME). Our science support team has curated a predefined Hurricane Datalist and received very positive feedback from the user community. Datalist uses the same architecture our new website uses and have the same look and feel as other datasets on our web site. and also provides a one-stop shopping for data, metadata, citation, documentation, visualization and other available services. Since the last AGU Meeting, we have further developed a few new datalists corresponding to the Big Earth Data Initiative (BEDI) Societal Benefit Areas and A-Train data. We now have four datalists: Hurricane, Wind Energy, Greenhouse Gas and A-Train. We have also started working with our User Working Group members to create their favorite datalists and working with other DAAC to explore the possibility to include their products in our datalists that may also lead to a future of potential federated (cross-DAAC) datalists. Since our datalist prototype effort was a success, we are planning to make datalist operational. It's extremely important to have a common metadata model to support datalist, this will also be the foundation of federated datalist. We mapped our datalist metadata model to the unpublished UMM(Universal Metadata Model)-Var (Variable) (June version) and found that the UMM-var together with UMM-C (Collection) and possible UMM-S (Service) will meet our basic requirements. For example: Dataset shortname, and version are already specified in UMM-C, variable name, long name, units, dimensions are all specified in UMM-Var. UMM-Var also facilitates ScienceKeywords to allow tagging at variable level and Characteristics for optional variable characteristics. Measurements is useful for grouping of the variables and Set is promising to define datalist. And finally, the UMM-Service model to specify the available services for the variable will be very beneficial. In summary, UMM-Var, UMM-C and UMM-S are the basis of federated datalist and the development and deployment of datalist will contribute to the evolution of the UMM.
GES DISC Datalist Improves Earth Science Data Discoverability
NASA Technical Reports Server (NTRS)
Li, A.; Teng, W.; Hegde, M.; Petrenko, M.; Shen, S.; Shie, C.; Liu, Z.; Hearty, T.; Bryant, K.; Vollmer, B.;
2017-01-01
At American Geophysical Union(AGU) 2016 Fall Meeting, Goddard Earth Sciences Data Information Services Center (GES DISC) unveiled a novel way to access data: Datalist. Currently, datalist is a collection of predefined data variables from one or more archived datasets, curated by our subject matter expert (SME). Our science support team has curated a predefined Hurricane Datalist and received very positive feedback from the user community. Datalist uses the same architecture our new website uses and have the same look and feel as other datasets on our web site. and also provides a one-stop shopping for data, metadata, citation, documentation, visualization and other available services. Since the last AGU Meeting, we have further developed a few new datalists corresponding to the Big Earth Data Initiative (BEDI) Societal Benefit Areas and A-Train data. We now have four datalists: Hurricane, Wind Energy, Greenhouse Gas and A-Train. We have also started working with our User Working Group members to create their favorite datalists and working with other DAAC to explore the possibility to include their products in our datalists that may also lead to a future of potential federated (cross-DAAC) datalists. Since our datalist prototype effort was a success, we are planning to make datalist operational. It's extremely important to have a common metadata model to support datalist, this will also be the foundation of federated datalist. We mapped our datalist metadata model to the unpublished UMM(Universal Metadata Model)-Var (Variable) (June version) and found that the UMM-var together with UMM-C (Collection) and possible UMM-S (Service) will meet our basic requirements. For example: Dataset shortname, and version are already specified in UMM-C, variable name, long name, units, dimensions are all specified in UMM-Var. UMM-Var also facilitates Science Keywords to allow tagging at variable level and Characteristics for optional variable characteristics. Measurements is useful for grouping of the variables and Set is promising to define datalist. And finally, the UMM-Service model to specify the available services for the variable will be very beneficial. In summary, UMM-Var, UMM-C and UMM-S are the basis of federated datalist and the development and deployment of datalist will contribute to the evolution of the UMM.
Service architecture challenges in building the KNMI Data Centre
NASA Astrophysics Data System (ADS)
Som de Cerff, Wim; van de Vegte, John; Plieger, Maarten; de Vreede, Ernst; Sluiter, Raymond; Willem Noteboom, Jan; van der Neut, Ian; Verhoef, Hans; van Versendaal, Robert; van Binnendijk, Martin; Kalle, Henk; Knopper, Arthur; Calis, Gijs; Ha, Siu Siu; van Moosel, WIm; Klein Ikkink, Henk-Jan; Tosun, Tuncay
2013-04-01
One of the objectives of KNMI is to act as a National Data centre for weather, climate and seismological data. KNMI has experience in curation of data for many years however important scientific data is not well accessible. New technologies also are available to improve the current infrastructure. Therefore a data curation program is initiated with two main goals: setup a Satellite Data Platform (SDP) and a KNMI data centre (KDC). KDC will provide, besides curation, data access, and storage and retrieval portal for KNMI data. In 2010 first requirements were gathered, in 2011 the main architecture was sketched, KDC was implemented in 2012 and is available on: http://data.knmi.nl KDC is built with the data providers involved with as key challenge: 'adding a dataset should be as simple as creating an HTML page'. This is enabled by a three step process, in which the data provider is responsible for two steps: 1. Provide dataset metadata: An easy to use web interface for providing metadata, with automated validation. Metadata consists of an ISO 19115 profile (matching INSPIRE and WMO requirements) and additional technical metadata regarding the data structure and access rights to the data. The interface hides certain metadata fields, which are filed by KDC automatically. 2. Provide data: after metadata has been entered, an upload location for uploading the dataset is provided. Also scripts for pushing large datasets are available. 3. Process and publish: once files are uploaded, they are processed for metadata (e.g., geolocation, time, version) and made available in KDC. The data is put into archive and made available using the in-house developed Virtual File System, which provides a persistent virtual path to the data. For the end-user of the data, KDC provides a web interface with search filters on key words, geolocation and time. Data can be downloaded using HTTP or FTP and can be scripted. Users can register to gain access to restricted datasets. The architecture combines Open Source software components (e.g. Geonetwork, Magnolia, MongoDB, MySQL) with in-house built software (ADAGUC, NADC) and newly developed software. Challenges faced and solved are: How to deal with the different file formats used at KNMI? (e.g. NetCDF, GRIB, BUFR, ASCII); How to deal with the different metadata profiles while hiding the complexity of this to the user? How to incorporate the existing archives? KDC is a node in several networks (WMO WIS, INSPIRE, Open Data): how to do this? In the presentation/poster we will describe what has been done for each of these challenges and how it is implemented in KDC.
The NCAR Digital Asset Services Hub (DASH): Implementing Unified Data Discovery and Access
NASA Astrophysics Data System (ADS)
Stott, D.; Worley, S. J.; Hou, C. Y.; Nienhouse, E.
2017-12-01
The National Center for Atmospheric Research (NCAR) Directorate created the Data Stewardship Engineering Team (DSET) to plan and implement an integrated single entry point for uniform digital asset discovery and access across the organization in order to improve the efficiency of access, reduce the costs, and establish the foundation for interoperability with other federated systems. This effort supports new policies included in federal funding mandates, NSF data management requirements, and journal citation recommendations. An inventory during the early planning stage identified diverse asset types across the organization that included publications, datasets, metadata, models, images, and software tools and code. The NCAR Digital Asset Services Hub (DASH) is being developed and phased in this year to improve the quality of users' experiences in finding and using these assets. DASH serves to provide engagement, training, search, and support through the following four nodes (see figure). DASH MetadataDASH provides resources for creating and cataloging metadata to the NCAR Dialect, a subset of ISO 19115. NMDEdit, an editor based on a European open source application, has been configured for manual entry of NCAR metadata. CKAN, an open source data portal platform, harvests these XML records (along with records output directly from databases) from a Web Accessible Folder (WAF) on GitHub for validation. DASH SearchThe NCAR Dialect metadata drives cross-organization search and discovery through CKAN, which provides the display interface of search results. DASH search will establish interoperability by facilitating metadata sharing with other federated systems. DASH ConsultingThe DASH Data Curation & Stewardship Coordinator assists with Data Management (DM) Plan preparation and advises on Digital Object Identifiers. The coordinator arranges training sessions on the DASH metadata tools and DM planning, and provides one-on-one assistance as requested. DASH RepositoryA repository is under development for NCAR datasets currently not in existing lab-managed archives. The DASH repository will be under NCAR governance and meet Trustworthy Repositories Audit & Certification (TRAC) requirements. This poster will highlight the processes, lessons learned, and current status of the DASH effort at NCAR.
Predicting structured metadata from unstructured metadata.
Posch, Lisa; Panahiazar, Maryam; Dumontier, Michel; Gevaert, Olivier
2016-01-01
Enormous amounts of biomedical data have been and are being produced by investigators all over the world. However, one crucial and limiting factor in data reuse is accurate, structured and complete description of the data or data about the data-defined as metadata. We propose a framework to predict structured metadata terms from unstructured metadata for improving quality and quantity of metadata, using the Gene Expression Omnibus (GEO) microarray database. Our framework consists of classifiers trained using term frequency-inverse document frequency (TF-IDF) features and a second approach based on topics modeled using a Latent Dirichlet Allocation model (LDA) to reduce the dimensionality of the unstructured data. Our results on the GEO database show that structured metadata terms can be the most accurately predicted using the TF-IDF approach followed by LDA both outperforming the majority vote baseline. While some accuracy is lost by the dimensionality reduction of LDA, the difference is small for elements with few possible values, and there is a large improvement over the majority classifier baseline. Overall this is a promising approach for metadata prediction that is likely to be applicable to other datasets and has implications for researchers interested in biomedical metadata curation and metadata prediction. © The Author(s) 2016. Published by Oxford University Press.
Metazen – metadata capture for metagenomes
Bischof, Jared; Harrison, Travis; Paczian, Tobias; ...
2014-12-08
Background: As the impact and prevalence of large-scale metagenomic surveys grow, so does the acute need for more complete and standards compliant metadata. Metadata (data describing data) provides an essential complement to experimental data, helping to answer questions about its source, mode of collection, and reliability. Metadata collection and interpretation have become vital to the genomics and metagenomics communities, but considerable challenges remain, including exchange, curation, and distribution. Currently, tools are available for capturing basic field metadata during sampling, and for storing, updating and viewing it. These tools are not specifically designed for metagenomic surveys; in particular, they lack themore » appropriate metadata collection templates, a centralized storage repository, and a unique ID linking system that can be used to easily port complete and compatible metagenomic metadata into widely used assembly and sequence analysis tools. Results: Metazen was developed as a comprehensive framework designed to enable metadata capture for metagenomic sequencing projects. Specifically, Metazen provides a rapid, easy-to-use portal to encourage early deposition of project and sample metadata. Conclusion: Metazen is an interactive tool that aids users in recording their metadata in a complete and valid format. A defined set of mandatory fields captures vital information, while the option to add fields provides flexibility.« less
Metazen – metadata capture for metagenomes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bischof, Jared; Harrison, Travis; Paczian, Tobias
Background: As the impact and prevalence of large-scale metagenomic surveys grow, so does the acute need for more complete and standards compliant metadata. Metadata (data describing data) provides an essential complement to experimental data, helping to answer questions about its source, mode of collection, and reliability. Metadata collection and interpretation have become vital to the genomics and metagenomics communities, but considerable challenges remain, including exchange, curation, and distribution. Currently, tools are available for capturing basic field metadata during sampling, and for storing, updating and viewing it. These tools are not specifically designed for metagenomic surveys; in particular, they lack themore » appropriate metadata collection templates, a centralized storage repository, and a unique ID linking system that can be used to easily port complete and compatible metagenomic metadata into widely used assembly and sequence analysis tools. Results: Metazen was developed as a comprehensive framework designed to enable metadata capture for metagenomic sequencing projects. Specifically, Metazen provides a rapid, easy-to-use portal to encourage early deposition of project and sample metadata. Conclusion: Metazen is an interactive tool that aids users in recording their metadata in a complete and valid format. A defined set of mandatory fields captures vital information, while the option to add fields provides flexibility.« less
Predicting structured metadata from unstructured metadata
Posch, Lisa; Panahiazar, Maryam; Dumontier, Michel; Gevaert, Olivier
2016-01-01
Enormous amounts of biomedical data have been and are being produced by investigators all over the world. However, one crucial and limiting factor in data reuse is accurate, structured and complete description of the data or data about the data—defined as metadata. We propose a framework to predict structured metadata terms from unstructured metadata for improving quality and quantity of metadata, using the Gene Expression Omnibus (GEO) microarray database. Our framework consists of classifiers trained using term frequency-inverse document frequency (TF-IDF) features and a second approach based on topics modeled using a Latent Dirichlet Allocation model (LDA) to reduce the dimensionality of the unstructured data. Our results on the GEO database show that structured metadata terms can be the most accurately predicted using the TF-IDF approach followed by LDA both outperforming the majority vote baseline. While some accuracy is lost by the dimensionality reduction of LDA, the difference is small for elements with few possible values, and there is a large improvement over the majority classifier baseline. Overall this is a promising approach for metadata prediction that is likely to be applicable to other datasets and has implications for researchers interested in biomedical metadata curation and metadata prediction. Database URL: http://www.yeastgenome.org/ PMID:28637268
NASA Technical Reports Server (NTRS)
Evans, Cindy; Todd, Nancy
2014-01-01
The Astromaterials Acquisition & Curation Office at NASA's Johnson Space Center (JSC) is the designated facility for curating all of NASA's extraterrestrial samples. Today, the suite of collections includes the lunar samples from the Apollo missions, cosmic dust particles falling into the Earth's atmosphere, meteorites collected in Antarctica, comet and interstellar dust particles from the Stardust mission, asteroid particles from Japan's Hayabusa mission, solar wind atoms collected during the Genesis mission, and space-exposed hardware from several missions. To support planetary science research on these samples, JSC's Astromaterials Curation Office hosts NASA's Astromaterials Curation digital repository and data access portal [http://curator.jsc.nasa.gov/], providing descriptions of the missions and collections, and critical information about each individual sample. Our office is designing and implementing several informatics initiatives to better serve the planetary research community. First, we are re-hosting the basic database framework by consolidating legacy databases for individual collections and providing a uniform access point for information (descriptions, imagery, classification) on all of our samples. Second, we continue to upgrade and host digital compendia that summarize and highlight published findings on the samples (e.g., lunar samples, meteorites from Mars). We host high resolution imagery of samples as it becomes available, including newly scanned images of historical prints from the Apollo missions. Finally we are creating plans to collect and provide new data, including 3D imagery, point cloud data, micro CT data, and external links to other data sets on selected samples. Together, these individual efforts will provide unprecedented digital access to NASA's Astromaterials, enabling preservation of the samples through more specific and targeted requests, and supporting new planetary science research and collaborations on the samples.
The French initiative for scientific cores virtual curating : a user-oriented integrated approach
NASA Astrophysics Data System (ADS)
Pignol, Cécile; Godinho, Elodie; Galabertier, Bruno; Caillo, Arnaud; Bernardet, Karim; Augustin, Laurent; Crouzet, Christian; Billy, Isabelle; Teste, Gregory; Moreno, Eva; Tosello, Vanessa; Crosta, Xavier; Chappellaz, Jérome; Calzas, Michel; Rousseau, Denis-Didier; Arnaud, Fabien
2016-04-01
Managing scientific data is probably one the most challenging issue in modern science. The question is made even more sensitive with the need of preserving and managing high value fragile geological sam-ples: cores. Large international scientific programs, such as IODP or ICDP are leading an intense effort to solve this problem and propose detailed high standard work- and dataflows thorough core handling and curating. However most results derived from rather small-scale research programs in which data and sample management is generally managed only locally - when it is … The national excellence equipment program (Equipex) CLIMCOR aims at developing French facilities for coring and drilling investigations. It concerns indiscriminately ice, marine and continental samples. As part of this initiative, we initiated a reflexion about core curating and associated coring-data management. The aim of the project is to conserve all metadata from fieldwork in an integrated cyber-environment which will evolve toward laboratory-acquired data storage in a near future. In that aim, our demarche was conducted through an close relationship with field operators as well laboratory core curators in order to propose user-oriented solutions. The national core curating initiative currently proposes a single web portal in which all scientifics teams can store their field data. For legacy samples, this will requires the establishment of a dedicated core lists with associated metadata. For forthcoming samples, we propose a mobile application, under Android environment to capture technical and scientific metadata on the field. This application is linked with a unique coring tools library and is adapted to most coring devices (gravity, drilling, percussion, etc...) including multiple sections and holes coring operations. Those field data can be uploaded automatically to the national portal, but also referenced through international standards or persistent identifiers (IGSN, ORCID and INSPIRE) and displayed in international portals (currently, NOAA's IMLGS). In this paper, we present the architecture of the integrated system, future perspectives and the approach we adopted to reach our goals. We will also present in front of our poster, one of the three mobile applications, dedicated more particularly to the operations of continental drillings.
The art and science of data curation: Lessons learned from constructing a virtual collection
NASA Astrophysics Data System (ADS)
Bugbee, Kaylin; Ramachandran, Rahul; Maskey, Manil; Gatlin, Patrick
2018-03-01
A digital, or virtual, collection is a value added service developed by libraries that curates information and resources around a topic, theme or organization. Adoption of the virtual collection concept as an Earth science data service improves the discoverability, accessibility and usability of data both within individual data centers but also across data centers and disciplines. In this paper, we introduce a methodology for systematically and rigorously curating Earth science data and information into a cohesive virtual collection. This methodology builds on the geocuration model of searching, selecting and synthesizing Earth science data, metadata and other information into a single and useful collection. We present our experiences curating a virtual collection for one of NASA's twelve Distributed Active Archive Centers (DAACs), the Global Hydrology Resource Center (GHRC), and describe lessons learned as a result of this curation effort. We also provide recommendations and best practices for data centers and data providers who wish to curate virtual collections for the Earth sciences.
ASDC Collaborations and Processes to Ensure Quality Metadata and Consistent Data Availability
NASA Astrophysics Data System (ADS)
Trapasso, T. J.
2017-12-01
With the introduction of new tools, faster computing, and less expensive storage, increased volumes of data are expected to be managed with existing or fewer resources. Metadata management is becoming a heightened challenge from the increase in data volume, resulting in more metadata records needed to be curated for each product. To address metadata availability and completeness, NASA ESDIS has taken significant strides with the creation of the United Metadata Model (UMM) and Common Metadata Repository (CMR). These UMM helps address hurdles experienced by the increasing number of metadata dialects and the CMR provides a primary repository for metadata so that required metadata fields can be served through a growing number of tools and services. However, metadata quality remains an issue as metadata is not always inherent to the end-user. In response to these challenges, the NASA Atmospheric Science Data Center (ASDC) created the Collaboratory for quAlity Metadata Preservation (CAMP) and defined the Product Lifecycle Process (PLP) to work congruently. CAMP is unique in that it provides science team members a UI to directly supply metadata that is complete, compliant, and accurate for their data products. This replaces back-and-forth communication that often results in misinterpreted metadata. Upon review by ASDC staff, metadata is submitted to CMR for broader distribution through Earthdata. Further, approval of science team metadata in CAMP automatically triggers the ASDC PLP workflow to ensure appropriate services are applied throughout the product lifecycle. This presentation will review the design elements of CAMP and PLP as well as demonstrate interfaces to each. It will show the benefits that CAMP and PLP provide to the ASDC that could potentially benefit additional NASA Earth Science Data and Information System (ESDIS) Distributed Active Archive Centers (DAACs).
Collaborative Sharing of Multidimensional Space-time Data Using HydroShare
NASA Astrophysics Data System (ADS)
Gan, T.; Tarboton, D. G.; Horsburgh, J. S.; Dash, P. K.; Idaszak, R.; Yi, H.; Blanton, B.
2015-12-01
HydroShare is a collaborative environment being developed for sharing hydrological data and models. It includes capability to upload data in many formats as resources that can be shared. The HydroShare data model for resources uses a specific format for the representation of each type of data and specifies metadata common to all resource types as well as metadata unique to specific resource types. The Network Common Data Form (NetCDF) was chosen as the format for multidimensional space-time data in HydroShare. NetCDF is widely used in hydrological and other geoscience modeling because it contains self-describing metadata and supports the creation of array-oriented datasets that may include three spatial dimensions, a time dimension and other user defined dimensions. For example, NetCDF may be used to represent precipitation or surface air temperature fields that have two dimensions in space and one dimension in time. This presentation will illustrate how NetCDF files are used in HydroShare. When a NetCDF file is loaded into HydroShare, header information is extracted using the "ncdump" utility. Python functions developed for the Django web framework on which HydroShare is based, extract science metadata present in the NetCDF file, saving the user from having to enter it. Where the file follows Climate Forecast (CF) convention and Attribute Convention for Dataset Discovery (ACDD) standards, metadata is thus automatically populated. Users also have the ability to add metadata to the resource that may not have been present in the original NetCDF file. HydroShare's metadata editing functionality then writes this science metadata back into the NetCDF file to maintain consistency between the science metadata in HydroShare and the metadata in the NetCDF file. This further helps researchers easily add metadata information following the CF and ACDD conventions. Additional data inspection and subsetting functions were developed, taking advantage of Python and command line libraries for working with NetCDF files. We describe the design and implementation of these features and illustrate how NetCDF files from a modeling application may be curated in HydroShare and thus enhance reproducibility of the associated research. We also discuss future development planned for multidimensional space-time data in HydroShare.
González-Beltrán, Alejandra; Neumann, Steffen; Maguire, Eamonn; Sansone, Susanna-Assunta; Rocca-Serra, Philippe
2014-01-01
The ISA-Tab format and software suite have been developed to break the silo effect induced by technology-specific formats for a variety of data types and to better support experimental metadata tracking. Experimentalists seldom use a single technique to monitor biological signals. Providing a multi-purpose, pragmatic and accessible format that abstracts away common constructs for describing Investigations, Studies and Assays, ISA is increasingly popular. To attract further interest towards the format and extend support to ensure reproducible research and reusable data, we present the Risa package, which delivers a central component to support the ISA format by enabling effortless integration with R, the popular, open source data crunching environment. The Risa package bridges the gap between the metadata collection and curation in an ISA-compliant way and the data analysis using the widely used statistical computing environment R. The package offers functionality for: i) parsing ISA-Tab datasets into R objects, ii) augmenting annotation with extra metadata not explicitly stated in the ISA syntax; iii) interfacing with domain specific R packages iv) suggesting potentially useful R packages available in Bioconductor for subsequent processing of the experimental data described in the ISA format; and finally v) saving back to ISA-Tab files augmented with analysis specific metadata from R. We demonstrate these features by presenting use cases for mass spectrometry data and DNA microarray data. The Risa package is open source (with LGPL license) and freely available through Bioconductor. By making Risa available, we aim to facilitate the task of processing experimental data, encouraging a uniform representation of experimental information and results while delivering tools for ensuring traceability and provenance tracking. The Risa package is available since Bioconductor 2.11 (version 1.0.0) and version 1.2.1 appeared in Bioconductor 2.12, both along with documentation and examples. The latest version of the code is at the development branch in Bioconductor and can also be accessed from GitHub https://github.com/ISA-tools/Risa, where the issue tracker allows users to report bugs or feature requests.
The Risa R/Bioconductor package: integrative data analysis from experimental metadata and back again
2014-01-01
Background The ISA-Tab format and software suite have been developed to break the silo effect induced by technology-specific formats for a variety of data types and to better support experimental metadata tracking. Experimentalists seldom use a single technique to monitor biological signals. Providing a multi-purpose, pragmatic and accessible format that abstracts away common constructs for describing Investigations, Studies and Assays, ISA is increasingly popular. To attract further interest towards the format and extend support to ensure reproducible research and reusable data, we present the Risa package, which delivers a central component to support the ISA format by enabling effortless integration with R, the popular, open source data crunching environment. Results The Risa package bridges the gap between the metadata collection and curation in an ISA-compliant way and the data analysis using the widely used statistical computing environment R. The package offers functionality for: i) parsing ISA-Tab datasets into R objects, ii) augmenting annotation with extra metadata not explicitly stated in the ISA syntax; iii) interfacing with domain specific R packages iv) suggesting potentially useful R packages available in Bioconductor for subsequent processing of the experimental data described in the ISA format; and finally v) saving back to ISA-Tab files augmented with analysis specific metadata from R. We demonstrate these features by presenting use cases for mass spectrometry data and DNA microarray data. Conclusions The Risa package is open source (with LGPL license) and freely available through Bioconductor. By making Risa available, we aim to facilitate the task of processing experimental data, encouraging a uniform representation of experimental information and results while delivering tools for ensuring traceability and provenance tracking. Software availability The Risa package is available since Bioconductor 2.11 (version 1.0.0) and version 1.2.1 appeared in Bioconductor 2.12, both along with documentation and examples. The latest version of the code is at the development branch in Bioconductor and can also be accessed from GitHub https://github.com/ISA-tools/Risa, where the issue tracker allows users to report bugs or feature requests. PMID:24564732
The CompTox Chemistry Dashboard provides access to data for ~760,000 chemicals. High quality curated data and rich metadata facilitates mass spec analysis. “MS-Ready” processed data enables structure identification.
McQuilton, Peter; Gonzalez-Beltran, Alejandra; Rocca-Serra, Philippe; Thurston, Milo; Lister, Allyson; Maguire, Eamonn; Sansone, Susanna-Assunta
2016-01-01
BioSharing (http://www.biosharing.org) is a manually curated, searchable portal of three linked registries. These resources cover standards (terminologies, formats and models, and reporting guidelines), databases, and data policies in the life sciences, broadly encompassing the biological, environmental and biomedical sciences. Launched in 2011 and built by the same core team as the successful MIBBI portal, BioSharing harnesses community curation to collate and cross-reference resources across the life sciences from around the world. BioSharing makes these resources findable and accessible (the core of the FAIR principle). Every record is designed to be interlinked, providing a detailed description not only on the resource itself, but also on its relations with other life science infrastructures. Serving a variety of stakeholders, BioSharing cultivates a growing community, to which it offers diverse benefits. It is a resource for funding bodies and journal publishers to navigate the metadata landscape of the biological sciences; an educational resource for librarians and information advisors; a publicising platform for standard and database developers/curators; and a research tool for bench and computer scientists to plan their work. BioSharing is working with an increasing number of journals and other registries, for example linking standards and databases to training material and tools. Driven by an international Advisory Board, the BioSharing user-base has grown by over 40% (by unique IP address), in the last year thanks to successful engagement with researchers, publishers, librarians, developers and other stakeholders via several routes, including a joint RDA/Force11 working group and a collaboration with the International Society for Biocuration. In this article, we describe BioSharing, with a particular focus on community-led curation.Database URL: https://www.biosharing.org. © The Author(s) 2016. Published by Oxford University Press.
Evolving Metadata in NASA Earth Science Data Systems
NASA Astrophysics Data System (ADS)
Mitchell, A.; Cechini, M. F.; Walter, J.
2011-12-01
NASA's Earth Observing System (EOS) is a coordinated series of satellites for long term global observations. NASA's Earth Observing System Data and Information System (EOSDIS) is a petabyte-scale archive of environmental data that supports global climate change research by providing end-to-end services from EOS instrument data collection to science data processing to full access to EOS and other earth science data. On a daily basis, the EOSDIS ingests, processes, archives and distributes over 3 terabytes of data from NASA's Earth Science missions representing over 3500 data products ranging from various types of science disciplines. EOSDIS is currently comprised of 12 discipline specific data centers that are collocated with centers of science discipline expertise. Metadata is used in all aspects of NASA's Earth Science data lifecycle from the initial measurement gathering to the accessing of data products. Missions use metadata in their science data products when describing information such as the instrument/sensor, operational plan, and geographically region. Acting as the curator of the data products, data centers employ metadata for preservation, access and manipulation of data. EOSDIS provides a centralized metadata repository called the Earth Observing System (EOS) ClearingHouse (ECHO) for data discovery and access via a service-oriented-architecture (SOA) between data centers and science data users. ECHO receives inventory metadata from data centers who generate metadata files that complies with the ECHO Metadata Model. NASA's Earth Science Data and Information System (ESDIS) Project established a Tiger Team to study and make recommendations regarding the adoption of the international metadata standard ISO 19115 in EOSDIS. The result was a technical report recommending an evolution of NASA data systems towards a consistent application of ISO 19115 and related standards including the creation of a NASA-specific convention for core ISO 19115 elements. Part of NASA's effort to continually evolve its data systems led ECHO to enhancing the method in which it receives inventory metadata from the data centers to allow for multiple metadata formats including ISO 19115. ECHO's metadata model will also be mapped to the NASA-specific convention for ingesting science metadata into the ECHO system. As NASA's new Earth Science missions and data centers are migrating to the ISO 19115 standards, EOSDIS is developing metadata management resources to assist in the reading, writing and parsing ISO 19115 compliant metadata. To foster interoperability with other agencies and international partners, NASA is working to ensure that a common ISO 19115 convention is developed, enhancing data sharing capabilities and other data analysis initiatives. NASA is also investigating the use of ISO 19115 standards to encode data quality, lineage and provenance with stored values. A common metadata standard across NASA's Earth Science data systems promotes interoperability, enhances data utilization and removes levels of uncertainty found in data products.
NASA Astrophysics Data System (ADS)
Lawrence, B.; Bennett, V.; Callaghan, S.; Juckes, M. N.; Pepler, S.
2013-12-01
The UK Centre for Environmental Data Archival (CEDA) hosts a number of formal data centres, including the British Atmospheric Data Centre (BADC), and is a partner in a range of national and international data federations, including the InfraStructure for the European Network for Earth system Simulation, the Earth System Grid Federation, and the distributed IPCC Data Distribution Centres. The mission of CEDA is to formally curate data from, and facilitate the doing of, environmental science. The twin aims are symbiotic: data curation helps facilitate science, and facilitating science helps with data curation. Here we cover how CEDA delivers this strategy by established internal processes supplemented by short-term projects, supported by staff with a range of roles. We show how CEDA adds value to data in the curated archive, and how it supports science, and show examples of the aforementioned symbiosis. We begin by discussing curation: CEDA has the formal responsibility for curating the data products of atmospheric science and earth observation research funded by the UK Natural Environment Research Council (NERC). However, curation is not just about the provider community, the consumer communities matter too, and the consumers of these data cross the boundaries of science, including engineers, medics, as well as the gamut of the environmental sciences. There is a small, and growing cohort of non-science users. For both producers and consumers of data, information about data is crucial, and a range of CEDA staff have long worked on tools and techniques for creating, managing, and delivering metadata (as well as data). CEDA "science support" staff work with scientists to help them prepare and document data for curation. As one of a spectrum of activities, CEDA has worked on data Publication as a method of both adding value to some data, and rewarding the effort put into the production of quality datasets. As such, we see this activity as both a curation and a facilitation activity. A range of more focused facilitation activities are carried out, from providing a computing platform suitable for big-data analytics (the Joint Analysis System, JASMIN), to working on distributed data analysis (EXARCH), and the acquisition of third party data to support science and impact (e.g. in the context of the facility for Climate and Environmental Monitoring from Space, CEMS). We conclude by confronting the view of Parsons and Fox (2013) that metaphors such as Data Publication, Big Iron, Science Support etc are limiting, and suggest the CEDA experience is that these sorts of activities can and do co-exist, much as they conclude they should. However, we also believe that within co-existing metaphors, production systems need to be limited in their scope, even if they are on a road to a more joined up infrastructure. We shouldn't confuse what we can do now with what we might want to do in the future.
Data Curation for the Exploitation of Large Earth Observation Products Databases - The MEA system
NASA Astrophysics Data System (ADS)
Mantovani, Simone; Natali, Stefano; Barboni, Damiano; Cavicchi, Mario; Della Vecchia, Andrea
2014-05-01
National Space Agencies under the umbrella of the European Space Agency are performing a strong activity to handle and provide solutions to Big Data and related knowledge (metadata, software tools and services) management and exploitation. The continuously increasing amount of long-term and of historic data in EO facilities in the form of online datasets and archives, the incoming satellite observation platforms that will generate an impressive amount of new data and the new EU approach on the data distribution policy make necessary to address technologies for the long-term management of these data sets, including their consolidation, preservation, distribution, continuation and curation across multiple missions. The management of long EO data time series of continuing or historic missions - with more than 20 years of data available already today - requires technical solutions and technologies which differ considerably from the ones exploited by existing systems. Several tools, both open source and commercial, are already providing technologies to handle data and metadata preparation, access and visualization via OGC standard interfaces. This study aims at describing the Multi-sensor Evolution Analysis (MEA) system and the Data Curation concept as approached and implemented within the ASIM and EarthServer projects, funded by the European Space Agency and the European Commission, respectively.
Expanding Access to NCAR's Digital Assets: Towards a Unified Scientific Data Management System
NASA Astrophysics Data System (ADS)
Stott, D.
2016-12-01
In 2014 the National Center for Atmospheric Research (NCAR) Directorate created the Data Stewardship Engineering Team (DSET) to plan and implement the strategic vision of an integrated front door for data discovery and access across the organization, including all laboratories, the library, and UCAR Community Programs. The DSET is focused on improving the quality of users' experiences in finding and using NCAR's digital assets. This effort also supports new policies included in federal mandates, NSF requirements, and journal publication rules. An initial survey with 97 respondents identified 68 persons responsible for more than 3 petabytes of data. An inventory, using the Data Asset Framework produced by the UK Digital Curation Centre as a starting point, identified asset types that included files and metadata, publications, images, and software (visualization, analysis, model codes). User story sessions with representatives from each lab identified and ranked desired features for a unified Scientific Data Management System (SDMS). A process beginning with an organization-wide assessment of metadata by the HDF Group and followed by meetings with labs to identify key documentation concepts, culminated in the development of an NCAR metadata dialect that leverages the DataCite and ISO 19115 standards. The tasks ahead are to build out an SDMS and populate it with rich standardized metadata. Software packages have been prototyped and currently are being tested and reviewed by DSET members. Key challenges for the DSET include technical and non-technical issues. First, the status quo with regard to how assets are managed varies widely across the organization. There are differences in file format standards, technologies, and discipline-specific vocabularies. Metadata diversity is another real challenge. The types of metadata, the standards used, and the capacity to create new metadata varies across the organization. Significant effort is required to develop tools to create new standard metadata across the organization, adapt and integrate current digital assets, and establish consistent data management practices going forward. To be successful, best practices must be infused into daily activities. This poster will highlight the processes, lessons learned, and current status of the DSET effort at NCAR.
NASA Astrophysics Data System (ADS)
Ferrini, V. L.; Grange, B.; Morton, J. J.; Soule, S. A.; Carbotte, S. M.; Lehnert, K.
2016-12-01
The National Deep Submergence Facility (NDSF) operates the Human Occupied Vehicle (HOV) Alvin, the Remotely Operated Vehicle (ROV) Jason, and the Autonomous Underwater Vehicle (AUV) Sentry. These vehicles are deployed throughout the global oceans to acquire sensor data and physical samples for a variety of interdisciplinary science programs. As part of the EarthCube Integrative Activity Alliance Testbed Project (ATP), new web services were developed to improve access to existing online NDSF data and metadata resources. These services make use of tools and infrastructure developed by the Interdisciplinary Earth Data Alliance (IEDA) and enable programmatic access to metadata and data resources as well as the development of new service-driven user interfaces. The Alvin Frame Grabber and Jason Virtual Van enable the exploration of frame-grabbed images derived from video cameras on NDSF dives. Metadata available for each image includes time and vehicle position, data from environmental sensors, and scientist-generated annotations, and data are organized and accessible by cruise and/or dive. A new FrameGrabber web service and service-driven user interface were deployed to offer integrated access to these data resources through a single API and allows users to search across content curated in both systems. In addition, a new NDSF Dive Metadata web service and service-driven user interface was deployed to provide consolidated access to basic information about each NDSF dive (e.g. vehicle name, dive ID, location, etc), which is important for linking distributed data resources curated in different data systems.
NASA Astrophysics Data System (ADS)
Moore, C.
2011-12-01
The Index to Marine and Lacustrine Geological Samples is a community designed and maintained resource enabling researchers to locate and request sea floor and lakebed geologic samples archived by partner institutions. Conceived in the dawn of the digital age by representatives from U.S. academic and government marine core repositories and the NOAA National Geophysical Data Center (NGDC) at a 1977 meeting convened by the National Science Foundation (NSF), the Index is based on core concepts of community oversight, common vocabularies, consistent metadata and a shared interface. Form and content of underlying vocabularies and metadata continue to evolve according to the needs of the community, as do supporting technologies and access methodologies. The Curators Consortium, now international in scope, meets at partner institutions biennially to share ideas and discuss best practices. NGDC serves the group by providing database access and maintenance, a list server, digitizing support and long-term archival of sample metadata, data and imagery. Over three decades, participating curators have performed the herculean task of creating and contributing metadata for over 195,000 sea floor and lakebed cores, grabs, and dredges archived in their collections. Some partners use the Index for primary web access to their collections while others use it to increase exposure of more in-depth institutional systems. The Index is currently a geospatially-enabled relational database, publicly accessible via Web Feature and Web Map Services, and text- and ArcGIS map-based web interfaces. To provide as much knowledge as possible about each sample, the Index includes curatorial contact information and links to related data, information and images; 1) at participating institutions, 2) in the NGDC archive, and 3) at sites such as the Rolling Deck to Repository (R2R) and the System for Earth Sample Registration (SESAR). Over 34,000 International GeoSample Numbers (IGSNs) linking to SESAR are included in anticipation of opportunities for interconnectivity with Integrated Earth Data Applications (IEDA) systems. To promote interoperability and broaden exposure via the semantic web, NGDC is publishing lithologic classification schemes and terminology used in the Index as Simple Knowledge Organization System (SKOS) vocabularies, coordinating with R2R and the Consortium for Ocean Leadership for consistency. Availability in SKOS form will also facilitate use of the vocabularies in International Standards Organization (ISO) 19115-2 compliant metadata records. NGDC provides stewardship for the Index on behalf of U.S. repositories as the NSF designated "appropriate National Data Center" for data and metadata pertaining to sea floor samples as specified in the 2011 Division of Ocean Sciences Sample and Data Policy, and on behalf of international partners via a collocated World Data Center. NGDC operates on the Open Archival Information System (OAIS) reference model. Active Partners: Antarctic Marine Geology Research Facility, Florida State University; British Ocean Sediment Core Research Facility; Geological Survey of Canada; Integrated Ocean Drilling Program; Lamont-Doherty Earth Observatory; National Lacustrine Core Repository, University of Minnesota; Oregon State University; Scripps Institution of Oceanography; University of Rhode Island; U.S. Geological Survey; Woods Hole Oceanographic Institution.
Climate Data Initiative: A Geocuration Effort to Support Climate Resilience
NASA Technical Reports Server (NTRS)
Ramachandran, Rahul; Bugbee, Kaylin; Tilmes, Curt; Pinheiro Privette, Ana
2015-01-01
Curation is traditionally defined as the process of collecting and organizing information around a common subject matter or a topic of interest and typically occurs in museums, art galleries, and libraries. The task of organizing data around specific topics or themes is a vibrant and growing effort in the biological sciences but to date this effort has not been actively pursued in the Earth sciences. In this paper, we introduce the concept of geocuration and define it as the act of searching, selecting, and synthesizing Earth science data/metadata and information from across disciplines and repositories into a single, cohesive, and useful compendium We present the Climate Data Initiative (CDI) project as an exemplar example. The CDI project is a systematic effort to manually curate and share openly available climate data from various federal agencies. CDI is a broad multi-agency effort of the U.S. government and seeks to leverage the extensive existing federal climate-relevant data to stimulate innovation and private-sector entrepreneurship to support national climate-change preparedness. We describe the geocuration process used in CDI project, lessons learned, and suggestions to improve similar geocuration efforts in the future.
Climate data initiative: A geocuration effort to support climate resilience
NASA Astrophysics Data System (ADS)
Ramachandran, Rahul; Bugbee, Kaylin; Tilmes, Curt; Privette, Ana Pinheiro
2016-03-01
Curation is traditionally defined as the process of collecting and organizing information around a common subject matter or a topic of interest and typically occurs in museums, art galleries, and libraries. The task of organizing data around specific topics or themes is a vibrant and growing effort in the biological sciences but to date this effort has not been actively pursued in the Earth sciences. In this paper, we introduce the concept of geocuration and define it as the act of searching, selecting, and synthesizing Earth science data/metadata and information from across disciplines and repositories into a single, cohesive, and useful collection. We present the Climate Data Initiative (CDI) project as a prototypical example. The CDI project is a systematic effort to manually curate and share openly available climate data from various federal agencies. CDI is a broad multi-agency effort of the U.S. government and seeks to leverage the extensive existing federal climate-relevant data to stimulate innovation and private-sector entrepreneurship to support national climate-change preparedness. We describe the geocuration process used in the CDI project, lessons learned, and suggestions to improve similar geocuration efforts in the future.
SNPMeta: SNP annotation and SNP metadata collection without a reference genome
USDA-ARS?s Scientific Manuscript database
The increase in availability of resequencing data is greatly accelerating SNP discovery and has facilitated the development of SNP genotyping assays. This, in turn, is increasing interest in annotation of individual SNPs. Currently, these data are only available through curation, or comparison to a ...
TCIA: An information resource to enable open science.
Prior, Fred W; Clark, Ken; Commean, Paul; Freymann, John; Jaffe, Carl; Kirby, Justin; Moore, Stephen; Smith, Kirk; Tarbox, Lawrence; Vendt, Bruce; Marquez, Guillermo
2013-01-01
Reusable, publicly available data is a pillar of open science. The Cancer Imaging Archive (TCIA) is an open image archive service supporting cancer research. TCIA collects, de-identifies, curates and manages rich collections of oncology image data. Image data sets have been contributed by 28 institutions and additional image collections are underway. Since June of 2011, more than 2,000 users have registered to search and access data from this freely available resource. TCIA encourages and supports cancer-related open science communities by hosting and managing the image archive, providing project wiki space and searchable metadata repositories. The success of TCIA is measured by the number of active research projects it enables (>40) and the number of scientific publications and presentations that are produced using data from TCIA collections (39).
Long-term Science Data Curation Using a Digital Object Model and Open-Source Frameworks
NASA Astrophysics Data System (ADS)
Pan, J.; Lenhardt, W.; Wilson, B. E.; Palanisamy, G.; Cook, R. B.
2010-12-01
Scientific digital content, including Earth Science observations and model output, has become more heterogeneous in format and more distributed across the Internet. In addition, data and metadata are becoming necessarily linked internally and externally on the Web. As a result, such content has become more difficult for providers to manage and preserve and for users to locate, understand, and consume. Specifically, it is increasingly harder to deliver relevant metadata and data processing lineage information along with the actual content consistently. Readme files, data quality information, production provenance, and other descriptive metadata are often separated in the storage level as well as in the data search and retrieval interfaces available to a user. Critical archival metadata, such as auditing trails and integrity checks, are often even more difficult for users to access, if they exist at all. We investigate the use of several open-source software frameworks to address these challenges. We use Fedora Commons Framework and its digital object abstraction as the repository, Drupal CMS as the user-interface, and the Islandora module as the connector from Drupal to Fedora Repository. With the digital object model, metadata of data description and data provenance can be associated with data content in a formal manner, so are external references and other arbitrary auxiliary information. Changes are formally audited on an object, and digital contents are versioned and have checksums automatically computed. Further, relationships among objects are formally expressed with RDF triples. Data replication, recovery, metadata export are supported with standard protocols, such as OAI-PMH. We provide a tentative comparative analysis of the chosen software stack with the Open Archival Information System (OAIS) reference model, along with our initial results with the existing terrestrial ecology data collections at NASA’s ORNL Distributed Active Archive Center for Biogeochemical Dynamics (ORNL DAAC).
NASA Astrophysics Data System (ADS)
Benedict, K. K.; Servilla, M. S.; Vanderbilt, K.; Wheeler, J.
2015-12-01
The growing volume, variety and velocity of production of Earth science data magnifies the impact of inefficiencies in data acquisition, processing, analysis, and sharing workflows, potentially to the point of impairing the ability of researchers to accomplish their desired scientific objectives. The adaptation of agile software development principles (http://agilemanifesto.org/principles.html) to data curation processes has significant potential to lower barriers to effective scientific data discovery and reuse - barriers that otherwise may force the development of new data to replace existing but unusable data, or require substantial effort to make data usable in new research contexts. This paper outlines a data curation process that was developed at the University of New Mexico that provides a cross-walk of data and associated documentation between the data archive developed by the Long Term Ecological Research (LTER) Network Office (PASTA - http://lno.lternet.edu/content/network-information-system) and UNM's institutional repository (LoboVault - http://repository.unm.edu). The developed automated workflow enables the replication of versioned data objects and their associated standards-based metadata between the LTER system and LoboVault - providing long-term preservation for those data/metadata packages within LoboVault while maintaining the value-added services that the PASTA platform provides. The relative ease with which this workflow was developed is a product of the capabilities independently developed on both platforms - including the simplicity of providing a well-documented application programming interface (API) for each platform enabling scripted interaction and the use of well-established documentation standards (EML in the case of PASTA, Dublin Core in the case of LoboVault) by both systems. These system characteristics, when combined with an iterative process of interaction between the Data Curation Librarian (on the LoboVault side of the process), the Sevilleta LTER Information Manager and the LTER Network Information System developer, yielded a rapid and relatively streamlined process for targeted replication of data and metadata between the two systems - increasing the discoverability and usability of the LTER data assets.
Resource Discovery for Extreme Scale Collaboration (RDESC) Final Report - RPI/TWC - Year 3
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fox, Peter
The amount of data produced in the practice of science is growing rapidly. Despite the accumulation and demand for scientific data, relatively little is actually made available for the broader scientific community. We surmise that the root of the problem is the perceived difficulty to electronically publish scientific data and associated metadata in a way that makes it discoverable. We propose to exploit Semantic Web technologies and practices to make (meta)data discoverable and easy to publish. We share our experiences in curating metadata to illustrate both the flexibility of our approach and the pain of discovering data in the currentmore » research environment. We also make recommendations by concrete example of how data publishers can provide their (meta)data by adding some limited, additional markup to HTML pages on the Web. With little additional effort from data publishers, the difficulty of data discovery/access/sharing can be greatly reduced and the impact of research data greatly enhanced.« less
Hill, A A; Crotta, M; Wall, B; Good, L; O'Brien, S J; Guitian, J
2017-03-01
Foodborne infection is a result of exposure to complex, dynamic food systems. The efficiency of foodborne infection is driven by ongoing shifts in genetic machinery. Next-generation sequencing technologies can provide high-fidelity data about the genetics of a pathogen. However, food safety surveillance systems do not currently provide similar high-fidelity epidemiological metadata to associate with genetic data. As a consequence, it is rarely possible to transform genetic data into actionable knowledge that can be used to genuinely inform risk assessment or prevent outbreaks. Big data approaches are touted as a revolution in decision support, and pose a potentially attractive method for closing the gap between the fidelity of genetic and epidemiological metadata for food safety surveillance. We therefore developed a simple food chain model to investigate the potential benefits of combining 'big' data sources, including both genetic and high-fidelity epidemiological metadata. Our results suggest that, as for any surveillance system, the collected data must be relevant and characterize the important dynamics of a system if we are to properly understand risk: this suggests the need to carefully consider data curation, rather than the more ambitious claims of big data proponents that unstructured and unrelated data sources can be combined to generate consistent insight. Of interest is that the biggest influencers of foodborne infection risk were contamination load and processing temperature, not genotype. This suggests that understanding food chain dynamics would probably more effectively generate insight into foodborne risk than prescribing the hazard in ever more detail in terms of genotype.
Crotta, M.; Wall, B.; Good, L.; O'Brien, S. J.; Guitian, J.
2017-01-01
Foodborne infection is a result of exposure to complex, dynamic food systems. The efficiency of foodborne infection is driven by ongoing shifts in genetic machinery. Next-generation sequencing technologies can provide high-fidelity data about the genetics of a pathogen. However, food safety surveillance systems do not currently provide similar high-fidelity epidemiological metadata to associate with genetic data. As a consequence, it is rarely possible to transform genetic data into actionable knowledge that can be used to genuinely inform risk assessment or prevent outbreaks. Big data approaches are touted as a revolution in decision support, and pose a potentially attractive method for closing the gap between the fidelity of genetic and epidemiological metadata for food safety surveillance. We therefore developed a simple food chain model to investigate the potential benefits of combining ‘big’ data sources, including both genetic and high-fidelity epidemiological metadata. Our results suggest that, as for any surveillance system, the collected data must be relevant and characterize the important dynamics of a system if we are to properly understand risk: this suggests the need to carefully consider data curation, rather than the more ambitious claims of big data proponents that unstructured and unrelated data sources can be combined to generate consistent insight. Of interest is that the biggest influencers of foodborne infection risk were contamination load and processing temperature, not genotype. This suggests that understanding food chain dynamics would probably more effectively generate insight into foodborne risk than prescribing the hazard in ever more detail in terms of genotype. PMID:28405360
Genomes OnLine Database (GOLD) v.6: data updates and feature enhancements
Mukherjee, Supratim; Stamatis, Dimitri; Bertsch, Jon; Ovchinnikova, Galina; Verezemska, Olena; Isbandi, Michelle; Thomas, Alex D.; Ali, Rida; Sharma, Kaushal; Kyrpides, Nikos C.; Reddy, T. B. K.
2017-01-01
The Genomes Online Database (GOLD) (https://gold.jgi.doe.gov) is a manually curated data management system that catalogs sequencing projects with associated metadata from around the world. In the current version of GOLD (v.6), all projects are organized based on a four level classification system in the form of a Study, Organism (for isolates) or Biosample (for environmental samples), Sequencing Project and Analysis Project. Currently, GOLD provides information for 26 117 Studies, 239 100 Organisms, 15 887 Biosamples, 97 212 Sequencing Projects and 78 579 Analysis Projects. These are integrated with over 312 metadata fields from which 58 are controlled vocabularies with 2067 terms. The web interface facilitates submission of a diverse range of Sequencing Projects (such as isolate genome, single-cell genome, metagenome, metatranscriptome) and complex Analysis Projects (such as genome from metagenome, or combined assembly from multiple Sequencing Projects). GOLD provides a seamless interface with the Integrated Microbial Genomes (IMG) system and supports and promotes the Genomic Standards Consortium (GSC) Minimum Information standards. This paper describes the data updates and additional features added during the last two years. PMID:27794040
linkedISA: semantic representation of ISA-Tab experimental metadata.
González-Beltrán, Alejandra; Maguire, Eamonn; Sansone, Susanna-Assunta; Rocca-Serra, Philippe
2014-01-01
Reporting and sharing experimental metadata- such as the experimental design, characteristics of the samples, and procedures applied, along with the analysis results, in a standardised manner ensures that datasets are comprehensible and, in principle, reproducible, comparable and reusable. Furthermore, sharing datasets in formats designed for consumption by humans and machines will also maximize their use. The Investigation/Study/Assay (ISA) open source metadata tracking framework facilitates standards-compliant collection, curation, visualization, storage and sharing of datasets, leveraging on other platforms to enable analysis and publication. The ISA software suite includes several components used in increasingly diverse set of life science and biomedical domains; it is underpinned by a general-purpose format, ISA-Tab, and conversions exist into formats required by public repositories. While ISA-Tab works well mainly as a human readable format, we have also implemented a linked data approach to semantically define the ISA-Tab syntax. We present a semantic web representation of the ISA-Tab syntax that complements ISA-Tab's syntactic interoperability with semantic interoperability. We introduce the linkedISA conversion tool from ISA-Tab to the Resource Description Framework (RDF), supporting mappings from the ISA syntax to multiple community-defined, open ontologies and capitalising on user-provided ontology annotations in the experimental metadata. We describe insights of the implementation and how annotations can be expanded driven by the metadata. We applied the conversion tool as part of Bio-GraphIIn, a web-based application supporting integration of the semantically-rich experimental descriptions. Designed in a user-friendly manner, the Bio-GraphIIn interface hides most of the complexities to the users, exposing a familiar tabular view of the experimental description to allow seamless interaction with the RDF representation, and visualising descriptors to drive the query over the semantic representation of the experimental design. In addition, we defined queries over the linkedISA RDF representation and demonstrated its use over the linkedISA conversion of datasets from Nature' Scientific Data online publication. Our linked data approach has allowed us to: 1) make the ISA-Tab semantics explicit and machine-processable, 2) exploit the existing ontology-based annotations in the ISA-Tab experimental descriptions, 3) augment the ISA-Tab syntax with new descriptive elements, 4) visualise and query elements related to the experimental design. Reasoning over ISA-Tab metadata and associated data will facilitate data integration and knowledge discovery.
Supporting Data Stewardship Throughout the Data Life Cycle in the Solid Earth Sciences
NASA Astrophysics Data System (ADS)
Ferrini, V.; Lehnert, K. A.; Carbotte, S. M.; Hsu, L.
2013-12-01
Stewardship of scientific data is fundamental to enabling new data-driven research, and ensures preservation, accessibility, and quality of the data, yet researchers, especially in disciplines that typically generate and use small, but complex, heterogeneous, and unstructured datasets are challenged to fulfill increasing demands of properly managing their data. The IEDA Data Facility (www.iedadata.org) provides tools and services that support data stewardship throughout the full life cycle of observational data in the solid earth sciences, with a focus on the data management needs of individual researchers. IEDA builds upon and brings together over a decade of development and experiences of its component data systems, the Marine Geoscience Data System (MGDS, www.marine-geo.org) and EarthChem (www.earthchem.org). IEDA services include domain-focused data curation and synthesis, tools for data discovery, access, visualization and analysis, as well as investigator support services that include tools for data contribution, data publication services, and data compliance support. IEDA data synthesis efforts (e.g. PetDB and Global Multi-Resolution Topography (GMRT) Synthesis) focus on data integration and analysis while emphasizing provenance and attribution. IEDA's domain-focused data catalogs (e.g. MGDS and EarthChem Library) provide access to metadata-rich long-tail data complemented by extensive metadata including attribution information and links to related publications. IEDA's visualization and analysis tools (e.g. GeoMapApp) broaden access to earth science data for domain specialist and non-specialists alike, facilitating both interdisciplinary research and education and outreach efforts. As a disciplinary data repository, a key role IEDA plays is to coordinate with its user community and to bridge the requirements and standards for data curation with both the evolving needs of its science community and emerging technologies. Development of IEDA tools and services is based first and foremost on the scientific needs of its user community. As data stewardship becomes a more integral component of the scientific workflow, IEDA investigator support services (e.g. Data Management Plan Tool and Data Compliance Reporting Tool) continue to evolve with the goal of lessening the 'burden' of data management for individual investigators by increasing awareness and facilitating the adoption of data management practices. We will highlight a variety of IEDA system components that support investigators throughout the data life cycle, and will discuss lessons learned and future directions.
NASA Astrophysics Data System (ADS)
Hernández, B. E.; Bugbee, K.; le Roux, J.; Beaty, T.; Hansen, M.; Staton, P.; Sisco, A. W.
2017-12-01
Earth observation (EO) data collected as part of NASA's Earth Observing System Data and Information System (EOSDIS) is now searchable via the Common Metadata Repository (CMR). The Analysis and Review of CMR (ARC) Team at Marshall Space Flight Center has been tasked with reviewing all NASA metadata records in the CMR ( 7,000 records). Each collection level record and constituent granule level metadata are reviewed for both completeness as well as compliance with the CMR's set of metadata standards, as specified in the Unified Metadata Model (UMM). NASA's Distributed Active Archive Centers (DAACs) have been harmonizing priority metadata records within the context of the inter-agency federal Big Earth Data Initiative (BEDI), which seeks to improve the discoverability, accessibility, and usability of EO data. Thus, the first phase of this project constitutes reviewing BEDI metadata records, while the second phase will constitute reviewing the remaining non-BEDI records in CMR. This presentation will discuss the ARC team's findings in terms of the overall quality of BEDI records across all DAACs as well as compliance with UMM standards. For instance, only a fifth of the collection-level metadata fields needed correction, compared to a quarter of the granule-level fields. It should be noted that the degree to which DAACs' metadata did not comply with the UMM standards may reflect multiple factors, such as recent changes in the UMM standards, and the utilization of different metadata formats (e.g. DIF 10, ECHO 10, ISO 19115-1) across the DAACs. Insights, constructive criticism, and lessons learned from this metadata review process will be contributed from both ORNL and SEDAC. Further inquiry along such lines may lead to insights which may improve the metadata curation process moving forward. In terms of the broader implications for metadata compliance with the UMM standards, this research has shown that a large proportion of the prioritized collections have already been made compliant, although the process of improving metadata quality is ongoing and iterative. Further research is also warranted into whether or not the gains in metadata quality are also driving gains in data use.
Evaluating the Interdisciplinary Discoverability of Data
NASA Astrophysics Data System (ADS)
Gordon, S.; Habermann, T.
2017-12-01
Documentation needs are similar across communities. Communities tend to agree on many of the basic concepts necessary for discovery. Shared concepts such as a title or a description of the data exist in most metadata dialects. Many dialects have been designed and recommendations implemented to create metadata valuable for data discovery. These implementations can create barriers to discovering the right data. How can we ensure that the documentation we curate will be discoverable and understandable by researchers outside of our own disciplines and organizations? Since communities tend to use and understand many of the same documentation concepts, the barriers to interdisciplinary discovery are caused by the differences in the implementation. Thus tools and methods designed for the conceptual layer that evaluate records for documentation concepts, regardless of the dialect, can be effective in identifying opportunities for improvement and providing guidance. The Metadata Evaluation Web Service combined with a Jupyter Notebook interface allows a user to gather insight about a collection of records with respect to different communities' conceptual recommendations. It accomplishes this via data visualizations and provides links to implementation specific guidance on the ESIP Wiki for each recommendation applied to the collection. By utilizing these curation tools as part of an iterative process the data's impact can be increased by making it discoverable to a greater scientific and research community. Due to the conceptual focus of the methods and tools used, they can be utilized by any community or organization regardless of their documentation dialect or tools.
NASA Astrophysics Data System (ADS)
Ulbricht, Damian; Elger, Kirsten; Bertelmann, Roland; Klump, Jens
2016-04-01
With the foundation of DataCite in 2009 and the technical infrastructure installed in the last six years it has become very easy to create citable dataset DOIs. Nowadays, dataset DOIs are increasingly accepted and required by journals in reference lists of manuscripts. In addition, DataCite provides usage statistics [1] of assigned DOIs and offers a public search API to make research data count. By linking related information to the data, they become more useful for future generations of scientists. For this purpose, several identifier systems, as ISBN for books, ISSN for journals, DOI for articles or related data, Orcid for authors, and IGSN for physical samples can be attached to DOIs using the DataCite metadata schema [2]. While these are good preconditions to publish data, free and open solutions that help with the curation of data, the publication of research data, and the assignment of DOIs in one software seem to be rare. At GFZ Potsdam we built a modular software stack that is made of several free and open software solutions and we established 'GFZ Data Services'. 'GFZ Data Services' provides storage, a metadata editor for publication and a facility to moderate minted DOIs. All software solutions are connected through web APIs, which makes it possible to reuse and integrate established software. Core component of 'GFZ Data Services' is an eSciDoc [3] middleware that is used as central storage, and has been designed along the OAIS reference model for digital preservation. Thus, data are stored in self-contained packages that are made of binary file-based data and XML-based metadata. The eSciDoc infrastructure provides access control to data and it is able to handle half-open datasets, which is useful in embargo situations when a subset of the research data are released after an adequate period. The data exchange platform panMetaDocs [4] makes use of eSciDoc's REST API to upload file-based data into eSciDoc and uses a metadata editor [5] to annotate the files with metadata. The metadata editor has a user-friendly interface with nominal lists, extensive explanations, and an interactive mapping tool to provide assistance to scientists describing the data. It is possible to deposit metadata templates to fill certain fields with default values. The metadata editor generates metadata in the schemas ISO19139, NASA GCMD DIF, and DataCite and could be extended for other schemas. panMetaDocs is able to mint dataset DOIs through DOIDB, which is our component to moderate dataset DOIs issued through 'GFZ Data Services'. DOIDB accepts metadata in the schemas ISO19139, DIF, and DataCite. In addition, DOIDB provides an OAI-PMH interface to disseminate all deposited metadata to data portals. The presentation of datasets on DOI landing pages is done though XSLT stylesheet transformation of the XML-based metadata. The landing pages have been designed to meet needs of scientists. We are able to render the metadata to different layouts. Furthermore, additional information about datasets and publications is assembled into the webpage by querying public databases on the internet. The work presented here will focus on technical details of the software stack. [1] http://stats.datacite.org [2] http://www.dlib.org/dlib/january11/starr/01starr.html [3] http://www.escidoc.org [4] http://panmetadocs.sf.net [5] http://github.com/ulbricht
NASA Astrophysics Data System (ADS)
Morton, J. J.; Ferrini, V. L.
2015-12-01
The Marine Geoscience Data System (MGDS, www.marine-geo.org) operates an interactive digital data repository and metadata catalog that provides access to a variety of marine geology and geophysical data from throughout the global oceans. Its Marine-Geo Digital Library includes common marine geophysical data types and supporting data and metadata, as well as complementary long-tail data. The Digital Library also includes community data collections and custom data portals for the GeoPRISMS, MARGINS and Ridge2000 programs, for active source reflection data (Academic Seismic Portal), and for marine data acquired by the US Antarctic Program (Antarctic and Southern Ocean Data Portal). Ensuring that these data are discoverable not only through our own interfaces but also through standards-compliant web services is critical for enabling investigators to find data of interest.Over the past two years, MGDS has developed several new RESTful web services that enable programmatic access to metadata and data holdings. These web services are compliant with the EarthCube GeoWS Building Blocks specifications and are currently used to drive our own user interfaces. New web applications have also been deployed to provide a more intuitive user experience for searching, accessing and browsing metadata and data. Our new map-based search interface combines components of the Google Maps API with our web services for dynamic searching and exploration of geospatially constrained data sets. Direct introspection of nearly all data formats for hundreds of thousands of data files curated in the Marine-Geo Digital Library has allowed for precise geographic bounds, which allow geographic searches to an extent not previously possible. All MGDS map interfaces utilize the web services of the Global Multi-Resolution Topography (GMRT) synthesis for displaying global basemap imagery and for dynamically provide depth values at the cursor location.
Developing Cyberinfrastructure Tools and Services for Metadata Quality Evaluation
NASA Astrophysics Data System (ADS)
Mecum, B.; Gordon, S.; Habermann, T.; Jones, M. B.; Leinfelder, B.; Powers, L. A.; Slaughter, P.
2016-12-01
Metadata and data quality are at the core of reusable and reproducible science. While great progress has been made over the years, much of the metadata collected only addresses data discovery, covering concepts such as titles and keywords. Improving metadata beyond the discoverability plateau means documenting detailed concepts within the data such as sampling protocols, instrumentation used, and variables measured. Given that metadata commonly do not describe their data at this level, how might we improve the state of things? Giving scientists and data managers easy to use tools to evaluate metadata quality that utilize community-driven recommendations is the key to producing high-quality metadata. To achieve this goal, we created a set of cyberinfrastructure tools and services that integrate with existing metadata and data curation workflows which can be used to improve metadata and data quality across the sciences. These tools work across metadata dialects (e.g., ISO19115, FGDC, EML, etc.) and can be used to assess aspects of quality beyond what is internal to the metadata such as the congruence between the metadata and the data it describes. The system makes use of a user-friendly mechanism for expressing a suite of checks as code in popular data science programming languages such as Python and R. This reduces the burden on scientists and data managers to learn yet another language. We demonstrated these services and tools in three ways. First, we evaluated a large corpus of datasets in the DataONE federation of data repositories against a metadata recommendation modeled after existing recommendations such as the LTER best practices and the Attribute Convention for Dataset Discovery (ACDD). Second, we showed how this service can be used to display metadata and data quality information to data producers during the data submission and metadata creation process, and to data consumers through data catalog search and access tools. Third, we showed how the centrally deployed DataONE quality service can achieve major efficiency gains by allowing member repositories to customize and use recommendations that fit their specific needs without having to create de novo infrastructure at their site.
Forster, Samuel C; Browne, Hilary P; Kumar, Nitin; Hunt, Martin; Denise, Hubert; Mitchell, Alex; Finn, Robert D; Lawley, Trevor D
2016-01-04
The Human Pan-Microbe Communities (HPMC) database (http://www.hpmcd.org/) provides a manually curated, searchable, metagenomic resource to facilitate investigation of human gastrointestinal microbiota. Over the past decade, the application of metagenome sequencing to elucidate the microbial composition and functional capacity present in the human microbiome has revolutionized many concepts in our basic biology. When sufficient high quality reference genomes are available, whole genome metagenomic sequencing can provide direct biological insights and high-resolution classification. The HPMC database provides species level, standardized phylogenetic classification of over 1800 human gastrointestinal metagenomic samples. This is achieved by combining a manually curated list of bacterial genomes from human faecal samples with over 21000 additional reference genomes representing bacteria, viruses, archaea and fungi with manually curated species classification and enhanced sample metadata annotation. A user-friendly, web-based interface provides the ability to search for (i) microbial groups associated with health or disease state, (ii) health or disease states and community structure associated with a microbial group, (iii) the enrichment of a microbial gene or sequence and (iv) enrichment of a functional annotation. The HPMC database enables detailed analysis of human microbial communities and supports research from basic microbiology and immunology to therapeutic development in human health and disease. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Lessons Learned in over Two Decades of GPS/GNSS Data Center Support
NASA Astrophysics Data System (ADS)
Boler, F. M.; Estey, L. H.; Meertens, C. M.; Maggert, D.
2014-12-01
The UNAVCO Data Center in Boulder, Colorado, curates, archives, and distributes geodesy data and products, mainly GPS/GNSS data from 3,000 permanent stations and 10,000 campaign sites around the globe. Although now having core support from NSF and NASA, the archive began around 1992 as a grass-roots effort of a few UNAVCO staff and community members to preserve data going back to 1986. Open access to this data is generally desired, but the Data Center in fact operates under an evolving suite of data access policies ranging from open access to nondisclosure for special cases. Key to processing this data is having the correct equipment metadata; reliably obtaining this metadata continues to be a challenge, in spite of modern cyberinfrastructure and tools, mostly due to human errors or lack of consistent operator training. New metadata problems surface when trying to design and publish modern Digital Object Identifiers for data sets where PIs, funding sources, and historical project names now need to be corrected and verified for data sets going back almost three decades. Originally, the data was GPS-only based on three signals on two carrier frequencies. Modern GNSS covers GPS modernization (three more signals and one additional carrier) as well as open signals and carriers of additional systems such as GLONASS, Galileo, BeiDou, and QZSS, requiring ongoing adaptive strategies to assess the quality of modern datasets. Also, new scientific uses of these data benefit from higher data rates than was needed for early tectonic applications. In addition, there has been a migration from episodic campaign sites (hence sparse data) to continuously operating stations (hence dense data) over the last two decades. All of these factors make it difficult to realistically plan even simple data center functions such as on-line storage capacity.
Expanding the Impact of Photogrammetric Topography Through Improved Data Archiving and Access
NASA Astrophysics Data System (ADS)
Crosby, C. J.; Arrowsmith, R.; Nandigam, V.
2016-12-01
Centimeter to decimeter-scale 2.5 to 3D sampling of the Earth surface topography coupled with the potential for photorealistic coloring of point clouds and texture mapping of meshes enables a wide range of science applications. Not only is the configuration and state of the surface as imaged valuable, but repeat surveys enable quantification of topographic change (erosion, deposition, and displacement) caused by various geologic processes. We are in an era of ubiquitous point clouds which come from both active sources such as laser scanners and radar as well as passive scene reconstruction via structure from motion (SfM) photogrammetry. With the decreasing costs of high-resolution topography (HRT) data collection, via methods such as SfM, the number of researchers collecting these data is increasing. These "long-tail" topographic data are of modest size but great value, and challenges exist to making them widely discoverable, shared, annotated, cited, managed and archived. Presently, there are no central repositories or services to support storage and curation of these datasets. The NSF funded OpenTopography (OT) employs cyberinfrastructure including large-scale data management, high-performance computing, and service-oriented architectures, to provide efficient online access to large HRT (mostly lidar) datasets, metadata, and processing tools. With over 200 datasets and 12,000 registered users, OT is well positioned to provide curation for community collected photogrammetric topographic data. OT is developing a "Community DataSpace", a service built on a low cost storage cloud (e.g. AWS S3) to make it easy for researchers to upload, curate, annotate and distribute their datasets. The system's ingestion workflow will extract metadata from data uploaded; validate it; assign a digital object identifier (DOI); and create a searchable catalog entry, before publishing via the OT portal. The OT Community DataSpace will enable wider discovery and utilization of these HRT datasets via the OT portal and sources that federate the OT data catalog, promote citations, and most importantly increase the impact of investments in data to catalyze scientific discovery.
Managing the explosion of high resolution topography in the geosciences
NASA Astrophysics Data System (ADS)
Crosby, Christopher; Nandigam, Viswanath; Arrowsmith, Ramon; Phan, Minh; Gross, Benjamin
2017-04-01
Centimeter to decimeter-scale 2.5 to 3D sampling of the Earth surface topography coupled with the potential for photorealistic coloring of point clouds and texture mapping of meshes enables a wide range of science applications. Not only is the configuration and state of the surface as imaged valuable, but repeat surveys enable quantification of topographic change (erosion, deposition, and displacement) caused by various geologic processes. We are in an era of ubiquitous point clouds that come from both active sources such as laser scanners and radar as well as passive scene reconstruction via structure from motion (SfM) photogrammetry. With the decreasing costs of high-resolution topography (HRT) data collection, via methods such as SfM and UAS-based laser scanning, the number of researchers collecting these data is increasing. These "long-tail" topographic data are of modest size but great value, and challenges exist to making them widely discoverable, shared, annotated, cited, managed and archived. Presently, there are no central repositories or services to support storage and curation of these datasets. The U.S. National Science Foundation funded OpenTopography (OT) Facility employs cyberinfrastructure including large-scale data management, high-performance computing, and service-oriented architectures, to provide efficient online access to large HRT (mostly lidar) datasets, metadata, and processing tools. With over 225 datasets and 15,000 registered users, OT is well positioned to provide curation for community collected high-resolution topographic data. OT has developed a "Community DataSpace", a service built on a low cost storage cloud (e.g. AWS S3) to make it easy for researchers to upload, curate, annotate and distribute their datasets. The system's ingestion workflow will extract metadata from data uploaded; validate it; assign a digital object identifier (DOI); and create a searchable catalog entry, before publishing via the OT portal. The OT Community DataSpace enables wider discovery and utilization of these HRT datasets via the OT portal and sources that federate the OT data catalog, promote citations, and most importantly increase the impact of investments in data to catalyzes scientific discovery.
Petascale Computing for Ground-Based Solar Physics with the DKIST Data Center
NASA Astrophysics Data System (ADS)
Berukoff, Steven J.; Hays, Tony; Reardon, Kevin P.; Spiess, DJ; Watson, Fraser; Wiant, Scott
2016-05-01
When construction is complete in 2019, the Daniel K. Inouye Solar Telescope will be the most-capable large aperture, high-resolution, multi-instrument solar physics facility in the world. The telescope is designed as a four-meter off-axis Gregorian, with a rotating Coude laboratory designed to simultaneously house and support five first-light imaging and spectropolarimetric instruments. At current design, the facility and its instruments will generate data volumes of 3 PB per year, and produce 107-109 metadata elements.The DKIST Data Center is being designed to store, curate, and process this flood of information, while providing association of science data and metadata to its acquisition and processing provenance. The Data Center will produce quality-controlled calibrated data sets, and make them available freely and openly through modern search interfaces and APIs. Documented software and algorithms will also be made available through community repositories like Github for further collaboration and improvement.We discuss the current design and approach of the DKIST Data Center, describing the development cycle, early technology analysis and prototyping, and the roadmap ahead. We discuss our iterative development approach, the underappreciated challenges of calibrating ground-based solar data, the crucial integration of the Data Center within the larger Operations lifecycle, and how software and hardware support, intelligently deployed, will enable high-caliber solar physics research and community growth for the DKIST's 40-year lifespan.
Scientific Workflows + Provenance = Better (Meta-)Data Management
NASA Astrophysics Data System (ADS)
Ludaescher, B.; Cuevas-Vicenttín, V.; Missier, P.; Dey, S.; Kianmajd, P.; Wei, Y.; Koop, D.; Chirigati, F.; Altintas, I.; Belhajjame, K.; Bowers, S.
2013-12-01
The origin and processing history of an artifact is known as its provenance. Data provenance is an important form of metadata that explains how a particular data product came about, e.g., how and when it was derived in a computational process, which parameter settings and input data were used, etc. Provenance information provides transparency and helps to explain and interpret data products. Other common uses and applications of provenance include quality control, data curation, result debugging, and more generally, 'reproducible science'. Scientific workflow systems (e.g. Kepler, Taverna, VisTrails, and others) provide controlled environments for developing computational pipelines with built-in provenance support. Workflow results can then be explained in terms of workflow steps, parameter settings, input data, etc. using provenance that is automatically captured by the system. Scientific workflows themselves provide a user-friendly abstraction of the computational process and are thus a form of ('prospective') provenance in their own right. The full potential of provenance information is realized when combining workflow-level information (prospective provenance) with trace-level information (retrospective provenance). To this end, the DataONE Provenance Working Group (ProvWG) has developed an extension of the W3C PROV standard, called D-PROV. Whereas PROV provides a 'least common denominator' for exchanging and integrating provenance information, D-PROV adds new 'observables' that described workflow-level information (e.g., the functional steps in a pipeline), as well as workflow-specific trace-level information ( timestamps for each workflow step executed, the inputs and outputs used, etc.) Using examples, we will demonstrate how the combination of prospective and retrospective provenance provides added value in managing scientific data. The DataONE ProvWG is also developing tools based on D-PROV that allow scientists to get more mileage from provenance metadata. DataONE is a federation of member nodes that store data and metadata for discovery and access. By enriching metadata with provenance information, search and reuse of data is enhanced, and the 'social life' of data (being the product of many workflow runs, different people, etc.) is revealed. We are currently prototyping a provenance repository (PBase) to demonstrate what can be achieved with advanced provenance queries. The ProvExplorer and ProPub tools support advanced ad-hoc querying and visualization of provenance as well as customized provenance publications (e.g., to address privacy issues, or to focus provenance to relevant details). In a parallel line of work, we are exploring ways to add provenance support to widely-used scripting platforms (e.g. R and Python) and then expose that information via D-PROV.
Soto, Axel J; Zerva, Chrysoula; Batista-Navarro, Riza; Ananiadou, Sophia
2018-04-15
Pathway models are valuable resources that help us understand the various mechanisms underpinning complex biological processes. Their curation is typically carried out through manual inspection of published scientific literature to find information relevant to a model, which is a laborious and knowledge-intensive task. Furthermore, models curated manually cannot be easily updated and maintained with new evidence extracted from the literature without automated support. We have developed LitPathExplorer, a visual text analytics tool that integrates advanced text mining, semi-supervised learning and interactive visualization, to facilitate the exploration and analysis of pathway models using statements (i.e. events) extracted automatically from the literature and organized according to levels of confidence. LitPathExplorer supports pathway modellers and curators alike by: (i) extracting events from the literature that corroborate existing models with evidence; (ii) discovering new events which can update models; and (iii) providing a confidence value for each event that is automatically computed based on linguistic features and article metadata. Our evaluation of event extraction showed a precision of 89% and a recall of 71%. Evaluation of our confidence measure, when used for ranking sampled events, showed an average precision ranging between 61 and 73%, which can be improved to 95% when the user is involved in the semi-supervised learning process. Qualitative evaluation using pair analytics based on the feedback of three domain experts confirmed the utility of our tool within the context of pathway model exploration. LitPathExplorer is available at http://nactem.ac.uk/LitPathExplorer_BI/. sophia.ananiadou@manchester.ac.uk. Supplementary data are available at Bioinformatics online.
If we build it, will they come? Curation and use of the ESO telescope bibliography
NASA Astrophysics Data System (ADS)
Grothkopf, Uta; Meakins, Silvia; Bordelon, Dominic
2015-12-01
The ESO Telescope Bibliography (telbib) is a database of refereed papers published by the ESO users community. It links data in the ESO Science Archive with the published literature, and vice versa. Developed and maintained by the ESO library, telbib also provides insights into the organization's research output and impact as measured through bibliometric studies. Curating telbib is a multi-step process that involves extensive tagging of the database records. Based on selected use cases, this talk will explain how the rich metadata provide parameters for reports and statistics in order to investigate the performance of ESO's facilities and to understand trends and developments in the publishing behaviour of the user community.
A Relevancy Algorithm for Curating Earth Science Data Around Phenomenon
NASA Technical Reports Server (NTRS)
Maskey, Manil; Ramachandran, Rahul; Li, Xiang; Weigel, Amanda; Bugbee, Kaylin; Gatlin, Patrick; Miller, J. J.
2017-01-01
Earth science data are being collected for various science needs and applications, processed using different algorithms at multiple resolutions and coverages, and then archived at different archiving centers for distribution and stewardship causing difficulty in data discovery. Curation, which typically occurs in museums, art galleries, and libraries, is traditionally defined as the process of collecting and organizing information around a common subject matter or a topic of interest. Curating data sets around topics or areas of interest addresses some of the data discovery needs in the field of Earth science, especially for unanticipated users of data. This paper describes a methodology to automate search and selection of data around specific phenomena. Different components of the methodology including the assumptions, the process, and the relevancy ranking algorithm are described. The paper makes two unique contributions to improving data search and discovery capabilities. First, the paper describes a novel methodology developed for automatically curating data around a topic using Earthscience metadata records. Second, the methodology has been implemented as a standalone web service that is utilized to augment search and usability of data in a variety of tools.
A relevancy algorithm for curating earth science data around phenomenon
NASA Astrophysics Data System (ADS)
Maskey, Manil; Ramachandran, Rahul; Li, Xiang; Weigel, Amanda; Bugbee, Kaylin; Gatlin, Patrick; Miller, J. J.
2017-09-01
Earth science data are being collected for various science needs and applications, processed using different algorithms at multiple resolutions and coverages, and then archived at different archiving centers for distribution and stewardship causing difficulty in data discovery. Curation, which typically occurs in museums, art galleries, and libraries, is traditionally defined as the process of collecting and organizing information around a common subject matter or a topic of interest. Curating data sets around topics or areas of interest addresses some of the data discovery needs in the field of Earth science, especially for unanticipated users of data. This paper describes a methodology to automate search and selection of data around specific phenomena. Different components of the methodology including the assumptions, the process, and the relevancy ranking algorithm are described. The paper makes two unique contributions to improving data search and discovery capabilities. First, the paper describes a novel methodology developed for automatically curating data around a topic using Earth science metadata records. Second, the methodology has been implemented as a stand-alone web service that is utilized to augment search and usability of data in a variety of tools.
The Internet of Scientific Research Things
NASA Astrophysics Data System (ADS)
Chandler, Cynthia; Shepherd, Adam; Arko, Robert; Leadbetter, Adam; Groman, Robert; Kinkade, Danie; Rauch, Shannon; Allison, Molly; Copley, Nancy; Gegg, Stephen; Wiebe, Peter; Glover, David
2016-04-01
The sum of the parts is greater than the whole, but for scientific research how do we identify the parts when they are curated at distributed locations? Results from environmental research represent an enormous investment and constitute essential knowledge required to understand our planet in this time of rapid change. The Biological and Chemical Oceanography Data Management Office (BCO-DMO) curates data from US NSF Ocean Sciences funded research awards, but BCO-DMO is only one repository in a landscape that includes many other sites that carefully curate results of scientific research. Recent efforts to use persistent identifiers (PIDs), most notably Open Researcher and Contributor ID (ORCiD) for person, Digital Object Identifier (DOI) for publications including data sets, and Open Funder Registry (FundRef) codes for research grants and awards are realizing success in unambiguously identifying the pieces that represent results of environmental research. This presentation uses BCO-DMO as a test case for adding PIDs to the locally-curated information published out as standards compliant metadata records. We present a summary of progress made thus far; what has worked and why, and thoughts on logical next steps.
Geocuration Lessons Learned from the Climate Data Initiative Project
NASA Technical Reports Server (NTRS)
Ramachandran, Rahul; Bugbee, Kaylin; Tilmes, Curt; Pinheiro Privette, Ana
2015-01-01
Curation is traditionally defined as the process of collecting and organizing information around a common subject matter or a topic of interest and typically occurs in museums, art galleries, and libraries. The task of organizing data around specific topics or themes is a vibrant and growing effort in the biological sciences but to date this effort has not been actively pursued in the Earth sciences. This presentation will introduce the concept of geocuration, which we define it as the act of searching, selecting, and synthesizing Earth science data/metadata and information from across disciplines and repositories into a single, cohesive, and useful compendium. We also present the Climate Data Initiative (CDI) project as an prototypical example. The CDI project is a systematic effort to manually curate and share openly available climate data from various federal agencies. CDI is a broad multi-agency effort of the U.S. government and seeks to leverage the extensive existing federal climate-relevant data to stimulate innovation and private-sector entrepreneurship to support national climate change preparedness. The geocuration process used in the CDI project, key lessons learned, and suggestions to improve similar geocuration efforts in the future will be part of this presentation.
Geocuration Lessons Learned from the Climate Data Initiative Project
NASA Astrophysics Data System (ADS)
Ramachandran, R.; Bugbee, K.; Tilmes, C.; Privette, A. P.
2015-12-01
Curation is traditionally defined as the process of collecting and organizing information around a common subject matter or a topic of interest and typically occurs in museums, art galleries, and libraries. The task of organizing data around specific topics or themes is a vibrant and growing effort in the biological sciences but to date this effort has not been actively pursued in the Earth sciences. This presentation will introduce the concept of geocuration, which we define it as the act of searching, selecting, and synthesizing Earth science data/metadata and information from across disciplines and repositories into a single, cohesive, and useful compendium.We also present the Climate Data Initiative (CDI) project as an exemplar example. The CDI project is a systematic effort to manually curate and share openly available climate data from various federal agencies. CDI is a broad multi-agency effort of the U.S. government and seeks to leverage the extensive existing federal climate-relevant data to stimulate innovation and private-sector entrepreneurship to support national climate-change preparedness. The geocuration process used in CDI project, key lessons learned, and suggestions to improve similar geocuration efforts in the future will be part of this presentation.
Curatr: a web application for creating, curating and sharing a mass spectral library.
Palmer, Andrew; Phapale, Prasad; Fay, Dominik; Alexandrov, Theodore
2018-04-15
We have developed a web application curatr for the rapid generation of high quality mass spectral fragmentation libraries from liquid-chromatography mass spectrometry datasets. Curatr handles datasets from single or multiplexed standards and extracts chromatographic profiles and potential fragmentation spectra for multiple adducts. An intuitive interface helps users to select high quality spectra that are stored along with searchable molecular information, the providence of each standard and experimental metadata. Curatr supports exports to several standard formats for use with third party software or submission to repositories. We demonstrate the use of curatr to generate the EMBL Metabolomics Core Facility spectral library http://curatr.mcf.embl.de. Source code and example data are at http://github.com/alexandrovteam/curatr/. palmer@embl.de. Supplementary data are available at Bioinformatics online.
Digital Curation of Earth Science Samples Starts in the Field
NASA Astrophysics Data System (ADS)
Lehnert, K. A.; Hsu, L.; Song, L.; Carter, M. R.
2014-12-01
Collection of physical samples in the field is an essential part of research in the Earth Sciences. Samples provide a basis for progress across many disciplines, from the study of global climate change now and over the Earth's history, to present and past biogeochemical cycles, to magmatic processes and mantle dynamics. The types of samples, methods of collection, and scope and scale of sampling campaigns are highly diverse, ranging from large-scale programs to drill rock and sediment cores on land, in lakes, and in the ocean, to environmental observation networks with continuous sampling, to single investigator or small team expeditions to remote areas around the globe or trips to local outcrops. Cyberinfrastructure for sample-related fieldwork needs to cater to the different needs of these diverse sampling activities, aligning with specific workflows, regional constraints such as connectivity or climate, and processing of samples. In general, digital tools should assist with capture and management of metadata about the sampling process (location, time, method) and the sample itself (type, dimension, context, images, etc.), management of the physical objects (e.g., sample labels with QR codes), and the seamless transfer of sample metadata to data systems and software relevant to the post-sampling data acquisition, data processing, and sample curation. In order to optimize CI capabilities for samples, tools and workflows need to adopt community-based standards and best practices for sample metadata, classification, identification and registration. This presentation will provide an overview and updates of several ongoing efforts that are relevant to the development of standards for digital sample management: the ODM2 project that has generated an information model for spatially-discrete, feature-based earth observations resulting from in-situ sensors and environmental samples, aligned with OGC's Observation & Measurements model (Horsburgh et al, AGU FM 2014); implementation of the IGSN (International Geo Sample Number) as a globally unique sample identifier via a distributed system of allocating agents and a central registry; and the EarthCube Research Coordination Network iSamplES (Internet of Samples in the Earth Sciences) that aims to improve sharing and curation of samples through the use of CI.
Curating and Preserving the Big Canopy Database System: an Active Curation Approach using SEAD
NASA Astrophysics Data System (ADS)
Myers, J.; Cushing, J. B.; Lynn, P.; Weiner, N.; Ovchinnikova, A.; Nadkarni, N.; McIntosh, A.
2015-12-01
Modern research is increasingly dependent upon highly heterogeneous data and on the associated cyberinfrastructure developed to organize, analyze, and visualize that data. However, due to the complexity and custom nature of such combined data-software systems, it can be very challenging to curate and preserve them for the long term at reasonable cost and in a way that retains their scientific value. In this presentation, we describe how this challenge was met in preserving the Big Canopy Database (CanopyDB) system using an agile approach and leveraging the Sustainable Environment - Actionable Data (SEAD) DataNet project's hosted data services. The CanopyDB system was developed over more than a decade at Evergreen State College to address the needs of forest canopy researchers. It is an early yet sophisticated exemplar of the type of system that has become common in biological research and science in general, including multiple relational databases for different experiments, a custom database generation tool used to create them, an image repository, and desktop and web tools to access, analyze, and visualize this data. SEAD provides secure project spaces with a semantic content abstraction (typed content with arbitrary RDF metadata statements and relationships to other content), combined with a standards-based curation and publication pipeline resulting in packaged research objects with Digital Object Identifiers. Using SEAD, our cross-project team was able to incrementally ingest CanopyDB components (images, datasets, software source code, documentation, executables, and virtualized services) and to iteratively define and extend the metadata and relationships needed to document them. We believe that both the process, and the richness of the resultant standards-based (OAI-ORE) preservation object, hold lessons for the development of best-practice solutions for preserving scientific data in association with the tools and services needed to derive value from it.
The IRIS Data Management Center: Enabling Access to Observational Time Series Spanning Decades
NASA Astrophysics Data System (ADS)
Ahern, T.; Benson, R.; Trabant, C.
2009-04-01
The Incorporated Research Institutions for Seismology (IRIS) is funded by the National Science Foundation (NSF) to operate the facilities to generate, archive, and distribute seismological data to research communities in the United States and internationally. The IRIS Data Management System (DMS) is responsible for the ingestion, archiving, curation and distribution of these data. The IRIS Data Management Center (DMC) manages data from more than 100 permanent seismic networks, hundreds of temporary seismic deployments as well as data from other geophysical observing networks such as magnetotelluric sensors, ocean bottom sensors, superconducting gravimeters, strainmeters, surface meteorological measurements, and in-situ atmospheric pressure measurements. The IRIS DMC has data from more than 20 different types of sensors. The IRIS DMC manages approximately 100 terabytes of primary observational data. These data are archived in multiple distributed storage systems that insure data availability independent of any single catastrophic failure. Storage systems include both RAID systems of greater than 100 terabytes as well as robotic tape robots of petabyte capacity. IRIS performs routine transcription of the data to new media and storage systems to insure the long-term viability of the scientific data. IRIS adheres to the OAIS Data Preservation Model in most cases. The IRIS data model requires the availability of metadata describing the characteristics and geographic location of sensors before data can be fully archived. IRIS works with the International Federation of Digital Seismographic Networks (FDSN) in the definition and evolution of the metadata. The metadata insures that the data remain useful to both current and future generations of earth scientists. Curation of the metadata and time series is one of the most important activities at the IRIS DMC. Data analysts and an automated quality assurance system monitor the quality of the incoming data. This insures data are of acceptably high quality. The formats and data structures used by the seismological community are esoteric. IRIS and its FDSN partners are developing web services that can transform the data holdings to structures that are more easily used by broader scientific communities. For instance, atmospheric scientists are interested in using global observations of microbarograph data but that community does not understand the methods of applying instrument corrections to the observations. Web processing services under development at IRIS will transform these data in a manner that allows direct use within such analysis tools as MATLAB® already in use by that community. By continuing to develop web-service based methods of data discovery and access, IRIS is enabling broader access to its data holdings. We currently support data discovery using many of the Open Geospatial Consortium (OGC) web mapping services. We are involved in portal technologies to support data discovery and distribution for all data from the EarthScope project. We are working with computer scientists at several universities including the University of Washington as part of a DataNet proposal and we intend to enhance metadata, further develop ontologies, develop a Registry Service to aid in the discovery of data sets and services, and in general improve the semantic interoperability of the data managed at the IRIS DMC. Finally IRIS has been identified as one of four scientific organizations that the External Research Division of Microsoft wants to work with in the development of web services and specifically with the development of a scientific workflow engine. More specific details of current and future developments at the IRIS DMC will be included in this presentation.
NASA Technical Reports Server (NTRS)
Wong, Min Minnie; Ward, Kevin; Boller, Ryan; Gunnoe, Taylor; Baynes, Kathleen; King, Benjamin
2016-01-01
Constellations of NASA Earth Observing System (EOS) satellites orbit the earth to collect images and data about the planet in near real-time. Within hours of satellite overpass, you can discover where the latest wildfires, severe storms, volcanic eruptions, and dust and haze events are occurring using NASA's Worldview web application. By harnessing a repository of curated natural event metadata from NASA Earth Observatory's Natural Event Tracker (EONET), Worldview has moved natural event discovery to the forefront and allows users to select events-of-interest from a curated list, zooms to the area, and adds the most relevant imagery layers for that type of natural event. This poster will highlight NASA Worldviews new natural event feed functionality.
IGSN e.V.: Registration and Identification Services for Physical Samples in the Digital Universe
NASA Astrophysics Data System (ADS)
Lehnert, K. A.; Klump, J.; Arko, R. A.; Bristol, S.; Buczkowski, B.; Chan, C.; Chan, S.; Conze, R.; Cox, S. J.; Habermann, T.; Hangsterfer, A.; Hsu, L.; Milan, A.; Miller, S. P.; Noren, A. J.; Richard, S. M.; Valentine, D. W.; Whitenack, T.; Wyborn, L. A.; Zaslavsky, I.
2011-12-01
The International Geo Sample Number (IGSN) is a unique identifier for samples and specimens collected from our natural environment. It was developed by the System for Earth Sample Registration SESAR to overcome the problem of ambiguous naming of samples that has limited the ability to share, link, and integrate data for samples across Geoscience data systems. Over the past 5 years, SESAR has made substantial progress in implementing the IGSN for sample and data management, working with Geoscience researchers, Geoinformatics specialists, and sample curators to establish metadata requirements, registration procedures, and best practices for the use of the IGSN. The IGSN is now recognized as the primary solution for sample identification and registration, and supported by a growing user community of investigators, repositories, science programs, and data systems. In order to advance broad disciplinary and international implementation of the IGSN, SESAR organized a meeting of international leaders in Geoscience informatics in 2011 to develop a consensus strategy for the long-term operations of the registry with approaches for sustainable operation, organizational structure, governance, and funding. The group endorsed an internationally unified approach for registration and discovery of physical specimens in the Geosciences, and refined the existing SESAR architecture to become a modular and scalable approach, separating the IGSN Registry from a central Sample Metadata Clearinghouse (SESAR), and introducing 'Local Registration Agents' that provide registration services to specific disciplinary or organizational communities, with tools for metadata submission and management, and metadata archiving. Development and implementation of the new IGSN architecture is underway with funding provided by the US NSF Office of International Science and Engineering. A formal governance structure is being established for the IGSN model, consisting of (a) an international not-for-profit organization, the IGSN e.V. (e.V. = 'Eingetragener Verein', legal status for a registered voluntary association in Germany), that defines the IGSN scope and syntax and maintains the IGSN Handle system, and (b) a Science Advisory Board that guides policies, technology, and best practices of the SESAR Sample Metadata Clearinghouse and Local Registration Agents. The IGSN e.V. is being incorporated in Germany at the GFZ Potsdam, a founding event is planned for the AGU Fall Meeting.
NASA Astrophysics Data System (ADS)
Hou, C. Y.; Dattore, R.; Peng, G. S.
2014-12-01
The National Center for Atmospheric Research's Global Climate Four-Dimensional Data Assimilation (CFDDA) Hourly 40km Reanalysis dataset is a dynamically downscaled dataset with high temporal and spatial resolution. The dataset contains three-dimensional hourly analyses in netCDF format for the global atmospheric state from 1985 to 2005 on a 40km horizontal grid (0.4°grid increment) with 28 vertical levels, providing good representation of local forcing and diurnal variation of processes in the planetary boundary layer. This project aimed to make the dataset publicly available, accessible, and usable in order to provide a unique resource to allow and promote studies of new climate characteristics. When the curation project started, it had been five years since the data files were generated. Also, although the Principal Investigator (PI) had generated a user document at the end of the project in 2009, the document had not been maintained. Furthermore, the PI had moved to a new institution, and the remaining team members were reassigned to other projects. These factors made data curation in the areas of verifying data quality, harvest metadata descriptions, documenting provenance information especially challenging. As a result, the project's curation process found that: Data curator's skill and knowledge helped make decisions, such as file format and structure and workflow documentation, that had significant, positive impact on the ease of the dataset's management and long term preservation. Use of data curation tools, such as the Data Curation Profiles Toolkit's guidelines, revealed important information for promoting the data's usability and enhancing preservation planning. Involving data curators during each stage of the data curation life cycle instead of at the end could improve the curation process' efficiency. Overall, the project showed that proper resources invested in the curation process would give datasets the best chance to fulfill their potential to help with new climate pattern discovery.
New GES DISC Services Shortening the Path in Science Data Discovery
NASA Technical Reports Server (NTRS)
Li, Angela; Shie, Chung-Lin; Petrenko, Maksym; Hegde, Mahabaleshwa; Teng, William; Liu, Zhong; Bryant, Keith; Shen, Suhung; Hearty, Thomas; Wei, Jennifer;
2017-01-01
The Current GES DISC available services only allow user to select variables from a single dataset at a time and too many variables from a dataset are displayed, choice is hard. At American Geophysical Union (AGU) 2016 Fall Meeting, Goddard Earth Sciences Data Information Services Center (GES DISC) unveiled a new service: Datalist. A Datalist is a collection of predefined or user-defined data variables from one or more archived datasets. Our science support team curated predefined datalist and provided value to the user community. Imagine some novice user wants to study hurricane and typed in hurricane in the search box. The first item in the search result is GES DISC provided Hurricane Datalist. It contains scientists recommended variables from multiple datasets like TRMM, GPM, MERRA, etc. Datalist uses the same architecture as that of our new website, which also provides one-stop shopping for data, metadata, citation, documentation, visualization and other available services.We implemented Datalist with new GES DISC web architecture, one single web page that unified all user interfaces. From that webpage, users can find data by either type in keyword, or browse by category. It also provides user with a sophisticated integrated data and services package, including metadata, citation, documentation, visualization, and data-specific services, all available from one-stop shopping.
Academic Research Library as Broker in Addressing Interoperability Challenges for the Geosciences
NASA Astrophysics Data System (ADS)
Smith, P., II
2015-12-01
Data capture is an important process in the research lifecycle. Complete descriptive and representative information of the data or database is necessary during data collection whether in the field or in the research lab. The National Science Foundation's (NSF) Public Access Plan (2015) mandates the need for federally funded projects to make their research data more openly available. Developing, implementing, and integrating metadata workflows into to the research process of the data lifecycle facilitates improved data access while also addressing interoperability challenges for the geosciences such as data description and representation. Lack of metadata or data curation can contribute to (1) semantic, (2) ontology, and (3) data integration issues within and across disciplinary domains and projects. Some researchers of EarthCube funded projects have identified these issues as gaps. These gaps can contribute to interoperability data access, discovery, and integration issues between domain-specific and general data repositories. Academic Research Libraries have expertise in providing long-term discovery and access through the use of metadata standards and provision of access to research data, datasets, and publications via institutional repositories. Metadata crosswalks, open archival information systems (OAIS), trusted-repositories, data seal of approval, persistent URL, linking data, objects, resources, and publications in institutional repositories and digital content management systems are common components in the library discipline. These components contribute to a library perspective on data access and discovery that can benefit the geosciences. The USGS Community for Data Integration (CDI) has developed the Science Support Framework (SSF) for data management and integration within its community of practice for contribution to improved understanding of the Earth's physical and biological systems. The USGS CDI SSF can be used as a reference model to map to EarthCube Funded projects with academic research libraries facilitating the data and information assets components of the USGS CDI SSF via institutional repositories and/or digital content management. This session will explore the USGS CDI SSF for cross-discipline collaboration considerations from a library perspective.
BCO-DMO: Enabling Access to Federally Funded Research Data
NASA Astrophysics Data System (ADS)
Kinkade, D.; Allison, M. D.; Chandler, C. L.; Groman, R. C.; Rauch, S.; Shepherd, A.; Gegg, S. R.; Wiebe, P. H.; Glover, D. M.
2013-12-01
In a February, 2013 memo1, the White House Office of Science and Technology Policy (OSTP) outlined principles and objectives to increase access by the public to federally funded research publications and data. Such access is intended to drive innovation by allowing private and commercial efforts to take full advantage of existing resources, thereby maximizing Federal research dollars and efforts. The Biological and Chemical Oceanography Data Management Office (BCO-DMO; bco-dmo.org) serves as a model resource for organizations seeking compliance with the OSTP policy. BCO-DMO works closely with scientific investigators to publish their data from research projects funded by the National Science Foundation (NSF), within the Biological and Chemical Oceanography Sections (OCE) and the Division of Polar Programs Antarctic Organisms & Ecosystems Program (PLR). BCO-DMO addresses many of the OSTP objectives for public access to digital scientific data: (1) Marine biogeochemical and ecological data and metadata are disseminated via a public website, and curated on intermediate time frames; (2) Preservation needs are met by collaborating with appropriate national data facilities for data archive; (3) Cost and administrative burden associated with data management is minimized by the use of one dedicated office providing hundreds of NSF investigators support for data management plan development, data organization, metadata generation and deposition of data and metadata into the BCO-DMO repository; (4) Recognition of intellectual property is reinforced through the office's citation policy and the use of digital object identifiers (DOIs); (5) Education and training in data stewardship and use of the BCO-DMO system is provided by office staff through a variety of venues. Oceanographic research data and metadata from thousands of datasets generated by hundreds of investigators are now available through BCO-DMO. 1 White House Office of Science and Technology Policy, Memorandum for the Heads of Executive Departments and Agencies: Increasing Access to the Results of Federally Funded Scientific Research, February 23, 2013. http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
The IGSN Experience: Successes and Challenges of Implementing Persistent Identifiers for Samples
NASA Astrophysics Data System (ADS)
Lehnert, Kerstin; Arko, Robert
2016-04-01
Physical samples collected and studied in the Earth sciences represent both a research resource and a research product in the Earth Sciences. As such they need to be properly managed, curated, documented, and cited to ensure re-usability and utility for future science, reproducibility of the data generated by their study, and credit for funding agencies and researchers who invested substantial resources and intellectual effort into their collection and curation. Use of persistent and unique identifiers and deposition of metadata in a persistent registry are therefore as important for physical samples as they are for digital data. The International Geo Sample Number (IGSN) is a persistent, globally unique identifier. Its adoption by individual investigators, repository curators, publishers, and data managers is rapidly growing world-wide. This presentation will provide an analysis of the development and implementation path of the IGSN and relevant insights and experiences gained along its way. Development of the IGSN started in 2004 as part of a US NSF-funded project to establish a registry for sample metadata, the System for Earth Sample Registration (SESAR). The initial system provided a centralized solution for users to submit information about their samples and obtain IGSNs and bar codes. Challenges encountered during this initial phase related to defining the scope of the registry, granularity of registered objects, responsibilities of relevant actors, and workflows, and designing the registry's metadata schema, its user interfaces, and the identifier itself, including its syntax. The most challenging task though was to make the IGSN an integral part of personal and institutional sample management, digital management of sample-based data, and data publication on a global scale. Besides convincing individual researchers, curators, editors and publishers, as well as data managers in US and non-US academia, state and federal agencies, the PIs of the SESAR project needed to identify ways to organize, operate, and govern the global registry in the short and in the long-term. A major breakthrough was achieved at an international workshop in February 2011, at which participants designed a new distributed and scalable architecture for the IGSN with international governance by a membership organization modeled after the DataCite consortium. The founding of the international governing body and implementation organization for the IGSN, the IGSN e.V., took place at the AGU Fall Meeting 2011. Recent progress came at a workshop in September 2015, where stakeholders from both geoscience and life science disciplines drafted a standard IGSN metadata schema for describing samples, to complement the existing schema for registering samples. Consensus was achieved on an essential set of properties to describe a sample's origin and classification, creating a "birth certificate" for the sample. Further consensus was achieved in clarifying that an IGSN may represent exactly one physical sample, sampling feature, or collection of samples; and in aligning the IGSN schema with the existing Observations Data Model (ODM-2). The resulting schema was published online at schema.igsn.org and presented at the AGU Fall Meeting 2015.
NASA Astrophysics Data System (ADS)
Hwang, L.; Kellogg, L. H.
2017-12-01
Curation of software promotes discoverability and accessibility and works hand in hand with scholarly citation to ascribe value to, and provide recognition for software development. To meet this challenge, the Computational Infrastructure for Geodynamics (CIG) maintains a community repository built on custom and open tools to promote discovery, access, identification, credit, and provenance of research software for the geodynamics community. CIG (geodynamics.org) originated from recognition of the tremendous effort required to develop sound software and the need to reduce duplication of effort and to sustain community codes. CIG curates software across 6 domains and has developed and follows software best practices that include establishing test cases, documentation, and a citable publication for each software package. CIG software landing web pages provide access to current and past releases; many are also accessible through the CIG community repository on github. CIG has now developed abc - attribution builder for citation to enable software users to give credit to software developers. abc uses zenodo as an archive and as the mechanism to obtain a unique identifier (DOI) for scientific software. To assemble the metadata, we searched the software's documentation and research publications and then requested the primary developers to verify. In this process, we have learned that each development community approaches software attribution differently. The metadata gathered is based on guidelines established by groups such as FORCE11 and OntoSoft. The rollout of abc is gradual as developers are forward-looking, rarely willing to go back and archive prior releases in zenodo. Going forward all actively developed packages will utilize the zenodo and github integration to automate the archival process when a new release is issued. How to handle legacy software, multi-authored libraries, and assigning roles to software remain open issues.
The NASA Astrophysics Data System joins the Revolution
NASA Astrophysics Data System (ADS)
Accomazzi, Alberto; Kurtz, Michael J.; Henneken, Edwin; Grant, Carolyn S.; Thompson, Donna M.; Chyla, Roman; Holachek, Alexandra; Sudilovsky, Vladimir; Elliott, Jonathan; Murray, Stephen S.
2015-08-01
Whether or not scholarly publications are going through an evolution or revolution, one comforting certainty remains: the NASA Astrophysics Data System (ADS) is here to help the working astronomer and librarian navigate through the increasingly complex communication environment we find ourselves in. Born as a bibliographic database, today's ADS is best described as a an "aggregator" of scholarly resources relevant to the needs of researchers in astronomy and physics. In addition to indexing content from a variety of publishers, data and software archives, the ADS enriches its records by text-mining and indexing the full-text articles, enriching its metadata through the extraction of citations and acknowledgments and the ingest of bibliographies and data links maintained by astronomy institutions and data archives. In addition, ADS generates and maintains citation and co-readership networks to support discovery and bibliometric analysis.In this talk I will summarize new and ongoing curation activities and technology developments of the ADS in the face of the ever-changing world of scholarly publishing and the trends in information-sharing behavior of astronomers. Recent curation efforts include the indexing of non-standard scholarly content (such as software packages, IVOA documents and standards, and NASA award proposals); the indexing of additional content (full-text of articles, acknowledgments, affiliations, ORCID ids); and enhanced support for bibliographic groups and data links. Recent technology developments include a new Application Programming Interface which provides access to a variety of ADS microservices, a new user interface featuring a variety of visualizations and bibliometric analysis, and integration with ORCID services to support paper claiming.
NASA Astrophysics Data System (ADS)
Devaraju, Anusuriya; Klump, Jens; Tey, Victor; Fraser, Ryan
2016-04-01
Physical samples such as minerals, soil, rocks, water, air and plants are important observational units for understanding the complexity of our environment and its resources. They are usually collected and curated by different entities, e.g., individual researchers, laboratories, state agencies, or museums. Persistent identifiers may facilitate access to physical samples that are scattered across various repositories. They are essential to locate samples unambiguously and to share their associated metadata and data systematically across the Web. The International Geo Sample Number (IGSN) is a persistent, globally unique label for identifying physical samples. The IGSNs of physical samples are registered by end-users (e.g., individual researchers, data centers and projects) through allocating agents. Allocating agents are the institutions acting on behalf of the implementing organization (IGSN e.V.). The Commonwealth Scientific and Industrial Research Organisation CSIRO) is one of the allocating agents in Australia. To implement IGSN in our organisation, we developed a RESTful service and a metadata model. The web service enables a client to register sub-namespaces and multiple samples, and retrieve samples' metadata programmatically. The metadata model provides a framework in which different types of samples may be represented. It is generic and extensible, therefore it may be applied in the context of multi-disciplinary projects. The metadata model has been implemented as an XML schema and a PostgreSQL database. The schema is used to handle sample registrations requests and to disseminate their metadata, whereas the relational database is used to preserve the metadata records. The metadata schema leverages existing controlled vocabularies to minimize the scope for error and incorporates some simplifications to reduce complexity of the schema implementation. The solutions developed have been applied and tested in the context of two sample repositories in CSIRO, the Capricorn Distal Footprints project and the Rock Store.
The PDS4 Information Model and its Role in Agile Science Data Curation
NASA Astrophysics Data System (ADS)
Hughes, J. S.; Crichton, D.
2017-12-01
PDS4 is an information model-driven service architecture supporting the capture, management, distribution and integration of massive planetary science data captured in distributed data archives world-wide. The PDS4 Information Model (IM), the core element of the architecture, was developed using lessons learned from 20 years of archiving Planetary Science Data and best practices for information model development. The foundational principles were adopted from the Open Archival Information System (OAIS) Reference Model (ISO 14721), the Metadata Registry Specification (ISO/IEC 11179), and W3C XML (Extensible Markup Language) specifications. These provided respectively an object oriented model for archive information systems, a comprehensive schema for data dictionaries and hierarchical governance, and rules for rules for encoding documents electronically. The PDS4 Information model is unique in that it drives the PDS4 infrastructure by providing the representation of concepts and their relationships, constraints, rules, and operations; a sharable, stable, and organized set of information requirements; and machine parsable definitions that are suitable for configuring and generating code. This presentation will provide an over of the PDS4 Information Model and how it is being leveraged to develop and evolve the PDS4 infrastructure and enable agile curation of over 30 years of science data collected by the international Planetary Science community.
The Global Registry of Biodiversity Repositories: A Call for Community Curation.
Schindel, David E; Miller, Scott E; Trizna, Michael G; Graham, Eileen; Crane, Adele E
2016-01-01
The Global Registry of Biodiversity Repositories is an online metadata resource for biodiversity collections, the institutions that contain them, and associated staff members. The registry provides contact and address information, characteristics of the institutions and collections using controlled vocabularies and free-text descripitons, links to related websites, unique identifiers for each institution and collection record, text fields for loan and use policies, and a variety of other descriptors. Each institution record includes an institutionCode that must be unique, and each collection record must have a collectionCode that is unique within that institution. The registry is populated with records imported from the largest similar registries and more can be harmonized and added. Doing so will require community input and curation and would produce a truly comprehensive and unifying information resource.
ALE: automated label extraction from GEO metadata.
Giles, Cory B; Brown, Chase A; Ripperger, Michael; Dennis, Zane; Roopnarinesingh, Xiavan; Porter, Hunter; Perz, Aleksandra; Wren, Jonathan D
2017-12-28
NCBI's Gene Expression Omnibus (GEO) is a rich community resource containing millions of gene expression experiments from human, mouse, rat, and other model organisms. However, information about each experiment (metadata) is in the format of an open-ended, non-standardized textual description provided by the depositor. Thus, classification of experiments for meta-analysis by factors such as gender, age of the sample donor, and tissue of origin is not feasible without assigning labels to the experiments. Automated approaches are preferable for this, primarily because of the size and volume of the data to be processed, but also because it ensures standardization and consistency. While some of these labels can be extracted directly from the textual metadata, many of the data available do not contain explicit text informing the researcher about the age and gender of the subjects with the study. To bridge this gap, machine-learning methods can be trained to use the gene expression patterns associated with the text-derived labels to refine label-prediction confidence. Our analysis shows only 26% of metadata text contains information about gender and 21% about age. In order to ameliorate the lack of available labels for these data sets, we first extract labels from the textual metadata for each GEO RNA dataset and evaluate the performance against a gold standard of manually curated labels. We then use machine-learning methods to predict labels, based upon gene expression of the samples and compare this to the text-based method. Here we present an automated method to extract labels for age, gender, and tissue from textual metadata and GEO data using both a heuristic approach as well as machine learning. We show the two methods together improve accuracy of label assignment to GEO samples.
Integrating Semantic Information in Metadata Descriptions for a Geoscience-wide Resource Inventory.
NASA Astrophysics Data System (ADS)
Zaslavsky, I.; Richard, S. M.; Gupta, A.; Valentine, D.; Whitenack, T.; Ozyurt, I. B.; Grethe, J. S.; Schachne, A.
2016-12-01
Integrating semantic information into legacy metadata catalogs is a challenging issue and so far has been mostly done on a limited scale. We present experience of CINERGI (Community Inventory of Earthcube Resources for Geoscience Interoperability), an NSF Earthcube Building Block project, in creating a large cross-disciplinary catalog of geoscience information resources to enable cross-domain discovery. The project developed a pipeline for automatically augmenting resource metadata, in particular generating keywords that describe metadata documents harvested from multiple geoscience information repositories or contributed by geoscientists through various channels including surveys and domain resource inventories. The pipeline examines available metadata descriptions using text parsing, vocabulary management and semantic annotation and graph navigation services of GeoSciGraph. GeoSciGraph, in turn, relies on a large cross-domain ontology of geoscience terms, which bridges several independently developed ontologies or taxonomies including SWEET, ENVO, YAGO, GeoSciML, GCMD, SWO, and CHEBI. The ontology content enables automatic extraction of keywords reflecting science domains, equipment used, geospatial features, measured properties, methods, processes, etc. We specifically focus on issues of cross-domain geoscience ontology creation, resolving several types of semantic conflicts among component ontologies or vocabularies, and constructing and managing facets for improved data discovery and navigation. The ontology and keyword generation rules are iteratively improved as pipeline results are presented to data managers for selective manual curation via a CINERGI Annotator user interface. We present lessons learned from applying CINERGI metadata augmentation pipeline to a number of federal agency and academic data registries, in the context of several use cases that require data discovery and integration across multiple earth science data catalogs of varying quality and completeness. The inventory is accessible at http://cinergi.sdsc.edu, and the CINERGI project web page is http://earthcube.org/group/cinergi
Winsor, Geoffrey L; Griffiths, Emma J; Lo, Raymond; Dhillon, Bhavjinder K; Shay, Julie A; Brinkman, Fiona S L
2016-01-04
The Pseudomonas Genome Database (http://www.pseudomonas.com) is well known for the application of community-based annotation approaches for producing a high-quality Pseudomonas aeruginosa PAO1 genome annotation, and facilitating whole-genome comparative analyses with other Pseudomonas strains. To aid analysis of potentially thousands of complete and draft genome assemblies, this database and analysis platform was upgraded to integrate curated genome annotations and isolate metadata with enhanced tools for larger scale comparative analysis and visualization. Manually curated gene annotations are supplemented with improved computational analyses that help identify putative drug targets and vaccine candidates or assist with evolutionary studies by identifying orthologs, pathogen-associated genes and genomic islands. The database schema has been updated to integrate isolate metadata that will facilitate more powerful analysis of genomes across datasets in the future. We continue to place an emphasis on providing high-quality updates to gene annotations through regular review of the scientific literature and using community-based approaches including a major new Pseudomonas community initiative for the assignment of high-quality gene ontology terms to genes. As we further expand from thousands of genomes, we plan to provide enhancements that will aid data visualization and analysis arising from whole-genome comparative studies including more pan-genome and population-based approaches. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Site-based data curation based on hot spring geobiology
Palmer, Carole L.; Thomer, Andrea K.; Baker, Karen S.; Wickett, Karen M.; Hendrix, Christie L.; Rodman, Ann; Sigler, Stacey; Fouke, Bruce W.
2017-01-01
Site-Based Data Curation (SBDC) is an approach to managing research data that prioritizes sharing and reuse of data collected at scientifically significant sites. The SBDC framework is based on geobiology research at natural hot spring sites in Yellowstone National Park as an exemplar case of high value field data in contemporary, cross-disciplinary earth systems science. Through stakeholder analysis and investigation of data artifacts, we determined that meaningful and valid reuse of digital hot spring data requires systematic documentation of sampling processes and particular contextual information about the site of data collection. We propose a Minimum Information Framework for recording the necessary metadata on sampling locations, with anchor measurements and description of the hot spring vent distinct from the outflow system, and multi-scale field photography to capture vital information about hot spring structures. The SBDC framework can serve as a global model for the collection and description of hot spring systems field data that can be readily adapted for application to the curation of data from other kinds scientifically significant sites. PMID:28253269
The Materials Data Facility: Data Services to Advance Materials Science Research
NASA Astrophysics Data System (ADS)
Blaiszik, B.; Chard, K.; Pruyne, J.; Ananthakrishnan, R.; Tuecke, S.; Foster, I.
2016-08-01
With increasingly strict data management requirements from funding agencies and institutions, expanding focus on the challenges of research replicability, and growing data sizes and heterogeneity, new data needs are emerging in the materials community. The materials data facility (MDF) operates two cloud-hosted services, data publication and data discovery, with features to promote open data sharing, self-service data publication and curation, and encourage data reuse, layered with powerful data discovery tools. The data publication service simplifies the process of copying data to a secure storage location, assigning data a citable persistent identifier, and recording custom (e.g., material, technique, or instrument specific) and automatically-extracted metadata in a registry while the data discovery service will provide advanced search capabilities (e.g., faceting, free text range querying, and full text search) against the registered data and metadata. The MDF services empower individual researchers, research projects, and institutions to (I) publish research datasets, regardless of size, from local storage, institutional data stores, or cloud storage, without involvement of third-party publishers; (II) build, share, and enforce extensible domain-specific custom metadata schemas; (III) interact with published data and metadata via representational state transfer (REST) application program interfaces (APIs) to facilitate automation, analysis, and feedback; and (IV) access a data discovery model that allows researchers to search, interrogate, and eventually build on existing published data. We describe MDF's design, current status, and future plans.
The Global Registry of Biodiversity Repositories: A Call for Community Curation
Miller, Scott E.; Trizna, Michael G.; Graham, Eileen; Crane, Adele E.
2016-01-01
Abstract The Global Registry of Biodiversity Repositories is an online metadata resource for biodiversity collections, the institutions that contain them, and associated staff members. The registry provides contact and address information, characteristics of the institutions and collections using controlled vocabularies and free-text descripitons, links to related websites, unique identifiers for each institution and collection record, text fields for loan and use policies, and a variety of other descriptors. Each institution record includes an institutionCode that must be unique, and each collection record must have a collectionCode that is unique within that institution. The registry is populated with records imported from the largest similar registries and more can be harmonized and added. Doing so will require community input and curation and would produce a truly comprehensive and unifying information resource. PMID:27660523
ASDC Advances in the Utilization of Microservices and Hybrid Cloud Environments
NASA Astrophysics Data System (ADS)
Baskin, W. E.; Herbert, A.; Mazaika, A.; Walter, J.
2017-12-01
The Atmospheric Science Data Center (ASDC) is transitioning many of its software tools and applications to standalone microservices deployable in a hybrid cloud, offering benefits such as scalability and efficient environment management. This presentation features several projects the ASDC staff have implemented leveraging the OpenShift Container Application Platform and OpenStack Hybrid Cloud Environment focusing on key tools and techniques applied to: Earth Science data processing Spatial-Temporal metadata generation, validation, repair, and curation Archived Data discovery, visualization, and access
DOIDB: Reusing DataCite's search software as metadata portal for GFZ Data Services
NASA Astrophysics Data System (ADS)
Elger, K.; Ulbricht, D.; Bertelmann, R.
2016-12-01
GFZ Data Services is the central service point for the publication of research data at the Helmholtz Centre Potsdam GFZ German Research Centre for Geosciences (GFZ). It provides data publishing services to scientists of GFZ, associated projects, and associated institutions. The publishing services aim to make research data and physical samples visible and citable, by assigning persistent identifiers (DOI, IGSN) and by complementing existing IT infrastructure. To integrate several research domains a modular software stack that is made of free software components has been created to manage data and metadata as well as register persistent identifiers [1]. Pivotal component for the registration of DOIs is the DOIDB. It has been derived from three software components provided by DataCite [2] that moderate the registration of DOIs and the deposition of metadata, allow the dissemination of metadata, and provide a user interface to navigate and discover datasets. The DOIDB acts as a proxy to the DataCite infrastructure and in addition to the DataCite metadata schema, it allows to deposit and disseminate metadata following the schemas ISO19139 and NASA GCMD DIF. The search component has been modified to meet the requirements of a geosciences metadata portal. In particular, the search component has been altered to make use of Apache SOLRs capability to index and query spatial coordinates. Furthermore, the user interface has been adjusted to provide a first impression of the data by showing a map, summary information and subjects. DOIDB and its components are available on GitHub [3].We present a software solution for registration of DOIs that allows to integrate existing data systems, keeps track of registered DOIs, and provides a metadata portal to discover datasets [4]. [1] Ulbricht, D.; Elger, K.; Bertelmann, R.; Klump, J. panMetaDocs, eSciDoc, and DOIDB—An Infrastructure for the Curation and Publication of File-Based Datasets for GFZ Data Services. ISPRS Int. J. Geo-Inf. 2016, 5, 25. http://doi.org/10.3390/ijgi5030025[2] https://github.com/datacite[3] https://github.com/ulbricht/search/tree/doidb , https://github.com/ulbricht/mds/tree/doidb , https://github.com/ulbricht/oaip/tree/doidb[4] http://doidb.wdc-terra.org
GraphMeta: Managing HPC Rich Metadata in Graphs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dai, Dong; Chen, Yong; Carns, Philip
High-performance computing (HPC) systems face increasingly critical metadata management challenges, especially in the approaching exascale era. These challenges arise not only from exploding metadata volumes, but also from increasingly diverse metadata, which contains data provenance and arbitrary user-defined attributes in addition to traditional POSIX metadata. This ‘rich’ metadata is becoming critical to supporting advanced data management functionality such as data auditing and validation. In our prior work, we identified a graph-based model as a promising solution to uniformly manage HPC rich metadata due to its flexibility and generality. However, at the same time, graph-based HPC rich metadata anagement also introducesmore » significant challenges to the underlying infrastructure. In this study, we first identify the challenges on the underlying infrastructure to support scalable, high-performance rich metadata management. Based on that, we introduce GraphMeta, a graphbased engine designed for this use case. It achieves performance scalability by introducing a new graph partitioning algorithm and a write-optimal storage engine. We evaluate GraphMeta under both synthetic and real HPC metadata workloads, compare it with other approaches, and demonstrate its advantages in terms of efficiency and usability for rich metadata management in HPC systems.« less
MicRhoDE: a curated database for the analysis of microbial rhodopsin diversity and evolution
Boeuf, Dominique; Audic, Stéphane; Brillet-Guéguen, Loraine; Caron, Christophe; Jeanthon, Christian
2015-01-01
Microbial rhodopsins are a diverse group of photoactive transmembrane proteins found in all three domains of life and in viruses. Today, microbial rhodopsin research is a flourishing research field in which new understandings of rhodopsin diversity, function and evolution are contributing to broader microbiological and molecular knowledge. Here, we describe MicRhoDE, a comprehensive, high-quality and freely accessible database that facilitates analysis of the diversity and evolution of microbial rhodopsins. Rhodopsin sequences isolated from a vast array of marine and terrestrial environments were manually collected and curated. To each rhodopsin sequence are associated related metadata, including predicted spectral tuning of the protein, putative activity and function, taxonomy for sequences that can be linked to a 16S rRNA gene, sampling date and location, and supporting literature. The database currently covers 7857 aligned sequences from more than 450 environmental samples or organisms. Based on a robust phylogenetic analysis, we introduce an operational classification system with multiple phylogenetic levels ranging from superclusters to species-level operational taxonomic units. An integrated pipeline for online sequence alignment and phylogenetic tree construction is also provided. With a user-friendly interface and integrated online bioinformatics tools, this unique resource should be highly valuable for upcoming studies of the biogeography, diversity, distribution and evolution of microbial rhodopsins. Database URL: http://micrhode.sb-roscoff.fr. PMID:26286928
MicRhoDE: a curated database for the analysis of microbial rhodopsin diversity and evolution.
Boeuf, Dominique; Audic, Stéphane; Brillet-Guéguen, Loraine; Caron, Christophe; Jeanthon, Christian
2015-01-01
Microbial rhodopsins are a diverse group of photoactive transmembrane proteins found in all three domains of life and in viruses. Today, microbial rhodopsin research is a flourishing research field in which new understandings of rhodopsin diversity, function and evolution are contributing to broader microbiological and molecular knowledge. Here, we describe MicRhoDE, a comprehensive, high-quality and freely accessible database that facilitates analysis of the diversity and evolution of microbial rhodopsins. Rhodopsin sequences isolated from a vast array of marine and terrestrial environments were manually collected and curated. To each rhodopsin sequence are associated related metadata, including predicted spectral tuning of the protein, putative activity and function, taxonomy for sequences that can be linked to a 16S rRNA gene, sampling date and location, and supporting literature. The database currently covers 7857 aligned sequences from more than 450 environmental samples or organisms. Based on a robust phylogenetic analysis, we introduce an operational classification system with multiple phylogenetic levels ranging from superclusters to species-level operational taxonomic units. An integrated pipeline for online sequence alignment and phylogenetic tree construction is also provided. With a user-friendly interface and integrated online bioinformatics tools, this unique resource should be highly valuable for upcoming studies of the biogeography, diversity, distribution and evolution of microbial rhodopsins. Database URL: http://micrhode.sb-roscoff.fr. © The Author(s) 2015. Published by Oxford University Press.
A Lifecycle Approach to Brokered Data Management for Hydrologic Modeling Data Using Open Standards.
NASA Astrophysics Data System (ADS)
Blodgett, D. L.; Booth, N.; Kunicki, T.; Walker, J.
2012-12-01
The U.S. Geological Survey Center for Integrated Data Analytics has formalized an information management-architecture to facilitate hydrologic modeling and subsequent decision support throughout a project's lifecycle. The architecture is based on open standards and open source software to decrease the adoption barrier and to build on existing, community supported software. The components of this system have been developed and evaluated to support data management activities of the interagency Great Lakes Restoration Initiative, Department of Interior's Climate Science Centers and WaterSmart National Water Census. Much of the research and development of this system has been in cooperation with international interoperability experiments conducted within the Open Geospatial Consortium. Community-developed standards and software, implemented to meet the unique requirements of specific disciplines, are used as a system of interoperable, discipline specific, data types and interfaces. This approach has allowed adoption of existing software that satisfies the majority of system requirements. Four major features of the system include: 1) assistance in model parameter and forcing creation from large enterprise data sources; 2) conversion of model results and calibrated parameters to standard formats, making them available via standard web services; 3) tracking a model's processes, inputs, and outputs as a cohesive metadata record, allowing provenance tracking via reference to web services; and 4) generalized decision support tools which rely on a suite of standard data types and interfaces, rather than particular manually curated model-derived datasets. Recent progress made in data and web service standards related to sensor and/or model derived station time series, dynamic web processing, and metadata management are central to this system's function and will be presented briefly along with a functional overview of the applications that make up the system. As the separate pieces of this system progress, they will be combined and generalized to form a sort of social network for nationally consistent hydrologic modeling.
Mercury Toolset for Spatiotemporal Metadata
NASA Technical Reports Server (NTRS)
Wilson, Bruce E.; Palanisamy, Giri; Devarakonda, Ranjeet; Rhyne, B. Timothy; Lindsley, Chris; Green, James
2010-01-01
Mercury (http://mercury.ornl.gov) is a set of tools for federated harvesting, searching, and retrieving metadata, particularly spatiotemporal metadata. Version 3.0 of the Mercury toolset provides orders of magnitude improvements in search speed, support for additional metadata formats, integration with Google Maps for spatial queries, facetted type search, support for RSS (Really Simple Syndication) delivery of search results, and enhanced customization to meet the needs of the multiple projects that use Mercury. It provides a single portal to very quickly search for data and information contained in disparate data management systems, each of which may use different metadata formats. Mercury harvests metadata and key data from contributing project servers distributed around the world and builds a centralized index. The search interfaces then allow the users to perform a variety of fielded, spatial, and temporal searches across these metadata sources. This centralized repository of metadata with distributed data sources provides extremely fast search results to the user, while allowing data providers to advertise the availability of their data and maintain complete control and ownership of that data. Mercury periodically (typically daily) harvests metadata sources through a collection of interfaces and re-indexes these metadata to provide extremely rapid search capabilities, even over collections with tens of millions of metadata records. A number of both graphical and application interfaces have been constructed within Mercury, to enable both human users and other computer programs to perform queries. Mercury was also designed to support multiple different projects, so that the particular fields that can be queried and used with search filters are easy to configure for each different project.
Mercury Toolset for Spatiotemporal Metadata
NASA Astrophysics Data System (ADS)
Devarakonda, Ranjeet; Palanisamy, Giri; Green, James; Wilson, Bruce; Rhyne, B. Timothy; Lindsley, Chris
2010-06-01
Mercury (http://mercury.ornl.gov) is a set of tools for federated harvesting, searching, and retrieving metadata, particularly spatiotemporal metadata. Version 3.0 of the Mercury toolset provides orders of magnitude improvements in search speed, support for additional metadata formats, integration with Google Maps for spatial queries, facetted type search, support for RSS (Really Simple Syndication) delivery of search results, and enhanced customization to meet the needs of the multiple projects that use Mercury. It provides a single portal to very quickly search for data and information contained in disparate data management systems, each of which may use different metadata formats. Mercury harvests metadata and key data from contributing project servers distributed around the world and builds a centralized index. The search interfaces then allow the users to perform a variety of fielded, spatial, and temporal searches across these metadata sources. This centralized repository of metadata with distributed data sources provides extremely fast search results to the user, while allowing data providers to advertise the availability of their data and maintain complete control and ownership of that data. Mercury periodically (typically daily)harvests metadata sources through a collection of interfaces and re-indexes these metadata to provide extremely rapid search capabilities, even over collections with tens of millions of metadata records. A number of both graphical and application interfaces have been constructed within Mercury, to enable both human users and other computer programs to perform queries. Mercury was also designed to support multiple different projects, so that the particular fields that can be queried and used with search filters are easy to configure for each different project.
Bookshelf: a simple curation system for the storage of biomolecular simulation data.
Vohra, Shabana; Hall, Benjamin A; Holdbrook, Daniel A; Khalid, Syma; Biggin, Philip C
2010-01-01
Molecular dynamics simulations can now routinely generate data sets of several hundreds of gigabytes in size. The ability to generate this data has become easier over recent years and the rate of data production is likely to increase rapidly in the near future. One major problem associated with this vast amount of data is how to store it in a way that it can be easily retrieved at a later date. The obvious answer to this problem is a database. However, a key issue in the development and maintenance of such a database is its sustainability, which in turn depends on the ease of the deposition and retrieval process. Encouraging users to care about meta-data is difficult and thus the success of any storage system will ultimately depend on how well used by end-users the system is. In this respect we suggest that even a minimal amount of metadata if stored in a sensible fashion is useful, if only at the level of individual research groups. We discuss here, a simple database system which we call 'Bookshelf', that uses python in conjunction with a mysql database to provide an extremely simple system for curating and keeping track of molecular simulation data. It provides a user-friendly, scriptable solution to the common problem amongst biomolecular simulation laboratories; the storage, logging and subsequent retrieval of large numbers of simulations. Download URL: http://sbcb.bioch.ox.ac.uk/bookshelf/
Bookshelf: a simple curation system for the storage of biomolecular simulation data
Vohra, Shabana; Hall, Benjamin A.; Holdbrook, Daniel A.; Khalid, Syma; Biggin, Philip C.
2010-01-01
Molecular dynamics simulations can now routinely generate data sets of several hundreds of gigabytes in size. The ability to generate this data has become easier over recent years and the rate of data production is likely to increase rapidly in the near future. One major problem associated with this vast amount of data is how to store it in a way that it can be easily retrieved at a later date. The obvious answer to this problem is a database. However, a key issue in the development and maintenance of such a database is its sustainability, which in turn depends on the ease of the deposition and retrieval process. Encouraging users to care about meta-data is difficult and thus the success of any storage system will ultimately depend on how well used by end-users the system is. In this respect we suggest that even a minimal amount of metadata if stored in a sensible fashion is useful, if only at the level of individual research groups. We discuss here, a simple database system which we call ‘Bookshelf’, that uses python in conjunction with a mysql database to provide an extremely simple system for curating and keeping track of molecular simulation data. It provides a user-friendly, scriptable solution to the common problem amongst biomolecular simulation laboratories; the storage, logging and subsequent retrieval of large numbers of simulations. Download URL: http://sbcb.bioch.ox.ac.uk/bookshelf/ PMID:21169341
NASA Astrophysics Data System (ADS)
Averett, A.; DeJarnett, B. B.
2016-12-01
The University Of Texas Bureau Of Economic Geology (BEG) serves as the geological survey for Texas and operates three geological sample repositories that house well over 2 million boxes of geological samples (cores and cuttings) and an abundant amount of geoscience data (geophysical logs, thin sections, geochemical analyses, etc.). Material is accessible and searchable online, and it is publically available to the geological community for research and education. Patrons access information about our collection by using our online core and log database (SQL format). BEG is currently undertaking a large project to: 1) improve the internal accuracy of metadata associated with the collection; 2) enhance the capabilities of the database for both BEG curators and researchers as well as our external patrons; and 3) ensure easy and efficient navigation for patrons through our online portal. As BEG undertakes this project, BEG is in the early stages of planning to export the metadata for its collection into SESAR (System for Earth Sample Registration) and have IGSN's (International GeoSample Numbers) assigned to its samples. Education regarding the value of IGSN's and an external registry (SESAR) has been crucial to receiving management support for the project because the concept and potential benefits of registering samples in a registry outside of the institution were not well-known prior to this project. Potential benefits such as increases in discoverability, repository recognition in publications, and interoperability were presented. The project was well-received by management, and BEG fully supports the effort to register our physical samples with SESAR. Since BEG is only in the initial phase of this project, any stumbling blocks, workflow issues, successes/failures, etc. can only be predicted at this point, but by mid-December, BEG expects to have several concrete issues to present in the session. Currently, our most pressing issue involves establishing the most efficient workflow for exporting of large amounts of metadata in a format that SESAR can easily ingest, and how this can be best accomplished with very few BEG staff assigned to the project.
Allelic variation in Salmonella: an underappreciated driver of adaptation and virulence
Yue, Min; Schifferli, Dieter M.
2014-01-01
Salmonella enterica causes substantial morbidity and mortality in humans and animals. Infection and intestinal colonization by S. enterica require virulence factors that mediate bacterial binding and invasion of enterocytes and innate immune cells. Some S. enterica colonization factors and their alleles are host restricted, suggesting a potential role in regulation of host specificity. Recent data also suggest that colonization factors promote horizontal gene transfer of antimicrobial resistance genes by increasing the local density of Salmonella in colonized intestines. Although a profusion of genes are involved in Salmonella pathogenesis, the relative importance of their allelic variation has only been studied intensely in the type 1 fimbrial adhesin FimH. Although other Salmonella virulence factors demonstrate allelic variation, their association with specific metadata (e.g., host species, disease or carrier state, time and geographic place of isolation, antibiotic resistance profile, etc.) remains to be interrogated. To date, genome-wide association studies (GWAS) in bacteriology have been limited by the paucity of relevant metadata. In addition, due to the many variables amid metadata categories, a very large number of strains must be assessed to attain statistically significant results. However, targeted approaches in which genes of interest (e.g., virulence factors) are specifically sequenced alleviates the time-consuming and costly statistical GWAS analysis and increases statistical power, as larger numbers of strains can be screened for non-synonymous single nucleotide polymorphisms (SNPs) that are associated with available metadata. Congruence of specific allelic variants with specific metadata from strains that have a relevant clinical and epidemiological history will help to prioritize functional wet-lab and animal studies aimed at determining cause-effect relationships. Such an approach should be applicable to other pathogens that are being collected in well-curated repositories. PMID:24454310
A community dataspace for distribution and processing of "long tail" high resolution topography data
NASA Astrophysics Data System (ADS)
Crosby, C. J.; Nandigam, V.; Arrowsmith, R.
2016-12-01
Topography is a fundamental observable for Earth and environmental science and engineering. High resolution topography (HRT) is revolutionary for Earth science. Cyberinfrastructure that enables users to discover, manage, share, and process these data increases the impact of investments in data collection and catalyzes scientific discovery.National Science Foundation funded OpenTopography (OT, www.opentopography.org) employs cyberinfrastructure that includes large-scale data management, high-performance computing, and service-oriented architectures, providing researchers with efficient online access to large, HRT (mostly lidar) datasets, metadata, and processing tools. HRT data are collected from satellite, airborne, and terrestrial platforms at increasingly finer resolutions, greater accuracy, and shorter repeat times. There has been a steady increase in OT data holdings due to partnerships and collaborations with various organizations with the academic NSF domain and beyond.With the decreasing costs of HRT data collection, via methods such as Structure from Motion, the number of researchers collecting these data is increasing. Researchers collecting these "long- tail" topography data (of modest size but great value) face an impediment, especially with costs associated in making them widely discoverable, shared, annotated, cited, managed and archived. Also because there are no existing central repositories or services to support storage and curation of these datasets, much of it is isolated and difficult to locate and preserve. To overcome these barriers and provide efficient centralized access to these high impact datasets, OT is developing a "Community DataSpace", a service built on a low cost storage cloud, (e.g. AWS S3) to make it easy for researchers to upload, curate, annotate and distribute their datasets. The system's ingestion workflow will extract metadata from data uploaded; validate it; assign a digital object identifier (DOI); and create a searchable catalog entry, before publishing via the OT portal. The OT Community DataSpace will enable wider discovery and utilization of these datasets via the OT portal and sources that federate the OT data catalog (e.g. data.gov), promote citations and more importantly increase the impact of investments in these data.
GeneLab Phase 2: Integrated Search Data Federation of Space Biology Experimental Data
NASA Technical Reports Server (NTRS)
Tran, P. B.; Berrios, D. C.; Gurram, M. M.; Hashim, J. C. M.; Raghunandan, S.; Lin, S. Y.; Le, T. Q.; Heher, D. M.; Thai, H. T.; Welch, J. D.;
2016-01-01
The GeneLab project is a science initiative to maximize the scientific return of omics data collected from spaceflight and from ground simulations of microgravity and radiation experiments, supported by a data system for a public bioinformatics repository and collaborative analysis tools for these data. The mission of GeneLab is to maximize the utilization of the valuable biological research resources aboard the ISS by collecting genomic, transcriptomic, proteomic and metabolomic (so-called omics) data to enable the exploration of the molecular network responses of terrestrial biology to space environments using a systems biology approach. All GeneLab data are made available to a worldwide network of researchers through its open-access data system. GeneLab is currently being developed by NASA to support Open Science biomedical research in order to enable the human exploration of space and improve life on earth. Open access to Phase 1 of the GeneLab Data Systems (GLDS) was implemented in April 2015. Download volumes have grown steadily, mirroring the growth in curated space biology research data sets (61 as of June 2016), now exceeding 10 TB/month, with over 10,000 file downloads since the start of Phase 1. For the period April 2015 to May 2016, most frequently downloaded were data from studies of Mus musculus (39) followed closely by Arabidopsis thaliana (30), with the remaining downloads roughly equally split across 12 other organisms (each 10 of total downloads). GLDS Phase 2 is focusing on interoperability, supporting data federation, including integrated search capabilities, of GLDS-housed data sets with external data sources, such as gene expression data from NIHNCBIs Gene Expression Omnibus (GEO), proteomic data from EBIs PRIDE system, and metagenomic data from Argonne National Laboratory's MG-RAST. GEO and MG-RAST employ specifications for investigation metadata that are different from those used by the GLDS and PRIDE (e.g., ISA-Tab). The GLDS Phase 2 system will implement a Google-like, full-text search engine using a Service-Oriented Architecture by utilizing publicly available RESTful web services Application Programming Interfaces (e.g., GEO Entrez Programming Utilities) and a Common Metadata Model (CMM) in order to accommodate the different metadata formats between the heterogeneous bioinformatics databases. GLDS Phase 2 completion with fully implemented capabilities will be made available to the general public in September 2017.
DOE Office of Scientific and Technical Information (OSTI.GOV)
He, Fei; Maslov, Sergei; Yoo, Shinjae
Here, transcriptome datasets from thousands of samples of the model plant Arabidopsis thaliana have been collectively generated by multiple individual labs. Although integration and meta-analysis of these samples has become routine in the plant research community, it is often hampered by the lack of metadata or differences in annotation styles by different labs. In this study, we carefully selected and integrated 6,057 Arabidopsis microarray expression samples from 304 experiments deposited to NCBI GEO. Metadata such as tissue type, growth condition, and developmental stage were manually curated for each sample. We then studied global expression landscape of the integrated dataset andmore » found that samples of the same tissue tend to be more similar to each other than to samples of other tissues, even in different growth conditions or developmental stages. Root has the most distinct transcriptome compared to aerial tissues, but the transcriptome of cultured root is more similar to those of aerial tissues as the former samples lost their cellular identity. Using a simple computational classification method, we showed that the tissue type of a sample can be successfully predicted based on its expression profile, opening the door for automatic metadata extraction and facilitating re-use of plant transcriptome data. As a proof of principle we applied our automated annotation pipeline to 708 RNA-seq samples from public repositories and verified accuracy of our predictions with samples’ metadata provided by authors.« less
He, Fei; Maslov, Sergei; Yoo, Shinjae; ...
2016-05-25
Here, transcriptome datasets from thousands of samples of the model plant Arabidopsis thaliana have been collectively generated by multiple individual labs. Although integration and meta-analysis of these samples has become routine in the plant research community, it is often hampered by the lack of metadata or differences in annotation styles by different labs. In this study, we carefully selected and integrated 6,057 Arabidopsis microarray expression samples from 304 experiments deposited to NCBI GEO. Metadata such as tissue type, growth condition, and developmental stage were manually curated for each sample. We then studied global expression landscape of the integrated dataset andmore » found that samples of the same tissue tend to be more similar to each other than to samples of other tissues, even in different growth conditions or developmental stages. Root has the most distinct transcriptome compared to aerial tissues, but the transcriptome of cultured root is more similar to those of aerial tissues as the former samples lost their cellular identity. Using a simple computational classification method, we showed that the tissue type of a sample can be successfully predicted based on its expression profile, opening the door for automatic metadata extraction and facilitating re-use of plant transcriptome data. As a proof of principle we applied our automated annotation pipeline to 708 RNA-seq samples from public repositories and verified accuracy of our predictions with samples’ metadata provided by authors.« less
Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy
Alexopoulou, Dimitra; Andreopoulos, Bill; Dietze, Heiko; Doms, Andreas; Gandon, Fabien; Hakenberg, Jörg; Khelif, Khaled; Schroeder, Michael; Wächter, Thomas
2009-01-01
Background Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively. Results The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate. Conclusion Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation. Availability The three benchmark datasets created for the purpose of disambiguation are available in Additional file 1. PMID:19159460
YummyData: providing high-quality open life science data
Yamaguchi, Atsuko; Splendiani, Andrea
2018-01-01
Abstract Many life science datasets are now available via Linked Data technologies, meaning that they are represented in a common format (the Resource Description Framework), and are accessible via standard APIs (SPARQL endpoints). While this is an important step toward developing an interoperable bioinformatics data landscape, it also creates a new set of obstacles, as it is often difficult for researchers to find the datasets they need. Different providers frequently offer the same datasets, with different levels of support: as well as having more or less up-to-date data, some providers add metadata to describe the content, structures, and ontologies of the stored datasets while others do not. We currently lack a place where researchers can go to easily assess datasets from different providers in terms of metrics such as service stability or metadata richness. We also lack a space for collecting feedback and improving data providers’ awareness of user needs. To address this issue, we have developed YummyData, which consists of two components. One periodically polls a curated list of SPARQL endpoints, monitoring the states of their Linked Data implementations and content. The other presents the information measured for the endpoints and provides a forum for discussion and feedback. YummyData is designed to improve the findability and reusability of life science datasets provided as Linked Data and to foster its adoption. It is freely accessible at http://yummydata.org/. Database URL: http://yummydata.org/ PMID:29688370
Publishing datasets with eSciDoc and panMetaDocs
NASA Astrophysics Data System (ADS)
Ulbricht, D.; Klump, J.; Bertelmann, R.
2012-04-01
Currently serveral research institutions worldwide undertake considerable efforts to have their scientific datasets published and to syndicate them to data portals as extensively described objects identified by a persistent identifier. This is done to foster the reuse of data, to make scientific work more transparent, and to create a citable entity that can be referenced unambigously in written publications. GFZ Potsdam established a publishing workflow for file based research datasets. Key software components are an eSciDoc infrastructure [1] and multiple instances of the data curation tool panMetaDocs [2]. The eSciDoc repository holds data objects and their associated metadata in container objects, called eSciDoc items. A key metadata element in this context is the publication status of the referenced data set. PanMetaDocs, which is based on PanMetaWorks [3], is a PHP based web application that allows to describe data with any XML-based metadata schema. The metadata fields can be filled with static or dynamic content to reduce the number of fields that require manual entries to a minimum and make use of contextual information in a project setting. Access rights can be applied to set visibility of datasets to other project members and allow collaboration on and notifying about datasets (RSS) and interaction with the internal messaging system, that was inherited from panMetaWorks. When a dataset is to be published, panMetaDocs allows to change the publication status of the eSciDoc item from status "private" to "submitted" and prepare the dataset for verification by an external reviewer. After quality checks, the item publication status can be changed to "published". This makes the data and metadata available through the internet worldwide. PanMetaDocs is developed as an eSciDoc application. It is an easy to use graphical user interface to eSciDoc items, their data and metadata. It is also an application supporting a DOI publication agent during the process of publishing scientific datasets as electronic data supplements to research papers. Publication of research manuscripts has an already well established workflow that shares junctures with other processes and involves several parties in the process of dataset publication. Activities of the author, the reviewer, the print publisher and the data publisher have to be coordinated into a common data publication workflow. The case of data publication at GFZ Potsdam displays some specifics, e.g. the DOIDB webservice. The DOIDB is a proxy service at GFZ for the DataCite [4] DOI registration and its metadata store. DOIDB provides a local summary of the dataset DOIs registered through GFZ as a publication agent. An additional use case for the DOIDB is its function to enrich the datacite metadata with additional custom attributes, like a geographic reference in a DIF record. These attributes are at the moment not available in the datacite metadata schema but would be valuable elements for the compilation of data catalogues in the earth sciences and for dissemination of catalogue data via OAI-PMH. [1] http://www.escidoc.org , eSciDoc, FIZ Karlruhe, Germany [2] http://panmetadocs.sf.net , panMetaDocs, GFZ Potsdam, Germany [3] http://metaworks.pangaea.de , panMetaWorks, Dr. R. Huber, MARUM, Univ. Bremen, Germany [4] http://www.datacite.org
Sharma, Deepak K; Solbrig, Harold R; Tao, Cui; Weng, Chunhua; Chute, Christopher G; Jiang, Guoqian
2017-06-05
Detailed Clinical Models (DCMs) have been regarded as the basis for retaining computable meaning when data are exchanged between heterogeneous computer systems. To better support clinical cancer data capturing and reporting, there is an emerging need to develop informatics solutions for standards-based clinical models in cancer study domains. The objective of the study is to develop and evaluate a cancer genome study metadata management system that serves as a key infrastructure in supporting clinical information modeling in cancer genome study domains. We leveraged a Semantic Web-based metadata repository enhanced with both ISO11179 metadata standard and Clinical Information Modeling Initiative (CIMI) Reference Model. We used the common data elements (CDEs) defined in The Cancer Genome Atlas (TCGA) data dictionary, and extracted the metadata of the CDEs using the NCI Cancer Data Standards Repository (caDSR) CDE dataset rendered in the Resource Description Framework (RDF). The ITEM/ITEM_GROUP pattern defined in the latest CIMI Reference Model is used to represent reusable model elements (mini-Archetypes). We produced a metadata repository with 38 clinical cancer genome study domains, comprising a rich collection of mini-Archetype pattern instances. We performed a case study of the domain "clinical pharmaceutical" in the TCGA data dictionary and demonstrated enriched data elements in the metadata repository are very useful in support of building detailed clinical models. Our informatics approach leveraging Semantic Web technologies provides an effective way to build a CIMI-compliant metadata repository that would facilitate the detailed clinical modeling to support use cases beyond TCGA in clinical cancer study domains.
DOE Office of Scientific and Technical Information (OSTI.GOV)
2006-10-25
The purpose of the eXtended MetaData Registry (XMDR) prototype is to demonstrate the feasibility and utility of constructing an extended metadata registry, i.e., one which encompasses richer classification support, facilities for including terminologies, and better support for formal specification of semantics. The prototype registry will also serve as a reference implementation for the revised versions of ISO 11179, Parts 2 and 3 to help guide production implementations.
The CUAHSI Water Data Center: Empowering scientists to discover, use, store, and share water data
NASA Astrophysics Data System (ADS)
Couch, A. L.; Hooper, R. P.; Arrigo, J. S.
2012-12-01
The proposed CUAHSI Water Data Center (WDC) will provide production-quality water data resources based upon the successful large-scale data services prototype developed by the CUAHSI Hydrologic Information System (HIS) project. The WDC, using the HIS technology, concentrates on providing time series data collected at fixed points or on moving platforms from sensors primarily (but not exclusively) in the medium of water. The WDC's missions include providing simple and effective data discovery tools useful to researchers in a variety of water-related disciplines, and providing simple and cost-effective data publication mechanisms for projects that do not desire to run their own data servers. The WDC's activities will include: 1. Rigorous curation of the water data catalog already assembled during the CUAHSI HIS project, to ensure accuracy of records and existence of declared sources. 2. Data backup and failover services for "at risk" data sources. 3. Creation and support for ubiquitously accessible data discovery and access, web-based search and smartphone applications. 4. Partnerships with researchers to extend the state of the art in water data use. 5. Partnerships with industry to create plug-and-play data publishing from sensors, and to create domain-specific tools. The WDC will serve as a knowledge resource for researchers of water-related issues, and will interface with other data centers to make their data more accessible to water researchers. The WDC will serve as a vehicle for addressing some of the grand challenges of accessing and using water data, including: a. Cross-domain data discovery: different scientific domains refer to the same kind of water data using different terminologies, making discovery of data difficult for researchers outside the data provider's domain. b. Cross-validation of data sources: much water data comes from sources lacking rigorous quality control procedures; such sources can be compared against others with rigorous quality control. The WDC enables this by making both kinds of sources available in the same search interface. c. Data provenance: the appropriateness of data for use in a specific model or analysis often depends upon the exact details of how data was gathered and processed. The WDC will aid this by curating standards for metadata that are as descriptive as practical of the collection procedures. "Plug and play" sensor interfaces will fill in metadata appropriate to each sensor without human intervention. d. Contextual search: discovering data based upon geological (e.g. aquifer) or geographic (e.g., location in a stream network) features external to metadata. e. Data-driven search: discovering data that exhibit quality factors that are not described by the metadata. The WDC will partner with researchers desiring contextual and data driven search, and make results available to all. Many major data providers (e.g. federal agencies) are not mandated to provide access to data other than those they collect. The HIS project assembled data from over 90 different sources, thus demonstrating the promise of this approach. Meeting the grand challenges listed above will greatly enhance scientists' ability to discover, interpret, access, and analyze water data from across domains and sources to test Earth system hypotheses.
Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations.
Martínez-Romero, Marcos; O'Connor, Martin J; Shankar, Ravi D; Panahiazar, Maryam; Willrett, Debra; Egyedi, Attila L; Gevaert, Olivier; Graybeal, John; Musen, Mark A
2017-01-01
In biomedicine, high-quality metadata are crucial for finding experimental datasets, for understanding how experiments were performed, and for reproducing those experiments. Despite the recent focus on metadata, the quality of metadata available in public repositories continues to be extremely poor. A key difficulty is that the typical metadata acquisition process is time-consuming and error prone, with weak or nonexistent support for linking metadata to ontologies. There is a pressing need for methods and tools to speed up the metadata acquisition process and to increase the quality of metadata that are entered. In this paper, we describe a methodology and set of associated tools that we developed to address this challenge. A core component of this approach is a value recommendation framework that uses analysis of previously entered metadata and ontology-based metadata specifications to help users rapidly and accurately enter their metadata. We performed an initial evaluation of this approach using metadata from a public metadata repository.
Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations
Martínez-Romero, Marcos; O’Connor, Martin J.; Shankar, Ravi D.; Panahiazar, Maryam; Willrett, Debra; Egyedi, Attila L.; Gevaert, Olivier; Graybeal, John; Musen, Mark A.
2017-01-01
In biomedicine, high-quality metadata are crucial for finding experimental datasets, for understanding how experiments were performed, and for reproducing those experiments. Despite the recent focus on metadata, the quality of metadata available in public repositories continues to be extremely poor. A key difficulty is that the typical metadata acquisition process is time-consuming and error prone, with weak or nonexistent support for linking metadata to ontologies. There is a pressing need for methods and tools to speed up the metadata acquisition process and to increase the quality of metadata that are entered. In this paper, we describe a methodology and set of associated tools that we developed to address this challenge. A core component of this approach is a value recommendation framework that uses analysis of previously entered metadata and ontology-based metadata specifications to help users rapidly and accurately enter their metadata. We performed an initial evaluation of this approach using metadata from a public metadata repository. PMID:29854196
NASA Astrophysics Data System (ADS)
Berukoff, Steven; Reardon, Kevin; Hays, Tony; Spiess, DJ; Watson, Fraser
2015-08-01
When construction is complete in 2019, the Daniel K. Inouye Solar Telescope will be the most-capable large aperture, high-resolution, multi-instrument solar physics facility in the world. The telescope is designed as a four-meter off-axis Gregorian, with a rotating Coude laboratory designed to simultaneously house and support five first-light imaging and spectropolarimetric instruments. At current design, the facility and its instruments will generate data volumes of 5 PB, produce 108 images, and 107-109 metadata elements annually. This data will not only forge new understanding of solar phenomena at high resolution, but enhance participation in solar physics and further grow a small but vibrant international community.The DKIST Data Center is being designed to store, curate, and process this flood of information, while augmenting its value by providing association of science data and metadata to its acquisition and processing provenance. In early Operations, the Data Center will produce, by autonomous, semi-automatic, and manual means, quality-controlled and -assured calibrated data sets, closely linked to facility and instrument performance during the Operations lifecycle. These data sets will be made available to the community openly and freely, and software and algorithms made available through community repositories like Github for further collaboration and improvement.We discuss the current design and approach of the DKIST Data Center, describing the development cycle, early technology analysis and prototyping, and the roadmap ahead. In this budget-conscious era, a key design criterion is elasticity, the ability of the built system to adapt to changing work volumes, types, and the shifting scientific landscape, without undue cost or operational impact. We discuss our deep iterative development approach, the underappreciated challenges of calibrating ground-based solar data, the crucial integration of the Data Center within the larger Operations lifecycle, and how software and hardware support, intelligently deployed, will enable high-caliber solar physics research and community growth for the DKIST's 40-year lifespan.
Assessing Metadata Quality of a Federally Sponsored Health Data Repository.
Marc, David T; Beattie, James; Herasevich, Vitaly; Gatewood, Laël; Zhang, Rui
2016-01-01
The U.S. Federal Government developed HealthData.gov to disseminate healthcare datasets to the public. Metadata is provided for each datasets and is the sole source of information to find and retrieve data. This study employed automated quality assessments of the HealthData.gov metadata published from 2012 to 2014 to measure completeness, accuracy, and consistency of applying standards. The results demonstrated that metadata published in earlier years had lower completeness, accuracy, and consistency. Also, metadata that underwent modifications following their original creation were of higher quality. HealthData.gov did not uniformly apply Dublin Core Metadata Initiative to the metadata, which is a widely accepted metadata standard. These findings suggested that the HealthData.gov metadata suffered from quality issues, particularly related to information that wasn't frequently updated. The results supported the need for policies to standardize metadata and contributed to the development of automated measures of metadata quality.
Assessing Metadata Quality of a Federally Sponsored Health Data Repository
Marc, David T.; Beattie, James; Herasevich, Vitaly; Gatewood, Laël; Zhang, Rui
2016-01-01
The U.S. Federal Government developed HealthData.gov to disseminate healthcare datasets to the public. Metadata is provided for each datasets and is the sole source of information to find and retrieve data. This study employed automated quality assessments of the HealthData.gov metadata published from 2012 to 2014 to measure completeness, accuracy, and consistency of applying standards. The results demonstrated that metadata published in earlier years had lower completeness, accuracy, and consistency. Also, metadata that underwent modifications following their original creation were of higher quality. HealthData.gov did not uniformly apply Dublin Core Metadata Initiative to the metadata, which is a widely accepted metadata standard. These findings suggested that the HealthData.gov metadata suffered from quality issues, particularly related to information that wasn’t frequently updated. The results supported the need for policies to standardize metadata and contributed to the development of automated measures of metadata quality. PMID:28269883
Metadata to Support Data Warehouse Evolution
NASA Astrophysics Data System (ADS)
Solodovnikova, Darja
The focus of this chapter is metadata necessary to support data warehouse evolution. We present the data warehouse framework that is able to track evolution process and adapt data warehouse schemata and data extraction, transformation, and loading (ETL) processes. We discuss the significant part of the framework, the metadata repository that stores information about the data warehouse, logical and physical schemata and their versions. We propose the physical implementation of multiversion data warehouse in a relational DBMS. For each modification of a data warehouse schema, we outline the changes that need to be made to the repository metadata and in the database.
ISO 19115 Experiences in NASA's Earth Observing System (EOS) ClearingHOuse (ECHO)
NASA Astrophysics Data System (ADS)
Cechini, M. F.; Mitchell, A.
2011-12-01
Metadata is an important entity in the process of cataloging, discovering, and describing earth science data. As science research and the gathered data increases in complexity, so does the complexity and importance of descriptive metadata. To meet these growing needs, the metadata models required utilize richer and more mature metadata attributes. Categorizing, standardizing, and promulgating these metadata models to a politically, geographically, and scientifically diverse community is a difficult process. An integral component of metadata management within NASA's Earth Observing System Data and Information System (EOSDIS) is the Earth Observing System (EOS) ClearingHOuse (ECHO). ECHO is the core metadata repository for the EOSDIS data centers providing a centralized mechanism for metadata and data discovery and retrieval. ECHO has undertaken an internal restructuring to meet the changing needs of scientists, the consistent advancement in technology, and the advent of new standards such as ISO 19115. These improvements were based on the following tenets for data discovery and retrieval: + There exists a set of 'core' metadata fields recommended for data discovery. + There exists a set of users who will require the entire metadata record for advanced analysis. + There exists a set of users who will require a 'core' set metadata fields for discovery only. + There will never be a cessation of new formats or a total retirement of all old formats. + Users should be presented metadata in a consistent format of their choosing. In order to address the previously listed items, ECHO's new metadata processing paradigm utilizes the following approach: + Identify a cross-format set of 'core' metadata fields necessary for discovery. + Implement format-specific indexers to extract the 'core' metadata fields into an optimized query capability. + Archive the original metadata in its entirety for presentation to users requiring the full record. + Provide on-demand translation of 'core' metadata to any supported result format. Lessons learned by the ECHO team while implementing its new metadata approach to support usage of the ISO 19115 standard will be presented. These lessons learned highlight some discovered strengths and weaknesses in the ISO 19115 standard as it is introduced to an existing metadata processing system.
2011-05-01
iTunes illustrate the difference between the centralized approach of digital library systems and the distributed approach of container file formats...metadata in a container file format. Apple’s iTunes uses a centralized metadata approach and allows users to maintain song metadata in a single...one iTunes library to another the metadata must be copied separately or reentered in the new library. This demonstrates the utility of storing metadata
NASA Astrophysics Data System (ADS)
Santhana Vannan, S. K.; Ramachandran, R.; Deb, D.; Beaty, T.; Wright, D.
2017-12-01
This paper summarizes the workflow challenges of curating and publishing data produced from disparate data sources and provides a generalized workflow solution to efficiently archive data generated by researchers. The Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) for biogeochemical dynamics and the Global Hydrology Resource Center (GHRC) DAAC have been collaborating on the development of a generalized workflow solution to efficiently manage the data publication process. The generalized workflow presented here are built on lessons learned from implementations of the workflow system. Data publication consists of the following steps: Accepting the data package from the data providers, ensuring the full integrity of the data files. Identifying and addressing data quality issues Assembling standardized, detailed metadata and documentation, including file level details, processing methodology, and characteristics of data files Setting up data access mechanisms Setup of the data in data tools and services for improved data dissemination and user experience Registering the dataset in online search and discovery catalogues Preserving the data location through Digital Object Identifiers (DOI) We will describe the steps taken to automate, and realize efficiencies to the above process. The goals of the workflow system are to reduce the time taken to publish a dataset, to increase the quality of documentation and metadata, and to track individual datasets through the data curation process. Utilities developed to achieve these goal will be described. We will also share metrics driven value of the workflow system and discuss the future steps towards creation of a common software framework.
Vlachos, Ioannis S; Paraskevopoulou, Maria D; Karagkouni, Dimitra; Georgakilas, Georgios; Vergoulis, Thanasis; Kanellos, Ilias; Anastasopoulos, Ioannis-Laertis; Maniou, Sofia; Karathanou, Konstantina; Kalfakakou, Despina; Fevgas, Athanasios; Dalamagas, Theodore; Hatzigeorgiou, Artemis G
2015-01-01
microRNAs (miRNAs) are short non-coding RNA species, which act as potent gene expression regulators. Accurate identification of miRNA targets is crucial to understanding their function. Currently, hundreds of thousands of miRNA:gene interactions have been experimentally identified. However, this wealth of information is fragmented and hidden in thousands of manuscripts and raw next-generation sequencing data sets. DIANA-TarBase was initially released in 2006 and it was the first database aiming to catalog published experimentally validated miRNA:gene interactions. DIANA-TarBase v7.0 (http://www.microrna.gr/tarbase) aims to provide for the first time hundreds of thousands of high-quality manually curated experimentally validated miRNA:gene interactions, enhanced with detailed meta-data. DIANA-TarBase v7.0 enables users to easily identify positive or negative experimental results, the utilized experimental methodology, experimental conditions including cell/tissue type and treatment. The new interface provides also advanced information ranging from the binding site location, as identified experimentally as well as in silico, to the primer sequences used for cloning experiments. More than half a million miRNA:gene interactions have been curated from published experiments on 356 different cell types from 24 species, corresponding to 9- to 250-fold more entries than any other relevant database. DIANA-TarBase v7.0 is freely available. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
NASA Astrophysics Data System (ADS)
Busby, R. W.; Woodward, R.; Aderhold, K.; Frassetto, A.
2017-12-01
The Alaska Transportable Array deployment is completely installed, totaling 280 stations, with 194 new stations and 86 existing stations, 28 of those upgraded with new sensor emplacement. We briefly summarize the deployment of this seismic network, describe the added meteorological instruments and soil temperature gauges, and review our expectations for operation and demobilization. Curation of data from the contiguous Lower-48 States deployment of Transportable Array (>1800 stations, 2004-2015) has continued with the few gaps in real-time data replaced by locally archived files as well as minor adjustments in metadata. We highlight station digests that provide more detail on the components and settings of individual stations, documentation of standard procedures used throughout the deployment and other resources available online. In cooperation with IRIS DMC, a copy of the complete TA archive for the Lower-48 period has been transferred to a local disk to experiment with data access and software workflows that utilize most or all of the seismic timeseries, in contrast to event segments. Assembling such large datasets reliably - from field stations to a well managed data archive to a user's workspace - is complex. Sharing a curated and defined data volume with researchers is a potentially straightforward way to make data intensive analyses less difficult. We note that data collection within the Lower-48 continues with 160 stations of the N4 network operating at increased sample rates (100 sps) as part of the CEUSN, as operational support transitions from NSF to USGS.
Canto: an online tool for community literature curation.
Rutherford, Kim M; Harris, Midori A; Lock, Antonia; Oliver, Stephen G; Wood, Valerie
2014-06-15
Detailed curation of published molecular data is essential for any model organism database. Community curation enables researchers to contribute data from their papers directly to databases, supplementing the activity of professional curators and improving coverage of a growing body of literature. We have developed Canto, a web-based tool that provides an intuitive curation interface for both curators and researchers, to support community curation in the fission yeast database, PomBase. Canto supports curation using OBO ontologies, and can be easily configured for use with any species. Canto code and documentation are available under an Open Source license from http://curation.pombase.org/. Canto is a component of the Generic Model Organism Database (GMOD) project (http://www.gmod.org/). © The Author 2014. Published by Oxford University Press.
Building Format-Agnostic Metadata Repositories
NASA Astrophysics Data System (ADS)
Cechini, M.; Pilone, D.
2010-12-01
This presentation will discuss the problems that surround persisting and discovering metadata in multiple formats; a set of tenets that must be addressed in a solution; and NASA’s Earth Observing System (EOS) ClearingHOuse’s (ECHO) proposed approach. In order to facilitate cross-discipline data analysis, Earth Scientists will potentially interact with more than one data source. The most common data discovery paradigm relies on services and/or applications facilitating the discovery and presentation of metadata. What may not be common are the formats in which the metadata are formatted. As the number of sources and datasets utilized for research increases, it becomes more likely that a researcher will encounter conflicting metadata formats. Metadata repositories, such as the EOS ClearingHOuse (ECHO), along with data centers, must identify ways to address this issue. In order to define the solution to this problem, the following tenets are identified: - There exists a set of ‘core’ metadata fields recommended for data discovery. - There exists a set of users who will require the entire metadata record for advanced analysis. - There exists a set of users who will require a ‘core’ set of metadata fields for discovery only. - There will never be a cessation of new formats or a total retirement of all old formats. - Users should be presented metadata in a consistent format. ECHO has undertaken an effort to transform its metadata ingest and discovery services in order to support the growing set of metadata formats. In order to address the previously listed items, ECHO’s new metadata processing paradigm utilizes the following approach: - Identify a cross-format set of ‘core’ metadata fields necessary for discovery. - Implement format-specific indexers to extract the ‘core’ metadata fields into an optimized query capability. - Archive the original metadata in its entirety for presentation to users requiring the full record. - Provide on-demand translation of ‘core’ metadata to any supported result format. With this identified approach, the Earth Scientist is provided with a consistent data representation as they interact with a variety of datasets that utilize multiple metadata formats. They are then able to focus their efforts on the more critical research activities which they are undertaking.
NASA Astrophysics Data System (ADS)
Wood, Chris
2016-04-01
Under the Marine Strategy Framework Directive (MSFD), EU Member States are mandated to achieve or maintain 'Good Environmental Status' (GES) in their marine areas by 2020, through a series of Programme of Measures (PoMs). The Celtic Seas Partnership (CSP), an EU LIFE+ project, aims to support policy makers, special-interest groups, users of the marine environment, and other interested stakeholders on MSFD implementation in the Celtic Seas geographical area. As part of this support, a metadata portal has been built to provide a signposting service to datasets that are relevant to MSFD within the Celtic Seas. To ensure that the metadata has the widest possible reach, a linked data approach was employed to construct the database. Although the metadata are stored in a traditional RDBS, the metadata are exposed as linked data via the D2RQ platform, allowing virtual RDF graphs to be generated. SPARQL queries can be executed against the end-point allowing any user to manipulate the metadata. D2RQ's mapping language, based on turtle, was used to map a wide range of relevant ontologies to the metadata (e.g. The Provenance Ontology (prov-o), Ocean Data Ontology (odo), Dublin Core Elements and Terms (dc & dcterms), Friend of a Friend (foaf), and Geospatial ontologies (geo)) allowing users to browse the metadata, either via SPARQL queries or by using D2RQ's HTML interface. The metadata were further enhanced by mapping relevant parameters to the NERC Vocabulary Server, itself built on a SPARQL endpoint. Additionally, a custom web front-end was built to enable users to browse the metadata and express queries through an intuitive graphical user interface that requires no prior knowledge of SPARQL. As well as providing means to browse the data via MSFD-related parameters (Descriptor, Criteria, and Indicator), the metadata records include the dataset's country of origin, the list of organisations involved in the management of the data, and links to any relevant INSPIRE-compliant services relating to the dataset. The web front-end therefore enables users to effectively filter, sort, or search the metadata. As the MSFD timeline requires Member States to review their progress on achieving or maintaining GES every six years, the timely development of this metadata portal will not only aid interested stakeholders in understanding how member states are meeting their targets, but also shows how linked data can be used effectively to support policy makers and associated legislative bodies.
ArrayWiki: an enabling technology for sharing public microarray data repositories and meta-analyses
Stokes, Todd H; Torrance, JT; Li, Henry; Wang, May D
2008-01-01
Background A survey of microarray databases reveals that most of the repository contents and data models are heterogeneous (i.e., data obtained from different chip manufacturers), and that the repositories provide only basic biological keywords linking to PubMed. As a result, it is difficult to find datasets using research context or analysis parameters information beyond a few keywords. For example, to reduce the "curse-of-dimension" problem in microarray analysis, the number of samples is often increased by merging array data from different datasets. Knowing chip data parameters such as pre-processing steps (e.g., normalization, artefact removal, etc), and knowing any previous biological validation of the dataset is essential due to the heterogeneity of the data. However, most of the microarray repositories do not have meta-data information in the first place, and do not have a a mechanism to add or insert this information. Thus, there is a critical need to create "intelligent" microarray repositories that (1) enable update of meta-data with the raw array data, and (2) provide standardized archiving protocols to minimize bias from the raw data sources. Results To address the problems discussed, we have developed a community maintained system called ArrayWiki that unites disparate meta-data of microarray meta-experiments from multiple primary sources with four key features. First, ArrayWiki provides a user-friendly knowledge management interface in addition to a programmable interface using standards developed by Wikipedia. Second, ArrayWiki includes automated quality control processes (caCORRECT) and novel visualization methods (BioPNG, Gel Plots), which provide extra information about data quality unavailable in other microarray repositories. Third, it provides a user-curation capability through the familiar Wiki interface. Fourth, ArrayWiki provides users with simple text-based searches across all experiment meta-data, and exposes data to search engine crawlers (Semantic Agents) such as Google to further enhance data discovery. Conclusions Microarray data and meta information in ArrayWiki are distributed and visualized using a novel and compact data storage format, BioPNG. Also, they are open to the research community for curation, modification, and contribution. By making a small investment of time to learn the syntax and structure common to all sites running MediaWiki software, domain scientists and practioners can all contribute to make better use of microarray technologies in research and medical practices. ArrayWiki is available at . PMID:18541053
A curated database of cyanobacterial strains relevant for modern taxonomy and phylogenetic studies.
Ramos, Vitor; Morais, João; Vasconcelos, Vitor M
2017-04-25
The dataset herein described lays the groundwork for an online database of relevant cyanobacterial strains, named CyanoType (http://lege.ciimar.up.pt/cyanotype). It is a database that includes categorized cyanobacterial strains useful for taxonomic, phylogenetic or genomic purposes, with associated information obtained by means of a literature-based curation. The dataset lists 371 strains and represents the first version of the database (CyanoType v.1). Information for each strain includes strain synonymy and/or co-identity, strain categorization, habitat, accession numbers for molecular data, taxonomy and nomenclature notes according to three different classification schemes, hierarchical automatic classification, phylogenetic placement according to a selection of relevant studies (including this), and important bibliographic references. The database will be updated periodically, namely by adding new strains meeting the criteria for inclusion and by revising and adding up-to-date metadata for strains already listed. A global 16S rDNA-based phylogeny is provided in order to assist users when choosing the appropriate strains for their studies.
A curated database of cyanobacterial strains relevant for modern taxonomy and phylogenetic studies
Ramos, Vitor; Morais, João; Vasconcelos, Vitor M.
2017-01-01
The dataset herein described lays the groundwork for an online database of relevant cyanobacterial strains, named CyanoType (http://lege.ciimar.up.pt/cyanotype). It is a database that includes categorized cyanobacterial strains useful for taxonomic, phylogenetic or genomic purposes, with associated information obtained by means of a literature-based curation. The dataset lists 371 strains and represents the first version of the database (CyanoType v.1). Information for each strain includes strain synonymy and/or co-identity, strain categorization, habitat, accession numbers for molecular data, taxonomy and nomenclature notes according to three different classification schemes, hierarchical automatic classification, phylogenetic placement according to a selection of relevant studies (including this), and important bibliographic references. The database will be updated periodically, namely by adding new strains meeting the criteria for inclusion and by revising and adding up-to-date metadata for strains already listed. A global 16S rDNA-based phylogeny is provided in order to assist users when choosing the appropriate strains for their studies. PMID:28440791
NASA Astrophysics Data System (ADS)
Radhakrishnan, A.; Balaji, V.; Schweitzer, R.; Nikonov, S.; O'Brien, K.; Vahlenkamp, H.; Burger, E. F.
2016-12-01
There are distinct phases in the development cycle of an Earth system model. During the model development phase, scientists make changes to code and parameters and require rapid access to results for evaluation. During the production phase, scientists may make an ensemble of runs with different settings, and produce large quantities of output, that must be further analyzed and quality controlled for scientific papers and submission to international projects such as the Climate Model Intercomparison Project (CMIP). During this phase, provenance is a key concern:being able to track back from outputs to inputs. We will discuss one of the paths taken at GFDL in delivering tools across this lifecycle, offering on-demand analysis of data by integrating the use of GFDL's in-house FRE-Curator, Unidata's THREDDS and NOAA PMEL's Live Access Servers (LAS).Experience over this lifecycle suggests that a major difficulty in developing analysis capabilities is only partially the scientific content, but often devoted to answering the questions "where is the data?" and "how do I get to it?". "FRE-Curator" is the name of a database-centric paradigm used at NOAA GFDL to ingest information about the model runs into an RDBMS (Curator database). The components of FRE-Curator are integrated into Flexible Runtime Environment workflow and can be invoked during climate model simulation. The front end to FRE-Curator, known as the Model Development Database Interface (MDBI) provides an in-house web-based access to GFDL experiments: metadata, analysis output and more. In order to provide on-demand visualization, MDBI uses Live Access Servers which is a highly configurable web server designed to provide flexible access to geo-referenced scientific data, that makes use of OPeNDAP. Model output saved in GFDL's tape archive, the size of the database and experiments, continuous model development initiatives with more dynamic configurations add complexity and challenges in providing an on-demand visualization experience to our GFDL users.
Scalable Data Mining and Archiving for the Square Kilometre Array
NASA Astrophysics Data System (ADS)
Jones, D. L.; Mattmann, C. A.; Hart, A. F.; Lazio, J.; Bennett, T.; Wagstaff, K. L.; Thompson, D. R.; Preston, R.
2011-12-01
As the technologies for remote observation improve, the rapid increase in the frequency and fidelity of those observations translates into an avalanche of data that is already beginning to eclipse the resources, both human and technical, of the institutions and facilities charged with managing the information. Common data management tasks like cataloging both data itself and contextual meta-data, creating and maintaining scalable permanent archive, and making data available on-demand for research present significant software engineering challenges when considered at the scales of modern multi-national scientific enterprises such as the upcoming Square Kilometre Array project. The NASA Jet Propulsion Laboratory (JPL), leveraging internal research and technology development funding, has begun to explore ways to address the data archiving and distribution challenges with a number of parallel activities involving collaborations with the EVLA and ALMA teams at the National Radio Astronomy Observatory (NRAO), and members of the Square Kilometre Array South Africa team. To date, we have leveraged the Apache OODT Process Control System framework and its catalog and archive service components that provide file management, workflow management, resource management as core web services. A client crawler framework ingests upstream data (e.g., EVLA raw directory output), identifies its MIME type and automatically extracts relevant metadata including temporal bounds, and job-relevant/processing information. A remote content acquisition (pushpull) service is responsible for staging remote content and handing it off to the crawler framework. A science algorithm wrapper (called CAS-PGE) wraps underlying code including CASApy programs for the EVLA, such as Continuum Imaging and Spectral Line Cube generation, executes the algorithm, and ingests its output (along with relevant extracted metadata). In addition to processing, the Process Control System has been leveraged to provide data curation and automatic ingestion for the MeerKAT/KAT-7 precursor instrument in South Africa, helping to catalog and archive correlator and sensor output from KAT-7, and to make the information available for downstream science analysis. These efforts, supported by the increasing availability of high-quality open source software, represent a concerted effort to seek a cost-conscious methodology for maintaining the integrity of observational data from the upstream instrument to the archive, and at the same time ensuring that the data, with its richly annotated catalog of meta-data, remains a viable resource for research into the future.
NASA Astrophysics Data System (ADS)
Gordon, S.; Dattore, E.; Williams, S.
2014-12-01
Even when a data center makes it's datasets accessible, they can still be hard to discover if the user is unaware of the laboratory or organization the data center supports. NCAR's Earth Observing Laboratory (EOL) is no exception. In response to this problem and as an inquiry into the feasibility of inter-connecting all of NCAR's repositories at a discovery layer, ESRI's Geoportal was researched. It was determined that an implementation of Geoportal would be a good choice to build a proof of concept model of inter-repository discovery around. This collaborative project between the University of Illinois and NCAR is coordinated through the Data Curation Education in Research Centers program. This program is funded by the Institute of Museum and Library Services.
Geoportal is open source software. It serves as an aggregation point for metadata catalogs of earth science datasets, with a focus on geospatial information. EOL's metadata is in static THREDDS catalogs. Geoportal can only create records from a THREDDS Data Server. The first step was to make EOL metadata more accessible by utilizing the ISO 19115-2 standard. It was also decided to create DIF records so EOL datasets could be ingested in NASA's Global Change Master Directory (GCMD).
To offer records for harvest, it was decided to develop an OAI-PMH server. To make a compliant server, the OAI_DC standard was also implemented. A server was written in Perl to serve a set of static records. We created a sample set of records in ISO 19115-2, FGDC, DIF, and OAI_DC. We utilized GCMD shared vocabularies to enhance discoverability and precision. The proof of concept was tested and verified by having another NCAR laboratory's Geoportal harvest our sample set.
To prepare for production, templates for each standard were developed and mapped to the database. These templates will help the automated creation of records. Once the OAI-PMH server is re-written in a Grails framework a dynamic representation of EOL's metadata will be available for harvest.
EOL will need to develop an implementation of a Geoportal and point GCMD to the OAI-PMH server. We will also seek out partnerships with other earth science and related discipline repositories that can communicate by OAI-PMH or Geoportal so that the scientific community will benefit from more discoverable data.
Deploying Object Oriented Data Technology to the Planetary Data System
NASA Technical Reports Server (NTRS)
Kelly, S.; Crichton, D.; Hughes, J. S.
2003-01-01
How do you provide more than 350 scientists and researchers access to data from every instrument in Odyssey when the data is curated across half a dozen institutions and in different formats and is too big to mail on a CD-ROM anymore? The Planetary Data System (PDS) faced this exact question. The solution was to use a metadata-based middleware framework developed by the Object Oriented Data Technology task at NASA s Jet Propulsion Laboratory. Using OODT, PDS provided - for the first time ever - data from all mission instruments through a single system immediately upon data delivery.
Challenges and opportunities of open data in ecology.
Reichman, O J; Jones, Matthew B; Schildhauer, Mark P
2011-02-11
Ecology is a synthetic discipline benefiting from open access to data from the earth, life, and social sciences. Technological challenges exist, however, due to the dispersed and heterogeneous nature of these data. Standardization of methods and development of robust metadata can increase data access but are not sufficient. Reproducibility of analyses is also important, and executable workflows are addressing this issue by capturing data provenance. Sociological challenges, including inadequate rewards for sharing data, must also be resolved. The establishment of well-curated, federated data repositories will provide a means to preserve data while promoting attribution and acknowledgement of its use.
Community-managed Data Sharing, Curation, and Publication: SEN on SEAD
NASA Astrophysics Data System (ADS)
Martin, R. L.; Myers, J.; Hsu, L.
2017-12-01
While data publication in support of reuse and scientific reproducibility is increasingly being recognized as a key aspect of modern research practice, best practices are still to be developed at the level of scientific communities. Often, such practices are discussed in the abstract - as community standards for data plans or as requirements for yet-to-be-built software - with no clear path to community adoption. In contrast, the Sediment Experimentalist Network, supported through the National Science Foundation's (NSF) EarthCube initiative, has encouraged an iterative, practice-based approach within its community that has resulted in the publication of dozens of datasets, comprised of millions of files totaling more than 4 TB in size, and the documentation of more than 100 experimental procedures, instruments, and facilities, by multiple research teams. A key element of SEN's approach has been to leverage cloud-based data services that provide robust core capabilities with community-based management and customization capabilities. These services - data sharing, curation, and publication services developed through the NSF-supported Sustainable Environment - Actionable Data (SEAD) project and the wiki-based SEN Knowledge Base (KB) - have allowed the SEN team to ground discussions in reality and leverage the practical questions arising as researchers publish data to drive discussion and evolve towards better practices. In this presentation we summarize how SEN interacts with researchers, the best practices that have been developed, and the capabilities of SEAD and the SEN KB that support them. We also describe issues that have arisen in the community - related, for example, to recommended and required metadata, individual, project and community branding, and data version and derivation relationships - and describe how SEN's outreach activities, collaboration with the SEAD team, and the flexible design of the data services themselves have, in combination, been able to provide rapid incremental solutions to support researchers needs while also helping the community align with broader semantic and data publication standards. We conclude with thoughts on how this approach could be applied in other communities as a way to drive progress towards data reuse and reproducible research.
AtomPy: an open atomic-data curation environment
NASA Astrophysics Data System (ADS)
Bautista, Manuel; Mendoza, Claudio; Boswell, Josiah S; Ajoku, Chukwuemeka
2014-06-01
We present a cloud-computing environment for atomic data curation, networking among atomic data providers and users, teaching-and-learning, and interfacing with spectral modeling software. The system is based on Google-Drive Sheets, Pandas (Python Data Analysis Library) DataFrames, and IPython Notebooks for open community-driven curation of atomic data for scientific and technological applications. The atomic model for each ionic species is contained in a multi-sheet Google-Drive workbook, where the atomic parameters from all known public sources are progressively stored. Metadata (provenance, community discussion, etc.) accompanying every entry in the database are stored through Notebooks. Education tools on the physics of atomic processes as well as their relevance to plasma and spectral modeling are based on IPython Notebooks that integrate written material, images, videos, and active computer-tool workflows. Data processing workflows and collaborative software developments are encouraged and managed through the GitHub social network. Relevant issues this platform intends to address are: (i) data quality by allowing open access to both data producers and users in order to attain completeness, accuracy, consistency, provenance and currentness; (ii) comparisons of different datasets to facilitate accuracy assessment; (iii) downloading to local data structures (i.e. Pandas DataFrames) for further manipulation and analysis by prospective users; and (iv) data preservation by avoiding the discard of outdated sets.
PDS4 - Some Principles for Agile Data Curation
NASA Astrophysics Data System (ADS)
Hughes, J. S.; Crichton, D. J.; Hardman, S. H.; Joyner, R.; Algermissen, S.; Padams, J.
2015-12-01
PDS4, a research data management and curation system for NASA's Planetary Science Archive, was developed using principles that promote the characteristics of agile development. The result is an efficient system that produces better research data products while using less resources (time, effort, and money) and maximizes their usefulness for current and future scientists. The key principle is architectural. The PDS4 information architecture is developed and maintained independent of the infrastructure's process, application and technology architectures. The information architecture is based on an ontology-based information model developed to leverage best practices from standard reference models for digital archives, digital object registries, and metadata registries and capture domain knowledge from a panel of planetary science domain experts. The information model provides a sharable, stable, and formal set of information requirements for the system and is the primary source for information to configure most system components, including the product registry, search engine, validation and display tools, and production pipelines. Multi-level governance is also allowed for the effective management of the informational elements at the common, discipline, and project level. This presentation will describe the development principles, components, and uses of the information model and how an information model-driven architecture exhibits characteristics of agile curation including early delivery, evolutionary development, adaptive planning, continuous improvement, and rapid and flexible response to change.
Optimizing Resources for Trustworthiness and Scientific Impact of Domain Repositories
NASA Astrophysics Data System (ADS)
Lehnert, K.
2017-12-01
Domain repositories, i.e. data archives tied to specific scientific communities, are widely recognized and trusted by their user communities for ensuring a high level of data quality, enhancing data value, access, and reuse through a unique combination of disciplinary and digital curation expertise. Their data services are guided by the practices and values of the specific community they serve and designed to support the advancement of their science. Domain repositories need to meet user expectations for scientific utility in order to be successful, but they also need to fulfill the requirements for trustworthy repository services to be acknowledged by scientists, funders, and publishers as a reliable facility that curates and preserves data following international standards. Domain repositories therefore need to carefully plan and balance investments to optimize the scientific impact of their data services and user satisfaction on the one hand, while maintaining a reliable and robust operation of the repository infrastructure on the other hand. Staying abreast of evolving repository standards to certify as a trustworthy repository and conducting a regular self-assessment and certification alone requires resources that compete with the demands for improving data holdings or usability of systems. The Interdisciplinary Earth Data Alliance (IEDA), a data facility funded by the US National Science Foundation, operates repositories for geochemical, marine Geoscience, and Antarctic research data, while also maintaining data products (global syntheses) and data visualization and analysis tools that are of high value for the science community and have demonstrated considerable scientific impact. Balancing the investments in the growth and utility of the syntheses with resources required for certifcation of IEDA's repository services has been challenging, and a major self-assessment effort has been difficult to accommodate. IEDA is exploring a partnership model to share generic repository functions (e.g. metadata registration, long-term archiving) with other repositories. This could substantially reduce the effort of certification and allow effort to focus on the domain-specific data curation and value-added services.
Achieving Sub-Second Search in the CMR
NASA Astrophysics Data System (ADS)
Gilman, J.; Baynes, K.; Pilone, D.; Mitchell, A. E.; Murphy, K. J.
2014-12-01
The Common Metadata Repository (CMR) is the next generation Earth Science Metadata catalog for NASA's Earth Observing data. It joins together the holdings from the EOS Clearing House (ECHO) and the Global Change Master Directory (GCMD), creating a unified, authoritative source for EOSDIS metadata. The CMR allows ingest in many different formats while providing consistent search behavior and retrieval in any supported format. Performance is a critical component of the CMR, ensuring improved data discovery and client interactivity. The CMR delivers sub-second search performance for any of the common query conditions (including spatial) across hundreds of millions of metadata granules. It also allows the addition of new metadata concepts such as visualizations, parameter metadata, and documentation. The CMR's goals presented many challenges. This talk will describe the CMR architecture, design, and innovations that were made to achieve its goals. This includes: * Architectural features like immutability and backpressure. * Data management techniques such as caching and parallel loading that give big performance gains. * Open Source and COTS tools like Elasticsearch search engine. * Adoption of Clojure, a functional programming language for the Java Virtual Machine. * Development of a custom spatial search plugin for Elasticsearch and why it was necessary. * Introduction of a unified model for metadata that maps every supported metadata format to a consistent domain model.
Facilitating Stewardship of scientific data through standards based workflows
NASA Astrophysics Data System (ADS)
Bastrakova, I.; Kemp, C.; Potter, A. K.
2013-12-01
There are main suites of standards that can be used to define the fundamental scientific methodology of data, methods and results. These are firstly Metadata standards to enable discovery of the data (ISO 19115), secondly the Sensor Web Enablement (SWE) suite of standards that include the O&M and SensorML standards and thirdly Ontology that provide vocabularies to define the scientific concepts and relationships between these concepts. All three types of standards have to be utilised by the practicing scientist to ensure that those who ultimately have to steward the data stewards to ensure that the data can be preserved curated and reused and repurposed. Additional benefits of this approach include transparency of scientific processes from the data acquisition to creation of scientific concepts and models, and provision of context to inform data use. Collecting and recording metadata is the first step in scientific data flow. The primary role of metadata is to provide details of geographic extent, availability and high-level description of data suitable for its initial discovery through common search engines. The SWE suite provides standardised patterns to describe observations and measurements taken for these data, capture detailed information about observation or analytical methods, used instruments and define quality determinations. This information standardises browsing capability over discrete data types. The standardised patterns of the SWE standards simplify aggregation of observation and measurement data enabling scientists to transfer disintegrated data to scientific concepts. The first two steps provide a necessary basis for the reasoning about concepts of ';pure' science, building relationship between concepts of different domains (linked-data), and identifying domain classification and vocabularies. Geoscience Australia is re-examining its marine data flows, including metadata requirements and business processes, to achieve a clearer link between scientific data acquisition and analysis requirements and effective interoperable data management and delivery. This includes participating in national and international dialogue on development of standards, embedding data management activities in business processes, and developing scientific staff as effective data stewards. Similar approach is applied to the geophysical data. By ensuring the geophysical datasets at GA strictly follow metadata and industry standards we are able to implement a provenance based workflow where the data is easily discoverable, geophysical processing can be applied to it and results can be stored. The provenance based workflow enables metadata records for the results to be produced automatically from the input dataset metadata.
Finding Atmospheric Composition (AC) Metadata
NASA Technical Reports Server (NTRS)
Strub, Richard F..; Falke, Stefan; Fiakowski, Ed; Kempler, Steve; Lynnes, Chris; Goussev, Oleg
2015-01-01
The Atmospheric Composition Portal (ACP) is an aggregator and curator of information related to remotely sensed atmospheric composition data and analysis. It uses existing tools and technologies and, where needed, enhances those capabilities to provide interoperable access, tools, and contextual guidance for scientists and value-adding organizations using remotely sensed atmospheric composition data. The initial focus is on Essential Climate Variables identified by the Global Climate Observing System CH4, CO, CO2, NO2, O3, SO2 and aerosols. This poster addresses our efforts in building the ACP Data Table, an interface to help discover and understand remotely sensed data that are related to atmospheric composition science and applications. We harvested GCMD, CWIC, GEOSS metadata catalogs using machine to machine technologies - OpenSearch, Web Services. We also manually investigated the plethora of CEOS data providers portals and other catalogs where that data might be aggregated. This poster is our experience of the excellence, variety, and challenges we encountered.Conclusions:1.The significant benefits that the major catalogs provide are their machine to machine tools like OpenSearch and Web Services rather than any GUI usability improvements due to the large amount of data in their catalog.2.There is a trend at the large catalogs towards simulating small data provider portals through advanced services. 3.Populating metadata catalogs using ISO19115 is too complex for users to do in a consistent way, difficult to parse visually or with XML libraries, and too complex for Java XML binders like CASTOR.4.The ability to search for Ids first and then for data (GCMD and ECHO) is better for machine to machine operations rather than the timeouts experienced when returning the entire metadata entry at once. 5.Metadata harvest and export activities between the major catalogs has led to a significant amount of duplication. (This is currently being addressed) 6.Most (if not all) Earth science atmospheric composition data providers store a reference to their data at GCMD.
Automated Atmospheric Composition Dataset Level Metadata Discovery. Difficulties and Surprises
NASA Astrophysics Data System (ADS)
Strub, R. F.; Falke, S. R.; Kempler, S.; Fialkowski, E.; Goussev, O.; Lynnes, C.
2015-12-01
The Atmospheric Composition Portal (ACP) is an aggregator and curator of information related to remotely sensed atmospheric composition data and analysis. It uses existing tools and technologies and, where needed, enhances those capabilities to provide interoperable access, tools, and contextual guidance for scientists and value-adding organizations using remotely sensed atmospheric composition data. The initial focus is on Essential Climate Variables identified by the Global Climate Observing System - CH4, CO, CO2, NO2, O3, SO2 and aerosols. This poster addresses our efforts in building the ACP Data Table, an interface to help discover and understand remotely sensed data that are related to atmospheric composition science and applications. We harvested GCMD, CWIC, GEOSS metadata catalogs using machine to machine technologies - OpenSearch, Web Services. We also manually investigated the plethora of CEOS data providers portals and other catalogs where that data might be aggregated. This poster is our experience of the excellence, variety, and challenges we encountered.Conclusions:1.The significant benefits that the major catalogs provide are their machine to machine tools like OpenSearch and Web Services rather than any GUI usability improvements due to the large amount of data in their catalog.2.There is a trend at the large catalogs towards simulating small data provider portals through advanced services. 3.Populating metadata catalogs using ISO19115 is too complex for users to do in a consistent way, difficult to parse visually or with XML libraries, and too complex for Java XML binders like CASTOR.4.The ability to search for Ids first and then for data (GCMD and ECHO) is better for machine to machine operations rather than the timeouts experienced when returning the entire metadata entry at once. 5.Metadata harvest and export activities between the major catalogs has led to a significant amount of duplication. (This is currently being addressed) 6.Most (if not all) Earth science atmospheric composition data providers store a reference to their data at GCMD.
The Digital Sample: Metadata, Unique Identification, and Links to Data and Publications
NASA Astrophysics Data System (ADS)
Lehnert, K. A.; Vinayagamoorthy, S.; Djapic, B.; Klump, J.
2006-12-01
A significant part of digital data in the Geosciences refers to physical samples of Earth materials, from igneous rocks to sediment cores to water or gas samples. The application and long-term utility of these sample-based data in research is critically dependent on (a) the availability of information (metadata) about the samples such as geographical location and time of sampling, or sampling method, (b) links between the different data types available for individual samples that are dispersed in the literature and in digital data repositories, and (c) access to the samples themselves. Major problems for achieving this include incomplete documentation of samples in publications, use of ambiguous sample names, and the lack of a central catalog that allows to find a sample's archiving location. The International Geo Sample Number IGSN, managed by the System for Earth Sample Registration SESAR, provides solutions for these problems. The IGSN is a unique persistent identifier for samples and other GeoObjects that can be obtained by submitting sample metadata to SESAR (www.geosamples.org). If data in a publication is referenced to an IGSN (rather than an ambiguous sample name), sample metadata can readily be extracted from the SESAR database, which evolves into a Global Sample Catalog that also allows to locate the owner or curator of the sample. Use of the IGSN in digital data systems allows building linkages between distributed data. SESAR is contributing to the development of sample metadata standards. SESAR will integrate the IGSN in persistent, resolvable identifiers based on the handle.net service to advance direct linkages between the digital representation of samples in SESAR (sample profiles) and their related data in the literature and in web-accessible digital data repositories. Technologies outlined by Klump et al. (this session) such as the automatic creation of ontologies by text mining applications will be explored for harvesting identifiers of publications and datasets that contain information about a specific sample in order to establish comprehensive data profiles for samples.
Towards Data Value-Level Metadata for Clinical Studies.
Zozus, Meredith Nahm; Bonner, Joseph
2017-01-01
While several standards for metadata describing clinical studies exist, comprehensive metadata to support traceability of data from clinical studies has not been articulated. We examine uses of metadata in clinical studies. We examine and enumerate seven sources of data value-level metadata in clinical studies inclusive of research designs across the spectrum of the National Institutes of Health definition of clinical research. The sources of metadata inform categorization in terms of metadata describing the origin of a data value, the definition of a data value, and operations to which the data value was subjected. The latter is further categorized into information about changes to a data value, movement of a data value, retrieval of a data value, and data quality checks, constraints or assessments to which the data value was subjected. The implications of tracking and managing data value-level metadata are explored.
The Materials Data Facility: Data Services to Advance Materials Science Research
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blaiszik, B.; Chard, K.; Pruyne, J.
2016-07-06
With increasingly strict data management requirements from funding agencies and institutions, expanding focus on the challenges of research replicability, and growing data sizes and heterogeneity, new data needs are emerging in the materials community. The materials data facility (MDF) operates two cloudhosted services, data publication and data discovery, with features to promote open data sharing, self-service data publication and curation, and encourage data reuse, layered with powerful data discovery tools. The data publication service simplifies the process of copying data to a secure storage location, assigning data a citable persistent identifier, and recording custom (e.g., material, technique, or instrument specific)andmore » automatically-extractedmetadata in a registrywhile the data discovery service will provide advanced search capabilities (e.g., faceting, free text range querying, and full text search) against the registered data and metadata. TheMDF services empower individual researchers, research projects, and institutions to (I) publish research datasets, regardless of size, from local storage, institutional data stores, or cloud storage, without involvement of thirdparty publishers; (II) build, share, and enforce extensible domain-specific custom metadata schemas; (III) interact with published data and metadata via representational state transfer (REST) application program interfaces (APIs) to facilitate automation, analysis, and feedback; and (IV) access a data discovery model that allows researchers to search, interrogate, and eventually build on existing published data. We describe MDF’s design, current status, and future plans.« less
MoonDB: Restoration and Synthesis of Lunar Petrological and Geochemical Data
NASA Technical Reports Server (NTRS)
Lehnert, Kerstin A.; Cai, Yue; Mana, Sara; Todd, Nancy S.; Zeigler, Ryan A.; Evans, Cindy A.
2016-01-01
About 2,200 samples were collected from the Moon during the Apollo missions, forming a unique and irreplaceable legacy of the Apollo program. These samples, obtained at tremendous cost and great risk, are the only samples that have ever been returned by astronauts from the surface of another planetary body. These lunar samples have been curated at NASA Johnson Space Center and made available to the global research community. Over more than 45 years, a vast body of petrological, geochemical, and geochronological studies of these samples have been amassed, which helped to expand our understanding of the history and evolution of the Moon, the Earth itself, and the history of our entire solar system. Unfortunately, data from these studies are dispersed in the literature, often only available in analog format in older publications, and/or lacking sample metadata and analytical metadata (e.g., information about analytical procedure and data quality), which greatly limits their usage for new scientific endeavors. Even worse is that much lunar data have never been published, simply because no forum existed at the time (e.g., electronic supplements). Thousands of valuable analyses remain inaccessible, often preserved only in personal records, and are in danger of being lost forever, when investigators retire or pass away. Making these data and metadata publicly accessible in a digital format would dramatically help guide current and future research and eliminate duplicated analyses of precious lunar samples.
The MAR databases: development and implementation of databases specific for marine metagenomics
Klemetsen, Terje; Raknes, Inge A; Fu, Juan; Agafonov, Alexander; Balasundaram, Sudhagar V; Tartari, Giacomo; Robertsen, Espen
2018-01-01
Abstract We introduce the marine databases; MarRef, MarDB and MarCat (https://mmp.sfb.uit.no/databases/), which are publicly available resources that promote marine research and innovation. These data resources, which have been implemented in the Marine Metagenomics Portal (MMP) (https://mmp.sfb.uit.no/), are collections of richly annotated and manually curated contextual (metadata) and sequence databases representing three tiers of accuracy. While MarRef is a database for completely sequenced marine prokaryotic genomes, which represent a marine prokaryote reference genome database, MarDB includes all incomplete sequenced prokaryotic genomes regardless level of completeness. The last database, MarCat, represents a gene (protein) catalog of uncultivable (and cultivable) marine genes and proteins derived from marine metagenomics samples. The first versions of MarRef and MarDB contain 612 and 3726 records, respectively. Each record is built up of 106 metadata fields including attributes for sampling, sequencing, assembly and annotation in addition to the organism and taxonomic information. Currently, MarCat contains 1227 records with 55 metadata fields. Ontologies and controlled vocabularies are used in the contextual databases to enhance consistency. The user-friendly web interface lets the visitors browse, filter and search in the contextual databases and perform BLAST searches against the corresponding sequence databases. All contextual and sequence databases are freely accessible and downloadable from https://s1.sfb.uit.no/public/mar/. PMID:29106641
A Generic Metadata Editor Supporting System Using Drupal CMS
NASA Astrophysics Data System (ADS)
Pan, J.; Banks, N. G.; Leggott, M.
2011-12-01
Metadata handling is a key factor in preserving and reusing scientific data. In recent years, standardized structural metadata has become widely used in Geoscience communities. However, there exist many different standards in Geosciences, such as the current version of the Federal Geographic Data Committee's Content Standard for Digital Geospatial Metadata (FGDC CSDGM), the Ecological Markup Language (EML), the Geography Markup Language (GML), and the emerging ISO 19115 and related standards. In addition, there are many different subsets within the Geoscience subdomain such as the Biological Profile of the FGDC (CSDGM), or for geopolitical regions, such as the European Profile or the North American Profile in the ISO standards. It is therefore desirable to have a software foundation to support metadata creation and editing for multiple standards and profiles, without re-inventing the wheels. We have developed a software module as a generic, flexible software system to do just that: to facilitate the support for multiple metadata standards and profiles. The software consists of a set of modules for the Drupal Content Management System (CMS), with minimal inter-dependencies to other Drupal modules. There are two steps in using the system's metadata functions. First, an administrator can use the system to design a user form, based on an XML schema and its instances. The form definition is named and stored in the Drupal database as a XML blob content. Second, users in an editor role can then use the persisted XML definition to render an actual metadata entry form, for creating or editing a metadata record. Behind the scenes, the form definition XML is transformed into a PHP array, which is then rendered via Drupal Form API. When the form is submitted the posted values are used to modify a metadata record. Drupal hooks can be used to perform custom processing on metadata record before and after submission. It is trivial to store the metadata record as an actual XML file or in a storage/archive system. We are working on adding many features to help editor users, such as auto completion, pre-populating of forms, partial saving, as well as automatic schema validation. In this presentation we will demonstrate a few sample editors, including an FGDC editor and a bare bone editor for ISO 19115/19139. We will also demonstrate the use of templates during the definition phase, with the support of export and import functions. Form pre-population and input validation will also be covered. Theses modules are available as open-source software from the Islandora software foundation, as a component of a larger Drupal-based data archive system. They can be easily installed as stand-alone system, or to be plugged into other existing metadata platforms.
Advancing Site-Based Data Curation for Geobiology: The Yellowstone Exemplar (Invited)
NASA Astrophysics Data System (ADS)
Palmer, C. L.; Fouke, B. W.; Rodman, A.; Choudhury, G. S.
2013-12-01
While advances in the management and archiving of scientific digital data are proceeding apace, there is an urgent need for data curation services to collect and provide access to high-value data fit for reuse. The Site-Based Data Curation (SBDC) project is establishing a framework of guidelines and processes for the curation of research data generated at scientifically significant sites. The project is a collaboration among information scientists, geobiologists, data archiving experts, and resource managers at Yellowstone National Park (YNP). Based on our previous work with the Data Conservancy on indicators of value for research data, several factors made YNP an optimal site for developing the SBDC framework, including unique environmental conditions, a permitting process for data collection, and opportunities for geo-located longitudinal data and multiple data sources for triangulation and context. Stakeholder analysis is informing the SBDC requirements, through engagement with geologists, geochemists, and microbiologists conducting research at YNP and personnel from the Yellowstone Center for Resources and other YNP units. To date, results include data value indicators specific to site-based research, minimum and optimal parameters for data description and metadata, and a strategy for organizing data around sampling events. New value indicators identified by the scientists include ease of access to park locations for verification and correction of data, and stable environmental conditions important for controlling variables. Researchers see high potential for data aggregated from the many individual investigators conducting permitted research at YNP, however reuse is clearly contingent on detailed and consistent sampling records. Major applications of SBDC include identifying connections in dynamic systems, spatial temporal synthesis, analyzing variability within and across geological features, tracking site evolution, assessing anomalies, and greater awareness of complementary research and opportunities for collaboration. Moreover, making evident the range of available YNP data will inform what should be explored next, even beyond YNP. Like funding agencies and policy makers, YNP researchers and resource managers are invested in data curation for strategic purposes related to the big picture and efficiency of science. For the scientists, YNP represents an ideal, protected natural system that can serve as an indicator of world events, and SBDC provides the ability to ask and answer broader research questions and leverage an extensive store of highly applicable data. SBDC affords YNP improved coordination and transparency of data collection activities, and easier identification of trends and connections across projects. SBDC capabilities that support broader inquiry and better coordination of scientific effort have clear implications for data curation at other research intensive sites, and may also inform how data systems can provide strategic assistance to science more generally.
NASA Astrophysics Data System (ADS)
Raugh, Anne; Henneken, Edwin
The Planetary Data System (PDS) is actively involved in designing both metadata and interfaces to make the assignment of Digital Object Identifiers (DOIs) to archival data a part of the archiving process for all data creators. These DOIs will be registered through DataCite, a non-profit organization whose members are all deeply concerned with archival research data, provenance tracking through the literature, and proper acknowledgement of the various types of efforts that contribute to the creation of an archival reference data set. Making the collection of citation metadata and its ingestion into the DataCite DOI database easy - and easy to do correctly - is in the best interests of all stakeholders: the data creators; the curators; the indexing organizations like the Astrophysics Data System (ADS); and the data users. But in order to realize the promise of DOIs, there are three key issues to address: 1) How do we incorporate the metadata collection process simply and naturally into the PDS archive creation process; 2) How do we encourage journal editors to require references to previously published data with the same rigor with which they require references to previously published research and analysis; and finally, 3) How can we change the culture of academic and research employers to recognize that the effort required to prepare a PDS archival data set is a career achievement on par with contributing to a refereed article in the professional literature. Data archives and scholarly publications are the long-term return on investment that funding agencies and the science community expect in exchange for research spending. The traceability and reproducibility ensured by the integration of DOIs and their related metadata into indexing and search services is an essential part of providing and optimizing that return.
MiMiR – an integrated platform for microarray data sharing, mining and analysis
Tomlinson, Chris; Thimma, Manjula; Alexandrakis, Stelios; Castillo, Tito; Dennis, Jayne L; Brooks, Anthony; Bradley, Thomas; Turnbull, Carly; Blaveri, Ekaterini; Barton, Geraint; Chiba, Norie; Maratou, Klio; Soutter, Pat; Aitman, Tim; Game, Laurence
2008-01-01
Background Despite considerable efforts within the microarray community for standardising data format, content and description, microarray technologies present major challenges in managing, sharing, analysing and re-using the large amount of data generated locally or internationally. Additionally, it is recognised that inconsistent and low quality experimental annotation in public data repositories significantly compromises the re-use of microarray data for meta-analysis. MiMiR, the Microarray data Mining Resource was designed to tackle some of these limitations and challenges. Here we present new software components and enhancements to the original infrastructure that increase accessibility, utility and opportunities for large scale mining of experimental and clinical data. Results A user friendly Online Annotation Tool allows researchers to submit detailed experimental information via the web at the time of data generation rather than at the time of publication. This ensures the easy access and high accuracy of meta-data collected. Experiments are programmatically built in the MiMiR database from the submitted information and details are systematically curated and further annotated by a team of trained annotators using a new Curation and Annotation Tool. Clinical information can be annotated and coded with a clinical Data Mapping Tool within an appropriate ethical framework. Users can visualise experimental annotation, assess data quality, download and share data via a web-based experiment browser called MiMiR Online. All requests to access data in MiMiR are routed through a sophisticated middleware security layer thereby allowing secure data access and sharing amongst MiMiR registered users prior to publication. Data in MiMiR can be mined and analysed using the integrated EMAAS open source analysis web portal or via export of data and meta-data into Rosetta Resolver data analysis package. Conclusion The new MiMiR suite of software enables systematic and effective capture of extensive experimental and clinical information with the highest MIAME score, and secure data sharing prior to publication. MiMiR currently contains more than 150 experiments corresponding to over 3000 hybridisations and supports the Microarray Centre's large microarray user community and two international consortia. The MiMiR flexible and scalable hardware and software architecture enables secure warehousing of thousands of datasets, including clinical studies, from microarray and potentially other -omics technologies. PMID:18801157
MiMiR--an integrated platform for microarray data sharing, mining and analysis.
Tomlinson, Chris; Thimma, Manjula; Alexandrakis, Stelios; Castillo, Tito; Dennis, Jayne L; Brooks, Anthony; Bradley, Thomas; Turnbull, Carly; Blaveri, Ekaterini; Barton, Geraint; Chiba, Norie; Maratou, Klio; Soutter, Pat; Aitman, Tim; Game, Laurence
2008-09-18
Despite considerable efforts within the microarray community for standardising data format, content and description, microarray technologies present major challenges in managing, sharing, analysing and re-using the large amount of data generated locally or internationally. Additionally, it is recognised that inconsistent and low quality experimental annotation in public data repositories significantly compromises the re-use of microarray data for meta-analysis. MiMiR, the Microarray data Mining Resource was designed to tackle some of these limitations and challenges. Here we present new software components and enhancements to the original infrastructure that increase accessibility, utility and opportunities for large scale mining of experimental and clinical data. A user friendly Online Annotation Tool allows researchers to submit detailed experimental information via the web at the time of data generation rather than at the time of publication. This ensures the easy access and high accuracy of meta-data collected. Experiments are programmatically built in the MiMiR database from the submitted information and details are systematically curated and further annotated by a team of trained annotators using a new Curation and Annotation Tool. Clinical information can be annotated and coded with a clinical Data Mapping Tool within an appropriate ethical framework. Users can visualise experimental annotation, assess data quality, download and share data via a web-based experiment browser called MiMiR Online. All requests to access data in MiMiR are routed through a sophisticated middleware security layer thereby allowing secure data access and sharing amongst MiMiR registered users prior to publication. Data in MiMiR can be mined and analysed using the integrated EMAAS open source analysis web portal or via export of data and meta-data into Rosetta Resolver data analysis package. The new MiMiR suite of software enables systematic and effective capture of extensive experimental and clinical information with the highest MIAME score, and secure data sharing prior to publication. MiMiR currently contains more than 150 experiments corresponding to over 3000 hybridisations and supports the Microarray Centre's large microarray user community and two international consortia. The MiMiR flexible and scalable hardware and software architecture enables secure warehousing of thousands of datasets, including clinical studies, from microarray and potentially other -omics technologies.
Improvements to the Ontology-based Metadata Portal for Unified Semantics (OlyMPUS)
NASA Astrophysics Data System (ADS)
Linsinbigler, M. A.; Gleason, J. L.; Huffer, E.
2016-12-01
The Ontology-based Metadata Portal for Unified Semantics (OlyMPUS), funded by the NASA Earth Science Technology Office Advanced Information Systems Technology program, is an end-to-end system designed to support Earth Science data consumers and data providers, enabling the latter to register data sets and provision them with the semantically rich metadata that drives the Ontology-Driven Interactive Search Environment for Earth Sciences (ODISEES). OlyMPUS complements the ODISEES' data discovery system with an intelligent tool to enable data producers to auto-generate semantically enhanced metadata and upload it to the metadata repository that drives ODISEES. Like ODISEES, the OlyMPUS metadata provisioning tool leverages robust semantics, a NoSQL database and query engine, an automated reasoning engine that performs first- and second-order deductive inferencing, and uses a controlled vocabulary to support data interoperability and automated analytics. The ODISEES data discovery portal leverages this metadata to provide a seamless data discovery and access experience for data consumers who are interested in comparing and contrasting the multiple Earth science data products available across NASA data centers. Olympus will support scientists' services and tools for performing complex analyses and identifying correlations and non-obvious relationships across all types of Earth System phenomena using the full spectrum of NASA Earth Science data available. By providing an intelligent discovery portal that supplies users - both human users and machines - with detailed information about data products, their contents and their structure, ODISEES will reduce the level of effort required to identify and prepare large volumes of data for analysis. This poster will explain how OlyMPUS leverages deductive reasoning and other technologies to create an integrated environment for generating and exploiting semantically rich metadata.
Document Classification in Support of Automated Metadata Extraction Form Heterogeneous Collections
ERIC Educational Resources Information Center
Flynn, Paul K.
2014-01-01
A number of federal agencies, universities, laboratories, and companies are placing their documents online and making them searchable via metadata fields such as author, title, and publishing organization. To enable this, every document in the collection must be catalogued using the metadata fields. Though time consuming, the task of identifying…
Metadata squared: enhancing its usability for volunteered geographic information and the GeoWeb
Poore, Barbara S.; Wolf, Eric B.; Sui, Daniel Z.; Elwood, Sarah; Goodchild, Michael F.
2013-01-01
The Internet has brought many changes to the way geographic information is created and shared. One aspect that has not changed is metadata. Static spatial data quality descriptions were standardized in the mid-1990s and cannot accommodate the current climate of data creation where nonexperts are using mobile phones and other location-based devices on a continuous basis to contribute data to Internet mapping platforms. The usability of standard geospatial metadata is being questioned by academics and neogeographers alike. This chapter analyzes current discussions of metadata to demonstrate how the media shift that is occurring has affected requirements for metadata. Two case studies of metadata use are presented—online sharing of environmental information through a regional spatial data infrastructure in the early 2000s, and new types of metadata that are being used today in OpenStreetMap, a map of the world created entirely by volunteers. Changes in metadata requirements are examined for usability, the ease with which metadata supports coproduction of data by communities of users, how metadata enhances findability, and how the relationship between metadata and data has changed. We argue that traditional metadata associated with spatial data infrastructures is inadequate and suggest several research avenues to make this type of metadata more interactive and effective in the GeoWeb.
Metadata Design in the New PDS4 Standards - Something for Everybody
NASA Astrophysics Data System (ADS)
Raugh, Anne C.; Hughes, John S.
2015-11-01
The Planetary Data System (PDS) archives, supports, and distributes data of diverse targets, from diverse sources, to diverse users. One of the core problems addressed by the PDS4 data standard redesign was that of metadata - how to accommodate the increasingly sophisticated demands of search interfaces, analytical software, and observational documentation into label standards without imposing limits and constraints that would impinge on the quality or quantity of metadata that any particular observer or team could supply. And yet, as an archive, PDS must have detailed documentation for the metadata in the labels it supports, or the institutional knowledge encoded into those attributes will be lost - putting the data at risk.The PDS4 metadata solution is based on a three-step approach. First, it is built on two key ISO standards: ISO 11179 "Information Technology - Metadata Registries", which provides a common framework and vocabulary for defining metadata attributes; and ISO 14721 "Space Data and Information Transfer Systems - Open Archival Information System (OAIS) Reference Model", which provides the framework for the information architecture that enforces the object-oriented paradigm for metadata modeling. Second, PDS has defined a hierarchical system that allows it to divide its metadata universe into namespaces ("data dictionaries", conceptually), and more importantly to delegate stewardship for a single namespace to a local authority. This means that a mission can develop its own data model with a high degree of autonomy and effectively extend the PDS model to accommodate its own metadata needs within the common ISO 11179 framework. Finally, within a single namespace - even the core PDS namespace - existing metadata structures can be extended and new structures added to the model as new needs are identifiedThis poster illustrates the PDS4 approach to metadata management and highlights the expected return on the development investment for PDS, users and data preparers.
Data Provenance Architecture for the Geosciences
NASA Astrophysics Data System (ADS)
Murphy, F.; Irving, D. H.
2012-12-01
The pace at which geoscientific insights inform societal development quickens with time and these insights drive decisions and actions of ever-increasing human and economic significance. Until recently academic, commercial and government bodies have maintained distinct bodies of knowledge to support scientific enquiry as well as societal development. However, it has become clear that the curation of the body of data is an activity of equal or higher social and commercial value. We address the community challenges in the curation of, access to, and analysis of scientific data including: the tensions between creators, providers and users; incentives and barriers to sharing; ownership and crediting. We also discuss the technical and financial challenges in maximising the return on the effort made in generating geoscientific data. To illustrate how these challenges might be addressed in the broader geoscientific domain, we describe the high-level data governance and analytical architecture in the upstream Oil Industry. This domain is heavily dependent on costly and highly diverse geodatasets collected and assimilated over timeframes varying from seconds to decades. These data must support both operational decisions at the minute-hour timefame and strategic and economic decisions of enterprise or national scale, and yet be sufficiently robust to last the life of a producing field. We develop three themes around data provenance, data ownership and business models for data curation. 1/ The overarching aspiration is to ensure that data provenance and quality is maintained along the analytical workflow. Hence if data on which a publication or report changes, the report and its publishers can be notified and we describe a mechanism by which dependent knowledge products can be flagged. 2/ From a cost and management point of view we look at who "owns" data especially in cases where the cost of curation and stewardship is significant compared to the cost of acquiring the data in the first place. Analytical value can be placed on data and this can govern the mode of custodianship. 3/ The broader scientific domain requires the development of new business models, whereby the need to re-examine how scientific credit and reputation are built and assessed, the current management of the scientific canon still refers back to the paper system and an appraisal of how we curate layered/rich/dynamic scientific content is timely. Using our review of the upstream Oil and Gas industry, we expand our ideas to the interplay between government, academic, private and public geo- and other datasets. Whilst there is no simple answer, much work has already been achieved around standardised data exchange formats, metadata and searchability via semantic frameworks, and cost/charge/licensing models. The key observation from the Oil industry is that it is the practicalities of data management issues that are driving change rather than any commercial/philosophical agenda - driven in particular by the custodians, managers and architects of the data rather than the users or the owners.
OlyMPUS - The Ontology-based Metadata Portal for Unified Semantics
NASA Astrophysics Data System (ADS)
Huffer, E.; Gleason, J. L.
2015-12-01
The Ontology-based Metadata Portal for Unified Semantics (OlyMPUS), funded by the NASA Earth Science Technology Office Advanced Information Systems Technology program, is an end-to-end system designed to support data consumers and data providers, enabling the latter to register their data sets and provision them with the semantically rich metadata that drives the Ontology-Driven Interactive Search Environment for Earth Sciences (ODISEES). OlyMPUS leverages the semantics and reasoning capabilities of ODISEES to provide data producers with a semi-automated interface for producing the semantically rich metadata needed to support ODISEES' data discovery and access services. It integrates the ODISEES metadata search system with multiple NASA data delivery tools to enable data consumers to create customized data sets for download to their computers, or for NASA Advanced Supercomputing (NAS) facility registered users, directly to NAS storage resources for access by applications running on NAS supercomputers. A core function of NASA's Earth Science Division is research and analysis that uses the full spectrum of data products available in NASA archives. Scientists need to perform complex analyses that identify correlations and non-obvious relationships across all types of Earth System phenomena. Comprehensive analytics are hindered, however, by the fact that many Earth science data products are disparate and hard to synthesize. Variations in how data are collected, processed, gridded, and stored, create challenges for data interoperability and synthesis, which are exacerbated by the sheer volume of available data. Robust, semantically rich metadata can support tools for data discovery and facilitate machine-to-machine transactions with services such as data subsetting, regridding, and reformatting. Such capabilities are critical to enabling the research activities integral to NASA's strategic plans. However, as metadata requirements increase and competing standards emerge, metadata provisioning becomes increasingly burdensome to data producers. The OlyMPUS system helps data providers produce semantically rich metadata, making their data more accessible to data consumers, and helps data consumers quickly discover and download the right data for their research.
New Tools to Document and Manage Data/Metadata: Example NGEE Arctic and ARM
NASA Astrophysics Data System (ADS)
Crow, M. C.; Devarakonda, R.; Killeffer, T.; Hook, L.; Boden, T.; Wullschleger, S.
2017-12-01
Tools used for documenting, archiving, cataloging, and searching data are critical pieces of informatics. This poster describes tools being used in several projects at Oak Ridge National Laboratory (ORNL), with a focus on the U.S. Department of Energy's Next Generation Ecosystem Experiment in the Arctic (NGEE Arctic) and Atmospheric Radiation Measurements (ARM) project, and their usage at different stages of the data lifecycle. The Online Metadata Editor (OME) is used for the documentation and archival stages while a Data Search tool supports indexing, cataloging, and searching. The NGEE Arctic OME Tool [1] provides a method by which researchers can upload their data and provide original metadata with each upload while adhering to standard metadata formats. The tool is built upon a Java SPRING framework to parse user input into, and from, XML output. Many aspects of the tool require use of a relational database including encrypted user-login, auto-fill functionality for predefined sites and plots, and file reference storage and sorting. The Data Search Tool conveniently displays each data record in a thumbnail containing the title, source, and date range, and features a quick view of the metadata associated with that record, as well as a direct link to the data. The search box incorporates autocomplete capabilities for search terms and sorted keyword filters are available on the side of the page, including a map for geo-searching. These tools are supported by the Mercury [2] consortium (funded by DOE, NASA, USGS, and ARM) and developed and managed at Oak Ridge National Laboratory. Mercury is a set of tools for collecting, searching, and retrieving metadata and data. Mercury collects metadata from contributing project servers, then indexes the metadata to make it searchable using Apache Solr, and provides access to retrieve it from the web page. Metadata standards that Mercury supports include: XML, Z39.50, FGDC, Dublin-Core, Darwin-Core, EML, and ISO-19115.
Sustainable data and metadata management at the BD2K-LINCS Data Coordination and Integration Center
Stathias, Vasileios; Koleti, Amar; Vidović, Dušica; Cooper, Daniel J.; Jagodnik, Kathleen M.; Terryn, Raymond; Forlin, Michele; Chung, Caty; Torre, Denis; Ayad, Nagi; Medvedovic, Mario; Ma'ayan, Avi; Pillai, Ajay; Schürer, Stephan C.
2018-01-01
The NIH-funded LINCS Consortium is creating an extensive reference library of cell-based perturbation response signatures and sophisticated informatics tools incorporating a large number of perturbagens, model systems, and assays. To date, more than 350 datasets have been generated including transcriptomics, proteomics, epigenomics, cell phenotype and competitive binding profiling assays. The large volume and variety of data necessitate rigorous data standards and effective data management including modular data processing pipelines and end-user interfaces to facilitate accurate and reliable data exchange, curation, validation, standardization, aggregation, integration, and end user access. Deep metadata annotations and the use of qualified data standards enable integration with many external resources. Here we describe the end-to-end data processing and management at the DCIC to generate a high-quality and persistent product. Our data management and stewardship solutions enable a functioning Consortium and make LINCS a valuable scientific resource that aligns with big data initiatives such as the BD2K NIH Program and concords with emerging data science best practices including the findable, accessible, interoperable, and reusable (FAIR) principles. PMID:29917015
A data discovery index for the social sciences
Krämer, Thomas; Klas, Claus-Peter; Hausstein, Brigitte
2018-01-01
This paper describes a novel search index for social and economic research data, one that enables users to search up-to-date references for data holdings in these disciplines. The index can be used for comparative analysis of publication of datasets in different areas of social science. The core of the index is the da|ra registration agency’s database for social and economic data, which contains high-quality searchable metadata from registered data publishers. Research data’s metadata records are harvested from data providers around the world and included in the index. In this paper, we describe the currently available indices on social science datasets and their shortcomings. Next, we describe the motivation behind and the purpose for the data discovery index as a dedicated and curated platform for finding social science research data and gesisDataSearch, its user interface. Further, we explain the harvesting, filtering and indexing procedure and give usage instructions for the dataset index. Lastly, we show that the index is currently the most comprehensive and most accessible collection of social science data descriptions available. PMID:29633988
Obuch, Raymond C.; Carlino, Jennifer; Zhang, Lin; Blythe, Jonathan; Dietrich, Christopher; Hawkinson, Christine
2018-04-12
The Department of the Interior (DOI) is a Federal agency with over 90,000 employees across 10 bureaus and 8 agency offices. Its primary mission is to protect and manage the Nation’s natural resources and cultural heritage; provide scientific and other information about those resources; and honor its trust responsibilities or special commitments to American Indians, Alaska Natives, and affiliated island communities. Data and information are critical in day-to-day operational decision making and scientific research. DOI is committed to creating, documenting, managing, and sharing high-quality data and metadata in and across its various programs that support its mission. Documenting data through metadata is essential in realizing the value of data as an enterprise asset. The completeness, consistency, and timeliness of metadata affect users’ ability to search for and discover the most relevant data for the intended purpose; and facilitates the interoperability and usability of these data among DOI bureaus and offices. Fully documented metadata describe data usability, quality, accuracy, provenance, and meaning.Across DOI, there are different maturity levels and phases of information and metadata management implementations. The Department has organized a committee consisting of bureau-level points-of-contacts to collaborate on the development of more consistent, standardized, and more effective metadata management practices and guidance to support this shared mission and the information needs of the Department. DOI’s metadata implementation plans establish key roles and responsibilities associated with metadata management processes, procedures, and a series of actions defined in three major metadata implementation phases including: (1) Getting started—Planning Phase, (2) Implementing and Maintaining Operational Metadata Management Phase, and (3) the Next Steps towards Improving Metadata Management Phase. DOI’s phased approach for metadata management addresses some of the major data and metadata management challenges that exist across the diverse missions of the bureaus and offices. All employees who create, modify, or use data are involved with data and metadata management. Identifying, establishing, and formalizing the roles and responsibilities associated with metadata management are key to institutionalizing a framework of best practices, methodologies, processes, and common approaches throughout all levels of the organization; these are the foundation for effective data resource management. For executives and managers, metadata management strengthens their overarching views of data assets, holdings, and data interoperability; and clarifies how metadata management can help accelerate the compliance of multiple policy mandates. For employees, data stewards, and data professionals, formalized metadata management will help with the consistency of definitions, and approaches addressing data discoverability, data quality, and data lineage. In addition to data professionals and others associated with information technology; data stewards and program subject matter experts take on important metadata management roles and responsibilities as data flow through their respective business and science-related workflows. The responsibilities of establishing, practicing, and governing the actions associated with their specific metadata management roles are critical to successful metadata implementation.
Mercury- Distributed Metadata Management, Data Discovery and Access System
NASA Astrophysics Data System (ADS)
Palanisamy, Giri; Wilson, Bruce E.; Devarakonda, Ranjeet; Green, James M.
2007-12-01
Mercury is a federated metadata harvesting, search and retrieval tool based on both open source and ORNL- developed software. It was originally developed for NASA, and the Mercury development consortium now includes funding from NASA, USGS, and DOE. Mercury supports various metadata standards including XML, Z39.50, FGDC, Dublin-Core, Darwin-Core, EML, and ISO-19115 (under development). Mercury provides a single portal to information contained in disparate data management systems. It collects metadata and key data from contributing project servers distributed around the world and builds a centralized index. The Mercury search interfaces then allow the users to perform simple, fielded, spatial and temporal searches across these metadata sources. This centralized repository of metadata with distributed data sources provides extremely fast search results to the user, while allowing data providers to advertise the availability of their data and maintain complete control and ownership of that data. Mercury supports various projects including: ORNL DAAC, NBII, DADDI, LBA, NARSTO, CDIAC, OCEAN, I3N, IAI, ESIP and ARM. The new Mercury system is based on a Service Oriented Architecture and supports various services such as Thesaurus Service, Gazetteer Web Service and UDDI Directory Services. This system also provides various search services including: RSS, Geo-RSS, OpenSearch, Web Services and Portlets. Other features include: Filtering and dynamic sorting of search results, book-markable search results, save, retrieve, and modify search criteria.
Improving Metadata Compliance for Earth Science Data Records
NASA Astrophysics Data System (ADS)
Armstrong, E. M.; Chang, O.; Foster, D.
2014-12-01
One of the recurring challenges of creating earth science data records is to ensure a consistent level of metadata compliance at the granule level where important details of contents, provenance, producer, and data references are necessary to obtain a sufficient level of understanding. These details are important not just for individual data consumers but also for autonomous software systems. Two of the most popular metadata standards at the granule level are the Climate and Forecast (CF) Metadata Conventions and the Attribute Conventions for Dataset Discovery (ACDD). Many data producers have implemented one or both of these models including the Group for High Resolution Sea Surface Temperature (GHRSST) for their global SST products and the Ocean Biology Processing Group for NASA ocean color and SST products. While both the CF and ACDD models contain various level of metadata richness, the actual "required" attributes are quite small in number. Metadata at the granule level becomes much more useful when recommended or optional attributes are implemented that document spatial and temporal ranges, lineage and provenance, sources, keywords, and references etc. In this presentation we report on a new open source tool to check the compliance of netCDF and HDF5 granules to the CF and ACCD metadata models. The tool, written in Python, was originally implemented to support metadata compliance for netCDF records as part of the NOAA's Integrated Ocean Observing System. It outputs standardized scoring for metadata compliance for both CF and ACDD, produces an objective summary weight, and can be implemented for remote records via OPeNDAP calls. Originally a command-line tool, we have extended it to provide a user-friendly web interface. Reports on metadata testing are grouped in hierarchies that make it easier to track flaws and inconsistencies in the record. We have also extended it to support explicit metadata structures and semantic syntax for the GHRSST project that can be easily adapted to other satellite missions as well. Overall, we hope this tool will provide the community with a useful mechanism to improve metadata quality and consistency at the granule level by providing objective scoring and assessment, as well as encourage data producers to improve metadata quality and quantity.
Improving Scientific Metadata Interoperability And Data Discoverability using OAI-PMH
NASA Astrophysics Data System (ADS)
Devarakonda, Ranjeet; Palanisamy, Giri; Green, James M.; Wilson, Bruce E.
2010-12-01
While general-purpose search engines (such as Google or Bing) are useful for finding many things on the Internet, they are often of limited usefulness for locating Earth Science data relevant (for example) to a specific spatiotemporal extent. By contrast, tools that search repositories of structured metadata can locate relevant datasets with fairly high precision, but the search is limited to that particular repository. Federated searches (such as Z39.50) have been used, but can be slow and the comprehensiveness can be limited by downtime in any search partner. An alternative approach to improve comprehensiveness is for a repository to harvest metadata from other repositories, possibly with limits based on subject matter or access permissions. Searches through harvested metadata can be extremely responsive, and the search tool can be customized with semantic augmentation appropriate to the community of practice being served. However, there are a number of different protocols for harvesting metadata, with some challenges for ensuring that updates are propagated and for collaborations with repositories using differing metadata standards. The Open Archive Initiative Protocol for Metadata Handling (OAI-PMH) is a standard that is seeing increased use as a means for exchanging structured metadata. OAI-PMH implementations must support Dublin Core as a metadata standard, with other metadata formats as optional. We have developed tools which enable our structured search tool (Mercury; http://mercury.ornl.gov) to consume metadata from OAI-PMH services in any of the metadata formats we support (Dublin Core, Darwin Core, FCDC CSDGM, GCMD DIF, EML, and ISO 19115/19137). We are also making ORNL DAAC metadata available through OAI-PMH for other metadata tools to utilize, such as the NASA Global Change Master Directory, GCMD). This paper describes Mercury capabilities with multiple metadata formats, in general, and, more specifically, the results of our OAI-PMH implementations and the lessons learned. References: [1] R. Devarakonda, G. Palanisamy, B.E. Wilson, and J.M. Green, "Mercury: reusable metadata management data discovery and access system", Earth Science Informatics, vol. 3, no. 1, pp. 87-94, May 2010. [2] R. Devarakonda, G. Palanisamy, J.M. Green, B.E. Wilson, "Data sharing and retrieval using OAI-PMH", Earth Science Informatics DOI: 10.1007/s12145-010-0073-0, (2010). [3] Devarakonda, R.; Palanisamy, G.; Green, J.; Wilson, B. E. "Mercury: An Example of Effective Software Reuse for Metadata Management Data Discovery and Access", Eos Trans. AGU, 89(53), Fall Meet. Suppl., IN11A-1019 (2008).
NASA Astrophysics Data System (ADS)
Noren, A.; Brady, K.; Myrbo, A.; Ito, E.
2007-12-01
Lacustrine sediment cores comprise an integral archive for the determination of continental paleoclimate, for their potentially high temporal resolution and for their ability to resolve spatial variability in climate across vast sections of the globe. Researchers studying these archives now have a large, nationally-funded, public facility dedicated to the support of their efforts. The LRC LacCore Facility, funded by NSF and the University of Minnesota, provides free or low-cost assistance to any portion of research projects, depending on the specific needs of the project. A large collection of field equipment (site survey equipment, coring devices, boats/platforms, water sampling devices) for nearly any lacustrine setting is available for rental, and Livingstone-type corers and drive rods may be purchased. LacCore staff can accompany field expeditions to operate these devices and curate samples, or provide training prior to device rental. The Facility maintains strong connections to experienced shipping agents and customs brokers, which vastly improves transport and importation of samples. In the lab, high-end instrumentation (e.g., multisensor loggers, high-resolution digital linescan cameras) provides a baseline of fundamental analyses before any sample material is consumed. LacCore staff provide support and training in lithological description, including smear-slide, XRD, and SEM analyses. The LRC botanical macrofossil reference collection is a valuable resource for both core description and detailed macrofossil analysis. Dedicated equipment and space for various subsample analyses streamlines these endeavors; subsamples for several analyses may be submitted for preparation or analysis by Facility technicians for a fee (e.g., carbon and sulfur coulometry, grain size, pollen sample preparation and analysis, charcoal, biogenic silica, LOI, freeze drying). The National Lacustrine Core Repository now curates ~9km of sediment cores from expeditions around the world, and stores metadata and analytical data for all cores processed at the facility. Any researcher may submit sample requests for material in archived cores. Supplies for field (e.g., polycarbonate pipe, endcaps), lab (e.g., sample containers, pollen sample spike), and curation (e.g., D-tubes) are sold at cost. In collaboration with facility users, staff continually develop new equipment, supplies, and procedures as needed in order to provide the best and most comprehensive set of services to the research community.
A metadata reporting framework (FRAMES) for synthesis of ecohydrological observations
Christianson, Danielle S.; Varadharajan, Charuleka; Christoffersen, Bradley; ...
2017-06-20
Metadata describe the ancillary information needed for data interpretation, comparison across heterogeneous datasets, and quality control and quality assessment (QA/QC). Metadata enable the synthesis of diverse ecohydrological and biogeochemical observations, an essential step in advancing a predictive understanding of earth systems. Environmental observations can be taken across a wide range of spatiotemporal scales in a variety of measurement settings and approaches, and saved in multiple formats. Thus, well-organized, consistent metadata are required to produce usable data products from diverse observations collected in disparate field sites. However, existing metadata reporting protocols do not support the complex data synthesis needs of interdisciplinarymore » earth system research. We developed a metadata reporting framework (FRAMES) to enable predictive understanding of carbon cycling in tropical forests under global change. FRAMES adheres to best practices for data and metadata organization, enabling consistent data reporting and thus compatibility with a variety of standardized data protocols. We used an iterative scientist-centered design process to develop FRAMES. The resulting modular organization streamlines metadata reporting and can be expanded to incorporate additional data types. The flexible data reporting format incorporates existing field practices to maximize data-entry efficiency. With FRAMES’s multi-scale measurement position hierarchy, data can be reported at observed spatial resolutions and then easily aggregated and linked across measurement types to support model-data integration. FRAMES is in early use by both data providers and users. Here in this article, we describe FRAMES, identify lessons learned, and discuss areas of future development.« less
A metadata reporting framework (FRAMES) for synthesis of ecohydrological observations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Christianson, Danielle S.; Varadharajan, Charuleka; Christoffersen, Bradley
Metadata describe the ancillary information needed for data interpretation, comparison across heterogeneous datasets, and quality control and quality assessment (QA/QC). Metadata enable the synthesis of diverse ecohydrological and biogeochemical observations, an essential step in advancing a predictive understanding of earth systems. Environmental observations can be taken across a wide range of spatiotemporal scales in a variety of measurement settings and approaches, and saved in multiple formats. Thus, well-organized, consistent metadata are required to produce usable data products from diverse observations collected in disparate field sites. However, existing metadata reporting protocols do not support the complex data synthesis needs of interdisciplinarymore » earth system research. We developed a metadata reporting framework (FRAMES) to enable predictive understanding of carbon cycling in tropical forests under global change. FRAMES adheres to best practices for data and metadata organization, enabling consistent data reporting and thus compatibility with a variety of standardized data protocols. We used an iterative scientist-centered design process to develop FRAMES. The resulting modular organization streamlines metadata reporting and can be expanded to incorporate additional data types. The flexible data reporting format incorporates existing field practices to maximize data-entry efficiency. With FRAMES’s multi-scale measurement position hierarchy, data can be reported at observed spatial resolutions and then easily aggregated and linked across measurement types to support model-data integration. FRAMES is in early use by both data providers and users. Here in this article, we describe FRAMES, identify lessons learned, and discuss areas of future development.« less
Achieving interoperability for metadata registries using comparative object modeling.
Park, Yu Rang; Kim, Ju Han
2010-01-01
Achieving data interoperability between organizations relies upon agreed meaning and representation (metadata) of data. For managing and registering metadata, many organizations have built metadata registries (MDRs) in various domains based on international standard for MDR framework, ISO/IEC 11179. Following this trend, two pubic MDRs in biomedical domain have been created, United States Health Information Knowledgebase (USHIK) and cancer Data Standards Registry and Repository (caDSR), from U.S. Department of Health & Human Services and National Cancer Institute (NCI), respectively. Most MDRs are implemented with indiscriminate extending for satisfying organization-specific needs and solving semantic and structural limitation of ISO/IEC 11179. As a result it is difficult to address interoperability among multiple MDRs. In this paper, we propose an integrated metadata object model for achieving interoperability among multiple MDRs. To evaluate this model, we developed an XML Schema Definition (XSD)-based metadata exchange format. We created an XSD-based metadata exporter, supporting both the integrated metadata object model and organization-specific MDR formats.
Explorative Analyses of Nursing Research Data.
Kim, Hyeoneui; Jang, Imho; Quach, Jimmy; Richardson, Alex; Kim, Jaemin; Choi, Jeeyae
2016-10-26
As a first step of pursuing the vision of "big data science in nursing," we described the characteristics of nursing research data reported in 194 published nursing studies. We also explored how completely the Version 1 metadata specification of biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) represents these metadata. The metadata items of the nursing studies were all related to one or more of the bioCADDIE metadata entities. However, values of many metadata items of the nursing studies were not sufficiently represented through the bioCADDIE metadata. This was partly due to the differences in the scope of the content that the bioCADDIE metadata are designed to represent. The 194 nursing studies reported a total of 1,181 unique data items, the majority of which take non-numeric values. This indicates the importance of data standardization to enable the integrative analyses of these data to support big data science in nursing. © The Author(s) 2016.
Metadata Repository for Improved Data Sharing and Reuse Based on HL7 FHIR.
Ulrich, Hannes; Kock, Ann-Kristin; Duhm-Harbeck, Petra; Habermann, Jens K; Ingenerf, Josef
2016-01-01
Unreconciled data structures and formats are a common obstacle to the urgently required sharing and reuse of data within healthcare and medical research. Within the North German Tumor Bank of Colorectal Cancer, clinical and sample data, based on a harmonized data set, is collected and can be pooled by using a hospital-integrated Research Data Management System supporting biobank and study management. Adding further partners who are not using the core data set requires manual adaptations and mapping of data elements. Facing this manual intervention and focusing the reuse of heterogeneous healthcare instance data (value level) and data elements (metadata level), a metadata repository has been developed. The metadata repository is an ISO 11179-3 conformant server application built for annotating and mediating data elements. The implemented architecture includes the translation of metadata information about data elements into the FHIR standard using the FHIR Data Element resource with the ISO 11179 Data Element Extensions. The FHIR-based processing allows exchange of data elements with clinical and research IT systems as well as with other metadata systems. With increasingly annotated and harmonized data elements, data quality and integration can be improved for successfully enabling data analytics and decision support.
Semantic technologies improving the recall and precision of the Mercury metadata search engine
NASA Astrophysics Data System (ADS)
Pouchard, L. C.; Cook, R. B.; Green, J.; Palanisamy, G.; Noy, N.
2011-12-01
The Mercury federated metadata system [1] was developed at the Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC), a NASA-sponsored effort holding datasets about biogeochemical dynamics, ecological data, and environmental processes. Mercury currently indexes over 100,000 records from several data providers conforming to community standards, e.g. EML, FGDC, FGDC Biological Profile, ISO 19115 and DIF. With the breadth of sciences represented in Mercury, the potential exists to address some key interdisciplinary scientific challenges related to climate change, its environmental and ecological impacts, and mitigation of these impacts. However, this wealth of metadata also hinders pinpointing datasets relevant to a particular inquiry. We implemented a semantic solution after concluding that traditional search approaches cannot improve the accuracy of the search results in this domain because: a) unlike everyday queries, scientific queries seek to return specific datasets with numerous parameters that may or may not be exposed to search (Deep Web queries); b) the relevance of a dataset cannot be judged by its popularity, as each scientific inquiry tends to be unique; and c)each domain science has its own terminology, more or less curated, consensual, and standardized depending on the domain. The same terms may refer to different concepts across domains (homonyms), but different terms mean the same thing (synonyms). Interdisciplinary research is arduous because an expert in a domain must become fluent in the language of another, just to find relevant datasets. Thus, we decided to use scientific ontologies because they can provide a context for a free-text search, in a way that string-based keywords never will. With added context, relevant datasets are more easily discoverable. To enable search and programmatic access to ontology entities in Mercury, we are using an instance of the BioPortal ontology repository. Mercury accesses ontology entities using the BioPortal REST API by passing a search parameter to BioPortal that may return domain context, parameter attribute, or entity annotations depending on the entity's associated ontological relationships. As Mercury's facetted search is popular with users, the results are displayed as facets. Unlike a facetted search however, the ontology-based solution implements both restrictions (improving precision) and expansions (improving recall) on the results of the initial search. For instance, "carbon" acquires a scientific context and additional key terms or phrases for discovering domain-specific datasets. A limitation of our solution is that the user must perform an additional step. Another limitation is that the quality of the newly discovered metadata is contingent upon the quality of the ontologies we use. Our solution leverages Mercury's federated capabilities to collect records from heterogeneous domains, and BioPortal's storage, curation and access capabilities for ontology entities. With minimal additional development, our approach builds on two mature systems for finding relevant datasets for interdisciplinary inquiries. We thus indicate a path forward for linking environmental, ecological and biological sciences. References: [1] Devarakonda, R., Palanisamy, G., Wilson, B. E., & Green, J. M. (2010). Mercury: reusable metadata management, data discovery and access system. Earth Science Informatics, 3(1-2), 87-94.
Data Curation Education in Research Centers (DCERC)
NASA Astrophysics Data System (ADS)
Marlino, M. R.; Mayernik, M. S.; Kelly, K.; Allard, S.; Tenopir, C.; Palmer, C.; Varvel, V. E., Jr.
2012-12-01
Digital data both enable and constrain scientific research. Scientists are enabled by digital data to develop new research methods, utilize new data sources, and investigate new topics, but they also face new data collection, management, and preservation burdens. The current data workforce consists primarily of scientists who receive little formal training in data management and data managers who are typically educated through on-the-job training. The Data Curation Education in Research Centers (DCERC) program is investigating a new model for educating data professionals to contribute to scientific research. DCERC is a collaboration between the University of Illinois at Urbana-Champaign Graduate School of Library and Information Science, the University of Tennessee School of Information Sciences, and the National Center for Atmospheric Research. The program is organized around a foundations course in data curation and provides field experiences in research and data centers for both master's and doctoral students. This presentation will outline the aims and the structure of the DCERC program and discuss results and lessons learned from the first set of summer internships in 2012. Four masters students participated and worked with both data mentors and science mentors, gaining first hand experiences in the issues, methods, and challenges of scientific data curation. They engaged in a diverse set of topics, including climate model metadata, observational data management workflows, and data cleaning, documentation, and ingest processes within a data archive. The students learned current data management practices and challenges while developing expertise and conducting research. They also made important contributions to NCAR data and science teams by evaluating data management workflows and processes, preparing data sets to be archived, and developing recommendations for particular data management activities. The master's student interns will return in summer of 2013, and two Ph.D. students will conduct data curation-related dissertation fieldwork during the 2013-2014 academic year.
Habitat-Lite: A GSC case study based on free text terms for environmental metadata
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kyrpides, Nikos; Hirschman, Lynette; Clark, Cheryl
2008-04-01
There is an urgent need to capture metadata on the rapidly growing number of genomic, metagenomic and related sequences, such as 16S ribosomal genes. This need is a major focus within the Genomic Standards Consortium (GSC), and Habitat is a key metadata descriptor in the proposed 'Minimum Information about a Genome Sequence' (MIGS) specification. The goal of the work described here is to provide a light-weight, easy-to-use (small) set of terms ('Habitat-Lite') that captures high-level information about habitat while preserving a mapping to the recently launched Environment Ontology (EnvO). Our motivation for building Habitat-Lite is to meet the needs ofmore » multiple users, such as annotators curating these data, database providers hosting the data, and biologists and bioinformaticians alike who need to search and employ such data in comparative analyses. Here, we report a case study based on semi-automated identification of terms from GenBank and GOLD. We estimate that the terms in the initial version of Habitat-Lite would provide useful labels for over 60% of the kinds of information found in the GenBank isolation-source field, and around 85% of the terms in the GOLD habitat field. We present a revised version of Habitat-Lite and invite the community's feedback on its further development in order to provide a minimum list of terms to capture high-level habitat information and to provide classification bins needed for future studies.« less
The MAR databases: development and implementation of databases specific for marine metagenomics.
Klemetsen, Terje; Raknes, Inge A; Fu, Juan; Agafonov, Alexander; Balasundaram, Sudhagar V; Tartari, Giacomo; Robertsen, Espen; Willassen, Nils P
2018-01-04
We introduce the marine databases; MarRef, MarDB and MarCat (https://mmp.sfb.uit.no/databases/), which are publicly available resources that promote marine research and innovation. These data resources, which have been implemented in the Marine Metagenomics Portal (MMP) (https://mmp.sfb.uit.no/), are collections of richly annotated and manually curated contextual (metadata) and sequence databases representing three tiers of accuracy. While MarRef is a database for completely sequenced marine prokaryotic genomes, which represent a marine prokaryote reference genome database, MarDB includes all incomplete sequenced prokaryotic genomes regardless level of completeness. The last database, MarCat, represents a gene (protein) catalog of uncultivable (and cultivable) marine genes and proteins derived from marine metagenomics samples. The first versions of MarRef and MarDB contain 612 and 3726 records, respectively. Each record is built up of 106 metadata fields including attributes for sampling, sequencing, assembly and annotation in addition to the organism and taxonomic information. Currently, MarCat contains 1227 records with 55 metadata fields. Ontologies and controlled vocabularies are used in the contextual databases to enhance consistency. The user-friendly web interface lets the visitors browse, filter and search in the contextual databases and perform BLAST searches against the corresponding sequence databases. All contextual and sequence databases are freely accessible and downloadable from https://s1.sfb.uit.no/public/mar/. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
NASA Astrophysics Data System (ADS)
Benedict, K. K.; Scott, S.
2013-12-01
While there has been a convergence towards a limited number of standards for representing knowledge (metadata) about geospatial (and other) data objects and collections, there exist a variety of community conventions around the specific use of those standards and within specific data discovery and access systems. This combination of limited (but multiple) standards and conventions creates a challenge for system developers that aspire to participate in multiple data infrastrucutres, each of which may use a different combination of standards and conventions. While Extensible Markup Language (XML) is a shared standard for encoding most metadata, traditional direct XML transformations (XSLT) from one standard to another often result in an imperfect transfer of information due to incomplete mapping from one standard's content model to another. This paper presents the work at the University of New Mexico's Earth Data Analysis Center (EDAC) in which a unified data and metadata management system has been developed in support of the storage, discovery and access of heterogeneous data products. This system, the Geographic Storage, Transformation and Retrieval Engine (GSTORE) platform has adopted a polyglot database model in which a combination of relational and document-based databases are used to store both data and metadata, with some metadata stored in a custom XML schema designed as a superset of the requirements for multiple target metadata standards: ISO 19115-2/19139/19110/19119, FGCD CSDGM (both with and without remote sensing extensions) and Dublin Core. Metadata stored within this schema is complemented by additional service, format and publisher information that is dynamically "injected" into produced metadata documents when they are requested from the system. While mapping from the underlying common metadata schema is relatively straightforward, the generation of valid metadata within each target standard is necessary but not sufficient for integration into multiple data infrastructures, as has been demonstrated through EDAC's testing and deployment of metadata into multiple external systems: Data.Gov, the GEOSS Registry, the DataONE network, the DSpace based institutional repository at UNM and semantic mediation systems developed as part of the NASA ACCESS ELSeWEB project. Each of these systems requires valid metadata as a first step, but to make most effective use of the delivered metadata each also has a set of conventions that are specific to the system. This presentation will provide an overview of the underlying metadata management model, the processes and web services that have been developed to automatically generate metadata in a variety of standard formats and highlight some of the specific modifications made to the output metadata content to support the different conventions used by the multiple metadata integration endpoints.
Metadata Means Communication: The Challenges of Producing Useful Metadata
NASA Astrophysics Data System (ADS)
Edwards, P. N.; Batcheller, A. L.
2010-12-01
Metadata are increasingly perceived as an important component of data sharing systems. For instance, metadata accompanying atmospheric model output may indicate the grid size, grid type, and parameter settings used in the model configuration. We conducted a case study of a data portal in the atmospheric sciences using in-depth interviews, document review, and observation. OUr analysis revealed a number of challenges in producing useful metadata. First, creating and managing metadata required considerable effort and expertise, yet responsibility for these tasks was ill-defined and diffused among many individuals, leading to errors, failure to capture metadata, and uncertainty about the quality of the primary data. Second, metadata ended up stored in many different forms and software tools, making it hard to manage versions and transfer between formats. Third, the exact meanings of metadata categories remained unsettled and misunderstood even among a small community of domain experts -- an effect we expect to be exacerbated when scientists from other disciplines wish to use these data. In practice, we found that metadata problems due to these obstacles are often overcome through informal, personal communication, such as conversations or email. We conclude that metadata serve to communicate the context of data production from the people who produce data to those who wish to use it. Thus while formal metadata systems are often public, critical elements of metadata (those embodied in informal communication) may never be recorded. Therefore, efforts to increase data sharing should include ways to facilitate inter-investigator communication. Instead of tackling metadata challenges only on the formal level, we can improve data usability for broader communities by better supporting metadata communication.
Data Management Rubric for Video Data in Organismal Biology.
Brainerd, Elizabeth L; Blob, Richard W; Hedrick, Tyson L; Creamer, Andrew T; Müller, Ulrike K
2017-07-01
Standards-based data management facilitates data preservation, discoverability, and access for effective data reuse within research groups and across communities of researchers. Data sharing requires community consensus on standards for data management, such as storage and formats for digital data preservation, metadata (i.e., contextual data about the data) that should be recorded and stored, and data access. Video imaging is a valuable tool for measuring time-varying phenotypes in organismal biology, with particular application for research in functional morphology, comparative biomechanics, and animal behavior. The raw data are the videos, but videos alone are not sufficient for scientific analysis. Nearly endless videos of animals can be found on YouTube and elsewhere on the web, but these videos have little value for scientific analysis because essential metadata such as true frame rate, spatial calibration, genus and species, weight, age, etc. of organisms, are generally unknown. We have embarked on a project to build community consensus on video data management and metadata standards for organismal biology research. We collected input from colleagues at early stages, organized an open workshop, "Establishing Standards for Video Data Management," at the Society for Integrative and Comparative Biology meeting in January 2017, and then collected two more rounds of input on revised versions of the standards. The result we present here is a rubric consisting of nine standards for video data management, with three levels within each standard: good, better, and best practices. The nine standards are: (1) data storage; (2) video file formats; (3) metadata linkage; (4) video data and metadata access; (5) contact information and acceptable use; (6) camera settings; (7) organism(s); (8) recording conditions; and (9) subject matter/topic. The first four standards address data preservation and interoperability for sharing, whereas standards 5-9 establish minimum metadata standards for organismal biology video, and suggest additional metadata that may be useful for some studies. This rubric was developed with substantial input from researchers and students, but still should be viewed as a living document that should be further refined and updated as technology and research practices change. The audience for these standards includes researchers, journals, and granting agencies, and also the developers and curators of databases that may contribute to video data sharing efforts. We offer this project as an example of building community consensus for data management, preservation, and sharing standards, which may be useful for future efforts by the organismal biology research community. © The Author 2017. Published by Oxford University Press on behalf of the Society for Integrative and Comparative Biology.
Data Management Rubric for Video Data in Organismal Biology
Brainerd, Elizabeth L.; Blob, Richard W.; Hedrick, Tyson L.; Creamer, Andrew T.; Müller, Ulrike K.
2017-01-01
Synopsis Standards-based data management facilitates data preservation, discoverability, and access for effective data reuse within research groups and across communities of researchers. Data sharing requires community consensus on standards for data management, such as storage and formats for digital data preservation, metadata (i.e., contextual data about the data) that should be recorded and stored, and data access. Video imaging is a valuable tool for measuring time-varying phenotypes in organismal biology, with particular application for research in functional morphology, comparative biomechanics, and animal behavior. The raw data are the videos, but videos alone are not sufficient for scientific analysis. Nearly endless videos of animals can be found on YouTube and elsewhere on the web, but these videos have little value for scientific analysis because essential metadata such as true frame rate, spatial calibration, genus and species, weight, age, etc. of organisms, are generally unknown. We have embarked on a project to build community consensus on video data management and metadata standards for organismal biology research. We collected input from colleagues at early stages, organized an open workshop, “Establishing Standards for Video Data Management,” at the Society for Integrative and Comparative Biology meeting in January 2017, and then collected two more rounds of input on revised versions of the standards. The result we present here is a rubric consisting of nine standards for video data management, with three levels within each standard: good, better, and best practices. The nine standards are: (1) data storage; (2) video file formats; (3) metadata linkage; (4) video data and metadata access; (5) contact information and acceptable use; (6) camera settings; (7) organism(s); (8) recording conditions; and (9) subject matter/topic. The first four standards address data preservation and interoperability for sharing, whereas standards 5–9 establish minimum metadata standards for organismal biology video, and suggest additional metadata that may be useful for some studies. This rubric was developed with substantial input from researchers and students, but still should be viewed as a living document that should be further refined and updated as technology and research practices change. The audience for these standards includes researchers, journals, and granting agencies, and also the developers and curators of databases that may contribute to video data sharing efforts. We offer this project as an example of building community consensus for data management, preservation, and sharing standards, which may be useful for future efforts by the organismal biology research community. PMID:28881939
NASA Astrophysics Data System (ADS)
Pignol, C.; Arnaud, F.; Godinho, E.; Galabertier, B.; Caillo, A.; Billy, I.; Augustin, L.; Calzas, M.; Rousseau, D. D.; Crosta, X.
2016-12-01
Managing scientific data is probably one the most challenging issues in modern science. In plaeosciences the question is made even more sensitive with the need of preserving and managing high value fragile geological samples: cores. Large international scientific programs, such as IODP or ICDP led intense effort to solve this problem and proposed detailed high standard work- and dataflows thorough core handling and curating. However many paleoscience results derived from small-scale research programs in which data and sample management is too often managed only locally - when it is… In this paper we present a national effort leads in France to develop an integrated system to curate ice and sediment cores. Under the umbrella of the national excellence equipment program CLIMCOR, we launched a reflexion about core curating and the management of associated fieldwork data. Our aim was then to conserve all data from fieldwork in an integrated cyber-environment which will evolve toward laboratory-acquired data storage in a near future. To do so, our demarche was conducted through an intimate relationship with field operators as well laboratory core curators in order to propose user-oriented solutions. The national core curating initiative proposes a single web portal in which all teams can store their fieldwork data. This portal is used as a national hub to attribute IGSNs. For legacy samples, this requires the establishment of a dedicated core list with associated metadata. However, for forthcoming core data, we developed a mobile application to capture technical and scientific data directly on the field. This application is linked with a unique coring-tools library and is adapted to most coring devices (gravity, drilling, percussion etc.) including multiple sections and holes coring operations. Those field data can be uploaded automatically to the national portal, but also referenced through international standards (IGSN and INSPIRE) and displayed in international portals (currently, NOAA's IMLGS). In this paper, we present the architecture of the integrated system, future perspectives and the approach we adopted to reach our goals. We will also present our mobile application through didactic examples.
ESGF and WDCC: The Double Structure of the Digital Data Storage at DKRZ
NASA Astrophysics Data System (ADS)
Toussaint, F.; Höck, H.
2016-12-01
Since a couple of years, Digital Repositories of climate science face new challenges: International projects are global collaborations. The data storage in parallel moved to federated, distributed storage systems like ESGF. For the long term archival storage (LTA) on the other hand, communities, funders, and data users make stronger demands for data and metadata quality to facilitate data use and reuse. At DKRZ, this situation led to a twofold data dissemination system - a situation which has influence on administration, workflows, and sustainability of the data. The ESGF system is focused on the needs of users as partners in global projects. It includes replication tools, detailed global project standards, and efficient search for the data to download. In contrast, DKRZ's classical CERA LTA storage aims for long term data holding and data curation as well as for data reuse requiring high metadata quality standards. In addition, for LTA data a Digital Object Identifier publication service for the direct integration of research data in scientific publications has been implemented. The editorial process at DKRZ-LTA ensures the quality of metadata and research data. The DOI and a citation code are provided and afterwards registered under DataCite's (datacite.org) regulations. In the overall data life cycle continuous reliability of the data and metadata quality is essential to allow for data handling at Petabytes level, data long term usability, and adequate publication of the results. These considerations lead to the question "What is quality" - with respect to data, to the repository itself, to the publisher, and the user? Global consensus is needed for these assessments as the phases of the end to end workflow gear into each other: For data and metadata, checks need to go hand in hand with the processes of production and storage. The results can be judged following a Quality Maturity Matrix (QMM). Repositories can be certified according to their trustworthiness. For the publication of any scientific conclusions, scientific community, funders, media, and policy makers ask for the publisher's impact in terms of readers' credit, run, and presentation quality. The paper describes the data life cycle. Emphasis is put on the different levels of quality assessment which at DKRZ ensure the data and metadata quality.
Digital Curation of Marine Physical Samples at Ocean Networks Canada
NASA Astrophysics Data System (ADS)
Jenkyns, R.; Tomlin, M. C.; Timmerman, R.
2015-12-01
Ocean Networks Canada (ONC) has collected hundreds of geological, biological and fluid samples from the water column and seafloor during its maintenance expeditions. These samples have been collected by Remotely Operated Vehicles (ROVs), divers, networked and autonomously deployed instruments, and rosettes. Subsequent measurements are used for scientific experiments, calibration of in-situ and remote sensors, monitoring of Marine Protected Areas, and environment characterization. Tracking the life cycles of these samples from collection to dissemination of results with all the pertinent documents (e.g., protocols, imagery, reports), metadata (e.g., location, identifiers, purpose, method) and data (e.g., measurements, taxonomic classification) is a challenge. The initial collection of samples is normally documented in SeaScribe (an ROV dive logging tool within ONC's Oceans 2.0 software) for which ONC has defined semantics and syntax. Next, samples are often sent to individual scientists and institutions (e.g., Royal BC Museum) for processing and storage, making acquisition of results and life cycle metadata difficult. Finally, this information needs to be retrieved and collated such that multiple user scenarios can be addressed. ONC aims to improve and extend its digital infrastructure for physical samples to support this complex array of samples, workflows and applications. However, in order to promote effective data discovery and exchange, interoperability and community standards must be an integral part of the design. Thus, integrating recommendations and outcomes of initiatives like the EarthCube iSamples working groups are essential. Use cases, existing tools, schemas and identifiers are reviewed, while remaining gaps and challenges are identified. The current status, selected approaches and possible future directions to enhance ONC's digital infrastructure for each sample type are presented.
Trustworthy Digital Repositories: Building Trust the Old Fashion Way, EARNING IT.
NASA Astrophysics Data System (ADS)
Kinkade, D.; Chandler, C. L.; Shepherd, A.; Rauch, S.; Groman, R. C.; Wiebe, P. H.; Glover, D. M.; Allison, M. D.; Copley, N. J.; Ake, H.; York, A.
2016-12-01
There are several drivers increasing the importance of high quality data management and curation in today's research process (e.g., OSTP PARR memo, journal publishers, funders, academic and private institutions), and proper management is necessary throughout the data lifecycle to enable reuse and reproducibility of results. Many digital data repositories are capable of satisfying the basic management needs of an investigator looking to share their data (i.e., publish data in the public domain), but repository services vary greatly and not all provide mature services that facilitate discovery, access, and reuse of research data. Domain-specific repositories play a vital role in the data curation process by working closely with investigators to create robust metadata, perform first order QC, and assemble and publish research data. In addition, they may employ technologies and services that enable increased discovery, access, and long-term archive. However, smaller domain facilities operate in varying states of capacity and curation ability. Within this repository environment, individual investigators (driven by publishers, funders, or institutions) need to find trustworthy repositories for their data; and funders need to direct investigators to quality repositories to ensure return on their investment. So, how can one determine the best home for valuable research data? Metrics can be applied to varying aspects of data curation, and many credentialing organizations offer services that assess and certify the trustworthiness of a given data management facility. Unfortunately, many of these certifications can be inaccessible to a small repository in cost, time, or scope. Are there alternatives? This presentation will discuss methods and approaches used by the Biological and Chemical Oceanography Data Management Office (BCO-DMO; a domain-specific, intermediate digital data repository) to demonstrate trustworthiness in the face of a daunting accreditation landscape.
Workflows for ingest of research data into digital archives - tests with Archivematica
NASA Astrophysics Data System (ADS)
Kirchner, I.; Bertelmann, R.; Gebauer, P.; Hasler, T.; Hirt, M.; Klump, J. F.; Peters-Kotting, W.; Rusch, B.; Ulbricht, D.
2013-12-01
Publication of research data and future re-use of measured data require the long-term preservation of digital objects. The ISO OAIS reference model defines responsibilities for long-term preservation of digital objects and although there is software available to support preservation of digital data, there are still problems remaining to be solved. A key task in preservation is to make the datasets ready for ingest into the archive, which is called the creation of Submission Information Packages (SIPs) in the OAIS model. This includes the creation of appropriate preservation metadata. Scientists need to be trained to deal with different types of data and to heighten their awareness for quality metadata. Other problems arise during the assembly of SIPs and during ingest into the archive because file format validators may produce conflicting output for identical data files and these conflicts are difficult to resolve automatically. Also, validation and identification tools are notorious for their poor performance. In the project EWIG Zuse-Institute Berlin acts as an infrastructure facility, while the Institute for Meteorology at FU Berlin and the German research Centre for Geosciences GFZ act as two different data producers. The aim of the project is to develop workflows for the transfer of research data into digital archives and the future re-use of data from long-term archives with emphasis on data from the geosciences. The technical work is supplemented by interviews with data practitioners at several institutions to identify problems in digital preservation workflows and by the development of university teaching materials to train students in the curation of research data and metadata. The free and open-source software Archivematica [1] is used as digital preservation system. The creation and ingest of SIPs has to meet several archival standards and be compatible to the Metadata Encoding and Transmission Standard (METS). The two data producers use different software in their workflows to test the assembly of SIPs and ingest of SIPs into the archive. GFZ Potsdam uses a combination of eSciDoc [2], panMetaDocs [3], and bagit [4] to collect research data and assemble SIPs for ingest into Archivematica, while the Institute for Meteorology at FU Berlin evaluates a variety of software solutions to describe data and publications and to generate SIPs. [1] http://www.archivematica.org [2] http://www.escidoc.org [3] http://panmetadocs.sf.net [4] http://sourceforge.net/projects/loc-xferutils/
NASA Astrophysics Data System (ADS)
Thomas, R.; Connell, D.; Spears, T.; Leadbetter, A.; Burger, E. F.
2016-12-01
The scientific literature heavily features small-scale studies with the impact of the results extrapolated to regional/global importance. There are on-going initiatives (e.g. OA-ICC, GOA-ON, GEOTRACES, EMODNet Chemistry) aiming to assemble regional to global-scale datasets that are available for trend or meta-analyses. Assessing the quality and comparability of these data requires information about the processing chain from "sampling to spreadsheet". This provenance information needs to be captured and readily available to assess data fitness for purpose. The NOAA Ocean Acidification metadata template was designed in consultation with domain experts for this reason; the core carbonate chemistry variables have 23-37 metadata fields each and for scientists generating these datasets there could appear to be an ever increasing amount of metadata expected to accompany a dataset. While this provenance metadata should be considered essential by those generating or using the data, for those discovering data there is a sliding scale between what is considered discovery metadata (title, abstract, contacts, etc.) versus usage metadata (methodology, environmental setup, lineage, etc.), the split depending on the intended use of data. As part of the OA-ICC's activities, the metadata fields from the NOAA template relevant to the sample processing chain and QA criteria have been factored to develop profiles for, and extensions to, the OM-JSON encoding supported by the PROV ontology. While this work started focused on carbonate chemistry variable specific metadata, the factorization could be applied within the O&M model across other disciplines such as trace metals or contaminants. In a linked data world with a suitable high level model for sample processing and QA available, tools and support can be provided to link reproducible units of metadata (e.g. the standard protocol for a variable as adopted by a community) and simplify the provision of metadata and subsequent discovery.
Metadata Creation, Management and Search System for your Scientific Data
NASA Astrophysics Data System (ADS)
Devarakonda, R.; Palanisamy, G.
2012-12-01
Mercury Search Systems is a set of tools for creating, searching, and retrieving of biogeochemical metadata. Mercury toolset provides orders of magnitude improvements in search speed, support for any metadata format, integration with Google Maps for spatial queries, multi-facetted type search, search suggestions, support for RSS (Really Simple Syndication) delivery of search results, and enhanced customization to meet the needs of the multiple projects that use Mercury. Mercury's metadata editor provides a easy way for creating metadata and Mercury's search interface provides a single portal to search for data and information contained in disparate data management systems, each of which may use any metadata format including FGDC, ISO-19115, Dublin-Core, Darwin-Core, DIF, ECHO, and EML. Mercury harvests metadata and key data from contributing project servers distributed around the world and builds a centralized index. The search interfaces then allow the users to perform a variety of fielded, spatial, and temporal searches across these metadata sources. This centralized repository of metadata with distributed data sources provides extremely fast search results to the user, while allowing data providers to advertise the availability of their data and maintain complete control and ownership of that data. Mercury is being used more than 14 different projects across 4 federal agencies. It was originally developed for NASA, with continuing development funded by NASA, USGS, and DOE for a consortium of projects. Mercury search won the NASA's Earth Science Data Systems Software Reuse Award in 2008. References: R. Devarakonda, G. Palanisamy, B.E. Wilson, and J.M. Green, "Mercury: reusable metadata management data discovery and access system", Earth Science Informatics, vol. 3, no. 1, pp. 87-94, May 2010. R. Devarakonda, G. Palanisamy, J.M. Green, B.E. Wilson, "Data sharing and retrieval using OAI-PMH", Earth Science Informatics DOI: 10.1007/s12145-010-0073-0, (2010);
Metadata for data rescue and data at risk
Anderson, William L.; Faundeen, John L.; Greenberg, Jane; Taylor, Fraser
2011-01-01
Scientific data age, become stale, fall into disuse and run tremendous risks of being forgotten and lost. These problems can be addressed by archiving and managing scientific data over time, and establishing practices that facilitate data discovery and reuse. Metadata documentation is integral to this work and essential for measuring and assessing high priority data preservation cases. The International Council for Science: Committee on Data for Science and Technology (CODATA) has a newly appointed Data-at-Risk Task Group (DARTG), participating in the general arena of rescuing data. The DARTG primary objective is building an inventory of scientific data that are at risk of being lost forever. As part of this effort, the DARTG is testing an approach for documenting endangered datasets. The DARTG is developing a minimal and easy to use set of metadata properties for sufficiently describing endangered data, which will aid global data rescue missions. The DARTG metadata framework supports rapid capture, and easy documentation, across an array of scientific domains. This paper reports on the goals and principles supporting the DARTG metadata schema, and provides a description of the preliminary implementation.
The ISS SOLAR payload data preservation in the frame of the PERICLES FP-7 project: metadata aspects.
NASA Astrophysics Data System (ADS)
Muller, Christian; Pandey, Praveen
2016-04-01
PERICLES (Promoting and Enhancing the Reuse of Information throughout the Content Lifecycle exploiting Evolving Semantics) is an FP7 project started on February 2013. It aims at preserving by design large and complex data sets. PERICLES is coordinated by King's College London, UK and its partners are University of Borås (Sweden), CERTH- ITI (Greece), DotSoft (Greece), Georg-August-Universität Göttingen (Germany), University of Liverpool (UK), Space Application Services (Belgium), XEROX France and University of Edinburgh (UK). Two additional partners provide the two case studies: Tate Gallery (UK) brings the digital art and media case study and B.USOC (Belgian Users Support and Operations Centre) brings the space science case study . PERICLES addresses the life-cycle of large and complex data sets in order to cater for the evolution of context of data sets and user communities, including groups unanticipated when the data was created. Semantics of data sets are thus also expected to evolve and the project includes elements which could address the reuse of data sets at periods where the data providers and even their institutions are not available any more. PERICLES uses the Linked Resources Model (LRM) which will be compared with the OAIS standard. In this study we present the space science case associated with PERICLES. B.USOC supports experiments on the International Space Station and is the curator of the collected data and operations history. B.USOC has chosen to analyse the SOLAR payload flying since 2008 on the ESA COLUMBUS module of the ISS as the PERICLES prime space science case. Solar observation data are prime candidates for long term data preservation as variabilities of the solar spectral irradiance have an influence on earth climate. The nature of the data to be preserved for the reuse of the current SOLAR series is much more extended than a simple set of time tagged tables of spectral irradiances, it is an important inventory of more than 50 classes of documents and/or metadata directly relevant to SOLAR, it includes not only the calibration elements and their evolution during SOLAR lifetime but also all tools used as the software's and versions used in interpretation. In order to foster potential user's present and future needs, there is an ongoing survey involving a community of practice. Its initial outcome will be presented here. The aspects of data appraisal and metadata constitution in the frame of the PERICLES project will be developed.
Credentialing Data Scientists: A Domain Repository Perspective
NASA Astrophysics Data System (ADS)
Lehnert, K. A.; Furukawa, H.
2015-12-01
A career in data science can have many paths: data curation, data analysis, metadata modeling - all of these in different commercial or scientific applications. Can a certification as 'data scientist' provide the guarantee that an applicant or candidate for a data science position has just the right skills? How valuable is a 'generic' certification as data scientist for an employer looking to fill a data science position? Credentials that are more specific and discipline-oriented may be more valuable to both the employer and the job candidate. One employment sector for data scientists are the data repositories that provide discipline-specific data services for science communities. Data science positions within domain repositories include a wide range of responsibilities in support of the full data life cycle - from data preservation and curation to development of data models, ontologies, and user interfaces, to development of data analysis and visualization tools to community education and outreach, and require a substantial degree of discipline-specific knowledge of scientific data acquisition and analysis workflows, data quality measures, and data cultures. Can there be certification programs for domain-specific data scientists that help build the urgently needed workforce for the repositories? The American Geophysical Union has recently started an initiative to develop a program for data science continuing education and data science professional certification for the Earth and space sciences. An Editorial Board has been charged to identify and develop curricula and content for these programs and to provide input and feedback in the implementation of the program. This presentation will report on the progress of this initiative and evaluate its utility for the needs of domain repositories in the Earth and space sciences.
Data Stewardship in the Ocean Sciences Needs to Include Physical Samples
NASA Astrophysics Data System (ADS)
Carter, M.; Lehnert, K.
2016-02-01
Across the Ocean Sciences, research involves the collection and study of samples collected above, at, and below the seafloor, including but not limited to rocks, sediments, fluids, gases, and living organisms. Many domains in the Earth Sciences have recently expressed the need for better discovery, access, and sharing of scientific samples and collections (EarthCube End-User Domain workshops, 2012 and 2013, http://earthcube.org/info/about/end-user-workshops), as has the US government (OSTP Memo, March 2014). iSamples (Internet of Samples in the Earth Sciences) is a Research Coordination Network within the EarthCube program that aims to advance the use of innovative cyberinfrastructure to support and advance the utility of physical samples and sample collections for science and ensure reproducibility of sample-based data and research results. iSamples strives to build, grow, and foster a new community of practice, in which domain scientists, curators of sample repositories and collections, computer and information scientists, software developers and technology innovators engage in and collaborate on defining, articulating, and addressing the needs and challenges of physical samples as a critical component of digital data infrastructure. A primary goal of iSamples is to deliver a community-endorsed set of best practices and standards for the registration, description, identification, and citation of physical specimens and define an actionable plan for implementation. iSamples conducted a broad community survey about sample sharing and has created 5 different working groups to address the different challenges of developing the internet of samples - from metadata schemas and unique identifiers to an architecture for a shared cyberinfrastructure to manage collections, to digitization of existing collections, to education, and ultimately to establishing the physical infrastructure that will ensure preservation and access of the physical samples. Repositories that curate marine sediment cores and dredge samples from the oceanic crust are participating in iSamples, but many other samples collected in the Ocean sciences are not yet represented. This presentation aims to engage a wider spectrum of Ocean scientists and sample curators in iSamples.
FRAMES Metadata Reporting Templates for Ecohydrological Observations, version 1.1
DOE Office of Scientific and Technical Information (OSTI.GOV)
Christianson, Danielle; Varadharajan, Charuleka; Christoffersen, Brad
FRAMES is a a set of Excel metadata files and package-level descriptive metadata that are designed to facilitate and improve capture of desired metadata for ecohydrological observations. The metadata are bundled with data files into a data package and submitted to a data repository (e.g. the NGEE Tropics Data Repository) via a web form. FRAMES standardizes reporting of diverse ecohydrological and biogeochemical data for synthesis across a range of spatiotemporal scales and incorporates many best data science practices. This version of FRAMES supports observations for primarily automated measurements collected by permanently located sensors, including sap flow (tree water use), leafmore » surface temperature, soil water content, dendrometry (stem diameter growth increment), and solar radiation. Version 1.1 extend the controlled vocabulary and incorporates functionality to facilitate programmatic use of data and FRAMES metadata (R code available at NGEE Tropics Data Repository).« less
NASA Astrophysics Data System (ADS)
Rogowitz, Bernice E.; Matasci, Naim
2011-03-01
The explosion of online scientific data from experiments, simulations, and observations has given rise to an avalanche of algorithmic, visualization and imaging methods. There has also been enormous growth in the introduction of tools that provide interactive interfaces for exploring these data dynamically. Most systems, however, do not support the realtime exploration of patterns and relationships across tools and do not provide guidance on which colors, colormaps or visual metaphors will be most effective. In this paper, we introduce a general architecture for sharing metadata between applications and a "Metadata Mapper" component that allows the analyst to decide how metadata from one component should be represented in another, guided by perceptual rules. This system is designed to support "brushing [1]," in which highlighting a region of interest in one application automatically highlights corresponding values in another, allowing the scientist to develop insights from multiple sources. Our work builds on the component-based iPlant Cyberinfrastructure [2] and provides a general approach to supporting interactive, exploration across independent visualization and visual analysis components.
PhysiomeSpace: digital library service for biomedical data
Testi, Debora; Quadrani, Paolo; Viceconti, Marco
2010-01-01
Every research laboratory has a wealth of biomedical data locked up, which, if shared with other experts, could dramatically improve biomedical and healthcare research. With the PhysiomeSpace service, it is now possible with a few clicks to share with selected users biomedical data in an easy, controlled and safe way. The digital library service is managed using a client–server approach. The client application is used to import, fuse and enrich the data information according to the PhysiomeSpace resource ontology and upload/download the data to the library. The server services are hosted on the Biomed Town community portal, where through a web interface, the user can complete the metadata curation and share and/or publish the data resources. A search service capitalizes on the domain ontology and on the enrichment of metadata for each resource, providing a powerful discovery environment. Once the users have found the data resources they are interested in, they can add them to their basket, following a metaphor popular in e-commerce web sites. When all the necessary resources have been selected, the user can download the basket contents into the client application. The digital library service is now in beta and open to the biomedical research community. PMID:20478910
PhysiomeSpace: digital library service for biomedical data.
Testi, Debora; Quadrani, Paolo; Viceconti, Marco
2010-06-28
Every research laboratory has a wealth of biomedical data locked up, which, if shared with other experts, could dramatically improve biomedical and healthcare research. With the PhysiomeSpace service, it is now possible with a few clicks to share with selected users biomedical data in an easy, controlled and safe way. The digital library service is managed using a client-server approach. The client application is used to import, fuse and enrich the data information according to the PhysiomeSpace resource ontology and upload/download the data to the library. The server services are hosted on the Biomed Town community portal, where through a web interface, the user can complete the metadata curation and share and/or publish the data resources. A search service capitalizes on the domain ontology and on the enrichment of metadata for each resource, providing a powerful discovery environment. Once the users have found the data resources they are interested in, they can add them to their basket, following a metaphor popular in e-commerce web sites. When all the necessary resources have been selected, the user can download the basket contents into the client application. The digital library service is now in beta and open to the biomedical research community.
Comprehensive Optimal Manpower and Personnel Analytic Simulation System (COMPASS)
2009-10-01
4 The EDB consists of 4 major components (some of which are re-usable): 1. Metadata Editor ( MDE ): Also considered a leaf node, the metadata...end-user queries via the QB. The EDB supports multiple instances of the MDE , although currently, only a single instance is recommended. 2 Query...the MSB is a central collection of web services, responsible for the authentication and authorization of users, maintenance of the EDB metadata
Curating and Integrating Data from Multiple Sources to Support Healthcare Analytics.
Ng, Kenney; Kakkanatt, Chris; Benigno, Michael; Thompson, Clay; Jackson, Margaret; Cahan, Amos; Zhu, Xinxin; Zhang, Ping; Huang, Paul
2015-01-01
As the volume and variety of healthcare related data continues to grow, the analysis and use of this data will increasingly depend on the ability to appropriately collect, curate and integrate disparate data from many different sources. We describe our approach to and highlight our experiences with the development of a robust data collection, curation and integration infrastructure that supports healthcare analytics. This system has been successfully applied to the processing of a variety of data types including clinical data from electronic health records and observational studies, genomic data, microbiomic data, self-reported data from surveys and self-tracked data from wearable devices from over 600 subjects. The curated data is currently being used to support healthcare analytic applications such as data visualization, patient stratification and predictive modeling.
DIANA-LncBase v2: indexing microRNA targets on non-coding transcripts
Paraskevopoulou, Maria D.; Vlachos, Ioannis S.; Karagkouni, Dimitra; Georgakilas, Georgios; Kanellos, Ilias; Vergoulis, Thanasis; Zagganas, Konstantinos; Tsanakas, Panayiotis; Floros, Evangelos; Dalamagas, Theodore; Hatzigeorgiou, Artemis G.
2016-01-01
microRNAs (miRNAs) are short non-coding RNAs (ncRNAs) that act as post-transcriptional regulators of coding gene expression. Long non-coding RNAs (lncRNAs) have been recently reported to interact with miRNAs. The sponge-like function of lncRNAs introduces an extra layer of complexity in the miRNA interactome. DIANA-LncBase v1 provided a database of experimentally supported and in silico predicted miRNA Recognition Elements (MREs) on lncRNAs. The second version of LncBase (www.microrna.gr/LncBase) presents an extensive collection of miRNA:lncRNA interactions. The significantly enhanced database includes more than 70 000 low and high-throughput, (in)direct miRNA:lncRNA experimentally supported interactions, derived from manually curated publications and the analysis of 153 AGO CLIP-Seq libraries. The new experimental module presents a 14-fold increase compared to the previous release. LncBase v2 hosts in silico predicted miRNA targets on lncRNAs, identified with the DIANA-microT algorithm. The relevant module provides millions of predicted miRNA binding sites, accompanied with detailed metadata and MRE conservation metrics. LncBase v2 caters information regarding cell type specific miRNA:lncRNA regulation and enables users to easily identify interactions in 66 different cell types, spanning 36 tissues for human and mouse. Database entries are also supported by accurate lncRNA expression information, derived from the analysis of more than 6 billion RNA-Seq reads. PMID:26612864
Architecture for the Interdisciplinary Earth Data Alliance
NASA Astrophysics Data System (ADS)
Richard, S. M.
2016-12-01
The Interdisciplinary Earth Data Alliance (IEDA) is leading an EarthCube (EC) Integrative Activity to develop a governance structure and technology framework that enables partner data systems to share technology, infrastructure, and practice for documenting, curating, and accessing heterogeneous geoscience data. The IEDA data facility provides capabilities in an extensible framework that enables domain-specific requirements for each partner system in the Alliance to be integrated into standardized cross-domain workflows. The shared technology infrastructure includes a data submission hub, a domain-agnostic file-based repository, an integrated Alliance catalog and a Data Browser for data discovery across all partner holdings, as well as services for registering identifiers for datasets (DOI) and samples (IGSN). The submission hub will be a platform that facilitates acquisition of cross-domain resource documentation and channels users into domain and resource-specific workflows tailored for each partner community. We are exploring an event-based message bus architecture with a standardized plug-in interface for adding capabilities. This architecture builds on the EC CINERGI metadata pipeline as well as the message-based architecture of the SEAD project. Plug-in components for file introspection to match entities to a data type registry (extending EC Digital Crust and Research Data Alliance work), extract standardized keywords (using CINERGI components), location, cruise, personnel and other metadata linkage information (building on GeoLink and existing IEDA partner components). The submission hub will feed submissions to appropriate partner repositories and service endpoints targeted by domain and resource type for distribution. The Alliance governance will adopt patterns (vocabularies, operations, resource types) for self-describing data services using standard HTTP protocol for simplified data access (building on EC GeoWS and other `RESTful' approaches). Exposure of resource descriptions (datasets and service distributions) for harvesting by commercial search engines as well as geoscience-data focused crawlers (like EC B-Cube crawler) will increase discoverability of IEDA resources with minimal effort by curators.
OzFlux data: network integration from collection to curation
NASA Astrophysics Data System (ADS)
Isaac, Peter; Cleverly, James; McHugh, Ian; van Gorsel, Eva; Ewenz, Cacilia; Beringer, Jason
2017-06-01
Measurement of the exchange of energy and mass between the surface and the atmospheric boundary-layer by the eddy covariance technique has undergone great change in the last 2 decades. Early studies of these exchanges were confined to brief field campaigns in carefully controlled conditions followed by months of data analysis. Current practice is to run tower-based eddy covariance systems continuously over several years due to the need for continuous monitoring as part of a global effort to develop local-, regional-, continental- and global-scale budgets of carbon, water and energy. Efficient methods of processing the increased quantities of data are needed to maximise the time available for analysis and interpretation. Standardised methods are needed to remove differences in data processing as possible contributors to observed spatial variability. Furthermore, public availability of these data sets assists with undertaking global research efforts. The OzFlux data path has been developed (i) to provide a standard set of quality control and post-processing tools across the network, thereby facilitating inter-site integration and spatial comparisons; (ii) to increase the time available to researchers for analysis and interpretation by reducing the time spent collecting and processing data; (iii) to propagate both data and metadata to the final product; and (iv) to facilitate the use of the OzFlux data by adopting a standard file format and making the data available from web-based portals. Discovery of the OzFlux data set is facilitated through incorporation in FLUXNET data syntheses and the publication of collection metadata via the RIF-CS format. This paper serves two purposes. The first is to describe the data sets, along with their quality control and post-processing, for the other papers of this Special Issue. The second is to provide an example of one solution to the data collection and curation challenges that are encountered by similar flux tower networks worldwide.
A case for user-generated sensor metadata
NASA Astrophysics Data System (ADS)
Nüst, Daniel
2015-04-01
Cheap and easy to use sensing technology and new developments in ICT towards a global network of sensors and actuators promise previously unthought of changes for our understanding of the environment. Large professional as well as amateur sensor networks exist, and they are used for specific yet diverse applications across domains such as hydrology, meteorology or early warning systems. However the impact this "abundance of sensors" had so far is somewhat disappointing. There is a gap between (community-driven) sensor networks that could provide very useful data and the users of the data. In our presentation, we argue this is due to a lack of metadata which allows determining the fitness of use of a dataset. Syntactic or semantic interoperability for sensor webs have made great progress and continue to be an active field of research, yet they often are quite complex, which is of course due to the complexity of the problem at hand. But still, we see the most generic information to determine fitness for use is a dataset's provenance, because it allows users to make up their own minds independently from existing classification schemes for data quality. In this work we will make the case how curated user-contributed metadata has the potential to improve this situation. This especially applies for scenarios in which an observed property is applicable in different domains, and for set-ups where the understanding about metadata concepts and (meta-)data quality differs between data provider and user. On the one hand a citizen does not understand the ISO provenance metadata. On the other hand a researcher might find issues in publicly accessible time series published by citizens, which the latter might not be aware of or care about. Because users will have to determine fitness for use for each application on their own anyway, we suggest an online collaboration platform for user-generated metadata based on an extremely simplified data model. In the most basic fashion, metadata generated by users can be boiled down to a basic property of the world wide web: many information items, such as news or blog posts, allow users to create comments and rate the content. Therefore we argue to focus a core data model on one text field for a textual comment, one optional numerical field for a rating, and a resolvable identifier for the dataset that is commented on. We present a conceptual framework that integrates user comments in existing standards and relevant applications of online sensor networks and discuss possible approaches, such as linked data, brokering, or standalone metadata portals. We relate this framework to existing work in user generated content, such as proprietary rating systems on commercial websites, microformats, the GeoViQua User Quality Model, the CHARMe annotations, or W3C Open Annotation. These systems are also explored for commonalities and based on their very useful concepts and ideas; we present an outline for future extensions of the minimal model. Building on this framework we present a concept how a simplistic comment-rating-system can be extended to capture provenance information for spatio-temporal observations in the sensor web, and how this framework can be evaluated.
Metadata management for high content screening in OMERO
Li, Simon; Besson, Sébastien; Blackburn, Colin; Carroll, Mark; Ferguson, Richard K.; Flynn, Helen; Gillen, Kenneth; Leigh, Roger; Lindner, Dominik; Linkert, Melissa; Moore, William J.; Ramalingam, Balaji; Rozbicki, Emil; Rustici, Gabriella; Tarkowska, Aleksandra; Walczysko, Petr; Williams, Eleanor; Allan, Chris; Burel, Jean-Marie; Moore, Josh; Swedlow, Jason R.
2016-01-01
High content screening (HCS) experiments create a classic data management challenge—multiple, large sets of heterogeneous structured and unstructured data, that must be integrated and linked to produce a set of “final” results. These different data include images, reagents, protocols, analytic output, and phenotypes, all of which must be stored, linked and made accessible for users, scientists, collaborators and where appropriate the wider community. The OME Consortium has built several open source tools for managing, linking and sharing these different types of data. The OME Data Model is a metadata specification that supports the image data and metadata recorded in HCS experiments. Bio-Formats is a Java library that reads recorded image data and metadata and includes support for several HCS screening systems. OMERO is an enterprise data management application that integrates image data, experimental and analytic metadata and makes them accessible for visualization, mining, sharing and downstream analysis. We discuss how Bio-Formats and OMERO handle these different data types, and how they can be used to integrate, link and share HCS experiments in facilities and public data repositories. OME specifications and software are open source and are available at https://www.openmicroscopy.org. PMID:26476368
Metadata management for high content screening in OMERO.
Li, Simon; Besson, Sébastien; Blackburn, Colin; Carroll, Mark; Ferguson, Richard K; Flynn, Helen; Gillen, Kenneth; Leigh, Roger; Lindner, Dominik; Linkert, Melissa; Moore, William J; Ramalingam, Balaji; Rozbicki, Emil; Rustici, Gabriella; Tarkowska, Aleksandra; Walczysko, Petr; Williams, Eleanor; Allan, Chris; Burel, Jean-Marie; Moore, Josh; Swedlow, Jason R
2016-03-01
High content screening (HCS) experiments create a classic data management challenge-multiple, large sets of heterogeneous structured and unstructured data, that must be integrated and linked to produce a set of "final" results. These different data include images, reagents, protocols, analytic output, and phenotypes, all of which must be stored, linked and made accessible for users, scientists, collaborators and where appropriate the wider community. The OME Consortium has built several open source tools for managing, linking and sharing these different types of data. The OME Data Model is a metadata specification that supports the image data and metadata recorded in HCS experiments. Bio-Formats is a Java library that reads recorded image data and metadata and includes support for several HCS screening systems. OMERO is an enterprise data management application that integrates image data, experimental and analytic metadata and makes them accessible for visualization, mining, sharing and downstream analysis. We discuss how Bio-Formats and OMERO handle these different data types, and how they can be used to integrate, link and share HCS experiments in facilities and public data repositories. OME specifications and software are open source and are available at https://www.openmicroscopy.org. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
Environmental Information Management For Data Discovery and Access System
NASA Astrophysics Data System (ADS)
Giriprakash, P.
2011-01-01
Mercury is a federated metadata harvesting, search and retrieval tool based on both open source software and software developed at Oak Ridge National Laboratory. It was originally developed for NASA, and the Mercury development consortium now includes funding from NASA, USGS, and DOE. A major new version of Mercury was developed during 2007 and released in early 2008. This new version provides orders of magnitude improvements in search speed, support for additional metadata formats, integration with Google Maps for spatial queries, support for RSS delivery of search results, and ready customization to meet the needs of the multiple projects which use Mercury. For the end users, Mercury provides a single portal to very quickly search for data and information contained in disparate data management systems. It collects metadata and key data from contributing project servers distributed around the world and builds a centralized index. The Mercury search interfaces then allow ! the users to perform simple, fielded, spatial and temporal searches across these metadata sources. This centralized repository of metadata with distributed data sources provides extremely fast search results to the user, while allowing data providers to advertise the availability of their data and maintain complete control and ownership of that data.
Howe, E.A.; de Souza, A.; Lahr, D.L.; Chatwin, S.; Montgomery, P.; Alexander, B.R.; Nguyen, D.-T.; Cruz, Y.; Stonich, D.A.; Walzer, G.; Rose, J.T.; Picard, S.C.; Liu, Z.; Rose, J.N.; Xiang, X.; Asiedu, J.; Durkin, D.; Levine, J.; Yang, J.J.; Schürer, S.C.; Braisted, J.C.; Southall, N.; Southern, M.R.; Chung, T.D.Y.; Brudz, S.; Tanega, C.; Schreiber, S.L.; Bittker, J.A.; Guha, R.; Clemons, P.A.
2015-01-01
BARD, the BioAssay Research Database (https://bard.nih.gov/) is a public database and suite of tools developed to provide access to bioassay data produced by the NIH Molecular Libraries Program (MLP). Data from 631 MLP projects were migrated to a new structured vocabulary designed to capture bioassay data in a formalized manner, with particular emphasis placed on the description of assay protocols. New data can be submitted to BARD with a user-friendly set of tools that assist in the creation of appropriately formatted datasets and assay definitions. Data published through the BARD application program interface (API) can be accessed by researchers using web-based query tools or a desktop client. Third-party developers wishing to create new tools can use the API to produce stand-alone tools or new plug-ins that can be integrated into BARD. The entire BARD suite of tools therefore supports three classes of researcher: those who wish to publish data, those who wish to mine data for testable hypotheses, and those in the developer community who wish to build tools that leverage this carefully curated chemical biology resource. PMID:25477388
The Role of Community-Driven Data Curation for Enterprises
NASA Astrophysics Data System (ADS)
Curry, Edward; Freitas, Andre; O'Riáin, Sean
With increased utilization of data within their operational and strategic processes, enterprises need to ensure data quality and accuracy. Data curation is a process that can ensure the quality of data and its fitness for use. Traditional approaches to curation are struggling with increased data volumes, and near real-time demands for curated data. In response, curation teams have turned to community crowd-sourcing and semi-automatedmetadata tools for assistance. This chapter provides an overview of data curation, discusses the business motivations for curating data and investigates the role of community-based data curation, focusing on internal communities and pre-competitive data collaborations. The chapter is supported by case studies from Wikipedia, The New York Times, Thomson Reuters, Protein Data Bank and ChemSpider upon which best practices for both social and technical aspects of community-driven data curation are described.
Researcher-library collaborations: Data repositories as a service for researchers.
Gordon, Andrew S; Millman, David S; Steiger, Lisa; Adolph, Karen E; Gilmore, Rick O
New interest has arisen in organizing, preserving, and sharing the raw materials-the data and metadata-that undergird the published products of research. Library and information scientists have valuable expertise to bring to bear in the effort to create larger, more diverse, and more widely used data repositories. However, for libraries to be maximally successful in providing the research data management and preservation services required of a successful data repository, librarians must work closely with researchers and learn about their data management workflows. Databrary is a data repository that is closely linked to the needs of a specific scholarly community-researchers who use video as a main source of data to study child development and learning. The project's success to date is a result of its focus on community outreach and providing services for scholarly communication, engaging institutional partners, offering services for data curation with the guidance of closely involved information professionals, and the creation of a strong technical infrastructure. Databrary plans to improve its curation tools that allow researchers to deposit their own data, enhance the user-facing feature set, increase integration with library systems, and implement strategies for long-term sustainability.
Enriching Earthdata by Improving Content Curation
NASA Astrophysics Data System (ADS)
Bagwell, R.; Wong, M. M.; Murphy, K. J.
2014-12-01
Since the launch of Earthdata in the later part of 2011, there has been an emphasis on improving the user experience and providing more enriched content to the user, ultimately with the focus to bring the "pixels to the people" or to ensure that a user clicks the fewest amount of times to get to the data, tools, or information which they seek. Earthdata was founded to be a single source of information for Earth Observing System Data and Information System (EOSDIS) components and services as a conglomeration between over 15 different websites. With an increased focus on access to Earth science data, the recognition is now on transforming Earthdata from a static website to one that is a dynamic, data-driven site full of enriched content.In the near future, Earthdata will have a number of components that will drive the access to the data, such as Earthdata Search, the Common Metadata Repository (CMR), and a redesign of the Earthdata website. The focus on content curation will be to leverage the use of these components to provide an enriched content environment and a better overall user experience, with an emphasis on Earthdata being "powered by EOSDIS" components and services.
The PSML format and library for norm-conserving pseudopotential data curation and interoperability
NASA Astrophysics Data System (ADS)
García, Alberto; Verstraete, Matthieu J.; Pouillon, Yann; Junquera, Javier
2018-06-01
Norm-conserving pseudopotentials are used by a significant number of electronic-structure packages, but the practical differences among codes in the handling of the associated data hinder their interoperability and make it difficult to compare their results. At the same time, existing formats lack provenance data, which makes it difficult to track and document computational workflows. To address these problems, we first propose a file format (PSML) that maps the basic concepts of the norm-conserving pseudopotential domain in a flexible form and supports the inclusion of provenance information and other important metadata. Second, we provide a software library (libPSML) that can be used by electronic structure codes to transparently extract the information in the file and adapt it to their own data structures, or to create converters for other formats. Support for the new file format has been already implemented in several pseudopotential generator programs (including ATOM and ONCVPSP), and the library has been linked with SIESTA and ABINIT, allowing them to work with the same pseudopotential operator (with the same local part and fully non-local projectors) thus easing the comparison of their results for the structural and electronic properties, as shown for several example systems. This methodology can be easily transferred to any other package that uses norm-conserving pseudopotentials, and offers a proof-of-concept for a general approach to interoperability.
A multi-service data management platform for scientific oceanographic products
NASA Astrophysics Data System (ADS)
D'Anca, Alessandro; Conte, Laura; Nassisi, Paola; Palazzo, Cosimo; Lecci, Rita; Cretì, Sergio; Mancini, Marco; Nuzzo, Alessandra; Mirto, Maria; Mannarini, Gianandrea; Coppini, Giovanni; Fiore, Sandro; Aloisio, Giovanni
2017-02-01
An efficient, secure and interoperable data platform solution has been developed in the TESSA project to provide fast navigation and access to the data stored in the data archive, as well as a standard-based metadata management support. The platform mainly targets scientific users and the situational sea awareness high-level services such as the decision support systems (DSS). These datasets are accessible through the following three main components: the Data Access Service (DAS), the Metadata Service and the Complex Data Analysis Module (CDAM). The DAS allows access to data stored in the archive by providing interfaces for different protocols and services for downloading, variables selection, data subsetting or map generation. Metadata Service is the heart of the information system of the TESSA products and completes the overall infrastructure for data and metadata management. This component enables data search and discovery and addresses interoperability by exploiting widely adopted standards for geospatial data. Finally, the CDAM represents the back-end of the TESSA DSS by performing on-demand complex data analysis tasks.
EUDAT B2FIND : A Cross-Discipline Metadata Service and Discovery Portal
NASA Astrophysics Data System (ADS)
Widmann, Heinrich; Thiemann, Hannes
2016-04-01
The European Data Infrastructure (EUDAT) project aims at a pan-European environment that supports a variety of multiple research communities and individuals to manage the rising tide of scientific data by advanced data management technologies. This led to the establishment of the community-driven Collaborative Data Infrastructure that implements common data services and storage resources to tackle the basic requirements and the specific challenges of international and interdisciplinary research data management. The metadata service B2FIND plays a central role in this context by providing a simple and user-friendly discovery portal to find research data collections stored in EUDAT data centers or in other repositories. For this we store the diverse metadata collected from heterogeneous sources in a comprehensive joint metadata catalogue and make them searchable in an open data portal. The implemented metadata ingestion workflow consists of three steps. First the metadata records - provided either by various research communities or via other EUDAT services - are harvested. Afterwards the raw metadata records are converted and mapped to unified key-value dictionaries as specified by the B2FIND schema. The semantic mapping of the non-uniform, community specific metadata to homogenous structured datasets is hereby the most subtle and challenging task. To assure and improve the quality of the metadata this mapping process is accompanied by • iterative and intense exchange with the community representatives, • usage of controlled vocabularies and community specific ontologies and • formal and semantic validation. Finally the mapped and checked records are uploaded as datasets to the catalogue, which is based on the open source data portal software CKAN. CKAN provides a rich RESTful JSON API and uses SOLR for dataset indexing that enables users to query and search in the catalogue. The homogenization of the community specific data models and vocabularies enables not only the unique presentation of these datasets as tables of field-value pairs but also the faceted, spatial and temporal search in the B2FIND metadata portal. Furthermore the service provides transparent access to the scientific data objects through the given references and identifiers in the metadata. B2FIND offers support for new communities interested in publishing their data within EUDAT. We present here the functionality and the features of the B2FIND service and give an outlook of further developments as interfaces to external libraries and use of Linked Data.
openPDS: protecting the privacy of metadata through SafeAnswers.
de Montjoye, Yves-Alexandre; Shmueli, Erez; Wang, Samuel S; Pentland, Alex Sandy
2014-01-01
The rise of smartphones and web services made possible the large-scale collection of personal metadata. Information about individuals' location, phone call logs, or web-searches, is collected and used intensively by organizations and big data researchers. Metadata has however yet to realize its full potential. Privacy and legal concerns, as well as the lack of technical solutions for personal metadata management is preventing metadata from being shared and reconciled under the control of the individual. This lack of access and control is furthermore fueling growing concerns, as it prevents individuals from understanding and managing the risks associated with the collection and use of their data. Our contribution is two-fold: (1) we describe openPDS, a personal metadata management framework that allows individuals to collect, store, and give fine-grained access to their metadata to third parties. It has been implemented in two field studies; (2) we introduce and analyze SafeAnswers, a new and practical way of protecting the privacy of metadata at an individual level. SafeAnswers turns a hard anonymization problem into a more tractable security one. It allows services to ask questions whose answers are calculated against the metadata instead of trying to anonymize individuals' metadata. The dimensionality of the data shared with the services is reduced from high-dimensional metadata to low-dimensional answers that are less likely to be re-identifiable and to contain sensitive information. These answers can then be directly shared individually or in aggregate. openPDS and SafeAnswers provide a new way of dynamically protecting personal metadata, thereby supporting the creation of smart data-driven services and data science research.
openPDS: Protecting the Privacy of Metadata through SafeAnswers
de Montjoye, Yves-Alexandre; Shmueli, Erez; Wang, Samuel S.; Pentland, Alex Sandy
2014-01-01
The rise of smartphones and web services made possible the large-scale collection of personal metadata. Information about individuals' location, phone call logs, or web-searches, is collected and used intensively by organizations and big data researchers. Metadata has however yet to realize its full potential. Privacy and legal concerns, as well as the lack of technical solutions for personal metadata management is preventing metadata from being shared and reconciled under the control of the individual. This lack of access and control is furthermore fueling growing concerns, as it prevents individuals from understanding and managing the risks associated with the collection and use of their data. Our contribution is two-fold: (1) we describe openPDS, a personal metadata management framework that allows individuals to collect, store, and give fine-grained access to their metadata to third parties. It has been implemented in two field studies; (2) we introduce and analyze SafeAnswers, a new and practical way of protecting the privacy of metadata at an individual level. SafeAnswers turns a hard anonymization problem into a more tractable security one. It allows services to ask questions whose answers are calculated against the metadata instead of trying to anonymize individuals' metadata. The dimensionality of the data shared with the services is reduced from high-dimensional metadata to low-dimensional answers that are less likely to be re-identifiable and to contain sensitive information. These answers can then be directly shared individually or in aggregate. openPDS and SafeAnswers provide a new way of dynamically protecting personal metadata, thereby supporting the creation of smart data-driven services and data science research. PMID:25007320
Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO).
Panahiazar, Maryam; Dumontier, Michel; Gevaert, Olivier
2017-08-01
A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table. All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Bruland, Philipp; Doods, Justin; Storck, Michael; Dugas, Martin
2017-01-01
Data dictionaries provide structural meta-information about data definitions in health information technology (HIT) systems. In this regard, reusing healthcare data for secondary purposes offers several advantages (e.g. reduce documentation times or increased data quality). Prerequisites for data reuse are its quality, availability and identical meaning of data. In diverse projects, research data warehouses serve as core components between heterogeneous clinical databases and various research applications. Given the complexity (high number of data elements) and dynamics (regular updates) of electronic health record (EHR) data structures, we propose a clinical metadata warehouse (CMDW) based on a metadata registry standard. Metadata of two large hospitals were automatically inserted into two CMDWs containing 16,230 forms and 310,519 data elements. Automatic updates of metadata are possible as well as semantic annotations. A CMDW allows metadata discovery, data quality assessment and similarity analyses. Common data models for distributed research networks can be established based on similarity analyses.
Dhaval, Rakesh; Borlawsky, Tara; Ostrander, Michael; Santangelo, Jennifer; Kamal, Jyoti; Payne, Philip R O
2008-11-06
In order to enhance interoperability between enterprise systems, and improve data validity and reliability throughout The Ohio State University Medical Center (OSUMC), we have initiated the development of an ontology-anchored metadata architecture and knowledge collection for our enterprise data warehouse. The metadata and corresponding semantic relationships stored in the OSUMC knowledge collection are intended to promote consistency and interoperability across the heterogeneous clinical, research, business and education information managed within the data warehouse.
A justification for semantic training in data curation frameworks development
NASA Astrophysics Data System (ADS)
Ma, X.; Branch, B. D.; Wegner, K.
2013-12-01
In the complex data curation activities involving proper data access, data use optimization and data rescue, opportunities exist where underlying skills in semantics may play a crucial role in data curation professionals ranging from data scientists, to informaticists, to librarians. Here, We provide a conceptualization of semantics use in the education data curation framework (EDCF) [1] under development by Purdue University and endorsed by the GLOBE program [2] for further development and application. Our work shows that a comprehensive data science training includes both spatial and non-spatial data, where both categories are promoted by standard efforts of organizations such as the Open Geospatial Consortium (OGC) and the World Wide Web Consortium (W3C), as well as organizations such as the Federation of Earth Science Information Partners (ESIP) that share knowledge and propagate best practices in applications. Outside the context of EDCF, semantics training may be same critical to such data scientists, informaticists or librarians in other types of data curation activity. Past works by the authors have suggested that such data science should augment an ontological literacy where data science may become sustainable as a discipline. As more datasets are being published as open data [3] and made linked to each other, i.e., in the Resource Description Framework (RDF) format, or at least their metadata are being published in such a way, vocabularies and ontologies of various domains are being created and used in the data management, such as the AGROVOC [4] for agriculture and the GCMD keywords [5] and CLEAN vocabulary [6] for climate sciences. The new generation of data scientist should be aware of those technologies and receive training where appropriate to incorporate those technologies into their reforming daily works. References [1] Branch, B.D., Fosmire, M., 2012. The role of interdisciplinary GIS and data curation librarians in enhancing authentic scientific research in the classroom. American Geophysical Union 2013 Fall Meeting, San Francisco, CA, USA. Abstract# ED43A-0727 [2] http://www.globe.gov [3] http://www.whitehouse.gov/sites/default/files/omb/memoranda/2013/m-13-13.pdf [4] http://aims.fao.org/standards/agrovoc [5] http://gcmd.nasa.gov/learn/keyword_list.html [6] http://cleanet.org/clean/about/climate_energy_.html
Misra, Dharitri; Chen, Siyuan; Thoma, George R
2009-01-01
One of the most expensive aspects of archiving digital documents is the manual acquisition of context-sensitive metadata useful for the subsequent discovery of, and access to, the archived items. For certain types of textual documents, such as journal articles, pamphlets, official government records, etc., where the metadata is contained within the body of the documents, a cost effective method is to identify and extract the metadata in an automated way, applying machine learning and string pattern search techniques.At the U. S. National Library of Medicine (NLM) we have developed an automated metadata extraction (AME) system that employs layout classification and recognition models with a metadata pattern search model for a text corpus with structured or semi-structured information. A combination of Support Vector Machine and Hidden Markov Model is used to create the layout recognition models from a training set of the corpus, following which a rule-based metadata search model is used to extract the embedded metadata by analyzing the string patterns within and surrounding each field in the recognized layouts.In this paper, we describe the design of our AME system, with focus on the metadata search model. We present the extraction results for a historic collection from the Food and Drug Administration, and outline how the system may be adapted for similar collections. Finally, we discuss some ongoing enhancements to our AME system.
NASA Astrophysics Data System (ADS)
Car, Nicholas; Cox, Simon; Fitch, Peter
2015-04-01
With earth-science datasets increasingly being published to enable re-use in projects disassociated from the original data acquisition or generation, there is an urgent need for associated metadata to be connected, in order to guide their application. In particular, provenance traces should support the evaluation of data quality and reliability. However, while standards for describing provenance are emerging (e.g. PROV-O), these do not include the necessary statistical descriptors and confidence assessments. UncertML has a mature conceptual model that may be used to record uncertainty metadata. However, by itself UncertML does not support the representation of uncertainty of multi-part datasets, and provides no direct way of associating the uncertainty information - metadata in relation to a dataset - with dataset objects.We present a method to address both these issues by combining UncertML with PROV-O, and delivering resulting uncertainty-enriched provenance traces through the Linked Data API. UncertProv extends the PROV-O provenance ontology with an RDF formulation of the UncertML conceptual model elements, adds further elements to support uncertainty representation without a conceptual model and the integration of UncertML through links to documents. The Linked ID API provides a systematic way of navigating from dataset objects to their UncertProv metadata and back again. The Linked Data API's 'views' capability enables access to UncertML and non-UncertML uncertainty metadata representations for a dataset. With this approach, it is possible to access and navigate the uncertainty metadata associated with a published dataset using standard semantic web tools, such as SPARQL queries. Where the uncertainty data follows the UncertML model it can be automatically interpreted and may also support automatic uncertainty propagation . Repositories wishing to enable uncertainty propagation for all datasets must ensure that all elements that are associated with uncertainty (PROV-O Entity and Activity classes) have UncertML elements recorded. This methodology is intentionally flexible to allow uncertainty metadata in many forms, not limited to UncertML. While the more formal representation of uncertainty metadata is desirable (using UncertProv elements to implement the UncertML conceptual model ), this will not always be possible, and any uncertainty data stored will be better than none. Since the UncertProv ontology contains a superset of UncertML elements to facilitate the representation of non-UncertML uncertainty data, it could easily be extended to include other formal uncertainty conceptual models thus allowing non-UncertML propagation calculations.
Challenges and Opportunities in Geocuration for Disaster Response
NASA Astrophysics Data System (ADS)
Molthan, A.; Burks, J. E.; McGrath, K.; Ramachandran, R.; Goodman, H. M.
2015-12-01
Following a significant disaster event, a wide range of resources and science teams are leveraged to aid in the response effort. Often, these efforts include the acquisition and use of non-traditional data sets, or the generation of prototyped products using new image analysis techniques. These efforts may also include acquisition and hosting of remote sensing data sets from domestic and international partners - from the public or private sector - which differ from standard remote sensing holdings, or may be accompanied by specific licensing agreements that limit their use and dissemination. In addition, at time periods well beyond the initial disaster event, other science teams may incorporate airborne or field campaign measurements that support the assessment of damage but also acquire information necessary to address key science questions about the specific disaster or a broader category of similar events. The immediate need to gather data and provide information to the response effort can result in large data holdings that require detailed curation to improve the efficiency of response efforts, but also ensure that collected data can be used on a longer time scale to address underlying science questions. Data collected in response to a disaster event may be thought of as a "field campaign" - consisting of traditional data sets managed through physical or virtual holdings, but also a larger number of ad hoc data collections, derived products, and metadata, including the potential for airborne or ground-based data collections. Appropriate metadata and documentation are needed to ensure that derived products have traceability to their source data, along with documentation of algorithm authors, versions, and outcomes so that others can reproduce their results, and to ensure that data sets remain available and well-documented for longer-term analysis that may in turn create new products relevant to understanding a type of disaster, or support future recovery efforts. The authors will review some activities related to recent data acquisitions for severe weather events and the 2015 Nepal Earthquake, discuss challenges and opportunities for geocuration, and suggest means of improving the management of data and scientific collaboration if applied to future events.
Inheritance rules for Hierarchical Metadata Based on ISO 19115
NASA Astrophysics Data System (ADS)
Zabala, A.; Masó, J.; Pons, X.
2012-04-01
Mainly, ISO19115 has been used to describe metadata for datasets and services. Furthermore, ISO19115 standard (as well as the new draft ISO19115-1) includes a conceptual model that allows to describe metadata at different levels of granularity structured in hierarchical levels, both in aggregated resources such as particularly series, datasets, and also in more disaggregated resources such as types of entities (feature type), types of attributes (attribute type), entities (feature instances) and attributes (attribute instances). In theory, to apply a complete metadata structure to all hierarchical levels of metadata, from the whole series to an individual feature attributes, is possible, but to store all metadata at all levels is completely impractical. An inheritance mechanism is needed to store each metadata and quality information at the optimum hierarchical level and to allow an ease and efficient documentation of metadata in both an Earth observation scenario such as a multi-satellite mission multiband imagery, as well as in a complex vector topographical map that includes several feature types separated in layers (e.g. administrative limits, contour lines, edification polygons, road lines, etc). Moreover, and due to the traditional split of maps in tiles due to map handling at detailed scales or due to the satellite characteristics, each of the previous thematic layers (e.g. 1:5000 roads for a country) or band (Landsat-5 TM cover of the Earth) are tiled on several parts (sheets or scenes respectively). According to hierarchy in ISO 19115, the definition of general metadata can be supplemented by spatially specific metadata that, when required, either inherits or overrides the general case (G.1.3). Annex H of this standard states that only metadata exceptions are defined at lower levels, so it is not necessary to generate the full registry of metadata for each level but to link particular values to the general value that they inherit. Conceptually the metadata registry is complete for each metadata hierarchical level, but at the implementation level most of the metadata elements are not stored at both levels but only at more generic one. This communication defines a metadata system that covers 4 levels, describes which metadata has to support series-layer inheritance and in which way, and how hierarchical levels are defined and stored. Metadata elements are classified according to the type of inheritance between products, series, tiles and the datasets. It explains the metadata elements classification and exemplifies it using core metadata elements. The communication also presents a metadata viewer and edition tool that uses the described model to propagate metadata elements and to show to the user a complete set of metadata for each level in a transparent way. This tool is integrated in the MiraMon GIS software.
Text Mining to Support Gene Ontology Curation and Vice Versa.
Ruch, Patrick
2017-01-01
In this chapter, we explain how text mining can support the curation of molecular biology databases dealing with protein functions. We also show how curated data can play a disruptive role in the developments of text mining methods. We review a decade of efforts to improve the automatic assignment of Gene Ontology (GO) descriptors, the reference ontology for the characterization of genes and gene products. To illustrate the high potential of this approach, we compare the performances of an automatic text categorizer and show a large improvement of +225 % in both precision and recall on benchmarked data. We argue that automatic text categorization functions can ultimately be embedded into a Question-Answering (QA) system to answer questions related to protein functions. Because GO descriptors can be relatively long and specific, traditional QA systems cannot answer such questions. A new type of QA system, so-called Deep QA which uses machine learning methods trained with curated contents, is thus emerging. Finally, future advances of text mining instruments are directly dependent on the availability of high-quality annotated contents at every curation step. Databases workflows must start recording explicitly all the data they curate and ideally also some of the data they do not curate.
Challenges to Standardization: A Case Study Using Coastal and Deep-Ocean Water Level Data
NASA Astrophysics Data System (ADS)
Sweeney, A. D.; Stroker, K. J.; Mungov, G.; McLean, S. J.
2015-12-01
Sea levels recorded at coastal stations and inferred from deep-ocean pressure observations at the seafloor are submitted for archive in multiple data and metadata formats. These formats include two forms of schema-less XML and a custom binary format accompanied by metadata in a spreadsheet. The authors report on efforts to use existing standards to make this data more discoverable and more useful beyond their initial use in detecting tsunamis. An initial review of data formats for sea level data around the globe revealed heterogeneity in presentation and content. In the absence of a widely-used domain-specific format, we adopted the general model for structuring data and metadata expressed by the Network Common Data Form (netCDF). netCDF has been endorsed by the Open Geospatial Consortium and has the advantages of small size when compared to equivalent plain text representation and provides a standard way of embedding metadata in the same file. We followed the orthogonal time-series profile of the Climate and Forecast discrete sampling geometries as the convention for structuring the data and describing metadata relevant for use. We adhered to the Attribute Convention for Data Discovery for capturing metadata to support user search. Beyond making it possible to structure data and metadata in a standard way, netCDF is supported by multiple software tools in providing programmatic cataloging, access, subsetting, and transformation to other formats. We will describe our successes and failures in adhering to existing standards and provide requirements for either augmenting existing conventions or developing new ones. Some of these enhancements are specific to sea level data, while others are applicable to time-series data in general.
"CanCore": In Canada and around the World
ERIC Educational Resources Information Center
Friesen, Norm
2005-01-01
In this article, the author discusses "CanCore," a learning resource metadata initiative funded by Industry Canada and supported by Athabasca University, Alberta, and TeleUniversite du Quebec, and describes the increasing range of international uses of the "CanCore" metadata for the indexing of learning objects.…
NASA Astrophysics Data System (ADS)
Baumann, Peter
2013-04-01
There is a traditional saying that metadata are understandable, semantic-rich, and searchable. Data, on the other hand, are big, with no accessible semantics, and just downloadable. Not only has this led to an imbalance of search support form a user perspective, but also underneath to a deep technology divide often using relational databases for metadata and bespoke archive solutions for data. Our vision is that this barrier will be overcome, and data and metadata become searchable likewise, leveraging the potential of semantic technologies in combination with scalability technologies. Ultimately, in this vision ad-hoc processing and filtering will not distinguish any longer, forming a uniformly accessible data universe. In the European EarthServer initiative, we work towards this vision by federating database-style raster query languages with metadata search and geo broker technology. We present our approach taken, how it can leverage OGC standards, the benefits envisaged, and first results.
Metadata-Driven SOA-Based Application for Facilitation of Real-Time Data Warehousing
NASA Astrophysics Data System (ADS)
Pintar, Damir; Vranić, Mihaela; Skočir, Zoran
Service-oriented architecture (SOA) has already been widely recognized as an effective paradigm for achieving integration of diverse information systems. SOA-based applications can cross boundaries of platforms, operation systems and proprietary data standards, commonly through the usage of Web Services technology. On the other side, metadata is also commonly referred to as a potential integration tool given the fact that standardized metadata objects can provide useful information about specifics of unknown information systems with which one has interest in communicating with, using an approach commonly called "model-based integration". This paper presents the result of research regarding possible synergy between those two integration facilitators. This is accomplished with a vertical example of a metadata-driven SOA-based business process that provides ETL (Extraction, Transformation and Loading) and metadata services to a data warehousing system in need of a real-time ETL support.
ODISEES: A New Paradigm in Data Access
NASA Astrophysics Data System (ADS)
Huffer, E.; Little, M. M.; Kusterer, J.
2013-12-01
As part of its ongoing efforts to improve access to data, the Atmospheric Science Data Center has developed a high-precision Earth Science domain ontology (the 'ES Ontology') implemented in a graph database ('the Semantic Metadata Repository') that is used to store detailed, semantically-enhanced, parameter-level metadata for ASDC data products. The ES Ontology provides the semantic infrastructure needed to drive the ASDC's Ontology-Driven Interactive Search Environment for Earth Science ('ODISEES'), a data discovery and access tool, and will support additional data services such as analytics and visualization. The ES ontology is designed on the premise that naming conventions alone are not adequate to provide the information needed by prospective data consumers to assess the suitability of a given dataset for their research requirements; nor are current metadata conventions adequate to support seamless machine-to-machine interactions between file servers and end-user applications. Data consumers need information not only about what two data elements have in common, but also about how they are different. End-user applications need consistent, detailed metadata to support real-time data interoperability. The ES ontology is a highly precise, bottom-up, queriable model of the Earth Science domain that focuses on critical details about the measurable phenomena, instrument techniques, data processing methods, and data file structures. Earth Science parameters are described in detail in the ES Ontology and mapped to the corresponding variables that occur in ASDC datasets. Variables are in turn mapped to well-annotated representations of the datasets that they occur in, the instrument(s) used to create them, the instrument platforms, the processing methods, etc., creating a linked-data structure that allows both human and machine users to access a wealth of information critical to understanding and manipulating the data. The mappings are recorded in the Semantic Metadata Repository as RDF-triples. An off-the-shelf Ontology Development Environment and a custom Metadata Conversion Tool comprise a human-machine/machine-machine hybrid tool that partially automates the creation of metadata as RDF-triples by interfacing with existing metadata repositories and providing a user interface that solicits input from a human user, when needed. RDF-triples are pushed to the Ontology Development Environment, where a reasoning engine executes a series of inference rules whose antecedent conditions can be satisfied by the initial set of RDF-triples, thereby generating the additional detailed metadata that is missing in existing repositories. A SPARQL Endpoint, a web-based query service and a Graphical User Interface allow prospective data consumers - even those with no familiarity with NASA data products - to search the metadata repository to find and order data products that meet their exact specifications. A web-based API will provide an interface for machine-to-machine transactions.
Geospatial resources for supporting data standards, guidance and best practice in health informatics
2011-01-01
Background The 1980s marked the occasion when Geographical Information System (GIS) technology was broadly introduced into the geo-spatial community through the establishment of a strong GIS industry. This technology quickly disseminated across many countries, and has now become established as an important research, planning and commercial tool for a wider community that includes organisations in the public and private health sectors. The broad acceptance of GIS technology and the nature of its functionality have meant that numerous datasets have been created over the past three decades. Most of these datasets have been created independently, and without any structured documentation systems in place. However, search and retrieval systems can only work if there is a mechanism for datasets existence to be discovered and this is where proper metadata creation and management can greatly help. This situation must be addressed through support mechanisms such as Web-based portal technologies, metadata editor tools, automation, metadata standards and guidelines and collaborative efforts with relevant individuals and organisations. Engagement with data developers or administrators should also include a strategy of identifying the benefits associated with metadata creation and publication. Findings The establishment of numerous Spatial Data Infrastructures (SDIs), and other Internet resources, is a testament to the recognition of the importance of supporting good data management and sharing practices across the geographic information community. These resources extend to health informatics in support of research, public services and teaching and learning. This paper identifies many of these resources available to the UK academic health informatics community. It also reveals the reluctance of many spatial data creators across the wider UK academic community to use these resources to create and publish metadata, or deposit their data in repositories for sharing. The Go-Geo! service is introduced as an SDI developed to provide UK academia with the necessary resources to address the concerns surrounding metadata creation and data sharing. The Go-Geo! portal, Geodoc metadata editor tool, ShareGeo spatial data repository, and a range of other support resources, are described in detail. Conclusions This paper describes a variety of resources available for the health research and public health sector to use for managing and sharing their data. The Go-Geo! service is one resource which offers an SDI for the eclectic range of disciplines using GIS in UK academia, including health informatics. The benefits of data management and sharing are immense, and in these times of cost restraints, these resources can be seen as solutions to find cost savings which can be reinvested in more research. PMID:21269487
DIRAC File Replica and Metadata Catalog
NASA Astrophysics Data System (ADS)
Tsaregorodtsev, A.; Poss, S.
2012-12-01
File replica and metadata catalogs are essential parts of any distributed data management system, which are largely determining its functionality and performance. A new File Catalog (DFC) was developed in the framework of the DIRAC Project that combines both replica and metadata catalog functionality. The DFC design is based on the practical experience with the data management system of the LHCb Collaboration. It is optimized for the most common patterns of the catalog usage in order to achieve maximum performance from the user perspective. The DFC supports bulk operations for replica queries and allows quick analysis of the storage usage globally and for each Storage Element separately. It supports flexible ACL rules with plug-ins for various policies that can be adopted by a particular community. The DFC catalog allows to store various types of metadata associated with files and directories and to perform efficient queries for the data based on complex metadata combinations. Definition of file ancestor-descendent relation chains is also possible. The DFC catalog is implemented in the general DIRAC distributed computing framework following the standard grid security architecture. In this paper we describe the design of the DFC and its implementation details. The performance measurements are compared with other grid file catalog implementations. The experience of the DFC Catalog usage in the CLIC detector project are discussed.
MMI's Metadata and Vocabulary Solutions: 10 Years and Growing
NASA Astrophysics Data System (ADS)
Graybeal, J.; Gayanilo, F.; Rueda-Velasquez, C. A.
2014-12-01
The Marine Metadata Interoperability project (http://marinemetadata.org) held its public opening at AGU's 2004 Fall Meeting. For 10 years since that debut, the MMI guidance and vocabulary sites have served over 100,000 visitors, with 525 community members and continuous Steering Committee leadership. Originally funded by the National Science Foundation, over the years multiple organizations have supported the MMI mission: "Our goal is to support collaborative research in the marine science domain, by simplifying the incredibly complex world of metadata into specific, straightforward guidance. MMI encourages scientists and data managers at all levels to apply good metadata practices from the start of a project, by providing the best guidance and resources for data management, and developing advanced metadata tools and services needed by the community." Now hosted by the Harte Research Institute at Texas A&M University at Corpus Christi, MMI continues to provide guidance and services to the community, and is planning for marine science and technology needs for the next 10 years. In this presentation we will highlight our major accomplishments, describe our recent achievements and imminent goals, and propose a vision for improving marine data interoperability for the next 10 years, including Ontology Registry and Repository (http://mmisw.org/orr) advancements and applications (http://mmisw.org/cfsn).
Managing biomedical image metadata for search and retrieval of similar images.
Korenblum, Daniel; Rubin, Daniel; Napel, Sandy; Rodriguez, Cesar; Beaulieu, Chris
2011-08-01
Radiology images are generally disconnected from the metadata describing their contents, such as imaging observations ("semantic" metadata), which are usually described in text reports that are not directly linked to the images. We developed a system, the Biomedical Image Metadata Manager (BIMM) to (1) address the problem of managing biomedical image metadata and (2) facilitate the retrieval of similar images using semantic feature metadata. Our approach allows radiologists, researchers, and students to take advantage of the vast and growing repositories of medical image data by explicitly linking images to their associated metadata in a relational database that is globally accessible through a Web application. BIMM receives input in the form of standard-based metadata files using Web service and parses and stores the metadata in a relational database allowing efficient data query and maintenance capabilities. Upon querying BIMM for images, 2D regions of interest (ROIs) stored as metadata are automatically rendered onto preview images included in search results. The system's "match observations" function retrieves images with similar ROIs based on specific semantic features describing imaging observation characteristics (IOCs). We demonstrate that the system, using IOCs alone, can accurately retrieve images with diagnoses matching the query images, and we evaluate its performance on a set of annotated liver lesion images. BIMM has several potential applications, e.g., computer-aided detection and diagnosis, content-based image retrieval, automating medical analysis protocols, and gathering population statistics like disease prevalences. The system provides a framework for decision support systems, potentially improving their diagnostic accuracy and selection of appropriate therapies.
The Library as Partner in University Data Curation: A Case Study in Collaboration
ERIC Educational Resources Information Center
Latham, Bethany; Poe, Jodi Welch
2012-01-01
Data curation is a concept with many facets. Curation goes beyond research-generated data, and its principles can support the preservation of institutions' historical data. Libraries are well-positioned to bring relevant expertise to such problems, especially those requiring collaboration, because of their experience as neutral caretakers and…
Earth Science Data Grid System
NASA Astrophysics Data System (ADS)
Chi, Y.; Yang, R.; Kafatos, M.
2004-05-01
The Earth Science Data Grid System (ESDGS) is a software system in support of earth science data storage and access. It is built upon the Storage Resource Broker (SRB) data grid technology. We have developed a complete data grid system consistent of SRB server providing users uniform access to diverse storage resources in a heterogeneous computing environment and metadata catalog server (MCAT) managing the metadata associated with data set, users, and resources. We also develop the earth science application metadata; geospatial, temporal, and content-based indexing; and some other tools. In this paper, we will describe software architecture and components of the data grid system, and use a practical example in support of storage and access of rainfall data from the Tropical Rainfall Measuring Mission (TRMM) to illustrate its functionality and features.
Misra, Dharitri; Chen, Siyuan; Thoma, George R.
2010-01-01
One of the most expensive aspects of archiving digital documents is the manual acquisition of context-sensitive metadata useful for the subsequent discovery of, and access to, the archived items. For certain types of textual documents, such as journal articles, pamphlets, official government records, etc., where the metadata is contained within the body of the documents, a cost effective method is to identify and extract the metadata in an automated way, applying machine learning and string pattern search techniques. At the U. S. National Library of Medicine (NLM) we have developed an automated metadata extraction (AME) system that employs layout classification and recognition models with a metadata pattern search model for a text corpus with structured or semi-structured information. A combination of Support Vector Machine and Hidden Markov Model is used to create the layout recognition models from a training set of the corpus, following which a rule-based metadata search model is used to extract the embedded metadata by analyzing the string patterns within and surrounding each field in the recognized layouts. In this paper, we describe the design of our AME system, with focus on the metadata search model. We present the extraction results for a historic collection from the Food and Drug Administration, and outline how the system may be adapted for similar collections. Finally, we discuss some ongoing enhancements to our AME system. PMID:21179386
A public database of macromolecular diffraction experiments.
Grabowski, Marek; Langner, Karol M; Cymborowski, Marcin; Porebski, Przemyslaw J; Sroka, Piotr; Zheng, Heping; Cooper, David R; Zimmerman, Matthew D; Elsliger, Marc André; Burley, Stephen K; Minor, Wladek
2016-11-01
The low reproducibility of published experimental results in many scientific disciplines has recently garnered negative attention in scientific journals and the general media. Public transparency, including the availability of `raw' experimental data, will help to address growing concerns regarding scientific integrity. Macromolecular X-ray crystallography has led the way in requiring the public dissemination of atomic coordinates and a wealth of experimental data, making the field one of the most reproducible in the biological sciences. However, there remains no mandate for public disclosure of the original diffraction data. The Integrated Resource for Reproducibility in Macromolecular Crystallography (IRRMC) has been developed to archive raw data from diffraction experiments and, equally importantly, to provide related metadata. Currently, the database of our resource contains data from 2920 macromolecular diffraction experiments (5767 data sets), accounting for around 3% of all depositions in the Protein Data Bank (PDB), with their corresponding partially curated metadata. IRRMC utilizes distributed storage implemented using a federated architecture of many independent storage servers, which provides both scalability and sustainability. The resource, which is accessible via the web portal at http://www.proteindiffraction.org, can be searched using various criteria. All data are available for unrestricted access and download. The resource serves as a proof of concept and demonstrates the feasibility of archiving raw diffraction data and associated metadata from X-ray crystallographic studies of biological macromolecules. The goal is to expand this resource and include data sets that failed to yield X-ray structures in order to facilitate collaborative efforts that will improve protein structure-determination methods and to ensure the availability of `orphan' data left behind for various reasons by individual investigators and/or extinct structural genomics projects.
NASA Astrophysics Data System (ADS)
Kunkel, K.; Champion, S.
2015-12-01
Data Management and the National Climate Assessment: A Data Quality Solution Sarah M. Champion and Kenneth E. Kunkel Cooperative Institute for Climate and Satellites, Asheville, NC The Third National Climate Assessment (NCA), anticipated for its authoritative climate change analysis, was also a vanguard in climate communication. From the cutting-edge website to the organization of information, the Assessment content appealed to, and could be accessed by, many demographics. One such pivotal presentation of information in the NCA was the availability of complex metadata directly connected to graphical products. While the basic metadata requirement is federally mandated through a series of federal guidelines as a part of the Information Quality Act, the NCA is also deemed a Highly Influential Scientific Assessment, which requires demonstration of the transparency and reproducibility of the content. To meet these requirements, the Technical Support Unit (TSU) for the NCA embarked on building a system for collecting and presenting metadata that not only met these requirements, but one that has since been employed in support of additional Assessments. The metadata effort for this NCA proved invaluable for many reasons, one of which being that it showcased that there is a critical need for a culture change within the scientific community to support collection and transparency of data and methods to the level produced with the NCA. Irregardless of being federally mandated, it proves to simply be a good practice in science communication. This presentation will detail the collection system built by the TSU, the improvements employed with additional Assessment products, as well as illustrate examples of successful transparency. Through this presentation, we hope to impel the discussion in support of detailed metadata becoming the cultural norm within the scientific community to support influential and highly policy-relevant documents such as the NCA.
Data-centric Science: New challenges for long-term archives and data publishers
NASA Astrophysics Data System (ADS)
Stockhause, Martina; Lautenschlager, Michael
2016-04-01
In the recent years the publication of data has become more and more common. Data and metadata for a single project are often disseminated by multiple data centers in federated data infrastructures. In the same time data is shared earlier to enable collaboration within research projects. The research data environment has become more heterogeneous and the data more dynamic. Only few data or metadata repositories are long-term archives (LTAs) with WDS/DSA certificates, complying to Force 11's 'Joint Declaration of Data Citation Principles'. Therefore for long-term usage of these data and information, a small number of LTAs have the task to preserve these pieces of information. They replicate, connect, quality assure, harmonize, archive, and curate these different types of data from multiple data centers with different operation procedures and data standards. Consortia or federations of certified LTAs are needed to meet the challenges of big data storage and citations. Data publishers play a central role in storing, preserving, and disseminating scientific information. Portals of these federations of LTAs or data registration agencies like DataCite might even become the portals of the future for scientific knowledge discovery. The example CMIP6 is used to illustrate this future perspective of the role of LTAs/data publishers.
Representing Hydrologic Models as HydroShare Resources to Facilitate Model Sharing and Collaboration
NASA Astrophysics Data System (ADS)
Castronova, A. M.; Goodall, J. L.; Mbewe, P.
2013-12-01
The CUAHSI HydroShare project is a collaborative effort that aims to provide software for sharing data and models within the hydrologic science community. One of the early focuses of this work has been establishing metadata standards for describing models and model-related data as HydroShare resources. By leveraging this metadata definition, a prototype extension has been developed to create model resources that can be shared within the community using the HydroShare system. The extension uses a general model metadata definition to create resource objects, and was designed so that model-specific parsing routines can extract and populate metadata fields from model input and output files. The long term goal is to establish a library of supported models where, for each model, the system has the ability to extract key metadata fields automatically, thereby establishing standardized model metadata that will serve as the foundation for model sharing and collaboration within HydroShare. The Soil Water & Assessment Tool (SWAT) is used to demonstrate this concept through a case study application.
Leveraging Metadata to Create Interactive Images... Today!
NASA Astrophysics Data System (ADS)
Hurt, Robert L.; Squires, G. K.; Llamas, J.; Rosenthal, C.; Brinkworth, C.; Fay, J.
2011-01-01
The image gallery for NASA's Spitzer Space Telescope has been newly rebuilt to fully support the Astronomy Visualization Metadata (AVM) standard to create a new user experience both on the website and in other applications. We encapsulate all the key descriptive information for a public image, including color representations and astronomical and sky coordinates and make it accessible in a user-friendly form on the website, but also embed the same metadata within the image files themselves. Thus, images downloaded from the site will carry with them all their descriptive information. Real-world benefits include display of general metadata when such images are imported into image editing software (e.g. Photoshop) or image catalog software (e.g. iPhoto). More advanced support in Microsoft's WorldWide Telescope can open a tagged image after it has been downloaded and display it in its correct sky position, allowing comparison with observations from other observatories. An increasing number of software developers are implementing AVM support in applications and an online image archive for tagged images is under development at the Spitzer Science Center. Tagging images following the AVM offers ever-increasing benefits to public-friendly imagery in all its standard forms (JPEG, TIFF, PNG). The AVM standard is one part of the Virtual Astronomy Multimedia Project (VAMP); http://www.communicatingastronomy.org
Training and Best Practice Guidelines: Implications for Metadata Creation
ERIC Educational Resources Information Center
Chuttur, Mohammad Y.
2012-01-01
In response to the rapid development of digital libraries over the past decade, researchers have focused on the use of metadata as an effective means to support resource discovery within online repositories. With the increasing involvement of libraries in digitization projects and the growing number of institutional repositories, it is anticipated…
A Meta-Data Driven Approach to Searching for Educational Resources in a Global Context.
ERIC Educational Resources Information Center
Wade, Vincent P.; Doherty, Paul
This paper presents the design of an Internet-enabled search service that supports educational resource discovery within an educational brokerage service. More specifically, it presents the design and implementation of a metadata-driven approach to implementing the distributed search and retrieval of Internet-based educational resources and…
Now That We've Found the "Hidden Web," What Can We Do with It?
ERIC Educational Resources Information Center
Cole, Timothy W.; Kaczmarek, Joanne; Marty, Paul F.; Prom, Christopher J.; Sandore, Beth; Shreeves, Sarah
The Open Archives Initiative (OAI) Protocol for Metadata Harvesting (PMH) is designed to facilitate discovery of the "hidden web" of scholarly information, such as that contained in databases, finding aids, and XML documents. OAI-PMH supports standardized exchange of metadata describing items in disparate collections, of such as those…
NASA Technical Reports Server (NTRS)
Fries, M. D.; Allen, C. C.; Calaway, M. J.; Evans, C. A.; Stansbery, E. K.
2015-01-01
Curation of NASA's astromaterials sample collections is a demanding and evolving activity that supports valuable science from NASA missions for generations, long after the samples are returned to Earth. For example, NASA continues to loan hundreds of Apollo program samples to investigators every year and those samples are often analyzed using instruments that did not exist at the time of the Apollo missions themselves. The samples are curated in a manner that minimizes overall contamination, enabling clean, new high-sensitivity measurements and new science results over 40 years after their return to Earth. As our exploration of the Solar System progresses, upcoming and future NASA sample return missions will return new samples with stringent contamination control, sample environmental control, and Planetary Protection requirements. Therefore, an essential element of a healthy astromaterials curation program is a research and development (R&D) effort that characterizes and employs new technologies to maintain current collections and enable new missions - an Advanced Curation effort. JSC's Astromaterials Acquisition & Curation Office is continually performing Advanced Curation research, identifying and defining knowledge gaps about research, development, and validation/verification topics that are critical to support current and future NASA astromaterials sample collections. The following are highlighted knowledge gaps and research opportunities.
Wang, Yanchao; Sunderraman, Rajshekhar
2006-01-01
In this paper, we propose two architectures for curating PDB data to improve its quality. The first one, PDB Data Curation System, is developed by adding two parts, Checking Filter and Curation Engine, between User Interface and Database. This architecture supports the basic PDB data curation. The other one, PDB Data Curation System with XCML, is designed for further curation which adds four more parts, PDB-XML, PDB, OODB, Protin-OODB, into the previous one. This architecture uses XCML language to automatically check errors of PDB data that enables PDB data more consistent and accurate. These two tools can be used for cleaning existing PDB files and creating new PDB files. We also show some ideas how to add constraints and assertions with XCML to get better data. In addition, we discuss the data provenance that may affect data accuracy and consistency.
DIANA-LncBase v2: indexing microRNA targets on non-coding transcripts.
Paraskevopoulou, Maria D; Vlachos, Ioannis S; Karagkouni, Dimitra; Georgakilas, Georgios; Kanellos, Ilias; Vergoulis, Thanasis; Zagganas, Konstantinos; Tsanakas, Panayiotis; Floros, Evangelos; Dalamagas, Theodore; Hatzigeorgiou, Artemis G
2016-01-04
microRNAs (miRNAs) are short non-coding RNAs (ncRNAs) that act as post-transcriptional regulators of coding gene expression. Long non-coding RNAs (lncRNAs) have been recently reported to interact with miRNAs. The sponge-like function of lncRNAs introduces an extra layer of complexity in the miRNA interactome. DIANA-LncBase v1 provided a database of experimentally supported and in silico predicted miRNA Recognition Elements (MREs) on lncRNAs. The second version of LncBase (www.microrna.gr/LncBase) presents an extensive collection of miRNA:lncRNA interactions. The significantly enhanced database includes more than 70 000 low and high-throughput, (in)direct miRNA:lncRNA experimentally supported interactions, derived from manually curated publications and the analysis of 153 AGO CLIP-Seq libraries. The new experimental module presents a 14-fold increase compared to the previous release. LncBase v2 hosts in silico predicted miRNA targets on lncRNAs, identified with the DIANA-microT algorithm. The relevant module provides millions of predicted miRNA binding sites, accompanied with detailed metadata and MRE conservation metrics. LncBase v2 caters information regarding cell type specific miRNA:lncRNA regulation and enables users to easily identify interactions in 66 different cell types, spanning 36 tissues for human and mouse. Database entries are also supported by accurate lncRNA expression information, derived from the analysis of more than 6 billion RNA-Seq reads. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Data governance in predictive toxicology: A review.
Fu, Xin; Wojak, Anna; Neagu, Daniel; Ridley, Mick; Travis, Kim
2011-07-13
Due to recent advances in data storage and sharing for further data processing in predictive toxicology, there is an increasing need for flexible data representations, secure and consistent data curation and automated data quality checking. Toxicity prediction involves multidisciplinary data. There are hundreds of collections of chemical, biological and toxicological data that are widely dispersed, mostly in the open literature, professional research bodies and commercial companies. In order to better manage and make full use of such large amount of toxicity data, there is a trend to develop functionalities aiming towards data governance in predictive toxicology to formalise a set of processes to guarantee high data quality and better data management. In this paper, data quality mainly refers in a data storage sense (e.g. accuracy, completeness and integrity) and not in a toxicological sense (e.g. the quality of experimental results). This paper reviews seven widely used predictive toxicology data sources and applications, with a particular focus on their data governance aspects, including: data accuracy, data completeness, data integrity, metadata and its management, data availability and data authorisation. This review reveals the current problems (e.g. lack of systematic and standard measures of data quality) and desirable needs (e.g. better management and further use of captured metadata and the development of flexible multi-level user access authorisation schemas) of predictive toxicology data sources development. The analytical results will help to address a significant gap in toxicology data quality assessment and lead to the development of novel frameworks for predictive toxicology data and model governance. While the discussed public data sources are well developed, there nevertheless remain some gaps in the development of a data governance framework to support predictive toxicology. In this paper, data governance is identified as the new challenge in predictive toxicology, and a good use of it may provide a promising framework for developing high quality and easy accessible toxicity data repositories. This paper also identifies important research directions that require further investigation in this area.
Data governance in predictive toxicology: A review
2011-01-01
Background Due to recent advances in data storage and sharing for further data processing in predictive toxicology, there is an increasing need for flexible data representations, secure and consistent data curation and automated data quality checking. Toxicity prediction involves multidisciplinary data. There are hundreds of collections of chemical, biological and toxicological data that are widely dispersed, mostly in the open literature, professional research bodies and commercial companies. In order to better manage and make full use of such large amount of toxicity data, there is a trend to develop functionalities aiming towards data governance in predictive toxicology to formalise a set of processes to guarantee high data quality and better data management. In this paper, data quality mainly refers in a data storage sense (e.g. accuracy, completeness and integrity) and not in a toxicological sense (e.g. the quality of experimental results). Results This paper reviews seven widely used predictive toxicology data sources and applications, with a particular focus on their data governance aspects, including: data accuracy, data completeness, data integrity, metadata and its management, data availability and data authorisation. This review reveals the current problems (e.g. lack of systematic and standard measures of data quality) and desirable needs (e.g. better management and further use of captured metadata and the development of flexible multi-level user access authorisation schemas) of predictive toxicology data sources development. The analytical results will help to address a significant gap in toxicology data quality assessment and lead to the development of novel frameworks for predictive toxicology data and model governance. Conclusions While the discussed public data sources are well developed, there nevertheless remain some gaps in the development of a data governance framework to support predictive toxicology. In this paper, data governance is identified as the new challenge in predictive toxicology, and a good use of it may provide a promising framework for developing high quality and easy accessible toxicity data repositories. This paper also identifies important research directions that require further investigation in this area. PMID:21752279
Earth Science Data Grid System
NASA Astrophysics Data System (ADS)
Chi, Y.; Yang, R.; Kafatos, M.
2004-12-01
The Earth Science Data Grid System (ESDGS) is a software in support of earth science data storage and access. It is built upon the Storage Resource Broker (SRB) data grid technology. We have developed a complete data grid system consistent of SRB server providing users uniform access to diverse storage resources in a heterogeneous computing environment and metadata catalog server (MCAT) managing the metadata associated with data set, users, and resources. We are also developing additional services of 1) metadata management, 2) geospatial, temporal, and content-based indexing, and 3) near/on site data processing, in response to the unique needs of Earth science applications. In this paper, we will describe the software architecture and components of the system, and use a practical example in support of storage and access of rainfall data from the Tropical Rainfall Measuring Mission (TRMM) to illustrate its functionality and features.
A New Look at Data Usage by Using Metadata Attributes as Indicators of Data Quality
NASA Technical Reports Server (NTRS)
Won, Young-In; Wanchoo, Lalit; Behnke, Jeanne
2016-01-01
This study reviews the key metrics (users, distributed volume, and files) in multiple ways to gain an understanding of the significance of the metadata. Characterizing the usability of data by key metadata elements, such as discipline and study area, will assist in understanding how the user needs have evolved over time. The data usage pattern based on product level provides insight into the level of data quality. In addition, the data metrics by various services, such as the Open-source Project for a Network Data Access Protocol (OPeNDAP) and subsets, address how these services have extended the usage of data. Over-all, this study presents the usage of data and metadata by metrics analyses, which may assist data centers in better supporting the needs of the users.
Divorcing Strain Classification from Species Names.
Baltrus, David A
2016-06-01
Confusion about strain classification and nomenclature permeates modern microbiology. Although taxonomists have traditionally acted as gatekeepers of order, the numbers of, and speed at which, new strains are identified has outpaced the opportunity for professional classification for many lineages. Furthermore, the growth of bioinformatics and database-fueled investigations have placed metadata curation in the hands of researchers with little taxonomic experience. Here I describe practical challenges facing modern microbial taxonomy, provide an overview of complexities of classification for environmentally ubiquitous taxa like Pseudomonas syringae, and emphasize that classification can be independent of nomenclature. A move toward implementation of relational classification schemes based on inherent properties of whole genomes could provide sorely needed continuity in how strains are referenced across manuscripts and data sets. Copyright © 2016 Elsevier Ltd. All rights reserved.
ClipCard: Sharable, Searchable Visual Metadata Summaries on the Cloud to Render Big Data Actionable
NASA Astrophysics Data System (ADS)
Saripalli, P.; Davis, D.; Cunningham, R.
2013-12-01
Research firm IDC estimates that approximately 90 percent of the Enterprise Big Data go un-analyzed, as 'dark data' - an enormous corpus of undiscovered, untagged information residing on data warehouses, servers and Storage Area Networks (SAN). In the geosciences, these data range from unpublished model runs to vast survey data assets to raw sensor data. Many of these are now being collected instantaneously, at a greater volume and in new data formats. Not all of these data can be analyzed, nor processed in real time, and their features may not be well described at the time of collection. These dark data are a serious data management problem for science organizations of all types, especially ones with mandated or required data reporting and compliance requirements. Additionally, data curators and scientists are encouraged to quantify the impact of their data holdings as a way to measure research success. Deriving actionable insights is the foremost goal of Big Data Analytics (BDA), which is especially true with geoscience, given its direct impact on most of the pressing global issues. Clearly, there is a pressing need for innovative approaches to making dark data discoverable, measurable, and actionable. We report on ClipCard, a Cloud-based SaaS analytic platform for instant summarization, quick search, visualization and easy sharing of metadata summaries form the Dark Data at hierarchical levels of detail, thus rendering it 'white', i.e., actionable. We present a use case of the ClipCard platform, a cloud-based application which helps generate (abstracted) visual metadata summaries and meta-analytics for environmental data at hierarchical scales within and across big data containers. These summaries and analyses provide important new tools for managing big data and simplifying collaboration through easy to deploy sharing APIs. The ClipCard application solves a growing data management bottleneck by helping enterprises and large organizations to summarize, search, discover, and share the potential in their unused data and information assets. Using Cloud as the base platform enables wider reach, quick dissemination and easy sharing of the metadata summaries, without actually storing or sharing the original data assets per se.
NASA Astrophysics Data System (ADS)
Delory, E.; Jirka, S.
2016-02-01
Discovering sensors and observation data is important when enabling the exchange of oceanographic data between observatories and scientists that need the data sets for their work. To better support this discovery process, one task of the European project FixO3 (Fixed-point Open Ocean Observatories) is dealing with the question which elements are needed for developing a better registry for sensors. This has resulted in four items which are addressed by the FixO3 project in cooperation with further European projects such as NeXOS (http://www.nexosproject.eu/). 1.) Metadata description format: To store and retrieve information about sensors and platforms it is necessary to have a common approach how to provide and encode the metadata. For this purpose, the OGC Sensor Model Language (SensorML) 2.0 standard was selected. Especially the opportunity to distinguish between sensor types and instances offers new chances for a more efficient provision and maintenance of sensor metadata. 2.) Conversion of existing metadata into a SensorML 2.0 representation: In order to ensure a sustainable re-use of already provided metadata content (e.g. from ESONET-FixO3 yellow pages), it is important to provide a mechanism which is capable of transforming these already available metadata sets into the new SensorML 2.0 structure. 3.) Metadata editor: To create descriptions of sensors and platforms, it is not possible to expect users to manually edit XML-based description files. Thus, a visual interface is necessary to help during the metadata creation. We will outline a prototype of this editor, building upon the development of the ESONET sensor registry interface. 4.) Sensor Metadata Store: A server is needed that for storing and querying the created sensor descriptions. For this purpose different options exist which will be discussed. In summary, we will present a set of different elements enabling sensor discovery ranging from metadata formats, metadata conversion and editing to metadata storage. Furthermore, the current development status will be demonstrated.
ERIC Educational Resources Information Center
Dietze, Stefan; Gugliotta, Alessio; Domingue, John
2009-01-01
Current E-Learning technologies primarily follow a data and metadata-centric paradigm by providing the learner with composite content containing the learning resources and the learning process description, usually based on specific metadata standards such as ADL SCORM or IMS Learning Design. Due to the design-time binding of learning resources,…
ERIC Educational Resources Information Center
Campbell, D. Grant; Fast, Karl V.
2004-01-01
This paper examines how future metadata capabilities could enable academic libraries to exploit information on the emerging Semantic Web in their library catalogues. Whereas current metadata architectures treat the Web as a simple means of interchanging bibliographic data that have been created by libraries, this paper suggests that academic…
Data and the Shift in Systems, Services, and Literacy
ERIC Educational Resources Information Center
Mitchell, Erik T.
2012-01-01
This month, the "Journal of Web Librarianship" is exploring the idea of data curation and its uses in libraries. The word "data" is as universal now as the word "cloud" was last year, and it is no accident that libraries are exploring how best to support data curation services. Data curation involves library activities in just about every way,…
HTTP-based Search and Ordering Using ECHO's REST-based and OpenSearch APIs
NASA Astrophysics Data System (ADS)
Baynes, K.; Newman, D. J.; Pilone, D.
2012-12-01
Metadata is an important entity in the process of cataloging, discovering, and describing Earth science data. NASA's Earth Observing System (EOS) ClearingHOuse (ECHO) acts as the core metadata repository for EOSDIS data centers, providing a centralized mechanism for metadata and data discovery and retrieval. By supporting both the ESIP's Federated Search API and its own search and ordering interfaces, ECHO provides multiple capabilities that facilitate ease of discovery and access to its ever-increasing holdings. Users are able to search and export metadata in a variety of formats including ISO 19115, json, and ECHO10. This presentation aims to inform technically savvy clients interested in automating search and ordering of ECHO's metadata catalog. The audience will be introduced to practical and applicable examples of end-to-end workflows that demonstrate finding, sub-setting and ordering data that is bound by keyword, temporal and spatial constraints. Interaction with the ESIP OpenSearch Interface will be highlighted, as will ECHO's own REST-based API.
NASA Technical Reports Server (NTRS)
Singley, P. T.; Bell, J. D.; Daugherty, P. F.; Hubbs, C. A.; Tuggle, J. G.
1993-01-01
This paper will discuss user interface development and the structure and use of metadata for the Atmospheric Radiation Measurement (ARM) Archive. The ARM Archive, located at Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tennessee, is the data repository for the U.S. Department of Energy's (DOE's) ARM Project. After a short description of the ARM Project and the ARM Archive's role, we will consider the philosophy and goals, constraints, and prototype implementation of the user interface for the archive. We will also describe the metadata that are stored at the archive and support the user interface.
Scrubchem: Building Bioactivity Datasets from Pubchem ...
The PubChem Bioassay database is a non-curated public repository with data from 64 sources, including: ChEMBL, BindingDb, DrugBank, EPA Tox21, NIH Molecular Libraries Screening Program, and various other academic, government, and industrial contributors. Methods for extracting this public data into quality datasets, useable for analytical research, presents several big-data challenges for which we have designed manageable solutions. According to our preliminary work, there are approximately 549 million bioactivity values and related meta-data within PubChem that can be mapped to over 10,000 biological targets. However, this data is not ready for use in data-driven research, mainly due to lack of structured annotations.We used a pragmatic approach that provides increasing access to bioactivity values in the PubChem Bioassay database. This included restructuring of individual PubChem Bioassay files into a relational database (ScrubChem). ScrubChem contains all primary PubChem Bioassay data that was: reparsed; error-corrected (when applicable); enriched with additional data links from other NCBI databases; and improved by adding key biological and assay annotations derived from logic-based language processing rules. The utility of ScrubChem and the curation process were illustrated using an example bioactivity dataset for the androgen receptor protein. This initial work serves as a trial ground for establishing the technical framework for accessing, integrating, cu
TR32DB - Management of Research Data in a Collaborative, Interdisciplinary Research Project
NASA Astrophysics Data System (ADS)
Curdt, Constanze; Hoffmeister, Dirk; Waldhoff, Guido; Lang, Ulrich; Bareth, Georg
2015-04-01
The management of research data in a well-structured and documented manner is essential in the context of collaborative, interdisciplinary research environments (e.g. across various institutions). Consequently, set-up and use of a research data management (RDM) system like a data repository or project database is necessary. These systems should accompany and support scientists during the entire research life cycle (e.g. data collection, documentation, storage, archiving, sharing, publishing) and operate cross-disciplinary in interdisciplinary research projects. Challenges and problems of RDM are well-know. Consequently, the set-up of a user-friendly, well-documented, sustainable RDM system is essential, as well as user support and further assistance. In the framework of the Transregio Collaborative Research Centre 32 'Patterns in Soil-Vegetation-Atmosphere Systems: Monitoring, Modelling, and Data Assimilation' (CRC/TR32), funded by the German Research Foundation (DFG), a RDM system was self-designed and implemented. The CRC/TR32 project database (TR32DB, www.tr32db.de) is operating online since early 2008. The TR32DB handles all data, which are created by the involved project participants from several institutions (e.g. Universities of Cologne, Bonn, Aachen, and the Research Centre Jülich) and research fields (e.g. soil and plant sciences, hydrology, geography, geophysics, meteorology, remote sensing). Very heterogeneous research data are considered, which are resulting from field measurement campaigns, meteorological monitoring, remote sensing, laboratory studies and modelling approaches. Furthermore, outcomes like publications, conference contributions, PhD reports and corresponding images are regarded. The TR32DB project database is set-up in cooperation with the Regional Computing Centre of the University of Cologne (RRZK) and also located in this hardware environment. The TR32DB system architecture is composed of three main components: (i) a file-based data storage including backup, (ii) a database-based storage for administrative data and metadata, and (iii) a web-interface for user access. The TR32DB offers common features of RDM systems. These include data storage, entry of corresponding metadata by a user-friendly input wizard, search and download of data depending on user permission, as well as secure internal exchange of data. In addition, a Digital Object Identifier (DOI) can be allocated for specific datasets and several web mapping components are supported (e.g. Web-GIS and map search). The centrepiece of the TR32DB is the self-provided and implemented CRC/TR32 specific metadata schema. This enables the documentation of all involved, heterogeneous data with accurate, interoperable metadata. The TR32DB Metadata Schema is set-up in a multi-level approach and supports several metadata standards and schemes (e.g. Dublin Core, ISO 19115, INSPIRE, DataCite). Furthermore, metadata properties with focus on the CRC/TR32 background (e.g. CRC/TR32 specific keywords) and the supported data types are complemented. Mandatory, optional and automatic metadata properties are specified. Overall, the TR32DB is designed and implemented according to the needs of the CRC/TR32 (e.g. huge amount of heterogeneous data) and demands of the DFG (e.g. cooperation with a computing centre). The application of a self-designed, project-specific, interoperable metadata schema enables the accurate documentation of all CRC/TR32 data. The implementation of the TR32DB in the hardware environment of the RRZK ensures the access to the data after the end of the CRC/TR32 funding in 2018.
Handling Metadata in a Neurophysiology Laboratory
Zehl, Lyuba; Jaillet, Florent; Stoewer, Adrian; Grewe, Jan; Sobolev, Andrey; Wachtler, Thomas; Brochier, Thomas G.; Riehle, Alexa; Denker, Michael; Grün, Sonja
2016-01-01
To date, non-reproducibility of neurophysiological research is a matter of intense discussion in the scientific community. A crucial component to enhance reproducibility is to comprehensively collect and store metadata, that is, all information about the experiment, the data, and the applied preprocessing steps on the data, such that they can be accessed and shared in a consistent and simple manner. However, the complexity of experiments, the highly specialized analysis workflows and a lack of knowledge on how to make use of supporting software tools often overburden researchers to perform such a detailed documentation. For this reason, the collected metadata are often incomplete, incomprehensible for outsiders or ambiguous. Based on our research experience in dealing with diverse datasets, we here provide conceptual and technical guidance to overcome the challenges associated with the collection, organization, and storage of metadata in a neurophysiology laboratory. Through the concrete example of managing the metadata of a complex experiment that yields multi-channel recordings from monkeys performing a behavioral motor task, we practically demonstrate the implementation of these approaches and solutions with the intention that they may be generalized to other projects. Moreover, we detail five use cases that demonstrate the resulting benefits of constructing a well-organized metadata collection when processing or analyzing the recorded data, in particular when these are shared between laboratories in a modern scientific collaboration. Finally, we suggest an adaptable workflow to accumulate, structure and store metadata from different sources using, by way of example, the odML metadata framework. PMID:27486397
Handling Metadata in a Neurophysiology Laboratory.
Zehl, Lyuba; Jaillet, Florent; Stoewer, Adrian; Grewe, Jan; Sobolev, Andrey; Wachtler, Thomas; Brochier, Thomas G; Riehle, Alexa; Denker, Michael; Grün, Sonja
2016-01-01
To date, non-reproducibility of neurophysiological research is a matter of intense discussion in the scientific community. A crucial component to enhance reproducibility is to comprehensively collect and store metadata, that is, all information about the experiment, the data, and the applied preprocessing steps on the data, such that they can be accessed and shared in a consistent and simple manner. However, the complexity of experiments, the highly specialized analysis workflows and a lack of knowledge on how to make use of supporting software tools often overburden researchers to perform such a detailed documentation. For this reason, the collected metadata are often incomplete, incomprehensible for outsiders or ambiguous. Based on our research experience in dealing with diverse datasets, we here provide conceptual and technical guidance to overcome the challenges associated with the collection, organization, and storage of metadata in a neurophysiology laboratory. Through the concrete example of managing the metadata of a complex experiment that yields multi-channel recordings from monkeys performing a behavioral motor task, we practically demonstrate the implementation of these approaches and solutions with the intention that they may be generalized to other projects. Moreover, we detail five use cases that demonstrate the resulting benefits of constructing a well-organized metadata collection when processing or analyzing the recorded data, in particular when these are shared between laboratories in a modern scientific collaboration. Finally, we suggest an adaptable workflow to accumulate, structure and store metadata from different sources using, by way of example, the odML metadata framework.
Advancements in Large-Scale Data/Metadata Management for Scientific Data.
NASA Astrophysics Data System (ADS)
Guntupally, K.; Devarakonda, R.; Palanisamy, G.; Frame, M. T.
2017-12-01
Scientific data often comes with complex and diverse metadata which are critical for data discovery and users. The Online Metadata Editor (OME) tool, which was developed by an Oak Ridge National Laboratory team, effectively manages diverse scientific datasets across several federal data centers, such as DOE's Atmospheric Radiation Measurement (ARM) Data Center and USGS's Core Science Analytics, Synthesis, and Libraries (CSAS&L) project. This presentation will focus mainly on recent developments and future strategies for refining OME tool within these centers. The ARM OME is a standard based tool (https://www.archive.arm.gov/armome) that allows scientists to create and maintain metadata about their data products. The tool has been improved with new workflows that help metadata coordinators and submitting investigators to submit and review their data more efficiently. The ARM Data Center's newly upgraded Data Discovery Tool (http://www.archive.arm.gov/discovery) uses rich metadata generated by the OME to enable search and discovery of thousands of datasets, while also providing a citation generator and modern order-delivery techniques like Globus (using GridFTP), Dropbox and THREDDS. The Data Discovery Tool also supports incremental indexing, which allows users to find new data as and when they are added. The USGS CSAS&L search catalog employs a custom version of the OME (https://www1.usgs.gov/csas/ome), which has been upgraded with high-level Federal Geographic Data Committee (FGDC) validations and the ability to reserve and mint Digital Object Identifiers (DOIs). The USGS's Science Data Catalog (SDC) (https://data.usgs.gov/datacatalog) allows users to discover a myriad of science data holdings through a web portal. Recent major upgrades to the SDC and ARM Data Discovery Tool include improved harvesting performance and migration using new search software, such as Apache Solr 6.0 for serving up data/metadata to scientific communities. Our presentation will highlight the future enhancements of these tools which enable users to retrieve fast search results, along with parallelizing the retrieval process from online and High Performance Storage Systems. In addition, these improvements to the tools will support additional metadata formats like the Large-Eddy Simulation (LES) ARM Symbiotic and Observation (LASSO) bundle data.
New Tools to Document and Manage Data/Metadata: Example NGEE Arctic and UrbIS
NASA Astrophysics Data System (ADS)
Crow, M. C.; Devarakonda, R.; Hook, L.; Killeffer, T.; Krassovski, M.; Boden, T.; King, A. W.; Wullschleger, S. D.
2016-12-01
Tools used for documenting, archiving, cataloging, and searching data are critical pieces of informatics. This discussion describes tools being used in two different projects at Oak Ridge National Laboratory (ORNL), but at different stages of the data lifecycle. The Metadata Entry and Data Search Tool is being used for the documentation, archival, and data discovery stages for the Next Generation Ecosystem Experiment - Arctic (NGEE Arctic) project while the Urban Information Systems (UrbIS) Data Catalog is being used to support indexing, cataloging, and searching. The NGEE Arctic Online Metadata Entry Tool [1] provides a method by which researchers can upload their data and provide original metadata with each upload. The tool is built upon a Java SPRING framework to parse user input into, and from, XML output. Many aspects of the tool require use of a relational database including encrypted user-login, auto-fill functionality for predefined sites and plots, and file reference storage and sorting. The UrbIS Data Catalog is a data discovery tool supported by the Mercury cataloging framework [2] which aims to compile urban environmental data from around the world into one location, and be searchable via a user-friendly interface. Each data record conveniently displays its title, source, and date range, and features: (1) a button for a quick view of the metadata, (2) a direct link to the data and, for some data sets, (3) a button for visualizing the data. The search box incorporates autocomplete capabilities for search terms and sorted keyword filters are available on the side of the page, including a map for searching by area. References: [1] Devarakonda, Ranjeet, et al. "Use of a metadata documentation and search tool for large data volumes: The NGEE arctic example." Big Data (Big Data), 2015 IEEE International Conference on. IEEE, 2015. [2] Devarakonda, R., Palanisamy, G., Wilson, B. E., & Green, J. M. (2010). Mercury: reusable metadata management, data discovery and access system. Earth Science Informatics, 3(1-2), 87-94.
NaKnowBaseTM: The EPA Nanomaterials Research ...
The ability to predict the environmental and health implications of engineered nanomaterials is an important research priority due to the exponential rate at which nanotechnology is being incorporated into consumer, industrial and biomedical applications. To address this need and develop predictive capability, we have created the NaKnowbaseTM, which provides a platform for the curation and dissemination of EPA nanomaterials data to support functional assay development, hazard risk models and informatic analyses. To date, we have combined relevant physicochemical parameters from other organizations (e.g., OECD, NIST), with those requested for nanomaterial data submitted to EPA under the Toxic Substances Control Act (TSCA). Physiochemical characterization data were collated from >400 unique nanomaterials including metals, metal oxides, carbon-based and hybrid materials evaluated or synthesized by EPA researchers. We constructed parameter requirements and table structures for encoding research metadata, including experimental factors and measured response variables. As a proof of concept, we illustrate how SQL-based queries facilitate a range of interrogations including, for example, relationships between nanoparticle characteristics and environmental or toxicological endpoints. The views expressed in this poster are those of the authors and may not reflect U.S. EPA policy. The purpose of this submission for clearance is an abstract for submission to a scientific
The XML Metadata Editor of GFZ Data Services
NASA Astrophysics Data System (ADS)
Ulbricht, Damian; Elger, Kirsten; Tesei, Telemaco; Trippanera, Daniele
2017-04-01
Following the FAIR data principles, research data should be Findable, Accessible, Interoperable and Reuseable. Publishing data under these principles requires to assign persistent identifiers to the data and to generate rich machine-actionable metadata. To increase the interoperability, metadata should include shared vocabularies and crosslink the newly published (meta)data and related material. However, structured metadata formats tend to be complex and are not intended to be generated by individual scientists. Software solutions are needed that support scientists in providing metadata describing their data. To facilitate data publication activities of 'GFZ Data Services', we programmed an XML metadata editor that assists scientists to create metadata in different schemata popular in the earth sciences (ISO19115, DIF, DataCite), while being at the same time usable by and understandable for scientists. Emphasis is placed on removing barriers, in particular the editor is publicly available on the internet without registration [1] and the scientists are not requested to provide information that may be generated automatically (e.g. the URL of a specific licence or the contact information of the metadata distributor). Metadata are stored in browser cookies and a copy can be saved to the local hard disk. To improve usability, form fields are translated into the scientific language, e.g. 'creators' of the DataCite schema are called 'authors'. To assist filling in the form, we make use of drop down menus for small vocabulary lists and offer a search facility for large thesauri. Explanations to form fields and definitions of vocabulary terms are provided in pop-up windows and a full documentation is available for download via the help menu. In addition, multiple geospatial references can be entered via an interactive mapping tool, which helps to minimize problems with different conventions to provide latitudes and longitudes. Currently, we are extending the metadata editor to be reused to generate metadata for data discovery and contextual metadata developed by the 'Multi-scale Laboratories' Thematic Core Service of the European Plate Observing System (EPOS-IP). The Editor will be used to build a common repository of a large variety of geological and geophysical datasets produced by multidisciplinary laboratories throughout Europe, thus contributing to a significant step toward the integration and accessibility of earth science data. This presentation will introduce the metadata editor and show the adjustments made for EPOS-IP. [1] http://dataservices.gfz-potsdam.de/panmetaworks/metaedit
Ontology-Based Search of Genomic Metadata.
Fernandez, Javier D; Lenzerini, Maurizio; Masseroli, Marco; Venco, Francesco; Ceri, Stefano
2016-01-01
The Encyclopedia of DNA Elements (ENCODE) is a huge and still expanding public repository of more than 4,000 experiments and 25,000 data files, assembled by a large international consortium since 2007; unknown biological knowledge can be extracted from these huge and largely unexplored data, leading to data-driven genomic, transcriptomic, and epigenomic discoveries. Yet, search of relevant datasets for knowledge discovery is limitedly supported: metadata describing ENCODE datasets are quite simple and incomplete, and not described by a coherent underlying ontology. Here, we show how to overcome this limitation, by adopting an ENCODE metadata searching approach which uses high-quality ontological knowledge and state-of-the-art indexing technologies. Specifically, we developed S.O.S. GeM (http://www.bioinformatics.deib.polimi.it/SOSGeM/), a system supporting effective semantic search and retrieval of ENCODE datasets. First, we constructed a Semantic Knowledge Base by starting with concepts extracted from ENCODE metadata, matched to and expanded on biomedical ontologies integrated in the well-established Unified Medical Language System. We prove that this inference method is sound and complete. Then, we leveraged the Semantic Knowledge Base to semantically search ENCODE data from arbitrary biologists' queries. This allows correctly finding more datasets than those extracted by a purely syntactic search, as supported by the other available systems. We empirically show the relevance of found datasets to the biologists' queries.
The GEOSS Clearinghouse based on the GeoNetwork opensource
NASA Astrophysics Data System (ADS)
Liu, K.; Yang, C.; Wu, H.; Huang, Q.
2010-12-01
The Global Earth Observation System of Systems (GEOSS) is established to support the study of the Earth system in a global community. It provides services for social management, quick response, academic research, and education. The purpose of GEOSS is to achieve comprehensive, coordinated and sustained observations of the Earth system, improve monitoring of the state of the Earth, increase understanding of Earth processes, and enhance prediction of the behavior of the Earth system. In 2009, GEO called for a competition for an official GEOSS clearinghouse to be selected as a source to consolidating catalogs for Earth observations. The Joint Center for Intelligent Spatial Computing at George Mason University worked with USGS to submit a solution based on the open-source platform - GeoNetwork. In the spring of 2010, the solution is selected as the product for GEOSS clearinghouse. The GEOSS Clearinghouse is a common search facility for the Intergovernmental Group on Ea rth Observation (GEO). By providing a list of harvesting functions in Business Logic, GEOSS clearinghouse can collect metadata from distributed catalogs including other GeoNetwork native nodes, webDAV/sitemap/WAF, catalog services for the web (CSW)2.0, GEOSS Component and Service Registry (http://geossregistries.info/), OGC Web Services (WCS, WFS, WMS and WPS), OAI Protocol for Metadata Harvesting 2.0, ArcSDE Server and Local File System. Metadata in GEOSS clearinghouse are managed in a database (MySQL, Postgresql, Oracle, or MckoiDB) and an index of the metadata is maintained through Lucene engine. Thus, EO data, services, and related resources can be discovered and accessed. It supports a variety of geospatial standards including CSW and SRU for search, FGDC and ISO metadata, and WMS related OGC standards for data access and visualization, as linked from the metadata.
Morard, Raphaël; Darling, Kate F; Mahé, Frédéric; Audic, Stéphane; Ujiié, Yurika; Weiner, Agnes K M; André, Aurore; Seears, Heidi A; Wade, Christopher M; Quillévéré, Frédéric; Douady, Christophe J; Escarguel, Gilles; de Garidel-Thoron, Thibault; Siccha, Michael; Kucera, Michal; de Vargas, Colomban
2015-11-01
Planktonic foraminifera (Rhizaria) are ubiquitous marine pelagic protists producing calcareous shells with conspicuous morphology. They play an important role in the marine carbon cycle, and their exceptional fossil record serves as the basis for biochronostratigraphy and past climate reconstructions. A major worldwide sampling effort over the last two decades has resulted in the establishment of multiple large collections of cryopreserved individual planktonic foraminifera samples. Thousands of 18S rDNA partial sequences have been generated, representing all major known morphological taxa across their worldwide oceanic range. This comprehensive data coverage provides an opportunity to assess patterns of molecular ecology and evolution in a holistic way for an entire group of planktonic protists. We combined all available published and unpublished genetic data to build PFR(2), the Planktonic foraminifera Ribosomal Reference database. The first version of the database includes 3322 reference 18S rDNA sequences belonging to 32 of the 47 known morphospecies of extant planktonic foraminifera, collected from 460 oceanic stations. All sequences have been rigorously taxonomically curated using a six-rank annotation system fully resolved to the morphological species level and linked to a series of metadata. The PFR(2) website, available at http://pfr2.sb-roscoff.fr, allows downloading the entire database or specific sections, as well as the identification of new planktonic foraminiferal sequences. Its novel, fully documented curation process integrates advances in morphological and molecular taxonomy. It allows for an increase in its taxonomic resolution and assures that integrity is maintained by including a complete contingency tracking of annotations and assuring that the annotations remain internally consistent. © 2015 John Wiley & Sons Ltd.
Strope, Pooja K; Chaverri, Priscila; Gazis, Romina; Ciufo, Stacy; Domrachev, Michael; Schoch, Conrad L
2017-01-01
Abstract The ITS (nuclear ribosomal internal transcribed spacer) RefSeq database at the National Center for Biotechnology Information (NCBI) is dedicated to the clear association between name, specimen and sequence data. This database is focused on sequences obtained from type material stored in public collections. While the initial ITS sequence curation effort together with numerous fungal taxonomy experts attempted to cover as many orders as possible, we extended our latest focus to the family and genus ranks. We focused on Trichoderma for several reasons, mainly because the asexual and sexual synonyms were well documented, and a list of proposed names and type material were recently proposed and published. In this case study the recent taxonomic information was applied to do a complete taxonomic audit for the genus Trichoderma in the NCBI Taxonomy database. A name status report is available here: https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi. As a result, the ITS RefSeq Targeted Loci database at NCBI has been augmented with more sequences from type and verified material from Trichoderma species. Additionally, to aid in the cross referencing of data from single loci and genomes we have collected a list of quality records of the RPB2 gene obtained from type material in GenBank that could help validate future submissions. During the process of curation misidentified genomes were discovered, and sequence records from type material were found hidden under previous classifications. Source metadata curation, although more cumbersome, proved to be useful as confirmation of the type material designation. Database URL: http://www.ncbi.nlm.nih.gov/bioproject/PRJNA177353 PMID:29220466
THE NEW ONLINE METADATA EDITOR FOR GENERATING STRUCTURED METADATA
DOE Office of Scientific and Technical Information (OSTI.GOV)
Devarakonda, Ranjeet; Shrestha, Biva; Palanisamy, Giri
Nobody is better suited to describe data than the scientist who created it. This description about a data is called Metadata. In general terms, Metadata represents the who, what, when, where, why and how of the dataset [1]. eXtensible Markup Language (XML) is the preferred output format for metadata, as it makes it portable and, more importantly, suitable for system discoverability. The newly developed ORNL Metadata Editor (OME) is a Web-based tool that allows users to create and maintain XML files containing key information, or metadata, about the research. Metadata include information about the specific projects, parameters, time periods, andmore » locations associated with the data. Such information helps put the research findings in context. In addition, the metadata produced using OME will allow other researchers to find these data via Metadata clearinghouses like Mercury [2][4]. OME is part of ORNL s Mercury software fleet [2][3]. It was jointly developed to support projects funded by the United States Geological Survey (USGS), U.S. Department of Energy (DOE), National Aeronautics and Space Administration (NASA) and National Oceanic and Atmospheric Administration (NOAA). OME s architecture provides a customizable interface to support project-specific requirements. Using this new architecture, the ORNL team developed OME instances for USGS s Core Science Analytics, Synthesis, and Libraries (CSAS&L), DOE s Next Generation Ecosystem Experiments (NGEE) and Atmospheric Radiation Measurement (ARM) Program, and the international Surface Ocean Carbon Dioxide ATlas (SOCAT). Researchers simply use the ORNL Metadata Editor to enter relevant metadata into a Web-based form. From the information on the form, the Metadata Editor can create an XML file on the server that the editor is installed or to the user s personal computer. Researchers can also use the ORNL Metadata Editor to modify existing XML metadata files. As an example, an NGEE Arctic scientist use OME to register their datasets to the NGEE data archive and allows the NGEE archive to publish these datasets via a data search portal (http://ngee.ornl.gov/data). These highly descriptive metadata created using OME allows the Archive to enable advanced data search options using keyword, geo-spatial, temporal and ontology filters. Similarly, ARM OME allows scientists or principal investigators (PIs) to submit their data products to the ARM data archive. How would OME help Big Data Centers like the Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC)? The ORNL DAAC is one of NASA s Earth Observing System Data and Information System (EOSDIS) data centers managed by the Earth Science Data and Information System (ESDIS) Project. The ORNL DAAC archives data produced by NASA's Terrestrial Ecology Program. The DAAC provides data and information relevant to biogeochemical dynamics, ecological data, and environmental processes, critical for understanding the dynamics relating to the biological, geological, and chemical components of the Earth's environment. Typically data produced, archived and analyzed is at a scale of multiple petabytes, which makes the discoverability of the data very challenging. Without proper metadata associated with the data, it is difficult to find the data you are looking for and equally difficult to use and understand the data. OME will allow data centers like the NGEE and ORNL DAAC to produce meaningful, high quality, standards-based, descriptive information about their data products in-turn helping with the data discoverability and interoperability. Useful Links: USGS OME: http://mercury.ornl.gov/OME/ NGEE OME: http://ngee-arctic.ornl.gov/ngeemetadata/ ARM OME: http://archive2.ornl.gov/armome/ Contact: Ranjeet Devarakonda (devarakondar@ornl.gov) References: [1] Federal Geographic Data Committee. Content standard for digital geospatial metadata. Federal Geographic Data Committee, 1998. [2] Devarakonda, Ranjeet, et al. "Mercury: reusable metadata management, data discovery and access system." Earth Science Informatics 3.1-2 (2010): 87-94. [3] Wilson, B. E., Palanisamy, G., Devarakonda, R., Rhyne, B. T., Lindsley, C., & Green, J. (2010). Mercury Toolset for Spatiotemporal Metadata. [4] Pouchard, L. C., Branstetter, M. L., Cook, R. B., Devarakonda, R., Green, J., Palanisamy, G., ... & Noy, N. F. (2013). A Linked Science investigation: enhancing climate change data discovery with semantic technologies. Earth science informatics, 6(3), 175-185.« less
Content-aware network storage system supporting metadata retrieval
NASA Astrophysics Data System (ADS)
Liu, Ke; Qin, Leihua; Zhou, Jingli; Nie, Xuejun
2008-12-01
Nowadays, content-based network storage has become the hot research spot of academy and corporation[1]. In order to solve the problem of hit rate decline causing by migration and achieve the content-based query, we exploit a new content-aware storage system which supports metadata retrieval to improve the query performance. Firstly, we extend the SCSI command descriptor block to enable system understand those self-defined query requests. Secondly, the extracted metadata is encoded by extensible markup language to improve the universality. Thirdly, according to the demand of information lifecycle management (ILM), we store those data in different storage level and use corresponding query strategy to retrieval them. Fourthly, as the file content identifier plays an important role in locating data and calculating block correlation, we use it to fetch files and sort query results through friendly user interface. Finally, the experiments indicate that the retrieval strategy and sort algorithm have enhanced the retrieval efficiency and precision.
Utilizing Linked Open Data Sources for Automatic Generation of Semantic Metadata
NASA Astrophysics Data System (ADS)
Nummiaho, Antti; Vainikainen, Sari; Melin, Magnus
In this paper we present an application that can be used to automatically generate semantic metadata for tags given as simple keywords. The application that we have implemented in Java programming language creates the semantic metadata by linking the tags to concepts in different semantic knowledge bases (CrunchBase, DBpedia, Freebase, KOKO, Opencyc, Umbel and/or WordNet). The steps that our application takes in doing so include detecting possible languages, finding spelling suggestions and finding meanings from amongst the proper nouns and common nouns separately. Currently, our application supports English, Finnish and Swedish words, but other languages could be included easily if the required lexical tools (spellcheckers, etc.) are available. The created semantic metadata can be of great use in, e.g., finding and combining similar contents, creating recommendations and targeting advertisements.
NASA Astrophysics Data System (ADS)
Do, Hong; Gudmundsson, Lukas; Leonard, Michael; Westra, Seth; Senerivatne, Sonia
2017-04-01
In-situ observations of daily streamflow with global coverage are a crucial asset for understanding large-scale freshwater resources which are an essential component of the Earth system and a prerequisite for societal development. Here we present the Global Streamflow Indices and Metadata archive (G-SIM), a collection indices derived from more than 20,000 daily streamflow time series across the globe. These indices are designed to support global assessments of change in wet and dry extremes, and have been compiled from 12 free-to-access online databases (seven national databases and five international collections). The G-SIM archive also includes significant metadata to help support detailed understanding of streamflow dynamics, with the inclusion of drainage area shapefile and many essential catchment properties such as land cover type, soil and topographic characteristics. The automated procedure in data handling and quality control of the project makes G-SIM a reproducible, extendible archive and can be utilised for many purposes in large-scale hydrology. Some potential applications include the identification of observational trends in hydrological extremes, the assessment of climate change impacts on streamflow regimes, and the validation of global hydrological models.
Pfaff, Claas-Thido; Eichenberg, David; Liebergesell, Mario; König-Ries, Birgitta; Wirth, Christian
2017-01-01
Ecology has become a data intensive science over the last decades which often relies on the reuse of data in cross-experimental analyses. However, finding data which qualifies for the reuse in a specific context can be challenging. It requires good quality metadata and annotations as well as efficient search strategies. To date, full text search (often on the metadata only) is the most widely used search strategy although it is known to be inaccurate. Faceted navigation is providing a filter mechanism which is based on fine granular metadata, categorizing search objects along numeric and categorical parameters relevant for their discovery. Selecting from these parameters during a full text search creates a system of filters which allows to refine and improve the results towards more relevance. We developed a framework for the efficient annotation and faceted navigation in ecology. It consists of an XML schema for storing the annotation of search objects and is accompanied by a vocabulary focused on ecology to support the annotation process. The framework consolidates ideas which originate from widely accepted metadata standards, textbooks, scientific literature, and vocabularies as well as from expert knowledge contributed by researchers from ecology and adjacent disciplines.
Uciteli, Alexandr; Herre, Heinrich
2015-01-01
The specification of metadata in clinical and epidemiological study projects absorbs significant expense. The validity and quality of the collected data depend heavily on the precise and semantical correct representation of their metadata. In various research organizations, which are planning and coordinating studies, the required metadata are specified differently, depending on many conditions, e.g., on the used study management software. The latter does not always meet the needs of a particular research organization, e.g., with respect to the relevant metadata attributes and structuring possibilities. The objective of the research, set forth in this paper, is the development of a new approach for ontology-based representation and management of metadata. The basic features of this approach are demonstrated by the software tool OntoStudyEdit (OSE). The OSE is designed and developed according to the three ontology method. This method for developing software is based on the interactions of three different kinds of ontologies: a task ontology, a domain ontology and a top-level ontology. The OSE can be easily adapted to different requirements, and it supports an ontologically founded representation and efficient management of metadata. The metadata specifications can by imported from various sources; they can be edited with the OSE, and they can be exported in/to several formats, which are used, e.g., by different study management software. Advantages of this approach are the adaptability of the OSE by integrating suitable domain ontologies, the ontological specification of mappings between the import/export formats and the DO, the specification of the study metadata in a uniform manner and its reuse in different research projects, and an intuitive data entry for non-expert users.
CINERGI: Community Inventory of EarthCube Resources for Geoscience Interoperability
NASA Astrophysics Data System (ADS)
Zaslavsky, Ilya; Bermudez, Luis; Grethe, Jeffrey; Gupta, Amarnath; Hsu, Leslie; Lehnert, Kerstin; Malik, Tanu; Richard, Stephen; Valentine, David; Whitenack, Thomas
2014-05-01
Organizing geoscience data resources to support cross-disciplinary data discovery, interpretation, analysis and integration is challenging because of different information models, semantic frameworks, metadata profiles, catalogs, and services used in different geoscience domains, not to mention different research paradigms and methodologies. The central goal of CINERGI, a new project supported by the US National Science Foundation through its EarthCube Building Blocks program, is to create a methodology and assemble a large inventory of high-quality information resources capable of supporting data discovery needs of researchers in a wide range of geoscience domains. The key characteristics of the inventory are: 1) collaboration with and integration of metadata resources from a number of large data facilities; 2) reliance on international metadata and catalog service standards; 3) assessment of resource "interoperability-readiness"; 4) ability to cross-link and navigate data resources, projects, models, researcher directories, publications, usage information, etc.; 5) efficient inclusion of "long-tail" data, which are not appearing in existing domain repositories; 6) data registration at feature level where appropriate, in addition to common dataset-level registration, and 7) integration with parallel EarthCube efforts, in particular focused on EarthCube governance, information brokering, service-oriented architecture design and management of semantic information. We discuss challenges associated with accomplishing CINERGI goals, including defining the inventory scope; managing different granularity levels of resource registration; interaction with search systems of domain repositories; explicating domain semantics; metadata brokering, harvesting and pruning; managing provenance of the harvested metadata; and cross-linking resources based on the linked open data (LOD) approaches. At the higher level of the inventory, we register domain-wide resources such as domain catalogs, vocabularies, information models, data service specifications, identifier systems, and assess their conformance with international standards (such as those adopted by ISO and OGC, and used by INSPIRE) or de facto community standards using, in part, automatic validation techniques. The main level in CINERGI leverages a metadata aggregation platform (currently Geoportal Server) to organize harvested resources from multiple collections and contributed by community members during EarthCube end-user domain workshops or suggested online. The latter mechanism uses the SciCrunch toolkit originally developed within the Neuroscience Information Framework (NIF) project and now being extended to other communities. The inventory is designed to support requests such as "Find resources with theme X in geographic area S", "Find datasets with subject Y using query concept expansion", "Find geographic regions having data of type Z", "Find datasets that contain property P". With the added LOD support, additional types of requests, such as "Find example implementations of specification X", "Find researchers who have worked in Domain X, dataset Y, location L", "Find resources annotated by person X", will be supported. Project's website (http://workspace.earthcube.org/cinergi) provides access to the initial resource inventory, a gallery of EarthCube researchers, collections of geoscience models, metadata entry forms, and other software modules and inventories being integrated into the CINERGI system. Support from the US National Science Foundation under award NSF ICER-1343816 is gratefully acknowledged.
Wu, Honghan; Oellrich, Anika; Girges, Christine; de Bono, Bernard; Hubbard, Tim J P; Dobson, Richard J B
2017-01-01
Neurodegenerative disorders such as Parkinson's and Alzheimer's disease are devastating and costly illnesses, a source of major global burden. In order to provide successful interventions for patients and reduce costs, both causes and pathological processes need to be understood. The ApiNATOMY project aims to contribute to our understanding of neurodegenerative disorders by manually curating and abstracting data from the vast body of literature amassed on these illnesses. As curation is labour-intensive, we aimed to speed up the process by automatically highlighting those parts of the PDF document of primary importance to the curator. Using techniques similar to those of summarisation, we developed an algorithm that relies on linguistic, semantic and spatial features. Employing this algorithm on a test set manually corrected for tool imprecision, we achieved a macro F 1 -measure of 0.51, which is an increase of 132% compared to the best bag-of-words baseline model. A user based evaluation was also conducted to assess the usefulness of the methodology on 40 unseen publications, which reveals that in 85% of cases all highlighted sentences are relevant to the curation task and in about 65% of the cases, the highlights are sufficient to support the knowledge curation task without needing to consult the full text. In conclusion, we believe that these are promising results for a step in automating the recognition of curation-relevant sentences. Refining our approach to pre-digest papers will lead to faster processing and cost reduction in the curation process. https://github.com/KHP-Informatics/NapEasy. © The Author(s) 2017. Published by Oxford University Press.
Oellrich, Anika; Girges, Christine; de Bono, Bernard; Hubbard, Tim J.P.; Dobson, Richard J.B.
2017-01-01
Abstract Neurodegenerative disorders such as Parkinson’s and Alzheimer’s disease are devastating and costly illnesses, a source of major global burden. In order to provide successful interventions for patients and reduce costs, both causes and pathological processes need to be understood. The ApiNATOMY project aims to contribute to our understanding of neurodegenerative disorders by manually curating and abstracting data from the vast body of literature amassed on these illnesses. As curation is labour-intensive, we aimed to speed up the process by automatically highlighting those parts of the PDF document of primary importance to the curator. Using techniques similar to those of summarisation, we developed an algorithm that relies on linguistic, semantic and spatial features. Employing this algorithm on a test set manually corrected for tool imprecision, we achieved a macro F1-measure of 0.51, which is an increase of 132% compared to the best bag-of-words baseline model. A user based evaluation was also conducted to assess the usefulness of the methodology on 40 unseen publications, which reveals that in 85% of cases all highlighted sentences are relevant to the curation task and in about 65% of the cases, the highlights are sufficient to support the knowledge curation task without needing to consult the full text. In conclusion, we believe that these are promising results for a step in automating the recognition of curation-relevant sentences. Refining our approach to pre-digest papers will lead to faster processing and cost reduction in the curation process. Database URL: https://github.com/KHP-Informatics/NapEasy PMID:28365743
PomBase: a comprehensive online resource for fission yeast
Wood, Valerie; Harris, Midori A.; McDowall, Mark D.; Rutherford, Kim; Vaughan, Brendan W.; Staines, Daniel M.; Aslett, Martin; Lock, Antonia; Bähler, Jürg; Kersey, Paul J.; Oliver, Stephen G.
2012-01-01
PomBase (www.pombase.org) is a new model organism database established to provide access to comprehensive, accurate, and up-to-date molecular data and biological information for the fission yeast Schizosaccharomyces pombe to effectively support both exploratory and hypothesis-driven research. PomBase encompasses annotation of genomic sequence and features, comprehensive manual literature curation and genome-wide data sets, and supports sophisticated user-defined queries. The implementation of PomBase integrates a Chado relational database that houses manually curated data with Ensembl software that supports sequence-based annotation and web access. PomBase will provide user-friendly tools to promote curation by experts within the fission yeast community. This will make a key contribution to shaping its content and ensuring its comprehensiveness and long-term relevance. PMID:22039153
Information integration for a sky survey by data warehousing
NASA Astrophysics Data System (ADS)
Luo, A.; Zhang, Y.; Zhao, Y.
The virtualization service of data system for a sky survey LAMOST is very important for astronomers The service needs to integrate information from data collections catalogs and references and support simple federation of a set of distributed files and associated metadata Data warehousing has been in existence for several years and demonstrated superiority over traditional relational database management systems by providing novel indexing schemes that supported efficient on-line analytical processing OLAP of large databases Now relational database systems such as Oracle etc support the warehouse capability which including extensions to the SQL language to support OLAP operations and a number of metadata management tools have been created The information integration of LAMOST by applying data warehousing is to effectively provide data and knowledge on-line
NASA Astrophysics Data System (ADS)
Benedict, K. K.; Lenhardt, W. C.; Young, J. W.; Gordon, L. C.; Hughes, S.; Santhana Vannan, S. K.
2017-12-01
The planning for and development of efficient workflows for the creation, reuse, sharing, documentation, publication and preservation of research data is a general challenge that research teams of all sizes face. In response to: requirements from funding agencies for full-lifecycle data management plans that will result in well documented, preserved, and shared research data products increasing requirements from publishers for shared data in conjunction with submitted papers interdisciplinary research team's needs for efficient data sharing within projects, and increasing reuse of research data for replication and new, unanticipated research, policy development, and public use alternative strategies to traditional data life cycle approaches must be developed and shared that enable research teams to meet these requirements while meeting the core science objectives of their projects within the available resources. In support of achieving these goals, the concept of Agile Data Curation has been developed in which there have been parallel activities in support of 1) identifying a set of shared values and principles that underlie the objectives of agile data curation, 2) soliciting case studies from the Earth science and other research communities that illustrate aspects of what the contributors consider agile data curation methods and practices, and 3) identifying or developing design patterns that are high-level abstractions from successful data curation practice that are related to common data curation problems for which common solution strategies may be employed. This paper provides a collection of case studies that have been contributed by the Earth science community, and an initial analysis of those case studies to map them to emerging shared data curation problems and their potential solutions. Following the initial analysis of these problems and potential solutions, existing design patterns from software engineering and related disciplines are identified as a starting point for the development of a catalog of data curation design patterns that may be reused in the design and execution of new data curation processes.
NASA Astrophysics Data System (ADS)
Sheldon, W.; Chamblee, J.; Cary, R. H.
2013-12-01
Environmental scientists are under increasing pressure from funding agencies and journal publishers to release quality-controlled data in a timely manner, as well as to produce comprehensive metadata for submitting data to long-term archives (e.g. DataONE, Dryad and BCO-DMO). At the same time, the volume of digital data that researchers collect and manage is increasing rapidly due to advances in high frequency electronic data collection from flux towers, instrumented moorings and sensor networks. However, few pre-built software tools are available to meet these data management needs, and those tools that do exist typically focus on part of the data management lifecycle or one class of data. The GCE Data Toolbox has proven to be both a generalized and effective software solution for environmental data management in the Long Term Ecological Research Network (LTER). This open source MATLAB software library, developed by the Georgia Coastal Ecosystems LTER program, integrates metadata capture, creation and management with data processing, quality control and analysis to support the entire data lifecycle. Raw data can be imported directly from common data logger formats (e.g. SeaBird, Campbell Scientific, YSI, Hobo), as well as delimited text files, MATLAB files and relational database queries. Basic metadata are derived from the data source itself (e.g. parsed from file headers) and by value inspection, and then augmented using editable metadata templates containing boilerplate documentation, attribute descriptors, code definitions and quality control rules. Data and metadata content, quality control rules and qualifier flags are then managed together in a robust data structure that supports database functionality and ensures data validity throughout processing. A growing suite of metadata-aware editing, quality control, analysis and synthesis tools are provided with the software to support managing data using graphical forms and command-line functions, as well as developing automated workflows for unattended processing. Finalized data and structured metadata can be exported in a wide variety of text and MATLAB formats or uploaded to a relational database for long-term archiving and distribution. The GCE Data Toolbox can be used as a complete, light-weight solution for environmental data and metadata management, but it can also be used in conjunction with other cyber infrastructure to provide a more comprehensive solution. For example, newly acquired data can be retrieved from a Data Turbine or Campbell LoggerNet Database server for quality control and processing, then transformed to CUAHSI Observations Data Model format and uploaded to a HydroServer for distribution through the CUAHSI Hydrologic Information System. The GCE Data Toolbox can also be leveraged in analytical workflows developed using Kepler or other systems that support MATLAB integration or tool chaining. This software can therefore be leveraged in many ways to help researchers manage, analyze and distribute the data they collect.
The Benefits and Future of Standards: Metadata and Beyond
NASA Astrophysics Data System (ADS)
Stracke, Christian M.
This article discusses the benefits and future of standards and presents the generic multi-dimensional Reference Model. First the importance and the tasks of interoperability as well as quality development and their relationship are analyzed. Especially in e-Learning their connection and interdependence is evident: Interoperability is one basic requirement for quality development. In this paper, it is shown how standards and specifications are supporting these crucial issues. The upcoming ISO metadata standard MLR (Metadata for Learning Resource) will be introduced and used as example for identifying the requirements and needs for future standardization. In conclusion a vision of the challenges and potentials for e-Learning standardization is outlined.
NASA Astrophysics Data System (ADS)
Hibbert, F. D.; Williams, F. H.; Fallon, S.; Rohling, E. J.
2017-12-01
The last deglacial was an interval of rapid climate and sea-level change, including the collapse of large continental ice sheets. This database collates carefully assessed sea-level data from peer-reviewed sources for the interval 0 to 25 thousand years ago (ka), from the last glacial maximum to the present interglacial conditions. In addition to facilitating site-specific reconstructions of past sea levels, the database provides a suite of data beyond the range of modern/instrumental variability that may help hone future sea-level projections. The database is global in scope, internally consistent, and contains U-series and radiocarbon dated indicators from both biological and geomorpohological archives. We focus on far-field data (i.e., away from the sites of the former continental ice sheets), but some key intermediate (i.e., from the Caribbean) data are also included. All primary fields (i.e., sample location, elevation, age and context) possess quantified uncertainties, which - in conjunction with available metadata - allows the reconstructed sea levels to be interpreted within both their uncertainties and geological context. Consistent treatment of each of the individual records in the database, and incorporation of fully expressed uncertainties, allows datasets to be easily compared. The compilation contains 145 studies from 40 locations (>2,000 data points) and includes all raw information and metadata.
The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases
Orchard, Sandra; Ammari, Mais; Aranda, Bruno; Breuza, Lionel; Briganti, Leonardo; Broackes-Carter, Fiona; Campbell, Nancy H.; Chavali, Gayatri; Chen, Carol; del-Toro, Noemi; Duesbury, Margaret; Dumousseau, Marine; Galeota, Eugenia; Hinz, Ursula; Iannuccelli, Marta; Jagannathan, Sruthi; Jimenez, Rafael; Khadake, Jyoti; Lagreid, Astrid; Licata, Luana; Lovering, Ruth C.; Meldal, Birgit; Melidoni, Anna N.; Milagros, Mila; Peluso, Daniele; Perfetto, Livia; Porras, Pablo; Raghunath, Arathi; Ricard-Blum, Sylvie; Roechert, Bernd; Stutz, Andre; Tognolli, Michael; van Roey, Kim; Cesareni, Gianni; Hermjakob, Henning
2014-01-01
IntAct (freely available at http://www.ebi.ac.uk/intact) is an open-source, open data molecular interaction database populated by data either curated from the literature or from direct data depositions. IntAct has developed a sophisticated web-based curation tool, capable of supporting both IMEx- and MIMIx-level curation. This tool is now utilized by multiple additional curation teams, all of whom annotate data directly into the IntAct database. Members of the IntAct team supply appropriate levels of training, perform quality control on entries and take responsibility for long-term data maintenance. Recently, the MINT and IntAct databases decided to merge their separate efforts to make optimal use of limited developer resources and maximize the curation output. All data manually curated by the MINT curators have been moved into the IntAct database at EMBL-EBI and are merged with the existing IntAct dataset. Both IntAct and MINT are active contributors to the IMEx consortium (http://www.imexconsortium.org). PMID:24234451
The Role of the Curator in Modern Hospitals: A Transcontinental Perspective.
Moss, Hilary; O'Neill, Desmond
2016-12-13
This paper explores the role of the curator in hospitals. The arts play a significant role in every society; however, recent studies indicate a neglect of the aesthetic environment of healthcare. This international study explores the complex role of the curator in modern hospitals. Semi-structured interviews were conducted with ten arts specialists in hospitals across five countries and three continents for a qualitative, phenomenological study. Five themes arose from the data: (1) Patient involvement and influence on the arts programme in hospital (2) Understanding the role of the curator in hospital (3) Influences on arts programming in hospital (4) Types of arts programmes (5) Limitations to effective curation in hospital. Recommendations arising from the research included recognition of the specialised role of the curator in hospitals; building positive links with clinical staff to effect positive hospital arts programmes and increasing formal involvement of patients in arts planning in hospital. Hospital curation can be a vibrant arena for arts development, and the role of the hospital curator is a ground-breaking specialist role that can bring benefits to hospital life. The role of curator in hospital deserves to be supported and developed by both the arts and health sectors.
NASA Astrophysics Data System (ADS)
Tanner, S.; Schwab, M.; Beam, K.; Skaug, M.
2017-12-01
Operation IceBridge has been flying campaigns in the Arctic and Antarctic for nearly 10 years and will soon be a decadal mission. During that time, the generation and use of file level metadata has evolved from nearly non-existent to robust spatio-temporal support. This evolution has been difficult at times, but the results speak for themselves in the form of production tools for search, discovery, access and analysis. The lessons learned from this experience are now being incorporated into SnowEx, a new mission to measure snow cover using airborne and ground-based measurements. This presentation will focus on techniques for generating metadata for such a diverse set of measurements as well as the resulting tools that utilize this information. This includes the development and deployment of MetGen, a semi-automated metadata generation capability that relies on collaboration between data producers and data archivers, the newly deployed IceBridge data portal which incorporates data browse capabilities and limited in-line analysis, and programmatic access to metadata and data for incorporation into larger automated workflows.
Operational Support for Instrument Stability through ODI-PPA Metadata Visualization and Analysis
NASA Astrophysics Data System (ADS)
Young, M. D.; Hayashi, S.; Gopu, A.; Kotulla, R.; Harbeck, D.; Liu, W.
2015-09-01
Over long time scales, quality assurance metrics taken from calibration and calibrated data products can aid observatory operations in quantifying the performance and stability of the instrument, and identify potential areas of concern or guide troubleshooting and engineering efforts. Such methods traditionally require manual SQL entries, assuming the requisite metadata has even been ingested into a database. With the ODI-PPA system, QA metadata has been harvested and indexed for all data products produced over the life of the instrument. In this paper we will describe how, utilizing the industry standard Highcharts Javascript charting package with a customized AngularJS-driven user interface, we have made the process of visualizing the long-term behavior of these QA metadata simple and easily replicated. Operators can easily craft a custom query using the powerful and flexible ODI-PPA search interface and visualize the associated metadata in a variety of ways. These customized visualizations can be bookmarked, shared, or embedded externally, and will be dynamically updated as new data products enter the system, enabling operators to monitor the long-term health of their instrument with ease.
NASA Technical Reports Server (NTRS)
Golden, Keith; Clancy, Dan (Technical Monitor)
2001-01-01
The data management problem comprises data processing and data tracking. Data processing is the creation of new data based on existing data sources. Data tracking consists of storing metadata descriptions of available data. This paper addresses the data management problem by casting it as an AI planning problem. Actions are data-processing commands, plans are dataflow programs and goals are metadata descriptions of desired data products. Data manipulation is simply plan generation and execution, and a key component of data tracking is inferring the effects of an observed plan. We introduce a new action language for data management domains, called ADILM. We discuss the connection between data processing and information integration and show how a language for the latter must be modified to support the former. The paper also discusses information gathering within a data-processing framework, and show how ADILM metadata expressions are a generalization of Local Completeness.
Cheminformatics and the Semantic Web: adding value with linked data and enhanced provenance
Frey, Jeremy G; Bird, Colin L
2013-01-01
Cheminformatics is evolving from being a field of study associated primarily with drug discovery into a discipline that embraces the distribution, management, access, and sharing of chemical data. The relationship with the related subject of bioinformatics is becoming stronger and better defined, owing to the influence of Semantic Web technologies, which enable researchers to integrate heterogeneous sources of chemical, biochemical, biological, and medical information. These developments depend on a range of factors: the principles of chemical identifiers and their role in relationships between chemical and biological entities; the importance of preserving provenance and properly curated metadata; and an understanding of the contribution that the Semantic Web can make at all stages of the research lifecycle. The movements toward open access, open source, and open collaboration all contribute to progress toward the goals of integration. PMID:24432050
Astromaterials Curation Online Resources for Principal Investigators
NASA Technical Reports Server (NTRS)
Todd, Nancy S.; Zeigler, Ryan A.; Mueller, Lina
2017-01-01
The Astromaterials Acquisition and Curation office at NASA Johnson Space Center curates all of NASA's extraterrestrial samples, the most extensive set of astromaterials samples available to the research community worldwide. The office allocates 1500 individual samples to researchers and students each year and has served the planetary research community for 45+ years. The Astromaterials Curation office provides access to its sample data repository and digital resources to support the research needs of sample investigators and to aid in the selection and request of samples for scientific study. These resources can be found on the Astromaterials Acquisition and Curation website at https://curator.jsc.nasa.gov. To better serve our users, we have engaged in several activities to enhance the data available for astromaterials samples, to improve the accessibility and performance of the website, and to address user feedback. We havealso put plans in place for continuing improvements to our existing data products.
NASA Astrophysics Data System (ADS)
Oggioni, Alessandro; Tagliolato, Paolo; Fugazza, Cristiano; Bastianini, Mauro; Pavesi, Fabio; Pepe, Monica; Menegon, Stefano; Basoni, Anna; Carrara, Paola
2015-04-01
Sensor observation systems for environmental data have become increasingly important in the last years. The EGU's Informatics in Oceanography and Ocean Science track stressed the importance of management tools and solutions for marine infrastructures. We think that full interoperability among sensor systems is still an open issue and that the solution to this involves providing appropriate metadata. Several open source applications implement the SWE specification and, particularly, the Sensor Observation Services (SOS) standard. These applications allow for the exchange of data and metadata in XML format between computer systems. However, there is a lack of metadata editing tools supporting end users in this activity. Generally speaking, it is hard for users to provide sensor metadata in the SensorML format without dedicated tools. In particular, such a tool should ease metadata editing by providing, for standard sensors, all the invariant information to be included in sensor metadata, thus allowing the user to concentrate on the metadata items that are related to the specific deployment. RITMARE, the Italian flagship project on marine research, envisages a subproject, SP7, for the set-up of the project's spatial data infrastructure. SP7 developed EDI, a general purpose, template-driven metadata editor that is composed of a backend web service and an HTML5/javascript client. EDI can be customized for managing the creation of generic metadata encoded as XML. Once tailored to a specific metadata format, EDI presents the users a web form with advanced auto completion and validation capabilities. In the case of sensor metadata (SensorML versions 1.0.1 and 2.0), the EDI client is instructed to send an "insert sensor" request to an SOS endpoint in order to save the metadata in an SOS server. In the first phase of project RITMARE, EDI has been used to simplify the creation from scratch of SensorML metadata by the involved researchers and data managers. An interesting by-product of this ongoing work is currently constituting an archive of predefined sensor descriptions. This information is being collected in order to further ease metadata creation in the next phase of the project. Users will be able to choose among a number of sensor and sensor platform prototypes: These will be specific instances on which it will be possible to define, in a bottom-up approach, "sensor profiles". We report on the outcome of this activity.
Combined use of semantics and metadata to manage Research Data Life Cycle in Environmental Sciences
NASA Astrophysics Data System (ADS)
Aguilar Gómez, Fernando; de Lucas, Jesús Marco; Pertinez, Esther; Palacio, Aida
2017-04-01
The use of metadata to contextualize datasets is quite extended in Earth System Sciences. There are some initiatives and available tools to help data managers to choose the best metadata standard that fit their use cases, like the DCC Metadata Directory (http://www.dcc.ac.uk/resources/metadata-standards). In our use case, we have been gathering physical, chemical and biological data from a water reservoir since 2010. A well metadata definition is crucial not only to contextualize our own data but also to integrate datasets from other sources like satellites or meteorological agencies. That is why we have chosen EML (Ecological Metadata Language), which integrates many different elements to define a dataset, including the project context, instrumentation and parameters definition, and the software used to process, provide quality controls and include the publication details. Those metadata elements can contribute to help both human and machines to understand and process the dataset. However, the use of metadata is not enough to fully support the data life cycle, from the Data Management Plan definition to the Publication and Re-use. To do so, we need to define not only metadata and attributes but also the relationships between them, so semantics are needed. Ontologies, being a knowledge representation, can contribute to define the elements of a research data life cycle, including DMP, datasets, software, etc. They also can define how the different elements are related between them and how they interact. The first advantage of developing an ontology of a knowledge domain is that they provide a common vocabulary hierarchy (i.e. a conceptual schema) that can be used and standardized by all the agents interested in the domain (either humans or machines). This way of using ontologies is one of the basis of the Semantic Web, where ontologies are set to play a key role in establishing a common terminology between agents. To develop an ontology we are using a graphical tool Protégé, which is a graphical ontology-development tool that supports a rich knowledge model and it is open-source and freely available. To process and manage the ontology, we are using Semantic MediaWiki, which is able to process queries. Semantic MediaWiki is an extension of MediaWiki where we can do semantic search and export data in RDF. Our final goal is integrating our data repository portal and semantic processing engine in order to have a complete system to manage the data life cycle stages and their relationships, including machine-actionable DMP solution, datasets and software management, computing resources for processing and analysis and publication features (DOI mint). This way we will be able to reproduce the full data life cycle chain warranting the FAIR+R principles.
Mercury: An Example of Effective Software Reuse for Metadata Management, Data Discovery and Access
NASA Astrophysics Data System (ADS)
Devarakonda, Ranjeet; Palanisamy, Giri; Green, James; Wilson, Bruce E.
2008-12-01
Mercury is a federated metadata harvesting, data discovery and access tool based on both open source packages and custom developed software. Though originally developed for NASA, the Mercury development consortium now includes funding from NASA, USGS, and DOE. Mercury supports the reuse of metadata by enabling searching across a range of metadata specification and standards including XML, Z39.50, FGDC, Dublin-Core, Darwin-Core, EML, and ISO-19115. Mercury provides a single portal to information contained in distributed data management systems. It collects metadata and key data from contributing project servers distributed around the world and builds a centralized index. The Mercury search interfaces then allow the users to perform simple, fielded, spatial and temporal searches across these metadata sources. One of the major goals of the recent redesign of Mercury was to improve the software reusability across the 12 projects which currently fund the continuing development of Mercury. These projects span a range of land, atmosphere, and ocean ecological communities and have a number of common needs for metadata searches, but they also have a number of needs specific to one or a few projects. To balance these common and project-specific needs, Mercury's architecture has three major reusable components; a harvester engine, an indexing system and a user interface component. The harvester engine is responsible for harvesting metadata records from various distributed servers around the USA and around the world. The harvester software was packaged in such a way that all the Mercury projects will use the same harvester scripts but each project will be driven by a set of project specific configuration files. The harvested files are structured metadata records that are indexed against the search library API consistently, so that it can render various search capabilities such as simple, fielded, spatial and temporal. This backend component is supported by a very flexible, easy to use Graphical User Interface which is driven by cascading style sheets, which make it even simpler for reusable design implementation. The new Mercury system is based on a Service Oriented Architecture and effectively reuses components for various services such as Thesaurus Service, Gazetteer Web Service and UDDI Directory Services. The software also provides various search services including: RSS, Geo-RSS, OpenSearch, Web Services and Portlets, integrated shopping cart to order datasets from various data centers (ORNL DAAC, NSIDC) and integrated visualization tools. Other features include: Filtering and dynamic sorting of search results, book- markable search results, save, retrieve, and modify search criteria.
Mercury: An Example of Effective Software Reuse for Metadata Management, Data Discovery and Access
DOE Office of Scientific and Technical Information (OSTI.GOV)
Devarakonda, Ranjeet
2008-01-01
Mercury is a federated metadata harvesting, data discovery and access tool based on both open source packages and custom developed software. Though originally developed for NASA, the Mercury development consortium now includes funding from NASA, USGS, and DOE. Mercury supports the reuse of metadata by enabling searching across a range of metadata specification and standards including XML, Z39.50, FGDC, Dublin-Core, Darwin-Core, EML, and ISO-19115. Mercury provides a single portal to information contained in distributed data management systems. It collects metadata and key data from contributing project servers distributed around the world and builds a centralized index. The Mercury search interfacesmore » then allow the users to perform simple, fielded, spatial and temporal searches across these metadata sources. One of the major goals of the recent redesign of Mercury was to improve the software reusability across the 12 projects which currently fund the continuing development of Mercury. These projects span a range of land, atmosphere, and ocean ecological communities and have a number of common needs for metadata searches, but they also have a number of needs specific to one or a few projects. To balance these common and project-specific needs, Mercury's architecture has three major reusable components; a harvester engine, an indexing system and a user interface component. The harvester engine is responsible for harvesting metadata records from various distributed servers around the USA and around the world. The harvester software was packaged in such a way that all the Mercury projects will use the same harvester scripts but each project will be driven by a set of project specific configuration files. The harvested files are structured metadata records that are indexed against the search library API consistently, so that it can render various search capabilities such as simple, fielded, spatial and temporal. This backend component is supported by a very flexible, easy to use Graphical User Interface which is driven by cascading style sheets, which make it even simpler for reusable design implementation. The new Mercury system is based on a Service Oriented Architecture and effectively reuses components for various services such as Thesaurus Service, Gazetteer Web Service and UDDI Directory Services. The software also provides various search services including: RSS, Geo-RSS, OpenSearch, Web Services and Portlets, integrated shopping cart to order datasets from various data centers (ORNL DAAC, NSIDC) and integrated visualization tools. Other features include: Filtering and dynamic sorting of search results, book- markable search results, save, retrieve, and modify search criteria.« less
NASA Astrophysics Data System (ADS)
Boldrini, Enrico; Schaap, Dick M. A.; Nativi, Stefano
2013-04-01
SeaDataNet implements a distributed pan-European infrastructure for Ocean and Marine Data Management whose nodes are maintained by 40 national oceanographic and marine data centers from 35 countries riparian to all European seas. A unique portal makes possible distributed discovery, visualization and access of the available sea data across all the member nodes. Geographic metadata play an important role in such an infrastructure, enabling an efficient documentation and discovery of the resources of interest. In particular: - Common Data Index (CDI) metadata describe the sea datasets, including identification information (e.g. product title, interested area), evaluation information (e.g. data resolution, constraints) and distribution information (e.g. download endpoint, download protocol); - Cruise Summary Reports (CSR) metadata describe cruises and field experiments at sea, including identification information (e.g. cruise title, name of the ship), acquisition information (e.g. utilized instruments, number of samples taken) In the context of the second phase of SeaDataNet (SeaDataNet 2 EU FP7 project, grant agreement 283607, started on October 1st, 2011 for a duration of 4 years) a major target is the setting, adoption and promotion of common international standards, to the benefit of outreach and interoperability with the international initiatives and communities (e.g. OGC, INSPIRE, GEOSS, …). A standardization effort conducted by CNR with the support of MARIS, IFREMER, STFC, BODC and ENEA has led to the creation of a ISO 19115 metadata profile of CDI and its XML encoding based on ISO 19139. The CDI profile is now in its stable version and it's being implemented and adopted by the SeaDataNet community tools and software. The effort has then continued to produce an ISO based metadata model and its XML encoding also for CSR. The metadata elements included in the CSR profile belong to different models: - ISO 19115: E.g. cruise identification information, including title and area of interest; metadata responsible party information - ISO 19115-2: E.g. acquisition information, including date of sampling, instruments used - SeaDataNet: E.g. SeaDataNet community specific, including EDMO and EDMERP code lists Two main guidelines have been followed in the metadata model drafting: - All the obligations and constraints required by both the ISO standards and INSPIRE directive had to be satisfied. These include the presence of specific elements with given cardinality (e.g. mandatory metadata date stamp, mandatory lineage information) - All the content information of legacy CSR format had to be supported by the new metadata model. An XML encoding of the CSR profile has been defined as well. Based on the ISO 19139 XML schema and constraints, it adds the new elements specific of the SeaDataNet community. The associated Schematron rules are used to enforce constraints not enforceable just with the Schema and to validate elements content against the SeaDataNet code lists vocabularies.
NASA Astrophysics Data System (ADS)
Williams, J. W.; Ashworth, A. C.; Betancourt, J. L.; Bills, B.; Blois, J.; Booth, R.; Buckland, P.; Charles, D.; Curry, B. B.; Goring, S. J.; Davis, E.; Grimm, E. C.; Graham, R. W.; Smith, A. J.
2015-12-01
Community-supported data repositories (CSDRs) in paleoecology and paleoclimatology have a decades-long tradition and serve multiple critical scientific needs. CSDRs facilitate synthetic large-scale scientific research by providing open-access and curated data that employ community-supported metadata and data standards. CSDRs serve as a 'middle tail' or boundary organization between information scientists and the long-tail community of individual geoscientists collecting and analyzing paleoecological data. Over the past decades, a distributed network of CSDRs has emerged, each serving a particular suite of data and research communities, e.g. Neotoma Paleoecology Database, Paleobiology Database, International Tree Ring Database, NOAA NCEI for Paleoclimatology, Morphobank, iDigPaleo, and Integrated Earth Data Alliance. Recently, these groups have organized into a common Paleobiology Data Consortium dedicated to improving interoperability and sharing best practices and protocols. The Neotoma Paleoecology Database offers one example of an active and growing CSDR, designed to facilitate research into ecological and evolutionary dynamics during recent past global change. Neotoma combines a centralized database structure with distributed scientific governance via multiple virtual constituent data working groups. The Neotoma data model is flexible and can accommodate a variety of paleoecological proxies from many depositional contests. Data input into Neotoma is done by trained Data Stewards, drawn from their communities. Neotoma data can be searched, viewed, and returned to users through multiple interfaces, including the interactive Neotoma Explorer map interface, REST-ful Application Programming Interfaces (APIs), the neotoma R package, and the Tilia stratigraphic software. Neotoma is governed by geoscientists and provides community engagement through training workshops for data contributors, stewards, and users. Neotoma is engaged in the Paleobiological Data Consortium and other efforts to improve interoperability among cyberinfrastructure in the paleogeosciences.
NASA Astrophysics Data System (ADS)
Koppe, Roland; Scientific MaNIDA-Team
2013-04-01
The Marine Network for Integrated Data Access (MaNIDA) aims to build a sustainable e-infrastructure to support discovery and re-use of marine data from distinct data providers in Germany (see related abstracts in session ESSI 1.2). In order to provide users integrated access and retrieval of expedition or cruise metadata, data, services and publications as well as relationships among the various objects, we are developing (web) applications based on state of the art technologies: the Data Portal of German Marine Research. Since the German network of distributed content providers have distinct objectives and mandates for storing digital objects (e.g. long-term data preservation, near real time data, publication repositories), we have to cope with heterogeneous metadata in terms of syntax and semantic, data types and formats as well as access solutions. We have defined a set of core metadata elements which are common to our content providers and therefore useful for discovery and building relationships among objects. Existing catalogues for various types of vocabularies are being used to assure the mapping to community-wide used terms. We distinguish between expedition metadata and continuously harvestable metadata objects from distinct data providers. • Existing expedition metadata from distinct sources is integrated and validated in order to create an expedition metadata catalogue which is used as authoritative source for expedition-related content. The web application allows browsing by e.g. research vessel and date, exploring expeditions and research gaps by tracklines and viewing expedition details (begin/end, ports, platforms, chief scientists, events, etc.). Also expedition-related objects from harvesting are dynamically associated with expedition information and presented to the user. Hence we will provide web services to detailed expedition information. • Other harvestable content is separated into four categories: archived data and data products, near real time data, publications and reports. Reports are a special case of publication, describing cruise planning, cruise reports or popular reports on expeditions and are orthogonal to e.g. peer-reviewed articles. Each object's metadata contains at least: identifier(s) e.g. doi/hdl, title, author(s), date, expedition(s), platform(s) e.g. research vessel Polarstern. Furthermore project(s), parameter(s), device(s) and e.g. geographic coverage are of interest. An international gazetteer resolves geographic coverage to region names and annotates to object metadata. Information is homogenously presented to the user, independent of the underlying format, but adaptable to specific disciplines e.g. bathymetry. Also data access and dissemination information is available to the user as data download link or web services (e.g. WFS, WMS). Based on relationship metadata we are dynamically building graphs of objects to support the user in finding possible relevant associated objects. Technically metadata is based on ISO / OGC standards or provider specification. Metadata is harvested via OAI-PMH or OGC CSW and indexed with Apache Lucene. This enables powerful full-text search, geographic and temporal search as well as faceting. In this presentation we will illustrate the architecture and the current implementation of our integrated approach.
Wu, Tsung-Jung; Shamsaddini, Amirhossein; Pan, Yang; Smith, Krista; Crichton, Daniel J; Simonyan, Vahan; Mazumder, Raja
2014-01-01
Years of sequence feature curation by UniProtKB/Swiss-Prot, PIR-PSD, NCBI-CDD, RefSeq and other database biocurators has led to a rich repository of information on functional sites of genes and proteins. This information along with variation-related annotation can be used to scan human short sequence reads from next-generation sequencing (NGS) pipelines for presence of non-synonymous single-nucleotide variations (nsSNVs) that affect functional sites. This and similar workflows are becoming more important because thousands of NGS data sets are being made available through projects such as The Cancer Genome Atlas (TCGA), and researchers want to evaluate their biomarkers in genomic data. BioMuta, an integrated sequence feature database, provides a framework for automated and manual curation and integration of cancer-related sequence features so that they can be used in NGS analysis pipelines. Sequence feature information in BioMuta is collected from the Catalogue of Somatic Mutations in Cancer (COSMIC), ClinVar, UniProtKB and through biocuration of information available from publications. Additionally, nsSNVs identified through automated analysis of NGS data from TCGA are also included in the database. Because of the petabytes of data and information present in NGS primary repositories, a platform HIVE (High-performance Integrated Virtual Environment) for storing, analyzing, computing and curating NGS data and associated metadata has been developed. Using HIVE, 31 979 nsSNVs were identified in TCGA-derived NGS data from breast cancer patients. All variations identified through this process are stored in a Curated Short Read archive, and the nsSNVs from the tumor samples are included in BioMuta. Currently, BioMuta has 26 cancer types with 13 896 small-scale and 308 986 large-scale study-derived variations. Integration of variation data allows identifications of novel or common nsSNVs that can be prioritized in validation studies. Database URL: BioMuta: http://hive.biochemistry.gwu.edu/tools/biomuta/index.php; CSR: http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr; HIVE: http://hive.biochemistry.gwu.edu.
Identifying the Functional Requirements for an Arizona Astronomy Data Hub (AADH)
NASA Astrophysics Data System (ADS)
Stahlman, G.; Heidorn, P. B.
2015-12-01
Astronomy data represent a curation challenge for information managers, as well as for astronomers. Extracting knowledge from these heterogeneous and complex datasets is particularly complicated and requires both interdisciplinary and domain expertise to accomplish true curation, with an overall goal of facilitating reproducible science through discoverability and persistence. A group of researchers and professional staff at the University of Arizona held several meetings during the spring of 2015 about astronomy data and the role of the university in curation of that data. The group decided that it was critical to obtain a broader consensus on the needs of the community. With assistance from a Start for Success grant provided by the University of Arizona Office of Research and Discovery and funding from the American Astronomical Society (AAS), a workshop was held in early July 2015, with 28 participants plus 4 organizers in attendance. Representing University researchers as well as astronomical facilities and a scholarly society, the group verified that indeed there is a problem with the long-term curation of some astronomical data not associated with major facilities, and that a repository or "data hub" with the correct functionality could facilitate research and the preservation and use of astronomy data. The workshop members also identified a set of next steps, including the identification of possible data and metadata to be included in the Hub. The participants further helped to identify additional information that must be gathered before construction of the AADH could begin, including identifying significant datasets that do not currently have sufficient preservation and dissemination infrastructure, as well as some data associated with journal publications and the broader context of the data beyond that directly published in the journals. Workshop participants recommended that a set of grant proposal should be developed that ensures community buy-in and participation. The project should be developed in an agile, incremental manner that will allow consistent community growth from the early stages of the project, building on existing iPlant infrastructure (www.iplantcollaborative.org) initially developed for the biology community.
Creating preservation metadata from XML-metadata profiles
NASA Astrophysics Data System (ADS)
Ulbricht, Damian; Bertelmann, Roland; Gebauer, Petra; Hasler, Tim; Klump, Jens; Kirchner, Ingo; Peters-Kottig, Wolfgang; Mettig, Nora; Rusch, Beate
2014-05-01
Registration of dataset DOIs at DataCite makes research data citable and comes with the obligation to keep data accessible in the future. In addition, many universities and research institutions measure data that is unique and not repeatable like the data produced by an observational network and they want to keep these data for future generations. In consequence, such data should be ingested in preservation systems, that automatically care for file format changes. Open source preservation software that is developed along the definitions of the ISO OAIS reference model is available but during ingest of data and metadata there are still problems to be solved. File format validation is difficult, because format validators are not only remarkably slow - due to variety in file formats different validators return conflicting identification profiles for identical data. These conflicts are hard to resolve. Preservation systems have a deficit in the support of custom metadata. Furthermore, data producers are sometimes not aware that quality metadata is a key issue for the re-use of data. In the project EWIG an university institute and a research institute work together with Zuse-Institute Berlin, that is acting as an infrastructure facility, to generate exemplary workflows for research data into OAIS compliant archives with emphasis on the geosciences. The Institute for Meteorology provides timeseries data from an urban monitoring network whereas GFZ Potsdam delivers file based data from research projects. To identify problems in existing preservation workflows the technical work is complemented by interviews with data practitioners. Policies for handling data and metadata are developed. Furthermore, university teaching material is created to raise the future scientists awareness of research data management. As a testbed for ingest workflows the digital preservation system Archivematica [1] is used. During the ingest process metadata is generated that is compliant to the Metadata Encoding and Transmission Standard (METS). To find datasets in future portals and to make use of this data in own scientific work, proper selection of discovery metadata and application metadata is very important. Some XML-metadata profiles are not suitable for preservation, because version changes are very fast and make it nearly impossible to automate the migration. For other XML-metadata profiles schema definitions are changed after publication of the profile or the schema definitions become inaccessible, which might cause problems during validation of the metadata inside the preservation system [2]. Some metadata profiles are not used widely enough and might not even exist in the future. Eventually, discovery and application metadata have to be embedded into the mdWrap-subtree of the METS-XML. [1] http://www.archivematica.org [2] http://dx.doi.org/10.2218/ijdc.v7i1.215
NASA Technical Reports Server (NTRS)
Carnahan, Richard S., Jr.; Corey, Stephen M.; Snow, John B.
1989-01-01
Applications of rapid prototyping and Artificial Intelligence techniques to problems associated with Space Station-era information management systems are described. In particular, the work is centered on issues related to: (1) intelligent man-machine interfaces applied to scientific data user support, and (2) the requirement that intelligent information management systems (IIMS) be able to efficiently process metadata updates concerning types of data handled. The advanced IIMS represents functional capabilities driven almost entirely by the needs of potential users. Space Station-era scientific data projected to be generated is likely to be significantly greater than data currently processed and analyzed. Information about scientific data must be presented clearly, concisely, and with support features to allow users at all levels of expertise efficient and cost-effective data access. Additionally, mechanisms for allowing more efficient IIMS metadata update processes must be addressed. The work reported covers the following IIMS design aspects: IIMS data and metadata modeling, including the automatic updating of IIMS-contained metadata, IIMS user-system interface considerations, including significant problems associated with remote access, user profiles, and on-line tutorial capabilities, and development of an IIMS query and browse facility, including the capability to deal with spatial information. A working prototype has been developed and is being enhanced.
Java Library for Input and Output of Image Data and Metadata
NASA Technical Reports Server (NTRS)
Deen, Robert; Levoe, Steven
2003-01-01
A Java-language library supports input and output (I/O) of image data and metadata (label data) in the format of the Video Image Communication and Retrieval (VICAR) image-processing software and in several similar formats, including a subset of the Planetary Data System (PDS) image file format. The library does the following: It provides low-level, direct access layer, enabling an application subprogram to read and write specific image files, lines, or pixels, and manipulate metadata directly. Two coding/decoding subprograms ("codecs" for short) based on the Java Advanced Imaging (JAI) software provide access to VICAR and PDS images in a file-format-independent manner. The VICAR and PDS codecs enable any program that conforms to the specification of the JAI codec to use VICAR or PDS images automatically, without specific knowledge of the VICAR or PDS format. The library also includes Image I/O plugin subprograms for VICAR and PDS formats. Application programs that conform to the Image I/O specification of Java version 1.4 can utilize any image format for which such a plug-in subprogram exists, without specific knowledge of the format itself. Like the aforementioned codecs, the VICAR and PDS Image I/O plug-in subprograms support reading and writing of metadata.
NASA Astrophysics Data System (ADS)
Brown, Nicholas J.; Lloyd, David S.; Reynolds, Melvin I.; Plummer, David L.
2002-05-01
A visible digital image is rendered from a set of digital image data. Medical digital image data can be stored as either: (a) pre-rendered format, corresponding to a photographic print, or (b) un-rendered format, corresponding to a photographic negative. The appropriate image data storage format and associated header data (metadata) required by a user of the results of a diagnostic procedure recorded electronically depends on the task(s) to be performed. The DICOM standard provides a rich set of metadata that supports the needs of complex applications. Many end user applications, such as simple report text viewing and display of a selected image, are not so demanding and generic image formats such as JPEG are sometimes used. However, these are lacking some basic identification requirements. In this paper we make specific proposals for minimal extensions to generic image metadata of value in various domains, which enable safe use in the case of two simple healthcare end user scenarios: (a) viewing of text and a selected JPEG image activated by a hyperlink and (b) viewing of one or more JPEG images together with superimposed text and graphics annotation using a file specified by a profile of the ISO/IEC Basic Image Interchange Format (BIIF).
Eichenberg, David; Liebergesell, Mario; König-Ries, Birgitta; Wirth, Christian
2017-01-01
Ecology has become a data intensive science over the last decades which often relies on the reuse of data in cross-experimental analyses. However, finding data which qualifies for the reuse in a specific context can be challenging. It requires good quality metadata and annotations as well as efficient search strategies. To date, full text search (often on the metadata only) is the most widely used search strategy although it is known to be inaccurate. Faceted navigation is providing a filter mechanism which is based on fine granular metadata, categorizing search objects along numeric and categorical parameters relevant for their discovery. Selecting from these parameters during a full text search creates a system of filters which allows to refine and improve the results towards more relevance. We developed a framework for the efficient annotation and faceted navigation in ecology. It consists of an XML schema for storing the annotation of search objects and is accompanied by a vocabulary focused on ecology to support the annotation process. The framework consolidates ideas which originate from widely accepted metadata standards, textbooks, scientific literature, and vocabularies as well as from expert knowledge contributed by researchers from ecology and adjacent disciplines. PMID:29023519
Storing, Browsing, Querying, and Sharing Data: the THREDDS Data Repository (TDR)
NASA Astrophysics Data System (ADS)
Wilson, A.; Lindholm, D.; Baltzer, T.
2005-12-01
The Unidata Internet Data Distribution (IDD) network delivers gigabytes of data per day in near real time to sites across the U.S. and beyond. The THREDDS Data Server (TDS) supports public browsing of metadata and data access via OPeNDAP enabled URLs for datasets such as these. With such large quantities of data, sites generally employ a simple data management policy, keeping the data for a relatively short term on the order of hours to perhaps a week or two. In order to save interesting data in longer term storage and make it available for sharing, a user must move the data herself. In this case the user is responsible for determining where space is available, executing the data movement, generating any desired metadata, and setting access control to enable sharing. This task sequence is generally based on execution of a sequence of low level operating system specific commands with significant user involvement. The LEAD (Linked Environments for Atmospheric Discovery) project is building a cyberinfrastructure to support research and education in mesoscale meteorology. LEAD orchestrations require large, robust, and reliable storage with speedy access to stage data and store both intermediate and final results. These requirements suggest storage solutions that involve distributed storage, replication, and interfacing to archival storage systems such as mass storage systems and tape or removable disks. LEAD requirements also include metadata generation and access in order to support querying. In support of both THREDDS and LEAD requirements, Unidata is designing and prototyping the THREDDS Data Repository (TDR), a framework for a modular data repository to support distributed data storage and retrieval using a variety of back end storage media and interchangeable software components. The TDR interface will provide high level abstractions for long term storage, controlled, fast and reliable access, and data movement capabilities via a variety of technologies such as OPeNDAP and gridftp. The modular structure will allow substitution of software components so that both simple and complex storage media can be integrated into the repository. It will also allow integration of different varieties of supporting software. For example, if replication is desired, replica management could be handled via a simple hash table or a complex solution such as Replica Locater Service (RLS). In order to ensure that metadata is available for all the data in the repository, the TDR will also generate THREDDS metadata when necessary. Users will be able to establish levels of access control to their metadata and data. Coupled with a THREDDS Data Server, both browsing via THREDDS catalogs and querying capabilities will be supported. This presentation will describe the motivating factors, current status, and future plans of the TDR. References: IDD: http://www.unidata.ucar.edu/content/software/idd/index.html THREDDS: http://www.unidata.ucar.edu/content/projects/THREDDS/tech/server/ServerStatus.html LEAD: http://lead.ou.edu/ RLS: http://www.isi.edu/~annc/papers/chervenakRLSjournal05.pdf
MMI: Increasing Community Collaboration
NASA Astrophysics Data System (ADS)
Galbraith, N. R.; Stocks, K.; Neiswender, C.; Maffei, A.; Bermudez, L.
2007-12-01
Building community requires a collaborative environment and guidance to help move members towards a common goal. An effective environment for community collaboration is a workspace that fosters participation and cooperation; effective guidance furthers common understanding and promotes best practices. The Marine Metadata Interoperability (MMI) project has developed a community web site to provide a collaborative environment for scientists, technologists, and data managers from around the world to learn about metadata and exchange ideas. Workshops, demonstration projects, and presentations also provide community-building opportunities for MMI. MMI has developed comprehensive online guides to help users understand and work with metadata standards, ontologies, and other controlled vocabularies. Documents such as "The Importance of Metadata Standards", "Usage vs. Discovery Vocabularies" and "Developing Controlled Vocabularies" guide scientists and data managers through a variety of metadata-related concepts. Members from eight organizations involved in marine science and informatics collaborated on this effort. The MMI web site has moved from Plone to Drupal, two content management systems which provide different opportunities for community-based work. Drupal's "organic groups" feature will be used to provide workspace for future teams tasked with content development, outreach, and other MMI mission-critical work. The new site is designed to enable members to easily create working areas, to build communities dedicated to developing consensus on metadata and other interoperability issues. Controlled-vocabulary-driven menus, integrated mailing-lists, member-based content creation and review tools are facets of the new web site architecture. This move provided the challenge of developing a hierarchical vocabulary to describe the resources presented on the site; consistent and logical tagging of web pages is the basis of Drupal site navigation. The new MMI web site presents enhanced opportunities for electronic discussions, focused collaborative work, and even greater community participation. The MMI project is beginning a new initiative to comprehensively catalog and document tools for marine metadata. The new MMI community-based web site will be used to support this work and to support the work of other ad-hoc teams in the future. We are seeking broad input from the community on this effort.
Building an Internet of Samples: The Australian Contribution
NASA Astrophysics Data System (ADS)
Wyborn, Lesley; Klump, Jens; Bastrakova, Irina; Devaraju, Anusuriya; McInnes, Brent; Cox, Simon; Karssies, Linda; Martin, Julia; Ross, Shawn; Morrissey, John; Fraser, Ryan
2017-04-01
Physical samples are often the ground truth to research reported in the scientific literature across multiple domains. They are collected by many different entities (individual researchers, laboratories, government agencies, mining companies, citizens, museums, etc.). Samples must be curated over the long-term to ensure both that their existence is known, and to allow any data derived from them through laboratory and field tests to be linked to the physical samples. For example, having unique identifiers that link back ground truth data on the original sample helps calibrate large volumes of remotely sensed data. Access to catalogues of reliably identified samples from several collections promotes collaboration across all Earth Science disciplines. It also increases the cost effectiveness of research by reducing the need to re-collect samples in the field. The assignment of web identifiers to the digital representations of these physical objects allows us to link to data, literature, investigators and institutions, thus creating an "Internet of Samples". An Australian implementation of the "Internet of Samples" is using the IGSN (International Geo Sample Number, http://igsn.github.io) to identify samples in a globally unique and persistent way. IGSN was developed in the solid earth science community and is recommended for sample identification by the Coalition for Publishing Data in the Earth and Space Sciences (COPDESS). IGSN is interoperable with other persistent identifier systems such as DataCite. Furthermore, the basic IGSN description metadata schema is compatible with existing schemas such as OGC Observations and Measurements (O&M) and DataCite Metadata Schema which makes crosswalks to other metadata schemas easy. IGSN metadata is disseminated through the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) allowing it to be aggregated in other applications such as portals (e.g. the Australian IGSN catalogue http://igsn2.csiro.au). The metadata is available in more than one format. The software for IGSN web services is based on components developed for DataCite and adapted to the specific requirements of IGSN. This cooperation in open source development ensures sustainable implementation and faster turnaround times for updates. IGSN, in particular in its Australian implementation, is characterised by a federated approach to system architecture and organisational governance giving it the necessary flexibility to adapt to particular local practices within multiple domains, whilst maintaining an overarching international standard. The three current IGSN allocation agents in Australia: Geoscience Australia, CSIRO and Curtin University, represent different sectors. Through funding from the Australian Research Data Services Program they have combined to develop a common web portal that allows discovery of physical samples and sample collections at a national level.International governance then ensures we can link to an international community but at the same time act locally to ensure the services offered are relevant to the needs of Australian researchers. This flexibility aids the integration of new disciplines into a global community of a physical samples information network.
Koleti, Amar; Terryn, Raymond; Stathias, Vasileios; Chung, Caty; Cooper, Daniel J; Turner, John P; Vidović, Dušica; Forlin, Michele; Kelley, Tanya T; D’Urso, Alessandro; Allen, Bryce K; Torre, Denis; Jagodnik, Kathleen M; Wang, Lily; Jenkins, Sherry L; Mader, Christopher; Niu, Wen; Fazel, Mehdi; Mahi, Naim; Pilarczyk, Marcin; Clark, Nicholas; Shamsaei, Behrouz; Meller, Jarek; Vasiliauskas, Juozas; Reichard, John; Medvedovic, Mario; Ma’ayan, Avi; Pillai, Ajay
2018-01-01
Abstract The Library of Integrated Network-based Cellular Signatures (LINCS) program is a national consortium funded by the NIH to generate a diverse and extensive reference library of cell-based perturbation-response signatures, along with novel data analytics tools to improve our understanding of human diseases at the systems level. In contrast to other large-scale data generation efforts, LINCS Data and Signature Generation Centers (DSGCs) employ a wide range of assay technologies cataloging diverse cellular responses. Integration of, and unified access to LINCS data has therefore been particularly challenging. The Big Data to Knowledge (BD2K) LINCS Data Coordination and Integration Center (DCIC) has developed data standards specifications, data processing pipelines, and a suite of end-user software tools to integrate and annotate LINCS-generated data, to make LINCS signatures searchable and usable for different types of users. Here, we describe the LINCS Data Portal (LDP) (http://lincsportal.ccs.miami.edu/), a unified web interface to access datasets generated by the LINCS DSGCs, and its underlying database, LINCS Data Registry (LDR). LINCS data served on the LDP contains extensive metadata and curated annotations. We highlight the features of the LDP user interface that is designed to enable search, browsing, exploration, download and analysis of LINCS data and related curated content. PMID:29140462
Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.
Chen, Qingyu; Zobel, Justin; Zhang, Xiuzhen; Verspoor, Karin
2016-01-01
First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.
Large-Scale Data Collection Metadata Management at the National Computation Infrastructure
NASA Astrophysics Data System (ADS)
Wang, J.; Evans, B. J. K.; Bastrakova, I.; Ryder, G.; Martin, J.; Duursma, D.; Gohar, K.; Mackey, T.; Paget, M.; Siddeswara, G.
2014-12-01
Data Collection management has become an essential activity at the National Computation Infrastructure (NCI) in Australia. NCI's partners (CSIRO, Bureau of Meteorology, Australian National University, and Geoscience Australia), supported by the Australian Government and Research Data Storage Infrastructure (RDSI), have established a national data resource that is co-located with high-performance computing. This paper addresses the metadata management of these data assets over their lifetime. NCI manages 36 data collections (10+ PB) categorised as earth system sciences, climate and weather model data assets and products, earth and marine observations and products, geosciences, terrestrial ecosystem, water management and hydrology, astronomy, social science and biosciences. The data is largely sourced from NCI partners, the custodians of many of the national scientific records, and major research community organisations. The data is made available in a HPC and data-intensive environment - a ~56000 core supercomputer, virtual labs on a 3000 core cloud system, and data services. By assembling these large national assets, new opportunities have arisen to harmonise the data collections, making a powerful cross-disciplinary resource.To support the overall management, a Data Management Plan (DMP) has been developed to record the workflows, procedures, the key contacts and responsibilities. The DMP has fields that can be exported to the ISO19115 schema and to the collection level catalogue of GeoNetwork. The subset or file level metadata catalogues are linked with the collection level through parent-child relationship definition using UUID. A number of tools have been developed that support interactive metadata management, bulk loading of data, and support for computational workflows or data pipelines. NCI creates persistent identifiers for each of the assets. The data collection is tracked over its lifetime, and the recognition of the data providers, data owners, data generators and data aggregators are updated. A Digital Object Identifier is assigned using the Australian National Data Service (ANDS). Once the data has been quality assured, a DOI is minted and the metadata record updated. NCI's data citation policy establishes the relationship between research outcomes, data providers, and the data.
EnviroAtlas Tree Cover Configuration and Connectivity, Water Background Web Service
This EnviroAtlas web service supports research and online mapping activities related to EnviroAtlas (https://www.epa.gov/enviroatlas). The 1-meter resolution tree cover configuration and connectivity map categorizes tree cover into structural elements (e.g. core, edge, connector, etc.). Source imagery varies by community. For specific information about methods and accuracy of each community's tree cover configuration and connectivity classification, consult their individual metadata records: Austin, TX (https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=%7B29D2B039-905C-4825-B0B4-9315122D6A9F%7D); Cleveland, OH (https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=%7B03cd54e1-4328-402e-ba75-e198ea9fbdc7%7D); Des Moines, IA (https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=%7B350A83E6-10A2-4D5D-97E6-F7F368D268BB%7D); Durham, NC (https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=%7BC337BA5F-8275-4BA8-9647-F63C443F317D%7D); Fresno, CA (https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=%7B84B98749-9C1C-4679-AE24-9B9C0998EBA5%7D); Green Bay, WI (https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=%7B69E48A44-3D30-4E84-A764-38FBDCCAC3D0%7D); Memphis, TN (https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=%7BB7313ADA-04F7-4D80-ABBA-77E753AAD002%7D); Milwaukee, WI (https://edg.epa.gov/metadata/catalog/search/resource/details.page?u
Towards Making Data Bases Practical for use in the Field
NASA Astrophysics Data System (ADS)
Fischer, T. P.; Lehnert, K. A.; Chiodini, G.; McCormick, B.; Cardellini, C.; Clor, L. E.; Cottrell, E.
2014-12-01
Geological, geochemical, and geophysical research is often field based with travel to remote areas and collection of samples and data under challenging environmental conditions. Cross-disciplinary investigations would greatly benefit from near real-time data access and visualisation within the existing framework of databases and GIS tools. An example of complex, interdisciplinary field-based and data intensive investigations is that of volcanologists and gas geochemists, who sample gases from fumaroles, hot springs, dry gas vents, hydrothermal vents and wells. Compositions of volcanic gas plumes are measured directly or by remote sensing. Soil gas fluxes from volcanic areas are measured by accumulation chamber and involve hundreds of measurements to calculate the total emission of a region. Many investigators also collect rock samples from recent or ancient volcanic eruptions. Structural, geochronological, and geophysical data collected during the same or related field campaigns complement these emissions data. All samples and data collected in the field require a set of metadata including date, time, location, sample or measurement id, and descriptive comments. Currently, most of these metadata are written in field notebooks and later transferred into a digital format. Final results such as laboratory analyses of samples and calculated flux data are tabulated for plotting, correlation with other types of data, modeling and finally publication and presentation. Data handling, organization and interpretation could be greatly streamlined by using digital tools available in the field to record metadata, assign an International Geo Sample Number (IGSN), upload measurements directly from field instruments, and arrange sample curation. Available data display tools such as GeoMapApp and existing data sets (PetDB, IRIS, UNAVCO) could be integrated to direct locations for additional measurements during a field campaign. Nearly live display of sampling locations, pictures, and comments could be used as an educational and outreach tool during sampling expeditions. Achieving these goals requires the integration of existing online data resources, with common access through a dedicated web portal.
Park, Yu Rang; Yoon, Young Jo; Jang, Tae Hun; Seo, Hwa Jeong; Kim, Ju Han
2014-01-01
Extension of the standard model while retaining compliance with it is a challenging issue because there is currently no method for semantically or syntactically verifying an extended data model. A metadata-based extended model, named CCR+, was designed and implemented to achieve interoperability between standard and extended models. Furthermore, a multilayered validation method was devised to validate the standard and extended models. The American Society for Testing and Materials (ASTM) Community Care Record (CCR) standard was selected to evaluate the CCR+ model; two CCR and one CCR+ XML files were evaluated. In total, 188 metadata were extracted from the ASTM CCR standard; these metadata are semantically interconnected and registered in the metadata registry. An extended-data-model-specific validation file was generated from these metadata. This file can be used in a smartphone application (Health Avatar CCR+) as a part of a multilayered validation. The new CCR+ model was successfully evaluated via a patient-centric exchange scenario involving multiple hospitals, with the results supporting both syntactic and semantic interoperability between the standard CCR and extended, CCR+, model. A feasible method for delivering an extended model that complies with the standard model is presented herein. There is a great need to extend static standard models such as the ASTM CCR in various domains: the methods presented here represent an important reference for achieving interoperability between standard and extended models.
Evolving the Living With a Star Data System Definition
NASA Astrophysics Data System (ADS)
Otranto, J. F.; Dijoseph, M.
2003-12-01
NASA's Living With a Star (LWS) Program is a space weather-focused and applications-driven research program. The LWS Program is soliciting input from the solar, space physics, space weather, and climate science communities to develop a system that enables access to science data associated with these disciplines, and advances the development of discipline and interdisciplinary findings. The LWS Program will implement a data system that builds upon the existing and planned data capture, processing, and storage components put in place by individual spacecraft missions and also inter-project data management systems, including active and deep archives, and multi-mission data repositories. It is technically feasible for the LWS Program to integrate data from a broad set of resources, assuming they are either publicly accessible or allow access by permission. The LWS Program data system will work in coordination with spacecraft mission data systems and science data repositories, integrating their holdings using a common metadata representation. This common representation relies on a robust metadata definition that provides journalistic and technical data descriptions, plus linkages to supporting data products and tools. The LWS Program intends to become an enabling resource to PIs, interdisciplinary scientists, researchers, and students facilitating both access to a broad collection of science data, as well as the necessary supporting components to understand and make productive use of these data. For the LWS Program to represent science data that are physically distributed across various ground system elements, information will be collected about these distributed data products through a series of LWS Program-created agents. These agents will be customized to interface or interact with each one of these data systems, collect information, and forward any new metadata records to a LWS Program-developed metadata library. A populated LWS metadata library will function as a single point-of-contact that serves the entire science community as a first stop for data availability, whether or not science data are physically stored in an LWS-operated repository. Further, this metadata library will provide the user access to information for understanding these data including descriptions of the associated spacecraft and instrument, data format, calibration and operations issues, links to ancillary and correlative data products, links to processing tools and models associated with these data, and any corresponding findings produced using these data. The LWS may also support an active archive for solar, space physics, space weather, and climate data when these data would otherwise be discarded or archived off-line. This archive could potentially serve also as a data storage backup facility for LWS missions. The plan for the LWS Program metadata library is developed based upon input received from the solar and geospace science communities; the library's architecture is based on existing systems developed for serving science metadata. The LWS Program continues to seek constructive input from the science community, examples of both successes and failures in dealing with science data systems, and insights regarding the obstacles between the current state-of-the-practice and this vision for the LWS Program metadata library.
Lindsköld, Lars; Wintell, Mikael; Edgren, Lars; Aspelin, Peter; Lundberg, Nina
2013-07-01
Challenges related to the cross-organizational access of accurate and timely information about a patient's condition has become a critical issue in healthcare. Interoperability of different local sources is necessary. To identify and present missing and semantically incorrect data elements of metadata in the radiology enterprise service that supports cross-organizational sharing of dynamic information about patients' visits, in the Region Västra Götaland, Sweden. Quantitative data elements of metadata were collected yearly from the first Wednesday in March from 2006 to 2011 from the 24 in-house radiology departments in Region Västra Götaland. These radiology departments were organized into four hospital groups and three stand-alone hospitals. Included data elements of metadata were the patient name, patient ID, institutional department name, referring physician's name, and examination description. The majority of missing data elements of metadata was related to the institutional department name for Hospital 2, from 87% in 2007 to 25% in 2011. All data elements of metadata except the patient ID contained semantic errors. For example, for the data element "patient name", only three names out of 3537 were semantically correct. This study shows that the semantics of metadata elements are poorly structured and inconsistently used. Although a cross-organizational solution may technically be fully functional, semantic errors may prevent it from serving as an information infrastructure for collaboration between all departments and hospitals in the region. For interoperability, it is important that the agreed semantic models are implemented in vendor systems using the information infrastructure.
Triage by ranking to support the curation of protein interactions
Pasche, Emilie; Gobeill, Julien; Rech de Laval, Valentine; Gleizes, Anne; Michel, Pierre-André; Bairoch, Amos
2017-01-01
Abstract Today, molecular biology databases are the cornerstone of knowledge sharing for life and health sciences. The curation and maintenance of these resources are labour intensive. Although text mining is gaining impetus among curators, its integration in curation workflow has not yet been widely adopted. The Swiss Institute of Bioinformatics Text Mining and CALIPHO groups joined forces to design a new curation support system named nextA5. In this report, we explore the integration of novel triage services to support the curation of two types of biological data: protein–protein interactions (PPIs) and post-translational modifications (PTMs). The recognition of PPIs and PTMs poses a special challenge, as it not only requires the identification of biological entities (proteins or residues), but also that of particular relationships (e.g. binding or position). These relationships cannot be described with onto-terminological descriptors such as the Gene Ontology for molecular functions, which makes the triage task more challenging. Prioritizing papers for these tasks thus requires the development of different approaches. In this report, we propose a new method to prioritize articles containing information specific to PPIs and PTMs. The new resources (RESTful APIs, semantically annotated MEDLINE library) enrich the neXtA5 platform. We tuned the article prioritization model on a set of 100 proteins previously annotated by the CALIPHO group. The effectiveness of the triage service was tested with a dataset of 200 annotated proteins. We defined two sets of descriptors to support automatic triage: the first set to enrich for papers with PPI data, and the second for PTMs. All occurrences of these descriptors were marked-up in MEDLINE and indexed, thus constituting a semantically annotated version of MEDLINE. These annotations were then used to estimate the relevance of a particular article with respect to the chosen annotation type. This relevance score was combined with a local vector-space search engine to generate a ranked list of PMIDs. We also evaluated a query refinement strategy, which adds specific keywords (such as ‘binds’ or ‘interacts’) to the original query. Compared to PubMed, the search effectiveness of the nextA5 triage service is improved by 190% for the prioritization of papers with PPIs information and by 260% for papers with PTMs information. Combining advanced retrieval and query refinement strategies with automatically enriched MEDLINE contents is effective to improve triage in complex curation tasks such as the curation of protein PPIs and PTMs. Database URL: http://candy.hesge.ch/nextA5 PMID:29220432
NASA Technical Reports Server (NTRS)
Todd, N. S.; Evans, C.
2015-01-01
The Astromaterials Acquisition and Curation Office at NASA's Johnson Space Center (JSC) is the designated facility for curating all of NASA's extraterrestrial samples. The suite of collections includes the lunar samples from the Apollo missions, cosmic dust particles falling into the Earth's atmosphere, meteorites collected in Antarctica, comet and interstellar dust particles from the Stardust mission, asteroid particles from the Japanese Hayabusa mission, and solar wind atoms collected during the Genesis mission. To support planetary science research on these samples, NASA's Astromaterials Curation Office hosts the Astromaterials Curation Digital Repository, which provides descriptions of the missions and collections, and critical information about each individual sample. Our office is implementing several informatics initiatives with the goal of better serving the planetary research community. One of these initiatives aims to increase the availability and discoverability of sample data and images through the use of a newly designed common architecture for Astromaterials Curation databases.
Using social media to support small group learning.
Cole, Duncan; Rengasamy, Emma; Batchelor, Shafqat; Pope, Charles; Riley, Stephen; Cunningham, Anne Marie
2017-11-10
Medical curricula are increasingly using small group learning and less didactic lecture-based teaching. This creates new challenges and opportunities in how students are best supported with information technology. We explored how university-supported and external social media could support collaborative small group working on our new undergraduate medical curriculum. We made available a curation platform (Scoop.it) and a wiki within our virtual learning environment as part of year 1 Case-Based Learning, and did not discourage the use of other tools such as Facebook. We undertook student surveys to capture perceptions of the tools and information on how they were used, and employed software user metrics to explore the extent to which they were used during the year. Student groups developed a preferred way of working early in the course. Most groups used Facebook to facilitate communication within the group, and to host documents and notes. There were more barriers to using the wiki and curation platform, although some groups did make extensive use of them. Staff engagement was variable, with some tutors reviewing the content posted on the wiki and curation platform in face-to-face sessions, but not outside these times. A small number of staff posted resources and reviewed student posts on the curation platform. Optimum use of these tools depends on sufficient training of both staff and students, and an opportunity to practice using them, with ongoing support. The platforms can all support collaborative learning, and may help develop digital literacy, critical appraisal skills, and awareness of wider health issues in society.
NASA Astrophysics Data System (ADS)
Accomazzi, Alberto; Kurtz, M. J.; Henneken, E. A.; Grant, C. S.; Thompson, D.; Luker, J.; Chyla, R.; Murray, S. S.
2014-01-01
In the spring of 1993, the Smithsonian/NASA Astrophysics Data System (ADS) first launched its bibliographic search system. It was known then as the ADS Abstract Service, a component of the larger Astrophysics Data System effort which had developed an interoperable data system now seen as a precursor of the Virtual Observatory. As a result of the massive technological and sociological changes in the field of scholarly communication, the ADS is now completing the most ambitious technological upgrade in its twenty-year history. Code-named ADS 2.0, the new system features: an IT platform built on web and digital library standards; a new, extensible, industrial strength search engine; a public API with various access control capabilities; a set of applications supporting search, export, visualization, analysis; a collaborative, open source development model; and enhanced indexing of content which includes the full-text of astronomy and physics publications. The changes in the ADS platform affect all aspects of the system and its operations, including: the process through which data and metadata are harvested, curated and indexed; the interface and paradigm used for searching the database; and the follow-up analysis capabilities available to the users. This poster describes the choices behind the technical overhaul of the system, the technology stack used, and the opportunities which the upgrade is providing us with, namely gains in productivity and enhancements in our system capabilities.
Bridging the gap between data, publications, and images
NASA Astrophysics Data System (ADS)
Ritchey, N. A.; Collins, D.; Sprain, M.
2017-12-01
NOAA's National Centers for Environmental Information (NCEI) manages the most comprehensive, accessible, and trusted source of environmental data and information in the US. It archives data from the depths of the ocean to the surface of the sun and from million-year-old sediment records to near real-time satellite observations. NCEI has a wealth of knowledge and experience in long-term data preservation with the goal of supporting today's scientists as well as future generations. In order to reduce fragmentation of data, publications, images, and documentation, and to improve preservation, curation, and stewardship of data, NCEI continues to partner with the NOAA Central Library (NCL). NCEI and NCL have long-established linkages between data metadata, published reports, and data or archival information packages (AIP). We also have analog AIPs that are stored and maintained in the NCL collection and discoverable in both NCEI and NCL collections via the AIP identifier. We are currently working with NCL to establish a workflow for submitting reports to their Institutional Repository and linking the data and report via digital object identifiers. We hope to establish linkages between images of physical samples and the NCL Photo Collection management infrastructure in the future. This presentation will detail how NCEI engages with the NCL in order to fully integrate documentation, images, publications, and data in preservation practices and improve the discovery and usability of NOAA's billion dollar investment in environmental data and information.
The Modeling and Simulation Catalog for Discovery, Knowledge and Reuse
NASA Technical Reports Server (NTRS)
Stone, George F. III; Greenberg, Brandi; Daehler-Wilking, Richard; Hunt, Steven
2011-01-01
The DoD M&S Steering Committee has noted that the current DoD and Service's modeling and simulation resource repository (MSRR) services are not up-to-date limiting their value to the using communities. However, M&S leaders and managers also determined that the Department needs a functional M&S registry card catalog to facilitate M&S tool and data visibility to support M&S activities across the DoD. The M&S Catalog will discover and access M&S metadata maintained at nodes distributed across DoD networks in a centrally managed, decentralized process that employs metadata collection and management. The intent is to link information stores, precluding redundant location updating. The M&S Catalog uses a standard metadata schemas based on the DoD's Net-Centric Data Strategy Community of Interest metadata specification. The Air Force, Navy and OSD (CAPE) have provided initial information to participating DoD nodes, but plans on the horizon are being made to bring in hundreds of source providers.
Page, Roderic D M
2011-05-23
The Biodiversity Heritage Library (BHL) is a large digital archive of legacy biological literature, comprising over 31 million pages scanned from books, monographs, and journals. During the digitisation process basic metadata about the scanned items is recorded, but not article-level metadata. Given that the article is the standard unit of citation, this makes it difficult to locate cited literature in BHL. Adding the ability to easily find articles in BHL would greatly enhance the value of the archive. A service was developed to locate articles in BHL based on matching article metadata to BHL metadata using approximate string matching, regular expressions, and string alignment. This article locating service is exposed as a standard OpenURL resolver on the BioStor web site http://biostor.org/openurl/. This resolver can be used on the web, or called by bibliographic tools that support OpenURL. BioStor provides tools for extracting, annotating, and visualising articles from the Biodiversity Heritage Library. BioStor is available from http://biostor.org/.
NASA Astrophysics Data System (ADS)
Mitchell, A. E.; Lowe, D. R.; Murphy, K. J.; Ramapriyan, H. K.
2011-12-01
Initiated in 1990, NASA's Earth Observing System Data and Information System (EOSDIS) is currently a petabyte-scale archive of data designed to receive, process, distribute and archive several terabytes of science data per day from NASA's Earth science missions. Comprised of 12 discipline specific data centers collocated with centers of science discipline expertise, EOSDIS manages over 6800 data products from many science disciplines and sources. NASA supports global climate change research by providing scalable open application layers to the EOSDIS distributed information framework. This allows many other value-added services to access NASA's vast Earth Science Collection and allows EOSDIS to interoperate with data archives from other domestic and international organizations. EOSDIS is committed to NASA's Data Policy of full and open sharing of Earth science data. As metadata is used in all aspects of NASA's Earth science data lifecycle, EOSDIS provides a spatial and temporal metadata registry and order broker called the EOS Clearing House (ECHO) that allows efficient search and access of cross domain data and services through the Reverb Client and Application Programmer Interfaces (APIs). Another core metadata component of EOSDIS is NASA's Global Change Master Directory (GCMD) which represents more than 25,000 Earth science data set and service descriptions from all over the world, covering subject areas within the Earth and environmental sciences. With inputs from the ECHO, GCMD and Soil Moisture Active Passive (SMAP) mission metadata models, EOSDIS is developing a NASA ISO 19115 Best Practices Convention. Adoption of an international metadata standard enables a far greater level of interoperability among national and international data products. NASA recently concluded a 'Metadata Harmony Study' of EOSDIS metadata capabilities/processes of ECHO and NASA's Global Change Master Directory (GCMD), to evaluate opportunities for improved data access and use, reduce efforts by data providers and improve metadata integrity. The result was a recommendation for EOSDIS to develop a 'Common Metadata Repository (CMR)' to manage the evolution of NASA Earth Science metadata in a unified and consistent way by providing a central storage and access capability that streamlines current workflows while increasing overall data quality and anticipating future capabilities. For applications users interested in monitoring and analyzing a wide variety of natural and man-made phenomena, EOSDIS provides access to near real-time products from the MODIS, OMI, AIRS, and MLS instruments in less than 3 hours from observation. To enable interactive exploration of NASA's Earth imagery, EOSDIS is developing a set of standard services to deliver global, full-resolution satellite imagery in a highly responsive manner. EOSDIS is also playing a lead role in the development of the CEOS WGISS Integrated Catalog (CWIC), which provides search and access to holdings of participating international data providers. EOSDIS provides a platform to expose and share information on NASA Earth science tools and data via Earthdata.nasa.gov while offering a coherent and interoperable system for the NASA Earth Science Data System (ESDS) Program.
NASA Astrophysics Data System (ADS)
Mitchell, A. E.; Lowe, D. R.; Murphy, K. J.; Ramapriyan, H. K.
2013-12-01
Initiated in 1990, NASA's Earth Observing System Data and Information System (EOSDIS) is currently a petabyte-scale archive of data designed to receive, process, distribute and archive several terabytes of science data per day from NASA's Earth science missions. Comprised of 12 discipline specific data centers collocated with centers of science discipline expertise, EOSDIS manages over 6800 data products from many science disciplines and sources. NASA supports global climate change research by providing scalable open application layers to the EOSDIS distributed information framework. This allows many other value-added services to access NASA's vast Earth Science Collection and allows EOSDIS to interoperate with data archives from other domestic and international organizations. EOSDIS is committed to NASA's Data Policy of full and open sharing of Earth science data. As metadata is used in all aspects of NASA's Earth science data lifecycle, EOSDIS provides a spatial and temporal metadata registry and order broker called the EOS Clearing House (ECHO) that allows efficient search and access of cross domain data and services through the Reverb Client and Application Programmer Interfaces (APIs). Another core metadata component of EOSDIS is NASA's Global Change Master Directory (GCMD) which represents more than 25,000 Earth science data set and service descriptions from all over the world, covering subject areas within the Earth and environmental sciences. With inputs from the ECHO, GCMD and Soil Moisture Active Passive (SMAP) mission metadata models, EOSDIS is developing a NASA ISO 19115 Best Practices Convention. Adoption of an international metadata standard enables a far greater level of interoperability among national and international data products. NASA recently concluded a 'Metadata Harmony Study' of EOSDIS metadata capabilities/processes of ECHO and NASA's Global Change Master Directory (GCMD), to evaluate opportunities for improved data access and use, reduce efforts by data providers and improve metadata integrity. The result was a recommendation for EOSDIS to develop a 'Common Metadata Repository (CMR)' to manage the evolution of NASA Earth Science metadata in a unified and consistent way by providing a central storage and access capability that streamlines current workflows while increasing overall data quality and anticipating future capabilities. For applications users interested in monitoring and analyzing a wide variety of natural and man-made phenomena, EOSDIS provides access to near real-time products from the MODIS, OMI, AIRS, and MLS instruments in less than 3 hours from observation. To enable interactive exploration of NASA's Earth imagery, EOSDIS is developing a set of standard services to deliver global, full-resolution satellite imagery in a highly responsive manner. EOSDIS is also playing a lead role in the development of the CEOS WGISS Integrated Catalog (CWIC), which provides search and access to holdings of participating international data providers. EOSDIS provides a platform to expose and share information on NASA Earth science tools and data via Earthdata.nasa.gov while offering a coherent and interoperable system for the NASA Earth Science Data System (ESDS) Program.
The curation of genetic variants: difficulties and possible solutions.
Pandey, Kapil Raj; Maden, Narendra; Poudel, Barsha; Pradhananga, Sailendra; Sharma, Amit Kumar
2012-12-01
The curation of genetic variants from biomedical articles is required for various clinical and research purposes. Nowadays, establishment of variant databases that include overall information about variants is becoming quite popular. These databases have immense utility, serving as a user-friendly information storehouse of variants for information seekers. While manual curation is the gold standard method for curation of variants, it can turn out to be time-consuming on a large scale thus necessitating the need for automation. Curation of variants described in biomedical literature may not be straightforward mainly due to various nomenclature and expression issues. Though current trends in paper writing on variants is inclined to the standard nomenclature such that variants can easily be retrieved, we have a massive store of variants in the literature that are present as non-standard names and the online search engines that are predominantly used may not be capable of finding them. For effective curation of variants, knowledge about the overall process of curation, nature and types of difficulties in curation, and ways to tackle the difficulties during the task are crucial. Only by effective curation, can variants be correctly interpreted. This paper presents the process and difficulties of curation of genetic variants with possible solutions and suggestions from our work experience in the field including literature support. The paper also highlights aspects of interpretation of genetic variants and the importance of writing papers on variants following standard and retrievable methods. Copyright © 2012. Published by Elsevier Ltd.
The Curation of Genetic Variants: Difficulties and Possible Solutions
Pandey, Kapil Raj; Maden, Narendra; Poudel, Barsha; Pradhananga, Sailendra; Sharma, Amit Kumar
2012-01-01
The curation of genetic variants from biomedical articles is required for various clinical and research purposes. Nowadays, establishment of variant databases that include overall information about variants is becoming quite popular. These databases have immense utility, serving as a user-friendly information storehouse of variants for information seekers. While manual curation is the gold standard method for curation of variants, it can turn out to be time-consuming on a large scale thus necessitating the need for automation. Curation of variants described in biomedical literature may not be straightforward mainly due to various nomenclature and expression issues. Though current trends in paper writing on variants is inclined to the standard nomenclature such that variants can easily be retrieved, we have a massive store of variants in the literature that are present as non-standard names and the online search engines that are predominantly used may not be capable of finding them. For effective curation of variants, knowledge about the overall process of curation, nature and types of difficulties in curation, and ways to tackle the difficulties during the task are crucial. Only by effective curation, can variants be correctly interpreted. This paper presents the process and difficulties of curation of genetic variants with possible solutions and suggestions from our work experience in the field including literature support. The paper also highlights aspects of interpretation of genetic variants and the importance of writing papers on variants following standard and retrievable methods. PMID:23317699
A Metadata Management Framework for Collaborative Review of Science Data Products
NASA Astrophysics Data System (ADS)
Hart, A. F.; Cinquini, L.; Mattmann, C. A.; Thompson, D. R.; Wagstaff, K.; Zimdars, P. A.; Jones, D. L.; Lazio, J.; Preston, R. A.
2012-12-01
Data volumes generated by modern scientific instruments often preclude archiving the complete observational record. To compensate, science teams have developed a variety of "triage" techniques for identifying data of potential scientific interest and marking it for prioritized processing or permanent storage. This may involve multiple stages of filtering with both automated and manual components operating at different timescales. A promising approach exploits a fast, fully automated first stage followed by a more reliable offline manual review of candidate events. This hybrid approach permits a 24-hour rapid real-time response while also preserving the high accuracy of manual review. To support this type of second-level validation effort, we have developed a metadata-driven framework for the collaborative review of candidate data products. The framework consists of a metadata processing pipeline and a browser-based user interface that together provide a configurable mechanism for reviewing data products via the web, and capturing the full stack of associated metadata in a robust, searchable archive. Our system heavily leverages software from the Apache Object Oriented Data Technology (OODT) project, an open source data integration framework that facilitates the construction of scalable data systems and places a heavy emphasis on the utilization of metadata to coordinate processing activities. OODT provides a suite of core data management components for file management and metadata cataloging that form the foundation for this effort. The system has been deployed at JPL in support of the V-FASTR experiment [1], a software-based radio transient detection experiment that operates commensally at the Very Long Baseline Array (VLBA), and has a science team that is geographically distributed across several countries. Daily review of automatically flagged data is a shared responsibility for the team, and is essential to keep the project within its resource constraints. We describe the development of the platform using open source software, and discuss our experience deploying the system operationally. [1] R.B.Wayth,W.F.Brisken,A.T.Deller,W.A.Majid,D.R.Thompson, S. J. Tingay, and K. L. Wagstaff, "V-fastr: The vlba fast radio transients experiment," The Astrophysical Journal, vol. 735, no. 2, p. 97, 2011. Acknowledgement: This effort was supported by the Jet Propulsion Laboratory, managed by the California Institute of Technology under a contract with the National Aeronautics and Space Administration.
Annotation of phenotypic diversity: decoupling data curation and ontology curation using Phenex.
Balhoff, James P; Dahdul, Wasila M; Dececchi, T Alexander; Lapp, Hilmar; Mabee, Paula M; Vision, Todd J
2014-01-01
Phenex (http://phenex.phenoscape.org/) is a desktop application for semantically annotating the phenotypic character matrix datasets common in evolutionary biology. Since its initial publication, we have added new features that address several major bottlenecks in the efficiency of the phenotype curation process: allowing curators during the data curation phase to provisionally request terms that are not yet available from a relevant ontology; supporting quality control against annotation guidelines to reduce later manual review and revision; and enabling the sharing of files for collaboration among curators. We decoupled data annotation from ontology development by creating an Ontology Request Broker (ORB) within Phenex. Curators can use the ORB to request a provisional term for use in data annotation; the provisional term can be automatically replaced with a permanent identifier once the term is added to an ontology. We added a set of annotation consistency checks to prevent common curation errors, reducing the need for later correction. We facilitated collaborative editing by improving the reliability of Phenex when used with online folder sharing services, via file change monitoring and continual autosave. With the addition of these new features, and in particular the Ontology Request Broker, Phenex users have been able to focus more effectively on data annotation. Phenoscape curators using Phenex have reported a smoother annotation workflow, with much reduced interruptions from ontology maintenance and file management issues.
Improving Earth Science Metadata: Modernizing ncISO
NASA Astrophysics Data System (ADS)
O'Brien, K.; Schweitzer, R.; Neufeld, D.; Burger, E. F.; Signell, R. P.; Arms, S. C.; Wilcox, K.
2016-12-01
ncISO is a package of tools developed at NOAA's National Center for Environmental Information (NCEI) that facilitates the generation of ISO 19115-2 metadata from NetCDF data sources. The tool currently exists in two iterations: a command line utility and a web-accessible service within the THREDDS Data Server (TDS). Several projects, including NOAA's Unified Access Framework (UAF), depend upon ncISO to generate the ISO-compliant metadata from their data holdings and use the resulting information to populate discovery tools such as NCEI's ESRI Geoportal and NOAA's data.noaa.gov CKAN system. In addition to generating ISO 19115-2 metadata, the tool calculates a rubric score based on how well the dataset follows the Attribute Conventions for Dataset Discovery (ACDD). The result of this rubric calculation, along with information about what has been included and what is missing is displayed in an HTML document generated by the ncISO software package. Recently ncISO has fallen behind in terms of supporting updates to conventions such updates to the ACDD. With the blessing of the original programmer, NOAA's UAF has been working to modernize the ncISO software base. In addition to upgrading ncISO to utilize version1.3 of the ACDD, we have been working with partners at Unidata and IOOS to unify the tool's code base. In essence, we are merging the command line capabilities into the same software that will now be used by the TDS service, allowing easier updates when conventions such as ACDD are updated in the future. In this presentation, we will discuss the work the UAF project has done to support updated conventions within ncISO, as well as describe how the updated tool is helping to improve metadata throughout the earth and ocean sciences.
BIO::Phylo-phyloinformatic analysis using perl.
Vos, Rutger A; Caravas, Jason; Hartmann, Klaas; Jensen, Mark A; Miller, Chase
2011-02-27
Phyloinformatic analyses involve large amounts of data and metadata of complex structure. Collecting, processing, analyzing, visualizing and summarizing these data and metadata should be done in steps that can be automated and reproduced. This requires flexible, modular toolkits that can represent, manipulate and persist phylogenetic data and metadata as objects with programmable interfaces. This paper presents Bio::Phylo, a Perl5 toolkit for phyloinformatic analysis. It implements classes and methods that are compatible with the well-known BioPerl toolkit, but is independent from it (making it easy to install) and features a richer API and a data model that is better able to manage the complex relationships between different fundamental data and metadata objects in phylogenetics. It supports commonly used file formats for phylogenetic data including the novel NeXML standard, which allows rich annotations of phylogenetic data to be stored and shared. Bio::Phylo can interact with BioPerl, thereby giving access to the file formats that BioPerl supports. Many methods for data simulation, transformation and manipulation, the analysis of tree shape, and tree visualization are provided. Bio::Phylo is composed of 59 richly documented Perl5 modules. It has been deployed successfully on a variety of computer architectures (including various Linux distributions, Mac OS X versions, Windows, Cygwin and UNIX-like systems). It is available as open source (GPL) software from http://search.cpan.org/dist/Bio-Phylo.
BIO::Phylo-phyloinformatic analysis using perl
2011-01-01
Background Phyloinformatic analyses involve large amounts of data and metadata of complex structure. Collecting, processing, analyzing, visualizing and summarizing these data and metadata should be done in steps that can be automated and reproduced. This requires flexible, modular toolkits that can represent, manipulate and persist phylogenetic data and metadata as objects with programmable interfaces. Results This paper presents Bio::Phylo, a Perl5 toolkit for phyloinformatic analysis. It implements classes and methods that are compatible with the well-known BioPerl toolkit, but is independent from it (making it easy to install) and features a richer API and a data model that is better able to manage the complex relationships between different fundamental data and metadata objects in phylogenetics. It supports commonly used file formats for phylogenetic data including the novel NeXML standard, which allows rich annotations of phylogenetic data to be stored and shared. Bio::Phylo can interact with BioPerl, thereby giving access to the file formats that BioPerl supports. Many methods for data simulation, transformation and manipulation, the analysis of tree shape, and tree visualization are provided. Conclusions Bio::Phylo is composed of 59 richly documented Perl5 modules. It has been deployed successfully on a variety of computer architectures (including various Linux distributions, Mac OS X versions, Windows, Cygwin and UNIX-like systems). It is available as open source (GPL) software from http://search.cpan.org/dist/Bio-Phylo PMID:21352572
Evolution of Web Services in EOSDIS: Search and Order Metadata Registry (ECHO)
NASA Technical Reports Server (NTRS)
Mitchell, Andrew; Ramapriyan, Hampapuram; Lowe, Dawn
2009-01-01
During 2005 through 2008, NASA defined and implemented a major evolutionary change in it Earth Observing system Data and Information System (EOSDIS) to modernize its capabilities. This implementation was based on a vision for 2015 developed during 2005. The EOSDIS 2015 Vision emphasizes increased end-to-end data system efficiency and operability; increased data usability; improved support for end users; and decreased operations costs. One key feature of the Evolution plan was achieving higher operational maturity (ingest, reconciliation, search and order, performance, error handling) for the NASA s Earth Observing System Clearinghouse (ECHO). The ECHO system is an operational metadata registry through which the scientific community can easily discover and exchange NASA's Earth science data and services. ECHO contains metadata for 2,726 data collections comprising over 87 million individual data granules and 34 million browse images, consisting of NASA s EOSDIS Data Centers and the United States Geological Survey's Landsat Project holdings. ECHO is a middleware component based on a Service Oriented Architecture (SOA). The system is comprised of a set of infrastructure services that enable the fundamental SOA functions: publish, discover, and access Earth science resources. It also provides additional services such as user management, data access control, and order management. The ECHO system has a data registry and a services registry. The data registry enables organizations to publish EOS and other Earth-science related data holdings to a common metadata model. These holdings are described through metadata in terms of datasets (types of data) and granules (specific data items of those types). ECHO also supports browse images, which provide a visual representation of the data. The published metadata can be mapped to and from existing standards (e.g., FGDC, ISO 19115). With ECHO, users can find the metadata stored in the data registry and then access the data either directly online or through a brokered order to the data archive organization. ECHO stores metadata from a variety of science disciplines and domains, including Climate Variability and Change, Carbon Cycle and Ecosystems, Earth Surface and Interior, Atmospheric Composition, Weather, and Water and Energy Cycle. ECHO also has a services registry for community-developed search services and data services. ECHO provides a platform for the publication, discovery, understanding and access to NASA s Earth Observation resources (data, service and clients). In their native state, these data, service and client resources are not necessarily targeted for use beyond their original mission. However, with the proper interoperability mechanisms, users of these resources can expand their value, by accessing, combining and applying them in unforeseen ways.
Supplemental Journal Article Materials: A progress report on an information industry initiative
NASA Astrophysics Data System (ADS)
Schwarzman, A. B.
2011-12-01
Who could possibly quibble with the idea of publishing supplemental materials to a journal article? Making them available makes it possible for the Earth and space scientists to demonstrate supporting evidence, such as multimedia, computer programs, and datasets; gives the authors the opportunity to present in-depth studies that would not otherwise be available; and enables the readers to replicate experiments and verify their results. However, the scholarly publishing ecosystem is now being threatened by a veritable tsunami of supplemental materials that have to be peer reviewed, identified, described, and made discoverable and citeable; such materials also have to be archived, preserved, and perpetually converted to the contemporary formats to be available to a future researcher. Moreover, the readers often have no clear indication of how critical a particular supplemental material is to the scientific conclusions of the article and thus are not sure whether they should spend their time reading/viewing/running it. In some cases it is not even clear what the material actually supplements. While one segment of the research community argues that even more supplemental materials should be made available, another segment increasingly voices its concern stating categorically that a research article is not a data dump or an FTP site. From the publisher's perspective, dealing with supplemental materials in a responsible fashion is becoming an increasingly costly proposition. Faced with formidable challenges of managing supplemental materials, the information profession community in 2010 formed a joint NISO/NFAIS Working Group to develop Recommended Practices for curating supplemental materials during their life cycle, including but not limited to their selection, peer review, editing, production, presentation, providing context, identification, linking, citing, hosting, discovery, metadata and markup, packaging, accessibility, and preservation. The Recommended Practices also intend to address roles and responsibilities of authors, editors, peer reviewers, publishers, libraries, abstracting and indexing services, and official data centers and institutional repositories. Finally, the document is going to contain broad principles and detailed technical implementation related to metadata, linking, packaging, and accessibility of supplemental materials. In this presentation, a co-chair of the NISO/NFAIS Working Group will report on the Group's latest progress in developing the Recommended Practices for Supplemental Journal Article Materials.
Natural Language Processing in aid of FlyBase curators
Karamanis, Nikiforos; Seal, Ruth; Lewin, Ian; McQuilton, Peter; Vlachos, Andreas; Gasperin, Caroline; Drysdale, Rachel; Briscoe, Ted
2008-01-01
Background Despite increasing interest in applying Natural Language Processing (NLP) to biomedical text, whether this technology can facilitate tasks such as database curation remains unclear. Results PaperBrowser is the first NLP-powered interface that was developed under a user-centered approach to improve the way in which FlyBase curators navigate an article. In this paper, we first discuss how observing curators at work informed the design and evaluation of PaperBrowser. Then, we present how we appraise PaperBrowser's navigational functionalities in a user-based study using a text highlighting task and evaluation criteria of Human-Computer Interaction. Our results show that PaperBrowser reduces the amount of interactions between two highlighting events and therefore improves navigational efficiency by about 58% compared to the navigational mechanism that was previously available to the curators. Moreover, PaperBrowser is shown to provide curators with enhanced navigational utility by over 74% irrespective of the different ways in which they highlight text in the article. Conclusion We show that state-of-the-art performance in certain NLP tasks such as Named Entity Recognition and Anaphora Resolution can be combined with the navigational functionalities of PaperBrowser to support curation quite successfully. PMID:18410678
EnviroAtlas One Meter Resolution Urban Land Cover Data (2008-2012) Web Service
This EnviroAtlas web service supports research and online mapping activities related to EnviroAtlas (https://www.epa.gov/enviroatlas ). The EnviroAtlas One Meter-scale Urban Land Cover (MULC) Data were generated individually for each EnviroAtlas community. Source imagery varies by community. Land cover classes mapped also vary by community and include the following: water, impervious surfaces, soil and barren land, trees, shrub, grass and herbaceous, agriculture, orchards, woody wetlands, and emergent wetlands. Accuracy assessments were completed for each community's classification. For specific information about methods and accuracy of each community's land cover classification, consult their individual metadata records: Austin, TX (https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=%7B91A32A9D-96F5-4FA0-BC97-73BAD5D1F158%7D); Cleveland, OH (https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=%7B82ab1edf-8fc8-4667-9c52-5a5acffffa34%7D); Des Moines, IA (https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=%7BA4152198-978D-4C0B-959F-42EABA9C4E1B%7D); Durham, NC (https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=%7B2FF66877-A037-4693-9718-D1870AA3F084%7D); Fresno, CA (https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=%7B87041CF3-05BC-43C3-82DA-F066267C9871%7D); Green Bay, WI (https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=%7BD602E7C9-7F53-4C24
Interoperable Solar Data and Metadata via LISIRD 3
NASA Astrophysics Data System (ADS)
Wilson, A.; Lindholm, D. M.; Pankratz, C. K.; Snow, M. A.; Woods, T. N.
2015-12-01
LISIRD 3 is a major upgrade of the LASP Interactive Solar Irradiance Data Center (LISIRD), which serves several dozen space based solar irradiance and related data products to the public. Through interactive plots, LISIRD 3 provides data browsing supported by data subsetting and aggregation. Incorporating a semantically enabled metadata repository, LISIRD 3 users see current, vetted, consistent information about the datasets offered. Users can now also search for datasets based on metadata fields such as dataset type and/or spectral or temporal range. This semantic database enables metadata browsing, so users can discover the relationships between datasets, instruments, spacecraft, mission and PI. The database also enables creation and publication of metadata records in a variety of formats, such as SPASE or ISO, making these datasets more discoverable. The database also enables the possibility of a public SPARQL endpoint, making the metadata browsable in an automated fashion. LISIRD 3's data access middleware, LaTiS, provides dynamic, on demand reformatting of data and timestamps, subsetting and aggregation, and other server side functionality via a RESTful OPeNDAP compliant API, enabling interoperability between LASP datasets and many common tools. LISIRD 3's templated front end design, coupled with the uniform data interface offered by LaTiS, allows easy integration of new datasets. Consequently the number and variety of datasets offered by LISIRD has grown to encompass several dozen, with many more to come. This poster will discuss design and implementation of LISIRD 3, including tools used, capabilities enabled, and issues encountered.
Park, Yu Rang; Yoon, Young Jo; Jang, Tae Hun; Seo, Hwa Jeong
2014-01-01
Objectives Extension of the standard model while retaining compliance with it is a challenging issue because there is currently no method for semantically or syntactically verifying an extended data model. A metadata-based extended model, named CCR+, was designed and implemented to achieve interoperability between standard and extended models. Methods Furthermore, a multilayered validation method was devised to validate the standard and extended models. The American Society for Testing and Materials (ASTM) Community Care Record (CCR) standard was selected to evaluate the CCR+ model; two CCR and one CCR+ XML files were evaluated. Results In total, 188 metadata were extracted from the ASTM CCR standard; these metadata are semantically interconnected and registered in the metadata registry. An extended-data-model-specific validation file was generated from these metadata. This file can be used in a smartphone application (Health Avatar CCR+) as a part of a multilayered validation. The new CCR+ model was successfully evaluated via a patient-centric exchange scenario involving multiple hospitals, with the results supporting both syntactic and semantic interoperability between the standard CCR and extended, CCR+, model. Conclusions A feasible method for delivering an extended model that complies with the standard model is presented herein. There is a great need to extend static standard models such as the ASTM CCR in various domains: the methods presented here represent an important reference for achieving interoperability between standard and extended models. PMID:24627817
The Five Cs of Digital Curation: Supporting Twenty-First-Century Teaching and Learning
ERIC Educational Resources Information Center
Deschaine, Mark E.; Sharma, Sue Ann
2015-01-01
Digital curation is a process that allows university professors to adapt and adopt resources from multidisciplinary fields to meet the educational needs of twenty-first-century learners. Looking through the lens of new media literacy studies (Vasquez, Harste, & Albers, 2010) and new literacies studies (Gee, 2010), we propose that university…
A Study of Faculty Data Curation Behaviors and Attitudes at a Teaching-Centered University
ERIC Educational Resources Information Center
Scaramozzino, Jeanine Marie; Ramírez, Marisa L.; McGaughey, Karen J.
2012-01-01
Academic libraries need reliable information on researcher data needs, data curation practices, and attitudes to identify and craft appropriate services that support outreach and teaching. This paper describes information gathered from a survey distributed to the College of Science and Mathematics faculty at California Polytechnic State…
Disease model curation improvements at Mouse Genome Informatics
Bello, Susan M.; Richardson, Joel E.; Davis, Allan P.; Wiegers, Thomas C.; Mattingly, Carolyn J.; Dolan, Mary E.; Smith, Cynthia L.; Blake, Judith A.; Eppig, Janan T.
2012-01-01
Optimal curation of human diseases requires an ontology or structured vocabulary that contains terms familiar to end users, is robust enough to support multiple levels of annotation granularity, is limited to disease terms and is stable enough to avoid extensive reannotation following updates. At Mouse Genome Informatics (MGI), we currently use disease terms from Online Mendelian Inheritance in Man (OMIM) to curate mouse models of human disease. While OMIM provides highly detailed disease records that are familiar to many in the medical community, it lacks structure to support multilevel annotation. To improve disease annotation at MGI, we evaluated the merged Medical Subject Headings (MeSH) and OMIM disease vocabulary created by the Comparative Toxicogenomics Database (CTD) project. Overlaying MeSH onto OMIM provides hierarchical access to broad disease terms, a feature missing from the OMIM. We created an extended version of the vocabulary to meet the genetic disease-specific curation needs at MGI. Here we describe our evaluation of the CTD application, the extensions made by MGI and discuss the strengths and weaknesses of this approach. Database URL: http://www.informatics.jax.org/ PMID:22434831
WormBase 2014: new views of curated biology
Harris, Todd W.; Baran, Joachim; Bieri, Tamberlyn; Cabunoc, Abigail; Chan, Juancarlos; Chen, Wen J.; Davis, Paul; Done, James; Grove, Christian; Howe, Kevin; Kishore, Ranjana; Lee, Raymond; Li, Yuling; Muller, Hans-Michael; Nakamura, Cecilia; Ozersky, Philip; Paulini, Michael; Raciti, Daniela; Schindelman, Gary; Tuli, Mary Ann; Auken, Kimberly Van; Wang, Daniel; Wang, Xiaodong; Williams, Gary; Wong, J. D.; Yook, Karen; Schedl, Tim; Hodgkin, Jonathan; Berriman, Matthew; Kersey, Paul; Spieth, John; Stein, Lincoln; Sternberg, Paul W.
2014-01-01
WormBase (http://www.wormbase.org/) is a highly curated resource dedicated to supporting research using the model organism Caenorhabditis elegans. With an electronic history predating the World Wide Web, WormBase contains information ranging from the sequence and phenotype of individual alleles to genome-wide studies generated using next-generation sequencing technologies. In recent years, we have expanded the contents to include data on additional nematodes of agricultural and medical significance, bringing the knowledge of C. elegans to bear on these systems and providing support for underserved research communities. Manual curation of the primary literature remains a central focus of the WormBase project, providing users with reliable, up-to-date and highly cross-linked information. In this update, we describe efforts to organize the original atomized and highly contextualized curated data into integrated syntheses of discrete biological topics. Next, we discuss our experiences coping with the vast increase in available genome sequences made possible through next-generation sequencing platforms. Finally, we describe some of the features and tools of the new WormBase Web site that help users better find and explore data of interest. PMID:24194605
How should the completeness and quality of curated nanomaterial data be evaluated?
NASA Astrophysics Data System (ADS)
Marchese Robinson, Richard L.; Lynch, Iseult; Peijnenburg, Willie; Rumble, John; Klaessig, Fred; Marquardt, Clarissa; Rauscher, Hubert; Puzyn, Tomasz; Purian, Ronit; Åberg, Christoffer; Karcher, Sandra; Vriens, Hanne; Hoet, Peter; Hoover, Mark D.; Hendren, Christine Ogilvie; Harper, Stacey L.
2016-05-01
Nanotechnology is of increasing significance. Curation of nanomaterial data into electronic databases offers opportunities to better understand and predict nanomaterials' behaviour. This supports innovation in, and regulation of, nanotechnology. It is commonly understood that curated data need to be sufficiently complete and of sufficient quality to serve their intended purpose. However, assessing data completeness and quality is non-trivial in general and is arguably especially difficult in the nanoscience area, given its highly multidisciplinary nature. The current article, part of the Nanomaterial Data Curation Initiative series, addresses how to assess the completeness and quality of (curated) nanomaterial data. In order to address this key challenge, a variety of related issues are discussed: the meaning and importance of data completeness and quality, existing approaches to their assessment and the key challenges associated with evaluating the completeness and quality of curated nanomaterial data. Considerations which are specific to the nanoscience area and lessons which can be learned from other relevant scientific disciplines are considered. Hence, the scope of this discussion ranges from physicochemical characterisation requirements for nanomaterials and interference of nanomaterials with nanotoxicology assays to broader issues such as minimum information checklists, toxicology data quality schemes and computational approaches that facilitate evaluation of the completeness and quality of (curated) data. This discussion is informed by a literature review and a survey of key nanomaterial data curation stakeholders. Finally, drawing upon this discussion, recommendations are presented concerning the central question: how should the completeness and quality of curated nanomaterial data be evaluated?Nanotechnology is of increasing significance. Curation of nanomaterial data into electronic databases offers opportunities to better understand and predict nanomaterials' behaviour. This supports innovation in, and regulation of, nanotechnology. It is commonly understood that curated data need to be sufficiently complete and of sufficient quality to serve their intended purpose. However, assessing data completeness and quality is non-trivial in general and is arguably especially difficult in the nanoscience area, given its highly multidisciplinary nature. The current article, part of the Nanomaterial Data Curation Initiative series, addresses how to assess the completeness and quality of (curated) nanomaterial data. In order to address this key challenge, a variety of related issues are discussed: the meaning and importance of data completeness and quality, existing approaches to their assessment and the key challenges associated with evaluating the completeness and quality of curated nanomaterial data. Considerations which are specific to the nanoscience area and lessons which can be learned from other relevant scientific disciplines are considered. Hence, the scope of this discussion ranges from physicochemical characterisation requirements for nanomaterials and interference of nanomaterials with nanotoxicology assays to broader issues such as minimum information checklists, toxicology data quality schemes and computational approaches that facilitate evaluation of the completeness and quality of (curated) data. This discussion is informed by a literature review and a survey of key nanomaterial data curation stakeholders. Finally, drawing upon this discussion, recommendations are presented concerning the central question: how should the completeness and quality of curated nanomaterial data be evaluated? Electronic supplementary information (ESI) available: (1) Detailed information regarding issues raised in the main text; (2) original survey responses. See DOI: 10.1039/c5nr08944a
Cyberinfrastructure to Support Collaborative and Reproducible Computational Hydrologic Modeling
NASA Astrophysics Data System (ADS)
Goodall, J. L.; Castronova, A. M.; Bandaragoda, C.; Morsy, M. M.; Sadler, J. M.; Essawy, B.; Tarboton, D. G.; Malik, T.; Nijssen, B.; Clark, M. P.; Liu, Y.; Wang, S. W.
2017-12-01
Creating cyberinfrastructure to support reproducibility of computational hydrologic models is an important research challenge. Addressing this challenge requires open and reusable code and data with machine and human readable metadata, organized in ways that allow others to replicate results and verify published findings. Specific digital objects that must be tracked for reproducible computational hydrologic modeling include (1) raw initial datasets, (2) data processing scripts used to clean and organize the data, (3) processed model inputs, (4) model results, and (5) the model code with an itemization of all software dependencies and computational requirements. HydroShare is a cyberinfrastructure under active development designed to help users store, share, and publish digital research products in order to improve reproducibility in computational hydrology, with an architecture supporting hydrologic-specific resource metadata. Researchers can upload data required for modeling, add hydrology-specific metadata to these resources, and use the data directly within HydroShare.org for collaborative modeling using tools like CyberGIS, Sciunit-CLI, and JupyterHub that have been integrated with HydroShare to run models using notebooks, Docker containers, and cloud resources. Current research aims to implement the Structure For Unifying Multiple Modeling Alternatives (SUMMA) hydrologic model within HydroShare to support hypothesis-driven hydrologic modeling while also taking advantage of the HydroShare cyberinfrastructure. The goal of this integration is to create the cyberinfrastructure that supports hypothesis-driven model experimentation, education, and training efforts by lowering barriers to entry, reducing the time spent on informatics technology and software development, and supporting collaborative research within and across research groups.
BioModels: expanding horizons to include more modelling approaches and formats
Nguyen, Tung V N; Graesslin, Martin; Hälke, Robert; Ali, Raza; Schramm, Jochen; Wimalaratne, Sarala M; Kothamachu, Varun B; Rodriguez, Nicolas; Swat, Maciej J; Eils, Jurgen; Eils, Roland; Laibe, Camille; Chelliah, Vijayalakshmi
2018-01-01
Abstract BioModels serves as a central repository of mathematical models representing biological processes. It offers a platform to make mathematical models easily shareable across the systems modelling community, thereby supporting model reuse. To facilitate hosting a broader range of model formats derived from diverse modelling approaches and tools, a new infrastructure for BioModels has been developed that is available at http://www.ebi.ac.uk/biomodels. This new system allows submitting and sharing of a wide range of models with improved support for formats other than SBML. It also offers a version-control backed environment in which authors and curators can work collaboratively to curate models. This article summarises the features available in the current system and discusses the potential benefit they offer to the users over the previous system. In summary, the new portal broadens the scope of models accepted in BioModels and supports collaborative model curation which is crucial for model reproducibility and sharing. PMID:29106614
Trippi, Michael H.; Kinney, Scott A.; Gunther, Gregory; Ryder, Robert T.; Ruppert, Leslie F.; Ruppert, Leslie F.; Ryder, Robert T.
2014-01-01
Metadata for these datasets are available in HTML and XML formats. Metadata files contain information about the sources of data used to create the dataset, the creation process steps, the data quality, the geographic coordinate system and horizontal datum used for the dataset, the values of attributes used in the dataset table, information about the publication and the publishing organization, and other information that may be useful to the reader. All links in the metadata were valid at the time of compilation. Some of these links may no longer be valid. No attempt has been made to determine the new online location (if one exists) for the data.
2016-01-01
Moving Target Indicator, Unmanned Aircraft System (UAS), Rivet Joint, U-2, and ground signals intelligence (PROPHET). At the BCT, Ranger Regiment and... metadata catalog managed by the DIB management office (outside of the DCGS-A system ). A metadata is a searchable description of data, and users across...challenge for users . The system required reboots about every 20 hours for users who had heavy workloads such as the fire support analysts and data
GEO Label Web Services for Dynamic and Effective Communication of Geospatial Metadata Quality
NASA Astrophysics Data System (ADS)
Lush, Victoria; Nüst, Daniel; Bastin, Lucy; Masó, Joan; Lumsden, Jo
2014-05-01
We present demonstrations of the GEO label Web services and their integration into a prototype extension of the GEOSS portal (http://scgeoviqua.sapienzaconsulting.com/web/guest/geo_home), the GMU portal (http://gis.csiss.gmu.edu/GADMFS/) and a GeoNetwork catalog application (http://uncertdata.aston.ac.uk:8080/geonetwork/srv/eng/main.home). The GEO label is designed to communicate, and facilitate interrogation of, geospatial quality information with a view to supporting efficient and effective dataset selection on the basis of quality, trustworthiness and fitness for use. The GEO label which we propose was developed and evaluated according to a user-centred design (UCD) approach in order to maximise the likelihood of user acceptance once deployed. The resulting label is dynamically generated from producer metadata in ISO or FDGC format, and incorporates user feedback on dataset usage, ratings and discovered issues, in order to supply a highly informative summary of metadata completeness and quality. The label was easily incorporated into a community portal as part of the GEO Architecture Implementation Programme (AIP-6) and has been successfully integrated into a prototype extension of the GEOSS portal, as well as the popular metadata catalog and editor, GeoNetwork. The design of the GEO label was based on 4 user studies conducted to: (1) elicit initial user requirements; (2) investigate initial user views on the concept of a GEO label and its potential role; (3) evaluate prototype label visualizations; and (4) evaluate and validate physical GEO label prototypes. The results of these studies indicated that users and producers support the concept of a label with drill-down interrogation facility, combining eight geospatial data informational aspects, namely: producer profile, producer comments, lineage information, standards compliance, quality information, user feedback, expert reviews, and citations information. These are delivered as eight facets of a wheel-like label, which are coloured according to metadata availability and are clickable to allow a user to engage with the original metadata and explore specific aspects in more detail. To support this graphical representation and allow for wider deployment architectures we have implemented two Web services, a PHP and a Java implementation, that generate GEO label representations by combining producer metadata (from standard catalogues or other published locations) with structured user feedback. Both services accept encoded URLs of publicly available metadata documents or metadata XML files as HTTP POST and GET requests and apply XPath and XSLT mappings to transform producer and feedback XML documents into clickable SVG GEO label representations. The label and services are underpinned by two XML-based quality models. The first is a producer model that extends ISO 19115 and 19157 to allow fuller citation of reference data, presentation of pixel- and dataset- level statistical quality information, and encoding of 'traceability' information on the lineage of an actual quality assessment. The second is a user quality model (realised as a feedback server and client) which allows reporting and query of ratings, usage reports, citations, comments and other domain knowledge. Both services are Open Source and are available on GitHub at https://github.com/lushv/geolabel-service and https://github.com/52North/GEO-label-java. The functionality of these services can be tested using our GEO label generation demos, available online at http://www.geolabel.net/demo.html and http://geoviqua.dev.52north.org/glbservice/index.jsf.
GeoViQua: quality-aware geospatial data discovery and evaluation
NASA Astrophysics Data System (ADS)
Bigagli, L.; Papeschi, F.; Mazzetti, P.; Nativi, S.
2012-04-01
GeoViQua (QUAlity aware VIsualization for the Global Earth Observation System of Systems) is a recently started FP7 project aiming at complementing the Global Earth Observation System of Systems (GEOSS) with rigorous data quality specifications and quality-aware capabilities, in order to improve reliability in scientific studies and policy decision-making. GeoViQua main scientific and technical objective is to enhance the GEOSS Common Infrastructure (GCI) providing the user community with innovative quality-aware search and evaluation tools, which will be integrated in the GEO-Portal, as well as made available to other end-user interfaces. To this end, GeoViQua will promote the extension of the current standard metadata for geographic information with accurate and expressive quality indicators, also contributing to the definition of a quality label (GEOLabel). GeoViQua proposed solutions will be assessed in several pilot case studies covering the whole Earth Observation chain, from remote sensing acquisition to data processing, to applications in the main GEOSS Societal Benefit Areas. This work presents the preliminary results of GeoViQua Work Package 4 "Enhanced geo-search tools" (WP4), started in January 2012. Its major anticipated technical innovations are search and evaluation tools that communicate and exploit data quality information from the GCI. In particular, GeoViQua will investigate a graphical search interface featuring a coherent and meaningful aggregation of statistics and metadata summaries (e.g. in the form of tables, charts), thus enabling end users to leverage quality constraints for data discovery and evaluation. Preparatory work on WP4 requirements indicated that users need the "best" data for their purpose, implying a high degree of subjectivity in judgment. This suggests that the GeoViQua system should exploit a combination of provider-generated metadata (objective indicators such as summary statistics), system-generated metadata (contextual/tracking information such as provenance of data and metadata), and user-generated metadata (informal user comments, usage information, rating, etc.). Moreover, metadata should include sufficiently complete access information, to allow rich data visualization and propagation. The following main enabling components are currently identified within WP4: - Quality-aware access services, e.g. a quality-aware extension of the OGC Sensor Observation Service (SOS-Q) specification, to support quality constraints for sensor data publishing and access; - Quality-aware discovery services, namely a quality-aware extension of the OGC Catalog Service for the Web (CSW-Q), to cope with quality constrained search; - Quality-augmentation broker (GeoViQua Broker), to support the linking and combination of the existing GCI metadata with GeoViQua- and user-generated metadata required to support the users in selecting the "best" data for their intended use. We are currently developing prototypes of the above quality-enabled geo-search components, that will be assessed in a sensor-based pilot case study in the next months. In particular, the GeoViQua Broker will be integrated with the EuroGEOSS Broker, to implement CSW-Q and federate (either via distribution or harvesting schemes) quality-aware data sources, GeoViQua will constitute a valuable test-bed for advancing the current best practices and standards in geospatial quality representation and exploitation. The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n° 265178.
Assessing the reproducibility of discriminant function analyses
Andrew, Rose L.; Albert, Arianne Y.K.; Renaut, Sebastien; Rennison, Diana J.; Bock, Dan G.
2015-01-01
Data are the foundation of empirical research, yet all too often the datasets underlying published papers are unavailable, incorrect, or poorly curated. This is a serious issue, because future researchers are then unable to validate published results or reuse data to explore new ideas and hypotheses. Even if data files are securely stored and accessible, they must also be accompanied by accurate labels and identifiers. To assess how often problems with metadata or data curation affect the reproducibility of published results, we attempted to reproduce Discriminant Function Analyses (DFAs) from the field of organismal biology. DFA is a commonly used statistical analysis that has changed little since its inception almost eight decades ago, and therefore provides an opportunity to test reproducibility among datasets of varying ages. Out of 100 papers we initially surveyed, fourteen were excluded because they did not present the common types of quantitative result from their DFA or gave insufficient details of their DFA. Of the remaining 86 datasets, there were 15 cases for which we were unable to confidently relate the dataset we received to the one used in the published analysis. The reasons ranged from incomprehensible or absent variable labels, the DFA being performed on an unspecified subset of the data, or the dataset we received being incomplete. We focused on reproducing three common summary statistics from DFAs: the percent variance explained, the percentage correctly assigned and the largest discriminant function coefficient. The reproducibility of the first two was fairly high (20 of 26, and 44 of 60 datasets, respectively), whereas our success rate with the discriminant function coefficients was lower (15 of 26 datasets). When considering all three summary statistics, we were able to completely reproduce 46 (65%) of 71 datasets. While our results show that a majority of studies are reproducible, they highlight the fact that many studies still are not the carefully curated research that the scientific community and public expects. PMID:26290793
DOE Office of Scientific and Technical Information (OSTI.GOV)
Faria, Jose P.; Overbeek, Ross; Taylor, Ronald C.
Here, we introduce a manually constructed and curated regulatory network model that describes the current state of knowledge of transcriptional regulation of B. subtilis. The model corresponds to an updated and enlarged version of the regulatory model of central metabolism originally proposed in 2008. We extended the original network to the whole genome by integration of information from DBTBS, a compendium of regulatory data that includes promoters, transcription factors (TFs), binding sites, motifs and regulated operons. Additionally, we consolidated our network with all the information on regulation included in the SporeWeb and Subtiwiki community-curated resources on B. subtilis. Finally, wemore » reconciled our network with data from RegPrecise, which recently released their own less comprehensive reconstruction of the regulatory network for B. subtilis. Our model describes 275 regulators and their target genes, representing 30 different mechanisms of regulation such as TFs, RNA switches, Riboswitches and small regulatory RNAs. Overall, regulatory information is included in the model for approximately 2500 of the ~4200 genes in B. subtilis 168. In an effort to further expand our knowledge of B. subtilis regulation, we reconciled our model with expression data. For this process, we reconstructed the Atomic Regulons (ARs) for B. subtilis, which are the sets of genes that share the same “ON” and “OFF” gene expression profiles across multiple samples of experimental data. We show how atomic regulons for B. subtilis are able to capture many sets of genes corresponding to regulated operons in our manually curated network. Additionally, we demonstrate how atomic regulons can be used to help expand or validate the knowledge of the regulatory networks by looking at highly correlated genes in the ARs for which regulatory information is lacking. During this process, we were also able to infer novel stimuli for hypothetical genes by exploring the genome expression metadata relating to experimental conditions, gaining insights into novel biology.« less
Faria, Jose P.; Overbeek, Ross; Taylor, Ronald C.; ...
2016-03-18
Here, we introduce a manually constructed and curated regulatory network model that describes the current state of knowledge of transcriptional regulation of B. subtilis. The model corresponds to an updated and enlarged version of the regulatory model of central metabolism originally proposed in 2008. We extended the original network to the whole genome by integration of information from DBTBS, a compendium of regulatory data that includes promoters, transcription factors (TFs), binding sites, motifs and regulated operons. Additionally, we consolidated our network with all the information on regulation included in the SporeWeb and Subtiwiki community-curated resources on B. subtilis. Finally, wemore » reconciled our network with data from RegPrecise, which recently released their own less comprehensive reconstruction of the regulatory network for B. subtilis. Our model describes 275 regulators and their target genes, representing 30 different mechanisms of regulation such as TFs, RNA switches, Riboswitches and small regulatory RNAs. Overall, regulatory information is included in the model for approximately 2500 of the ~4200 genes in B. subtilis 168. In an effort to further expand our knowledge of B. subtilis regulation, we reconciled our model with expression data. For this process, we reconstructed the Atomic Regulons (ARs) for B. subtilis, which are the sets of genes that share the same “ON” and “OFF” gene expression profiles across multiple samples of experimental data. We show how atomic regulons for B. subtilis are able to capture many sets of genes corresponding to regulated operons in our manually curated network. Additionally, we demonstrate how atomic regulons can be used to help expand or validate the knowledge of the regulatory networks by looking at highly correlated genes in the ARs for which regulatory information is lacking. During this process, we were also able to infer novel stimuli for hypothetical genes by exploring the genome expression metadata relating to experimental conditions, gaining insights into novel biology.« less
NASA Astrophysics Data System (ADS)
Agarwal, D.; Varadharajan, C.; Cholia, S.; Snavely, C.; Hendrix, V.; Gunter, D.; Riley, W. J.; Jones, M.; Budden, A. E.; Vieglais, D.
2017-12-01
The ESS-DIVE archive is a new U.S. Department of Energy (DOE) data archive designed to provide long-term stewardship and use of data from observational, experimental, and modeling activities in the earth and environmental sciences. The ESS-DIVE infrastructure is constructed with the long-term vision of enabling broad access to and usage of the DOE sponsored data stored in the archive. It is designed as a scalable framework that incentivizes data providers to contribute well-structured, high-quality data to the archive and that enables the user community to easily build data processing, synthesis, and analysis capabilities using those data. The key innovations in our design include: (1) application of user-experience research methods to understand the needs of users and data contributors; (2) support for early data archiving during project data QA/QC and before public release; (3) focus on implementation of data standards in collaboration with the community; (4) support for community built tools for data search, interpretation, analysis, and visualization tools; (5) data fusion database to support search of the data extracted from packages submitted and data available in partner data systems such as the Earth System Grid Federation (ESGF) and DataONE; and (6) support for archiving of data packages that are not to be released to the public. ESS-DIVE data contributors will be able to archive and version their data and metadata, obtain data DOIs, search for and access ESS data and metadata via web and programmatic portals, and provide data and metadata in standardized forms. The ESS-DIVE archive and catalog will be federated with other existing catalogs, allowing cross-catalog metadata search and data exchange with existing systems, including DataONE's Metacat search. ESS-DIVE is operated by a multidisciplinary team from Berkeley Lab, the National Center for Ecological Analysis and Synthesis (NCEAS), and DataONE. The primarily data copies are hosted at DOE's NERSC supercomputing facility with replicas at DataONE nodes.
A Grid Metadata Service for Earth and Environmental Sciences
NASA Astrophysics Data System (ADS)
Fiore, Sandro; Negro, Alessandro; Aloisio, Giovanni
2010-05-01
Critical challenges for climate modeling researchers are strongly connected with the increasingly complex simulation models and the huge quantities of produced datasets. Future trends in climate modeling will only increase computational and storage requirements. For this reason the ability to transparently access to both computational and data resources for large-scale complex climate simulations must be considered as a key requirement for Earth Science and Environmental distributed systems. From the data management perspective (i) the quantity of data will continuously increases, (ii) data will become more and more distributed and widespread, (iii) data sharing/federation will represent a key challenging issue among different sites distributed worldwide, (iv) the potential community of users (large and heterogeneous) will be interested in discovery experimental results, searching of metadata, browsing collections of files, compare different results, display output, etc.; A key element to carry out data search and discovery, manage and access huge and distributed amount of data is the metadata handling framework. What we propose for the management of distributed datasets is the GRelC service (a data grid solution focusing on metadata management). Despite the classical approaches, the proposed data-grid solution is able to address scalability, transparency, security and efficiency and interoperability. The GRelC service we propose is able to provide access to metadata stored in different and widespread data sources (relational databases running on top of MySQL, Oracle, DB2, etc. leveraging SQL as query language, as well as XML databases - XIndice, eXist, and libxml2 based documents, adopting either XPath or XQuery) providing a strong data virtualization layer in a grid environment. Such a technological solution for distributed metadata management leverages on well known adopted standards (W3C, OASIS, etc.); (ii) supports role-based management (based on VOMS), which increases flexibility and scalability; (iii) provides full support for Grid Security Infrastructure, which means (authorization, mutual authentication, data integrity, data confidentiality and delegation); (iv) is compatible with existing grid middleware such as gLite and Globus and finally (v) is currently adopted at the Euro-Mediterranean Centre for Climate Change (CMCC - Italy) to manage the entire CMCC data production activity as well as in the international Climate-G testbed.
Managing Quality, Identity and Adversaries in Public Discourse with Machine Learning
ERIC Educational Resources Information Center
Brennan, Michael
2012-01-01
Automation can mitigate issues when scaling and managing quality and identity in public discourse on the web. Discourse needs to be curated and filtered. Anonymous speech has to be supported while handling adversaries. Reliance on human curators or analysts does not scale and content can be missed. These scaling and management issues include the…
Smart Mobility Stakeholders - Curating Urban Data & Models
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sperling, Joshua
This presentation provides an overview of the curation of urban data and models through engaging SMART mobility stakeholders. SMART Mobility Urban Science Efforts are helping to expose key data sets, models, and roles for the U.S. Department of Energy in engaging across stakeholders to ensure useful insights. This will help to support other Urban Science and broader SMART initiatives.
2011-01-01
Background The Biodiversity Heritage Library (BHL) is a large digital archive of legacy biological literature, comprising over 31 million pages scanned from books, monographs, and journals. During the digitisation process basic metadata about the scanned items is recorded, but not article-level metadata. Given that the article is the standard unit of citation, this makes it difficult to locate cited literature in BHL. Adding the ability to easily find articles in BHL would greatly enhance the value of the archive. Description A service was developed to locate articles in BHL based on matching article metadata to BHL metadata using approximate string matching, regular expressions, and string alignment. This article locating service is exposed as a standard OpenURL resolver on the BioStor web site http://biostor.org/openurl/. This resolver can be used on the web, or called by bibliographic tools that support OpenURL. Conclusions BioStor provides tools for extracting, annotating, and visualising articles from the Biodiversity Heritage Library. BioStor is available from http://biostor.org/. PMID:21605356
Lessons Learned From 104 Years of Mobile Observatories
NASA Astrophysics Data System (ADS)
Miller, S. P.; Clark, P. D.; Neiswender, C.; Raymond, L.; Rioux, M.; Norton, C.; Detrick, R.; Helly, J.; Sutton, D.; Weatherford, J.
2007-12-01
As the oceanographic community ventures into a new era of integrated observatories, it may be helpful to look back on the era of "mobile observatories" to see what Cyberinfrastructure lessons might be learned. For example, SIO has been operating research vessels for 104 years, supporting a wide range of disciplines: marine geology and geophysics, physical oceanography, geochemistry, biology, seismology, ecology, fisheries, and acoustics. In the last 6 years progress has been made with diverse data types, formats and media, resulting in a fully-searchable online SIOExplorer Digital Library of more than 800 cruises (http://SIOExplorer.ucsd.edu). Public access to SIOExplorer is considerable, with 795,351 files (206 GB) downloaded last year. During the last 3 years the efforts have been extended to WHOI, with a "Multi-Institution Testbed for Scalable Digital Archiving" funded by the Library of Congress and NSF (IIS 0455998). The project has created a prototype digital library of data from both institutions, including cruises, Alvin submersible dives, and ROVs. In the process, the team encountered technical and cultural issues that will be facing the observatory community in the near future. Technological Lessons Learned: Shipboard data from multiple institutions are extraordinarily diverse, and provide a good training ground for observatories. Data are gathered from a wide range of authorities, laboratories, servers and media, with little documentation. Conflicting versions exist, generated by alternative processes. Domain- and institution-specific issues were addressed during initial staging. Data files were categorized and metadata harvested with automated procedures. With our second-generation approach to staging, we achieve higher levels of automation with greater use of controlled vocabularies. Database and XML- based procedures deal with the diversity of raw metadata values and map them to agreed-upon standard values, in collaboration with the Marine Metadata Interoperability (MMI) community. All objects are tagged with an expert level, thus serving an educational audience, as well as research users. After staging, publication into the digital library is completely automated. The technical challenges have been largely overcome, thanks to a scalable, federated digital library architecture from the San Diego Supercomputer Center, implemented at SIO, WHOI and other sites. The metadata design is flexible, supporting modular blocks of metadata tailored to the needs of instruments, samples, documents, derived products, cruises or dives, as appropriate. Controlled metadata vocabularies, with content and definitions negotiated by all parties, are critical. Metadata may be mapped to required external standards and formats, as needed. Cultural Lessons Learned: The cultural challenges have been more formidable than expected. They became most apparent during attempts to categorize and stage digital data objects across two institutions, each with their own naming conventions and practices, generally undocumented, and evolving across decades. Whether the questions concerned data ownership, collection techniques, data diversity or institutional practices, the solution involved a joint discussion with scientists, data managers, technicians and archivists, working together. Because metadata discussions go on endlessly, significant benefit comes from dictionaries with definitions of all community-authorized metadata values.
Event selection services in ATLAS
NASA Astrophysics Data System (ADS)
Cranshaw, J.; Cuhadar-Donszelmann, T.; Gallas, E.; Hrivnac, J.; Kenyon, M.; McGlone, H.; Malon, D.; Mambelli, M.; Nowak, M.; Viegas, F.; Vinek, E.; Zhang, Q.
2010-04-01
ATLAS has developed and deployed event-level selection services based upon event metadata records ("TAGS") and supporting file and database technology. These services allow physicists to extract events that satisfy their selection predicates from any stage of data processing and use them as input to later analyses. One component of these services is a web-based Event-Level Selection Service Interface (ELSSI). ELSSI supports event selection by integrating run-level metadata, luminosity-block-level metadata (e.g., detector status and quality information), and event-by-event information (e.g., triggers passed and physics content). The list of events that survive after some selection criterion is returned in a form that can be used directly as input to local or distributed analysis; indeed, it is possible to submit a skimming job directly from the ELSSI interface using grid proxy credential delegation. ELSSI allows physicists to explore ATLAS event metadata as a means to understand, qualitatively and quantitatively, the distributional characteristics of ATLAS data. In fact, the ELSSI service provides an easy interface to see the highest missing ET events or the events with the most leptons, to count how many events passed a given set of triggers, or to find events that failed a given trigger but nonetheless look relevant to an analysis based upon the results of offline reconstruction, and more. This work provides an overview of ATLAS event-level selection services, with an emphasis upon the interactive Event-Level Selection Service Interface.
Payao: a community platform for SBML pathway model curation
Matsuoka, Yukiko; Ghosh, Samik; Kikuchi, Norihiro; Kitano, Hiroaki
2010-01-01
Summary: Payao is a community-based, collaborative web service platform for gene-regulatory and biochemical pathway model curation. The system combines Web 2.0 technologies and online model visualization functions to enable a collaborative community to annotate and curate biological models. Payao reads the models in Systems Biology Markup Language format, displays them with CellDesigner, a process diagram editor, which complies with the Systems Biology Graphical Notation, and provides an interface for model enrichment (adding tags and comments to the models) for the access-controlled community members. Availability and implementation: Freely available for model curation service at http://www.payaologue.org. Web site implemented in Seaser Framework 2.0 with S2Flex2, MySQL 5.0 and Tomcat 5.5, with all major browsers supported. Contact: kitano@sbi.jp PMID:20371497
Cole, Charles; Krampis, Konstantinos; Karagiannis, Konstantinos; Almeida, Jonas S; Faison, William J; Motwani, Mona; Wan, Quan; Golikov, Anton; Pan, Yang; Simonyan, Vahan; Mazumder, Raja
2014-01-27
Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it. To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr). Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.
2014-01-01
Background Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it. Results To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr). Conclusions Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides. PMID:24467687
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bent, John M.; Faibish, Sorin; Pedone, Jr., James M.
A cluster file system is provided having a plurality of distributed metadata servers with shared access to one or more shared low latency persistent key-value metadata stores. A metadata server comprises an abstract storage interface comprising a software interface module that communicates with at least one shared persistent key-value metadata store providing a key-value interface for persistent storage of key-value metadata. The software interface module provides the key-value metadata to the at least one shared persistent key-value metadata store in a key-value format. The shared persistent key-value metadata store is accessed by a plurality of metadata servers. A metadata requestmore » can be processed by a given metadata server independently of other metadata servers in the cluster file system. A distributed metadata storage environment is also disclosed that comprises a plurality of metadata servers having an abstract storage interface to at least one shared persistent key-value metadata store.« less
Lunar Reference Suite to Support Instrument Development and Testing
NASA Technical Reports Server (NTRS)
Allen, Carlton; Sellar, Glenn; Nunez, Jorge I.; Winterhalter, Daniel; Farmer, Jack
2010-01-01
Astronauts on long-duration lunar missions will need the capability to "high-grade" their samples - to select the highest value samples for transport to Earth - and to leave others on the Moon. Instruments that may be useful for such high-grading are under development. Instruments are also being developed for possible use on future lunar robotic landers, for lunar field work, and for more sophisticated analyses at a lunar outpost. The Johnson Space Center Astromaterials acquisition and Curation Office (JSC Curation) wll support such instrument testing by providing lunar sample "ground truth".
Using Controlled Vocabularies and Semantics to Improve Ocean Data Discovery (Invited)
NASA Astrophysics Data System (ADS)
Chandler, C. L.; Groman, R. C.; Shepherd, A.; Allison, M. D.; Kinkade, D.; Rauch, S.; Wiebe, P. H.; Glover, D. M.
2013-12-01
The Biological and Chemical Oceanography Data Management Office (BCO-DMO) was created in late 2006, by combining the formerly independent data management offices for the U.S. GLOBal Ocean ECosystems Dynamics (GLOBEC) and U.S. Joint Global Ocean flux Study (JGOFS) programs. BCO-DMO staff members work with investigators to publish data from research projects funded by the NSF Geosciences Directorate (GEO) Division of Ocean Sciences (OCE) Biological and Chemical Oceanography Sections and Polar Programs (PLR) Antarctic Sciences Organisms & Ecosystems Program (ANT). Since 2006, researchers have been contributing new data to the BCO-DMO data system. As the data from new research efforts have been added to the data previously shared by U.S. GLOBEC and U.S. JGOFS researchers, the BCO-DMO system has developed into a rich repository of data from ocean, coastal, and Great Lakes research programs. The metadata records for the original research program data (prior to 2006) were stored in human-readable flat files of text, translated on-demand to Web-retrievable files. Beginning in 2006, the metadata records from multiple data systems managed by BCO-DMO were ingested into a relational database (MySQL). Since that time, efforts have been made to incorporate lists of controlled vocabulary terms for key information concepts stored in the MySQL database (e.g. names of research programs, deployments, instruments and measurements). This presents a challenge for a data system that includes legacy data and is continually expanding with the addition of new contributions. Over the years, BCO-DMO has developed a series of data delivery systems driven by the supporting metadata. Improved access to research data, a primary goal of the BCO-DMO project, is achieved through geospatial and text-based data access systems that support data discovery, access, display, assessment, integration, and export of data resources. The addition of a semantically-enabled search capability improves data discovery options particularly for those investigators whose research interests are cross-domain and multi-disciplinary. Current efforts by BCO-DMO staff members are focused on identifying globally unique, persistent identifiers to unambiguously identify resources of interest curated by and available from BCO-DMO. The process involves several essential components: (1) identifying a trusted authoritative source of complementary content and the appropriate contact; (2) determining the globally unique, persistent identifier system for resources of interest and (3) negotiating the requisite syntactic and semantic exchange systems. A variety of technologies have been deployed including: (1) controlled vocabulary term lists for some of the essential concepts/classes; (2) the Ocean Data Ontology; (3) publishing content as Linked Open Data and (4) SPARQL queries and inference. The final results are emerging as a semantic layer comprising domain-specific controlled vocabularies typed to community standard definitions, an ontology with the concepts and relationships needed to describe ocean data, a semantically-enabled faceted search, and inferencing services. We are exploring use of these technologies to improve the accuracy of the BCO-DMO data collection and to facilitate exchange of information with complementary ocean data repositories. Integrating a semantic layer into the BCO-DMO data system architecture improves data and information resource discovery, access and integration.
In-field Access to Geoscientific Metadata through GPS-enabled Mobile Phones
NASA Astrophysics Data System (ADS)
Hobona, Gobe; Jackson, Mike; Jordan, Colm; Butchart, Ben
2010-05-01
Fieldwork is an integral part of much geosciences research. But whilst geoscientists have physical or online access to data collections whilst in the laboratory or at base stations, equivalent in-field access is not standard or straightforward. The increasing availability of mobile internet and GPS-supported mobile phones, however, now provides the basis for addressing this issue. The SPACER project was commissioned by the Rapid Innovation initiative of the UK Joint Information Systems Committee (JISC) to explore the potential for GPS-enabled mobile phones to access geoscientific metadata collections. Metadata collections within the geosciences and the wider geospatial domain can be disseminated through web services based on the Catalogue Service for Web(CSW) standard of the Open Geospatial Consortium (OGC) - a global grouping of over 380 private, public and academic organisations aiming to improve interoperability between geospatial technologies. CSW offers an XML-over-HTTP interface for querying and retrieval of geospatial metadata. By default, the metadata returned by CSW is based on the ISO19115 standard and encoded in XML conformant to ISO19139. The SPACER project has created a prototype application that enables mobile phones to send queries to CSW containing user-defined keywords and coordinates acquired from GPS devices built-into the phones. The prototype has been developed using the free and open source Google Android platform. The mobile application offers views for listing titles, presenting multiple metadata elements and a Google Map with an overlay of bounding coordinates of datasets. The presentation will describe the architecture and approach applied in the development of the prototype.
How should the completeness and quality of curated nanomaterial data be evaluated?†
Marchese Robinson, Richard L.; Lynch, Iseult; Peijnenburg, Willie; Rumble, John; Klaessig, Fred; Marquardt, Clarissa; Rauscher, Hubert; Puzyn, Tomasz; Purian, Ronit; Åberg, Christoffer; Karcher, Sandra; Vriens, Hanne; Hoet, Peter; Hoover, Mark D.; Hendren, Christine Ogilvie; Harper, Stacey L.
2016-01-01
Nanotechnology is of increasing significance. Curation of nanomaterial data into electronic databases offers opportunities to better understand and predict nanomaterials’ behaviour. This supports innovation in, and regulation of, nanotechnology. It is commonly understood that curated data need to be sufficiently complete and of sufficient quality to serve their intended purpose. However, assessing data completeness and quality is non-trivial in general and is arguably especially difficult in the nanoscience area, given its highly multidisciplinary nature. The current article, part of the Nanomaterial Data Curation Initiative series, addresses how to assess the completeness and quality of (curated) nanomaterial data. In order to address this key challenge, a variety of related issues are discussed: the meaning and importance of data completeness and quality, existing approaches to their assessment and the key challenges associated with evaluating the completeness and quality of curated nanomaterial data. Considerations which are specific to the nanoscience area and lessons which can be learned from other relevant scientific disciplines are considered. Hence, the scope of this discussion ranges from physicochemical characterisation requirements for nanomaterials and interference of nanomaterials with nanotoxicology assays to broader issues such as minimum information checklists, toxicology data quality schemes and computational approaches that facilitate evaluation of the completeness and quality of (curated) data. This discussion is informed by a literature review and a survey of key nanomaterial data curation stakeholders. Finally, drawing upon this discussion, recommendations are presented concerning the central question: how should the completeness and quality of curated nanomaterial data be evaluated? PMID:27143028
Public health and primary care: struggling to "win friends and influence people".
Mayes, Rick; McKenna, Sean
2011-01-01
Why are the goals of public health and primary care less politically popular and financially supported than those of curative medicine? A major part of the answer to this question lies in the fact that humans often worry wrongly by assessing risk poorly. This reality is a significant obstacle to the adequate promotion of and investment in public health, primary care, and prevention. Also, public health's tendency to infringe on personal privacy-as well as to call for difficult behavioral change-often sparks intense controversy and interest group opposition that discourage broader political support. Finally, in contrast to curative medicine, both the cost-benefit structure of public health (costs now, benefits later) and the way in which the profession operates make it largely invisible to and, thus, underappreciated by the general public. When curative medicine works well, most everybody notices. When public health and primary care work well, virtually nobody notices.
Classification of Chemical Compounds to Support Complex Queries in a Pathway Database
Weidemann, Andreas; Kania, Renate; Peiss, Christian; Rojas, Isabel
2004-01-01
Data quality in biological databases has become a topic of great discussion. To provide high quality data and to deal with the vast amount of biochemical data, annotators and curators need to be supported by software that carries out part of their work in an (semi-) automatic manner. The detection of errors and inconsistencies is a part that requires the knowledge of domain experts, thus in most cases it is done manually, making it very expensive and time-consuming. This paper presents two tools to partially support the curation of data on biochemical pathways. The tool enables the automatic classification of chemical compounds based on their respective SMILES strings. Such classification allows the querying and visualization of biochemical reactions at different levels of abstraction, according to the level of detail at which the reaction participants are described. Chemical compounds can be classified in a flexible manner based on different criteria. The support of the process of data curation is provided by facilitating the detection of compounds that are identified as different but that are actually the same. This is also used to identify similar reactions and, in turn, pathways. PMID:18629066
With the increasing need to leverage data and models to perform cutting edge analyses within the environmental science community, collection and organization of that data into a readily accessible format for consumption is a pressing need. The EPA CompTox chemical dashboard is i...
Plant Reactome: a resource for plant pathways and comparative analysis
Naithani, Sushma; Preece, Justin; D'Eustachio, Peter; Gupta, Parul; Amarasinghe, Vindhya; Dharmawardhana, Palitha D.; Wu, Guanming; Fabregat, Antonio; Elser, Justin L.; Weiser, Joel; Keays, Maria; Fuentes, Alfonso Munoz-Pomer; Petryszak, Robert; Stein, Lincoln D.; Ware, Doreen; Jaiswal, Pankaj
2017-01-01
Plant Reactome (http://plantreactome.gramene.org/) is a free, open-source, curated plant pathway database portal, provided as part of the Gramene project. The database provides intuitive bioinformatics tools for the visualization, analysis and interpretation of pathway knowledge to support genome annotation, genome analysis, modeling, systems biology, basic research and education. Plant Reactome employs the structural framework of a plant cell to show metabolic, transport, genetic, developmental and signaling pathways. We manually curate molecular details of pathways in these domains for reference species Oryza sativa (rice) supported by published literature and annotation of well-characterized genes. Two hundred twenty-two rice pathways, 1025 reactions associated with 1173 proteins, 907 small molecules and 256 literature references have been curated to date. These reference annotations were used to project pathways for 62 model, crop and evolutionarily significant plant species based on gene homology. Database users can search and browse various components of the database, visualize curated baseline expression of pathway-associated genes provided by the Expression Atlas and upload and analyze their Omics datasets. The database also offers data access via Application Programming Interfaces (APIs) and in various standardized pathway formats, such as SBML and BioPAX. PMID:27799469
Valdez, Joshua; Rueschman, Michael; Kim, Matthew; Redline, Susan; Sahoo, Satya S
2016-10-01
Extraction of structured information from biomedical literature is a complex and challenging problem due to the complexity of biomedical domain and lack of appropriate natural language processing (NLP) techniques. High quality domain ontologies model both data and metadata information at a fine level of granularity, which can be effectively used to accurately extract structured information from biomedical text. Extraction of provenance metadata, which describes the history or source of information, from published articles is an important task to support scientific reproducibility. Reproducibility of results reported by previous research studies is a foundational component of scientific advancement. This is highlighted by the recent initiative by the US National Institutes of Health called "Principles of Rigor and Reproducibility". In this paper, we describe an effective approach to extract provenance metadata from published biomedical research literature using an ontology-enabled NLP platform as part of the Provenance for Clinical and Healthcare Research (ProvCaRe). The ProvCaRe-NLP tool extends the clinical Text Analysis and Knowledge Extraction System (cTAKES) platform using both provenance and biomedical domain ontologies. We demonstrate the effectiveness of ProvCaRe-NLP tool using a corpus of 20 peer-reviewed publications. The results of our evaluation demonstrate that the ProvCaRe-NLP tool has significantly higher recall in extracting provenance metadata as compared to existing NLP pipelines such as MetaMap.
A metadata schema for data objects in clinical research.
Canham, Steve; Ohmann, Christian
2016-11-24
A large number of stakeholders have accepted the need for greater transparency in clinical research and, in the context of various initiatives and systems, have developed a diverse and expanding number of repositories for storing the data and documents created by clinical studies (collectively known as data objects). To make the best use of such resources, we assert that it is also necessary for stakeholders to agree and deploy a simple, consistent metadata scheme. The relevant data objects and their likely storage are described, and the requirements for metadata to support data sharing in clinical research are identified. Issues concerning persistent identifiers, for both studies and data objects, are explored. A scheme is proposed that is based on the DataCite standard, with extensions to cover the needs of clinical researchers, specifically to provide (a) study identification data, including links to clinical trial registries; (b) data object characteristics and identifiers; and (c) data covering location, ownership and access to the data object. The components of the metadata scheme are described. The metadata schema is proposed as a natural extension of a widely agreed standard to fill a gap not tackled by other standards related to clinical research (e.g., Clinical Data Interchange Standards Consortium, Biomedical Research Integrated Domain Group). The proposal could be integrated with, but is not dependent on, other moves to better structure data in clinical research.
UAV field demonstration of social media enabled tactical data link
NASA Astrophysics Data System (ADS)
Olson, Christopher C.; Xu, Da; Martin, Sean R.; Castelli, Jonathan C.; Newman, Andrew J.
2015-05-01
This paper addresses the problem of enabling Command and Control (C2) and data exfiltration functions for missions using small, unmanned, airborne surveillance and reconnaissance platforms. The authors demonstrated the feasibility of using existing commercial wireless networks as the data transmission infrastructure to support Unmanned Aerial Vehicle (UAV) autonomy functions such as transmission of commands, imagery, metadata, and multi-vehicle coordination messages. The authors developed and integrated a C2 Android application for ground users with a common smart phone, a C2 and data exfiltration Android application deployed on-board the UAVs, and a web server with database to disseminate the collected data to distributed users using standard web browsers. The authors performed a mission-relevant field test and demonstration in which operators commanded a UAV from an Android device to search and loiter; and remote users viewed imagery, video, and metadata via web server to identify and track a vehicle on the ground. Social media served as the tactical data link for all command messages, images, videos, and metadata during the field demonstration. Imagery, video, and metadata were transmitted from the UAV to the web server via multiple Twitter, Flickr, Facebook, YouTube, and similar media accounts. The web server reassembled images and video with corresponding metadata for distributed users. The UAV autopilot communicated with the on-board Android device via on-board Bluetooth network.
NASA Astrophysics Data System (ADS)
Christianson, D. S.; Varadharajan, C.; Detto, M.; Faybishenko, B.; Gimenez, B.; Jardine, K.; Negron Juarez, R. I.; Pastorello, G.; Powell, T.; Warren, J.; Wolfe, B.; McDowell, N. G.; Kueppers, L. M.; Chambers, J.; Agarwal, D.
2016-12-01
The U.S. Department of Energy's (DOE) Next Generation Ecosystem Experiment (NGEE) Tropics project aims to develop a process-rich tropical forest ecosystem model that is parameterized and benchmarked by field observations. Thus, data synthesis, quality assurance and quality control (QA/QC), and data product generation of a diverse and complex set of ecohydrological observations, including sapflux, leaf surface temperature, soil water content, and leaf gas exchange from sites across the Tropics, are required to support model simulations. We have developed a metadata reporting framework, implemented in conjunction with the NGEE Tropics Data Archive tool, to enable cross-site and cross-method comparison, data interpretability, and QA/QC. We employed a modified User-Centered Design approach, which involved short development cycles based on user-identified needs, and iterative testing with data providers and users. The metadata reporting framework currently has been implemented for sensor-based observations and leverages several existing metadata protocols. The framework consists of templates that define a multi-scale measurement position hierarchy, descriptions of measurement settings, and details about data collection and data file organization. The framework also enables data providers to define data-access permission settings, provenance, and referencing to enable appropriate data usage, citation, and attribution. In addition to describing the metadata reporting framework, we discuss tradeoffs and impressions from both data providers and users during the development process, focusing on the scalability, usability, and efficiency of the framework.
Text mining and expert curation to develop a database on psychiatric diseases and their genes
Gutiérrez-Sacristán, Alba; Bravo, Àlex; Portero-Tresserra, Marta; Valverde, Olga; Armario, Antonio; Blanco-Gandía, M.C.; Farré, Adriana; Fernández-Ibarrondo, Lierni; Fonseca, Francina; Giraldo, Jesús; Leis, Angela; Mané, Anna; Mayer, M.A.; Montagud-Romero, Sandra; Nadal, Roser; Ortiz, Jordi; Pavon, Francisco Javier; Perez, Ezequiel Jesús; Rodríguez-Arias, Marta; Serrano, Antonia; Torrens, Marta; Warnault, Vincent; Sanz, Ferran
2017-01-01
Abstract Psychiatric disorders constitute one of the main causes of disability worldwide. During the past years, considerable research has been conducted on the genetic architecture of such diseases, although little understanding of their etiology has been achieved. The difficulty to access up-to-date, relevant genotype-phenotype information has hampered the application of this wealth of knowledge to translational research and clinical practice in order to improve diagnosis and treatment of psychiatric patients. PsyGeNET (http://www.psygenet.org/) has been developed with the aim of supporting research on the genetic architecture of psychiatric diseases, by providing integrated and structured accessibility to their genotype–phenotype association data, together with analysis and visualization tools. In this article, we describe the protocol developed for the sustainable update of this knowledge resource. It includes the recruitment of a team of domain experts in order to perform the curation of the data extracted by text mining. Annotation guidelines and a web-based annotation tool were developed to support the curators’ tasks. A curation workflow was designed including a pilot phase and two rounds of curation and analysis phases. Negative evidence from the literature on gene–disease associations (GDAs) was taken into account in the curation process. We report the results of the application of this workflow to the curation of GDAs for PsyGeNET, including the analysis of the inter-annotator agreement and suggest this model as a suitable approach for the sustainable development and update of knowledge resources. Database URL: http://www.psygenet.org PsyGeNET corpus: http://www.psygenet.org/ds/PsyGeNET/results/psygenetCorpus.tar PMID:29220439
Multi-facetted Metadata - Describing datasets with different metadata schemas at the same time
NASA Astrophysics Data System (ADS)
Ulbricht, Damian; Klump, Jens; Bertelmann, Roland
2013-04-01
Inspired by the wish to re-use research data a lot of work is done to bring data systems of the earth sciences together. Discovery metadata is disseminated to data portals to allow building of customized indexes of catalogued dataset items. Data that were once acquired in the context of a scientific project are open for reappraisal and can now be used by scientists that were not part of the original research team. To make data re-use easier, measurement methods and measurement parameters must be documented in an application metadata schema and described in a written publication. Linking datasets to publications - as DataCite [1] does - requires again a specific metadata schema and every new use context of the measured data may require yet another metadata schema sharing only a subset of information with the meta information already present. To cope with the problem of metadata schema diversity in our common data repository at GFZ Potsdam we established a solution to store file-based research data and describe these with an arbitrary number of metadata schemas. Core component of the data repository is an eSciDoc infrastructure that provides versioned container objects, called eSciDoc [2] "items". The eSciDoc content model allows assigning files to "items" and adding any number of metadata records to these "items". The eSciDoc items can be submitted, revised, and finally published, which makes the data and metadata available through the internet worldwide. GFZ Potsdam uses eSciDoc to support its scientific publishing workflow, including mechanisms for data review in peer review processes by providing temporary web links for external reviewers that do not have credentials to access the data. Based on the eSciDoc API, panMetaDocs [3] provides a web portal for data management in research projects. PanMetaDocs, which is based on panMetaWorks [4], is a PHP based web application that allows to describe data with any XML-based schema. It uses the eSciDoc infrastructures REST-interface to store versioned dataset files and metadata in a XML-format. The software is able to administrate more than one eSciDoc metadata record per item and thus allows the description of a dataset according to its context. The metadata fields can be filled with static or dynamic content to reduce the number of fields that require manual entries to a minimum and, at the same time, make use of contextual information available in a project setting. Access rights can be adjusted to set visibility of datasets to the required degree of openness. Metadata from separate instances of panMetaDocs can be syndicated to portals through RSS and OAI-PMH interfaces. The application architecture presented here allows storing file-based datasets and describe these datasets with any number of metadata schemas, depending on the intended use case. Data and metadata are stored in the same entity (eSciDoc items) and are managed by a software tool through the eSciDoc REST interface - in this case the application is panMetaDocs. Other software may re-use the produced items and modify the appropriate metadata records by accessing the web API of the eSciDoc data infrastructure. For presentation of the datasets in a web browser we are not bound to panMetaDocs. This is done by stylesheet transformation of the eSciDoc-item. [1] http://www.datacite.org [2] http://www.escidoc.org , eSciDoc, FIZ Karlruhe, Germany [3] http://panmetadocs.sf.net , panMetaDocs, GFZ Potsdam, Germany [4] http://metaworks.pangaea.de , panMetaWorks, Dr. R. Huber, MARUM, Univ. Bremen, Germany
Log-less metadata management on metadata server for parallel file systems.
Liao, Jianwei; Xiao, Guoqiang; Peng, Xiaoning
2014-01-01
This paper presents a novel metadata management mechanism on the metadata server (MDS) for parallel and distributed file systems. In this technique, the client file system backs up the sent metadata requests, which have been handled by the metadata server, so that the MDS does not need to log metadata changes to nonvolatile storage for achieving highly available metadata service, as well as better performance improvement in metadata processing. As the client file system backs up certain sent metadata requests in its memory, the overhead for handling these backup requests is much smaller than that brought by the metadata server, while it adopts logging or journaling to yield highly available metadata service. The experimental results show that this newly proposed mechanism can significantly improve the speed of metadata processing and render a better I/O data throughput, in contrast to conventional metadata management schemes, that is, logging or journaling on MDS. Besides, a complete metadata recovery can be achieved by replaying the backup logs cached by all involved clients, when the metadata server has crashed or gone into nonoperational state exceptionally.
Log-Less Metadata Management on Metadata Server for Parallel File Systems
Xiao, Guoqiang; Peng, Xiaoning
2014-01-01
This paper presents a novel metadata management mechanism on the metadata server (MDS) for parallel and distributed file systems. In this technique, the client file system backs up the sent metadata requests, which have been handled by the metadata server, so that the MDS does not need to log metadata changes to nonvolatile storage for achieving highly available metadata service, as well as better performance improvement in metadata processing. As the client file system backs up certain sent metadata requests in its memory, the overhead for handling these backup requests is much smaller than that brought by the metadata server, while it adopts logging or journaling to yield highly available metadata service. The experimental results show that this newly proposed mechanism can significantly improve the speed of metadata processing and render a better I/O data throughput, in contrast to conventional metadata management schemes, that is, logging or journaling on MDS. Besides, a complete metadata recovery can be achieved by replaying the backup logs cached by all involved clients, when the metadata server has crashed or gone into nonoperational state exceptionally. PMID:24892093
2011-01-01
Background Improvements in the techniques for metabolomics analyses and growing interest in metabolomic approaches are resulting in the generation of increasing numbers of metabolomic profiles. Platforms are required for profile management, as a function of experimental design, and for metabolite identification, to facilitate the mining of the corresponding data. Various databases have been created, including organism-specific knowledgebases and analytical technique-specific spectral databases. However, there is currently no platform meeting the requirements for both profile management and metabolite identification for nuclear magnetic resonance (NMR) experiments. Description MeRy-B, the first platform for plant 1H-NMR metabolomic profiles, is designed (i) to provide a knowledgebase of curated plant profiles and metabolites obtained by NMR, together with the corresponding experimental and analytical metadata, (ii) for queries and visualization of the data, (iii) to discriminate between profiles with spectrum visualization tools and statistical analysis, (iv) to facilitate compound identification. It contains lists of plant metabolites and unknown compounds, with information about experimental conditions, the factors studied and metabolite concentrations for several plant species, compiled from more than one thousand annotated NMR profiles for various organs or tissues. Conclusion MeRy-B manages all the data generated by NMR-based plant metabolomics experiments, from description of the biological source to identification of the metabolites and determinations of their concentrations. It is the first database allowing the display and overlay of NMR metabolomic profiles selected through queries on data or metadata. MeRy-B is available from http://www.cbib.u-bordeaux2.fr/MERYB/index.php. PMID:21668943
New Features of the re3data Registry of Research Data Repositories
NASA Astrophysics Data System (ADS)
Elger, K.; Pampel, H.; Vierkant, P.; Witt, M.
2016-12-01
re3data is a registry of research data repositories that lists over 1,600 repositories from around the world, making it the largest and most comprehensive online catalog of data repositories on the web. The registry offers researchers, funding agencies, libraries and publishers a comprehensive overview of the heterogeneous landscape of data repositories. The repositories are described, following the "Metadata Schema for the Description of Research Data Repositories". re3data summarises the properties of a repository into a user-friendly icon system helping users to easily identify an adequate repository for the storage of their datasets. The re3data entries are curated by an international, multi-disciplinary editorial board. An application programming interface (API) enables other information systems to list and fetch metadata for integration and interoperability. Funders like the European Commission (2015) and publishers like Springer Nature (2016) recommend the use of re3data.org in their policies. The original re3data project partners are the GFZ German Research Centre for Geosciences, the Humboldt-Universität zu Berlin, the Purdue University Libraries and the Karlsruhe Institute of Technology (KIT). Since 2015 re3data is operated as a service of DataCite, a global non-profit organisation that provides persistent identifiers (DOIs) for research data. At the 2016 AGU Fall Meeting we will describe the current status of re3data. An overview of the major developments and new features will be given. Furthermore, we will present our plans to increase the quality of the re3data entries.
Next-Generation Search Engines for Information Retrieval
DOE Office of Scientific and Technical Information (OSTI.GOV)
Devarakonda, Ranjeet; Hook, Leslie A; Palanisamy, Giri
In the recent years, there have been significant advancements in the areas of scientific data management and retrieval techniques, particularly in terms of standards and protocols for archiving data and metadata. Scientific data is rich, and spread across different places. In order to integrate these pieces together, a data archive and associated metadata should be generated. Data should be stored in a format that can be retrievable and more importantly it should be in a format that will continue to be accessible as technology changes, such as XML. While general-purpose search engines (such as Google or Bing) are useful formore » finding many things on the Internet, they are often of limited usefulness for locating Earth Science data relevant (for example) to a specific spatiotemporal extent. By contrast, tools that search repositories of structured metadata can locate relevant datasets with fairly high precision, but the search is limited to that particular repository. Federated searches (such as Z39.50) have been used, but can be slow and the comprehensiveness can be limited by downtime in any search partner. An alternative approach to improve comprehensiveness is for a repository to harvest metadata from other repositories, possibly with limits based on subject matter or access permissions. Searches through harvested metadata can be extremely responsive, and the search tool can be customized with semantic augmentation appropriate to the community of practice being served. One such system, Mercury, a metadata harvesting, data discovery, and access system, built for researchers to search to, share and obtain spatiotemporal data used across a range of climate and ecological sciences. Mercury is open-source toolset, backend built on Java and search capability is supported by the some popular open source search libraries such as SOLR and LUCENE. Mercury harvests the structured metadata and key data from several data providing servers around the world and builds a centralized index. The harvested files are indexed against SOLR search API consistently, so that it can render search capabilities such as simple, fielded, spatial and temporal searches across a span of projects ranging from land, atmosphere, and ocean ecology. Mercury also provides data sharing capabilities using Open Archive Initiatives Protocol for Metadata Handling (OAI-PMH). In this paper we will discuss about the best practices for archiving data and metadata, new searching techniques, efficient ways of data retrieval and information display.« less
Özdemir, Vural; Kolker, Eugene; Hotez, Peter J; Mohin, Sophie; Prainsack, Barbara; Wynne, Brian; Vayena, Effy; Coşkun, Yavuz; Dereli, Türkay; Huzair, Farah; Borda-Rodriguez, Alexander; Bragazzi, Nicola Luigi; Faris, Jack; Ramesar, Raj; Wonkam, Ambroise; Dandara, Collet; Nair, Bipin; Llerena, Adrián; Kılıç, Koray; Jain, Rekha; Reddy, Panga Jaipal; Gollapalli, Kishore; Srivastava, Sanjeeva; Kickbusch, Ilona
2014-01-01
Metadata refer to descriptions about data or as some put it, "data about data." Metadata capture what happens on the backstage of science, on the trajectory from study conception, design, funding, implementation, and analysis to reporting. Definitions of metadata vary, but they can include the context information surrounding the practice of science, or data generated as one uses a technology, including transactional information about the user. As the pursuit of knowledge broadens in the 21(st) century from traditional "science of whats" (data) to include "science of hows" (metadata), we analyze the ways in which metadata serve as a catalyst for responsible and open innovation, and by extension, science diplomacy. In 2015, the United Nations Millennium Development Goals (MDGs) will formally come to an end. Therefore, we propose that metadata, as an ingredient of responsible innovation, can help achieve the Sustainable Development Goals (SDGs) on the post-2015 agenda. Such responsible innovation, as a collective learning process, has become a key component, for example, of the European Union's 80 billion Euro Horizon 2020 R&D Program from 2014-2020. Looking ahead, OMICS: A Journal of Integrative Biology, is launching an initiative for a multi-omics metadata checklist that is flexible yet comprehensive, and will enable more complete utilization of single and multi-omics data sets through data harmonization and greater visibility and accessibility. The generation of metadata that shed light on how omics research is carried out, by whom and under what circumstances, will create an "intervention space" for integration of science with its socio-technical context. This will go a long way to addressing responsible innovation for a fairer and more transparent society. If we believe in science, then such reflexive qualities and commitments attained by availability of omics metadata are preconditions for a robust and socially attuned science, which can then remain broadly respected, independent, and responsibly innovative. "In Sierra Leone, we have not too much electricity. The lights will come on once in a week, and the rest of the month, dark[ness]. So I made my own battery to power light in people's houses." Kelvin Doe (Global Minimum, 2012) MIT Visiting Young Innovator Cambridge, USA, and Sierra Leone "An important function of the (Global) R&D Observatory will be to provide support and training to build capacity in the collection and analysis of R&D flows, and how to link them to the product pipeline." World Health Organization (2013) Draft Working Paper on a Global Health R&D Observatory.
NASA Astrophysics Data System (ADS)
McWhirter, J.; Boler, F. M.; Bock, Y.; Jamason, P.; Squibb, M. B.; Noll, C. E.; Blewitt, G.; Kreemer, C. W.
2010-12-01
Three geodesy Archive Centers, Scripps Orbit and Permanent Array Center (SOPAC), NASA's Crustal Dynamics Data Information System (CDDIS) and UNAVCO are engaged in a joint effort to define and develop a common Web Service Application Programming Interface (API) for accessing geodetic data holdings. This effort is funded by the NASA ROSES ACCESS Program to modernize the original GPS Seamless Archive Centers (GSAC) technology which was developed in the 1990s. A new web service interface, the GSAC-WS, is being developed to provide uniform and expanded mechanisms through which users can access our data repositories. In total, our respective archives hold tens of millions of files and contain a rich collection of site/station metadata. Though we serve similar user communities, we currently provide a range of different access methods, query services and metadata formats. This leads to a lack of consistency in the userís experience and a duplication of engineering efforts. The GSAC-WS API and its reference implementation in an underlying Java-based GSAC Service Layer (GSL) supports metadata and data queries into site/station oriented data archives. The general nature of this API makes it applicable to a broad range of data systems. The overall goals of this project include providing consistent and rich query interfaces for end users and client programs, the development of enabling technology to facilitate third party repositories in developing these web service capabilities and to enable the ability to perform data queries across a collection of federated GSAC-WS enabled repositories. A fundamental challenge faced in this project is to provide a common suite of query services across a heterogeneous collection of data yet enabling each repository to expose their specific metadata holdings. To address this challenge we are developing a "capabilities" based service where a repository can describe its specific query and metadata capabilities. Furthermore, the architecture of the GSL is based on a model-view paradigm that decouples the underlying data model semantics from particular representations of the data model. This will allow for the GSAC-WS enabled repositories to evolve their service offerings to incorporate new metadata definition formats (e.g., ISO-19115, FGDC, JSON, etc.) and new techniques for accessing their holdings. Building on the core GSAC-WS implementations the project is also developing a federated/distributed query service. This service will seamlessly integrate with the GSAC Service Layer and will support data and metadata queries across a collection of federated GSAC repositories.
User's guide for mapIMG 3--Map image re-projection software package
Finn, Michael P.; Mattli, David M.
2012-01-01
Version 0.0 (1995), Dan Steinwand, U.S. Geological Survey (USGS)/Earth Resources Observation Systems (EROS) Data Center (EDC)--Version 0.0 was a command line version for UNIX that required four arguments: the input metadata, the output metadata, the input data file, and the output destination path. Version 1.0 (2003), Stephen Posch and Michael P. Finn, USGS/Mid-Continent Mapping Center (MCMC--Version 1.0 added a GUI interface that was built using the Qt library for cross platform development. Version 1.01 (2004), Jason Trent and Michael P. Finn, USGS/MCMC--Version 1.01 suggested bounds for the parameters of each projection. Support was added for larger input files, storage of the last used input and output folders, and for TIFF/ GeoTIFF input images. Version 2.0 (2005), Robert Buehler, Jason Trent, and Michael P. Finn, USGS/National Geospatial Technical Operations Center (NGTOC)--Version 2.0 added Resampling Methods (Mean, Mode, Min, Max, and Sum), updated the GUI design, and added the viewer/pre-viewer. The metadata style was changed to XML and was switched to a new naming convention. Version 3.0 (2009), David Mattli and Michael P. Finn, USGS/Center of Excellence for Geospatial Information Science (CEGIS)--Version 3.0 brings optimized resampling methods, an updated GUI, support for less than global datasets, UTM support and the whole codebase was ported to Qt4.
Curating NASA's Past, Present, and Future Astromaterial Sample Collections
NASA Technical Reports Server (NTRS)
Zeigler, R. A.; Allton, J. H.; Evans, C. A.; Fries, M. D.; McCubbin, F. M.; Nakamura-Messenger, K.; Righter, K.; Zolensky, M.; Stansbery, E. K.
2016-01-01
The Astromaterials Acquisition and Curation Office at NASA Johnson Space Center (hereafter JSC curation) is responsible for curating all of NASA's extraterrestrial samples. JSC presently curates 9 different astromaterials collections in seven different clean-room suites: (1) Apollo Samples (ISO (International Standards Organization) class 6 + 7); (2) Antarctic Meteorites (ISO 6 + 7); (3) Cosmic Dust Particles (ISO 5); (4) Microparticle Impact Collection (ISO 7; formerly called Space-Exposed Hardware); (5) Genesis Solar Wind Atoms (ISO 4); (6) Stardust Comet Particles (ISO 5); (7) Stardust Interstellar Particles (ISO 5); (8) Hayabusa Asteroid Particles (ISO 5); (9) OSIRIS-REx Spacecraft Coupons and Witness Plates (ISO 7). Additional cleanrooms are currently being planned to house samples from two new collections, Hayabusa 2 (2021) and OSIRIS-REx (2023). In addition to the labs that house the samples, we maintain a wide variety of infra-structure facilities required to support the clean rooms: HEPA-filtered air-handling systems, ultrapure dry gaseous nitrogen systems, an ultrapure water system, and cleaning facilities to provide clean tools and equipment for the labs. We also have sample preparation facilities for making thin sections, microtome sections, and even focused ion-beam sections. We routinely monitor the cleanliness of our clean rooms and infrastructure systems, including measurements of inorganic or organic contamination, weekly airborne particle counts, compositional and isotopic monitoring of liquid N2 deliveries, and daily UPW system monitoring. In addition to the physical maintenance of the samples, we track within our databases the current and ever changing characteristics (weight, location, etc.) of more than 250,000 individually numbered samples across our various collections, as well as more than 100,000 images, and countless "analog" records that record the sample processing records of each individual sample. JSC Curation is co-located with JSC's Astromaterials Research Office, which houses a world-class suite of analytical instrumentation and scientists. We leverage these labs and personnel to better curate the samples. Part of the cu-ration process is planning for the future, and we refer to these planning efforts as "advanced curation". Advanced Curation is tasked with developing procedures, technology, and data sets necessary for curating new types of collections as envi-sioned by NASA exploration goals. We are (and have been) planning for future cu-ration, including cold curation, extended curation of ices and volatiles, curation of samples with special chemical considerations such as perchlorate-rich samples, and curation of organically- and biologically-sensitive samples.
Foster, Joseph M; Moreno, Pablo; Fabregat, Antonio; Hermjakob, Henning; Steinbeck, Christoph; Apweiler, Rolf; Wakelam, Michael J O; Vizcaíno, Juan Antonio
2013-01-01
Protein sequence databases are the pillar upon which modern proteomics is supported, representing a stable reference space of predicted and validated proteins. One example of such resources is UniProt, enriched with both expertly curated and automatic annotations. Taken largely for granted, similar mature resources such as UniProt are not available yet in some other "omics" fields, lipidomics being one of them. While having a seasoned community of wet lab scientists, lipidomics lies significantly behind proteomics in the adoption of data standards and other core bioinformatics concepts. This work aims to reduce the gap by developing an equivalent resource to UniProt called 'LipidHome', providing theoretically generated lipid molecules and useful metadata. Using the 'FASTLipid' Java library, a database was populated with theoretical lipids, generated from a set of community agreed upon chemical bounds. In parallel, a web application was developed to present the information and provide computational access via a web service. Designed specifically to accommodate high throughput mass spectrometry based approaches, lipids are organised into a hierarchy that reflects the variety in the structural resolution of lipid identifications. Additionally, cross-references to other lipid related resources and papers that cite specific lipids were used to annotate lipid records. The web application encompasses a browser for viewing lipid records and a 'tools' section where an MS1 search engine is currently implemented. LipidHome can be accessed at http://www.ebi.ac.uk/apweiler-srv/lipidhome.
The Next Generation of HLA Image Products
NASA Astrophysics Data System (ADS)
Gaffney, N. I.; Casertano, S.; Ferguson, B.
2012-09-01
We present the re-engineered pipeline based on existing and improved algorithms with the aim of improving processing quality, cross-instrument portability, data flow management, and software maintenance. The Hubble Legacy Archive (HLA) is a project to add value to the Hubble Space Telescope data archive by producing and delivering science-ready drizzled data products and source lists derived from these products. Initially, ACS, NICMOS, and WFCP2 data were combined using instrument-specific pipelines based on scripts developed to process the ACS GOODS data and a separate set of scripts to generate source extractor and DAOPhot source lists. The new pipeline, initially designed for WFC3 data, isolates instrument-specific processing and is easily extendable to other instruments and to generating wide-area mosaics. Significant improvements have been made in image combination using improved alignment, source detection, and background equalization routines. It integrates improved alignment procedures, better noise model, and source list generation within a single code base. Wherever practical, PyRAF based routines have been replaced with non-IRAF based python libraries (e.g. NumPy and PyFITS). The data formats have been modified to handle better and more consistent propagation of information from individual exposures to the combined products. A new exposure layer stores the effective exposure time for each pixel in the sky which is key in properly interpreting combined images from diverse data that were not initially planned to be mosaiced. We worked to improve the validity of the metadata within our FITS headers for these products relative to standard IRAF/PyRAF processing. Any keywords that pertain to individual exposures have been removed from the primary and extension headers and placed in a table extension for more direct and efficient perusal. This mechanism also allows for more detailed information on the processing of individual images to be stored and propagated providing a more hierarchical metadata storage system than key value pair FITS headers provide. In this poster we will discuss the changes to the pipeline processing and source list generation and the lessons learned which may be applicable to other archive projects as well as discuss our new metadata curation and preservation process.
Data Publishing and Sharing Via the THREDDS Data Repository
NASA Astrophysics Data System (ADS)
Wilson, A.; Caron, J.; Davis, E.; Baltzer, T.
2007-12-01
The terms "Team Science" and "Networked Science" have been coined to describe a virtual organization of researchers tied via some intellectual challenge, but often located in different organizations and locations. A critical component to these endeavors is publishing and sharing of content, including scientific data. Imagine pointing your web browser to a web page that interactively lets you upload data and metadata to a repository residing on a remote server, which can then be accessed by others in a secure fasion via the web. While any content can be added to this repository, it is designed particularly for storing and sharing scientific data and metadata. Server support includes uploading of data files that can subsequently be subsetted, aggregrated, and served in NetCDF or other scientific data formats. Metadata can be associated with the data and interactively edited. The THREDDS Data Repository (TDR) is a server that provides client initiated, on demand, location transparent storage for data of any type that can then be served by the THREDDS Data Server (TDS). The TDR provides functionality to: * securely store and "own" data files and associated metadata * upload files via HTTP and gridftp * upload a collection of data as single file * modify and restructure repository contents * incorporate metadata provided by the user * generate additional metadata programmatically * edit individual metadata elements The TDR can exist separately from a TDS, serving content via HTTP. Also, it can work in conjunction with the TDS, which includes functionality to provide: * access to data in a variety of formats via -- OPeNDAP -- OGC Web Coverage Service (for gridded datasets) -- bulk HTTP file transfer * a NetCDF view of datasets in NetCDF, OPeNDAP, HDF-5, GRIB, and NEXRAD formats * serving of very large volume datasets, such as NEXRAD radar * aggregation into virtual datasets * subsetting via OPeNDAP and NetCDF Subsetting services This talk will discuss TDR/TDS capabilities as well as how users can install this software to create their own repositories.
The STP (Solar-Terrestrial Physics) Semantic Web based on the RSS1.0 and the RDF
NASA Astrophysics Data System (ADS)
Kubo, T.; Murata, K. T.; Kimura, E.; Ishikura, S.; Shinohara, I.; Kasaba, Y.; Watari, S.; Matsuoka, D.
2006-12-01
In the Solar-Terrestrial Physics (STP), it is pointed out that circulation and utilization of observation data among researchers are insufficient. To archive interdisciplinary researches, we need to overcome this circulation and utilization problems. Under such a background, authors' group has developed a world-wide database that manages meta-data of satellite and ground-based observation data files. It is noted that retrieving meta-data from the observation data and registering them to database have been carried out by hand so far. Our goal is to establish the STP Semantic Web. The Semantic Web provides a common framework that allows a variety of data shared and reused across applications, enterprises, and communities. We also expect that the secondary information related with observations, such as event information and associated news, are also shared over the networks. The most fundamental issue on the establishment is who generates, manages and provides meta-data in the Semantic Web. We developed an automatic meta-data collection system for the observation data using the RSS (RDF Site Summary) 1.0. The RSS1.0 is one of the XML-based markup languages based on the RDF (Resource Description Framework), which is designed for syndicating news and contents of news-like sites. The RSS1.0 is used to describe the STP meta-data, such as data file name, file server address and observation date. To describe the meta-data of the STP beyond RSS1.0 vocabulary, we defined original vocabularies for the STP resources using the RDF Schema. The RDF describes technical terms on the STP along with the Dublin Core Metadata Element Set, which is standard for cross-domain information resource descriptions. Researchers' information on the STP by FOAF, which is known as an RDF/XML vocabulary, creates a machine-readable metadata describing people. Using the RSS1.0 as a meta-data distribution method, the workflow from retrieving meta-data to registering them into the database is automated. This technique is applied for several database systems, such as the DARTS database system and NICT Space Weather Report Service. The DARTS is a science database managed by ISAS/JAXA in Japan. We succeeded in generating and collecting the meta-data automatically for the CDF (Common data Format) data, such as Reimei satellite data, provided by the DARTS. We also create an RDF service for space weather report and real-time global MHD simulation 3D data provided by the NICT. Our Semantic Web system works as follows: The RSS1.0 documents generated on the data sites (ISAS and NICT) are automatically collected by a meta-data collection agent. The RDF documents are registered and the agent extracts meta-data to store them in the Sesame, which is an open source RDF database with support for RDF Schema inferencing and querying. The RDF database provides advanced retrieval processing that has considered property and relation. Finally, the STP Semantic Web provides automatic processing or high level search for the data which are not only for observation data but for space weather news, physical events, technical terms and researches information related to the STP.
MyOcean Internal Information System (Dial-P)
NASA Astrophysics Data System (ADS)
Blanc, Frederique; Jolibois, Tony; Loubrieu, Thomas; Manzella, Giuseppe; Mazzetti, Paolo; Nativi, Stefano
2010-05-01
MyOcean is a three-year project (2008-2011) which goal is the development and pre-operational validation of the GMES Marine Core Service for ocean monitoring and forecasting. It's a transition project that will conduct the European "operational oceanography" community towards the operational phase of a GMES European service, which demands more European integration, more operationality, and more service. Observations, model-based data, and added-value products will be generated - and enhanced thanks to dedicated expertise - by the following production units: • Five Thematic Assembly Centers, each of them dealing with a specific set of observation data: Sea Level, Ocean colour, Sea Surface Temperature, Sea Ice & Wind, and In Situ data, • Seven Monitoring and Forecasting Centers to serve the Global Ocean, the Arctic area, the Baltic Sea, the Atlantic North-West shelves area, the Atlantic Iberian-Biscay-Ireland area, the Mediterranean Sea and the Black sea. Intermediate and final users will discover, view and get the products by means of a central web desk, a central re-active manned service desk and thematic experts distributed across Europe. The MyOcean Information System (MIS) is considering the various aspects of an interoperable - federated information system. Data models support data and computer systems by providing the definition and format of data. The possibility of including the information in the data file is depending on data model adopted. In general there is little effort in the actual project to develop a ‘generic' data model. A strong push to develop a common model is provided by the EU Directive INSPIRE. At present, there is no single de-facto data format for storing observational data. Data formats are still evolving, with their underlying data models moving towards the concept of Feature Types based on ISO/TC211 standards. For example, Unidata are developing the Common Data Model that can represent scientific data types such as point, trajectory, station, grid, etc., which will be implemented in netCDF format. SeaDataNet is recommending ODV and NetCDF formats. Another problem related to data curation and interoperability is the possibility to use common vocabularies. Common vocabularies are developed in many international initiatives, such as GEMET (promoted by INSPIRE as a multilingual thesaurus), UNIDATA, SeaDataNet, Marine Metadata Initiative (MMI). MIS is considering the SeaDataNet vocabulary as a base for interoperability. Four layers of different abstraction levels of interoperability an be defined: - Technical/basic: this layer is implemented at each TAC or MFC through internet connection and basic services for data transfer and browsing (e.g FTP, HTTP, etc). - Syntactic: allowing the interchange of metadata and protocol elements. This layer corresponds to a definition Core Metadata Set, the format of exchange/delivery for the data and associated metadata and possible software. This layer is implemented by the DIAL-P logical interface (e.g. adoption of INSPIRE compliant metadata set and common data formats). - Functional/pragmatic: based on a common set of functional primitives or on a common set of service definitions. This layer refers to the definition of services based on Web services standards. This layer is implemented by the DIAL-P logical interface (e.g. adoption of INSPIRE compliant network services). - Semantic: allowing to access similar classes of objects and services across multiple sites, with multilinguality of content as one specific aspect. This layer corresponds to MIS interface, terminology and thesaurus. Given the above requirements, the proposed solution is a federation of systems, where the individual participants are self-contained autonomous systems, but together form a consistent wider picture. A mid-tier integration layer mediates between existing systems, adapting their data and service model schema to the MIS. The developed MIS is a read-only system, i.e. does not allow updating (or inserting) data into the participant resource systems. The main advantages of the proposed approach are: • to enable information sources to join the MIS and publish their data and metadata in a secure way, without any modification to their existing resources and procedures and without any restriction to their autonomy; • to enable users to browse and query the MIS, receiving an aggregated result incorporating relevant data and metadata from across different sources; • to accommodate the growth of such a MIS, either in terms of its clients or of its information resources, as well as the evolution of the underlying data model.
Integrated Array/Metadata Analytics
NASA Astrophysics Data System (ADS)
Misev, Dimitar; Baumann, Peter
2015-04-01
Data comes in various forms and types, and integration usually presents a problem that is often simply ignored and solved with ad-hoc solutions. Multidimensional arrays are an ubiquitous data type, that we find at the core of virtually all science and engineering domains, as sensor, model, image, statistics data. Naturally, arrays are richly described by and intertwined with additional metadata (alphanumeric relational data, XML, JSON, etc). Database systems, however, a fundamental building block of what we call "Big Data", lack adequate support for modelling and expressing these array data/metadata relationships. Array analytics is hence quite primitive or non-existent at all in modern relational DBMS. Recognizing this, we extended SQL with a new SQL/MDA part seamlessly integrating multidimensional array analytics into the standard database query language. We demonstrate the benefits of SQL/MDA with real-world examples executed in ASQLDB, an open-source mediator system based on HSQLDB and rasdaman, that already implements SQL/MDA.
Lustre Distributed Name Space (DNE) Evaluation at the Oak Ridge Leadership Computing Facility (OLCF)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Simmons, James S.; Leverman, Dustin B.; Hanley, Jesse A.
This document describes the Lustre Distributed Name Space (DNE) evaluation carried at the Oak Ridge Leadership Computing Facility (OLCF) between 2014 and 2015. DNE is a development project funded by the OpenSFS, to improve Lustre metadata performance and scalability. The development effort has been split into two parts, the first part (DNE P1) providing support for remote directories over remote Lustre Metadata Server (MDS) nodes and Metadata Target (MDT) devices, while the second phase (DNE P2) addressed split directories over multiple remote MDS nodes and MDT devices. The OLCF have been actively evaluating the performance, reliability, and the functionality ofmore » both DNE phases. For these tests, internal OLCF testbed were used. Results are promising and OLCF is planning on a full DNE deployment by mid-2016 timeframe on production systems.« less
Lee, Taein; Cheng, Chun-Huai; Ficklin, Stephen; Yu, Jing; Humann, Jodi; Main, Dorrie
2017-01-01
Abstract Tripal is an open-source database platform primarily used for development of genomic, genetic and breeding databases. We report here on the release of the Chado Loader, Chado Data Display and Chado Search modules to extend the functionality of the core Tripal modules. These new extension modules provide additional tools for (1) data loading, (2) customized visualization and (3) advanced search functions for supported data types such as organism, marker, QTL/Mendelian Trait Loci, germplasm, map, project, phenotype, genotype and their respective metadata. The Chado Loader module provides data collection templates in Excel with defined metadata and data loaders with front end forms. The Chado Data Display module contains tools to visualize each data type and the metadata which can be used as is or customized as desired. The Chado Search module provides search and download functionality for the supported data types. Also included are the tools to visualize map and species summary. The use of materialized views in the Chado Search module enables better performance as well as flexibility of data modeling in Chado, allowing existing Tripal databases with different metadata types to utilize the module. These Tripal Extension modules are implemented in the Genome Database for Rosaceae (rosaceae.org), CottonGen (cottongen.org), Citrus Genome Database (citrusgenomedb.org), Genome Database for Vaccinium (vaccinium.org) and the Cool Season Food Legume Database (coolseasonfoodlegume.org). Database URL: https://www.citrusgenomedb.org/, https://www.coolseasonfoodlegume.org/, https://www.cottongen.org/, https://www.rosaceae.org/, https://www.vaccinium.org/
Database integration in a multimedia-modeling environment
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dorow, Kevin E.
2002-09-02
Integration of data from disparate remote sources has direct applicability to modeling, which can support Brownfield assessments. To accomplish this task, a data integration framework needs to be established. A key element in this framework is the metadata that creates the relationship between the pieces of information that are important in the multimedia modeling environment and the information that is stored in the remote data source. The design philosophy is to allow modelers and database owners to collaborate by defining this metadata in such a way that allows interaction between their components. The main parts of this framework include toolsmore » to facilitate metadata definition, database extraction plan creation, automated extraction plan execution / data retrieval, and a central clearing house for metadata and modeling / database resources. Cross-platform compatibility (using Java) and standard communications protocols (http / https) allow these parts to run in a wide variety of computing environments (Local Area Networks, Internet, etc.), and, therefore, this framework provides many benefits. Because of the specific data relationships described in the metadata, the amount of data that have to be transferred is kept to a minimum (only the data that fulfill a specific request are provided as opposed to transferring the complete contents of a data source). This allows for real-time data extraction from the actual source. Also, the framework sets up collaborative responsibilities such that the different types of participants have control over the areas in which they have domain knowledge-the modelers are responsible for defining the data relevant to their models, while the database owners are responsible for mapping the contents of the database using the metadata definitions. Finally, the data extraction mechanism allows for the ability to control access to the data and what data are made available.« less
Cultivating Data Expertise and Roles at a National Research Center
NASA Astrophysics Data System (ADS)
Thompson, C. A.
2015-12-01
As research becomes more computation and data-intensive, it brings new demands for staff that can manage complex data, design user services, and facilitate open access. Responding to these new demands, universities and research institutions are developing data services to support their scientists and scholarly communities. As more organizations extend their operations to research data, a better understanding of the staff roles and expertise required to support data-intensive research services is needed. What is data expertise - knowledge, skills, and roles? This study addresses this question through a case study of an exemplar research center, the National Center for Atmospheric Research (NCAR) in Boulder, CO. The NCAR case study results were supplemented and validated with a set of interviews of managers at additional geoscience data centers. To date, 11 interviews with NCAR staff and 19 interviews with managers at supplementary data centers have been completed. Selected preliminary results from the qualitative analysis will be reported in the poster: Data professionals have cultivated expertise in areas such as managing scientific data and products, understanding use and users, harnessing technology for data solutions, and standardizing metadata and data sets. Staff roles and responsibilities have evolved over the years to create new roles for data scientists, data managers/curators, data engineers, and senior managers of data teams, embedding data expertise into each NCAR lab. Explicit career paths and ladders for data professionals are limited but starting to emerge. NCAR has supported organization-wide efforts for data management, leveraging knowledge and best practices across all the labs and their staff. Based on preliminary results, NCAR provides a model for how organizations can build expertise and roles into their data service models. Data collection for this study is ongoing. The author anticipates that the results will help answer questions on what are the knowledge and skills required for data professionals and how organizations can develop data expertise.
Crowdsourcing and curation: perspectives from biology and natural language processing
Hirschman, Lynette; Fort, Karën; Boué, Stéphanie; ...
2016-08-08
Crowdsourcing is increasingly utilized for performing tasks in both natural language processing and biocuration. Although there have been many applications of crowdsourcing in these fields, there have been fewer high-level discussions of the methodology and its applicability to biocuration. This paper explores crowdsourcing for biocuration through several case studies that highlight different ways of leveraging ‘the crowd’; these raise issues about the kind(s) of expertise needed, the motivations of participants, and questions related to feasibility, cost and quality. The paper is an outgrowth of a panel session held at BioCreative V (Seville, September 9–11, 2015). The session consisted of fourmore » short talks, followed by a discussion. In their talks, the panelists explored the role of expertise and the potential to improve crowd performance by training; the challenge of decomposing tasks to make them amenable to crowdsourcing; and the capture of biological data and metadata through community editing.« less
Crowdsourcing and curation: perspectives from biology and natural language processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hirschman, Lynette; Fort, Karën; Boué, Stéphanie
Crowdsourcing is increasingly utilized for performing tasks in both natural language processing and biocuration. Although there have been many applications of crowdsourcing in these fields, there have been fewer high-level discussions of the methodology and its applicability to biocuration. This paper explores crowdsourcing for biocuration through several case studies that highlight different ways of leveraging ‘the crowd’; these raise issues about the kind(s) of expertise needed, the motivations of participants, and questions related to feasibility, cost and quality. The paper is an outgrowth of a panel session held at BioCreative V (Seville, September 9–11, 2015). The session consisted of fourmore » short talks, followed by a discussion. In their talks, the panelists explored the role of expertise and the potential to improve crowd performance by training; the challenge of decomposing tasks to make them amenable to crowdsourcing; and the capture of biological data and metadata through community editing.« less
MetaboLights: An Open-Access Database Repository for Metabolomics Data.
Kale, Namrata S; Haug, Kenneth; Conesa, Pablo; Jayseelan, Kalaivani; Moreno, Pablo; Rocca-Serra, Philippe; Nainala, Venkata Chandrasekhar; Spicer, Rachel A; Williams, Mark; Li, Xuefei; Salek, Reza M; Griffin, Julian L; Steinbeck, Christoph
2016-03-24
MetaboLights is the first general purpose, open-access database repository for cross-platform and cross-species metabolomics research at the European Bioinformatics Institute (EMBL-EBI). Based upon the open-source ISA framework, MetaboLights provides Metabolomics Standard Initiative (MSI) compliant metadata and raw experimental data associated with metabolomics experiments. Users can upload their study datasets into the MetaboLights Repository. These studies are then automatically assigned a stable and unique identifier (e.g., MTBLS1) that can be used for publication reference. The MetaboLights Reference Layer associates metabolites with metabolomics studies in the archive and is extensively annotated with data fields such as structural and chemical information, NMR and MS spectra, target species, metabolic pathways, and reactions. The database is manually curated with no specific release schedules. MetaboLights is also recommended by journals for metabolomics data deposition. This unit provides a guide to using MetaboLights, downloading experimental data, and depositing metabolomics datasets using user-friendly submission tools. Copyright © 2016 John Wiley & Sons, Inc.
Our journey to digital curation of the Jeghers Medical Index.
Gawdyda, Lori; Carter, Kimbroe; Willson, Mark; Bedford, Denise
2017-07-01
Harold Jeghers, a well-known medical educator of the twentieth century, maintained a print collection of about one million medical articles from the late 1800s to the 1990s. This case study discusses how a print collection of these articles was transformed to a digital database. Staff in the Jeghers Medical Index, St. Elizabeth Youngstown Hospital, converted paper articles to Adobe portable document format (PDF)/A-1a files. Optical character recognition was used to obtain searchable text. The data were then incorporated into a specialized database. Lastly, articles were matched to PubMed bibliographic metadata through automation and human review. An online database of the collection was ultimately created. The collection was made part of a discovery search service, and semantic technologies have been explored as a method of creating access points. This case study shows how a small medical library made medical writings of the nineteenth and twentieth centuries available in electronic format for historic or semantic research, highlighting the efficiencies of contemporary information technology.
Metadata for Web Resources: How Metadata Works on the Web.
ERIC Educational Resources Information Center
Dillon, Martin
This paper discusses bibliographic control of knowledge resources on the World Wide Web. The first section sets the context of the inquiry. The second section covers the following topics related to metadata: (1) definitions of metadata, including metadata as tags and as descriptors; (2) metadata on the Web, including general metadata systems,…
Metadata Dictionary Database: A Proposed Tool for Academic Library Metadata Management
ERIC Educational Resources Information Center
Southwick, Silvia B.; Lampert, Cory
2011-01-01
This article proposes a metadata dictionary (MDD) be used as a tool for metadata management. The MDD is a repository of critical data necessary for managing metadata to create "shareable" digital collections. An operational definition of metadata management is provided. The authors explore activities involved in metadata management in…
Harvesting NASA's Common Metadata Repository (CMR)
NASA Technical Reports Server (NTRS)
Shum, Dana; Durbin, Chris; Norton, James; Mitchell, Andrew
2017-01-01
As part of NASA's Earth Observing System Data and Information System (EOSDIS), the Common Metadata Repository (CMR) stores metadata for over 30,000 datasets from both NASA and international providers along with over 300M granules. This metadata enables sub-second discovery and facilitates data access. While the CMR offers a robust temporal, spatial and keyword search functionality to the general public and international community, it is sometimes more desirable for international partners to harvest the CMR metadata and merge the CMR metadata into a partner's existing metadata repository. This poster will focus on best practices to follow when harvesting CMR metadata to ensure that any changes made to the CMR can also be updated in a partner's own repository. Additionally, since each partner has distinct metadata formats they are able to consume, the best practices will also include guidance on retrieving the metadata in the desired metadata format using CMR's Unified Metadata Model translation software.
Harvesting NASA's Common Metadata Repository
NASA Astrophysics Data System (ADS)
Shum, D.; Mitchell, A. E.; Durbin, C.; Norton, J.
2017-12-01
As part of NASA's Earth Observing System Data and Information System (EOSDIS), the Common Metadata Repository (CMR) stores metadata for over 30,000 datasets from both NASA and international providers along with over 300M granules. This metadata enables sub-second discovery and facilitates data access. While the CMR offers a robust temporal, spatial and keyword search functionality to the general public and international community, it is sometimes more desirable for international partners to harvest the CMR metadata and merge the CMR metadata into a partner's existing metadata repository. This poster will focus on best practices to follow when harvesting CMR metadata to ensure that any changes made to the CMR can also be updated in a partner's own repository. Additionally, since each partner has distinct metadata formats they are able to consume, the best practices will also include guidance on retrieving the metadata in the desired metadata format using CMR's Unified Metadata Model translation software.
An Analysis of the Climate Data Initiative's Data Collection
NASA Astrophysics Data System (ADS)
Ramachandran, R.; Bugbee, K.
2015-12-01
The Climate Data Initiative (CDI) is a broad multi-agency effort of the U.S. government that seeks to leverage the extensive existing federal climate-relevant data to stimulate innovation and private-sector entrepreneurship to support national climate-change preparedness. The CDI project is a systematic effort to manually curate and share openly available climate data from various federal agencies. To date, the CDI has curated seven themes, or topics, relevant to climate change resiliency. These themes include Coastal Flooding, Food Resilience, Water, Ecosystem Vulnerability, Human Health, Energy Infrastructure, and Transportation. Each theme was curated by subject matter experts who selected datasets relevant to the topic at hand. An analysis of the entire Climate Data Initiative data collection and the data curated for each theme offers insights into which datasets are considered most relevant in addressing climate resiliency. Other aspects of the data collection will be examined including which datasets were the most visited or popular and which datasets were the most sought after for curation by the theme teams. Results from the analysis of the CDI collection will be presented in this talk.
NASA Technical Reports Server (NTRS)
Olsen, Lola M.
2006-01-01
The capabilities of the International Directory Network's (IDN) version MD9.5, along with a new version of the metadata authoring tool, "docBUILDER", will be presented during the Technology and Services Subgroup session of the Working Group on Information Systems and Services (WGISS). Feedback provided through the international community has proven instrumental in positively influencing the direction of the IDN s development. The international community was instrumental in encouraging support for using the IS0 international character set that is now available through the directory. Supporting metadata descriptions in additional languages encourages extended use of the IDN. Temporal and spatial attributes often prove pivotal in the search for data. Prior to the new software release, the IDN s geospatial and temporal searches suffered from browser incompatibilities and often resulted in unreliable performance for users attempting to initiate a spatial search using a map based on aging Java applet technology. The IDN now offers an integrated Google map and date search that replaces that technology. In addition, one of the most defining characteristics in the search for data relates to the temporal and spatial resolution of the data. The ability to refine the search for data sets meeting defined resolution requirements is now possible. Data set authors are encouraged to indicate the precise resolution values for their data sets and subsequently bin these into one of the pre-selected resolution ranges. New metadata authoring tools have been well received. In response to requests for a standalone metadata authoring tool, a new shareable software package called "docBUILDER solo" will soon be released to the public. This tool permits researchers to document their data during experiments and observational periods in the field. interoperability has been enhanced through the use of the Open Archives Initiative s (OAI) Protocol for Metadata Harvesting (PMH). Harvesting of XML content through OAI-MPH has been successfully tested with several organizations. The protocol appears to be a prime candidate for sharing metadata throughout the international community. Data services for visualizing and analyzing data have become valuable assets in facilitating the use of data. Data providers are offering many of their data-related services through the directory. The IDN plans to develop a service-based architecture to further promote the use of web services. During the IDN Task Team session, ideas for further enhancements will be discussed.
Legacy2Drupal: Conversion of an existing relational oceanographic database to a Drupal 7 CMS
NASA Astrophysics Data System (ADS)
Work, T. T.; Maffei, A. R.; Chandler, C. L.; Groman, R. C.
2011-12-01
Content Management Systems (CMSs) such as Drupal provide powerful features that can be of use to oceanographic (and other geo-science) data managers. However, in many instances, geo-science data management offices have already designed and implemented customized schemas for their metadata. The NSF funded Biological Chemical and Biological Data Management Office (BCO-DMO) has ported an existing relational database containing oceanographic metadata, along with an existing interface coded in Cold Fusion middleware, to a Drupal 7 Content Management System. This is an update on an effort described as a proof-of-concept in poster IN21B-1051, presented at AGU2009. The BCO-DMO project has translated all the existing database tables, input forms, website reports, and other features present in the existing system into Drupal CMS features. The replacement features are made possible by the use of Drupal content types, CCK node-reference fields, a custom theme, and a number of other supporting modules. This presentation describes the process used to migrate content in the original BCO-DMO metadata database to Drupal 7, some problems encountered during migration, and the modules used to migrate the content successfully. Strategic use of Drupal 7 CMS features that enable three separate but complementary interfaces to provide access to oceanographic research metadata will also be covered: 1) a Drupal 7-powered user front-end; 2) REST-ful JSON web services (providing a Mapserver interface to the metadata and data; and 3) a SPARQL interface to a semantic representation of the repository metadata (this feeding a new faceted search capability currently under development). The existing BCO-DMO ontology, developed in collaboration with Rensselaer Polytechnic Institute's Tetherless World Constellation, makes strategic use of pre-existing ontologies and will be used to drive semantically-enabled faceted search capabilities planned for the site. At this point, the use of semantic technologies included in the Drupal 7 core is anticipated. Using a public domain CMS as opposed to proprietary middleware, and taking advantage of the many features of Drupal 7 that are designed to support semantically-enabled interfaces will help prepare the BCO-DMO and other science data repositories for interoperability between systems that serve ecosystem research data.
Sea Level Station Metadata for Tsunami Detection, Warning and Research
NASA Astrophysics Data System (ADS)
Stroker, K. J.; Marra, J.; Kari, U. S.; Weinstein, S. A.; Kong, L.
2007-12-01
The devastating earthquake and tsunami of December 26, 2004 has greatly increased recognition of the need for water level data both from the coasts and the deep-ocean. In 2006, the National Oceanic and Atmospheric Administration (NOAA) completed a Tsunami Data Management Report describing the management of data required to minimize the impact of tsunamis in the United States. One of the major gaps defined in this report is the access to global coastal water level data. NOAA's National Geophysical Data Center (NGDC) and National Climatic Data Center (NCDC) are working cooperatively to bridge this gap. NOAA relies on a network of global data, acquired and processed in real-time to support tsunami detection and warning, as well as high-quality global databases of archived data to support research and advanced scientific modeling. In 2005, parties interested in enhancing the access and use of sea level station data united under the NOAA NCDC's Integrated Data and Environmental Applications (IDEA) Center's Pacific Region Integrated Data Enterprise (PRIDE) program to develop a distributed metadata system describing sea level stations (Kari et. al., 2006; Marra et.al., in press). This effort started with pilot activities in a regional framework and is targeted at tsunami detection and warning systems being developed by various agencies. It includes development of the components of a prototype sea level station metadata web service and accompanying Google Earth-based client application, which use an XML-based schema to expose, at a minimum, information in the NOAA National Weather Service (NWS) Pacific Tsunami Warning Center (PTWC) station database needed to use the PTWC's Tide Tool application. As identified in the Tsunami Data Management Report, the need also exists for long-term retention of the sea level station data. NOAA envisions that the retrospective water level data and metadata will also be available through web services, using an XML-based schema. Five high-priority metadata requirements identified at a water level workshop held at the XXIV IUGG Meeting in Perugia will be addressed: consistent, validated, and well defined numbers (e.g. amplitude); exact location of sea level stations; a complete record of sea level data stored in the archive; identifying high-priority sea level stations; and consistent definitions. NOAA's National Geophysical Data Center (NGDC) and co-located World Data Center for Solid Earth Geophysics (including tsunamis) would hold the archive of the sea level station data and distribute the standard metadata. Currently, NGDC is also archiving and distributing the DART buoy deep-ocean water level data and metadata in standards based formats. Kari, Uday S., John J. Marra, Stuart A. Weinstein, 2006 A Tsunami Focused Data Sharing Framework For Integration of Databases that Describe Water Level Station Specifications. AGU Fall Meeting, 2006. San Francisco, California. Marra, John, J., Uday S. Kari, and Stuart A. Weinstein (in press). A Tsunami Detection and Warning-focused Sea Level Station Metadata Web Service. IUGG XXIV, July 2-13, 2007. Perugia, Italy.
Federated ontology-based queries over cancer data
2012-01-01
Background Personalised medicine provides patients with treatments that are specific to their genetic profiles. It requires efficient data sharing of disparate data types across a variety of scientific disciplines, such as molecular biology, pathology, radiology and clinical practice. Personalised medicine aims to offer the safest and most effective therapeutic strategy based on the gene variations of each subject. In particular, this is valid in oncology, where knowledge about genetic mutations has already led to new therapies. Current molecular biology techniques (microarrays, proteomics, epigenetic technology and improved DNA sequencing technology) enable better characterisation of cancer tumours. The vast amounts of data, however, coupled with the use of different terms - or semantic heterogeneity - in each discipline makes the retrieval and integration of information difficult. Results Existing software infrastructures for data-sharing in the cancer domain, such as caGrid, support access to distributed information. caGrid follows a service-oriented model-driven architecture. Each data source in caGrid is associated with metadata at increasing levels of abstraction, including syntactic, structural, reference and domain metadata. The domain metadata consists of ontology-based annotations associated with the structural information of each data source. However, caGrid's current querying functionality is given at the structural metadata level, without capitalising on the ontology-based annotations. This paper presents the design of and theoretical foundations for distributed ontology-based queries over cancer research data. Concept-based queries are reformulated to the target query language, where join conditions between multiple data sources are found by exploiting the semantic annotations. The system has been implemented, as a proof of concept, over the caGrid infrastructure. The approach is applicable to other model-driven architectures. A graphical user interface has been developed, supporting ontology-based queries over caGrid data sources. An extensive evaluation of the query reformulation technique is included. Conclusions To support personalised medicine in oncology, it is crucial to retrieve and integrate molecular, pathology, radiology and clinical data in an efficient manner. The semantic heterogeneity of the data makes this a challenging task. Ontologies provide a formal framework to support querying and integration. This paper provides an ontology-based solution for querying distributed databases over service-oriented, model-driven infrastructures. PMID:22373043
The use of advanced web-based survey design in Delphi research.
Helms, Christopher; Gardner, Anne; McInnes, Elizabeth
2017-12-01
A discussion of the application of metadata, paradata and embedded data in web-based survey research, using two completed Delphi surveys as examples. Metadata, paradata and embedded data use in web-based Delphi surveys has not been described in the literature. The rapid evolution and widespread use of online survey methods imply that paper-based Delphi methods will likely become obsolete. Commercially available web-based survey tools offer a convenient and affordable means of conducting Delphi research. Researchers and ethics committees may be unaware of the benefits and risks of using metadata in web-based surveys. Discussion paper. Two web-based, three-round Delphi surveys were conducted sequentially between August 2014 - January 2015 and April - May 2016. Their aims were to validate the Australian nurse practitioner metaspecialties and their respective clinical practice standards. Our discussion paper is supported by researcher experience and data obtained from conducting both web-based Delphi surveys. Researchers and ethics committees should consider the benefits and risks of metadata use in web-based survey methods. Web-based Delphi research using paradata and embedded data may introduce efficiencies that improve individual participant survey experiences and reduce attrition across iterations. Use of embedded data allows the efficient conduct of multiple simultaneous Delphi surveys across a shorter timeframe than traditional survey methods. The use of metadata, paradata and embedded data appears to improve response rates, identify bias and give possible explanation for apparent outlier responses, providing an efficient method of conducting web-based Delphi surveys. © 2017 John Wiley & Sons Ltd.
docBUILDER - Building Your Useful Metadata for Earth Science Data and Services.
NASA Astrophysics Data System (ADS)
Weir, H. M.; Pollack, J.; Olsen, L. M.; Major, G. R.
2005-12-01
The docBUILDER tool, created by NASA's Global Change Master Directory (GCMD), assists the scientific community in efficiently creating quality data and services metadata. Metadata authors are asked to complete five required fields to ensure enough information is provided for users to discover the data and related services they seek. After the metadata record is submitted to the GCMD, it is reviewed for semantic and syntactic consistency. Currently, two versions are available - a Web-based tool accessible with most browsers (docBUILDERweb) and a stand-alone desktop application (docBUILDERsolo). The Web version is available through the GCMD website, at http://gcmd.nasa.gov/User/authoring.html. This version has been updated and now offers: personalized templates to ease entering similar information for multiple data sets/services; automatic population of Data Center/Service Provider URLs based on the selected center/provider; three-color support to indicate required, recommended, and optional fields; an editable text window containing the XML record, to allow for quick editing; and improved overall performance and presentation. The docBUILDERsolo version offers the ability to create metadata records on a computer wherever you are. Except for installation and the occasional update of keywords, data/service providers are not required to have an Internet connection. This freedom will allow users with portable computers (Windows, Mac, and Linux) to create records in field campaigns, whether in Antarctica or the Australian Outback. This version also offers a spell-checker, in addition to all of the features found in the Web version.
NASA Astrophysics Data System (ADS)
Kindermann, Stephan; Berger, Katharina; Toussaint, Frank
2014-05-01
The integration of well-established legacy data centers into newly developed data federation infrastructures is a key requirement to enhance climate data access based on widely agreed interfaces. We present the approach taken to integrate the ICSU World Data Center for Climate (WDCC) located in Hamburg, Germany into the European ENES climate data Federation which is part of the international ESGF data federation. The ENES / ESGF data federation hosts petabytes of climate model data and provides scalable data search and access services across the worldwide distributed data centers. Parts of the data provided by the ENES / ESGF data federation is also long term archived and curated at the WDCC data archive, allowing e.g. for DOI based data citation. An integration of the WDCC into the ENES / ESGF federation allows end users to search and access WDCC data using consistent interfaces worldwide. We will summarize the integration approach we have taken for WDCC legacy system and ESGF infrastructure integration. On the technical side we describe the provisioning of ESGF consistent metadata and data interfaces as well as the security infrastructure adoption. On the non-technical side we describe our experiences in integrating a long-term archival center with costly quality assurance procedures with an integrated distributed data federation putting emphasis on providing early and consistent data search and access services to scientists. The experiences were gained in the process of curating ESGF hosted CMIP5 data at the WDCC. Approximately one petabyte of CMIP5 data which was used for the IPCC climate report is being replicated and archived at the WDCC.
Powers, Christina M; Mills, Karmann A; Morris, Stephanie A; Klaessig, Fred; Gaheen, Sharon; Lewinski, Nastassja
2015-01-01
Summary There is a critical opportunity in the field of nanoscience to compare and integrate information across diverse fields of study through informatics (i.e., nanoinformatics). This paper is one in a series of articles on the data curation process in nanoinformatics (nanocuration). Other articles in this series discuss key aspects of nanocuration (temporal metadata, data completeness, database integration), while the focus of this article is on the nanocuration workflow, or the process of identifying, inputting, and reviewing nanomaterial data in a data repository. In particular, the article discusses: 1) the rationale and importance of a defined workflow in nanocuration, 2) the influence of organizational goals or purpose on the workflow, 3) established workflow practices in other fields, 4) current workflow practices in nanocuration, 5) key challenges for workflows in emerging fields like nanomaterials, 6) examples to make these challenges more tangible, and 7) recommendations to address the identified challenges. Throughout the article, there is an emphasis on illustrating key concepts and current practices in the field. Data on current practices in the field are from a group of stakeholders active in nanocuration. In general, the development of workflows for nanocuration is nascent, with few individuals formally trained in data curation or utilizing available nanocuration resources (e.g., ISA-TAB-Nano). Additional emphasis on the potential benefits of cultivating nanomaterial data via nanocuration processes (e.g., capability to analyze data from across research groups) and providing nanocuration resources (e.g., training) will likely prove crucial for the wider application of nanocuration workflows in the scientific community. PMID:26425437
The National Map seamless digital elevation model specifications
Archuleta, Christy-Ann M.; Constance, Eric W.; Arundel, Samantha T.; Lowe, Amanda J.; Mantey, Kimberly S.; Phillips, Lori A.
2017-08-02
This specification documents the requirements and standards used to produce the seamless elevation layers for The National Map of the United States. Seamless elevation data are available for the conterminous United States, Hawaii, Alaska, and the U.S. territories, in three different resolutions—1/3-arc-second, 1-arc-second, and 2-arc-second. These specifications include requirements and standards information about source data requirements, spatial reference system, distribution tiling schemes, horizontal resolution, vertical accuracy, digital elevation model surface treatment, georeferencing, data source and tile dates, distribution and supporting file formats, void areas, metadata, spatial metadata, and quality assurance and control.
Klevebro, Fredrik; Ekman, Simon; Nilsson, Magnus
2017-09-01
Multimodality treatment has now been widely introduced in the curatively intended treatment of esophageal and gastroesophageal junction cancer. We aim to give an overview of the scientific evidence for the available treatment strategies and to describe which trends that are currently developing. We conducted a review of the scientific evidence for the different curatively intended treatment strategies that are available today. Relevant articles of randomized controlled trials, cohort studies, and meta analyses were included. After a systematic search of relevant papers we have included 64 articles in the review. The results show that adenocarcinomas and squamous cell carcinomas of the esophagus and gastroesophageal junction are two separate entities and should be analysed and studied as two different diseases. Neoadjuvant treatment followed by surgical resection is the gold standard of the curatively intended treatment today. There is no scientific evidence to support the use of chemoradiotherapy over chemotherapy in the neoadjuvant setting for esophageal or junctional adenocarcinoma. There is reasonable evidence to support definitive chemoradiotherapy as a treatment option for squamous cell carcinoma of the esophagus. The evidence base for curatively intended treatments of esophageal and gastroesophageal junction cancer is not very strong. Several on-going trials have the potential to change the gold standard treatments of today. Copyright © 2017 Elsevier Ltd. All rights reserved.
Plant Reactome: a resource for plant pathways and comparative analysis.
Naithani, Sushma; Preece, Justin; D'Eustachio, Peter; Gupta, Parul; Amarasinghe, Vindhya; Dharmawardhana, Palitha D; Wu, Guanming; Fabregat, Antonio; Elser, Justin L; Weiser, Joel; Keays, Maria; Fuentes, Alfonso Munoz-Pomer; Petryszak, Robert; Stein, Lincoln D; Ware, Doreen; Jaiswal, Pankaj
2017-01-04
Plant Reactome (http://plantreactome.gramene.org/) is a free, open-source, curated plant pathway database portal, provided as part of the Gramene project. The database provides intuitive bioinformatics tools for the visualization, analysis and interpretation of pathway knowledge to support genome annotation, genome analysis, modeling, systems biology, basic research and education. Plant Reactome employs the structural framework of a plant cell to show metabolic, transport, genetic, developmental and signaling pathways. We manually curate molecular details of pathways in these domains for reference species Oryza sativa (rice) supported by published literature and annotation of well-characterized genes. Two hundred twenty-two rice pathways, 1025 reactions associated with 1173 proteins, 907 small molecules and 256 literature references have been curated to date. These reference annotations were used to project pathways for 62 model, crop and evolutionarily significant plant species based on gene homology. Database users can search and browse various components of the database, visualize curated baseline expression of pathway-associated genes provided by the Expression Atlas and upload and analyze their Omics datasets. The database also offers data access via Application Programming Interfaces (APIs) and in various standardized pathway formats, such as SBML and BioPAX. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Ning, Shangwei; Zhang, Jizhou; Wang, Peng; Zhi, Hui; Wang, Jianjian; Liu, Yue; Gao, Yue; Guo, Maoni; Yue, Ming; Wang, Lihua; Li, Xia
2016-01-01
Lnc2Cancer (http://www.bio-bigdata.net/lnc2cancer) is a manually curated database of cancer-associated long non-coding RNAs (lncRNAs) with experimental support that aims to provide a high-quality and integrated resource for exploring lncRNA deregulation in various human cancers. LncRNAs represent a large category of functional RNA molecules that play a significant role in human cancers. A curated collection and summary of deregulated lncRNAs in cancer is essential to thoroughly understand the mechanisms and functions of lncRNAs. Here, we developed the Lnc2Cancer database, which contains 1057 manually curated associations between 531 lncRNAs and 86 human cancers. Each association includes lncRNA and cancer name, the lncRNA expression pattern, experimental techniques, a brief functional description, the original reference and additional annotation information. Lnc2Cancer provides a user-friendly interface to conveniently browse, retrieve and download data. Lnc2Cancer also offers a submission page for researchers to submit newly validated lncRNA-cancer associations. With the rapidly increasing interest in lncRNAs, Lnc2Cancer will significantly improve our understanding of lncRNA deregulation in cancer and has the potential to be a timely and valuable resource. PMID:26481356
Enriched Video Semantic Metadata: Authorization, Integration, and Presentation.
ERIC Educational Resources Information Center
Mu, Xiangming; Marchionini, Gary
2003-01-01
Presents an enriched video metadata framework including video authorization using the Video Annotation and Summarization Tool (VAST)-a video metadata authorization system that integrates both semantic and visual metadata-- metadata integration, and user level applications. Results demonstrated that the enriched metadata were seamlessly…
McKinney, Bill; Meyer, Peter A.; Crosas, Mercè; Sliz, Piotr
2016-01-01
Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension—functionality supporting preservation of filesystem structure within Dataverse—which is essential for both in-place computation and supporting non-http data transfers. PMID:27862010
Mercury: Reusable software application for Metadata Management, Data Discovery and Access
NASA Astrophysics Data System (ADS)
Devarakonda, Ranjeet; Palanisamy, Giri; Green, James; Wilson, Bruce E.
2009-12-01
Mercury is a federated metadata harvesting, data discovery and access tool based on both open source packages and custom developed software. It was originally developed for NASA, and the Mercury development consortium now includes funding from NASA, USGS, and DOE. Mercury is itself a reusable toolset for metadata, with current use in 12 different projects. Mercury also supports the reuse of metadata by enabling searching across a range of metadata specification and standards including XML, Z39.50, FGDC, Dublin-Core, Darwin-Core, EML, and ISO-19115. Mercury provides a single portal to information contained in distributed data management systems. It collects metadata and key data from contributing project servers distributed around the world and builds a centralized index. The Mercury search interfaces then allow the users to perform simple, fielded, spatial and temporal searches across these metadata sources. One of the major goals of the recent redesign of Mercury was to improve the software reusability across the projects which currently fund the continuing development of Mercury. These projects span a range of land, atmosphere, and ocean ecological communities and have a number of common needs for metadata searches, but they also have a number of needs specific to one or a few projects To balance these common and project-specific needs, Mercury’s architecture includes three major reusable components; a harvester engine, an indexing system and a user interface component. The harvester engine is responsible for harvesting metadata records from various distributed servers around the USA and around the world. The harvester software was packaged in such a way that all the Mercury projects will use the same harvester scripts but each project will be driven by a set of configuration files. The harvested files are then passed to the Indexing system, where each of the fields in these structured metadata records are indexed properly, so that the query engine can perform simple, keyword, spatial and temporal searches across these metadata sources. The search user interface software has two API categories; a common core API which is used by all the Mercury user interfaces for querying the index and a customized API for project specific user interfaces. For our work in producing a reusable, portable, robust, feature-rich application, Mercury received a 2008 NASA Earth Science Data Systems Software Reuse Working Group Peer-Recognition Software Reuse Award. The new Mercury system is based on a Service Oriented Architecture and effectively reuses components for various services such as Thesaurus Service, Gazetteer Web Service and UDDI Directory Services. The software also provides various search services including: RSS, Geo-RSS, OpenSearch, Web Services and Portlets, integrated shopping cart to order datasets from various data centers (ORNL DAAC, NSIDC) and integrated visualization tools. Other features include: Filtering and dynamic sorting of search results, book-markable search results, save, retrieve, and modify search criteria.
Unified Science Information Model for SoilSCAPE using the Mercury Metadata Search System
NASA Astrophysics Data System (ADS)
Devarakonda, Ranjeet; Lu, Kefa; Palanisamy, Giri; Cook, Robert; Santhana Vannan, Suresh; Moghaddam, Mahta Clewley, Dan; Silva, Agnelo; Akbar, Ruzbeh
2013-12-01
SoilSCAPE (Soil moisture Sensing Controller And oPtimal Estimator) introduces a new concept for a smart wireless sensor web technology for optimal measurements of surface-to-depth profiles of soil moisture using in-situ sensors. The objective is to enable a guided and adaptive sampling strategy for the in-situ sensor network to meet the measurement validation objectives of spaceborne soil moisture sensors such as the Soil Moisture Active Passive (SMAP) mission. This work is being carried out at the University of Michigan, the Massachusetts Institute of Technology, University of Southern California, and Oak Ridge National Laboratory. At Oak Ridge National Laboratory we are using Mercury metadata search system [1] for building a Unified Information System for the SoilSCAPE project. This unified portal primarily comprises three key pieces: Distributed Search/Discovery; Data Collections and Integration; and Data Dissemination. Mercury, a Federally funded software for metadata harvesting, indexing, and searching would be used for this module. Soil moisture data sources identified as part of this activity such as SoilSCAPE and FLUXNET (in-situ sensors), AirMOSS (airborne retrieval), SMAP (spaceborne retrieval), and are being indexed and maintained by Mercury. Mercury would be the central repository of data sources for cal/val for soil moisture studies and would provide a mechanism to identify additional data sources. Relevant metadata from existing inventories such as ORNL DAAC, USGS Clearinghouse, ARM, NASA ECHO, GCMD etc. would be brought in to this soil-moisture data search/discovery module. The SoilSCAPE [2] metadata records will also be published in broader metadata repositories such as GCMD, data.gov. Mercury can be configured to provide a single portal to soil moisture information contained in disparate data management systems located anywhere on the Internet. Mercury is able to extract, metadata systematically from HTML pages or XML files using a variety of methods including OAI-PMH [3]. The Mercury search interface then allows users to perform simple, fielded, spatial and temporal searches across a central harmonized index of metadata. Mercury supports various metadata standards including FGDC, ISO-19115, DIF, Dublin-Core, Darwin-Core, and EML. This poster describes in detail how Mercury implements the Unified Science Information Model for Soil moisture data. References: [1]Devarakonda R., et al. Mercury: reusable metadata management, data discovery and access system. Earth Science Informatics (2010), 3(1): 87-94. [2]Devarakonda R., et al. Daymet: Single Pixel Data Extraction Tool. http://daymet.ornl.gov/singlepixel.html (2012). Last Accesses 10-01-2013 [3]Devarakonda R., et al. Data sharing and retrieval using OAI-PMH. Earth Science Informatics (2011), 4(1): 1-5.
Curating Big Data Made Simple: Perspectives from Scientific Communities.
Sowe, Sulayman K; Zettsu, Koji
2014-03-01
The digital universe is exponentially producing an unprecedented volume of data that has brought benefits as well as fundamental challenges for enterprises and scientific communities alike. This trend is inherently exciting for the development and deployment of cloud platforms to support scientific communities curating big data. The excitement stems from the fact that scientists can now access and extract value from the big data corpus, establish relationships between bits and pieces of information from many types of data, and collaborate with a diverse community of researchers from various domains. However, despite these perceived benefits, to date, little attention is focused on the people or communities who are both beneficiaries and, at the same time, producers of big data. The technical challenges posed by big data are as big as understanding the dynamics of communities working with big data, whether scientific or otherwise. Furthermore, the big data era also means that big data platforms for data-intensive research must be designed in such a way that research scientists can easily search and find data for their research, upload and download datasets for onsite/offsite use, perform computations and analysis, share their findings and research experience, and seamlessly collaborate with their colleagues. In this article, we present the architecture and design of a cloud platform that meets some of these requirements, and a big data curation model that describes how a community of earth and environmental scientists is using the platform to curate data. Motivation for developing the platform, lessons learnt in overcoming some challenges associated with supporting scientists to curate big data, and future research directions are also presented.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fisher,D
Concerns about the long-term viability of SFS as the metadata store for HPSS have been increasing. A concern that Transarc may discontinue support for SFS motivates us to consider alternative means to store HPSS metadata. The obvious alternative is a commercial database. Commercial databases have the necessary characteristics for storage of HPSS metadata records. They are robust and scalable and can easily accommodate the volume of data that must be stored. They provide programming interfaces, transactional semantics and a full set of maintenance and performance enhancement tools. A team was organized within the HPSS project to study and recommend anmore » approach for the replacement of SFS. Members of the team are David Fisher, Jim Minton, Donna Mecozzi, Danny Cook, Bart Parliman and Lynn Jones. We examined several possible solutions to the problem of replacing SFS, and recommended on May 22, 2000, in a report to the HPSS Technical and Executive Committees, to change HPSS into a database application over either Oracle or DB2. We recommended either Oracle or DB2 on the basis of market share and technical suitability. Oracle and DB2 are dominant offerings in the market, and it is in the best interest of HPSS to use a major player's product. Both databases provide a suitable programming interface. Transaction management functions, support for multi-threaded clients and data manipulation languages (DML) are available. These findings were supported in meetings held with technical experts from both companies. In both cases, the evidence indicated that either database would provide the features needed to host HPSS.« less
Partnerships To Mine Unexploited Sources of Metadata.
ERIC Educational Resources Information Center
Reynolds, Regina Romano
This paper discusses the metadata created for other purposes as a potential source of bibliographic data. The first section addresses collecting metadata by means of templates, including the Nordic Metadata Project's Dublin Core Metadata Template. The second section considers potential partnerships for re-purposing metadata for bibliographic use,…
Enhanced Management of and Access to Hurricane Sandy Ocean and Coastal Mapping Data
NASA Astrophysics Data System (ADS)
Eakins, B.; Neufeld, D.; Varner, J. D.; McLean, S. J.
2014-12-01
NOAA's National Geophysical Data Center (NGDC) has significantly improved the discovery and delivery of its geophysical data holdings, initially targeting ocean and coastal mapping (OCM) data in the U.S. coastal region impacted by Hurricane Sandy in 2012. We have developed a browser-based, interactive interface that permits users to refine their initial map-driven data-type choices prior to bulk download (e.g., by selecting individual surveys), including the ability to choose ancillary files, such as reports or derived products. Initial OCM data types now available in a U.S. East Coast map viewer, as well as underlying web services, include: NOS hydrographic soundings and multibeam sonar bathymetry. Future releases will include trackline geophysics, airborne topographic and bathymetric-topographic lidar, bottom sample descriptions, and digital elevation models.This effort also includes working collaboratively with other NOAA offices and partners to develop automated methods to receive and verify data, stage data for archive, and notify data providers when ingest and archive are completed. We have also developed improved metadata tools to parse XML and auto-populate OCM data catalogs, support the web-based creation and editing of ISO-compliant metadata records, and register metadata in appropriate data portals. This effort supports a variety of NOAA mission requirements, from safe navigation to coastal flood forecasting and habitat characterization.
MOPED enables discoveries through consistently processed proteomics data
Higdon, Roger; Stewart, Elizabeth; Stanberry, Larissa; Haynes, Winston; Choiniere, John; Montague, Elizabeth; Anderson, Nathaniel; Yandl, Gregory; Janko, Imre; Broomall, William; Fishilevich, Simon; Lancet, Doron; Kolker, Natali; Kolker, Eugene
2014-01-01
The Model Organism Protein Expression Database (MOPED, http://moped.proteinspire.org), is an expanding proteomics resource to enable biological and biomedical discoveries. MOPED aggregates simple, standardized and consistently processed summaries of protein expression and metadata from proteomics (mass spectrometry) experiments from human and model organisms (mouse, worm and yeast). The latest version of MOPED adds new estimates of protein abundance and concentration, as well as relative (differential) expression data. MOPED provides a new updated query interface that allows users to explore information by organism, tissue, localization, condition, experiment, or keyword. MOPED supports the Human Proteome Project’s efforts to generate chromosome and diseases specific proteomes by providing links from proteins to chromosome and disease information, as well as many complementary resources. MOPED supports a new omics metadata checklist in order to harmonize data integration, analysis and use. MOPED’s development is driven by the user community, which spans 90 countries guiding future development that will transform MOPED into a multi-omics resource. MOPED encourages users to submit data in a simple format. They can use the metadata a checklist generate a data publication for this submission. As a result, MOPED will provide even greater insights into complex biological processes and systems and enable deeper and more comprehensive biological and biomedical discoveries. PMID:24350770
NASA Astrophysics Data System (ADS)
Bargatze, L. F.
2015-12-01
Active Data Archive Product Tracking (ADAPT) is a collection of software routines that permits one to generate XML metadata files to describe and register data products in support of the NASA Heliophysics Virtual Observatory VxO effort. ADAPT is also a philosophy. The ADAPT concept is to use any and all available metadata associated with scientific data to produce XML metadata descriptions in a consistent, uniform, and organized fashion to provide blanket access to the full complement of data stored on a targeted data server. In this poster, we present an application of ADAPT to describe all of the data products that are stored by using the Common Data File (CDF) format served out by the CDAWEB and SPDF data servers hosted at the NASA Goddard Space Flight Center. These data servers are the primary repositories for NASA Heliophysics data. For this purpose, the ADAPT routines have been used to generate data resource descriptions by using an XML schema named Space Physics Archive, Search, and Extract (SPASE). SPASE is the designated standard for documenting Heliophysics data products, as adopted by the Heliophysics Data and Model Consortium. The set of SPASE XML resource descriptions produced by ADAPT includes high-level descriptions of numerical data products, display data products, or catalogs and also includes low-level "Granule" descriptions. A SPASE Granule is effectively a universal access metadata resource; a Granule associates an individual data file (e.g. a CDF file) with a "parent" high-level data resource description, assigns a resource identifier to the file, and lists the corresponding assess URL(s). The CDAWEB and SPDF file systems were queried to provide the input required by the ADAPT software to create an initial set of SPASE metadata resource descriptions. Then, the CDAWEB and SPDF data repositories were queried subsequently on a nightly basis and the CDF file lists were checked for any changes such as the occurrence of new, modified, or deleted files, or the addition of new or the deletion of old data products. Next, ADAPT routines analyzed the query results and issued updates to the metadata stored in the UCLA CDAWEB and SPDF metadata registries. In this way, the SPASE metadata registries generated by ADAPT can be relied on to provide up to date and complete access to Heliophysics CDF data resources on a daily basis.
NASA Astrophysics Data System (ADS)
Maffei, A. R.; Chandler, C. L.; Work, T.; Allen, J.; Groman, R. C.; Fox, P. A.
2009-12-01
Content Management Systems (CMSs) provide powerful features that can be of use to oceanographic (and other geo-science) data managers. However, in many instances, geo-science data management offices have previously designed customized schemas for their metadata. The WHOI Ocean Informatics initiative and the NSF funded Biological Chemical and Biological Data Management Office (BCO-DMO) have jointly sponsored a project to port an existing, relational database containing oceanographic metadata, along with an existing interface coded in Cold Fusion middleware, to a Drupal6 Content Management System. The goal was to translate all the existing database tables, input forms, website reports, and other features present in the existing system to employ Drupal CMS features. The replacement features include Drupal content types, CCK node-reference fields, themes, RDB, SPARQL, workflow, and a number of other supporting modules. Strategic use of some Drupal6 CMS features enables three separate but complementary interfaces that provide access to oceanographic research metadata via the MySQL database: 1) a Drupal6-powered front-end; 2) a standard SQL port (used to provide a Mapserver interface to the metadata and data; and 3) a SPARQL port (feeding a new faceted search capability being developed). Future plans include the creation of science ontologies, by scientist/technologist teams, that will drive semantically-enabled faceted search capabilities planned for the site. Incorporation of semantic technologies included in the future Drupal 7 core release is also anticipated. Using a public domain CMS as opposed to proprietary middleware, and taking advantage of the many features of Drupal 6 that are designed to support semantically-enabled interfaces will help prepare the BCO-DMO database for interoperability with other ecosystem databases.
DAS: A Data Management System for Instrument Tests and Operations
NASA Astrophysics Data System (ADS)
Frailis, M.; Sartor, S.; Zacchei, A.; Lodi, M.; Cirami, R.; Pasian, F.; Trifoglio, M.; Bulgarelli, A.; Gianotti, F.; Franceschi, E.; Nicastro, L.; Conforti, V.; Zoli, A.; Smart, R.; Morbidelli, R.; Dadina, M.
2014-05-01
The Data Access System (DAS) is a and data management software system, providing a reusable solution for the storage of data acquired both from telescopes and auxiliary data sources during the instrument development phases and operations. It is part of the Customizable Instrument WorkStation system (CIWS-FW), a framework for the storage, processing and quick-look at the data acquired from scientific instruments. The DAS provides a data access layer mainly targeted to software applications: quick-look displays, pre-processing pipelines and scientific workflows. It is logically organized in three main components: an intuitive and compact Data Definition Language (DAS DDL) in XML format, aimed for user-defined data types; an Application Programming Interface (DAS API), automatically adding classes and methods supporting the DDL data types, and providing an object-oriented query language; a data management component, which maps the metadata of the DDL data types in a relational Data Base Management System (DBMS), and stores the data in a shared (network) file system. With the DAS DDL, developers define the data model for a particular project, specifying for each data type the metadata attributes, the data format and layout (if applicable), and named references to related or aggregated data types. Together with the DDL user-defined data types, the DAS API acts as the only interface to store, query and retrieve the metadata and data in the DAS system, providing both an abstract interface and a data model specific one in C, C++ and Python. The mapping of metadata in the back-end database is automatic and supports several relational DBMSs, including MySQL, Oracle and PostgreSQL.
Argo: an integrative, interactive, text mining-based workbench supporting curation
Rak, Rafal; Rowley, Andrew; Black, William; Ananiadou, Sophia
2012-01-01
Curation of biomedical literature is often supported by the automatic analysis of textual content that generally involves a sequence of individual processing components. Text mining (TM) has been used to enhance the process of manual biocuration, but has been focused on specific databases and tasks rather than an environment integrating TM tools into the curation pipeline, catering for a variety of tasks, types of information and applications. Processing components usually come from different sources and often lack interoperability. The well established Unstructured Information Management Architecture is a framework that addresses interoperability by defining common data structures and interfaces. However, most of the efforts are targeted towards software developers and are not suitable for curators, or are otherwise inconvenient to use on a higher level of abstraction. To overcome these issues we introduce Argo, an interoperable, integrative, interactive and collaborative system for text analysis with a convenient graphic user interface to ease the development of processing workflows and boost productivity in labour-intensive manual curation. Robust, scalable text analytics follow a modular approach, adopting component modules for distinct levels of text analysis. The user interface is available entirely through a web browser that saves the user from going through often complicated and platform-dependent installation procedures. Argo comes with a predefined set of processing components commonly used in text analysis, while giving the users the ability to deposit their own components. The system accommodates various areas and levels of user expertise, from TM and computational linguistics to ontology-based curation. One of the key functionalities of Argo is its ability to seamlessly incorporate user-interactive components, such as manual annotation editors, into otherwise completely automatic pipelines. As a use case, we demonstrate the functionality of an in-built manual annotation editor that is well suited for in-text corpus annotation tasks. Database URL: http://www.nactem.ac.uk/Argo PMID:22434844
Go Digital! Making Physical Samples a Valued Part of the Online Record of Science
NASA Astrophysics Data System (ADS)
Klump, J. F.; Lehnert, K.
2016-12-01
Physical samples, at first glance, seem to be the opposite to the virtual world of the internet. Yet, as anything not natively digital, physical samples can have a digital representation that is accessible through the internet. Most museums and other institutions have many more objects in their collections than they could ever put on display and many samples exist outside of formal curation workflows. Nevertheless, these objects can be of importance to science, maybe because this particular fossil is a holotype that defines an extinct animal species, or it is a mineral sample that was used to derive a reference optical reflectance spectrum that is used in the interpretation of remote sensing data from satellites. As these examples show, the value of a scientific collection lies not only in its objects but also in how these objects are integrated into the record of science. Fundamental to this are, of course, catalogues of the samples held in a collection. Significant value can be added to a collection if its catalogue is web accessible, and even better if its catalogue can be harvested into disciplinary portals to aid the discovery of samples. Sample curation in the digital age, however, must go beyond simply labeling and cataloguing. In the same way that publications and datasets can now be identified and accessed over the web, steps are now being made to do the same for physical samples. Globally unique, resolvable identifiers of samples, datasets and literature can serve as nodes to link these resources together and in this way, then cross-link between scientific interpretation in the literature, data interpreted in these works, and samples from which these data were derived. These linkages must not only be recorded in the metadata but must also be machine actionable to allow integration of these digital assets into the ever growing body and richness of the scientific record. This presentation will discuss cyberinfrastructures for samples and sample curation through case studies that illustrate how the life cycle of a sample relates to other digital objects in literature and data, and how added value is generated through these linkages.
Evaluating and Evolving Metadata in Multiple Dialects
NASA Astrophysics Data System (ADS)
Kozimor, J.; Habermann, T.; Powers, L. A.; Gordon, S.
2016-12-01
Despite many long-term homogenization efforts, communities continue to develop focused metadata standards along with related recommendations and (typically) XML representations (aka dialects) for sharing metadata content. Different representations easily become obstacles to sharing information because each representation generally requires a set of tools and skills that are designed, built, and maintained specifically for that representation. In contrast, community recommendations are generally described, at least initially, at a more conceptual level and are more easily shared. For example, most communities agree that dataset titles should be included in metadata records although they write the titles in different ways. This situation has led to the development of metadata repositories that can ingest and output metadata in multiple dialects. As an operational example, the NASA Common Metadata Repository (CMR) includes three different metadata dialects (DIF, ECHO, and ISO 19115-2). These systems raise a new question for metadata providers: if I have a choice of metadata dialects, which should I use and how do I make that decision? We have developed a collection of metadata evaluation tools that can be used to evaluate metadata records in many dialects for completeness with respect to recommendations from many organizations and communities. We have applied these tools to over 8000 collection and granule metadata records in four different dialects. This large collection of identical content in multiple dialects enables us to address questions about metadata and dialect evolution and to answer those questions quantitatively. We will describe those tools and results from evaluating the NASA CMR metadata collection.
DataMed - an open source discovery index for finding biomedical datasets.
Chen, Xiaoling; Gururaj, Anupama E; Ozyurt, Burak; Liu, Ruiling; Soysal, Ergin; Cohen, Trevor; Tiryaki, Firat; Li, Yueling; Zong, Nansu; Jiang, Min; Rogith, Deevakar; Salimi, Mandana; Kim, Hyeon-Eui; Rocca-Serra, Philippe; Gonzalez-Beltran, Alejandra; Farcas, Claudiu; Johnson, Todd; Margolis, Ron; Alter, George; Sansone, Susanna-Assunta; Fore, Ian M; Ohno-Machado, Lucila; Grethe, Jeffrey S; Xu, Hua
2018-01-13
Finding relevant datasets is important for promoting data reuse in the biomedical domain, but it is challenging given the volume and complexity of biomedical data. Here we describe the development of an open source biomedical data discovery system called DataMed, with the goal of promoting the building of additional data indexes in the biomedical domain. DataMed, which can efficiently index and search diverse types of biomedical datasets across repositories, is developed through the National Institutes of Health-funded biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium. It consists of 2 main components: (1) a data ingestion pipeline that collects and transforms original metadata information to a unified metadata model, called DatA Tag Suite (DATS), and (2) a search engine that finds relevant datasets based on user-entered queries. In addition to describing its architecture and techniques, we evaluated individual components within DataMed, including the accuracy of the ingestion pipeline, the prevalence of the DATS model across repositories, and the overall performance of the dataset retrieval engine. Our manual review shows that the ingestion pipeline could achieve an accuracy of 90% and core elements of DATS had varied frequency across repositories. On a manually curated benchmark dataset, the DataMed search engine achieved an inferred average precision of 0.2033 and a precision at 10 (P@10, the number of relevant results in the top 10 search results) of 0.6022, by implementing advanced natural language processing and terminology services. Currently, we have made the DataMed system publically available as an open source package for the biomedical community. © The Author 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com
BrainMap VBM: An environment for structural meta-analysis.
Vanasse, Thomas J; Fox, P Mickle; Barron, Daniel S; Robertson, Michaela; Eickhoff, Simon B; Lancaster, Jack L; Fox, Peter T
2018-05-02
The BrainMap database is a community resource that curates peer-reviewed, coordinate-based human neuroimaging literature. By pairing the results of neuroimaging studies with their relevant meta-data, BrainMap facilitates coordinate-based meta-analysis (CBMA) of the neuroimaging literature en masse or at the level of experimental paradigm, clinical disease, or anatomic location. Initially dedicated to the functional, task-activation literature, BrainMap is now expanding to include voxel-based morphometry (VBM) studies in a separate sector, titled: BrainMap VBM. VBM is a whole-brain, voxel-wise method that measures significant structural differences between or within groups which are reported as standardized, peak x-y-z coordinates. Here we describe BrainMap VBM, including the meta-data structure, current data volume, and automated reverse inference functions (region-to-disease profile) of this new community resource. CBMA offers a robust methodology for retaining true-positive and excluding false-positive findings across studies in the VBM literature. As with BrainMap's functional database, BrainMap VBM may be synthesized en masse or at the level of clinical disease or anatomic location. As a use-case scenario for BrainMap VBM, we illustrate a trans-diagnostic data-mining procedure wherein we explore the underlying network structure of 2,002 experiments representing over 53,000 subjects through independent components analysis (ICA). To reduce data-redundancy effects inherent to any database, we demonstrate two data-filtering approaches that proved helpful to ICA. Finally, we apply hierarchical clustering analysis (HCA) to measure network- and disease-specificity. This procedure distinguished psychiatric from neurological diseases. We invite the neuroscientific community to further exploit BrainMap VBM with other modeling approaches. © 2018 Wiley Periodicals, Inc.
An Assessment of the Need for Standard Variable Names for Airborne Field Campaigns
NASA Astrophysics Data System (ADS)
Beach, A. L., III; Chen, G.; Northup, E. A.; Kusterer, J.; Quam, B. M.
2017-12-01
The NASA Earth Venture Program has led to a dramatic increase in airborne observations, requiring updated data management practices with clearly defined data standards and protocols for metadata. An airborne field campaign can involve multiple aircraft and a variety of instruments. It is quite common to have different instruments/techniques measure the same parameter on one or more aircraft platforms. This creates a need to allow instrument Principal Investigators (PIs) to name their variables in a way that would distinguish them across various data sets. A lack of standardization of variables names presents a challenge for data search tools in enabling discovery of similar data across airborne studies, aircraft platforms, and instruments. This was also identified by data users as one of the top issues in data use. One effective approach for mitigating this problem is to enforce variable name standardization, which can effectively map the unique PI variable names to fixed standard names. In order to ensure consistency amongst the standard names, it will be necessary to choose them from a controlled list. However, no such list currently exists despite a number of previous efforts to establish a sufficient list of atmospheric variable names. The Atmospheric Composition Variable Standard Name Working Group was established under the auspices of NASA's Earth Science Data Systems Working Group (ESDSWG) to solicit research community feedback to create a list of standard names that are acceptable to data providers and data users This presentation will discuss the challenges and recommendations of standard variable names in an effort to demonstrate how airborne metadata curation/management can be improved to streamline data ingest, improve interoperability, and discoverability to a broader user community.
The Dublin Core is a metadata element set intended to facilitate discovery of electronic resources. It was originally conceived for author-generated descriptions of Web resources, and the Dublin Core has attracted broad ranging international and interdisciplinary support. The cha...
An asynchronous traversal engine for graph-based rich metadata management
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dai, Dong; Carns, Philip; Ross, Robert B.
Rich metadata in high-performance computing (HPC) systems contains extended information about users, jobs, data files, and their relationships. Property graphs are a promising data model to represent heterogeneous rich metadata flexibly. Specifically, a property graph can use vertices to represent different entities and edges to record the relationships between vertices with unique annotations. The high-volume HPC use case, with millions of entities and relationships, naturally requires an out-of-core distributed property graph database, which must support live updates (to ingest production information in real time), low-latency point queries (for frequent metadata operations such as permission checking), and large-scale traversals (for provenancemore » data mining). Among these needs, large-scale property graph traversals are particularly challenging for distributed graph storage systems. Most existing graph systems implement a "level synchronous" breadth-first search algorithm that relies on global synchronization in each traversal step. This performs well in many problem domains; but a rich metadata management system is characterized by imbalanced graphs, long traversal lengths, and concurrent workloads, each of which has the potential to introduce or exacerbate stragglers (i.e., abnormally slow steps or servers in a graph traversal) that lead to low overall throughput for synchronous traversal algorithms. Previous research indicated that the straggler problem can be mitigated by using asynchronous traversal algorithms, and many graph-processing frameworks have successfully demonstrated this approach. Such systems require the graph to be loaded into a separate batch-processing framework instead of being iteratively accessed, however. In this work, we investigate a general asynchronous graph traversal engine that can operate atop a rich metadata graph in its native format. We outline a traversal-aware query language and key optimizations (traversal-affiliate caching and execution merging) necessary for efficient performance. We further explore the effect of different graph partitioning strategies on the traversal performance for both synchronous and asynchronous traversal engines. Our experiments show that the asynchronous graph traversal engine is more efficient than its synchronous counterpart in the case of HPC rich metadata processing, where more servers are involved and larger traversals are needed. Furthermore, the asynchronous traversal engine is more adaptive to different graph partitioning strategies.« less
An asynchronous traversal engine for graph-based rich metadata management
Dai, Dong; Carns, Philip; Ross, Robert B.; ...
2016-06-23
Rich metadata in high-performance computing (HPC) systems contains extended information about users, jobs, data files, and their relationships. Property graphs are a promising data model to represent heterogeneous rich metadata flexibly. Specifically, a property graph can use vertices to represent different entities and edges to record the relationships between vertices with unique annotations. The high-volume HPC use case, with millions of entities and relationships, naturally requires an out-of-core distributed property graph database, which must support live updates (to ingest production information in real time), low-latency point queries (for frequent metadata operations such as permission checking), and large-scale traversals (for provenancemore » data mining). Among these needs, large-scale property graph traversals are particularly challenging for distributed graph storage systems. Most existing graph systems implement a "level synchronous" breadth-first search algorithm that relies on global synchronization in each traversal step. This performs well in many problem domains; but a rich metadata management system is characterized by imbalanced graphs, long traversal lengths, and concurrent workloads, each of which has the potential to introduce or exacerbate stragglers (i.e., abnormally slow steps or servers in a graph traversal) that lead to low overall throughput for synchronous traversal algorithms. Previous research indicated that the straggler problem can be mitigated by using asynchronous traversal algorithms, and many graph-processing frameworks have successfully demonstrated this approach. Such systems require the graph to be loaded into a separate batch-processing framework instead of being iteratively accessed, however. In this work, we investigate a general asynchronous graph traversal engine that can operate atop a rich metadata graph in its native format. We outline a traversal-aware query language and key optimizations (traversal-affiliate caching and execution merging) necessary for efficient performance. We further explore the effect of different graph partitioning strategies on the traversal performance for both synchronous and asynchronous traversal engines. Our experiments show that the asynchronous graph traversal engine is more efficient than its synchronous counterpart in the case of HPC rich metadata processing, where more servers are involved and larger traversals are needed. Furthermore, the asynchronous traversal engine is more adaptive to different graph partitioning strategies.« less
Web Based Data Access to the World Data Center for Climate
NASA Astrophysics Data System (ADS)
Toussaint, F.; Lautenschlager, M.
2006-12-01
The World Data Center for Climate (WDC-Climate, www.wdc-climate.de) is hosted by the Model &Data Group (M&D) of the Max Planck Institute for Meteorology. The M&D department is financed by the German government and uses the computers and mass storage facilities of the German Climate Computing Centre (Deutsches Klimarechenzentrum, DKRZ). The WDC-Climate provides web access to 200 Terabytes of climate data; the total mass storage archive contains nearly 4 Petabytes. Although the majority of the datasets concern model output data, some satellite and observational data are accessible as well. The underlying relational database is distributed on five servers. The CERA relational data model is used to integrate catalogue data and mass data. The flexibility of the model allows to store and access very different types of data and metadata. The CERA metadata catalogue provides easy access to the content of the CERA database as well as to other data in the web. Visit ceramodel.wdc-climate.de for additional information on the CERA data model. The majority of the users access data via the CERA metadata catalogue, which is open without registration. However, prior to retrieving data user are required to check in and apply for a userid and password. The CERA metadata catalogue is servlet based. So it is accessible worldwide through any web browser at cera.wdc-climate.de. In addition to data and metadata access by the web catalogue, WDC-Climate offers a number of other forms of web based data access. All metadata are available via http request as xml files in various metadata formats (ISO, DC, etc., see wini.wdc-climate.de) which allows for easy data interchange with other catalogues. Model data can be retrieved in GRIB, ASCII, NetCDF, and binary (IEEE) format. WDC-Climate serves as data centre for various projects. Since xml files are accessible by http, the integration of data into applications of different projects is very easy. Projects supported by WDC-Climate are e.g. CEOP, IPCC, and CARIBIC. A script tool for data download (jblob) is offered on the web page, to make retrieval of huge data quantities more comfortable.
The Road to Independently Understandable Information
NASA Astrophysics Data System (ADS)
Habermann, T.; Robinson, E.
2017-12-01
The turn of the 21st century was a pivotal time in the Earth and Space Science information ecosystem. The Content Standard for Digital Geospatial Metadata (CSDGM) had existed for nearly a decade and ambitious new standards were just emerging. The U.S. Federal Geospatial Data Committee (FGDC) had extended many of the concepts from CSDGM into the International community with ISO 19115:2003 and the Consultative Committee for Space Data Systems (CCSDS) had migrated their Open Archival Information System (OAIS) Reference Model into an international standard (ISO 14721:2003). The OAIS model outlined the roles and responsibilities of archives with the principle role being preserving information and making it available to users, a "designated community", as a service to the data producer. It was mandatory for the archive to ensure that information is "independently understandable" to the designated community and to maintain that understanding through on-going partnerships between archives and designated communities. Standards can play a role in supporting these partnerships as designated communities expand across disciplinary and geographic boundaries. The ISO metadata standards include many capabilities that might make critical contributions to this goal. These include connections to resources outside of the metadata record (i.e. documentation) and mechanisms for ongoing incorporation of user feedback into the metadata stream. We will demonstrate these capabilities with examples of how they can increase understanding.
The Role of GIS and Data Librarians in Cyber-infrastructure Support and Governance
NASA Astrophysics Data System (ADS)
Branch, B. D.
2012-12-01
A governance road-map for cyber-infrastructure in the geosciences will include an intentional librarian core capable of technical skills that include GIS and open source support for data curation that involves all aspects of data life cycle management. Per Executive Order 12906 and other policy; spatial data, literacy, and curation are critical cyber-infrastructure needs in the near future. A formal earth science and space informatics librarian may be an outcome of such development. From e-science to e-research, STEM pipelines need librarians as critical data intermediaries in technical assistance and collaboration efforts with scientists' data and outreach needs. Future training concerns should advocate trans-disciplinary data science and policy skills that will be necessary for data management support and procurement.
EOS ODL Metadata On-line Viewer
NASA Astrophysics Data System (ADS)
Yang, J.; Rabi, M.; Bane, B.; Ullman, R.
2002-12-01
We have recently developed and deployed an EOS ODL metadata on-line viewer. The EOS ODL metadata viewer is a web server that takes: 1) an EOS metadata file in Object Description Language (ODL), 2) parameters, such as which metadata to view and what style of display to use, and returns an HTML or XML document displaying the requested metadata in the requested style. This tool is developed to address widespread complaints by science community that the EOS Data and Information System (EOSDIS) metadata files in ODL are difficult to read by allowing users to upload and view an ODL metadata file in different styles using a web browser. Users have the selection to view all the metadata or part of the metadata, such as Collection metadata, Granule metadata, or Unsupported Metadata. Choices of display styles include 1) Web: a mouseable display with tabs and turn-down menus, 2) Outline: Formatted and colored text, suitable for printing, 3) Generic: Simple indented text, a direct representation of the underlying ODL metadata, and 4) None: No stylesheet is applied and the XML generated by the converter is returned directly. Not all display styles are implemented for all the metadata choices. For example, Web style is only implemented for Collection and Granule metadata groups with known attribute fields, but not for Unsupported, Other, and All metadata. The overall strategy of the ODL viewer is to transform an ODL metadata file to a viewable HTML in two steps. The first step is to convert the ODL metadata file to an XML using a Java-based parser/translator called ODL2XML. The second step is to transform the XML to an HTML using stylesheets. Both operations are done on the server side. This allows a lot of flexibility in the final result, and is very portable cross-platform. Perl CGI behind the Apache web server is used to run the Java ODL2XML, and then run the results through an XSLT processor. The EOS ODL viewer can be accessed from either a PC or a Mac using Internet Explorer 5.0+ or Netscape 4.7+.
Evolutions in Metadata Quality
NASA Astrophysics Data System (ADS)
Gilman, J.
2016-12-01
Metadata Quality is one of the chief drivers of discovery and use of NASA EOSDIS (Earth Observing System Data and Information System) data. Issues with metadata such as lack of completeness, inconsistency, and use of legacy terms directly hinder data use. As the central metadata repository for NASA Earth Science data, the Common Metadata Repository (CMR) has a responsibility to its users to ensure the quality of CMR search results. This talk will cover how we encourage metadata authors to improve the metadata through the use of integrated rubrics of metadata quality and outreach efforts. In addition we'll demonstrate Humanizers, a technique for dealing with the symptoms of metadata issues. Humanizers allow CMR administrators to identify specific metadata issues that are fixed at runtime when the data is indexed. An example Humanizer is the aliasing of processing level "Level 1" to "1" to improve consistency across collections. The CMR currently indexes 35K collections and 300M granules.
Generation of Multiple Metadata Formats from a Geospatial Data Repository
NASA Astrophysics Data System (ADS)
Hudspeth, W. B.; Benedict, K. K.; Scott, S.
2012-12-01
The Earth Data Analysis Center (EDAC) at the University of New Mexico is partnering with the CYBERShARE and Environmental Health Group from the Center for Environmental Resource Management (CERM), located at the University of Texas, El Paso (UTEP), the Biodiversity Institute at the University of Kansas (KU), and the New Mexico Geo- Epidemiology Research Network (GERN) to provide a technical infrastructure that enables investigation of a variety of climate-driven human/environmental systems. Two significant goals of this NASA-funded project are: a) to increase the use of NASA Earth observational data at EDAC by various modeling communities through enabling better discovery, access, and use of relevant information, and b) to expose these communities to the benefits of provenance for improving understanding and usability of heterogeneous data sources and derived model products. To realize these goals, EDAC has leveraged the core capabilities of its Geographic Storage, Transformation, and Retrieval Engine (Gstore) platform, developed with support of the NSF EPSCoR Program. The Gstore geospatial services platform provides general purpose web services based upon the REST service model, and is capable of data discovery, access, and publication functions, metadata delivery functions, data transformation, and auto-generated OGC services for those data products that can support those services. Central to the NASA ACCESS project is the delivery of geospatial metadata in a variety of formats, including ISO 19115-2/19139, FGDC CSDGM, and the Proof Markup Language (PML). This presentation details the extraction and persistence of relevant metadata in the Gstore data store, and their transformation into multiple metadata formats that are increasingly utilized by the geospatial community to document not only core library catalog elements (e.g. title, abstract, publication data, geographic extent, projection information, and database elements), but also the processing steps used to generate derived modeling products. In particular, we discuss the generation and service delivery of provenance, or trace of data sources and analytical methods used in a scientific analysis, for archived data. We discuss the workflows developed by EDAC to capture end-to-end provenance, the storage model for those data in a delivery format independent data structure, and delivery of PML, ISO, and FGDC documents to clients requesting those products.
GeoBoost: accelerating research involving the geospatial metadata of virus GenBank records.
Tahsin, Tasnia; Weissenbacher, Davy; O'Connor, Karen; Magge, Arjun; Scotch, Matthew; Gonzalez-Hernandez, Graciela
2018-05-01
GeoBoost is a command-line software package developed to address sparse or incomplete metadata in GenBank sequence records that relate to the location of the infected host (LOIH) of viruses. Given a set of GenBank accession numbers corresponding to virus GenBank records, GeoBoost extracts, integrates and normalizes geographic information reflecting the LOIH of the viruses using integrated information from GenBank metadata and related full-text publications. In addition, to facilitate probabilistic geospatial modeling, GeoBoost assigns probability scores for each possible LOIH. Binaries and resources required for running GeoBoost are packed into a single zipped file and freely available for download at https://tinyurl.com/geoboost. A video tutorial is included to help users quickly and easily install and run the software. The software is implemented in Java 1.8, and supported on MS Windows and Linux platforms. gragon@upenn.edu. Supplementary data are available at Bioinformatics online.
The role of metadata in managing large environmental science datasets. Proceedings
DOE Office of Scientific and Technical Information (OSTI.GOV)
Melton, R.B.; DeVaney, D.M.; French, J. C.
1995-06-01
The purpose of this workshop was to bring together computer science researchers and environmental sciences data management practitioners to consider the role of metadata in managing large environmental sciences datasets. The objectives included: establishing a common definition of metadata; identifying categories of metadata; defining problems in managing metadata; and defining problems related to linking metadata with primary data.
Hewitt, Robin; Gobbi, Alberto; Lee, Man-Ling
2005-01-01
Relational databases are the current standard for storing and retrieving data in the pharmaceutical and biotech industries. However, retrieving data from a relational database requires specialized knowledge of the database schema and of the SQL query language. At Anadys, we have developed an easy-to-use system for searching and reporting data in a relational database to support our drug discovery project teams. This system is fast and flexible and allows users to access all data without having to write SQL queries. This paper presents the hierarchical, graph-based metadata representation and SQL-construction methods that, together, are the basis of this system's capabilities.
Logic programming and metadata specifications
NASA Technical Reports Server (NTRS)
Lopez, Antonio M., Jr.; Saacks, Marguerite E.
1992-01-01
Artificial intelligence (AI) ideas and techniques are critical to the development of intelligent information systems that will be used to collect, manipulate, and retrieve the vast amounts of space data produced by 'Missions to Planet Earth.' Natural language processing, inference, and expert systems are at the core of this space application of AI. This paper presents logic programming as an AI tool that can support inference (the ability to draw conclusions from a set of complicated and interrelated facts). It reports on the use of logic programming in the study of metadata specifications for a small problem domain of airborne sensors, and the dataset characteristics and pointers that are needed for data access.
miRSponge: a manually curated database for experimentally supported miRNA sponges and ceRNAs.
Wang, Peng; Zhi, Hui; Zhang, Yunpeng; Liu, Yue; Zhang, Jizhou; Gao, Yue; Guo, Maoni; Ning, Shangwei; Li, Xia
2015-01-01
In this study, we describe miRSponge, a manually curated database, which aims at providing an experimentally supported resource for microRNA (miRNA) sponges. Recent evidence suggests that miRNAs are themselves regulated by competing endogenous RNAs (ceRNAs) or 'miRNA sponges' that contain miRNA binding sites. These competitive molecules can sequester miRNAs to prevent them interacting with their natural targets to play critical roles in various biological and pathological processes. It has become increasingly important to develop a high quality database to record and store ceRNA data to support future studies. To this end, we have established the experimentally supported miRSponge database that contains data on 599 miRNA-sponge interactions and 463 ceRNA relationships from 11 species following manual curating from nearly 1200 published articles. Database classes include endogenously generated molecules including coding genes, pseudogenes, long non-coding RNAs and circular RNAs, along with exogenously introduced molecules including viral RNAs and artificial engineered sponges. Approximately 70% of the interactions were identified experimentally in disease states. miRSponge provides a user-friendly interface for convenient browsing, retrieval and downloading of dataset. A submission page is also included to allow researchers to submit newly validated miRNA sponge data. Database URL: http://www.bio-bigdata.net/miRSponge. © The Author(s) 2015. Published by Oxford University Press.
miRSponge: a manually curated database for experimentally supported miRNA sponges and ceRNAs
Wang, Peng; Zhi, Hui; Zhang, Yunpeng; Liu, Yue; Zhang, Jizhou; Gao, Yue; Guo, Maoni; Ning, Shangwei; Li, Xia
2015-01-01
In this study, we describe miRSponge, a manually curated database, which aims at providing an experimentally supported resource for microRNA (miRNA) sponges. Recent evidence suggests that miRNAs are themselves regulated by competing endogenous RNAs (ceRNAs) or ‘miRNA sponges’ that contain miRNA binding sites. These competitive molecules can sequester miRNAs to prevent them interacting with their natural targets to play critical roles in various biological and pathological processes. It has become increasingly important to develop a high quality database to record and store ceRNA data to support future studies. To this end, we have established the experimentally supported miRSponge database that contains data on 599 miRNA-sponge interactions and 463 ceRNA relationships from 11 species following manual curating from nearly 1200 published articles. Database classes include endogenously generated molecules including coding genes, pseudogenes, long non-coding RNAs and circular RNAs, along with exogenously introduced molecules including viral RNAs and artificial engineered sponges. Approximately 70% of the interactions were identified experimentally in disease states. miRSponge provides a user-friendly interface for convenient browsing, retrieval and downloading of dataset. A submission page is also included to allow researchers to submit newly validated miRNA sponge data. Database URL: http://www.bio-bigdata.net/miRSponge. PMID:26424084
Ning, Shangwei; Zhang, Jizhou; Wang, Peng; Zhi, Hui; Wang, Jianjian; Liu, Yue; Gao, Yue; Guo, Maoni; Yue, Ming; Wang, Lihua; Li, Xia
2016-01-04
Lnc2Cancer (http://www.bio-bigdata.net/lnc2cancer) is a manually curated database of cancer-associated long non-coding RNAs (lncRNAs) with experimental support that aims to provide a high-quality and integrated resource for exploring lncRNA deregulation in various human cancers. LncRNAs represent a large category of functional RNA molecules that play a significant role in human cancers. A curated collection and summary of deregulated lncRNAs in cancer is essential to thoroughly understand the mechanisms and functions of lncRNAs. Here, we developed the Lnc2Cancer database, which contains 1057 manually curated associations between 531 lncRNAs and 86 human cancers. Each association includes lncRNA and cancer name, the lncRNA expression pattern, experimental techniques, a brief functional description, the original reference and additional annotation information. Lnc2Cancer provides a user-friendly interface to conveniently browse, retrieve and download data. Lnc2Cancer also offers a submission page for researchers to submit newly validated lncRNA-cancer associations. With the rapidly increasing interest in lncRNAs, Lnc2Cancer will significantly improve our understanding of lncRNA deregulation in cancer and has the potential to be a timely and valuable resource. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Spatial Data Transfer Standard (SDTS), part 5 : SDTS raster profile and extensions
DOT National Transportation Integrated Search
1999-02-01
The Spatial Data Transfer Standard (SDTS) defines a general mechanism for the transfer of : geographically referenced spatial data and its supporting metadata, i.e., attributes, data quality reports, : coordinate reference systems, security informati...
Scalable Metadata Management for a Large Multi-Source Seismic Data Repository
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gaylord, J. M.; Dodge, D. A.; Magana-Zook, S. A.
In this work, we implemented the key metadata management components of a scalable seismic data ingestion framework to address limitations in our existing system, and to position it for anticipated growth in volume and complexity. We began the effort with an assessment of open source data flow tools from the Hadoop ecosystem. We then began the construction of a layered architecture that is specifically designed to address many of the scalability and data quality issues we experience with our current pipeline. This included implementing basic functionality in each of the layers, such as establishing a data lake, designing a unifiedmore » metadata schema, tracking provenance, and calculating data quality metrics. Our original intent was to test and validate the new ingestion framework with data from a large-scale field deployment in a temporary network. This delivered somewhat unsatisfying results, since the new system immediately identified fatal flaws in the data relatively early in the pipeline. Although this is a correct result it did not allow us to sufficiently exercise the whole framework. We then widened our scope to process all available metadata from over a dozen online seismic data sources to further test the implementation and validate the design. This experiment also uncovered a higher than expected frequency of certain types of metadata issues that challenged us to further tune our data management strategy to handle them. Our result from this project is a greatly improved understanding of real world data issues, a validated design, and prototype implementations of major components of an eventual production framework. This successfully forms the basis of future development for the Geophysical Monitoring Program data pipeline, which is a critical asset supporting multiple programs. It also positions us very well to deliver valuable metadata management expertise to our sponsors, and has already resulted in an NNSA Office of Defense Nuclear Nonproliferation commitment to a multi-year project for follow-on work.« less
NASA Astrophysics Data System (ADS)
Stone, N.; Lafuente, B.; Bristow, T.; Keller, R.; Downs, R. T.; Blake, D. F.; Fonda, M.; Pires, A.
2016-12-01
Working primarily with astrobiology researchers at NASA Ames, the Open Data Repository (ODR) has been conducting a software pilot to meet the varying needs of this multidisciplinary community. Astrobiology researchers often have small communities or operate individually with unique data sets that don't easily fit into existing database structures. The ODR constructed its Data Publisher software to allow researchers to create databases with common metadata structures and subsequently extend them to meet their individual needs and data requirements. The software accomplishes these tasks through a web-based interface that allows collaborative creation and revision of common metadata templates and individual extensions to these templates for custom data sets. This allows researchers to search disparate datasets based on common metadata established through the metadata tools, but still facilitates distinct analyses and data that may be stored alongside the required common metadata. The software produces web pages that can be made publicly available at the researcher's discretion so that users may search and browse the data in an effort to make interoperability and data discovery a human-friendly task while also providing semantic data for machine-based discovery. Once relevant data has been identified, researchers can utilize the built-in application programming interface (API) that exposes the data for machine-based consumption and integration with existing data analysis tools (e.g. R, MATLAB, Project Jupyter - http://jupyter.org). The current evolution of the project has created the Astrobiology Habitable Environments Database (AHED)[1] which provides an interface to databases connected through a common metadata core. In the next project phase, the goal is for small research teams and groups to be self-sufficient in publishing their research data to meet funding mandates and academic requirements as well as fostering increased data discovery and interoperability through human-readable and machine-readable interfaces. This project is supported by the Science-Enabling Research Activity (SERA) and NASA NNX11AP82A, MSL. [1] B. Lafuente et al. (2016) AGU, submitted.
Kwong, Waldan K.; McFrederick, Quinn; Anderson, Kirk E.; Barribeau, Seth Michael; Chandler, James Angus; Cornman, R. Scott; Dainat, Jacques; Doublet, Vincent; Emery, Olivier; Evans, Jay D.; Farinelli, Laurent; Flenniken, Michelle L.; Granberg, Fredrik; Grasis, Juris A.; Gauthier, Laurent; Hayer, Juliette; Koch, Hauke; Kocher, Sarah; Martinson, Vincent G.; Moran, Nancy; Munoz-Torres, Monica; Newton, Irene; Paxton, Robert J.; Powell, Eli; Sadd, Ben M.; Schmid-Hempel, Paul; Schmid-Hempel, Regula; Schwarz, Ryan S.; vanEngelsdorp, Dennis
2016-01-01
ABSTRACT As pollinators, bees are cornerstones for terrestrial ecosystem stability and key components in agricultural productivity. All animals, including bees, are associated with a diverse community of microbes, commonly referred to as the microbiome. The bee microbiome is likely to be a crucial factor affecting host health. However, with the exception of a few pathogens, the impacts of most members of the bee microbiome on host health are poorly understood. Further, the evolutionary and ecological forces that shape and change the microbiome are unclear. Here, we discuss recent progress in our understanding of the bee microbiome, and we present challenges associated with its investigation. We conclude that global coordination of research efforts is needed to fully understand the complex and highly dynamic nature of the interplay between the bee microbiome, its host, and the environment. High-throughput sequencing technologies are ideal for exploring complex biological systems, including host-microbe interactions. To maximize their value and to improve assessment of the factors affecting bee health, sequence data should be archived, curated, and analyzed in ways that promote the synthesis of different studies. To this end, the BeeBiome consortium aims to develop an online database which would provide reference sequences, archive metadata, and host analytical resources. The goal would be to support applied and fundamental research on bees and their associated microbes and to provide a collaborative framework for sharing primary data from different research programs, thus furthering our understanding of the bee microbiome and its impact on pollinator health. PMID:27118586
Engel, Philipp; Kwong, Waldan K.; McFrederick, Quinn; Anderson, Kirk E.; Barribeau, Seth Michael; Chandler, James Angus; Cornman, Robert S.; Dainat, Jacques; de Miranda, Joachim R.; Doublet, Vincent; Emery, Olivier; Evans, Jay D.; Farinelli, Laurent; Flenniken, Michelle L.; Granberg, Fredrik; Grasis, Juris A.; Gauthier, Laurent; Hayer, Juliette; Koch, Hauke; Kocher, Sarah; Martinson, Vincent G.; Moran, Nancy; Munoz-Torres, Monica; Newton, Irene; Paxton, Robert J.; Powell, Eli; Sadd, Ben M.; Schmid-Hempel, Paul; Schmid-Hempel, Regula; Song, Se Jin; Schwarz, Ryan S.; vanEngelsdorp, Dennis; Dainat, Benjamin
2016-01-01
As pollinators, bees are cornerstones for terrestrial ecosystem stability and key components in agricultural productivity. All animals, including bees, are associated with a diverse community of microbes, commonly referred to as the microbiome. The bee microbiome is likely to be a crucial factor affecting host health. However, with the exception of a few pathogens, the impacts of most members of the bee microbiome on host health are poorly understood. Further, the evolutionary and ecological forces that shape and change the microbiome are unclear. Here, we discuss recent progress in our understanding of the bee microbiome, and we present challenges associated with its investigation. We conclude that global coordination of research efforts is needed to fully understand the complex and highly dynamic nature of the interplay between the bee microbiome, its host, and the environment. High-throughput sequencing technologies are ideal for exploring complex biological systems, including host-microbe interactions. To maximize their value and to improve assessment of the factors affecting bee health, sequence data should be archived, curated, and analyzed in ways that promote the synthesis of different studies. To this end, the BeeBiome consortium aims to develop an online database which would provide reference sequences, archive metadata, and host analytical resources. The goal would be to support applied and fundamental research on bees and their associated microbes and to provide a collaborative framework for sharing primary data from different research programs, thus furthering our understanding of the bee microbiome and its impact on pollinator health.
Engel, Philipp; Kwong, Waldan K; McFrederick, Quinn; Anderson, Kirk E; Barribeau, Seth Michael; Chandler, James Angus; Cornman, R Scott; Dainat, Jacques; de Miranda, Joachim R; Doublet, Vincent; Emery, Olivier; Evans, Jay D; Farinelli, Laurent; Flenniken, Michelle L; Granberg, Fredrik; Grasis, Juris A; Gauthier, Laurent; Hayer, Juliette; Koch, Hauke; Kocher, Sarah; Martinson, Vincent G; Moran, Nancy; Munoz-Torres, Monica; Newton, Irene; Paxton, Robert J; Powell, Eli; Sadd, Ben M; Schmid-Hempel, Paul; Schmid-Hempel, Regula; Song, Se Jin; Schwarz, Ryan S; vanEngelsdorp, Dennis; Dainat, Benjamin
2016-04-26
As pollinators, bees are cornerstones for terrestrial ecosystem stability and key components in agricultural productivity. All animals, including bees, are associated with a diverse community of microbes, commonly referred to as the microbiome. The bee microbiome is likely to be a crucial factor affecting host health. However, with the exception of a few pathogens, the impacts of most members of the bee microbiome on host health are poorly understood. Further, the evolutionary and ecological forces that shape and change the microbiome are unclear. Here, we discuss recent progress in our understanding of the bee microbiome, and we present challenges associated with its investigation. We conclude that global coordination of research efforts is needed to fully understand the complex and highly dynamic nature of the interplay between the bee microbiome, its host, and the environment. High-throughput sequencing technologies are ideal for exploring complex biological systems, including host-microbe interactions. To maximize their value and to improve assessment of the factors affecting bee health, sequence data should be archived, curated, and analyzed in ways that promote the synthesis of different studies. To this end, the BeeBiome consortium aims to develop an online database which would provide reference sequences, archive metadata, and host analytical resources. The goal would be to support applied and fundamental research on bees and their associated microbes and to provide a collaborative framework for sharing primary data from different research programs, thus furthering our understanding of the bee microbiome and its impact on pollinator health. Copyright © 2016 Engel et al.