Automatic Extraction of Metadata from Scientific Publications for CRIS Systems
ERIC Educational Resources Information Center
Kovacevic, Aleksandar; Ivanovic, Dragan; Milosavljevic, Branko; Konjovic, Zora; Surla, Dusan
2011-01-01
Purpose: The aim of this paper is to develop a system for automatic extraction of metadata from scientific papers in PDF format for the information system for monitoring the scientific research activity of the University of Novi Sad (CRIS UNS). Design/methodology/approach: The system is based on machine learning and performs automatic extraction…
EXIF Custom: Automatic image metadata extraction for Scratchpads and Drupal.
Baker, Ed
2013-01-01
Many institutions and individuals use embedded metadata to aid in the management of their image collections. Many deskop image management solutions such as Adobe Bridge and online tools such as Flickr also make use of embedded metadata to describe, categorise and license images. Until now Scratchpads (a data management system and virtual research environment for biodiversity) have not made use of these metadata, and users have had to manually re-enter this information if they have wanted to display it on their Scratchpad site. The Drupal described here allows users to map metadata embedded in their images to the associated field in the Scratchpads image form using one or more customised mappings. The module works seamlessly with the bulk image uploader used on Scratchpads and it is therefore possible to upload hundreds of images easily with automatic metadata (EXIF, XMP and IPTC) extraction and mapping.
EXIF Custom: Automatic image metadata extraction for Scratchpads and Drupal
2013-01-01
Abstract Many institutions and individuals use embedded metadata to aid in the management of their image collections. Many deskop image management solutions such as Adobe Bridge and online tools such as Flickr also make use of embedded metadata to describe, categorise and license images. Until now Scratchpads (a data management system and virtual research environment for biodiversity) have not made use of these metadata, and users have had to manually re-enter this information if they have wanted to display it on their Scratchpad site. The Drupal described here allows users to map metadata embedded in their images to the associated field in the Scratchpads image form using one or more customised mappings. The module works seamlessly with the bulk image uploader used on Scratchpads and it is therefore possible to upload hundreds of images easily with automatic metadata (EXIF, XMP and IPTC) extraction and mapping. PMID:24723768
Automated software system for checking the structure and format of ACM SIG documents
NASA Astrophysics Data System (ADS)
Mirza, Arsalan Rahman; Sah, Melike
2017-04-01
Microsoft (MS) Office Word is one of the most commonly used software tools for creating documents. MS Word 2007 and above uses XML to represent the structure of MS Word documents. Metadata about the documents are automatically created using Office Open XML (OOXML) syntax. We develop a new framework, which is called ADFCS (Automated Document Format Checking System) that takes the advantage of the OOXML metadata, in order to extract semantic information from MS Office Word documents. In particular, we develop a new ontology for Association for Computing Machinery (ACM) Special Interested Group (SIG) documents for representing the structure and format of these documents by using OWL (Web Ontology Language). Then, the metadata is extracted automatically in RDF (Resource Description Framework) according to this ontology using the developed software. Finally, we generate extensive rules in order to infer whether the documents are formatted according to ACM SIG standards. This paper, introduces ACM SIG ontology, metadata extraction process, inference engine, ADFCS online user interface, system evaluation and user study evaluations.
Normalized Metadata Generation for Human Retrieval Using Multiple Video Surveillance Cameras.
Jung, Jaehoon; Yoon, Inhye; Lee, Seungwon; Paik, Joonki
2016-06-24
Since it is impossible for surveillance personnel to keep monitoring videos from a multiple camera-based surveillance system, an efficient technique is needed to help recognize important situations by retrieving the metadata of an object-of-interest. In a multiple camera-based surveillance system, an object detected in a camera has a different shape in another camera, which is a critical issue of wide-range, real-time surveillance systems. In order to address the problem, this paper presents an object retrieval method by extracting the normalized metadata of an object-of-interest from multiple, heterogeneous cameras. The proposed metadata generation algorithm consists of three steps: (i) generation of a three-dimensional (3D) human model; (ii) human object-based automatic scene calibration; and (iii) metadata generation. More specifically, an appropriately-generated 3D human model provides the foot-to-head direction information that is used as the input of the automatic calibration of each camera. The normalized object information is used to retrieve an object-of-interest in a wide-range, multiple-camera surveillance system in the form of metadata. Experimental results show that the 3D human model matches the ground truth, and automatic calibration-based normalization of metadata enables a successful retrieval and tracking of a human object in the multiple-camera video surveillance system.
Normalized Metadata Generation for Human Retrieval Using Multiple Video Surveillance Cameras
Jung, Jaehoon; Yoon, Inhye; Lee, Seungwon; Paik, Joonki
2016-01-01
Since it is impossible for surveillance personnel to keep monitoring videos from a multiple camera-based surveillance system, an efficient technique is needed to help recognize important situations by retrieving the metadata of an object-of-interest. In a multiple camera-based surveillance system, an object detected in a camera has a different shape in another camera, which is a critical issue of wide-range, real-time surveillance systems. In order to address the problem, this paper presents an object retrieval method by extracting the normalized metadata of an object-of-interest from multiple, heterogeneous cameras. The proposed metadata generation algorithm consists of three steps: (i) generation of a three-dimensional (3D) human model; (ii) human object-based automatic scene calibration; and (iii) metadata generation. More specifically, an appropriately-generated 3D human model provides the foot-to-head direction information that is used as the input of the automatic calibration of each camera. The normalized object information is used to retrieve an object-of-interest in a wide-range, multiple-camera surveillance system in the form of metadata. Experimental results show that the 3D human model matches the ground truth, and automatic calibration-based normalization of metadata enables a successful retrieval and tracking of a human object in the multiple-camera video surveillance system. PMID:27347961
Liu, Z; Sun, J; Smith, M; Smith, L; Warr, R
2013-11-01
Computer-assisted diagnosis (CAD) of malignant melanoma (MM) has been advocated to help clinicians to achieve a more objective and reliable assessment. However, conventional CAD systems examine only the features extracted from digital photographs of lesions. Failure to incorporate patients' personal information constrains the applicability in clinical settings. To develop a new CAD system to improve the performance of automatic diagnosis of melanoma, which, for the first time, incorporates digital features of lesions with important patient metadata into a learning process. Thirty-two features were extracted from digital photographs to characterize skin lesions. Patients' personal information, such as age, gender and, lesion site, and their combinations, was quantified as metadata. The integration of digital features and metadata was realized through an extended Laplacian eigenmap, a dimensionality-reduction method grouping lesions with similar digital features and metadata into the same classes. The diagnosis reached 82.1% sensitivity and 86.1% specificity when only multidimensional digital features were used, but improved to 95.2% sensitivity and 91.0% specificity after metadata were incorporated appropriately. The proposed system achieves a level of sensitivity comparable with experienced dermatologists aided by conventional dermoscopes. This demonstrates the potential of our method for assisting clinicians in diagnosing melanoma, and the benefit it could provide to patients and hospitals by greatly reducing unnecessary excisions of benign naevi. This paper proposes an enhanced CAD system incorporating clinical metadata into the learning process for automatic classification of melanoma. Results demonstrate that the additional metadata and the mechanism to incorporate them are useful for improving CAD of melanoma. © 2013 British Association of Dermatologists.
Representing Hydrologic Models as HydroShare Resources to Facilitate Model Sharing and Collaboration
NASA Astrophysics Data System (ADS)
Castronova, A. M.; Goodall, J. L.; Mbewe, P.
2013-12-01
The CUAHSI HydroShare project is a collaborative effort that aims to provide software for sharing data and models within the hydrologic science community. One of the early focuses of this work has been establishing metadata standards for describing models and model-related data as HydroShare resources. By leveraging this metadata definition, a prototype extension has been developed to create model resources that can be shared within the community using the HydroShare system. The extension uses a general model metadata definition to create resource objects, and was designed so that model-specific parsing routines can extract and populate metadata fields from model input and output files. The long term goal is to establish a library of supported models where, for each model, the system has the ability to extract key metadata fields automatically, thereby establishing standardized model metadata that will serve as the foundation for model sharing and collaboration within HydroShare. The Soil Water & Assessment Tool (SWAT) is used to demonstrate this concept through a case study application.
Automated DICOM metadata and volumetric anatomical information extraction for radiation dosimetry
NASA Astrophysics Data System (ADS)
Papamichail, D.; Ploussi, A.; Kordolaimi, S.; Karavasilis, E.; Papadimitroulas, P.; Syrgiamiotis, V.; Efstathopoulos, E.
2015-09-01
Patient-specific dosimetry calculations based on simulation techniques have as a prerequisite the modeling of the modality system and the creation of voxelized phantoms. This procedure requires the knowledge of scanning parameters and patients’ information included in a DICOM file as well as image segmentation. However, the extraction of this information is complicated and time-consuming. The objective of this study was to develop a simple graphical user interface (GUI) to (i) automatically extract metadata from every slice image of a DICOM file in a single query and (ii) interactively specify the regions of interest (ROI) without explicit access to the radiology information system. The user-friendly application developed in Matlab environment. The user can select a series of DICOM files and manage their text and graphical data. The metadata are automatically formatted and presented to the user as a Microsoft Excel file. The volumetric maps are formed by interactively specifying the ROIs and by assigning a specific value in every ROI. The result is stored in DICOM format, for data and trend analysis. The developed GUI is easy, fast and and constitutes a very useful tool for individualized dosimetry. One of the future goals is to incorporate a remote access to a PACS server functionality.
NASA Technical Reports Server (NTRS)
Ullman, Richard; Bane, Bob; Yang, Jingli
2008-01-01
A shell script has been written as a means of automatically making HDF-EOS-formatted data sets available via the World Wide Web. ("HDF-EOS" and variants thereof are defined in the first of the two immediately preceding articles.) The shell script chains together some software tools developed by the Data Usability Group at Goddard Space Flight Center to perform the following actions: Extract metadata in Object Definition Language (ODL) from an HDF-EOS file, Convert the metadata from ODL to Extensible Markup Language (XML), Reformat the XML metadata into human-readable Hypertext Markup Language (HTML), Publish the HTML metadata and the original HDF-EOS file to a Web server and an Open-source Project for a Network Data Access Protocol (OPeN-DAP) server computer, and Reformat the XML metadata and submit the resulting file to the EOS Clearinghouse, which is a Web-based metadata clearinghouse that facilitates searching for, and exchange of, Earth-Science data.
The STP (Solar-Terrestrial Physics) Semantic Web based on the RSS1.0 and the RDF
NASA Astrophysics Data System (ADS)
Kubo, T.; Murata, K. T.; Kimura, E.; Ishikura, S.; Shinohara, I.; Kasaba, Y.; Watari, S.; Matsuoka, D.
2006-12-01
In the Solar-Terrestrial Physics (STP), it is pointed out that circulation and utilization of observation data among researchers are insufficient. To archive interdisciplinary researches, we need to overcome this circulation and utilization problems. Under such a background, authors' group has developed a world-wide database that manages meta-data of satellite and ground-based observation data files. It is noted that retrieving meta-data from the observation data and registering them to database have been carried out by hand so far. Our goal is to establish the STP Semantic Web. The Semantic Web provides a common framework that allows a variety of data shared and reused across applications, enterprises, and communities. We also expect that the secondary information related with observations, such as event information and associated news, are also shared over the networks. The most fundamental issue on the establishment is who generates, manages and provides meta-data in the Semantic Web. We developed an automatic meta-data collection system for the observation data using the RSS (RDF Site Summary) 1.0. The RSS1.0 is one of the XML-based markup languages based on the RDF (Resource Description Framework), which is designed for syndicating news and contents of news-like sites. The RSS1.0 is used to describe the STP meta-data, such as data file name, file server address and observation date. To describe the meta-data of the STP beyond RSS1.0 vocabulary, we defined original vocabularies for the STP resources using the RDF Schema. The RDF describes technical terms on the STP along with the Dublin Core Metadata Element Set, which is standard for cross-domain information resource descriptions. Researchers' information on the STP by FOAF, which is known as an RDF/XML vocabulary, creates a machine-readable metadata describing people. Using the RSS1.0 as a meta-data distribution method, the workflow from retrieving meta-data to registering them into the database is automated. This technique is applied for several database systems, such as the DARTS database system and NICT Space Weather Report Service. The DARTS is a science database managed by ISAS/JAXA in Japan. We succeeded in generating and collecting the meta-data automatically for the CDF (Common data Format) data, such as Reimei satellite data, provided by the DARTS. We also create an RDF service for space weather report and real-time global MHD simulation 3D data provided by the NICT. Our Semantic Web system works as follows: The RSS1.0 documents generated on the data sites (ISAS and NICT) are automatically collected by a meta-data collection agent. The RDF documents are registered and the agent extracts meta-data to store them in the Sesame, which is an open source RDF database with support for RDF Schema inferencing and querying. The RDF database provides advanced retrieval processing that has considered property and relation. Finally, the STP Semantic Web provides automatic processing or high level search for the data which are not only for observation data but for space weather news, physical events, technical terms and researches information related to the STP.
Integrating Semantic Information in Metadata Descriptions for a Geoscience-wide Resource Inventory.
NASA Astrophysics Data System (ADS)
Zaslavsky, I.; Richard, S. M.; Gupta, A.; Valentine, D.; Whitenack, T.; Ozyurt, I. B.; Grethe, J. S.; Schachne, A.
2016-12-01
Integrating semantic information into legacy metadata catalogs is a challenging issue and so far has been mostly done on a limited scale. We present experience of CINERGI (Community Inventory of Earthcube Resources for Geoscience Interoperability), an NSF Earthcube Building Block project, in creating a large cross-disciplinary catalog of geoscience information resources to enable cross-domain discovery. The project developed a pipeline for automatically augmenting resource metadata, in particular generating keywords that describe metadata documents harvested from multiple geoscience information repositories or contributed by geoscientists through various channels including surveys and domain resource inventories. The pipeline examines available metadata descriptions using text parsing, vocabulary management and semantic annotation and graph navigation services of GeoSciGraph. GeoSciGraph, in turn, relies on a large cross-domain ontology of geoscience terms, which bridges several independently developed ontologies or taxonomies including SWEET, ENVO, YAGO, GeoSciML, GCMD, SWO, and CHEBI. The ontology content enables automatic extraction of keywords reflecting science domains, equipment used, geospatial features, measured properties, methods, processes, etc. We specifically focus on issues of cross-domain geoscience ontology creation, resolving several types of semantic conflicts among component ontologies or vocabularies, and constructing and managing facets for improved data discovery and navigation. The ontology and keyword generation rules are iteratively improved as pipeline results are presented to data managers for selective manual curation via a CINERGI Annotator user interface. We present lessons learned from applying CINERGI metadata augmentation pipeline to a number of federal agency and academic data registries, in the context of several use cases that require data discovery and integration across multiple earth science data catalogs of varying quality and completeness. The inventory is accessible at http://cinergi.sdsc.edu, and the CINERGI project web page is http://earthcube.org/group/cinergi
Automated sea floor extraction from underwater video
NASA Astrophysics Data System (ADS)
Kelly, Lauren; Rahmes, Mark; Stiver, James; McCluskey, Mike
2016-05-01
Ocean floor mapping using video is a method to simply and cost-effectively record large areas of the seafloor. Obtaining visual and elevation models has noteworthy applications in search and recovery missions. Hazards to navigation are abundant and pose a significant threat to the safety, effectiveness, and speed of naval operations and commercial vessels. This project's objective was to develop a workflow to automatically extract metadata from marine video and create image optical and elevation surface mosaics. Three developments made this possible. First, optical character recognition (OCR) by means of two-dimensional correlation, using a known character set, allowed for the capture of metadata from image files. Second, exploiting the image metadata (i.e., latitude, longitude, heading, camera angle, and depth readings) allowed for the determination of location and orientation of the image frame in mosaic. Image registration improved the accuracy of mosaicking. Finally, overlapping data allowed us to determine height information. A disparity map was created using the parallax from overlapping viewpoints of a given area and the relative height data was utilized to create a three-dimensional, textured elevation map.
ERIC Educational Resources Information Center
Vrablecová, Petra; Šimko, Marián
2016-01-01
The domain model is an essential part of an adaptive learning system. For each educational course, it involves educational content and semantics, which is also viewed as a form of conceptual metadata about educational content. Due to the size of a domain model, manual domain model creation is a challenging and demanding task for teachers or…
Bruland, Philipp; Doods, Justin; Storck, Michael; Dugas, Martin
2017-01-01
Data dictionaries provide structural meta-information about data definitions in health information technology (HIT) systems. In this regard, reusing healthcare data for secondary purposes offers several advantages (e.g. reduce documentation times or increased data quality). Prerequisites for data reuse are its quality, availability and identical meaning of data. In diverse projects, research data warehouses serve as core components between heterogeneous clinical databases and various research applications. Given the complexity (high number of data elements) and dynamics (regular updates) of electronic health record (EHR) data structures, we propose a clinical metadata warehouse (CMDW) based on a metadata registry standard. Metadata of two large hospitals were automatically inserted into two CMDWs containing 16,230 forms and 310,519 data elements. Automatic updates of metadata are possible as well as semantic annotations. A CMDW allows metadata discovery, data quality assessment and similarity analyses. Common data models for distributed research networks can be established based on similarity analyses.
An Observation Knowledgebase for Hinode Data
NASA Astrophysics Data System (ADS)
Hurlburt, Neal E.; Freeland, S.; Green, S.; Schiff, D.; Seguin, R.; Slater, G.; Cirtain, J.
2007-05-01
We have developed a standards-based system for the Solar Optical and X Ray Telescopes on the Hinode orbiting solar observatory which can serve as part of a developing Heliophysics informatics system. Our goal is to make the scientific data acquired by Hinode more accessible and useful to scientists by allowing them to do reasoning and flexible searches on observation metadata and to ask higher-level questions of the system than previously allowed. The Hinode Observation Knowledgebase relates the intentions and goals of the observation planners (as-planned metadata) with actual observational data (as-run metadata), along with connections to related models, data products and identified features (follow-up metadata) through a citation system. Summaries of the data (both as image thumbnails and short "film strips") serve to guide researchers to the observations appropriate for their research, and these are linked directly to the data catalog for easy extraction and delivery. The semantic information of the observation (Field of view, wavelength, type of observable, average cadence etc.) is captured through simple user interfaces and encoded using the VOEvent XML standard (with the addition of some solar-related extensions). These interfaces merge metadata acquired automatically during both mission planning and an data analysis (see Seguin et. al. 2007 at this meeting) phases with that obtained directly from the planner/analyst and send them to be incorporated into the knowledgebase. The resulting information is automatically rendered into standard categories based on planned and recent observations, as well as by popularity and recommendations by the science team. They are also directly searchable through both and web-based searches and direct calls to the API. Observations details can also be rendered as RSS, iTunes and Google Earth interfaces. The resulting system provides a useful tool to researchers and can act as a demonstration for larger, more complex systems.
Collaborative Sharing of Multidimensional Space-time Data Using HydroShare
NASA Astrophysics Data System (ADS)
Gan, T.; Tarboton, D. G.; Horsburgh, J. S.; Dash, P. K.; Idaszak, R.; Yi, H.; Blanton, B.
2015-12-01
HydroShare is a collaborative environment being developed for sharing hydrological data and models. It includes capability to upload data in many formats as resources that can be shared. The HydroShare data model for resources uses a specific format for the representation of each type of data and specifies metadata common to all resource types as well as metadata unique to specific resource types. The Network Common Data Form (NetCDF) was chosen as the format for multidimensional space-time data in HydroShare. NetCDF is widely used in hydrological and other geoscience modeling because it contains self-describing metadata and supports the creation of array-oriented datasets that may include three spatial dimensions, a time dimension and other user defined dimensions. For example, NetCDF may be used to represent precipitation or surface air temperature fields that have two dimensions in space and one dimension in time. This presentation will illustrate how NetCDF files are used in HydroShare. When a NetCDF file is loaded into HydroShare, header information is extracted using the "ncdump" utility. Python functions developed for the Django web framework on which HydroShare is based, extract science metadata present in the NetCDF file, saving the user from having to enter it. Where the file follows Climate Forecast (CF) convention and Attribute Convention for Dataset Discovery (ACDD) standards, metadata is thus automatically populated. Users also have the ability to add metadata to the resource that may not have been present in the original NetCDF file. HydroShare's metadata editing functionality then writes this science metadata back into the NetCDF file to maintain consistency between the science metadata in HydroShare and the metadata in the NetCDF file. This further helps researchers easily add metadata information following the CF and ACDD conventions. Additional data inspection and subsetting functions were developed, taking advantage of Python and command line libraries for working with NetCDF files. We describe the design and implementation of these features and illustrate how NetCDF files from a modeling application may be curated in HydroShare and thus enhance reproducibility of the associated research. We also discuss future development planned for multidimensional space-time data in HydroShare.
Automated Transformation of CDISC ODM to OpenClinica.
Gessner, Sophia; Storck, Michael; Hegselmann, Stefan; Dugas, Martin; Soto-Rey, Iñaki
2017-01-01
Due to the increasing use of electronic data capture systems for clinical research, the interest in saving resources by automatically generating and reusing case report forms in clinical studies is growing. OpenClinica, an open-source electronic data capture system enables the reuse of metadata in its own Excel import template, hampering the reuse of metadata defined in other standard formats. One of these standard formats is the Operational Data Model for metadata, administrative and clinical data in clinical studies. This work suggests a mapping from Operational Data Model to OpenClinica and describes the implementation of a converter to automatically generate OpenClinica conform case report forms based upon metadata in the Operational Data Model.
NASA Astrophysics Data System (ADS)
Hart, Andrew F.; Cinquini, Luca; Khudikyan, Shakeh E.; Thompson, David R.; Mattmann, Chris A.; Wagstaff, Kiri; Lazio, Joseph; Jones, Dayton
2015-01-01
“Fast radio transients” are defined here as bright millisecond pulses of radio-frequency energy. These short-duration pulses can be produced by known objects such as pulsars or potentially by more exotic objects such as evaporating black holes. The identification and verification of such an event would be of great scientific value. This is one major goal of the Very Long Baseline Array (VLBA) Fast Transient Experiment (V-FASTR), a software-based detection system installed at the VLBA. V-FASTR uses a “commensal” (piggy-back) approach, analyzing all array data continually during routine VLBA observations and identifying candidate fast transient events. Raw data can be stored from a buffer memory, which enables a comprehensive off-line analysis. This is invaluable for validating the astrophysical origin of any detection. Candidates discovered by the automatic system must be reviewed each day by analysts to identify any promising signals that warrant a more in-depth investigation. To support the timely analysis of fast transient detection candidates by V-FASTR scientists, we have developed a metadata-driven, collaborative candidate review framework. The framework consists of a software pipeline for metadata processing composed of both open source software components and project-specific code written expressly to extract and catalog metadata from the incoming V-FASTR data products, and a web-based data portal that facilitates browsing and inspection of the available metadata for candidate events extracted from the VLBA radio data.
Automatic textual annotation of video news based on semantic visual object extraction
NASA Astrophysics Data System (ADS)
Boujemaa, Nozha; Fleuret, Francois; Gouet, Valerie; Sahbi, Hichem
2003-12-01
In this paper, we present our work for automatic generation of textual metadata based on visual content analysis of video news. We present two methods for semantic object detection and recognition from a cross modal image-text thesaurus. These thesaurus represent a supervised association between models and semantic labels. This paper is concerned with two semantic objects: faces and Tv logos. In the first part, we present our work for efficient face detection and recogniton with automatic name generation. This method allows us also to suggest the textual annotation of shots close-up estimation. On the other hand, we were interested to automatically detect and recognize different Tv logos present on incoming different news from different Tv Channels. This work was done jointly with the French Tv Channel TF1 within the "MediaWorks" project that consists on an hybrid text-image indexing and retrieval plateform for video news.
A metadata approach for clinical data management in translational genomics studies in breast cancer.
Papatheodorou, Irene; Crichton, Charles; Morris, Lorna; Maccallum, Peter; Davies, Jim; Brenton, James D; Caldas, Carlos
2009-11-30
In molecular profiling studies of cancer patients, experimental and clinical data are combined in order to understand the clinical heterogeneity of the disease: clinical information for each subject needs to be linked to tumour samples, macromolecules extracted, and experimental results. This may involve the integration of clinical data sets from several different sources: these data sets may employ different data definitions and some may be incomplete. In this work we employ semantic web techniques developed within the CancerGrid project, in particular the use of metadata elements and logic-based inference to annotate heterogeneous clinical information, integrate and query it. We show how this integration can be achieved automatically, following the declaration of appropriate metadata elements for each clinical data set; we demonstrate the practicality of this approach through application to experimental results and clinical data from five hospitals in the UK and Canada, undertaken as part of the METABRIC project (Molecular Taxonomy of Breast Cancer International Consortium). We describe a metadata approach for managing similarities and differences in clinical datasets in a standardized way that uses Common Data Elements (CDEs). We apply and evaluate the approach by integrating the five different clinical datasets of METABRIC.
The ground truth about metadata and community detection in networks.
Peel, Leto; Larremore, Daniel B; Clauset, Aaron
2017-05-01
Across many scientific domains, there is a common need to automatically extract a simplified view or coarse-graining of how a complex system's components interact. This general task is called community detection in networks and is analogous to searching for clusters in independent vector data. It is common to evaluate the performance of community detection algorithms by their ability to find so-called ground truth communities. This works well in synthetic networks with planted communities because these networks' links are formed explicitly based on those known communities. However, there are no planted communities in real-world networks. Instead, it is standard practice to treat some observed discrete-valued node attributes, or metadata, as ground truth. We show that metadata are not the same as ground truth and that treating them as such induces severe theoretical and practical problems. We prove that no algorithm can uniquely solve community detection, and we prove a general No Free Lunch theorem for community detection, which implies that there can be no algorithm that is optimal for all possible community detection tasks. However, community detection remains a powerful tool and node metadata still have value, so a careful exploration of their relationship with network structure can yield insights of genuine worth. We illustrate this point by introducing two statistical techniques that can quantify the relationship between metadata and community structure for a broad class of models. We demonstrate these techniques using both synthetic and real-world networks, and for multiple types of metadata and community structures.
Policy enabled information sharing system
Jorgensen, Craig R.; Nelson, Brian D.; Ratheal, Steve W.
2014-09-02
A technique for dynamically sharing information includes executing a sharing policy indicating when to share a data object responsive to the occurrence of an event. The data object is created by formatting a data file to be shared with a receiving entity. The data object includes a file data portion and a sharing metadata portion. The data object is encrypted and then automatically transmitted to the receiving entity upon occurrence of the event. The sharing metadata portion includes metadata characterizing the data file and referenced in connection with the sharing policy to determine when to automatically transmit the data object to the receiving entity.
Zhang, Mingyuan; Fiol, Guilherme Del; Grout, Randall W.; Jonnalagadda, Siddhartha; Medlin, Richard; Mishra, Rashmi; Weir, Charlene; Liu, Hongfang; Mostafa, Javed; Fiszman, Marcelo
2014-01-01
Online knowledge resources such as Medline can address most clinicians’ patient care information needs. Yet, significant barriers, notably lack of time, limit the use of these sources at the point of care. The most common information needs raised by clinicians are treatment-related. Comparative effectiveness studies allow clinicians to consider multiple treatment alternatives for a particular problem. Still, solutions are needed to enable efficient and effective consumption of comparative effectiveness research at the point of care. Objective Design and assess an algorithm for automatically identifying comparative effectiveness studies and extracting the interventions investigated in these studies. Methods The algorithm combines semantic natural language processing, Medline citation metadata, and machine learning techniques. We assessed the algorithm in a case study of treatment alternatives for depression. Results Both precision and recall for identifying comparative studies was 0.83. A total of 86% of the interventions extracted perfectly or partially matched the gold standard. Conclusion Overall, the algorithm achieved reasonable performance. The method provides building blocks for the automatic summarization of comparative effectiveness research to inform point of care decision-making. PMID:23920677
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hart, Andrew F.; Cinquini, Luca; Khudikyan, Shakeh E.
2015-01-01
“Fast radio transients” are defined here as bright millisecond pulses of radio-frequency energy. These short-duration pulses can be produced by known objects such as pulsars or potentially by more exotic objects such as evaporating black holes. The identification and verification of such an event would be of great scientific value. This is one major goal of the Very Long Baseline Array (VLBA) Fast Transient Experiment (V-FASTR), a software-based detection system installed at the VLBA. V-FASTR uses a “commensal” (piggy-back) approach, analyzing all array data continually during routine VLBA observations and identifying candidate fast transient events. Raw data can be storedmore » from a buffer memory, which enables a comprehensive off-line analysis. This is invaluable for validating the astrophysical origin of any detection. Candidates discovered by the automatic system must be reviewed each day by analysts to identify any promising signals that warrant a more in-depth investigation. To support the timely analysis of fast transient detection candidates by V-FASTR scientists, we have developed a metadata-driven, collaborative candidate review framework. The framework consists of a software pipeline for metadata processing composed of both open source software components and project-specific code written expressly to extract and catalog metadata from the incoming V-FASTR data products, and a web-based data portal that facilitates browsing and inspection of the available metadata for candidate events extracted from the VLBA radio data.« less
Extraction of CT dose information from DICOM metadata: automated Matlab-based approach.
Dave, Jaydev K; Gingold, Eric L
2013-01-01
The purpose of this study was to extract exposure parameters and dose-relevant indexes of CT examinations from information embedded in DICOM metadata. DICOM dose report files were identified and retrieved from a PACS. An automated software program was used to extract from these files information from the structured elements in the DICOM metadata relevant to exposure. Extracting information from DICOM metadata eliminated potential errors inherent in techniques based on optical character recognition, yielding 100% accuracy.
iLOG: A Framework for Automatic Annotation of Learning Objects with Empirical Usage Metadata
ERIC Educational Resources Information Center
Miller, L. D.; Soh, Leen-Kiat; Samal, Ashok; Nugent, Gwen
2012-01-01
Learning objects (LOs) are digital or non-digital entities used for learning, education or training commonly stored in repositories searchable by their associated metadata. Unfortunately, based on the current standards, such metadata is often missing or incorrectly entered making search difficult or impossible. In this paper, we investigate…
PDF text classification to leverage information extraction from publication reports.
Bui, Duy Duc An; Del Fiol, Guilherme; Jonnalagadda, Siddhartha
2016-06-01
Data extraction from original study reports is a time-consuming, error-prone process in systematic review development. Information extraction (IE) systems have the potential to assist humans in the extraction task, however majority of IE systems were not designed to work on Portable Document Format (PDF) document, an important and common extraction source for systematic review. In a PDF document, narrative content is often mixed with publication metadata or semi-structured text, which add challenges to the underlining natural language processing algorithm. Our goal is to categorize PDF texts for strategic use by IE systems. We used an open-source tool to extract raw texts from a PDF document and developed a text classification algorithm that follows a multi-pass sieve framework to automatically classify PDF text snippets (for brevity, texts) into TITLE, ABSTRACT, BODYTEXT, SEMISTRUCTURE, and METADATA categories. To validate the algorithm, we developed a gold standard of PDF reports that were included in the development of previous systematic reviews by the Cochrane Collaboration. In a two-step procedure, we evaluated (1) classification performance, and compared it with machine learning classifier, and (2) the effects of the algorithm on an IE system that extracts clinical outcome mentions. The multi-pass sieve algorithm achieved an accuracy of 92.6%, which was 9.7% (p<0.001) higher than the best performing machine learning classifier that used a logistic regression algorithm. F-measure improvements were observed in the classification of TITLE (+15.6%), ABSTRACT (+54.2%), BODYTEXT (+3.7%), SEMISTRUCTURE (+34%), and MEDADATA (+14.2%). In addition, use of the algorithm to filter semi-structured texts and publication metadata improved performance of the outcome extraction system (F-measure +4.1%, p=0.002). It also reduced of number of sentences to be processed by 44.9% (p<0.001), which corresponds to a processing time reduction of 50% (p=0.005). The rule-based multi-pass sieve framework can be used effectively in categorizing texts extracted from PDF documents. Text classification is an important prerequisite step to leverage information extraction from PDF documents. Copyright © 2016 Elsevier Inc. All rights reserved.
Textural-Contextual Labeling and Metadata Generation for Remote Sensing Applications
NASA Technical Reports Server (NTRS)
Kiang, Richard K.
1999-01-01
Despite the extensive research and the advent of several new information technologies in the last three decades, machine labeling of ground categories using remotely sensed data has not become a routine process. Considerable amount of human intervention is needed to achieve a level of acceptable labeling accuracy. A number of fundamental reasons may explain why machine labeling has not become automatic. In addition, there may be shortcomings in the methodology for labeling ground categories. The spatial information of a pixel, whether textural or contextual, relates a pixel to its surroundings. This information should be utilized to improve the performance of machine labeling of ground categories. Landsat-4 Thematic Mapper (TM) data taken in July 1982 over an area in the vicinity of Washington, D.C. are used in this study. On-line texture extraction by neural networks may not be the most efficient way to incorporate textural information into the labeling process. Texture features are pre-computed from cooccurrence matrices and then combined with a pixel's spectral and contextual information as the input to a neural network. The improvement in labeling accuracy with spatial information included is significant. The prospect of automatic generation of metadata consisting of ground categories, textural and contextual information is discussed.
Utilizing Linked Open Data Sources for Automatic Generation of Semantic Metadata
NASA Astrophysics Data System (ADS)
Nummiaho, Antti; Vainikainen, Sari; Melin, Magnus
In this paper we present an application that can be used to automatically generate semantic metadata for tags given as simple keywords. The application that we have implemented in Java programming language creates the semantic metadata by linking the tags to concepts in different semantic knowledge bases (CrunchBase, DBpedia, Freebase, KOKO, Opencyc, Umbel and/or WordNet). The steps that our application takes in doing so include detecting possible languages, finding spelling suggestions and finding meanings from amongst the proper nouns and common nouns separately. Currently, our application supports English, Finnish and Swedish words, but other languages could be included easily if the required lexical tools (spellcheckers, etc.) are available. The created semantic metadata can be of great use in, e.g., finding and combining similar contents, creating recommendations and targeting advertisements.
The ground truth about metadata and community detection in networks
Peel, Leto; Larremore, Daniel B.; Clauset, Aaron
2017-01-01
Across many scientific domains, there is a common need to automatically extract a simplified view or coarse-graining of how a complex system’s components interact. This general task is called community detection in networks and is analogous to searching for clusters in independent vector data. It is common to evaluate the performance of community detection algorithms by their ability to find so-called ground truth communities. This works well in synthetic networks with planted communities because these networks’ links are formed explicitly based on those known communities. However, there are no planted communities in real-world networks. Instead, it is standard practice to treat some observed discrete-valued node attributes, or metadata, as ground truth. We show that metadata are not the same as ground truth and that treating them as such induces severe theoretical and practical problems. We prove that no algorithm can uniquely solve community detection, and we prove a general No Free Lunch theorem for community detection, which implies that there can be no algorithm that is optimal for all possible community detection tasks. However, community detection remains a powerful tool and node metadata still have value, so a careful exploration of their relationship with network structure can yield insights of genuine worth. We illustrate this point by introducing two statistical techniques that can quantify the relationship between metadata and community structure for a broad class of models. We demonstrate these techniques using both synthetic and real-world networks, and for multiple types of metadata and community structures. PMID:28508065
Chao, Tian-Jy; Kim, Younghun
2015-02-03
Automatically translating a building architecture file format (Industry Foundation Class) to a simulation file, in one aspect, may extract data and metadata used by a target simulation tool from a building architecture file. Interoperability data objects may be created and the extracted data is stored in the interoperability data objects. A model translation procedure may be prepared to identify a mapping from a Model View Definition to a translation and transformation function. The extracted data may be transformed using the data stored in the interoperability data objects, an input Model View Definition template, and the translation and transformation function to convert the extracted data to correct geometric values needed for a target simulation file format used by the target simulation tool. The simulation file in the target simulation file format may be generated.
Misra, Dharitri; Chen, Siyuan; Thoma, George R
2009-01-01
One of the most expensive aspects of archiving digital documents is the manual acquisition of context-sensitive metadata useful for the subsequent discovery of, and access to, the archived items. For certain types of textual documents, such as journal articles, pamphlets, official government records, etc., where the metadata is contained within the body of the documents, a cost effective method is to identify and extract the metadata in an automated way, applying machine learning and string pattern search techniques.At the U. S. National Library of Medicine (NLM) we have developed an automated metadata extraction (AME) system that employs layout classification and recognition models with a metadata pattern search model for a text corpus with structured or semi-structured information. A combination of Support Vector Machine and Hidden Markov Model is used to create the layout recognition models from a training set of the corpus, following which a rule-based metadata search model is used to extract the embedded metadata by analyzing the string patterns within and surrounding each field in the recognized layouts.In this paper, we describe the design of our AME system, with focus on the metadata search model. We present the extraction results for a historic collection from the Food and Drug Administration, and outline how the system may be adapted for similar collections. Finally, we discuss some ongoing enhancements to our AME system.
A metadata-driven approach to data repository design.
Harvey, Matthew J; McLean, Andrew; Rzepa, Henry S
2017-01-01
The design and use of a metadata-driven data repository for research data management is described. Metadata is collected automatically during the submission process whenever possible and is registered with DataCite in accordance with their current metadata schema, in exchange for a persistent digital object identifier. Two examples of data preview are illustrated, including the demonstration of a method for integration with commercial software that confers rich domain-specific data analytics without introducing customisation into the repository itself.
Software for minimalistic data management in large camera trap studies
Krishnappa, Yathin S.; Turner, Wendy C.
2014-01-01
The use of camera traps is now widespread and their importance in wildlife studies well understood. Camera trap studies can produce millions of photographs and there is a need for software to help manage photographs efficiently. In this paper, we describe a software system that was built to successfully manage a large behavioral camera trap study that produced more than a million photographs. We describe the software architecture and the design decisions that shaped the evolution of the program over the study’s three year period. The software system has the ability to automatically extract metadata from images, and add customized metadata to the images in a standardized format. The software system can be installed as a standalone application on popular operating systems. It is minimalistic, scalable and extendable so that it can be used by small teams or individual researchers for a broad variety of camera trap studies. PMID:25110471
Protocols for Scholarly Communication
NASA Astrophysics Data System (ADS)
Pepe, A.; Yeomans, J.
2007-10-01
CERN, the European Organization for Nuclear Research, has operated an institutional preprint repository for more than 10 years. The repository contains over 850,000 records of which more than 450,000 are full-text OA preprints, mostly in the field of particle physics, and it is integrated with the library's holdings of books, conference proceedings, journals and other grey literature. In order to encourage effective propagation and open access to scholarly material, CERN is implementing a range of innovative library services into its document repository: automatic keywording, reference extraction, collaborative management tools and bibliometric tools. Some of these services, such as user reviewing and automatic metadata extraction, could make up an interesting testbed for future publishing solutions and certainly provide an exciting environment for e-science possibilities. The future protocol for scientific communication should guide authors naturally towards OA publication, and CERN wants to help reach a full open access publishing environment for the particle physics community and related sciences in the next few years.
NASA Astrophysics Data System (ADS)
Dolloff, John; Hottel, Bryant; Edwards, David; Theiss, Henry; Braun, Aaron
2017-05-01
This paper presents an overview of the Full Motion Video-Geopositioning Test Bed (FMV-GTB) developed to investigate algorithm performance and issues related to the registration of motion imagery and subsequent extraction of feature locations along with predicted accuracy. A case study is included corresponding to a video taken from a quadcopter. Registration of the corresponding video frames is performed without the benefit of a priori sensor attitude (pointing) information. In particular, tie points are automatically measured between adjacent frames using standard optical flow matching techniques from computer vision, an a priori estimate of sensor attitude is then computed based on supplied GPS sensor positions contained in the video metadata and a photogrammetric/search-based structure from motion algorithm, and then a Weighted Least Squares adjustment of all a priori metadata across the frames is performed. Extraction of absolute 3D feature locations, including their predicted accuracy based on the principles of rigorous error propagation, is then performed using a subset of the registered frames. Results are compared to known locations (check points) over a test site. Throughout this entire process, no external control information (e.g. surveyed points) is used other than for evaluation of solution errors and corresponding accuracy.
Misra, Dharitri; Chen, Siyuan; Thoma, George R.
2010-01-01
One of the most expensive aspects of archiving digital documents is the manual acquisition of context-sensitive metadata useful for the subsequent discovery of, and access to, the archived items. For certain types of textual documents, such as journal articles, pamphlets, official government records, etc., where the metadata is contained within the body of the documents, a cost effective method is to identify and extract the metadata in an automated way, applying machine learning and string pattern search techniques. At the U. S. National Library of Medicine (NLM) we have developed an automated metadata extraction (AME) system that employs layout classification and recognition models with a metadata pattern search model for a text corpus with structured or semi-structured information. A combination of Support Vector Machine and Hidden Markov Model is used to create the layout recognition models from a training set of the corpus, following which a rule-based metadata search model is used to extract the embedded metadata by analyzing the string patterns within and surrounding each field in the recognized layouts. In this paper, we describe the design of our AME system, with focus on the metadata search model. We present the extraction results for a historic collection from the Food and Drug Administration, and outline how the system may be adapted for similar collections. Finally, we discuss some ongoing enhancements to our AME system. PMID:21179386
Scalable Data Mining and Archiving for the Square Kilometre Array
NASA Astrophysics Data System (ADS)
Jones, D. L.; Mattmann, C. A.; Hart, A. F.; Lazio, J.; Bennett, T.; Wagstaff, K. L.; Thompson, D. R.; Preston, R.
2011-12-01
As the technologies for remote observation improve, the rapid increase in the frequency and fidelity of those observations translates into an avalanche of data that is already beginning to eclipse the resources, both human and technical, of the institutions and facilities charged with managing the information. Common data management tasks like cataloging both data itself and contextual meta-data, creating and maintaining scalable permanent archive, and making data available on-demand for research present significant software engineering challenges when considered at the scales of modern multi-national scientific enterprises such as the upcoming Square Kilometre Array project. The NASA Jet Propulsion Laboratory (JPL), leveraging internal research and technology development funding, has begun to explore ways to address the data archiving and distribution challenges with a number of parallel activities involving collaborations with the EVLA and ALMA teams at the National Radio Astronomy Observatory (NRAO), and members of the Square Kilometre Array South Africa team. To date, we have leveraged the Apache OODT Process Control System framework and its catalog and archive service components that provide file management, workflow management, resource management as core web services. A client crawler framework ingests upstream data (e.g., EVLA raw directory output), identifies its MIME type and automatically extracts relevant metadata including temporal bounds, and job-relevant/processing information. A remote content acquisition (pushpull) service is responsible for staging remote content and handing it off to the crawler framework. A science algorithm wrapper (called CAS-PGE) wraps underlying code including CASApy programs for the EVLA, such as Continuum Imaging and Spectral Line Cube generation, executes the algorithm, and ingests its output (along with relevant extracted metadata). In addition to processing, the Process Control System has been leveraged to provide data curation and automatic ingestion for the MeerKAT/KAT-7 precursor instrument in South Africa, helping to catalog and archive correlator and sensor output from KAT-7, and to make the information available for downstream science analysis. These efforts, supported by the increasing availability of high-quality open source software, represent a concerted effort to seek a cost-conscious methodology for maintaining the integrity of observational data from the upstream instrument to the archive, and at the same time ensuring that the data, with its richly annotated catalog of meta-data, remains a viable resource for research into the future.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chao, Tian-Jy; Kim, Younghun
Automatically translating a building architecture file format (Industry Foundation Class) to a simulation file, in one aspect, may extract data and metadata used by a target simulation tool from a building architecture file. Interoperability data objects may be created and the extracted data is stored in the interoperability data objects. A model translation procedure may be prepared to identify a mapping from a Model View Definition to a translation and transformation function. The extracted data may be transformed using the data stored in the interoperability data objects, an input Model View Definition template, and the translation and transformation function tomore » convert the extracted data to correct geometric values needed for a target simulation file format used by the target simulation tool. The simulation file in the target simulation file format may be generated.« less
The Materials Data Facility: Data Services to Advance Materials Science Research
NASA Astrophysics Data System (ADS)
Blaiszik, B.; Chard, K.; Pruyne, J.; Ananthakrishnan, R.; Tuecke, S.; Foster, I.
2016-08-01
With increasingly strict data management requirements from funding agencies and institutions, expanding focus on the challenges of research replicability, and growing data sizes and heterogeneity, new data needs are emerging in the materials community. The materials data facility (MDF) operates two cloud-hosted services, data publication and data discovery, with features to promote open data sharing, self-service data publication and curation, and encourage data reuse, layered with powerful data discovery tools. The data publication service simplifies the process of copying data to a secure storage location, assigning data a citable persistent identifier, and recording custom (e.g., material, technique, or instrument specific) and automatically-extracted metadata in a registry while the data discovery service will provide advanced search capabilities (e.g., faceting, free text range querying, and full text search) against the registered data and metadata. The MDF services empower individual researchers, research projects, and institutions to (I) publish research datasets, regardless of size, from local storage, institutional data stores, or cloud storage, without involvement of third-party publishers; (II) build, share, and enforce extensible domain-specific custom metadata schemas; (III) interact with published data and metadata via representational state transfer (REST) application program interfaces (APIs) to facilitate automation, analysis, and feedback; and (IV) access a data discovery model that allows researchers to search, interrogate, and eventually build on existing published data. We describe MDF's design, current status, and future plans.
Streamlining Metadata and Data Management for Evolving Digital Libraries
NASA Astrophysics Data System (ADS)
Clark, D.; Miller, S. P.; Peckman, U.; Smith, J.; Aerni, S.; Helly, J.; Sutton, D.; Chase, A.
2003-12-01
What began two years ago as an effort to stabilize the Scripps Institution of Oceanography (SIO) data archives from more than 700 cruises going back 50 years, has now become the operational fully-searchable "SIOExplorer" digital library, complete with thousands of historic photographs, images, maps, full text documents, binary data files, and 3D visualization experiences, totaling nearly 2 terabytes of digital content. Coping with data diversity and complexity has proven to be more challenging than dealing with large volumes of digital data. SIOExplorer has been built with scalability in mind, so that the addition of new data types and entire new collections may be accomplished with ease. It is a federated system, currently interoperating with three independent data-publishing authorities, each responsible for their own quality control, metadata specifications, and content selection. The IT architecture implemented at the San Diego Supercomputer Center (SDSC) streamlines the integration of additional projects in other disciplines with a suite of metadata management and collection building tools for "arbitrary digital objects." Metadata are automatically harvested from data files into domain-specific metadata blocks, and mapped into various specification standards as needed. Metadata can be browsed and objects can be viewed onscreen or downloaded for further analysis, with automatic proprietary-hold request management.
CMO: Cruise Metadata Organizer for JAMSTEC Research Cruises
NASA Astrophysics Data System (ADS)
Fukuda, K.; Saito, H.; Hanafusa, Y.; Vanroosebeke, A.; Kitayama, T.
2011-12-01
JAMSTEC's Data Research Center for Marine-Earth Sciences manages and distributes a wide variety of observational data and samples obtained from JAMSTEC research vessels and deep sea submersibles. Generally, metadata are essential to identify data and samples were obtained. In JAMSTEC, cruise metadata include cruise information such as cruise ID, name of vessel, research theme, and diving information such as dive number, name of submersible and position of diving point. They are submitted by chief scientists of research cruises in the Microsoft Excel° spreadsheet format, and registered into a data management database to confirm receipt of observational data files, cruise summaries, and cruise reports. The cruise metadata are also published via "JAMSTEC Data Site for Research Cruises" within two months after end of cruise. Furthermore, these metadata are distributed with observational data, images and samples via several data and sample distribution websites after a publication moratorium period. However, there are two operational issues in the metadata publishing process. One is that duplication efforts and asynchronous metadata across multiple distribution websites due to manual metadata entry into individual websites by administrators. The other is that differential data types or representation of metadata in each website. To solve those problems, we have developed a cruise metadata organizer (CMO) which allows cruise metadata to be connected from the data management database to several distribution websites. CMO is comprised of three components: an Extensible Markup Language (XML) database, an Enterprise Application Integration (EAI) software, and a web-based interface. The XML database is used because of its flexibility for any change of metadata. Daily differential uptake of metadata from the data management database to the XML database is automatically processed via the EAI software. Some metadata are entered into the XML database using the web-based interface by a metadata editor in CMO as needed. Then daily differential uptake of metadata from the XML database to databases in several distribution websites is automatically processed using a convertor defined by the EAI software. Currently, CMO is available for three distribution websites: "Deep Sea Floor Rock Sample Database GANSEKI", "Marine Biological Sample Database", and "JAMSTEC E-library of Deep-sea Images". CMO is planned to provide "JAMSTEC Data Site for Research Cruises" with metadata in the future.
de Lusignan, Simon; Liaw, Siaw-Teng; Michalakidis, Georgios; Jones, Simon
2011-01-01
The burden of chronic disease is increasing, and research and quality improvement will be less effective if case finding strategies are suboptimal. To describe an ontology-driven approach to case finding in chronic disease and how this approach can be used to create a data dictionary and make the codes used in case finding transparent. A five-step process: (1) identifying a reference coding system or terminology; (2) using an ontology-driven approach to identify cases; (3) developing metadata that can be used to identify the extracted data; (4) mapping the extracted data to the reference terminology; and (5) creating the data dictionary. Hypertension is presented as an exemplar. A patient with hypertension can be represented by a range of codes including diagnostic, history and administrative. Metadata can link the coding system and data extraction queries to the correct data mapping and translation tool, which then maps it to the equivalent code in the reference terminology. The code extracted, the term, its domain and subdomain, and the name of the data extraction query can then be automatically grouped and published online as a readily searchable data dictionary. An exemplar online is: www.clininf.eu/qickd-data-dictionary.html Adopting an ontology-driven approach to case finding could improve the quality of disease registers and of research based on routine data. It would offer considerable advantages over using limited datasets to define cases. This approach should be considered by those involved in research and quality improvement projects which utilise routine data.
Semantic Similarity Graphs of Mathematics Word Problems: Can Terminology Detection Help?
ERIC Educational Resources Information Center
John, Rogers Jeffrey Leo; Passonneau, Rebecca J.; McTavish, Thomas S.
2015-01-01
Curricula often lack metadata to characterize the relatedness of concepts. To investigate automatic methods for generating relatedness metadata for a mathematics curriculum, we first address the task of identifying which terms in the vocabulary from mathematics word problems are associated with the curriculum. High chance-adjusted interannotator…
Automatic Content Recommendation and Aggregation According to SCORM
ERIC Educational Resources Information Center
Neves, Daniel Eugênio; Brandão, Wladmir Cardoso; Ishitani, Lucila
2017-01-01
Although widely used, the SCORM metadata model for content aggregation is difficult to be used by educators, content developers and instructional designers. Particularly, the identification of contents related with each other, in large repositories, and their aggregation using metadata as defined in SCORM, has been demanding efforts of computer…
Exploring Characterizations of Learning Object Repositories Using Data Mining Techniques
NASA Astrophysics Data System (ADS)
Segura, Alejandra; Vidal, Christian; Menendez, Victor; Zapata, Alfredo; Prieto, Manuel
Learning object repositories provide a platform for the sharing of Web-based educational resources. As these repositories evolve independently, it is difficult for users to have a clear picture of the kind of contents they give access to. Metadata can be used to automatically extract a characterization of these resources by using machine learning techniques. This paper presents an exploratory study carried out in the contents of four public repositories that uses clustering and association rule mining algorithms to extract characterizations of repository contents. The results of the analysis include potential relationships between different attributes of learning objects that may be useful to gain an understanding of the kind of resources available and eventually develop search mechanisms that consider repository descriptions as a criteria in federated search.
DOE Office of Scientific and Technical Information (OSTI.GOV)
He, Fei; Maslov, Sergei; Yoo, Shinjae
Here, transcriptome datasets from thousands of samples of the model plant Arabidopsis thaliana have been collectively generated by multiple individual labs. Although integration and meta-analysis of these samples has become routine in the plant research community, it is often hampered by the lack of metadata or differences in annotation styles by different labs. In this study, we carefully selected and integrated 6,057 Arabidopsis microarray expression samples from 304 experiments deposited to NCBI GEO. Metadata such as tissue type, growth condition, and developmental stage were manually curated for each sample. We then studied global expression landscape of the integrated dataset andmore » found that samples of the same tissue tend to be more similar to each other than to samples of other tissues, even in different growth conditions or developmental stages. Root has the most distinct transcriptome compared to aerial tissues, but the transcriptome of cultured root is more similar to those of aerial tissues as the former samples lost their cellular identity. Using a simple computational classification method, we showed that the tissue type of a sample can be successfully predicted based on its expression profile, opening the door for automatic metadata extraction and facilitating re-use of plant transcriptome data. As a proof of principle we applied our automated annotation pipeline to 708 RNA-seq samples from public repositories and verified accuracy of our predictions with samples’ metadata provided by authors.« less
He, Fei; Maslov, Sergei; Yoo, Shinjae; ...
2016-05-25
Here, transcriptome datasets from thousands of samples of the model plant Arabidopsis thaliana have been collectively generated by multiple individual labs. Although integration and meta-analysis of these samples has become routine in the plant research community, it is often hampered by the lack of metadata or differences in annotation styles by different labs. In this study, we carefully selected and integrated 6,057 Arabidopsis microarray expression samples from 304 experiments deposited to NCBI GEO. Metadata such as tissue type, growth condition, and developmental stage were manually curated for each sample. We then studied global expression landscape of the integrated dataset andmore » found that samples of the same tissue tend to be more similar to each other than to samples of other tissues, even in different growth conditions or developmental stages. Root has the most distinct transcriptome compared to aerial tissues, but the transcriptome of cultured root is more similar to those of aerial tissues as the former samples lost their cellular identity. Using a simple computational classification method, we showed that the tissue type of a sample can be successfully predicted based on its expression profile, opening the door for automatic metadata extraction and facilitating re-use of plant transcriptome data. As a proof of principle we applied our automated annotation pipeline to 708 RNA-seq samples from public repositories and verified accuracy of our predictions with samples’ metadata provided by authors.« less
Valdez, Joshua; Rueschman, Michael; Kim, Matthew; Redline, Susan; Sahoo, Satya S
2016-10-01
Extraction of structured information from biomedical literature is a complex and challenging problem due to the complexity of biomedical domain and lack of appropriate natural language processing (NLP) techniques. High quality domain ontologies model both data and metadata information at a fine level of granularity, which can be effectively used to accurately extract structured information from biomedical text. Extraction of provenance metadata, which describes the history or source of information, from published articles is an important task to support scientific reproducibility. Reproducibility of results reported by previous research studies is a foundational component of scientific advancement. This is highlighted by the recent initiative by the US National Institutes of Health called "Principles of Rigor and Reproducibility". In this paper, we describe an effective approach to extract provenance metadata from published biomedical research literature using an ontology-enabled NLP platform as part of the Provenance for Clinical and Healthcare Research (ProvCaRe). The ProvCaRe-NLP tool extends the clinical Text Analysis and Knowledge Extraction System (cTAKES) platform using both provenance and biomedical domain ontologies. We demonstrate the effectiveness of ProvCaRe-NLP tool using a corpus of 20 peer-reviewed publications. The results of our evaluation demonstrate that the ProvCaRe-NLP tool has significantly higher recall in extracting provenance metadata as compared to existing NLP pipelines such as MetaMap.
Assessing Public Metabolomics Metadata, Towards Improving Quality.
Ferreira, João D; Inácio, Bruno; Salek, Reza M; Couto, Francisco M
2017-12-13
Public resources need to be appropriately annotated with metadata in order to make them discoverable, reproducible and traceable, further enabling them to be interoperable or integrated with other datasets. While data-sharing policies exist to promote the annotation process by data owners, these guidelines are still largely ignored. In this manuscript, we analyse automatic measures of metadata quality, and suggest their application as a mean to encourage data owners to increase the metadata quality of their resources and submissions, thereby contributing to higher quality data, improved data sharing, and the overall accountability of scientific publications. We analyse these metadata quality measures in the context of a real-world repository of metabolomics data (i.e. MetaboLights), including a manual validation of the measures, and an analysis of their evolution over time. Our findings suggest that the proposed measures can be used to mimic a manual assessment of metadata quality.
NASA Astrophysics Data System (ADS)
Peckham, S. D.
2017-12-01
Standardized, deep descriptions of digital resources (e.g. data sets, computational models, software tools and publications) make it possible to develop user-friendly software systems that assist scientists with the discovery and appropriate use of these resources. Semantic metadata makes it possible for machines to take actions on behalf of humans, such as automatically identifying the resources needed to solve a given problem, retrieving them and then automatically connecting them (despite their heterogeneity) into a functioning workflow. Standardized model metadata also helps model users to understand the important details that underpin computational models and to compare the capabilities of different models. These details include simplifying assumptions on the physics, governing equations and the numerical methods used to solve them, discretization of space (the grid) and time (the time-stepping scheme), state variables (input or output), model configuration parameters. This kind of metadata provides a "deep description" of a computational model that goes well beyond other types of metadata (e.g. author, purpose, scientific domain, programming language, digital rights, provenance, execution) and captures the science that underpins a model. A carefully constructed, unambiguous and rules-based schema to address this problem, called the Geoscience Standard Names ontology will be presented that utilizes Semantic Web best practices and technologies. It has also been designed to work across science domains and to be readable by both humans and machines.
Automatic publishing ISO 19115 metadata with PanMetaDocs using SensorML information
NASA Astrophysics Data System (ADS)
Stender, Vivien; Ulbricht, Damian; Schroeder, Matthias; Klump, Jens
2014-05-01
Terrestrial Environmental Observatories (TERENO) is an interdisciplinary and long-term research project spanning an Earth observation network across Germany. It includes four test sites within Germany from the North German lowlands to the Bavarian Alps and is operated by six research centers of the Helmholtz Association. The contribution by the participating research centers is organized as regional observatories. A challenge for TERENO and its observatories is to integrate all aspects of data management, data workflows, data modeling and visualizations into the design of a monitoring infrastructure. TERENO Northeast is one of the sub-observatories of TERENO and is operated by the German Research Centre for Geosciences (GFZ) in Potsdam. This observatory investigates geoecological processes in the northeastern lowland of Germany by collecting large amounts of environmentally relevant data. The success of long-term projects like TERENO depends on well-organized data management, data exchange between the partners involved and on the availability of the captured data. Data discovery and dissemination are facilitated not only through data portals of the regional TERENO observatories but also through a common spatial data infrastructure TEODOOR (TEreno Online Data repOsitORry). TEODOOR bundles the data, provided by the different web services of the single observatories, and provides tools for data discovery, visualization and data access. The TERENO Northeast data infrastructure integrates data from more than 200 instruments and makes data available through standard web services. Geographic sensor information and services are described using the ISO 19115 metadata schema. TEODOOR accesses the OGC Sensor Web Enablement (SWE) interfaces offered by the regional observatories. In addition to the SWE interface, TERENO Northeast also published data through DataCite. The necessary metadata are created in an automated process by extracting information from the SWE SensorML to create ISO 19115 compliant metadata. The resulting metadata file is stored in the GFZ Potsdam data infrastructure. The publishing workflow for file based research datasets at GFZ Potsdam is based on the eSciDoc infrastructure, using PanMetaDocs (PMD) as the graphical user interface. PMD is a collaborative, metadata based data and information exchange platform [1]. Besides SWE, metadata are also syndicated by PMD through an OAI-PMH interface. In addition, metadata from other observatories, projects or sensors in TERENO can be accessed through the TERENO Northeast data portal. [1] http://meetingorganizer.copernicus.org/EGU2012/EGU2012-7058-2.pdf
Passenger baggage object database (PBOD)
NASA Astrophysics Data System (ADS)
Gittinger, Jaxon M.; Suknot, April N.; Jimenez, Edward S.; Spaulding, Terry W.; Wenrich, Steve A.
2018-04-01
Detection of anomalies of interest in x-ray images is an ever-evolving problem that requires the rapid development of automatic detection algorithms. Automatic detection algorithms are developed using machine learning techniques, which would require developers to obtain the x-ray machine that was used to create the images being trained on, and compile all associated metadata for those images by hand. The Passenger Baggage Object Database (PBOD) and data acquisition application were designed and developed for acquiring and persisting 2-D and 3-D x-ray image data and associated metadata. PBOD was specifically created to capture simulated airline passenger "stream of commerce" luggage data, but could be applied to other areas of x-ray imaging to utilize machine-learning methods.
Raising orphans from a metadata morass: A researcher's guide to re-use of public 'omics data.
Bhandary, Priyanka; Seetharam, Arun S; Arendsee, Zebulun W; Hur, Manhoi; Wurtele, Eve Syrkin
2018-02-01
More than 15 petabases of raw RNAseq data is now accessible through public repositories. Acquisition of other 'omics data types is expanding, though most lack a centralized archival repository. Data-reuse provides tremendous opportunity to extract new knowledge from existing experiments, and offers a unique opportunity for robust, multi-'omics analyses by merging metadata (information about experimental design, biological samples, protocols) and data from multiple experiments. We illustrate how predictive research can be accelerated by meta-analysis with a study of orphan (species-specific) genes. Computational predictions are critical to infer orphan function because their coding sequences provide very few clues. The metadata in public databases is often confusing; a test case with Zea mays mRNA seq data reveals a high proportion of missing, misleading or incomplete metadata. This metadata morass significantly diminishes the insight that can be extracted from these data. We provide tips for data submitters and users, including specific recommendations to improve metadata quality by more use of controlled vocabulary and by metadata reviews. Finally, we advocate for a unified, straightforward metadata submission and retrieval system. Copyright © 2017 Elsevier B.V. All rights reserved.
Park, Yu Rang; Yoon, Young Jo; Kim, Hye Hyeon; Kim, Ju Han
2013-01-01
Achieving semantic interoperability is critical for biomedical data sharing between individuals, organizations and systems. The ISO/IEC 11179 MetaData Registry (MDR) standard has been recognized as one of the solutions for this purpose. The standard model, however, is limited. Representing concepts consist of two or more values, for instance, are not allowed including blood pressure with systolic and diastolic values. We addressed the structural limitations of ISO/IEC 11179 by an integrated metadata object model in our previous research. In the present study, we introduce semantic extensions for the model by defining three new types of semantic relationships; dependency, composite and variable relationships. To evaluate our extensions in a real world setting, we measured the efficiency of metadata reduction by means of mapping to existing others. We extracted metadata from the College of American Pathologist Cancer Protocols and then evaluated our extensions. With no semantic loss, one third of the extracted metadata could be successfully eliminated, suggesting better strategy for implementing clinical MDRs with improved efficiency and utility.
The Digital Sample: Metadata, Unique Identification, and Links to Data and Publications
NASA Astrophysics Data System (ADS)
Lehnert, K. A.; Vinayagamoorthy, S.; Djapic, B.; Klump, J.
2006-12-01
A significant part of digital data in the Geosciences refers to physical samples of Earth materials, from igneous rocks to sediment cores to water or gas samples. The application and long-term utility of these sample-based data in research is critically dependent on (a) the availability of information (metadata) about the samples such as geographical location and time of sampling, or sampling method, (b) links between the different data types available for individual samples that are dispersed in the literature and in digital data repositories, and (c) access to the samples themselves. Major problems for achieving this include incomplete documentation of samples in publications, use of ambiguous sample names, and the lack of a central catalog that allows to find a sample's archiving location. The International Geo Sample Number IGSN, managed by the System for Earth Sample Registration SESAR, provides solutions for these problems. The IGSN is a unique persistent identifier for samples and other GeoObjects that can be obtained by submitting sample metadata to SESAR (www.geosamples.org). If data in a publication is referenced to an IGSN (rather than an ambiguous sample name), sample metadata can readily be extracted from the SESAR database, which evolves into a Global Sample Catalog that also allows to locate the owner or curator of the sample. Use of the IGSN in digital data systems allows building linkages between distributed data. SESAR is contributing to the development of sample metadata standards. SESAR will integrate the IGSN in persistent, resolvable identifiers based on the handle.net service to advance direct linkages between the digital representation of samples in SESAR (sample profiles) and their related data in the literature and in web-accessible digital data repositories. Technologies outlined by Klump et al. (this session) such as the automatic creation of ontologies by text mining applications will be explored for harvesting identifiers of publications and datasets that contain information about a specific sample in order to establish comprehensive data profiles for samples.
ASDC Collaborations and Processes to Ensure Quality Metadata and Consistent Data Availability
NASA Astrophysics Data System (ADS)
Trapasso, T. J.
2017-12-01
With the introduction of new tools, faster computing, and less expensive storage, increased volumes of data are expected to be managed with existing or fewer resources. Metadata management is becoming a heightened challenge from the increase in data volume, resulting in more metadata records needed to be curated for each product. To address metadata availability and completeness, NASA ESDIS has taken significant strides with the creation of the United Metadata Model (UMM) and Common Metadata Repository (CMR). These UMM helps address hurdles experienced by the increasing number of metadata dialects and the CMR provides a primary repository for metadata so that required metadata fields can be served through a growing number of tools and services. However, metadata quality remains an issue as metadata is not always inherent to the end-user. In response to these challenges, the NASA Atmospheric Science Data Center (ASDC) created the Collaboratory for quAlity Metadata Preservation (CAMP) and defined the Product Lifecycle Process (PLP) to work congruently. CAMP is unique in that it provides science team members a UI to directly supply metadata that is complete, compliant, and accurate for their data products. This replaces back-and-forth communication that often results in misinterpreted metadata. Upon review by ASDC staff, metadata is submitted to CMR for broader distribution through Earthdata. Further, approval of science team metadata in CAMP automatically triggers the ASDC PLP workflow to ensure appropriate services are applied throughout the product lifecycle. This presentation will review the design elements of CAMP and PLP as well as demonstrate interfaces to each. It will show the benefits that CAMP and PLP provide to the ASDC that could potentially benefit additional NASA Earth Science Data and Information System (ESDIS) Distributed Active Archive Centers (DAACs).
Composing Data Parallel Code for a SPARQL Graph Engine
DOE Office of Scientific and Technical Information (OSTI.GOV)
Castellana, Vito G.; Tumeo, Antonino; Villa, Oreste
Big data analytics process large amount of data to extract knowledge from them. Semantic databases are big data applications that adopt the Resource Description Framework (RDF) to structure metadata through a graph-based representation. The graph based representation provides several benefits, such as the possibility to perform in memory processing with large amounts of parallelism. SPARQL is a language used to perform queries on RDF-structured data through graph matching. In this paper we present a tool that automatically translates SPARQL queries to parallel graph crawling and graph matching operations. The tool also supports complex SPARQL constructs, which requires more than basicmore » graph matching for their implementation. The tool generates parallel code annotated with OpenMP pragmas for x86 Shared-memory Multiprocessors (SMPs). With respect to commercial database systems such as Virtuoso, our approach reduces memory occupation due to join operations and provides higher performance. We show the scaling of the automatically generated graph-matching code on a 48-core SMP.« less
A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records
Weissenbacher, Davy; Rivera, Robert; Beard, Rachel; Firago, Mari; Wallstrom, Garrick; Scotch, Matthew; Gonzalez, Graciela
2016-01-01
Objective The metadata reflecting the location of the infected host (LOIH) of virus sequences in GenBank often lacks specificity. This work seeks to enhance this metadata by extracting more specific geographic information from related full-text articles and mapping them to their latitude/longitudes using knowledge derived from external geographical databases. Materials and Methods We developed a rule-based information extraction framework for linking GenBank records to the latitude/longitudes of the LOIH. Our system first extracts existing geospatial metadata from GenBank records and attempts to improve it by seeking additional, relevant geographic information from text and tables in related full-text PubMed Central articles. The final extracted locations of the records, based on data assimilated from these sources, are then disambiguated and mapped to their respective geo-coordinates. We evaluated our approach on a manually annotated dataset comprising of 5728 GenBank records for the influenza A virus. Results We found the precision, recall, and f-measure of our system for linking GenBank records to the latitude/longitudes of their LOIH to be 0.832, 0.967, and 0.894, respectively. Discussion Our system had a high level of accuracy for linking GenBank records to the geo-coordinates of the LOIH. However, it can be further improved by expanding our database of geospatial data, incorporating spell correction, and enhancing the rules used for extraction. Conclusion Our system performs reasonably well for linking GenBank records for the influenza A virus to the geo-coordinates of their LOIH based on record metadata and information extracted from related full-text articles. PMID:26911818
A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records.
Tahsin, Tasnia; Weissenbacher, Davy; Rivera, Robert; Beard, Rachel; Firago, Mari; Wallstrom, Garrick; Scotch, Matthew; Gonzalez, Graciela
2016-09-01
The metadata reflecting the location of the infected host (LOIH) of virus sequences in GenBank often lacks specificity. This work seeks to enhance this metadata by extracting more specific geographic information from related full-text articles and mapping them to their latitude/longitudes using knowledge derived from external geographical databases. We developed a rule-based information extraction framework for linking GenBank records to the latitude/longitudes of the LOIH. Our system first extracts existing geospatial metadata from GenBank records and attempts to improve it by seeking additional, relevant geographic information from text and tables in related full-text PubMed Central articles. The final extracted locations of the records, based on data assimilated from these sources, are then disambiguated and mapped to their respective geo-coordinates. We evaluated our approach on a manually annotated dataset comprising of 5728 GenBank records for the influenza A virus. We found the precision, recall, and f-measure of our system for linking GenBank records to the latitude/longitudes of their LOIH to be 0.832, 0.967, and 0.894, respectively. Our system had a high level of accuracy for linking GenBank records to the geo-coordinates of the LOIH. However, it can be further improved by expanding our database of geospatial data, incorporating spell correction, and enhancing the rules used for extraction. Our system performs reasonably well for linking GenBank records for the influenza A virus to the geo-coordinates of their LOIH based on record metadata and information extracted from related full-text articles. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Enriching the trustworthiness of health-related web pages.
Gaudinat, Arnaud; Cruchet, Sarah; Boyer, Celia; Chrawdhry, Pravir
2011-06-01
We present an experimental mechanism for enriching web content with quality metadata. This mechanism is based on a simple and well-known initiative in the field of the health-related web, the HONcode. The Resource Description Framework (RDF) format and the Dublin Core Metadata Element Set were used to formalize these metadata. The model of trust proposed is based on a quality model for health-related web pages that has been tested in practice over a period of thirteen years. Our model has been explored in the context of a project to develop a research tool that automatically detects the occurrence of quality criteria in health-related web pages.
MEMOPS: data modelling and automatic code generation.
Fogh, Rasmus H; Boucher, Wayne; Ionides, John M C; Vranken, Wim F; Stevens, Tim J; Laue, Ernest D
2010-03-25
In recent years the amount of biological data has exploded to the point where much useful information can only be extracted by complex computational analyses. Such analyses are greatly facilitated by metadata standards, both in terms of the ability to compare data originating from different sources, and in terms of exchanging data in standard forms, e.g. when running processes on a distributed computing infrastructure. However, standards thrive on stability whereas science tends to constantly move, with new methods being developed and old ones modified. Therefore maintaining both metadata standards, and all the code that is required to make them useful, is a non-trivial problem. Memops is a framework that uses an abstract definition of the metadata (described in UML) to generate internal data structures and subroutine libraries for data access (application programming interfaces--APIs--currently in Python, C and Java) and data storage (in XML files or databases). For the individual project these libraries obviate the need for writing code for input parsing, validity checking or output. Memops also ensures that the code is always internally consistent, massively reducing the need for code reorganisation. Across a scientific domain a Memops-supported data model makes it easier to support complex standards that can capture all the data produced in a scientific area, share them among all programs in a complex software pipeline, and carry them forward to deposition in an archive. The principles behind the Memops generation code will be presented, along with example applications in Nuclear Magnetic Resonance (NMR) spectroscopy and structural biology.
Ignizio, Drew A.; O'Donnell, Michael S.; Talbert, Colin B.
2014-01-01
Creating compliant metadata for scientific data products is mandated for all federal Geographic Information Systems professionals and is a best practice for members of the geospatial data community. However, the complexity of the The Federal Geographic Data Committee’s Content Standards for Digital Geospatial Metadata, the limited availability of easy-to-use tools, and recent changes in the ESRI software environment continue to make metadata creation a challenge. Staff at the U.S. Geological Survey Fort Collins Science Center have developed a Python toolbox for ESRI ArcDesktop to facilitate a semi-automated workflow to create and update metadata records in ESRI’s 10.x software. The U.S. Geological Survey Metadata Wizard tool automatically populates several metadata elements: the spatial reference, spatial extent, geospatial presentation format, vector feature count or raster column/row count, native system/processing environment, and the metadata creation date. Once the software auto-populates these elements, users can easily add attribute definitions and other relevant information in a simple Graphical User Interface. The tool, which offers a simple design free of esoteric metadata language, has the potential to save many government and non-government organizations a significant amount of time and costs by facilitating the development of The Federal Geographic Data Committee’s Content Standards for Digital Geospatial Metadata compliant metadata for ESRI software users. A working version of the tool is now available for ESRI ArcDesktop, version 10.0, 10.1, and 10.2 (downloadable at http:/www.sciencebase.gov/metadatawizard).
Document Classification in Support of Automated Metadata Extraction Form Heterogeneous Collections
ERIC Educational Resources Information Center
Flynn, Paul K.
2014-01-01
A number of federal agencies, universities, laboratories, and companies are placing their documents online and making them searchable via metadata fields such as author, title, and publishing organization. To enable this, every document in the collection must be catalogued using the metadata fields. Though time consuming, the task of identifying…
ERIC Educational Resources Information Center
Association for Computing Machinery, New York, NY.
Papers in this Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (Roanoke, Virginia, June 24-28, 2001) discuss: automatic genre analysis; text categorization; automated name authority control; automatic event generation; linked active content; designing e-books for legal research; metadata harvesting; mapping the…
2008-06-01
provides a means for file owners to add metadata which can then be used by iTunes for cataloging and searching [4]. Metadata can be stored in different...based and contain AAC data formats [3]. Specifically, Apple uses Protected AAC to encode copy-protected music titles purchased from the iTunes Music...Store [4]. The files purchased from the iTunes Music Store include the following metadata. • Name • Email address of purchaser • Year • Album
Data publication and sharing using the SciDrive service
NASA Astrophysics Data System (ADS)
Mishin, Dmitry; Medvedev, D.; Szalay, A. S.; Plante, R. L.
2014-01-01
Despite the last years progress in scientific data storage, still remains the problem of public data storage and sharing system for relatively small scientific datasets. These are collections forming the “long tail” of power log datasets distribution. The aggregated size of the long tail data is comparable to the size of all data collections from large archives, and the value of data is significant. The SciDrive project's main goal is providing the scientific community with a place to reliably and freely store such data and provide access to it to broad scientific community. The primary target audience of the project is astoromy community, and it will be extended to other fields. We're aiming to create a simple way of publishing a dataset, which can be then shared with other people. Data owner controls the permissions to modify and access the data and can assign a group of users or open the access to everyone. The data contained in the dataset will be automaticaly recognized by a background process. Known data formats will be extracted according to the user's settings. Currently tabular data can be automatically extracted to the user's MyDB table where user can make SQL queries to the dataset and merge it with other public CasJobs resources. Other data formats can be processed using a set of plugins that upload the data or metadata to user-defined side services. The current implementation targets some of the data formats commonly used by the astronomy communities, including FITS, ASCII and Excel tables, TIFF images, and YT simulations data archives. Along with generic metadata, format-specific metadata is also processed. For example, basic information about celestial objects is extracted from FITS files and TIFF images, if present. A 100TB implementation has just been put into production at Johns Hopkins University. The system features public data storage REST service supporting VOSpace 2.0 and Dropbox protocols, HTML5 web portal, command-line client and Java standalone client to synchronize a local folder with the remote storage. We use VAO SSO (Single Sign On) service from NCSA for users authentication that provides free registration for everyone.
ModelArchiver—A program for facilitating the creation of groundwater model archives
Winston, Richard B.
2018-03-01
ModelArchiver is a program designed to facilitate the creation of groundwater model archives that meet the requirements of the U.S. Geological Survey (USGS) policy (Office of Groundwater Technical Memorandum 2016.02, https://water.usgs.gov/admin/memo/GW/gw2016.02.pdf, https://water.usgs.gov/ogw/policy/gw-model/). ModelArchiver version 1.0 leads the user step-by-step through the process of creating a USGS groundwater model archive. The user specifies the contents of each of the subdirectories within the archive and provides descriptions of the archive contents. Descriptions of some files can be specified automatically using file extensions. Descriptions also can be specified individually. Those descriptions are added to a readme.txt file provided by the user. ModelArchiver moves the content of the archive to the archive folder and compresses some folders into .zip files.As part of the archive, the modeler must create a metadata file describing the archive. The program has a built-in metadata editor and provides links to websites that can aid in creation of the metadata. The built-in metadata editor is also available as a stand-alone program named FgdcMetaEditor version 1.0, which also is described in this report. ModelArchiver updates the metadata file provided by the user with descriptions of the files in the archive. An optional archive list file generated automatically by ModelMuse can streamline the creation of archives by identifying input files, output files, model programs, and ancillary files for inclusion in the archive.
EARS : Repositioning data management near data acquisition.
NASA Astrophysics Data System (ADS)
Sinquin, Jean-Marc; Sorribas, Jordi; Diviacco, Paolo; Vandenberghe, Thomas; Munoz, Raquel; Garcia, Oscar
2016-04-01
The EU FP7 Projects Eurofleets and Eurofleets2 are an European wide alliance of marine research centers that aim to share their research vessels, to improve information sharing on planned, current and completed cruises, on details of ocean-going research vessels and specialized equipment, and to durably improve cost-effectiveness of cruises. Within this context logging of information on how, when and where anything happens on board of the vessel is crucial information for data users in a later stage. This forms a primordial step in the process of data quality control as it could assist in the understanding of anomalies and unexpected trends recorded in the acquired data sets. In this way completeness of the metadata is improved as it is recorded accurately at the origin of the measurement. The collection of this crucial information has been done in very different ways, using different procedures, formats and pieces of software in the context of the European Research Fleet. At the time that the Eurofleets project started, every institution and country had adopted different strategies and approaches, which complicated the task of users that need to log general purpose information and events on-board whenever they access a different platform loosing the opportunity to produce this valuable metadata on-board. Among the many goals the Eurofleets project has, a very important task is the development of an "event log software" called EARS (Eurofleets Automatic Reporting System) that enables scientists and operators to record what happens during a survey. EARS will allow users to fill, in a standardized way, the gap existing at the moment in metadata description that only very seldom links data with its history. Events generated automatically by acquisition instruments will also be handled, enhancing the granularity and precision of the event annotation. The adoption of a common procedure to log survey events and a common terminology to describe them is crucial to provide a friendly and successfully metadata on-board creation procedure for the whole the European Fleet. The possibility of automatically reporting metadata and general purpose data, will simplify the work of scientists and data managers with regards to data transmission. An improved accuracy and completeness of metadata is expected when events are recorded at acquisition time. This will also enhance multiple usages of the data as it allows verification of the different requirements existing in different disciplines.
Using RDF and Git to Realize a Collaborative Metadata Repository.
Stöhr, Mark R; Majeed, Raphael W; Günther, Andreas
2018-01-01
The German Center for Lung Research (DZL) is a research network with the aim of researching respiratory diseases. The participating study sites' register data differs in terms of software and coding system as well as data field coverage. To perform meaningful consortium-wide queries through one single interface, a uniform conceptual structure is required covering the DZL common data elements. No single existing terminology includes all our concepts. Potential candidates such as LOINC and SNOMED only cover specific subject areas or are not granular enough for our needs. To achieve a broadly accepted and complete ontology, we developed a platform for collaborative metadata management. The DZL data management group formulated detailed requirements regarding the metadata repository and the user interfaces for metadata editing. Our solution builds upon existing standard technologies allowing us to meet those requirements. Its key parts are RDF and the distributed version control system Git. We developed a software system to publish updated metadata automatically and immediately after performing validation tests for completeness and consistency.
A Prototype Publishing Registry for the Virtual Observatory
NASA Astrophysics Data System (ADS)
Williamson, R.; Plante, R.
2004-07-01
In the Virtual Observatory (VO), a registry helps users locate resources, such as data and services, in a distributed environment. A general framework for VO registries is now under development within the International Virtual Observatory Alliance (IVOA) Registry Working Group. We present a prototype of one component of this framework: the publishing registry. The publishing registry allows data providers to expose metadata descriptions of their resources to the VO environment. Searchable registries can harvest the metadata from many publishing registries and make them searchable by users. We have developed a prototype publishing registry that data providers can install at their sites to publish their resources. The descriptions are exposed using the Open Archive Initiative (OAI) Protocol for Metadata Harvesting. Automating the input of metadata into registries is critical when a provider wishes to describe many resources. We illustrate various strategies for such automation, both currently in use and planned for the future. We also describe how future versions of the registry can adapt automatically to evolving metadata schemas for describing resources.
Hybrid Multiagent System for Automatic Object Learning Classification
NASA Astrophysics Data System (ADS)
Gil, Ana; de La Prieta, Fernando; López, Vivian F.
The rapid evolution within the context of e-learning is closely linked to international efforts on the standardization of learning object metadata, which provides learners in a web-based educational system with ubiquitous access to multiple distributed repositories. This article presents a hybrid agent-based architecture that enables the recovery of learning objects tagged in Learning Object Metadata (LOM) and provides individualized help with selecting learning materials to make the most suitable choice among many alternatives.
ALE: automated label extraction from GEO metadata.
Giles, Cory B; Brown, Chase A; Ripperger, Michael; Dennis, Zane; Roopnarinesingh, Xiavan; Porter, Hunter; Perz, Aleksandra; Wren, Jonathan D
2017-12-28
NCBI's Gene Expression Omnibus (GEO) is a rich community resource containing millions of gene expression experiments from human, mouse, rat, and other model organisms. However, information about each experiment (metadata) is in the format of an open-ended, non-standardized textual description provided by the depositor. Thus, classification of experiments for meta-analysis by factors such as gender, age of the sample donor, and tissue of origin is not feasible without assigning labels to the experiments. Automated approaches are preferable for this, primarily because of the size and volume of the data to be processed, but also because it ensures standardization and consistency. While some of these labels can be extracted directly from the textual metadata, many of the data available do not contain explicit text informing the researcher about the age and gender of the subjects with the study. To bridge this gap, machine-learning methods can be trained to use the gene expression patterns associated with the text-derived labels to refine label-prediction confidence. Our analysis shows only 26% of metadata text contains information about gender and 21% about age. In order to ameliorate the lack of available labels for these data sets, we first extract labels from the textual metadata for each GEO RNA dataset and evaluate the performance against a gold standard of manually curated labels. We then use machine-learning methods to predict labels, based upon gene expression of the samples and compare this to the text-based method. Here we present an automated method to extract labels for age, gender, and tissue from textual metadata and GEO data using both a heuristic approach as well as machine learning. We show the two methods together improve accuracy of label assignment to GEO samples.
Database integration in a multimedia-modeling environment
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dorow, Kevin E.
2002-09-02
Integration of data from disparate remote sources has direct applicability to modeling, which can support Brownfield assessments. To accomplish this task, a data integration framework needs to be established. A key element in this framework is the metadata that creates the relationship between the pieces of information that are important in the multimedia modeling environment and the information that is stored in the remote data source. The design philosophy is to allow modelers and database owners to collaborate by defining this metadata in such a way that allows interaction between their components. The main parts of this framework include toolsmore » to facilitate metadata definition, database extraction plan creation, automated extraction plan execution / data retrieval, and a central clearing house for metadata and modeling / database resources. Cross-platform compatibility (using Java) and standard communications protocols (http / https) allow these parts to run in a wide variety of computing environments (Local Area Networks, Internet, etc.), and, therefore, this framework provides many benefits. Because of the specific data relationships described in the metadata, the amount of data that have to be transferred is kept to a minimum (only the data that fulfill a specific request are provided as opposed to transferring the complete contents of a data source). This allows for real-time data extraction from the actual source. Also, the framework sets up collaborative responsibilities such that the different types of participants have control over the areas in which they have domain knowledge-the modelers are responsible for defining the data relevant to their models, while the database owners are responsible for mapping the contents of the database using the metadata definitions. Finally, the data extraction mechanism allows for the ability to control access to the data and what data are made available.« less
Harvesting Intelligence in Multimedia Social Tagging Systems
NASA Astrophysics Data System (ADS)
Giannakidou, Eirini; Kaklidou, Foteini; Chatzilari, Elisavet; Kompatsiaris, Ioannis; Vakali, Athena
As more people adopt tagging practices, social tagging systems tend to form rich knowledge repositories that enable the extraction of patterns reflecting the way content semantics is perceived by the web users. This is of particular importance, especially in the case of multimedia content, since the availability of such content in the web is very high and its efficient retrieval using textual annotations or content-based automatically extracted metadata still remains a challenge. It is argued that complementing multimedia analysis techniques with knowledge drawn from web social annotations may facilitate multimedia content management. This chapter focuses on analyzing tagging patterns and combining them with content feature extraction methods, generating, thus, intelligence from multimedia social tagging systems. Emphasis is placed on using all available "tracks" of knowledge, that is tag co-occurrence together with semantic relations among tags and low-level features of the content. Towards this direction, a survey on the theoretical background and the adopted practices for analysis of multimedia social content are presented. A case study from Flickr illustrates the efficiency of the proposed approach.
Managing biomedical image metadata for search and retrieval of similar images.
Korenblum, Daniel; Rubin, Daniel; Napel, Sandy; Rodriguez, Cesar; Beaulieu, Chris
2011-08-01
Radiology images are generally disconnected from the metadata describing their contents, such as imaging observations ("semantic" metadata), which are usually described in text reports that are not directly linked to the images. We developed a system, the Biomedical Image Metadata Manager (BIMM) to (1) address the problem of managing biomedical image metadata and (2) facilitate the retrieval of similar images using semantic feature metadata. Our approach allows radiologists, researchers, and students to take advantage of the vast and growing repositories of medical image data by explicitly linking images to their associated metadata in a relational database that is globally accessible through a Web application. BIMM receives input in the form of standard-based metadata files using Web service and parses and stores the metadata in a relational database allowing efficient data query and maintenance capabilities. Upon querying BIMM for images, 2D regions of interest (ROIs) stored as metadata are automatically rendered onto preview images included in search results. The system's "match observations" function retrieves images with similar ROIs based on specific semantic features describing imaging observation characteristics (IOCs). We demonstrate that the system, using IOCs alone, can accurately retrieve images with diagnoses matching the query images, and we evaluate its performance on a set of annotated liver lesion images. BIMM has several potential applications, e.g., computer-aided detection and diagnosis, content-based image retrieval, automating medical analysis protocols, and gathering population statistics like disease prevalences. The system provides a framework for decision support systems, potentially improving their diagnostic accuracy and selection of appropriate therapies.
Solutions for extracting file level spatial metadata from airborne mission data
NASA Astrophysics Data System (ADS)
Schwab, M. J.; Stanley, M.; Pals, J.; Brodzik, M.; Fowler, C.; Icebridge Engineering/Spatial Metadata
2011-12-01
Authors: Michael Stanley Mark Schwab Jon Pals Mary J. Brodzik Cathy Fowler Collaboration: Raytheon EED and NSIDC Raytheon / EED 5700 Rivertech Court Riverdale, MD 20737 NSIDC University of Colorado UCB 449 Boulder, CO 80309-0449 Data sets acquired from satellites and aircraft may differ in many ways. We will focus on the differences in spatial coverage between the two platforms. Satellite data sets over a given period typically cover large geographic regions. These data are collected in a consistent, predictable and well understood manner due to the uniformity of satellite orbits. Since satellite data collection paths are typically smooth and uniform the data from satellite instruments can usually be described with simple spatial metadata. Subsequently, these spatial metadata can be stored and searched easily and efficiently. Conversely, aircraft have significantly more freedom to change paths, circle, overlap, and vary altitude all of which add complexity to the spatial metadata. Aircraft are also subject to wind and other elements that result in even more complicated and unpredictable spatial coverage areas. This unpredictability and complexity makes it more difficult to extract usable spatial metadata from data sets collected on aircraft missions. It is not feasible to use all of the location data from aircraft mission data sets for use as spatial metadata. The number of data points in typical data sets poses serious performance problems for spatial searching. In order to provide efficient spatial searching of the large number of files cataloged in our systems, we need to extract approximate spatial descriptions as geo-polygons from a small number of vertices (fewer than two hundred). We present some of the challenges and solutions for creating airborne mission-derived spatial metadata. We are implementing these methods to create the spatial metadata for insertion of IceBridge mission data into ECS for public access through NSIDC and ECHO but, they are potentially extensible to any aircraft mission data.
Automatic Inference of Cryptographic Key Length Based on Analysis of Proof Tightness
2016-06-01
within an attack tree structure, then expand attack tree methodology to include cryptographic reductions. We then provide the algorithms for...maintaining and automatically reasoning about these expanded attack trees . We provide a software tool that utilizes machine-readable proof and attack metadata...and the attack tree methodology to provide rapid and precise answers regarding security parameters and effective security. This eliminates the need
Bridging the semantic gap in sports
NASA Astrophysics Data System (ADS)
Li, Baoxin; Errico, James; Pan, Hao; Sezan, M. Ibrahim
2003-01-01
One of the major challenges facing current media management systems and the related applications is the so-called "semantic gap" between the rich meaning that a user desires and the shallowness of the content descriptions that are automatically extracted from the media. In this paper, we address the problem of bridging this gap in the sports domain. We propose a general framework for indexing and summarizing sports broadcast programs. The framework is based on a high-level model of sports broadcast video using the concept of an event, defined according to domain-specific knowledge for different types of sports. Within this general framework, we develop automatic event detection algorithms that are based on automatic analysis of the visual and aural signals in the media. We have successfully applied the event detection algorithms to different types of sports including American football, baseball, Japanese sumo wrestling, and soccer. Event modeling and detection contribute to the reduction of the semantic gap by providing rudimentary semantic information obtained through media analysis. We further propose a novel approach, which makes use of independently generated rich textual metadata, to fill the gap completely through synchronization of the information-laden textual data with the basic event segments. An MPEG-7 compliant prototype browsing system has been implemented to demonstrate semantic retrieval and summarization of sports video.
MetaRNA-Seq: An Interactive Tool to Browse and Annotate Metadata from RNA-Seq Studies.
Kumar, Pankaj; Halama, Anna; Hayat, Shahina; Billing, Anja M; Gupta, Manish; Yousri, Noha A; Smith, Gregory M; Suhre, Karsten
2015-01-01
The number of RNA-Seq studies has grown in recent years. The design of RNA-Seq studies varies from very simple (e.g., two-condition case-control) to very complicated (e.g., time series involving multiple samples at each time point with separate drug treatments). Most of these publically available RNA-Seq studies are deposited in NCBI databases, but their metadata are scattered throughout four different databases: Sequence Read Archive (SRA), Biosample, Bioprojects, and Gene Expression Omnibus (GEO). Although the NCBI web interface is able to provide all of the metadata information, it often requires significant effort to retrieve study- or project-level information by traversing through multiple hyperlinks and going to another page. Moreover, project- and study-level metadata lack manual or automatic curation by categories, such as disease type, time series, case-control, or replicate type, which are vital to comprehending any RNA-Seq study. Here we describe "MetaRNA-Seq," a new tool for interactively browsing, searching, and annotating RNA-Seq metadata with the capability of semiautomatic curation at the study level.
NASA Astrophysics Data System (ADS)
Prasad, U.; Rahabi, A.
2001-05-01
The following utilities developed for HDF-EOS format data dump are of special use for Earth science data for NASA's Earth Observation System (EOS). This poster demonstrates their use and application. The first four tools take HDF-EOS data files as input. HDF-EOS Metadata Dumper - metadmp Metadata dumper extracts metadata from EOS data granules. It operates by simply copying blocks of metadata from the file to the standard output. It does not process the metadata in any way. Since all metadata in EOS granules is encoded in the Object Description Language (ODL), the output of metadmp will be in the form of complete ODL statements. EOS data granules may contain up to three different sets of metadata (Core, Archive, and Structural Metadata). HDF-EOS Contents Dumper - heosls Heosls dumper displays the contents of HDF-EOS files. This utility provides detailed information on the POINT, SWATH, and GRID data sets. in the files. For example: it will list, the Geo-location fields, Data fields and objects. HDF-EOS ASCII Dumper - asciidmp The ASCII dump utility extracts fields from EOS data granules into plain ASCII text. The output from asciidmp should be easily human readable. With minor editing, asciidmp's output can be made ingestible by any application with ASCII import capabilities. HDF-EOS Binary Dumper - bindmp The binary dumper utility dumps HDF-EOS objects in binary format. This is useful for feeding the output of it into existing program, which does not understand HDF, for example: custom software and COTS products. HDF-EOS User Friendly Metadata - UFM The UFM utility tool is useful for viewing ECS metadata. UFM takes an EOSDIS ODL metadata file and produces an HTML report of the metadata for display using a web browser. HDF-EOS METCHECK - METCHECK METCHECK can be invoked from either Unix or Dos environment with a set of command line options that a user might use to direct the tool inputs and output . METCHECK validates the inventory metadata in (.met file) using The Descriptor file (.desc) as the reference. The tool takes (.desc), and (.met) an ODL file as inputs, and generates a simple output file contains the results of the checking process.
Soto, Axel J; Zerva, Chrysoula; Batista-Navarro, Riza; Ananiadou, Sophia
2018-04-15
Pathway models are valuable resources that help us understand the various mechanisms underpinning complex biological processes. Their curation is typically carried out through manual inspection of published scientific literature to find information relevant to a model, which is a laborious and knowledge-intensive task. Furthermore, models curated manually cannot be easily updated and maintained with new evidence extracted from the literature without automated support. We have developed LitPathExplorer, a visual text analytics tool that integrates advanced text mining, semi-supervised learning and interactive visualization, to facilitate the exploration and analysis of pathway models using statements (i.e. events) extracted automatically from the literature and organized according to levels of confidence. LitPathExplorer supports pathway modellers and curators alike by: (i) extracting events from the literature that corroborate existing models with evidence; (ii) discovering new events which can update models; and (iii) providing a confidence value for each event that is automatically computed based on linguistic features and article metadata. Our evaluation of event extraction showed a precision of 89% and a recall of 71%. Evaluation of our confidence measure, when used for ranking sampled events, showed an average precision ranging between 61 and 73%, which can be improved to 95% when the user is involved in the semi-supervised learning process. Qualitative evaluation using pair analytics based on the feedback of three domain experts confirmed the utility of our tool within the context of pathway model exploration. LitPathExplorer is available at http://nactem.ac.uk/LitPathExplorer_BI/. sophia.ananiadou@manchester.ac.uk. Supplementary data are available at Bioinformatics online.
Visualization of JPEG Metadata
NASA Astrophysics Data System (ADS)
Malik Mohamad, Kamaruddin; Deris, Mustafa Mat
There are a lot of information embedded in JPEG image than just graphics. Visualization of its metadata would benefit digital forensic investigator to view embedded data including corrupted image where no graphics can be displayed in order to assist in evidence collection for cases such as child pornography or steganography. There are already available tools such as metadata readers, editors and extraction tools but mostly focusing on visualizing attribute information of JPEG Exif. However, none have been done to visualize metadata by consolidating markers summary, header structure, Huffman table and quantization table in a single program. In this paper, metadata visualization is done by developing a program that able to summarize all existing markers, header structure, Huffman table and quantization table in JPEG. The result shows that visualization of metadata helps viewing the hidden information within JPEG more easily.
Park, Yu Rang; Kim*, Ju Han
2006-01-01
Standardized management of data elements (DEs) for Case Report Form (CRF) is crucial in Clinical Trials Information System (CTIS). Traditional CTISs utilize organization-specific definitions and storage methods for Des and CRFs. We developed metadata-based DE management system for clinical trials, Clinical and Histopathological Metadata Registry (CHMR), using international standard for metadata registry (ISO 11179) for the management of cancer clinical trials information. CHMR was evaluated in cancer clinical trials with 1625 DEs extracted from the College of American Pathologists Cancer Protocols for 20 major cancers. PMID:17238675
Metadata to Support Data Warehouse Evolution
NASA Astrophysics Data System (ADS)
Solodovnikova, Darja
The focus of this chapter is metadata necessary to support data warehouse evolution. We present the data warehouse framework that is able to track evolution process and adapt data warehouse schemata and data extraction, transformation, and loading (ETL) processes. We discuss the significant part of the framework, the metadata repository that stores information about the data warehouse, logical and physical schemata and their versions. We propose the physical implementation of multiversion data warehouse in a relational DBMS. For each modification of a data warehouse schema, we outline the changes that need to be made to the repository metadata and in the database.
Automatic evidence quality prediction to support evidence-based decision making.
Sarker, Abeed; Mollá, Diego; Paris, Cécile
2015-06-01
Evidence-based medicine practice requires practitioners to obtain the best available medical evidence, and appraise the quality of the evidence when making clinical decisions. Primarily due to the plethora of electronically available data from the medical literature, the manual appraisal of the quality of evidence is a time-consuming process. We present a fully automatic approach for predicting the quality of medical evidence in order to aid practitioners at point-of-care. Our approach extracts relevant information from medical article abstracts and utilises data from a specialised corpus to apply supervised machine learning for the prediction of the quality grades. Following an in-depth analysis of the usefulness of features (e.g., publication types of articles), they are extracted from the text via rule-based approaches and from the meta-data associated with the articles, and then applied in the supervised classification model. We propose the use of a highly scalable and portable approach using a sequence of high precision classifiers, and introduce a simple evaluation metric called average error distance (AED) that simplifies the comparison of systems. We also perform elaborate human evaluations to compare the performance of our system against human judgments. We test and evaluate our approaches on a publicly available, specialised, annotated corpus containing 1132 evidence-based recommendations. Our rule-based approach performs exceptionally well at the automatic extraction of publication types of articles, with F-scores of up to 0.99 for high-quality publication types. For evidence quality classification, our approach obtains an accuracy of 63.84% and an AED of 0.271. The human evaluations show that the performance of our system, in terms of AED and accuracy, is comparable to the performance of humans on the same data. The experiments suggest that our structured text classification framework achieves evaluation results comparable to those of human performance. Our overall classification approach and evaluation technique are also highly portable and can be used for various evidence grading scales. Copyright © 2015 Elsevier B.V. All rights reserved.
Li, Zuofeng; Wen, Jingran; Zhang, Xiaoyan; Wu, Chunxiao; Li, Zuogao; Liu, Lei
2012-01-01
Aim to ease the secondary use of clinical data in clinical research, we introduce a metadata driven web-based clinical data management system named ClinData Express. ClinData Express is made up of two parts: 1) m-designer, a standalone software for metadata definition; 2) a web based data warehouse system for data management. With ClinData Express, what the researchers need to do is to define the metadata and data model in the m-designer. The web interface for data collection and specific database for data storage will be automatically generated. The standards used in the system and the data export modular make sure of the data reuse. The system has been tested on seven disease-data collection in Chinese and one form from dbGap. The flexibility of system makes its great potential usage in clinical research. The system is available at http://code.google.com/p/clindataexpress. PMID:23304327
Page, Roderic D M
2011-05-23
The Biodiversity Heritage Library (BHL) is a large digital archive of legacy biological literature, comprising over 31 million pages scanned from books, monographs, and journals. During the digitisation process basic metadata about the scanned items is recorded, but not article-level metadata. Given that the article is the standard unit of citation, this makes it difficult to locate cited literature in BHL. Adding the ability to easily find articles in BHL would greatly enhance the value of the archive. A service was developed to locate articles in BHL based on matching article metadata to BHL metadata using approximate string matching, regular expressions, and string alignment. This article locating service is exposed as a standard OpenURL resolver on the BioStor web site http://biostor.org/openurl/. This resolver can be used on the web, or called by bibliographic tools that support OpenURL. BioStor provides tools for extracting, annotating, and visualising articles from the Biodiversity Heritage Library. BioStor is available from http://biostor.org/.
Metadata mapping and reuse in caBIG.
Kunz, Isaac; Lin, Ming-Chin; Frey, Lewis
2009-02-05
This paper proposes that interoperability across biomedical databases can be improved by utilizing a repository of Common Data Elements (CDEs), UML model class-attributes and simple lexical algorithms to facilitate the building domain models. This is examined in the context of an existing system, the National Cancer Institute (NCI)'s cancer Biomedical Informatics Grid (caBIG). The goal is to demonstrate the deployment of open source tools that can be used to effectively map models and enable the reuse of existing information objects and CDEs in the development of new models for translational research applications. This effort is intended to help developers reuse appropriate CDEs to enable interoperability of their systems when developing within the caBIG framework or other frameworks that use metadata repositories. The Dice (di-grams) and Dynamic algorithms are compared and both algorithms have similar performance matching UML model class-attributes to CDE class object-property pairs. With algorithms used, the baselines for automatically finding the matches are reasonable for the data models examined. It suggests that automatic mapping of UML models and CDEs is feasible within the caBIG framework and potentially any framework that uses a metadata repository. This work opens up the possibility of using mapping algorithms to reduce cost and time required to map local data models to a reference data model such as those used within caBIG. This effort contributes to facilitating the development of interoperable systems within caBIG as well as other metadata frameworks. Such efforts are critical to address the need to develop systems to handle enormous amounts of diverse data that can be leveraged from new biomedical methodologies.
NASA Astrophysics Data System (ADS)
Zaslavsky, I.; Valentine, D.; Richard, S. M.; Gupta, A.; Meier, O.; Peucker-Ehrenbrink, B.; Hudman, G.; Stocks, K. I.; Hsu, L.; Whitenack, T.; Grethe, J. S.; Ozyurt, I. B.
2017-12-01
EarthCube Data Discovery Hub (DDH) is an EarthCube Building Block project using technologies developed in CINERGI (Community Inventory of EarthCube Resources for Geoscience Interoperability) to enable geoscience users to explore a growing portfolio of EarthCube-created and other geoscience-related resources. Over 1 million metadata records are available for discovery through the project portal (cinergi.sdsc.edu). These records are retrieved from data facilities, including federal, state and academic sources, or contributed by geoscientists through workshops, surveys, or other channels. CINERGI metadata augmentation pipeline components 1) provide semantic enhancement based on a large ontology of geoscience terms, using text analytics to generate keywords with references to ontology classes, 2) add spatial extents based on place names found in the metadata record, and 3) add organization identifiers to the metadata. The records are indexed and can be searched via a web portal and standard search APIs. The added metadata content improves discoverability and interoperability of the registered resources. Specifically, the addition of ontology-anchored keywords enables faceted browsing and lets users navigate to datasets related by variables measured, equipment used, science domain, processes described, geospatial features studied, and other dataset characteristics that are generated by the pipeline. DDH also lets data curators access and edit the automatically generated metadata records using the CINERGI metadata editor, accept or reject the enhanced metadata content, and consider it in updating their metadata descriptions. We consider several complex data discovery workflows, in environmental seismology (quantifying sediment and water fluxes using seismic data), marine biology (determining available temperature, location, weather and bleaching characteristics of coral reefs related to measurements in a given coral reef survey), and river geochemistry (discovering observations relevant to geochemical measurements outside the tidal zone, given specific discharge conditions).
NASA Astrophysics Data System (ADS)
Zaslavsky, I.; Richard, S. M.; Valentine, D. W., Jr.; Grethe, J. S.; Hsu, L.; Malik, T.; Bermudez, L. E.; Gupta, A.; Lehnert, K. A.; Whitenack, T.; Ozyurt, I. B.; Condit, C.; Calderon, R.; Musil, L.
2014-12-01
EarthCube is envisioned as a cyberinfrastructure that fosters new, transformational geoscience by enabling sharing, understanding and scientifically-sound and efficient re-use of formerly unconnected data resources, software, models, repositories, and computational power. Its purpose is to enable science enterprise and workforce development via an extensible and adaptable collaboration and resource integration framework. A key component of this vision is development of comprehensive inventories supporting resource discovery and re-use across geoscience domains. The goal of the EarthCube CINERGI (Community Inventory of EarthCube Resources for Geoscience Interoperability) project is to create a methodology and assemble a large inventory of high-quality information resources with standard metadata descriptions and traceable provenance. The inventory is compiled from metadata catalogs maintained by geoscience data facilities, as well as from user contributions. The latter mechanism relies on community resource viewers: online applications that support update and curation of metadata records. Once harvested into CINERGI, metadata records from domain catalogs and community resource viewers are loaded into a staging database implemented in MongoDB, and validated for compliance with ISO 19139 metadata schema. Several types of metadata defects detected by the validation engine are automatically corrected with help of several information extractors or flagged for manual curation. The metadata harvesting, validation and processing components generate provenance statements using W3C PROV notation, which are stored in a Neo4J database. Thus curated metadata, along with the provenance information, is re-published and accessed programmatically and via a CINERGI online application. This presentation focuses on the role of resource inventories in a scalable and adaptable information infrastructure, and on the CINERGI metadata pipeline and its implementation challenges. Key project components are described at the project's website (http://workspace.earthcube.org/cinergi), which also provides access to the initial resource inventory, the inventory metadata model, metadata entry forms and a collection of the community resource viewers.
NASA Astrophysics Data System (ADS)
Car, Nicholas; Cox, Simon; Fitch, Peter
2015-04-01
With earth-science datasets increasingly being published to enable re-use in projects disassociated from the original data acquisition or generation, there is an urgent need for associated metadata to be connected, in order to guide their application. In particular, provenance traces should support the evaluation of data quality and reliability. However, while standards for describing provenance are emerging (e.g. PROV-O), these do not include the necessary statistical descriptors and confidence assessments. UncertML has a mature conceptual model that may be used to record uncertainty metadata. However, by itself UncertML does not support the representation of uncertainty of multi-part datasets, and provides no direct way of associating the uncertainty information - metadata in relation to a dataset - with dataset objects.We present a method to address both these issues by combining UncertML with PROV-O, and delivering resulting uncertainty-enriched provenance traces through the Linked Data API. UncertProv extends the PROV-O provenance ontology with an RDF formulation of the UncertML conceptual model elements, adds further elements to support uncertainty representation without a conceptual model and the integration of UncertML through links to documents. The Linked ID API provides a systematic way of navigating from dataset objects to their UncertProv metadata and back again. The Linked Data API's 'views' capability enables access to UncertML and non-UncertML uncertainty metadata representations for a dataset. With this approach, it is possible to access and navigate the uncertainty metadata associated with a published dataset using standard semantic web tools, such as SPARQL queries. Where the uncertainty data follows the UncertML model it can be automatically interpreted and may also support automatic uncertainty propagation . Repositories wishing to enable uncertainty propagation for all datasets must ensure that all elements that are associated with uncertainty (PROV-O Entity and Activity classes) have UncertML elements recorded. This methodology is intentionally flexible to allow uncertainty metadata in many forms, not limited to UncertML. While the more formal representation of uncertainty metadata is desirable (using UncertProv elements to implement the UncertML conceptual model ), this will not always be possible, and any uncertainty data stored will be better than none. Since the UncertProv ontology contains a superset of UncertML elements to facilitate the representation of non-UncertML uncertainty data, it could easily be extended to include other formal uncertainty conceptual models thus allowing non-UncertML propagation calculations.
VizieR Online Data Catalog: Hubble Legacy Archive ACS grism data (Kuemmel+, 2011)
NASA Astrophysics Data System (ADS)
Kuemmel, M.; Rosati, P.; Fosbury, R.; Haase, J.; Hook, R. N.; Kuntschner, H.; Lombardi, M.; Micol, A.; Nilsson, K. K.; Stoehr, F.; Walsh, J. R.
2011-09-01
A public release of slitless spectra, obtained with ACS/WFC and the G800L grism, is presented. Spectra were automatically extracted in a uniform way from 153 archival fields (or "associations") distributed across the two Galactic caps, covering all observations to 2008. The ACS G800L grism provides a wavelength range of 0.55-1.00um, with a dispersion of 40Å/pixel and a resolution of ~80Å for point-like sources. The ACS G800L images and matched direct images were reduced with an automatic pipeline that handles all steps from archive retrieval, alignment and astrometric calibration, direct image combination, catalogue generation, spectral extraction and collection of metadata. The large number of extracted spectra (73,581) demanded automatic methods for quality control and an automated classification algorithm was trained on the visual inspection of several thousand spectra. The final sample of quality controlled spectra includes 47919 datasets (65% of the total number of extracted spectra) for 32149 unique objects, with a median iAB-band magnitude of 23.7, reaching 26.5 AB for the faintest objects. Each released dataset contains science-ready 1D and 2D spectra, as well as multi-band image cutouts of corresponding sources and a useful preview page summarising the direct and slitless data, astrometric and photometric parameters. This release is part of the continuing effort to enhance the content of the Hubble Legacy Archive (HLA) with highly processed data products which significantly facilitate the scientific exploitation of the Hubble data. In order to characterize the slitless spectra, emission-line flux and equivalent width sensitivity of the ACS data were compared with public ground-based spectra in the GOODS-South field. An example list of emission line galaxies with two or more identified lines is also included, covering the redshift range 0.2-4.6. Almost all redshift determinations outside of the GOODS fields are new. The scope of science projects possible with the ACS slitless release data is large, from studies of Galactic stars to searches for high redshift galaxies. (3 data files).
The Hubble Legacy Archive ACS grism data
NASA Astrophysics Data System (ADS)
Kümmel, M.; Rosati, P.; Fosbury, R.; Haase, J.; Hook, R. N.; Kuntschner, H.; Lombardi, M.; Micol, A.; Nilsson, K. K.; Stoehr, F.; Walsh, J. R.
2011-06-01
A public release of slitless spectra, obtained with ACS/WFC and the G800L grism, is presented. Spectra were automatically extracted in a uniform way from 153 archival fields (or "associations") distributed across the two Galactic caps, covering all observations to 2008. The ACS G800L grism provides a wavelength range of 0.55-1.00 μm, with a dispersion of 40 Å/pixel and a resolution of ~80 Å for point-like sources. The ACS G800L images and matched direct images were reduced with an automatic pipeline that handles all steps from archive retrieval, alignment and astrometric calibration, direct image combination, catalogue generation, spectral extraction and collection of metadata. The large number of extracted spectra (73,581) demanded automatic methods for quality control and an automated classification algorithm was trained on the visual inspection of several thousand spectra. The final sample of quality controlled spectra includes 47 919 datasets (65% of the total number of extracted spectra) for 32 149 unique objects, with a median iAB-band magnitude of 23.7, reaching 26.5 AB for the faintest objects. Each released dataset contains science-ready 1D and 2D spectra, as well as multi-band image cutouts of corresponding sources and a useful preview page summarising the direct and slitless data, astrometric and photometric parameters. This release is part of the continuing effort to enhance the content of the Hubble Legacy Archive (HLA) with highly processed data products which significantly facilitate the scientific exploitation of the Hubble data. In order to characterize the slitless spectra, emission-line flux and equivalent width sensitivity of the ACS data were compared with public ground-based spectra in the GOODS-South field. An example list of emission line galaxies with two or more identified lines is also included, covering the redshift range 0.2 - 4.6. Almost all redshift determinations outside of the GOODS fields are new. The scope of science projects possible with the ACS slitless release data is large, from studies of Galactic stars to searches for high redshift galaxies.
FIR: An Effective Scheme for Extracting Useful Metadata from Social Media.
Chen, Long-Sheng; Lin, Zue-Cheng; Chang, Jing-Rong
2015-11-01
Recently, the use of social media for health information exchange is expanding among patients, physicians, and other health care professionals. In medical areas, social media allows non-experts to access, interpret, and generate medical information for their own care and the care of others. Researchers paid much attention on social media in medical educations, patient-pharmacist communications, adverse drug reactions detection, impacts of social media on medicine and healthcare, and so on. However, relatively few papers discuss how to extract useful knowledge from a huge amount of textual comments in social media effectively. Therefore, this study aims to propose a Fuzzy adaptive resonance theory network based Information Retrieval (FIR) scheme by combining Fuzzy adaptive resonance theory (ART) network, Latent Semantic Indexing (LSI), and association rules (AR) discovery to extract knowledge from social media. In our FIR scheme, Fuzzy ART network firstly has been employed to segment comments. Next, for each customer segment, we use LSI technique to retrieve important keywords. Then, in order to make the extracted keywords understandable, association rules mining is presented to organize these extracted keywords to build metadata. These extracted useful voices of customers will be transformed into design needs by using Quality Function Deployment (QFD) for further decision making. Unlike conventional information retrieval techniques which acquire too many keywords to get key points, our FIR scheme can extract understandable metadata from social media.
HELIOGate, a Portal for the Heliophysics Community
NASA Astrophysics Data System (ADS)
Pierantoni; Gabriele; Carley, Eoin
2014-10-01
Heliophysics is the branch of physics that investigates the interactions between the Sun and the other bodies of the solar system. Heliophysicists rely on data collected from numerous sources scattered across the Solar System. The data collected from these sources is processed to extract metadata and the metadata extracted in this fashion is then used to build indexes of features and events called catalogues. Heliophysicists also develop conceptual and mathematical models of the phenomena and the environment of the Solar System. More specifically, they investigate the physical characteristics of the phenomena and they simulate how they propagate throughout the Solar System with mathematical and physical abstractions called propagation models. HELIOGate aims at addressing the need to combine and orchestrate existing web services in a flexible and easily configurable fashion to tackle different scientific questions. HELIOGate also offers a tool capable of connecting to size! able computation and storage infrastructures to execute data processing codes that are needed to calibrate raw data and to extract metadata.
[Radiological dose and metadata management].
Walz, M; Kolodziej, M; Madsack, B
2016-12-01
This article describes the features of management systems currently available in Germany for extraction, registration and evaluation of metadata from radiological examinations, particularly in the digital imaging and communications in medicine (DICOM) environment. In addition, the probable relevant developments in this area concerning radiation protection legislation, terminology, standardization and information technology are presented.
The XML Metadata Editor of GFZ Data Services
NASA Astrophysics Data System (ADS)
Ulbricht, Damian; Elger, Kirsten; Tesei, Telemaco; Trippanera, Daniele
2017-04-01
Following the FAIR data principles, research data should be Findable, Accessible, Interoperable and Reuseable. Publishing data under these principles requires to assign persistent identifiers to the data and to generate rich machine-actionable metadata. To increase the interoperability, metadata should include shared vocabularies and crosslink the newly published (meta)data and related material. However, structured metadata formats tend to be complex and are not intended to be generated by individual scientists. Software solutions are needed that support scientists in providing metadata describing their data. To facilitate data publication activities of 'GFZ Data Services', we programmed an XML metadata editor that assists scientists to create metadata in different schemata popular in the earth sciences (ISO19115, DIF, DataCite), while being at the same time usable by and understandable for scientists. Emphasis is placed on removing barriers, in particular the editor is publicly available on the internet without registration [1] and the scientists are not requested to provide information that may be generated automatically (e.g. the URL of a specific licence or the contact information of the metadata distributor). Metadata are stored in browser cookies and a copy can be saved to the local hard disk. To improve usability, form fields are translated into the scientific language, e.g. 'creators' of the DataCite schema are called 'authors'. To assist filling in the form, we make use of drop down menus for small vocabulary lists and offer a search facility for large thesauri. Explanations to form fields and definitions of vocabulary terms are provided in pop-up windows and a full documentation is available for download via the help menu. In addition, multiple geospatial references can be entered via an interactive mapping tool, which helps to minimize problems with different conventions to provide latitudes and longitudes. Currently, we are extending the metadata editor to be reused to generate metadata for data discovery and contextual metadata developed by the 'Multi-scale Laboratories' Thematic Core Service of the European Plate Observing System (EPOS-IP). The Editor will be used to build a common repository of a large variety of geological and geophysical datasets produced by multidisciplinary laboratories throughout Europe, thus contributing to a significant step toward the integration and accessibility of earth science data. This presentation will introduce the metadata editor and show the adjustments made for EPOS-IP. [1] http://dataservices.gfz-potsdam.de/panmetaworks/metaedit
NDSI products system based on Hadoop platform
NASA Astrophysics Data System (ADS)
Zhou, Yan; Jiang, He; Yang, Xiaoxia; Geng, Erhui
2015-12-01
Snow is solid state of water resources on earth, and plays an important role in human life. Satellite remote sensing is significant in snow extraction with the advantages of cyclical, macro, comprehensiveness, objectivity, timeliness. With the continuous development of remote sensing technology, remote sensing data access to the trend of multiple platforms, multiple sensors and multiple perspectives. At the same time, in view of the remote sensing data of compute-intensive applications demand increase gradually. However, current the producing system of remote sensing products is in a serial mode, and this kind of production system is used for professional remote sensing researchers mostly, and production systems achieving automatic or semi-automatic production are relatively less. Facing massive remote sensing data, the traditional serial mode producing system with its low efficiency has been difficult to meet the requirements of mass data timely and efficient processing. In order to effectively improve the production efficiency of NDSI products, meet the demand of large-scale remote sensing data processed timely and efficiently, this paper build NDSI products production system based on Hadoop platform, and the system mainly includes the remote sensing image management module, NDSI production module, and system service module. Main research contents and results including: (1)The remote sensing image management module: includes image import and image metadata management two parts. Import mass basis IRS images and NDSI product images (the system performing the production task output) into HDFS file system; At the same time, read the corresponding orbit ranks number, maximum/minimum longitude and latitude, product date, HDFS storage path, Hadoop task ID (NDSI products), and other metadata information, and then create thumbnails, and unique ID number for each record distribution, import it into base/product image metadata database. (2)NDSI production module: includes the index calculation, production tasks submission and monitoring two parts. Read HDF images related to production task in the form of a byte stream, and use Beam library to parse image byte stream to the form of Product; Use MapReduce distributed framework to perform production tasks, at the same time monitoring task status; When the production task complete, calls remote sensing image management module to store NDSI products. (3)System service module: includes both image search and DNSI products download. To image metadata attributes described in JSON format, return to the image sequence ID existing in the HDFS file system; For the given MapReduce task ID, package several task output NDSI products into ZIP format file, and return to the download link (4)System evaluation: download massive remote sensing data and use the system to process it to get the NDSI products testing the performance, and the result shows that the system has high extendibility, strong fault tolerance, fast production speed, and the image processing results with high accuracy.
Automated metadata--final project report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schissel, David
This report summarizes the work of the Automated Metadata, Provenance Cataloging, and Navigable Interfaces: Ensuring the Usefulness of Extreme-Scale Data Project (MPO Project) funded by the United States Department of Energy (DOE), Offices of Advanced Scientific Computing Research and Fusion Energy Sciences. Initially funded for three years starting in 2012, it was extended for 6 months with additional funding. The project was a collaboration between scientists at General Atomics, Lawrence Berkley National Laboratory (LBNL), and Massachusetts Institute of Technology (MIT). The group leveraged existing computer science technology where possible, and extended or created new capabilities where required. The MPO projectmore » was able to successfully create a suite of software tools that can be used by a scientific community to automatically document their scientific workflows. These tools were integrated into workflows for fusion energy and climate research illustrating the general applicability of the project’s toolkit. Feedback was very positive on the project’s toolkit and the value of such automatic workflow documentation to the scientific endeavor.« less
Metadata mapping and reuse in caBIG™
Kunz, Isaac; Lin, Ming-Chin; Frey, Lewis
2009-01-01
Background This paper proposes that interoperability across biomedical databases can be improved by utilizing a repository of Common Data Elements (CDEs), UML model class-attributes and simple lexical algorithms to facilitate the building domain models. This is examined in the context of an existing system, the National Cancer Institute (NCI)'s cancer Biomedical Informatics Grid (caBIG™). The goal is to demonstrate the deployment of open source tools that can be used to effectively map models and enable the reuse of existing information objects and CDEs in the development of new models for translational research applications. This effort is intended to help developers reuse appropriate CDEs to enable interoperability of their systems when developing within the caBIG™ framework or other frameworks that use metadata repositories. Results The Dice (di-grams) and Dynamic algorithms are compared and both algorithms have similar performance matching UML model class-attributes to CDE class object-property pairs. With algorithms used, the baselines for automatically finding the matches are reasonable for the data models examined. It suggests that automatic mapping of UML models and CDEs is feasible within the caBIG™ framework and potentially any framework that uses a metadata repository. Conclusion This work opens up the possibility of using mapping algorithms to reduce cost and time required to map local data models to a reference data model such as those used within caBIG™. This effort contributes to facilitating the development of interoperable systems within caBIG™ as well as other metadata frameworks. Such efforts are critical to address the need to develop systems to handle enormous amounts of diverse data that can be leveraged from new biomedical methodologies. PMID:19208192
DAS: A Data Management System for Instrument Tests and Operations
NASA Astrophysics Data System (ADS)
Frailis, M.; Sartor, S.; Zacchei, A.; Lodi, M.; Cirami, R.; Pasian, F.; Trifoglio, M.; Bulgarelli, A.; Gianotti, F.; Franceschi, E.; Nicastro, L.; Conforti, V.; Zoli, A.; Smart, R.; Morbidelli, R.; Dadina, M.
2014-05-01
The Data Access System (DAS) is a and data management software system, providing a reusable solution for the storage of data acquired both from telescopes and auxiliary data sources during the instrument development phases and operations. It is part of the Customizable Instrument WorkStation system (CIWS-FW), a framework for the storage, processing and quick-look at the data acquired from scientific instruments. The DAS provides a data access layer mainly targeted to software applications: quick-look displays, pre-processing pipelines and scientific workflows. It is logically organized in three main components: an intuitive and compact Data Definition Language (DAS DDL) in XML format, aimed for user-defined data types; an Application Programming Interface (DAS API), automatically adding classes and methods supporting the DDL data types, and providing an object-oriented query language; a data management component, which maps the metadata of the DDL data types in a relational Data Base Management System (DBMS), and stores the data in a shared (network) file system. With the DAS DDL, developers define the data model for a particular project, specifying for each data type the metadata attributes, the data format and layout (if applicable), and named references to related or aggregated data types. Together with the DDL user-defined data types, the DAS API acts as the only interface to store, query and retrieve the metadata and data in the DAS system, providing both an abstract interface and a data model specific one in C, C++ and Python. The mapping of metadata in the back-end database is automatic and supports several relational DBMSs, including MySQL, Oracle and PostgreSQL.
2011-01-01
Background The Biodiversity Heritage Library (BHL) is a large digital archive of legacy biological literature, comprising over 31 million pages scanned from books, monographs, and journals. During the digitisation process basic metadata about the scanned items is recorded, but not article-level metadata. Given that the article is the standard unit of citation, this makes it difficult to locate cited literature in BHL. Adding the ability to easily find articles in BHL would greatly enhance the value of the archive. Description A service was developed to locate articles in BHL based on matching article metadata to BHL metadata using approximate string matching, regular expressions, and string alignment. This article locating service is exposed as a standard OpenURL resolver on the BioStor web site http://biostor.org/openurl/. This resolver can be used on the web, or called by bibliographic tools that support OpenURL. Conclusions BioStor provides tools for extracting, annotating, and visualising articles from the Biodiversity Heritage Library. BioStor is available from http://biostor.org/. PMID:21605356
Automatic Conversion of Metadata from the Study of Health in Pomerania to ODM.
Hegselmann, Stefan; Gessner, Sophia; Neuhaus, Philipp; Henke, Jörg; Schmidt, Carsten Oliver; Dugas, Martin
2017-01-01
Electronic collection and high quality analysis of medical data is expected to have a big potential to improve patient care and medical research. However, the integration of data from different stake holders is posing a crucial problem. The exchange and reuse of medical data models as well as annotations with unique semantic identifiers were proposed as a solution. Convert metadata from the Study of Health in Pomerania to the standardized CDISC ODM format. The structure of the two data formats is analyzed and a mapping is suggested and implemented. The metadata from the Study of Health in Pomerania was successfully converted to ODM. All relevant information was included in the resulting forms. Three sample forms were evaluated in-depth, which demonstrates the feasibility of this conversion. Hundreds of data entry forms with more than 15.000 items can be converted into a standardized format with some limitations, e.g. regarding logical constraints. This enables the integration of the Study of Health in Pomerania metadata into various systems, facilitating the implementation and reuse in different study sites.
NASA Astrophysics Data System (ADS)
Rogowitz, Bernice E.; Matasci, Naim
2011-03-01
The explosion of online scientific data from experiments, simulations, and observations has given rise to an avalanche of algorithmic, visualization and imaging methods. There has also been enormous growth in the introduction of tools that provide interactive interfaces for exploring these data dynamically. Most systems, however, do not support the realtime exploration of patterns and relationships across tools and do not provide guidance on which colors, colormaps or visual metaphors will be most effective. In this paper, we introduce a general architecture for sharing metadata between applications and a "Metadata Mapper" component that allows the analyst to decide how metadata from one component should be represented in another, guided by perceptual rules. This system is designed to support "brushing [1]," in which highlighting a region of interest in one application automatically highlights corresponding values in another, allowing the scientist to develop insights from multiple sources. Our work builds on the component-based iPlant Cyberinfrastructure [2] and provides a general approach to supporting interactive, exploration across independent visualization and visual analysis components.
Trends in Fetal Medicine: A 10-Year Bibliometric Analysis of Prenatal Diagnosis
Dhombres, Ferdinand; Bodenreider, Olivier
2018-01-01
The objective is to automatically identify trends in Fetal Medicine over the past 10 years through a bibliometric analysis of articles published in Prenatal Diagnosis, using text mining techniques. We processed 2,423 full-text articles published in Prenatal Diagnosis between 2006 and 2015. We extracted salient terms, calculated their frequencies over time, and established evolution profiles for terms, from which we derived falling, stable, and rising trends. We identified 618 terms with a falling trend, 2,142 stable terms, and 839 terms with a rising trend. Terms with increasing frequencies include those related to statistics and medical study design. The most recent of these terms reflect the new opportunities of next- generation sequencing. Many terms related to cytogenetics exhibit a falling trend. A bibliometric analysis based on text mining effectively supports identification of trends over time. This scalable approach is complementary to analyses based on metadata or expert opinion. PMID:29295220
Using phrases and document metadata to improve topic modeling of clinical reports.
Speier, William; Ong, Michael K; Arnold, Corey W
2016-06-01
Probabilistic topic models provide an unsupervised method for analyzing unstructured text, which have the potential to be integrated into clinical automatic summarization systems. Clinical documents are accompanied by metadata in a patient's medical history and frequently contains multiword concepts that can be valuable for accurately interpreting the included text. While existing methods have attempted to address these problems individually, we present a unified model for free-text clinical documents that integrates contextual patient- and document-level data, and discovers multi-word concepts. In the proposed model, phrases are represented by chained n-grams and a Dirichlet hyper-parameter is weighted by both document-level and patient-level context. This method and three other Latent Dirichlet allocation models were fit to a large collection of clinical reports. Examples of resulting topics demonstrate the results of the new model and the quality of the representations are evaluated using empirical log likelihood. The proposed model was able to create informative prior probabilities based on patient and document information, and captured phrases that represented various clinical concepts. The representation using the proposed model had a significantly higher empirical log likelihood than the compared methods. Integrating document metadata and capturing phrases in clinical text greatly improves the topic representation of clinical documents. The resulting clinically informative topics may effectively serve as the basis for an automatic summarization system for clinical reports. Copyright © 2016 Elsevier Inc. All rights reserved.
In Interactive, Web-Based Approach to Metadata Authoring
NASA Technical Reports Server (NTRS)
Pollack, Janine; Wharton, Stephen W. (Technical Monitor)
2001-01-01
NASA's Global Change Master Directory (GCMD) serves a growing number of users by assisting the scientific community in the discovery of and linkage to Earth science data sets and related services. The GCMD holds over 8000 data set descriptions in Directory Interchange Format (DIF) and 200 data service descriptions in Service Entry Resource Format (SERF), encompassing the disciplines of geology, hydrology, oceanography, meteorology, and ecology. Data descriptions also contain geographic coverage information, thus allowing researchers to discover data pertaining to a particular geographic location, as well as subject of interest. The GCMD strives to be the preeminent data locator for world-wide directory level metadata. In this vein, scientists and data providers must have access to intuitive and efficient metadata authoring tools. Existing GCMD tools are not currently attracting. widespread usage. With usage being the prime indicator of utility, it has become apparent that current tools must be improved. As a result, the GCMD has released a new suite of web-based authoring tools that enable a user to create new data and service entries, as well as modify existing data entries. With these tools, a more interactive approach to metadata authoring is taken, as they feature a visual "checklist" of data/service fields that automatically update when a field is completed. In this way, the user can quickly gauge which of the required and optional fields have not been populated. With the release of these tools, the Earth science community will be further assisted in efficiently creating quality data and services metadata. Keywords: metadata, Earth science, metadata authoring tools
Auspice: Automatic Service Planning in Cloud/Grid Environments
NASA Astrophysics Data System (ADS)
Chiu, David; Agrawal, Gagan
Recent scientific advances have fostered a mounting number of services and data sets available for utilization. These resources, though scattered across disparate locations, are often loosely coupled both semantically and operationally. This loosely coupled relationship implies the possibility of linking together operations and data sets to answer queries. This task, generally known as automatic service composition, therefore abstracts the process of complex scientific workflow planning from the user. We have been exploring a metadata-driven approach toward automatic service workflow composition, among other enabling mechanisms, in our system, Auspice: Automatic Service Planning in Cloud/Grid Environments. In this paper, we present a complete overview of our system's unique features and outlooks for future deployment as the Cloud computing paradigm becomes increasingly eminent in enabling scientific computing.
Musick, Charles R [Castro Valley, CA; Critchlow, Terence [Livermore, CA; Ganesh, Madhaven [San Jose, CA; Slezak, Tom [Livermore, CA; Fidelis, Krzysztof [Brentwood, CA
2006-12-19
A system and method is disclosed for integrating and accessing multiple data sources within a data warehouse architecture. The metadata formed by the present method provide a way to declaratively present domain specific knowledge, obtained by analyzing data sources, in a consistent and useable way. Four types of information are represented by the metadata: abstract concepts, databases, transformations and mappings. A mediator generator automatically generates data management computer code based on the metadata. The resulting code defines a translation library and a mediator class. The translation library provides a data representation for domain specific knowledge represented in a data warehouse, including "get" and "set" methods for attributes that call transformation methods and derive a value of an attribute if it is missing. The mediator class defines methods that take "distinguished" high-level objects as input and traverse their data structures and enter information into the data warehouse.
NASA Astrophysics Data System (ADS)
Benedict, K. K.; Scott, S.
2013-12-01
While there has been a convergence towards a limited number of standards for representing knowledge (metadata) about geospatial (and other) data objects and collections, there exist a variety of community conventions around the specific use of those standards and within specific data discovery and access systems. This combination of limited (but multiple) standards and conventions creates a challenge for system developers that aspire to participate in multiple data infrastrucutres, each of which may use a different combination of standards and conventions. While Extensible Markup Language (XML) is a shared standard for encoding most metadata, traditional direct XML transformations (XSLT) from one standard to another often result in an imperfect transfer of information due to incomplete mapping from one standard's content model to another. This paper presents the work at the University of New Mexico's Earth Data Analysis Center (EDAC) in which a unified data and metadata management system has been developed in support of the storage, discovery and access of heterogeneous data products. This system, the Geographic Storage, Transformation and Retrieval Engine (GSTORE) platform has adopted a polyglot database model in which a combination of relational and document-based databases are used to store both data and metadata, with some metadata stored in a custom XML schema designed as a superset of the requirements for multiple target metadata standards: ISO 19115-2/19139/19110/19119, FGCD CSDGM (both with and without remote sensing extensions) and Dublin Core. Metadata stored within this schema is complemented by additional service, format and publisher information that is dynamically "injected" into produced metadata documents when they are requested from the system. While mapping from the underlying common metadata schema is relatively straightforward, the generation of valid metadata within each target standard is necessary but not sufficient for integration into multiple data infrastructures, as has been demonstrated through EDAC's testing and deployment of metadata into multiple external systems: Data.Gov, the GEOSS Registry, the DataONE network, the DSpace based institutional repository at UNM and semantic mediation systems developed as part of the NASA ACCESS ELSeWEB project. Each of these systems requires valid metadata as a first step, but to make most effective use of the delivered metadata each also has a set of conventions that are specific to the system. This presentation will provide an overview of the underlying metadata management model, the processes and web services that have been developed to automatically generate metadata in a variety of standard formats and highlight some of the specific modifications made to the output metadata content to support the different conventions used by the multiple metadata integration endpoints.
Ontology-Based Search of Genomic Metadata.
Fernandez, Javier D; Lenzerini, Maurizio; Masseroli, Marco; Venco, Francesco; Ceri, Stefano
2016-01-01
The Encyclopedia of DNA Elements (ENCODE) is a huge and still expanding public repository of more than 4,000 experiments and 25,000 data files, assembled by a large international consortium since 2007; unknown biological knowledge can be extracted from these huge and largely unexplored data, leading to data-driven genomic, transcriptomic, and epigenomic discoveries. Yet, search of relevant datasets for knowledge discovery is limitedly supported: metadata describing ENCODE datasets are quite simple and incomplete, and not described by a coherent underlying ontology. Here, we show how to overcome this limitation, by adopting an ENCODE metadata searching approach which uses high-quality ontological knowledge and state-of-the-art indexing technologies. Specifically, we developed S.O.S. GeM (http://www.bioinformatics.deib.polimi.it/SOSGeM/), a system supporting effective semantic search and retrieval of ENCODE datasets. First, we constructed a Semantic Knowledge Base by starting with concepts extracted from ENCODE metadata, matched to and expanded on biomedical ontologies integrated in the well-established Unified Medical Language System. We prove that this inference method is sound and complete. Then, we leveraged the Semantic Knowledge Base to semantically search ENCODE data from arbitrary biologists' queries. This allows correctly finding more datasets than those extracted by a purely syntactic search, as supported by the other available systems. We empirically show the relevance of found datasets to the biologists' queries.
Pafilis, Evangelos; Buttigieg, Pier Luigi; Ferrell, Barbra; Pereira, Emiliano; Schnetzer, Julia; Arvanitidis, Christos; Jensen, Lars Juhl
2016-01-01
The microbial and molecular ecology research communities have made substantial progress on developing standards for annotating samples with environment metadata. However, sample manual annotation is a highly labor intensive process and requires familiarity with the terminologies used. We have therefore developed an interactive annotation tool, EXTRACT, which helps curators identify and extract standard-compliant terms for annotation of metagenomic records and other samples. Behind its web-based user interface, the system combines published methods for named entity recognition of environment, organism, tissue and disease terms. The evaluators in the BioCreative V Interactive Annotation Task found the system to be intuitive, useful, well documented and sufficiently accurate to be helpful in spotting relevant text passages and extracting organism and environment terms. Comparison of fully manual and text-mining-assisted curation revealed that EXTRACT speeds up annotation by 15-25% and helps curators to detect terms that would otherwise have been missed. Database URL: https://extract.hcmr.gr/. © The Author(s) 2016. Published by Oxford University Press.
NASA Astrophysics Data System (ADS)
Zaslavsky, I.; Richard, S. M.; Malik, T.; Hsu, L.; Gupta, A.; Grethe, J. S.; Valentine, D. W., Jr.; Lehnert, K. A.; Bermudez, L. E.; Ozyurt, I. B.; Whitenack, T.; Schachne, A.; Giliarini, A.
2015-12-01
While many geoscience-related repositories and data discovery portals exist, finding information about available resources remains a pervasive problem, especially when searching across multiple domains and catalogs. Inconsistent and incomplete metadata descriptions, disparate access protocols and semantic differences across domains, and troves of unstructured or poorly structured information which is hard to discover and use are major hindrances toward discovery, while metadata compilation and curation remain manual and time-consuming. We report on methodology, main results and lessons learned from an ongoing effort to develop a geoscience-wide catalog of information resources, with consistent metadata descriptions, traceable provenance, and automated metadata enhancement. Developing such a catalog is the central goal of CINERGI (Community Inventory of EarthCube Resources for Geoscience Interoperability), an EarthCube building block project (earthcube.org/group/cinergi). The key novel technical contributions of the projects include: a) development of a metadata enhancement pipeline and a set of document enhancers to automatically improve various aspects of metadata descriptions, including keyword assignment and definition of spatial extents; b) Community Resource Viewers: online applications for crowdsourcing community resource registry development, curation and search, and channeling metadata to the unified CINERGI inventory, c) metadata provenance, validation and annotation services, d) user interfaces for advanced resource discovery; and e) geoscience-wide ontology and machine learning to support automated semantic tagging and faceted search across domains. We demonstrate these CINERGI components in three types of user scenarios: (1) improving existing metadata descriptions maintained by government and academic data facilities, (2) supporting work of several EarthCube Research Coordination Network projects in assembling information resources for their domains, and (3) enhancing the inventory and the underlying ontology to address several complicated data discovery use cases in hydrology, geochemistry, sedimentology, and critical zone science. Support from the US National Science Foundation under award ICER-1343816 is gratefully acknowledged.
NASA Astrophysics Data System (ADS)
Fazliev, A.
2009-04-01
The information and knowledge layers of information-computational system for water spectroscopy are described. Semantic metadata for all the tasks of domain information model that are the basis of the layers have been studied. The principle of semantic metadata determination and mechanisms of the usage during information systematization in molecular spectroscopy has been revealed. The software developed for the work with semantic metadata is described as well. Formation of domain model in the framework of Semantic Web is based on the use of explicit specification of its conceptualization or, in other words, its ontologies. Formation of conceptualization for molecular spectroscopy was described in Refs. 1, 2. In these works two chains of task are selected for zeroth approximation for knowledge domain description. These are direct tasks chain and inverse tasks chain. Solution schemes of these tasks defined approximation of data layer for knowledge domain conceptualization. Spectroscopy tasks solutions properties lead to a step-by-step extension of molecular spectroscopy conceptualization. Information layer of information system corresponds to this extension. An advantage of molecular spectroscopy model designed in a form of tasks chain is actualized in the fact that one can explicitly define data and metadata at each step of solution of these molecular spectroscopy chain tasks. Metadata structure (tasks solutions properties) in knowledge domain also has form of a chain in which input data and metadata of the previous task become metadata of the following tasks. The term metadata is used in its narrow sense: metadata are the properties of spectroscopy tasks solutions. Semantic metadata represented with the help of OWL 3 are formed automatically and they are individuals of classes (A-box). Unification of T-box and A-box is an ontology that can be processed with the help of inference engine. In this work we analyzed the formation of individuals of molecular spectroscopy applied ontologies as well as the software used for their creation by means of OWL DL language. The results of this work are presented in a form of an information layer and a knowledge layer in W@DIS information system 4. 1 FORMATION OF INDIVIDUALS OF WATER SPECTROSCOPY APPLIED ONTOLOGY Applied tasks ontology contains explicit description of input an output data of physical tasks solved in two chains of molecular spectroscopy tasks. Besides physical concepts, related to spectroscopy tasks solutions, an information source, which is a key concept of knowledge domain information model, is also used. Each solution of knowledge domain task is linked to the information source which contains a reference on published task solution, molecule and task solution properties. Each information source allows us to identify a certain knowledge domain task solution contained in the information system. Water spectroscopy applied ontology classes are formed on the basis of molecular spectroscopy concepts taxonomy. They are defined by constrains on properties of the selected conceptualization. Extension of applied ontology in W@DIS information system is actualized according to two scenarios. Individuals (ontology facts or axioms) formation is actualized during the task solution upload in the information system. Ontology user operation that implies molecular spectroscopy taxonomy and individuals is performed solely by the user. For this purpose Protege ontology editor was used. For the formation, processing and visualization of knowledge domain tasks individuals a software was designed and implemented. Method of individual formation determines the sequence of steps of created ontology individuals' generation. Tasks solutions properties (metadata) have qualitative and quantitative values. Qualitative metadata are regarded as metadata describing qualitative side of a task such as solution method or other information that can be explicitly specified by object properties of OWL DL language. Quantitative metadata are metadata that describe quantitative properties of task solution such as minimal and maximal data value or other information that can be explicitly obtained by programmed algorithmic operations. These metadata are related to DatatypeProperty properties of OWL specification language Quantitative metadata can be obtained automatically during data upload into information system. Since ObjectProperty values are objects, processing of qualitative metadata requires logical constraints. In case of the task solved in W@DIS ICS qualitative metadata can be formed automatically (for example in spectral functions calculation task). The used methods of translation of qualitative metadata into quantitative is characterized as roughened representation of knowledge in knowledge domain. The existence of two ways of data obtainment is a key moment in the formation of applied ontology of molecular spectroscopy task. experimental method (metadata for experimental data contain description of equipment, experiment conditions and so on) on the initial stage and inverse task solution on the following stages; calculation method (metadata for calculation data are closely related to the metadata used for the description of physical and mathematical models of molecular spectroscopy) 2 SOFTWARE FOR ONTOLOGY OPERATION Data collection in water spectroscopy information system is organized in a form of workflow that contains such operations as information source creation, entry of bibliographic data on publications, formation of uploaded data schema an so on. Metadata are generated in information source as well. Two methods are used for their formation: automatic metadata generation and manual metadata generation (performed by user). Software implementation of support of actions related to metadata formation is performed by META+ module. Functions of META+ module can be divided into two groups. The first groups contains the functions necessary to software developer while the second one the functions necessary to a user of the information system. META+ module functions necessary to the developer are: 1. creation of taxonomy (T-boxes) of applied ontology classes of knowledge domain tasks; 2. creation of instances of task classes; 3. creation of data schemes of tasks in a form of an XML-pattern and based on XML-syntax. XML-pattern is developed for instances generator and created according to certain rules imposed on software generator implementation. 4. implementation of metadata values calculation algorithms; 5. creation of a request interface and additional knowledge processing function for the solution of these task; 6. unification of the created functions and interfaces into one information system The following sequence is universal for the generation of task classes' individuals that form chains. Special interfaces for user operations management are designed for software developer in META+ module. There are means for qualitative metadata values updating during data reuploading to information source. The list of functions necessary to end user contains: - data sets visualization and editing, taking into account their metadata, e.g.: display of unique number of bands in transitions for a certain data source; - export of OWL/RDF models from information system to the environment in XML-syntax; - visualization of instances of classes of applied ontology tasks on molecular spectroscopy; - import of OWL/RDF models into the information system and their integration with domain vocabulary; - formation of additional knowledge of knowledge domain for the construction of ontological instances of task classes using GTML-formats and their processing; - formation of additional knowledge in knowledge domain for the construction of instances of task classes, using software algorithm for data sets processing; - function of semantic search implementation using an interface that formulates questions in a form of related triplets in order for getting an adequate answer. 3 STRUCTURE OF META+ MODULE META+ software module that provides the above functions contains the following components: - a knowledge base that stores semantic metadata and taxonomies of information system; - software libraries POWL and RAP 5 created by third-party developer and providing access to ontological storage; - function classes and libraries that form the core of the module and perform the tasks of formation, storage and visualization of classes instances; - configuration files and module patterns that allow one to adjust and organize operation of different functional blocks; META+ module also contains scripts and patterns implemented according to the rules of W@DIS information system development environment. - scripts for interaction with environment by means of the software core of information system. These scripts provide organizing web-oriented interactive communication; - patterns for the formation of functionality visualization realized by the scripts Software core of scientific information-computational system W@DIS is created with the help of MVC (Model - View - Controller) design pattern that allows us to separate logic of application from its representation. It realizes the interaction of three logical components, actualizing interactivity with the environment via Web and performing its preprocessing. Functions of «Controller» logical component are realized with the help of scripts designed according to the rules imposed by software core of the information system. Each script represents a definite object-oriented class with obligatory class method of script initiation called "start". Functions of actualization of domain application operation results representation (i.e. "View" component) are sets of HTML-patterns that allow one to visualize the results of domain applications operation with the help of additional constructions processed by software core of the system. Besides the interaction with the software core of the scientific information system this module also deals with configuration files of software core and its database. Such organization of work provides closer integration with software core and deeper and more adequate connection in operating system support. 4 CONCLUSION In this work the problems of semantic metadata creation in information system oriented on information representation in the area of molecular spectroscopy have been discussed. The described method of semantic metadata and functions formation as well as realization and structure of META+ module have been described. Architecture of META+ module is closely related to the existing software of "Molecular spectroscopy" scientific information system. Realization of the module is performed with the use of modern approaches to Web-oriented applications development. It uses the existing applied interfaces. The developed software allows us to: - perform automatic metadata annotation of calculated tasks solutions directly in the information system; - perform automatic annotation of metadata on the solution of tasks on task solution results uploading outside the information system forming an instance of the solved task on the basis of entry data; - use ontological instances of task solution for identification of data in information tasks of viewing, comparison and search solved by information system; - export applied tasks ontologies for the operation with them by external means; - solve the task of semantic search according to the pattern and using question-answer type interface. 5 ACKNOWLEDGEMENT The authors are grateful to RFBR for the financial support of development of distributed information system for molecular spectroscopy. REFERENCES A.D.Bykov, A.Z. Fazliev, N.N.Filippov, A.V. Kozodoev, A.I.Privezentsev, L.N.Sinitsa, M.V.Tonkov and M.Yu.Tretyakov, Distributed information system on atmospheric spectroscopy // Geophysical Research Abstracts, SRef-ID: 1607-7962/gra/EGU2007-A-01906, 2007, v. 9, p. 01906. A.I.Prevezentsev, A.Z. Fazliev Applied task ontology for molecular spectroscopy information resources systematization. The Proceedings of 9th Russian scientific conference "Electronic libraries: advanced methods and technologies, electronic collections" - RCDL'2007, Pereslavl Zalesskii, 2007, part.1, 2007, P.201-210. OWL Web Ontology Language Semantics and Abstract Syntax, W3C Recommendation 10 February 2004, http://www.w3.org/TR/2004/REC-owl-semantics-20040210/ W@DIS information system, http://wadis.saga.iao.ru RAP library, http://www4.wiwiss.fu-berlin.de/bizer/rdfapi/.
Metadata-Driven SOA-Based Application for Facilitation of Real-Time Data Warehousing
NASA Astrophysics Data System (ADS)
Pintar, Damir; Vranić, Mihaela; Skočir, Zoran
Service-oriented architecture (SOA) has already been widely recognized as an effective paradigm for achieving integration of diverse information systems. SOA-based applications can cross boundaries of platforms, operation systems and proprietary data standards, commonly through the usage of Web Services technology. On the other side, metadata is also commonly referred to as a potential integration tool given the fact that standardized metadata objects can provide useful information about specifics of unknown information systems with which one has interest in communicating with, using an approach commonly called "model-based integration". This paper presents the result of research regarding possible synergy between those two integration facilitators. This is accomplished with a vertical example of a metadata-driven SOA-based business process that provides ETL (Extraction, Transformation and Loading) and metadata services to a data warehousing system in need of a real-time ETL support.
ISO 19115 Experiences in NASA's Earth Observing System (EOS) ClearingHOuse (ECHO)
NASA Astrophysics Data System (ADS)
Cechini, M. F.; Mitchell, A.
2011-12-01
Metadata is an important entity in the process of cataloging, discovering, and describing earth science data. As science research and the gathered data increases in complexity, so does the complexity and importance of descriptive metadata. To meet these growing needs, the metadata models required utilize richer and more mature metadata attributes. Categorizing, standardizing, and promulgating these metadata models to a politically, geographically, and scientifically diverse community is a difficult process. An integral component of metadata management within NASA's Earth Observing System Data and Information System (EOSDIS) is the Earth Observing System (EOS) ClearingHOuse (ECHO). ECHO is the core metadata repository for the EOSDIS data centers providing a centralized mechanism for metadata and data discovery and retrieval. ECHO has undertaken an internal restructuring to meet the changing needs of scientists, the consistent advancement in technology, and the advent of new standards such as ISO 19115. These improvements were based on the following tenets for data discovery and retrieval: + There exists a set of 'core' metadata fields recommended for data discovery. + There exists a set of users who will require the entire metadata record for advanced analysis. + There exists a set of users who will require a 'core' set metadata fields for discovery only. + There will never be a cessation of new formats or a total retirement of all old formats. + Users should be presented metadata in a consistent format of their choosing. In order to address the previously listed items, ECHO's new metadata processing paradigm utilizes the following approach: + Identify a cross-format set of 'core' metadata fields necessary for discovery. + Implement format-specific indexers to extract the 'core' metadata fields into an optimized query capability. + Archive the original metadata in its entirety for presentation to users requiring the full record. + Provide on-demand translation of 'core' metadata to any supported result format. Lessons learned by the ECHO team while implementing its new metadata approach to support usage of the ISO 19115 standard will be presented. These lessons learned highlight some discovered strengths and weaknesses in the ISO 19115 standard as it is introduced to an existing metadata processing system.
A document centric metadata registration tool constructing earth environmental data infrastructure
NASA Astrophysics Data System (ADS)
Ichino, M.; Kinutani, H.; Ono, M.; Shimizu, T.; Yoshikawa, M.; Masuda, K.; Fukuda, K.; Kawamoto, H.
2009-12-01
DIAS (Data Integration and Analysis System) is one of GEOSS activities in Japan. It is also a leading part of the GEOSS task with the same name defined in GEOSS Ten Year Implementation Plan. The main mission of DIAS is to construct data infrastructure that can effectively integrate earth environmental data such as observation data, numerical model outputs, and socio-economic data provided from the fields of climate, water cycle, ecosystem, ocean, biodiversity and agriculture. Some of DIAS's data products are available at the following web site of http://www.jamstec.go.jp/e/medid/dias. Most of earth environmental data commonly have spatial and temporal attributes such as the covering geographic scope or the created date. The metadata standards including these common attributes are published by the geographic information technical committee (TC211) in ISO (the International Organization for Standardization) as specifications of ISO 19115:2003 and 19139:2007. Accordingly, DIAS metadata is developed with basing on ISO/TC211 metadata standards. From the viewpoint of data users, metadata is useful not only for data retrieval and analysis but also for interoperability and information sharing among experts, beginners and nonprofessionals. On the other hand, from the viewpoint of data providers, two problems were pointed out after discussions. One is that data providers prefer to minimize another tasks and spending time for creating metadata. Another is that data providers want to manage and publish documents to explain their data sets more comprehensively. Because of solving these problems, we have been developing a document centric metadata registration tool. The features of our tool are that the generated documents are available instantly and there is no extra cost for data providers to generate metadata. Also, this tool is developed as a Web application. So, this tool does not demand any software for data providers if they have a web-browser. The interface of the tool provides the section titles of the documents and by filling out the content of each section, the documents for the data sets are automatically published in PDF and HTML format. Furthermore, the metadata XML file which is compliant with ISO19115 and ISO19139 is created at the same moment. The generated metadata are managed in the metadata database of the DIAS project, and will be used in various ISO19139 compliant metadata management tools, such as GeoNetwork.
NASA Astrophysics Data System (ADS)
Do, Hong Xuan; Gudmundsson, Lukas; Leonard, Michael; Westra, Seth
2018-04-01
This is the first part of a two-paper series presenting the Global Streamflow Indices and Metadata archive (GSIM), a worldwide collection of metadata and indices derived from more than 35 000 daily streamflow time series. This paper focuses on the compilation of the daily streamflow time series based on 12 free-to-access streamflow databases (seven national databases and five international collections). It also describes the development of three metadata products (freely available at https://doi.pangaea.de/10.1594/PANGAEA.887477): (1) a GSIM catalogue collating basic metadata associated with each time series, (2) catchment boundaries for the contributing area of each gauge, and (3) catchment metadata extracted from 12 gridded global data products representing essential properties such as land cover type, soil type, and climate and topographic characteristics. The quality of the delineated catchment boundary is also made available and should be consulted in GSIM application. The second paper in the series then explores production and analysis of streamflow indices. Having collated an unprecedented number of stations and associated metadata, GSIM can be used to advance large-scale hydrological research and improve understanding of the global water cycle.
Building Format-Agnostic Metadata Repositories
NASA Astrophysics Data System (ADS)
Cechini, M.; Pilone, D.
2010-12-01
This presentation will discuss the problems that surround persisting and discovering metadata in multiple formats; a set of tenets that must be addressed in a solution; and NASA’s Earth Observing System (EOS) ClearingHOuse’s (ECHO) proposed approach. In order to facilitate cross-discipline data analysis, Earth Scientists will potentially interact with more than one data source. The most common data discovery paradigm relies on services and/or applications facilitating the discovery and presentation of metadata. What may not be common are the formats in which the metadata are formatted. As the number of sources and datasets utilized for research increases, it becomes more likely that a researcher will encounter conflicting metadata formats. Metadata repositories, such as the EOS ClearingHOuse (ECHO), along with data centers, must identify ways to address this issue. In order to define the solution to this problem, the following tenets are identified: - There exists a set of ‘core’ metadata fields recommended for data discovery. - There exists a set of users who will require the entire metadata record for advanced analysis. - There exists a set of users who will require a ‘core’ set of metadata fields for discovery only. - There will never be a cessation of new formats or a total retirement of all old formats. - Users should be presented metadata in a consistent format. ECHO has undertaken an effort to transform its metadata ingest and discovery services in order to support the growing set of metadata formats. In order to address the previously listed items, ECHO’s new metadata processing paradigm utilizes the following approach: - Identify a cross-format set of ‘core’ metadata fields necessary for discovery. - Implement format-specific indexers to extract the ‘core’ metadata fields into an optimized query capability. - Archive the original metadata in its entirety for presentation to users requiring the full record. - Provide on-demand translation of ‘core’ metadata to any supported result format. With this identified approach, the Earth Scientist is provided with a consistent data representation as they interact with a variety of datasets that utilize multiple metadata formats. They are then able to focus their efforts on the more critical research activities which they are undertaking.
NASA Astrophysics Data System (ADS)
Pascoe, Charlotte; Lawrence, Bryan; Moine, Marie-Pierre; Ford, Rupert; Devine, Gerry
2010-05-01
The EU METAFOR Project (http://metaforclimate.eu) has created a web-based model documentation questionnaire to collect metadata from the modelling groups that are running simulations in support of the Coupled Model Intercomparison Project - 5 (CMIP5). The CMIP5 model documentation questionnaire will retrieve information about the details of the models used, how the simulations were carried out, how the simulations conformed to the CMIP5 experiment requirements and details of the hardware used to perform the simulations. The metadata collected by the CMIP5 questionnaire will allow CMIP5 data to be compared in a scientifically meaningful way. This paper describes the life-cycle of the CMIP5 questionnaire development which starts with relatively unstructured input from domain specialists and ends with formal XML documents that comply with the METAFOR Common Information Model (CIM). Each development step is associated with a specific tool. (1) Mind maps are used to capture information requirements from domain experts and build a controlled vocabulary, (2) a python parser processes the XML files generated by the mind maps, (3) Django (python) is used to generate the dynamic structure and content of the web based questionnaire from processed xml and the METAFOR CIM, (4) Python parsers ensure that information entered into the CMIP5 questionnaire is output as CIM compliant xml, (5) CIM compliant output allows automatic information capture tools to harvest questionnaire content into databases such as the Earth System Grid (ESG) metadata catalogue. This paper will focus on how Django (python) and XML input files are used to generate the structure and content of the CMIP5 questionnaire. It will also address how the choice of development tools listed above provided a framework that enabled working scientists (who we would never ordinarily get to interact with UML and XML) to be part the iterative development process and ensure that the CMIP5 model documentation questionnaire reflects what scientists want to know about the models. Keywords: metadata, CMIP5, automatic information capture, tool development
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pafilis, Evangelos; Buttigieg, Pier Luigi; Ferrell, Barbra
The microbial and molecular ecology research communities have made substantial progress on developing standards for annotating samples with environment metadata. However, sample manual annotation is a highly labor intensive process and requires familiarity with the terminologies used. We have therefore developed an interactive annotation tool, EXTRACT, which helps curators identify and extract standard-compliant terms for annotation of metagenomic records and other samples. Behind its web-based user interface, the system combines published methods for named entity recognition of environment, organism, tissue and disease terms. The evaluators in the BioCreative V Interactive Annotation Task found the system to be intuitive, useful, wellmore » documented and sufficiently accurate to be helpful in spotting relevant text passages and extracting organism and environment terms. Here the comparison of fully manual and text-mining-assisted curation revealed that EXTRACT speeds up annotation by 15–25% and helps curators to detect terms that would otherwise have been missed.« less
Pafilis, Evangelos; Buttigieg, Pier Luigi; Ferrell, Barbra; ...
2016-01-01
The microbial and molecular ecology research communities have made substantial progress on developing standards for annotating samples with environment metadata. However, sample manual annotation is a highly labor intensive process and requires familiarity with the terminologies used. We have therefore developed an interactive annotation tool, EXTRACT, which helps curators identify and extract standard-compliant terms for annotation of metagenomic records and other samples. Behind its web-based user interface, the system combines published methods for named entity recognition of environment, organism, tissue and disease terms. The evaluators in the BioCreative V Interactive Annotation Task found the system to be intuitive, useful, wellmore » documented and sufficiently accurate to be helpful in spotting relevant text passages and extracting organism and environment terms. Here the comparison of fully manual and text-mining-assisted curation revealed that EXTRACT speeds up annotation by 15–25% and helps curators to detect terms that would otherwise have been missed.« less
Master Metadata Repository and Metadata-Management System
NASA Technical Reports Server (NTRS)
Armstrong, Edward; Reed, Nate; Zhang, Wen
2007-01-01
A master metadata repository (MMR) software system manages the storage and searching of metadata pertaining to data from national and international satellite sources of the Global Ocean Data Assimilation Experiment (GODAE) High Resolution Sea Surface Temperature Pilot Project [GHRSSTPP]. These sources produce a total of hundreds of data files daily, each file classified as one of more than ten data products representing global sea-surface temperatures. The MMR is a relational database wherein the metadata are divided into granulelevel records [denoted file records (FRs)] for individual satellite files and collection-level records [denoted data set descriptions (DSDs)] that describe metadata common to all the files from a specific data product. FRs and DSDs adhere to the NASA Directory Interchange Format (DIF). The FRs and DSDs are contained in separate subdatabases linked by a common field. The MMR is configured in MySQL database software with custom Practical Extraction and Reporting Language (PERL) programs to validate and ingest the metadata records. The database contents are converted into the Federal Geographic Data Committee (FGDC) standard format by use of the Extensible Markup Language (XML). A Web interface enables users to search for availability of data from all sources.
The MPO system for automatic workflow documentation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Abla, G.; Coviello, E. N.; Flanagan, S. M.
Data from large-scale experiments and extreme-scale computing is expensive to produce and may be used for critical applications. However, it is not the mere existence of data that is important, but our ability to make use of it. Experience has shown that when metadata is better organized and more complete, the underlying data becomes more useful. Traditionally, capturing the steps of scientific workflows and metadata was the role of the lab notebook, but the digital era has resulted instead in the fragmentation of data, processing, and annotation. Here, this article presents the Metadata, Provenance, and Ontology (MPO) System, the softwaremore » that can automate the documentation of scientific workflows and associated information. Based on recorded metadata, it provides explicit information about the relationships among the elements of workflows in notebook form augmented with directed acyclic graphs. A set of web-based graphical navigation tools and Application Programming Interface (API) have been created for searching and browsing, as well as programmatically accessing the workflows and data. We describe the MPO concepts and its software architecture. We also report the current status of the software as well as the initial deployment experience.« less
The MPO system for automatic workflow documentation
Abla, G.; Coviello, E. N.; Flanagan, S. M.; ...
2016-04-18
Data from large-scale experiments and extreme-scale computing is expensive to produce and may be used for critical applications. However, it is not the mere existence of data that is important, but our ability to make use of it. Experience has shown that when metadata is better organized and more complete, the underlying data becomes more useful. Traditionally, capturing the steps of scientific workflows and metadata was the role of the lab notebook, but the digital era has resulted instead in the fragmentation of data, processing, and annotation. Here, this article presents the Metadata, Provenance, and Ontology (MPO) System, the softwaremore » that can automate the documentation of scientific workflows and associated information. Based on recorded metadata, it provides explicit information about the relationships among the elements of workflows in notebook form augmented with directed acyclic graphs. A set of web-based graphical navigation tools and Application Programming Interface (API) have been created for searching and browsing, as well as programmatically accessing the workflows and data. We describe the MPO concepts and its software architecture. We also report the current status of the software as well as the initial deployment experience.« less
Earthquake and failure forecasting in real-time: A Forecasting Model Testing Centre
NASA Astrophysics Data System (ADS)
Filgueira, Rosa; Atkinson, Malcolm; Bell, Andrew; Main, Ian; Boon, Steven; Meredith, Philip
2013-04-01
Across Europe there are a large number of rock deformation laboratories, each of which runs many experiments. Similarly there are a large number of theoretical rock physicists who develop constitutive and computational models both for rock deformation and changes in geophysical properties. Here we consider how to open up opportunities for sharing experimental data in a way that is integrated with multiple hypothesis testing. We present a prototype for a new forecasting model testing centre based on e-infrastructures for capturing and sharing data and models to accelerate the Rock Physicist (RP) research. This proposal is triggered by our work on data assimilation in the NERC EFFORT (Earthquake and Failure Forecasting in Real Time) project, using data provided by the NERC CREEP 2 experimental project as a test case. EFFORT is a multi-disciplinary collaboration between Geoscientists, Rock Physicists and Computer Scientist. Brittle failure of the crust is likely to play a key role in controlling the timing of a range of geophysical hazards, such as volcanic eruptions, yet the predictability of brittle failure is unknown. Our aim is to provide a facility for developing and testing models to forecast brittle failure in experimental and natural data. Model testing is performed in real-time, verifiably prospective mode, in order to avoid selection biases that are possible in retrospective analyses. The project will ultimately quantify the predictability of brittle failure, and how this predictability scales from simple, controlled laboratory conditions to the complex, uncontrolled real world. Experimental data are collected from controlled laboratory experiments which includes data from the UCL Laboratory and from Creep2 project which will undertake experiments in a deep-sea laboratory. We illustrate the properties of the prototype testing centre by streaming and analysing realistically noisy synthetic data, as an aid to generating and improving testing methodologies in imperfect conditions. The forecasting model testing centre uses a repository to hold all the data and models and a catalogue to hold all the corresponding metadata. It allows to: Data transfer: Upload experimental data: We have developed FAST (Flexible Automated Streaming Transfer) tool to upload data from RP laboratories to the repository. FAST sets up data transfer requirements and selects automatically the transfer protocol. Metadata are automatically created and stored. Web data access: Create synthetic data: Users can choose a generator and supply parameters. Synthetic data are automatically stored with corresponding metadata. Select data and models: Search the metadata using criteria design for RP. The metadata of each data (synthetic or from laboratory) and models are well-described through their respective catalogues accessible by the web portal. Upload models: Upload and store a model with associated metadata. This provide an opportunity to share models. The web portal solicits and creates metadata describing each model. Run model and visualise results: Selected data and a model to be submitted to a High Performance Computational resource hiding technical details. Results are displayed in accelerated time and stored allowing retrieval, inspection and aggregation. The forecasting model testing centre proposed could be integrated into EPOS. Its expected benefits are: Improved the understanding of brittle failure prediction and its scalability to natural phenomena. Accelerated and extensive testing and rapid sharing of insights. Increased impact and visibility of RP and GeoScience research. Resources for education and training. A key challenge is to agree the framework for sharing RP data and models. Our work is provocative first step.
Automatic image assessment from facial attributes
NASA Astrophysics Data System (ADS)
Ptucha, Raymond; Kloosterman, David; Mittelstaedt, Brian; Loui, Alexander
2013-03-01
Personal consumer photography collections often contain photos captured by numerous devices stored both locally and via online services. The task of gathering, organizing, and assembling still and video assets in preparation for sharing with others can be quite challenging. Current commercial photobook applications are mostly manual-based requiring significant user interactions. To assist the consumer in organizing these assets, we propose an automatic method to assign a fitness score to each asset, whereby the top scoring assets are used for product creation. Our method uses cues extracted from analyzing pixel data, metadata embedded in the file, as well as ancillary tags or online comments. When a face occurs in an image, its features have a dominating influence on both aesthetic and compositional properties of the displayed image. As such, this paper will emphasize the contributions faces have on affecting the overall fitness score of an image. To understand consumer preference, we conducted a psychophysical study that spanned 27 judges, 5,598 faces, and 2,550 images. Preferences on a per-face and per-image basis were independently gathered to train our classifiers. We describe how to use machine learning techniques to merge differing facial attributes into a single classifier. Our novel methods of facial weighting, fusion of facial attributes, and dimensionality reduction produce stateof- the-art results suitable for commercial applications.
Creating preservation metadata from XML-metadata profiles
NASA Astrophysics Data System (ADS)
Ulbricht, Damian; Bertelmann, Roland; Gebauer, Petra; Hasler, Tim; Klump, Jens; Kirchner, Ingo; Peters-Kottig, Wolfgang; Mettig, Nora; Rusch, Beate
2014-05-01
Registration of dataset DOIs at DataCite makes research data citable and comes with the obligation to keep data accessible in the future. In addition, many universities and research institutions measure data that is unique and not repeatable like the data produced by an observational network and they want to keep these data for future generations. In consequence, such data should be ingested in preservation systems, that automatically care for file format changes. Open source preservation software that is developed along the definitions of the ISO OAIS reference model is available but during ingest of data and metadata there are still problems to be solved. File format validation is difficult, because format validators are not only remarkably slow - due to variety in file formats different validators return conflicting identification profiles for identical data. These conflicts are hard to resolve. Preservation systems have a deficit in the support of custom metadata. Furthermore, data producers are sometimes not aware that quality metadata is a key issue for the re-use of data. In the project EWIG an university institute and a research institute work together with Zuse-Institute Berlin, that is acting as an infrastructure facility, to generate exemplary workflows for research data into OAIS compliant archives with emphasis on the geosciences. The Institute for Meteorology provides timeseries data from an urban monitoring network whereas GFZ Potsdam delivers file based data from research projects. To identify problems in existing preservation workflows the technical work is complemented by interviews with data practitioners. Policies for handling data and metadata are developed. Furthermore, university teaching material is created to raise the future scientists awareness of research data management. As a testbed for ingest workflows the digital preservation system Archivematica [1] is used. During the ingest process metadata is generated that is compliant to the Metadata Encoding and Transmission Standard (METS). To find datasets in future portals and to make use of this data in own scientific work, proper selection of discovery metadata and application metadata is very important. Some XML-metadata profiles are not suitable for preservation, because version changes are very fast and make it nearly impossible to automate the migration. For other XML-metadata profiles schema definitions are changed after publication of the profile or the schema definitions become inaccessible, which might cause problems during validation of the metadata inside the preservation system [2]. Some metadata profiles are not used widely enough and might not even exist in the future. Eventually, discovery and application metadata have to be embedded into the mdWrap-subtree of the METS-XML. [1] http://www.archivematica.org [2] http://dx.doi.org/10.2218/ijdc.v7i1.215
A Research Agenda and Vision for Data Science
NASA Astrophysics Data System (ADS)
Mattmann, C. A.
2014-12-01
Big Data has emerged as a first-class citizen in the research community spanning disciplines in the domain sciences - Astronomy is pushing velocity with new ground-based instruments such as the Square Kilometre Array (SKA) and its unprecedented data rates (700 TB/sec!); Earth-science is pushing the boundaries of volume with increasing experiments in the international Intergovernmental Panel on Climate Change (IPCC) and climate modeling and remote sensing communities increasing the size of the total archives into the Exabytes scale; airborne missions from NASA such as the JPL Airborne Snow Observatory (ASO) is increasing both its velocity and decreasing the overall turnaround time required to receive products and to make them available to water managers and decision makers. Proteomics and the computational biology community are sequencing genomes and providing near real time answers to clinicians, researchers, and ultimately to patients, helping to process and understand and create diagnoses. Data complexity is on the rise, and the norm is no longer 100s of metadata attributes, but thousands to hundreds of thousands, including complex interrelationships between data and metadata and knowledge. I published a vision for data science in Nature 2013 that encapsulates four thrust areas and foci that I believe the computer science, Big Data, and data science communities need to attack over the next decade to make fundamental progress in the data volume, velocity and complexity challenges arising from the domain sciences such as those described above. These areas include: (1) rapid and unobtrusive algorithm integration; (2) intelligent and automatic data movement; (3) automated and rapid extraction text, metadata and language from heterogeneous file formats; and (4) participation and people power via open source communities. In this talk I will revisit these four areas and describe current progress; future work and challenges ahead as we move forward in this exciting age of Data Science.
An algebra for spatio-temporal information generation
NASA Astrophysics Data System (ADS)
Pebesma, Edzer; Scheider, Simon; Gräler, Benedikt; Stasch, Christoph; Hinz, Matthias
2016-04-01
When we accept the premises of James Frew's laws of metadata (Frew's first law: scientists don't write metadata; Frew's second law: any scientist can be forced to write bad metadata), but also assume that scientists try to maximise the impact of their research findings, can we develop our information infrastructures such that useful metadata is generated automatically? Currently, sharing of data and software to completely reproduce research findings is becoming standard, e.g. in the Journal of Statistical Software [1]. The reproduction (e.g. R) scripts however convey correct syntax, but still limited semantics. We propose [2] a new, platform-neutral way to algebraically describe how data is generated, e.g. by observation, and how data is derived, e.g. by processing observations. It starts with forming functions composed of four reference system types (space, time, quality, entity), which express for instance continuity of objects over time, and continuity of fields over space and time. Data, which is discrete by definition, is generated by evaluating such functions at discrete space and time instances, or by evaluating a convolution (aggregation) over them. Derived data is obtained by inputting data to data derivation functions, which for instance interpolate, estimate, aggregate, or convert fields into objects and vice versa. As opposed to the traditional when, where and what semantics of data sets, our algebra focuses on describing how a data set was generated. We argue that it can be used to discover data sets that were derived from a particular source x, or derived by a particular procedure y. It may also form the basis for inferring meaningfulness of derivation procedures [3]. Current research focuses on automatically generating provenance documentation from R scripts. [1] http://www.jstatsoft.org/ (open access) [2] http://www.meaningfulspatialstatistics.org has the full paper (in review) [3] Stasch, C., S. Scheider, E. Pebesma, W. Kuhn, 2014. Meaningful Spatial Prediction and Aggregation. Environmental Modelling & Software, 51, 149-165 (open access)
Chen, Mingyang; Stott, Amanda C; Li, Shenggang; Dixon, David A
2012-04-01
A robust metadata database called the Collaborative Chemistry Database Tool (CCDBT) for massive amounts of computational chemistry raw data has been designed and implemented. It performs data synchronization and simultaneously extracts the metadata. Computational chemistry data in various formats from different computing sources, software packages, and users can be parsed into uniform metadata for storage in a MySQL database. Parsing is performed by a parsing pyramid, including parsers written for different levels of data types and sets created by the parser loader after loading parser engines and configurations. Copyright © 2011 Elsevier Inc. All rights reserved.
McMahon, Christiana; Denaxas, Spiros
2017-11-06
Informed consent is an important feature of longitudinal research studies as it enables the linking of the baseline participant information with administrative data. The lack of standardized models to capture consent elements can lead to substantial challenges. A structured approach to capturing consent-related metadata can address these. a) Explore the state-of-the-art for recording consent; b) Identify key elements of consent required for record linkage; and c) Create and evaluate a novel metadata management model to capture consent-related metadata. The main methodological components of our work were: a) a systematic literature review and qualitative analysis of consent forms; b) the development and evaluation of a novel metadata model. We qualitatively analyzed 61 manuscripts and 30 consent forms. We extracted data elements related to obtaining consent for linkage. We created a novel metadata management model for consent and evaluated it by comparison with the existing standards and by iteratively applying it to case studies. The developed model can facilitate the standardized recording of consent for linkage in longitudinal research studies and enable the linkage of external participant data. Furthermore, it can provide a structured way of recording consent-related metadata and facilitate the harmonization and streamlining of processes.
ERIC Educational Resources Information Center
Miller, L. Dee; Soh, Leen-Kiat; Samal, Ashok; Kupzyk, Kevin; Nugent, Gwen
2015-01-01
Learning objects (LOs) are important online resources for both learners and instructors and usage for LOs is growing. Automatic LO tracking collects large amounts of metadata about individual students as well as data aggregated across courses, learning objects, and other demographic characteristics (e.g. gender). The challenge becomes identifying…
mzML2ISA & nmrML2ISA: generating enriched ISA-Tab metadata files from metabolomics XML data
Larralde, Martin; Lawson, Thomas N.; Weber, Ralf J. M.; Moreno, Pablo; Haug, Kenneth; Rocca-Serra, Philippe; Viant, Mark R.; Steinbeck, Christoph; Salek, Reza M.
2017-01-01
Abstract Summary Submission to the MetaboLights repository for metabolomics data currently places the burden of reporting instrument and acquisition parameters in ISA-Tab format on users, who have to do it manually, a process that is time consuming and prone to user input error. Since the large majority of these parameters are embedded in instrument raw data files, an opportunity exists to capture this metadata more accurately. Here we report a set of Python packages that can automatically generate ISA-Tab metadata file stubs from raw XML metabolomics data files. The parsing packages are separated into mzML2ISA (encompassing mzML and imzML formats) and nmrML2ISA (nmrML format only). Overall, the use of mzML2ISA & nmrML2ISA reduces the time needed to capture metadata substantially (capturing 90% of metadata on assay and sample levels), is much less prone to user input errors, improves compliance with minimum information reporting guidelines and facilitates more finely grained data exploration and querying of datasets. Availability and Implementation mzML2ISA & nmrML2ISA are available under version 3 of the GNU General Public Licence at https://github.com/ISA-tools. Documentation is available from http://2isa.readthedocs.io/en/latest/. Contact reza.salek@ebi.ac.uk or isatools@googlegroups.com Supplementary information Supplementary data are available at Bioinformatics online. PMID:28402395
mzML2ISA & nmrML2ISA: generating enriched ISA-Tab metadata files from metabolomics XML data.
Larralde, Martin; Lawson, Thomas N; Weber, Ralf J M; Moreno, Pablo; Haug, Kenneth; Rocca-Serra, Philippe; Viant, Mark R; Steinbeck, Christoph; Salek, Reza M
2017-08-15
Submission to the MetaboLights repository for metabolomics data currently places the burden of reporting instrument and acquisition parameters in ISA-Tab format on users, who have to do it manually, a process that is time consuming and prone to user input error. Since the large majority of these parameters are embedded in instrument raw data files, an opportunity exists to capture this metadata more accurately. Here we report a set of Python packages that can automatically generate ISA-Tab metadata file stubs from raw XML metabolomics data files. The parsing packages are separated into mzML2ISA (encompassing mzML and imzML formats) and nmrML2ISA (nmrML format only). Overall, the use of mzML2ISA & nmrML2ISA reduces the time needed to capture metadata substantially (capturing 90% of metadata on assay and sample levels), is much less prone to user input errors, improves compliance with minimum information reporting guidelines and facilitates more finely grained data exploration and querying of datasets. mzML2ISA & nmrML2ISA are available under version 3 of the GNU General Public Licence at https://github.com/ISA-tools. Documentation is available from http://2isa.readthedocs.io/en/latest/. reza.salek@ebi.ac.uk or isatools@googlegroups.com. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.
docBUILDER - Building Your Useful Metadata for Earth Science Data and Services.
NASA Astrophysics Data System (ADS)
Weir, H. M.; Pollack, J.; Olsen, L. M.; Major, G. R.
2005-12-01
The docBUILDER tool, created by NASA's Global Change Master Directory (GCMD), assists the scientific community in efficiently creating quality data and services metadata. Metadata authors are asked to complete five required fields to ensure enough information is provided for users to discover the data and related services they seek. After the metadata record is submitted to the GCMD, it is reviewed for semantic and syntactic consistency. Currently, two versions are available - a Web-based tool accessible with most browsers (docBUILDERweb) and a stand-alone desktop application (docBUILDERsolo). The Web version is available through the GCMD website, at http://gcmd.nasa.gov/User/authoring.html. This version has been updated and now offers: personalized templates to ease entering similar information for multiple data sets/services; automatic population of Data Center/Service Provider URLs based on the selected center/provider; three-color support to indicate required, recommended, and optional fields; an editable text window containing the XML record, to allow for quick editing; and improved overall performance and presentation. The docBUILDERsolo version offers the ability to create metadata records on a computer wherever you are. Except for installation and the occasional update of keywords, data/service providers are not required to have an Internet connection. This freedom will allow users with portable computers (Windows, Mac, and Linux) to create records in field campaigns, whether in Antarctica or the Australian Outback. This version also offers a spell-checker, in addition to all of the features found in the Web version.
Automatic meta-data collection of STP observation data
NASA Astrophysics Data System (ADS)
Ishikura, S.; Kimura, E.; Murata, K.; Kubo, T.; Shinohara, I.
2006-12-01
For the geo-science and the STP (Solar-Terrestrial Physics) studies, various observations have been done by satellites and ground-based observatories up to now. These data are saved and managed at many organizations, but no common procedure and rule to provide and/or share these data files. Researchers have felt difficulty in searching and analyzing such different types of data distributed over the Internet. To support such cross-over analyses of observation data, we have developed the STARS (Solar-Terrestrial data Analysis and Reference System). The STARS consists of client application (STARS-app), the meta-database (STARS- DB), the portal Web service (STARS-WS) and the download agent Web service (STARS DLAgent-WS). The STARS-DB includes directory information, access permission, protocol information to retrieve data files, hierarchy information of mission/team/data and user information. Users of the STARS are able to download observation data files without knowing locations of the files by using the STARS-DB. We have implemented the Portal-WS to retrieve meta-data from the meta-database. One reason we use the Web service is to overcome a variety of firewall restrictions which is getting stricter in recent years. Now it is difficult for the STARS client application to access to the STARS-DB by sending SQL query to obtain meta- data from the STARS-DB. Using the Web service, we succeeded in placing the STARS-DB behind the Portal- WS and prevent from exposing it on the Internet. The STARS accesses to the Portal-WS by sending the SOAP (Simple Object Access Protocol) request over HTTP. Meta-data is received as a SOAP Response. The STARS DLAgent-WS provides clients with data files downloaded from data sites. The data files are provided with a variety of protocols (e.g., FTP, HTTP, FTPS and SFTP). These protocols are individually selected at each site. The clients send a SOAP request with download request messages and receive observation data files as a SOAP Response with DIME-Attachment. By introducing the DLAgent-WS, we overcame the problem that the data management policies of each data site are independent. Another important issue to be overcome is how to collect the meta-data of observation data files. So far, STARS-DB managers have added new records to the meta-database and updated them manually. We have had a lot of troubles to maintain the meta-database because observation data are generated every day and the quantity of data files increases explosively. For that purpose, we have attempted to automate collection of the meta-data. In this research, we adopted the RSS 1.0 (RDF Site Summary) as a format to exchange meta-data in the STP fields. The RSS is an RDF vocabulary that provides a multipurpose extensible meta-data description and is suitable for syndication of meta-data. Most of the data in the present study are described in the CDF (Common Data Format), which is a self- describing data format. We have converted meta-information extracted from the CDF data files into RSS files. The program to generate the RSS files is executed on data site server once a day and the RSS files provide information of new data files. The RSS files are collected by RSS collection server once a day and the meta- data are stored in the STARS-DB.
NASA Technical Reports Server (NTRS)
Carnahan, Richard S., Jr.; Corey, Stephen M.; Snow, John B.
1989-01-01
Applications of rapid prototyping and Artificial Intelligence techniques to problems associated with Space Station-era information management systems are described. In particular, the work is centered on issues related to: (1) intelligent man-machine interfaces applied to scientific data user support, and (2) the requirement that intelligent information management systems (IIMS) be able to efficiently process metadata updates concerning types of data handled. The advanced IIMS represents functional capabilities driven almost entirely by the needs of potential users. Space Station-era scientific data projected to be generated is likely to be significantly greater than data currently processed and analyzed. Information about scientific data must be presented clearly, concisely, and with support features to allow users at all levels of expertise efficient and cost-effective data access. Additionally, mechanisms for allowing more efficient IIMS metadata update processes must be addressed. The work reported covers the following IIMS design aspects: IIMS data and metadata modeling, including the automatic updating of IIMS-contained metadata, IIMS user-system interface considerations, including significant problems associated with remote access, user profiles, and on-line tutorial capabilities, and development of an IIMS query and browse facility, including the capability to deal with spatial information. A working prototype has been developed and is being enhanced.
Java Library for Input and Output of Image Data and Metadata
NASA Technical Reports Server (NTRS)
Deen, Robert; Levoe, Steven
2003-01-01
A Java-language library supports input and output (I/O) of image data and metadata (label data) in the format of the Video Image Communication and Retrieval (VICAR) image-processing software and in several similar formats, including a subset of the Planetary Data System (PDS) image file format. The library does the following: It provides low-level, direct access layer, enabling an application subprogram to read and write specific image files, lines, or pixels, and manipulate metadata directly. Two coding/decoding subprograms ("codecs" for short) based on the Java Advanced Imaging (JAI) software provide access to VICAR and PDS images in a file-format-independent manner. The VICAR and PDS codecs enable any program that conforms to the specification of the JAI codec to use VICAR or PDS images automatically, without specific knowledge of the VICAR or PDS format. The library also includes Image I/O plugin subprograms for VICAR and PDS formats. Application programs that conform to the Image I/O specification of Java version 1.4 can utilize any image format for which such a plug-in subprogram exists, without specific knowledge of the format itself. Like the aforementioned codecs, the VICAR and PDS Image I/O plug-in subprograms support reading and writing of metadata.
MPEG-7-based description infrastructure for an audiovisual content analysis and retrieval system
NASA Astrophysics Data System (ADS)
Bailer, Werner; Schallauer, Peter; Hausenblas, Michael; Thallinger, Georg
2005-01-01
We present a case study of establishing a description infrastructure for an audiovisual content-analysis and retrieval system. The description infrastructure consists of an internal metadata model and access tool for using it. Based on an analysis of requirements, we have selected, out of a set of candidates, MPEG-7 as the basis of our metadata model. The openness and generality of MPEG-7 allow using it in broad range of applications, but increase complexity and hinder interoperability. Profiling has been proposed as a solution, with the focus on selecting and constraining description tools. Semantic constraints are currently only described in textual form. Conformance in terms of semantics can thus not be evaluated automatically and mappings between different profiles can only be defined manually. As a solution, we propose an approach to formalize the semantic constraints of an MPEG-7 profile using a formal vocabulary expressed in OWL, which allows automated processing of semantic constraints. We have defined the Detailed Audiovisual Profile as the profile to be used in our metadata model and we show how some of the semantic constraints of this profile can be formulated using ontologies. To work practically with the metadata model, we have implemented a MPEG-7 library and a client/server document access infrastructure.
A Generic Metadata Editor Supporting System Using Drupal CMS
NASA Astrophysics Data System (ADS)
Pan, J.; Banks, N. G.; Leggott, M.
2011-12-01
Metadata handling is a key factor in preserving and reusing scientific data. In recent years, standardized structural metadata has become widely used in Geoscience communities. However, there exist many different standards in Geosciences, such as the current version of the Federal Geographic Data Committee's Content Standard for Digital Geospatial Metadata (FGDC CSDGM), the Ecological Markup Language (EML), the Geography Markup Language (GML), and the emerging ISO 19115 and related standards. In addition, there are many different subsets within the Geoscience subdomain such as the Biological Profile of the FGDC (CSDGM), or for geopolitical regions, such as the European Profile or the North American Profile in the ISO standards. It is therefore desirable to have a software foundation to support metadata creation and editing for multiple standards and profiles, without re-inventing the wheels. We have developed a software module as a generic, flexible software system to do just that: to facilitate the support for multiple metadata standards and profiles. The software consists of a set of modules for the Drupal Content Management System (CMS), with minimal inter-dependencies to other Drupal modules. There are two steps in using the system's metadata functions. First, an administrator can use the system to design a user form, based on an XML schema and its instances. The form definition is named and stored in the Drupal database as a XML blob content. Second, users in an editor role can then use the persisted XML definition to render an actual metadata entry form, for creating or editing a metadata record. Behind the scenes, the form definition XML is transformed into a PHP array, which is then rendered via Drupal Form API. When the form is submitted the posted values are used to modify a metadata record. Drupal hooks can be used to perform custom processing on metadata record before and after submission. It is trivial to store the metadata record as an actual XML file or in a storage/archive system. We are working on adding many features to help editor users, such as auto completion, pre-populating of forms, partial saving, as well as automatic schema validation. In this presentation we will demonstrate a few sample editors, including an FGDC editor and a bare bone editor for ISO 19115/19139. We will also demonstrate the use of templates during the definition phase, with the support of export and import functions. Form pre-population and input validation will also be covered. Theses modules are available as open-source software from the Islandora software foundation, as a component of a larger Drupal-based data archive system. They can be easily installed as stand-alone system, or to be plugged into other existing metadata platforms.
Towards a semantic web of paleoclimatology
NASA Astrophysics Data System (ADS)
Emile-Geay, J.; Eshleman, J. A.
2012-12-01
The paleoclimate record is information-rich, yet signifiant technical barriers currently exist before it can be used to automatically answer scientific questions. Here we make the case for a universal format to structure paleoclimate data. A simple example demonstrates the scientific utility of such a self-contained way of organizing coral data and meta-data in the Matlab language. This example is generalized to a universal ontology that may form the backbone of an open-source, open-access and crowd-sourced paleoclimate database. Its key attributes are: 1. Parsability: the format is self-contained (hence machine-readable), and would therefore enable a semantic web of paleoclimate information. 2. Universality: the format is platform-independent (readable on all computer and operating systems), and language- independent (readable in major programming languages) 3. Extensibility: the format requires a minimum set of fields to appropriately define a paleoclimate record, but allows for the database to grow organically as more records are added, or - equally important - as more metadata are added to existing records. 4. Citability: The format enables the automatic citation of peer- reviewed articles as well as data citations whenever a data record is being used for analysis, making due recognition of scientific work an automatic part and foundational principle of paleoclimate data analysis. 5. Ergonomy: The format will be easy to use, update and manage. This structure is designed to enable semantic searches, and is expected to help accelerate discovery in all workflows where paleoclimate data are being used. Practical steps towards the implementation of such a system at the community level are then discussed.; Preliminary ontology describing relationships between the data and meta-data fields of the Nurhati et al. [2011] climate record. Several fields are viewed as instances of larger classes (ProxyClass,Site,Reference), which would allow computers to perform operations on all records within a specific class (e.g. if the measurement type is δ18O , or if the proxy class is 'Tree Ring Width', or if the resolution is less than 3 months, etc). All records in such a database would be bound to each other by similar links, allowing machines to automatically process any form of query involving existing information. Such a design would also allow growth, by adding records and/or additional information about each record.
Unified Science Information Model for SoilSCAPE using the Mercury Metadata Search System
NASA Astrophysics Data System (ADS)
Devarakonda, Ranjeet; Lu, Kefa; Palanisamy, Giri; Cook, Robert; Santhana Vannan, Suresh; Moghaddam, Mahta Clewley, Dan; Silva, Agnelo; Akbar, Ruzbeh
2013-12-01
SoilSCAPE (Soil moisture Sensing Controller And oPtimal Estimator) introduces a new concept for a smart wireless sensor web technology for optimal measurements of surface-to-depth profiles of soil moisture using in-situ sensors. The objective is to enable a guided and adaptive sampling strategy for the in-situ sensor network to meet the measurement validation objectives of spaceborne soil moisture sensors such as the Soil Moisture Active Passive (SMAP) mission. This work is being carried out at the University of Michigan, the Massachusetts Institute of Technology, University of Southern California, and Oak Ridge National Laboratory. At Oak Ridge National Laboratory we are using Mercury metadata search system [1] for building a Unified Information System for the SoilSCAPE project. This unified portal primarily comprises three key pieces: Distributed Search/Discovery; Data Collections and Integration; and Data Dissemination. Mercury, a Federally funded software for metadata harvesting, indexing, and searching would be used for this module. Soil moisture data sources identified as part of this activity such as SoilSCAPE and FLUXNET (in-situ sensors), AirMOSS (airborne retrieval), SMAP (spaceborne retrieval), and are being indexed and maintained by Mercury. Mercury would be the central repository of data sources for cal/val for soil moisture studies and would provide a mechanism to identify additional data sources. Relevant metadata from existing inventories such as ORNL DAAC, USGS Clearinghouse, ARM, NASA ECHO, GCMD etc. would be brought in to this soil-moisture data search/discovery module. The SoilSCAPE [2] metadata records will also be published in broader metadata repositories such as GCMD, data.gov. Mercury can be configured to provide a single portal to soil moisture information contained in disparate data management systems located anywhere on the Internet. Mercury is able to extract, metadata systematically from HTML pages or XML files using a variety of methods including OAI-PMH [3]. The Mercury search interface then allows users to perform simple, fielded, spatial and temporal searches across a central harmonized index of metadata. Mercury supports various metadata standards including FGDC, ISO-19115, DIF, Dublin-Core, Darwin-Core, and EML. This poster describes in detail how Mercury implements the Unified Science Information Model for Soil moisture data. References: [1]Devarakonda R., et al. Mercury: reusable metadata management, data discovery and access system. Earth Science Informatics (2010), 3(1): 87-94. [2]Devarakonda R., et al. Daymet: Single Pixel Data Extraction Tool. http://daymet.ornl.gov/singlepixel.html (2012). Last Accesses 10-01-2013 [3]Devarakonda R., et al. Data sharing and retrieval using OAI-PMH. Earth Science Informatics (2011), 4(1): 1-5.
A quality score for coronary artery tree extraction results
NASA Astrophysics Data System (ADS)
Cao, Qing; Broersen, Alexander; Kitslaar, Pieter H.; Lelieveldt, Boudewijn P. F.; Dijkstra, Jouke
2018-02-01
Coronary artery trees (CATs) are often extracted to aid the fully automatic analysis of coronary artery disease on coronary computed tomography angiography (CCTA) images. Automatically extracted CATs often miss some arteries or include wrong extractions which require manual corrections before performing successive steps. For analyzing a large number of datasets, a manual quality check of the extraction results is time-consuming. This paper presents a method to automatically calculate quality scores for extracted CATs in terms of clinical significance of the extracted arteries and the completeness of the extracted CAT. Both right dominant (RD) and left dominant (LD) anatomical statistical models are generated and exploited in developing the quality score. To automatically determine which model should be used, a dominance type detection method is also designed. Experiments are performed on the automatically extracted and manually refined CATs from 42 datasets to evaluate the proposed quality score. In 39 (92.9%) cases, the proposed method is able to measure the quality of the manually refined CATs with higher scores than the automatically extracted CATs. In a 100-point scale system, the average scores for automatically and manually refined CATs are 82.0 (+/-15.8) and 88.9 (+/-5.4) respectively. The proposed quality score will assist the automatic processing of the CAT extractions for large cohorts which contain both RD and LD cases. To the best of our knowledge, this is the first time that a general quality score for an extracted CAT is presented.
Sharma, Deepak K; Solbrig, Harold R; Tao, Cui; Weng, Chunhua; Chute, Christopher G; Jiang, Guoqian
2017-06-05
Detailed Clinical Models (DCMs) have been regarded as the basis for retaining computable meaning when data are exchanged between heterogeneous computer systems. To better support clinical cancer data capturing and reporting, there is an emerging need to develop informatics solutions for standards-based clinical models in cancer study domains. The objective of the study is to develop and evaluate a cancer genome study metadata management system that serves as a key infrastructure in supporting clinical information modeling in cancer genome study domains. We leveraged a Semantic Web-based metadata repository enhanced with both ISO11179 metadata standard and Clinical Information Modeling Initiative (CIMI) Reference Model. We used the common data elements (CDEs) defined in The Cancer Genome Atlas (TCGA) data dictionary, and extracted the metadata of the CDEs using the NCI Cancer Data Standards Repository (caDSR) CDE dataset rendered in the Resource Description Framework (RDF). The ITEM/ITEM_GROUP pattern defined in the latest CIMI Reference Model is used to represent reusable model elements (mini-Archetypes). We produced a metadata repository with 38 clinical cancer genome study domains, comprising a rich collection of mini-Archetype pattern instances. We performed a case study of the domain "clinical pharmaceutical" in the TCGA data dictionary and demonstrated enriched data elements in the metadata repository are very useful in support of building detailed clinical models. Our informatics approach leveraging Semantic Web technologies provides an effective way to build a CIMI-compliant metadata repository that would facilitate the detailed clinical modeling to support use cases beyond TCGA in clinical cancer study domains.
Expressive map design: OGC SLD/SE++ extension for expressive map styles
NASA Astrophysics Data System (ADS)
Christophe, Sidonie; Duménieu, Bertrand; Masse, Antoine; Hoarau, Charlotte; Ory, Jérémie; Brédif, Mathieu; Lecordix, François; Mellado, Nicolas; Turbet, Jérémie; Loi, Hugo; Hurtut, Thomas; Vanderhaeghe, David; Vergne, Romain; Thollot, Joëlle
2018-05-01
In the context of custom map design, handling more artistic and expressive tools has been identified as a carto-graphic need, in order to design stylized and expressive maps. Based on previous works on style formalization, an approach for specifying the map style has been proposed and experimented for particular use cases. A first step deals with the analysis of inspiration sources, in order to extract `what does make the style of the source', i.e. the salient visual characteristics to be automatically reproduced (textures, spatial arrangements, linear stylization, etc.). In a second step, in order to mimic and generate those visual characteristics, existing and innovative rendering techniques have been implemented in our GIS engine, thus extending the capabilities to generate expressive renderings. Therefore, an extension of the existing cartographic pipeline has been proposed based on the following aspects: 1- extension of the symbolization specifications OGC SLD/SE in order to provide a formalism to specify and reference expressive rendering methods; 2- separate the specification of each rendering method and its parameterization, as metadata. The main contribution has been described in (Christophe et al. 2016). In this paper, we focus firstly on the extension of the cartographic pipeline (SLD++ and metadata) and secondly on map design capabilities which have been experimented on various topographic styles: old cartographic styles (Cassini), artistic styles (watercolor, impressionism, Japanese print), hybrid topographic styles (ortho-imagery & vector data) and finally abstract and photo-realist styles for the geovisualization of costal area. The genericity and interoperability of our approach are promising and have already been tested for 3D visualization.
Seqenv: linking sequences to environments through text mining.
Sinclair, Lucas; Ijaz, Umer Z; Jensen, Lars Juhl; Coolen, Marco J L; Gubry-Rangin, Cecile; Chroňáková, Alica; Oulas, Anastasis; Pavloudi, Christina; Schnetzer, Julia; Weimann, Aaron; Ijaz, Ali; Eiler, Alexander; Quince, Christopher; Pafilis, Evangelos
2016-01-01
Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the "nt" nucleotide database provided by NCBI and, out of every hit, extracts-if it is available-the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install seqenv, go to: https://github.com/xapple/seqenv.
Creatiing a Collaborative Research Network for Scientists
NASA Astrophysics Data System (ADS)
Gunn, W.
2012-12-01
This abstract proposes a discussion of how professional science communication and scientific cooperation can become more efficient through the use of modern social network technology, using the example of Mendeley. Mendeley is a research workflow and collaboration tool which crowdsources real-time research trend information and semantic annotations of research papers in a central data store, thereby creating a "social research network" that is emergent from the research data added to the platform. We describe how Mendeley's model can overcome barriers for collaboration by turning research papers into social objects, making academic data publicly available via an open API, and promoting more efficient collaboration. Central to the success of Mendeley has been the creation of a tool that works for the researcher without the requirement of being part of an explicit social network. Mendeley automatically extracts metadata from research papers, and allows a researcher to annotate, tag and organize their research collection. The tool integrates with the paper writing workflow and provides advanced collaboration options, thus significantly improving researchers' productivity. By anonymously aggregating usage data, Mendeley enables the emergence of social metrics and real-time usage stats on top of the articles' abstract metadata. In this way a social network of collaborators, and people genuinely interested in content, emerges. By building this research network around the article as the social object, a social layer of direct relevance to academia emerges. As science, particularly Earth sciences with their large shared resources, become more and more global, the management and coordination of research is more and more dependent on technology to support these distributed collaborations.
Sequencing Data Discovery and Integration for Earth System Science with MetaSeek
NASA Astrophysics Data System (ADS)
Hoarfrost, A.; Brown, N.; Arnosti, C.
2017-12-01
Microbial communities play a central role in biogeochemical cycles. Sequencing data resources from environmental sources have grown exponentially in recent years, and represent a singular opportunity to investigate microbial interactions with Earth system processes. Carrying out such meta-analyses depends on our ability to discover and curate sequencing data into large-scale integrated datasets. However, such integration efforts are currently challenging and time-consuming, with sequencing data scattered across multiple repositories and metadata that is not easily or comprehensively searchable. MetaSeek is a sequencing data discovery tool that integrates sequencing metadata from all the major data repositories, allowing the user to search and filter on datasets in a lightweight application with an intuitive, easy-to-use web-based interface. Users can save and share curated datasets, while other users can browse these data integrations or use them as a jumping off point for their own curation. Missing and/or erroneous metadata are inferred automatically where possible, and where not possible, users are prompted to contribute to the improvement of the sequencing metadata pool by correcting and amending metadata errors. Once an integrated dataset has been curated, users can follow simple instructions to download their raw data and quickly begin their investigations. In addition to the online interface, the MetaSeek database is easily queryable via an open API, further enabling users and facilitating integrations of MetaSeek with other data curation tools. This tool lowers the barriers to curation and integration of environmental sequencing data, clearing the path forward to illuminating the ecosystem-scale interactions between biological and abiotic processes.
PH5 for integrating and archiving different data types
NASA Astrophysics Data System (ADS)
Azevedo, Steve; Hess, Derick; Beaudoin, Bruce
2016-04-01
PH5 is IRIS PASSCAL's file organization of HDF5 used for seismic data. The extensibility and portability of HDF5 allows the PH5 format to evolve and operate on a variety of platforms and interfaces. To make PH5 even more flexible, the seismic metadata is separated from the time series data in order to achieve gains in performance as well as ease of use and to simplify user interaction. This separation affords easy updates to metadata after the data are archived without having to access waveform data. To date, PH5 is currently used for integrating and archiving active source, passive source, and onshore-offshore seismic data sets with the IRIS Data Management Center (DMC). Active development to make PH5 fully compatible with FDSN web services and deliver StationXML is near completion. We are also exploring the feasibility of utilizing QuakeML for active seismic source representation. The PH5 software suite, PIC KITCHEN, comprises in-field tools that include data ingestion (e.g. RefTek format, SEG-Y, and SEG-D), meta-data management tools including QC, and a waveform review tool. These tools enable building archive ready data in-field during active source experiments greatly decreasing the time to produce research ready data sets. Once archived, our online request page generates a unique web form and pre-populates much of it based on the metadata provided to it from the PH5 file. The data requester then can intuitively select the extraction parameters as well as data subsets they wish to receive (current output formats include SEG-Y, SAC, mseed). The web interface then passes this on to the PH5 processing tools to generate the requested seismic data, and e-mail the requester a link to the data set automatically as soon as the data are ready. PH5 file organization was originally designed to hold seismic time series data and meta-data from controlled source experiments using RefTek data loggers. The flexibility of HDF5 has enabled us to extend the use of PH5 in several areas one of which is using PH5 to handle very large data sets. PH5 is also good at integrating data from various types of seismic experiments such as OBS, onshore-offshore, controlled source, and passive recording. HDF5 is capable of holding practically any type of digital data so integrating GPS data with seismic data is possible. Since PH5 is a common format and data contained in HDF5 is accessible randomly it has been easy to extend to include new input and output data formats as community needs arise.
Current Development at the Southern California Earthquake Data Center (SCEDC)
NASA Astrophysics Data System (ADS)
Appel, V. L.; Clayton, R. W.
2005-12-01
Over the past year, the SCEDC completed or is near completion of three featured projects: Station Information System (SIS) Development: The SIS will provide users with an interface into complete and accurate station metadata for all current and historic data at the SCEDC. The goal of this project is to develop a system that can interact with a single database source to enter, update and retrieve station metadata easily and efficiently. The system will provide accurate station/channel information for active stations to the SCSN real-time processing system, as will as station/channel information for stations that have parametric data at the SCEDC i.e., for users retrieving data via STP. Additionally, the SIS will supply information required to generate dataless SEED and COSMOS V0 volumes and allow stations to be added to the system with a minimum, but incomplete set of information using predefined defaults that can be easily updated as more information becomes available. Finally, the system will facilitate statewide metadata exchange for both real-time processing and provide a common approach to CISN historic station metadata. Moment Tensor Solutions: The SCEDC is currently archiving and delivering Moment Magnitudes and Moment Tensor Solutions (MTS) produced by the SCSN in real-time and post-processing solutions for events spanning back to 1999. The automatic MTS runs on all local events with magnitudes > 3.0, and all regional events > 3.5. The distributed solution automatically creates links from all USGS Simpson Maps to a text e-mail summary solution, creates a .gif image of the solution, and updates the moment tensor database tables at the SCEDC. Searchable Scanned Waveforms Site: The Caltech Seismological Lab has made available 12,223 scanned images of pre-digital analog recordings of major earthquakes recorded in Southern California between 1962 and 1992 at http://www.data.scec.org/research/scans/. The SCEDC has developed a searchable web interface that allows users to search the available files, select multiple files for download and then retrieve a zipped file containing the results. Scanned images of paper records for M>3.5 southern California earthquakes and several significant teleseisms are available for download via the SCEDC through this search tool.
Dynamic Non-Hierarchical File Systems for Exascale Storage
DOE Office of Scientific and Technical Information (OSTI.GOV)
Long, Darrell E.; Miller, Ethan L
This constitutes the final report for “Dynamic Non-Hierarchical File Systems for Exascale Storage”. The ultimate goal of this project was to improve data management in scientific computing and high-end computing (HEC) applications, and to achieve this goal we proposed: to develop the first, HEC-targeted, file system featuring rich metadata and provenance collection, extreme scalability, and future storage hardware integration as core design goals, and to evaluate and develop a flexible non-hierarchical file system interface suitable for providing more powerful and intuitive data management interfaces to HEC and scientific computing users. Data management is swiftly becoming a serious problem in themore » scientific community – while copious amounts of data are good for obtaining results, finding the right data is often daunting and sometimes impossible. Scientists participating in a Department of Energy workshop noted that most of their time was spent “...finding, processing, organizing, and moving data and it’s going to get much worse”. Scientists should not be forced to become data mining experts in order to retrieve the data they want, nor should they be expected to remember the naming convention they used several years ago for a set of experiments they now wish to revisit. Ideally, locating the data you need would be as easy as browsing the web. Unfortunately, existing data management approaches are usually based on hierarchical naming, a 40 year-old technology designed to manage thousands of files, not exabytes of data. Today’s systems do not take advantage of the rich array of metadata that current high-end computing (HEC) file systems can gather, including content-based metadata and provenance1 information. As a result, current metadata search approaches are typically ad hoc and often work by providing a parallel management system to the “main” file system, as is done in Linux (the locate utility), personal computers, and enterprise search appliances. These search applications are often optimized for a single file system, making it difficult to move files and their metadata between file systems. Users have tried to solve this problem in several ways, including the use of separate databases to index file properties, the encoding of file properties into file names, and separately gathering and managing provenance data, but none of these approaches has worked well, either due to limited usefulness or scalability, or both. Our research addressed several key issues: High-performance, real-time metadata harvesting: extracting important attributes from files dynamically and immediately updating indexes used to improve search; Transparent, automatic, and secure provenance capture: recording the data inputs and processing steps used in the production of each file in the system; Scalable indexing: indexes that are optimized for integration with the file system; Dynamic file system structure: our approach provides dynamic directories similar to those in semantic file systems, but these are the native organization rather than a feature grafted onto a conventional system. In addition to these goals, our research effort will include evaluating the impact of new storage technologies on the file system design and performance. In particular, the indexing and metadata harvesting functions can potentially benefit from the performance improvements promised by new storage class memories.« less
Empirical Analysis of Exploiting Review Helpfulness for Extractive Summarization of Online Reviews
ERIC Educational Resources Information Center
Xiong, Wenting; Litman, Diane
2014-01-01
We propose a novel unsupervised extractive approach for summarizing online reviews by exploiting review helpfulness ratings. In addition to using the helpfulness ratings for review-level filtering, we suggest using them as the supervision of a topic model for sentence-level content scoring. The proposed method is metadata-driven, requiring no…
Park, Yu Rang; Yoon, Young Jo; Jang, Tae Hun; Seo, Hwa Jeong; Kim, Ju Han
2014-01-01
Extension of the standard model while retaining compliance with it is a challenging issue because there is currently no method for semantically or syntactically verifying an extended data model. A metadata-based extended model, named CCR+, was designed and implemented to achieve interoperability between standard and extended models. Furthermore, a multilayered validation method was devised to validate the standard and extended models. The American Society for Testing and Materials (ASTM) Community Care Record (CCR) standard was selected to evaluate the CCR+ model; two CCR and one CCR+ XML files were evaluated. In total, 188 metadata were extracted from the ASTM CCR standard; these metadata are semantically interconnected and registered in the metadata registry. An extended-data-model-specific validation file was generated from these metadata. This file can be used in a smartphone application (Health Avatar CCR+) as a part of a multilayered validation. The new CCR+ model was successfully evaluated via a patient-centric exchange scenario involving multiple hospitals, with the results supporting both syntactic and semantic interoperability between the standard CCR and extended, CCR+, model. A feasible method for delivering an extended model that complies with the standard model is presented herein. There is a great need to extend static standard models such as the ASTM CCR in various domains: the methods presented here represent an important reference for achieving interoperability between standard and extended models.
Long-term Science Data Curation Using a Digital Object Model and Open-Source Frameworks
NASA Astrophysics Data System (ADS)
Pan, J.; Lenhardt, W.; Wilson, B. E.; Palanisamy, G.; Cook, R. B.
2010-12-01
Scientific digital content, including Earth Science observations and model output, has become more heterogeneous in format and more distributed across the Internet. In addition, data and metadata are becoming necessarily linked internally and externally on the Web. As a result, such content has become more difficult for providers to manage and preserve and for users to locate, understand, and consume. Specifically, it is increasingly harder to deliver relevant metadata and data processing lineage information along with the actual content consistently. Readme files, data quality information, production provenance, and other descriptive metadata are often separated in the storage level as well as in the data search and retrieval interfaces available to a user. Critical archival metadata, such as auditing trails and integrity checks, are often even more difficult for users to access, if they exist at all. We investigate the use of several open-source software frameworks to address these challenges. We use Fedora Commons Framework and its digital object abstraction as the repository, Drupal CMS as the user-interface, and the Islandora module as the connector from Drupal to Fedora Repository. With the digital object model, metadata of data description and data provenance can be associated with data content in a formal manner, so are external references and other arbitrary auxiliary information. Changes are formally audited on an object, and digital contents are versioned and have checksums automatically computed. Further, relationships among objects are formally expressed with RDF triples. Data replication, recovery, metadata export are supported with standard protocols, such as OAI-PMH. We provide a tentative comparative analysis of the chosen software stack with the Open Archival Information System (OAIS) reference model, along with our initial results with the existing terrestrial ecology data collections at NASA’s ORNL Distributed Active Archive Center for Biogeochemical Dynamics (ORNL DAAC).
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ferrell, Paul; Hanson, Paige; Ardi, Calvin
2016-11-04
A system for processing network packet capture streams, extracting metadata and generating flow records (via Argus). The system can be used by network security operators and analysts to enable forensic investigations for network security events.
NASA Astrophysics Data System (ADS)
Helge Østerås, Bjørn; Skaane, Per; Gullien, Randi; Catrine Trægde Martinsen, Anne
2018-02-01
The main purpose was to compare average glandular dose (AGD) for same-compression digital mammography (DM) and digital breast tomosynthesis (DBT) acquisitions in a population based screening program, with and without breast density stratification, as determined by automatically calculated breast density (Quantra™). Secondary, to compare AGD estimates based on measured breast density, air kerma and half value layer (HVL) to DICOM metadata based estimates. AGD was estimated for 3819 women participating in the screening trial. All received craniocaudal and mediolateral oblique views of each breasts with paired DM and DBT acquisitions. Exposure parameters were extracted from DICOM metadata. Air kerma and HVL were measured for all beam qualities used to acquire the mammograms. Volumetric breast density was estimated using Quantra™. AGD was estimated using the Dance model. AGD reported directly from the DICOM metadata was also assessed. Mean AGD was 1.74 and 2.10 mGy for DM and DBT, respectively. Mean DBT/DM AGD ratio was 1.24. For fatty breasts: mean AGD was 1.74 and 2.27 mGy for DM and DBT, respectively. For dense breasts: mean AGD was 1.73 and 1.79 mGy, for DM and DBT, respectively. For breasts of similar thickness, dense breasts had higher AGD for DM and similar AGD for DBT. The DBT/DM dose ratio was substantially lower for dense compared to fatty breasts (1.08 versus 1.33). The average c-factor was 1.16. Using previously published polynomials to estimate glandularity from thickness underestimated the c-factor by 5.9% on average. Mean AGD error between estimates based on measurements (air kerma and HVL) versus DICOM header data was 3.8%, but for one mammography unit as high as 7.9%. Mean error of using the AGD value reported in the DICOM header was 10.7 and 13.3%, respectively. Thus, measurement of breast density, radiation dose and beam quality can substantially affect AGD estimates.
Østerås, Bjørn Helge; Skaane, Per; Gullien, Randi; Martinsen, Anne Catrine Trægde
2018-01-25
The main purpose was to compare average glandular dose (AGD) for same-compression digital mammography (DM) and digital breast tomosynthesis (DBT) acquisitions in a population based screening program, with and without breast density stratification, as determined by automatically calculated breast density (Quantra ™ ). Secondary, to compare AGD estimates based on measured breast density, air kerma and half value layer (HVL) to DICOM metadata based estimates. AGD was estimated for 3819 women participating in the screening trial. All received craniocaudal and mediolateral oblique views of each breasts with paired DM and DBT acquisitions. Exposure parameters were extracted from DICOM metadata. Air kerma and HVL were measured for all beam qualities used to acquire the mammograms. Volumetric breast density was estimated using Quantra ™ . AGD was estimated using the Dance model. AGD reported directly from the DICOM metadata was also assessed. Mean AGD was 1.74 and 2.10 mGy for DM and DBT, respectively. Mean DBT/DM AGD ratio was 1.24. For fatty breasts: mean AGD was 1.74 and 2.27 mGy for DM and DBT, respectively. For dense breasts: mean AGD was 1.73 and 1.79 mGy, for DM and DBT, respectively. For breasts of similar thickness, dense breasts had higher AGD for DM and similar AGD for DBT. The DBT/DM dose ratio was substantially lower for dense compared to fatty breasts (1.08 versus 1.33). The average c-factor was 1.16. Using previously published polynomials to estimate glandularity from thickness underestimated the c-factor by 5.9% on average. Mean AGD error between estimates based on measurements (air kerma and HVL) versus DICOM header data was 3.8%, but for one mammography unit as high as 7.9%. Mean error of using the AGD value reported in the DICOM header was 10.7 and 13.3%, respectively. Thus, measurement of breast density, radiation dose and beam quality can substantially affect AGD estimates.
Image processing tool for automatic feature recognition and quantification
Chen, Xing; Stoddard, Ryan J.
2017-05-02
A system for defining structures within an image is described. The system includes reading of an input file, preprocessing the input file while preserving metadata such as scale information and then detecting features of the input file. In one version the detection first uses an edge detector followed by identification of features using a Hough transform. The output of the process is identified elements within the image.
NASA Technical Reports Server (NTRS)
Jahnsen, Vilhelm J. (Inventor); Campen, Jr., Charles F. (Inventor)
1980-01-01
A sample processor and method for the automatic extraction of families of compounds, known as extracts, from liquid and/or homogenized solid samples are disclosed. The sample processor includes a tube support structure which supports a plurality of extraction tubes, each containing a sample from which families of compounds are to be extracted. The support structure is moveable automatically with respect to one or more extraction stations, so that as each tube is at each station a solvent system, consisting of a solvent and reagents, is introduced therein. As a result an extract is automatically extracted from the tube. The sample processor includes an arrangement for directing the different extracts from each tube to different containers, or to direct similar extracts from different tubes to the same utilization device.
Distributed digital music archives and libraries
NASA Astrophysics Data System (ADS)
Fujinaga, Ichiro
2005-09-01
The main goal of this research program is to develop and evaluate practices, frameworks, and tools for the design and construction of worldwide distributed digital music archives and libraries. Over the last few millennia, humans have amassed an enormous amount of musical information that is scattered around the world. It is becoming abundantly clear that the optimal path for acquisition is to distribute the task of digitizing the wealth of historical and cultural heritage material that exists in analogue formats, which may include books and manuscripts related to music, music scores, photographs, videos, audio tapes, and phonograph records. In order to achieve this goal, libraries, museums, and archives throughout the world, large or small, need well-researched policies, proper guidance, and efficient tools to digitize their collections and to make them available economically. The research conducted within the program addresses unique and imminent challenges posed by the digitization and dissemination of music media. The are four major research projects in progress: development and evaluation of digitization methods for preservation of analogue recordings; optical music recognition using microfilms; design of workflow management system with automatic metadata extraction; and formulation of interlibrary communication strategies.
Standards-based curation of a decade-old digital repository dataset of molecular information.
Harvey, Matthew J; Mason, Nicholas J; McLean, Andrew; Murray-Rust, Peter; Rzepa, Henry S; Stewart, James J P
2015-01-01
The desirable curation of 158,122 molecular geometries derived from the NCI set of reference molecules together with associated properties computed using the MOPAC semi-empirical quantum mechanical method and originally deposited in 2005 into the Cambridge DSpace repository as a data collection is reported. The procedures involved in the curation included annotation of the original data using new MOPAC methods, updating the syntax of the CML documents used to express the data to ensure schema conformance and adding new metadata describing the entries together with a XML schema transformation to map the metadata schema to that used by the DataCite organisation. We have adopted a granularity model in which a DataCite persistent identifier (DOI) is created for each individual molecule to enable data discovery and data metrics at this level using DataCite tools. We recommend that the future research data management (RDM) of the scientific and chemical data components associated with journal articles (the "supporting information") should be conducted in a manner that facilitates automatic periodic curation. Graphical abstractStandards and metadata-based curation of a decade-old digital repository dataset of molecular information.
In situ data analytics and indexing of protein trajectories.
Johnston, Travis; Zhang, Boyu; Liwo, Adam; Crivelli, Silvia; Taufer, Michela
2017-06-15
The transition toward exascale computing will be accompanied by a performance dichotomy. Computational peak performance will rapidly increase; I/O performance will either grow slowly or be completely stagnant. Essentially, the rate at which data are generated will grow much faster than the rate at which data can be read from and written to the disk. MD simulations will soon face the I/O problem of efficiently writing to and reading from disk on the next generation of supercomputers. This article targets MD simulations at the exascale and proposes a novel technique for in situ data analysis and indexing of MD trajectories. Our technique maps individual trajectories' substructures (i.e., α-helices, β-strands) to metadata frame by frame. The metadata captures the conformational properties of the substructures. The ensemble of metadata can be used for automatic, strategic analysis within a trajectory or across trajectories, without manually identify those portions of trajectories in which critical changes take place. We demonstrate our technique's effectiveness by applying it to 26.3k helices and 31.2k strands from 9917 PDB proteins and by providing three empirical case studies. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Harvey, Dustin Yewell
Echo™ is a MATLAB-based software package designed for robust and scalable analysis of complex data workflows. An alternative to tedious, error-prone conventional processes, Echo is based on three transformative principles for data analysis: self-describing data, name-based indexing, and dynamic resource allocation. The software takes an object-oriented approach to data analysis, intimately connecting measurement data with associated metadata. Echo operations in an analysis workflow automatically track and merge metadata and computation parameters to provide a complete history of the process used to generate final results, while automated figure and report generation tools eliminate the potential to mislabel those results. History reportingmore » and visualization methods provide straightforward auditability of analysis processes. Furthermore, name-based indexing on metadata greatly improves code readability for analyst collaboration and reduces opportunities for errors to occur. Echo efficiently manages large data sets using a framework that seamlessly allocates resources such that only the necessary computations to produce a given result are executed. Echo provides a versatile and extensible framework, allowing advanced users to add their own tools and data classes tailored to their own specific needs. Applying these transformative principles and powerful features, Echo greatly improves analyst efficiency and quality of results in many application areas.« less
Presentation video retrieval using automatically recovered slide and spoken text
NASA Astrophysics Data System (ADS)
Cooper, Matthew
2013-03-01
Video is becoming a prevalent medium for e-learning. Lecture videos contain text information in both the presentation slides and lecturer's speech. This paper examines the relative utility of automatically recovered text from these sources for lecture video retrieval. To extract the visual information, we automatically detect slides within the videos and apply optical character recognition to obtain their text. Automatic speech recognition is used similarly to extract spoken text from the recorded audio. We perform controlled experiments with manually created ground truth for both the slide and spoken text from more than 60 hours of lecture video. We compare the automatically extracted slide and spoken text in terms of accuracy relative to ground truth, overlap with one another, and utility for video retrieval. Results reveal that automatically recovered slide text and spoken text contain different content with varying error profiles. Experiments demonstrate that automatically extracted slide text enables higher precision video retrieval than automatically recovered spoken text.
Park, Yu Rang; Yoon, Young Jo; Jang, Tae Hun; Seo, Hwa Jeong
2014-01-01
Objectives Extension of the standard model while retaining compliance with it is a challenging issue because there is currently no method for semantically or syntactically verifying an extended data model. A metadata-based extended model, named CCR+, was designed and implemented to achieve interoperability between standard and extended models. Methods Furthermore, a multilayered validation method was devised to validate the standard and extended models. The American Society for Testing and Materials (ASTM) Community Care Record (CCR) standard was selected to evaluate the CCR+ model; two CCR and one CCR+ XML files were evaluated. Results In total, 188 metadata were extracted from the ASTM CCR standard; these metadata are semantically interconnected and registered in the metadata registry. An extended-data-model-specific validation file was generated from these metadata. This file can be used in a smartphone application (Health Avatar CCR+) as a part of a multilayered validation. The new CCR+ model was successfully evaluated via a patient-centric exchange scenario involving multiple hospitals, with the results supporting both syntactic and semantic interoperability between the standard CCR and extended, CCR+, model. Conclusions A feasible method for delivering an extended model that complies with the standard model is presented herein. There is a great need to extend static standard models such as the ASTM CCR in various domains: the methods presented here represent an important reference for achieving interoperability between standard and extended models. PMID:24627817
A Geospatial Semantic Enrichment and Query Service for Geotagged Photographs
Ennis, Andrew; Nugent, Chris; Morrow, Philip; Chen, Liming; Ioannidis, George; Stan, Alexandru; Rachev, Preslav
2015-01-01
With the increasing abundance of technologies and smart devices, equipped with a multitude of sensors for sensing the environment around them, information creation and consumption has now become effortless. This, in particular, is the case for photographs with vast amounts being created and shared every day. For example, at the time of this writing, Instagram users upload 70 million photographs a day. Nevertheless, it still remains a challenge to discover the “right” information for the appropriate purpose. This paper describes an approach to create semantic geospatial metadata for photographs, which can facilitate photograph search and discovery. To achieve this we have developed and implemented a semantic geospatial data model by which a photograph can be enrich with geospatial metadata extracted from several geospatial data sources based on the raw low-level geo-metadata from a smartphone photograph. We present the details of our method and implementation for searching and querying the semantic geospatial metadata repository to enable a user or third party system to find the information they are looking for. PMID:26205265
[Construction of chemical information database based on optical structure recognition technique].
Lv, C Y; Li, M N; Zhang, L R; Liu, Z M
2018-04-18
To create a protocol that could be used to construct chemical information database from scientific literature quickly and automatically. Scientific literature, patents and technical reports from different chemical disciplines were collected and stored in PDF format as fundamental datasets. Chemical structures were transformed from published documents and images to machine-readable data by using the name conversion technology and optical structure recognition tool CLiDE. In the process of molecular structure information extraction, Markush structures were enumerated into well-defined monomer molecules by means of QueryTools in molecule editor ChemDraw. Document management software EndNote X8 was applied to acquire bibliographical references involving title, author, journal and year of publication. Text mining toolkit ChemDataExtractor was adopted to retrieve information that could be used to populate structured chemical database from figures, tables, and textual paragraphs. After this step, detailed manual revision and annotation were conducted in order to ensure the accuracy and completeness of the data. In addition to the literature data, computing simulation platform Pipeline Pilot 7.5 was utilized to calculate the physical and chemical properties and predict molecular attributes. Furthermore, open database ChEMBL was linked to fetch known bioactivities, such as indications and targets. After information extraction and data expansion, five separate metadata files were generated, including molecular structure data file, molecular information, bibliographical references, predictable attributes and known bioactivities. Canonical simplified molecular input line entry specification as primary key, metadata files were associated through common key nodes including molecular number and PDF number to construct an integrated chemical information database. A reasonable construction protocol of chemical information database was created successfully. A total of 174 research articles and 25 reviews published in Marine Drugs from January 2015 to June 2016 collected as essential data source, and an elementary marine natural product database named PKU-MNPD was built in accordance with this protocol, which contained 3 262 molecules and 19 821 records. This data aggregation protocol is of great help for the chemical information database construction in accuracy, comprehensiveness and efficiency based on original documents. The structured chemical information database can facilitate the access to medical intelligence and accelerate the transformation of scientific research achievements.
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs and Increasing Value
NASA Astrophysics Data System (ADS)
Myers, J.; Hedstrom, M.; Plale, B. A.; Kumar, P.; McDonald, R.; Kooper, R.; Marini, L.; Kouper, I.; Chandrasekar, K.
2013-12-01
What if everything that researchers know about their data, and everything their applications know, were directly available to curators? What if all the information that data consumers discover and infer about data were also available? What if curation and preservation activities occurred incrementally, during research projects instead of after they end, and could be leveraged to make it easier to manage research data from the moment of its creation? These are questions that the Sustainable Environments - Actionable Data (SEAD) project, funded as part of the National Science Foundation's DataNet partnership, was designed to answer. Data curation is challenging, but it is made more difficult by the historical separation of data production, data use, and formal curation activities across organizations, locations, and applications, and across time. Modern computing and networking technologies allow a much different approach in which data and metadata can easily flow between these activities throughout the data lifecycle, and in which heterogeneous and evolving data and metadata can be managed. Sustainability research, SEAD's initial focus area, is a clear example of an area where the nature of the research (cross-disciplinary, integrating heterogeneous data from independent sources, small teams, rapid evolution of sensing and analysis techniques) and the barriers and costs inherent in traditional methods have limited adoption of existing curation tools and techniques, to the detriment of overall scientific progress. To explore these ideas and create a sustainable curation capability for communities such as sustainability research, the SEAD team has developed and is now deploying an interacting set of open source data services that demonstrate this approach. These services provide end-to-end support for management of data during research projects; publication of that data into long-term archives; and integration of it into community networks of publications, research center activities, and synthesis efforts. They build on a flexible ';semantic content management' architecture and incorporate notions of ';active' and ';social' curation - continuous, incremental curation activities performed by the data producers (active) and the community (social) that are motivated by a range of direct benefits. Examples include the use of metadata (tags) to allow generation of custom geospatial maps, automated metadata extraction to generate rich data pages for known formats, and the use of information about data authorship to allow automatic updates of personal and project research profiles when data is published. In this presentation, we describe the core capabilities of SEAD's services and their application in sustainability research. We also outline the key features of the SEAD architecture - the use of global semantic identifiers, extensible data and metadata models, web services to manage context shifts, scalable cloud storage - and highlight how this approach is particularly well suited to extension by independent third parties. We conclude with thoughts on how this approach can be applied to challenging issues such as exposing ';dark' data and reducing duplicate creation of derived data products, and can provide a new level of analytics for community analysis and coordination.
Software Implements a Space-Mission File-Transfer Protocol
NASA Technical Reports Server (NTRS)
Rundstrom, Kathleen; Ho, Son Q.; Levesque, Michael; Sanders, Felicia; Burleigh, Scott; Veregge, John
2004-01-01
CFDP is a computer program that implements the CCSDS (Consultative Committee for Space Data Systems) File Delivery Protocol, which is an international standard for automatic, reliable transfers of files of data between locations on Earth and in outer space. CFDP administers concurrent file transfers in both directions, delivery of data out of transmission order, reliable and unreliable transmission modes, and automatic retransmission of lost or corrupted data by use of one or more of several lost-segment-detection modes. The program also implements several data-integrity measures, including file checksums and optional cyclic redundancy checks for each protocol data unit. The metadata accompanying each file can include messages to users application programs and commands for operating on remote file systems.
Analysis of Technique to Extract Data from the Web for Improved Performance
NASA Astrophysics Data System (ADS)
Gupta, Neena; Singh, Manish
2010-11-01
The World Wide Web rapidly guides the world into a newly amazing electronic world, where everyone can publish anything in electronic form and extract almost all the information. Extraction of information from semi structured or unstructured documents, such as web pages, is a useful yet complex task. Data extraction, which is important for many applications, extracts the records from the HTML files automatically. Ontologies can achieve a high degree of accuracy in data extraction. We analyze method for data extraction OBDE (Ontology-Based Data Extraction), which automatically extracts the query result records from the web with the help of agents. OBDE first constructs an ontology for a domain according to information matching between the query interfaces and query result pages from different web sites within the same domain. Then, the constructed domain ontology is used during data extraction to identify the query result section in a query result page and to align and label the data values in the extracted records. The ontology-assisted data extraction method is fully automatic and overcomes many of the deficiencies of current automatic data extraction methods.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schissel, David; Greenwald, Martin
The MPO (Metadata, Provenance, Ontology) Project successfully addressed the goal of improving the usefulness and traceability of scientific data by building a system that could capture and display all steps in the process of creating, analyzing and disseminating that data. Throughout history, scientists have generated handwritten logbooks to keep track of data, their hypotheses, assumptions, experimental setup, and computational processes as well as reflections on observations and issues encountered. Over the last several decades, with the growth of personal computers, handheld devices, and the World Wide Web, the handwritten logbook has begun to be replaced by electronic logbooks. This transitionmore » has brought increased capability such as supporting multi-media, hypertext, and fast searching. However, content creation and metadata (a set of data that describes and gives information about other data) capturing has for the most part remained a manual activity just as it was with handwritten logbooks. This has led to a fragmentation of data, processing, and annotation that has only accelerated as scientific workflows continue to increase in complexity. From a scientific perspective, it is very important to be able to understand the lineage of any piece of data: who, what, when, how, and why. This is typically referred to as data provenance. The fragmentation discussed previously often means that data provenance is lost. As scientific workflows move to powerful computers and become more complex, the ability to track all of the steps involved in creating a piece of data become even more difficult. It was the goal of the MPO (Metadata, Provenance, Ontology) Project to create a system (the MPO System) that allows for automatic provenance and metadata capturing in such a way to allow easy searching and browsing. This goal needed to be accomplished in a general way so that it may be used across a broad range of scientific domains, yet allow the addition of vocabulary (Ontology) that is domain specific as is required for intelligent searching and browsing in the scientific context. Through the creation and deployment of the MPO system, the goals of the project were achieved. An enhanced metadata, provenance, and ontology storage system was created. This was combined with innovative methodologies for navigating and exploring these data using a web browser for both experimental and simulation-based scientific research. In addition, a system to allow scientists to instrument their existing workflows for automatic metadata and provenance is part of the MPO system. In that way, a scientist can continue to use their existing methodology yet easily document their work. Workflows and data provenance can be displayed either graphically or in an electronic notebook format and support advanced search features including via ontology. The MPO system was successfully used in both Climate and Magnetic Fusion Energy Research. The software for the MPO system is located at https://github.com/MPO-Group/MPO and is open source distributed under the Revised BSD License. A demonstration site of the MPO system is open to the public and is available at https://mpo.psfc.mit.edu/. A Docker container release of the command line client is available for public download using the command docker pull jcwright/mpo-cli at https://hub.docker.com/r/jcwright/mpo-cli.« less
Storing files in a parallel computing system based on user-specified parser function
Faibish, Sorin; Bent, John M; Tzelnic, Percy; Grider, Gary; Manzanares, Adam; Torres, Aaron
2014-10-21
Techniques are provided for storing files in a parallel computing system based on a user-specified parser function. A plurality of files generated by a distributed application in a parallel computing system are stored by obtaining a parser from the distributed application for processing the plurality of files prior to storage; and storing one or more of the plurality of files in one or more storage nodes of the parallel computing system based on the processing by the parser. The plurality of files comprise one or more of a plurality of complete files and a plurality of sub-files. The parser can optionally store only those files that satisfy one or more semantic requirements of the parser. The parser can also extract metadata from one or more of the files and the extracted metadata can be stored with one or more of the plurality of files and used for searching for files.
System for definition of the central-chest vasculature
NASA Astrophysics Data System (ADS)
Taeprasartsit, Pinyo; Higgins, William E.
2009-02-01
Accurate definition of the central-chest vasculature from three-dimensional (3D) multi-detector CT (MDCT) images is important for pulmonary applications. For instance, the aorta and pulmonary artery help in automatic definition of the Mountain lymph-node stations for lung-cancer staging. This work presents a system for defining major vascular structures in the central chest. The system provides automatic methods for extracting the aorta and pulmonary artery and semi-automatic methods for extracting the other major central chest arteries/veins, such as the superior vena cava and azygos vein. Automatic aorta and pulmonary artery extraction are performed by model fitting and selection. The system also extracts certain vascular structure information to validate outputs. A semi-automatic method extracts vasculature by finding the medial axes between provided important sites. Results of the system are applied to lymph-node station definition and guidance of bronchoscopic biopsy.
Content-aware network storage system supporting metadata retrieval
NASA Astrophysics Data System (ADS)
Liu, Ke; Qin, Leihua; Zhou, Jingli; Nie, Xuejun
2008-12-01
Nowadays, content-based network storage has become the hot research spot of academy and corporation[1]. In order to solve the problem of hit rate decline causing by migration and achieve the content-based query, we exploit a new content-aware storage system which supports metadata retrieval to improve the query performance. Firstly, we extend the SCSI command descriptor block to enable system understand those self-defined query requests. Secondly, the extracted metadata is encoded by extensible markup language to improve the universality. Thirdly, according to the demand of information lifecycle management (ILM), we store those data in different storage level and use corresponding query strategy to retrieval them. Fourthly, as the file content identifier plays an important role in locating data and calculating block correlation, we use it to fetch files and sort query results through friendly user interface. Finally, the experiments indicate that the retrieval strategy and sort algorithm have enhanced the retrieval efficiency and precision.
Designing Extensible Data Management for Ocean Observatories, Platforms, and Devices
NASA Astrophysics Data System (ADS)
Graybeal, J.; Gomes, K.; McCann, M.; Schlining, B.; Schramm, R.; Wilkin, D.
2002-12-01
The Monterey Bay Aquarium Research Institute (MBARI) has been collecting science data for 15 years from all kinds of oceanographic instruments and systems, and is building a next-generation observing system, the MBARI Ocean Observing System (MOOS). To meet the data management requirements of the MOOS, the Institute began developing a flexible, extensible data management solution, the Shore Side Data System (SSDS). This data management system must address a wide variety of oceanographic instruments and data sources, including instruments and platforms of the future. Our data management solution will address all elements of the data management challenge, from ingest (including suitable pre-definition of metadata) through to access and visualization. Key to its success will be ease of use, and automatic incorporation of new data streams and data sets. The data will be of many different forms, and come from many different types of instruments. Instruments will be designed for fixed locations (as with moorings), changing locations (drifters and AUVs), and cruise-based sampling. Data from airplanes, satellites, models, and external archives must also be considered. Providing an architecture which allows data from these varied sources to be automatically archived and processed, yet readily accessed, is only possible with the best practices in metadata definition, software design, and re-use of third-party components. The current status of SSDS development will be presented, including lessons learned from our science users and from previous data management designs.
Target recognition based on convolutional neural network
NASA Astrophysics Data System (ADS)
Wang, Liqiang; Wang, Xin; Xi, Fubiao; Dong, Jian
2017-11-01
One of the important part of object target recognition is the feature extraction, which can be classified into feature extraction and automatic feature extraction. The traditional neural network is one of the automatic feature extraction methods, while it causes high possibility of over-fitting due to the global connection. The deep learning algorithm used in this paper is a hierarchical automatic feature extraction method, trained with the layer-by-layer convolutional neural network (CNN), which can extract the features from lower layers to higher layers. The features are more discriminative and it is beneficial to the object target recognition.
Facilitating Stewardship of scientific data through standards based workflows
NASA Astrophysics Data System (ADS)
Bastrakova, I.; Kemp, C.; Potter, A. K.
2013-12-01
There are main suites of standards that can be used to define the fundamental scientific methodology of data, methods and results. These are firstly Metadata standards to enable discovery of the data (ISO 19115), secondly the Sensor Web Enablement (SWE) suite of standards that include the O&M and SensorML standards and thirdly Ontology that provide vocabularies to define the scientific concepts and relationships between these concepts. All three types of standards have to be utilised by the practicing scientist to ensure that those who ultimately have to steward the data stewards to ensure that the data can be preserved curated and reused and repurposed. Additional benefits of this approach include transparency of scientific processes from the data acquisition to creation of scientific concepts and models, and provision of context to inform data use. Collecting and recording metadata is the first step in scientific data flow. The primary role of metadata is to provide details of geographic extent, availability and high-level description of data suitable for its initial discovery through common search engines. The SWE suite provides standardised patterns to describe observations and measurements taken for these data, capture detailed information about observation or analytical methods, used instruments and define quality determinations. This information standardises browsing capability over discrete data types. The standardised patterns of the SWE standards simplify aggregation of observation and measurement data enabling scientists to transfer disintegrated data to scientific concepts. The first two steps provide a necessary basis for the reasoning about concepts of ';pure' science, building relationship between concepts of different domains (linked-data), and identifying domain classification and vocabularies. Geoscience Australia is re-examining its marine data flows, including metadata requirements and business processes, to achieve a clearer link between scientific data acquisition and analysis requirements and effective interoperable data management and delivery. This includes participating in national and international dialogue on development of standards, embedding data management activities in business processes, and developing scientific staff as effective data stewards. Similar approach is applied to the geophysical data. By ensuring the geophysical datasets at GA strictly follow metadata and industry standards we are able to implement a provenance based workflow where the data is easily discoverable, geophysical processing can be applied to it and results can be stored. The provenance based workflow enables metadata records for the results to be produced automatically from the input dataset metadata.
NASA Astrophysics Data System (ADS)
Sheldon, W.
2013-12-01
Managing data for a large, multidisciplinary research program such as a Long Term Ecological Research (LTER) site is a significant challenge, but also presents unique opportunities for data stewardship. LTER research is conducted within multiple organizational frameworks (i.e. a specific LTER site as well as the broader LTER network), and addresses both specific goals defined in an NSF proposal as well as broader goals of the network; therefore, every LTER data can be linked to rich contextual information to guide interpretation and comparison. The challenge is how to link the data to this wealth of contextual metadata. At the Georgia Coastal Ecosystems LTER we developed an integrated information management system (GCE-IMS) to manage, archive and distribute data, metadata and other research products as well as manage project logistics, administration and governance (figure 1). This system allows us to store all project information in one place, and provide dynamic links through web applications and services to ensure content is always up to date on the web as well as in data set metadata. The database model supports tracking changes over time in personnel roles, projects and governance decisions, allowing these databases to serve as canonical sources of project history. Storing project information in a central database has also allowed us to standardize both the formatting and content of critical project information, including personnel names, roles, keywords, place names, attribute names, units, and instrumentation, providing consistency and improving data and metadata comparability. Lookup services for these standard terms also simplify data entry in web and database interfaces. We have also coupled the GCE-IMS to our MATLAB- and Python-based data processing tools (i.e. through database connections) to automate metadata generation and packaging of tabular and GIS data products for distribution. Data processing history is automatically tracked throughout the data lifecycle, from initial import through quality control, revision and integration by our data processing system (GCE Data Toolbox for MATLAB), and included in metadata for versioned data products. This high level of automation and system integration has proven very effective in managing the chaos and scalability of our information management program.
Integration of external metadata into the Earth System Grid Federation (ESGF)
NASA Astrophysics Data System (ADS)
Berger, Katharina; Levavasseur, Guillaume; Stockhause, Martina; Lautenschlager, Michael
2015-04-01
International projects with high volume data usually disseminate their data in a federated data infrastructure, e.g.~the Earth System Grid Federation (ESGF). The ESGF aims to make the geographically distributed data seamlessly discoverable and accessible. Additional data-related information is currently collected and stored in separate repositories by each data provider. This scattered and useful information is not or only partly available for ESGF users. Examples for such additional information systems are ES-DOC/metafor for model and simulation information, IPSL's versioning information, CHARMe for user annotations, DKRZ's quality information and data citation information. The ESGF Quality Control working team (esgf-qcwt) aims to integrate these valuable pieces of additional information into the ESGF in order to make them available to users and data archive managers by (i) integrating external information into ESGF portal, (ii) integrating links to external information objects into the ESGF metadata index, e.g. by the use of PIDs (Persistent IDentifiers), and (iii) automating the collection of external information during the ESGF data publication process. For the sixth phase of CMIP (Coupled Model Intercomparison Project), the ESGF metadata index is to be enriched by additional information on data citation, file version, etc. This information will support users directly and can be automatically exploited by higher level services (human and machine readability).
Simple, Script-Based Science Processing Archive
NASA Technical Reports Server (NTRS)
Lynnes, Christopher; Hegde, Mahabaleshwara; Barth, C. Wrandle
2007-01-01
The Simple, Scalable, Script-based Science Processing (S4P) Archive (S4PA) is a disk-based archival system for remote sensing data. It is based on the data-driven framework of S4P and is used for data transfer, data preprocessing, metadata generation, data archive, and data distribution. New data are automatically detected by the system. S4P provides services such as data access control, data subscription, metadata publication, data replication, and data recovery. It comprises scripts that control the data flow. The system detects the availability of data on an FTP (file transfer protocol) server, initiates data transfer, preprocesses data if necessary, and archives it on readily available disk drives with FTP and HTTP (Hypertext Transfer Protocol) access, allowing instantaneous data access. There are options for plug-ins for data preprocessing before storage. Publication of metadata to external applications such as the Earth Observing System Clearinghouse (ECHO) is also supported. S4PA includes a graphical user interface for monitoring the system operation and a tool for deploying the system. To ensure reliability, S4P continuously checks stored data for integrity, Further reliability is provided by tape backups of disks made once a disk partition is full and closed. The system is designed for low maintenance, requiring minimal operator oversight.
Java-Library for the Access, Storage and Editing of Calibration Metadata of Optical Sensors
NASA Astrophysics Data System (ADS)
Firlej, M.; Kresse, W.
2016-06-01
The standardization of the calibration of optical sensors in photogrammetry and remote sensing has been discussed for more than a decade. Projects of the German DGPF and the European EuroSDR led to the abstract International Technical Specification ISO/TS 19159-1:2014 "Calibration and validation of remote sensing imagery sensors and data - Part 1: Optical sensors". This article presents the first software interface for a read- and write-access to all metadata elements standardized in the ISO/TS 19159-1. This interface is based on an xml-schema that was automatically derived by ShapeChange from the UML-model of the Specification. The software interface serves two cases. First, the more than 300 standardized metadata elements are stored individually according to the xml-schema. Secondly, the camera manufacturers are using many administrative data that are not a part of the ISO/TS 19159-1. The new software interface provides a mechanism for input, storage, editing, and output of both types of data. Finally, an output channel towards a usual calibration protocol is provided. The interface is written in Java. The article also addresses observations made when analysing the ISO/TS 19159-1 and compiles a list of proposals for maturing the document, i.e. for an updated version of the Specification.
Metadata Exporter for Scientific Photography Management
NASA Astrophysics Data System (ADS)
Staudigel, D.; English, B.; Delaney, R.; Staudigel, H.; Koppers, A.; Hart, S.
2005-12-01
Photographs have become an increasingly important medium, especially with the advent of digital cameras. It has become inexpensive to take photographs and quickly post them on a website. However informative photos may be, they still need to be displayed in a convenient way, and be cataloged in such a manner that makes them easily locatable. Managing the great number of photographs that digital cameras allow and creating a format for efficient dissemination of the information related to the photos is a tedious task. Products such as Apple's iPhoto have greatly eased the task of managing photographs, However, they often have limitations. Un-customizable metadata fields and poor metadata extraction tools limit their scientific usefulness. A solution to this persistent problem is a customizable metadata exporter. On the ALIA expedition, we successfully managed the thousands of digital photos we took. We did this with iPhoto and a version of the exporter that is now available to the public under the name "CustomHTMLExport" (http://www.versiontracker.com/dyn/moreinfo/macosx/27777), currently undergoing formal beta testing This software allows the use of customized metadata fields (including description, time, date, GPS data, etc.), which is exported along with the photo. It can also produce webpages with this data straight from iPhoto, in a much more flexible way than is already allowed. With this tool it becomes very easy to manage and distribute scientific photos.
NASA Astrophysics Data System (ADS)
Vines, Aleksander; Hansen, Morten W.; Korosov, Anton
2017-04-01
Existing infrastructure international and Norwegian projects, e.g., NorDataNet, NMDC and NORMAP, provide open data access through the OPeNDAP protocol following the conventions for CF (Climate and Forecast) metadata, designed to promote the processing and sharing of files created with the NetCDF application programming interface (API). This approach is now also being implemented in the Norwegian Sentinel Data Hub (satellittdata.no) to provide satellite EO data to the user community. Simultaneously with providing simplified and unified data access, these projects also seek to use and establish common standards for use and discovery metadata. This then allows development of standardized tools for data search and (subset) streaming over the internet to perform actual scientific analysis. A combinnation of software tools, which we call a Scientific Platform as a Service (SPaaS), will take advantage of these opportunities to harmonize and streamline the search, retrieval and analysis of integrated satellite and auxiliary observations of the oceans in a seamless system. The SPaaS is a cloud solution for integration of analysis tools with scientific datasets via an API. The core part of the SPaaS is a distributed metadata catalog to store granular metadata describing the structure, location and content of available satellite, model, and in situ datasets. The analysis tools include software for visualization (also online), interactive in-depth analysis, and server-based processing chains. The API conveys search requests between system nodes (i.e., interactive and server tools) and provides easy access to the metadata catalog, data repositories, and the tools. The SPaaS components are integrated in virtual machines, of which provisioning and deployment are automatized using existing state-of-the-art open-source tools (e.g., Vagrant, Ansible, Docker). The open-source code for scientific tools and virtual machine configurations is under version control at https://github.com/nansencenter/, and is coupled to an online continuous integration system (e.g., Travis CI).
Automatic Keyword Extraction from Individual Documents
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rose, Stuart J.; Engel, David W.; Cramer, Nicholas O.
2010-05-03
This paper introduces a novel and domain-independent method for automatically extracting keywords, as sequences of one or more words, from individual documents. We describe the method’s configuration parameters and algorithm, and present an evaluation on a benchmark corpus of technical abstracts. We also present a method for generating lists of stop words for specific corpora and domains, and evaluate its ability to improve keyword extraction on the benchmark corpus. Finally, we apply our method of automatic keyword extraction to a corpus of news articles and define metrics for characterizing the exclusivity, essentiality, and generality of extracted keywords within a corpus.
Extraction of Pharmacokinetic Evidence of Drug–Drug Interactions from the Literature
Kolchinsky, Artemy; Lourenço, Anália; Wu, Heng-Yi; Li, Lang; Rocha, Luis M.
2015-01-01
Drug-drug interaction (DDI) is a major cause of morbidity and mortality and a subject of intense scientific interest. Biomedical literature mining can aid DDI research by extracting evidence for large numbers of potential interactions from published literature and clinical databases. Though DDI is investigated in domains ranging in scale from intracellular biochemistry to human populations, literature mining has not been used to extract specific types of experimental evidence, which are reported differently for distinct experimental goals. We focus on pharmacokinetic evidence for DDI, essential for identifying causal mechanisms of putative interactions and as input for further pharmacological and pharmacoepidemiology investigations. We used manually curated corpora of PubMed abstracts and annotated sentences to evaluate the efficacy of literature mining on two tasks: first, identifying PubMed abstracts containing pharmacokinetic evidence of DDIs; second, extracting sentences containing such evidence from abstracts. We implemented a text mining pipeline and evaluated it using several linear classifiers and a variety of feature transforms. The most important textual features in the abstract and sentence classification tasks were analyzed. We also investigated the performance benefits of using features derived from PubMed metadata fields, various publicly available named entity recognizers, and pharmacokinetic dictionaries. Several classifiers performed very well in distinguishing relevant and irrelevant abstracts (reaching F1≈0.93, MCC≈0.74, iAUC≈0.99) and sentences (F1≈0.76, MCC≈0.65, iAUC≈0.83). We found that word bigram features were important for achieving optimal classifier performance and that features derived from Medical Subject Headings (MeSH) terms significantly improved abstract classification. We also found that some drug-related named entity recognition tools and dictionaries led to slight but significant improvements, especially in classification of evidence sentences. Based on our thorough analysis of classifiers and feature transforms and the high classification performance achieved, we demonstrate that literature mining can aid DDI discovery by supporting automatic extraction of specific types of experimental evidence. PMID:25961290
Automatic Extraction of Urban Built-Up Area Based on Object-Oriented Method and Remote Sensing Data
NASA Astrophysics Data System (ADS)
Li, L.; Zhou, H.; Wen, Q.; Chen, T.; Guan, F.; Ren, B.; Yu, H.; Wang, Z.
2018-04-01
Built-up area marks the use of city construction land in the different periods of the development, the accurate extraction is the key to the studies of the changes of urban expansion. This paper studies the technology of automatic extraction of urban built-up area based on object-oriented method and remote sensing data, and realizes the automatic extraction of the main built-up area of the city, which saves the manpower cost greatly. First, the extraction of construction land based on object-oriented method, the main technical steps include: (1) Multi-resolution segmentation; (2) Feature Construction and Selection; (3) Information Extraction of Construction Land Based on Rule Set, The characteristic parameters used in the rule set mainly include the mean of the red band (Mean R), Normalized Difference Vegetation Index (NDVI), Ratio of residential index (RRI), Blue band mean (Mean B), Through the combination of the above characteristic parameters, the construction site information can be extracted. Based on the degree of adaptability, distance and area of the object domain, the urban built-up area can be quickly and accurately defined from the construction land information without depending on other data and expert knowledge to achieve the automatic extraction of the urban built-up area. In this paper, Beijing city as an experimental area for the technical methods of the experiment, the results show that: the city built-up area to achieve automatic extraction, boundary accuracy of 2359.65 m to meet the requirements. The automatic extraction of urban built-up area has strong practicality and can be applied to the monitoring of the change of the main built-up area of city.
Old document image segmentation using the autocorrelation function and multiresolution analysis
NASA Astrophysics Data System (ADS)
Mehri, Maroua; Gomez-Krämer, Petra; Héroux, Pierre; Mullot, Rémy
2013-01-01
Recent progress in the digitization of heterogeneous collections of ancient documents has rekindled new challenges in information retrieval in digital libraries and document layout analysis. Therefore, in order to control the quality of historical document image digitization and to meet the need of a characterization of their content using intermediate level metadata (between image and document structure), we propose a fast automatic layout segmentation of old document images based on five descriptors. Those descriptors, based on the autocorrelation function, are obtained by multiresolution analysis and used afterwards in a specific clustering method. The method proposed in this article has the advantage that it is performed without any hypothesis on the document structure, either about the document model (physical structure), or the typographical parameters (logical structure). It is also parameter-free since it automatically adapts to the image content. In this paper, firstly, we detail our proposal to characterize the content of old documents by extracting the autocorrelation features in the different areas of a page and at several resolutions. Then, we show that is possible to automatically find the homogeneous regions defined by similar indices of autocorrelation without knowledge about the number of clusters using adapted hierarchical ascendant classification and consensus clustering approaches. To assess our method, we apply our algorithm on 316 old document images, which encompass six centuries (1200-1900) of French history, in order to demonstrate the performance of our proposal in terms of segmentation and characterization of heterogeneous corpus content. Moreover, we define a new evaluation metric, the homogeneity measure, which aims at evaluating the segmentation and characterization accuracy of our methodology. We find a 85% of mean homogeneity accuracy. Those results help to represent a document by a hierarchy of layout structure and content, and to define one or more signatures for each page, on the basis of a hierarchical representation of homogeneous blocks and their topology.
1989-08-01
Automatic Line Network Extraction from Aerial Imangery of Urban Areas Sthrough KnowledghBased Image Analysis N 04 Final Technical ReportI December...Automatic Line Network Extraction from Aerial Imagery of Urban Areas through Knowledge Based Image Analysis Accesion For NTIS CRA&I DTIC TAB 0...paittern re’ognlition. blac’kboardl oriented symbollic processing, knowledge based image analysis , image understanding, aer’ial imsagery, urban area, 17
ANNUAL REPORT-AUTOMATIC INDEXING AND ABSTRACTING.
ERIC Educational Resources Information Center
Lockheed Missiles and Space Co., Palo Alto, CA. Electronic Sciences Lab.
THE INVESTIGATION IS CONCERNED WITH THE DEVELOPMENT OF AUTOMATIC INDEXING, ABSTRACTING, AND EXTRACTING SYSTEMS. BASIC INVESTIGATIONS IN ENGLISH MORPHOLOGY, PHONETICS, AND SYNTAX ARE PURSUED AS NECESSARY MEANS TO THIS END. IN THE FIRST SECTION THE THEORY AND DESIGN OF THE "SENTENCE DICTIONARY" EXPERIMENT IN AUTOMATIC EXTRACTION IS OUTLINED. SOME OF…
Validation of crowdsourced automatic rain gauge measurements in Amsterdam
NASA Astrophysics Data System (ADS)
de Vos, Lotte; Leijnse, Hidde; Overeem, Aart; Uijlenhoet, Remko
2016-04-01
The increasing number of privately owned weather stations and the facilitating role the internet to make this data publicly available, has led to several online platforms that collect and visualize crowdsourced weather data. This has resulted in ever increasing freely available datasets of weather measurements generated by amateur weather enthusiasts. Because of the lack of quality control and the frequent absence of metadata, these measurements are often considered as unreliable. Given the often large variability of weather variables in space and time, and the generally low number of official weather stations, this growing quantity of crowdsourced data may become an important additional source of information. Amateur weather observations have become more frequent over the past decade due to weather stations becoming more user-friendly and affordable. The variables measured by these weather stations are temperature, pressure and dew point, and in some cases wind and rainfall. Meteorological data from crowdsourced automatic weather stations in cities have primarily been used to examine the urban heat island effect. Thus far, these studies have focused on the comparison of the crowdsourced station temperature measurements with a nearby WMO-standard weather station, which is often located in a rural area or the outskirts of a city, generally not being representative of the city center. Instead of temperature, the rainfall measurements by the stations are examined. This research focuses on the combined ability of a large number of privately owned weather stations in an urban setting to correctly monitor rainfall. A set of 64 automatic weather stations distributed over Amsterdam (The Netherlands) that have at least 3 months of precipitation measurement during one year are evaluated. Precipitation measurements from stations are compared to a merged radar-gauge precipitation product. Disregarding sudden jumps in station measured precipitation, the accumulative rainfall over time in most stations showed an underestimation of rainfall compared to the accumulative values found in the corresponding radar pixel of the reference. Special consideration is given to the identification of faulty measurements without the need to obtain additional meta-data, such as setup and surroundings. This validation will show the potential of crowdsourced automatic weather stations for future urban rainfall monitoring.
MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive.
Bernstein, Matthew N; Doan, AnHai; Dewey, Colin N
2017-09-15
The NCBI's Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the SRA do not dictate a standardized set of terms that should be used to describe the biological samples from which the sequencing data are derived. As a result, the metadata include many synonyms, spelling variants and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues and cell types present in the SRA. We present MetaSRA, a database of normalized SRA human sample-specific metadata following a schema inspired by the metadata organization of the ENCODE project. This schema involves mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. We automated these tasks via a novel computational pipeline. The MetaSRA is available at metasra.biostat.wisc.edu via both a searchable web interface and bulk downloads. Software implementing our computational pipeline is available at http://github.com/deweylab/metasra-pipeline. cdewey@biostat.wisc.edu. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.
Automatic River Network Extraction from LIDAR Data
NASA Astrophysics Data System (ADS)
Maderal, E. N.; Valcarcel, N.; Delgado, J.; Sevilla, C.; Ojeda, J. C.
2016-06-01
National Geographic Institute of Spain (IGN-ES) has launched a new production system for automatic river network extraction for the Geospatial Reference Information (GRI) within hydrography theme. The goal is to get an accurate and updated river network, automatically extracted as possible. For this, IGN-ES has full LiDAR coverage for the whole Spanish territory with a density of 0.5 points per square meter. To implement this work, it has been validated the technical feasibility, developed a methodology to automate each production phase: hydrological terrain models generation with 2 meter grid size and river network extraction combining hydrographic criteria (topographic network) and hydrological criteria (flow accumulation river network), and finally the production was launched. The key points of this work has been managing a big data environment, more than 160,000 Lidar data files, the infrastructure to store (up to 40 Tb between results and intermediate files), and process; using local virtualization and the Amazon Web Service (AWS), which allowed to obtain this automatic production within 6 months, it also has been important the software stability (TerraScan-TerraSolid, GlobalMapper-Blue Marble , FME-Safe, ArcGIS-Esri) and finally, the human resources managing. The results of this production has been an accurate automatic river network extraction for the whole country with a significant improvement for the altimetric component of the 3D linear vector. This article presents the technical feasibility, the production methodology, the automatic river network extraction production and its advantages over traditional vector extraction systems.
Efforts to integrate CMIP metadata and standards into NOAA-GFDL's climate model workflow
NASA Astrophysics Data System (ADS)
Blanton, C.; Lee, M.; Mason, E. E.; Radhakrishnan, A.
2017-12-01
Modeling centers participating in CMIP6 run model simulations, publish requested model output (conforming to community data standards), and document models and simulations using ES-DOC. GFDL developed workflow software implementing some best practices to meet these metadata and documentation requirements. The CMIP6 Data Request defines the variables that should be archived for each experiment and specifies their spatial and temporal structure. We used the Data Request's dreqPy python library to write GFDL model configuration files as an alternative to hand-crafted tables. There was also a largely successful effort to standardize variable names within the model to reduce the additional overhead of translating "GFDL to CMOR" variables at a later stage in the pipeline. The ES-DOC ecosystem provides tools and standards to create, publish, and view various types of community-defined CIM documents, most notably model and simulation documents. Although ES-DOC will automatically create simulation documents during publishing by harvesting NetCDF global attributes, the information must be collected, stored, and placed in the NetCDF files by the workflow. We propose to develop a GUI to collect the simulation document precursors. In addition, a new MIP for CMIP6-CPMIP, a comparison of computational performance of climate models-is documented using machine and performance CIM documents. We used ES-DOC's pyesdoc python library to automatically create these machine and performance documents. We hope that these and similar efforts will become permanent features of the GFDL workflow to facilitate future participation in CMIP-like activities.
Service architecture challenges in building the KNMI Data Centre
NASA Astrophysics Data System (ADS)
Som de Cerff, Wim; van de Vegte, John; Plieger, Maarten; de Vreede, Ernst; Sluiter, Raymond; Willem Noteboom, Jan; van der Neut, Ian; Verhoef, Hans; van Versendaal, Robert; van Binnendijk, Martin; Kalle, Henk; Knopper, Arthur; Calis, Gijs; Ha, Siu Siu; van Moosel, WIm; Klein Ikkink, Henk-Jan; Tosun, Tuncay
2013-04-01
One of the objectives of KNMI is to act as a National Data centre for weather, climate and seismological data. KNMI has experience in curation of data for many years however important scientific data is not well accessible. New technologies also are available to improve the current infrastructure. Therefore a data curation program is initiated with two main goals: setup a Satellite Data Platform (SDP) and a KNMI data centre (KDC). KDC will provide, besides curation, data access, and storage and retrieval portal for KNMI data. In 2010 first requirements were gathered, in 2011 the main architecture was sketched, KDC was implemented in 2012 and is available on: http://data.knmi.nl KDC is built with the data providers involved with as key challenge: 'adding a dataset should be as simple as creating an HTML page'. This is enabled by a three step process, in which the data provider is responsible for two steps: 1. Provide dataset metadata: An easy to use web interface for providing metadata, with automated validation. Metadata consists of an ISO 19115 profile (matching INSPIRE and WMO requirements) and additional technical metadata regarding the data structure and access rights to the data. The interface hides certain metadata fields, which are filed by KDC automatically. 2. Provide data: after metadata has been entered, an upload location for uploading the dataset is provided. Also scripts for pushing large datasets are available. 3. Process and publish: once files are uploaded, they are processed for metadata (e.g., geolocation, time, version) and made available in KDC. The data is put into archive and made available using the in-house developed Virtual File System, which provides a persistent virtual path to the data. For the end-user of the data, KDC provides a web interface with search filters on key words, geolocation and time. Data can be downloaded using HTTP or FTP and can be scripted. Users can register to gain access to restricted datasets. The architecture combines Open Source software components (e.g. Geonetwork, Magnolia, MongoDB, MySQL) with in-house built software (ADAGUC, NADC) and newly developed software. Challenges faced and solved are: How to deal with the different file formats used at KNMI? (e.g. NetCDF, GRIB, BUFR, ASCII); How to deal with the different metadata profiles while hiding the complexity of this to the user? How to incorporate the existing archives? KDC is a node in several networks (WMO WIS, INSPIRE, Open Data): how to do this? In the presentation/poster we will describe what has been done for each of these challenges and how it is implemented in KDC.
Integrated workflows for spiking neuronal network simulations
Antolík, Ján; Davison, Andrew P.
2013-01-01
The increasing availability of computational resources is enabling more detailed, realistic modeling in computational neuroscience, resulting in a shift toward more heterogeneous models of neuronal circuits, and employment of complex experimental protocols. This poses a challenge for existing tool chains, as the set of tools involved in a typical modeler's workflow is expanding concomitantly, with growing complexity in the metadata flowing between them. For many parts of the workflow, a range of tools is available; however, numerous areas lack dedicated tools, while integration of existing tools is limited. This forces modelers to either handle the workflow manually, leading to errors, or to write substantial amounts of code to automate parts of the workflow, in both cases reducing their productivity. To address these issues, we have developed Mozaik: a workflow system for spiking neuronal network simulations written in Python. Mozaik integrates model, experiment and stimulation specification, simulation execution, data storage, data analysis and visualization into a single automated workflow, ensuring that all relevant metadata are available to all workflow components. It is based on several existing tools, including PyNN, Neo, and Matplotlib. It offers a declarative way to specify models and recording configurations using hierarchically organized configuration files. Mozaik automatically records all data together with all relevant metadata about the experimental context, allowing automation of the analysis and visualization stages. Mozaik has a modular architecture, and the existing modules are designed to be extensible with minimal programming effort. Mozaik increases the productivity of running virtual experiments on highly structured neuronal networks by automating the entire experimental cycle, while increasing the reliability of modeling studies by relieving the user from manual handling of the flow of metadata between the individual workflow stages. PMID:24368902
NASA Astrophysics Data System (ADS)
Pilone, D.; Gilman, J.; Baynes, K.; Shum, D.
2015-12-01
This talk introduces a new NASA Earth Observing System Data and Information System (EOSDIS) capability to automatically generate and maintain derived, Virtual Product information allowing DAACs and Data Providers to create tailored and more discoverable variations of their products. After this talk the audience will be aware of the new EOSDIS Virtual Product capability, applications of it, and how to take advantage of it. Much of the data made available in the EOSDIS are organized for generation and archival rather than for discovery and use. The EOSDIS Common Metadata Repository (CMR) is launching a new capability providing automated generation and maintenance of user-oriented Virtual Product information. DAACs can easily surface variations on established data products tailored to specific uses cases and users, leveraging DAAC exposed services such as custom ordering or access services like OPeNDAP for on-demand product generation and distribution. Virtual Data Products enjoy support for spatial and temporal information, keyword discovery, association with imagery, and are fully discoverable by tools such as NASA Earthdata Search, Worldview, and Reverb. Virtual Product generation has applicability across many use cases: - Describing derived products such as Surface Kinetic Temperature information (AST_08) from source products (ASTER L1A) - Providing streamlined access to data products (e.g. AIRS) containing many (>800) data variables covering an enormous variety of physical measurements - Attaching additional EOSDIS offerings such as Visual Metadata, external services, and documentation metadata - Publishing alternate formats for a product (e.g. netCDF for HDF products) with the actual conversion happening on request - Publishing granules to be modified by on-the-fly services, like GES-DISC's Data Quality Screening Service - Publishing "bundled" products where granules from one product correspond to granules from one or more other related products
Integrated workflows for spiking neuronal network simulations.
Antolík, Ján; Davison, Andrew P
2013-01-01
The increasing availability of computational resources is enabling more detailed, realistic modeling in computational neuroscience, resulting in a shift toward more heterogeneous models of neuronal circuits, and employment of complex experimental protocols. This poses a challenge for existing tool chains, as the set of tools involved in a typical modeler's workflow is expanding concomitantly, with growing complexity in the metadata flowing between them. For many parts of the workflow, a range of tools is available; however, numerous areas lack dedicated tools, while integration of existing tools is limited. This forces modelers to either handle the workflow manually, leading to errors, or to write substantial amounts of code to automate parts of the workflow, in both cases reducing their productivity. To address these issues, we have developed Mozaik: a workflow system for spiking neuronal network simulations written in Python. Mozaik integrates model, experiment and stimulation specification, simulation execution, data storage, data analysis and visualization into a single automated workflow, ensuring that all relevant metadata are available to all workflow components. It is based on several existing tools, including PyNN, Neo, and Matplotlib. It offers a declarative way to specify models and recording configurations using hierarchically organized configuration files. Mozaik automatically records all data together with all relevant metadata about the experimental context, allowing automation of the analysis and visualization stages. Mozaik has a modular architecture, and the existing modules are designed to be extensible with minimal programming effort. Mozaik increases the productivity of running virtual experiments on highly structured neuronal networks by automating the entire experimental cycle, while increasing the reliability of modeling studies by relieving the user from manual handling of the flow of metadata between the individual workflow stages.
GeoBoost: accelerating research involving the geospatial metadata of virus GenBank records.
Tahsin, Tasnia; Weissenbacher, Davy; O'Connor, Karen; Magge, Arjun; Scotch, Matthew; Gonzalez-Hernandez, Graciela
2018-05-01
GeoBoost is a command-line software package developed to address sparse or incomplete metadata in GenBank sequence records that relate to the location of the infected host (LOIH) of viruses. Given a set of GenBank accession numbers corresponding to virus GenBank records, GeoBoost extracts, integrates and normalizes geographic information reflecting the LOIH of the viruses using integrated information from GenBank metadata and related full-text publications. In addition, to facilitate probabilistic geospatial modeling, GeoBoost assigns probability scores for each possible LOIH. Binaries and resources required for running GeoBoost are packed into a single zipped file and freely available for download at https://tinyurl.com/geoboost. A video tutorial is included to help users quickly and easily install and run the software. The software is implemented in Java 1.8, and supported on MS Windows and Linux platforms. gragon@upenn.edu. Supplementary data are available at Bioinformatics online.
SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents
Heifets, Abraham; Jurisica, Igor
2012-01-01
The patent literature is a rich catalog of biologically relevant chemicals; many public and commercial molecular databases contain the structures disclosed in patent claims. However, patents are an equally rich source of metadata about bioactive molecules, including mechanism of action, disease class, homologous experimental series, structural alternatives, or the synthetic pathways used to produce molecules of interest. Unfortunately, this metadata is discarded when chemical structures are deposited separately in databases. SCRIPDB is a chemical structure database designed to make this metadata accessible. SCRIPDB provides the full original patent text, reactions and relationships described within any individual patent, in addition to the molecular files common to structural databases. We discuss how such information is valuable in medical text mining, chemical image analysis, reaction extraction and in silico pharmaceutical lead optimization. SCRIPDB may be searched by exact chemical structure, substructure or molecular similarity and the results may be restricted to patents describing synthetic routes. SCRIPDB is available at http://dcv.uhnres.utoronto.ca/SCRIPDB. PMID:22067445
The Materials Data Facility: Data Services to Advance Materials Science Research
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blaiszik, B.; Chard, K.; Pruyne, J.
2016-07-06
With increasingly strict data management requirements from funding agencies and institutions, expanding focus on the challenges of research replicability, and growing data sizes and heterogeneity, new data needs are emerging in the materials community. The materials data facility (MDF) operates two cloudhosted services, data publication and data discovery, with features to promote open data sharing, self-service data publication and curation, and encourage data reuse, layered with powerful data discovery tools. The data publication service simplifies the process of copying data to a secure storage location, assigning data a citable persistent identifier, and recording custom (e.g., material, technique, or instrument specific)andmore » automatically-extractedmetadata in a registrywhile the data discovery service will provide advanced search capabilities (e.g., faceting, free text range querying, and full text search) against the registered data and metadata. TheMDF services empower individual researchers, research projects, and institutions to (I) publish research datasets, regardless of size, from local storage, institutional data stores, or cloud storage, without involvement of thirdparty publishers; (II) build, share, and enforce extensible domain-specific custom metadata schemas; (III) interact with published data and metadata via representational state transfer (REST) application program interfaces (APIs) to facilitate automation, analysis, and feedback; and (IV) access a data discovery model that allows researchers to search, interrogate, and eventually build on existing published data. We describe MDF’s design, current status, and future plans.« less
XML at the ADC: Steps to a Next Generation Data Archive
NASA Astrophysics Data System (ADS)
Shaya, E.; Blackwell, J.; Gass, J.; Oliversen, N.; Schneider, G.; Thomas, B.; Cheung, C.; White, R. A.
1999-05-01
The eXtensible Markup Language (XML) is a document markup language that allows users to specify their own tags, to create hierarchical structures to qualify their data, and to support automatic checking of documents for structural validity. It is being intensively supported by nearly every major corporate software developer. Under the funds of a NASA AISRP proposal, the Astronomical Data Center (ADC, http://adc.gsfc.nasa.gov) is developing an infrastructure for importation, enhancement, and distribution of data and metadata using XML as the document markup language. We discuss the preliminary Document Type Definition (DTD, at http://adc.gsfc.nasa.gov/xml) which specifies the elements and their attributes in our metadata documents. This attempts to define both the metadata of an astronomical catalog and the `header' information of an astronomical table. In addition, we give an overview of the planned flow of data through automated pipelines from authors and journal presses into our XML archive and retrieval through the web via the XML-QL Query Language and eXtensible Style Language (XSL) scripts. When completed, the catalogs and journal tables at the ADC will be tightly hyperlinked to enhance data discovery. In addition one will be able to search on fragmentary information. For instance, one could query for a table by entering that the second author is so-and-so or that the third author is at such-and-such institution.
The Self-Organized Archive: SPASE, PDS and Archive Cooperatives
NASA Astrophysics Data System (ADS)
King, T. A.; Hughes, J. S.; Roberts, D. A.; Walker, R. J.; Joy, S. P.
2005-05-01
Information systems with high quality metadata enable uses and services which often go beyond the original purpose. There are two types of metadata: annotations which are items that comment on or describe the content of a resource and identification attributes which describe the external properties of the resource itself. For example, annotations may indicate which columns are present in a table of data, whereas an identification attribute would indicate source of the table, such as the observatory, instrument, organization, and data type. When the identification attributes are collected and used as the basis of a search engine, a user can constrain on an attribute, the archive can then self-organize around the constraint, presenting the user with a particular view of the archive. In an archive cooperative where each participating data system or archive may have its own metadata standards, providing a multi-system search engine requires that individual archive metadata be mapped to a broad based standard. To explore how cooperative archives can form a larger self-organized archive we will show how the Space Physics Archive Search and Extract (SPASE) data model will allow different systems to create a cooperative and will use Planetary Data System (PDS) plus existing space physics activities as a demonstration.
A Metadata Management Framework for Collaborative Review of Science Data Products
NASA Astrophysics Data System (ADS)
Hart, A. F.; Cinquini, L.; Mattmann, C. A.; Thompson, D. R.; Wagstaff, K.; Zimdars, P. A.; Jones, D. L.; Lazio, J.; Preston, R. A.
2012-12-01
Data volumes generated by modern scientific instruments often preclude archiving the complete observational record. To compensate, science teams have developed a variety of "triage" techniques for identifying data of potential scientific interest and marking it for prioritized processing or permanent storage. This may involve multiple stages of filtering with both automated and manual components operating at different timescales. A promising approach exploits a fast, fully automated first stage followed by a more reliable offline manual review of candidate events. This hybrid approach permits a 24-hour rapid real-time response while also preserving the high accuracy of manual review. To support this type of second-level validation effort, we have developed a metadata-driven framework for the collaborative review of candidate data products. The framework consists of a metadata processing pipeline and a browser-based user interface that together provide a configurable mechanism for reviewing data products via the web, and capturing the full stack of associated metadata in a robust, searchable archive. Our system heavily leverages software from the Apache Object Oriented Data Technology (OODT) project, an open source data integration framework that facilitates the construction of scalable data systems and places a heavy emphasis on the utilization of metadata to coordinate processing activities. OODT provides a suite of core data management components for file management and metadata cataloging that form the foundation for this effort. The system has been deployed at JPL in support of the V-FASTR experiment [1], a software-based radio transient detection experiment that operates commensally at the Very Long Baseline Array (VLBA), and has a science team that is geographically distributed across several countries. Daily review of automatically flagged data is a shared responsibility for the team, and is essential to keep the project within its resource constraints. We describe the development of the platform using open source software, and discuss our experience deploying the system operationally. [1] R.B.Wayth,W.F.Brisken,A.T.Deller,W.A.Majid,D.R.Thompson, S. J. Tingay, and K. L. Wagstaff, "V-fastr: The vlba fast radio transients experiment," The Astrophysical Journal, vol. 735, no. 2, p. 97, 2011. Acknowledgement: This effort was supported by the Jet Propulsion Laboratory, managed by the California Institute of Technology under a contract with the National Aeronautics and Space Administration.
NASA Astrophysics Data System (ADS)
Jovicic, A.; Castelli, A.; Kljajic, Z.
2012-04-01
As a result of efforts to standardize oceanographic data sets collected since year 2002 in the area of south-east Adriatic, relational data model suitable for storage of meta-data and in situ measurements was designed and implemented. Using combination of customized tools developed for extraction of meta-data and data records from CTD files as well as standard office applications, data were extracted, transformed, processed and unified by attributes and units of measurement. To make those data available for wider scientific community, we have developed web portal able to be used for data retrieval based on various filters (spatial, temporal, by project and/or by sampling instrument). Selected data model proves to be also very efficient for generating of data-exchange formats required by various projects and initiatives (e.g. SeaDataNet) so extended by particular dictionaries it can allow fast implementation of integration services. As a part of Ecoport 8 project, newly available type of data was recently introduced. Real-time data provided by permanent sensors need to be automatically collected and stored into database. Visualization of such data was also required as well as exchange with project data center. To fulfill those requirements, additional data scheme and appropriate B2B services were developed. Additional care was taken about data transfer security as database was not hosted at the same place as workstation used for remote access to sensor equipment. Third section of portal is "Tide Tables", interactive, graphical application that visualize tide predictions for ports of Bar and Kotor, allowing also correction by atmospheric pressure. Developed in Java, based on well known Mike Foreman's Fortran 77 code it can be used as stand-alone product without Internet connection. Last section of portal is Google Earth file containing position of stations as well as some spatial features that can be useful during planning of future oceanographic cruises in this area (e.g. explosives dumping grounds, administrative lines and depth contours). Technically speaking, present level of implementation can provide fast response to any future requirement. However, some administrative issues need to be resolved. Multilateral (or bilateral) data-exchange policies need to be signed by all interested parties before all data can become fully available to wider scientific community.
Targeted exploration and analysis of large cross-platform human transcriptomic compendia
Zhu, Qian; Wong, Aaron K; Krishnan, Arjun; Aure, Miriam R; Tadych, Alicja; Zhang, Ran; Corney, David C; Greene, Casey S; Bongo, Lars A; Kristensen, Vessela N; Charikar, Moses; Li, Kai; Troyanskaya, Olga G.
2016-01-01
We present SEEK (http://seek.princeton.edu), a query-based search engine across very large transcriptomic data collections, including thousands of human data sets from almost 50 microarray and next-generation sequencing platforms. SEEK uses a novel query-level cross-validation-based algorithm to automatically prioritize data sets relevant to the query and a robust search approach to identify query-coregulated genes, pathways, and processes. SEEK provides cross-platform handling, multi-gene query search, iterative metadata-based search refinement, and extensive visualization-based analysis options. PMID:25581801
High-throughput neuroimaging-genetics computational infrastructure
Dinov, Ivo D.; Petrosyan, Petros; Liu, Zhizhong; Eggert, Paul; Hobel, Sam; Vespa, Paul; Woo Moon, Seok; Van Horn, John D.; Franco, Joseph; Toga, Arthur W.
2014-01-01
Many contemporary neuroscientific investigations face significant challenges in terms of data management, computational processing, data mining, and results interpretation. These four pillars define the core infrastructure necessary to plan, organize, orchestrate, validate, and disseminate novel scientific methods, computational resources, and translational healthcare findings. Data management includes protocols for data acquisition, archival, query, transfer, retrieval, and aggregation. Computational processing involves the necessary software, hardware, and networking infrastructure required to handle large amounts of heterogeneous neuroimaging, genetics, clinical, and phenotypic data and meta-data. Data mining refers to the process of automatically extracting data features, characteristics and associations, which are not readily visible by human exploration of the raw dataset. Result interpretation includes scientific visualization, community validation of findings and reproducible findings. In this manuscript we describe the novel high-throughput neuroimaging-genetics computational infrastructure available at the Institute for Neuroimaging and Informatics (INI) and the Laboratory of Neuro Imaging (LONI) at University of Southern California (USC). INI and LONI include ultra-high-field and standard-field MRI brain scanners along with an imaging-genetics database for storing the complete provenance of the raw and derived data and meta-data. In addition, the institute provides a large number of software tools for image and shape analysis, mathematical modeling, genomic sequence processing, and scientific visualization. A unique feature of this architecture is the Pipeline environment, which integrates the data management, processing, transfer, and visualization. Through its client-server architecture, the Pipeline environment provides a graphical user interface for designing, executing, monitoring validating, and disseminating of complex protocols that utilize diverse suites of software tools and web-services. These pipeline workflows are represented as portable XML objects which transfer the execution instructions and user specifications from the client user machine to remote pipeline servers for distributed computing. Using Alzheimer's and Parkinson's data, we provide several examples of translational applications using this infrastructure1. PMID:24795619
Towards Text Copyright Detection Using Metadata in Web Applications
ERIC Educational Resources Information Center
Poulos, Marios; Korfiatis, Nikolaos; Bokos, George
2011-01-01
Purpose: This paper aims to present the semantic content identifier (SCI), a permanent identifier, computed through a linear-time onion-peeling algorithm that enables the extraction of semantic features from a text, and the integration of this information within the permanent identifier. Design/methodology/approach: The authors employ SCI to…
NASA Astrophysics Data System (ADS)
Sidiropoulos, Panagiotis; Muller, Jan-Peter; Watson, Gillian; Michael, Gregory; Walter, Sebastian
2018-02-01
This work presents the coregistered, orthorectified and mosaiced high-resolution products of the MC11 quadrangle of Mars, which have been processed using novel, fully automatic, techniques. We discuss the development of a pipeline that achieves fully automatic and parameter independent geometric alignment of high-resolution planetary images, starting from raw input images in NASA PDS format and following all required steps to produce a coregistered geotiff image, a corresponding footprint and useful metadata. Additionally, we describe the development of a radiometric calibration technique that post-processes coregistered images to make them radiometrically consistent. Finally, we present a batch-mode application of the developed techniques over the MC11 quadrangle to validate their potential, as well as to generate end products, which are released to the planetary science community, thus assisting in the analysis of Mars static and dynamic features. This case study is a step towards the full automation of signal processing tasks that are essential to increase the usability of planetary data, but currently, require the extensive use of human resources.
PS1-41: Just Add Data: Implementing an Event-Based Data Model for Clinical Trial Tracking
Fuller, Sharon; Carrell, David; Pardee, Roy
2012-01-01
Background/Aims Clinical research trials often have similar fundamental tracking needs, despite being quite variable in their specific logic and activities. A model tracking database that can be quickly adapted by a variety of studies has the potential to achieve significant efficiencies in database development and maintenance. Methods Over the course of several different clinical trials, we have developed a database model that is highly adaptable to a variety of projects. Rather than hard-coding each specific event that might occur in a trial, along with its logical consequences, this model considers each event and its parameters to be a data record in its own right. Each event may have related variables (metadata) describing its prerequisites, subsequent events due, associated mailings, or events that it overrides. The metadata for each event is stored in the same record with the event name. When changes are made to the study protocol, no structural changes to the database are needed. One has only to add or edit events and their metadata. Changes in the event metadata automatically determine any related logic changes. In addition to streamlining application code, this model simplifies communication between the programmer and other team members. Database requirements can be phrased as changes to the underlying data, rather than to the application code. The project team can review a single report of events and metadata and easily see where changes might be needed. In addition to benefitting from streamlined code, the front end database application can also implement useful standard features such as automated mail merges and to do lists. Results The event-based data model has proven itself to be robust, adaptable and user-friendly in a variety of study contexts. We have chosen to implement it as a SQL Server back end and distributed Access front end. Interested readers may request a copy of the Access front end and scripts for creating the back end database. Discussion An event-based database with a consistent, robust set of features has the potential to significantly reduce development time and maintenance expense for clinical trial tracking databases.
Brady, S L; Kaufman, R A
2015-05-01
To develop an automated methodology to estimate patient examination dose in digital radiography (DR) imaging using DICOM metadata as a quality assurance (QA) tool. Patient examination and demographical information were gathered from metadata analysis of DICOM header data. The x-ray system radiation output (i.e., air KERMA) was characterized for all filter combinations used for patient examinations. Average patient thicknesses were measured for head, chest, abdomen, knees, and hands using volumetric images from CT. Backscatter factors (BSFs) were calculated from examination kVp. Patient entrance skin air KERMA (ESAK) was calculated by (1) looking up examination technique factors taken from DICOM header metadata (i.e., kVp and mA s) to derive an air KERMA (k air) value based on an x-ray characteristic radiation output curve; (2) scaling k air with a BSF value; and (3) correcting k air for patient thickness. Finally, patient entrance skin dose (ESD) was calculated by multiplying a mass-energy attenuation coefficient ratio by ESAK. Patient ESD calculations were computed for common DR examinations at our institution: dual view chest, anteroposterior (AP) abdomen, lateral (LAT) skull, dual view knee, and bone age (left hand only) examinations. ESD was calculated for a total of 3794 patients; mean age was 11 ± 8 yr (range: 2 months to 55 yr). The mean ESD range was 0.19-0.42 mGy for dual view chest, 0.28-1.2 mGy for AP abdomen, 0.18-0.65 mGy for LAT view skull, 0.15-0.63 mGy for dual view knee, and 0.10-0.12 mGy for bone age (left hand) examinations. A methodology combining DICOM header metadata and basic x-ray tube characterization curves was demonstrated. In a regulatory era where patient dose reporting has become increasingly in demand, this methodology will allow a knowledgeable user the means to establish an automatable dose reporting program for DR and perform patient dose related QA testing for digital x-ray imaging.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brady, S. L., E-mail: samuel.brady@stjude.org; Kaufman, R. A., E-mail: robert.kaufman@stjude.org
Purpose: To develop an automated methodology to estimate patient examination dose in digital radiography (DR) imaging using DICOM metadata as a quality assurance (QA) tool. Methods: Patient examination and demographical information were gathered from metadata analysis of DICOM header data. The x-ray system radiation output (i.e., air KERMA) was characterized for all filter combinations used for patient examinations. Average patient thicknesses were measured for head, chest, abdomen, knees, and hands using volumetric images from CT. Backscatter factors (BSFs) were calculated from examination kVp. Patient entrance skin air KERMA (ESAK) was calculated by (1) looking up examination technique factors taken frommore » DICOM header metadata (i.e., kVp and mA s) to derive an air KERMA (k{sub air}) value based on an x-ray characteristic radiation output curve; (2) scaling k{sub air} with a BSF value; and (3) correcting k{sub air} for patient thickness. Finally, patient entrance skin dose (ESD) was calculated by multiplying a mass–energy attenuation coefficient ratio by ESAK. Patient ESD calculations were computed for common DR examinations at our institution: dual view chest, anteroposterior (AP) abdomen, lateral (LAT) skull, dual view knee, and bone age (left hand only) examinations. Results: ESD was calculated for a total of 3794 patients; mean age was 11 ± 8 yr (range: 2 months to 55 yr). The mean ESD range was 0.19–0.42 mGy for dual view chest, 0.28–1.2 mGy for AP abdomen, 0.18–0.65 mGy for LAT view skull, 0.15–0.63 mGy for dual view knee, and 0.10–0.12 mGy for bone age (left hand) examinations. Conclusions: A methodology combining DICOM header metadata and basic x-ray tube characterization curves was demonstrated. In a regulatory era where patient dose reporting has become increasingly in demand, this methodology will allow a knowledgeable user the means to establish an automatable dose reporting program for DR and perform patient dose related QA testing for digital x-ray imaging.« less
NASA Astrophysics Data System (ADS)
Hsu, Kuo-Hsien
2012-11-01
Formosat-2 image is a kind of high-spatial-resolution (2 meters GSD) remote sensing satellite data, which includes one panchromatic band and four multispectral bands (Blue, Green, Red, near-infrared). An essential sector in the daily processing of received Formosat-2 image is to estimate the cloud statistic of image using Automatic Cloud Coverage Assessment (ACCA) algorithm. The information of cloud statistic of image is subsequently recorded as an important metadata for image product catalog. In this paper, we propose an ACCA method with two consecutive stages: preprocessing and post-processing analysis. For pre-processing analysis, the un-supervised K-means classification, Sobel's method, thresholding method, non-cloudy pixels reexamination, and cross-band filter method are implemented in sequence for cloud statistic determination. For post-processing analysis, Box-Counting fractal method is implemented. In other words, the cloud statistic is firstly determined via pre-processing analysis, the correctness of cloud statistic of image of different spectral band is eventually cross-examined qualitatively and quantitatively via post-processing analysis. The selection of an appropriate thresholding method is very critical to the result of ACCA method. Therefore, in this work, We firstly conduct a series of experiments of the clustering-based and spatial thresholding methods that include Otsu's, Local Entropy(LE), Joint Entropy(JE), Global Entropy(GE), and Global Relative Entropy(GRE) method, for performance comparison. The result shows that Otsu's and GE methods both perform better than others for Formosat-2 image. Additionally, our proposed ACCA method by selecting Otsu's method as the threshoding method has successfully extracted the cloudy pixels of Formosat-2 image for accurate cloud statistic estimation.
NASA Astrophysics Data System (ADS)
Bargatze, L. F.
2015-12-01
Active Data Archive Product Tracking (ADAPT) is a collection of software routines that permits one to generate XML metadata files to describe and register data products in support of the NASA Heliophysics Virtual Observatory VxO effort. ADAPT is also a philosophy. The ADAPT concept is to use any and all available metadata associated with scientific data to produce XML metadata descriptions in a consistent, uniform, and organized fashion to provide blanket access to the full complement of data stored on a targeted data server. In this poster, we present an application of ADAPT to describe all of the data products that are stored by using the Common Data File (CDF) format served out by the CDAWEB and SPDF data servers hosted at the NASA Goddard Space Flight Center. These data servers are the primary repositories for NASA Heliophysics data. For this purpose, the ADAPT routines have been used to generate data resource descriptions by using an XML schema named Space Physics Archive, Search, and Extract (SPASE). SPASE is the designated standard for documenting Heliophysics data products, as adopted by the Heliophysics Data and Model Consortium. The set of SPASE XML resource descriptions produced by ADAPT includes high-level descriptions of numerical data products, display data products, or catalogs and also includes low-level "Granule" descriptions. A SPASE Granule is effectively a universal access metadata resource; a Granule associates an individual data file (e.g. a CDF file) with a "parent" high-level data resource description, assigns a resource identifier to the file, and lists the corresponding assess URL(s). The CDAWEB and SPDF file systems were queried to provide the input required by the ADAPT software to create an initial set of SPASE metadata resource descriptions. Then, the CDAWEB and SPDF data repositories were queried subsequently on a nightly basis and the CDF file lists were checked for any changes such as the occurrence of new, modified, or deleted files, or the addition of new or the deletion of old data products. Next, ADAPT routines analyzed the query results and issued updates to the metadata stored in the UCLA CDAWEB and SPDF metadata registries. In this way, the SPASE metadata registries generated by ADAPT can be relied on to provide up to date and complete access to Heliophysics CDF data resources on a daily basis.
The Automatic Integration of Folksonomies with Taxonomies Using Non-axiomatic Logic
NASA Astrophysics Data System (ADS)
Geldart, Joe; Cummins, Stephen
Cooperative tagging systems such as folksonomies are powerful tools when used to annotate information resources. The inherent power of folksonomies is in their ability to allow casual users to easily contribute ad hoc, yet meaningful, resource metadata without any specialist training. Older folksonomies have begun to degrade due to the lack of internal structure and from the use of many low quality tags. This chapter describes a remedy for some of the problems associated with folksonomies. We introduce a method of automatic integration and inference of the relationships between tags and resources in a folksonomy using non-axiomatic logic. We test this method on the CiteULike corpus of tags by comparing precision and recall between it and standard keyword search. Our results show that non-axiomatic reasoning is a promising technique for integrating tagging systems with more structured knowledge representations.
Douyère, Magaly; Soualmia, Lina F; Névéol, Aurélie; Rogozan, Alexandrina; Dahamna, Badisse; Leroy, Jean-Philippe; Thirion, Benoît; Darmoni, Stefan J
2004-12-01
The amount of health information available on the Internet is considerable. In this context, several health gateways have been developed. Among them, CISMeF (Catalogue and Index of Health Resources in French) was designed to catalogue and index health resources in French. The goal of this article is to describe the various enhancements to the MeSH thesaurus developed by the CISMeF team to adapt this terminology to the broader field of health Internet resources instead of scientific articles for the medline bibliographic database. CISMeF uses two standard tools for organizing information: the MeSH thesaurus and several metadata element sets, in particular the Dublin Core metadata format. The heterogeneity of Internet health resources led the CISMeF team to enhance the MeSH thesaurus with the introduction of two new concepts, respectively, resource types and metaterms. CISMeF resource types are a generalization of the publication types of medline. A resource type describes the nature of the resource and MeSH keyword/qualifier pairs describe the subject of the resource. A metaterm is generally a medical specialty or a biological science, which has semantic links with one or more MeSH keywords, qualifiers and resource types. The CISMeF terminology is exploited for several tasks: resource indexing performed manually, resource categorization performed automatically, visualization and navigation through the concept hierarchies and information retrieval using the Doc'CISMeF search engine. The CISMeF health gateway uses several MeSH thesaurus enhancements to optimize information retrieval, hierarchy navigation and automatic indexing.
The Application and Future Direction of the SPASE Metadata Standard in the U.S. and Worldwide
NASA Astrophysics Data System (ADS)
King, Todd; Thieman, James; Roberts, D. Aaron
2013-04-01
The Space Physics Archive Search and Extract (SPASE) Metadata standard for Heliophysics and related data is now an established standard within the NASA-funded space and solar physics community and is spreading to the international groups within that community. Development of SPASE had involved a number of international partners and the current version of the SPASE Metadata Model (version 2.2.2) has been stable since January 2011. The SPASE standard has been adopted by groups such as NASA's Heliophysics division, the Canadian Space Science Data Portal (CSSDP), Canada's AUTUMN network, Japan's Inter-university Upper atmosphere Global Observation NETwork (IUGONET), Centre de Données de la Physique des Plasmas (CDPP), and the near-Earth space data infrastructure for e-Science (ESPAS). In addition, portions of the SPASE dictionary have been modeled in semantic web ontologies for use with reasoners and semantic searches. In development are modifications to accommodate simulation and model data, as well as enhancements to describe data accessibility. These additions will add features to describe a broader range of data types. In keeping with a SPASE principle of back-compatibility, these changes will not affect the data descriptions already generated for instrument-related datasets. We also look at the long term commitment by NASA to support the SPASE effort and how SPASE metadata can enable value-added services.
An automatic rat brain extraction method based on a deformable surface model.
Li, Jiehua; Liu, Xiaofeng; Zhuo, Jiachen; Gullapalli, Rao P; Zara, Jason M
2013-08-15
The extraction of the brain from the skull in medical images is a necessary first step before image registration or segmentation. While pre-clinical MR imaging studies on small animals, such as rats, are increasing, fully automatic imaging processing techniques specific to small animal studies remain lacking. In this paper, we present an automatic rat brain extraction method, the Rat Brain Deformable model method (RBD), which adapts the popular human brain extraction tool (BET) through the incorporation of information on the brain geometry and MR image characteristics of the rat brain. The robustness of the method was demonstrated on T2-weighted MR images of 64 rats and compared with other brain extraction methods (BET, PCNN, PCNN-3D). The results demonstrate that RBD reliably extracts the rat brain with high accuracy (>92% volume overlap) and is robust against signal inhomogeneity in the images. Copyright © 2013 Elsevier B.V. All rights reserved.
Gorgolewski, Krzysztof J; Varoquaux, Gael; Rivera, Gabriel; Schwartz, Yannick; Sochat, Vanessa V; Ghosh, Satrajit S; Maumet, Camille; Nichols, Thomas E; Poline, Jean-Baptiste; Yarkoni, Tal; Margulies, Daniel S; Poldrack, Russell A
2016-01-01
NeuroVault.org is dedicated to storing outputs of analyses in the form of statistical maps, parcellations and atlases, a unique strategy that contrasts with most neuroimaging repositories that store raw acquisition data or stereotaxic coordinates. Such maps are indispensable for performing meta-analyses, validating novel methodology, and deciding on precise outlines for regions of interest (ROIs). NeuroVault is open to maps derived from both healthy and clinical populations, as well as from various imaging modalities (sMRI, fMRI, EEG, MEG, PET, etc.). The repository uses modern web technologies such as interactive web-based visualization, cognitive decoding, and comparison with other maps to provide researchers with efficient, intuitive tools to improve the understanding of their results. Each dataset and map is assigned a permanent Universal Resource Locator (URL), and all of the data is accessible through a REST Application Programming Interface (API). Additionally, the repository supports the NIDM-Results standard and has the ability to parse outputs from popular FSL and SPM software packages to automatically extract relevant metadata. This ease of use, modern web-integration, and pioneering functionality holds promise to improve the workflow for making inferences about and sharing whole-brain statistical maps. Copyright © 2015 Elsevier Inc. All rights reserved.
SAS- Semantic Annotation Service for Geoscience resources on the web
NASA Astrophysics Data System (ADS)
Elag, M.; Kumar, P.; Marini, L.; Li, R.; Jiang, P.
2015-12-01
There is a growing need for increased integration across the data and model resources that are disseminated on the web to advance their reuse across different earth science applications. Meaningful reuse of resources requires semantic metadata to realize the semantic web vision for allowing pragmatic linkage and integration among resources. Semantic metadata associates standard metadata with resources to turn them into semantically-enabled resources on the web. However, the lack of a common standardized metadata framework as well as the uncoordinated use of metadata fields across different geo-information systems, has led to a situation in which standards and related Standard Names abound. To address this need, we have designed SAS to provide a bridge between the core ontologies required to annotate resources and information systems in order to enable queries and analysis over annotation from a single environment (web). SAS is one of the services that are provided by the Geosematnic framework, which is a decentralized semantic framework to support the integration between models and data and allow semantically heterogeneous to interact with minimum human intervention. Here we present the design of SAS and demonstrate its application for annotating data and models. First we describe how predicates and their attributes are extracted from standards and ingested in the knowledge-base of the Geosemantic framework. Then we illustrate the application of SAS in annotating data managed by SEAD and annotating simulation models that have web interface. SAS is a step in a broader approach to raise the quality of geoscience data and models that are published on the web and allow users to better search, access, and use of the existing resources based on standard vocabularies that are encoded and published using semantic technologies.
Lamy, Jean-Baptiste; Ugon, Adrien; Berthelot, Hélène
2016-01-01
Potential adverse effects (AEs) of drugs are described in their summary of product characteristics (SPCs), a textual document. Automatic extraction of AEs from SPCs is useful for detecting AEs and for building drug databases. However, this task is difficult because each AE is associated with a frequency that must be extracted and the presentation of AEs in SPCs is heterogeneous, consisting of plain text and tables in many different formats. We propose a taxonomy for the presentation of AEs in SPCs. We set up natural language processing (NLP) and table parsing methods for extracting AEs from texts and tables of any format, and evaluate them on 10 SPCs. Automatic extraction performed better on tables than on texts. Tables should be recommended for the presentation of the AEs section of the SPCs.
Automatic Molar Extraction from Dental Panoramic Radiographs for Forensic Personal Identification
NASA Astrophysics Data System (ADS)
Samopa, Febriliyan; Asano, Akira; Taguchi, Akira
Measurement of an individual molar provides rich information for forensic personal identification. We propose a computer-based system for extracting an individual molar from dental panoramic radiographs. A molar is obtained by extracting the region-of-interest, separating the maxilla and mandible, and extracting the boundaries between teeth. The proposed system is almost fully automatic; all that the user has to do is clicking three points on the boundary between the maxilla and the mandible.
Information retrieval and terminology extraction in online resources for patients with diabetes.
Seljan, Sanja; Baretić, Maja; Kucis, Vlasta
2014-06-01
Terminology use, as a mean for information retrieval or document indexing, plays an important role in health literacy. Specific types of users, i.e. patients with diabetes need access to various online resources (on foreign and/or native language) searching for information on self-education of basic diabetic knowledge, on self-care activities regarding importance of dietetic food, medications, physical exercises and on self-management of insulin pumps. Automatic extraction of corpus-based terminology from online texts, manuals or professional papers, can help in building terminology lists or list of "browsing phrases" useful in information retrieval or in document indexing. Specific terminology lists represent an intermediate step between free text search and controlled vocabulary, between user's demands and existing online resources in native and foreign language. The research aiming to detect the role of terminology in online resources, is conducted on English and Croatian manuals and Croatian online texts, and divided into three interrelated parts: i) comparison of professional and popular terminology use ii) evaluation of automatic statistically-based terminology extraction on English and Croatian texts iii) comparison and evaluation of extracted terminology performed on English manual using statistical and hybrid approaches. Extracted terminology candidates are evaluated by comparison with three types of reference lists: list created by professional medical person, list of highly professional vocabulary contained in MeSH and list created by non-medical persons, made as intersection of 15 lists. Results report on use of popular and professional terminology in online diabetes resources, on evaluation of automatically extracted terminology candidates in English and Croatian texts and on comparison of statistical and hybrid extraction methods in English text. Evaluation of automatic and semi-automatic terminology extraction methods is performed by recall, precision and f-measure.
Document Exploration and Automatic Knowledge Extraction for Unstructured Biomedical Text
NASA Astrophysics Data System (ADS)
Chu, S.; Totaro, G.; Doshi, N.; Thapar, S.; Mattmann, C. A.; Ramirez, P.
2015-12-01
We describe our work on building a web-browser based document reader with built-in exploration tool and automatic concept extraction of medical entities for biomedical text. Vast amounts of biomedical information are offered in unstructured text form through scientific publications and R&D reports. Utilizing text mining can help us to mine information and extract relevant knowledge from a plethora of biomedical text. The ability to employ such technologies to aid researchers in coping with information overload is greatly desirable. In recent years, there has been an increased interest in automatic biomedical concept extraction [1, 2] and intelligent PDF reader tools with the ability to search on content and find related articles [3]. Such reader tools are typically desktop applications and are limited to specific platforms. Our goal is to provide researchers with a simple tool to aid them in finding, reading, and exploring documents. Thus, we propose a web-based document explorer, which we called Shangri-Docs, which combines a document reader with automatic concept extraction and highlighting of relevant terms. Shangri-Docsalso provides the ability to evaluate a wide variety of document formats (e.g. PDF, Words, PPT, text, etc.) and to exploit the linked nature of the Web and personal content by performing searches on content from public sites (e.g. Wikipedia, PubMed) and private cataloged databases simultaneously. Shangri-Docsutilizes Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) [4] and Unified Medical Language System (UMLS) to automatically identify and highlight terms and concepts, such as specific symptoms, diseases, drugs, and anatomical sites, mentioned in the text. cTAKES was originally designed specially to extract information from clinical medical records. Our investigation leads us to extend the automatic knowledge extraction process of cTAKES for biomedical research domain by improving the ontology guided information extraction process. We will describe our experience and implementation of our system and share lessons learned from our development. We will also discuss ways in which this could be adapted to other science fields. [1] Funk et al., 2014. [2] Kang et al., 2014. [3] Utopia Documents, http://utopiadocs.com [4] Apache cTAKES, http://ctakes.apache.org
Overview of long-term field experiments in Germany - metadata visualization
NASA Astrophysics Data System (ADS)
Muqit Zoarder, Md Abdul; Heinrich, Uwe; Svoboda, Nikolai; Grosse, Meike; Hierold, Wilfried
2017-04-01
BonaRes ("soil as a sustainable resource for the bioeconomy") is conducting to collect data and metadata of agricultural long-term field experiments (LTFE) of Germany. It is funded by the German Federal Ministry of Education and Research (BMBF) under the umbrella of the National Research Strategy BioEconomy 2030. BonaRes consists of ten interdisciplinary research project consortia and the 'BonaRes - Centre for Soil Research'. BonaRes Data Centre is responsible for collecting all LTFE data and regarding metadata into an enterprise database upon higher level of security and visualization of the data and metadata through data portal. In the frame of the BonaRes project, we are compiling an overview of long-term field experiments in Germany that is based on a literature review, the results of the online survey and direct contacts with LTFE operators. Information about research topic, contact person, website, experiment setup and analyzed parameters are collected. Based on the collected LTFE data, an enterprise geodatabase is developed and a GIS-based web-information system about LTFE in Germany is also settled. Various aspects of the LTFE, like experiment type, land-use type, agricultural category and duration of experiment, are presented in thematic maps. This information system is dynamically linked to the database, which means changes in the data directly affect the presentation. An easy data searching option using LTFE name, -location or -operators and the dynamic layer selection ensure a user-friendly web application. Dispersion and visualization of the overlapping LTFE points on the overview map are also challenging and we make it automatized at very zoom level which is also a consistent part of this application. The application provides both, spatial location and meta-information of LTFEs, which is backed-up by an enterprise geodatabase, GIS server for hosting map services and Java script API for web application development.
NASA Astrophysics Data System (ADS)
Gebhardt, Steffen; Wehrmann, Thilo; Klinger, Verena; Schettler, Ingo; Huth, Juliane; Künzer, Claudia; Dech, Stefan
2010-10-01
The German-Vietnamese water-related information system for the Mekong Delta (WISDOM) project supports business processes in Integrated Water Resources Management in Vietnam. Multiple disciplines bring together earth and ground based observation themes, such as environmental monitoring, water management, demographics, economy, information technology, and infrastructural systems. This paper introduces the components of the web-based WISDOM system including data, logic and presentation tier. It focuses on the data models upon which the database management system is built, including techniques for tagging or linking metadata with the stored information. The model also uses ordered groupings of spatial, thematic and temporal reference objects to semantically tag datasets to enable fast data retrieval, such as finding all data in a specific administrative unit belonging to a specific theme. A spatial database extension is employed by the PostgreSQL database. This object-oriented database was chosen over a relational database to tag spatial objects to tabular data, improving the retrieval of census and observational data at regional, provincial, and local areas. While the spatial database hinders processing raster data, a "work-around" was built into WISDOM to permit efficient management of both raster and vector data. The data model also incorporates styling aspects of the spatial datasets through styled layer descriptions (SLD) and web mapping service (WMS) layer specifications, allowing retrieval of rendered maps. Metadata elements of the spatial data are based on the ISO19115 standard. XML structured information of the SLD and metadata are stored in an XML database. The data models and the data management system are robust for managing the large quantity of spatial objects, sensor observations, census and document data. The operational WISDOM information system prototype contains modules for data management, automatic data integration, and web services for data retrieval, analysis, and distribution. The graphical user interfaces facilitate metadata cataloguing, data warehousing, web sensor data analysis and thematic mapping.
TR32DB - Management of Research Data in a Collaborative, Interdisciplinary Research Project
NASA Astrophysics Data System (ADS)
Curdt, Constanze; Hoffmeister, Dirk; Waldhoff, Guido; Lang, Ulrich; Bareth, Georg
2015-04-01
The management of research data in a well-structured and documented manner is essential in the context of collaborative, interdisciplinary research environments (e.g. across various institutions). Consequently, set-up and use of a research data management (RDM) system like a data repository or project database is necessary. These systems should accompany and support scientists during the entire research life cycle (e.g. data collection, documentation, storage, archiving, sharing, publishing) and operate cross-disciplinary in interdisciplinary research projects. Challenges and problems of RDM are well-know. Consequently, the set-up of a user-friendly, well-documented, sustainable RDM system is essential, as well as user support and further assistance. In the framework of the Transregio Collaborative Research Centre 32 'Patterns in Soil-Vegetation-Atmosphere Systems: Monitoring, Modelling, and Data Assimilation' (CRC/TR32), funded by the German Research Foundation (DFG), a RDM system was self-designed and implemented. The CRC/TR32 project database (TR32DB, www.tr32db.de) is operating online since early 2008. The TR32DB handles all data, which are created by the involved project participants from several institutions (e.g. Universities of Cologne, Bonn, Aachen, and the Research Centre Jülich) and research fields (e.g. soil and plant sciences, hydrology, geography, geophysics, meteorology, remote sensing). Very heterogeneous research data are considered, which are resulting from field measurement campaigns, meteorological monitoring, remote sensing, laboratory studies and modelling approaches. Furthermore, outcomes like publications, conference contributions, PhD reports and corresponding images are regarded. The TR32DB project database is set-up in cooperation with the Regional Computing Centre of the University of Cologne (RRZK) and also located in this hardware environment. The TR32DB system architecture is composed of three main components: (i) a file-based data storage including backup, (ii) a database-based storage for administrative data and metadata, and (iii) a web-interface for user access. The TR32DB offers common features of RDM systems. These include data storage, entry of corresponding metadata by a user-friendly input wizard, search and download of data depending on user permission, as well as secure internal exchange of data. In addition, a Digital Object Identifier (DOI) can be allocated for specific datasets and several web mapping components are supported (e.g. Web-GIS and map search). The centrepiece of the TR32DB is the self-provided and implemented CRC/TR32 specific metadata schema. This enables the documentation of all involved, heterogeneous data with accurate, interoperable metadata. The TR32DB Metadata Schema is set-up in a multi-level approach and supports several metadata standards and schemes (e.g. Dublin Core, ISO 19115, INSPIRE, DataCite). Furthermore, metadata properties with focus on the CRC/TR32 background (e.g. CRC/TR32 specific keywords) and the supported data types are complemented. Mandatory, optional and automatic metadata properties are specified. Overall, the TR32DB is designed and implemented according to the needs of the CRC/TR32 (e.g. huge amount of heterogeneous data) and demands of the DFG (e.g. cooperation with a computing centre). The application of a self-designed, project-specific, interoperable metadata schema enables the accurate documentation of all CRC/TR32 data. The implementation of the TR32DB in the hardware environment of the RRZK ensures the access to the data after the end of the CRC/TR32 funding in 2018.
NASA Reverb: Standards-Driven Earth Science Data and Service Discovery
NASA Astrophysics Data System (ADS)
Cechini, M. F.; Mitchell, A.; Pilone, D.
2011-12-01
NASA's Earth Observing System Data and Information System (EOSDIS) is a core capability in NASA's Earth Science Data Systems Program. NASA's EOS ClearingHOuse (ECHO) is a metadata catalog for the EOSDIS, providing a centralized catalog of data products and registry of related data services. Working closely with the EOSDIS community, the ECHO team identified a need to develop the next generation EOS data and service discovery tool. This development effort relied on the following principles: + Metadata Driven User Interface - Users should be presented with data and service discovery capabilities based on dynamic processing of metadata describing the targeted data. + Integrated Data & Service Discovery - Users should be able to discovery data and associated data services that facilitate their research objectives. + Leverage Common Standards - Users should be able to discover and invoke services that utilize common interface standards. Metadata plays a vital role facilitating data discovery and access. As data providers enhance their metadata, more advanced search capabilities become available enriching a user's search experience. Maturing metadata formats such as ISO 19115 provide the necessary depth of metadata that facilitates advanced data discovery capabilities. Data discovery and access is not limited to simply the retrieval of data granules, but is growing into the more complex discovery of data services. These services include, but are not limited to, services facilitating additional data discovery, subsetting, reformatting, and re-projecting. The discovery and invocation of these data services is made significantly simpler through the use of consistent and interoperable standards. By utilizing an adopted standard, developing standard-specific adapters can be utilized to communicate with multiple services implementing a specific protocol. The emergence of metadata standards such as ISO 19119 plays a similarly important role in discovery as the 19115 standard. After a yearlong design, development, and testing process, the ECHO team successfully released "Reverb - The Next Generation Earth Science Discovery Tool." Reverb relies heavily on the information contained in dataset and granule metadata, such as ISO 19115, to provide a dynamic experience to users based on identified search facet values extracted from science metadata. Such an approach allows users to perform cross-dataset correlation and searches, discovering additional data that they may not previously have been aware of. In addition to data discovery, Reverb users may discover services associated with their data of interest. When services utilize supported standards and/or protocols, Reverb can facilitate the invocation of both synchronous and asynchronous data processing services. This greatly enhances a users ability to discover data of interest and accomplish their research goals. Extrapolating on the current movement towards interoperable standards and an increase in available services, data service invocation and chaining will become a natural part of data discovery. Reverb is one example of a discovery tool that provides a mechanism for transforming the earth science data discovery paradigm.
A Risk Assessment System with Automatic Extraction of Event Types
NASA Astrophysics Data System (ADS)
Capet, Philippe; Delavallade, Thomas; Nakamura, Takuya; Sandor, Agnes; Tarsitano, Cedric; Voyatzi, Stavroula
In this article we describe the joint effort of experts in linguistics, information extraction and risk assessment to integrate EventSpotter, an automatic event extraction engine, into ADAC, an automated early warning system. By detecting as early as possible weak signals of emerging risks ADAC provides a dynamic synthetic picture of situations involving risk. The ADAC system calculates risk on the basis of fuzzy logic rules operated on a template graph whose leaves are event types. EventSpotter is based on a general purpose natural language dependency parser, XIP, enhanced with domain-specific lexical resources (Lexicon-Grammar). Its role is to automatically feed the leaves with input data.
Biomedical Informatics on the Cloud: A Treasure Hunt for Advancing Cardiovascular Medicine.
Ping, Peipei; Hermjakob, Henning; Polson, Jennifer S; Benos, Panagiotis V; Wang, Wei
2018-04-27
In the digital age of cardiovascular medicine, the rate of biomedical discovery can be greatly accelerated by the guidance and resources required to unearth potential collections of knowledge. A unified computational platform leverages metadata to not only provide direction but also empower researchers to mine a wealth of biomedical information and forge novel mechanistic insights. This review takes the opportunity to present an overview of the cloud-based computational environment, including the functional roles of metadata, the architecture schema of indexing and search, and the practical scenarios of machine learning-supported molecular signature extraction. By introducing several established resources and state-of-the-art workflows, we share with our readers a broadly defined informatics framework to phenotype cardiovascular health and disease. © 2018 American Heart Association, Inc.
2D Automatic body-fitted structured mesh generation using advancing extraction method
USDA-ARS?s Scientific Manuscript database
This paper presents an automatic mesh generation algorithm for body-fitted structured meshes in Computational Fluids Dynamics (CFD) analysis using the Advancing Extraction Method (AEM). The method is applicable to two-dimensional domains with complex geometries, which have the hierarchical tree-like...
2D automatic body-fitted structured mesh generation using advancing extraction method
USDA-ARS?s Scientific Manuscript database
This paper presents an automatic mesh generation algorithm for body-fitted structured meshes in Computational Fluids Dynamics (CFD) analysis using the Advancing Extraction Method (AEM). The method is applicable to two-dimensional domains with complex geometries, which have the hierarchical tree-like...
Vessel extraction in retinal images using automatic thresholding and Gabor Wavelet.
Ali, Aziah; Hussain, Aini; Wan Zaki, Wan Mimi Diyana
2017-07-01
Retinal image analysis has been widely used for early detection and diagnosis of multiple systemic diseases. Accurate vessel extraction in retinal image is a crucial step towards a fully automated diagnosis system. This work affords an efficient unsupervised method for extracting blood vessels from retinal images by combining existing Gabor Wavelet (GW) method with automatic thresholding. Green channel image is extracted from color retinal image and used to produce Gabor feature image using GW. Both green channel image and Gabor feature image undergo vessel-enhancement step in order to highlight blood vessels. Next, the two vessel-enhanced images are transformed to binary images using automatic thresholding before combined to produce the final vessel output. Combining the images results in significant improvement of blood vessel extraction performance compared to using individual image. Effectiveness of the proposed method was proven via comparative analysis with existing methods validated using publicly available database, DRIVE.
NASA Astrophysics Data System (ADS)
Kontoes, Charalampos; Papoutsis, Ioannis; Herekakis, Themistoklis; Michail, Dimitrios; Ieronymidi, Emmanuela
2013-04-01
Remote sensing tools for the accurate, robust and timely assessment of the damages inflicted by forest wildfires provide information that is of paramount importance to public environmental agencies and related stakeholders before, during and after the crisis. The Institute for Astronomy, Astrophysics, Space Applications and Remote Sensing of the National Observatory of Athens (IAASARS/NOA) has developed a fully automatic single and/or multi date processing chain that takes as input archived Landsat 4, 5 or 7 raw images and produces precise diachronic burnt area polygons and damage assessments over the Greek territory. The methodology consists of three fully automatic stages: 1) the pre-processing stage where the metadata of the raw images are extracted, followed by the application of the LEDAPS software platform for calibration and mask production and the Automated Precise Orthorectification Package, developed by NASA, for image geo-registration and orthorectification, 2) the core-BSM (Burn Scar Mapping) processing stage which incorporates a published classification algorithm based on a series of physical indexes, the application of two filters for noise removal using graph-based techniques and the grouping of pixels classified as burnt to form the appropriate pixels clusters before proceeding to conversion from raster to vector, and 3) the post-processing stage where the products are thematically refined and enriched using auxiliary GIS layers (underlying land cover/use, administrative boundaries, etc.) and human logic/evidence to suppress false alarms and omission errors. The established processing chain has been successfully applied to the entire archive of Landsat imagery over Greece spanning from 1984 to 2012, which has been collected and managed in IAASARS/NOA. The number of full Landsat frames that were subject of process in the framework of the study was 415. These burn scar mapping products are generated for the first time to such a temporal and spatial extent and are ideal to use in further environmental time series analyzes, production of statistical indexes (frequency, geographical distribution and number of fires per prefecture) and applications, including change detection and climate change models, urban planning, correlation with manmade activities, etc.
Automatic sentence extraction for the detection of scientific paper relations
NASA Astrophysics Data System (ADS)
Sibaroni, Y.; Prasetiyowati, S. S.; Miftachudin, M.
2018-03-01
The relations between scientific papers are very useful for researchers to see the interconnection between scientific papers quickly. By observing the inter-article relationships, researchers can identify, among others, the weaknesses of existing research, performance improvements achieved to date, and tools or data typically used in research in specific fields. So far, methods that have been developed to detect paper relations include machine learning and rule-based methods. However, a problem still arises in the process of sentence extraction from scientific paper documents, which is still done manually. This manual process causes the detection of scientific paper relations longer and inefficient. To overcome this problem, this study performs an automatic sentences extraction while the paper relations are identified based on the citation sentence. The performance of the built system is then compared with that of the manual extraction system. The analysis results suggested that the automatic sentence extraction indicates a very high level of performance in the detection of paper relations, which is close to that of manual sentence extraction.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bent, John M.; Faibish, Sorin; Pedone, Jr., James M.
A cluster file system is provided having a plurality of distributed metadata servers with shared access to one or more shared low latency persistent key-value metadata stores. A metadata server comprises an abstract storage interface comprising a software interface module that communicates with at least one shared persistent key-value metadata store providing a key-value interface for persistent storage of key-value metadata. The software interface module provides the key-value metadata to the at least one shared persistent key-value metadata store in a key-value format. The shared persistent key-value metadata store is accessed by a plurality of metadata servers. A metadata requestmore » can be processed by a given metadata server independently of other metadata servers in the cluster file system. A distributed metadata storage environment is also disclosed that comprises a plurality of metadata servers having an abstract storage interface to at least one shared persistent key-value metadata store.« less
Event selection services in ATLAS
NASA Astrophysics Data System (ADS)
Cranshaw, J.; Cuhadar-Donszelmann, T.; Gallas, E.; Hrivnac, J.; Kenyon, M.; McGlone, H.; Malon, D.; Mambelli, M.; Nowak, M.; Viegas, F.; Vinek, E.; Zhang, Q.
2010-04-01
ATLAS has developed and deployed event-level selection services based upon event metadata records ("TAGS") and supporting file and database technology. These services allow physicists to extract events that satisfy their selection predicates from any stage of data processing and use them as input to later analyses. One component of these services is a web-based Event-Level Selection Service Interface (ELSSI). ELSSI supports event selection by integrating run-level metadata, luminosity-block-level metadata (e.g., detector status and quality information), and event-by-event information (e.g., triggers passed and physics content). The list of events that survive after some selection criterion is returned in a form that can be used directly as input to local or distributed analysis; indeed, it is possible to submit a skimming job directly from the ELSSI interface using grid proxy credential delegation. ELSSI allows physicists to explore ATLAS event metadata as a means to understand, qualitatively and quantitatively, the distributional characteristics of ATLAS data. In fact, the ELSSI service provides an easy interface to see the highest missing ET events or the events with the most leptons, to count how many events passed a given set of triggers, or to find events that failed a given trigger but nonetheless look relevant to an analysis based upon the results of offline reconstruction, and more. This work provides an overview of ATLAS event-level selection services, with an emphasis upon the interactive Event-Level Selection Service Interface.
Zen and the Art of Virtual Observatory Maintenance
NASA Astrophysics Data System (ADS)
Bargatze, L. F.
2014-12-01
The NASA Science Mission Directive Science Plan stresses that the primary goals of Heliophysics research focus on the understanding of the Sun's influence on the Earth and other bodies in the solar system. The NASA Heliophysics Division has adopted the Virtual Observatory, or VxO, concept in order to enable scientists to easily discover and access all data products relevant to these goals via web portals that act as clearinghouses. Furthermore, Heliophysics discipline scientists have defined the Space Physics Archive Search and Extract (SPASE) metadata schema in order to describe the contents of such applicable data products with detail extending all the way down to the parameter level. One SPASE metadata description file must be written to describe each data product at the global level. And the collection of such data product metadata description files, stored in repositories, provides the searchable content that the VxO web sites require in order to match the list of products to the unique needs of each researcher. The VxO metadata repository content also allows one to provide links to each unique data file contained in the full complement of files on a per data product basis. These links are contained within SPASE "Granule" description files and permit uniform access, worldwide, regardless of data server location thus permitting the VxO clearinghouse capability. The VxO concept is sound in theory but difficult in practice given that the Heliophysics data environment is diverse, ever expanding, and volatile. Thus, it is imperative to update the VxO metadata repositories in order to provide a complete, accurate, and current portrayal of the data environment. Such attention to detail is not a VxO desire but a necessity in order to support Heliophysics researchers and foster VxO user loyalty. An application of these basic tenets to the construction of a VxO repository dedicated to providing access to the CDF-formatted data collection hosted on the NASA Goddard CDAWeb data server. Note that the CDF format is self-describing and thus it provides a source of information for initiating SPASE metadata description at the data product level. Also, the CDAWeb data server provides high-quality data product tracking down to the individual data file level permitting easy updating of SPASE Granule metadata.
NASA Astrophysics Data System (ADS)
Davenport, Jack H.
2016-05-01
Intelligence analysts demand rapid information fusion capabilities to develop and maintain accurate situational awareness and understanding of dynamic enemy threats in asymmetric military operations. The ability to extract relationships between people, groups, and locations from a variety of text datasets is critical to proactive decision making. The derived network of entities must be automatically created and presented to analysts to assist in decision making. DECISIVE ANALYTICS Corporation (DAC) provides capabilities to automatically extract entities, relationships between entities, semantic concepts about entities, and network models of entities from text and multi-source datasets. DAC's Natural Language Processing (NLP) Entity Analytics model entities as complex systems of attributes and interrelationships which are extracted from unstructured text via NLP algorithms. The extracted entities are automatically disambiguated via machine learning algorithms, and resolution recommendations are presented to the analyst for validation; the analyst's expertise is leveraged in this hybrid human/computer collaborative model. Military capability is enhanced by these NLP Entity Analytics because analysts can now create/update an entity profile with intelligence automatically extracted from unstructured text, thereby fusing entity knowledge from structured and unstructured data sources. Operational and sustainment costs are reduced since analysts do not have to manually tag and resolve entities.
The Gemini Recipe System: A Dynamic Workflow for Automated Data Reduction
NASA Astrophysics Data System (ADS)
Labrie, K.; Hirst, P.; Allen, C.
2011-07-01
Gemini's next generation data reduction software suite aims to offer greater automation of the data reduction process without compromising the flexibility required by science programs using advanced or unusual observing strategies. The Recipe System is central to our new data reduction software. Developed in Python, it facilitates near-real time processing for data quality assessment, and both on- and off-line science quality processing. The Recipe System can be run as a standalone application or as the data processing core of an automatic pipeline. Building on concepts that originated in ORAC-DR, a data reduction process is defined in a Recipe written in a science (as opposed to computer) oriented language, and consists of a sequence of data reduction steps called Primitives. The Primitives are written in Python and can be launched from the PyRAF user interface by users wishing for more hands-on optimization of the data reduction process. The fact that the same processing Primitives can be run within both the pipeline context and interactively in a PyRAF session is an important strength of the Recipe System. The Recipe System offers dynamic flow control allowing for decisions regarding processing and calibration to be made automatically, based on the pixel and the metadata properties of the dataset at the stage in processing where the decision is being made, and the context in which the processing is being carried out. Processing history and provenance recording are provided by the AstroData middleware, which also offers header abstraction and data type recognition to facilitate the development of instrument-agnostic processing routines. All observatory or instrument specific definitions are isolated from the core of the AstroData system and distributed in external configuration packages that define a lexicon including classifications, uniform metadata elements, and transformations.
Earth Science Datacasting v2.0
NASA Technical Reports Server (NTRS)
Bingham, Andrew W.; Deen, Robert G.; Hussey, Kevin J.; Stough, Timothy M.; McCleese, Sean W.; Toole, Nicholas T.
2012-01-01
The Datacasting software, which consists of a server and a client, has been developed as part of the Earth Science (ES) Datacasting project. The goal of ES Datacasting is to provide scientists the ability to automatically and continuously download Earth science data that meets a precise, predefined need, and then to instantaneously visualize it on a local computer. This is achieved by applying the concept of podcasting to deliver science data over the Internet using RSS (Really Simple Syndication) XML feeds. By extending the RSS specification, scientists can filter a feed and only download the files that are required for a particular application (for example, only files that contain information about a particular event, such as a hurricane or flood). The extension also provides the ability for the client to understand the format of the data and visualize the information locally. The server part enables a data provider to create and serve basic Datacasting (RSS-based) feeds. The user can subscribe to any number of feeds, view the information related to each item contained within a feed (including browse pre-made images), manually download files associated with items, and place these files in a local store. The client-server architecture enables users to: a) Subscribe and interpret multiple Datacasting feeds (same look and feel as a typical mail client), b) Maintain a list of all items within each feed, c) Enable filtering on the lists based on different metadata attributes contained within the feed (list will reference only data files of interest), d) Visualize the reference data and associated metadata, e) Download files referenced within the list, and f) Automatically download files as new items become available.
iRODS: A Distributed Data Management Cyberinfrastructure for Observatories
NASA Astrophysics Data System (ADS)
Rajasekar, A.; Moore, R.; Vernon, F.
2007-12-01
Large-scale and long-term preservation of both observational and synthesized data requires a system that virtualizes data management concepts. A methodology is needed that can work across long distances in space (distribution) and long-periods in time (preservation). The system needs to manage data stored on multiple types of storage systems including new systems that become available in the future. This concept is called infrastructure independence, and is typically implemented through virtualization mechanisms. Data grids are built upon concepts of data and trust virtualization. These concepts enable the management of collections of data that are distributed across multiple institutions, stored on multiple types of storage systems, and accessed by multiple types of clients. Data virtualization ensures that the name spaces used to identify files, users, and storage systems are persistent, even when files are migrated onto future technology. This is required to preserve authenticity, the link between the record and descriptive and provenance metadata. Trust virtualization ensures that access controls remain invariant as files are moved within the data grid. This is required to track the chain of custody of records over time. The Storage Resource Broker (http://www.sdsc.edu/srb) is one such data grid used in a wide variety of applications in earth and space sciences such as ROADNet (roadnet.ucsd.edu), SEEK (seek.ecoinformatics.org), GEON (www.geongrid.org) and NOAO (www.noao.edu). Recent extensions to data grids provide one more level of virtualization - policy or management virtualization. Management virtualization ensures that execution of management policies can be automated, and that rules can be created that verify assertions about the shared collections of data. When dealing with distributed large-scale data over long periods of time, the policies used to manage the data and provide assurances about the authenticity of the data become paramount. The integrated Rule-Oriented Data System (iRODS) (http://irods.sdsc.edu) provides the mechanisms needed to describe not only management policies, but also to track how the policies are applied and their execution results. The iRODS data grid maps management policies to rules that control the execution of the remote micro-services. As an example, a rule can be created that automatically creates a replica whenever a file is added to a specific collection, or extracts its metadata automatically and registers it in a searchable catalog. For the replication operation, the persistent state information consists of the replica location, the creation date, the owner, the replica size, etc. The mechanism used by iRODS for providing policy virtualization is based on well-defined functions, called micro-services, which are chained into alternative workflows using rules. A rule engine, based on the event-condition-action paradigm executes the rule-based workflows after an event. Rules can be deferred to a pre-determined time or executed on a periodic basis. As the data management policies evolve, the iRODS system can implement new rules, new micro-services, and new state information (metadata content) needed to manage the new policies. Each sub- collection can be managed using a different set of policies. The discussion of the concepts in rule-based policy virtualization and its application to long-term and large-scale data management for observatories such as ORION and NEON will be the basis of the paper.
The Service Environment for Enhanced Knowledge and Research (SEEKR) Framework
NASA Astrophysics Data System (ADS)
King, T. A.; Walker, R. J.; Weigel, R. S.; Narock, T. W.; McGuire, R. E.; Candey, R. M.
2011-12-01
The Service Environment for Enhanced Knowledge and Research (SEEKR) Framework is a configurable service oriented framework to enable the discovery, access and analysis of data shared in a community. The SEEKR framework integrates many existing independent services through the use of web technologies and standard metadata. Services are hosted on systems by using an application server and are callable by using REpresentational State Transfer (REST) protocols. Messages and metadata are transferred with eXtensible Markup Language (XML) encoding which conform to a published XML schema. Space Physics Archive Search and Extract (SPASE) metadata is central to utilizing the services. Resources (data, documents, software, etc.) are described with SPASE and the associated Resource Identifier is used to access and exchange resources. The configurable options for the service can be set by using a web interface. Services are packaged as web application resource (WAR) files for direct deployment on application services such as Tomcat or Jetty. We discuss the composition of the SEEKR framework, how new services can be integrated and the steps necessary to deploying the framework. The SEEKR Framework emerged from NASA's Virtual Magnetospheric Observatory (VMO) and other systems and we present an overview of these systems from a SEEKR Framework perspective.
A Geospatial Data Recommender System based on Metadata and User Behaviour
NASA Astrophysics Data System (ADS)
Li, Y.; Jiang, Y.; Yang, C. P.; Armstrong, E. M.; Huang, T.; Moroni, D. F.; Finch, C. J.; McGibbney, L. J.
2017-12-01
Earth observations are produced in a fast velocity through real time sensors, reaching tera- to peta- bytes of geospatial data daily. Discovering and accessing the right data from the massive geospatial data is like finding needle in the haystack. To help researchers find the right data for study and decision support, quite a lot of research focusing on improving search performance have been proposed including recommendation algorithm. However, few papers have discussed the way to implement a recommendation algorithm in geospatial data retrieval system. In order to address this problem, we propose a recommendation engine to improve discovering relevant geospatial data by mining and utilizing metadata and user behavior data: 1) metadata based recommendation considers the correlation of each attribute (i.e., spatiotemporal, categorical, and ordinal) to data to be found. In particular, phrase extraction method is used to improve the accuracy of the description similarity; 2) user behavior data are utilized to predict the interest of a user through collaborative filtering; 3) an integration method is designed to combine the results of the above two methods to achieve better recommendation Experiments show that in the hybrid recommendation list, the all the precisions are larger than 0.8 from position 1 to 10.
An Open Catalog for Supernova Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Guillochon, James; Parrent, Jerod; Kelley, Luke Zoltan
We present the Open Supernova Catalog , an online collection of observations and metadata for presently 36,000+ supernovae and related candidates. The catalog is freely available on the web (https://sne.space), with its main interface having been designed to be a user-friendly, rapidly searchable table accessible on desktop and mobile devices. In addition to the primary catalog table containing supernova metadata, an individual page is generated for each supernova, which displays its available metadata, light curves, and spectra spanning X-ray to radio frequencies. The data presented in the catalog is automatically rebuilt on a daily basis and is constructed by parsingmore » several dozen sources, including the data presented in the supernova literature and from secondary sources such as other web-based catalogs. Individual supernova data is stored in the hierarchical, human- and machine-readable JSON format, with the entirety of each supernova’s data being contained within a single JSON file bearing its name. The setup we present here, which is based on open-source software maintained via git repositories hosted on github, enables anyone to download the entirety of the supernova data set to their home computer in minutes, and to make contributions of their own data back to the catalog via git. As the supernova data set continues to grow, especially in the upcoming era of all-sky synoptic telescopes, which will increase the total number of events by orders of magnitude, we hope that the catalog we have designed will be a valuable tool for the community to analyze both historical and contemporary supernovae.« less
An Open Catalog for Supernova Data
NASA Astrophysics Data System (ADS)
Guillochon, James; Parrent, Jerod; Kelley, Luke Zoltan; Margutti, Raffaella
2017-01-01
We present the Open Supernova Catalog, an online collection of observations and metadata for presently 36,000+ supernovae and related candidates. The catalog is freely available on the web (https://sne.space), with its main interface having been designed to be a user-friendly, rapidly searchable table accessible on desktop and mobile devices. In addition to the primary catalog table containing supernova metadata, an individual page is generated for each supernova, which displays its available metadata, light curves, and spectra spanning X-ray to radio frequencies. The data presented in the catalog is automatically rebuilt on a daily basis and is constructed by parsing several dozen sources, including the data presented in the supernova literature and from secondary sources such as other web-based catalogs. Individual supernova data is stored in the hierarchical, human- and machine-readable JSON format, with the entirety of each supernova’s data being contained within a single JSON file bearing its name. The setup we present here, which is based on open-source software maintained via git repositories hosted on github, enables anyone to download the entirety of the supernova data set to their home computer in minutes, and to make contributions of their own data back to the catalog via git. As the supernova data set continues to grow, especially in the upcoming era of all-sky synoptic telescopes, which will increase the total number of events by orders of magnitude, we hope that the catalog we have designed will be a valuable tool for the community to analyze both historical and contemporary supernovae.
Hierarchical video summarization based on context clustering
NASA Astrophysics Data System (ADS)
Tseng, Belle L.; Smith, John R.
2003-11-01
A personalized video summary is dynamically generated in our video personalization and summarization system based on user preference and usage environment. The three-tier personalization system adopts the server-middleware-client architecture in order to maintain, select, adapt, and deliver rich media content to the user. The server stores the content sources along with their corresponding MPEG-7 metadata descriptions. In this paper, the metadata includes visual semantic annotations and automatic speech transcriptions. Our personalization and summarization engine in the middleware selects the optimal set of desired video segments by matching shot annotations and sentence transcripts with user preferences. Besides finding the desired contents, the objective is to present a coherent summary. There are diverse methods for creating summaries, and we focus on the challenges of generating a hierarchical video summary based on context information. In our summarization algorithm, three inputs are used to generate the hierarchical video summary output. These inputs are (1) MPEG-7 metadata descriptions of the contents in the server, (2) user preference and usage environment declarations from the user client, and (3) context information including MPEG-7 controlled term list and classification scheme. In a video sequence, descriptions and relevance scores are assigned to each shot. Based on these shot descriptions, context clustering is performed to collect consecutively similar shots to correspond to hierarchical scene representations. The context clustering is based on the available context information, and may be derived from domain knowledge or rules engines. Finally, the selection of structured video segments to generate the hierarchical summary efficiently balances between scene representation and shot selection.
A prototype system to support evidence-based practice.
Demner-Fushman, Dina; Seckman, Charlotte; Fisher, Cheryl; Hauser, Susan E; Clayton, Jennifer; Thoma, George R
2008-11-06
Translating evidence into clinical practice is a complex process that depends on the availability of evidence, the environment into which the research evidence is translated, and the system that facilitates the translation. This paper presents InfoBot, a system designed for automatic delivery of patient-specific information from evidence-based resources. A prototype system has been implemented to support development of individualized patient care plans. The prototype explores possibilities to automatically extract patients problems from the interdisciplinary team notes and query evidence-based resources using the extracted terms. Using 4,335 de-identified interdisciplinary team notes for 525 patients, the system automatically extracted biomedical terminology from 4,219 notes and linked resources to 260 patient records. Sixty of those records (15 each for Pediatrics, Oncology & Hematology, Medical & Surgical, and Behavioral Health units) have been selected for an ongoing evaluation of the quality of automatically proactively delivered evidence and its usefulness in development of care plans.
A Prototype System to Support Evidence-based Practice
Demner-Fushman, Dina; Seckman, Charlotte; Fisher, Cheryl; Hauser, Susan E.; Clayton, Jennifer; Thoma, George R.
2008-01-01
Translating evidence into clinical practice is a complex process that depends on the availability of evidence, the environment into which the research evidence is translated, and the system that facilitates the translation. This paper presents InfoBot, a system designed for automatic delivery of patient-specific information from evidence-based resources. A prototype system has been implemented to support development of individualized patient care plans. The prototype explores possibilities to automatically extract patients’ problems from the interdisciplinary team notes and query evidence-based resources using the extracted terms. Using 4,335 de-identified interdisciplinary team notes for 525 patients, the system automatically extracted biomedical terminology from 4,219 notes and linked resources to 260 patient records. Sixty of those records (15 each for Pediatrics, Oncology & Hematology, Medical & Surgical, and Behavioral Health units) have been selected for an ongoing evaluation of the quality of automatically proactively delivered evidence and its usefulness in development of care plans. PMID:18998835
Application of Magnetic Nanoparticles in Pretreatment Device for POPs Analysis in Water
NASA Astrophysics Data System (ADS)
Chu, Dongzhi; Kong, Xiangfeng; Wu, Bingwei; Fan, Pingping; Cao, Xuan; Zhang, Ting
2018-01-01
In order to reduce process time and labour force of POPs pretreatment, and solve the problem that extraction column was easily clogged, the paper proposed a new technology of extraction and enrichment which used magnetic nanoparticles. Automatic pretreatment system had automatic sampling unit, extraction enrichment unit and elution enrichment unit. The paper briefly introduced the preparation technology of magnetic nanoparticles, and detailly introduced the structure and control system of automatic pretreatment system. The result of magnetic nanoparticles mass recovery experiments showed that the system had POPs analysis preprocessing capability, and the recovery rate of magnetic nanoparticles were over 70%. In conclusion, the author proposed three points optimization recommendation.
NASA Astrophysics Data System (ADS)
Peng, G. S.
2016-12-01
Research necessarily expands upon the volume and variety of data used in prior work. Increasingly, investigators look outside their primary areas of expertise for data to incorporate into their research. Locating and using the data that they need, which may be described in terminology from other fields of science or be encoded in unfamiliar data formats, present often insurmountable barriers for potential users. As a data provider of a diverse collection of over 600 atmospheric and oceanic data sets (DS) (http://rda.ucar.edu), we seek to reduce or remove those barriers. Serving a broadening and increasing user base with fixed and finite resources requires automation. Our software harvests metadata descriptors about the data from the data files themselves. Data curators/subject matter experts augment the machine-generated metadata as needed. Metadata powers our data search tools. Users may search for data in a myriad of ways ranging from free text queries to GCMD keywords to faceted searches capable of narrowing down selections by specific criteria. Users are offered customized lists of DSs fitting their criteria with links to DS main information pages that provide detailed information about each DS. Where appropriate, they link to the NCAR Climate Data Guide for expert guidance about strengths and weaknesses of that particular DS. Once users find the data sets they need, we provide modular lessons for common data tasks. The lessons may be data tool install guides, data recipes, blog posts, or short YouTube videos. Rather than overloading users with reams of information, we provide targeted lessons when the user is most receptive, e.g. when they want to use data in an unfamiliar format. We add new material when we discover common points of confusion. Each educational resource is tagged with DS ID numbers so that they are automatically linked with the relevant DSs. How can data providers leverage the work of other data providers? Can a common tagging scheme for data education materials help us automatically share our data lessons? Research is at the frontier of knowledge. Questions from users seeking to create new uses for old data will not always be answerable with canned responses. Humans will remain in the loop. But we need automation and cross-center cooperation to reduce the areas spanned by "edge cases."
Impact of translation on named-entity recognition in radiology texts
Pedro, Vasco
2017-01-01
Abstract Radiology reports describe the results of radiography procedures and have the potential of being a useful source of information which can bring benefits to health care systems around the world. One way to automatically extract information from the reports is by using Text Mining tools. The problem is that these tools are mostly developed for English and reports are usually written in the native language of the radiologist, which is not necessarily English. This creates an obstacle to the sharing of Radiology information between different communities. This work explores the solution of translating the reports to English before applying the Text Mining tools, probing the question of what translation approach should be used. We created MRRAD (Multilingual Radiology Research Articles Dataset), a parallel corpus of Portuguese research articles related to Radiology and a number of alternative translations (human, automatic and semi-automatic) to English. This is a novel corpus which can be used to move forward the research on this topic. Using MRRAD we studied which kind of automatic or semi-automatic translation approach is more effective on the Named-entity recognition task of finding RadLex terms in the English version of the articles. Considering the terms extracted from human translations as our gold standard, we calculated how similar to this standard were the terms extracted using other translations. We found that a completely automatic translation approach using Google leads to F-scores (between 0.861 and 0.868, depending on the extraction approach) similar to the ones obtained through a more expensive semi-automatic translation approach using Unbabel (between 0.862 and 0.870). To better understand the results we also performed a qualitative analysis of the type of errors found in the automatic and semi-automatic translations. Database URL: https://github.com/lasigeBioTM/MRRAD PMID:29220455
GeoDeepDive: Towards a Machine Reading-Ready Digital Library and Information Integration Resource
NASA Astrophysics Data System (ADS)
Husson, J. M.; Peters, S. E.; Livny, M.; Ross, I.
2015-12-01
Recent developments in machine reading and learning approaches to text and data mining hold considerable promise for accelerating the pace and quality of literature-based data synthesis, but these advances have outpaced even basic levels of access to the published literature. For many geoscience domains, particularly those based on physical samples and field-based descriptions, this limitation is significant. Here we describe a general infrastructure to support published literature-based machine reading and learning approaches to information integration and knowledge base creation. This infrastructure supports rate-controlled automated fetching of original documents, along with full bibliographic citation metadata, from remote servers, the secure storage of original documents, and the utilization of considerable high-throughput computing resources for the pre-processing of these documents by optical character recognition, natural language parsing, and other document annotation and parsing software tools. New tools and versions of existing tools can be automatically deployed against original documents when they are made available. The products of these tools (text/XML files) are managed by MongoDB and are available for use in data extraction applications. Basic search and discovery functionality is provided by ElasticSearch, which is used to identify documents of potential relevance to a given data extraction task. Relevant files derived from the original documents are then combined into basic starting points for application building; these starting points are kept up-to-date as new relevant documents are incorporated into the digital library. Currently, our digital library stores contains more than 360K documents supplied by Elsevier and the USGS and we are actively seeking additional content providers. By focusing on building a dependable infrastructure to support the retrieval, storage, and pre-processing of published content, we are establishing a foundation for complex, and continually improving, information integration and data extraction applications. We have developed one such application, which we present as an example, and invite new collaborations to develop other such applications.
Log-less metadata management on metadata server for parallel file systems.
Liao, Jianwei; Xiao, Guoqiang; Peng, Xiaoning
2014-01-01
This paper presents a novel metadata management mechanism on the metadata server (MDS) for parallel and distributed file systems. In this technique, the client file system backs up the sent metadata requests, which have been handled by the metadata server, so that the MDS does not need to log metadata changes to nonvolatile storage for achieving highly available metadata service, as well as better performance improvement in metadata processing. As the client file system backs up certain sent metadata requests in its memory, the overhead for handling these backup requests is much smaller than that brought by the metadata server, while it adopts logging or journaling to yield highly available metadata service. The experimental results show that this newly proposed mechanism can significantly improve the speed of metadata processing and render a better I/O data throughput, in contrast to conventional metadata management schemes, that is, logging or journaling on MDS. Besides, a complete metadata recovery can be achieved by replaying the backup logs cached by all involved clients, when the metadata server has crashed or gone into nonoperational state exceptionally.
Log-Less Metadata Management on Metadata Server for Parallel File Systems
Xiao, Guoqiang; Peng, Xiaoning
2014-01-01
This paper presents a novel metadata management mechanism on the metadata server (MDS) for parallel and distributed file systems. In this technique, the client file system backs up the sent metadata requests, which have been handled by the metadata server, so that the MDS does not need to log metadata changes to nonvolatile storage for achieving highly available metadata service, as well as better performance improvement in metadata processing. As the client file system backs up certain sent metadata requests in its memory, the overhead for handling these backup requests is much smaller than that brought by the metadata server, while it adopts logging or journaling to yield highly available metadata service. The experimental results show that this newly proposed mechanism can significantly improve the speed of metadata processing and render a better I/O data throughput, in contrast to conventional metadata management schemes, that is, logging or journaling on MDS. Besides, a complete metadata recovery can be achieved by replaying the backup logs cached by all involved clients, when the metadata server has crashed or gone into nonoperational state exceptionally. PMID:24892093
Research on Automatic Classification, Indexing and Extracting. Annual Progress Report.
ERIC Educational Resources Information Center
Baker, F.T.; And Others
In order to contribute to the success of several studies for automatic classification, indexing and extracting currently in progress, as well as to further the theoretical and practical understanding of textual item distributions, the development of a frequency program capable of supplying these types of information was undertaken. The program…
The collection of Intelligence , Surveillance, and Reconnaissance (ISR) Full Motion Video (FMV) is growing at an exponential rate, and the manual... intelligence for the warfighter. This paper will address the question of how can automatic pattern extraction, based on computer vision, extract anomalies in
NASA Astrophysics Data System (ADS)
Li, Lin; Li, Dalin; Zhu, Haihong; Li, You
2016-10-01
Street trees interlaced with other objects in cluttered point clouds of urban scenes inhibit the automatic extraction of individual trees. This paper proposes a method for the automatic extraction of individual trees from mobile laser scanning data, according to the general constitution of trees. Two components of each individual tree - a trunk and a crown can be extracted by the dual growing method. This method consists of coarse classification, through which most of artifacts are removed; the automatic selection of appropriate seeds for individual trees, by which the common manual initial setting is avoided; a dual growing process that separates one tree from others by circumscribing a trunk in an adaptive growing radius and segmenting a crown in constrained growing regions; and a refining process that draws a singular trunk from the interlaced other objects. The method is verified by two datasets with over 98% completeness and over 96% correctness. The low mean absolute percentage errors in capturing the morphological parameters of individual trees indicate that this method can output individual trees with high precision.
Extraction of small boat harmonic signatures from passive sonar.
Ogden, George L; Zurk, Lisa M; Jones, Mark E; Peterson, Mary E
2011-06-01
This paper investigates the extraction of acoustic signatures from small boats using a passive sonar system. Noise radiated from a small boats consists of broadband noise and harmonically related tones that correspond to engine and propeller specifications. A signal processing method to automatically extract the harmonic structure of noise radiated from small boats is developed. The Harmonic Extraction and Analysis Tool (HEAT) estimates the instantaneous fundamental frequency of the harmonic tones, refines the fundamental frequency estimate using a Kalman filter, and automatically extracts the amplitudes of the harmonic tonals to generate a harmonic signature for the boat. Results are presented that show the HEAT algorithms ability to extract these signatures. © 2011 Acoustical Society of America
#Healthy Selfies: Exploration of Health Topics on Instagram.
Muralidhara, Sachin; Paul, Michael J
2018-06-29
Social media provides a complementary source of information for public health surveillance. The dominate data source for this type of monitoring is the microblogging platform Twitter, which is convenient due to the free availability of public data. Less is known about the utility of other social media platforms, despite their popularity. This work aims to characterize the health topics that are prominently discussed in the image-sharing platform Instagram, as a step toward understanding how this data might be used for public health research. The study uses a topic modeling approach to discover topics in a dataset of 96,426 Instagram posts containing hashtags related to health. We use a polylingual topic model, initially developed for datasets in different natural languages, to model different modalities of data: hashtags, caption words, and image tags automatically extracted using a computer vision tool. We identified 47 health-related topics in the data (kappa=.77), covering ten broad categories: acute illness, alternative medicine, chronic illness and pain, diet, exercise, health care & medicine, mental health, musculoskeletal health and dermatology, sleep, and substance use. The most prevalent topics were related to diet (8,293/96,426; 8.6% of posts) and exercise (7,328/96,426; 7.6% of posts). A large and diverse set of health topics are discussed in Instagram. The extracted image tags were generally too coarse and noisy to be used for identifying posts but were in some cases accurate for identifying images relevant to studying diet and substance use. Instagram shows potential as a source of public health information, though limitations in data collection and metadata availability may limit its use in comparison to platforms like Twitter. ©Sachin Muralidhara, Michael J. Paul. Originally published in JMIR Public Health and Surveillance (http://publichealth.jmir.org), 29.06.2018.
A Forensic Examination of Online Search Facility URL Record Structures.
Horsman, Graeme
2018-05-29
The use of search engines and associated search functions to locate content online is now common practice. As a result, a forensic examination of a suspect's online search activity can be a critical aspect in establishing whether an offense has been committed in many investigations. This article offers an analysis of online search URL structures to support law enforcement and associated digital forensics practitioners interpret acts of online searching during an investigation. Google, Bing, Yahoo!, and DuckDuckGo searching functions are examined, and key URL attribute structures and metadata have been documented. In addition, an overview of social media searching covering Twitter, Facebook, Instagram, and YouTube is offered. Results show the ability to extract embedded metadata from search engine URLs which can establish online searching behaviors and the timing of searches. © 2018 American Academy of Forensic Sciences.
Intuitive web-based experimental design for high-throughput biomedical data.
Friedrich, Andreas; Kenar, Erhan; Kohlbacher, Oliver; Nahnsen, Sven
2015-01-01
Big data bioinformatics aims at drawing biological conclusions from huge and complex biological datasets. Added value from the analysis of big data, however, is only possible if the data is accompanied by accurate metadata annotation. Particularly in high-throughput experiments intelligent approaches are needed to keep track of the experimental design, including the conditions that are studied as well as information that might be interesting for failure analysis or further experiments in the future. In addition to the management of this information, means for an integrated design and interfaces for structured data annotation are urgently needed by researchers. Here, we propose a factor-based experimental design approach that enables scientists to easily create large-scale experiments with the help of a web-based system. We present a novel implementation of a web-based interface allowing the collection of arbitrary metadata. To exchange and edit information we provide a spreadsheet-based, humanly readable format. Subsequently, sample sheets with identifiers and metainformation for data generation facilities can be created. Data files created after measurement of the samples can be uploaded to a datastore, where they are automatically linked to the previously created experimental design model.
A spatial data handling system for retrieval of images by unrestricted regions of user interest
NASA Technical Reports Server (NTRS)
Dorfman, Erik; Cromp, Robert F.
1992-01-01
The Intelligent Data Management (IDM) project at NASA/Goddard Space Flight Center has prototyped an Intelligent Information Fusion System (IIFS), which automatically ingests metadata from remote sensor observations into a large catalog which is directly queryable by end-users. The greatest challenge in the implementation of this catalog was supporting spatially-driven searches, where the user has a possible complex region of interest and wishes to recover those images that overlap all or simply a part of that region. A spatial data management system is described, which is capable of storing and retrieving records of image data regardless of their source. This system was designed and implemented as part of the IIFS catalog. A new data structure, called a hypercylinder, is central to the design. The hypercylinder is specifically tailored for data distributed over the surface of a sphere, such as satellite observations of the Earth or space. Operations on the hypercylinder are regulated by two expert systems. The first governs the ingest of new metadata records, and maintains the efficiency of the data structure as it grows. The second translates, plans, and executes users' spatial queries, performing incremental optimization as partial query results are returned.
Methods for automatically analyzing humpback song units.
Rickwood, Peter; Taylor, Andrew
2008-03-01
This paper presents mathematical techniques for automatically extracting and analyzing bioacoustic signals. Automatic techniques are described for isolation of target signals from background noise, extraction of features from target signals and unsupervised classification (clustering) of the target signals based on these features. The only user-provided inputs, other than raw sound, is an initial set of signal processing and control parameters. Of particular note is that the number of signal categories is determined automatically. The techniques, applied to hydrophone recordings of humpback whales (Megaptera novaeangliae), produce promising initial results, suggesting that they may be of use in automated analysis of not only humpbacks, but possibly also in other bioacoustic settings where automated analysis is desirable.
An algorithm for automatic parameter adjustment for brain extraction in BrainSuite
NASA Astrophysics Data System (ADS)
Rajagopal, Gautham; Joshi, Anand A.; Leahy, Richard M.
2017-02-01
Brain Extraction (classification of brain and non-brain tissue) of MRI brain images is a crucial pre-processing step necessary for imaging-based anatomical studies of the human brain. Several automated methods and software tools are available for performing this task, but differences in MR image parameters (pulse sequence, resolution) and instrumentand subject-dependent noise and artefacts affect the performance of these automated methods. We describe and evaluate a method that automatically adapts the default parameters of the Brain Surface Extraction (BSE) algorithm to optimize a cost function chosen to reflect accurate brain extraction. BSE uses a combination of anisotropic filtering, Marr-Hildreth edge detection, and binary morphology for brain extraction. Our algorithm automatically adapts four parameters associated with these steps to maximize the brain surface area to volume ratio. We evaluate the method on a total of 109 brain volumes with ground truth brain masks generated by an expert user. A quantitative evaluation of the performance of the proposed algorithm showed an improvement in the mean (s.d.) Dice coefficient from 0.8969 (0.0376) for default parameters to 0.9509 (0.0504) for the optimized case. These results indicate that automatic parameter optimization can result in significant improvements in definition of the brain mask.
ERIC Educational Resources Information Center
DOLBY, J.L.; AND OTHERS
THE STUDY IS CONCERNED WITH THE LINGUISTIC PROBLEM INVOLVED IN TEXT COMPRESSION--EXTRACTING, INDEXING, AND THE AUTOMATIC CREATION OF SPECIAL-PURPOSE CITATION DICTIONARIES. IN SPITE OF EARLY SUCCESS IN USING LARGE-SCALE COMPUTERS TO AUTOMATE CERTAIN HUMAN TASKS, THESE PROBLEMS REMAIN AMONG THE MOST DIFFICULT TO SOLVE. ESSENTIALLY, THE PROBLEM IS TO…
NASA Astrophysics Data System (ADS)
Šilhavý, Jakub; Minár, Jozef; Mentlík, Pavel; Sládek, Ján
2016-07-01
This paper presents a new method of automatic lineament extraction which includes the removal of the 'artefacts effect' which is associated with the process of raster based analysis. The core of the proposed Multi-Hillshade Hierarchic Clustering (MHHC) method incorporates a set of variously illuminated and rotated hillshades in combination with hierarchic clustering of derived 'protolineaments'. The algorithm also includes classification into positive and negative lineaments. MHHC was tested in two different territories in Bohemian Forest and Central Western Carpathians. The original vector-based algorithm was developed for comparison of the individual lineaments proximity. Its use confirms the compatibility of manual and automatic extraction and their similar relationships to structural data in the study areas.
Automatic extraction of road features in urban environments using dense ALS data
NASA Astrophysics Data System (ADS)
Soilán, Mario; Truong-Hong, Linh; Riveiro, Belén; Laefer, Debra
2018-02-01
This paper describes a methodology that automatically extracts semantic information from urban ALS data for urban parameterization and road network definition. First, building façades are segmented from the ground surface by combining knowledge-based information with both voxel and raster data. Next, heuristic rules and unsupervised learning are applied to the ground surface data to distinguish sidewalk and pavement points as a means for curb detection. Then radiometric information was employed for road marking extraction. Using high-density ALS data from Dublin, Ireland, this fully automatic workflow was able to generate a F-score close to 95% for pavement and sidewalk identification with a resolution of 20 cm and better than 80% for road marking detection.
2D automatic body-fitted structured mesh generation using advancing extraction method
NASA Astrophysics Data System (ADS)
Zhang, Yaoxin; Jia, Yafei
2018-01-01
This paper presents an automatic mesh generation algorithm for body-fitted structured meshes in Computational Fluids Dynamics (CFD) analysis using the Advancing Extraction Method (AEM). The method is applicable to two-dimensional domains with complex geometries, which have the hierarchical tree-like topography with extrusion-like structures (i.e., branches or tributaries) and intrusion-like structures (i.e., peninsula or dikes). With the AEM, the hierarchical levels of sub-domains can be identified, and the block boundary of each sub-domain in convex polygon shape in each level can be extracted in an advancing scheme. In this paper, several examples were used to illustrate the effectiveness and applicability of the proposed algorithm for automatic structured mesh generation, and the implementation of the method.
Automatic extraction of discontinuity orientation from rock mass surface 3D point cloud
NASA Astrophysics Data System (ADS)
Chen, Jianqin; Zhu, Hehua; Li, Xiaojun
2016-10-01
This paper presents a new method for extracting discontinuity orientation automatically from rock mass surface 3D point cloud. The proposed method consists of four steps: (1) automatic grouping of discontinuity sets using an improved K-means clustering method, (2) discontinuity segmentation and optimization, (3) discontinuity plane fitting using Random Sample Consensus (RANSAC) method, and (4) coordinate transformation of discontinuity plane. The method is first validated by the point cloud of a small piece of a rock slope acquired by photogrammetry. The extracted discontinuity orientations are compared with measured ones in the field. Then it is applied to a publicly available LiDAR data of a road cut rock slope at Rockbench repository. The extracted discontinuity orientations are compared with the method proposed by Riquelme et al. (2014). The results show that the presented method is reliable and of high accuracy, and can meet the engineering needs.
Automatic extraction of blocks from 3D point clouds of fractured rock
NASA Astrophysics Data System (ADS)
Chen, Na; Kemeny, John; Jiang, Qinghui; Pan, Zhiwen
2017-12-01
This paper presents a new method for extracting blocks and calculating block size automatically from rock surface 3D point clouds. Block size is an important rock mass characteristic and forms the basis for several rock mass classification schemes. The proposed method consists of four steps: 1) the automatic extraction of discontinuities using an improved Ransac Shape Detection method, 2) the calculation of discontinuity intersections based on plane geometry, 3) the extraction of block candidates based on three discontinuities intersecting one another to form corners, and 4) the identification of "true" blocks using an improved Floodfill algorithm. The calculated block sizes were compared with manual measurements in two case studies, one with fabricated cardboard blocks and the other from an actual rock mass outcrop. The results demonstrate that the proposed method is accurate and overcomes the inaccuracies, safety hazards, and biases of traditional techniques.
Kim, Heejun; Bian, Jiantao; Mostafa, Javed; Jonnalagadda, Siddhartha; Del Fiol, Guilherme
2016-01-01
Motivation: Clinicians need up-to-date evidence from high quality clinical trials to support clinical decisions. However, applying evidence from the primary literature requires significant effort. Objective: To examine the feasibility of automatically extracting key clinical trial information from ClinicalTrials.gov. Methods: We assessed the coverage of ClinicalTrials.gov for high quality clinical studies that are indexed in PubMed. Using 140 random ClinicalTrials.gov records, we developed and tested rules for the automatic extraction of key information. Results: The rate of high quality clinical trial registration in ClinicalTrials.gov increased from 0.2% in 2005 to 17% in 2015. Trials reporting results increased from 3% in 2005 to 19% in 2015. The accuracy of the automatic extraction algorithm for 10 trial attributes was 90% on average. Future research is needed to improve the algorithm accuracy and to design information displays to optimally present trial information to clinicians.
NASA Astrophysics Data System (ADS)
McInnes, B.; Brown, A.; Liffers, M.
2015-12-01
Publically funded laboratories have a responsibility to generate, archive and disseminate analytical data to the research community. Laboratory managers know however, that a long tail of analytical effort never escapes researchers' thumb drives once they leave the lab. This work reports on a research data management project (Digital Mineralogy Library) where integrated hardware and software systems automatically archive and deliver analytical data and metadata to institutional and community data portals. The scientific objective of the DML project was to quantify the modal abundance of heavy minerals extracted from key lithological units in Western Australia. The selected analytical platform was a TESCAN Integrated Mineral Analyser (TIMA) that uses EDS-based mineral classification software to image and quantify mineral abundance and grain size at micron scale resolution. The analytical workflow used a bespoke laboratory information management system (LIMS) to orchestrate: (1) the preparation of grain mounts with embedded QR codes that serve as enduring links between physical samples and analytical data, (2) the assignment of an International Geo Sample Number (IGSN) and Digital Object Identifier (DOI) to each grain mount via the System for Earth Sample Registry (SESAR), (3) the assignment of a DOI to instrument metadata via Research Data Australia, (4) the delivery of TIMA analytical outputs, including spatially registered mineralogy images and mineral abundance data, to an institutionally-based data management server, and (5) the downstream delivery of a final data product via a Google Maps interface such as the AuScope Discovery Portal. The modular design of the system permits the networking of multiple instruments within a single site or multiple collaborating research institutions. Although sharing analytical data does provide new opportunities for the geochemistry community, the creation of an open data network requires: (1) adopting open data reporting standards and conventions, (2) requiring instrument manufacturers and software developers to deliver and process data in formats compatible with open standards, and (3) public funding agencies to incentivise researchers, laboratories and institutions to make their data open and accessible to consumers.
Metadata for Web Resources: How Metadata Works on the Web.
ERIC Educational Resources Information Center
Dillon, Martin
This paper discusses bibliographic control of knowledge resources on the World Wide Web. The first section sets the context of the inquiry. The second section covers the following topics related to metadata: (1) definitions of metadata, including metadata as tags and as descriptors; (2) metadata on the Web, including general metadata systems,…
Metadata Dictionary Database: A Proposed Tool for Academic Library Metadata Management
ERIC Educational Resources Information Center
Southwick, Silvia B.; Lampert, Cory
2011-01-01
This article proposes a metadata dictionary (MDD) be used as a tool for metadata management. The MDD is a repository of critical data necessary for managing metadata to create "shareable" digital collections. An operational definition of metadata management is provided. The authors explore activities involved in metadata management in…
Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations.
Martínez-Romero, Marcos; O'Connor, Martin J; Shankar, Ravi D; Panahiazar, Maryam; Willrett, Debra; Egyedi, Attila L; Gevaert, Olivier; Graybeal, John; Musen, Mark A
2017-01-01
In biomedicine, high-quality metadata are crucial for finding experimental datasets, for understanding how experiments were performed, and for reproducing those experiments. Despite the recent focus on metadata, the quality of metadata available in public repositories continues to be extremely poor. A key difficulty is that the typical metadata acquisition process is time-consuming and error prone, with weak or nonexistent support for linking metadata to ontologies. There is a pressing need for methods and tools to speed up the metadata acquisition process and to increase the quality of metadata that are entered. In this paper, we describe a methodology and set of associated tools that we developed to address this challenge. A core component of this approach is a value recommendation framework that uses analysis of previously entered metadata and ontology-based metadata specifications to help users rapidly and accurately enter their metadata. We performed an initial evaluation of this approach using metadata from a public metadata repository.
Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations
Martínez-Romero, Marcos; O’Connor, Martin J.; Shankar, Ravi D.; Panahiazar, Maryam; Willrett, Debra; Egyedi, Attila L.; Gevaert, Olivier; Graybeal, John; Musen, Mark A.
2017-01-01
In biomedicine, high-quality metadata are crucial for finding experimental datasets, for understanding how experiments were performed, and for reproducing those experiments. Despite the recent focus on metadata, the quality of metadata available in public repositories continues to be extremely poor. A key difficulty is that the typical metadata acquisition process is time-consuming and error prone, with weak or nonexistent support for linking metadata to ontologies. There is a pressing need for methods and tools to speed up the metadata acquisition process and to increase the quality of metadata that are entered. In this paper, we describe a methodology and set of associated tools that we developed to address this challenge. A core component of this approach is a value recommendation framework that uses analysis of previously entered metadata and ontology-based metadata specifications to help users rapidly and accurately enter their metadata. We performed an initial evaluation of this approach using metadata from a public metadata repository. PMID:29854196
Towards structured sharing of raw and derived neuroimaging data across existing resources
Keator, D.B.; Helmer, K.; Steffener, J.; Turner, J.A.; Van Erp, T.G.M.; Gadde, S.; Ashish, N.; Burns, G.A.; Nichols, B.N.
2013-01-01
Data sharing efforts increasingly contribute to the acceleration of scientific discovery. Neuroimaging data is accumulating in distributed domain-specific databases and there is currently no integrated access mechanism nor an accepted format for the critically important meta-data that is necessary for making use of the combined, available neuroimaging data. In this manuscript, we present work from the Derived Data Working Group, an open-access group sponsored by the Biomedical Informatics Research Network (BIRN) and the International Neuroimaging Coordinating Facility (INCF) focused on practical tools for distributed access to neuroimaging data. The working group develops models and tools facilitating the structured interchange of neuroimaging meta-data and is making progress towards a unified set of tools for such data and meta-data exchange. We report on the key components required for integrated access to raw and derived neuroimaging data as well as associated meta-data and provenance across neuroimaging resources. The components include (1) a structured terminology that provides semantic context to data, (2) a formal data model for neuroimaging with robust tracking of data provenance, (3) a web service-based application programming interface (API) that provides a consistent mechanism to access and query the data model, and (4) a provenance library that can be used for the extraction of provenance data by image analysts and imaging software developers. We believe that the framework and set of tools outlined in this manuscript have great potential for solving many of the issues the neuroimaging community faces when sharing raw and derived neuroimaging data across the various existing database systems for the purpose of accelerating scientific discovery. PMID:23727024
Harvesting NASA's Common Metadata Repository (CMR)
NASA Technical Reports Server (NTRS)
Shum, Dana; Durbin, Chris; Norton, James; Mitchell, Andrew
2017-01-01
As part of NASA's Earth Observing System Data and Information System (EOSDIS), the Common Metadata Repository (CMR) stores metadata for over 30,000 datasets from both NASA and international providers along with over 300M granules. This metadata enables sub-second discovery and facilitates data access. While the CMR offers a robust temporal, spatial and keyword search functionality to the general public and international community, it is sometimes more desirable for international partners to harvest the CMR metadata and merge the CMR metadata into a partner's existing metadata repository. This poster will focus on best practices to follow when harvesting CMR metadata to ensure that any changes made to the CMR can also be updated in a partner's own repository. Additionally, since each partner has distinct metadata formats they are able to consume, the best practices will also include guidance on retrieving the metadata in the desired metadata format using CMR's Unified Metadata Model translation software.
Harvesting NASA's Common Metadata Repository
NASA Astrophysics Data System (ADS)
Shum, D.; Mitchell, A. E.; Durbin, C.; Norton, J.
2017-12-01
As part of NASA's Earth Observing System Data and Information System (EOSDIS), the Common Metadata Repository (CMR) stores metadata for over 30,000 datasets from both NASA and international providers along with over 300M granules. This metadata enables sub-second discovery and facilitates data access. While the CMR offers a robust temporal, spatial and keyword search functionality to the general public and international community, it is sometimes more desirable for international partners to harvest the CMR metadata and merge the CMR metadata into a partner's existing metadata repository. This poster will focus on best practices to follow when harvesting CMR metadata to ensure that any changes made to the CMR can also be updated in a partner's own repository. Additionally, since each partner has distinct metadata formats they are able to consume, the best practices will also include guidance on retrieving the metadata in the desired metadata format using CMR's Unified Metadata Model translation software.
Mining the pharmacogenomics literature—a survey of the state of the art
Cohen, K. Bretonnel; Garten, Yael; Shah, Nigam H.
2012-01-01
This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research. PMID:22833496
Mining the pharmacogenomics literature--a survey of the state of the art.
Hahn, Udo; Cohen, K Bretonnel; Garten, Yael; Shah, Nigam H
2012-07-01
This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.
Geophysical Event Casting: Assembling & Broadcasting Data Relevant to Events and Disasters
NASA Astrophysics Data System (ADS)
Manipon, G. M.; Wilson, B. D.
2012-12-01
Broadcast Atom feeds are already being used to publish metadata and support discovery of data collections, granules, and web services. Such data and service casting advertises the existence of new granules in a dataset and available services to access or transform data. Similarly, data and services relevant to studying topical geophysical events (earthquakes, hurricanes, etc.) or periodic/regional structures (El Nino, deep convection) can be broadcast by publishing new entries and links in a feed for that topic. By using the geoRSS conventions, the time and space location of the event (e.g. a moving hurricane track) is specified in the feed, along with science description, images, relevant data granules, and links to useful web services (e.g. OGC/WMS). The topic cast is used to assemble all of the relevant data/images as they come in, and publish the metadata (images, links, services) to a broad group of subscribers. All of the information in the feed is structured using standardized XML tags (e.g. georss for space & time, and tags to point to external data & services), and is thus machine-readable, which is an improvement over collecting ad hoc links on a wiki. We have created a software suite in python to generate such "event casts" when a geophysical event first happens, then update them with more information as it becomes available, and display them as an event album in a web browser. Figure 1 shows a snapshot of our Event Cast Browser displaying information from a set of casts about the hurricanes in the Western Pacific during the year 2011. The 19th cyclone is selected in the left panel, so the top right panels display the entries in that feed with metadata such as maximum wind speed, while the bottom right panel displays the hurricane track (positions every 12 hours) as KML in the Google Earth plug-in, where additional data/image layers from the feed can be turned on or off by the user. The software automatically converts (georss) space & time information to KML placemarks, and can also generate various KML visualizations for other data layers that are pointed to in the feed. The user can replay all of the data images as an animation over the several days as the cyclone develops. The goal of "event casting" is to standardize several metadata micro-formats and use them within Atom feeds to create a rich ecosystem of topical event data that can be automatically manipulated by scripts and many interfaces. For our event cast browser, the same code can display all kinds of casts, whether about hurricanes, fire, earthquakes, or even El Nino. The presentation will describe: the event cast format and its standard micro-formats, software to generate and augment casts, and the browser GUI with KML visualizations.;
SciFlo: Semantically-Enabled Grid Workflow for Collaborative Science
NASA Astrophysics Data System (ADS)
Yunck, T.; Wilson, B. D.; Raskin, R.; Manipon, G.
2005-12-01
SciFlo is a system for Scientific Knowledge Creation on the Grid using a Semantically-Enabled Dataflow Execution Environment. SciFlo leverages Simple Object Access Protocol (SOAP) Web Services and the Grid Computing standards (WS-* standards and the Globus Alliance toolkits), and enables scientists to do multi-instrument Earth Science by assembling reusable SOAP Services, native executables, local command-line scripts, and python codes into a distributed computing flow (a graph of operators). SciFlo's XML dataflow documents can be a mixture of concrete operators (fully bound operations) and abstract template operators (late binding via semantic lookup). All data objects and operators can be both simply typed (simple and complex types in XML schema) and semantically typed using controlled vocabularies (linked to OWL ontologies such as SWEET). By exploiting ontology-enhanced search and inference, one can discover (and automatically invoke) Web Services and operators that have been semantically labeled as performing the desired transformation, and adapt a particular invocation to the proper interface (number, types, and meaning of inputs and outputs). The SciFlo client & server engines optimize the execution of such distributed data flows and allow the user to transparently find and use datasets and operators without worrying about the actual location of the Grid resources. The scientist injects a distributed computation into the Grid by simply filling out an HTML form or directly authoring the underlying XML dataflow document, and results are returned directly to the scientist's desktop. A Visual Programming tool is also being developed, but it is not required. Once an analysis has been specified for a granule or day of data, it can be easily repeated with different control parameters and over months or years of data. SciFlo uses and preserves semantics, and also generates and infers new semantic annotations. Specifically, the SciFlo engine uses semantic metadata to understand (infer) what it is doing and potentially improve the data flow; preserves semantics by saving links to the semantics of (metadata describing) the input datasets, related datasets, and the data transformations (algorithms) used to generate downstream products; generates new metadata by allowing the user to add semantic annotations to the generated data products (or simply accept automatically generated provenance annotations); and infers new semantic metadata by understanding and applying logic to the semantics of the data and the transformations performed. Much ontology development still needs to be done but, nevertheless, SciFlo documents provide a substrate for using and preserving more semantics as ontologies develop. We will give a live demonstration of the growing SciFlo network using an example dataflow in which atmospheric temperature and water vapor profiles from three Earth Observing System (EOS) instruments are retrieved using SOAP (geo-location query & data access) services, co-registered, and visually & statistically compared on demand (see http://sciflo.jpl.nasa.gov for more information).
Gurulingappa, Harsha; Toldo, Luca; Rajput, Abdul Mateen; Kors, Jan A; Taweel, Adel; Tayrouz, Yorki
2013-11-01
The aim of this study was to assess the impact of automatically detected adverse event signals from text and open-source data on the prediction of drug label changes. Open-source adverse effect data were collected from FAERS, Yellow Cards and SIDER databases. A shallow linguistic relation extraction system (JSRE) was applied for extraction of adverse effects from MEDLINE case reports. Statistical approach was applied on the extracted datasets for signal detection and subsequent prediction of label changes issued for 29 drugs by the UK Regulatory Authority in 2009. 76% of drug label changes were automatically predicted. Out of these, 6% of drug label changes were detected only by text mining. JSRE enabled precise identification of four adverse drug events from MEDLINE that were undetectable otherwise. Changes in drug labels can be predicted automatically using data and text mining techniques. Text mining technology is mature and well-placed to support the pharmacovigilance tasks. Copyright © 2013 John Wiley & Sons, Ltd.
NASA Astrophysics Data System (ADS)
Lv, Zheng; Sui, Haigang; Zhang, Xilin; Huang, Xianfeng
2007-11-01
As one of the most important geo-spatial objects and military establishment, airport is always a key target in fields of transportation and military affairs. Therefore, automatic recognition and extraction of airport from remote sensing images is very important and urgent for updating of civil aviation and military application. In this paper, a new multi-source data fusion approach on automatic airport information extraction, updating and 3D modeling is addressed. Corresponding key technologies including feature extraction of airport information based on a modified Ostu algorithm, automatic change detection based on new parallel lines-based buffer detection algorithm, 3D modeling based on gradual elimination of non-building points algorithm, 3D change detecting between old airport model and LIDAR data, typical CAD models imported and so on are discussed in detail. At last, based on these technologies, we develop a prototype system and the results show our method can achieve good effects.
Simplified Metadata Curation via the Metadata Management Tool
NASA Astrophysics Data System (ADS)
Shum, D.; Pilone, D.
2015-12-01
The Metadata Management Tool (MMT) is the newest capability developed as part of NASA Earth Observing System Data and Information System's (EOSDIS) efforts to simplify metadata creation and improve metadata quality. The MMT was developed via an agile methodology, taking into account inputs from GCMD's science coordinators and other end-users. In its initial release, the MMT uses the Unified Metadata Model for Collections (UMM-C) to allow metadata providers to easily create and update collection records in the ISO-19115 format. Through a simplified UI experience, metadata curators can create and edit collections without full knowledge of the NASA Best Practices implementation of ISO-19115 format, while still generating compliant metadata. More experienced users are also able to access raw metadata to build more complex records as needed. In future releases, the MMT will build upon recent work done in the community to assess metadata quality and compliance with a variety of standards through application of metadata rubrics. The tool will provide users with clear guidance as to how to easily change their metadata in order to improve their quality and compliance. Through these features, the MMT allows data providers to create and maintain compliant and high quality metadata in a short amount of time.
Ravikumar, Ke; Liu, Haibin; Cohn, Judith D; Wall, Michael E; Verspoor, Karin
2012-10-05
We propose a method for automatic extraction of protein-specific residue mentions from the biomedical literature. The method searches text for mentions of amino acids at specific sequence positions and attempts to correctly associate each mention with a protein also named in the text. The methods presented in this work will enable improved protein functional site extraction from articles, ultimately supporting protein function prediction. Our method made use of linguistic patterns for identifying the amino acid residue mentions in text. Further, we applied an automated graph-based method to learn syntactic patterns corresponding to protein-residue pairs mentioned in the text. We finally present an approach to automated construction of relevant training and test data using the distant supervision model. The performance of the method was assessed by extracting protein-residue relations from a new automatically generated test set of sentences containing high confidence examples found using distant supervision. It achieved a F-measure of 0.84 on automatically created silver corpus and 0.79 on a manually annotated gold data set for this task, outperforming previous methods. The primary contributions of this work are to (1) demonstrate the effectiveness of distant supervision for automatic creation of training data for protein-residue relation extraction, substantially reducing the effort and time involved in manual annotation of a data set and (2) show that the graph-based relation extraction approach we used generalizes well to the problem of protein-residue association extraction. This work paves the way towards effective extraction of protein functional residues from the literature.
Enriched Video Semantic Metadata: Authorization, Integration, and Presentation.
ERIC Educational Resources Information Center
Mu, Xiangming; Marchionini, Gary
2003-01-01
Presents an enriched video metadata framework including video authorization using the Video Annotation and Summarization Tool (VAST)-a video metadata authorization system that integrates both semantic and visual metadata-- metadata integration, and user level applications. Results demonstrated that the enriched metadata were seamlessly…
Ruano, Juan; Gómez-García, Francisco; Gay-Mimbrera, Jesús; Aguilar-Luque, Macarena; Fernández-Rueda, José Luis; Fernández-Chaichio, Jesús; Alcalde-Mellado, Patricia; Carmona-Fernandez, Pedro J; Sanz-Cabanillas, Juan Luis; Viguera-Guerra, Isabel; Franco-García, Francisco; Cárdenas-Aranzana, Manuel; Romero, José Luis Hernández; Gonzalez-Padilla, Marcelino; Isla-Tejera, Beatriz; Garcia-Nieto, Antonio Velez
2018-03-09
Epidemiology and the reporting characteristics of systematic reviews (SRs) and meta-analyses (MAs) are well known. However, no study has analyzed the influence of protocol features on the probability that a study's results will be finally reported, thereby indirectly assessing the reporting bias of International Prospective Register of Systematic Reviews (PROSPERO) registration records. The objective of this study is to explore which factors are associated with a higher probability that results derived from a non-Cochrane PROSPERO registration record for a systematic review will be finally reported as an original article in a scientific journal. The PROSPERO repository will be web scraped to automatically and iteratively obtain all completed non-Cochrane registration records stored from February 2011 to December 2017. Downloaded records will be screened, and those with less than 90% fulfilled or are duplicated (i.e., those sharing titles and reviewers) will be excluded. Manual and human-supervised automatic methods will be used for data extraction, depending on the data source (fields of PROSPERO registration records, bibliometric databases, etc.). Records will be classified into published, discontinued, and abandoned review subgroups. All articles derived from published reviews will be obtained through multiple parallel searches using the full protocol "title" and/or "list reviewers" in MEDLINE/PubMed databases and Google Scholar. Reviewer, author, article, and journal metadata will be obtained using different sources. R and Python programming and analysis languages will be used to describe the datasets; perform text mining, machine learning, and deep learning analyses; and visualize the data. We will report the study according to the recommendations for meta-epidemiological studies adapted from the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement for SRs and MAs. This meta-epidemiological study will explore, for the first time, characteristics of PROSPERO records that may be associated with the publication of a completed systematic review. The evidence may help to improve review workflow performance in terms of research topic selection, decision-making regarding team selection, planning relationships with funding sources, implementing literature search strategies, and efficient data extraction and analysis. We expect to make our results, datasets, and R and Python code scripts publicly available during the third quarter of 2018.
NASA Technical Reports Server (NTRS)
Nagihara, S.; Nakamura, Y.; Williams, D. R.; Taylor, P. T.; Kiefer, W. S.; Hager, M. A.; Hills, H. K.
2016-01-01
In year 2010, 440 original data archival tapes for the Apollo Lunar Science Experiment Package (ALSEP) experiments were found at the Washington National Records Center. These tapes hold raw instrument data received from the Moon for all the ALSEP instruments for the period of April through June 1975. We have recently completed extraction of binary files from these tapes, and we have delivered them to the NASA Space Science Data Cordinated Archive (NSSDCA). We are currently processing the raw data into higher order data products in file formats more readily usable by contemporary researchers. These data products will fill a number of gaps in the current ALSEP data collection at NSSDCA. In addition, we have estabilished a digital, searcheable archive of ALSEP document and metadata as part of the web portal of the Lunar and Planetary Institute. It currently holds approx. 700 documents totaling approx. 40,000 pages
DOE Office of Scientific and Technical Information (OSTI.GOV)
Studwell, Sara; Robinson, Carly; Elliott, Jannean
Scientific research is producing ever-increasing amounts of data. Organizing and reflecting relationships across data collections, datasets, publications, and other research objects are essential functionalities of the modern science environment, yet challenging to implement. Landing pages are often used for providing ‘big picture’ contextual frameworks for datasets and data collections, and many large-volume data holders are utilizing them in thoughtful, creative ways. The benefits of their organizational efforts, however, are not realized unless the user eventually sees the landing page at the end point of their search. What if that organization and ‘big picture’ context could benefit the user at themore » beginning of the search? That is a challenging approach, but The Department of Energy’s (DOE) Office of Scientific and Technical Information (OSTI) is redesigning the database functionality of the DOE Data Explorer (DDE) with that goal in mind. Phase I is focused on redesigning the DDE database to leverage relationships between two existing distinct populations in DDE, data Projects and individual Datasets, and then adding a third intermediate population, data Collections. Mapped, structured linkages, designed to show user relationships, will allow users to make informed search choices. These linkages will be sustainable and scalable, created automatically with the use of new metadata fields and existing authorities. Phase II will study selected DOE Data ID Service clients, analyzing how their landing pages are organized, and how that organization might be used to improve DDE search capabilities. At the heart of both phases is the realization that adding more metadata information for cross-referencing may require additional effort for data scientists. Finally, OSTI’s approach seeks to leverage existing metadata and landing page intelligence without imposing an additional burden on the data creators.« less
Automatic identification of high impact articles in PubMed to support clinical decision making.
Bian, Jiantao; Morid, Mohammad Amin; Jonnalagadda, Siddhartha; Luo, Gang; Del Fiol, Guilherme
2017-09-01
The practice of evidence-based medicine involves integrating the latest best available evidence into patient care decisions. Yet, critical barriers exist for clinicians' retrieval of evidence that is relevant for a particular patient from primary sources such as randomized controlled trials and meta-analyses. To help address those barriers, we investigated machine learning algorithms that find clinical studies with high clinical impact from PubMed®. Our machine learning algorithms use a variety of features including bibliometric features (e.g., citation count), social media attention, journal impact factors, and citation metadata. The algorithms were developed and evaluated with a gold standard composed of 502 high impact clinical studies that are referenced in 11 clinical evidence-based guidelines on the treatment of various diseases. We tested the following hypotheses: (1) our high impact classifier outperforms a state-of-the-art classifier based on citation metadata and citation terms, and PubMed's® relevance sort algorithm; and (2) the performance of our high impact classifier does not decrease significantly after removing proprietary features such as citation count. The mean top 20 precision of our high impact classifier was 34% versus 11% for the state-of-the-art classifier and 4% for PubMed's® relevance sort (p=0.009); and the performance of our high impact classifier did not decrease significantly after removing proprietary features (mean top 20 precision=34% vs. 36%; p=0.085). The high impact classifier, using features such as bibliometrics, social media attention and MEDLINE® metadata, outperformed previous approaches and is a promising alternative to identifying high impact studies for clinical decision support. Copyright © 2017 Elsevier Inc. All rights reserved.
Studwell, Sara; Robinson, Carly; Elliott, Jannean
2017-04-04
Scientific research is producing ever-increasing amounts of data. Organizing and reflecting relationships across data collections, datasets, publications, and other research objects are essential functionalities of the modern science environment, yet challenging to implement. Landing pages are often used for providing ‘big picture’ contextual frameworks for datasets and data collections, and many large-volume data holders are utilizing them in thoughtful, creative ways. The benefits of their organizational efforts, however, are not realized unless the user eventually sees the landing page at the end point of their search. What if that organization and ‘big picture’ context could benefit the user at themore » beginning of the search? That is a challenging approach, but The Department of Energy’s (DOE) Office of Scientific and Technical Information (OSTI) is redesigning the database functionality of the DOE Data Explorer (DDE) with that goal in mind. Phase I is focused on redesigning the DDE database to leverage relationships between two existing distinct populations in DDE, data Projects and individual Datasets, and then adding a third intermediate population, data Collections. Mapped, structured linkages, designed to show user relationships, will allow users to make informed search choices. These linkages will be sustainable and scalable, created automatically with the use of new metadata fields and existing authorities. Phase II will study selected DOE Data ID Service clients, analyzing how their landing pages are organized, and how that organization might be used to improve DDE search capabilities. At the heart of both phases is the realization that adding more metadata information for cross-referencing may require additional effort for data scientists. Finally, OSTI’s approach seeks to leverage existing metadata and landing page intelligence without imposing an additional burden on the data creators.« less
Assessing Metadata Quality of a Federally Sponsored Health Data Repository.
Marc, David T; Beattie, James; Herasevich, Vitaly; Gatewood, Laël; Zhang, Rui
2016-01-01
The U.S. Federal Government developed HealthData.gov to disseminate healthcare datasets to the public. Metadata is provided for each datasets and is the sole source of information to find and retrieve data. This study employed automated quality assessments of the HealthData.gov metadata published from 2012 to 2014 to measure completeness, accuracy, and consistency of applying standards. The results demonstrated that metadata published in earlier years had lower completeness, accuracy, and consistency. Also, metadata that underwent modifications following their original creation were of higher quality. HealthData.gov did not uniformly apply Dublin Core Metadata Initiative to the metadata, which is a widely accepted metadata standard. These findings suggested that the HealthData.gov metadata suffered from quality issues, particularly related to information that wasn't frequently updated. The results supported the need for policies to standardize metadata and contributed to the development of automated measures of metadata quality.
Assessing Metadata Quality of a Federally Sponsored Health Data Repository
Marc, David T.; Beattie, James; Herasevich, Vitaly; Gatewood, Laël; Zhang, Rui
2016-01-01
The U.S. Federal Government developed HealthData.gov to disseminate healthcare datasets to the public. Metadata is provided for each datasets and is the sole source of information to find and retrieve data. This study employed automated quality assessments of the HealthData.gov metadata published from 2012 to 2014 to measure completeness, accuracy, and consistency of applying standards. The results demonstrated that metadata published in earlier years had lower completeness, accuracy, and consistency. Also, metadata that underwent modifications following their original creation were of higher quality. HealthData.gov did not uniformly apply Dublin Core Metadata Initiative to the metadata, which is a widely accepted metadata standard. These findings suggested that the HealthData.gov metadata suffered from quality issues, particularly related to information that wasn’t frequently updated. The results supported the need for policies to standardize metadata and contributed to the development of automated measures of metadata quality. PMID:28269883
Partnerships To Mine Unexploited Sources of Metadata.
ERIC Educational Resources Information Center
Reynolds, Regina Romano
This paper discusses the metadata created for other purposes as a potential source of bibliographic data. The first section addresses collecting metadata by means of templates, including the Nordic Metadata Project's Dublin Core Metadata Template. The second section considers potential partnerships for re-purposing metadata for bibliographic use,…
Visa: AN Automatic Aware and Visual Aids Mechanism for Improving the Correct Use of Geospatial Data
NASA Astrophysics Data System (ADS)
Hong, J. H.; Su, Y. T.
2016-06-01
With the fast growth of internet-based sharing mechanism and OpenGIS technology, users nowadays enjoy the luxury to quickly locate and access a variety of geospatial data for the tasks at hands. While this sharing innovation tremendously expand the possibility of application and reduce the development cost, users nevertheless have to deal with all kinds of "differences" implicitly hidden behind the acquired georesources. We argue the next generation of GIS-based environment, regardless internet-based or not, must have built-in knowledge to automatically and correctly assess the fitness of data use and present the analyzed results to users in an intuitive and meaningful way. The VISA approach proposed in this paper refer to four different types of visual aids that can be respectively used for addressing analyzed results, namely, virtual layer, informative window, symbol transformation and augmented TOC. The VISA-enabled interface works in an automatic-aware fashion, where the standardized metadata serve as the known facts about the selected geospatial resources, algorithms for analyzing the differences of temporality and quality of the geospatial resources were designed and the transformation of analyzed results into visual aids were automatically executed. It successfully presents a new way for bridging the communication gaps between systems and users. GIS has been long seen as a powerful integration tool, but its achievements would be highly restricted if it fails to provide a friendly and correct working platform.
46 CFR 161.002-2 - Types of fire-protective systems.
Code of Federal Regulations, 2013 CFR
2013-10-01
..., but not be limited to, automatic fire and smoke detecting systems, manual fire alarm systems, sample extraction smoke detection systems, watchman's supervisory systems, and combinations of these systems. (b) Automatic fire detecting systems. For the purpose of this subpart, automatic fire and smoke detecting...
46 CFR 161.002-2 - Types of fire-protective systems.
Code of Federal Regulations, 2014 CFR
2014-10-01
..., but not be limited to, automatic fire and smoke detecting systems, manual fire alarm systems, sample extraction smoke detection systems, watchman's supervisory systems, and combinations of these systems. (b) Automatic fire detecting systems. For the purpose of this subpart, automatic fire and smoke detecting...
NASA Astrophysics Data System (ADS)
Servilla, M.; Brunt, J.
2011-12-01
Emerging in the 1980's as a U.S. National Science Foundation funded research network, the Long Term Ecological Research (LTER) Network began with six sites and with the goal of performing comparative data collection and analysis of major biotic regions of North America. Today, the LTER Network includes 26 sites located in North America, Antarctica, Puerto Rico, and French Polynesia and has contributed a corpus of over 7,000 data sets to the public domain. The diversity of LTER research has led to a wealth of scientific data derived from atmospheric to terrestrial to oceanographic to anthropogenic studies. Such diversity, however, is a contributing factor to data being published with poor or inconsistent quality or to data lacking descriptive documentation sufficient for understanding their origin or performing derivative studies. It is for these reasons that the LTER community, in collaboration with the LTER Network Office, have embarked on the development of the LTER Network Information System (NIS) - an integrative data management approach to improve the process by which quality LTER data and metadata are assembled into a central archive, thereby enabling better discovery, analysis, and synthesis of derived data products. The mission of the LTER NIS is to promote advances in collaborative and synthetic ecological science at multiple temporal and spatial scales by providing the information management and technology infrastructure to increase: ? availability and quality of data from LTER sites - by the use and support of standardized approaches to metadata management and access to data; ? timeliness and number of LTER derived data products - by creating a suite of middleware programs and workflows that make it easy to create and maintain integrated data sets derived from LTER data; and ? knowledge generated from the synthesis of LTER data - by creating standardized access and easy to use applications to discover, access, and use LTER data. The LTER NIS will utilize the Provenance Aware Synthesis Tracking Architecture (PASTA), which will provide the LTER community a metadata-driven data-flow framework to automatically harvest data from LTER research sites and make it available through a well defined software interface. We distinguish PASTA from the more generalized NIS by classifying framework components as critical and enabling cyberinfrastructure that, collectively, provide the services defined by the above mission. Data and metadata will have to pass a set of community defined quality criteria before entry into PASTA, including the use of semantic informing metadata elements and the conformance of data to their structural descriptions provided by metadata. As a result, consumers of data products from PASTA will be assured that metadata are complete and include provenance information where applicable and the data are of the highest quality. Development of the NIS is being performed through community participation. Advisory groups, called "Tiger Teams", are enlisted from the general LTER membership to provide input to the design of the NIS. Other LTER working groups contribute community-based software into the NIS; these include modules for controlled vocabularies, scientific units, and personnel. We anticipate a 2014 release of the LTER NIS.
CINERGI: Community Inventory of EarthCube Resources for Geoscience Interoperability
NASA Astrophysics Data System (ADS)
Zaslavsky, Ilya; Bermudez, Luis; Grethe, Jeffrey; Gupta, Amarnath; Hsu, Leslie; Lehnert, Kerstin; Malik, Tanu; Richard, Stephen; Valentine, David; Whitenack, Thomas
2014-05-01
Organizing geoscience data resources to support cross-disciplinary data discovery, interpretation, analysis and integration is challenging because of different information models, semantic frameworks, metadata profiles, catalogs, and services used in different geoscience domains, not to mention different research paradigms and methodologies. The central goal of CINERGI, a new project supported by the US National Science Foundation through its EarthCube Building Blocks program, is to create a methodology and assemble a large inventory of high-quality information resources capable of supporting data discovery needs of researchers in a wide range of geoscience domains. The key characteristics of the inventory are: 1) collaboration with and integration of metadata resources from a number of large data facilities; 2) reliance on international metadata and catalog service standards; 3) assessment of resource "interoperability-readiness"; 4) ability to cross-link and navigate data resources, projects, models, researcher directories, publications, usage information, etc.; 5) efficient inclusion of "long-tail" data, which are not appearing in existing domain repositories; 6) data registration at feature level where appropriate, in addition to common dataset-level registration, and 7) integration with parallel EarthCube efforts, in particular focused on EarthCube governance, information brokering, service-oriented architecture design and management of semantic information. We discuss challenges associated with accomplishing CINERGI goals, including defining the inventory scope; managing different granularity levels of resource registration; interaction with search systems of domain repositories; explicating domain semantics; metadata brokering, harvesting and pruning; managing provenance of the harvested metadata; and cross-linking resources based on the linked open data (LOD) approaches. At the higher level of the inventory, we register domain-wide resources such as domain catalogs, vocabularies, information models, data service specifications, identifier systems, and assess their conformance with international standards (such as those adopted by ISO and OGC, and used by INSPIRE) or de facto community standards using, in part, automatic validation techniques. The main level in CINERGI leverages a metadata aggregation platform (currently Geoportal Server) to organize harvested resources from multiple collections and contributed by community members during EarthCube end-user domain workshops or suggested online. The latter mechanism uses the SciCrunch toolkit originally developed within the Neuroscience Information Framework (NIF) project and now being extended to other communities. The inventory is designed to support requests such as "Find resources with theme X in geographic area S", "Find datasets with subject Y using query concept expansion", "Find geographic regions having data of type Z", "Find datasets that contain property P". With the added LOD support, additional types of requests, such as "Find example implementations of specification X", "Find researchers who have worked in Domain X, dataset Y, location L", "Find resources annotated by person X", will be supported. Project's website (http://workspace.earthcube.org/cinergi) provides access to the initial resource inventory, a gallery of EarthCube researchers, collections of geoscience models, metadata entry forms, and other software modules and inventories being integrated into the CINERGI system. Support from the US National Science Foundation under award NSF ICER-1343816 is gratefully acknowledged.
A Window to the World: Lessons Learned from NASA's Collaborative Metadata Curation Effort
NASA Astrophysics Data System (ADS)
Bugbee, K.; Dixon, V.; Baynes, K.; Shum, D.; le Roux, J.; Ramachandran, R.
2017-12-01
Well written descriptive metadata adds value to data by making data easier to discover as well as increases the use of data by providing the context or appropriateness of use. While many data centers acknowledge the importance of correct, consistent and complete metadata, allocating resources to curate existing metadata is often difficult. To lower resource costs, many data centers seek guidance on best practices for curating metadata but struggle to identify those recommendations. In order to assist data centers in curating metadata and to also develop best practices for creating and maintaining metadata, NASA has formed a collaborative effort to improve the Earth Observing System Data and Information System (EOSDIS) metadata in the Common Metadata Repository (CMR). This effort has taken significant steps in building consensus around metadata curation best practices. However, this effort has also revealed gaps in EOSDIS enterprise policies and procedures within the core metadata curation task. This presentation will explore the mechanisms used for building consensus on metadata curation, the gaps identified in policies and procedures, the lessons learned from collaborating with both the data centers and metadata curation teams, and the proposed next steps for the future.
Shape and texture fused recognition of flying targets
NASA Astrophysics Data System (ADS)
Kovács, Levente; Utasi, Ákos; Kovács, Andrea; Szirányi, Tamás
2011-06-01
This paper presents visual detection and recognition of flying targets (e.g. planes, missiles) based on automatically extracted shape and object texture information, for application areas like alerting, recognition and tracking. Targets are extracted based on robust background modeling and a novel contour extraction approach, and object recognition is done by comparisons to shape and texture based query results on a previously gathered real life object dataset. Application areas involve passive defense scenarios, including automatic object detection and tracking with cheap commodity hardware components (CPU, camera and GPS).
Evaluating and Evolving Metadata in Multiple Dialects
NASA Astrophysics Data System (ADS)
Kozimor, J.; Habermann, T.; Powers, L. A.; Gordon, S.
2016-12-01
Despite many long-term homogenization efforts, communities continue to develop focused metadata standards along with related recommendations and (typically) XML representations (aka dialects) for sharing metadata content. Different representations easily become obstacles to sharing information because each representation generally requires a set of tools and skills that are designed, built, and maintained specifically for that representation. In contrast, community recommendations are generally described, at least initially, at a more conceptual level and are more easily shared. For example, most communities agree that dataset titles should be included in metadata records although they write the titles in different ways. This situation has led to the development of metadata repositories that can ingest and output metadata in multiple dialects. As an operational example, the NASA Common Metadata Repository (CMR) includes three different metadata dialects (DIF, ECHO, and ISO 19115-2). These systems raise a new question for metadata providers: if I have a choice of metadata dialects, which should I use and how do I make that decision? We have developed a collection of metadata evaluation tools that can be used to evaluate metadata records in many dialects for completeness with respect to recommendations from many organizations and communities. We have applied these tools to over 8000 collection and granule metadata records in four different dialects. This large collection of identical content in multiple dialects enables us to address questions about metadata and dialect evolution and to answer those questions quantitatively. We will describe those tools and results from evaluating the NASA CMR metadata collection.
Study of Burn Scar Extraction Automatically Based on Level Set Method using Remote Sensing Data
Liu, Yang; Dai, Qin; Liu, JianBo; Liu, ShiBin; Yang, Jin
2014-01-01
Burn scar extraction using remote sensing data is an efficient way to precisely evaluate burn area and measure vegetation recovery. Traditional burn scar extraction methodologies have no well effect on burn scar image with blurred and irregular edges. To address these issues, this paper proposes an automatic method to extract burn scar based on Level Set Method (LSM). This method utilizes the advantages of the different features in remote sensing images, as well as considers the practical needs of extracting the burn scar rapidly and automatically. This approach integrates Change Vector Analysis (CVA), Normalized Difference Vegetation Index (NDVI) and the Normalized Burn Ratio (NBR) to obtain difference image and modifies conventional Level Set Method Chan-Vese (C-V) model with a new initial curve which results from a binary image applying K-means method on fitting errors of two near-infrared band images. Landsat 5 TM and Landsat 8 OLI data sets are used to validate the proposed method. Comparison with conventional C-V model, OSTU algorithm, Fuzzy C-mean (FCM) algorithm are made to show that the proposed approach can extract the outline curve of fire burn scar effectively and exactly. The method has higher extraction accuracy and less algorithm complexity than that of the conventional C-V model. PMID:24503563
NASA Astrophysics Data System (ADS)
David, Peter; Hansen, Nichole; Nolan, James J.; Alcocer, Pedro
2015-05-01
The growth in text data available online is accompanied by a growth in the diversity of available documents. Corpora with extreme heterogeneity in terms of file formats, document organization, page layout, text style, and content are common. The absence of meaningful metadata describing the structure of online and open-source data leads to text extraction results that contain no information about document structure and are cluttered with page headers and footers, web navigation controls, advertisements, and other items that are typically considered noise. We describe an approach to document structure and metadata recovery that uses visual analysis of documents to infer the communicative intent of the author. Our algorithm identifies the components of documents such as titles, headings, and body content, based on their appearance. Because it operates on an image of a document, our technique can be applied to any type of document, including scanned images. Our approach to document structure recovery considers a finer-grained set of component types than prior approaches. In this initial work, we show that a machine learning approach to document structure recovery using a feature set based on the geometry and appearance of images of documents achieves a 60% greater F1- score than a baseline random classifier.
EBI metagenomics--a new resource for the analysis and archiving of metagenomic data.
Hunter, Sarah; Corbett, Matthew; Denise, Hubert; Fraser, Matthew; Gonzalez-Beltran, Alejandra; Hunter, Christopher; Jones, Philip; Leinonen, Rasko; McAnulla, Craig; Maguire, Eamonn; Maslen, John; Mitchell, Alex; Nuka, Gift; Oisel, Arnaud; Pesseat, Sebastien; Radhakrishnan, Rajesh; Rocca-Serra, Philippe; Scheremetjew, Maxim; Sterk, Peter; Vaughan, Daniel; Cochrane, Guy; Field, Dawn; Sansone, Susanna-Assunta
2014-01-01
Metagenomics is a relatively recently established but rapidly expanding field that uses high-throughput next-generation sequencing technologies to characterize the microbial communities inhabiting different ecosystems (including oceans, lakes, soil, tundra, plants and body sites). Metagenomics brings with it a number of challenges, including the management, analysis, storage and sharing of data. In response to these challenges, we have developed a new metagenomics resource (http://www.ebi.ac.uk/metagenomics/) that allows users to easily submit raw nucleotide reads for functional and taxonomic analysis by a state-of-the-art pipeline, and have them automatically stored (together with descriptive, standards-compliant metadata) in the European Nucleotide Archive.
SOCIB Glider toolbox: from sensor to data repository
NASA Astrophysics Data System (ADS)
Pau Beltran, Joan; Heslop, Emma; Ruiz, Simón; Troupin, Charles; Tintoré, Joaquín
2015-04-01
Nowadays in oceanography, gliders constitutes a mature, cost-effective technology for the acquisition of measurements independently of the sea state (unlike ships), providing subsurface data during sustained periods, including extreme weather events. The SOCIB glider toolbox is a set of MATLAB/Octave scripts and functions developed in order to manage the data collected by a glider fleet. They cover the main stages of the data management process, both in real-time and delayed-time modes: metadata aggregation, downloading, processing, and automatic generation of data products and figures. The toolbox is distributed under the GNU licence (http://www.gnu.org/copyleft/gpl.html) and is available at http://www.socib.es/users/glider/glider_toolbox.
EBI metagenomics—a new resource for the analysis and archiving of metagenomic data
Hunter, Sarah; Corbett, Matthew; Denise, Hubert; Fraser, Matthew; Gonzalez-Beltran, Alejandra; Hunter, Christopher; Jones, Philip; Leinonen, Rasko; McAnulla, Craig; Maguire, Eamonn; Maslen, John; Mitchell, Alex; Nuka, Gift; Oisel, Arnaud; Pesseat, Sebastien; Radhakrishnan, Rajesh; Rocca-Serra, Philippe; Scheremetjew, Maxim; Sterk, Peter; Vaughan, Daniel; Cochrane, Guy; Field, Dawn; Sansone, Susanna-Assunta
2014-01-01
Metagenomics is a relatively recently established but rapidly expanding field that uses high-throughput next-generation sequencing technologies to characterize the microbial communities inhabiting different ecosystems (including oceans, lakes, soil, tundra, plants and body sites). Metagenomics brings with it a number of challenges, including the management, analysis, storage and sharing of data. In response to these challenges, we have developed a new metagenomics resource (http://www.ebi.ac.uk/metagenomics/) that allows users to easily submit raw nucleotide reads for functional and taxonomic analysis by a state-of-the-art pipeline, and have them automatically stored (together with descriptive, standards-compliant metadata) in the European Nucleotide Archive. PMID:24165880
Gordaliza, P M; Muñoz-Barrutia, A; Via, L E; Sharpe, S; Desco, M; Vaquero, J J
2018-05-29
Computed tomography (CT) images enable capturing specific manifestations of tuberculosis (TB) that are undetectable using common diagnostic tests, which suffer from limited specificity. In this study, we aimed to automatically quantify the burden of Mycobacterium tuberculosis (Mtb) using biomarkers extracted from x-ray CT images. Nine macaques were aerosol-infected with Mtb and treated with various antibiotic cocktails. Chest CT scans were acquired in all animals at specific times independently of disease progression. First, a fully automatic segmentation of the healthy lungs from the acquired chest CT volumes was performed and air-like structures were extracted. Next, unsegmented pulmonary regions corresponding to damaged parenchymal tissue and TB lesions were included. CT biomarkers were extracted by classification of the probability distribution of the intensity of the segmented images into three tissue types: (1) Healthy tissue, parenchyma free from infection; (2) soft diseased tissue, and (3) hard diseased tissue. The probability distribution of tissue intensities was assumed to follow a Gaussian mixture model. The thresholds identifying each region were automatically computed using an expectation-maximization algorithm. The estimated longitudinal course of TB infection shows that subjects that have followed the same antibiotic treatment present a similar response (relative change in the diseased volume) with respect to baseline. More interestingly, the correlation between the diseased volume (soft tissue + hard tissue), which was manually delineated by an expert, and the automatically extracted volume with the proposed method was very strong (R 2 ≈ 0.8). We present a methodology that is suitable for automatic extraction of a radiological biomarker from CT images for TB disease burden. The method could be used to describe the longitudinal evolution of Mtb infection in a clinical trial devoted to the design of new drugs.
NASA Astrophysics Data System (ADS)
Cong, Chao; Liu, Dingsheng; Zhao, Lingjun
2008-12-01
This paper discusses a new method for the automatic matching of ground control points (GCPs) between satellite remote sensing Image and digital raster graphic (DRG) in urban areas. The key of this method is to automatically extract tie point pairs according to geographic characters from such heterogeneous images. Since there are big differences between such heterogeneous images respect to texture and corner features, more detail analyzations are performed to find similarities and differences between high resolution remote sensing Image and (DRG). Furthermore a new algorithms based on the fuzzy-c means (FCM) method is proposed to extract linear feature in remote sensing Image. Based on linear feature, crossings and corners extracted from these features are chosen as GCPs. On the other hand, similar method was used to find same features from DRGs. Finally, Hausdorff Distance was adopted to pick matching GCPs from above two GCP groups. Experiences shown the method can extract GCPs from such images with a reasonable RMS error.
Automatic Feature Extraction from Planetary Images
NASA Technical Reports Server (NTRS)
Troglio, Giulia; Le Moigne, Jacqueline; Benediktsson, Jon A.; Moser, Gabriele; Serpico, Sebastiano B.
2010-01-01
With the launch of several planetary missions in the last decade, a large amount of planetary images has already been acquired and much more will be available for analysis in the coming years. The image data need to be analyzed, preferably by automatic processing techniques because of the huge amount of data. Although many automatic feature extraction methods have been proposed and utilized for Earth remote sensing images, these methods are not always applicable to planetary data that often present low contrast and uneven illumination characteristics. Different methods have already been presented for crater extraction from planetary images, but the detection of other types of planetary features has not been addressed yet. Here, we propose a new unsupervised method for the extraction of different features from the surface of the analyzed planet, based on the combination of several image processing techniques, including a watershed segmentation and the generalized Hough Transform. The method has many applications, among which image registration and can be applied to arbitrary planetary images.
Kim, Heejun; Bian, Jiantao; Mostafa, Javed; Jonnalagadda, Siddhartha; Del Fiol, Guilherme
2016-01-01
Motivation: Clinicians need up-to-date evidence from high quality clinical trials to support clinical decisions. However, applying evidence from the primary literature requires significant effort. Objective: To examine the feasibility of automatically extracting key clinical trial information from ClinicalTrials.gov. Methods: We assessed the coverage of ClinicalTrials.gov for high quality clinical studies that are indexed in PubMed. Using 140 random ClinicalTrials.gov records, we developed and tested rules for the automatic extraction of key information. Results: The rate of high quality clinical trial registration in ClinicalTrials.gov increased from 0.2% in 2005 to 17% in 2015. Trials reporting results increased from 3% in 2005 to 19% in 2015. The accuracy of the automatic extraction algorithm for 10 trial attributes was 90% on average. Future research is needed to improve the algorithm accuracy and to design information displays to optimally present trial information to clinicians. PMID:28269867
NASA Astrophysics Data System (ADS)
Kobayashi, Hiroshi; Suzuki, Seiji; Takahashi, Hisanori; Tange, Akira; Kikuchi, Kohki
This study deals with a method to realize automatic contour extraction of facial features such as eyebrows, eyes and mouth for the time-wise frontal face with various facial expressions. Because Snakes which is one of the most famous methods used to extract contours, has several disadvantages, we propose a new method to overcome these issues. We define the elastic contour model in order to hold the contour shape and then determine the elastic energy acquired by the amount of modification of the elastic contour model. Also we utilize the image energy obtained by brightness differences of the control points on the elastic contour model. Applying the dynamic programming method, we determine the contour position where the total value of the elastic energy and the image energy becomes minimum. Employing 1/30s time-wise facial frontal images changing from neutral to one of six typical facial expressions obtained from 20 subjects, we have estimated our method and find it enables high accuracy automatic contour extraction of facial features.
Scientific Workflows + Provenance = Better (Meta-)Data Management
NASA Astrophysics Data System (ADS)
Ludaescher, B.; Cuevas-Vicenttín, V.; Missier, P.; Dey, S.; Kianmajd, P.; Wei, Y.; Koop, D.; Chirigati, F.; Altintas, I.; Belhajjame, K.; Bowers, S.
2013-12-01
The origin and processing history of an artifact is known as its provenance. Data provenance is an important form of metadata that explains how a particular data product came about, e.g., how and when it was derived in a computational process, which parameter settings and input data were used, etc. Provenance information provides transparency and helps to explain and interpret data products. Other common uses and applications of provenance include quality control, data curation, result debugging, and more generally, 'reproducible science'. Scientific workflow systems (e.g. Kepler, Taverna, VisTrails, and others) provide controlled environments for developing computational pipelines with built-in provenance support. Workflow results can then be explained in terms of workflow steps, parameter settings, input data, etc. using provenance that is automatically captured by the system. Scientific workflows themselves provide a user-friendly abstraction of the computational process and are thus a form of ('prospective') provenance in their own right. The full potential of provenance information is realized when combining workflow-level information (prospective provenance) with trace-level information (retrospective provenance). To this end, the DataONE Provenance Working Group (ProvWG) has developed an extension of the W3C PROV standard, called D-PROV. Whereas PROV provides a 'least common denominator' for exchanging and integrating provenance information, D-PROV adds new 'observables' that described workflow-level information (e.g., the functional steps in a pipeline), as well as workflow-specific trace-level information ( timestamps for each workflow step executed, the inputs and outputs used, etc.) Using examples, we will demonstrate how the combination of prospective and retrospective provenance provides added value in managing scientific data. The DataONE ProvWG is also developing tools based on D-PROV that allow scientists to get more mileage from provenance metadata. DataONE is a federation of member nodes that store data and metadata for discovery and access. By enriching metadata with provenance information, search and reuse of data is enhanced, and the 'social life' of data (being the product of many workflow runs, different people, etc.) is revealed. We are currently prototyping a provenance repository (PBase) to demonstrate what can be achieved with advanced provenance queries. The ProvExplorer and ProPub tools support advanced ad-hoc querying and visualization of provenance as well as customized provenance publications (e.g., to address privacy issues, or to focus provenance to relevant details). In a parallel line of work, we are exploring ways to add provenance support to widely-used scripting platforms (e.g. R and Python) and then expose that information via D-PROV.
EOS ODL Metadata On-line Viewer
NASA Astrophysics Data System (ADS)
Yang, J.; Rabi, M.; Bane, B.; Ullman, R.
2002-12-01
We have recently developed and deployed an EOS ODL metadata on-line viewer. The EOS ODL metadata viewer is a web server that takes: 1) an EOS metadata file in Object Description Language (ODL), 2) parameters, such as which metadata to view and what style of display to use, and returns an HTML or XML document displaying the requested metadata in the requested style. This tool is developed to address widespread complaints by science community that the EOS Data and Information System (EOSDIS) metadata files in ODL are difficult to read by allowing users to upload and view an ODL metadata file in different styles using a web browser. Users have the selection to view all the metadata or part of the metadata, such as Collection metadata, Granule metadata, or Unsupported Metadata. Choices of display styles include 1) Web: a mouseable display with tabs and turn-down menus, 2) Outline: Formatted and colored text, suitable for printing, 3) Generic: Simple indented text, a direct representation of the underlying ODL metadata, and 4) None: No stylesheet is applied and the XML generated by the converter is returned directly. Not all display styles are implemented for all the metadata choices. For example, Web style is only implemented for Collection and Granule metadata groups with known attribute fields, but not for Unsupported, Other, and All metadata. The overall strategy of the ODL viewer is to transform an ODL metadata file to a viewable HTML in two steps. The first step is to convert the ODL metadata file to an XML using a Java-based parser/translator called ODL2XML. The second step is to transform the XML to an HTML using stylesheets. Both operations are done on the server side. This allows a lot of flexibility in the final result, and is very portable cross-platform. Perl CGI behind the Apache web server is used to run the Java ODL2XML, and then run the results through an XSLT processor. The EOS ODL viewer can be accessed from either a PC or a Mac using Internet Explorer 5.0+ or Netscape 4.7+.
Willoughby, Cerys; Bird, Colin L; Coles, Simon J; Frey, Jeremy G
2014-12-22
The drive toward more transparency in research, the growing willingness to make data openly available, and the reuse of data to maximize the return on research investment all increase the importance of being able to find information and make links to the underlying data. The use of metadata in Electronic Laboratory Notebooks (ELNs) to curate experiment data is an essential ingredient for facilitating discovery. The University of Southampton has developed a Web browser-based ELN that enables users to add their own metadata to notebook entries. A survey of these notebooks was completed to assess user behavior and patterns of metadata usage within ELNs, while user perceptions and expectations were gathered through interviews and user-testing activities within the community. The findings indicate that while some groups are comfortable with metadata and are able to design a metadata structure that works effectively, many users are making little attempts to use it, thereby endangering their ability to recover data in the future. A survey of patterns of metadata use in these notebooks, together with feedback from the user community, indicated that while a few groups are comfortable with metadata and are able to design a metadata structure that works effectively, many users adopt a "minimum required" approach to metadata. To investigate whether the patterns of metadata use in LabTrove were unusual, a series of surveys were undertaken to investigate metadata usage in a variety of platforms supporting user-defined metadata. These surveys also provided the opportunity to investigate whether interface designs in these other environments might inform strategies for encouraging metadata creation and more effective use of metadata in LabTrove.
Metadata squared: enhancing its usability for volunteered geographic information and the GeoWeb
Poore, Barbara S.; Wolf, Eric B.; Sui, Daniel Z.; Elwood, Sarah; Goodchild, Michael F.
2013-01-01
The Internet has brought many changes to the way geographic information is created and shared. One aspect that has not changed is metadata. Static spatial data quality descriptions were standardized in the mid-1990s and cannot accommodate the current climate of data creation where nonexperts are using mobile phones and other location-based devices on a continuous basis to contribute data to Internet mapping platforms. The usability of standard geospatial metadata is being questioned by academics and neogeographers alike. This chapter analyzes current discussions of metadata to demonstrate how the media shift that is occurring has affected requirements for metadata. Two case studies of metadata use are presented—online sharing of environmental information through a regional spatial data infrastructure in the early 2000s, and new types of metadata that are being used today in OpenStreetMap, a map of the world created entirely by volunteers. Changes in metadata requirements are examined for usability, the ease with which metadata supports coproduction of data by communities of users, how metadata enhances findability, and how the relationship between metadata and data has changed. We argue that traditional metadata associated with spatial data infrastructures is inadequate and suggest several research avenues to make this type of metadata more interactive and effective in the GeoWeb.
Evolutions in Metadata Quality
NASA Astrophysics Data System (ADS)
Gilman, J.
2016-12-01
Metadata Quality is one of the chief drivers of discovery and use of NASA EOSDIS (Earth Observing System Data and Information System) data. Issues with metadata such as lack of completeness, inconsistency, and use of legacy terms directly hinder data use. As the central metadata repository for NASA Earth Science data, the Common Metadata Repository (CMR) has a responsibility to its users to ensure the quality of CMR search results. This talk will cover how we encourage metadata authors to improve the metadata through the use of integrated rubrics of metadata quality and outreach efforts. In addition we'll demonstrate Humanizers, a technique for dealing with the symptoms of metadata issues. Humanizers allow CMR administrators to identify specific metadata issues that are fixed at runtime when the data is indexed. An example Humanizer is the aliasing of processing level "Level 1" to "1" to improve consistency across collections. The CMR currently indexes 35K collections and 300M granules.
Automatic Rail Extraction and Celarance Check with a Point Cloud Captured by Mls in a Railway
NASA Astrophysics Data System (ADS)
Niina, Y.; Honma, R.; Honma, Y.; Kondo, K.; Tsuji, K.; Hiramatsu, T.; Oketani, E.
2018-05-01
Recently, MLS (Mobile Laser Scanning) has been successfully used in a road maintenance. In this paper, we present the application of MLS for the inspection of clearance along railway tracks of West Japan Railway Company. Point clouds around the track are captured by MLS mounted on a bogie and rail position can be determined by matching the shape of the ideal rail head with respect to the point cloud by ICP algorithm. A clearance check is executed automatically with virtual clearance model laid along the extracted rail. As a result of evaluation, the accuracy of extracting rail positions is less than 3 mm. With respect to the automatic clearance check, the objects inside the clearance and the ones related to a contact line is successfully detected by visual confirmation.
Robust extraction of the aorta and pulmonary artery from 3D MDCT image data
NASA Astrophysics Data System (ADS)
Taeprasartsit, Pinyo; Higgins, William E.
2010-03-01
Accurate definition of the aorta and pulmonary artery from three-dimensional (3D) multi-detector CT (MDCT) images is important for pulmonary applications. This work presents robust methods for defining the aorta and pulmonary artery in the central chest. The methods work on both contrast enhanced and no-contrast 3D MDCT image data. The automatic methods use a common approach employing model fitting and selection and adaptive refinement. During the occasional event that more precise vascular extraction is desired or the method fails, we also have an alternate semi-automatic fail-safe method. The semi-automatic method extracts the vasculature by extending the medial axes into a user-guided direction. A ground-truth study over a series of 40 human 3D MDCT images demonstrates the efficacy, accuracy, robustness, and efficiency of the methods.
Metadata Means Communication: The Challenges of Producing Useful Metadata
NASA Astrophysics Data System (ADS)
Edwards, P. N.; Batcheller, A. L.
2010-12-01
Metadata are increasingly perceived as an important component of data sharing systems. For instance, metadata accompanying atmospheric model output may indicate the grid size, grid type, and parameter settings used in the model configuration. We conducted a case study of a data portal in the atmospheric sciences using in-depth interviews, document review, and observation. OUr analysis revealed a number of challenges in producing useful metadata. First, creating and managing metadata required considerable effort and expertise, yet responsibility for these tasks was ill-defined and diffused among many individuals, leading to errors, failure to capture metadata, and uncertainty about the quality of the primary data. Second, metadata ended up stored in many different forms and software tools, making it hard to manage versions and transfer between formats. Third, the exact meanings of metadata categories remained unsettled and misunderstood even among a small community of domain experts -- an effect we expect to be exacerbated when scientists from other disciplines wish to use these data. In practice, we found that metadata problems due to these obstacles are often overcome through informal, personal communication, such as conversations or email. We conclude that metadata serve to communicate the context of data production from the people who produce data to those who wish to use it. Thus while formal metadata systems are often public, critical elements of metadata (those embodied in informal communication) may never be recorded. Therefore, efforts to increase data sharing should include ways to facilitate inter-investigator communication. Instead of tackling metadata challenges only on the formal level, we can improve data usability for broader communities by better supporting metadata communication.
Inheritance rules for Hierarchical Metadata Based on ISO 19115
NASA Astrophysics Data System (ADS)
Zabala, A.; Masó, J.; Pons, X.
2012-04-01
Mainly, ISO19115 has been used to describe metadata for datasets and services. Furthermore, ISO19115 standard (as well as the new draft ISO19115-1) includes a conceptual model that allows to describe metadata at different levels of granularity structured in hierarchical levels, both in aggregated resources such as particularly series, datasets, and also in more disaggregated resources such as types of entities (feature type), types of attributes (attribute type), entities (feature instances) and attributes (attribute instances). In theory, to apply a complete metadata structure to all hierarchical levels of metadata, from the whole series to an individual feature attributes, is possible, but to store all metadata at all levels is completely impractical. An inheritance mechanism is needed to store each metadata and quality information at the optimum hierarchical level and to allow an ease and efficient documentation of metadata in both an Earth observation scenario such as a multi-satellite mission multiband imagery, as well as in a complex vector topographical map that includes several feature types separated in layers (e.g. administrative limits, contour lines, edification polygons, road lines, etc). Moreover, and due to the traditional split of maps in tiles due to map handling at detailed scales or due to the satellite characteristics, each of the previous thematic layers (e.g. 1:5000 roads for a country) or band (Landsat-5 TM cover of the Earth) are tiled on several parts (sheets or scenes respectively). According to hierarchy in ISO 19115, the definition of general metadata can be supplemented by spatially specific metadata that, when required, either inherits or overrides the general case (G.1.3). Annex H of this standard states that only metadata exceptions are defined at lower levels, so it is not necessary to generate the full registry of metadata for each level but to link particular values to the general value that they inherit. Conceptually the metadata registry is complete for each metadata hierarchical level, but at the implementation level most of the metadata elements are not stored at both levels but only at more generic one. This communication defines a metadata system that covers 4 levels, describes which metadata has to support series-layer inheritance and in which way, and how hierarchical levels are defined and stored. Metadata elements are classified according to the type of inheritance between products, series, tiles and the datasets. It explains the metadata elements classification and exemplifies it using core metadata elements. The communication also presents a metadata viewer and edition tool that uses the described model to propagate metadata elements and to show to the user a complete set of metadata for each level in a transparent way. This tool is integrated in the MiraMon GIS software.
The role of metadata in managing large environmental science datasets. Proceedings
DOE Office of Scientific and Technical Information (OSTI.GOV)
Melton, R.B.; DeVaney, D.M.; French, J. C.
1995-06-01
The purpose of this workshop was to bring together computer science researchers and environmental sciences data management practitioners to consider the role of metadata in managing large environmental sciences datasets. The objectives included: establishing a common definition of metadata; identifying categories of metadata; defining problems in managing metadata; and defining problems related to linking metadata with primary data.
NASA Astrophysics Data System (ADS)
Fripp, Jurgen; Crozier, Stuart; Warfield, Simon K.; Ourselin, Sébastien
2007-03-01
The accurate segmentation of the articular cartilages from magnetic resonance (MR) images of the knee is important for clinical studies and drug trials into conditions like osteoarthritis. Currently, segmentations are obtained using time-consuming manual or semi-automatic algorithms which have high inter- and intra-observer variabilities. This paper presents an important step towards obtaining automatic and accurate segmentations of the cartilages, namely an approach to automatically segment the bones and extract the bone-cartilage interfaces (BCI) in the knee. The segmentation is performed using three-dimensional active shape models, which are initialized using an affine registration to an atlas. The BCI are then extracted using image information and prior knowledge about the likelihood of each point belonging to the interface. The accuracy and robustness of the approach was experimentally validated using an MR database of fat suppressed spoiled gradient recall images. The (femur, tibia, patella) bone segmentation had a median Dice similarity coefficient of (0.96, 0.96, 0.89) and an average point-to-surface error of 0.16 mm on the BCI. The extracted BCI had a median surface overlap of 0.94 with the real interface, demonstrating its usefulness for subsequent cartilage segmentation or quantitative analysis.
Automatic emotional expression analysis from eye area
NASA Astrophysics Data System (ADS)
Akkoç, Betül; Arslan, Ahmet
2015-02-01
Eyes play an important role in expressing emotions in nonverbal communication. In the present study, emotional expression classification was performed based on the features that were automatically extracted from the eye area. Fırst, the face area and the eye area were automatically extracted from the captured image. Afterwards, the parameters to be used for the analysis through discrete wavelet transformation were obtained from the eye area. Using these parameters, emotional expression analysis was performed through artificial intelligence techniques. As the result of the experimental studies, 6 universal emotions consisting of expressions of happiness, sadness, surprise, disgust, anger and fear were classified at a success rate of 84% using artificial neural networks.
Towards automatic music transcription: note extraction based on independent subspace analysis
NASA Astrophysics Data System (ADS)
Wellhausen, Jens; Hoynck, Michael
2005-01-01
Due to the increasing amount of music available electronically the need of automatic search, retrieval and classification systems for music becomes more and more important. In this paper an algorithm for automatic transcription of polyphonic piano music into MIDI data is presented, which is a very interesting basis for database applications, music analysis and music classification. The first part of the algorithm performs a note accurate temporal audio segmentation. In the second part, the resulting segments are examined using Independent Subspace Analysis to extract sounding notes. Finally, the results are used to build a MIDI file as a new representation of the piece of music which is examined.
Towards automatic music transcription: note extraction based on independent subspace analysis
NASA Astrophysics Data System (ADS)
Wellhausen, Jens; Höynck, Michael
2004-12-01
Due to the increasing amount of music available electronically the need of automatic search, retrieval and classification systems for music becomes more and more important. In this paper an algorithm for automatic transcription of polyphonic piano music into MIDI data is presented, which is a very interesting basis for database applications, music analysis and music classification. The first part of the algorithm performs a note accurate temporal audio segmentation. In the second part, the resulting segments are examined using Independent Subspace Analysis to extract sounding notes. Finally, the results are used to build a MIDI file as a new representation of the piece of music which is examined.
NASA Astrophysics Data System (ADS)
Adiri, Zakaria; El Harti, Abderrazak; Jellouli, Amine; Lhissou, Rachid; Maacha, Lhou; Azmi, Mohamed; Zouhair, Mohamed; Bachaoui, El Mostafa
2017-12-01
Certainly, lineament mapping occupies an important place in several studies, including geology, hydrogeology and topography etc. With the help of remote sensing techniques, lineaments can be better identified due to strong advances in used data and methods. This allowed exceeding the usual classical procedures and achieving more precise results. The aim of this work is the comparison of ASTER, Landsat-8 and Sentinel 1 data sensors in automatic lineament extraction. In addition to image data, the followed approach includes the use of the pre-existing geological map, the Digital Elevation Model (DEM) as well as the ground truth. Through a fully automatic approach consisting of a combination of edge detection algorithm and line-linking algorithm, we have found the optimal parameters for automatic lineament extraction in the study area. Thereafter, the comparison and the validation of the obtained results showed that the Sentinel 1 data are more efficient in restitution of lineaments. This indicates the performance of the radar data compared to those optical in this kind of study.
NASA Astrophysics Data System (ADS)
Li, Y.; Jiang, Y.; Yang, C. P.; Armstrong, E. M.; Huang, T.; Moroni, D. F.; McGibbney, L. J.
2016-12-01
Big oceanographic data have been produced, archived and made available online, but finding the right data for scientific research and application development is still a significant challenge. A long-standing problem in data discovery is how to find the interrelationships between keywords and data, as well as the intrarelationships of the two individually. Most previous research attempted to solve this problem by building domain-specific ontology either manually or through automatic machine learning techniques. The former is costly, labor intensive and hard to keep up-to-date, while the latter is prone to noise and may be difficult for human to understand. Large-scale user behavior data modelling represents a largely untapped, unique, and valuable source for discovering semantic relationships among domain-specific vocabulary. In this article, we propose a search engine framework for mining and utilizing dataset relevancy from oceanographic dataset metadata, user behaviors, and existing ontology. The objective is to improve discovery accuracy of oceanographic data and reduce time for scientist to discover, download and reformat data for their projects. Experiments and a search example show that the proposed search engine helps both scientists and general users search with better ranking results, recommendation, and ontology navigation.
NASA Technical Reports Server (NTRS)
Lopez, Antonio M., Jr.
1993-01-01
Development of an Intelligent Information System (IIS) involves application of numerous artificial intelligence (AI) paradigms and advanced technologies. The National Aeronautics and Space Administration (NASA) is interested in an IIS that can automatically collect, classify, store and retrieve data, as well as develop, manipulate and restructure knowledge regarding the data and its application (Campbell et al., 1987, p.3). This interest stems in part from a NASA initiative in support of the interagency Global Change Research program. NASA's space data problems are so large and varied that scientific researchers will find it almost impossible to access the most suitable information from a software system if meta-information (metadata and meta-knowledge) is not embedded in that system. Even if more, faster, larger hardware is used, new innovative software systems will be required to organize, link, maintain, and properly archive the Earth Observing System (EOS) data that is to be stored and distributed by the EOS Data and Information System (EOSDIS) (Dozier, 1990). Although efforts are being made to specify the metadata that will be used in EOSDIS, meta-knowledge specification issues are not clear. With the expectation that EOSDIS might evolve into an IIS, this paper presents certain ideas on the concept of meta-knowledge and demonstrates how meta-knowledge might be represented in a pixel classification problem.
NASA Astrophysics Data System (ADS)
Qin, Xulei; Cong, Zhibin; Halig, Luma V.; Fei, Baowei
2013-03-01
An automatic framework is proposed to segment right ventricle on ultrasound images. This method can automatically segment both epicardial and endocardial boundaries from a continuous echocardiography series by combining sparse matrix transform (SMT), a training model, and a localized region based level set. First, the sparse matrix transform extracts main motion regions of myocardium as eigenimages by analyzing statistical information of these images. Second, a training model of right ventricle is registered to the extracted eigenimages in order to automatically detect the main location of the right ventricle and the corresponding transform relationship between the training model and the SMT-extracted results in the series. Third, the training model is then adjusted as an adapted initialization for the segmentation of each image in the series. Finally, based on the adapted initializations, a localized region based level set algorithm is applied to segment both epicardial and endocardial boundaries of the right ventricle from the whole series. Experimental results from real subject data validated the performance of the proposed framework in segmenting right ventricle from echocardiography. The mean Dice scores for both epicardial and endocardial boundaries are 89.1%+/-2.3% and 83.6+/-7.3%, respectively. The automatic segmentation method based on sparse matrix transform and level set can provide a useful tool for quantitative cardiac imaging.
Making Metadata Better with CMR and MMT
NASA Technical Reports Server (NTRS)
Gilman, Jason Arthur; Shum, Dana
2016-01-01
Ensuring complete, consistent and high quality metadata is a challenge for metadata providers and curators. The CMR and MMT systems provide providers and curators options to build in metadata quality from the start and also assess and improve the quality of already existing metadata.
Automatic Extraction of Planetary Image Features
NASA Technical Reports Server (NTRS)
Troglio, G.; LeMoigne, J.; Moser, G.; Serpico, S. B.; Benediktsson, J. A.
2009-01-01
With the launch of several Lunar missions such as the Lunar Reconnaissance Orbiter (LRO) and Chandrayaan-1, a large amount of Lunar images will be acquired and will need to be analyzed. Although many automatic feature extraction methods have been proposed and utilized for Earth remote sensing images, these methods are not always applicable to Lunar data that often present low contrast and uneven illumination characteristics. In this paper, we propose a new method for the extraction of Lunar features (that can be generalized to other planetary images), based on the combination of several image processing techniques, a watershed segmentation and the generalized Hough Transform. This feature extraction has many applications, among which image registration.
Count Me In! on the Automaticity of Numerosity Processing
ERIC Educational Resources Information Center
Naparstek, Sharon; Henik, Avishai
2010-01-01
Extraction of numerosity (i.e., enumeration) is an essential component of mathematical abilities. The current study asked how automatic is the processing of numerosity and whether automatic activation is task dependent. Participants were presented with displays containing a variable number of digits and were asked to pay attention to the number of…
Evolution in Metadata Quality: Common Metadata Repository's Role in NASA Curation Efforts
NASA Technical Reports Server (NTRS)
Gilman, Jason; Shum, Dana; Baynes, Katie
2016-01-01
Metadata Quality is one of the chief drivers of discovery and use of NASA EOSDIS (Earth Observing System Data and Information System) data. Issues with metadata such as lack of completeness, inconsistency, and use of legacy terms directly hinder data use. As the central metadata repository for NASA Earth Science data, the Common Metadata Repository (CMR) has a responsibility to its users to ensure the quality of CMR search results. This poster covers how we use humanizers, a technique for dealing with the symptoms of metadata issues, as well as our plans for future metadata validation enhancements. The CMR currently indexes 35K collections and 300M granules.
Patridge, Jeff; Namulanda, Gonza
2008-01-01
The Environmental Public Health Tracking (EPHT) Network provides an opportunity to bring together diverse environmental and health effects data by integrating}?> local, state, and national databases of environmental hazards, environmental exposures, and health effects. To help users locate data on the EPHT Network, the network will utilize descriptive metadata that provide critical information as to the purpose, location, content, and source of these data. Since 2003, the Centers for Disease Control and Prevention's EPHT Metadata Subgroup has been working to initiate the creation and use of descriptive metadata. Efforts undertaken by the group include the adoption of a metadata standard, creation of an EPHT-specific metadata profile, development of an open-source metadata creation tool, and promotion of the creation of descriptive metadata by changing the perception of metadata in the public health culture.
Hancock, David; Wilson, Michael; Velarde, Giles; Morrison, Norman; Hayes, Andrew; Hulme, Helen; Wood, A Joseph; Nashar, Karim; Kell, Douglas B; Brass, Andy
2005-11-03
maxdLoad2 is a relational database schema and Java application for microarray experimental annotation and storage. It is compliant with all standards for microarray meta-data capture; including the specification of what data should be recorded, extensive use of standard ontologies and support for data exchange formats. The output from maxdLoad2 is of a form acceptable for submission to the ArrayExpress microarray repository at the European Bioinformatics Institute. maxdBrowse is a PHP web-application that makes contents of maxdLoad2 databases accessible via web-browser, the command-line and web-service environments. It thus acts as both a dissemination and data-mining tool. maxdLoad2 presents an easy-to-use interface to an underlying relational database and provides a full complement of facilities for browsing, searching and editing. There is a tree-based visualization of data connectivity and the ability to explore the links between any pair of data elements, irrespective of how many intermediate links lie between them. Its principle novel features are: the flexibility of the meta-data that can be captured, the tools provided for importing data from spreadsheets and other tabular representations, the tools provided for the automatic creation of structured documents, the ability to browse and access the data via web and web-services interfaces. Within maxdLoad2 it is very straightforward to customise the meta-data that is being captured or change the definitions of the meta-data. These meta-data definitions are stored within the database itself allowing client software to connect properly to a modified database without having to be specially configured. The meta-data definitions (configuration file) can also be centralized allowing changes made in response to revisions of standards or terminologies to be propagated to clients without user intervention.maxdBrowse is hosted on a web-server and presents multiple interfaces to the contents of maxd databases. maxdBrowse emulates many of the browse and search features available in the maxdLoad2 application via a web-browser. This allows users who are not familiar with maxdLoad2 to browse and export microarray data from the database for their own analysis. The same browse and search features are also available via command-line and SOAP server interfaces. This both enables scripting of data export for use embedded in data repositories and analysis environments, and allows access to the maxd databases via web-service architectures. maxdLoad2 http://www.bioinf.man.ac.uk/microarray/maxd/ and maxdBrowse http://dbk.ch.umist.ac.uk/maxdBrowse are portable and compatible with all common operating systems and major database servers. They provide a powerful, flexible package for annotation of microarray experiments and a convenient dissemination environment. They are available for download and open sourced under the Artistic License.
Metadata: Standards for Retrieving WWW Documents (and Other Digitized and Non-Digitized Resources)
NASA Astrophysics Data System (ADS)
Rusch-Feja, Diann
The use of metadata for indexing digitized and non-digitized resources for resource discovery in a networked environment is being increasingly implemented all over the world. Greater precision is achieved using metadata than relying on universal search engines and furthermore, meta-data can be used as filtering mechanisms for search results. An overview of various metadata sets is given, followed by a more focussed presentation of Dublin Core Metadata including examples of sub-elements and qualifiers. Especially the use of the Dublin Core Relation element provides connections between the metadata of various related electronic resources, as well as the metadata for physical, non-digitized resources. This facilitates more comprehensive search results without losing precision and brings together different genres of information which would otherwise be only searchable in separate databases. Furthermore, the advantages of Dublin Core Metadata in comparison with library cataloging and the use of universal search engines are discussed briefly, followed by a listing of types of implementation of Dublin Core Metadata.
Obuch, Raymond C.; Carlino, Jennifer; Zhang, Lin; Blythe, Jonathan; Dietrich, Christopher; Hawkinson, Christine
2018-04-12
The Department of the Interior (DOI) is a Federal agency with over 90,000 employees across 10 bureaus and 8 agency offices. Its primary mission is to protect and manage the Nation’s natural resources and cultural heritage; provide scientific and other information about those resources; and honor its trust responsibilities or special commitments to American Indians, Alaska Natives, and affiliated island communities. Data and information are critical in day-to-day operational decision making and scientific research. DOI is committed to creating, documenting, managing, and sharing high-quality data and metadata in and across its various programs that support its mission. Documenting data through metadata is essential in realizing the value of data as an enterprise asset. The completeness, consistency, and timeliness of metadata affect users’ ability to search for and discover the most relevant data for the intended purpose; and facilitates the interoperability and usability of these data among DOI bureaus and offices. Fully documented metadata describe data usability, quality, accuracy, provenance, and meaning.Across DOI, there are different maturity levels and phases of information and metadata management implementations. The Department has organized a committee consisting of bureau-level points-of-contacts to collaborate on the development of more consistent, standardized, and more effective metadata management practices and guidance to support this shared mission and the information needs of the Department. DOI’s metadata implementation plans establish key roles and responsibilities associated with metadata management processes, procedures, and a series of actions defined in three major metadata implementation phases including: (1) Getting started—Planning Phase, (2) Implementing and Maintaining Operational Metadata Management Phase, and (3) the Next Steps towards Improving Metadata Management Phase. DOI’s phased approach for metadata management addresses some of the major data and metadata management challenges that exist across the diverse missions of the bureaus and offices. All employees who create, modify, or use data are involved with data and metadata management. Identifying, establishing, and formalizing the roles and responsibilities associated with metadata management are key to institutionalizing a framework of best practices, methodologies, processes, and common approaches throughout all levels of the organization; these are the foundation for effective data resource management. For executives and managers, metadata management strengthens their overarching views of data assets, holdings, and data interoperability; and clarifies how metadata management can help accelerate the compliance of multiple policy mandates. For employees, data stewards, and data professionals, formalized metadata management will help with the consistency of definitions, and approaches addressing data discoverability, data quality, and data lineage. In addition to data professionals and others associated with information technology; data stewards and program subject matter experts take on important metadata management roles and responsibilities as data flow through their respective business and science-related workflows. The responsibilities of establishing, practicing, and governing the actions associated with their specific metadata management roles are critical to successful metadata implementation.
Making Interoperability Easier with the NASA Metadata Management Tool
NASA Astrophysics Data System (ADS)
Shum, D.; Reese, M.; Pilone, D.; Mitchell, A. E.
2016-12-01
ISO 19115 has enabled interoperability amongst tools, yet many users find it hard to build ISO metadata for their collections because it can be large and overly flexible for their needs. The Metadata Management Tool (MMT), part of NASA's Earth Observing System Data and Information System (EOSDIS), offers users a modern, easy to use browser based tool to develop ISO compliant metadata. Through a simplified UI experience, metadata curators can create and edit collections without any understanding of the complex ISO-19115 format, while still generating compliant metadata. The MMT is also able to assess the completeness of collection level metadata by evaluating it against a variety of metadata standards. The tool provides users with clear guidance as to how to change their metadata in order to improve their quality and compliance. It is based on NASA's Unified Metadata Model for Collections (UMM-C) which is a simpler metadata model which can be cleanly mapped to ISO 19115. This allows metadata authors and curators to meet ISO compliance requirements faster and more accurately. The MMT and UMM-C have been developed in an agile fashion, with recurring end user tests and reviews to continually refine the tool, the model and the ISO mappings. This process is allowing for continual improvement and evolution to meet the community's needs.
Huang, Yukun; Chen, Rong; Wei, Jingbo; Pei, Xilong; Cao, Jing; Prakash Jayaraman, Prem; Ranjan, Rajiv
2014-01-01
JNI in the Android platform is often observed with low efficiency and high coding complexity. Although many researchers have investigated the JNI mechanism, few of them solve the efficiency and the complexity problems of JNI in the Android platform simultaneously. In this paper, a hybrid polylingual object (HPO) model is proposed to allow a CAR object being accessed as a Java object and as vice in the Dalvik virtual machine. It is an acceptable substitute for JNI to reuse the CAR-compliant components in Android applications in a seamless and efficient way. The metadata injection mechanism is designed to support the automatic mapping and reflection between CAR objects and Java objects. A prototype virtual machine, called HPO-Dalvik, is implemented by extending the Dalvik virtual machine to support the HPO model. Lifespan management, garbage collection, and data type transformation of HPO objects are also handled in the HPO-Dalvik virtual machine automatically. The experimental result shows that the HPO model outweighs the standard JNI in lower overhead on native side, better executing performance with no JNI bridging code being demanded.
Automatic Authorship Detection Using Textual Patterns Extracted from Integrated Syntactic Graphs
Gómez-Adorno, Helena; Sidorov, Grigori; Pinto, David; Vilariño, Darnes; Gelbukh, Alexander
2016-01-01
We apply the integrated syntactic graph feature extraction methodology to the task of automatic authorship detection. This graph-based representation allows integrating different levels of language description into a single structure. We extract textual patterns based on features obtained from shortest path walks over integrated syntactic graphs and apply them to determine the authors of documents. On average, our method outperforms the state of the art approaches and gives consistently high results across different corpora, unlike existing methods. Our results show that our textual patterns are useful for the task of authorship attribution. PMID:27589740
Data System for HS3 Airborne Field Campaign
NASA Astrophysics Data System (ADS)
Maskey, M.; Mceniry, M.; Berendes, T.; Bugbee, K.; Conover, H.; Ramachandran, R.
2014-12-01
Hurricane and Severe Storm Sentinel (HS3) is a NASA airborne field campaign aimed at better understanding the physical processes that control hurricane intensity change. HS3 will help answer questions related to the roles of environmental conditions and internal storm structures to storm intensification. Due to the nature of the questions that HS3 mission is addressing, it involves a variety of in-situ, satellite observations, airborne data, meteorological analyses, and simulation data. This variety of datasets presents numerous data management challenges for HS3. The methods used for airborne data management differ greatly from the methods used for space-borne data. In particular, metadata extraction, spatial and temporal indexing, and the large number of instruments and subsequent variables are a few of the data management challenges unique to airborne missions. A robust data system is required to successfully help HS3 scientist achieve their mission goals. Furthermore, the data system also needs to provide for data management that assists in broader use of HS3 data to enable future research activities. The Global Hydrology Resource Center (GHRC) is considering all these needs and designing a data system for HS3. Experience with past airborne field campaign puts GHRC in a good position to address HS3 needs. However, the scale of this mission along with science requirements separates HS3 from previous field campaigns. The HS3 data system will include automated services for geo-location, metadata extraction, discovery, and distribution for all HS3 data. To answer the science questions, the data system will include a visual data exploration tool that is fully integrated into the data catalog. The tool will allow visually augmenting airborne data with analyses and simulations. Satellite data will provide contextual information during such data explorations. All HS3 tools will be supported by an enterprise service architecture that will allow scaling, easy integration of new tools and existing services, and integration of new ESDIS metadata and security guidelines.
Using Activity-Related Behavioural Features towards More Effective Automatic Stress Detection
Giakoumis, Dimitris; Drosou, Anastasios; Cipresso, Pietro; Tzovaras, Dimitrios; Hassapis, George; Gaggioli, Andrea; Riva, Giuseppe
2012-01-01
This paper introduces activity-related behavioural features that can be automatically extracted from a computer system, with the aim to increase the effectiveness of automatic stress detection. The proposed features are based on processing of appropriate video and accelerometer recordings taken from the monitored subjects. For the purposes of the present study, an experiment was conducted that utilized a stress-induction protocol based on the stroop colour word test. Video, accelerometer and biosignal (Electrocardiogram and Galvanic Skin Response) recordings were collected from nineteen participants. Then, an explorative study was conducted by following a methodology mainly based on spatiotemporal descriptors (Motion History Images) that are extracted from video sequences. A large set of activity-related behavioural features, potentially useful for automatic stress detection, were proposed and examined. Experimental evaluation showed that several of these behavioural features significantly correlate to self-reported stress. Moreover, it was found that the use of the proposed features can significantly enhance the performance of typical automatic stress detection systems, commonly based on biosignal processing. PMID:23028461
1997-11-01
status can sometimes be reflected in the infectious potential or drug resistance of those pathogens. For example, in Mycobacterium tuberculosis ... Mycobacterium tuberculosis , its antibiotic resistance and prediction of pathogenicity amongst Mycobacterium spp. based on signature lipid biomarkers ...TITLE AND SUBTITLE Rapid, Potentially Automatable, Method Extract Biomarkers for HPLC/ESI/MS/MS to Detect and Identify BW Agents 5a. CONTRACT NUMBER 5b
Design of Automatic Extraction Algorithm of Knowledge Points for MOOCs
Chen, Haijian; Han, Dongmei; Zhao, Lina
2015-01-01
In recent years, Massive Open Online Courses (MOOCs) are very popular among college students and have a powerful impact on academic institutions. In the MOOCs environment, knowledge discovery and knowledge sharing are very important, which currently are often achieved by ontology techniques. In building ontology, automatic extraction technology is crucial. Because the general methods of text mining algorithm do not have obvious effect on online course, we designed automatic extracting course knowledge points (AECKP) algorithm for online course. It includes document classification, Chinese word segmentation, and POS tagging for each document. Vector Space Model (VSM) is used to calculate similarity and design the weight to optimize the TF-IDF algorithm output values, and the higher scores will be selected as knowledge points. Course documents of “C programming language” are selected for the experiment in this study. The results show that the proposed approach can achieve satisfactory accuracy rate and recall rate. PMID:26448738
Road Network Extraction from Dsm by Mathematical Morphology and Reasoning
NASA Astrophysics Data System (ADS)
Li, Yan; Wu, Jianliang; Zhu, Lin; Tachibana, Kikuo
2016-06-01
The objective of this research is the automatic extraction of the road network in a scene of the urban area from a high resolution digital surface model (DSM). Automatic road extraction and modeling from remote sensed data has been studied for more than one decade. The methods vary greatly due to the differences of data types, regions, resolutions et al. An advanced automatic road network extraction scheme is proposed to address the issues of tedium steps on segmentation, recognition and grouping. It is on the basis of a geometric road model which describes a multiple-level structure. The 0-dimension element is intersection. The 1-dimension elements are central line and side. The 2-dimension element is plane, which is generated from the 1-dimension elements. The key feature of the presented approach is the cross validation for the three road elements which goes through the entire procedure of their extraction. The advantage of our model and method is that linear elements of the road can be derived directly, without any complex, non-robust connection hypothesis. An example of Japanese scene is presented to display the procedure and the performance of the approach.
GraphMeta: Managing HPC Rich Metadata in Graphs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dai, Dong; Chen, Yong; Carns, Philip
High-performance computing (HPC) systems face increasingly critical metadata management challenges, especially in the approaching exascale era. These challenges arise not only from exploding metadata volumes, but also from increasingly diverse metadata, which contains data provenance and arbitrary user-defined attributes in addition to traditional POSIX metadata. This ‘rich’ metadata is becoming critical to supporting advanced data management functionality such as data auditing and validation. In our prior work, we identified a graph-based model as a promising solution to uniformly manage HPC rich metadata due to its flexibility and generality. However, at the same time, graph-based HPC rich metadata anagement also introducesmore » significant challenges to the underlying infrastructure. In this study, we first identify the challenges on the underlying infrastructure to support scalable, high-performance rich metadata management. Based on that, we introduce GraphMeta, a graphbased engine designed for this use case. It achieves performance scalability by introducing a new graph partitioning algorithm and a write-optimal storage engine. We evaluate GraphMeta under both synthetic and real HPC metadata workloads, compare it with other approaches, and demonstrate its advantages in terms of efficiency and usability for rich metadata management in HPC systems.« less
Generation of Multiple Metadata Formats from a Geospatial Data Repository
NASA Astrophysics Data System (ADS)
Hudspeth, W. B.; Benedict, K. K.; Scott, S.
2012-12-01
The Earth Data Analysis Center (EDAC) at the University of New Mexico is partnering with the CYBERShARE and Environmental Health Group from the Center for Environmental Resource Management (CERM), located at the University of Texas, El Paso (UTEP), the Biodiversity Institute at the University of Kansas (KU), and the New Mexico Geo- Epidemiology Research Network (GERN) to provide a technical infrastructure that enables investigation of a variety of climate-driven human/environmental systems. Two significant goals of this NASA-funded project are: a) to increase the use of NASA Earth observational data at EDAC by various modeling communities through enabling better discovery, access, and use of relevant information, and b) to expose these communities to the benefits of provenance for improving understanding and usability of heterogeneous data sources and derived model products. To realize these goals, EDAC has leveraged the core capabilities of its Geographic Storage, Transformation, and Retrieval Engine (Gstore) platform, developed with support of the NSF EPSCoR Program. The Gstore geospatial services platform provides general purpose web services based upon the REST service model, and is capable of data discovery, access, and publication functions, metadata delivery functions, data transformation, and auto-generated OGC services for those data products that can support those services. Central to the NASA ACCESS project is the delivery of geospatial metadata in a variety of formats, including ISO 19115-2/19139, FGDC CSDGM, and the Proof Markup Language (PML). This presentation details the extraction and persistence of relevant metadata in the Gstore data store, and their transformation into multiple metadata formats that are increasingly utilized by the geospatial community to document not only core library catalog elements (e.g. title, abstract, publication data, geographic extent, projection information, and database elements), but also the processing steps used to generate derived modeling products. In particular, we discuss the generation and service delivery of provenance, or trace of data sources and analytical methods used in a scientific analysis, for archived data. We discuss the workflows developed by EDAC to capture end-to-end provenance, the storage model for those data in a delivery format independent data structure, and delivery of PML, ISO, and FGDC documents to clients requesting those products.
Metabolonote: A Wiki-Based Database for Managing Hierarchical Metadata of Metabolome Analyses
Ara, Takeshi; Enomoto, Mitsuo; Arita, Masanori; Ikeda, Chiaki; Kera, Kota; Yamada, Manabu; Nishioka, Takaaki; Ikeda, Tasuku; Nihei, Yoshito; Shibata, Daisuke; Kanaya, Shigehiko; Sakurai, Nozomu
2015-01-01
Metabolomics – technology for comprehensive detection of small molecules in an organism – lags behind the other “omics” in terms of publication and dissemination of experimental data. Among the reasons for this are difficulty precisely recording information about complicated analytical experiments (metadata), existence of various databases with their own metadata descriptions, and low reusability of the published data, resulting in submitters (the researchers who generate the data) being insufficiently motivated. To tackle these issues, we developed Metabolonote, a Semantic MediaWiki-based database designed specifically for managing metabolomic metadata. We also defined a metadata and data description format, called “Togo Metabolome Data” (TogoMD), with an ID system that is required for unique access to each level of the tree-structured metadata such as study purpose, sample, analytical method, and data analysis. Separation of the management of metadata from that of data and permission to attach related information to the metadata provide advantages for submitters, readers, and database developers. The metadata are enriched with information such as links to comparable data, thereby functioning as a hub of related data resources. They also enhance not only readers’ understanding and use of data but also submitters’ motivation to publish the data. The metadata are computationally shared among other systems via APIs, which facilitate the construction of novel databases by database developers. A permission system that allows publication of immature metadata and feedback from readers also helps submitters to improve their metadata. Hence, this aspect of Metabolonote, as a metadata preparation tool, is complementary to high-quality and persistent data repositories such as MetaboLights. A total of 808 metadata for analyzed data obtained from 35 biological species are published currently. Metabolonote and related tools are available free of cost at http://metabolonote.kazusa.or.jp/. PMID:25905099
Metabolonote: a wiki-based database for managing hierarchical metadata of metabolome analyses.
Ara, Takeshi; Enomoto, Mitsuo; Arita, Masanori; Ikeda, Chiaki; Kera, Kota; Yamada, Manabu; Nishioka, Takaaki; Ikeda, Tasuku; Nihei, Yoshito; Shibata, Daisuke; Kanaya, Shigehiko; Sakurai, Nozomu
2015-01-01
Metabolomics - technology for comprehensive detection of small molecules in an organism - lags behind the other "omics" in terms of publication and dissemination of experimental data. Among the reasons for this are difficulty precisely recording information about complicated analytical experiments (metadata), existence of various databases with their own metadata descriptions, and low reusability of the published data, resulting in submitters (the researchers who generate the data) being insufficiently motivated. To tackle these issues, we developed Metabolonote, a Semantic MediaWiki-based database designed specifically for managing metabolomic metadata. We also defined a metadata and data description format, called "Togo Metabolome Data" (TogoMD), with an ID system that is required for unique access to each level of the tree-structured metadata such as study purpose, sample, analytical method, and data analysis. Separation of the management of metadata from that of data and permission to attach related information to the metadata provide advantages for submitters, readers, and database developers. The metadata are enriched with information such as links to comparable data, thereby functioning as a hub of related data resources. They also enhance not only readers' understanding and use of data but also submitters' motivation to publish the data. The metadata are computationally shared among other systems via APIs, which facilitate the construction of novel databases by database developers. A permission system that allows publication of immature metadata and feedback from readers also helps submitters to improve their metadata. Hence, this aspect of Metabolonote, as a metadata preparation tool, is complementary to high-quality and persistent data repositories such as MetaboLights. A total of 808 metadata for analyzed data obtained from 35 biological species are published currently. Metabolonote and related tools are available free of cost at http://metabolonote.kazusa.or.jp/.
The Application of the SPASE Metadata Standard in the U.S. and Worldwide
NASA Astrophysics Data System (ADS)
Thieman, J. R.; King, T. A.; Roberts, D.
2012-12-01
The Space Physics Archive Search and Extract (SPASE) Metadata standard for Heliophysics and related data is now an established standard within the NASA-funded space and solar physics community and is spreading to the international groups within that community. Development of SPASE had involved a number of international partners and the current version of the SPASE Metadata Model (version 2.2.2) has not needed any structural modifications since January 2011 . The SPASE standard has been adopted by groups such as NASA's Heliophysics division, the Canadian Space Science Data Portal (CSSDP), Canada's AUTUMN network, Japan's Inter-university Upper atmosphere Global Observation NETwork (IUGONET), Centre de Données de la Physique des Plasmas (CDPP), and the near-Earth space data infrastructure for e-Science (ESPAS). In addition, portions of the SPASE dictionary have been modeled in semantic web ontologies for use with reasoners and semantic searches. While we anticipate additional modifications to the model in the future to accommodate simulation and model data, these changes will not affect the data descriptions already generated for instrument-related datasets. Examples of SPASE descriptions can be viewed at
SPASE, Metadata, and the Heliophysics Virtual Observatories
NASA Technical Reports Server (NTRS)
Thieman, James; King, Todd; Roberts, Aaron
2010-01-01
To provide data search and access capability in the field of Heliophysics (the study of the Sun and its effects on the Solar System, especially the Earth) a number of Virtual Observatories (VO) have been established both via direct funding from the U.S. National Aeronautics and Space Administration (NASA) and through other funding agencies in the U.S. and worldwide. At least 15 systems can be labeled as Virtual Observatories in the Heliophysics community, 9 of them funded by NASA. The problem is that different metadata and data search approaches are used by these VO's and a search for data relevant to a particular research question can involve consulting with multiple VO's - needing to learn a different approach for finding and acquiring data for each. The Space Physics Archive Search and Extract (SPASE) project is intended to provide a common data model for Heliophysics data and therefore a common set of metadata for searches of the VO's. The SPASE Data Model has been developed through the common efforts of the Heliophysics Data and Model Consortium (HDMC) representatives over a number of years. We currently have released Version 2.1 of the Data Model. The advantages and disadvantages of the Data Model will be discussed along with the plans for the future. Recent changes requested by new members of the SPASE community indicate some of the directions for further development.
Forecasting Chronic Diseases Using Data Fusion.
Acar, Evrim; Gürdeniz, Gözde; Savorani, Francesco; Hansen, Louise; Olsen, Anja; Tjønneland, Anne; Dragsted, Lars Ove; Bro, Rasmus
2017-07-07
Data fusion, that is, extracting information through the fusion of complementary data sets, is a topic of great interest in metabolomics because analytical platforms such as liquid chromatography-mass spectrometry (LC-MS) and nuclear magnetic resonance (NMR) spectroscopy commonly used for chemical profiling of biofluids provide complementary information. In this study, with a goal of forecasting acute coronary syndrome (ACS), breast cancer, and colon cancer, we jointly analyzed LC-MS, NMR measurements of plasma samples, and the metadata corresponding to the lifestyle of participants. We used supervised data fusion based on multiple kernel learning and exploited the linearity of the models to identify significant metabolites/features for the separation of healthy referents and the cases developing a disease. We demonstrated that (i) fusing LC-MS, NMR, and metadata provided better separation of ACS cases and referents compared with individual data sets, (ii) NMR data performed the best in terms of forecasting breast cancer, while fusion degraded the performance, and (iii) neither the individual data sets nor their fusion performed well for colon cancer. Furthermore, we showed the strengths and limitations of the fusion models by discussing their performance in terms of capturing known biomarkers for smoking and coffee. While fusion may improve performance in terms of separating certain conditions by jointly analyzing metabolomics and metadata sets, it is not necessarily always the best approach as in the case of breast cancer.
A novel murmur-based heart sound feature extraction technique using envelope-morphological analysis
NASA Astrophysics Data System (ADS)
Yao, Hao-Dong; Ma, Jia-Li; Fu, Bin-Bin; Wang, Hai-Yang; Dong, Ming-Chui
2015-07-01
Auscultation of heart sound (HS) signals serves as an important primary approach to diagnose cardiovascular diseases (CVDs) for centuries. Confronting the intrinsic drawbacks of traditional HS auscultation, computer-aided automatic HS auscultation based on feature extraction technique has witnessed explosive development. Yet, most existing HS feature extraction methods adopt acoustic or time-frequency features which exhibit poor relationship with diagnostic information, thus restricting the performance of further interpretation and analysis. Tackling such a bottleneck problem, this paper innovatively proposes a novel murmur-based HS feature extraction method since murmurs contain massive pathological information and are regarded as the first indications of pathological occurrences of heart valves. Adapting discrete wavelet transform (DWT) and Shannon envelope, the envelope-morphological characteristics of murmurs are obtained and three features are extracted accordingly. Validated by discriminating normal HS and 5 various abnormal HS signals with extracted features, the proposed method provides an attractive candidate in automatic HS auscultation.
NASA Astrophysics Data System (ADS)
Kamangir, H.; Momeni, M.; Satari, M.
2017-09-01
This paper presents an automatic method to extract road centerline networks from high and very high resolution satellite images. The present paper addresses the automated extraction roads covered with multiple natural and artificial objects such as trees, vehicles and either shadows of buildings or trees. In order to have a precise road extraction, this method implements three stages including: classification of images based on maximum likelihood algorithm to categorize images into interested classes, modification process on classified images by connected component and morphological operators to extract pixels of desired objects by removing undesirable pixels of each class, and finally line extraction based on RANSAC algorithm. In order to evaluate performance of the proposed method, the generated results are compared with ground truth road map as a reference. The evaluation performance of the proposed method using representative test images show completeness values ranging between 77% and 93%.
Robert E. Keane
2006-01-01
The Metadata (MD) table in the FIREMON database is used to record any information about the sampling strategy or data collected using the FIREMON sampling procedures. The MD method records metadata pertaining to a group of FIREMON plots, such as all plots in a specific FIREMON project. FIREMON plots are linked to metadata using a unique metadata identifier that is...
Valente, João; Vieira, Pedro M; Couto, Carlos; Lima, Carlos S
2018-02-01
Poor brain extraction in Magnetic Resonance Imaging (MRI) has negative consequences in several types of brain post-extraction such as tissue segmentation and related statistical measures or pattern recognition algorithms. Current state of the art algorithms for brain extraction work on weighted T1 and T2, being not adequate for non-whole brain images such as the case of T2*FLASH@7T partial volumes. This paper proposes two new methods that work directly in T2*FLASH@7T partial volumes. The first is an improvement of the semi-automatic threshold-with-morphology approach adapted to incomplete volumes. The second method uses an improved version of a current implementation of the fuzzy c-means algorithm with bias correction for brain segmentation. Under high inhomogeneity conditions the performance of the first method degrades, requiring user intervention which is unacceptable. The second method performed well for all volumes, being entirely automatic. State of the art algorithms for brain extraction are mainly semi-automatic, requiring a correct initialization by the user and knowledge of the software. These methods can't deal with partial volumes and/or need information from atlas which is not available in T2*FLASH@7T. Also, combined volumes suffer from manipulations such as re-sampling which deteriorates significantly voxel intensity structures making segmentation tasks difficult. The proposed method can overcome all these difficulties, reaching good results for brain extraction using only T2*FLASH@7T volumes. The development of this work will lead to an improvement of automatic brain lesions segmentation in T2*FLASH@7T volumes, becoming more important when lesions such as cortical Multiple-Sclerosis need to be detected. Copyright © 2017 Elsevier B.V. All rights reserved.
A data reduction package for multiple object spectroscopy
NASA Technical Reports Server (NTRS)
Hill, J. M.; Eisenhamer, J. D.; Silva, D. R.
1986-01-01
Experience with fiber-optic spectrometers has demonstrated improvements in observing efficiency for clusters of 30 or more objects that must in turn be matched by data reduction capability increases. The Medusa Automatic Reduction System reduces data generated by multiobject spectrometers in the form of two-dimensional images containing 44 to 66 individual spectra, using both software and hardware improvements to efficiently extract the one-dimensional spectra. Attention is given to the ridge-finding algorithm for automatic location of the spectra in the CCD frame. A simultaneous extraction of calibration frames allows an automatic wavelength calibration routine to determine dispersion curves, and both line measurements and cross-correlation techniques are used to determine galaxy redshifts.
Radar Determination of Fault Slip and Location in Partially Decorrelated Images
NASA Astrophysics Data System (ADS)
Parker, Jay; Glasscoe, Margaret; Donnellan, Andrea; Stough, Timothy; Pierce, Marlon; Wang, Jun
2017-06-01
Faced with the challenge of thousands of frames of radar interferometric images, automated feature extraction promises to spur data understanding and highlight geophysically active land regions for further study. We have developed techniques for automatically determining surface fault slip and location using deformation images from the NASA Uninhabited Aerial Vehicle Synthetic Aperture Radar (UAVSAR), which is similar to satellite-based SAR but has more mission flexibility and higher resolution (pixels are approximately 7 m). This radar interferometry provides a highly sensitive method, clearly indicating faults slipping at levels of 10 mm or less. But interferometric images are subject to decorrelation between revisit times, creating spots of bad data in the image. Our method begins with freely available data products from the UAVSAR mission, chiefly unwrapped interferograms, coherence images, and flight metadata. The computer vision techniques we use assume no data gaps or holes; so a preliminary step detects and removes spots of bad data and fills these holes by interpolation and blurring. Detected and partially validated surface fractures from earthquake main shocks, aftershocks, and aseismic-induced slip are shown for faults in California, including El Mayor-Cucapah (M7.2, 2010), the Ocotillo aftershock (M5.7, 2010), and South Napa (M6.0, 2014). Aseismic slip is detected on the San Andreas Fault from the El Mayor-Cucapah earthquake, in regions of highly patterned partial decorrelation. Validation is performed by comparing slip estimates from two interferograms with published ground truth measurements.
Documentation Resources on the ESIP Wiki
NASA Technical Reports Server (NTRS)
Habermann, Ted; Kozimor, John; Gordon, Sean
2017-01-01
The ESIP community includes data providers and users that communicate with one another through datasets and metadata that describe them. Improving this communication depends on consistent high-quality metadata. The ESIP Documentation Cluster and the wiki play an important central role in facilitating this communication. We will describe and demonstrate sections of the wiki that provide information about metadata concept definitions, metadata recommendation, metadata dialects, and guidance pages. We will also describe and demonstrate the ISO Explorer, a tool that the community is developing to help metadata creators.
Ismail, Mahmoud; Philbin, James
2015-04-01
The digital imaging and communications in medicine (DICOM) information model combines pixel data and its metadata in a single object. There are user scenarios that only need metadata manipulation, such as deidentification and study migration. Most picture archiving and communication system use a database to store and update the metadata rather than updating the raw DICOM files themselves. The multiseries DICOM (MSD) format separates metadata from pixel data and eliminates duplicate attributes. This work promotes storing DICOM studies in MSD format to reduce the metadata processing time. A set of experiments are performed that update the metadata of a set of DICOM studies for deidentification and migration. The studies are stored in both the traditional single frame DICOM (SFD) format and the MSD format. The results show that it is faster to update studies' metadata in MSD format than in SFD format because the bulk data is separated in MSD and is not retrieved from the storage system. In addition, it is space efficient to store the deidentified studies in MSD format as it shares the same bulk data object with the original study. In summary, separation of metadata from pixel data using the MSD format provides fast metadata access and speeds up applications that process only the metadata.
Transforming Dermatologic Imaging for the Digital Era: Metadata and Standards.
Caffery, Liam J; Clunie, David; Curiel-Lewandrowski, Clara; Malvehy, Josep; Soyer, H Peter; Halpern, Allan C
2018-01-17
Imaging is increasingly being used in dermatology for documentation, diagnosis, and management of cutaneous disease. The lack of standards for dermatologic imaging is an impediment to clinical uptake. Standardization can occur in image acquisition, terminology, interoperability, and metadata. This paper presents the International Skin Imaging Collaboration position on standardization of metadata for dermatologic imaging. Metadata is essential to ensure that dermatologic images are properly managed and interpreted. There are two standards-based approaches to recording and storing metadata in dermatologic imaging. The first uses standard consumer image file formats, and the second is the file format and metadata model developed for the Digital Imaging and Communication in Medicine (DICOM) standard. DICOM would appear to provide an advantage over using consumer image file formats for metadata as it includes all the patient, study, and technical metadata necessary to use images clinically. Whereas, consumer image file formats only include technical metadata and need to be used in conjunction with another actor-for example, an electronic medical record-to supply the patient and study metadata. The use of DICOM may have some ancillary benefits in dermatologic imaging including leveraging DICOM network and workflow services, interoperability of images and metadata, leveraging existing enterprise imaging infrastructure, greater patient safety, and better compliance to legislative requirements for image retention.
Ismail, Mahmoud; Philbin, James
2015-01-01
Abstract. The digital imaging and communications in medicine (DICOM) information model combines pixel data and its metadata in a single object. There are user scenarios that only need metadata manipulation, such as deidentification and study migration. Most picture archiving and communication system use a database to store and update the metadata rather than updating the raw DICOM files themselves. The multiseries DICOM (MSD) format separates metadata from pixel data and eliminates duplicate attributes. This work promotes storing DICOM studies in MSD format to reduce the metadata processing time. A set of experiments are performed that update the metadata of a set of DICOM studies for deidentification and migration. The studies are stored in both the traditional single frame DICOM (SFD) format and the MSD format. The results show that it is faster to update studies’ metadata in MSD format than in SFD format because the bulk data is separated in MSD and is not retrieved from the storage system. In addition, it is space efficient to store the deidentified studies in MSD format as it shares the same bulk data object with the original study. In summary, separation of metadata from pixel data using the MSD format provides fast metadata access and speeds up applications that process only the metadata. PMID:26158117
ISO, FGDC, DIF and Dublin Core - Making Sense of Metadata Standards for Earth Science Data
NASA Astrophysics Data System (ADS)
Jones, P. R.; Ritchey, N. A.; Peng, G.; Toner, V. A.; Brown, H.
2014-12-01
Metadata standards provide common definitions of metadata fields for information exchange across user communities. Despite the broad adoption of metadata standards for Earth science data, there are still heterogeneous and incompatible representations of information due to differences between the many standards in use and how each standard is applied. Federal agencies are required to manage and publish metadata in different metadata standards and formats for various data catalogs. In 2014, the NOAA National Climatic data Center (NCDC) managed metadata for its scientific datasets in ISO 19115-2 in XML, GCMD Directory Interchange Format (DIF) in XML, DataCite Schema in XML, Dublin Core in XML, and Data Catalog Vocabulary (DCAT) in JSON, with more standards and profiles of standards planned. Of these standards, the ISO 19115-series metadata is the most complete and feature-rich, and for this reason it is used by NCDC as the source for the other metadata standards. We will discuss the capabilities of metadata standards and how these standards are being implemented to document datasets. Successful implementations include developing translations and displays using XSLTs, creating links to related data and resources, documenting dataset lineage, and establishing best practices. Benefits, gaps, and challenges will be highlighted with suggestions for improved approaches to metadata storage and maintenance.
Automated Deployment of Advanced Controls and Analytics in Buildings
NASA Astrophysics Data System (ADS)
Pritoni, Marco
Buildings use 40% of primary energy in the US. Recent studies show that developing energy analytics and enhancing control strategies can significantly improve their energy performance. However, the deployment of advanced control software applications has been mostly limited to academic studies. Larger-scale implementations are prevented by the significant engineering time and customization required, due to significant differences among buildings. This study demonstrates how physics-inspired data-driven models can be used to develop portable analytics and control applications for buildings. Specifically, I demonstrate application of these models in all phases of the deployment of advanced controls and analytics in buildings: in the first phase, "Site Preparation and Interface with Legacy Systems" I used models to discover or map relationships among building components, automatically gathering metadata (information about data points) necessary to run the applications. During the second phase: "Application Deployment and Commissioning", models automatically learn system parameters, used for advanced controls and analytics. In the third phase: "Continuous Monitoring and Verification" I utilized models to automatically measure the energy performance of a building that has implemented advanced control strategies. In the conclusions, I discuss future challenges and suggest potential strategies for these innovative control systems to be widely deployed in the market. This dissertation provides useful new tools in terms of procedures, algorithms, and models to facilitate the automation of deployment of advanced controls and analytics and accelerate their wide adoption in buildings.
NASA Astrophysics Data System (ADS)
Hernández, B. E.; Bugbee, K.; le Roux, J.; Beaty, T.; Hansen, M.; Staton, P.; Sisco, A. W.
2017-12-01
Earth observation (EO) data collected as part of NASA's Earth Observing System Data and Information System (EOSDIS) is now searchable via the Common Metadata Repository (CMR). The Analysis and Review of CMR (ARC) Team at Marshall Space Flight Center has been tasked with reviewing all NASA metadata records in the CMR ( 7,000 records). Each collection level record and constituent granule level metadata are reviewed for both completeness as well as compliance with the CMR's set of metadata standards, as specified in the Unified Metadata Model (UMM). NASA's Distributed Active Archive Centers (DAACs) have been harmonizing priority metadata records within the context of the inter-agency federal Big Earth Data Initiative (BEDI), which seeks to improve the discoverability, accessibility, and usability of EO data. Thus, the first phase of this project constitutes reviewing BEDI metadata records, while the second phase will constitute reviewing the remaining non-BEDI records in CMR. This presentation will discuss the ARC team's findings in terms of the overall quality of BEDI records across all DAACs as well as compliance with UMM standards. For instance, only a fifth of the collection-level metadata fields needed correction, compared to a quarter of the granule-level fields. It should be noted that the degree to which DAACs' metadata did not comply with the UMM standards may reflect multiple factors, such as recent changes in the UMM standards, and the utilization of different metadata formats (e.g. DIF 10, ECHO 10, ISO 19115-1) across the DAACs. Insights, constructive criticism, and lessons learned from this metadata review process will be contributed from both ORNL and SEDAC. Further inquiry along such lines may lead to insights which may improve the metadata curation process moving forward. In terms of the broader implications for metadata compliance with the UMM standards, this research has shown that a large proportion of the prioritized collections have already been made compliant, although the process of improving metadata quality is ongoing and iterative. Further research is also warranted into whether or not the gains in metadata quality are also driving gains in data use.
Automatic Extraction of JPF Options and Documentation
NASA Technical Reports Server (NTRS)
Luks, Wojciech; Tkachuk, Oksana; Buschnell, David
2011-01-01
Documenting existing Java PathFinder (JPF) projects or developing new extensions is a challenging task. JPF provides a platform for creating new extensions and relies on key-value properties for their configuration. Keeping track of all possible options and extension mechanisms in JPF can be difficult. This paper presents jpf-autodoc-options, a tool that automatically extracts JPF projects options and other documentation-related information, which can greatly help both JPF users and developers of JPF extensions.
Forum Guide to Metadata: The Meaning behind Education Data. NFES 2009-805
ERIC Educational Resources Information Center
National Forum on Education Statistics, 2009
2009-01-01
The purpose of this guide is to empower people to more effectively use data as information. To accomplish this, the publication explains what metadata are; why metadata are critical to the development of sound education data systems; what components comprise a metadata system; what value metadata bring to data management and use; and how to…
ERIC Educational Resources Information Center
Yang, Le
2016-01-01
This study analyzed digital item metadata and keywords from Internet search engines to learn what metadata elements actually facilitate discovery of digital collections through Internet keyword searching and how significantly each metadata element affects the discovery of items in a digital repository. The study found that keywords from Internet…
Automatic differential analysis of NMR experiments in complex samples.
Margueritte, Laure; Markov, Petar; Chiron, Lionel; Starck, Jean-Philippe; Vonthron-Sénécheau, Catherine; Bourjot, Mélanie; Delsuc, Marc-André
2018-06-01
Liquid state nuclear magnetic resonance (NMR) is a powerful tool for the analysis of complex mixtures of unknown molecules. This capacity has been used in many analytical approaches: metabolomics, identification of active compounds in natural extracts, and characterization of species, and such studies require the acquisition of many diverse NMR measurements on series of samples. Although acquisition can easily be performed automatically, the number of NMR experiments involved in these studies increases very rapidly, and this data avalanche requires to resort to automatic processing and analysis. We present here a program that allows the autonomous, unsupervised processing of a large corpus of 1D, 2D, and diffusion-ordered spectroscopy experiments from a series of samples acquired in different conditions. The program provides all the signal processing steps, as well as peak-picking and bucketing of 1D and 2D spectra, the program and its components are fully available. In an experiment mimicking the search of a bioactive species in a natural extract, we use it for the automatic detection of small amounts of artemisinin added to a series of plant extracts and for the generation of the spectral fingerprint of this molecule. This program called Plasmodesma is a novel tool that should be useful to decipher complex mixtures, particularly in the discovery of biologically active natural products from plants extracts but can also in drug discovery or metabolomics studies. Copyright © 2017 John Wiley & Sons, Ltd.
McMahon, Christiana; Denaxas, Spiros
2016-01-01
Metadata are critical in epidemiological and public health research. However, a lack of biomedical metadata quality frameworks and limited awareness of the implications of poor quality metadata renders data analyses problematic. In this study, we created and evaluated a novel framework to assess metadata quality of epidemiological and public health research datasets. We performed a literature review and surveyed stakeholders to enhance our understanding of biomedical metadata quality assessment. The review identified 11 studies and nine quality dimensions; none of which were specifically aimed at biomedical metadata. 96 individuals completed the survey; of those who submitted data, most only assessed metadata quality sometimes, and eight did not at all. Our framework has four sections: a) general information; b) tools and technologies; c) usability; and d) management and curation. We evaluated the framework using three test cases and sought expert feedback. The framework can assess biomedical metadata quality systematically and robustly. PMID:27570670
McMahon, Christiana; Denaxas, Spiros
2016-01-01
Metadata are critical in epidemiological and public health research. However, a lack of biomedical metadata quality frameworks and limited awareness of the implications of poor quality metadata renders data analyses problematic. In this study, we created and evaluated a novel framework to assess metadata quality of epidemiological and public health research datasets. We performed a literature review and surveyed stakeholders to enhance our understanding of biomedical metadata quality assessment. The review identified 11 studies and nine quality dimensions; none of which were specifically aimed at biomedical metadata. 96 individuals completed the survey; of those who submitted data, most only assessed metadata quality sometimes, and eight did not at all. Our framework has four sections: a) general information; b) tools and technologies; c) usability; and d) management and curation. We evaluated the framework using three test cases and sought expert feedback. The framework can assess biomedical metadata quality systematically and robustly.
New auto-segment method of cerebral hemorrhage
NASA Astrophysics Data System (ADS)
Wang, Weijiang; Shen, Tingzhi; Dang, Hua
2007-12-01
A novel method for Computerized tomography (CT) cerebral hemorrhage (CH) image automatic segmentation is presented in the paper, which uses expert system that models human knowledge about the CH automatic segmentation problem. The algorithm adopts a series of special steps and extracts some easy ignored CH features which can be found by statistic results of mass real CH images, such as region area, region CT number, region smoothness and some statistic CH region relationship. And a seven steps' extracting mechanism will ensure these CH features can be got correctly and efficiently. By using these CH features, a decision tree which models the human knowledge about the CH automatic segmentation problem has been built and it will ensure the rationality and accuracy of the algorithm. Finally some experiments has been taken to verify the correctness and reasonable of the automatic segmentation, and the good correct ratio and fast speed make it possible to be widely applied into practice.
Automatic detection of typical dust devils from Mars landscape images
NASA Astrophysics Data System (ADS)
Ogohara, Kazunori; Watanabe, Takeru; Okumura, Susumu; Hatanaka, Yuji
2018-02-01
This paper presents an improved algorithm for automatic detection of Martian dust devils that successfully extracts tiny bright dust devils and obscured large dust devils from two subtracted landscape images. These dust devils are frequently observed using visible cameras onboard landers or rovers. Nevertheless, previous research on automated detection of dust devils has not focused on these common types of dust devils, but on dust devils that appear on images to be irregularly bright and large. In this study, we detect these common dust devils automatically using two kinds of parameter sets for thresholding when binarizing subtracted images. We automatically extract dust devils from 266 images taken by the Spirit rover to evaluate our algorithm. Taking dust devils detected by visual inspection to be ground truth, the precision, recall and F-measure values are 0.77, 0.86, and 0.81, respectively.
Clinical Assistant Diagnosis for Electronic Medical Record Based on Convolutional Neural Network.
Yang, Zhongliang; Huang, Yongfeng; Jiang, Yiran; Sun, Yuxi; Zhang, Yu-Jin; Luo, Pengcheng
2018-04-20
Automatically extracting useful information from electronic medical records along with conducting disease diagnoses is a promising task for both clinical decision support(CDS) and neural language processing(NLP). Most of the existing systems are based on artificially constructed knowledge bases, and then auxiliary diagnosis is done by rule matching. In this study, we present a clinical intelligent decision approach based on Convolutional Neural Networks(CNN), which can automatically extract high-level semantic information of electronic medical records and then perform automatic diagnosis without artificial construction of rules or knowledge bases. We use collected 18,590 copies of the real-world clinical electronic medical records to train and test the proposed model. Experimental results show that the proposed model can achieve 98.67% accuracy and 96.02% recall, which strongly supports that using convolutional neural network to automatically learn high-level semantic features of electronic medical records and then conduct assist diagnosis is feasible and effective.
samiDB: A Prototype Data Archive for Big Science Exploration
NASA Astrophysics Data System (ADS)
Konstantopoulos, I. S.; Green, A. W.; Cortese, L.; Foster, C.; Scott, N.
2015-04-01
samiDB is an archive, database, and query engine to serve the spectra, spectral hypercubes, and high-level science products that make up the SAMI Galaxy Survey. Based on the versatile Hierarchical Data Format (HDF5), samiDB does not depend on relational database structures and hence lightens the setup and maintenance load imposed on science teams by metadata tables. The code, written in Python, covers the ingestion, querying, and exporting of data as well as the automatic setup of an HTML schema browser. samiDB serves as a maintenance-light data archive for Big Science and can be adopted and adapted by science teams that lack the means to hire professional archivists to set up the data back end for their projects.
Towards Data Value-Level Metadata for Clinical Studies.
Zozus, Meredith Nahm; Bonner, Joseph
2017-01-01
While several standards for metadata describing clinical studies exist, comprehensive metadata to support traceability of data from clinical studies has not been articulated. We examine uses of metadata in clinical studies. We examine and enumerate seven sources of data value-level metadata in clinical studies inclusive of research designs across the spectrum of the National Institutes of Health definition of clinical research. The sources of metadata inform categorization in terms of metadata describing the origin of a data value, the definition of a data value, and operations to which the data value was subjected. The latter is further categorized into information about changes to a data value, movement of a data value, retrieval of a data value, and data quality checks, constraints or assessments to which the data value was subjected. The implications of tracking and managing data value-level metadata are explored.
Metadata-driven Clinical Data Loading into i2b2 for Clinical and Translational Science Institutes.
Post, Andrew R; Pai, Akshatha K; Willard, Richard; May, Bradley J; West, Andrew C; Agravat, Sanjay; Granite, Stephen J; Winslow, Raimond L; Stephens, David S
2016-01-01
Clinical and Translational Science Award (CTSA) recipients have a need to create research data marts from their clinical data warehouses, through research data networks and the use of i2b2 and SHRINE technologies. These data marts may have different data requirements and representations, thus necessitating separate extract, transform and load (ETL) processes for populating each mart. Maintaining duplicative procedural logic for each ETL process is onerous. We have created an entirely metadata-driven ETL process that can be customized for different data marts through separate configurations, each stored in an extension of i2b2 's ontology database schema. We extended our previously reported and open source Eureka! Clinical Analytics software with this capability. The same software has created i2b2 data marts for several projects, the largest being the nascent Accrual for Clinical Trials (ACT) network, for which it has loaded over 147 million facts about 1.2 million patients.
Metadata-driven Clinical Data Loading into i2b2 for Clinical and Translational Science Institutes
Post, Andrew R.; Pai, Akshatha K.; Willard, Richard; May, Bradley J.; West, Andrew C.; Agravat, Sanjay; Granite, Stephen J.; Winslow, Raimond L.; Stephens, David S.
2016-01-01
Clinical and Translational Science Award (CTSA) recipients have a need to create research data marts from their clinical data warehouses, through research data networks and the use of i2b2 and SHRINE technologies. These data marts may have different data requirements and representations, thus necessitating separate extract, transform and load (ETL) processes for populating each mart. Maintaining duplicative procedural logic for each ETL process is onerous. We have created an entirely metadata-driven ETL process that can be customized for different data marts through separate configurations, each stored in an extension of i2b2 ‘s ontology database schema. We extended our previously reported and open source Eureka! Clinical Analytics software with this capability. The same software has created i2b2 data marts for several projects, the largest being the nascent Accrual for Clinical Trials (ACT) network, for which it has loaded over 147 million facts about 1.2 million patients. PMID:27570667
TokSearch: A search engine for fusion experimental data
Sammuli, Brian S.; Barr, Jayson L.; Eidietis, Nicholas W.; ...
2018-04-01
At a typical fusion research site, experimental data is stored using archive technologies that deal with each discharge as an independent set of data. These technologies (e.g. MDSplus or HDF5) are typically supplemented with a database that aggregates metadata for multiple shots to allow for efficient querying of certain predefined quantities. Often, however, a researcher will need to extract information from the archives, possibly for many shots, that is not available in the metadata store or otherwise indexed for quick retrieval. To address this need, a new search tool called TokSearch has been added to the General Atomics TokSys controlmore » design and analysis suite [1]. This tool provides the ability to rapidly perform arbitrary, parallelized queries of archived tokamak shot data (both raw and analyzed) over large numbers of shots. The TokSearch query API borrows concepts from SQL, and users can choose to implement queries in either MatlabTM or Python.« less
NASA Astrophysics Data System (ADS)
Lin, Po-Chuan; Chen, Bo-Wei; Chang, Hangbae
2016-07-01
This study presents a human-centric technique for social video expansion based on semantic processing and graph analysis. The objective is to increase metadata of an online video and to explore related information, thereby facilitating user browsing activities. To analyze the semantic meaning of a video, shots and scenes are firstly extracted from the video on the server side. Subsequently, this study uses annotations along with ConceptNet to establish the underlying framework. Detailed metadata, including visual objects and audio events among the predefined categories, are indexed by using the proposed method. Furthermore, relevant online media associated with each category are also analyzed to enrich the existing content. With the above-mentioned information, users can easily browse and search the content according to the link analysis and its complementary knowledge. Experiments on a video dataset are conducted for evaluation. The results show that our system can achieve satisfactory performance, thereby demonstrating the feasibility of the proposed idea.
Shao, Weixiang; Adams, Clive E; Cohen, Aaron M; Davis, John M; McDonagh, Marian S; Thakurta, Sujata; Yu, Philip S; Smalheiser, Neil R
2015-03-01
It is important to identify separate publications that report outcomes from the same underlying clinical trial, in order to avoid over-counting these as independent pieces of evidence. We created positive and negative training sets (comprised of pairs of articles reporting on the same condition and intervention) that were, or were not, linked to the same clinicaltrials.gov trial registry number. Features were extracted from MEDLINE and PubMed metadata; pairwise similarity scores were modeled using logistic regression. Article pairs from the same trial were identified with high accuracy (F1 score=0.843). We also created a clustering tool, Aggregator, that takes as input a PubMed user query for RCTs on a given topic, and returns article clusters predicted to arise from the same clinical trial. Although painstaking examination of full-text may be needed to be conclusive, metadata are surprisingly accurate in predicting when two articles derive from the same underlying clinical trial. Copyright © 2014 Elsevier Inc. All rights reserved.
TokSearch: A search engine for fusion experimental data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sammuli, Brian S.; Barr, Jayson L.; Eidietis, Nicholas W.
At a typical fusion research site, experimental data is stored using archive technologies that deal with each discharge as an independent set of data. These technologies (e.g. MDSplus or HDF5) are typically supplemented with a database that aggregates metadata for multiple shots to allow for efficient querying of certain predefined quantities. Often, however, a researcher will need to extract information from the archives, possibly for many shots, that is not available in the metadata store or otherwise indexed for quick retrieval. To address this need, a new search tool called TokSearch has been added to the General Atomics TokSys controlmore » design and analysis suite [1]. This tool provides the ability to rapidly perform arbitrary, parallelized queries of archived tokamak shot data (both raw and analyzed) over large numbers of shots. The TokSearch query API borrows concepts from SQL, and users can choose to implement queries in either MatlabTM or Python.« less
Managing Complex Change in Clinical Study Metadata
Brandt, Cynthia A.; Gadagkar, Rohit; Rodriguez, Cesar; Nadkarni, Prakash M.
2004-01-01
In highly functional metadata-driven software, the interrelationships within the metadata become complex, and maintenance becomes challenging. We describe an approach to metadata management that uses a knowledge-base subschema to store centralized information about metadata dependencies and use cases involving specific types of metadata modification. Our system borrows ideas from production-rule systems in that some of this information is a high-level specification that is interpreted and executed dynamically by a middleware engine. Our approach is implemented in TrialDB, a generic clinical study data management system. We review approaches that have been used for metadata management in other contexts and describe the features, capabilities, and limitations of our system. PMID:15187070
Rapid automatic keyword extraction for information retrieval and analysis
Rose, Stuart J [Richland, WA; Cowley,; E, Wendy [Richland, WA; Crow, Vernon L [Richland, WA; Cramer, Nicholas O [Richland, WA
2012-03-06
Methods and systems for rapid automatic keyword extraction for information retrieval and analysis. Embodiments can include parsing words in an individual document by delimiters, stop words, or both in order to identify candidate keywords. Word scores for each word within the candidate keywords are then calculated based on a function of co-occurrence degree, co-occurrence frequency, or both. Based on a function of the word scores for words within the candidate keyword, a keyword score is calculated for each of the candidate keywords. A portion of the candidate keywords are then extracted as keywords based, at least in part, on the candidate keywords having the highest keyword scores.
Integrating Information Extraction Agents into a Tourism Recommender System
NASA Astrophysics Data System (ADS)
Esparcia, Sergio; Sánchez-Anguix, Víctor; Argente, Estefanía; García-Fornes, Ana; Julián, Vicente
Recommender systems face some problems. On the one hand information needs to be maintained updated, which can result in a costly task if it is not performed automatically. On the other hand, it may be interesting to include third party services in the recommendation since they improve its quality. In this paper, we present an add-on for the Social-Net Tourism Recommender System that uses information extraction and natural language processing techniques in order to automatically extract and classify information from the Web. Its goal is to maintain the system updated and obtain information about third party services that are not offered by service providers inside the system.
The impact of OCR accuracy on automated cancer classification of pathology reports.
Zuccon, Guido; Nguyen, Anthony N; Bergheim, Anton; Wickman, Sandra; Grayson, Narelle
2012-01-01
To evaluate the effects of Optical Character Recognition (OCR) on the automatic cancer classification of pathology reports. Scanned images of pathology reports were converted to electronic free-text using a commercial OCR system. A state-of-the-art cancer classification system, the Medical Text Extraction (MEDTEX) system, was used to automatically classify the OCR reports. Classifications produced by MEDTEX on the OCR versions of the reports were compared with the classification from a human amended version of the OCR reports. The employed OCR system was found to recognise scanned pathology reports with up to 99.12% character accuracy and up to 98.95% word accuracy. Errors in the OCR processing were found to minimally impact on the automatic classification of scanned pathology reports into notifiable groups. However, the impact of OCR errors is not negligible when considering the extraction of cancer notification items, such as primary site, histological type, etc. The automatic cancer classification system used in this work, MEDTEX, has proven to be robust to errors produced by the acquisition of freetext pathology reports from scanned images through OCR software. However, issues emerge when considering the extraction of cancer notification items.
NASA Astrophysics Data System (ADS)
Lugmayr, Artur R.; Mailaparampil, Anurag; Tico, Florina; Kalli, Seppo; Creutzburg, Reiner
2003-01-01
Digital television (digiTV) is an additional multimedia environment, where metadata is one key element for the description of arbitrary content. This implies adequate structures for content description, which is provided by XML metadata schemes (e.g. MPEG-7, MPEG-21). Content and metadata management is the task of a multimedia repository, from which digiTV clients - equipped with an Internet connection - can access rich additional multimedia types over an "All-HTTP" protocol layer. Within this research work, we focus on conceptual design issues of a metadata repository for the storage of metadata, accessible from the feedback channel of a local set-top box. Our concept describes the whole heterogeneous life-cycle chain of XML metadata from the service provider to the digiTV equipment, device independent representation of content, accessing and querying the metadata repository, management of metadata related to digiTV, and interconnection of basic system components (http front-end, relational database system, and servlet container). We present our conceptual test configuration of a metadata repository that is aimed at a real-world deployment, done within the scope of the future interaction (fiTV) project at the Digital Media Institute (DMI) Tampere (www.futureinteraction.tv).
Metazen – metadata capture for metagenomes
2014-01-01
Background As the impact and prevalence of large-scale metagenomic surveys grow, so does the acute need for more complete and standards compliant metadata. Metadata (data describing data) provides an essential complement to experimental data, helping to answer questions about its source, mode of collection, and reliability. Metadata collection and interpretation have become vital to the genomics and metagenomics communities, but considerable challenges remain, including exchange, curation, and distribution. Currently, tools are available for capturing basic field metadata during sampling, and for storing, updating and viewing it. Unfortunately, these tools are not specifically designed for metagenomic surveys; in particular, they lack the appropriate metadata collection templates, a centralized storage repository, and a unique ID linking system that can be used to easily port complete and compatible metagenomic metadata into widely used assembly and sequence analysis tools. Results Metazen was developed as a comprehensive framework designed to enable metadata capture for metagenomic sequencing projects. Specifically, Metazen provides a rapid, easy-to-use portal to encourage early deposition of project and sample metadata. Conclusions Metazen is an interactive tool that aids users in recording their metadata in a complete and valid format. A defined set of mandatory fields captures vital information, while the option to add fields provides flexibility. PMID:25780508
Metazen - metadata capture for metagenomes.
Bischof, Jared; Harrison, Travis; Paczian, Tobias; Glass, Elizabeth; Wilke, Andreas; Meyer, Folker
2014-01-01
As the impact and prevalence of large-scale metagenomic surveys grow, so does the acute need for more complete and standards compliant metadata. Metadata (data describing data) provides an essential complement to experimental data, helping to answer questions about its source, mode of collection, and reliability. Metadata collection and interpretation have become vital to the genomics and metagenomics communities, but considerable challenges remain, including exchange, curation, and distribution. Currently, tools are available for capturing basic field metadata during sampling, and for storing, updating and viewing it. Unfortunately, these tools are not specifically designed for metagenomic surveys; in particular, they lack the appropriate metadata collection templates, a centralized storage repository, and a unique ID linking system that can be used to easily port complete and compatible metagenomic metadata into widely used assembly and sequence analysis tools. Metazen was developed as a comprehensive framework designed to enable metadata capture for metagenomic sequencing projects. Specifically, Metazen provides a rapid, easy-to-use portal to encourage early deposition of project and sample metadata. Metazen is an interactive tool that aids users in recording their metadata in a complete and valid format. A defined set of mandatory fields captures vital information, while the option to add fields provides flexibility.
Improving Access to NASA Earth Science Data through Collaborative Metadata Curation
NASA Astrophysics Data System (ADS)
Sisco, A. W.; Bugbee, K.; Shum, D.; Baynes, K.; Dixon, V.; Ramachandran, R.
2017-12-01
The NASA-developed Common Metadata Repository (CMR) is a high-performance metadata system that currently catalogs over 375 million Earth science metadata records. It serves as the authoritative metadata management system of NASA's Earth Observing System Data and Information System (EOSDIS), enabling NASA Earth science data to be discovered and accessed by a worldwide user community. The size of the EOSDIS data archive is steadily increasing, and the ability to manage and query this archive depends on the input of high quality metadata to the CMR. Metadata that does not provide adequate descriptive information diminishes the CMR's ability to effectively find and serve data to users. To address this issue, an innovative and collaborative review process is underway to systematically improve the completeness, consistency, and accuracy of metadata for approximately 7,000 data sets archived by NASA's twelve EOSDIS data centers, or Distributed Active Archive Centers (DAACs). The process involves automated and manual metadata assessment of both collection and granule records by a team of Earth science data specialists at NASA Marshall Space Flight Center. The team communicates results to DAAC personnel, who then make revisions and reingest improved metadata into the CMR. Implementation of this process relies on a network of interdisciplinary collaborators leveraging a variety of communication platforms and long-range planning strategies. Curating metadata at this scale and resolving metadata issues through community consensus improves the CMR's ability to serve current and future users and also introduces best practices for stewarding the next generation of Earth Observing System data. This presentation will detail the metadata curation process, its outcomes thus far, and also share the status of ongoing curation activities.
NASA Technical Reports Server (NTRS)
Shum, Dana; Bugbee, Kaylin
2017-01-01
This talk explains the ongoing metadata curation activities in the Common Metadata Repository. It explores tools that exist today which are useful for building quality metadata and also opens up the floor for discussions on other potentially useful tools.
NASA Astrophysics Data System (ADS)
Troyan, D.
2016-12-01
The Atmospheric Radiation Measurement (ARM) program has been collecting data from instruments in diverse climate regions for nearly twenty-five years. These data are made available to all interested parties at no cost via specially designed tools found on the ARM website (www.arm.gov). Metadata is created and applied to the various datastreams to facilitate information retrieval using the ARM website, the ARM Data Discovery Tool, and data quality reporting tools. Over the last year, the Metadata Manager - a relatively new position within the ARM program - created two documents that summarize the state of ARM metadata processes: ARM Metadata Workflow, and ARM Metadata Standards. These documents serve as guides to the creation and management of ARM metadata. With many of ARM's data functions spread around the Department of Energy national laboratory complex and with many of the original architects of the metadata structure no longer working for ARM, there is increased importance on using these documents to resolve issues from data flow bottlenecks and inaccurate metadata to improving data discovery and organizing web pages. This presentation will provide some examples from the workflow and standards documents. The examples will illustrate the complexity of the ARM metadata processes and the efficiency by which the metadata team works towards achieving the goal of providing access to data collected under the auspices of the ARM program.
Efficient processing of MPEG-21 metadata in the binary domain
NASA Astrophysics Data System (ADS)
Timmerer, Christian; Frank, Thomas; Hellwagner, Hermann; Heuer, Jörg; Hutter, Andreas
2005-10-01
XML-based metadata is widely adopted across the different communities and plenty of commercial and open source tools for processing and transforming are available on the market. However, all of these tools have one thing in common: they operate on plain text encoded metadata which may become a burden in constrained and streaming environments, i.e., when metadata needs to be processed together with multimedia content on the fly. In this paper we present an efficient approach for transforming such kind of metadata which are encoded using MPEG's Binary Format for Metadata (BiM) without additional en-/decoding overheads, i.e., within the binary domain. Therefore, we have developed an event-based push parser for BiM encoded metadata which transforms the metadata by a limited set of processing instructions - based on traditional XML transformation techniques - operating on bit patterns instead of cost-intensive string comparisons.
A model for enhancing Internet medical document retrieval with "medical core metadata".
Malet, G; Munoz, F; Appleyard, R; Hersh, W
1999-01-01
Finding documents on the World Wide Web relevant to a specific medical information need can be difficult. The goal of this work is to define a set of document content description tags, or metadata encodings, that can be used to promote disciplined search access to Internet medical documents. The authors based their approach on a proposed metadata standard, the Dublin Core Metadata Element Set, which has recently been submitted to the Internet Engineering Task Force. Their model also incorporates the National Library of Medicine's Medical Subject Headings (MeSH) vocabulary and MEDLINE-type content descriptions. The model defines a medical core metadata set that can be used to describe the metadata for a wide variety of Internet documents. The authors propose that their medical core metadata set be used to assign metadata to medical documents to facilitate document retrieval by Internet search engines.
NASA Astrophysics Data System (ADS)
Jusman, Yessi; Ng, Siew-Cheok; Hasikin, Khairunnisa; Kurnia, Rahmadi; Osman, Noor Azuan Bin Abu; Teoh, Kean Hooi
2016-10-01
The capability of field emission scanning electron microscopy and energy dispersive x-ray spectroscopy (FE-SEM/EDX) to scan material structures at the microlevel and characterize the material with its elemental properties has inspired this research, which has developed an FE-SEM/EDX-based cervical cancer screening system. The developed computer-aided screening system consisted of two parts, which were the automatic features of extraction and classification. For the automatic features extraction algorithm, the image and spectra of cervical cells features extraction algorithm for extracting the discriminant features of FE-SEM/EDX data was introduced. The system automatically extracted two types of features based on FE-SEM/EDX images and FE-SEM/EDX spectra. Textural features were extracted from the FE-SEM/EDX image using a gray level co-occurrence matrix technique, while the FE-SEM/EDX spectra features were calculated based on peak heights and corrected area under the peaks using an algorithm. A discriminant analysis technique was employed to predict the cervical precancerous stage into three classes: normal, low-grade intraepithelial squamous lesion (LSIL), and high-grade intraepithelial squamous lesion (HSIL). The capability of the developed screening system was tested using 700 FE-SEM/EDX spectra (300 normal, 200 LSIL, and 200 HSIL cases). The accuracy, sensitivity, and specificity performances were 98.2%, 99.0%, and 98.0%, respectively.
Developing Cyberinfrastructure Tools and Services for Metadata Quality Evaluation
NASA Astrophysics Data System (ADS)
Mecum, B.; Gordon, S.; Habermann, T.; Jones, M. B.; Leinfelder, B.; Powers, L. A.; Slaughter, P.
2016-12-01
Metadata and data quality are at the core of reusable and reproducible science. While great progress has been made over the years, much of the metadata collected only addresses data discovery, covering concepts such as titles and keywords. Improving metadata beyond the discoverability plateau means documenting detailed concepts within the data such as sampling protocols, instrumentation used, and variables measured. Given that metadata commonly do not describe their data at this level, how might we improve the state of things? Giving scientists and data managers easy to use tools to evaluate metadata quality that utilize community-driven recommendations is the key to producing high-quality metadata. To achieve this goal, we created a set of cyberinfrastructure tools and services that integrate with existing metadata and data curation workflows which can be used to improve metadata and data quality across the sciences. These tools work across metadata dialects (e.g., ISO19115, FGDC, EML, etc.) and can be used to assess aspects of quality beyond what is internal to the metadata such as the congruence between the metadata and the data it describes. The system makes use of a user-friendly mechanism for expressing a suite of checks as code in popular data science programming languages such as Python and R. This reduces the burden on scientists and data managers to learn yet another language. We demonstrated these services and tools in three ways. First, we evaluated a large corpus of datasets in the DataONE federation of data repositories against a metadata recommendation modeled after existing recommendations such as the LTER best practices and the Attribute Convention for Dataset Discovery (ACDD). Second, we showed how this service can be used to display metadata and data quality information to data producers during the data submission and metadata creation process, and to data consumers through data catalog search and access tools. Third, we showed how the centrally deployed DataONE quality service can achieve major efficiency gains by allowing member repositories to customize and use recommendations that fit their specific needs without having to create de novo infrastructure at their site.
The New Online Metadata Editor for Generating Structured Metadata
NASA Astrophysics Data System (ADS)
Devarakonda, R.; Shrestha, B.; Palanisamy, G.; Hook, L.; Killeffer, T.; Boden, T.; Cook, R. B.; Zolly, L.; Hutchison, V.; Frame, M. T.; Cialella, A. T.; Lazer, K.
2014-12-01
Nobody is better suited to "describe" data than the scientist who created it. This "description" about a data is called Metadata. In general terms, Metadata represents the who, what, when, where, why and how of the dataset. eXtensible Markup Language (XML) is the preferred output format for metadata, as it makes it portable and, more importantly, suitable for system discoverability. The newly developed ORNL Metadata Editor (OME) is a Web-based tool that allows users to create and maintain XML files containing key information, or metadata, about the research. Metadata include information about the specific projects, parameters, time periods, and locations associated with the data. Such information helps put the research findings in context. In addition, the metadata produced using OME will allow other researchers to find these data via Metadata clearinghouses like Mercury [1] [2]. Researchers simply use the ORNL Metadata Editor to enter relevant metadata into a Web-based form. How is OME helping Big Data Centers like ORNL DAAC? The ORNL DAAC is one of NASA's Earth Observing System Data and Information System (EOSDIS) data centers managed by the ESDIS Project. The ORNL DAAC archives data produced by NASA's Terrestrial Ecology Program. The DAAC provides data and information relevant to biogeochemical dynamics, ecological data, and environmental processes, critical for understanding the dynamics relating to the biological components of the Earth's environment. Typically data produced, archived and analyzed is at a scale of multiple petabytes, which makes the discoverability of the data very challenging. Without proper metadata associated with the data, it is difficult to find the data you are looking for and equally difficult to use and understand the data. OME will allow data centers like the ORNL DAAC to produce meaningful, high quality, standards-based, descriptive information about their data products in-turn helping with the data discoverability and interoperability.References:[1] Devarakonda, Ranjeet, et al. "Mercury: reusable metadata management, data discovery and access system." Earth Science Informatics 3.1-2 (2010): 87-94. [2] Wilson, Bruce E., et al. "Mercury Toolset for Spatiotemporal Metadata." NASA Technical Reports Server (NTRS) (2010).
Integrating historical clinical and financial data for pharmacological research.
Deshmukh, Vikrant G; Sower, N Brett; Hunter, Cheri Y; Mitchell, Joyce A
2011-11-18
Retrospective research requires longitudinal data, and repositories derived from electronic health records (EHR) can be sources of such data. With Health Information Technology for Economic and Clinical Health (HITECH) Act meaningful use provisions, many institutions are expected to adopt EHRs, but may be left with large amounts of financial and historical clinical data, which can differ significantly from data obtained from newer systems, due to lack or inconsistent use of controlled medical terminologies (CMT) in older systems. We examined different approaches for semantic enrichment of financial data with CMT, and integration of clinical data from disparate historical and current sources for research. Snapshots of financial data from 1999, 2004 and 2009 were mapped automatically to the current inpatient pharmacy catalog, and enriched with RxNorm. Administrative metadata from financial and dispensing systems, RxNorm and two commercial pharmacy vocabularies were used to integrate data from current and historical inpatient pharmacy modules, and the outpatient EHR. Data integration approaches were compared using percentages of automated matches, and effects on cohort size of a retrospective study. During 1999-2009, 71.52%-90.08% of items in use from the financial catalog were enriched using RxNorm; 64.95%-70.37% of items in use from the historical inpatient system were integrated using RxNorm, 85.96%-91.67% using a commercial vocabulary, 87.19%-94.23% using financial metadata, and 77.20%-94.68% using dispensing metadata. During 1999-2009, 48.01%-30.72% of items in use from the outpatient catalog were integrated using RxNorm, and 79.27%-48.60% using a commercial vocabulary. In a cohort of 16304 inpatients obtained from clinical systems, 4172 (25.58%) were found exclusively through integration of historical clinical data, while 15978 (98%) could be identified using semantically enriched financial data. Data integration using metadata from financial/dispensing systems and pharmacy vocabularies were comparable. Given the current state of EHR adoption, semantic enrichment of financial data and integration of historical clinical data would allow the repurposing of these data for research. With the push for HITECH meaningful use, institutions that are transitioning to newer EHRs will be able to use their older financial and clinical data for research using these methods.
Integrating historical clinical and financial data for pharmacological research
2011-01-01
Background Retrospective research requires longitudinal data, and repositories derived from electronic health records (EHR) can be sources of such data. With Health Information Technology for Economic and Clinical Health (HITECH) Act meaningful use provisions, many institutions are expected to adopt EHRs, but may be left with large amounts of financial and historical clinical data, which can differ significantly from data obtained from newer systems, due to lack or inconsistent use of controlled medical terminologies (CMT) in older systems. We examined different approaches for semantic enrichment of financial data with CMT, and integration of clinical data from disparate historical and current sources for research. Methods Snapshots of financial data from 1999, 2004 and 2009 were mapped automatically to the current inpatient pharmacy catalog, and enriched with RxNorm. Administrative metadata from financial and dispensing systems, RxNorm and two commercial pharmacy vocabularies were used to integrate data from current and historical inpatient pharmacy modules, and the outpatient EHR. Data integration approaches were compared using percentages of automated matches, and effects on cohort size of a retrospective study. Results During 1999-2009, 71.52%-90.08% of items in use from the financial catalog were enriched using RxNorm; 64.95%-70.37% of items in use from the historical inpatient system were integrated using RxNorm, 85.96%-91.67% using a commercial vocabulary, 87.19%-94.23% using financial metadata, and 77.20%-94.68% using dispensing metadata. During 1999-2009, 48.01%-30.72% of items in use from the outpatient catalog were integrated using RxNorm, and 79.27%-48.60% using a commercial vocabulary. In a cohort of 16304 inpatients obtained from clinical systems, 4172 (25.58%) were found exclusively through integration of historical clinical data, while 15978 (98%) could be identified using semantically enriched financial data. Conclusions Data integration using metadata from financial/dispensing systems and pharmacy vocabularies were comparable. Given the current state of EHR adoption, semantic enrichment of financial data and integration of historical clinical data would allow the repurposing of these data for research. With the push for HITECH meaningful use, institutions that are transitioning to newer EHRs will be able to use their older financial and clinical data for research using these methods. PMID:22099213
The TDR: A Repository for Long Term Storage of Geophysical Data and Metadata
NASA Astrophysics Data System (ADS)
Wilson, A.; Baltzer, T.; Caron, J.
2006-12-01
For many years Unidata has provided easy, low cost data access to universities and research labs. Historically Unidata technology provided access to data in near real time. In recent years Unidata has additionally turned to providing middleware to serve longer term data and associated metadata via its THREDDS technology, the most recent offering being the THREDDS Data Server (TDS). The TDS provides middleware for metadata access and management, OPeNDAP data access, and integration with the Unidata Integrated Data Viewer (IDV), among other benefits. The TDS was designed to support rolling archives of data, that is, data that exist only for a relatively short, predefined time window. Now we are creating an addition to the TDS, called the THREDDS Data Repository (TDR), which allows users to store and retrieve data and other objects for an arbitrarily long time period. Data in the TDR can also be served by the TDS. The TDR performs important functions of locating storage for the data, moving the data to and from the repository, assigning unique identifiers, and generating metadata. The TDR framework supports pluggable components that allow tailoring an implementation for a particular application. The Linked Environments for Atmospheric Discovery (LEAD) project provides an excellent use case for the TDR. LEAD is a multi-institutional Large Information Technology Research project funded by the National Science Foundation (NSF). The goal of LEAD is to create a framework based on Grid and Web Services to support mesoscale meteorology research and education. This includes capabilities such as launching forecast models, mining data for meteorological phenomena, and dynamic workflows that are automatically reconfigurable in response to changing weather. LEAD presents unique challenges in managing and storing large data volumes from real-time observational systems as well as data that are dynamically created during the execution of adaptive workflows. For example, in order to support storage of many large data products, the LEAD implementation of the TDR will provide a variety of data movement options, including gridftp. It will have a web service interface and will be callable programmatically as well as via interactive user requests. Future plans include the use of a mass storage device to provide robust long term storage. This talk will present the current state of the TDR effort.
NASA Astrophysics Data System (ADS)
Fredericks, J.; Rueda-Velasquez, C. A.
2016-12-01
As we move from keeping data on our disks to sharing it with the world, often in real-time, we are obligated to also tell an unknown user about how our observations were made. Data that are shared must not only have ownership metadata, unit descriptions and content formatting information. The provider must also share information that is needed to assess the data as it relates to potential re-use. A user must be able to assess the limitations and capabilities of the sensor, as it is configured, to understand its value. For example, when an instrument is configured, it typically affects the data accuracy and operational limits of the sensor. An operator may sacrifice data accuracy to achieve a broader operational range and visa versa. If you are looking at newly discovered data, it is important to be able to find all of the information that relates to assessing the data quality for your particular application. Traditionally, metadata are captured by data managers who usually do not know how the data are collected. By the time data are distributed, this knowledge is often gone, buried within notebooks or hidden in documents that are not machine-harvestable and often not human-readable. In a recently funded NSF EarthCube Integrative Activity called X-DOMES (Cross-Domain Observational Metadata in EnviroSensing), mechanisms are underway to enable the capture of sensor and deployment metadata by sensor manufacturers and field operators. The support has enabled the development of a community ontology repository (COR) within the Earth Science Information Partnership (ESIP) community, fostering easy creation of resolvable terms for the broader community. This tool enables non-experts to easily develop W3C standards-based content, promoting the implementation of Semantic Web technologies for enhanced discovery of content and interoperability in workflows. The X-DOMES project is also developing a SensorML Viewer/Editor to provide an easy interface for sensor manufacturers and field operators to fully-describe sensor capabilities and configuration/deployment content - automatically generating it in machine-harvestable encodings that can be referenced by data managers and/or associated with the data through web-services, such as the OGC SWE Sensor Observation Service.
Distributed Information System for Dynamic Ocean Data in Indonesia
NASA Astrophysics Data System (ADS)
Romero, Laia; Sala, Joan; Polo, Isabel; Cases, Oscar; López, Alejandro; Jolibois, Tony; Carbou, Jérome
2014-05-01
Information systems are widely used to enable access to scientific data by different user communities. MyOcean information system is a good example of such applications in Europe. The present work describes a specific distributed information system for Ocean Numerical Model (ONM) data in the scope of the INDESO project, a project focused on Infrastructure Development of Space Oceanography in Indonesia. INDESO, as part of the Blue Revolution policy conducted by the Indonesian government for the sustainable development of fisheries and aquaculture, presents challenging service requirements in terms of services performance, reliability, security and overall usability. Following state-of-the-art technologies on scientific data networks, this robust information system provides a high level of interoperability of services to discover, view and access INDESO dynamic ONM scientific data. The entire system is automatically updated four times a day, including dataset metadata, taking into account every new file available in the data repositories. The INDESO system architecture has been designed in great part around the extension and integration of open-source flexible and mature technologies. It involves three separate modules: web portal, dissemination gateway, and user administration. Supporting different gridded and non-gridded data, the INDESO information system features search-based data discovery, data access by temporal and spatial subset extraction, direct download and ftp, and multiple-layer visualization of datasets. A complex authorization system has been designed and applied throughout all components, in order to enable services authorization at dataset level, according to the different user profiles stated in the data policy. Finally, a web portal has been developed as the single entry point and standardized interface to all data services (discover, view, and access). Apache SOLR has been implemented as the search server, allowing faceted browsing among ocean data products and the connection to an external catalogue of metadata records. ncWMS and Godiva2 have been the basis of the viewing server and client technologies developed, MOTU has been used for data subsetting and intelligent management of data queues, and has allowed the deployment of a centralised download interface applicable to all ONM products. Unidata's Thredds server has been employed to provide file metadata and remote access to ONM data. CAS has been used as the single sign-on protocol for all data services. The user management application developed has been based on GOSA2. Joomla and Bootstrap have been the technologies used for the web portal, compatible with mobile phone and tablet devices. The INDESO information system comes up as an information system that is scalable, extremely easy to use, operate and maintain. This will facilitate the extensive use of ocean numerical model data by the scientific community in Indonesia. Constituted mostly of open-source solutions, the system is able to meet strict operational requirements, and carry out complex functions. It is feasible to adapt this architecture to different static and dynamic oceanographic data sources and large data volumes, in an accessible, fast, and comprehensive manner.
NASA Astrophysics Data System (ADS)
Richard, S. M.
2011-12-01
The USGIN project has drafted and is using a specification for use of ISO 19115/19/39 metadata, recommendations for simple metadata content, and a proposal for a URI scheme to identify resources using resolvable http URI's(see http://lab.usgin.org/usgin-profiles). The principal target use case is a catalog in which resources can be registered and described by data providers for discovery by users. We are currently using the ESRI Geoportal (Open Source), with configuration files for the USGIN profile. The metadata offered by the catalog must provide sufficient content to guide search engines to locate requested resources, to describe the resource content, provenance, and quality so users can determine if the resource will serve for intended usage, and finally to enable human users and sofware clients to obtain or access the resource. In order to achieve an operational federated catalog system, provisions in the ISO specification must be restricted and usage clarified to reduce the heterogeneity of 'standard' metadata and service implementations such that a single client can search against different catalogs, and the metadata returned by catalogs can be parsed reliably to locate required information. Usage of the complex ISO 19139 XML schema allows for a great deal of structured metadata content, but the heterogenity in approaches to content encoding has hampered development of sophisticated client software that can take advantage of the rich metadata; the lack of such clients in turn reduces motivation for metadata producers to produce content-rich metadata. If the only significant use of the detailed, structured metadata is to format into text for people to read, then the detailed information could be put in free text elements and be just as useful. In order for complex metadata encoding and content to be useful, there must be clear and unambiguous conventions on the encoding that are utilized by the community that wishes to take advantage of advanced metadata content. The use cases for the detailed content must be well understood, and the degree of metadata complexity should be determined by requirements for those use cases. The ISO standard provides sufficient flexibility that relatively simple metadata records can be created that will serve for text-indexed search/discovery, resource evaluation by a user reading text content from the metadata, and access to the resource via http, ftp, or well-known service protocols (e.g. Thredds; OGC WMS, WFS, WCS).
Improving Scientific Metadata Interoperability And Data Discoverability using OAI-PMH
NASA Astrophysics Data System (ADS)
Devarakonda, Ranjeet; Palanisamy, Giri; Green, James M.; Wilson, Bruce E.
2010-12-01
While general-purpose search engines (such as Google or Bing) are useful for finding many things on the Internet, they are often of limited usefulness for locating Earth Science data relevant (for example) to a specific spatiotemporal extent. By contrast, tools that search repositories of structured metadata can locate relevant datasets with fairly high precision, but the search is limited to that particular repository. Federated searches (such as Z39.50) have been used, but can be slow and the comprehensiveness can be limited by downtime in any search partner. An alternative approach to improve comprehensiveness is for a repository to harvest metadata from other repositories, possibly with limits based on subject matter or access permissions. Searches through harvested metadata can be extremely responsive, and the search tool can be customized with semantic augmentation appropriate to the community of practice being served. However, there are a number of different protocols for harvesting metadata, with some challenges for ensuring that updates are propagated and for collaborations with repositories using differing metadata standards. The Open Archive Initiative Protocol for Metadata Handling (OAI-PMH) is a standard that is seeing increased use as a means for exchanging structured metadata. OAI-PMH implementations must support Dublin Core as a metadata standard, with other metadata formats as optional. We have developed tools which enable our structured search tool (Mercury; http://mercury.ornl.gov) to consume metadata from OAI-PMH services in any of the metadata formats we support (Dublin Core, Darwin Core, FCDC CSDGM, GCMD DIF, EML, and ISO 19115/19137). We are also making ORNL DAAC metadata available through OAI-PMH for other metadata tools to utilize, such as the NASA Global Change Master Directory, GCMD). This paper describes Mercury capabilities with multiple metadata formats, in general, and, more specifically, the results of our OAI-PMH implementations and the lessons learned. References: [1] R. Devarakonda, G. Palanisamy, B.E. Wilson, and J.M. Green, "Mercury: reusable metadata management data discovery and access system", Earth Science Informatics, vol. 3, no. 1, pp. 87-94, May 2010. [2] R. Devarakonda, G. Palanisamy, J.M. Green, B.E. Wilson, "Data sharing and retrieval using OAI-PMH", Earth Science Informatics DOI: 10.1007/s12145-010-0073-0, (2010). [3] Devarakonda, R.; Palanisamy, G.; Green, J.; Wilson, B. E. "Mercury: An Example of Effective Software Reuse for Metadata Management Data Discovery and Access", Eos Trans. AGU, 89(53), Fall Meet. Suppl., IN11A-1019 (2008).
Morphological feature extraction for the classification of digital images of cancerous tissues.
Thiran, J P; Macq, B
1996-10-01
This paper presents a new method for automatic recognition of cancerous tissues from an image of a microscopic section. Based on the shape and the size analysis of the observed cells, this method provides the physician with nonsubjective numerical values for four criteria of malignancy. This automatic approach is based on mathematical morphology, and more specifically on the use of Geodesy. This technique is used first to remove the background noise from the image and then to operate a segmentation of the nuclei of the cells and an analysis of their shape, their size, and their texture. From the values of the extracted criteria, an automatic classification of the image (cancerous or not) is finally operated.
Automatic extraction of building boundaries using aerial LiDAR data
NASA Astrophysics Data System (ADS)
Wang, Ruisheng; Hu, Yong; Wu, Huayi; Wang, Jian
2016-01-01
Building extraction is one of the main research topics of the photogrammetry community. This paper presents automatic algorithms for building boundary extractions from aerial LiDAR data. First, segmenting height information generated from LiDAR data, the outer boundaries of aboveground objects are expressed as closed chains of oriented edge pixels. Then, building boundaries are distinguished from nonbuilding ones by evaluating their shapes. The candidate building boundaries are reconstructed as rectangles or regular polygons by applying new algorithms, following the hypothesis verification paradigm. These algorithms include constrained searching in Hough space, enhanced Hough transformation, and the sequential linking technique. The experimental results show that the proposed algorithms successfully extract building boundaries at rates of 97%, 85%, and 92% for three LiDAR datasets with varying scene complexities.
Viangteeravat, Teeradache; Anyanwu, Matthew N; Ra Nagisetty, Venkateswara; Kuscu, Emin
2011-07-15
Massive datasets comprising high-resolution images, generated in neuro-imaging studies and in clinical imaging research, are increasingly challenging our ability to analyze, share, and filter such images in clinical and basic translational research. Pivot collection exploratory analysis provides each user the ability to fully interact with the massive amounts of visual data to fully facilitate sufficient sorting, flexibility and speed to fluidly access, explore or analyze the massive image data sets of high-resolution images and their associated meta information, such as neuro-imaging databases from the Allen Brain Atlas. It is used in clustering, filtering, data sharing and classifying of the visual data into various deep zoom levels and meta information categories to detect the underlying hidden pattern within the data set that has been used. We deployed prototype Pivot collections using the Linux CentOS running on the Apache web server. We also tested the prototype Pivot collections on other operating systems like Windows (the most common variants) and UNIX, etc. It is demonstrated that the approach yields very good results when compared with other approaches used by some researchers for generation, creation, and clustering of massive image collections such as the coronal and horizontal sections of the mouse brain from the Allen Brain Atlas. Pivot visual analytics was used to analyze a prototype of dataset Dab2 co-expressed genes from the Allen Brain Atlas. The metadata along with high-resolution images were automatically extracted using the Allen Brain Atlas API. It is then used to identify the hidden information based on the various categories and conditions applied by using options generated from automated collection. A metadata category like chromosome, as well as data for individual cases like sex, age, and plan attributes of a particular gene, is used to filter, sort and to determine if there exist other genes with a similar characteristics to Dab2. And online access to the mouse brain pivot collection can be viewed using the link http://edtech-dev.uthsc.edu/CTSI/teeDev1/unittest/PaPa/collection.html (user name: tviangte and password: demome) Our proposed algorithm has automated the creation of large image Pivot collections; this will enable investigators of clinical research projects to easily and quickly analyse the image collections through a perspective that is useful for making critical decisions about the image patterns discovered.
An Approach to Information Management for AIR7000 with Metadata and Ontologies
2009-10-01
metadata. We then propose an approach based on Semantic Technologies including the Resource Description Framework (RDF) and Upper Ontologies, for the...mandating specific metadata schemas can result in interoperability problems. For example, many standards within the ADO mandate the use of XML for metadata...such problems, we propose an archi- tecture in which different metadata schemes can inter operate. By using RDF (Resource Description Framework ) as a
Making Interoperability Easier with NASA's Metadata Management Tool (MMT)
NASA Technical Reports Server (NTRS)
Shum, Dana; Reese, Mark; Pilone, Dan; Baynes, Katie
2016-01-01
While the ISO-19115 collection level metadata format meets many users' needs for interoperable metadata, it can be cumbersome to create it correctly. Through the MMT's simple UI experience, metadata curators can create and edit collections which are compliant with ISO-19115 without full knowledge of the NASA Best Practices implementation of ISO-19115 format. Users are guided through the metadata creation process through a forms-based editor, complete with field information, validation hints and picklists. Once a record is completed, users can download the metadata in any of the supported formats with just 2 clicks.
Predicting structured metadata from unstructured metadata.
Posch, Lisa; Panahiazar, Maryam; Dumontier, Michel; Gevaert, Olivier
2016-01-01
Enormous amounts of biomedical data have been and are being produced by investigators all over the world. However, one crucial and limiting factor in data reuse is accurate, structured and complete description of the data or data about the data-defined as metadata. We propose a framework to predict structured metadata terms from unstructured metadata for improving quality and quantity of metadata, using the Gene Expression Omnibus (GEO) microarray database. Our framework consists of classifiers trained using term frequency-inverse document frequency (TF-IDF) features and a second approach based on topics modeled using a Latent Dirichlet Allocation model (LDA) to reduce the dimensionality of the unstructured data. Our results on the GEO database show that structured metadata terms can be the most accurately predicted using the TF-IDF approach followed by LDA both outperforming the majority vote baseline. While some accuracy is lost by the dimensionality reduction of LDA, the difference is small for elements with few possible values, and there is a large improvement over the majority classifier baseline. Overall this is a promising approach for metadata prediction that is likely to be applicable to other datasets and has implications for researchers interested in biomedical metadata curation and metadata prediction. © The Author(s) 2016. Published by Oxford University Press.
Metazen – metadata capture for metagenomes
Bischof, Jared; Harrison, Travis; Paczian, Tobias; ...
2014-12-08
Background: As the impact and prevalence of large-scale metagenomic surveys grow, so does the acute need for more complete and standards compliant metadata. Metadata (data describing data) provides an essential complement to experimental data, helping to answer questions about its source, mode of collection, and reliability. Metadata collection and interpretation have become vital to the genomics and metagenomics communities, but considerable challenges remain, including exchange, curation, and distribution. Currently, tools are available for capturing basic field metadata during sampling, and for storing, updating and viewing it. These tools are not specifically designed for metagenomic surveys; in particular, they lack themore » appropriate metadata collection templates, a centralized storage repository, and a unique ID linking system that can be used to easily port complete and compatible metagenomic metadata into widely used assembly and sequence analysis tools. Results: Metazen was developed as a comprehensive framework designed to enable metadata capture for metagenomic sequencing projects. Specifically, Metazen provides a rapid, easy-to-use portal to encourage early deposition of project and sample metadata. Conclusion: Metazen is an interactive tool that aids users in recording their metadata in a complete and valid format. A defined set of mandatory fields captures vital information, while the option to add fields provides flexibility.« less
Metazen – metadata capture for metagenomes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bischof, Jared; Harrison, Travis; Paczian, Tobias
Background: As the impact and prevalence of large-scale metagenomic surveys grow, so does the acute need for more complete and standards compliant metadata. Metadata (data describing data) provides an essential complement to experimental data, helping to answer questions about its source, mode of collection, and reliability. Metadata collection and interpretation have become vital to the genomics and metagenomics communities, but considerable challenges remain, including exchange, curation, and distribution. Currently, tools are available for capturing basic field metadata during sampling, and for storing, updating and viewing it. These tools are not specifically designed for metagenomic surveys; in particular, they lack themore » appropriate metadata collection templates, a centralized storage repository, and a unique ID linking system that can be used to easily port complete and compatible metagenomic metadata into widely used assembly and sequence analysis tools. Results: Metazen was developed as a comprehensive framework designed to enable metadata capture for metagenomic sequencing projects. Specifically, Metazen provides a rapid, easy-to-use portal to encourage early deposition of project and sample metadata. Conclusion: Metazen is an interactive tool that aids users in recording their metadata in a complete and valid format. A defined set of mandatory fields captures vital information, while the option to add fields provides flexibility.« less
Predicting structured metadata from unstructured metadata
Posch, Lisa; Panahiazar, Maryam; Dumontier, Michel; Gevaert, Olivier
2016-01-01
Enormous amounts of biomedical data have been and are being produced by investigators all over the world. However, one crucial and limiting factor in data reuse is accurate, structured and complete description of the data or data about the data—defined as metadata. We propose a framework to predict structured metadata terms from unstructured metadata for improving quality and quantity of metadata, using the Gene Expression Omnibus (GEO) microarray database. Our framework consists of classifiers trained using term frequency-inverse document frequency (TF-IDF) features and a second approach based on topics modeled using a Latent Dirichlet Allocation model (LDA) to reduce the dimensionality of the unstructured data. Our results on the GEO database show that structured metadata terms can be the most accurately predicted using the TF-IDF approach followed by LDA both outperforming the majority vote baseline. While some accuracy is lost by the dimensionality reduction of LDA, the difference is small for elements with few possible values, and there is a large improvement over the majority classifier baseline. Overall this is a promising approach for metadata prediction that is likely to be applicable to other datasets and has implications for researchers interested in biomedical metadata curation and metadata prediction. Database URL: http://www.yeastgenome.org/ PMID:28637268
A method for real-time implementation of HOG feature extraction
NASA Astrophysics Data System (ADS)
Luo, Hai-bo; Yu, Xin-rong; Liu, Hong-mei; Ding, Qing-hai
2011-08-01
Histogram of oriented gradient (HOG) is an efficient feature extraction scheme, and HOG descriptors are feature descriptors which is widely used in computer vision and image processing for the purpose of biometrics, target tracking, automatic target detection(ATD) and automatic target recognition(ATR) etc. However, computation of HOG feature extraction is unsuitable for hardware implementation since it includes complicated operations. In this paper, the optimal design method and theory frame for real-time HOG feature extraction based on FPGA were proposed. The main principle is as follows: firstly, the parallel gradient computing unit circuit based on parallel pipeline structure was designed. Secondly, the calculation of arctangent and square root operation was simplified. Finally, a histogram generator based on parallel pipeline structure was designed to calculate the histogram of each sub-region. Experimental results showed that the HOG extraction can be implemented in a pixel period by these computing units.
ERIC Educational Resources Information Center
Kerr, Deirdre; Mousavi, Hamid; Iseli, Markus R.
2013-01-01
The Common Core assessments emphasize short essay constructed-response items over multiple-choice items because they are more precise measures of understanding. However, such items are too costly and time consuming to be used in national assessments unless a way to score them automatically can be found. Current automatic essay-scoring techniques…
A new visual navigation system for exploring biomedical Open Educational Resource (OER) videos
Zhao, Baoquan; Xu, Songhua; Lin, Shujin; Luo, Xiaonan; Duan, Lian
2016-01-01
Objective Biomedical videos as open educational resources (OERs) are increasingly proliferating on the Internet. Unfortunately, seeking personally valuable content from among the vast corpus of quality yet diverse OER videos is nontrivial due to limitations of today’s keyword- and content-based video retrieval techniques. To address this need, this study introduces a novel visual navigation system that facilitates users’ information seeking from biomedical OER videos in mass quantity by interactively offering visual and textual navigational clues that are both semantically revealing and user-friendly. Materials and Methods The authors collected and processed around 25 000 YouTube videos, which collectively last for a total length of about 4000 h, in the broad field of biomedical sciences for our experiment. For each video, its semantic clues are first extracted automatically through computationally analyzing audio and visual signals, as well as text either accompanying or embedded in the video. These extracted clues are subsequently stored in a metadata database and indexed by a high-performance text search engine. During the online retrieval stage, the system renders video search results as dynamic web pages using a JavaScript library that allows users to interactively and intuitively explore video content both efficiently and effectively. Results The authors produced a prototype implementation of the proposed system, which is publicly accessible at https://patentq.njit.edu/oer. To examine the overall advantage of the proposed system for exploring biomedical OER videos, the authors further conducted a user study of a modest scale. The study results encouragingly demonstrate the functional effectiveness and user-friendliness of the new system for facilitating information seeking from and content exploration among massive biomedical OER videos. Conclusion Using the proposed tool, users can efficiently and effectively find videos of interest, precisely locate video segments delivering personally valuable information, as well as intuitively and conveniently preview essential content of a single or a collection of videos. PMID:26335986
Why can't I manage my digital images like MP3s? The evolution and intent of multimedia metadata
NASA Astrophysics Data System (ADS)
Goodrum, Abby; Howison, James
2005-01-01
This paper considers the deceptively simple question: Why can't digital images be managed in the simple and effective manner in which digital music files are managed? We make the case that the answer is different treatments of metadata in different domains with different goals. A central difference between the two formats stems from the fact that digital music metadata lookup services are collaborative and automate the movement from a digital file to the appropriate metadata, while image metadata services do not. To understand why this difference exists we examine the divergent evolution of metadata standards for digital music and digital images and observed that the processes differ in interesting ways according to their intent. Specifically music metadata was developed primarily for personal file management and community resource sharing, while the focus of image metadata has largely been on information retrieval. We argue that lessons from MP3 metadata can assist individuals facing their growing personal image management challenges. Our focus therefore is not on metadata for cultural heritage institutions or the publishing industry, it is limited to the personal libraries growing on our hard-drives. This bottom-up approach to file management combined with p2p distribution radically altered the music landscape. Might such an approach have a similar impact on image publishing? This paper outlines plans for improving the personal management of digital images-doing image metadata and file management the MP3 way-and considers the likelihood of success.
Why can't I manage my digital images like MP3s? The evolution and intent of multimedia metadata
NASA Astrophysics Data System (ADS)
Goodrum, Abby; Howison, James
2004-12-01
This paper considers the deceptively simple question: Why can"t digital images be managed in the simple and effective manner in which digital music files are managed? We make the case that the answer is different treatments of metadata in different domains with different goals. A central difference between the two formats stems from the fact that digital music metadata lookup services are collaborative and automate the movement from a digital file to the appropriate metadata, while image metadata services do not. To understand why this difference exists we examine the divergent evolution of metadata standards for digital music and digital images and observed that the processes differ in interesting ways according to their intent. Specifically music metadata was developed primarily for personal file management and community resource sharing, while the focus of image metadata has largely been on information retrieval. We argue that lessons from MP3 metadata can assist individuals facing their growing personal image management challenges. Our focus therefore is not on metadata for cultural heritage institutions or the publishing industry, it is limited to the personal libraries growing on our hard-drives. This bottom-up approach to file management combined with p2p distribution radically altered the music landscape. Might such an approach have a similar impact on image publishing? This paper outlines plans for improving the personal management of digital images-doing image metadata and file management the MP3 way-and considers the likelihood of success.
The Role of Metadata Standards in EOSDIS Search and Retrieval Applications
NASA Technical Reports Server (NTRS)
Pfister, Robin
1999-01-01
Metadata standards play a critical role in data search and retrieval systems. Metadata tie software to data so the data can be processed, stored, searched, retrieved and distributed. Without metadata these actions are not possible. The process of populating metadata to describe science data is an important service to the end user community so that a user who is unfamiliar with the data, can easily find and learn about a particular dataset before an order decision is made. Once a good set of standards are in place, the accuracy with which data search can be performed depends on the degree to which metadata standards are adhered during product definition. NASA's Earth Observing System Data and Information System (EOSDIS) provides examples of how metadata standards are used in data search and retrieval.
openPDS: protecting the privacy of metadata through SafeAnswers.
de Montjoye, Yves-Alexandre; Shmueli, Erez; Wang, Samuel S; Pentland, Alex Sandy
2014-01-01
The rise of smartphones and web services made possible the large-scale collection of personal metadata. Information about individuals' location, phone call logs, or web-searches, is collected and used intensively by organizations and big data researchers. Metadata has however yet to realize its full potential. Privacy and legal concerns, as well as the lack of technical solutions for personal metadata management is preventing metadata from being shared and reconciled under the control of the individual. This lack of access and control is furthermore fueling growing concerns, as it prevents individuals from understanding and managing the risks associated with the collection and use of their data. Our contribution is two-fold: (1) we describe openPDS, a personal metadata management framework that allows individuals to collect, store, and give fine-grained access to their metadata to third parties. It has been implemented in two field studies; (2) we introduce and analyze SafeAnswers, a new and practical way of protecting the privacy of metadata at an individual level. SafeAnswers turns a hard anonymization problem into a more tractable security one. It allows services to ask questions whose answers are calculated against the metadata instead of trying to anonymize individuals' metadata. The dimensionality of the data shared with the services is reduced from high-dimensional metadata to low-dimensional answers that are less likely to be re-identifiable and to contain sensitive information. These answers can then be directly shared individually or in aggregate. openPDS and SafeAnswers provide a new way of dynamically protecting personal metadata, thereby supporting the creation of smart data-driven services and data science research.
openPDS: Protecting the Privacy of Metadata through SafeAnswers
de Montjoye, Yves-Alexandre; Shmueli, Erez; Wang, Samuel S.; Pentland, Alex Sandy
2014-01-01
The rise of smartphones and web services made possible the large-scale collection of personal metadata. Information about individuals' location, phone call logs, or web-searches, is collected and used intensively by organizations and big data researchers. Metadata has however yet to realize its full potential. Privacy and legal concerns, as well as the lack of technical solutions for personal metadata management is preventing metadata from being shared and reconciled under the control of the individual. This lack of access and control is furthermore fueling growing concerns, as it prevents individuals from understanding and managing the risks associated with the collection and use of their data. Our contribution is two-fold: (1) we describe openPDS, a personal metadata management framework that allows individuals to collect, store, and give fine-grained access to their metadata to third parties. It has been implemented in two field studies; (2) we introduce and analyze SafeAnswers, a new and practical way of protecting the privacy of metadata at an individual level. SafeAnswers turns a hard anonymization problem into a more tractable security one. It allows services to ask questions whose answers are calculated against the metadata instead of trying to anonymize individuals' metadata. The dimensionality of the data shared with the services is reduced from high-dimensional metadata to low-dimensional answers that are less likely to be re-identifiable and to contain sensitive information. These answers can then be directly shared individually or in aggregate. openPDS and SafeAnswers provide a new way of dynamically protecting personal metadata, thereby supporting the creation of smart data-driven services and data science research. PMID:25007320
Automatic quantitative analysis of in-stent restenosis using FD-OCT in vivo intra-arterial imaging.
Mandelias, Kostas; Tsantis, Stavros; Spiliopoulos, Stavros; Katsakiori, Paraskevi F; Karnabatidis, Dimitris; Nikiforidis, George C; Kagadis, George C
2013-06-01
A new segmentation technique is implemented for automatic lumen area extraction and stent strut detection in intravascular optical coherence tomography (OCT) images for the purpose of quantitative analysis of in-stent restenosis (ISR). In addition, a user-friendly graphical user interface (GUI) is developed based on the employed algorithm toward clinical use. Four clinical datasets of frequency-domain OCT scans of the human femoral artery were analyzed. First, a segmentation method based on fuzzy C means (FCM) clustering and wavelet transform (WT) was applied toward inner luminal contour extraction. Subsequently, stent strut positions were detected by utilizing metrics derived from the local maxima of the wavelet transform into the FCM membership function. The inner lumen contour and the position of stent strut were extracted with high precision. Compared to manual segmentation by an expert physician, the automatic lumen contour delineation had an average overlap value of 0.917 ± 0.065 for all OCT images included in the study. The strut detection procedure achieved an overall accuracy of 93.80% and successfully identified 9.57 ± 0.5 struts for every OCT image. Processing time was confined to approximately 2.5 s per OCT frame. A new fast and robust automatic segmentation technique combining FCM and WT for lumen border extraction and strut detection in intravascular OCT images was designed and implemented. The proposed algorithm integrated in a GUI represents a step forward toward the employment of automated quantitative analysis of ISR in clinical practice.
Realtime automatic metal extraction of medical x-ray images for contrast improvement
NASA Astrophysics Data System (ADS)
Prangl, Martin; Hellwagner, Hermann; Spielvogel, Christian; Bischof, Horst; Szkaliczki, Tibor
2006-03-01
This paper focuses on an approach for real-time metal extraction of x-ray images taken from modern x-ray machines like C-arms. Such machines are used for vessel diagnostics, surgical interventions, as well as cardiology, neurology and orthopedic examinations. They are very fast in taking images from different angles. For this reason, manual adjustment of contrast is infeasible and automatic adjustment algorithms have been applied to try to select the optimal radiation dose for contrast adjustment. Problems occur when metallic objects, e.g., a prosthesis or a screw, are in the absorption area of interest. In this case, the automatic adjustment mostly fails because the dark, metallic objects lead the algorithm to overdose the x-ray tube. This outshining effect results in overexposed images and bad contrast. To overcome this limitation, metallic objects have to be detected and extracted from images that are taken as input for the adjustment algorithm. In this paper, we present a real-time solution for extracting metallic objects of x-ray images. We will explore the characteristic features of metallic objects in x-ray images and their distinction from bone fragments which form the basis to find a successful way for object segmentation and classification. Subsequently, we will present our edge based real-time approach for successful and fast automatic segmentation and classification of metallic objects. Finally, experimental results on the effectiveness and performance of our approach based on a vast amount of input image data sets will be presented.
Ontology-aided feature correlation for multi-modal urban sensing
NASA Astrophysics Data System (ADS)
Misra, Archan; Lantra, Zaman; Jayarajah, Kasthuri
2016-05-01
The paper explores the use of correlation across features extracted from different sensing channels to help in urban situational understanding. We use real-world datasets to show how such correlation can improve the accuracy of detection of city-wide events by combining metadata analysis with image analysis of Instagram content. We demonstrate this through a case study on the Singapore Haze. We show that simple ontological relationships and reasoning can significantly help in automating such correlation-based understanding of transient urban events.
Perspectives in astrophysical databases
NASA Astrophysics Data System (ADS)
Frailis, Marco; de Angelis, Alessandro; Roberto, Vito
2004-07-01
Astrophysics has become a domain extremely rich of scientific data. Data mining tools are needed for information extraction from such large data sets. This asks for an approach to data management emphasizing the efficiency and simplicity of data access; efficiency is obtained using multidimensional access methods and simplicity is achieved by properly handling metadata. Moreover, clustering and classification techniques on large data sets pose additional requirements in terms of computation and memory scalability and interpretability of results. In this study we review some possible solutions.
Progress in defining a standard for file-level metadata
NASA Technical Reports Server (NTRS)
Williams, Joel; Kobler, Ben
1996-01-01
In the following narrative, metadata required to locate a file on tape or collection of tapes will be referred to as file-level metadata. This paper discribes the rationale for and the history of the effort to define a standard for this metadata.
Achieving interoperability for metadata registries using comparative object modeling.
Park, Yu Rang; Kim, Ju Han
2010-01-01
Achieving data interoperability between organizations relies upon agreed meaning and representation (metadata) of data. For managing and registering metadata, many organizations have built metadata registries (MDRs) in various domains based on international standard for MDR framework, ISO/IEC 11179. Following this trend, two pubic MDRs in biomedical domain have been created, United States Health Information Knowledgebase (USHIK) and cancer Data Standards Registry and Repository (caDSR), from U.S. Department of Health & Human Services and National Cancer Institute (NCI), respectively. Most MDRs are implemented with indiscriminate extending for satisfying organization-specific needs and solving semantic and structural limitation of ISO/IEC 11179. As a result it is difficult to address interoperability among multiple MDRs. In this paper, we propose an integrated metadata object model for achieving interoperability among multiple MDRs. To evaluate this model, we developed an XML Schema Definition (XSD)-based metadata exchange format. We created an XSD-based metadata exporter, supporting both the integrated metadata object model and organization-specific MDR formats.
Request queues for interactive clients in a shared file system of a parallel computing system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bent, John M.; Faibish, Sorin
Interactive requests are processed from users of log-in nodes. A metadata server node is provided for use in a file system shared by one or more interactive nodes and one or more batch nodes. The interactive nodes comprise interactive clients to execute interactive tasks and the batch nodes execute batch jobs for one or more batch clients. The metadata server node comprises a virtual machine monitor; an interactive client proxy to store metadata requests from the interactive clients in an interactive client queue; a batch client proxy to store metadata requests from the batch clients in a batch client queue;more » and a metadata server to store the metadata requests from the interactive client queue and the batch client queue in a metadata queue based on an allocation of resources by the virtual machine monitor. The metadata requests can be prioritized, for example, based on one or more of a predefined policy and predefined rules.« less
Text feature extraction based on deep learning: a review.
Liang, Hong; Sun, Xiao; Sun, Yunlei; Gao, Yuan
2017-01-01
Selection of text feature item is a basic and important matter for text mining and information retrieval. Traditional methods of feature extraction require handcrafted features. To hand-design, an effective feature is a lengthy process, but aiming at new applications, deep learning enables to acquire new effective feature representation from training data. As a new feature extraction method, deep learning has made achievements in text mining. The major difference between deep learning and conventional methods is that deep learning automatically learns features from big data, instead of adopting handcrafted features, which mainly depends on priori knowledge of designers and is highly impossible to take the advantage of big data. Deep learning can automatically learn feature representation from big data, including millions of parameters. This thesis outlines the common methods used in text feature extraction first, and then expands frequently used deep learning methods in text feature extraction and its applications, and forecasts the application of deep learning in feature extraction.
Automatic extraction and visualization of object-oriented software design metrics
NASA Astrophysics Data System (ADS)
Lakshminarayana, Anuradha; Newman, Timothy S.; Li, Wei; Talburt, John
2000-02-01
Software visualization is a graphical representation of software characteristics and behavior. Certain modes of software visualization can be useful in isolating problems and identifying unanticipated behavior. In this paper we present a new approach to aid understanding of object- oriented software through 3D visualization of software metrics that can be extracted from the design phase of software development. The focus of the paper is a metric extraction method and a new collection of glyphs for multi- dimensional metric visualization. Our approach utilize the extensibility interface of a popular CASE tool to access and automatically extract the metrics from Unified Modeling Language class diagrams. Following the extraction of the design metrics, 3D visualization of these metrics are generated for each class in the design, utilizing intuitively meaningful 3D glyphs that are representative of the ensemble of metrics. Extraction and visualization of design metrics can aid software developers in the early study and understanding of design complexity.
D’Avolio, Leonard W.; Litwin, Mark S.; Rogers, Selwyn O.; Bui, Alex A. T.
2007-01-01
Prostate cancer removal surgeries that result in tumor found at the surgical margin, otherwise known as a positive surgical margin, have a significantly higher chance of biochemical recurrence and clinical progression. To support clinical outcomes assessment a system was designed to automatically identify, extract, and classify key phrases from pathology reports describing this outcome. Heuristics and boundary detection were used to extract phrases. Phrases were then classified using support vector machines into one of three classes: ‘positive (involved) margins,’ ‘negative (uninvolved) margins,’ and ‘not-applicable or definitive.’ A total of 851 key phrases were extracted from a sample of 782 reports produced between 1996 and 2006 from two major hospitals. Despite differences in reporting style, at least 1 sentence containing a diagnosis was extracted from 780 of the 782 reports (99.74%). Of the 851 sentences extracted, 97.3% contained diagnoses. Overall accuracy of automated classification of extracted sentences into the three categories was 97.18%. PMID:18693818
Hirose, Tomoaki; Igami, Tsuyoshi; Koga, Kusuto; Hayashi, Yuichiro; Ebata, Tomoki; Yokoyama, Yukihiro; Sugawara, Gen; Mizuno, Takashi; Yamaguchi, Junpei; Mori, Kensaku; Nagino, Masato
2017-03-01
Fusion angiography using reconstructed multidetector-row computed tomography (MDCT) images, and cholangiography using reconstructed images from MDCT with a cholangiographic agent include an anatomical gap due to the different periods of MDCT scanning. To conquer such gaps, we attempted to develop a cholangiography procedure that automatically reconstructs a cholangiogram from portal-phase MDCT images. The automatically produced cholangiography procedure utilized an original software program that was developed by the Graduate School of Information Science, Nagoya University. This program structured 5 candidate biliary tracts, and automatically selected one as the candidate for cholangiography. The clinical value of the automatically produced cholangiography procedure was estimated based on a comparison with manually produced cholangiography. Automatically produced cholangiograms were reconstructed for 20 patients who underwent MDCT scanning before biliary drainage for distal biliary obstruction. The procedure showed the ability to extract the 5 main biliary branches and the 21 subsegmental biliary branches in 55 and 25 % of the cases, respectively. The extent of aberrant connections and aberrant extractions outside the biliary tract was acceptable. Among all of the cholangiograms, 5 were clinically applied with no correction, 8 were applied with modest improvements, and 3 produced a correct cholangiography before automatic selection. Although our procedure requires further improvement based on the analysis of additional patient data, it may represent an alternative to direct cholangiography in the future.
Making metadata usable in a multi-national research setting.
Ellul, Claire; Foord, Joanna; Mooney, John
2013-11-01
SECOA (Solutions for Environmental Contrasts in Coastal Areas) is a multi-national research project examining the effects of human mobility on urban settlements in fragile coastal environments. This paper describes the setting up of a SECOA metadata repository for non-specialist researchers such as environmental scientists and tourism experts. Conflicting usability requirements of two groups - metadata creators and metadata users - are identified along with associated limitations of current metadata standards. A description is given of a configurable metadata system designed to grow as the project evolves. This work is of relevance for similar projects such as INSPIRE. Copyright © 2012 Elsevier Ltd and The Ergonomics Society. All rights reserved.
NASA Astrophysics Data System (ADS)
Yatagai, A. I.; Iyemori, T.; Ritschel, B.; Koyama, Y.; Hori, T.; Abe, S.; Tanaka, Y.; Shinbori, A.; Umemura, N.; Sato, Y.; Yagi, M.; Ueno, S.; Hashiguchi, N. O.; Kaneda, N.; Belehaki, A.; Hapgood, M. A.
2013-12-01
The IUGONET is a Japanese program to build a metadata database for ground-based observations of the upper atmosphere [1]. The project began in 2009 with five Japanese institutions which archive data observed by radars, magnetometers, photometers, radio telescopes and helioscopes, and so on, at various altitudes from the Earth's surface to the Sun. Systems have been developed to allow searching of the above described metadata. We have been updating the system and adding new and updated metadata. The IUGONET development team adopted the SPASE metadata model [2] to describe the upper atmosphere data. This model is used as the common metadata format by the virtual observatories for solar-terrestrial physics. It includes metadata referring to each data file (called a 'Granule'), which enable a search for data files as well as data sets. Further details are described in [2] and [3]. Currently, three additional Japanese institutions are being incorporated in IUGONET. Furthermore, metadata of observations of the troposphere, taken at the observatories of the middle and upper atmosphere radar at Shigaraki and the Meteor radar in Indonesia, have been incorporated. These additions will contribute to efficient interdisciplinary scientific research. In the beginning of 2013, the registration of the 'Observatory' and 'Instrument' metadata was completed, which makes it easy to overview of the metadata database. The number of registered metadata as of the end of July, totalled 8.8 million, including 793 observatories and 878 instruments. It is important to promote interoperability and/or metadata exchange between the database development groups. A memorandum of agreement has been signed with the European Near-Earth Space Data Infrastructure for e-Science (ESPAS) project, which has similar objectives to IUGONET with regard to a framework for formal collaboration. Furthermore, observations by satellites and the International Space Station are being incorporated with a view for making/linking metadata databases. The development of effective data systems will contribute to the progress of scientific research on solar terrestrial physics, climate and the geophysical environment. Any kind of cooperation, metadata input and feedback, especially for linkage of the databases, is welcomed. References 1. Hayashi, H. et al., Inter-university Upper Atmosphere Global Observation Network (IUGONET), Data Sci. J., 12, WDS179-184, 2013. 2. King, T. et al., SPASE 2.0: A standard data model for space physics. Earth Sci. Inform. 3, 67-73, 2010, doi:10.1007/s12145-010-0053-4. 3. Hori, T., et al., Development of IUGONET metadata format and metadata management system. J. Space Sci. Info. Jpn., 105-111, 2012. (in Japanese)
Towards Precise Metadata-set for Discovering 3D Geospatial Models in Geo-portals
NASA Astrophysics Data System (ADS)
Zamyadi, A.; Pouliot, J.; Bédard, Y.
2013-09-01
Accessing 3D geospatial models, eventually at no cost and for unrestricted use, is certainly an important issue as they become popular among participatory communities, consultants, and officials. Various geo-portals, mainly established for 2D resources, have tried to provide access to existing 3D resources such as digital elevation model, LIDAR or classic topographic data. Describing the content of data, metadata is a key component of data discovery in geo-portals. An inventory of seven online geo-portals and commercial catalogues shows that the metadata referring to 3D information is very different from one geo-portal to another as well as for similar 3D resources in the same geo-portal. The inventory considered 971 data resources affiliated with elevation. 51% of them were from three geo-portals running at Canadian federal and municipal levels whose metadata resources did not consider 3D model by any definition. Regarding the remaining 49% which refer to 3D models, different definition of terms and metadata were found, resulting in confusion and misinterpretation. The overall assessment of these geo-portals clearly shows that the provided metadata do not integrate specific and common information about 3D geospatial models. Accordingly, the main objective of this research is to improve 3D geospatial model discovery in geo-portals by adding a specific metadata-set. Based on the knowledge and current practices on 3D modeling, and 3D data acquisition and management, a set of metadata is proposed to increase its suitability for 3D geospatial models. This metadata-set enables the definition of genuine classes, fields, and code-lists for a 3D metadata profile. The main structure of the proposal contains 21 metadata classes. These classes are classified in three packages as General and Complementary on contextual and structural information, and Availability on the transition from storage to delivery format. The proposed metadata set is compared with Canadian Geospatial Data Infrastructure (CGDI) metadata which is an implementation of North American Profile of ISO-19115. The comparison analyzes the two metadata against three simulated scenarios about discovering needed 3D geo-spatial datasets. Considering specific metadata about 3D geospatial models, the proposed metadata-set has six additional classes on geometric dimension, level of detail, geometric modeling, topology, and appearance information. In addition classes on data acquisition, preparation, and modeling, and physical availability have been specialized for 3D geospatial models.
The Heliophysics Data Environment: Open Source, Open Systems and Open Data.
NASA Astrophysics Data System (ADS)
King, Todd; Roberts, Aaron; Walker, Raymond; Thieman, James
2012-07-01
The Heliophysics Data Environment (HPDE) is a place for scientific discovery. Today the Heliophysics Data Environment is a framework of technologies, standards and services which enables the international community to collaborate more effectively in space physics research. Crafting a framework for a data environment begins with defining a model of the tasks to be performed, then defining the functional aspects and the work flow. The foundation of any data environment is an information model which defines the structure and content of the metadata necessary to perform the tasks. In the Heliophysics Data Environment the information model is the Space Physics Archive Search and Extract (SPASE) model and available resources are described by using this model. A described resource can reside anywhere on the internet which makes it possible for a national archive, mission, data center or individual researcher to be a provider. The generated metadata is shared, reviewed and harvested to enable services. Virtual Observatories use the metadata to provide community based portals. Through unique identifiers and registry services tools can quickly discover and access data available anywhere on the internet. This enables a researcher to quickly view and analyze data in a variety of settings and enhances the Heliophysics Data Environment. To illustrate the current Heliophysics Data Environment we present the design, architecture and operation of the Heliophysics framework. We then walk through a real example of using available tools to investigate the effects of the solar wind on Earth's magnetosphere.
2011-05-01
iTunes illustrate the difference between the centralized approach of digital library systems and the distributed approach of container file formats...metadata in a container file format. Apple’s iTunes uses a centralized metadata approach and allows users to maintain song metadata in a single...one iTunes library to another the metadata must be copied separately or reentered in the new library. This demonstrates the utility of storing metadata
Collaborative Metadata Curation in Support of NASA Earth Science Data Stewardship
NASA Technical Reports Server (NTRS)
Sisco, Adam W.; Bugbee, Kaylin; le Roux, Jeanne; Staton, Patrick; Freitag, Brian; Dixon, Valerie
2018-01-01
Growing collection of NASA Earth science data is archived and distributed by EOSDIS’s 12 Distributed Active Archive Centers (DAACs). Each collection and granule is described by a metadata record housed in the Common Metadata Repository (CMR). Multiple metadata standards are in use, and core elements of each are mapped to and from a common model – the Unified Metadata Model (UMM). Work done by the Analysis and Review of CMR (ARC) Team.
Mitogenome metadata: current trends and proposed standards.
Strohm, Jeff H T; Gwiazdowski, Rodger A; Hanner, Robert
2016-09-01
Mitogenome metadata are descriptive terms about the sequence, and its specimen description that allow both to be digitally discoverable and interoperable. Here, we review a sampling of mitogenome metadata published in the journal Mitochondrial DNA between 2005 and 2014. Specifically, we have focused on a subset of metadata fields that are available for GenBank records, and specified by the Genomics Standards Consortium (GSC) and other biodiversity metadata standards; and we assessed their presence across three main categories: collection, biological and taxonomic information. To do this we reviewed 146 mitogenome manuscripts, and their associated GenBank records, and scored them for 13 metadata fields. We also explored the potential for mitogenome misidentification using their sequence diversity, and taxonomic metadata on the Barcode of Life Datasystems (BOLD). For this, we focused on all Lepidoptera and Perciformes mitogenomes included in the review, along with additional mitogenome sequence data mined from Genbank. Overall, we found that none of 146 mitogenome projects provided all the metadata we looked for; and only 17 projects provided at least one category of metadata across the three main categories. Comparisons using mtDNA sequences from BOLD, suggest that some mitogenomes may be misidentified. Lastly, we appreciate the research potential of mitogenomes announced through this journal; and we conclude with a suggestion of 13 metadata fields, available on GenBank, that if provided in a mitogenomes's GenBank record, would increase their research value.
Design and implementation of a fault-tolerant and dynamic metadata database for clinical trials
NASA Astrophysics Data System (ADS)
Lee, J.; Zhou, Z.; Talini, E.; Documet, J.; Liu, B.
2007-03-01
In recent imaging-based clinical trials, quantitative image analysis (QIA) and computer-aided diagnosis (CAD) methods are increasing in productivity due to higher resolution imaging capabilities. A radiology core doing clinical trials have been analyzing more treatment methods and there is a growing quantity of metadata that need to be stored and managed. These radiology centers are also collaborating with many off-site imaging field sites and need a way to communicate metadata between one another in a secure infrastructure. Our solution is to implement a data storage grid with a fault-tolerant and dynamic metadata database design to unify metadata from different clinical trial experiments and field sites. Although metadata from images follow the DICOM standard, clinical trials also produce metadata specific to regions-of-interest and quantitative image analysis. We have implemented a data access and integration (DAI) server layer where multiple field sites can access multiple metadata databases in the data grid through a single web-based grid service. The centralization of metadata database management simplifies the task of adding new databases into the grid and also decreases the risk of configuration errors seen in peer-to-peer grids. In this paper, we address the design and implementation of a data grid metadata storage that has fault-tolerance and dynamic integration for imaging-based clinical trials.
Metadata and Service at the GFZ ISDC Portal
NASA Astrophysics Data System (ADS)
Ritschel, B.
2008-05-01
The online service portal of the GFZ Potsdam Information System and Data Center (ISDC) is an access point for all manner of geoscientific geodata, its corresponding metadata, scientific documentation and software tools. At present almost 2000 national and international users and user groups have the opportunity to request Earth science data from a portfolio of 275 different products types and more than 20 Million single data files with an added volume of approximately 12 TByte. The majority of the data and information, the portal currently offers to the public, are global geomonitoring products such as satellite orbit and Earth gravity field data as well as geomagnetic and atmospheric data for the exploration. These products for Earths changing system are provided via state-of-the art retrieval techniques. The data product catalog system behind these techniques is based on the extensive usage of standardized metadata, which are describing the different geoscientific product types and data products in an uniform way. Where as all ISDC product types are specified by NASA's Directory Interchange Format (DIF), Version 9.0 Parent XML DIF metadata files, the individual data files are described by extended DIF metadata documents. Depending on the beginning of the scientific project, one part of data files are described by extended DIF, Version 6 metadata documents and the other part are specified by data Child XML DIF metadata documents. Both, the product type dependent parent DIF metadata documents and the data file dependent child DIF metadata documents are derived from a base-DIF.xsd xml schema file. The ISDC metadata philosophy defines a geoscientific product as a package consisting of mostly one or sometimes more than one data file plus one extended DIF metadata file. Because NASA's DIF metadata standard has been developed in order to specify a collection of data only, the extension of the DIF standard consists of new and specific attributes, which are necessary for an explicit identification of single data files and the set-up of a comprehensive Earth science data catalog. The huge ISDC data catalog is realized by product type dependent tables filled with data file related metadata, which have relations to corresponding metadata tables. The product type describing parent DIF XML metadata documents are stored and managed in ORACLE's XML storage structures. In order to improve the interoperability of the ISDC service portal, the existing proprietary catalog system will be extended by an ISO 19115 based web catalog service. In addition to this development there is ISDC related concerning semantic network of different kind of metadata resources, like different kind of standardized and not-standardized metadata documents and literature as well as Web 2.0 user generated information derived from tagging activities and social navigation data.
NASA Astrophysics Data System (ADS)
Agarwal, D.; Varadharajan, C.; Cholia, S.; Snavely, C.; Hendrix, V.; Gunter, D.; Riley, W. J.; Jones, M.; Budden, A. E.; Vieglais, D.
2017-12-01
The ESS-DIVE archive is a new U.S. Department of Energy (DOE) data archive designed to provide long-term stewardship and use of data from observational, experimental, and modeling activities in the earth and environmental sciences. The ESS-DIVE infrastructure is constructed with the long-term vision of enabling broad access to and usage of the DOE sponsored data stored in the archive. It is designed as a scalable framework that incentivizes data providers to contribute well-structured, high-quality data to the archive and that enables the user community to easily build data processing, synthesis, and analysis capabilities using those data. The key innovations in our design include: (1) application of user-experience research methods to understand the needs of users and data contributors; (2) support for early data archiving during project data QA/QC and before public release; (3) focus on implementation of data standards in collaboration with the community; (4) support for community built tools for data search, interpretation, analysis, and visualization tools; (5) data fusion database to support search of the data extracted from packages submitted and data available in partner data systems such as the Earth System Grid Federation (ESGF) and DataONE; and (6) support for archiving of data packages that are not to be released to the public. ESS-DIVE data contributors will be able to archive and version their data and metadata, obtain data DOIs, search for and access ESS data and metadata via web and programmatic portals, and provide data and metadata in standardized forms. The ESS-DIVE archive and catalog will be federated with other existing catalogs, allowing cross-catalog metadata search and data exchange with existing systems, including DataONE's Metacat search. ESS-DIVE is operated by a multidisciplinary team from Berkeley Lab, the National Center for Ecological Analysis and Synthesis (NCEAS), and DataONE. The primarily data copies are hosted at DOE's NERSC supercomputing facility with replicas at DataONE nodes.
Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO).
Panahiazar, Maryam; Dumontier, Michel; Gevaert, Olivier
2017-08-01
A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table. All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Mercury Toolset for Spatiotemporal Metadata
NASA Technical Reports Server (NTRS)
Wilson, Bruce E.; Palanisamy, Giri; Devarakonda, Ranjeet; Rhyne, B. Timothy; Lindsley, Chris; Green, James
2010-01-01
Mercury (http://mercury.ornl.gov) is a set of tools for federated harvesting, searching, and retrieving metadata, particularly spatiotemporal metadata. Version 3.0 of the Mercury toolset provides orders of magnitude improvements in search speed, support for additional metadata formats, integration with Google Maps for spatial queries, facetted type search, support for RSS (Really Simple Syndication) delivery of search results, and enhanced customization to meet the needs of the multiple projects that use Mercury. It provides a single portal to very quickly search for data and information contained in disparate data management systems, each of which may use different metadata formats. Mercury harvests metadata and key data from contributing project servers distributed around the world and builds a centralized index. The search interfaces then allow the users to perform a variety of fielded, spatial, and temporal searches across these metadata sources. This centralized repository of metadata with distributed data sources provides extremely fast search results to the user, while allowing data providers to advertise the availability of their data and maintain complete control and ownership of that data. Mercury periodically (typically daily) harvests metadata sources through a collection of interfaces and re-indexes these metadata to provide extremely rapid search capabilities, even over collections with tens of millions of metadata records. A number of both graphical and application interfaces have been constructed within Mercury, to enable both human users and other computer programs to perform queries. Mercury was also designed to support multiple different projects, so that the particular fields that can be queried and used with search filters are easy to configure for each different project.
Mercury Toolset for Spatiotemporal Metadata
NASA Astrophysics Data System (ADS)
Devarakonda, Ranjeet; Palanisamy, Giri; Green, James; Wilson, Bruce; Rhyne, B. Timothy; Lindsley, Chris
2010-06-01
Mercury (http://mercury.ornl.gov) is a set of tools for federated harvesting, searching, and retrieving metadata, particularly spatiotemporal metadata. Version 3.0 of the Mercury toolset provides orders of magnitude improvements in search speed, support for additional metadata formats, integration with Google Maps for spatial queries, facetted type search, support for RSS (Really Simple Syndication) delivery of search results, and enhanced customization to meet the needs of the multiple projects that use Mercury. It provides a single portal to very quickly search for data and information contained in disparate data management systems, each of which may use different metadata formats. Mercury harvests metadata and key data from contributing project servers distributed around the world and builds a centralized index. The search interfaces then allow the users to perform a variety of fielded, spatial, and temporal searches across these metadata sources. This centralized repository of metadata with distributed data sources provides extremely fast search results to the user, while allowing data providers to advertise the availability of their data and maintain complete control and ownership of that data. Mercury periodically (typically daily)harvests metadata sources through a collection of interfaces and re-indexes these metadata to provide extremely rapid search capabilities, even over collections with tens of millions of metadata records. A number of both graphical and application interfaces have been constructed within Mercury, to enable both human users and other computer programs to perform queries. Mercury was also designed to support multiple different projects, so that the particular fields that can be queried and used with search filters are easy to configure for each different project.
NASA Astrophysics Data System (ADS)
Jirka, Simon; del Rio, Joaquin; Toma, Daniel; Martinez, Enoc; Delory, Eric; Pearlman, Jay; Rieke, Matthes; Stasch, Christoph
2017-04-01
The rapidly evolving technology for building Web-based (spatial) information infrastructures and Sensor Webs, there are new opportunities to improve the process how ocean data is collected and managed. A central element in this development is the suite of Sensor Web Enablement (SWE) standards specified by the Open Geospatial Consortium (OGC). This framework of standards comprises on the one hand data models as well as formats for measurement data (ISO/OGC Observations and Measurement, O&M) and metadata describing measurement processes and sensors (OGC Sensor Model Language, SensorML). On the other hand the SWE standards comprise (Web service) interface specifications for pull-based access to observation data (OGC Sensor Observation Service, SOS) and for controlling or configuring sensors (OGC Sensor Planning Service, SPS). Also within the European INSPIRE framework the SWE standards play an important role as the SOS is the recommended download service interface for O&M-encoded observation data sets. In the context of the EU-funded Oceans of Tomorrow initiative the NeXOS (Next generation, Cost-effective, Compact, Multifunctional Web Enabled Ocean Sensor Systems Empowering Marine, Maritime and Fisheries Management) project is developing a new generation of in-situ sensors that make use of the SWE standards to facilitate the data publication process and the integration into Web based information infrastructures. This includes the development of a dedicated firmware for instruments and sensor platforms (SEISI, Smart Electronic Interface for Sensors and Instruments) maintained by the Universitat Politècnica de Catalunya (UPC). Among other features, SEISI makes use of OGC SWE standards such OGC-PUCK, to enable a plug-and-play mechanism for sensors based on SensorML encoded metadata. Thus, if a new instrument is attached to a SEISI-based platform, it automatically configures the connection to these instruments, automatically generated data files compliant with the ISO/OGC Observations and Measurements standard and initiates the data transmission into the NeXOS Sensor Web infrastructure. Besides these platform-related developments, NeXOS has realised the full path of data transmission from the sensor to the end user application. The conceptual architecture design is implemented by a series of open source SWE software packages provided by 52°North. This comprises especially different SWE server components (i.e. OGC Sensor Observation Service), tools for data visualisation (e.g. the 52°North Helgoland SOS viewer), and an editor for providing SensorML-based metadata (52°North smle). As a result, NeXOS has demonstrated how the SWE standards help to improve marine observation data collection. Within this presentation, we will present the experiences and findings of the NeXOS project and will provide recommendation for future work directions.
Metadata Realities for Cyberinfrastructure: Data Authors as Metadata Creators
ERIC Educational Resources Information Center
Mayernik, Matthew Stephen
2011-01-01
As digital data creation technologies become more prevalent, data and metadata management are necessary to make data available, usable, sharable, and storable. Researchers in many scientific settings, however, have little experience or expertise in data and metadata management. In this dissertation, I explore the everyday data and metadata…
NetCDF4/HDF5 and Linked Data in the Real World - Enriching Geoscientific Metadata without Bloat
NASA Astrophysics Data System (ADS)
Ip, Alex; Car, Nicholas; Druken, Kelsey; Poudjom-Djomani, Yvette; Butcher, Stirling; Evans, Ben; Wyborn, Lesley
2017-04-01
NetCDF4 has become the dominant generic format for many forms of geoscientific data, leveraging (and constraining) the versatile HDF5 container format, while providing metadata conventions for interoperability. However, the encapsulation of detailed metadata within each file can lead to metadata "bloat", and difficulty in maintaining consistency where metadata is replicated to multiple locations. Complex conceptual relationships are also difficult to represent in simple key-value netCDF metadata. Linked Data provides a practical mechanism to address these issues by associating the netCDF files and their internal variables with complex metadata stored in Semantic Web vocabularies and ontologies, while complying with and complementing existing metadata conventions. One of the stated objectives of the netCDF4/HDF5 formats is that they should be self-describing: containing metadata sufficient for cataloguing and using the data. However, this objective can be regarded as only partially-met where details of conventions and definitions are maintained externally to the data files. For example, one of the most widely used netCDF community standards, the Climate and Forecasting (CF) Metadata Convention, maintains standard vocabularies for a broad range of disciplines across the geosciences, but this metadata is currently neither readily discoverable nor machine-readable. We have previously implemented useful Linked Data and netCDF tooling (ncskos) that associates netCDF files, and individual variables within those files, with concepts in vocabularies formulated using the Simple Knowledge Organization System (SKOS) ontology. NetCDF files contain Uniform Resource Identifier (URI) links to terms represented as SKOS Concepts, rather than plain-text representations of those terms, so we can use simple, standardised web queries to collect and use rich metadata for the terms from any Linked Data-presented SKOS vocabulary. Geoscience Australia (GA) manages a large volume of diverse geoscientific data, much of which is being translated from proprietary formats to netCDF at NCI Australia. This data is made available through the NCI National Environmental Research Data Interoperability Platform (NERDIP) for programmatic access and interdisciplinary analysis. The netCDF files contain both scientific data variables (e.g. gravity, magnetic or radiometric values), but also domain-specific operational values (e.g. specific instrument parameters) best described fully in formal vocabularies. Our ncskos codebase provides access to multiple stores of detailed external metadata in a standardised fashion. Geophysical datasets are generated from a "survey" event, and GA maintains corporate databases of all surveys and their associated metadata. It is impractical to replicate the full source survey metadata into each netCDF dataset so, instead, we link the netCDF files to survey metadata using public Linked Data URIs. These URIs link to Survey class objects which we model as a subclass of Activity objects as defined by the PROV Ontology, and we provide URI resolution for them via a custom Linked Data API which draws current survey metadata from GA's in-house databases. We have demonstrated that Linked Data is a practical way to associate netCDF data with detailed, external metadata. This allows us to ensure that catalogued metadata is kept consistent with metadata points-of-truth, and we can infer complex conceptual relationships not possible with netCDF key-value attributes alone.
Bercovich, A; Edan, Y; Alchanatis, V; Moallem, U; Parmet, Y; Honig, H; Maltz, E; Antler, A; Halachmi, I
2013-01-01
Body condition evaluation is a common tool to assess energy reserves of dairy cows and to estimate their fatness or thinness. This study presents a computer-vision tool that automatically estimates cow's body condition score. Top-view images of 151 cows were collected on an Israeli research dairy farm using a digital still camera located at the entrance to the milking parlor. The cow's tailhead area and its contour were segmented and extracted automatically. Two types of features of the tailhead contour were extracted: (1) the angles and distances between 5 anatomical points; and (2) the cow signature, which is a 1-dimensional vector of the Euclidean distances from each point in the normalized tailhead contour to the shape center. Two methods were applied to describe the cow's signature and to reduce its dimension: (1) partial least squares regression, and (2) Fourier descriptors of the cow signature. Three prediction models were compared with manual scores of an expert. Results indicate that (1) it is possible to automatically extract and predict body condition from color images without any manual interference; and (2) Fourier descriptors of the cow's signature result in improved performance (R(2)=0.77). Copyright © 2013 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Automatically extracting the significant aspects evaluated in game reviews
NASA Astrophysics Data System (ADS)
Fong, Chiok Hoong; Ng, Yen Kaow
2017-04-01
Understanding the criteria (or "aspects") that reviewers use to evaluate games is important to game developers and publishers, since this will give them important input on how to improve their products. Techniques for the extraction of such aspects have been studied by others, albeit not specific to the gaming industry. In this paper we demonstrate an aspect extraction and analysis system specific to computer games. The system extracts game review texts from a list of known websites and automatically extracts candidate aspects from the review text using techniques from natural language processing and sentiment analysis. It then ranks the candidate aspects using the HITS algorithm. To evaluate the correctness of the extracted aspects, we used the system to calculate an overall score for each game by aggregating its highly-rated aspects, weighted by the importance of the respective aspects. The aggregated scores resulted in a ranking of games, which we compared to a known ranking from a popular website - the rankings showed overall consistency, which suggests that the system has extracted valuable aspects from the reviews. Using the extracted aspect, our system also facilitates the analysis of a game, by evaluating how review articles have rated its performance in these extracted aspects.
Comparison of methods of DNA extraction for real-time PCR in a model of pleural tuberculosis.
Santos, Ana; Cremades, Rosa; Rodríguez, Juan Carlos; García-Pachón, Eduardo; Ruiz, Montserrat; Royo, Gloria
2010-01-01
Molecular methods have been reported to have different sensitivities in the diagnosis of pleural tuberculosis and this may in part be caused by the use of different methods of DNA extraction. Our study compares nine DNA extraction systems in an experimental model of pleural tuberculosis. An inoculum of Mycobacterium tuberculosis was added to 23 pleural liquid samples with different characteristics. DNA was subsequently extracted using nine different methods (seven manual and two automatic) for analysis with real-time PCR. Only two methods were able to detect the presence of M. tuberculosis DNA in all the samples: extraction using columns (Qiagen) and automated extraction with the TNAI system (Roche). The automatic method is more expensive, but requires less time. Almost all the false negatives were because of the difficulty involved in extracting M. tuberculosis DNA, as in general, all the methods studied are capable of eliminating inhibitory substances that block the amplification reaction. The method of M. tuberculosis DNA extraction used affects the results of the diagnosis of pleural tuberculosis by molecular methods. DNA extraction systems that have been shown to be effective in pleural liquid should be used.
Huang, Yukun; Chen, Rong; Wei, Jingbo; Pei, Xilong; Cao, Jing; Prakash Jayaraman, Prem; Ranjan, Rajiv
2014-01-01
JNI in the Android platform is often observed with low efficiency and high coding complexity. Although many researchers have investigated the JNI mechanism, few of them solve the efficiency and the complexity problems of JNI in the Android platform simultaneously. In this paper, a hybrid polylingual object (HPO) model is proposed to allow a CAR object being accessed as a Java object and as vice in the Dalvik virtual machine. It is an acceptable substitute for JNI to reuse the CAR-compliant components in Android applications in a seamless and efficient way. The metadata injection mechanism is designed to support the automatic mapping and reflection between CAR objects and Java objects. A prototype virtual machine, called HPO-Dalvik, is implemented by extending the Dalvik virtual machine to support the HPO model. Lifespan management, garbage collection, and data type transformation of HPO objects are also handled in the HPO-Dalvik virtual machine automatically. The experimental result shows that the HPO model outweighs the standard JNI in lower overhead on native side, better executing performance with no JNI bridging code being demanded. PMID:25110745
Automatic Generation of Data Types for Classification of Deep Web Sources
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ngu, A H; Buttler, D J; Critchlow, T J
2005-02-14
A Service Class Description (SCD) is an effective meta-data based approach for discovering Deep Web sources whose data exhibit some regular patterns. However, it is tedious and error prone to create an SCD description manually. Moreover, a manually created SCD is not adaptive to the frequent changes of Web sources. It requires its creator to identify all the possible input and output types of a service a priori. In many domains, it is impossible to exhaustively list all the possible input and output data types of a source in advance. In this paper, we describe machine learning approaches for automaticmore » generation of the data types of an SCD. We propose two different approaches for learning data types of a class of Web sources. The Brute-Force Learner is able to generate data types that can achieve high recall, but with low precision. The Clustering-based Learner generates data types that have a high precision rate, but with a lower recall rate. We demonstrate the feasibility of these two learning-based solutions for automatic generation of data types for citation Web sources and presented a quantitative evaluation of these two solutions.« less
Computerized image analysis for quantitative neuronal phenotyping in zebrafish.
Liu, Tianming; Lu, Jianfeng; Wang, Ye; Campbell, William A; Huang, Ling; Zhu, Jinmin; Xia, Weiming; Wong, Stephen T C
2006-06-15
An integrated microscope image analysis pipeline is developed for automatic analysis and quantification of phenotypes in zebrafish with altered expression of Alzheimer's disease (AD)-linked genes. We hypothesize that a slight impairment of neuronal integrity in a large number of zebrafish carrying the mutant genotype can be detected through the computerized image analysis method. Key functionalities of our zebrafish image processing pipeline include quantification of neuron loss in zebrafish embryos due to knockdown of AD-linked genes, automatic detection of defective somites, and quantitative measurement of gene expression levels in zebrafish with altered expression of AD-linked genes or treatment with a chemical compound. These quantitative measurements enable the archival of analyzed results and relevant meta-data. The structured database is organized for statistical analysis and data modeling to better understand neuronal integrity and phenotypic changes of zebrafish under different perturbations. Our results show that the computerized analysis is comparable to manual counting with equivalent accuracy and improved efficacy and consistency. Development of such an automated data analysis pipeline represents a significant step forward to achieve accurate and reproducible quantification of neuronal phenotypes in large scale or high-throughput zebrafish imaging studies.
García-Remesal, Miguel; Maojo, Victor; Crespo, José
2010-01-01
In this paper we present a knowledge engineering approach to automatically recognize and extract genetic sequences from scientific articles. To carry out this task, we use a preliminary recognizer based on a finite state machine to extract all candidate DNA/RNA sequences. The latter are then fed into a knowledge-based system that automatically discards false positives and refines noisy and incorrectly merged sequences. We created the knowledge base by manually analyzing different manuscripts containing genetic sequences. Our approach was evaluated using a test set of 211 full-text articles in PDF format containing 3134 genetic sequences. For such set, we achieved 87.76% precision and 97.70% recall respectively. This method can facilitate different research tasks. These include text mining, information extraction, and information retrieval research dealing with large collections of documents containing genetic sequences.
Automated Assessment of Child Vocalization Development Using LENA
ERIC Educational Resources Information Center
Richards, Jeffrey A.; Xu, Dongxin; Gilkerson, Jill; Yapanel, Umit; Gray, Sharmistha; Paul, Terrance
2017-01-01
Purpose: To produce a novel, efficient measure of children's expressive vocal development on the basis of automatic vocalization assessment (AVA), child vocalizations were automatically identified and extracted from audio recordings using Language Environment Analysis (LENA) System technology. Method: Assessment was based on full-day audio…
Content Metadata Standards for Marine Science: A Case Study
Riall, Rebecca L.; Marincioni, Fausto; Lightsom, Frances L.
2004-01-01
The U.S. Geological Survey developed a content metadata standard to meet the demands of organizing electronic resources in the marine sciences for a broad, heterogeneous audience. These metadata standards are used by the Marine Realms Information Bank project, a Web-based public distributed library of marine science from academic institutions and government agencies. The development and deployment of this metadata standard serve as a model, complete with lessons about mistakes, for the creation of similarly specialized metadata standards for digital libraries.
Yu, Yong-Jie; Xia, Qiao-Ling; Wang, Sheng; Wang, Bing; Xie, Fu-Wei; Zhang, Xiao-Bing; Ma, Yun-Ming; Wu, Hai-Long
2014-09-12
Peak detection and background drift correction (BDC) are the key stages in using chemometric methods to analyze chromatographic fingerprints of complex samples. This study developed a novel chemometric strategy for simultaneous automatic chromatographic peak detection and BDC. A robust statistical method was used for intelligent estimation of instrumental noise level coupled with first-order derivative of chromatographic signal to automatically extract chromatographic peaks in the data. A local curve-fitting strategy was then employed for BDC. Simulated and real liquid chromatographic data were designed with various kinds of background drift and degree of overlapped chromatographic peaks to verify the performance of the proposed strategy. The underlying chromatographic peaks can be automatically detected and reasonably integrated by this strategy. Meanwhile, chromatograms with BDC can be precisely obtained. The proposed method was used to analyze a complex gas chromatography dataset that monitored quality changes in plant extracts during storage procedure. Copyright © 2014 Elsevier B.V. All rights reserved.
Eccles, B A; Klevecz, R R
1986-06-01
Mitotic frequency in a synchronous culture of mammalian cells was determined fully automatically and in real time using low-intensity phase-contrast microscopy and a newvicon video camera connected to an EyeCom III image processor. Image samples, at a frequency of one per minute for 50 hours, were analyzed by first extracting the high-frequency picture components, then thresholding and probing for annular objects indicative of putative mitotic cells. Both the extraction of high-frequency components and the recognition of rings of varying radii and discontinuities employed novel algorithms. Spatial and temporal relationships between annuli were examined to discern the occurrences of mitoses, and such events were recorded in a computer data file. At present, the automatic analysis is suited for random cell proliferation rate measurements or cell cycle studies. The automatic identification of mitotic cells as described here provides a measure of the average proliferative activity of the cell population as a whole and eliminates more than eight hours of manual review per time-lapse video recording.
Automatic Author Profiling of Online Chat Logs
2007-03-01
CLASSIFICATION WITH PRIOR ..........91 1. All Test Data ................................91 2. Extracted Test Data: Teens and 20s ...........92 3...Extracted Test Data: Teens and 30s ...........92 4. Extracted Test Data: Teens and 40s ...........93 5. Extracted Test Data: Teens and 50s ...........93 6...Data ................................97 C. AGE: BINARY CLASSIFICATION WITH PRIOR .............98 1. Extracted Test Data: Teens and 20s ...........98 2
Automatic indexing of scanned documents: a layout-based approach
NASA Astrophysics Data System (ADS)
Esser, Daniel; Schuster, Daniel; Muthmann, Klemens; Berger, Michael; Schill, Alexander
2012-01-01
Archiving official written documents such as invoices, reminders and account statements in business and private area gets more and more important. Creating appropriate index entries for document archives like sender's name, creation date or document number is a tedious manual work. We present a novel approach to handle automatic indexing of documents based on generic positional extraction of index terms. For this purpose we apply the knowledge of document templates stored in a common full text search index to find index positions that were successfully extracted in the past.
The Extraction of Terrace in the Loess Plateau Based on radial method
NASA Astrophysics Data System (ADS)
Liu, W.; Li, F.
2016-12-01
The terrace of Loess Plateau, as a typical kind of artificial landform and an important measure of soil and water conservation, its positioning and automatic extraction will simplify the work of land use investigation. The existing methods of terrace extraction mainly include visual interpretation and automatic extraction. The manual method is used in land use investigation, but it is time-consuming and laborious. Researchers put forward some automatic extraction methods. For example, Fourier transform method can recognize terrace and find accurate position from frequency domain image, but it is more affected by the linear objects in the same direction of terrace; Texture analysis method is simple and have a wide range application of image processing. The disadvantage of texture analysis method is unable to recognize terraces' edge; Object-oriented is a new method of image classification, but when introduce it to terrace extracting, fracture polygons will be the most serious problem and it is difficult to explain its geological meaning. In order to positioning the terraces, we use high- resolution remote sensing image to extract and analyze the gray value of the pixels which the radial went through. During the recognition process, we firstly use the DEM data analysis or by manual selecting, to roughly confirm the position of peak points; secondly, take each of the peak points as the center to make radials in all directions; finally, extracting the gray values of the pixels which the radials went through, and analyzing its changing characteristics to confirm whether the terrace exists. For the purpose of getting accurate position of terrace, terraces' discontinuity, extension direction, ridge width, image processing algorithm, remote sensing image illumination and other influence factors were fully considered when designing the algorithms.
Heidelberger, Philip; Steinmacher-Burow, Burkhard
2015-01-06
According to one embodiment, a method for implementing an array-based queue in memory of a memory system that includes a controller includes configuring, in the memory, metadata of the array-based queue. The configuring comprises defining, in metadata, an array start location in the memory for the array-based queue, defining, in the metadata, an array size for the array-based queue, defining, in the metadata, a queue top for the array-based queue and defining, in the metadata, a queue bottom for the array-based queue. The method also includes the controller serving a request for an operation on the queue, the request providing the location in the memory of the metadata of the queue.
WGISS-45 International Directory Network (IDN) Report
NASA Technical Reports Server (NTRS)
Morahan, Michael
2018-01-01
The objective of this presentation is to provide IDN (International Directory Network) updates on features and activities to the Committee on Earth Observation Satellites (CEOS) Working Group on Information Systems and Services (WGISS) and provider community. The following topics will be will be discussed during the presentation: Transition of Providers DIF-9 (Directory Interchange Format-9) to DIF-10 Metadata Records in the Common Metadata Repository (CMR); GCMD (Global Change Master Directory) Keyword Update; DIF-10 and UMM-C (Unified Metadata Model-Collections) Schema Changes; Metadata Validation of Provider Metadata; docBUILDER for Submitting IDN Metadata to the CMR (i.e. Registration); and Mapping WGClimate Essential Climate Variable (ECV) Inventory to IDN Records.
WebEAV: automatic metadata-driven generation of web interfaces to entity-attribute-value databases.
Nadkarni, P M; Brandt, C M; Marenco, L
2000-01-01
The task of creating and maintaining a front end to a large institutional entity-attribute-value (EAV) database can be cumbersome when using traditional client-server technology. Switching to Web technology as a delivery vehicle solves some of these problems but introduces others. In particular, Web development environments tend to be primitive, and many features that client-server developers take for granted are missing. WebEAV is a generic framework for Web development that is intended to streamline the process of Web application development for databases having a significant EAV component. It also addresses some challenging user interface issues that arise when any complex system is created. The authors describe the architecture of WebEAV and provide an overview of its features with suitable examples.
Explorative Analyses of Nursing Research Data.
Kim, Hyeoneui; Jang, Imho; Quach, Jimmy; Richardson, Alex; Kim, Jaemin; Choi, Jeeyae
2016-10-26
As a first step of pursuing the vision of "big data science in nursing," we described the characteristics of nursing research data reported in 194 published nursing studies. We also explored how completely the Version 1 metadata specification of biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) represents these metadata. The metadata items of the nursing studies were all related to one or more of the bioCADDIE metadata entities. However, values of many metadata items of the nursing studies were not sufficiently represented through the bioCADDIE metadata. This was partly due to the differences in the scope of the content that the bioCADDIE metadata are designed to represent. The 194 nursing studies reported a total of 1,181 unique data items, the majority of which take non-numeric values. This indicates the importance of data standardization to enable the integrative analyses of these data to support big data science in nursing. © The Author(s) 2016.
A Model for Enhancing Internet Medical Document Retrieval with “Medical Core Metadata”
Malet, Gary; Munoz, Felix; Appleyard, Richard; Hersh, William
1999-01-01
Objective: Finding documents on the World Wide Web relevant to a specific medical information need can be difficult. The goal of this work is to define a set of document content description tags, or metadata encodings, that can be used to promote disciplined search access to Internet medical documents. Design: The authors based their approach on a proposed metadata standard, the Dublin Core Metadata Element Set, which has recently been submitted to the Internet Engineering Task Force. Their model also incorporates the National Library of Medicine's Medical Subject Headings (MeSH) vocabulary and Medline-type content descriptions. Results: The model defines a medical core metadata set that can be used to describe the metadata for a wide variety of Internet documents. Conclusions: The authors propose that their medical core metadata set be used to assign metadata to medical documents to facilitate document retrieval by Internet search engines. PMID:10094069
MPEG-7: standard metadata for multimedia content
NASA Astrophysics Data System (ADS)
Chang, Wo
2005-08-01
The eXtensible Markup Language (XML) metadata technology of describing media contents has emerged as a dominant mode of making media searchable both for human and machine consumptions. To realize this premise, many online Web applications are pushing this concept to its fullest potential. However, a good metadata model does require a robust standardization effort so that the metadata content and its structure can reach its maximum usage between various applications. An effective media content description technology should also use standard metadata structures especially when dealing with various multimedia contents. A new metadata technology called MPEG-7 content description has merged from the ISO MPEG standards body with the charter of defining standard metadata to describe audiovisual content. This paper will give an overview of MPEG-7 technology and what impact it can bring forth to the next generation of multimedia indexing and retrieval applications.
Quality Assurance for Digital Learning Object Repositories: Issues for the Metadata Creation Process
ERIC Educational Resources Information Center
Currier, Sarah; Barton, Jane; O'Beirne, Ronan; Ryan, Ben
2004-01-01
Metadata enables users to find the resources they require, therefore it is an important component of any digital learning object repository. Much work has already been done within the learning technology community to assure metadata quality, focused on the development of metadata standards, specifications and vocabularies and their implementation…
A Model for the Creation of Human-Generated Metadata within Communities
ERIC Educational Resources Information Center
Brasher, Andrew; McAndrew, Patrick
2005-01-01
This paper considers situations for which detailed metadata descriptions of learning resources are necessary, and focuses on human generation of such metadata. It describes a model which facilitates human production of good quality metadata by the development and use of structured vocabularies. Using examples, this model is applied to single and…
Enhancing SCORM Metadata for Assessment Authoring in E-Learning
ERIC Educational Resources Information Center
Chang, Wen-Chih; Hsu, Hui-Huang; Smith, Timothy K.; Wang, Chun-Chia
2004-01-01
With the rapid development of distance learning and the XML technology, metadata play an important role in e-Learning. Nowadays, many distance learning standards, such as SCORM, AICC CMI, IEEE LTSC LOM and IMS, use metadata to tag learning materials. However, most metadata models are used to define learning materials and test problems. Few…
Development of Health Information Search Engine Based on Metadata and Ontology
Song, Tae-Min; Jin, Dal-Lae
2014-01-01
Objectives The aim of the study was to develop a metadata and ontology-based health information search engine ensuring semantic interoperability to collect and provide health information using different application programs. Methods Health information metadata ontology was developed using a distributed semantic Web content publishing model based on vocabularies used to index the contents generated by the information producers as well as those used to search the contents by the users. Vocabulary for health information ontology was mapped to the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), and a list of about 1,500 terms was proposed. The metadata schema used in this study was developed by adding an element describing the target audience to the Dublin Core Metadata Element Set. Results A metadata schema and an ontology ensuring interoperability of health information available on the internet were developed. The metadata and ontology-based health information search engine developed in this study produced a better search result compared to existing search engines. Conclusions Health information search engine based on metadata and ontology will provide reliable health information to both information producer and information consumers. PMID:24872907
Development of health information search engine based on metadata and ontology.
Song, Tae-Min; Park, Hyeoun-Ae; Jin, Dal-Lae
2014-04-01
The aim of the study was to develop a metadata and ontology-based health information search engine ensuring semantic interoperability to collect and provide health information using different application programs. Health information metadata ontology was developed using a distributed semantic Web content publishing model based on vocabularies used to index the contents generated by the information producers as well as those used to search the contents by the users. Vocabulary for health information ontology was mapped to the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), and a list of about 1,500 terms was proposed. The metadata schema used in this study was developed by adding an element describing the target audience to the Dublin Core Metadata Element Set. A metadata schema and an ontology ensuring interoperability of health information available on the internet were developed. The metadata and ontology-based health information search engine developed in this study produced a better search result compared to existing search engines. Health information search engine based on metadata and ontology will provide reliable health information to both information producer and information consumers.
NASA Astrophysics Data System (ADS)
Hardy, D.; Janée, G.; Gallagher, J.; Frew, J.; Cornillon, P.
2006-12-01
The OPeNDAP Data Access Protocol (DAP) is a community standard for sharing scientific data across the Internet. Data providers using DAP have adopted a variety of metadata conventions to improve data utility, such as COARDS (1995) and CF (2003). Our results show, however, that metadata do not follow these conventions in practice. We collected metadata from over a hundred DAP servers, tens of thousands of data objects, and hundreds of collections. We found that a minority claim to adhere to a metadata convention, and a small percentage accurately adhere to their stated convention. We present descriptive statistics of our survey and highlight common traits such as well-populated attributes. Our empirical results indicate that unified search services cannot rely solely on metadata conventions. Although we encourage all providers to adopt a small subset of the CF convention for discovery purposes, we have no evidence to suggest that improved conventions would simplify the fundamental problem of heterogeneity. Large-scale discovery services must find methods for integrating incompatible metadata.
A metadata template for ocean acidification data
NASA Astrophysics Data System (ADS)
Jiang, L.
2014-12-01
Metadata is structured information that describes, explains, and locates an information resource (e.g., data). It is often coarsely described as data about data, and documents information such as what was measured, by whom, when, where, and how it was sampled, analyzed, with what instruments. Metadata is inherent to ensure the survivability and accessibility of the data into the future. With the rapid expansion of biological response ocean acidification (OA) studies, the lack of a common metadata template to document such type of data has become a significant gap for ocean acidification data management efforts. In this paper, we present a metadata template that can be applied to a broad spectrum of OA studies, including those studying the biological responses of organisms on ocean acidification. The "variable metadata section", which includes the variable name, observation type, whether the variable is a manipulation condition or response variable, and the biological subject on which the variable is studied, forms the core of this metadata template. Additional metadata elements, such as principal investigators, temporal and spatial coverage, platforms for the sampling, data citation are essential components to complete the template. We explain the structure of the template, and define many metadata elements that may be unfamiliar to researchers. For that reason, this paper can serve as a user's manual for the template.
A Shared Infrastructure for Federated Search Across Distributed Scientific Metadata Catalogs
NASA Astrophysics Data System (ADS)
Reed, S. A.; Truslove, I.; Billingsley, B. W.; Grauch, A.; Harper, D.; Kovarik, J.; Lopez, L.; Liu, M.; Brandt, M.
2013-12-01
The vast amount of science metadata can be overwhelming and highly complex. Comprehensive analysis and sharing of metadata is difficult since institutions often publish to their own repositories. There are many disjoint standards used for publishing scientific data, making it difficult to discover and share information from different sources. Services that publish metadata catalogs often have different protocols, formats, and semantics. The research community is limited by the exclusivity of separate metadata catalogs and thus it is desirable to have federated search interfaces capable of unified search queries across multiple sources. Aggregation of metadata catalogs also enables users to critique metadata more rigorously. With these motivations in mind, the National Snow and Ice Data Center (NSIDC) and Advanced Cooperative Arctic Data and Information Service (ACADIS) implemented two search interfaces for the community. Both the NSIDC Search and ACADIS Arctic Data Explorer (ADE) use a common infrastructure which keeps maintenance costs low. The search clients are designed to make OpenSearch requests against Solr, an Open Source search platform. Solr applies indexes to specific fields of the metadata which in this instance optimizes queries containing keywords, spatial bounds and temporal ranges. NSIDC metadata is reused by both search interfaces but the ADE also brokers additional sources. Users can quickly find relevant metadata with minimal effort and ultimately lowers costs for research. This presentation will highlight the reuse of data and code between NSIDC and ACADIS, discuss challenges and milestones for each project, and will identify creation and use of Open Source libraries.
Harvesting geographic features from heterogeneous raster maps
NASA Astrophysics Data System (ADS)
Chiang, Yao-Yi
2010-11-01
Raster maps offer a great deal of geospatial information and are easily accessible compared to other geospatial data. However, harvesting geographic features locked in heterogeneous raster maps to obtain the geospatial information is challenging. This is because of the varying image quality of raster maps (e.g., scanned maps with poor image quality and computer-generated maps with good image quality), the overlapping geographic features in maps, and the typical lack of metadata (e.g., map geocoordinates, map source, and original vector data). Previous work on map processing is typically limited to a specific type of map and often relies on intensive manual work. In contrast, this thesis investigates a general approach that does not rely on any prior knowledge and requires minimal user effort to process heterogeneous raster maps. This approach includes automatic and supervised techniques to process raster maps for separating individual layers of geographic features from the maps and recognizing geographic features in the separated layers (i.e., detecting road intersections, generating and vectorizing road geometry, and recognizing text labels). The automatic technique eliminates user intervention by exploiting common map properties of how road lines and text labels are drawn in raster maps. For example, the road lines are elongated linear objects and the characters are small connected-objects. The supervised technique utilizes labels of road and text areas to handle complex raster maps, or maps with poor image quality, and can process a variety of raster maps with minimal user input. The results show that the general approach can handle raster maps with varying map complexity, color usage, and image quality. By matching extracted road intersections to another geospatial dataset, we can identify the geocoordinates of a raster map and further align the raster map, separated feature layers from the map, and recognized features from the layers with the geospatial dataset. The road vectorization and text recognition results outperform state-of-art commercial products, and with considerably less user input. The approach in this thesis allows us to make use of the geospatial information of heterogeneous maps locked in raster format.
A standard for measuring metadata quality in spectral libraries
NASA Astrophysics Data System (ADS)
Rasaiah, B.; Jones, S. D.; Bellman, C.
2013-12-01
A standard for measuring metadata quality in spectral libraries Barbara Rasaiah, Simon Jones, Chris Bellman RMIT University Melbourne, Australia barbara.rasaiah@rmit.edu.au, simon.jones@rmit.edu.au, chris.bellman@rmit.edu.au ABSTRACT There is an urgent need within the international remote sensing community to establish a metadata standard for field spectroscopy that ensures high quality, interoperable metadata sets that can be archived and shared efficiently within Earth observation data sharing systems. Metadata are an important component in the cataloguing and analysis of in situ spectroscopy datasets because of their central role in identifying and quantifying the quality and reliability of spectral data and the products derived from them. This paper presents approaches to measuring metadata completeness and quality in spectral libraries to determine reliability, interoperability, and re-useability of a dataset. Explored are quality parameters that meet the unique requirements of in situ spectroscopy datasets, across many campaigns. Examined are the challenges presented by ensuring that data creators, owners, and data users ensure a high level of data integrity throughout the lifecycle of a dataset. Issues such as field measurement methods, instrument calibration, and data representativeness are investigated. The proposed metadata standard incorporates expert recommendations that include metadata protocols critical to all campaigns, and those that are restricted to campaigns for specific target measurements. The implication of semantics and syntax for a robust and flexible metadata standard are also considered. Approaches towards an operational and logistically viable implementation of a quality standard are discussed. This paper also proposes a way forward for adapting and enhancing current geospatial metadata standards to the unique requirements of field spectroscopy metadata quality. [0430] BIOGEOSCIENCES / Computational methods and data processing [0480] BIOGEOSCIENCES / Remote sensing [1904] INFORMATICS / Community standards [1912] INFORMATICS / Data management, preservation, rescue [1926] INFORMATICS / Geospatial [1930] INFORMATICS / Data and information governance [1946] INFORMATICS / Metadata [1952] INFORMATICS / Modeling [1976] INFORMATICS / Software tools and services [9810] GENERAL OR MISCELLANEOUS / New fields
Metadata Design in the New PDS4 Standards - Something for Everybody
NASA Astrophysics Data System (ADS)
Raugh, Anne C.; Hughes, John S.
2015-11-01
The Planetary Data System (PDS) archives, supports, and distributes data of diverse targets, from diverse sources, to diverse users. One of the core problems addressed by the PDS4 data standard redesign was that of metadata - how to accommodate the increasingly sophisticated demands of search interfaces, analytical software, and observational documentation into label standards without imposing limits and constraints that would impinge on the quality or quantity of metadata that any particular observer or team could supply. And yet, as an archive, PDS must have detailed documentation for the metadata in the labels it supports, or the institutional knowledge encoded into those attributes will be lost - putting the data at risk.The PDS4 metadata solution is based on a three-step approach. First, it is built on two key ISO standards: ISO 11179 "Information Technology - Metadata Registries", which provides a common framework and vocabulary for defining metadata attributes; and ISO 14721 "Space Data and Information Transfer Systems - Open Archival Information System (OAIS) Reference Model", which provides the framework for the information architecture that enforces the object-oriented paradigm for metadata modeling. Second, PDS has defined a hierarchical system that allows it to divide its metadata universe into namespaces ("data dictionaries", conceptually), and more importantly to delegate stewardship for a single namespace to a local authority. This means that a mission can develop its own data model with a high degree of autonomy and effectively extend the PDS model to accommodate its own metadata needs within the common ISO 11179 framework. Finally, within a single namespace - even the core PDS namespace - existing metadata structures can be extended and new structures added to the model as new needs are identifiedThis poster illustrates the PDS4 approach to metadata management and highlights the expected return on the development investment for PDS, users and data preparers.
NASA Astrophysics Data System (ADS)
Delory, E.; Jirka, S.
2016-02-01
Discovering sensors and observation data is important when enabling the exchange of oceanographic data between observatories and scientists that need the data sets for their work. To better support this discovery process, one task of the European project FixO3 (Fixed-point Open Ocean Observatories) is dealing with the question which elements are needed for developing a better registry for sensors. This has resulted in four items which are addressed by the FixO3 project in cooperation with further European projects such as NeXOS (http://www.nexosproject.eu/). 1.) Metadata description format: To store and retrieve information about sensors and platforms it is necessary to have a common approach how to provide and encode the metadata. For this purpose, the OGC Sensor Model Language (SensorML) 2.0 standard was selected. Especially the opportunity to distinguish between sensor types and instances offers new chances for a more efficient provision and maintenance of sensor metadata. 2.) Conversion of existing metadata into a SensorML 2.0 representation: In order to ensure a sustainable re-use of already provided metadata content (e.g. from ESONET-FixO3 yellow pages), it is important to provide a mechanism which is capable of transforming these already available metadata sets into the new SensorML 2.0 structure. 3.) Metadata editor: To create descriptions of sensors and platforms, it is not possible to expect users to manually edit XML-based description files. Thus, a visual interface is necessary to help during the metadata creation. We will outline a prototype of this editor, building upon the development of the ESONET sensor registry interface. 4.) Sensor Metadata Store: A server is needed that for storing and querying the created sensor descriptions. For this purpose different options exist which will be discussed. In summary, we will present a set of different elements enabling sensor discovery ranging from metadata formats, metadata conversion and editing to metadata storage. Furthermore, the current development status will be demonstrated.
Improving Metadata Compliance for Earth Science Data Records
NASA Astrophysics Data System (ADS)
Armstrong, E. M.; Chang, O.; Foster, D.
2014-12-01
One of the recurring challenges of creating earth science data records is to ensure a consistent level of metadata compliance at the granule level where important details of contents, provenance, producer, and data references are necessary to obtain a sufficient level of understanding. These details are important not just for individual data consumers but also for autonomous software systems. Two of the most popular metadata standards at the granule level are the Climate and Forecast (CF) Metadata Conventions and the Attribute Conventions for Dataset Discovery (ACDD). Many data producers have implemented one or both of these models including the Group for High Resolution Sea Surface Temperature (GHRSST) for their global SST products and the Ocean Biology Processing Group for NASA ocean color and SST products. While both the CF and ACDD models contain various level of metadata richness, the actual "required" attributes are quite small in number. Metadata at the granule level becomes much more useful when recommended or optional attributes are implemented that document spatial and temporal ranges, lineage and provenance, sources, keywords, and references etc. In this presentation we report on a new open source tool to check the compliance of netCDF and HDF5 granules to the CF and ACCD metadata models. The tool, written in Python, was originally implemented to support metadata compliance for netCDF records as part of the NOAA's Integrated Ocean Observing System. It outputs standardized scoring for metadata compliance for both CF and ACDD, produces an objective summary weight, and can be implemented for remote records via OPeNDAP calls. Originally a command-line tool, we have extended it to provide a user-friendly web interface. Reports on metadata testing are grouped in hierarchies that make it easier to track flaws and inconsistencies in the record. We have also extended it to support explicit metadata structures and semantic syntax for the GHRSST project that can be easily adapted to other satellite missions as well. Overall, we hope this tool will provide the community with a useful mechanism to improve metadata quality and consistency at the granule level by providing objective scoring and assessment, as well as encourage data producers to improve metadata quality and quantity.
Evolving Metadata in NASA Earth Science Data Systems
NASA Astrophysics Data System (ADS)
Mitchell, A.; Cechini, M. F.; Walter, J.
2011-12-01
NASA's Earth Observing System (EOS) is a coordinated series of satellites for long term global observations. NASA's Earth Observing System Data and Information System (EOSDIS) is a petabyte-scale archive of environmental data that supports global climate change research by providing end-to-end services from EOS instrument data collection to science data processing to full access to EOS and other earth science data. On a daily basis, the EOSDIS ingests, processes, archives and distributes over 3 terabytes of data from NASA's Earth Science missions representing over 3500 data products ranging from various types of science disciplines. EOSDIS is currently comprised of 12 discipline specific data centers that are collocated with centers of science discipline expertise. Metadata is used in all aspects of NASA's Earth Science data lifecycle from the initial measurement gathering to the accessing of data products. Missions use metadata in their science data products when describing information such as the instrument/sensor, operational plan, and geographically region. Acting as the curator of the data products, data centers employ metadata for preservation, access and manipulation of data. EOSDIS provides a centralized metadata repository called the Earth Observing System (EOS) ClearingHouse (ECHO) for data discovery and access via a service-oriented-architecture (SOA) between data centers and science data users. ECHO receives inventory metadata from data centers who generate metadata files that complies with the ECHO Metadata Model. NASA's Earth Science Data and Information System (ESDIS) Project established a Tiger Team to study and make recommendations regarding the adoption of the international metadata standard ISO 19115 in EOSDIS. The result was a technical report recommending an evolution of NASA data systems towards a consistent application of ISO 19115 and related standards including the creation of a NASA-specific convention for core ISO 19115 elements. Part of NASA's effort to continually evolve its data systems led ECHO to enhancing the method in which it receives inventory metadata from the data centers to allow for multiple metadata formats including ISO 19115. ECHO's metadata model will also be mapped to the NASA-specific convention for ingesting science metadata into the ECHO system. As NASA's new Earth Science missions and data centers are migrating to the ISO 19115 standards, EOSDIS is developing metadata management resources to assist in the reading, writing and parsing ISO 19115 compliant metadata. To foster interoperability with other agencies and international partners, NASA is working to ensure that a common ISO 19115 convention is developed, enhancing data sharing capabilities and other data analysis initiatives. NASA is also investigating the use of ISO 19115 standards to encode data quality, lineage and provenance with stored values. A common metadata standard across NASA's Earth Science data systems promotes interoperability, enhances data utilization and removes levels of uncertainty found in data products.
NASA Astrophysics Data System (ADS)
Okaya, D.; Deelman, E.; Maechling, P.; Wong-Barnum, M.; Jordan, T. H.; Meyers, D.
2007-12-01
Large scientific collaborations, such as the SCEC Petascale Cyberfacility for Physics-based Seismic Hazard Analysis (PetaSHA) Project, involve interactions between many scientists who exchange ideas and research results. These groups must organize, manage, and make accessible their community materials of observational data, derivative (research) results, computational products, and community software. The integration of scientific workflows as a paradigm to solve complex computations provides advantages of efficiency, reliability, repeatability, choices, and ease of use. The underlying resource needed for a scientific workflow to function and create discoverable and exchangeable products is the construction, tracking, and preservation of metadata. In the scientific workflow environment there is a two-tier structure of metadata. Workflow-level metadata and provenance describe operational steps, identity of resources, execution status, and product locations and names. Domain-level metadata essentially define the scientific meaning of data, codes and products. To a large degree the metadata at these two levels are separate. However, between these two levels is a subset of metadata produced at one level but is needed by the other. This crossover metadata suggests that some commonality in metadata handling is needed. SCEC researchers are collaborating with computer scientists at SDSC, the USC Information Sciences Institute, and Carnegie Mellon Univ. in order to perform earthquake science using high-performance computational resources. A primary objective of the "PetaSHA" collaboration is to perform physics-based estimations of strong ground motion associated with real and hypothetical earthquakes located within Southern California. Construction of 3D earth models, earthquake representations, and numerical simulation of seismic waves are key components of these estimations. Scientific workflows are used to orchestrate the sequences of scientific tasks and to access distributed computational facilities such as the NSF TeraGrid. Different types of metadata are produced and captured within the scientific workflows. One workflow within PetaSHA ("Earthworks") performs a linear sequence of tasks with workflow and seismological metadata preserved. Downstream scientific codes ingest these metadata produced by upstream codes. The seismological metadata uses attribute-value pairing in plain text; an identified need is to use more advanced handling methods. Another workflow system within PetaSHA ("Cybershake") involves several complex workflows in order to perform statistical analysis of ground shaking due to thousands of hypothetical but plausible earthquakes. Metadata management has been challenging due to its construction around a number of legacy scientific codes. We describe difficulties arising in the scientific workflow due to the lack of this metadata and suggest corrective steps, which in some cases include the cultural shift of domain science programmers coding for metadata.
NASA Astrophysics Data System (ADS)
Brook, A.; Cristofani, E.; Vandewal, M.; Matheis, C.; Jonuscheit, J.; Beigang, R.
2012-05-01
The present study proposes a fully integrated, semi-automatic and near real-time mode-operated image processing methodology developed for Frequency-Modulated Continuous-Wave (FMCW) THz images with the center frequencies around: 100 GHz and 300 GHz. The quality control of aeronautics composite multi-layered materials and structures using Non-Destructive Testing is the main focus of this work. Image processing is applied on the 3-D images to extract useful information. The data is processed by extracting areas of interest. The detected areas are subjected to image analysis for more particular investigation managed by a spatial model. Finally, the post-processing stage examines and evaluates the spatial accuracy of the extracted information.
Hwang, Mi-Jung; Seol, Geun Hee
2015-01-01
Heel blood sampling is a common but painful procedure for neonates. Automatic lancets have been shown to be more effective, with reduced pain and tissue damage, than manual lancets, but the effects of lancet type on cortical activation have not yet been compared. The study aimed to compare the effects of manual and automatic lancets on cerebral oxygenation and pain of heel blood sampling in 24 premature infants with respiratory distress syndrome. Effectiveness was measured by assessing numbers of pricks and squeezes and duration of heel blood sampling. Pain responses were measured using the premature infant pain profile score, heart rate, and oxygen saturation (SpO2). Regional cerebral oxygen saturation (rScO2) was measured using near-infrared spectroscopy, and cerebral fractional tissue oxygen extraction was calculated from SpO2 and rScO. Measures of effectiveness were significantly better with automatic than with manual lancing, including fewer heel punctures (P = .009) and squeezes (P < .001) and shorter duration of heel blood sampling (P = .002). rScO2 was significantly higher (P = .013) and cerebral fractional tissue oxygen extraction after puncture significantly lower (P = .040) with automatic lancing. Premature infant pain profile scores during (P = .004) and after (P = .048) puncture were significantly lower in the automatic than in the manual lancet group. Automatic lancets for heel blood sampling in neonates with respiratory distress syndrome significantly reduced pain and enhanced cerebral oxygenation, suggesting that heel blood should be sampled routinely using an automatic lancet.
Evaluating the privacy properties of telephone metadata.
Mayer, Jonathan; Mutchler, Patrick; Mitchell, John C
2016-05-17
Since 2013, a stream of disclosures has prompted reconsideration of surveillance law and policy. One of the most controversial principles, both in the United States and abroad, is that communications metadata receives substantially less protection than communications content. Several nations currently collect telephone metadata in bulk, including on their own citizens. In this paper, we attempt to shed light on the privacy properties of telephone metadata. Using a crowdsourcing methodology, we demonstrate that telephone metadata is densely interconnected, can trivially be reidentified, and can be used to draw sensitive inferences.
Studies of Big Data metadata segmentation between relational and non-relational databases
NASA Astrophysics Data System (ADS)
Golosova, M. V.; Grigorieva, M. A.; Klimentov, A. A.; Ryabinkin, E. A.; Dimitrov, G.; Potekhin, M.
2015-12-01
In recent years the concepts of Big Data became well established in IT. Systems managing large data volumes produce metadata that describe data and workflows. These metadata are used to obtain information about current system state and for statistical and trend analysis of the processes these systems drive. Over the time the amount of the stored metadata can grow dramatically. In this article we present our studies to demonstrate how metadata storage scalability and performance can be improved by using hybrid RDBMS/NoSQL architecture.
Evaluating the privacy properties of telephone metadata
Mayer, Jonathan; Mutchler, Patrick; Mitchell, John C.
2016-01-01
Since 2013, a stream of disclosures has prompted reconsideration of surveillance law and policy. One of the most controversial principles, both in the United States and abroad, is that communications metadata receives substantially less protection than communications content. Several nations currently collect telephone metadata in bulk, including on their own citizens. In this paper, we attempt to shed light on the privacy properties of telephone metadata. Using a crowdsourcing methodology, we demonstrate that telephone metadata is densely interconnected, can trivially be reidentified, and can be used to draw sensitive inferences. PMID:27185922
Incorporating ISO Metadata Using HDF Product Designer
NASA Technical Reports Server (NTRS)
Jelenak, Aleksandar; Kozimor, John; Habermann, Ted
2016-01-01
The need to store in HDF5 files increasing amounts of metadata of various complexity is greatly overcoming the capabilities of the Earth science metadata conventions currently in use. Data producers until now did not have much choice but to come up with ad hoc solutions to this challenge. Such solutions, in turn, pose a wide range of issues for data managers, distributors, and, ultimately, data users. The HDF Group is experimenting on a novel approach of using ISO 19115 metadata objects as a catch-all container for all the metadata that cannot be fitted into the current Earth science data conventions. This presentation will showcase how the HDF Product Designer software can be utilized to help data producers include various ISO metadata objects in their products.
FacetGist: Collective Extraction of Document Facets in Large Technical Corpora.
Siddiqui, Tarique; Ren, Xiang; Parameswaran, Aditya; Han, Jiawei
2016-10-01
Given the large volume of technical documents available, it is crucial to automatically organize and categorize these documents to be able to understand and extract value from them. Towards this end, we introduce a new research problem called Facet Extraction. Given a collection of technical documents, the goal of Facet Extraction is to automatically label each document with a set of concepts for the key facets ( e.g. , application, technique, evaluation metrics, and dataset) that people may be interested in. Facet Extraction has numerous applications, including document summarization, literature search, patent search and business intelligence. The major challenge in performing Facet Extraction arises from multiple sources: concept extraction, concept to facet matching, and facet disambiguation. To tackle these challenges, we develop FacetGist, a framework for facet extraction. Facet Extraction involves constructing a graph-based heterogeneous network to capture information available across multiple local sentence-level features, as well as global context features. We then formulate a joint optimization problem, and propose an efficient algorithm for graph-based label propagation to estimate the facet of each concept mention. Experimental results on technical corpora from two domains demonstrate that Facet Extraction can lead to an improvement of over 25% in both precision and recall over competing schemes.
FacetGist: Collective Extraction of Document Facets in Large Technical Corpora
Siddiqui, Tarique; Ren, Xiang; Parameswaran, Aditya; Han, Jiawei
2017-01-01
Given the large volume of technical documents available, it is crucial to automatically organize and categorize these documents to be able to understand and extract value from them. Towards this end, we introduce a new research problem called Facet Extraction. Given a collection of technical documents, the goal of Facet Extraction is to automatically label each document with a set of concepts for the key facets (e.g., application, technique, evaluation metrics, and dataset) that people may be interested in. Facet Extraction has numerous applications, including document summarization, literature search, patent search and business intelligence. The major challenge in performing Facet Extraction arises from multiple sources: concept extraction, concept to facet matching, and facet disambiguation. To tackle these challenges, we develop FacetGist, a framework for facet extraction. Facet Extraction involves constructing a graph-based heterogeneous network to capture information available across multiple local sentence-level features, as well as global context features. We then formulate a joint optimization problem, and propose an efficient algorithm for graph-based label propagation to estimate the facet of each concept mention. Experimental results on technical corpora from two domains demonstrate that Facet Extraction can lead to an improvement of over 25% in both precision and recall over competing schemes. PMID:28210517
Extraction of linear features on SAR imagery
NASA Astrophysics Data System (ADS)
Liu, Junyi; Li, Deren; Mei, Xin
2006-10-01
Linear features are usually extracted from SAR imagery by a few edge detectors derived from the contrast ratio edge detector with a constant probability of false alarm. On the other hand, the Hough Transform is an elegant way of extracting global features like curve segments from binary edge images. Randomized Hough Transform can reduce the computation time and memory usage of the HT drastically. While Randomized Hough Transform will bring about a great deal of cells invalid during the randomized sample. In this paper, we propose a new approach to extract linear features on SAR imagery, which is an almost automatic algorithm based on edge detection and Randomized Hough Transform. The presented improved method makes full use of the directional information of each edge candidate points so as to solve invalid cumulate problems. Applied result is in good agreement with the theoretical study, and the main linear features on SAR imagery have been extracted automatically. The method saves storage space and computational time, which shows its effectiveness and applicability.
AAlAbdulsalam, Abdulrahman K.; Garvin, Jennifer H.; Redd, Andrew; Carter, Marjorie E.; Sweeny, Carol; Meystre, Stephane M.
2018-01-01
Cancer stage is one of the most important prognostic parameters in most cancer subtypes. The American Joint Com-mittee on Cancer (AJCC) specifies criteria for staging each cancer type based on tumor characteristics (T), lymph node involvement (N), and tumor metastasis (M) known as TNM staging system. Information related to cancer stage is typically recorded in clinical narrative text notes and other informal means of communication in the Electronic Health Record (EHR). As a result, human chart-abstractors (known as certified tumor registrars) have to search through volu-minous amounts of text to extract accurate stage information and resolve discordance between different data sources. This study proposes novel applications of natural language processing and machine learning to automatically extract and classify TNM stage mentions from records at the Utah Cancer Registry. Our results indicate that TNM stages can be extracted and classified automatically with high accuracy (extraction sensitivity: 95.5%–98.4% and classification sensitivity: 83.5%–87%). PMID:29888032
AAlAbdulsalam, Abdulrahman K; Garvin, Jennifer H; Redd, Andrew; Carter, Marjorie E; Sweeny, Carol; Meystre, Stephane M
2018-01-01
Cancer stage is one of the most important prognostic parameters in most cancer subtypes. The American Joint Com-mittee on Cancer (AJCC) specifies criteria for staging each cancer type based on tumor characteristics (T), lymph node involvement (N), and tumor metastasis (M) known as TNM staging system. Information related to cancer stage is typically recorded in clinical narrative text notes and other informal means of communication in the Electronic Health Record (EHR). As a result, human chart-abstractors (known as certified tumor registrars) have to search through volu-minous amounts of text to extract accurate stage information and resolve discordance between different data sources. This study proposes novel applications of natural language processing and machine learning to automatically extract and classify TNM stage mentions from records at the Utah Cancer Registry. Our results indicate that TNM stages can be extracted and classified automatically with high accuracy (extraction sensitivity: 95.5%-98.4% and classification sensitivity: 83.5%-87%).