Sample records for data

  1. CFSv2 Seasonal Climate Forecasts

    Science.gov Websites

    Nino3.4 Nino4 E1 (data) E2 (data) E3 (data) E1 (data) E2 (data) E3 (data) E1 (data) E2 (data) E3 (data) E1 (data) E2 (data) E3 (data) E1 (data) E2 (data) E3 (data) E1 (data) E2 (data) E3 (data) E1 (data) E2 (data) E3 (data) E1 (data) E2 (data) E3 (data) E1 (data) E2 (data) E3 (data) Sea surface height and

  2. DataHub: Knowledge-based data management for data discovery

    NASA Astrophysics Data System (ADS)

    Handley, Thomas H.; Li, Y. Philip

    1993-08-01

    Currently available database technology is largely designed for business data-processing applications, and seems inadequate for scientific applications. The research described in this paper, the DataHub, will address the issues associated with this shortfall in technology utilization and development. The DataHub development is addressing the key issues in scientific data management of scientific database models and resource sharing in a geographically distributed, multi-disciplinary, science research environment. Thus, the DataHub will be a server between the data suppliers and data consumers to facilitate data exchanges, to assist science data analysis, and to provide as systematic approach for science data management. More specifically, the DataHub's objectives are to provide support for (1) exploratory data analysis (i.e., data driven analysis); (2) data transformations; (3) data semantics capture and usage; analysis-related knowledge capture and usage; and (5) data discovery, ingestion, and extraction. Applying technologies that vary from deductive databases, semantic data models, data discovery, knowledge representation and inferencing, exploratory data analysis techniques and modern man-machine interfaces, DataHub will provide a prototype, integrated environement to support research scientists' needs in multiple disciplines (i.e. oceanography, geology, and atmospheric) while addressing the more general science data management issues. Additionally, the DataHub will provide data management services to exploratory data analysis applications such as LinkWinds and NCSA's XIMAGE.

  3. Automated Data Submission for the Data Center

    NASA Astrophysics Data System (ADS)

    Wright, D.; Beaty, T.; Wei, Y.; Shanafield, H.; Santhana Vannan, S. K.

    2014-12-01

    Data centers struggle with difficulties related to data submission. Data are acquired through many avenues by many people. Many data submission activities involve intensive manual processes. During the submission process, data end up on varied storage devices. The situation can easily become chaotic. Collecting information on the status of pending data sets is arduous. For data providers, the submission process can be inconsistent and confusing. Scientists generally provide data from previous projects, and archival can be a low priority. Incomplete or poor documentation accompanies many data sets. However, complicated questionnaires deter busy data providers. At the ORNL DAAC, we have semi-automated the data set submission process to create a uniform data product and provide a consistent data provider experience. The formalized workflow makes archival faster for the data center and data set submission easier for data providers. Software modules create a flexible, reusable submission package. Formalized data set submission provides several benefits to the data center. A single data upload area provides one point of entry and ensures data are stored in a consistent location. A central dashboard records pending data set submissions in a single table and simplifies reporting. Flexible role management allows team members to readily coordinate and increases efficiency. Data products and metadata become uniform and easily maintained. As data and metadata standards change, modules can be modified or re-written without affecting workflow. While each data center has unique challenges, the data ingestion process is generally the same: get data from the provider, scientist, or project and capture metadata pertinent to that data. The ORNL DAAC data set submission workflow and software modules can be reused entirely or in part by other data centers looking for a data set submission solution. These data set submission modules will be available on NASA's Earthdata Code Collaborative and by request.

  4. KNMI DataLab experiences in serving data-driven innovations

    NASA Astrophysics Data System (ADS)

    Noteboom, Jan Willem; Sluiter, Raymond

    2016-04-01

    Climate change research and innovations in weather forecasting rely more and more on (Big) data. Besides increasing data from traditional sources (such as observation networks, radars and satellites), the use of open data, crowd sourced data and the Internet of Things (IoT) is emerging. To deploy these sources of data optimally in our services and products, KNMI has established a DataLab to serve data-driven innovations in collaboration with public and private sector partners. Big data management, data integration, data analytics including machine learning and data visualization techniques are playing an important role in the DataLab. Cross-domain data-driven innovations that arise from public-private collaborative projects and research programmes can be explored, experimented and/or piloted by the KNMI DataLab. Furthermore, advice can be requested on (Big) data techniques and data sources. In support of collaborative (Big) data science activities, scalable environments are offered with facilities for data integration, data analysis and visualization. In addition, Data Science expertise is provided directly or from a pool of internal and external experts. At the EGU conference, gained experiences and best practices are presented in operating the KNMI DataLab to serve data-driven innovations for weather and climate applications optimally.

  5. NSSDC Data listing

    NASA Technical Reports Server (NTRS)

    1981-01-01

    A convenient reference to space science and supportive data available from the National Space Science Data Center (NSSDC) is provided. Satellite data are organized by NSSDC spacecraft common name. The launch date and NSSDC ID are given. Experiments are listed alphabetically by the principal investigator or team leader. The experiment name and NSSDC ID, data set ID, data set name, data form code, quantity of data, and the time span of the data as verified by NSSDC are shown. Ground-based data, models, computer routines, and composite spacecraft data that are available from NSSDC are listed alphabetically by discipline, source, data type, data content, and data set. The data set name, data form code, quantity of data, and the time span covered where appropriate are included.

  6. National Space Science Data Center (NSSDC) Data Listing

    NASA Technical Reports Server (NTRS)

    1980-01-01

    Satellite and nonsatellite data available from the National Space Science Data Center are listed. The Satellite Data listing includes the spacecraft name, launch date, and an alphabetical list of experiments. The Non-Satellite Data listing contains ground based data, models, computer routines, and composite spacecraft data. The data set name, data form code, quantity of data, and the time space covered are included in the data sets of both listings where appropriate. Geodetic tracking data sets are also included.

  7. NSSDC data listing

    NASA Technical Reports Server (NTRS)

    Horowitz, Richard

    1991-01-01

    The purpose here is to identify, in a highly summarized way, data available from the National Space Science Data Center (NSSDC). Most data are maintained as offline data sets gathered from individual instruments carried on spacecraft; these comprise the Satellite Data Listing. Descriptive names, time spans, data form, and quality of these data sets are identified in the listing, which is sorted alphabetically, first by spacecraft name and then by the principal investigator's or team leader's last name. Several data sets not associated with individual spaceflight instruments are identified in separate listings following the Satellite Data Listing. These include composite spacecraft data sets, ground based data, models, and computer routines. NSSDC also offers data via special services and systems in a number of areas, including the Astronomical Data Center, Coordinated Data Analysis Workshops, NASA Climate Data System, Pilot Land Data System, and Crustal Dynamics Data Information System.

  8. US GeoData Available Through the Internet

    USGS Publications Warehouse

    ,

    2000-01-01

    The U.S. Geological Survey (USGS) offers certain US GeoData data sets through the Internet. They can be retrieved using the World Wide Web or anonymous File Transfer Protocol (FTP). The data bases and their directory paths are as follows: * 1:24,000-scale digital line graph data in SDTS format (/pub/data/DLG/24K) * 1:2,000,000-scale digital line graph data in SDTS format (/pub/data/DLG/2M) * 1:100,000-scale digital line graph data (/pub/data/DLG/100K) * 1:100,000-scale land use and land cover data (/pub/data/LULC/100K) * 1:250,000-scale land use and land cover data (/pub/data/LULC/250K) * 1:24,000-scale digital elevation data (/pub/data/DEM/7.5min) * 1-degree digital elevation model data (/pub/data/DEM/250)

  9. Environmental Data Store: A Web-Based System Providing Management and Exploitation for Multi-Data-Type Environmental Data

    NASA Astrophysics Data System (ADS)

    Ji, P.; Piasecki, M.

    2012-12-01

    With the rapid growth in data volumes, data diversity and data demands from multi-disciplinary research effort, data management and exploitation are increasingly facing significant challenges for environmental scientific community. We describe Environmental data store (EDS), a system we are developing that is a web-based system following an open source implementation to manage and exploit multi-data-type environmental data. EDS provides repository services for the six fundamental data types, which meet the demands of multi-disciplinary environmental research. These data types are: a) Time Series Data, b) GeoSpatial data, c) Digital Data, d) Ex-Situ Sampling data, e) Modeling Data, f) Raster Data. Through data portal, EDS allows for efficient consuming these six types of data placed in data pool, which is made up of different data nodes corresponding to different data types, including iRODS, ODM, THREADS, ESSDB, GeoServer, etc.. EDS data portal offers unified submission interface for the above different data types; provides fully integrated, scalable search across content from the above different data systems; also features mapping, analysis, exporting and visualization, through integration with other software. EDS uses a number of developed systems, follows widely used data standards, and highlights the thematic, semantic, and syntactic support on the submission and search, in order to advance multi-disciplinary environmental research. This system will be installed and develop at the CrossRoads initiative at the City College of New York.

  10. Alternative Fuels Data Center: Data Downloads

    Science.gov Websites

    Data Downloads to someone by E-mail Share Alternative Fuels Data Center: Data Downloads on Facebook Tweet about Alternative Fuels Data Center: Data Downloads on Twitter Bookmark Alternative Fuels Data Center: Data Downloads on Google Bookmark Alternative Fuels Data Center: Data Downloads on Delicious Rank

  11. US GeoData Available Through the Internet

    USGS Publications Warehouse

    ,

    2000-01-01

    The U.S. Geological Survey (USGS) offers certain US GeoData data sets through the Internet. They can be retrieved using the World Wide Web or anonymous File Transfer Protocol (FTP). The data bases and their directory paths are as follows: * 1:24,000-scale digital line graph data in SDTS format (/pub/data/DLG/24K) * 1:2,000,000-scale digital line graph data in SDTS format (/pub/data/DLG/2M) * 1:100,000-scale digital line graph data (/pub/data/DLG/100K) * 1:100,000-scale land use and land cover data (/pub/data/LULC/100K) * 1:250,000-scale land use and land cover data (/pub/data/LULC/250K) * 1-degree digital elevation model data (/pub/data/DEM/250)

  12. Datalist: A Value Added Service to Enable Easy Data Selection

    NASA Technical Reports Server (NTRS)

    Li, Angela; Hegde, Mahabaleshwa; Bryant, Keith; Seiler, Edward; Shie, Chung-Lin; Teng, William; Liu, Zhong; Hearty, Thomas; Shen, Suhung; Kempler, Steven; hide

    2016-01-01

    Imagine a user wanting to study hurricane events. This could involve searching and downloading multiple data variables from multiple data sets. The currently available services from the Goddard Earth Sciences Data and Information Services Center (GES DISC) only allow the user to select one data set at a time. The GES DISC started a Data List initiative, in order to enable users to easily select multiple data variables. A Data List is a collection of predefined or user-defined data variables from one or more archived data sets. Target users of Data Lists include science teams, individual science researchers, application users, and educational users. Data Lists are more than just data. Data Lists effectively provide users with a sophisticated integrated data and services package, including metadata, citation, documentation, visualization, and data-specific services, all available from one-stop shopping. Data Lists are created based on the software architecture of the GES DISC Unified User Interface (UUI). The Data List service is completely data-driven, and a Data List is treated just as any other data set. The predefined Data Lists, created by the experienced GES DISC science support team, should save a significant amount of time that users would otherwise have to spend.

  13. DataONE: A Distributed Environmental and Earth Science Data Network Supporting the Full Data Life Cycle

    NASA Astrophysics Data System (ADS)

    Cook, R.; Michener, W.; Vieglais, D.; Budden, A.; Koskela, R.

    2012-04-01

    Addressing grand environmental science challenges requires unprecedented access to easily understood data that cross the breadth of temporal, spatial, and thematic scales. Tools are needed to plan management of the data, discover the relevant data, integrate heterogeneous and diverse data, and convert the data to information and knowledge. Addressing these challenges requires new approaches for the full data life cycle of managing, preserving, sharing, and analyzing data. DataONE (Observation Network for Earth) represents a virtual organization that enables new science and knowledge creation through preservation and access to data about life on Earth and the environment that sustains it. The DataONE approach is to improve data collection and management techniques; facilitate easy, secure, and persistent storage of data; continue to increase access to data and tools that improve data interoperability; disseminate integrated and user-friendly tools for data discovery and novel analyses; work with researchers to build intuitive data exploration and visualization tools; and support communities of practice via education, outreach, and stakeholder engagement.

  14. Challenges in sharing of geospatial data by data custodians in South Africa

    NASA Astrophysics Data System (ADS)

    Kay, Sissiel E.

    2018-05-01

    As most development planning and rendering of public services happens at a place or in a space, geospatial data is required. This geospatial data is best managed through a spatial data infrastructure, which has as a key objective to share geospatial data. The collection and maintenance of geospatial data is expensive and time consuming and so the principle of "collect once - use many times" should apply. It is best to obtain the geospatial data from the authoritative source - the appointed data custodian. In South Africa the South African Spatial Data Infrastructure (SASDI) is the means to achieve the requirement for geospatial data sharing. This requires geospatial data sharing to take place between the data custodian and the user. All data custodians are expected to comply with the Spatial Data Infrastructure Act (SDI Act) in terms of geo-spatial data sharing. Currently data custodians are experiencing challenges with regard to the sharing of geospatial data. This research is based on the current ten data themes selected by the Committee for Spatial Information and the organisations identified as the data custodians for these ten data themes. The objectives are to determine whether the identified data custodians comply with the SDI Act with respect to geospatial data sharing, and if not what are the reasons for this. Through an international comparative assessment it then determines if the compliance with the SDI Act is not too onerous on the data custodians. The research concludes that there are challenges with geospatial data sharing in South Africa and that the data custodians only partially comply with the SDI Act in terms of geospatial data sharing. However, it is shown that the South African legislation is not too onerous on the data custodians.

  15. What Does it Mean to Publish Data in Earth System Science Data Journal?

    NASA Astrophysics Data System (ADS)

    Carlson, D.; Pfeiffenberger, H.

    2015-12-01

    The availability of more than 120 data sets in ESSD represents an unprecedented effort by providers, data centers and ESSD. ESSD data sets and their accompanying data descriptions undergo rigorous review. The data sets reside at any of more than 20 cooperating data centers. The ESSD publication process depends on but challenges the concepts of digital object identification and exacerbates the varied interpretations of the phrase 'data publication'. ESSD adopts the digital object identifier (doi). Key questions apply to doi's and other identifiers. How will persistent identifiers point accurately to distributed or replicated data? How should data centers and data publishers use identifier technologies to ensure authenticity and integrity? Should metadata associated with identifiers distinguish among raw, quality controlled and derived data processing levels, or indicate license or copyright status?Data centers publish data sets according to internal metadata standards but without indicators of quality control. Publication in this sense indicates availability. National data portals compile, serve and publish data products as a service to national researchers and, often, to meet national requirements. Publication in this second case indicates availability in a national context; the data themselves may still reside at separate data centers. Data journals such as ESSD or Scientific Data publish peer-reviewed, quality controlled data sets. These data sets almost always reside at a separate data center - the journal and the center maintain explicit identifier linkages. Data journals add quality to the feature of availability. A single data set processed through these layers will generate three independent doi's but the doi's will provide little information about availability or quality. Could the data world learn from the URL world to consider additions? Suffixes? Could we use our experience with processing levels or data maturity to propose and agree such extensions?

  16. Web-based data acquisition and management system for GOSAT validation Lidar data analysis

    NASA Astrophysics Data System (ADS)

    Okumura, Hiroshi; Takubo, Shoichiro; Kawasaki, Takeru; Abdullah, Indra N.; Uchino, Osamu; Morino, Isamu; Yokota, Tatsuya; Nagai, Tomohiro; Sakai, Tetsu; Maki, Takashi; Arai, Kohei

    2012-11-01

    An web-base data acquisition and management system for GOSAT (Greenhouse gases Observation SATellite) validation lidar data analysis is developed. The system consists of data acquisition sub-system (DAS) and data management sub-system (DMS). DAS written in Perl language acquires AMeDAS ground-level meteorological data, Rawinsonde upper-air meteorological data, ground-level oxidant data, skyradiometer data, skyview camera images, meteorological satellite IR image data and GOSAT validation lidar data. DMS written in PHP language demonstrates satellite-pass date and all acquired data.

  17. Understanding the Data Complexity continuum to reduce data management costs and increase data usability through partnerships with the National Centers for Environmental Information

    NASA Astrophysics Data System (ADS)

    Mesick, S.; Weathers, K. W.

    2017-12-01

    Data complexity can be seen as a continuum from complex to simple. The term data complexity refers to data collections that are disorganized, poorly documented, and generally do not follow best data management practices. Complex data collections are challenging and expensive to manage. Simplified collections readily support automated archival processes, enhanced discovery and data access, as well as production of services that make data easier to reuse. In this session, NOAA NCEI scientific data stewards will discuss the data complexity continuum. This talk will explore data simplification concepts, methods, and tools that data managers can employ which may offer more control over data management costs and processes, while achieving policy goals for open data access and ready reuse. Topics will include guidance for data managers on best allocation of limited data management resources; models for partnering with NCEI to accomplish shared data management goals; and will demonstrate through case studies the benefits of investing in documentation, accessibility, and services to increase data value and return on investment.

  18. Definitions of components of the master water data index maintained by the National Water Data Exchange

    USGS Publications Warehouse

    Perry, R.A.; Williams, O.O.

    1982-01-01

    The Master Water Data Index is a computerized data base developed and maintained by the National Water Data Exchange (NAWDEX). The Index contains information about water-data collection sites. This information includes: the identification of new sites for which water data are available, the locations of these sites, the type of site, the data-collection organization, the types of data available, the major water-data parameters for which data are available, the frequency at which these parameters are measured, the period of time for which data are available, and the medial in which the data are stored. This document, commonly referred to as the MWDI data dictionary, contains a definition and description of each component of the Master Water Data Index data base. (USGS)

  19. Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) - A New U.S. DOE Data Archive

    NASA Astrophysics Data System (ADS)

    Agarwal, D.; Varadharajan, C.; Cholia, S.; Snavely, C.; Hendrix, V.; Gunter, D.; Riley, W. J.; Jones, M.; Budden, A. E.; Vieglais, D.

    2017-12-01

    The ESS-DIVE archive is a new U.S. Department of Energy (DOE) data archive designed to provide long-term stewardship and use of data from observational, experimental, and modeling activities in the earth and environmental sciences. The ESS-DIVE infrastructure is constructed with the long-term vision of enabling broad access to and usage of the DOE sponsored data stored in the archive. It is designed as a scalable framework that incentivizes data providers to contribute well-structured, high-quality data to the archive and that enables the user community to easily build data processing, synthesis, and analysis capabilities using those data. The key innovations in our design include: (1) application of user-experience research methods to understand the needs of users and data contributors; (2) support for early data archiving during project data QA/QC and before public release; (3) focus on implementation of data standards in collaboration with the community; (4) support for community built tools for data search, interpretation, analysis, and visualization tools; (5) data fusion database to support search of the data extracted from packages submitted and data available in partner data systems such as the Earth System Grid Federation (ESGF) and DataONE; and (6) support for archiving of data packages that are not to be released to the public. ESS-DIVE data contributors will be able to archive and version their data and metadata, obtain data DOIs, search for and access ESS data and metadata via web and programmatic portals, and provide data and metadata in standardized forms. The ESS-DIVE archive and catalog will be federated with other existing catalogs, allowing cross-catalog metadata search and data exchange with existing systems, including DataONE's Metacat search. ESS-DIVE is operated by a multidisciplinary team from Berkeley Lab, the National Center for Ecological Analysis and Synthesis (NCEAS), and DataONE. The primarily data copies are hosted at DOE's NERSC supercomputing facility with replicas at DataONE nodes.

  20. Liquid cooled data center design selection

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chainer, Timothy J.; Iyengar, Madhusudan K.; Parida, Pritish R.

    Input data, specifying aspects of a thermal design of a liquid cooled data center, is obtained. The input data includes data indicative of ambient outdoor temperature for a location of the data center; and/or data representing workload power dissipation for the data center. The input data is evaluated to obtain performance of the data center thermal design. The performance includes cooling energy usage; and/or one pertinent temperature associated with the data center. The performance of the data center thermal design is output.

  1. Geodynamics branch data base for main magnetic field analysis

    NASA Technical Reports Server (NTRS)

    Langel, Robert A.; Baldwin, R. T.

    1991-01-01

    The data sets used in geomagnetic field modeling at GSFC are described. Data are measured and obtained from a variety of information and sources. For clarity, data sets from different sources are categorized and processed separately. The data base is composed of magnetic observatory data, surface data, high quality aeromagnetic, high quality total intensity marine data, satellite data, and repeat data. These individual data categories are described in detail in a series of notebooks in the Geodynamics Branch, GSFC. This catalog reviews the original data sets, the processing history, and the final data sets available for each individual category of the data base and is to be used as a reference manual for the notebooks. Each data type used in geomagnetic field modeling has varying levels of complexity requiring specialized processing routines for satellite and observatory data and two general routines for processing aeromagnetic, marine, land survey, and repeat data.

  2. Trends in Planetary Data Analysis. Executive summary of the Planetary Data Workshop

    NASA Technical Reports Server (NTRS)

    Evans, N.

    1984-01-01

    Planetary data include non-imaging remote sensing data, which includes spectrometric, radiometric, and polarimetric remote sensing observations. Also included are in-situ, radio/radar data, and Earth based observation. Also discussed is development of a planetary data system. A catalog to identify observations will be the initial entry point for all levels of users into the data system. There are seven distinct data support services: encyclopedia, data index, data inventory, browse, search, sample, and acquire. Data systems for planetary science users must provide access to data, process, store, and display data. Two standards will be incorporated into the planetary data system: Standard communications protocol and Standard format data unit. The data system configuration must combine a distributed system with those of a centralized system. Fiscal constraints have made prioritization important. Activities include saving previous mission data, planning/cost analysis, and publishing of proceedings.

  3. Data sharing platforms for de-identified data from human clinical trials.

    PubMed

    Huser, Vojtech; Shmueli-Blumberg, Dikla

    2018-04-01

    Data sharing of de-identified individual participant data is being adopted by an increasing number of sponsors of human clinical trials. In addition to standardizing data syntax for shared trial data, semantic integration of various data elements is the focus of several initiatives that define research common data elements. This perspective article, in the first part, compares several data sharing platforms for de-identified clinical research data in terms of their size, policies and supported features. In the second part, we use a case study approach to describe in greater detail one data sharing platform (Data Share from National Institute of Drug Abuse). We present data on the past use of the platform, data formats offered, data de-identification approaches and its use of research common data elements. We conclude with a summary of current and expected future trends that facilitate secondary research use of data from completed human clinical trials.

  4. Development and Implementation of Production Area of Agricultural Product Data Collection System Based on Embedded System

    NASA Astrophysics Data System (ADS)

    Xi, Lei; Guo, Wei; Che, Yinchao; Zhang, Hao; Wang, Qiang; Ma, Xinming

    To solve problems in detecting the origin of agricultural products, this paper brings about an embedded data-based terminal, applies middleware thinking, and provides reusable long-range two-way data exchange module between business equipment and data acquisition systems. The system is constructed by data collection node and data center nodes. Data collection nodes taking embedded data terminal NetBoxII as the core, consisting of data acquisition interface layer, controlling information layer and data exchange layer, completing the data reading of different front-end acquisition equipments, and packing the data TCP to realize the data exchange between data center nodes according to the physical link (GPRS / CDMA / Ethernet). Data center node consists of the data exchange layer, the data persistence layer, and the business interface layer, which make the data collecting durable, and provide standardized data for business systems based on mapping relationship of collected data and business data. Relying on public communications networks, application of the system could establish the road of flow of information between the scene of origin certification and management center, and could realize the real-time collection, storage and processing between data of origin certification scene and databases of certification organization, and could achieve needs of long-range detection of agricultural origin.

  5. The creation, management, and use of data quality information for life cycle assessment.

    PubMed

    Edelen, Ashley; Ingwersen, Wesley W

    2018-04-01

    Despite growing access to data, questions of "best fit" data and the appropriate use of results in supporting decision making still plague the life cycle assessment (LCA) community. This discussion paper addresses revisions to assessing data quality captured in a new US Environmental Protection Agency guidance document as well as additional recommendations on data quality creation, management, and use in LCA databases and studies. Existing data quality systems and approaches in LCA were reviewed and tested. The evaluations resulted in a revision to a commonly used pedigree matrix, for which flow and process level data quality indicators are described, more clarity for scoring criteria, and further guidance on interpretation are given. Increased training for practitioners on data quality application and its limits are recommended. A multi-faceted approach to data quality assessment utilizing the pedigree method alongside uncertainty analysis in result interpretation is recommended. A method of data quality score aggregation is proposed and recommendations for usage of data quality scores in existing data are made to enable improved use of data quality scores in LCA results interpretation. Roles for data generators, data repositories, and data users are described in LCA data quality management. Guidance is provided on using data with data quality scores from other systems alongside data with scores from the new system. The new pedigree matrix and recommended data quality aggregation procedure can now be implemented in openLCA software. Additional ways in which data quality assessment might be improved and expanded are described. Interoperability efforts in LCA data should focus on descriptors to enable user scoring of data quality rather than translation of existing scores. Developing and using data quality indicators for additional dimensions of LCA data, and automation of data quality scoring through metadata extraction and comparison to goal and scope are needed.

  6. Early Citability of Data vs Peer-Review like Data Publishing Procedures

    NASA Astrophysics Data System (ADS)

    Stockhause, Martina; Höck, Heinke; Toussaint, Frank; Lautenschlager, Michael

    2014-05-01

    The World Data Center for Climate (WDCC) hosted at the German Climate Computing Center (DKRZ) was one of the first data centers, which established a peer-review like data publication procedure resulting in DataCite DOIs. Data in the long-term archive (LTA) is diligently reviewed by data managers and data authors to grant high quality and widely reusability of the published data. This traditional data publication procedure for LTA data bearing DOIs is very time consuming especially for WDCC's high data volumes of climate model data in the order of multiple TBytes. Data is shared with project members and selected scientists months before the data is long-term archived. The scientific community analyses and thus reviews the data leading to data quality improvements. Scientists wish to cite these unstable data in scientific publications before the long-term archiving and the thorough data review process are finalized. A concept for early preprint DOIs for shared but not yet long-term archived data is presented. Requirements on data documentation, persistence and quality and use cases for preprint DOIs within the data life-cycle are discussed as well as questions of how to document the differences of the two DOI types and how to relate them to each other with the recommendation to use LTA DOIs in citations. WDCC wants to offer an additional user service for early citations of data of basic quality without compromising the LTA DOIs, i.e. WDCC's standard DOIs, as trustworthy indicator for high quality data. Referencing Links: World Data Center for Climate (WDCC): http://www.wdc-climate.de German Climate Computing Center (DKRZ): http://www.dkrz.de DataCite: http://datacite.org

  7. Data Citation Concept for CMIP6

    NASA Astrophysics Data System (ADS)

    Stockhause, M.; Toussaint, F.; Lautenschlager, M.; Lawrence, B.

    2015-12-01

    There is a broad consensus among data centers and scientific publishers on Force 11's 'Joint Declaration of Data Citation Principles'. To put these principles into operation is not always as straight forward. The focus for CMIP6 data citations lies on the citation of data created by others and used in an analysis underlying the article. And for this source data usually no article of the data creators is available ('stand-alone data publication'). The planned data citation granularities are model data (data collections containing all datasets provided for the project by a single model) and experiment data (data collections containing all datasets for a scientific experiment run by a single model). In case of large international projects or activities like CMIP, the data is commonly stored and disseminated by multiple repositories in a federated data infrastructure such as the Earth System Grid Federation (ESGF). The individual repositories are subject to different institutional and national policies. A Data Management Plan (DMP) will define a certain standard for the repositories including data handling procedures. Another aspect of CMIP data, relevant for data citations, is its dynamic nature. For such large data collections, datasets are added, revised and retracted for years, before the data collection becomes stable for a data citation entity including all model or simulation data. Thus, a critical issue for ESGF is data consistency, requiring thorough dataset versioning to enable the identification of the data collection in the cited version. Currently, the ESGF is designed for accessing the latest dataset versions. Data citation introduces the necessity to support older and retracted dataset versions by storing metadata even beyond data availability (data unpublished in ESGF). Apart from ESGF, other infrastructure components exist for CMIP, which provide information that has to be connected to the CMIP6 data, e.g. ES-DOC providing information on models and simulations and the IPCC Data Distribution Centre (DDC) storing a subset of data together with available metadata (ES-DOC) for the long-term reuse of the interdisciplinary community. Other connections exist to standard project vocabularies, to personal identifiers (e.g. ORCID), or to data products (including provenance information).

  8. Critique and Contribute: A Practice-Based Framework for Improving Critical Data Studies and Data Science.

    PubMed

    Neff, Gina; Tanweer, Anissa; Fiore-Gartland, Brittany; Osburn, Laura

    2017-06-01

    What would data science look like if its key critics were engaged to help improve it, and how might critiques of data science improve with an approach that considers the day-to-day practices of data science? This article argues for scholars to bridge the conversations that seek to critique data science and those that seek to advance data science practice to identify and create the social and organizational arrangements necessary for a more ethical data science. We summarize four critiques that are commonly made in critical data studies: data are inherently interpretive, data are inextricable from context, data are mediated through the sociomaterial arrangements that produce them, and data serve as a medium for the negotiation and communication of values. We present qualitative research with academic data scientists, "data for good" projects, and specialized cross-disciplinary engineering teams to show evidence of these critiques in the day-to-day experience of data scientists as they acknowledge and grapple with the complexities of their work. Using ethnographic vignettes from two large multiresearcher field sites, we develop a set of concepts for analyzing and advancing the practice of data science and improving critical data studies, including (1) communication is central to the data science endeavor; (2) making sense of data is a collective process; (3) data are starting, not end points, and (4) data are sets of stories. We conclude with two calls to action for researchers and practitioners in data science and critical data studies alike. First, creating opportunities for bringing social scientific and humanistic expertise into data science practice simultaneously will advance both data science and critical data studies. Second, practitioners should leverage the insights from critical data studies to build new kinds of organizational arrangements, which we argue will help advance a more ethical data science. Engaging the insights of critical data studies will improve data science. Careful attention to the practices of data science will improve scholarly critiques. Genuine collaborative conversations between these different communities will help push for more ethical, and better, ways of knowing in increasingly datum-saturated societies.

  9. Critique and Contribute: A Practice-Based Framework for Improving Critical Data Studies and Data Science

    PubMed Central

    Neff, Gina; Tanweer, Anissa; Fiore-Gartland, Brittany; Osburn, Laura

    2017-01-01

    Abstract What would data science look like if its key critics were engaged to help improve it, and how might critiques of data science improve with an approach that considers the day-to-day practices of data science? This article argues for scholars to bridge the conversations that seek to critique data science and those that seek to advance data science practice to identify and create the social and organizational arrangements necessary for a more ethical data science. We summarize four critiques that are commonly made in critical data studies: data are inherently interpretive, data are inextricable from context, data are mediated through the sociomaterial arrangements that produce them, and data serve as a medium for the negotiation and communication of values. We present qualitative research with academic data scientists, “data for good” projects, and specialized cross-disciplinary engineering teams to show evidence of these critiques in the day-to-day experience of data scientists as they acknowledge and grapple with the complexities of their work. Using ethnographic vignettes from two large multiresearcher field sites, we develop a set of concepts for analyzing and advancing the practice of data science and improving critical data studies, including (1) communication is central to the data science endeavor; (2) making sense of data is a collective process; (3) data are starting, not end points, and (4) data are sets of stories. We conclude with two calls to action for researchers and practitioners in data science and critical data studies alike. First, creating opportunities for bringing social scientific and humanistic expertise into data science practice simultaneously will advance both data science and critical data studies. Second, practitioners should leverage the insights from critical data studies to build new kinds of organizational arrangements, which we argue will help advance a more ethical data science. Engaging the insights of critical data studies will improve data science. Careful attention to the practices of data science will improve scholarly critiques. Genuine collaborative conversations between these different communities will help push for more ethical, and better, ways of knowing in increasingly datum-saturated societies. PMID:28632445

  10. ACTS data center

    NASA Technical Reports Server (NTRS)

    Syed, Ali; Vogel, Wolfhard J.

    1993-01-01

    Viewgraphs on ACTS Data Center status report are included. Topics covered include: ACTS Data Center Functions; data flow overview; PPD flow; RAW data flow; data compression; PPD distribution; RAW Data Archival; PPD Audit; and data analysis.

  11. Digital Archive Issues from the Perspective of an Earth Science Data Producer

    NASA Technical Reports Server (NTRS)

    Barkstrom, Bruce R.

    2004-01-01

    Contents include the following: Introduction. A Producer Perspective on Earth Science Data. Data Producers as Members of a Scientific Community. Some Unique Characteristics of Scientific Data. Spatial and Temporal Sampling for Earth (or Space) Science Data. The Influence of the Data Production System Architecture. The Spatial and Temporal Structures Underlying Earth Science Data. Earth Science Data File (or Relation) Schemas. Data Producer Configuration Management Complexities. The Topology of Earth Science Data Inventories. Some Thoughts on the User Perspective. Science Data User Communities. Spatial and Temporal Structure Needs of Different Users. User Spatial Objects. Data Search Services. Inventory Search. Parameter (Keyword) Search. Metadata Searches. Documentation Search. Secondary Index Search. Print Technology and Hypertext. Inter-Data Collection Configuration Management Issues. An Archive View. Producer Data Ingest and Production. User Data Searching and Distribution. Subsetting and Supersetting. Semantic Requirements for Data Interchange. Tentative Conclusions. An Object Oriented View of Archive Information Evolution. Scientific Data Archival Issues. A Perspective on the Future of Digital Archives for Scientific Data. References Index for this paper.

  12. Data citation in climate sciences: Improvements in CMIP6 compared to CMIP5

    NASA Astrophysics Data System (ADS)

    Stockhause, M.; Lautenschlager, M.

    2017-12-01

    Within CMIP5 (Coupled Model Intercomparison Project Phase 5) the citation of the data was not possible prior its long-term archival in the IPCC Data Distribution Centre (DDC). The Reference Data Archive for AR5 (Assessment Report 5) was built up after the submission deadline for part 1 of the AR5. This was too late for many scientific articles. But even the AR5 data in the IPCC DDC is rarely cited in literature in spite of annual download volumes between one and three PBytes. On the other hand, the request for a citation possibility for the evolving CMIP6 data prior to long-term archival came from the CMIP6 data providers. The additional provision of data citations for the project input4MIPs (input data for CMIP6) could raise the scientists' awareness of the discrepancy between the readiness to cite data and the desire to be cited and get credit. The CMIP6 Citation Service is a pragmatic approach built on existing services and services under development, such as ESGF (Earth System Grid Federation) as data infrastructure component, DataCite as DOI registration agency, and Scholix services for tracking data usage information. Other principles followed to overcome barriers of data citation are: Collect data and literature references in the data citation metadata to enable data-data and data-literature interlinking. Visibility of data citation information in the ESGF data portals (low barrier to access data citation information) Provide data usage information in literature for the data providers, data node managers and their funders (requested by some ESGF data node managers) The CMIP6 Citation Service is an implementation only of the credit part of the RDA WGDC recommendation for the citation of dynamic data. The second part, the identification of the data subset underlying an article, is planned for CMIP7 as a data cart approach comprising multiple pre-defined CMIP6 DataCite DOIs. Additional policies on the long-term data availability are required. References: M. Stockhause and M. Lautenschlager (2017). CMIP6 Data Citation of Evolving Data. Data Science Journal. 16, p.30. doi:10.5334/dsj-2017-030. https://doi.org/10.5334/dsj-2017-030 . http://cmip6cite.wdc-climate.de

  13. IEDA: Making Small Data BIG Through Interdisciplinary Partnerships Among Long-tail Domains

    NASA Astrophysics Data System (ADS)

    Lehnert, K. A.; Carbotte, S. M.; Arko, R. A.; Ferrini, V. L.; Hsu, L.; Song, L.; Ghiorso, M. S.; Walker, D. J.

    2014-12-01

    The Big Data world in the Earth Sciences so far exists primarily for disciplines that generate massive volumes of observational or computed data using large-scale, shared instrumentation such as global sensor networks, satellites, or high-performance computing facilities. These data are typically managed and curated by well-supported community data facilities that also provide the tools for exploring the data through visualization or statistical analysis. In many other domains, especially those where data are primarily acquired by individual investigators or small teams (known as 'Long-tail data'), data are poorly shared and integrated, lacking a community-based data infrastructure that ensures persistent access, quality control, standardization, and integration of data, as well as appropriate tools to fully explore and mine the data within the context of broader Earth Science datasets. IEDA (Integrated Earth Data Applications, www.iedadata.org) is a data facility funded by the US NSF to develop and operate data services that support data stewardship throughout the full life cycle of observational data in the solid earth sciences, with a focus on the data management needs of individual researchers. IEDA builds on a strong foundation of mature disciplinary data systems for marine geology and geophysics, geochemistry, and geochronology. These systems have dramatically advanced data resources in those long-tail Earth science domains. IEDA has strengthened these resources by establishing a consolidated, enterprise-grade infrastructure that is shared by the domain-specific data systems, and implementing joint data curation and data publication services that follow community standards. In recent years, other domain-specific data efforts have partnered with IEDA to take advantage of this infrastructure and improve data services to their respective communities with formal data publication, long-term preservation of data holdings, and better sustainability. IEDA hopes to foster such partnerships with streamlined data services, including user-friendly, single-point interfaces for data submission, discovery, and access across the partner systems to support interdisciplinary science.

  14. Data governance in predictive toxicology: A review.

    PubMed

    Fu, Xin; Wojak, Anna; Neagu, Daniel; Ridley, Mick; Travis, Kim

    2011-07-13

    Due to recent advances in data storage and sharing for further data processing in predictive toxicology, there is an increasing need for flexible data representations, secure and consistent data curation and automated data quality checking. Toxicity prediction involves multidisciplinary data. There are hundreds of collections of chemical, biological and toxicological data that are widely dispersed, mostly in the open literature, professional research bodies and commercial companies. In order to better manage and make full use of such large amount of toxicity data, there is a trend to develop functionalities aiming towards data governance in predictive toxicology to formalise a set of processes to guarantee high data quality and better data management. In this paper, data quality mainly refers in a data storage sense (e.g. accuracy, completeness and integrity) and not in a toxicological sense (e.g. the quality of experimental results). This paper reviews seven widely used predictive toxicology data sources and applications, with a particular focus on their data governance aspects, including: data accuracy, data completeness, data integrity, metadata and its management, data availability and data authorisation. This review reveals the current problems (e.g. lack of systematic and standard measures of data quality) and desirable needs (e.g. better management and further use of captured metadata and the development of flexible multi-level user access authorisation schemas) of predictive toxicology data sources development. The analytical results will help to address a significant gap in toxicology data quality assessment and lead to the development of novel frameworks for predictive toxicology data and model governance. While the discussed public data sources are well developed, there nevertheless remain some gaps in the development of a data governance framework to support predictive toxicology. In this paper, data governance is identified as the new challenge in predictive toxicology, and a good use of it may provide a promising framework for developing high quality and easy accessible toxicity data repositories. This paper also identifies important research directions that require further investigation in this area.

  15. Data governance in predictive toxicology: A review

    PubMed Central

    2011-01-01

    Background Due to recent advances in data storage and sharing for further data processing in predictive toxicology, there is an increasing need for flexible data representations, secure and consistent data curation and automated data quality checking. Toxicity prediction involves multidisciplinary data. There are hundreds of collections of chemical, biological and toxicological data that are widely dispersed, mostly in the open literature, professional research bodies and commercial companies. In order to better manage and make full use of such large amount of toxicity data, there is a trend to develop functionalities aiming towards data governance in predictive toxicology to formalise a set of processes to guarantee high data quality and better data management. In this paper, data quality mainly refers in a data storage sense (e.g. accuracy, completeness and integrity) and not in a toxicological sense (e.g. the quality of experimental results). Results This paper reviews seven widely used predictive toxicology data sources and applications, with a particular focus on their data governance aspects, including: data accuracy, data completeness, data integrity, metadata and its management, data availability and data authorisation. This review reveals the current problems (e.g. lack of systematic and standard measures of data quality) and desirable needs (e.g. better management and further use of captured metadata and the development of flexible multi-level user access authorisation schemas) of predictive toxicology data sources development. The analytical results will help to address a significant gap in toxicology data quality assessment and lead to the development of novel frameworks for predictive toxicology data and model governance. Conclusions While the discussed public data sources are well developed, there nevertheless remain some gaps in the development of a data governance framework to support predictive toxicology. In this paper, data governance is identified as the new challenge in predictive toxicology, and a good use of it may provide a promising framework for developing high quality and easy accessible toxicity data repositories. This paper also identifies important research directions that require further investigation in this area. PMID:21752279

  16. Data Overview: Overview of an Epidemic

    MedlinePlus

    ... the Epidemic Commonly Used Terms Prescription Opioids Heroin Fentanyl Data Opioid Data Analysis Drug Overdose Death Data ... Overdose Data Heroin Overdose Data Synthetic Opioid Data Fentanyl Encounters Data Overdose Prevention Improve Opioid Prescribing Prevent ...

  17. Data Resources | Geospatial Data Science | NREL

    Science.gov Websites

    variety of renewable energy technologies. These datasets are designed to be used in GIS software applications. Biomass Data Geothermal Data Hydrogen Data Marine and Hydrokinetic Data Solar Data Wind Data

  18. Analysis Resistant Cipher Method and Apparatus

    NASA Technical Reports Server (NTRS)

    Oakley, Ernest C. (Inventor)

    2009-01-01

    A system for encoding and decoding data words including an anti-analysis encoder unit for receiving an original plaintext and producing a recoded data, a data compression unit for receiving the recoded data and producing a compressed recoded data, and an encryption unit for receiving the compressed recoded data and producing an encrypted data. The recoded data has an increased non-correlatable data redundancy compared with the original plaintext in order to mask the statistical distribution of characters in the plaintext data. The system of the present invention further includes a decryption unit for receiving the encrypted data and producing a decrypted data, a data decompression unit for receiving the decrypted data and producing an uncompressed recoded data, and an anti-analysis decoder unit for receiving the uncompressed recoded data and producing a recovered plaintext that corresponds with the original plaintext.

  19. Data Model Performance in Data Warehousing

    NASA Astrophysics Data System (ADS)

    Rorimpandey, G. C.; Sangkop, F. I.; Rantung, V. P.; Zwart, J. P.; Liando, O. E. S.; Mewengkang, A.

    2018-02-01

    Data Warehouses have increasingly become important in organizations that have large amount of data. It is not a product but a part of a solution for the decision support system in those organizations. Data model is the starting point for designing and developing of data warehouses architectures. Thus, the data model needs stable interfaces and consistent for a longer period of time. The aim of this research is to know which data model in data warehousing has the best performance. The research method is descriptive analysis, which has 3 main tasks, such as data collection and organization, analysis of data and interpretation of data. The result of this research is discussed in a statistic analysis method, represents that there is no statistical difference among data models used in data warehousing. The organization can utilize four data model proposed when designing and developing data warehouse.

  20. Pricing and disseminating customer data with privacy awareness.

    PubMed

    Li, Xiao-Bai; Raghunathan, Srinivasan

    2014-03-01

    Organizations today regularly share their customer data with their partners to gain competitive advantages. They are also often requested or even required by a third party to provide customer data that are deemed sensitive. In these circumstances, organizations are obligated to protect the privacy of the individuals involved while still benefiting from sharing data or meeting the requirement for releasing data. In this study, we analyze the tradeoff between privacy and data utility from the perspective of the data owner. We develop an incentive-compatible mechanism for the data owner to price and disseminate private data. With this mechanism, a data user is motivated to reveal his true purpose of data usage and acquire the data that suits to that purpose. Existing economic studies of information privacy primarily consider the interplay between the data owner and the individuals, focusing on problems that occur in the collection of private data. This study, however, examines the privacy issue facing a data owner organization in the distribution of private data to a third party data user when the real purpose of data usage is unclear and the released data could be misused.

  1. Pricing and disseminating customer data with privacy awareness

    PubMed Central

    Raghunathan, Srinivasan

    2014-01-01

    Organizations today regularly share their customer data with their partners to gain competitive advantages. They are also often requested or even required by a third party to provide customer data that are deemed sensitive. In these circumstances, organizations are obligated to protect the privacy of the individuals involved while still benefiting from sharing data or meeting the requirement for releasing data. In this study, we analyze the tradeoff between privacy and data utility from the perspective of the data owner. We develop an incentive-compatible mechanism for the data owner to price and disseminate private data. With this mechanism, a data user is motivated to reveal his true purpose of data usage and acquire the data that suits to that purpose. Existing economic studies of information privacy primarily consider the interplay between the data owner and the individuals, focusing on problems that occur in the collection of private data. This study, however, examines the privacy issue facing a data owner organization in the distribution of private data to a third party data user when the real purpose of data usage is unclear and the released data could be misused. PMID:24839337

  2. Data, Analysis, and Visualization | Computational Science | NREL

    Science.gov Websites

    Data, Analysis, and Visualization Data, Analysis, and Visualization Data management, data analysis . At NREL, our data management, data analysis, and scientific visualization capabilities help move the approaches to image analysis and computer vision. Data Management and Big Data Systems, software, and tools

  3. Sharing Responsibility for Data Stewardship Between Scientists and Curators

    NASA Astrophysics Data System (ADS)

    Hedstrom, M. L.

    2012-12-01

    Data stewardship is becoming increasingly important to support accurate conclusions from new forms of data, integration of and computation across heterogeneous data types, interactions between models and data, replication of results, data governance and long-term archiving. In addition to increasing recognition of the importance of data management, data science, and data curation by US and international scientific agencies, the National Academies of Science Board on Research Data and Information is sponsoring a study on Data Curation Education and Workforce Issues. Effective data stewardship requires a distributed effort among scientists who produce data, IT staff and/or vendors who provide data storage and computational facilities and services, and curators who enhance data quality, manage data governance, provide access to third parties, and assume responsibility for long-term archiving of data. The expertise necessary for scientific data management includes a mix of knowledge of the scientific domain; an understanding of domain data requirements, standards, ontologies and analytical methods; facility with leading edge information technology; and knowledge of data governance, standards, and best practices for long-term preservation and access that rarely are found in a single individual. Rather than developing data science and data curation as new and distinct occupations, this paper examines the set of tasks required for data stewardship. The paper proposes an alternative model that embeds data stewardship in scientific workflows and coordinates hand-offs between instruments, repositories, analytical processing, publishers, distributors, and archives. This model forms the basis for defining knowledge and skill requirements for specific actors in the processes required for data stewardship and the corresponding educational and training needs.

  4. NSSDC data listing

    NASA Technical Reports Server (NTRS)

    Horowitz, Richard; King, Joseph H.

    1990-01-01

    In a highly summarized way, data available from the National Space Science Data Center (NSSDC) is identified. Most data are offline data sets (on magnetic tape or as film/print products of various sizes) from individual instruments carried on spacecraft; these compose the Satellite Data Listing. Descriptive names, time spans, data form, and quantity of these data sets are identified in the listing, which is sorted alphabetically-first by spacecraft name and then by the principal investigator's or team leader's last name. Several data sets held at NSSDC, not associated with individual spaceflight instruments, are identified in separate listings following the Satellite Data Listing. These data sets make up the Supplementary Data Listings and include composite spacecraft data sets, ground-based data, models, and computer routines. The identifiers used in the Supplementary Data Listings were created by NSSDC and are explained in the pages preceding the listings. Data set form codes are listed. NSSDC offers primarily archival, retrieval, replication, and dissemination services associated with the data sets discussed in the two major listings identified above. NSSDC also provides documentation which enables the data recipient to use the data received. NSSDC is working toward expanding presently limited capabilities for data subsetting and for promotion of data files to online residence for user downloading. NSSDC data holdings span the range of scientific disciplines in which NASA is involved, and include astrophysics, lunar and planetary science, solar physics, space plasma physics, and Earth science. In addition to the functions mentioned above, NSSDC offers data via special services and systems in a number of areas, including Astronomical Data Center (ADC), Coordinated Data Analysis Workshops (CDAWs), NASA Climate Data System (NCDS), Pilot Land Data System (PLDS), and Crustal Dynamics Data Information System (CDDIS). Furthermore, NSSDC has a no-password account on its SPAN/Telenet-accessible VAX through which the NASA Master Directory and selected online data bases are accessible and through which any data described here may be ordered. Astrophysics data support by NSSDC is not limited to the ADC. Each of these special services/systems is described briefly.

  5. Big Data Analytics in Medicine and Healthcare.

    PubMed

    Ristevski, Blagoj; Chen, Ming

    2018-05-10

    This paper surveys big data with highlighting the big data analytics in medicine and healthcare. Big data characteristics: value, volume, velocity, variety, veracity and variability are described. Big data analytics in medicine and healthcare covers integration and analysis of large amount of complex heterogeneous data such as various - omics data (genomics, epigenomics, transcriptomics, proteomics, metabolomics, interactomics, pharmacogenomics, diseasomics), biomedical data and electronic health records data. We underline the challenging issues about big data privacy and security. Regarding big data characteristics, some directions of using suitable and promising open-source distributed data processing software platform are given.

  6. CIRSS vertical data integration, San Bernardino County study phases 1-A, 1-B

    NASA Technical Reports Server (NTRS)

    Christenson, J.; Michel, R. (Principal Investigator)

    1981-01-01

    User needs, data types, data automation, and preliminary applications are described for an effort to assemble a single data base for San Bernardino County from data bases which exist at several administrative levels. Each of the data bases used was registered and converted to a grid-based data file at a resolution of 4 acres and used to create a multivariable data base for the entire study area. To this data base were added classified LANDSAT data from 1976 and 1979. The resulting data base thus integrated in a uniform format all of the separately automated data within the study area. Several possible interactions between existing geocoded data bases and LANDSAT data were tested. The use of LANDSAT to update existing data base is to be tested.

  7. Minimally buffered data transfers between nodes in a data communications network

    DOEpatents

    Miller, Douglas R.

    2015-06-23

    Methods, apparatus, and products for minimally buffered data transfers between nodes in a data communications network are disclosed that include: receiving, by a messaging module on an origin node, a storage identifier, a origin data type, and a target data type, the storage identifier specifying application storage containing data, the origin data type describing a data subset contained in the origin application storage, the target data type describing an arrangement of the data subset in application storage on a target node; creating, by the messaging module, origin metadata describing the origin data type; selecting, by the messaging module from the origin application storage in dependence upon the origin metadata and the storage identifier, the data subset; and transmitting, by the messaging module to the target node, the selected data subset for storing in the target application storage in dependence upon the target data type without temporarily buffering the data subset.

  8. A Metadata Action Language

    NASA Technical Reports Server (NTRS)

    Golden, Keith; Clancy, Dan (Technical Monitor)

    2001-01-01

    The data management problem comprises data processing and data tracking. Data processing is the creation of new data based on existing data sources. Data tracking consists of storing metadata descriptions of available data. This paper addresses the data management problem by casting it as an AI planning problem. Actions are data-processing commands, plans are dataflow programs and goals are metadata descriptions of desired data products. Data manipulation is simply plan generation and execution, and a key component of data tracking is inferring the effects of an observed plan. We introduce a new action language for data management domains, called ADILM. We discuss the connection between data processing and information integration and show how a language for the latter must be modified to support the former. The paper also discusses information gathering within a data-processing framework, and show how ADILM metadata expressions are a generalization of Local Completeness.

  9. Processing data base information having nonwhite noise

    DOEpatents

    Gross, Kenneth C.; Morreale, Patricia

    1995-01-01

    A method and system for processing a set of data from an industrial process and/or a sensor. The method and system can include processing data from either real or calculated data related to an industrial process variable. One of the data sets can be an artificial signal data set generated by an autoregressive moving average technique. After obtaining two data sets associated with one physical variable, a difference function data set is obtained by determining the arithmetic difference between the two pairs of data sets over time. A frequency domain transformation is made of the difference function data set to obtain Fourier modes describing a composite function data set. A residual function data set is obtained by subtracting the composite function data set from the difference function data set and the residual function data set (free of nonwhite noise) is analyzed by a statistical probability ratio test to provide a validated data base.

  10. A Clinical Data Warehouse Based on OMOP and i2b2 for Austrian Health Claims Data.

    PubMed

    Rinner, Christoph; Gezgin, Deniz; Wendl, Christopher; Gall, Walter

    2018-01-01

    To develop simulation models for healthcare related questions clinical data can be reused. Develop a clinical data warehouse to harmonize different data sources in a standardized manner and get a reproducible interface for clinical data reuse. The Kimball life cycle for the development of data warehouse was used. The development is split into the technical, the data and the business intelligence pathway. Sample data was persisted in the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). The i2b2 clinical data warehouse tools were used to query the OMOP CDM by applying the new i2b2 multi-fact table feature. A clinical data warehouse was set up and sample data, data dimensions and ontologies for Austrian health claims data were created. The ability of the standardized data access layer to create and apply simulation models will be evaluated next.

  11. Data Recipes: Toward Creating How-To Knowledge Base for Earth Science Data

    NASA Technical Reports Server (NTRS)

    Shen, Suhung; Lynnes, Chris; Acker, James G.; Beaty, Tammy

    2015-01-01

    Both the diversity and volume of Earth science data from satellites and numerical models are growing dramatically, due to an increasing population of measured physical parameters, and also an increasing variety of spatial and temporal resolutions for many data products. To further complicate matters, Earth science data delivered to data archive centers are commonly found in different formats and structures. NASA data centers, managed by the Earth Observing System Data and Information System (EOSDIS), have developed a rich and diverse set of data services and tools with features intended to simplify finding, downloading, and working with these data. Although most data services and tools have user guides, many users still experience difficulties with accessing or reading data due to varying levels of familiarity with data services, tools, and or formats. The data recipe project at Goddard Earth Science Data and Information Services Center (GES DISC) was initiated in late 2012 for enhancing user support. A data recipe is a How-To online explanatory document, with step-by-step instructions and examples of accessing and working with real data (http:disc.sci.gsfc.nasa.govrecipes). The current suite of recipes has been found to be very helpful, especially to first-time-users of particular data services, tools, or data products. Online traffic to the data recipe pages is significant, even though the data recipe topics are still limited. An Earth Science Data System Working Group (ESDSWG) for data recipes was established in the spring of 2014, aimed to initiate an EOSDIS-wide campaign for leveraging the distributed knowledge within EOSDIS and its user communities regarding their respective services and tools. The ESDSWG data recipe group is working on an inventory and analysis of existing data recipes and tutorials, and will provide guidelines and recommendation for writing and grouping data recipes, and for cross linking recipes to data products. This presentation gives an overview of the data recipe activites at GES DISC and ESDSWG. We are seeking requirements and input from a broader data user community to establish a strong knowledge base for Earth science data research and application implementations.

  12. Semi-automated Data Set Submission Work Flow for Archival with the ORNL DAAC

    NASA Astrophysics Data System (ADS)

    Wright, D.; Beaty, T.; Cook, R. B.; Devarakonda, R.; Eby, P.; Heinz, S. L.; Hook, L. A.; McMurry, B. F.; Shanafield, H. A.; Sill, D.; Santhana Vannan, S.; Wei, Y.

    2013-12-01

    The ORNL DAAC archives and publishes, free of charge, data and information relevant to biogeochemical, ecological, and environmental processes. The ORNL DAAC primarily archives data produced by NASA's Terrestrial Ecology Program; however, any data that are pertinent to the biogeochemical and ecological community are of interest. The data set submission process to the ORNL DAAC has been recently updated and semi-automated to provide a consistent data provider experience and to create a uniform data product. The data archived at the ORNL DAAC must be well formatted, self-descriptive, and documented, as well as referenced in a peer-reviewed publication. If the ORNL DAAC is the appropriate archive for a data set, the data provider will be sent an email with several URL links to guide them through the submission process. The data provider will be asked to fill out a short online form to help the ORNL DAAC staff better understand the data set. These questions cover information about the data set, a description of the data set, temporal and spatial characteristics of the data set, and how the data were prepared and delivered. The questionnaire is generic and has been designed to gather input on the various diverse data sets the ORNL DAAC archives. A data upload module and metadata editor further guide the data provider through the submission process. For submission purposes, a complete data set includes data files, document(s) describing data, supplemental files, metadata record(s), and an online form. There are five major functions the ORNL DAAC performs during the process of archiving data: 1) Ingestion is the ORNL DAAC side of submission; data are checked, metadata records are compiled, and files are converted to archival formats. 2) Metadata records and data set documentation made searchable and the data set is given a permanent URL. 3) The data set is published, assigned a DOI, and advertised. 4) The data set is provided long-term post-project support. 5) Stewardship of data ensures the data are stored on state of the art computer systems with reliable backups.

  13. Replacing missing values using trustworthy data values from web data sources

    NASA Astrophysics Data System (ADS)

    Izham Jaya, M.; Sidi, Fatimah; Mat Yusof, Sharmila; Suriani Affendey, Lilly; Ishak, Iskandar; Jabar, Marzanah A.

    2017-09-01

    In practice, collected data usually are incomplete and contains missing value. Existing approaches in managing missing values overlook the importance of trustworthy data values in replacing missing values. In view that trusted completed data is very important in data analysis, we proposed a framework of missing value replacement using trustworthy data values from web data sources. The proposed framework adopted ontology to map data values from web data sources to the incomplete dataset. As data from web is conflicting with each other, we proposed a trust score measurement based on data accuracy and data reliability. Trust score is then used to select trustworthy data values from web data sources for missing values replacement. We successfully implemented the proposed framework using financial dataset and presented the findings in this paper. From our experiment, we manage to show that replacing missing values with trustworthy data values is important especially in a case of conflicting data to solve missing values problem.

  14. Earth Observation Data Quality Monitoring and Control: A Case Study of STAR Central Data Repository

    NASA Astrophysics Data System (ADS)

    Han, W.; Jochum, M.

    2017-12-01

    Earth observation data quality is very important for researchers and decision makers involved in weather forecasting, severe weather warning, disaster and emergency response, environmental monitoring, etc. Monitoring and control earth observation data quality, especially accuracy, completeness, and timeliness, is very useful in data management and governance to optimize data flow, discover potential transmission issues, and better connect data providers and users. Taking a centralized near real-time satellite data repository, STAR (Center for Satellite Applications and Research of NOAA) Central Data Repository (SCDR), as an example, this paper describes how to develop new mechanism to verify data integrity, check data completeness, and monitor data latency in an operational data management system. Such quality monitoring and control of large volume satellite data help data providers and managers improve data transmission of near real-time satellite data, enhance its acquisition and management, and overcome performance and management issues to better serve research and development activities.

  15. Improvement of web-based data acquisition and management system for GOSAT validation lidar data analysis

    NASA Astrophysics Data System (ADS)

    Okumura, Hiroshi; Takubo, Shoichiro; Kawasaki, Takeru; Abdullah, Indra Nugraha; Uchino, Osamu; Morino, Isamu; Yokota, Tatsuya; Nagai, Tomohiro; Sakai, Tetsu; Maki, Takashi; Arai, Kohei

    2013-01-01

    A web-base data acquisition and management system for GOSAT (Greenhouse gases Observation SATellite) validation lidar data-analysis has been developed. The system consists of data acquisition sub-system (DAS) and data management sub-system (DMS). DAS written in Perl language acquires AMeDAS (Automated Meteorological Data Acquisition System) ground-level local meteorological data, GPS Radiosonde upper-air meteorological data, ground-level oxidant data, skyradiometer data, skyview camera images, meteorological satellite IR image data and GOSAT validation lidar data. DMS written in PHP language demonstrates satellite-pass date and all acquired data. In this article, we briefly describe some improvement for higher performance and higher data usability. GPS Radiosonde upper-air meteorological data and U.S. standard atmospheric model in DAS automatically calculate molecule number density profiles. Predicted ozone density prole images above Saga city are also calculated by using Meteorological Research Institute (MRI) chemistry-climate model version 2 for comparison to actual ozone DIAL data.

  16. Data Preparation Process for the Buildings Performance Database

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Walter, Travis; Dunn, Laurel; Mercado, Andrea

    2014-06-30

    The Buildings Performance Database (BPD) includes empirically measured data from a variety of data sources with varying degrees of data quality and data availability. The purpose of the data preparation process is to maintain data quality within the database and to ensure that all database entries have sufficient data for meaningful analysis and for the database API. Data preparation is a systematic process of mapping data into the Building Energy Data Exchange Specification (BEDES), cleansing data using a set of criteria and rules of thumb, and deriving values such as energy totals and dominant asset types. The data preparation processmore » takes the most amount of effort and time therefore most of the cleansing process has been automated. The process also needs to adapt as more data is contributed to the BPD and as building technologies over time. The data preparation process is an essential step between data contributed by providers and data published to the public in the BPD.« less

  17. Quality assessment concept of the World Data Center for Climate and its application to CMIP5 data

    NASA Astrophysics Data System (ADS)

    Stockhause, M.; Höck, H.; Toussaint, F.; Lautenschlager, M.

    2012-08-01

    The preservation of data in a high state of quality which is suitable for interdisciplinary use is one of the most pressing and challenging current issues in long-term archiving. For high volume data such as climate model data, the data and data replica are no longer stored centrally but distributed over several local data repositories, e.g. the data of the Climate Model Intercomparison Project Phase 5 (CMIP5). The most important part of the data is to be archived, assigned a DOI, and published according to the World Data Center for Climate's (WDCC) application of the DataCite regulations. The integrated part of WDCC's data publication process, the data quality assessment, was adapted to the requirements of a federated data infrastructure. A concept of a distributed and federated quality assessment procedure was developed, in which the workload and responsibility for quality control is shared between the three primary CMIP5 data centers: Program for Climate Model Diagnosis and Intercomparison (PCMDI), British Atmospheric Data Centre (BADC), and WDCC. This distributed quality control concept, its pilot implementation for CMIP5, and first experiences are presented. The distributed quality control approach is capable of identifying data inconsistencies and to make quality results immediately available for data creators, data users and data infrastructure managers. Continuous publication of new data versions and slow data replication prevents the quality control from check completion. This together with ongoing developments of the data and metadata infrastructure requires adaptations in code and concept of the distributed quality control approach.

  18. Device Data Ingestion for Industrial Big Data Platforms with a Case Study †

    PubMed Central

    Ji, Cun; Shao, Qingshi; Sun, Jiao; Liu, Shijun; Pan, Li; Wu, Lei; Yang, Chenglei

    2016-01-01

    Despite having played a significant role in the Industry 4.0 era, the Internet of Things is currently faced with the challenge of how to ingest large-scale heterogeneous and multi-type device data. In response to this problem we present a heterogeneous device data ingestion model for an industrial big data platform. The model includes device templates and four strategies for data synchronization, data slicing, data splitting and data indexing, respectively. We can ingest device data from multiple sources with this heterogeneous device data ingestion model, which has been verified on our industrial big data platform. In addition, we present a case study on device data-based scenario analysis of industrial big data. PMID:26927121

  19. Geographic Data as Personal Data in Four EU Member States

    NASA Astrophysics Data System (ADS)

    de Jong, A. J.; van Loenen, B.; Zevenbergen, J. A.

    2016-06-01

    The EU Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data aims at harmonising data protection legislation in the European Union. This should promote the free flow of products and services within the EU. This research found a wide variety of interpretations of the application of data protection legislation to geographic data. The variety was found among the different EU Member States, the different stakeholders and the different types of geographic data. In the Netherlands, the Data Protection Authority (DPA) states that panoramic images of streets are considered personal data. While Dutch case law judges that the data protection legislation does not apply if certain features are blurred and no link to an address is provided. The topographic datasets studied in the case studies do not contain personal data, according to the Dutch DPA, while the German DPA and the Belgian DPA judge that topographic maps of a large scale can contain personal data, and impose conditions on the processing of topographic maps. The UK DPA does consider this data outside of the scope of legal definition of personal data. The patchwork of differences in data protection legislation can be harmonised by using a traffic light model. This model focuses on the context in which the processing of the data takes place and has four categories of data: (1) sensitive personal data, (2) personal data, (3), data that can possibly lead to identification, and (4) non-personal data. For some geographic data, for example factual data that does not reveal sensitive information about a person, can be categorised in the third category giving room to opening up data under the INSPIRE Directive.

  20. Transparent Reporting of Data Quality in Distributed Data Networks

    PubMed Central

    Kahn, Michael G.; Brown, Jeffrey S.; Chun, Alein T.; Davidson, Bruce N.; Meeker, Daniella; Ryan, Patrick B.; Schilling, Lisa M.; Weiskopf, Nicole G.; Williams, Andrew E.; Zozus, Meredith Nahm

    2015-01-01

    Introduction: Poor data quality can be a serious threat to the validity and generalizability of clinical research findings. The growing availability of electronic administrative and clinical data is accompanied by a growing concern about the quality of these data for observational research and other analytic purposes. Currently, there are no widely accepted guidelines for reporting quality results that would enable investigators and consumers to independently determine if a data source is fit for use to support analytic inferences and reliable evidence generation. Model and Methods: We developed a conceptual model that captures the flow of data from data originator across successive data stewards and finally to the data consumer. This “data lifecycle” model illustrates how data quality issues can result in data being returned back to previous data custodians. We highlight the potential risks of poor data quality on clinical practice and research results. Because of the need to ensure transparent reporting of a data quality issues, we created a unifying data-quality reporting framework and a complementary set of 20 data-quality reporting recommendations for studies that use observational clinical and administrative data for secondary data analysis. We obtained stakeholder input on the perceived value of each recommendation by soliciting public comments via two face-to-face meetings of informatics and comparative-effectiveness investigators, through multiple public webinars targeted to the health services research community, and with an open access online wiki. Recommendations: Our recommendations propose reporting on both general and analysis-specific data quality features. The goals of these recommendations are to improve the reporting of data quality measures for studies that use observational clinical and administrative data, to ensure transparency and consistency in computing data quality measures, and to facilitate best practices and trust in the new clinical discoveries based on secondary use of observational data. PMID:25992385

  1. Data Publishing - View from the Front

    NASA Astrophysics Data System (ADS)

    Carlson, David; Pfeiffenberger, Hans

    2014-05-01

    As data publishing journals - Earth System Science Data (ESSD, Copernicus, since 2009), Geophysical Data Journal (GDJ, Wiley, recent) and Scientific Data (SD, Nature Publishing Group, anticipated from May 2014) - expose data sets, implement data description and data review practices, and develop partnerships with data centres and data providers, we anticipate substantial benefits for the broad earth system and environmental research communities but also substantial challenges for all parties. A primary advantage emerges from open access to convergent data: subsurface hydrographic data near Antarctica, for example, now available for combination and comparison with nearby atmospheric data (both documented in ESSD), basin-scale precipitation data (accessed through GDJ) for comparison and interpolation with long-term global precipitation records (accessed from ESSD), or, imagining not too far into the future, stomach content and abundance data for European fish (from ESSD) linked to genetic or nutritional data (from SD). In addition to increased opportunity for discovery and collaboration, we also notice parallel developments of new tools for (published) data visualization and display and increasing acceptance of data publication as a useful and anticipated dissemination step included in project- and institution-based data management plans. All parties - providers, publishers and users - will benefit as various indexing services (SCI, SCOPUS, DCI etc.) acknowledge the creative, intellectual and meritorious efforts of data preparation and data provision. The challenges facing data publication, in most cases very familiar to the data community but made more acute by the advances in data publishing, include diverging metadata standards (among biomedical, green ocean modeling and meteorological communities, for example), adhering to standards and practices for permanent identification while also accommodating 'living' data, and maintaining prompt but rigorous review and evaluation processes in the face of unfamiliarity and overwhelming workloads.

  2. Optimizing Data Center Services to Foster Stewardship and Use of Geospatial Data by Heterogeneous Populations of Users

    NASA Astrophysics Data System (ADS)

    Downs, R. R.; Chen, R. S.; de Sherbinin, A. M.

    2017-12-01

    Growing recognition of the importance of sharing scientific data more widely and openly has refocused attention on the state of data repositories, including both discipline- or topic-oriented data centers and institutional repositories. Data creators often have several alternatives for depositing and disseminating their natural, social, health, or engineering science data. In selecting a repository for their data, data creators and other stakeholders such as their funding agencies may wish to consider the user community or communities served, the type and quality of data products already offered, and the degree of data stewardship and associated services provided. Some data repositories serve general communities, e.g., those in their host institution or region, whereas others tailor their services to particular scientific disciplines or topical areas. Some repositories are selective when acquiring data and conduct extensive curation and reviews to ensure that data products meet quality standards. Many repositories have secured credentials and established a track record for providing trustworthy, high quality data and services. The NASA Socioeconomic Data and Applications Center (SEDAC) serves users interested in human-environment interactions, including researchers, students, and applied users from diverse sectors. SEDAC is selective when choosing data for dissemination, conducting several reviews of data products and services prior to release. SEDAC works with data producers to continually improve the quality of its open data products and services. As a Distributed Active Archive Center (DAAC) of the NASA Earth Observing System Data and Information System, SEDAC is committed to improving the accessibility, interoperability, and usability of its data in conjunction with data available from other DAACs, as well as other relevant data sources. SEDAC is certified as a Regular Member of the International Council for Science World Data System (ICSU-WDS).

  3. "Small" data in a big data world: archiving terrestrial ecology data at ORNL DAAC

    NASA Astrophysics Data System (ADS)

    Santhana Vannan, S. K.; Beaty, T.; Boyer, A.; Deb, D.; Hook, L.; Shrestha, R.; Thornton, M.; Virdi, M.; Wei, Y.; Wright, D.

    2016-12-01

    The Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC http://daac.ornl.gov), a NASA-funded data center, archives a diverse collection of terrestrial biogeochemistry and ecological dynamics observations and models in support of NASA's Earth Science program. The ORNL DAAC has been addressing the increasing challenge of publishing diverse small data products into an online archive while dealing with the enhanced need for integration and availability of these data to address big science questions. This paper will show examples of "small" diverse data holdings - ranging from the Daymet model output data to site-based soil moisture observation data. We define "small" by the data volume of these data products compared to petabyte scale observations. We will highlight the use of tools and services for visualizing diverse data holdings and subsetting services such as the MODIS land products subsets tool (at ORNL DAAC) that provides big MODIS data in small chunks. Digital Object Identifiers (DOI) and data citations have enhanced the availability of data. The challenge faced by data publishers now is to deal with the increased number of publishable data products and most importantly the difficulties of publishing small diverse data products into an online archive. This paper will also present our experiences designing a data curation system for these types of data. The characteristics of these data will be examined and their scientific value will be demonstrated via data citation metrics. We will present case studies of leveraging specialized tools and services that have enabled small data sets to realize their "big" scientific potential. Overall, we will provide a holistic view of the challenges and potential of small diverse terrestrial ecology data sets from data curation to distribution.

  4. Use of Schema on Read in Earth Science Data Archives

    NASA Technical Reports Server (NTRS)

    Hegde, Mahabaleshwara; Smit, Christine; Pilone, Paul; Petrenko, Maksym; Pham, Long

    2017-01-01

    Traditionally, NASA Earth Science data archives have file-based storage using proprietary data file formats, such as HDF and HDF-EOS, which are optimized to support fast and efficient storage of spaceborne and model data as they are generated. The use of file-based storage essentially imposes an indexing strategy based on data dimensions. In most cases, NASA Earth Science data uses time as the primary index, leading to poor performance in accessing data in spatial dimensions. For example, producing a time series for a single spatial grid cell involves accessing a large number of data files. With exponential growth in data volume due to the ever-increasing spatial and temporal resolution of the data, using file-based archives poses significant performance and cost barriers to data discovery and access. Storing and disseminating data in proprietary data formats imposes an additional access barrier for users outside the mainstream research community. At the NASA Goddard Earth Sciences Data Information Services Center (GES DISC), we have evaluated applying the schema-on-read principle to data access and distribution. We used Apache Parquet to store geospatial data, and have exposed data through Amazon Web Services (AWS) Athena, AWS Simple Storage Service (S3), and Apache Spark. Using the schema-on-read approach allows customization of indexing spatially or temporally to suit the data access pattern. The storage of data in open formats such as Apache Parquet has widespread support in popular programming languages. A wide range of solutions for handling big data lowers the access barrier for all users. This presentation will discuss formats used for data storage, frameworks with This presentation will discuss formats used for data storage, frameworks with support for schema-on-read used for data access, and common use cases covering data usage patterns seen in a geospatial data archive.

  5. Querying Semi-Structured Data

    NASA Technical Reports Server (NTRS)

    Abiteboul, Serge

    1997-01-01

    The amount of data of all kinds available electronically has increased dramatically in recent years. The data resides in different forms, ranging from unstructured data in the systems to highly structured in relational database systems. Data is accessible through a variety of interfaces including Web browsers, database query languages, application-specic interfaces, or data exchange formats. Some of this data is raw data, e.g., images or sound. Some of it has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to be extracted from the data. Sometimes also it exists but we prefer to ignore it for certain purposes such as browsing. We call here semi-structured data this data that is (from a particular viewpoint) neither raw data nor strictly typed, i.e., not table-oriented as in a relational model or sorted-graph as in object databases. As will seen later when the notion of semi-structured data is more precisely de ned, the need for semi-structured data arises naturally in the context of data integration, even when the data sources are themselves well-structured. Although data integration is an old topic, the need to integrate a wider variety of data- formats (e.g., SGML or ASN.1 data) and data found on the Web has brought the topic of semi-structured data to the forefront of research. The main purpose of the paper is to isolate the essential aspects of semi- structured data. We also survey some proposals of models and query languages for semi-structured data. In particular, we consider recent works at Stanford U. and U. Penn on semi-structured data. In both cases, the motivation is found in the integration of heterogeneous data.

  6. Writing through Big Data: New Challenges and Possibilities for Data-Driven Arguments

    ERIC Educational Resources Information Center

    Beveridge, Aaron

    2017-01-01

    As multimodal writing continues to shift and expand in the era of Big Data, writing studies must confront the new challenges and possibilities emerging from data mining, data visualization, and data-driven arguments. Often collected under the broad banner of "data literacy," students' experiences of data visualization and data-driven…

  7. Multi-registration of software library resources

    DOEpatents

    Archer, Charles J [Rochester, MN; Blocksome, Michael A [Rochester, MN; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN

    2011-04-05

    Data communications, including issuing, by an application program to a high level data communications library, a request for initialization of a data communications service; issuing to a low level data communications library a request for registration of data communications functions; registering the data communications functions, including instantiating a factory object for each of the one or more data communications functions; issuing by the application program an instruction to execute a designated data communications function; issuing, to the low level data communications library, an instruction to execute the designated data communications function, including passing to the low level data communications library a call parameter that identifies a factory object; creating with the identified factory object the data communications object that implements the data communications function according to the protocol; and executing by the low level data communications library the designated data communications function.

  8. Solutions for research data from a publisher's perspective

    NASA Astrophysics Data System (ADS)

    Cotroneo, P.

    2015-12-01

    Sharing research data has the potential to make research more efficient and reproducible. Elsevier has developed several initiatives to address the different needs of research data users. These include PANGEA Linked data, which provides geo-referenced, citable datasets from earth and life sciences, archived as supplementary data from publications by the PANGEA data repository; Mendeley Data, which allows users to freely upload and share their data; a database linking program that creates links between articles on ScienceDirect and datasets held in external data repositories such as EarthRef and EarthChem; a pilot for searching for research data through a map interface; an open data pilot that allows authors publishing in Elsevier journals to store and share research data and make this publicly available as a supplementary file alongside their article; and data journals, including Data in Brief, which allow researchers to share their data open access. Through these initiatives, researchers are not only encouraged to share their research data, but also supported in optimizing their research data management. By making data more readily citable and visible, and hence generating citations for authors, these initiatives also aim to ensure that researchers get the recognition they deserve for publishing their data.

  9. BrainLiner: A Neuroinformatics Platform for Sharing Time-Aligned Brain-Behavior Data

    PubMed Central

    Takemiya, Makoto; Majima, Kei; Tsukamoto, Mitsuaki; Kamitani, Yukiyasu

    2016-01-01

    Data-driven neuroscience aims to find statistical relationships between brain activity and task behavior from large-scale datasets. To facilitate high-throughput data processing and modeling, we created BrainLiner as a web platform for sharing time-aligned, brain-behavior data. Using an HDF5-based data format, BrainLiner treats brain activity and data related to behavior with the same salience, aligning both behavioral and brain activity data on a common time axis. This facilitates learning the relationship between behavior and brain activity. Using a common data file format also simplifies data processing and analyses. Properties describing data are unambiguously defined using a schema, allowing machine-readable definition of data. The BrainLiner platform allows users to upload and download data, as well as to explore and search for data from the web platform. A WebGL-based data explorer can visualize highly detailed neurophysiological data from within the web browser, and a data-driven search feature allows users to search for similar time windows of data. This increases transparency, and allows for visual inspection of neural coding. BrainLiner thus provides an essential set of tools for data sharing and data-driven modeling. PMID:26858636

  10. Expediting Scientific Data Analysis with Reorganization of Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Byna, Surendra; Wu, Kesheng

    2013-08-19

    Data producers typically optimize the layout of data files to minimize the write time. In most cases, data analysis tasks read these files in access patterns different from the write patterns causing poor read performance. In this paper, we introduce Scientific Data Services (SDS), a framework for bridging the performance gap between writing and reading scientific data. SDS reorganizes data to match the read patterns of analysis tasks and enables transparent data reads from the reorganized data. We implemented a HDF5 Virtual Object Layer (VOL) plugin to redirect the HDF5 dataset read calls to the reorganized data. To demonstrate themore » effectiveness of SDS, we applied two parallel data organization techniques: a sort-based organization on a plasma physics data and a transpose-based organization on mass spectrometry imaging data. We also extended the HDF5 data access API to allow selection of data based on their values through a query interface, called SDS Query. We evaluated the execution time in accessing various subsets of data through existing HDF5 Read API and SDS Query. We showed that reading the reorganized data using SDS is up to 55X faster than reading the original data.« less

  11. The Principles for Successful Scientific Data Management Revisited

    NASA Astrophysics Data System (ADS)

    Walker, R. J.; King, T. A.; Joy, S. P.

    2005-12-01

    It has been 23 years since the National Research Council's Committee on Data Management and Computation (CODMAC) published its famous list of principles for successful scientific data management that have provided the framework for modern space science data management. CODMAC outlined seven principles: 1. Scientific Involvement in all aspects of space science missions. 2. Scientific Oversight of all scientific data-management activities. 3. Data Availability - Validated data should be made available to the scientific community in a timely manner. They should include appropriate ancillary data, and complete documentation. 4. Facilities - A proper balance between cost and scientific productivity should be maintained. 5. Software - Transportable well documented software should be available to process and analyze the data. 6. Scientific Data Storage - The data should be preserved in retrievable form. 7. Data System Funding - Adequate data funding should be made available at the outset of missions and protected from overruns. In this paper we will review the lessons learned in trying to apply these principles to space derived data. The Planetary Data System created the concept of data curation to carry out the CODMAC principles. Data curators are scientists and technologists who work directly with the mission scientists to create data products. The efficient application of the CODMAC principles requires that data curators and the mission team start early in a mission to plan for data access and archiving. To build the data products the planetary discipline adopted data access and documentation standards and has adhered to them. The data curators and mission team work together to produce data products and make them available. However even with early planning and agreement on standards the needs of the science community frequently far exceed the available resources. This is especially true for smaller principal investigator run missions. We will argue that one way to make data systems for small missions more effective is for the data curators to provide software tools to help develop the mission data system.

  12. HRP Data Accessibility Current Status

    NASA Technical Reports Server (NTRS)

    Sams, Clarence

    2009-01-01

    Overview of talk: a) Content of Human Life Science data; b) Data archive structure; c) Applicable legal documents and policies; and d) Methods for data access. Life Science Data Archive (LSDA) contains research data from NASA-funded experiments, primarily data from flight experiments and ground analog data collected at NASA facilities. Longitudinal Study of Astronaut Health (LSAH) contains electronic health records (medical data) of all astronauts, including mission data. Data are collected for clinical purposes. Clinical data are analyzed by LSAH epidemiologists to identify trends in crew health and implement changes in pre-, in-, or post-flight medical care.

  13. A review of data quality assessment methods for public health information systems.

    PubMed

    Chen, Hong; Hailey, David; Wang, Ning; Yu, Ping

    2014-05-14

    High quality data and effective data quality assessment are required for accurately evaluating the impact of public health interventions and measuring public health outcomes. Data, data use, and data collection process, as the three dimensions of data quality, all need to be assessed for overall data quality assessment. We reviewed current data quality assessment methods. The relevant study was identified in major databases and well-known institutional websites. We found the dimension of data was most frequently assessed. Completeness, accuracy, and timeliness were the three most-used attributes among a total of 49 attributes of data quality. The major quantitative assessment methods were descriptive surveys and data audits, whereas the common qualitative assessment methods were interview and documentation review. The limitations of the reviewed studies included inattentiveness to data use and data collection process, inconsistency in the definition of attributes of data quality, failure to address data users' concerns and a lack of systematic procedures in data quality assessment. This review study is limited by the coverage of the databases and the breadth of public health information systems. Further research could develop consistent data quality definitions and attributes. More research efforts should be given to assess the quality of data use and the quality of data collection process.

  14. Data-centric Science: New challenges for long-term archives and data publishers

    NASA Astrophysics Data System (ADS)

    Stockhause, Martina; Lautenschlager, Michael

    2016-04-01

    In the recent years the publication of data has become more and more common. Data and metadata for a single project are often disseminated by multiple data centers in federated data infrastructures. In the same time data is shared earlier to enable collaboration within research projects. The research data environment has become more heterogeneous and the data more dynamic. Only few data or metadata repositories are long-term archives (LTAs) with WDS/DSA certificates, complying to Force 11's 'Joint Declaration of Data Citation Principles'. Therefore for long-term usage of these data and information, a small number of LTAs have the task to preserve these pieces of information. They replicate, connect, quality assure, harmonize, archive, and curate these different types of data from multiple data centers with different operation procedures and data standards. Consortia or federations of certified LTAs are needed to meet the challenges of big data storage and citations. Data publishers play a central role in storing, preserving, and disseminating scientific information. Portals of these federations of LTAs or data registration agencies like DataCite might even become the portals of the future for scientific knowledge discovery. The example CMIP6 is used to illustrate this future perspective of the role of LTAs/data publishers.

  15. A Review of Data Quality Assessment Methods for Public Health Information Systems

    PubMed Central

    Chen, Hong; Hailey, David; Wang, Ning; Yu, Ping

    2014-01-01

    High quality data and effective data quality assessment are required for accurately evaluating the impact of public health interventions and measuring public health outcomes. Data, data use, and data collection process, as the three dimensions of data quality, all need to be assessed for overall data quality assessment. We reviewed current data quality assessment methods. The relevant study was identified in major databases and well-known institutional websites. We found the dimension of data was most frequently assessed. Completeness, accuracy, and timeliness were the three most-used attributes among a total of 49 attributes of data quality. The major quantitative assessment methods were descriptive surveys and data audits, whereas the common qualitative assessment methods were interview and documentation review. The limitations of the reviewed studies included inattentiveness to data use and data collection process, inconsistency in the definition of attributes of data quality, failure to address data users’ concerns and a lack of systematic procedures in data quality assessment. This review study is limited by the coverage of the databases and the breadth of public health information systems. Further research could develop consistent data quality definitions and attributes. More research efforts should be given to assess the quality of data use and the quality of data collection process. PMID:24830450

  16. Development of spatial data guidelines and standards: spatial data set documentation to support hydrologic analysis in the U.S. Geological Survey

    USGS Publications Warehouse

    Fulton, James L.

    1992-01-01

    Spatial data analysis has become an integral component in many surface and sub-surface hydrologic investigations within the U.S. Geological Survey (USGS). Currently, one of the largest costs in applying spatial data analysis is the cost of developing the needed spatial data. Therefore, guidelines and standards are required for the development of spatial data in order to allow for data sharing and reuse; this eliminates costly redevelopment. In order to attain this goal, the USGS is expanding efforts to identify guidelines and standards for the development of spatial data for hydrologic analysis. Because of the variety of project and database needs, the USGS has concentrated on developing standards for documenting spatial sets to aid in the assessment of data set quality and compatibility of different data sets. An interim data set documentation standard (1990) has been developed that provides a mechanism for associating a wide variety of information with a data set, including data about source material, data automation and editing procedures used, projection parameters, data statistics, descriptions of features and feature attributes, information on organizational contacts lists of operations performed on the data, and free-form comments and notes about the data, made at various times in the evolution of the data set. The interim data set documentation standard has been automated using a commercial geographic information system (GIS) and data set documentation software developed by the USGS. Where possible, USGS developed software is used to enter data into the data set documentation file automatically. The GIS software closely associates a data set with its data set documentation file; the documentation file is retained with the data set whenever it is modified, copied, or transferred to another computer system. The Water Resources Division of the USGS is continuing to develop spatial data and data processing standards, with emphasis on standards needed to support hydrologic analysis, hydrologic data processing, and publication of hydrologic thermatic maps. There is a need for the GIS vendor community to develop data set documentation tools similar to those developed by the USGS, or to incorporate USGS developed tools in their software.

  17. Data warehouse implementation with clinical pharmacokinetic/pharmacodynamic data.

    PubMed

    Koprowski, S P; Barrett, J S

    2002-03-01

    We have created a data warehouse for human pharmacokinetic (PK) and pharmacodynamic (PD) data generated primarily within the Clinical PK Group of the Drug Metabolism and Pharmacokinetics (DM&PK) Department of DuPont Pharmaceuticals. Data which enters an Oracle-based LIMS directly from chromatography systems or through files from contract research organizations are accessed via SAS/PH.Kinetics, GLP-compliant data analysis software residing on individual users' workstations. Upon completion of the final PK or PD analysis, data are pushed to a predefined location. Data analyzed/created with other software (i.e., WinNonlin, NONMEM, Adapt, etc.) are added to this file repository as well. The warehouse creates views to these data and accumulates metadata on all data sources defined in the warehouse. The warehouse is managed via the SAS/Warehouse Administrator product that defines the environment, creates summarized data structures, and schedules data refresh. The clinical PK/PD warehouse encompasses laboratory, biometric, PK and PD data streams. Detailed logical tables for each compound are created/updated as the clinical PK/PD data warehouse is populated. The data model defined to the warehouse is based on a star schema. Summarized data structures such as multidimensional data bases (MDDB), infomarts, and datamarts are created from detail tables. Data mining and querying of highly summarized data as well as drill-down to detail data is possible via the creation of exploitation tools which front-end the warehouse data. Based on periodic refreshing of the warehouse data, these applications are able to access the most current data available and do not require a manual interface to update/populate the data store. Prototype applications have been web-enabled to facilitate their usage to varied data customers across platform and location. The warehouse also contains automated mechanisms for the construction of study data listings and SAS transport files for eventual incorporation into an electronic submission. This environment permits the management of online analytical processing via a single administrator once the data model and warehouse configuration have been designed. The expansion of the current environment will eventually connect data from all phases of research and development ensuring the return on investment and hopefully efficiencies in data processing unforeseen with earlier legacy systems.

  18. Legal assessment tool (LAT): an interactive tool to address privacy and data protection issues for data sharing.

    PubMed

    Kuchinke, Wolfgang; Krauth, Christian; Bergmann, René; Karakoyun, Töresin; Woollard, Astrid; Schluender, Irene; Braasch, Benjamin; Eckert, Martin; Ohmann, Christian

    2016-07-07

    In an unprecedented rate data in the life sciences is generated and stored in many different databases. An ever increasing part of this data is human health data and therefore falls under data protected by legal regulations. As part of the BioMedBridges project, which created infrastructures that connect more than 10 ESFRI research infrastructures (RI), the legal and ethical prerequisites of data sharing were examined employing a novel and pragmatic approach. We employed concepts from computer science to create legal requirement clusters that enable legal interoperability between databases for the areas of data protection, data security, Intellectual Property (IP) and security of biosample data. We analysed and extracted access rules and constraints from all data providers (databases) involved in the building of data bridges covering many of Europe's most important databases. These requirement clusters were applied to five usage scenarios representing the data flow in different data bridges: Image bridge, Phenotype data bridge, Personalised medicine data bridge, Structural data bridge, and Biosample data bridge. A matrix was built to relate the important concepts from data protection regulations (e.g. pseudonymisation, identifyability, access control, consent management) with the results of the requirement clusters. An interactive user interface for querying the matrix for requirements necessary for compliant data sharing was created. To guide researchers without the need for legal expert knowledge through legal requirements, an interactive tool, the Legal Assessment Tool (LAT), was developed. LAT provides researchers interactively with a selection process to characterise the involved types of data and databases and provides suitable requirements and recommendations for concrete data access and sharing situations. The results provided by LAT are based on an analysis of the data access and sharing conditions for different kinds of data of major databases in Europe. Data sharing for research purposes must be opened for human health data and LAT is one of the means to achieve this aim. In summary, LAT provides requirements in an interactive way for compliant data access and sharing with appropriate safeguards, restrictions and responsibilities by introducing a culture of responsibility and data governance when dealing with human data.

  19. The Materials Data Facility: Data Services to Advance Materials Science Research

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Blaiszik, B.; Chard, K.; Pruyne, J.

    2016-07-06

    With increasingly strict data management requirements from funding agencies and institutions, expanding focus on the challenges of research replicability, and growing data sizes and heterogeneity, new data needs are emerging in the materials community. The materials data facility (MDF) operates two cloudhosted services, data publication and data discovery, with features to promote open data sharing, self-service data publication and curation, and encourage data reuse, layered with powerful data discovery tools. The data publication service simplifies the process of copying data to a secure storage location, assigning data a citable persistent identifier, and recording custom (e.g., material, technique, or instrument specific)andmore » automatically-extractedmetadata in a registrywhile the data discovery service will provide advanced search capabilities (e.g., faceting, free text range querying, and full text search) against the registered data and metadata. TheMDF services empower individual researchers, research projects, and institutions to (I) publish research datasets, regardless of size, from local storage, institutional data stores, or cloud storage, without involvement of thirdparty publishers; (II) build, share, and enforce extensible domain-specific custom metadata schemas; (III) interact with published data and metadata via representational state transfer (REST) application program interfaces (APIs) to facilitate automation, analysis, and feedback; and (IV) access a data discovery model that allows researchers to search, interrogate, and eventually build on existing published data. We describe MDF’s design, current status, and future plans.« less

  20. The Materials Data Facility: Data Services to Advance Materials Science Research

    NASA Astrophysics Data System (ADS)

    Blaiszik, B.; Chard, K.; Pruyne, J.; Ananthakrishnan, R.; Tuecke, S.; Foster, I.

    2016-08-01

    With increasingly strict data management requirements from funding agencies and institutions, expanding focus on the challenges of research replicability, and growing data sizes and heterogeneity, new data needs are emerging in the materials community. The materials data facility (MDF) operates two cloud-hosted services, data publication and data discovery, with features to promote open data sharing, self-service data publication and curation, and encourage data reuse, layered with powerful data discovery tools. The data publication service simplifies the process of copying data to a secure storage location, assigning data a citable persistent identifier, and recording custom (e.g., material, technique, or instrument specific) and automatically-extracted metadata in a registry while the data discovery service will provide advanced search capabilities (e.g., faceting, free text range querying, and full text search) against the registered data and metadata. The MDF services empower individual researchers, research projects, and institutions to (I) publish research datasets, regardless of size, from local storage, institutional data stores, or cloud storage, without involvement of third-party publishers; (II) build, share, and enforce extensible domain-specific custom metadata schemas; (III) interact with published data and metadata via representational state transfer (REST) application program interfaces (APIs) to facilitate automation, analysis, and feedback; and (IV) access a data discovery model that allows researchers to search, interrogate, and eventually build on existing published data. We describe MDF's design, current status, and future plans.

  1. Data Integration for Heterogenous Datasets

    PubMed Central

    2014-01-01

    Abstract More and more, the needs of data analysts are requiring the use of data outside the control of their own organizations. The increasing amount of data available on the Web, the new technologies for linking data across datasets, and the increasing need to integrate structured and unstructured data are all driving this trend. In this article, we provide a technical overview of the emerging “broad data” area, in which the variety of heterogeneous data being used, rather than the scale of the data being analyzed, is the limiting factor in data analysis efforts. The article explores some of the emerging themes in data discovery, data integration, linked data, and the combination of structured and unstructured data. PMID:25553272

  2. Methods and apparatus of analyzing electrical power grid data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hafen, Ryan P.; Critchlow, Terence J.; Gibson, Tara D.

    Apparatus and methods of processing large-scale data regarding an electrical power grid are described. According to one aspect, a method of processing large-scale data regarding an electrical power grid includes accessing a large-scale data set comprising information regarding an electrical power grid; processing data of the large-scale data set to identify a filter which is configured to remove erroneous data from the large-scale data set; using the filter, removing erroneous data from the large-scale data set; and after the removing, processing data of the large-scale data set to identify an event detector which is configured to identify events of interestmore » in the large-scale data set.« less

  3. Efficient data management in a large-scale epidemiology research project.

    PubMed

    Meyer, Jens; Ostrzinski, Stefan; Fredrich, Daniel; Havemann, Christoph; Krafczyk, Janina; Hoffmann, Wolfgang

    2012-09-01

    This article describes the concept of a "Central Data Management" (CDM) and its implementation within the large-scale population-based medical research project "Personalized Medicine". The CDM can be summarized as a conjunction of data capturing, data integration, data storage, data refinement, and data transfer. A wide spectrum of reliable "Extract Transform Load" (ETL) software for automatic integration of data as well as "electronic Case Report Forms" (eCRFs) was developed, in order to integrate decentralized and heterogeneously captured data. Due to the high sensitivity of the captured data, high system resource availability, data privacy, data security and quality assurance are of utmost importance. A complex data model was developed and implemented using an Oracle database in high availability cluster mode in order to integrate different types of participant-related data. Intelligent data capturing and storage mechanisms are improving the quality of data. Data privacy is ensured by a multi-layered role/right system for access control and de-identification of identifying data. A well defined backup process prevents data loss. Over the period of one and a half year, the CDM has captured a wide variety of data in the magnitude of approximately 5terabytes without experiencing any critical incidents of system breakdown or loss of data. The aim of this article is to demonstrate one possible way of establishing a Central Data Management in large-scale medical and epidemiological studies. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.

  4. Reporting Data with "Over-the-Counter" Data Analysis Supports Improves Educators' Data Analyses

    ERIC Educational Resources Information Center

    Rankin, Jenny Grant

    2014-01-01

    The benefits of making data-informed decisions to improve learning rely on educators correctly interpreting given data. Many educators routinely misinterpret data, even at districts with proactive support for data use. The tool most educators use for data analyses, which is an information technology data system or its reports, typically reports…

  5. ASTM Data Banks and Chemical Information Sources

    ERIC Educational Resources Information Center

    Batik, Albert; Hale, Eleanor

    1972-01-01

    Among the data described are infrared indexes, mass spectral data, chromatographic data, X-ray emmission data, odor and taste threshold data, and thermodynamics data. This paper provides the chemical documentarian a complete reference source to a wide variety of analytical data. (Author/NH)

  6. DataFed: A Federated Data System for Visualization and Analysis of Spatio-Temporal Air Quality Data

    NASA Astrophysics Data System (ADS)

    Husar, R. B.; Hoijarvi, K.

    2017-12-01

    DataFed is a distributed web-services-based computing environment for accessing, processing, and visualizing atmospheric data in support of air quality science and management. The flexible, adaptive environment facilitates the access and flow of atmospheric data from provider to users by enabling the creation of user-driven data processing/visualization applications. DataFed `wrapper' components, non-intrusively wrap heterogeneous, distributed datasets for access by standards-based GIS web services. The mediator components (also web services) map the heterogeneous data into a spatio-temporal data model. Chained web services provide homogeneous data views (e.g., geospatial, time views) using a global multi-dimensional data model. In addition to data access and rendering, the data processing component services can be programmed for filtering, aggregation, and fusion of multidimensional data. A complete application software is written in a custom made data flow language. Currently, the federated data pool consists of over 50 datasets originating from globally distributed data providers delivering surface-based air quality measurements, satellite observations, emissions data as well as regional and global-scale air quality models. The web browser-based user interface allows point and click navigation and browsing the XYZT multi-dimensional data space. The key applications of DataFed are for exploring spatial pattern of pollutants, seasonal, weekly, diurnal cycles and frequency distributions for exploratory air quality research. Since 2008, DataFed has been used to support EPA in the implementation of the Exceptional Event Rule. The data system is also used at universities in the US, Europe and Asia.

  7. Legacy data center integration into distributed data federations: The World Data Center for Climate (WDCC) experience

    NASA Astrophysics Data System (ADS)

    Kindermann, Stephan; Berger, Katharina; Toussaint, Frank

    2014-05-01

    The integration of well-established legacy data centers into newly developed data federation infrastructures is a key requirement to enhance climate data access based on widely agreed interfaces. We present the approach taken to integrate the ICSU World Data Center for Climate (WDCC) located in Hamburg, Germany into the European ENES climate data Federation which is part of the international ESGF data federation. The ENES / ESGF data federation hosts petabytes of climate model data and provides scalable data search and access services across the worldwide distributed data centers. Parts of the data provided by the ENES / ESGF data federation is also long term archived and curated at the WDCC data archive, allowing e.g. for DOI based data citation. An integration of the WDCC into the ENES / ESGF federation allows end users to search and access WDCC data using consistent interfaces worldwide. We will summarize the integration approach we have taken for WDCC legacy system and ESGF infrastructure integration. On the technical side we describe the provisioning of ESGF consistent metadata and data interfaces as well as the security infrastructure adoption. On the non-technical side we describe our experiences in integrating a long-term archival center with costly quality assurance procedures with an integrated distributed data federation putting emphasis on providing early and consistent data search and access services to scientists. The experiences were gained in the process of curating ESGF hosted CMIP5 data at the WDCC. Approximately one petabyte of CMIP5 data which was used for the IPCC climate report is being replicated and archived at the WDCC.

  8. Development of Data Acquisition Set-up for Steady-state Experiments

    NASA Astrophysics Data System (ADS)

    Srivastava, Amit K.; Gupta, Arnab D.; Sunil, S.; Khan, Ziauddin

    2017-04-01

    For short duration experiments, generally digitized data is transferred for processing and storage after the experiment whereas in case of steady-state experiment the data is acquired, processed, displayed and stored continuously in pipelined manner. This requires acquiring data through special techniques for storage and on-the-go viewing data to display the current data trends for various physical parameters. A small data acquisition set-up is developed for continuously acquiring signals from various physical parameters at different sampling rate for long duration experiment. This includes the hardware set-up for signal digitization, Field Programmable Gate Arrays (FPGA) based timing system for clock synchronization and event/trigger distribution, time slicing of data streams for storage of data chunks to enable viewing of data during acquisition and channel profile display through down sampling etc. In order to store a long data stream of indefinite/long time duration, the data stream is divided into data slices/chunks of user defined time duration. Data chunks avoid the problem of non-access of server data until the channel data file is closed at the end of the long duration experiment. A graphical user interface has been developed in Lab VIEW application development environment for configuring the data acquisition hardware and storing data chunks on local machine as well as at remote data server through Python for further data access. The data plotting and analysis utilities have been developed with Python software, which provides tools for further data processing. This paper describes the development and implementation of data acquisition for steady-state experiment.

  9. Next Generation Cloud-based Science Data Systems and Their Implications on Data and Software Stewardship, Preservation, and Provenance

    NASA Astrophysics Data System (ADS)

    Hua, H.; Manipon, G.; Starch, M.

    2017-12-01

    NASA's upcoming missions are expected to be generating data volumes at least an order of magnitude larger than current missions. A significant increase in data processing, data rates, data volumes, and long-term data archive capabilities are needed. Consequently, new challenges are emerging that impact traditional data and software management approaches. At large-scales, next generation science data systems are exploring the move onto cloud computing paradigms to support these increased needs. New implications such as costs, data movement, collocation of data systems & archives, and moving processing closer to the data, may result in changes to the stewardship, preservation, and provenance of science data and software. With more science data systems being on-boarding onto cloud computing facilities, we can expect more Earth science data records to be both generated and kept in the cloud. But at large scales, the cost of processing and storing global data may impact architectural and system designs. Data systems will trade the cost of keeping data in the cloud with the data life-cycle approaches of moving "colder" data back to traditional on-premise facilities. How will this impact data citation and processing software stewardship? What are the impacts of cloud-based on-demand processing and its affect on reproducibility and provenance. Similarly, with more science processing software being moved onto cloud, virtual machines, and container based approaches, more opportunities arise for improved stewardship and preservation. But will the science community trust data reprocessed years or decades later? We will also explore emerging questions of the stewardship of the science data system software that is generating the science data records both during and after the life of mission.

  10. Evaluating the Quality and Usability of Open Data for Public Health Research: A Systematic Review of Data Offerings on 3 Open Data Platforms.

    PubMed

    Martin, Erika G; Law, Jennie; Ran, Weijia; Helbig, Natalie; Birkhead, Guthrie S

    Government datasets are newly available on open data platforms that are publicly accessible, available in nonproprietary formats, free of charge, and with unlimited use and distribution rights. They provide opportunities for health research, but their quality and usability are unknown. To describe available open health data, identify whether data are presented in a way that is aligned with best practices and usable for researchers, and examine differences across platforms. Two reviewers systematically reviewed a random sample of data offerings on NYC OpenData (New York City, all offerings, n = 37), Health Data NY (New York State, 25% sample, n = 71), and HealthData.gov (US Department of Health and Human Services, 5% sample, n = 75), using a standard coding guide. Three open health data platforms at the federal, New York State, and New York City levels. Data characteristics from the coding guide were aggregated into summary indices for intrinsic data quality, contextual data quality, adherence to the Dublin Core metadata standards, and the 5-star open data deployment scheme. One quarter of the offerings were structured datasets; other presentation styles included charts (14.7%), documents describing data (12.0%), maps (10.9%), and query tools (7.7%). Health Data NY had higher intrinsic data quality (P < .001), contextual data quality (P < .001), and Dublin Core metadata standards adherence (P < .001). All met basic "web availability" open data standards; fewer met higher standards of "hyperlinked to other data." Although all platforms need improvement, they already provide readily available data for health research. Sustained effort on improving open data websites and metadata is necessary for ensuring researchers use these data, thereby increasing their research value.

  11. Data hosting infrastructure for primary biodiversity data

    PubMed Central

    2011-01-01

    Background Today, an unprecedented volume of primary biodiversity data are being generated worldwide, yet significant amounts of these data have been and will continue to be lost after the conclusion of the projects tasked with collecting them. To get the most value out of these data it is imperative to seek a solution whereby these data are rescued, archived and made available to the biodiversity community. To this end, the biodiversity informatics community requires investment in processes and infrastructure to mitigate data loss and provide solutions for long-term hosting and sharing of biodiversity data. Discussion We review the current state of biodiversity data hosting and investigate the technological and sociological barriers to proper data management. We further explore the rescuing and re-hosting of legacy data, the state of existing toolsets and propose a future direction for the development of new discovery tools. We also explore the role of data standards and licensing in the context of data hosting and preservation. We provide five recommendations for the biodiversity community that will foster better data preservation and access: (1) encourage the community's use of data standards, (2) promote the public domain licensing of data, (3) establish a community of those involved in data hosting and archival, (4) establish hosting centers for biodiversity data, and (5) develop tools for data discovery. Conclusion The community's adoption of standards and development of tools to enable data discovery is essential to sustainable data preservation. Furthermore, the increased adoption of open content licensing, the establishment of data hosting infrastructure and the creation of a data hosting and archiving community are all necessary steps towards the community ensuring that data archival policies become standardized. PMID:22373257

  12. Use of Schema on Read in Earth Science Data Archives

    NASA Astrophysics Data System (ADS)

    Petrenko, M.; Hegde, M.; Smit, C.; Pilone, P.; Pham, L.

    2017-12-01

    Traditionally, NASA Earth Science data archives have file-based storage using proprietary data file formats, such as HDF and HDF-EOS, which are optimized to support fast and efficient storage of spaceborne and model data as they are generated. The use of file-based storage essentially imposes an indexing strategy based on data dimensions. In most cases, NASA Earth Science data uses time as the primary index, leading to poor performance in accessing data in spatial dimensions. For example, producing a time series for a single spatial grid cell involves accessing a large number of data files. With exponential growth in data volume due to the ever-increasing spatial and temporal resolution of the data, using file-based archives poses significant performance and cost barriers to data discovery and access. Storing and disseminating data in proprietary data formats imposes an additional access barrier for users outside the mainstream research community. At the NASA Goddard Earth Sciences Data Information Services Center (GES DISC), we have evaluated applying the "schema-on-read" principle to data access and distribution. We used Apache Parquet to store geospatial data, and have exposed data through Amazon Web Services (AWS) Athena, AWS Simple Storage Service (S3), and Apache Spark. Using the "schema-on-read" approach allows customization of indexing—spatial or temporal—to suit the data access pattern. The storage of data in open formats such as Apache Parquet has widespread support in popular programming languages. A wide range of solutions for handling big data lowers the access barrier for all users. This presentation will discuss formats used for data storage, frameworks with support for "schema-on-read" used for data access, and common use cases covering data usage patterns seen in a geospatial data archive.

  13. Distributed Earth observation data integration and on-demand services based on a collaborative framework of geospatial data service gateway

    NASA Astrophysics Data System (ADS)

    Xie, Jibo; Li, Guoqing

    2015-04-01

    Earth observation (EO) data obtained by air-borne or space-borne sensors has the characteristics of heterogeneity and geographical distribution of storage. These data sources belong to different organizations or agencies whose data management and storage methods are quite different and geographically distributed. Different data sources provide different data publish platforms or portals. With more Remote sensing sensors used for Earth Observation (EO) missions, different space agencies have distributed archived massive EO data. The distribution of EO data archives and system heterogeneity makes it difficult to efficiently use geospatial data for many EO applications, such as hazard mitigation. To solve the interoperable problems of different EO data systems, an advanced architecture of distributed geospatial data infrastructure is introduced to solve the complexity of distributed and heterogeneous EO data integration and on-demand processing in this paper. The concept and architecture of geospatial data service gateway (GDSG) is proposed to build connection with heterogeneous EO data sources by which EO data can be retrieved and accessed with unified interfaces. The GDSG consists of a set of tools and service to encapsulate heterogeneous geospatial data sources into homogenous service modules. The GDSG modules includes EO metadata harvesters and translators, adaptors to different type of data system, unified data query and access interfaces, EO data cache management, and gateway GUI, etc. The GDSG framework is used to implement interoperability and synchronization between distributed EO data sources with heterogeneous architecture. An on-demand distributed EO data platform is developed to validate the GDSG architecture and implementation techniques. Several distributed EO data achieves are used for test. Flood and earthquake serves as two scenarios for the use cases of distributed EO data integration and interoperability.

  14. Lunar Data Node: Apollo Data Restoration and Archiving Update

    NASA Technical Reports Server (NTRS)

    Williams, David R.; Hills, Howard K.; Guiness, Edward A.; Taylor, Patrick T.; McBride, Marie Julia

    2013-01-01

    The Lunar Data Node (LDN) of the Planetary Data System (PDS) is responsible for the restoration and archiving of Apollo data. The LDN is located at the National Space Science Data Center (NSSDC), which holds much of the extant Apollo data on microfilm, microfiche, hard-copy documents, and magnetic tapes in older formats. The goal of the restoration effort is to convert the data into user-accessible PDS formats, create a full set of explanatory supporting data (metadata), archive the full data sets through PDS, and post the data online at the PDS Geosciences Node. This will both enable easy use of the data by current researchers and ensure that the data and metadata are securely preserved for future use. We are also attempting to locate and preserve Apollo data which were never archived at NSSDC. We will give a progress report on the data sets we have been restoring and future work.

  15. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Coleman, Jody Rustyn; Poland, Richard W.

    A system and method for the secure storage and transmission of data is provided. A data aggregate device can be configured to receive secure data from a data source, such as a sensor, and encrypt the secure data using a suitable encryption technique, such as a shared private key technique, a public key encryption technique, a Diffie-Hellman key exchange technique, or other suitable encryption technique. The encrypted secure data can be provided from the data aggregate device to different remote devices over a plurality of segregated or isolated data paths. Each of the isolated data paths can include an optoisolatormore » that is configured to provide one-way transmission of the encrypted secure data from the data aggregate device over the isolated data path. External data can be received through a secure data filter which, by validating the external data, allows for key exchange and other various adjustments from an external source.« less

  16. Authenticated sensor interface device

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Coleman, Jody Rustyn; Poland, Richard W.

    A system and method for the secure storage and transmission of data is provided. A data aggregate device can be configured to receive secure data from a data source, such as a sensor, and encrypt the secure data using a suitable encryption technique, such as a shared private key technique, a public key encryption technique, a Diffie-Hellman key exchange technique, or other suitable encryption technique. The encrypted secure data can be provided from the data aggregate device to different remote devices over a plurality of segregated or isolated data paths. Each of the isolated data paths can include an optoisolatormore » that is configured to provide one-way transmission of the encrypted secure data from the data aggregate device over the isolated data path. External data can be received through a secure data filter which, by validating the external data, allows for key exchange and other various adjustments from an external source.« less

  17. SDMS: A scientific data management system

    NASA Technical Reports Server (NTRS)

    Massena, W. A.

    1978-01-01

    SDMS is a data base management system developed specifically to support scientific programming applications. It consists of a data definition program to define the forms of data bases, and FORTRAN-compatible subroutine calls to create and access data within them. Each SDMS data base contains one or more data sets. A data set has the form of a relation. Each column of a data set is defined to be either a key or data element. Key elements must be scalar. Data elements may also be vectors or matrices. The data elements in each row of the relation form an element set. SDMS permits direct storage and retrieval of an element set by specifying the corresponding key element values. To support the scientific environment, SDMS allows the dynamic creation of data bases via subroutine calls. It also allows intermediate or scratch data to be stored in temporary data bases which vanish at job end.

  18. Good Data Can Be Better Data - How Data Management Maturity Can Help Repositories Improve Operations, Data Quality, And Usability, Helping Researchers

    NASA Astrophysics Data System (ADS)

    Stall, S.

    2015-12-01

    Much earth and space science data and metadata are managed and supported by an infrastructure of repositories, ranging from large agency or instrument facilities, to institutions, to smaller repositories including labs. Scientists face many challenges in this ecosystem both on storing their data and in accessing data from others for new research. Critical for all uses is ensuring the credibility and integrity of the data and conveying that and provenance information now and in the future. Accurate information is essential for future researchers to find (or discover) the data, evaluate the data for use (content, temporal, geolocation, precision) and finally select (or discard) that data as meeting a "fit-for-purpose" criteria. We also need to optimize the effort it takes in describing the data for these determinations, which means making it efficient for the researchers who collect the data. At AGU we are developing a program aimed at helping repositories, and thereby researchers, improve data quality and data usability toward these goals. AGU has partnered with the CMMI Institute to develop their Data Management Maturity (DMM) framework within the Earth and space sciences. The CMMI DMM framework guides best practices in a range of data operations, and the application of the DMM, through an assessment, reveals how repositories and institutions can best optimize efforts to improve operations and functionality throughout the data lifecycle and elevate best practices across a variety of data management operations. Supporting processes like data operations, data governance, and data architecture are included. An assessment involves identifying accomplishment, and weaknesses compared to leading practices for data management. Broad application of the DMM can help improve quality in data and operations, and consistency across the community that will facilitate interoperability, discovery, preservation, and reuse. Good data can be better data. Consistency results in sustainability.

  19. The Challenges of Data Quality Evaluation in a Joint Data Warehouse

    PubMed Central

    Bae, Charles J.; Griffith, Sandra; Fan, Youran; Dunphy, Cheryl; Thompson, Nicolas; Urchek, John; Parchman, Alandra; Katzan, Irene L.

    2015-01-01

    Introduction: The use of clinically derived data from electronic health records (EHRs) and other electronic clinical systems can greatly facilitate clinical research as well as operational and quality initiatives. One approach for making these data available is to incorporate data from different sources into a joint data warehouse. When using such a data warehouse, it is important to understand the quality of the data. The primary objective of this study was to determine the completeness and concordance of common types of clinical data available in the Knowledge Program (KP) joint data warehouse, which contains feeds from several electronic systems including the EHR. Methods: A manual review was performed of specific data elements for 250 patients from an EHR, and these were compared with corresponding elements in the KP data warehouse. Completeness and concordance were calculated for five categories of data including demographics, vital signs, laboratory results, diagnoses, and medications. Results: In general, data elements for demographics, vital signs, diagnoses, and laboratory results were present in more cases in the source EHR compared to the KP. When data elements were available in both sources, there was a high concordance. In contrast, the KP data warehouse documented a higher prevalence of deaths and medications compared to the EHR. Discussion: Several factors contributed to the discrepancies between data in the KP and the EHR—including the start date and frequency of data feeds updates into the KP, inability to transfer data located in nonstructured formats (e.g., free text or scanned documents), as well as incomplete and missing data variables in the source EHR. Conclusion: When evaluating the quality of a data warehouse with multiple data sources, assessing completeness and concordance between data set and source data may be better than designating one to be a gold standard. This will allow the user to optimize the method and timing of data transfer in order to capture data with better accuracy. PMID:26290882

  20. Lessons learned from setting up the NOWESP research data base: Experiences in an interdisciplinary research project

    NASA Astrophysics Data System (ADS)

    Radach, Günther; Gekeler, Jens

    1996-09-01

    Research carried out within the framework of the MAST project NOWESP (North-West European Shelf Programme) was based on a multi-parameter data set of existing marine data, relevant for estimating trends, variability and fluxes on the Northwest European Shelf. The data sets were provided by the partners of the project. Additional data sets were obtained from several other institutions. During the project, the data were organized in the NOWESP Research Data Base (NRDB), for which a special data base scheme was defined that was capable of storing different types of marine data. Data products, like time series and interpolated fields, were provided to the partners for analysis (Radach et al. [1997]). After three years of project time, the feasibility of such an approach is discussed. Ways of optimizing data access and evaluation are proposed. A project-oriented Research Data Base is a useful tool because of its flexibility and proximity to the research being carried out. However, several requirements must be met to derive optimum benefits from this type of service unit. Since this task usually is carried out by a limited number of staff, an early start of project data management is recommended. To enable future projects to succeed in an analogous compilation of relevant data for their use, as performed in NOWESP, the task of organizing the data sets for any short-term project should be shared between a research data base group and a national or international data centre whose experience and software could be used. It must be ensured that only quality controlled data sets from the individual data-produ cing projects are delivered to the national data centres. It is recommended that data quality control should be performed by the originators and/or data centres before delivering any data sets to the research data base. Delivery of the (full) data sets should be checked and their quality should be approved by authorized data centres.

  1. The Frictionless Data Package: Data Containerization for Automated Scientific Workflows

    NASA Astrophysics Data System (ADS)

    Shepherd, A.; Fils, D.; Kinkade, D.; Saito, M. A.

    2017-12-01

    As cross-disciplinary geoscience research increasingly relies on machines to discover and access data, one of the critical questions facing data repositories is how data and supporting materials should be packaged for consumption. Traditionally, data repositories have relied on a human's involvement throughout discovery and access workflows. This human could assess fitness for purpose by reading loosely coupled, unstructured information from web pages and documentation. In attempts to shorten the time to science and access data resources across may disciplines, expectations for machines to mediate the process of discovery and access is challenging data repository infrastructure. This challenge is to find ways to deliver data and information in ways that enable machines to make better decisions by enabling them to understand the data and metadata of many data types. Additionally, once machines have recommended a data resource as relevant to an investigator's needs, the data resource should be easy to integrate into that investigator's toolkits for analysis and visualization. The Biological and Chemical Oceanography Data Management Office (BCO-DMO) supports NSF-funded OCE and PLR investigators with their project's data management needs. These needs involve a number of varying data types some of which require multiple files with differing formats. Presently, BCO-DMO has described these data types and the important relationships between the type's data files through human-readable documentation on web pages. For machines directly accessing data files from BCO-DMO, this documentation could be overlooked and lead to misinterpreting the data. Instead, BCO-DMO is exploring the idea of data containerization, or packaging data and related information for easier transport, interpretation, and use. In researching the landscape of data containerization, the Frictionlessdata Data Package (http://frictionlessdata.io/) provides a number of valuable advantages over similar solutions. This presentation will focus on these advantages and how the Frictionlessdata Data Package addresses a number of real-world use cases faced for data discovery, access, analysis and visualization.

  2. New Data Services for Polar Investigators from Integrated Earth Data Applications (IEDA)

    NASA Astrophysics Data System (ADS)

    Nitsche, F. O.; Ferrini, V.; Morton, J. J.; Arko, R. A.; McLain, K.; O'hara, S. H.; Carbotte, S. M.; Lehnert, K. A.; IEDA Team, I.

    2013-12-01

    Accessibility and preservation of data is needed to support multi-disciplinary research in the key environmentally sensitive Polar Regions. IEDA (Integrated Earth Data Applications) is a community-based data facility funded by the US National Science Foundation (NSF) to support, sustain, and advance the geosciences by providing data services for observational solid earth data from the Ocean, Earth, and Polar Sciences. IEDA tools and services relevant to the Polar Research Community include the Antarctic and Southern Ocean Data System (ASODS), the U.S. Antarctic Program Data Coordination Center (USAP-DCC), GeoMapApp, as well as a number of services for sample-based data (SESAR and EarthChem). In addition to existing tools, which assist Polar investigators in archiving their data, and creating DIF records for global searches in AMD, IEDA recently added several new tools and services that will provide further support for investigators with the data life cycle process. These include a data management plan (http://www.iedadata.org/compliance/plan) and data compliance reporting tool (http://www.iedadata.org/compliance/report) that will help investigators comply with the requirements of funding agencies such as the National Science Foundation (NSF). Data, especially from challenging Polar Regions, are likely to be used by other scientists for future studies. Therefore, data acknowledgment is an important concern of many investigators. To encourage data acknowledgments by data users, we link references of publications (when known) to datasets and cruises registered within the ASODS system as part of our data curation services (http://www.marine-geo.org/portals/antarctic/references.php). In addition, IEDA offers a data publication service to register scientific data with DOI's, making data sets citable as publications with attribution to investigators as authors. IEDA is a publication agent of the DataCite consortium. Offering such services provides additional incentives for making data available through data centers. Such tools and services are important building blocks of a coherent and comprehensive (cyber) data support structure for Polar investigators.

  3. --No Title--

    Science.gov Websites

    , the user is responsbile for controlling the quality of observational data # and ensuring data is also # # OUTPUTS: # 1) observational data (named data_obs) and Model data (named data_model) # stored under " observational and model data, stored in correct locations # 2) "data" and "figures" folders

  4. A Semi-Automated Workflow Solution for Data Set Publication

    DOE PAGES

    Vannan, Suresh; Beaty, Tammy W.; Cook, Robert B.; ...

    2016-03-08

    In order to address the need for published data, considerable effort has gone into formalizing the process of data publication. From funding agencies to publishers, data publication has rapidly become a requirement. Digital Object Identifiers (DOI) and data citations have enhanced the integration and availability of data. The challenge facing data publishers now is to deal with the increased number of publishable data products and most importantly the difficulties of publishing diverse data products into an online archive. The Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC), a NASA-funded data center, faces these challenges as it deals withmore » data products created by individual investigators. This paper summarizes the challenges of curating data and provides a summary of a workflow solution that ORNL DAAC researcher and technical staffs have created to deal with publication of the diverse data products. Finally, the workflow solution presented here is generic and can be applied to data from any scientific domain and data located at any data center.« less

  5. Linked Data: Forming Partnerships at the Data Layer

    NASA Astrophysics Data System (ADS)

    Shepherd, A.; Chandler, C. L.; Arko, R. A.; Jones, M. B.; Hitzler, P.; Janowicz, K.; Krisnadhi, A.; Schildhauer, M.; Fils, D.; Narock, T.; Groman, R. C.; O'Brien, M.; Patton, E. W.; Kinkade, D.; Rauch, S.

    2015-12-01

    The challenges presented by big data are straining data management software architectures of the past. For smaller existing data facilities, the technical refactoring of software layers become costly to scale across the big data landscape. In response to these challenges, data facilities will need partnerships with external entities for improved solutions to perform tasks such as data cataloging, discovery and reuse, and data integration and processing with provenance. At its surface, the concept of linked open data suggests an uncalculated altruism. Yet, in his concept of five star open data, Tim Berners-Lee explains the strategic costs and benefits of deploying linked open data from the perspective of its consumer and producer - a data partnership. The Biological and Chemical Oceanography Data Management Office (BCO-DMO) addresses some of the emerging needs of its research community by partnering with groups doing complementary work and linking their respective data layers using linked open data principles. Examples will show how these links, explicit manifestations of partnerships, reduce technical debt and provide a swift flexibility for future considerations.

  6. ClinData Express – A Metadata Driven Clinical Research Data Management System for Secondary Use of Clinical Data

    PubMed Central

    Li, Zuofeng; Wen, Jingran; Zhang, Xiaoyan; Wu, Chunxiao; Li, Zuogao; Liu, Lei

    2012-01-01

    Aim to ease the secondary use of clinical data in clinical research, we introduce a metadata driven web-based clinical data management system named ClinData Express. ClinData Express is made up of two parts: 1) m-designer, a standalone software for metadata definition; 2) a web based data warehouse system for data management. With ClinData Express, what the researchers need to do is to define the metadata and data model in the m-designer. The web interface for data collection and specific database for data storage will be automatically generated. The standards used in the system and the data export modular make sure of the data reuse. The system has been tested on seven disease-data collection in Chinese and one form from dbGap. The flexibility of system makes its great potential usage in clinical research. The system is available at http://code.google.com/p/clindataexpress. PMID:23304327

  7. A Semi-Automated Workflow Solution for Data Set Publication

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Vannan, Suresh; Beaty, Tammy W.; Cook, Robert B.

    In order to address the need for published data, considerable effort has gone into formalizing the process of data publication. From funding agencies to publishers, data publication has rapidly become a requirement. Digital Object Identifiers (DOI) and data citations have enhanced the integration and availability of data. The challenge facing data publishers now is to deal with the increased number of publishable data products and most importantly the difficulties of publishing diverse data products into an online archive. The Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC), a NASA-funded data center, faces these challenges as it deals withmore » data products created by individual investigators. This paper summarizes the challenges of curating data and provides a summary of a workflow solution that ORNL DAAC researcher and technical staffs have created to deal with publication of the diverse data products. Finally, the workflow solution presented here is generic and can be applied to data from any scientific domain and data located at any data center.« less

  8. DataONE: Gateway to Earth and Environmental Data Repositories

    NASA Astrophysics Data System (ADS)

    Koskela, R.; Michener, W. K.; Vieglais, D.; Budden, A. E.

    2017-12-01

    DataONE (Data Observation Network for Earth) is a National Science Foundation DataNet project that enables universal access to data and also facilitates researchers in fulfilling their need for data management and in providing secure and permanent access to their data. DataONE offers the scientific community a suite of tools and training materials that cover all aspects of the data life cycle from data collection, to management, analysis and publication. Data repositories affiliated with DataONE are referred to as Member Nodes and represent large regional, national and international research networks, agencies, and other institutions. As part of the DataONE Federation, the repositories gain access to a range of value-added services to support their users. These services include usage tracking and reporting, content replication, and the ability to register the services created by the repository. In addition, DataONE and the California Digital Library manage ONEShare, a repository that accepts content submitted through Dash, a platform allowing researchers to easily describe, deposit and share their research data.

  9. [Contemplation on the application of big data in clinical medicine].

    PubMed

    Lian, Lei

    2015-01-01

    Medicine is another area where big data is being used. The link between clinical treatment and outcome is the key step when applying big data in medicine. In the era of big data, it is critical to collect complete outcome data. Patient follow-up, comprehensive integration of data resources, quality control and standardized data management are the predominant approaches to avoid missing data and data island. Therefore, establishment of systemic patients follow-up protocol and prospective data management strategy are the important aspects of big data in medicine.

  10. Data Prospecting Framework - a new approach to explore "big data" in Earth Science

    NASA Astrophysics Data System (ADS)

    Ramachandran, R.; Rushing, J.; Lin, A.; Kuo, K.

    2012-12-01

    Due to advances in sensors, computation and storage, cost and effort required to produce large datasets have been significantly reduced. As a result, we are seeing a proliferation of large-scale data sets being assembled in almost every science field, especially in geosciences. Opportunities to exploit the "big data" are enormous as new hypotheses can be generated by combining and analyzing large amounts of data. However, such a data-driven approach to science discovery assumes that scientists can find and isolate relevant subsets from vast amounts of available data. Current Earth Science data systems only provide data discovery through simple metadata and keyword-based searches and are not designed to support data exploration capabilities based on the actual content. Consequently, scientists often find themselves downloading large volumes of data, struggling with large amounts of storage and learning new analysis technologies that will help them separate the wheat from the chaff. New mechanisms of data exploration are needed to help scientists discover the relevant subsets We present data prospecting, a new content-based data analysis paradigm to support data-intensive science. Data prospecting allows the researchers to explore big data in determining and isolating data subsets for further analysis. This is akin to geo-prospecting in which mineral sites of interest are determined over the landscape through screening methods. The resulting "data prospects" only provide an interaction with and feel for the data through first-look analytics; the researchers would still have to download the relevant datasets and analyze them deeply using their favorite analytical tools to determine if the datasets will yield new hypotheses. Data prospecting combines two traditional categories of data analysis, data exploration and data mining within the discovery step. Data exploration utilizes manual/interactive methods for data analysis such as standard statistical analysis and visualization, usually on small datasets. On the other hand, data mining utilizes automated algorithms to extract useful information. Humans guide these automated algorithms and specify algorithm parameters (training samples, clustering size, etc.). Data Prospecting combines these two approaches using high performance computing and the new techniques for efficient distributed file access.

  11. The Environmental Data Initiative data repository: Trustworthy practices that foster preservation, fitness, and reuse for environmental and ecological data

    NASA Astrophysics Data System (ADS)

    Servilla, M. S.; Brunt, J.; Costa, D.; Gries, C.; Grossman-Clarke, S.; Hanson, P. C.; O'Brien, M.; Smith, C.; Vanderbilt, K.; Waide, R.

    2017-12-01

    The Environmental Data Initiative (EDI) is an outgrowth of more than 30 years of information management experience and technology from LTER Network data practitioners. EDI builds upon the PASTA data repository software used by the LTER Network Information System and manages more than 42,000 data packages, containing tabular data, imagery, and other formats. Development of the repository was a community process beginning in 2009 that included numerous working groups for generating use cases, system requirements, and testing of completed software, thereby creating a vested interested in its success and transparency in design. All software is available for review on GitHub, and refinements and new features are ongoing. Documentation is also available on Read-the-docs, including a comprehensive description of all web-service API methods. PASTA is metadata driven and uses the Ecological Metadata Language (EML) standard for describing environmental and ecological data; a simplified Dublin Core document is also available for each data package. Data are aggregated into packages consisting of metadata and other related content described by an OAI-ORE document. Once archived, each data package becomes immutable and permanent; updates are possible through the addition of new revisions. Components of each data package are accessible through a unique identifier, while the entire data package receives a DOI that is registered in DataCite. Preservation occurs through a combination of DataONE synchronization/replication and by a series of local and remote backup strategies, including daily uploads to AWS Glacier storage. Checksums are computed for all data at initial upload, with random verification occurring on a continuous basis, thus ensuring the integrity of data. PASTA incorporates a series of data quality tests to ensure that data are correctly documented with EML before data are archived; data packages that fail any test are forbidden in the repository. These tests are a measure data fitness, which ultimately increases confidence in data reuse and synthesis. The EDI data repository is recognized by multiple organizations, including EarthCube's Council of Data Facilities, the United States Geological Survey, FAIRsharing.org, re3data.org, and is a PLOS and Nature recommended data repository.

  12. DataUp 2.0: Improving On a Tool For Helping Researchers Archive, Manage, and Share Their Tabular Data

    NASA Astrophysics Data System (ADS)

    Strasser, C.; Borda, S.; Cruse, P.; Kunze, J.

    2013-12-01

    There are many barriers to data management and sharing among earth and environmental scientists; among the most significant are a lack of knowledge about best practices for data management, metadata standards, or appropriate data repositories for archiving and sharing data. Last year we developed an open source web application, DataUp, to help researchers overcome these barriers. DataUp helps scientists to (1) determine whether their file is CSV compatible, (2) generate metadata in a standard format, (3) retrieve an identifier to facilitate data citation, and (4) deposit their data into a repository. With funding from the NSF via a supplemental grant to the DataONE project, we are working to improve upon DataUp. Our main goal for DataUp 2.0 is to ensure organizations and repositories are able to adopt and adapt DataUp to meet their unique needs, including connecting to analytical tools, adding new metadata schema, and expanding the list of connected data repositories. DataUp is a collaborative project between the California Digital Library, DataONE, the San Diego Supercomputing Center, and Microsoft Research Connections.

  13. DOE Office of Scientific and Technical Information (OSTI.GOV)

    NONE

    The Quarterly Environmental Data Summary (QEDS) for the fourth quarter of 1997 is prepared in support of the Weldon Spring Site Remedial Action Project Federal Facilities Agreement. The data presented constitute the QEDS. The data were received from the contract laboratories, verified by the Weldon Spring Site verification group and, except for air monitoring data and site KPA generated data (uranium analyses), merged into the data base during the fourth quarter of 1997. Air monitoring data presented are the most recent complete sets of quarterly data. Air data are not stored in the data base and KPA data are notmore » merged into the regular data base. Significant data, defined as data values that have exceeded defined ``above normal`` level 2 values, are discussed in this letter for Environmental Monitoring Plan (EMP) generated data only. Above normal level 2 values are based, in ES and H procedures, on historical high values, DOE Derived Concentration Guides (DCGs), NPDES limits and other guidelines. The procedures also establish actions to be taken in response to such data. Data received and verified during the fourth quarter were within a permissible range of variability except for those which are detailed.« less

  14. 78 FR 60003 - Self-Regulatory Organizations; Financial Industry Regulatory Authority, Inc.; Notice of Filing...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-09-30

    ... month, per display application per Data Set \\5\\ of Real-Time \\6\\ TRACE transaction data. The fee waiver... data available in three Data Sets--the Corporate Bond Data Set, the Agency Data Set and the ABS Data Set. A fourth Data Set, the Rule 144A Data Set, will become available in 2014. See Securities Exchange...

  15. Attaining and maintaining data integrity with configuration management

    NASA Astrophysics Data System (ADS)

    Huffman, Dorothy J.; Jeane, Shirley A.

    1993-08-01

    Managers and scientists are concerned about data integrity because they draw conclusions from data that can have far reaching effects. Projects managers use Configuration Management to insure that hardware, software, and project information are controlled. They have not, as yet, applied its rigorously to data. However, there is ample opportunity in the data collection and production process to jeopardize data integrity. Environmental changes, tampering and production problems can all affect data integrity. There are four functions included in the Configuration Management process: configuration identification, control, auditing and status accounting. These functions provide management the means to attain data integrity and the visibility into engineering processes needed to maintain data integrity. When project managers apply Configuration Management processes to data, the data user can trace back through history to validate data integrity. The user knows that the project allowed only orderly changes to the data. He is assured that project personnel followed procedures to maintain data quality. He also has access to status information about the data. The user receives data products with a known integrity level and a means to assess the impact of past events ont he conclusions derived from the data. To obtain these benefits, project managers should apply the Configuration Management discipline to data.

  16. Metadata Repository for Improved Data Sharing and Reuse Based on HL7 FHIR.

    PubMed

    Ulrich, Hannes; Kock, Ann-Kristin; Duhm-Harbeck, Petra; Habermann, Jens K; Ingenerf, Josef

    2016-01-01

    Unreconciled data structures and formats are a common obstacle to the urgently required sharing and reuse of data within healthcare and medical research. Within the North German Tumor Bank of Colorectal Cancer, clinical and sample data, based on a harmonized data set, is collected and can be pooled by using a hospital-integrated Research Data Management System supporting biobank and study management. Adding further partners who are not using the core data set requires manual adaptations and mapping of data elements. Facing this manual intervention and focusing the reuse of heterogeneous healthcare instance data (value level) and data elements (metadata level), a metadata repository has been developed. The metadata repository is an ISO 11179-3 conformant server application built for annotating and mediating data elements. The implemented architecture includes the translation of metadata information about data elements into the FHIR standard using the FHIR Data Element resource with the ISO 11179 Data Element Extensions. The FHIR-based processing allows exchange of data elements with clinical and research IT systems as well as with other metadata systems. With increasingly annotated and harmonized data elements, data quality and integration can be improved for successfully enabling data analytics and decision support.

  17. Pilot climate data system user's guide

    NASA Technical Reports Server (NTRS)

    Reph, M. G.; Treinish, L. A.; Bloch, L.

    1984-01-01

    Instructions for using the Pilot Climate Data System (PCDS), an interactive, scientific data management system for locating, obtaining, manipulating, and displaying climate-research data are presented. The PCDS currently provides this supoort for approximately twenty data sets. Figures that illustrate the terminal displays which a user sees when he/she runs the PCDS and some examples of the output from this system are included. The capabilities which are described in detail allow a user to perform the following: (1) obtain comprehensive descriptions of a number of climate parameter data sets and the associated sensor measurements from which they were derived; (2) obtain detailed information about the temporal coverage and data volume of data sets which are readily accessible via the PCDS; (3) extract portions of a data set using criteria such as time range and geographic location, and output the data to tape, user terminal, system printer, or online disk files in a special data-set-independent format; (4) access and manipulate the data in these data-set-independent files, performing such functions as combining the data, subsetting the data, and averaging the data; and (5) create various graphical representations of the data stored in the data-set-independent files.

  18. Open Access to Geophysical Data

    NASA Astrophysics Data System (ADS)

    Sergeyeva, Nataliya A.; Zabarinskaya, Ludmila P.

    2017-04-01

    Russian World Data Centers for Solar-Terrestrial Physics & Solid Earth Physics hosted by the Geophysical Center of the Russian Academy of Sciences are the Regular Members of the ICSU-World Data System. Guided by the principles of the WDS Constitution and WDS Data Sharing Principles, the WDCs provide full and open access to data, long-term data stewardship, compliance with agreed-upon data standards and conventions, and mechanisms to facilitate and improve access to data. Historical and current geophysical data on different media, in the form of digital data sets, analog records, collections of maps, descriptions are stored and collected in the Centers. The WDCs regularly fill up repositories and database with new data, support them up to date. Now the WDCs focus on four new projects, aimed at increase of data available in network by retrospective data collection and digital preservation of data; creation of a modern system of registration and publication of data with digital object identifier (DOI) assignment, and promotion of data citation culture; creation of databases instead of file system for more convenient access to data; participation in the WDS Metadata Catalogue and Data Portal by creating of metadata for information resources of WDCs.

  19. [Infrastructure and contents of clinical data management plan].

    PubMed

    Shen, Tong; Xu, Lie-dong; Fu, Hai-jun; Liu, Yan; He, Jia; Chen, Ping-yan; Song, Yu-fei

    2015-11-01

    Establishment of quality management system (QMS) plays a critical role in the clinical data management (CDM). The objectives of CDM are to ensure the quality and integrity of the trial data. Thus, every stage or element that may impact the quality outcomes of clinical studies should be in the controlled manner, which is referred to the full life cycle of CDM associated with the data collection, handling and statistical analysis of trial data. Based on the QMS, this paper provides consensus on how to develop a compliant clinical data management plan (CDMP). According to the essential requirements of the CDM, the CDMP should encompass each process of data collection, data capture and cleaning, medical coding, data verification and reconciliation, database monitoring and management, external data transmission and integration, data documentation and data quality assurance and so on. Creating and following up data management plan in each designed data management steps, dynamically record systems used, actions taken, parties involved will build and confirm regulated data management processes, standard operational procedures and effective quality metrics in all data management activities. CDMP is one of most important data management documents that is the solid foundation for clinical data quality.

  20. Ascertaining severe perineal trauma and associated risk factors by comparing birth data with multiple sources.

    PubMed

    Ampt, Amanda J; Ford, Jane B

    2015-09-30

    Population data are often used to monitor severe perineal trauma trends and investigate risk factors. Within New South Wales (NSW), two different datasets can be used, the Perinatal Data Collection ('birth' data) or a linked dataset combining birth data with the Admitted Patient Data Collection ('hospital' data). Severe perineal trauma can be ascertained by birth data alone, or by hospital International Classification of Diseases Australian Modification (ICD-10-AM) diagnosis and procedure coding in the linked dataset. The aim of this study was to compare rates and risk factors for severe perineal trauma using birth data alone versus using linked data. The study population consisted of all vaginal births in NSW between 2001 and 2011. Perineal injury coding in birth data was revised in 2006, so data were analysed separately for 2001-06 and 2006-11. Rates of severe perineal injury over time were compared in birth data alone versus linked data. Kappa and agreement statistics were calculated. Risk factor distributions (maternal age, primiparity, instrumental birth, birthweight ≥4 kg, Asian country of birth and episiotomy) were compared between women with severe perineal trauma identified by birth data alone, and those identified by linked data. Multivariable logistic regression was used to calculate the adjusted odds ratios (aORs) of severe perineal trauma. Among 697 202 women with vaginal births, 2.1% were identified with severe perineal trauma by birth data alone, and 2.6% by linked data. The rate discrepancy was higher among earlier data (1.7% for birth data, 2.4% for linked data). Kappa for earlier data was 0.78 (95% CI 0.78, 0.79), and 0.89 (95% CI 0.89, 0.89) for more recent data. With the exception of episiotomy, differences in risk factor distributions were small, with similar aORs. The aOR of severe perineal trauma for episiotomy was higher using linked data (1.33, 95% CI 1.27, 1.40) compared with birth data (1.02, 95% CI 0.97, 1.08). Although discrepancies in ascertainment of severe perineal trauma improved after revision of birth data coding in 2006, higher ascertainment by linked data was still evident for recent data. There were also higher risk estimates of severe perineal trauma with episiotomy by linked data than by birth data.

  1. Supporting Data Stewardship Throughout the Data Life Cycle in the Solid Earth Sciences

    NASA Astrophysics Data System (ADS)

    Ferrini, V.; Lehnert, K. A.; Carbotte, S. M.; Hsu, L.

    2013-12-01

    Stewardship of scientific data is fundamental to enabling new data-driven research, and ensures preservation, accessibility, and quality of the data, yet researchers, especially in disciplines that typically generate and use small, but complex, heterogeneous, and unstructured datasets are challenged to fulfill increasing demands of properly managing their data. The IEDA Data Facility (www.iedadata.org) provides tools and services that support data stewardship throughout the full life cycle of observational data in the solid earth sciences, with a focus on the data management needs of individual researchers. IEDA builds upon and brings together over a decade of development and experiences of its component data systems, the Marine Geoscience Data System (MGDS, www.marine-geo.org) and EarthChem (www.earthchem.org). IEDA services include domain-focused data curation and synthesis, tools for data discovery, access, visualization and analysis, as well as investigator support services that include tools for data contribution, data publication services, and data compliance support. IEDA data synthesis efforts (e.g. PetDB and Global Multi-Resolution Topography (GMRT) Synthesis) focus on data integration and analysis while emphasizing provenance and attribution. IEDA's domain-focused data catalogs (e.g. MGDS and EarthChem Library) provide access to metadata-rich long-tail data complemented by extensive metadata including attribution information and links to related publications. IEDA's visualization and analysis tools (e.g. GeoMapApp) broaden access to earth science data for domain specialist and non-specialists alike, facilitating both interdisciplinary research and education and outreach efforts. As a disciplinary data repository, a key role IEDA plays is to coordinate with its user community and to bridge the requirements and standards for data curation with both the evolving needs of its science community and emerging technologies. Development of IEDA tools and services is based first and foremost on the scientific needs of its user community. As data stewardship becomes a more integral component of the scientific workflow, IEDA investigator support services (e.g. Data Management Plan Tool and Data Compliance Reporting Tool) continue to evolve with the goal of lessening the 'burden' of data management for individual investigators by increasing awareness and facilitating the adoption of data management practices. We will highlight a variety of IEDA system components that support investigators throughout the data life cycle, and will discuss lessons learned and future directions.

  2. The Role of Interdisciplinary GIS and Data Curation Librarians in Enhancing Authentic Scientific Research in the Classroom

    NASA Astrophysics Data System (ADS)

    Branch, B. D.; Fosmire, M.

    2012-12-01

    Data science is a recently evolved area of scientific inquiry, where data, often collected by others, is analyzed by independent investigators to draw new conclusions. As such, data literacy needs to be incorporated into authentic research activities. The earth sciences in particular have a trove of data that resides in national data centers as well as individual investigators' labs, which can be repurposed to provide the inputs for students to make their own inquiries into the data. With the amount of data available, students can make more substantive conclusions than if relying just on data they've collected themselves. A new scientific role is that of the data scientist or data curation specialist. This person understands best practices in data and knowledge management and can translate those skills into an environment appropriate for K-20 students and teachers. In particular, data curation specialists can transform raw data into data that is audience appropriate that can be re-used. First, appropriate research data can be located, as well as foundational or baseline data (topography, political maps, etc.), and that data needs to be converted (or directions for conversion supplied) so that it can be ingested into the processing system used for the activity. Furthermore, data needs to be organized, especially as it is processed by students, and multiple versions of data created. Data also should be appropriately annotated to allow for effective sharing among students and determining reproducibility of the data. Finally, appropriate visualization of the data can be facilitated by a data curation specialist. To provide a concrete example, one of the authors developed, a data-driven authentic research project for a group of middle school students looking at water quality in a North Carolina community. Students needed to find relevant hydrologic, environmental, and political data as inputs for their project. They then collected local data to add to the standard data, so they could build a profile of water quality over time. Once the data had been appropriately collected, processed, and added, students could then develop queries to run against the data to evaluate their research questions. Simple statistical analysis was then run to determine the validity of their conclusions, and finally, presentations were developed to explain their results. Furthermore, students were empowered to connect the results of the research project to suggest policy changes for their community.

  3. Analysis of Human Mobility Based on Cellular Data

    NASA Astrophysics Data System (ADS)

    Arifiansyah, F.; Saptawati, G. A. P.

    2017-01-01

    Nowadays not only adult but even teenager and children have then own mobile phones. This phenomena indicates that the mobile phone becomes an important part of everyday’s life. Based on these indication, the amount of cellular data also increased rapidly. Cellular data defined as the data that records communication among mobile phone users. Cellular data is easy to obtain because the telecommunications company had made a record of the data for the billing system of the company. Billing data keeps a log of the users cellular data usage each time. We can obtained information from the data about communication between users. Through data visualization process, an interesting pattern can be seen in the raw cellular data, so that users can obtain prior knowledge to perform data analysis. Cellular data processing can be done using data mining to find out human mobility patterns and on the existing data. In this paper, we use frequent pattern mining and finding association rules to observe the relation between attributes in cellular data and then visualize them. We used weka tools for finding the rules in stage of data mining. Generally, the utilization of cellular data can provide supporting information for the decision making process and become a data support to provide solutions and information needed by the decision makers.

  4. An overview on integrated data system for archiving and sharing marine geology and geophysical data in Korea Institute of Ocean Science & Technology (KIOST)

    NASA Astrophysics Data System (ADS)

    Choi, Sang-Hwa; Kim, Sung Dae; Park, Hyuk Min; Lee, SeungHa

    2016-04-01

    We established and have operated an integrated data system for managing, archiving and sharing marine geology and geophysical data around Korea produced from various research projects and programs in Korea Institute of Ocean Science & Technology (KIOST). First of all, to keep the consistency of data system with continuous data updates, we set up standard operating procedures (SOPs) for data archiving, data processing and converting, data quality controls, and data uploading, DB maintenance, etc. Database of this system comprises two databases, ARCHIVE DB and GIS DB for the purpose of this data system. ARCHIVE DB stores archived data as an original forms and formats from data providers for data archive and GIS DB manages all other compilation, processed and reproduction data and information for data services and GIS application services. Relational data management system, Oracle 11g, adopted for DBMS and open source GIS techniques applied for GIS services such as OpenLayers for user interface, GeoServer for application server, PostGIS and PostgreSQL for GIS database. For the sake of convenient use of geophysical data in a SEG Y format, a viewer program was developed and embedded in this system. Users can search data through GIS user interface and save the results as a report.

  5. Oceanographic Data in Europe: Minimal Effort for Data Providers, Maximal Ease of Use and Access for Data Users

    NASA Astrophysics Data System (ADS)

    De Bruin, T.

    2017-12-01

    SeaDataCloud/SeaDataNet (SDC/SDN) is both a consortium and a data infrastructure as well as a (series of) European oceanographic data management project(s), allowing data providers to store data at a data centre of their choice (usually a type of National Oceanographic Data Center), while exposing and making the data available for download via a chain of interconnected data portals at local, regional, pan-European and global levels. SDC/SDN as an infrastructure connects over 100 data centers from 35 countries in and around Europe. The infrastructure has been operational since early 2009 and provides the user an overview of all available data as well as the possibility to download the data in an uniform format. This presentation will give a short introduction to the SDC/SDN infrastructure and describe how its development was based on sound data management principles. The emphasis will be on how the system is interconnected with other, non-discipline specific (metadata) portals such as the Group of Earth Observations System of Systems (GEOSS), allowing oceanographic data stored at a local level in a data centre to be exposed at a global level to a wide audience from various disciplines.

  6. Apparatus And Method For Reconstructing Data Using Cross-Parity Stripes On Storage Media

    DOEpatents

    Hughes, James Prescott

    2003-06-17

    An apparatus and method for reconstructing missing data using cross-parity stripes on a storage medium is provided. The apparatus and method may operate on data symbols having sizes greater than a data bit. The apparatus and method makes use of a plurality of parity stripes for reconstructing missing data stripes. The parity symbol values in the parity stripes are used as a basis for determining the value of the missing data symbol in a data stripe. A correction matrix is shifted along the data stripes, correcting missing data symbols as it is shifted. The correction is performed from the outside data stripes towards the inner data stripes to thereby use previously reconstructed data symbols to reconstruct other missing data symbols.

  7. Bad data packet capture device

    DOEpatents

    Chen, Dong; Gara, Alan; Heidelberger, Philip; Vranas, Pavlos

    2010-04-20

    An apparatus and method for capturing data packets for analysis on a network computing system includes a sending node and a receiving node connected by a bi-directional communication link. The sending node sends a data transmission to the receiving node on the bi-directional communication link, and the receiving node receives the data transmission and verifies the data transmission to determine valid data and invalid data and verify retransmissions of invalid data as corresponding valid data. A memory device communicates with the receiving node for storing the invalid data and the corresponding valid data. A computing node communicates with the memory device and receives and performs an analysis of the invalid data and the corresponding valid data received from the memory device.

  8. Data pre-processing in record linkage to find the same companies from different databases

    NASA Astrophysics Data System (ADS)

    Gunawan, D.; Lubis, M. S.; Arisandi, D.; Azzahry, B.

    2018-03-01

    As public agencies, the Badan Pelayanan Perizinan Terpadu (BPPT) and the Badan Lingkungan Hidup (BLH) of Medan city manage process to obtain a business license from the public. However, each agency might have a different corporate data because of a separate data input process, even though the data may refer to the same company’s data. Therefore, it is required to identify and correlate data that refer to the same company which lie in different data sources. This research focuses on data pre-processing such as data cleaning, text pre-processing, indexing and record comparison. In addition, this research implements data matching using support vector machine algorithm. The result of this algorithm will be used to record linkage of data that can be used to identify and connect the company’s data based on the degree of similarity of each data. Previous data will be standardized in accordance with the format and structure appropriate to the stage of preprocessing data. After analyzing data pre-processing, we found that both database structures are not designed to support data integration. We decide that the data matching can be done with blocking criteria such as company name and the name of the owner (or applicant). In addition to data pre-processing, the result of data classification with a high level of similarity as many as 90 pairs of records.

  9. DataUp: Helping manage and archive data within the researcher's workflow

    NASA Astrophysics Data System (ADS)

    Strasser, C.

    2012-12-01

    There are many barriers to data management and sharing among earth and environmental scientists; among the most significant are lacks of knowledge about best practices for data management, metadata standards, or appropriate data repositories for archiving and sharing data. We have developed an open-source add-in for Excel and an open source web application intended to help researchers overcome these barriers. DataUp helps scientists to (1) determine whether their file is CSV compatible, (2) generate metadata in a standard format, (3) retrieve an identifier to facilitate data citation, and (4) deposit their data into a repository. The researcher does not need a prior relationship with a data repository to use DataUp; the newly implemented ONEShare repository, a DataONE member node, is available for any researcher to archive and share their data. By meeting researchers where they already work, in spreadsheets, DataUp becomes part of the researcher's workflow and data management and sharing becomes easier. Future enhancement of DataUp will rely on members of the community adopting and adapting the DataUp tools to meet their unique needs, including connecting to analytical tools, adding new metadata schema, and expanding the list of connected data repositories. DataUp is a collaborative project between Microsoft Research Connections, the University of California's California Digital Library, the Gordon and Betty Moore Foundation, and DataONE.

  10. Multi-protocol header generation system

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Roberts, David A.; Ignatowski, Michael; Jayasena, Nuwan

    A communication device includes a data source that generates data for transmission over a bus, and a data encoder that receives and encodes outgoing data. An encoder system receives outgoing data from a data source and stores the outgoing data in a first queue. An encoder encodes outgoing data with a header type that is based upon a header type indication from a controller and stores the encoded data that may be a packet or a data word with at least one layered header in a second queue for transmission. The device is configured to receive at a payload extractor,more » a packet protocol change command from the controller and to remove the encoded data and to re-encode the data to create a re-encoded data packet and placing the re-encoded data packet in the second queue for transmission.« less

  11. SeaDataCloud - further developing the pan-European SeaDataNet infrastructure for marine and ocean data management

    NASA Astrophysics Data System (ADS)

    Schaap, Dick M. A.; Fichaut, Michele

    2017-04-01

    SeaDataCloud marks the third phase of developing the pan-European SeaDataNet infrastructure for marine and ocean data management. The SeaDataCloud project is funded by EU and runs for 4 years from 1st November 2016. It succeeds the successful SeaDataNet II (2011 - 2015) and SeaDataNet (2006 - 2011) projects. SeaDataNet has set up and operates a pan-European infrastructure for managing marine and ocean data and is undertaken by National Oceanographic Data Centres (NODC's) and oceanographic data focal points from 34 coastal states in Europe. The infrastructure comprises a network of interconnected data centres and central SeaDataNet portal. The portal provides users a harmonised set of metadata directories and controlled access to the large collections of datasets, managed by the interconnected data centres. The population of directories has increased considerably in cooperation with and involvement in many associated EU projects and initiatives such as EMODnet. SeaDataNet at present gives overview and access to more than 1.9 million data sets for physical oceanography, chemistry, geology, geophysics, bathymetry and biology from more than 100 connected data centres from 34 countries riparian to European seas. SeaDataNet is also active in setting and governing marine data standards, and exploring and establishing interoperability solutions to connect to other e-infrastructures on the basis of standards of ISO (19115, 19139), and OGC (WMS, WFS, CS-W and SWE). Standards and associated SeaDataNet tools are made available at the SeaDataNet portal for wide uptake by data handling and managing organisations. SeaDataCloud aims at further developing standards, innovating services & products, adopting new technologies, and giving more attention to users. Moreover, it is about implementing a cooperation between the SeaDataNet consortium of marine data centres and the EUDAT consortium of e-infrastructure service providers. SeaDataCloud aims at considerably advancing services and increasing their usage by adopting cloud and High Performance Computing technology. SeaDataCloud will empower researchers with a packaged collection of services and tools, tailored to their specific needs, supporting research and enabling generation of added-value products from marine and ocean data. Substantial activities will be focused on developing added-value services, such as data subsetting, analysis, visualisation, and publishing workflows for users, both regular and advanced users, as part of a Virtual Research Environment (VRE). SeaDataCloud aims at a number of leading user communities that have new challenges for upgrading and expanding the SeaDataNet standards and services: Science, EMODnet, Copernicus Marine Environmental Monitoring Service (CMEMS) and EuroGOOS, and International scientific programmes. The presentation will give information on present services of the SeaDataNet infrastructure and services, and the new challenges in SeaDataCloud, and will highlight a number of key achievements in SeaDataCloud so far.

  12. Simulation of EO-1 Hyperion Data from ALI Multispectral Data Based on the Spectral Reconstruction Approach

    PubMed Central

    Liu, Bo; Zhang, Lifu; Zhang, Xia; Zhang, Bing; Tong, Qingxi

    2009-01-01

    Data simulation is widely used in remote sensing to produce imagery for a new sensor in the design stage, for scale issues of some special applications, or for testing of novel algorithms. Hyperspectral data could provide more abundant information than traditional multispectral data and thus greatly extend the range of remote sensing applications. Unfortunately, hyperspectral data are much more difficult and expensive to acquire and were not available prior to the development of operational hyperspectral instruments, while large amounts of accumulated multispectral data have been collected around the world over the past several decades. Therefore, it is reasonable to examine means of using these multispectral data to simulate or construct hyperspectral data, especially in situations where hyperspectral data are necessary but hard to acquire. Here, a method based on spectral reconstruction is proposed to simulate hyperspectral data (Hyperion data) from multispectral Advanced Land Imager data (ALI data). This method involves extraction of the inherent information of source data and reassignment to newly simulated data. A total of 106 bands of Hyperion data were simulated from ALI data covering the same area. To evaluate this method, we compare the simulated and original Hyperion data by visual interpretation, statistical comparison, and classification. The results generally showed good performance of this method and indicated that most bands were well simulated, and the information both preserved and presented well. This makes it possible to simulate hyperspectral data from multispectral data for testing the performance of algorithms, extend the use of multispectral data and help the design of a virtual sensor. PMID:22574064

  13. Big Data and medicine: a big deal?

    PubMed

    Mayer-Schönberger, V; Ingelsson, E

    2018-05-01

    Big Data promises huge benefits for medical research. Looking beyond superficial increases in the amount of data collected, we identify three key areas where Big Data differs from conventional analyses of data samples: (i) data are captured more comprehensively relative to the phenomenon under study; this reduces some bias but surfaces important trade-offs, such as between data quantity and data quality; (ii) data are often analysed using machine learning tools, such as neural networks rather than conventional statistical methods resulting in systems that over time capture insights implicit in data, but remain black boxes, rarely revealing causal connections; and (iii) the purpose of the analyses of data is no longer simply answering existing questions, but hinting at novel ones and generating promising new hypotheses. As a consequence, when performed right, Big Data analyses can accelerate research. Because Big Data approaches differ so fundamentally from small data ones, research structures, processes and mindsets need to adjust. The latent value of data is being reaped through repeated reuse of data, which runs counter to existing practices not only regarding data privacy, but data management more generally. Consequently, we suggest a number of adjustments such as boards reviewing responsible data use, and incentives to facilitate comprehensive data sharing. As data's role changes to a resource of insight, we also need to acknowledge the importance of collecting and making data available as a crucial part of our research endeavours, and reassess our formal processes from career advancement to treatment approval. © 2017 The Association for the Publication of the Journal of Internal Medicine.

  14. Nursing Needs Big Data and Big Data Needs Nursing.

    PubMed

    Brennan, Patricia Flatley; Bakken, Suzanne

    2015-09-01

    Contemporary big data initiatives in health care will benefit from greater integration with nursing science and nursing practice; in turn, nursing science and nursing practice has much to gain from the data science initiatives. Big data arises secondary to scholarly inquiry (e.g., -omics) and everyday observations like cardiac flow sensors or Twitter feeds. Data science methods that are emerging ensure that these data be leveraged to improve patient care. Big data encompasses data that exceed human comprehension, that exist at a volume unmanageable by standard computer systems, that arrive at a velocity not under the control of the investigator and possess a level of imprecision not found in traditional inquiry. Data science methods are emerging to manage and gain insights from big data. The primary methods included investigation of emerging federal big data initiatives, and exploration of exemplars from nursing informatics research to benchmark where nursing is already poised to participate in the big data revolution. We provide observations and reflections on experiences in the emerging big data initiatives. Existing approaches to large data set analysis provide a necessary but not sufficient foundation for nursing to participate in the big data revolution. Nursing's Social Policy Statement guides a principled, ethical perspective on big data and data science. There are implications for basic and advanced practice clinical nurses in practice, for the nurse scientist who collaborates with data scientists, and for the nurse data scientist. Big data and data science has the potential to provide greater richness in understanding patient phenomena and in tailoring interventional strategies that are personalized to the patient. © 2015 Sigma Theta Tau International.

  15. Restoration of Apollo Data for Future Lunar Exploration

    NASA Astrophysics Data System (ADS)

    Schultz, Alfred B.; Williams, D. R.; Hills, H. K.

    2007-10-01

    The Lunar Data Project (LDP) at NASA's National Space Science Data Center (NSSDC) is retrieving and restoring relevant, scientifically important Apollo data into accessible digital form for use by researchers and mission planners. Much of the Apollo data housed at the NSSDC are in forms which are not readily usable, such as microfilm, hardcopy, and magnetic tapes written using machine representations of computers no longer in use. The LDP has prioritized these data based on scientific and engineering value and level of effort required and is in the process of restoring these data collections. In association with the Planetary Data System (PDS), the restored data are converted into standard format and subject to a data peer review before ingestion into PDS. The Apollo 12 and 15 Solar Wind Spectrometer data have been restored and are awaiting data review. The Apollo 14 and 15 ALSEP Cold Cathode Ion Gage data have been scanned, the Apollo 14 Dust, Thermal, and Radiation Engineering Measurements data are in the process of being scanned, and the Apollo 14 Charged Particle Lunar Environment Experiment data have been retrieved from magnetic tape. An optical character recognition software to produce digital tables of the scanned data, where appropriate, is under development. These data represent some of the only long-term lunar surface environment information that exists. We will report on our progress. Metadata, ancillary information to aid in the use and understanding of the data, will be included in these online data collections. These cover complete descriptions of the data sets, formats, processing history, relevant references and contacts, and instrument descriptions. Restored data and associated metadata are posted online and easily accessible to interested users. The data sets and more information on the LDP can be found at nssdc.gsfc.nasa.gov/planetary/lunar/lunar_data/

  16. Public-Private Partnership: Joint recommendations to improve downloads of large Earth observation data

    NASA Astrophysics Data System (ADS)

    Ramachandran, R.; Murphy, K. J.; Baynes, K.; Lynnes, C.

    2016-12-01

    With the volume of Earth observation data expanding rapidly, cloud computing is quickly changing the way Earth observation data is processed, analyzed, and visualized. The cloud infrastructure provides the flexibility to scale up to large volumes of data and handle high velocity data streams efficiently. Having freely available Earth observation data collocated on a cloud infrastructure creates opportunities for innovation and value-added data re-use in ways unforeseen by the original data provider. These innovations spur new industries and applications and spawn new scientific pathways that were previously limited due to data volume and computational infrastructure issues. NASA, in collaboration with Amazon, Google, and Microsoft, have jointly developed a set of recommendations to enable efficient transfer of Earth observation data from existing data systems to a cloud computing infrastructure. The purpose of these recommendations is to provide guidelines against which all data providers can evaluate existing data systems and be used to improve any issues uncovered to enable efficient search, access, and use of large volumes of data. Additionally, these guidelines ensure that all cloud providers utilize a common methodology for bulk-downloading data from data providers thus preventing the data providers from building custom capabilities to meet the needs of individual cloud providers. The intent is to share these recommendations with other Federal agencies and organizations that serve Earth observation to enable efficient search, access, and use of large volumes of data. Additionally, the adoption of these recommendations will benefit data users interested in moving large volumes of data from data systems to any other location. These data users include the cloud providers, cloud users such as scientists, and other users working in a high performance computing environment who need to move large volumes of data.

  17. Real-time GIS data model and sensor web service platform for environmental data management.

    PubMed

    Gong, Jianya; Geng, Jing; Chen, Zeqiang

    2015-01-09

    Effective environmental data management is meaningful for human health. In the past, environmental data management involved developing a specific environmental data management system, but this method often lacks real-time data retrieving and sharing/interoperating capability. With the development of information technology, a Geospatial Service Web method is proposed that can be employed for environmental data management. The purpose of this study is to determine a method to realize environmental data management under the Geospatial Service Web framework. A real-time GIS (Geographic Information System) data model and a Sensor Web service platform to realize environmental data management under the Geospatial Service Web framework are proposed in this study. The real-time GIS data model manages real-time data. The Sensor Web service platform is applied to support the realization of the real-time GIS data model based on the Sensor Web technologies. To support the realization of the proposed real-time GIS data model, a Sensor Web service platform is implemented. Real-time environmental data, such as meteorological data, air quality data, soil moisture data, soil temperature data, and landslide data, are managed in the Sensor Web service platform. In addition, two use cases of real-time air quality monitoring and real-time soil moisture monitoring based on the real-time GIS data model in the Sensor Web service platform are realized and demonstrated. The total time efficiency of the two experiments is 3.7 s and 9.2 s. The experimental results show that the method integrating real-time GIS data model and Sensor Web Service Platform is an effective way to manage environmental data under the Geospatial Service Web framework.

  18. CB4-03: An Eye on the Future: A Review of Data Virtualization Techniques to Improve Research Analytics

    PubMed Central

    Richter, Jack; McFarland, Lela; Bredfeldt, Christine

    2012-01-01

    Background/Aims Integrating data across systems can be a daunting process. The traditional method of moving data to a common location, mapping fields with different formats and meanings, and performing data cleaning activities to ensure valid and reliable integration across systems can be both expensive and extremely time consuming. As the scope of needed research data increases, the traditional methodology may not be sustainable. Data Virtualization provides an alternative to traditional methods that may reduce the effort required to integrate data across disparate systems. Objective Our goal was to survey new methods in data integration, cloud computing, enterprise data management and virtual data management for opportunities to increase the efficiency of producing VDW and similar data sets. Methods Kaiser Permanente Information Technology (KPIT), in collaboration with the Mid-Atlantic Permanente Research Institute (MAPRI) reviewed methodologies in the burgeoning field of Data Virtualization. We identified potential strengths and weaknesses of new approaches to data integration. For each method, we evaluated its potential application for producing effective research data sets. Results Data Virtualization provides opportunities to reduce the amount of data movement required to integrate data sources on different platforms in order to produce research data sets. Additionally, Data Virtualization also includes methods for managing “fuzzy” matching used to match fields known to have poor reliability such as names, addresses and social security numbers. These methods could improve the efficiency of integrating state and federal data such as patient race, death, and tumors with internal electronic health record data. Discussion The emerging field of Data Virtualization has considerable potential for increasing the efficiency of producing research data sets. An important next step will be to develop a proof of concept project that will help us understand to benefits and drawbacks of these techniques.

  19. 15 CFR 718.1 - Definition.

    Code of Federal Regulations, 2012 CFR

    2012-01-01

    ...(g)(1) and 304(e)(2) of the Act and other trade secrets as follows: (a) Financial data; (b) Sales and marketing data (other than shipment data); (c) Pricing data; (d) Personnel data; (e) Research data; (f) Patent data; (g) Data maintained for compliance with environmental or occupational health and safety...

  20. 15 CFR 718.1 - Definition.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ...(g)(1) and 304(e)(2) of the Act and other trade secrets as follows: (a) Financial data; (b) Sales and marketing data (other than shipment data); (c) Pricing data; (d) Personnel data; (e) Research data; (f) Patent data; (g) Data maintained for compliance with environmental or occupational health and safety...

  1. 15 CFR 718.1 - Definition.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ...(g)(1) and 304(e)(2) of the Act and other trade secrets as follows: (a) Financial data; (b) Sales and marketing data (other than shipment data); (c) Pricing data; (d) Personnel data; (e) Research data; (f) Patent data; (g) Data maintained for compliance with environmental or occupational health and safety...

  2. 15 CFR 718.1 - Definition.

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ...(g)(1) and 304(e)(2) of the Act and other trade secrets as follows: (a) Financial data; (b) Sales and marketing data (other than shipment data); (c) Pricing data; (d) Personnel data; (e) Research data; (f) Patent data; (g) Data maintained for compliance with environmental or occupational health and safety...

  3. Big Data: You Are Adding to . . . and Using It

    ERIC Educational Resources Information Center

    Makela, Carole J.

    2016-01-01

    "Big data" prompts a whole lexicon of terms--data flow; analytics; data mining; data science; smart you name it (cars, houses, cities, wearables, etc.); algorithms; learning analytics; predictive analytics; data aggregation; data dashboards; digital tracks; and big data brokers. New terms are being coined frequently. Are we paying…

  4. 14 CFR 125.228 - Flight data recorders: filtered data.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 14 Aeronautics and Space 3 2011-01-01 2011-01-01 false Flight data recorders: filtered data. 125... Equipment Requirements § 125.228 Flight data recorders: filtered data. (a) A flight data signal is filtered... original sensor signal value can be reconstructed from the recorded data. This demonstration requires that...

  5. Enabling Data-as- a-Service (DaaS) - Biggest Challenge of Geoscience Australia

    NASA Astrophysics Data System (ADS)

    Bastrakova, I.; Kemp, C.; Car, N. J.

    2016-12-01

    Geoscience Australia (GA) is recognised and respected as the national repository and steward of multiple national significance data collections that provides geoscience information, services and capability to the Australian Government, industry and stakeholders. Provision of Data-as-a-Service is both GA's key responsibility and core business. Through the Science First Transformation Program GA is undergoing a significant rethinking of its data architecture, curation and access to support the Digital Science capability for which DaaS forms both a dependency and underpins its implementation. DaaS, being a service, means we can deliver its outputs in multiple ways thus providing users with data on demand in ready-for-consumption forms. We can then to reuse prebuilt data constructions to allow self-serviced integration of data underpinned by dynamic query tools. In GA's context examples of DaaS are the Australian Geoscience Data Cube, the Foundation Spatial Data Framework and data served through several Virtual Laboratories. We have implemented a three-layered architecture for DaaS in order to store and manage the data while honouring the semantics of Scientific Data Models defined by subject matter experts and GA's Enterprise Data Architecture as well as retain that delivery flexibility. The foundation layer of DaaS is Canonical Datasets, which are optimised for a long-term data stewardship and curation. Data is well structured, standardised, described and audited. All data creation and editing happen within this layer. The middle Data Transformation layer assists with transformation of data from Canonical Datasets to data integration layer. It provides mechanisms for multi-format and multi-technology data transformation. The top Data Integration layer is optimised for data access. Data can be easily reused and repurposed; data formats made available are optimised for scientific computing and adjusted for access by multiple applications, tools and libraries. Moving to DaaS enables GA to increase data alertness, generate new capabilities and be prepared for emerging technological challengers.

  6. The National Center for Atmospheric Research (NCAR) Research Data Archive: a Data Education Center

    NASA Astrophysics Data System (ADS)

    Peng, G. S.; Schuster, D.

    2015-12-01

    The National Center for Atmospheric Research (NCAR) Research Data Archive (RDA), rda.ucar.edu, is not just another data center or data archive. It is a data education center. We not only serve data, we TEACH data. Weather and climate data is the original "Big Data" dataset and lessons learned while playing with weather data are applicable to a wide range of data investigations. Erroneous data assumptions are the Achilles heel of Big Data. It doesn't matter how much data you crunch if the data is not what you think it is. Each dataset archived at the RDA is assigned to a data specialist (DS) who curates the data. If a user has a question not answered in the dataset information web pages, they can call or email a skilled DS for further clarification. The RDA's diverse staff—with academic training in meteorology, oceanography, engineering (electrical, civil, ocean and database), mathematics, physics, chemistry and information science—means we likely have someone who "speaks your language." Data discovery is another difficult Big Data problem; one can only solve problems with data if one can find the right data. Metadata, both machine and human-generated, underpin the RDA data search tools. Users can quickly find datasets by name or dataset ID number. They can also perform a faceted search that successively narrows the options by user requirements or simply kick off an indexed search with a few words. Weather data formats can be difficult to read for non-expert users; it's usually packed in binary formats requiring specialized software and parameter names use specialized vocabularies. DSs create detailed information pages for each dataset and maintain lists of helpful software, documentation and links of information around the web. We further grow the level of sophistication of the users with tips, tutorials and data stories on the RDA Blog, http://ncarrda.blogspot.com/. How-to video tutorials are also posted on the NCAR Computational and Information Systems Laboratory (CISL) YouTube channel.

  7. Using Feedback from Data Consumers to Capture Quality Information on Environmental Research Data

    NASA Astrophysics Data System (ADS)

    Devaraju, A.; Klump, J. F.

    2015-12-01

    Data quality information is essential to facilitate reuse of Earth science data. Recorded quality information must be sufficient for other researchers to select suitable data sets for their analysis and confirm the results and conclusions. In the research data ecosystem, several entities are responsible for data quality. Data producers (researchers and agencies) play a major role in this aspect as they often include validation checks or data cleaning as part of their work. It is possible that the quality information is not supplied with published data sets; if it is available, the descriptions might be incomplete, ambiguous or address specific quality aspects. Data repositories have built infrastructures to share data, but not all of them assess data quality. They normally provide guidelines of documenting quality information. Some suggests that scholarly and data journals should take a role in ensuring data quality by involving reviewers to assess data sets used in articles, and incorporating data quality criteria in the author guidelines. However, this mechanism primarily addresses data sets submitted to journals. We believe that data consumers will complement existing entities to assess and document the quality of published data sets. This has been adopted in crowd-source platforms such as Zooniverse, OpenStreetMap, Wikipedia, Mechanical Turk and Tomnod. This paper presents a framework designed based on open source tools to capture and share data users' feedback on the application and assessment of research data. The framework comprises a browser plug-in, a web service and a data model such that feedback can be easily reported, retrieved and searched. The feedback records are also made available as Linked Data to promote integration with other sources on the Web. Vocabularies from Dublin Core and PROV-O are used to clarify the source and attribution of feedback. The application of the framework is illustrated with the CSIRO's Data Access Portal.

  8. Data Citation Impediments: Human and Institutional Inertia

    NASA Astrophysics Data System (ADS)

    Mayernik, M. S.

    2013-12-01

    Data citations are growing in visibility in scientific and public policy circles. Data citations directly link scholarship and data, and as such provide a mechanism through which data can be discovered and accessed, scholarly use of data can be tracked, and the impact of data facilities can be identified. The interest in data citations is coming from many research stakeholders, including funders, policy makers, professional societies and their publication entities, research organizations, and individual researchers. Most of the efforts to date around data citations have focused on the challenges of assigning unique identifiers to digital data sets. While these challenges are significant, an additional challenge has gone relatively unaddressed, namely, the fact that data citation is not a common practice within scientific communities. This presentation will present findings from an interview study within the University Corporation for Atmospheric Research / National Center for Atmospheric Research (UCAR/NCAR). Through interviews with 14 scientists and engineers, we have found that there is little evidence that data citations have gained momentum as a common practice. Currently, data users acknowledge their use of particular data sets in either the research methods or acknowledgements sections of their papers, not as formal citations in a paper's bibliography. Data users are often 1) unaware that they can and should cite data sets, 2) unsure of how to cite data sets, and 3) lacking career motivations to forward data citations as a common activity. Data citation initiatives will have minimal impact on the scientific community if they do not address this practical inertia. Data users are a critical stakeholder in the data citation process. Their voice needs to be central to the data citation discussion. We will discuss how outreach efforts need to focus on raising the profile of data citations by informing scientists and administrators, being proactive in providing data users with recommended citations, and embedding data citations within larger scientific research institutions like academic tenure and scholarly peer review.

  9. The data facility of the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS)

    NASA Technical Reports Server (NTRS)

    Nielsen, Pia J.; Green, Robert O.; Murray, Alex T.; Eng, Bjorn T.; Novack, H. Ian; Solis, Manuel; Olah, Martin

    1993-01-01

    AVIRIS operations at the Jet Propulsion Laboratory include a significant data task. The AVIRIS data facility is responsible for data archiving, data calibration, quality monitoring and distribution. Since 1987, the data facility has archived over one terabyte of AVIRIS data and distributed these data to science investigators as requested. In this paper we describe recent improvements in the AVIRIS data facility.

  10. Data Mining and Homeland Security: An Overview

    DTIC Science & Technology

    2006-01-27

    which government agencies should use and mix commercial data with government data, whether data sources are being used for purposes other than those...example, a hardware store may compare their customers’ tool purchases with home ownership, type of CRS-2 3 John Makulowich, “ Government Data Mining...cleaning, data integration, data selection, data transformation , (data mining), pattern evaluation, and knowledge presentation.4 A number of advances in

  11. Transforming Research Data into Resource Data

    NASA Astrophysics Data System (ADS)

    Chandler, C. L.; Shepherd, A.; Groman, R. C.; Kinkade, D.; Rauch, S.; Allison, M. D.; Copley, N. J.; Ake, H.; York, A.; Wiebe, P. H.; Glover, D. M.

    2016-12-01

    Many of the Grand Challenge science questions are of interest to the marine science research community funded by the United States National Science Foundation (NSF). The highly diverse range of environmental data from the oceans, coastal regions, and Great Lakes are collected using a variety of platforms, instrument systems and sensors and are complemented by experimental results including sequence data, and model results. The data are often collected with a particular research purpose in mind. Such data are costly to acquire and environmental data, temporally and geographically unique, cannot be acquired again. The NSF-funded research community comprising funded investigators and their research teams, operators of the US academic research fleet, data managers, marine librarians, and NSF program managers are working together to transform `research data' into `resource data'. The objective is to ensure that the original research data become available to a much wider community, and have potential to be used as `resource data' for new and different types of research well beyond the initial focus of the NSF research grant. The Biological and Chemical Oceanography Data Management Office (BCO-DMO) manages a community-driven data repository that serves some of these data: the data and results from research funded by NSF Ocean Sciences and Polar Programs. Individually such data sets are typically small in size, but when integrated these data become a valuable resource for the global research effort. The data are analyzed, quality controlled, finalized by the original investigators and their research teams, and then contributed to BCO-DMO. The BCO-DMO data managers reformat the data if they were submitted in proprietary formats, perform quality assessment review, augment the data sets with additional documentation, and create structured, machine-actionable metadata. The BCO-DMO data system allows researchers to make connections between related data sets within the BCO-DMO catalog, and also to follow links to complementary data sets curated at other research data repositories. The key is to expose, in standards compliant ways, essential elements of domain-specific metadata that enable discovery of related data, results, products, and publications from scientific research activities.

  12. re3data.org - a global registry of research data repositories

    NASA Astrophysics Data System (ADS)

    Pampel, Heinz; Vierkant, Paul; Elger, Kirsten; Bertelmann, Roland; Witt, Michael; Schirmbacher, Peter; Rücknagel, Jessika; Kindling, Maxi; Scholze, Frank; Ulrich, Robert

    2016-04-01

    re3data.org - the registry of research data repositories lists over 1,400 research data repositories from all over the world making it the largest and most comprehensive online catalog of research data repositories on the web. The registry is a valuable tool for researchers, funding organizations, publishers and libraries. re3data.org provides detailed information about research data repositories, and its distinctive icons help researchers to easily identify relevant repositories for accessing and depositing data sets [1]. Funding agencies, like the European Commission [2] and research institutions like the University of Bielefeld [3] already recommend the use of re3data.org in their guidelines and policies. Several publishers and journals like Copernicus Publications, PeerJ, and Nature's Scientific Data recommend re3data.org in their editorial policies as a tool for the easy identification of appropriate data repositories to store research data. Project partners in re3data.org are the Library and Information Services department (LIS) of the GFZ German Research Centre for Geosciences, the Computer and Media Service at the Humboldt-Universität zu Berlin, the Purdue University Libraries and the KIT Library at the Karlsruhe Institute of Technology (KIT). After its fusion with the U.S. American DataBib in 2014, re3data.org continues as a service of DataCite from 2016 on. DataCite is the international organization for the registration of Digital Object Identifiers (DOI) for research data and aims to improve their citation. The poster describes the current status and the future plans of re3data.org. [1] Pampel H, et al. (2013) Making Research Data Repositories Visible: The re3data.org Registry. PLoS ONE 8(11): e78080. doi:10.1371/journal.pone.0078080. [2] European Commission (2015): Guidelines on Open Access to Scientific Publications and Research Data in Horizon 2020. Available: http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-pilot-guide_en.pdf Accessed 11 January 2016. [3] Bielefeld University (2013): Resolution on Research Data Management. Available: http://data.uni-bielefeld.de/en/resolution Accessed 11 January 2016.

  13. SeaDataNet - Pan-European infrastructure for marine and ocean data management: Unified access to distributed data sets (www.seadatanet.org)

    NASA Astrophysics Data System (ADS)

    Schaap, Dick M. A.; Maudire, Gilbert

    2010-05-01

    SeaDataNet is a leading infrastructure in Europe for marine & ocean data management. It is actively operating and further developing a Pan-European infrastructure for managing, indexing and providing access to ocean and marine data sets and data products, acquired via research cruises and other observational activities, in situ and remote sensing. The basis of SeaDataNet is interconnecting 40 National Oceanographic Data Centres and Marine Data Centers from 35 countries around European seas into a distributed network of data resources with common standards for metadata, vocabularies, data transport formats, quality control methods and flags, and access. Thereby most of the NODC's operate and/or are developing national networks to other institutes in their countries to ensure national coverage and long-term stewardship of available data sets. The majority of data managed by SeaDataNet partners concerns physical oceanography, marine chemistry, hydrography, and a substantial volume of marine biology and geology and geophysics. These are partly owned by the partner institutes themselves and for a major part also owned by other organizations from their countries. The SeaDataNet infrastructure is implemented with support of the EU via the EU FP6 SeaDataNet project to provide the Pan-European data management system adapted both to the fragmented observation system and the users need for an integrated access to data, meta-data, products and services. The SeaDataNet project has a duration of 5 years and started in 2006, but builds upon earlier data management infrastructure projects, undertaken over a period of 20 years by an expanding network of oceanographic data centres from the countries around all European seas. Its predecessor project Sea-Search had a strict focus on metadata. SeaDataNet maintains significant interest in the further development of the metadata infrastructure, extending its services with the provision of easy data access and generic data products. Version 1 of its infrastructure upgrade was launched in April 2008 and is now well underway to include all 40 data centres at V1 level. It comprises the network of 40 interconnected data centres (NODCs) and a central SeaDataNet portal. V1 provides users a unified and transparent overview of the metadata and controlled access to the large collections of data sets, that are managed at these data centres. The SeaDataNet V1 infrastructure comprises the following middleware services: • Discovery services = Metadata directories and User interfaces • Vocabulary services = Common vocabularies and Governance • Security services = Authentication, Authorization & Accounting • Delivery services = Requesting and Downloading of data sets • Viewing services = Mapping of metadata • Monitoring services = Statistics on system usage and performance and Registration of data requests and transactions • Maintenance services = Entry and updating of metadata by data centres Also good progress is being made with extending the SeaDataNet infrastructure with V2 services: • Viewing services = Quick views and Visualisation of data and data products • Product services = Generic and standard products • Exchange services = transformation of SeaDataNet portal CDI output to INSPIRE compliance As a basis for the V1 services, common standards have been defined for metadata and data formats, common vocabularies, quality flags, and quality control methods, based on international standards, such as ISO 19115, OGC, NetCDF (CF), ODV, best practices from IOC and ICES, and following INSPIRE developments. An important objective of the SeaDataNet V1 infrastructure is to provide transparent access to the distributed data sets via a unique user interface and download service. In the SeaDataNet V1 architecture the Common Data Index (CDI) V1 metadata service provides the link between discovery and delivery of data sets. The CDI user interface enables users to have a detailed insight of the availability and geographical distribution of marine data, archived at the connected data centres. It provides sufficient information to allow the user to assess the data relevance. Moreover the CDI user interface provides the means for downloading data sets in common formats via a transaction mechanism. The SeaDataNet portal provides registered users access to these distributed data sets via the CDI V1 Directory and a shopping basket mechanism. This allows registered users to locate data of interest and submit their data requests. The requests are forwarded automatically from the portal to the relevant SeaDataNet data centres. This process is controlled via the Request Status Manager (RSM) Web Service at the portal and a Download Manager (DM) java software module, implemented at each of the data centres. The RSM also enables registered users to check regularly the status of their requests and download data sets, after access has been granted. Data centres can follow all transactions for their data sets online and can handle requests which require their consent. The actual delivery of data sets is done between the user and the selected data centre. Very good progress is being made with connecting all SeaDataNet data centres and their data sets to the CDI V1 system. At present the CDI V1 system provides users functionality to discover and download more than 500.000 data sets, a number which is steadily increasing. The SeaDataNet architecture provides a coherent system of the various V1 services and inclusion of the V2 services. For the implementation, a range of technical components have been defined and developed. These make use of recent web technologies, and also comprise Java components, to provide multi-platform support and syntactic interoperability. To facilitate sharing of resources and interoperability, SeaDataNet has adopted the technology of SOAP Web services for various communication tasks. The SeaDataNet architecture has been designed as a multi-disciplinary system from the beginning. It is able to support a wide variety of data types and to serve several sector communities. SeaDataNet is willing to share its technologies and expertise, to spread and expand its approach, and to build bridges to other well established infrastructures in the marine domain. Therefore SeaDataNet has developed a strategy of seeking active cooperation on a national scale with other data holding organisations via its NODC networks and on an international scale with other European and international data management initiatives and networks. This is done with the objective to achieve a wider coverage of data sources and an overall interoperability between data infrastructures in the marine and ocean domains. Recent examples are e.g. the EU FP7 projects Geo-Seas for geology and geophysical data sets, UpgradeBlackSeaScene for a Black Sea data management infrastructure, CaspInfo for a Caspian Sea data management infrastructure, the EU EMODNET pilot projects, for hydrographic, chemical, and biological data sets. All projects are adopting the SeaDataNet standards and extending its services. Also active cooperation takes place with EuroGOOS and MyOcean in the domain of real-time and delayed mode metocean monitoring data. SeaDataNet Partners: IFREMER (France), MARIS (Netherlands), HCMR/HNODC (Greece), ULg (Belgium), OGS (Italy), NERC/BODC (UK), BSH/DOD (Germany), SMHI (Sweden), IEO (Spain), RIHMI/WDC (Russia), IOC (International), ENEA (Italy), INGV (Italy), METU (Turkey), CLS (France), AWI (Germany), IMR (Norway), NERI (Denmark), ICES (International), EC-DG JRC (International), MI (Ireland), IHPT (Portugal), RIKZ (Netherlands), RBINS/MUMM (Belgium), VLIZ (Belgium), MRI (Iceland), FIMR (Finland ), IMGW (Poland), MSI (Estonia), IAE/UL (Latvia), CMR (Lithuania), SIO/RAS (Russia), MHI/DMIST (Ukraine), IO/BAS (Bulgaria), NIMRD (Romania), TSU (Georgia), INRH (Morocco), IOF (Croatia), PUT (Albania), NIB (Slovenia), UoM (Malta), OC/UCY (Cyprus), IOLR (Israel), NCSR/NCMS (Lebanon), CNR-ISAC (Italy), ISMAL (Algeria), INSTM (Tunisia)

  14. Data warehouse model design technology analysis and research

    NASA Astrophysics Data System (ADS)

    Jiang, Wenhua; Li, Qingshui

    2012-01-01

    Existing data storage format can not meet the needs of information analysis, data warehouse onto the historical stage, the data warehouse is to support business decision making and the creation of specially designed data collection. With the data warehouse, the companies will all collected information is stored in the data warehouse. The data warehouse is organized according to some, making information easy to access and has value. This paper focuses on the establishment of data warehouse and analysis, design, data warehouse, two barrier models, and compares them.

  15. Marine asset security and tracking (MAST) system

    DOEpatents

    Hanson, Gregory Richard [Clinton, TN; Smith, Stephen Fulton [Loudon, TN; Moore, Michael Roy [Corryton, TN; Dobson, Eric Lesley [Charleston, SC; Blair, Jeffrey Scott [Charleston, SC; Duncan, Christopher Allen [Marietta, GA; Lenarduzzi, Roberto [Knoxville, TN

    2008-07-01

    Methods and apparatus are described for marine asset security and tracking (MAST). A method includes transmitting identification data, location data and environmental state sensor data from a radio frequency tag. An apparatus includes a radio frequency tag that transmits identification data, location data and environmental state sensor data. Another method includes transmitting identification data and location data from a radio frequency tag using hybrid spread-spectrum modulation. Another apparatus includes a radio frequency tag that transmits both identification data and location data using hybrid spread-spectrum modulation.

  16. Development of an Oceanographic Data Archiving and Service System for the Korean Researchers

    NASA Astrophysics Data System (ADS)

    Kim, Sung Dae; Park, Hyuk Min; Baek, Sang Ho

    2014-05-01

    Oceanographic Data and Information Center of Korea Institute of Ocean Science and Technology (KIOST) started to develop an oceanographic data archiving and service system in 2010 to support the Korean ocean researchers by providing quality controlled data continuously. Many physical oceanographic data available in the public domain and Korean domestic data were collected periodically, quality controlled, manipulated and provided to ocean modelers who need ocean data continuously and marine biologists who don't know well physical data but need it. The northern limit and the southern limit of the spatial coverage are 20°N and 55°N, and the western limit and the eastern limit are 110°E and 150°E, respectively. To archive TS (Temperature and Salinity) profile data, ARGO data were gathered from ARGO GDACs (France and USA) and many historical TS profile data observed by CTD, OSD and BT were retrieved from World Ocean Database 2009. The quality control software for TS profile data, which meets QC criteria suggested by the ARGO program and the GTSPP (Global Temperature-Salinity Profile Program), was programmed and applied to the collected data. By the end of 2013, the total number of vertical profile data from the ARGO GDACs was 59,642 and total number of station data from WOD 2009 was 1,604,422. We also collected the global satellite SST data produced by NCDC and global SSH data from AVISO every day. An automatic program was coded to collect satellite data, extract sub data sets of the North West Pacific area and produce distribution maps. The total number of collected satellite data sets was 3,613 by the end of 2013. We use 3 different data services to provide archived data to the Korean experts. A FTP service was prepared to allow data users to download data in the original format. We developed TS database system using Oracle RDBMS to contain all collected temperature salinity data and support SQL data retrieval with various conditions. The KIOST ocean data portal was used as the data retrieving service of TS DB, which uses GIS interface made by open source GIS software. We also installed Live Access Service developed by US PMEL for service of the satellite netCDF data files, which support on-the-fly visualization and OPeNDAP (Open-source Project for a Network Data Access Protocol) service for remote connection and sub-setting of large data set

  17. A Survey on Next-generation Power Grid Data Architecture

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    You, Shutang; Zhu, Dr. Lin; Liu, Yong

    2015-01-01

    The operation and control of power grids will increasingly rely on data. A high-speed, reliable, flexible and secure data architecture is the prerequisite of the next-generation power grid. This paper summarizes the challenges in collecting and utilizing power grid data, and then provides reference data architecture for future power grids. Based on the data architecture deployment, related research on data architecture is reviewed and summarized in several categories including data measurement/actuation, data transmission, data service layer, data utilization, as well as two cross-cutting issues, interoperability and cyber security. Research gaps and future work are also presented.

  18. System using data compression and hashing adapted for use for multimedia encryption

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Coffland, Douglas R

    2011-07-12

    A system and method is disclosed for multimedia encryption. Within the system of the present invention, a data compression module receives and compresses a media signal into a compressed data stream. A data acquisition module receives and selects a set of data from the compressed data stream. And, a hashing module receives and hashes the set of data into a keyword. The method of the present invention includes the steps of compressing a media signal into a compressed data stream; selecting a set of data from the compressed data stream; and hashing the set of data into a keyword.

  19. Data Quality- and Master Data Management - A Hospital Case.

    PubMed

    Arthofer, Klaus; Girardi, Dominic

    2017-01-01

    Poor data quality prevents the analysis of data for decisions which are critical for business. It also has a negative impact on business processes. Nevertheless the maturity level of data quality- and master data management is still insufficient in many organizations nowadays. This article discusses the corresponding maturity of companies and a management cycle integrating data quality- and master data management in a case dealing with benchmarking in hospitals. In conclusion if data quality and master data are not properly managed, structured data should not be acquired in the first place due to the added expense and complexity.

  20. Concept for Future Data Services at the Long-Term Archive of WDCC combining DOIs with common PIDs

    NASA Astrophysics Data System (ADS)

    Stockhause, Martina; Weigel, Tobias; Toussaint, Frank; Höck, Heinke; Thiemann, Hannes; Lautenschlager, Michael

    2013-04-01

    The World Data Center for Climate (WDCC) hosted at the German Climate Computing Center (DKRZ) maintains a long-term archive (LTA) of climate model data as well as observational data. WDCC distinguishes between two types of LTA data: Structured data: Data output of an instrument or of a climate model run consists of numerous, highly structured individual datasets in a uniform format. Part of these data is also published on an ESGF (Earth System Grid Federation) data node. Detailed metadata is available allowing for fine-grained user-defined data access. Unstructured data: LTA data of finished scientific projects are in general unstructured and consist of datasets of different formats, different sizes, and different contents. For these data compact metadata is available as content information. The structured data is suitable for WDCC's DataCite DOI process, the project data only in exceptional cases. The DOI process includes a thorough quality control process of technical as well as scientific aspects by the publication agent and the data creator. DOIs are assigned to data collections appropriate to be cited in scientific publications, like a simulation run. The data collection is defined in agreement with the data creator. At the moment there is no possibility to identify and cite individual datasets within this DOI data collection analogous to the citation of chapters in a book. Also missing is a compact citation regulation for a user-specified collection of data. WDCC therefore complements its existing LTA/DOI concept by Persistent Identifier (PID) assignment to datasets using Handles. In addition to data identification for internal and external use, the concept of PIDs allows to define relations among PIDs. Such structural information is stored as key-value pair directly in the handles. Thus, relations provide basic provenance or lineage information, even if part of the data like intermediate results are lost. WDCC intends to use additional PIDs on metadata entities with a relation to the data PID(s). These add background information on the data creation process (e.g. descriptions of experiment, model, model set-up, and platform for the model run etc.) to the data. These pieces of additional information increase the re-usability of the archived model data, significantly. Other valuable additional information for scientific collaboration could be added by the same mechanism, like quality information and annotations. Apart from relations among data and metadata entities, PIDs on collections are advantageous for model data: Collections allow for persistent references to single datasets or subsets of data assigned a DOI, Data objects and additional information objects can be consistently connected via relations (provenance, creation, quality information for data),

  1. AstroCloud, a Cyber-Infrastructure for Astronomy Research: Data Archiving and Quality Control

    NASA Astrophysics Data System (ADS)

    He, B.; Cui, C.; Fan, D.; Li, C.; Xiao, J.; Yu, C.; Wang, C.; Cao, Z.; Chen, J.; Yi, W.; Li, S.; Mi, L.; Yang, S.

    2015-09-01

    AstroCloud is a cyber-Infrastructure for Astronomy Research initiated by Chinese Virtual Observatory (China-VO) under funding support from NDRC (National Development and Reform commission) and CAS (Chinese Academy of Sciences)1(Cui et al. 2014). To archive the astronomical data in China, we present the implementation of the astronomical data archiving system (ADAS). Data archiving and quality control are the infrastructure for the AstroCloud. Throughout the data of the entire life cycle, data archiving system standardized data, transferring data, logging observational data, archiving ambient data, And storing these data and metadata in database. Quality control covers the whole process and all aspects of data archiving.

  2. High speed, very large (8 megabyte) first in/first out buffer memory (FIFO)

    DOEpatents

    Baumbaugh, Alan E.; Knickerbocker, Kelly L.

    1989-01-01

    A fast FIFO (First In First Out) memory buffer capable of storing data at rates of 100 megabytes per second. The invention includes a data packer which concatenates small bit data words into large bit data words, a memory array having individual data storage addresses adapted to store the large bit data words, a data unpacker into which large bit data words from the array can be read and reconstructed into small bit data words, and a controller to control and keep track of the individual data storage addresses in the memory array into which data from the packer is being written and data to the unpacker is being read.

  3. Challenges and Best Practices for the Curation and Publication of Long-Tail Data with GFZ Data Services

    NASA Astrophysics Data System (ADS)

    Elger, Kirsten; Ulbricht, Damian; Bertelmann, Roland

    2017-04-01

    Open access to research data is an increasing international request and includes not only data underlying scholarly publication, but also raw and curated data. Especially in the framework of the observed shift in many scientific fields towards data science and data mining, data repositories are becoming important player as data archives and access point to curated research data. While general and institutional data repositories are available across all scientific disciplines, domain-specific data repositories are specialised for scientific disciplines, like, e.g., bio- or geosciences, with the possibility to use more discipline-specific and richer metadata models than general repositories. Data publication is increasingly regarded as important scientific achievement, and datasets with digital object identifier (DOI) are now fully citable in journal articles. Moreover, following in their signature of the "Statement of Commitment of the Coalition on Publishing Data in the Earth and Space Sciences" (COPDESS), many publishers have adopted their data policies and recommend and even request to store and publish data underlying scholarly publications in (domain-specific) data repositories and not as classical supplementary material directly attached to the respective article. The curation of large dynamic data from global networks in, e.g., seismology, magnetics or geodesy, always required a high grade of professional, IT-supported data management, simply to be able to store and access the huge number of files and manage dynamic datasets. In contrast to these, the vast amount of research data acquired by individual investigators or small teams known as 'long-tail data' was often not the focus for the development of data curation infrastructures. Nevertheless, even though they are small in size and highly variable, in total they represent a significant portion of the total scientific outcome. The curation of long-tail data requires more individual approaches and personal involvement of the data curator, especially regarding the data description. Here we will introduce best practices for the publication of long-tail data that are helping to reduce the individual effort, improve the quality of the data description. The data repository of GFZ Data Services, which is hosted at GFZ German Research Centre for Geosciences in Potsdam, is a domain-specific data repository for geosciences. In addition to large dynamic datasets from different disciplines, it has a large focus on the DOI-referenced publication of long-tail data with the aim to reach a high grade of reusability through a comprehensive data description and in the same time provide and distribute standardised, machine actionable metadata for data discovery (FAIR data). The development of templates for data reports, metadata provision by scientists via an XML Metadata Editor and discipline-specific DOI landing pages are helping both, the data curators to handle all kinds of datasets and enabling the scientists, i.e. user, to quickly decide whether a published dataset is fulfilling their needs. In addition, GFZ Data Services have developed DOI-registration services for several international networks (e.g. ICGEM, World Stress Map, IGETS, etc.). In addition, we have developed project-or network-specific designs of the DOI landing pages with the logo or design of the networks or project

  4. Opening Data in the Long Tail for Community Discovery, Curation and Action Using Active and Social Curation

    NASA Astrophysics Data System (ADS)

    Hedstrom, M. L.; Kumar, P.; Myers, J.; Plale, B. A.

    2012-12-01

    In data science, the most common sequence of steps for data curation are to 1) curate data, 2) enable data discovery, and 3) provide for data reuse. The Sustainable Environments - Actionable Data (SEAD) project, funded through NSF's DataNet program, is creating an environment for sustainability scientists to discover data first, reuse data next, and curate data though an on-going process that we call Active and Social Curation. For active curation we are developing tools and services that support data discovery, data management, and data enhancement for the community while the data is still being used actively for research. We are creating an Active Content Repository, using drop box, semantic web technologies, and a Flickr-like interface for researchers to "drop" data into a repository where it will be replicated and minimally discoverable. For social curation, we are deploying a social networking tool, VIVO, which will allow researchers to discover data-publications-people (e.g. expertise) through a route that can start at any of those entry points. The other dimension of social curation is developing mechanisms to open data for community input, for example, using ranking and commenting mechanisms for data sets and a community-sourcing capability to add tags, clean up and validate data sets. SEAD's strategies and services are aimed at the sustainability science community, which faces numerous challenges including discovery of useful data, cleaning noisy observational data, synthesizing data of different types, defining appropriate models, managing and preserving their research data, and conveying holistic results to colleagues, students, decision makers, and the public. Sustainability researchers make significant use of centrally managed data from satellites and national sensor networks, national scientific and statistical agencies, and data archives. At the same time, locally collected data and custom derived data products that combine observations and measurements from local, national, and global sources are critical resources that have disproportionately high value relative to their size. Sustainability science includes a diverse and growing community of domain scientists, policy makers, private sector investors, green manufacturers, citizen scientists, and informed consumers. These communities need actionable data in order to assess the impacts of alternate scenarios, evaluate the cost-benefit tradeoffs of different solutions, and defend their recommendations and decisions. SEAD's goal is to extend its services to other communities in the "long tail" that may benefit from new approaches to infrastructure development which take into account the social and economic characteristics of diverse and dispersed data producers and consumers. For example, one barrier to data reuse is the difficulty of discovering data that might be valuable for a particular study, model, or decision. Making data minimally discoverable saves the community time expended on futile searches and creates a market, of sorts, for the data. Creating very low barriers to entry to a network where data can be discovered and acted upon vastly reduces this disincentive to sharing data. SEAD's approach allows communities to make small incremental improvements in data curation based on their own priorities and needs.

  5. Curating and Integrating Data from Multiple Sources to Support Healthcare Analytics.

    PubMed

    Ng, Kenney; Kakkanatt, Chris; Benigno, Michael; Thompson, Clay; Jackson, Margaret; Cahan, Amos; Zhu, Xinxin; Zhang, Ping; Huang, Paul

    2015-01-01

    As the volume and variety of healthcare related data continues to grow, the analysis and use of this data will increasingly depend on the ability to appropriately collect, curate and integrate disparate data from many different sources. We describe our approach to and highlight our experiences with the development of a robust data collection, curation and integration infrastructure that supports healthcare analytics. This system has been successfully applied to the processing of a variety of data types including clinical data from electronic health records and observational studies, genomic data, microbiomic data, self-reported data from surveys and self-tracked data from wearable devices from over 600 subjects. The curated data is currently being used to support healthcare analytic applications such as data visualization, patient stratification and predictive modeling.

  6. Accessing memory

    DOEpatents

    Yoon, Doe Hyun; Muralimanohar, Naveen; Chang, Jichuan; Ranganthan, Parthasarathy

    2017-09-26

    A disclosed example method involves performing simultaneous data accesses on at least first and second independently selectable logical sub-ranks to access first data via a wide internal data bus in a memory device. The memory device includes a translation buffer chip, memory chips in independently selectable logical sub-ranks, a narrow external data bus to connect the translation buffer chip to a memory controller, and the wide internal data bus between the translation buffer chip and the memory chips. A data access is performed on only the first independently selectable logical sub-rank to access second data via the wide internal data bus. The example method also involves locating a first portion of the first data, a second portion of the first data, and the second data on the narrow external data bus during separate data transfers.

  7. Methods and Apparatus for Aggregation of Multiple Pulse Code Modulation Channels into a Signal Time Division Multiplexing Stream

    NASA Technical Reports Server (NTRS)

    Chang, Chen J. (Inventor); Liaghati, Jr., Amir L. (Inventor); Liaghati, Mahsa L. (Inventor)

    2018-01-01

    Methods and apparatus are provided for telemetry processing using a telemetry processor. The telemetry processor can include a plurality of communications interfaces, a computer processor, and data storage. The telemetry processor can buffer sensor data by: receiving a frame of sensor data using a first communications interface and clock data using a second communications interface, receiving an end of frame signal using a third communications interface, and storing the received frame of sensor data in the data storage. After buffering the sensor data, the telemetry processor can generate an encapsulated data packet including a single encapsulated data packet header, the buffered sensor data, and identifiers identifying telemetry devices that provided the sensor data. A format of the encapsulated data packet can comply with a Consultative Committee for Space Data Systems (CCSDS) standard. The telemetry processor can send the encapsulated data packet using a fourth and a fifth communications interfaces.

  8. A Data Quality Filter for PMU Measurements: Description, Experience, and Examples

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Follum, James D.; Amidan, Brett G.

    Networks of phasor measurement units (PMUs) continue to grow, and along with them, the amount of data available for analysis. With so much data, it is impractical to identify and remove poor quality data manually. The data quality filter described in this paper was developed for use with the Data Integrity and Situation Awareness Tool (DISAT), which analyzes PMU data to identify anomalous system behavior. The filter operates based only on the information included in the data files, without supervisory control and data acquisition (SCADA) data, state estimator values, or system topology information. Measurements are compared to preselected thresholds tomore » determine if they are reliable. Along with the filter's description, examples of data quality issues from application of the filter to nine months of archived PMU data are provided. The paper is intended to aid the reader in recognizing and properly addressing data quality issues in PMU data.« less

  9. Managing data from multiple disciplines, scales, and sites to support synthesis and modeling

    USGS Publications Warehouse

    Olson, R. J.; Briggs, J. M.; Porter, J.H.; Mah, Grant R.; Stafford, S.G.

    1999-01-01

    The synthesis and modeling of ecological processes at multiple spatial and temporal scales involves bringing together and sharing data from numerous sources. This article describes a data and information system model that facilitates assembling, managing, and sharing diverse data from multiple disciplines, scales, and sites to support integrated ecological studies. Cross-site scientific-domain working groups coordinate the development of data associated with their particular scientific working group, including decisions about data requirements, data to be compiled, data formats, derived data products, and schedules across the sites. The Web-based data and information system consists of nodes for each working group plus a central node that provides data access, project information, data query, and other functionality. The approach incorporates scientists and computer experts in the working groups and provides incentives for individuals to submit documented data to the data and information system.

  10. Packaging and distributing ecological data from multisite studies

    NASA Technical Reports Server (NTRS)

    Olson, R. J.; Voorhees, L. D.; Field, J. M.; Gentry, M. J.

    1996-01-01

    Studies of global change and other regional issues depend on ecological data collected at multiple study areas or sites. An information system model is proposed for compiling diverse data from dispersed sources so that the data are consistent, complete, and readily available. The model includes investigators who collect and analyze field measurements, science teams that synthesize data, a project information system that collates data, a data archive center that distributes data to secondary users, and a master data directory that provides broader searching opportunities. Special attention to format consistency is required, such as units of measure, spatial coordinates, dates, and notation for missing values. Often data may need to be enhanced by estimating missing values, aggregating to common temporal units, or adding other related data such as climatic and soils data. Full documentation, an efficient data distribution mechanism, and an equitable way to acknowledge the original source of data are also required.

  11. Parallel compression of data chunks of a shared data object using a log-structured file system

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bent, John M.; Faibish, Sorin; Grider, Gary

    2016-10-25

    Techniques are provided for parallel compression of data chunks being written to a shared object. A client executing on a compute node or a burst buffer node in a parallel computing system stores a data chunk generated by the parallel computing system to a shared data object on a storage node by compressing the data chunk; and providing the data compressed data chunk to the storage node that stores the shared object. The client and storage node may employ Log-Structured File techniques. The compressed data chunk can be de-compressed by the client when the data chunk is read. A storagemore » node stores a data chunk as part of a shared object by receiving a compressed version of the data chunk from a compute node; and storing the compressed version of the data chunk to the shared data object on the storage node.« less

  12. Rdesign: A data dictionary with relational database design capabilities in Ada

    NASA Technical Reports Server (NTRS)

    Lekkos, Anthony A.; Kwok, Teresa Ting-Yin

    1986-01-01

    Data Dictionary is defined to be the set of all data attributes, which describe data objects in terms of their intrinsic attributes, such as name, type, size, format and definition. It is recognized as the data base for the Information Resource Management, to facilitate understanding and communication about the relationship between systems applications and systems data usage and to help assist in achieving data independence by permitting systems applications to access data knowledge of the location or storage characteristics of the data in the system. A research and development effort to use Ada has produced a data dictionary with data base design capabilities. This project supports data specification and analysis and offers a choice of the relational, network, and hierarchical model for logical data based design. It provides a highly integrated set of analysis and design transformation tools which range from templates for data element definition, spreadsheet for defining functional dependencies, normalization, to logical design generator.

  13. Uniform Data Access Using GXD

    NASA Technical Reports Server (NTRS)

    Vanderbilt, Peter

    1999-01-01

    This paper gives an overview of GXD, a framework facilitating publication and use of data from diverse data sources. GXD defines an object-oriented data model designed to represent a wide range of things including data, its metadata, resources and query results. GXD also defines a data transport language. a dialect of XML, for representing instances of the data model. This language allows for a wide range of data source implementations by supporting both the direct incorporation of data and the specification of data by various rules. The GXD software library, proto-typed in Java, includes client and server runtimes. The server runtime facilitates the generation of entities containing data encoded in the GXD transport language. The GXD client runtime interprets these entities (potentially from many data sources) to create an illusion of a globally interconnected data space, one that is independent of data source location and implementation.

  14. Data management support for selected climate data sets using the climate data access system

    NASA Technical Reports Server (NTRS)

    Reph, M. G.

    1983-01-01

    The functional capabilities of the Goddard Space Flight Center (GSFC) Climate Data Access System (CDAS), an interactive data storage and retrieval system, and the archival data sets which this system manages are discussed. The CDAS manages several climate-related data sets, such as the First Global Atmospheric Research Program (GARP) Global Experiment (FGGE) Level 2-b and Level 3-a data tapes. CDAS data management support consists of three basic functions: (1) an inventory capability which allows users to search or update a disk-resident inventory describing the contents of each tape in a data set, (2) a capability to depict graphically the spatial coverage of a tape in a data set, and (3) a data set selection capability which allows users to extract portions of a data set using criteria such as time, location, and data source/parameter and output the data to tape, user terminal, or system printer. This report includes figures that illustrate menu displays and output listings for each CDAS function.

  15. Sharing Health Big Data for Research - A Design by Use Cases: The INSHARE Platform Approach.

    PubMed

    Bouzillé, Guillaume; Westerlynck, Richard; Defossez, Gautier; Bouslimi, Dalel; Bayat, Sahar; Riou, Christine; Busnel, Yann; Le Guillou, Clara; Cauvin, Jean-Michel; Jacquelinet, Christian; Pladys, Patrick; Oger, Emmanuel; Stindel, Eric; Ingrand, Pierre; Coatrieux, Gouenou; Cuggia, Marc

    2017-01-01

    Sharing and exploiting Health Big Data (HBD) allow tackling challenges: data protection/governance taking into account legal, ethical, and deontological aspects enables trust, transparent and win-win relationship between researchers, citizens, and data providers. Lack of interoperability: compartmentalized and syntactically/semantica heterogeneous data. INSHARE project using experimental proof of concept explores how recent technologies overcome such issues. Using 6 data providers, platform is designed via 3 steps to: (1) analyze use cases, needs, and requirements; (2) define data sharing governance, secure access to platform; and (3) define platform specifications. Three use cases - from 5 studies and 11 data sources - were analyzed for platform design. Governance derived from SCANNER model was adapted to data sharing. Platform architecture integrates: data repository and hosting, semantic integration services, data processing, aggregate computing, data quality and integrity monitoring, Id linking, multisource query builder, visualization and data export services, data governance, study management service and security including data watermarking.

  16. Data publication with the structural biology data grid supports live analysis

    DOE PAGES

    Meyer, Peter A.; Socias, Stephanie; Key, Jason; ...

    2016-03-07

    Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of themore » original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. In conclusion, it is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis.« less

  17. Data publication with the structural biology data grid supports live analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Meyer, Peter A.; Socias, Stephanie; Key, Jason

    Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of themore » original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. In conclusion, it is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis.« less

  18. [Accuracy improvement of spectral classification of crop using microwave backscatter data].

    PubMed

    Jia, Kun; Li, Qiang-Zi; Tian, Yi-Chen; Wu, Bing-Fang; Zhang, Fei-Fei; Meng, Ji-Hua

    2011-02-01

    In the present study, VV polarization microwave backscatter data used for improving accuracies of spectral classification of crop is investigated. Classification accuracy using different classifiers based on the fusion data of HJ satellite multi-spectral and Envisat ASAR VV backscatter data are compared. The results indicate that fusion data can take full advantage of spectral information of HJ multi-spectral data and the structure sensitivity feature of ASAR VV polarization data. The fusion data enlarges the spectral difference among different classifications and improves crop classification accuracy. The classification accuracy using fusion data can be increased by 5 percent compared to the single HJ data. Furthermore, ASAR VV polarization data is sensitive to non-agrarian area of planted field, and VV polarization data joined classification can effectively distinguish the field border. VV polarization data associating with multi-spectral data used in crop classification enlarges the application of satellite data and has the potential of spread in the domain of agriculture.

  19. SEURAT: visual analytics for the integrated analysis of microarray data.

    PubMed

    Gribov, Alexander; Sill, Martin; Lück, Sonja; Rücker, Frank; Döhner, Konstanze; Bullinger, Lars; Benner, Axel; Unwin, Antony

    2010-06-03

    In translational cancer research, gene expression data is collected together with clinical data and genomic data arising from other chip based high throughput technologies. Software tools for the joint analysis of such high dimensional data sets together with clinical data are required. We have developed an open source software tool which provides interactive visualization capability for the integrated analysis of high-dimensional gene expression data together with associated clinical data, array CGH data and SNP array data. The different data types are organized by a comprehensive data manager. Interactive tools are provided for all graphics: heatmaps, dendrograms, barcharts, histograms, eventcharts and a chromosome browser, which displays genetic variations along the genome. All graphics are dynamic and fully linked so that any object selected in a graphic will be highlighted in all other graphics. For exploratory data analysis the software provides unsupervised data analytics like clustering, seriation algorithms and biclustering algorithms. The SEURAT software meets the growing needs of researchers to perform joint analysis of gene expression, genomical and clinical data.

  20. An interdisciplinary analysis of multispectral satellite data for selected cover types in the Colorado Mountains, using automatic data processing techniques

    NASA Technical Reports Server (NTRS)

    Hoffer, R. M. (Principal Investigator)

    1975-01-01

    The author has reported the following significant results. A data set containing SKYLAB, LANDSAT, and topographic data has been overlayed, registered, and geometrically corrected to a scale of 1:24,000. After geometrically correcting both sets of data, the SKYLAB data were overlayed on the LANDSAT data. Digital topographic data were then obtained, reformatted, and a data channel containing elevation information was then digitally overlayed onto the LANDSAT and SKYLAB spectral data. The 14,039 square kilometers involving 2,113, 776 LANDSAT pixels represents a relatively large data set available for digital analysis. The overlayed data set enables investigators to numerically analyze and compare two sources of spectral data and topographic data from any point in the scene. This capability is new and it will permit a numerical comparison of spectral response with elevation, slope, and aspect. Utilization of the spectral and topographic data together to obtain more accurate classifications of the various cover types present is feasible.

  1. Data publication with the structural biology data grid supports live analysis.

    PubMed

    Meyer, Peter A; Socias, Stephanie; Key, Jason; Ransey, Elizabeth; Tjon, Emily C; Buschiazzo, Alejandro; Lei, Ming; Botka, Chris; Withrow, James; Neau, David; Rajashankar, Kanagalaghatta; Anderson, Karen S; Baxter, Richard H; Blacklow, Stephen C; Boggon, Titus J; Bonvin, Alexandre M J J; Borek, Dominika; Brett, Tom J; Caflisch, Amedeo; Chang, Chung-I; Chazin, Walter J; Corbett, Kevin D; Cosgrove, Michael S; Crosson, Sean; Dhe-Paganon, Sirano; Di Cera, Enrico; Drennan, Catherine L; Eck, Michael J; Eichman, Brandt F; Fan, Qing R; Ferré-D'Amaré, Adrian R; Fromme, J Christopher; Garcia, K Christopher; Gaudet, Rachelle; Gong, Peng; Harrison, Stephen C; Heldwein, Ekaterina E; Jia, Zongchao; Keenan, Robert J; Kruse, Andrew C; Kvansakul, Marc; McLellan, Jason S; Modis, Yorgo; Nam, Yunsun; Otwinowski, Zbyszek; Pai, Emil F; Pereira, Pedro José Barbosa; Petosa, Carlo; Raman, C S; Rapoport, Tom A; Roll-Mecak, Antonina; Rosen, Michael K; Rudenko, Gabby; Schlessinger, Joseph; Schwartz, Thomas U; Shamoo, Yousif; Sondermann, Holger; Tao, Yizhi J; Tolia, Niraj H; Tsodikov, Oleg V; Westover, Kenneth D; Wu, Hao; Foster, Ian; Fraser, James S; Maia, Filipe R N C; Gonen, Tamir; Kirchhausen, Tom; Diederichs, Kay; Crosas, Mercè; Sliz, Piotr

    2016-03-07

    Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of the original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. It is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis.

  2. Data publication with the structural biology data grid supports live analysis

    PubMed Central

    Meyer, Peter A.; Socias, Stephanie; Key, Jason; Ransey, Elizabeth; Tjon, Emily C.; Buschiazzo, Alejandro; Lei, Ming; Botka, Chris; Withrow, James; Neau, David; Rajashankar, Kanagalaghatta; Anderson, Karen S.; Baxter, Richard H.; Blacklow, Stephen C.; Boggon, Titus J.; Bonvin, Alexandre M. J. J.; Borek, Dominika; Brett, Tom J.; Caflisch, Amedeo; Chang, Chung-I; Chazin, Walter J.; Corbett, Kevin D.; Cosgrove, Michael S.; Crosson, Sean; Dhe-Paganon, Sirano; Di Cera, Enrico; Drennan, Catherine L.; Eck, Michael J.; Eichman, Brandt F.; Fan, Qing R.; Ferré-D'Amaré, Adrian R.; Christopher Fromme, J.; Garcia, K. Christopher; Gaudet, Rachelle; Gong, Peng; Harrison, Stephen C.; Heldwein, Ekaterina E.; Jia, Zongchao; Keenan, Robert J.; Kruse, Andrew C.; Kvansakul, Marc; McLellan, Jason S.; Modis, Yorgo; Nam, Yunsun; Otwinowski, Zbyszek; Pai, Emil F.; Pereira, Pedro José Barbosa; Petosa, Carlo; Raman, C. S.; Rapoport, Tom A.; Roll-Mecak, Antonina; Rosen, Michael K.; Rudenko, Gabby; Schlessinger, Joseph; Schwartz, Thomas U.; Shamoo, Yousif; Sondermann, Holger; Tao, Yizhi J.; Tolia, Niraj H.; Tsodikov, Oleg V.; Westover, Kenneth D.; Wu, Hao; Foster, Ian; Fraser, James S.; Maia, Filipe R. N C.; Gonen, Tamir; Kirchhausen, Tom; Diederichs, Kay; Crosas, Mercè; Sliz, Piotr

    2016-01-01

    Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of the original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. It is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis. PMID:26947396

  3. The Lure of Statistics in Data Mining

    ERIC Educational Resources Information Center

    Grover, Lovleen Kumar; Mehra, Rajni

    2008-01-01

    The field of Data Mining like Statistics concerns itself with "learning from data" or "turning data into information". For statisticians the term "Data mining" has a pejorative meaning. Instead of finding useful patterns in large volumes of data as in the case of Statistics, data mining has the connotation of searching for data to fit preconceived…

  4. 14 CFR 121.346 - Flight data recorders: filtered data.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 14 Aeronautics and Space 3 2011-01-01 2011-01-01 false Flight data recorders: filtered data. 121... § 121.346 Flight data recorders: filtered data. (a) A flight data signal is filtered when an original... sensor signal value can be reconstructed from the recorded data. This demonstration requires that: (i...

  5. 14 CFR 135.156 - Flight data recorders: filtered data.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 14 Aeronautics and Space 3 2011-01-01 2011-01-01 false Flight data recorders: filtered data. 135... Aircraft and Equipment § 135.156 Flight data recorders: filtered data. (a) A flight data signal is filtered... original sensor signal value can be reconstructed from the recorded data. This demonstration requires that...

  6. Format conversion between CAD data and GIS data based on ArcGIS

    NASA Astrophysics Data System (ADS)

    Xie, Qingqing; Wei, Bo; Zhang, Kailin; Wang, Zhichao

    2015-12-01

    To make full use of the data resources and realize a sharing for the different types of data in different industries, a method of format conversion between CAD data and GIS data based on ArcGIS was proposed. To keep the integrity of the converted data, some key steps to process CAD data before conversion were made in AutoCAD. For examples, deleting unnecessary elements such as title, border and legend avoided the appearance of unnecessary elements after conversion, as layering data again by a national standard avoided the different types of elements to appear in a same layer after conversion. In ArcGIS, converting CAD data to GIS data was executed by the correspondence of graphic element classification between AutoCAD and ArcGIS. In addition, an empty geographic database and feature set was required to create in ArcGIS for storing the text data of CAD data. The experimental results show that the proposed method avoids a large amount of editing work in data conversion and maintains the integrity of spatial data and attribute data between before and after conversion.

  7. Authoring Data-Driven Videos with DataClips.

    PubMed

    Amini, Fereshteh; Riche, Nathalie Henry; Lee, Bongshin; Monroy-Hernandez, Andres; Irani, Pourang

    2017-01-01

    Data videos, or short data-driven motion graphics, are an increasingly popular medium for storytelling. However, creating data videos is difficult as it involves pulling together a unique combination of skills. We introduce DataClips, an authoring tool aimed at lowering the barriers to crafting data videos. DataClips allows non-experts to assemble data-driven "clips" together to form longer sequences. We constructed the library of data clips by analyzing the composition of over 70 data videos produced by reputable sources such as The New York Times and The Guardian. We demonstrate that DataClips can reproduce over 90% of our data videos corpus. We also report on a qualitative study comparing the authoring process and outcome achieved by (1) non-experts using DataClips, and (2) experts using Adobe Illustrator and After Effects to create data-driven clips. Results indicated that non-experts are able to learn and use DataClips with a short training period. In the span of one hour, they were able to produce more videos than experts using a professional editing tool, and their clips were rated similarly by an independent audience.

  8. The Role of NOAA's National Data Centers in the Earth and Space Science Infrastructure

    NASA Astrophysics Data System (ADS)

    Fox, C. G.

    2008-12-01

    NOAA's National Data Centers (NNDC) provide access to long-term archives of environmental data from NOAA and other sources. The NNDCs face significant challenges in the volume and complexity of modern data sets. Data volume challenges are being addressed using more capable data archive systems such as the Comprehensive Large Array-Data Stewardship System (CLASS). Challenges in assuring data quality and stewardship are in many ways more challenging. In the past, scientists at the Data Centers could provide reasonable stewardship of data sets in their area of expertise. As staff levels have decreased and data complexity has increased, Data Centers depend on their data providers and user communities to provide high-quality metadata, feedback on data problems and improvements. This relationship requires strong partnerships between the NNDCs and academic, commercial, and international partners, as well as advanced data management and access tools that conform to established international standards when available. The NNDCs are looking to geospatial databases, interactive mapping, web services, and other Application Program Interface approaches to help preserve NNDC data and information and to make it easily available to the scientific community.

  9. Method of and apparatus for generating an interstitial point in a data stream having an even number of data points

    NASA Technical Reports Server (NTRS)

    Edwards, T. R. (Inventor)

    1985-01-01

    Apparatus for doubling the data density rate of an analog to digital converter or doubling the data density storage capacity of a memory deviced is discussed. An interstitial data point midway between adjacent data points in a data stream having an even number of equal interval data points is generated by applying a set of predetermined one-dimensional convolute integer coefficients which can include a set of multiplier coefficients and a normalizer coefficient. Interpolator means apply the coefficients to the data points by weighting equally on each side of the center of the even number of equal interval data points to obtain an interstital point value at the center of the data points. A one-dimensional output data set, which is twice as dense as a one-dimensional equal interval input data set, can be generated where the output data set includes interstitial points interdigitated between adjacent data points in the input data set. The method for generating the set of interstital points is a weighted, nearest-neighbor, non-recursive, moving, smoothing averaging technique, equivalent to applying a polynomial regression calculation to the data set.

  10. FRIEDA: Flexible Robust Intelligent Elastic Data Management Framework

    DOE PAGES

    Ghoshal, Devarshi; Hendrix, Valerie; Fox, William; ...

    2017-02-01

    Scientific applications are increasingly using cloud resources for their data analysis workflows. However, managing data effectively and efficiently over these cloud resources is challenging due to the myriad storage choices with different performance, cost trade-offs, complex application choices and complexity associated with elasticity, failure rates in these environments. The different data access patterns for data-intensive scientific applications require a more flexible and robust data management solution than the ones currently in existence. FRIEDA is a Flexible Robust Intelligent Elastic Data Management framework that employs a range of data management strategies in cloud environments. FRIEDA can manage storage and data lifecyclemore » of applications in cloud environments. There are four different stages in the data management lifecycle of FRIEDA – (i) storage planning, (ii) provisioning and preparation, (iii) data placement, and (iv) execution. FRIEDA defines a data control plane and an execution plane. The data control plane defines the data partition and distribution strategy, whereas the execution plane manages the execution of the application using a master-worker paradigm. FRIEDA also provides different data management strategies, either to partition the data in real-time, or predetermine the data partitions prior to application execution.« less

  11. Nonlinear analysis of EEG for epileptic seizures

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hively, L.M.; Clapp, N.E.; Daw, C.S.

    1995-04-01

    We apply chaotic time series analysis (CTSA) to human electroencephalogram (EEG) data. Three epoches were examined: epileptic seizure, non-seizure, and transition from non-seizure to seizure. The CTSA tools were applied to four forms of these data: raw EEG data (e-data), artifact data (f-data) via application of a quadratic zero-phase filter of the raw data, artifact-filtered data (g- data) and that was the residual after subtracting f-data from e-data, and a low-pass-filtered version (h-data) of g-data. Two different seizures were analyzed for the same patient. Several nonlinear measures uniquely indicate an epileptic seizure in both cases, including an abrupt decrease inmore » the time per wave cycle in f-data, an abrupt increase in the Kolmogorov entropy and in the correlation dimension for e-h data, and an abrupt increase in the correlation dimension for e-h data. The transition from normal to seizure state also is characterized by distinctly different trends in the nonlinear measures for each seizure and may be potential seizure predictors for this patient. Surrogate analysis of e-data shows that statistically significant nonlinear structure is present during the non-seizure, transition , and seizure epoches.« less

  12. FRIEDA: Flexible Robust Intelligent Elastic Data Management Framework

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ghoshal, Devarshi; Hendrix, Valerie; Fox, William

    Scientific applications are increasingly using cloud resources for their data analysis workflows. However, managing data effectively and efficiently over these cloud resources is challenging due to the myriad storage choices with different performance, cost trade-offs, complex application choices and complexity associated with elasticity, failure rates in these environments. The different data access patterns for data-intensive scientific applications require a more flexible and robust data management solution than the ones currently in existence. FRIEDA is a Flexible Robust Intelligent Elastic Data Management framework that employs a range of data management strategies in cloud environments. FRIEDA can manage storage and data lifecyclemore » of applications in cloud environments. There are four different stages in the data management lifecycle of FRIEDA – (i) storage planning, (ii) provisioning and preparation, (iii) data placement, and (iv) execution. FRIEDA defines a data control plane and an execution plane. The data control plane defines the data partition and distribution strategy, whereas the execution plane manages the execution of the application using a master-worker paradigm. FRIEDA also provides different data management strategies, either to partition the data in real-time, or predetermine the data partitions prior to application execution.« less

  13. Modern data science for analytical chemical data - A comprehensive review.

    PubMed

    Szymańska, Ewa

    2018-10-22

    Efficient and reliable analysis of chemical analytical data is a great challenge due to the increase in data size, variety and velocity. New methodologies, approaches and methods are being proposed not only by chemometrics but also by other data scientific communities to extract relevant information from big datasets and provide their value to different applications. Besides common goal of big data analysis, different perspectives and terms on big data are being discussed in scientific literature and public media. The aim of this comprehensive review is to present common trends in the analysis of chemical analytical data across different data scientific fields together with their data type-specific and generic challenges. Firstly, common data science terms used in different data scientific fields are summarized and discussed. Secondly, systematic methodologies to plan and run big data analysis projects are presented together with their steps. Moreover, different analysis aspects like assessing data quality, selecting data pre-processing strategies, data visualization and model validation are considered in more detail. Finally, an overview of standard and new data analysis methods is provided and their suitability for big analytical chemical datasets shortly discussed. Copyright © 2018 Elsevier B.V. All rights reserved.

  14. Individual Data Linkage of Survey Data with Claims Data in Germany—An Overview Based on a Cohort Study

    PubMed Central

    March, Stefanie

    2017-01-01

    Research based on health insurance data has a long tradition in Germany. By contrast, data linkage of survey data with such claims data is a relatively new field of research with high potential. Data linkage opens up new opportunities for analyses in the field of health services research and public health. Germany has comprehensive rules and regulations of data protection that have to be followed. Therefore, a written informed consent is needed for individual data linkage. Additionally, the health system is characterized by heterogeneity of health insurance. The lidA-living at work-study is a cohort study on work, age and health, which linked survey data with claims data of a large number of statutory health insurance data. All health insurance funds were contacted, of whom a written consent was given. This paper will give an overview of individual data linkage of survey data with German claims data on the example of the lidA-study results. The challenges and limitations of data linkage will be presented. Despite heterogeneity, such kind of studies is possible with a negligibly small influence of bias. The experience we gain in lidA will be shown and provide important insights for other studies focusing on data linkage. PMID:29232834

  15. Individual Data Linkage of Survey Data with Claims Data in Germany-An Overview Based on a Cohort Study.

    PubMed

    March, Stefanie

    2017-12-09

    Research based on health insurance data has a long tradition in Germany. By contrast, data linkage of survey data with such claims data is a relatively new field of research with high potential. Data linkage opens up new opportunities for analyses in the field of health services research and public health. Germany has comprehensive rules and regulations of data protection that have to be followed. Therefore, a written informed consent is needed for individual data linkage. Additionally, the health system is characterized by heterogeneity of health insurance. The lidA-living at work-study is a cohort study on work, age and health, which linked survey data with claims data of a large number of statutory health insurance data. All health insurance funds were contacted, of whom a written consent was given. This paper will give an overview of individual data linkage of survey data with German claims data on the example of the lidA-study results. The challenges and limitations of data linkage will be presented. Despite heterogeneity, such kind of studies is possible with a negligibly small influence of bias. The experience we gain in lidA will be shown and provide important insights for other studies focusing on data linkage.

  16. Simple, Script-Based Science Processing Archive

    NASA Technical Reports Server (NTRS)

    Lynnes, Christopher; Hegde, Mahabaleshwara; Barth, C. Wrandle

    2007-01-01

    The Simple, Scalable, Script-based Science Processing (S4P) Archive (S4PA) is a disk-based archival system for remote sensing data. It is based on the data-driven framework of S4P and is used for data transfer, data preprocessing, metadata generation, data archive, and data distribution. New data are automatically detected by the system. S4P provides services such as data access control, data subscription, metadata publication, data replication, and data recovery. It comprises scripts that control the data flow. The system detects the availability of data on an FTP (file transfer protocol) server, initiates data transfer, preprocesses data if necessary, and archives it on readily available disk drives with FTP and HTTP (Hypertext Transfer Protocol) access, allowing instantaneous data access. There are options for plug-ins for data preprocessing before storage. Publication of metadata to external applications such as the Earth Observing System Clearinghouse (ECHO) is also supported. S4PA includes a graphical user interface for monitoring the system operation and a tool for deploying the system. To ensure reliability, S4P continuously checks stored data for integrity, Further reliability is provided by tape backups of disks made once a disk partition is full and closed. The system is designed for low maintenance, requiring minimal operator oversight.

  17. Application and Prospect of Big Data in Water Resources

    NASA Astrophysics Data System (ADS)

    Xi, Danchi; Xu, Xinyi

    2017-04-01

    Because of developed information technology and affordable data storage, we h ave entered the era of data explosion. The term "Big Data" and technology relate s to it has been created and commonly applied in many fields. However, academic studies just got attention on Big Data application in water resources recently. As a result, water resource Big Data technology has not been fully developed. This paper introduces the concept of Big Data and its key technologies, including the Hadoop system and MapReduce. In addition, this paper focuses on the significance of applying the big data in water resources and summarizing prior researches by others. Most studies in this field only set up theoretical frame, but we define the "Water Big Data" and explain its tridimensional properties which are time dimension, spatial dimension and intelligent dimension. Based on HBase, the classification system of Water Big Data is introduced: hydrology data, ecology data and socio-economic data. Then after analyzing the challenges in water resources management, a series of solutions using Big Data technologies such as data mining and web crawler, are proposed. Finally, the prospect of applying big data in water resources is discussed, it can be predicted that as Big Data technology keeps developing, "3D" (Data Driven Decision) will be utilized more in water resources management in the future.

  18. Integrated platform and API for electrophysiological data

    PubMed Central

    Sobolev, Andrey; Stoewer, Adrian; Leonhardt, Aljoscha; Rautenberg, Philipp L.; Kellner, Christian J.; Garbers, Christian; Wachtler, Thomas

    2014-01-01

    Recent advancements in technology and methodology have led to growing amounts of increasingly complex neuroscience data recorded from various species, modalities, and levels of study. The rapid data growth has made efficient data access and flexible, machine-readable data annotation a crucial requisite for neuroscientists. Clear and consistent annotation and organization of data is not only an important ingredient for reproducibility of results and re-use of data, but also essential for collaborative research and data sharing. In particular, efficient data management and interoperability requires a unified approach that integrates data and metadata and provides a common way of accessing this information. In this paper we describe GNData, a data management platform for neurophysiological data. GNData provides a storage system based on a data representation that is suitable to organize data and metadata from any electrophysiological experiment, with a functionality exposed via a common application programming interface (API). Data representation and API structure are compatible with existing approaches for data and metadata representation in neurophysiology. The API implementation is based on the Representational State Transfer (REST) pattern, which enables data access integration in software applications and facilitates the development of tools that communicate with the service. Client libraries that interact with the API provide direct data access from computing environments like Matlab or Python, enabling integration of data management into the scientist's experimental or analysis routines. PMID:24795616

  19. Comparison of property between two Viking Seismic tapes

    NASA Astrophysics Data System (ADS)

    Yamamoto, Y.; Yamada, R.

    2016-12-01

    Tthe restoration work of the seismometer data onboard Viking Lander 2 is still continuing. Originally, the data were processed and archived both in MIT and UTIG separately, and each data is accessible via the Internet today. Their file formats to store the data are different, but both of them are currently readable due to the continuous investigation. However, there is some inconsistency between their data although most of their data are highly consistent. To understand the differences, the knowledge of archiving and off-line processing of spacecraft is required because these differences are caused by the off-line processing.The data processing of spacecraft often requires merge and sort processing of raw data. The merge processing is normally performed to eliminate duplicated data, and the sort processing is performed to fix data order. UTIG did not seem to perform these merge and sort processing. Therefore, the UTIG processed data remain duplication. The MIT processed data did these merge and sort processing, but the raw data sometimes include wrong time tags, and it cannot be fixed strictly after sort processing. Also, the MIT processed data has enough documents to understand metadata, while UTIG data has a brief instruction. Therefore, both of MIT and UTIG data are treated complementary. A better data set can be established using both of them. In this presentation, we would show the method to build a better data set of Viking Lander 2 seismic data.

  20. Integrated platform and API for electrophysiological data.

    PubMed

    Sobolev, Andrey; Stoewer, Adrian; Leonhardt, Aljoscha; Rautenberg, Philipp L; Kellner, Christian J; Garbers, Christian; Wachtler, Thomas

    2014-01-01

    Recent advancements in technology and methodology have led to growing amounts of increasingly complex neuroscience data recorded from various species, modalities, and levels of study. The rapid data growth has made efficient data access and flexible, machine-readable data annotation a crucial requisite for neuroscientists. Clear and consistent annotation and organization of data is not only an important ingredient for reproducibility of results and re-use of data, but also essential for collaborative research and data sharing. In particular, efficient data management and interoperability requires a unified approach that integrates data and metadata and provides a common way of accessing this information. In this paper we describe GNData, a data management platform for neurophysiological data. GNData provides a storage system based on a data representation that is suitable to organize data and metadata from any electrophysiological experiment, with a functionality exposed via a common application programming interface (API). Data representation and API structure are compatible with existing approaches for data and metadata representation in neurophysiology. The API implementation is based on the Representational State Transfer (REST) pattern, which enables data access integration in software applications and facilitates the development of tools that communicate with the service. Client libraries that interact with the API provide direct data access from computing environments like Matlab or Python, enabling integration of data management into the scientist's experimental or analysis routines.

  1. SeaDataNet Pan-European infrastructure for Ocean & Marine Data Management

    NASA Astrophysics Data System (ADS)

    Manzella, G. M.; Maillard, C.; Maudire, G.; Schaap, D.; Rickards, L.; Nast, F.; Balopoulos, E.; Mikhailov, N.; Vladymyrov, V.; Pissierssens, P.; Schlitzer, R.; Beckers, J. M.; Barale, V.

    2007-12-01

    SEADATANET is developing a Pan-European data management infrastructure to insure access to a large number of marine environmental data (i.e. temperature, salinity current, sea level, chemical, physical and biological properties), safeguard and long term archiving. Data are derived from many different sensors installed on board of research vessels, satellite and the various platforms of the marine observing system. SeaDataNet allows to have information on real time and archived marine environmental data collected at a pan-european level, through directories on marine environmental data and projects. SeaDataNet allows the access to the most comprehensive multidisciplinary sets of marine in-situ and remote sensing data, from about 40 laboratories, through user friendly tools. The data selection and access is operated through the Common Data Index (CDI), XML files compliant with ISO standards and unified dictionaries. Technical Developments carried out by SeaDataNet includes: A library of Standards - Meta-data standards, compliant with ISO 19115, for communication and interoperability between the data platforms. Software of interoperable on line system - Interconnection of distributed data centres by interfacing adapted communication technology tools. Off-Line Data Management software - software representing the minimum equipment of all the data centres is developed by AWI "Ocean Data View (ODV)". Training, Education and Capacity Building - Training 'on the job' is carried out by IOC-Unesco in Ostende. SeaDataNet Virtual Educational Centre internet portal provides basic tools for informal education

  2. A scalable neuroinformatics data flow for electrophysiological signals using MapReduce.

    PubMed

    Jayapandian, Catherine; Wei, Annan; Ramesh, Priya; Zonjy, Bilal; Lhatoo, Samden D; Loparo, Kenneth; Zhang, Guo-Qiang; Sahoo, Satya S

    2015-01-01

    Data-driven neuroscience research is providing new insights in progression of neurological disorders and supporting the development of improved treatment approaches. However, the volume, velocity, and variety of neuroscience data generated from sophisticated recording instruments and acquisition methods have exacerbated the limited scalability of existing neuroinformatics tools. This makes it difficult for neuroscience researchers to effectively leverage the growing multi-modal neuroscience data to advance research in serious neurological disorders, such as epilepsy. We describe the development of the Cloudwave data flow that uses new data partitioning techniques to store and analyze electrophysiological signal in distributed computing infrastructure. The Cloudwave data flow uses MapReduce parallel programming algorithm to implement an integrated signal data processing pipeline that scales with large volume of data generated at high velocity. Using an epilepsy domain ontology together with an epilepsy focused extensible data representation format called Cloudwave Signal Format (CSF), the data flow addresses the challenge of data heterogeneity and is interoperable with existing neuroinformatics data representation formats, such as HDF5. The scalability of the Cloudwave data flow is evaluated using a 30-node cluster installed with the open source Hadoop software stack. The results demonstrate that the Cloudwave data flow can process increasing volume of signal data by leveraging Hadoop Data Nodes to reduce the total data processing time. The Cloudwave data flow is a template for developing highly scalable neuroscience data processing pipelines using MapReduce algorithms to support a variety of user applications.

  3. State-Level Immunization Information Systems: Potential for Childhood Immunization Data Linkages.

    PubMed

    Fuller, Jill E; Walter, Emmanuel B; Dole, Nancy; O'Hara, Richard; Herring, Amy H; Durkin, Maureen S; Specker, Bonny; Wey, Betty

    2017-01-01

    Objectives Sources of immunization data include state registries or immunization information systems (IIS), medical records, and surveys. Little is known about the quality of these data sources or the feasibility of using IIS data for research. We assessed the feasibility of collecting immunization information for a national children's health study by accessing existing IIS data and comparing the completeness of these data against medical record abstractions (MRA) and parent report. Staff time needed to obtain IIS and MRA data was assessed. Methods We administered a questionnaire to state-level IIS representatives to ascertain availability and completeness of their data for research and gather information about data formats. We evaluated quality of data from IIS, medical records, and reports from parents of 119 National Children's Study participants at three locations. Results IIS data were comparable to MRA data and both were more complete than parental report. Agreement between IIS and MRA data was greater than between parental report and MRA, suggesting IIS and MRA are better sources than parental report. Obtaining IIS data took less staff time than chart review, making IIS data linkage for research a preferred choice. Conclusions IIS survey results indicate data can be obtained by researchers using data linkages. IIS are an accessible and feasible child immunization information source and these registries reduce reliance on parental report or medical record abstraction. Researchers seeking to link IIS data with large multi-site studies should consider acquiring IIS data, but may need strategies to overcome barriers to data completeness and linkage.

  4. A scalable neuroinformatics data flow for electrophysiological signals using MapReduce

    PubMed Central

    Jayapandian, Catherine; Wei, Annan; Ramesh, Priya; Zonjy, Bilal; Lhatoo, Samden D.; Loparo, Kenneth; Zhang, Guo-Qiang; Sahoo, Satya S.

    2015-01-01

    Data-driven neuroscience research is providing new insights in progression of neurological disorders and supporting the development of improved treatment approaches. However, the volume, velocity, and variety of neuroscience data generated from sophisticated recording instruments and acquisition methods have exacerbated the limited scalability of existing neuroinformatics tools. This makes it difficult for neuroscience researchers to effectively leverage the growing multi-modal neuroscience data to advance research in serious neurological disorders, such as epilepsy. We describe the development of the Cloudwave data flow that uses new data partitioning techniques to store and analyze electrophysiological signal in distributed computing infrastructure. The Cloudwave data flow uses MapReduce parallel programming algorithm to implement an integrated signal data processing pipeline that scales with large volume of data generated at high velocity. Using an epilepsy domain ontology together with an epilepsy focused extensible data representation format called Cloudwave Signal Format (CSF), the data flow addresses the challenge of data heterogeneity and is interoperable with existing neuroinformatics data representation formats, such as HDF5. The scalability of the Cloudwave data flow is evaluated using a 30-node cluster installed with the open source Hadoop software stack. The results demonstrate that the Cloudwave data flow can process increasing volume of signal data by leveraging Hadoop Data Nodes to reduce the total data processing time. The Cloudwave data flow is a template for developing highly scalable neuroscience data processing pipelines using MapReduce algorithms to support a variety of user applications. PMID:25852536

  5. Data Elevator

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    BYNA, SUNRENDRA; DONG, BIN; WU, KESHENG

    Data Elevator: Efficient Asynchronous Data Movement in Hierarchical Storage Systems Multi-layer storage subsystems, including SSD-based burst buffers and disk-based parallel file systems (PFS), are becoming part of HPC systems. However, software for this storage hierarchy is still in its infancy. Applications may have to explicitly move data among the storage layers. We propose Data Elevator for transparently and efficiently moving data between a burst buffer and a PFS. Users specify the final destination for their data, typically on PFS, Data Elevator intercepts the I/O calls, stages data on burst buffer, and then asynchronously transfers the data to their final destinationmore » in the background. This system allows extensive optimizations, such as overlapping read and write operations, choosing I/O modes, and aligning buffer boundaries. In tests with large-scale scientific applications, Data Elevator is as much as 4.2X faster than Cray DataWarp, the start-of-art software for burst buffer, and 4X faster than directly writing to PFS. The Data Elevator library uses HDF5's Virtual Object Layer (VOL) for intercepting parallel I/O calls that write data to PFS. The intercepted calls are redirected to the Data Elevator, which provides a handle to write the file in a faster and intermediate burst buffer system. Once the application finishes writing the data to the burst buffer, the Data Elevator job uses HDF5 to move the data to final destination in an asynchronous manner. Hence, using the Data Elevator library is currently useful for applications that call HDF5 for writing data files. Also, the Data Elevator depends on the HDF5 VOL functionality.« less

  6. Supporting Snow Research: SnowEx Data and Services at the NASA National Snow and Ice Data Center DAAC

    NASA Astrophysics Data System (ADS)

    Leon, A.; Tanner, S.; Deems, J. S.

    2017-12-01

    The National Snow and Ice Data Center Distributed Active Archive Center (NSIDC DAAC), part of the Cooperative Institute for Research in Environmental Sciences (CIRES) at the University of Colorado Boulder, will archive and distribute all primary data sets collected during the NASA SnowEx campaigns. NSIDC DAAC's overarching goal for SnowEx data management is to steward the diverse SnowEx data sets to provide a reliable long-term archive, to enable effective data discovery, retrieval, and usage, and to support end user engagement. This goal will be achieved though coordination and collaboration with SnowEx project management and investigators. NSIDC DAAC's core functions for SnowEx data management include: Data Creation: Advise investigators on data formats and structure as well as metadata creation and content to enable preservation, usability, and discoverability. Data Documentation: Develop comprehensive data set documentation describing the instruments, data collection and derivation methods, and data file contents. Data Distribution: Provide discovery and access through NSIDC and NASA data portals to make SnowEx data available to a broad user community Data & User Support: Assist user communities with the selection and usage of SnowEx data products. In an effort to educate and broaden the SnowEx user community, we will present an overview of the SnowEx data products, tools, and services which will be available at the NSIDC DAAC. We hope to gain further insight into how the DAAC can enable the user community to seamlessly and effectively utilize SnowEx data in their research and applications.

  7. Opportunities and challenges in conducting secondary analysis of HIV programmes using data from routine health information systems and personal health information.

    PubMed

    Gloyd, Stephen; Wagenaar, Bradley H; Woelk, Godfrey B; Kalibala, Samuel

    2016-01-01

    HIV programme data from routine health information systems (RHIS) and personal health information (PHI) provide ample opportunities for secondary data analysis. However, these data pose unique opportunities and challenges for use in health system monitoring, along with process and impact evaluations. Analyses focused on retrospective case reviews of four of the HIV-related studies published in this JIAS supplement. We identify specific opportunities and challenges with respect to the secondary analysis of RHIS and PHI data. Challenges working with both HIV-related RHIS and PHI included missing, inconsistent and implausible data; rapidly changing indicators; systematic differences in the utilization of services; and patient linkages over time and different data sources. Specific challenges among RHIS data included numerous registries and indicators, inconsistent data entry, gaps in data transmission, duplicate registry of information, numerator-denominator incompatibility and infrequent use of data for decision-making. Challenges specific to PHI included the time burden for busy providers, the culture of lax charting, overflowing archives for paper charts and infrequent chart review. Many of the challenges that undermine effective use of RHIS and PHI data for analyses are related to the processes and context of collecting the data, excessive data requirements, lack of knowledge of the purpose of data and the limited use of data among those generating the data. Recommendations include simplifying data sources, analysis and reporting; conducting systematic data quality audits; enhancing the use of data for decision-making; promoting routine chart review linked with simple patient tracking systems; and encouraging open access to RHIS and PHI data for increased use.

  8. Data and Data Products for Climate Research: Web Services at the Asia-Pacific Data-Research Center (APDRC)

    NASA Astrophysics Data System (ADS)

    DeCarlo, S.; Potemra, J. T.; Wang, K.

    2012-12-01

    The International Pacific Research Center (IPRC) at the University of Hawaii maintains a data center for climate studies called the Asia-Pacific Data-Research Center (APDRC). This data center was designed within a center of excellence in climate research with the intention of serving the needs of the research scientist. The APDRC provides easy access to a wide collection of climate data and data products for a wide variety of users. The data center maintains an archive of approximately 100 data sets including in-situ and remote data, as well as a range of model-based output. All data are available via on-line browsing tools such as a Live Access Server (LAS) and DChart, and direct binary access is available through OPeNDAP services. On-line tutorials on how to use these services are now available. Users can keep up-to-date with new data and product announcements via the APDRC facebook page. The main focus of the APDRC has been climate scientists, and the services are therefore streamlined to such users, both in the number and types of data served, but also in the way data are served. In addition, due to the integration of the APDRC within the IPRC, several value-added data products (see figure for an example using Argo floats) have been developed via a variety of research activities. The APDRC, therefore, has three main foci: 1. acquisition of climate-related data, 2. maintenance of integrated data servers, and 3. development and distribution of data products The APDRC can be found at http://apdrc.soest.hawaii.edu. The presentation will provide an overview along with specific examples of the data, data products and data services available at the APDRC.; APDRC product example: gridded field from Argo profiling floats

  9. Opening Health Data: What Do Researchers Want? Early Experiences With New York's Open Health Data Platform.

    PubMed

    Martin, Erika G; Helbig, Natalie; Birkhead, Guthrie S

    2015-01-01

    Governments are rapidly developing open data platforms to improve transparency and make information more accessible. New York is a leader, with currently the only state platform devoted to health. Although these platforms could build public health departments' capabilities to serve more researchers, agencies have little guidance on releasing meaningful and usable data. Structured focus groups with researchers and practitioners collected stakeholder feedback on potential uses of open health data and New York's open data strategy. Researchers and practitioners attended a 1-day November 2013 workshop on New York State's open health data resources. After learning about the state's open data platform and vision for open health data, participants were organized into 7 focus groups to discuss the essential elements of open data sets, practical challenges to obtaining and using health data, and potential uses of open data. Participants included 33 quantitative health researchers from State University of New York campuses and private partners and 10 practitioners from the New York State Department of Health. There was low awareness of open data, with 67% of researchers reporting never using open data portals prior to the workshop. Participants were interested in data sets that were geocoded, longitudinal, or aggregated to small area granularity and capabilities to link multiple data sets. Multiple environmental conditions and barriers hinder their capacity to use health data for research. Although open data platforms cannot address all barriers, they provide multiple opportunities for public health research and practice, and participants were overall positive about the state's efforts to release open data. Open data are not ideal for some researchers because they do not contain individually identifiable data, indicating a need for tiered data release strategies. However, they do provide important new opportunities to facilitate research and foster collaborations among agencies, researchers, and practitioners.

  10. Has open data arrived at the British Medical Journal (BMJ)? An observational study.

    PubMed

    Rowhani-Farid, Anisa; Barnett, Adrian G

    2016-10-13

    To quantify data sharing trends and data sharing policy compliance at the British Medical Journal (BMJ) by analysing the rate of data sharing practices, and investigate attitudes and examine barriers towards data sharing. Observational study. The BMJ research archive. 160 randomly sampled BMJ research articles from 2009 to 2015, excluding meta-analysis and systematic reviews. Percentages of research articles that indicated the availability of their raw data sets in their data sharing statements, and those that easily made their data sets available on request. 3 articles contained the data in the article. 50 out of 157 (32%) remaining articles indicated the availability of their data sets. 12 used publicly available data and the remaining 38 were sent email requests to access their data sets. Only 1 publicly available data set could be accessed and only 6 out of 38 shared their data via email. So only 7/157 research articles shared their data sets, 4.5% (95% CI 1.8% to 9%). For 21 clinical trials bound by the BMJ data sharing policy, the per cent shared was 24% (8% to 47%). Despite the BMJ's strong data sharing policy, sharing rates are low. Possible explanations for low data sharing rates could be: the wording of the BMJ data sharing policy, which leaves room for individual interpretation and possible loopholes; that our email requests ended up in researchers spam folders; and that researchers are not rewarded for sharing their data. It might be time for a more effective data sharing policy and better incentives for health and medical researchers to share their data. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/.

  11. Enabling the Usability of Earth Science Data Products and Services by Evaluating, Describing, and Improving Data Quality throughout the Data Lifecycle

    NASA Astrophysics Data System (ADS)

    Downs, R. R.; Peng, G.; Wei, Y.; Ramapriyan, H.; Moroni, D. F.

    2015-12-01

    Earth science data products and services are being used by representatives of various science and social science disciplines, by planning and decision-making professionals, by educators and learners ranging from primary through graduate and informal education, and by the general public. The diversity of users and uses of Earth science data is gratifying and offers new challenges for enabling the usability of these data by audiences with various purposes and levels of expertise. Users and other stakeholders need capabilities to efficiently find, explore, select, and determine the applicability and suitability of data products and services to meet their objectives and information needs. Similarly, they need to be able to understand the limitations of Earth science data, which can be complex, especially when considering combined or simultaneous use of multiple data products and services. Quality control efforts of stakeholders, throughout the data lifecycle, can contribute to the usability of Earth science data to meet the needs of diverse users. Such stakeholders include study design teams, data producers, data managers and curators, archives, systems professionals, data distributors, end-users, intermediaries, sponsoring organizations, hosting institutions, and others. Opportunities for engaging stakeholders to review, describe, and improve the quality of Earth science data products and services throughout the data lifecycle are identified and discussed. Insight is shared from the development of guidelines for implementing the Group on Earth Observations (GEO) Data Management Principles, the recommendations from the Earth Science Data System Working Group (ESDSWG) on Data Quality, and the efforts of the Information Quality Cluster of the Federation of Earth Science Information Partners (ESIP). Examples and outcomes from quality control efforts of data facilities, such as scientific data centers, that contribute to the usability of Earth science data also are offered.

  12. Advancing User Supports with a Structured How-To Knowledge Base for Earth Science Data

    NASA Technical Reports Server (NTRS)

    Shen, Suhung; Acker, James G.; Lynnes, Christopher S.; Beaty, Tammy; Lighty, Luther; Kempler, Steven J.

    2016-01-01

    It is a challenge to access and process fast growing Earth science data from satellites and numerical models, which may be archived in very different data format and structures. NASA data centers, managed by the Earth Observing System Data and Information System (EOSDIS), have developed a rich and diverse set of data services and tools with features intended to simplify finding, downloading, and working with these data. Although most data services and tools have user guides, many users still experience difficulties with accessing or reading data due to varying levels of familiarity with data services, tools, and/or formats. A type of structured online document, data recipe, were created in beginning 2013 by Goddard Earth Science Data and Information Services Center (GES DISC). A data recipe is the How-To document created by using the fixed template, containing step-by-step instructions with screenshots and examples of accessing and working with real data. The recipes has been found to be very helpful, especially to first-time-users of particular data services, tools, or data products. Online traffic to the data recipe pages is significant to some recipes. In 2014, the NASA Earth Science Data System Working Group (ESDSWG) for data recipes was established, aimed to initiate an EOSDIS-wide campaign for leveraging the distributed knowledge within EOSDIS and its user communities regarding their respective services and tools. The ESDSWG data recipe group started with inventory and analysis of existing EOSDIS-wide online help documents, and provided recommendations and guidelines and for writing and grouping data recipes. This presentation will overview activities of creating How-To documents at GES DISC and ESDSWG. We encourage feedback and contribution from users for improving the data How-To knowledge base.

  13. Credentialing Data Scientists: A Domain Repository Perspective

    NASA Astrophysics Data System (ADS)

    Lehnert, K. A.; Furukawa, H.

    2015-12-01

    A career in data science can have many paths: data curation, data analysis, metadata modeling - all of these in different commercial or scientific applications. Can a certification as 'data scientist' provide the guarantee that an applicant or candidate for a data science position has just the right skills? How valuable is a 'generic' certification as data scientist for an employer looking to fill a data science position? Credentials that are more specific and discipline-oriented may be more valuable to both the employer and the job candidate. One employment sector for data scientists are the data repositories that provide discipline-specific data services for science communities. Data science positions within domain repositories include a wide range of responsibilities in support of the full data life cycle - from data preservation and curation to development of data models, ontologies, and user interfaces, to development of data analysis and visualization tools to community education and outreach, and require a substantial degree of discipline-specific knowledge of scientific data acquisition and analysis workflows, data quality measures, and data cultures. Can there be certification programs for domain-specific data scientists that help build the urgently needed workforce for the repositories? The American Geophysical Union has recently started an initiative to develop a program for data science continuing education and data science professional certification for the Earth and space sciences. An Editorial Board has been charged to identify and develop curricula and content for these programs and to provide input and feedback in the implementation of the program. This presentation will report on the progress of this initiative and evaluate its utility for the needs of domain repositories in the Earth and space sciences.

  14. Advancing User Supports with Structured How-To Knowledge Base for Earth Science Data

    NASA Astrophysics Data System (ADS)

    Shen, S.; Acker, J. G.; Lynnes, C.; Lighty, L.; Beaty, T.; Kempler, S.

    2016-12-01

    It is a challenge to access and process fast growing Earth science data from satellites and numerical models, which may be archived in very different data format and structures. NASA data centers, managed by the Earth Observing System Data and Information System (EOSDIS), have developed a rich and diverse set of data services and tools with features intended to simplify finding, downloading, and working with these data. Although most data services and tools have user guides, many users still experience difficulties with accessing or reading data due to varying levels of familiarity with data services, tools, and/or formats. A type of structured online document, "data recipe", were created in beginning 2013 by Goddard Earth Science Data and Information Services Center (GES DISC). A data recipe is the "How-To" document created by using the fixed template, containing step-by-step instructions with screenshots and examples of accessing and working with real data. The recipes has been found to be very helpful, especially to first-time-users of particular data services, tools, or data products. Online traffic to the data recipe pages is significant to some recipes. In 2014, the NASA Earth Science Data System Working Group (ESDSWG) for data recipes was established, aimed to initiate an EOSDIS-wide campaign for leveraging the distributed knowledge within EOSDIS and its user communities regarding their respective services and tools. The ESDSWG data recipe group started with inventory and analysis of existing EOSDIS-wide online help documents, and provided recommendations and guidelines and for writing and grouping data recipes. This presentation will overview activities of creating How-To documents at GES DISC and ESDSWG. We encourage feedback and contribution from users for improving the data How-To knowledge base.

  15. Data Sharing & Publishing at Nature Publishing Group

    NASA Astrophysics Data System (ADS)

    VanDecar, J. C.; Hrynaszkiewicz, I.; Hufton, A. L.

    2015-12-01

    In recent years, the research community has come to recognize that upon-request data sharing has important limitations1,2. The Nature-titled journals feel that researchers have a duty to share data without undue qualifications, in a manner that allows others to replicate and build upon their published findings. Historically, the Nature journals have been strong supporters of data deposition in communities with existing data mandates, and have required data sharing upon request in all other cases. To help address some of the limitations of upon-request data sharing, the Nature titles have strengthened their existing data policies and forged a new partnership with Scientific Data, to promote wider data sharing in discoverable, citeable and reusable forms, and to ensure that scientists get appropriate credit for sharing3. Scientific Data is a new peer-reviewed journal for descriptions of research datasets, which works with a wide of range of public data repositories4. Articles at Scientific Data may either expand on research publications at other journals or may be used to publish new datasets. The Nature Publishing Group has also signed the Joint Declaration of Data Citation Principles5, and Scientific Data is our first journal to include formal data citations. We are currently in the process of adding data citation support to our various journals. 1 Wicherts, J. M., Borsboom, D., Kats, J. & Molenaar, D. The poor availability of psychological research data for reanalysis. Am. Psychol. 61, 726-728, doi:10.1037/0003-066x.61.7.726 (2006). 2 Vines, T. H. et al. Mandated data archiving greatly improves access to research data. FASEB J. 27, 1304-1308, doi:10.1096/fj.12-218164 (2013). 3 Data-access practices strengthened. Nature 515, 312, doi:10.1038/515312a (2014). 4 More bang for your byte. Sci. Data 1, 140010, doi:10.1038/sdata.2014.10 (2014). 5 Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. (FORCE11, San Diego, CA, 2014).

  16. Enabling Data-Driven Methodologies Across the Data Lifecycle and Ecosystem

    NASA Astrophysics Data System (ADS)

    Doyle, R. J.; Crichton, D.

    2017-12-01

    NASA has unlocked unprecedented scientific knowledge through exploration of the Earth, our solar system, and the larger universe. NASA is generating enormous amounts of data that are challenging traditional approaches to capturing, managing, analyzing and ultimately gaining scientific understanding from science data. New architectures, capabilities and methodologies are needed to span the entire observing system, from spacecraft to archive, while integrating data-driven discovery and analytic capabilities. NASA data have a definable lifecycle, from remote collection point to validated accessibility in multiple archives. Data challenges must be addressed across this lifecycle, to capture opportunities and avoid decisions that may limit or compromise what is achievable once data arrives at the archive. Data triage may be necessary when the collection capacity of the sensor or instrument overwhelms data transport or storage capacity. By migrating computational and analytic capability to the point of data collection, informed decisions can be made about which data to keep; in some cases, to close observational decision loops onboard, to enable attending to unexpected or transient phenomena. Along a different dimension than the data lifecycle, scientists and other end-users must work across an increasingly complex data ecosystem, where the range of relevant data is rarely owned by a single institution. To operate effectively, scalable data architectures and community-owned information models become essential. NASA's Planetary Data System is having success with this approach. Finally, there is the difficult challenge of reproducibility and trust. While data provenance techniques will be part of the solution, future interactive analytics environments must support an ability to provide a basis for a result: relevant data source and algorithms, uncertainty tracking, etc., to assure scientific integrity and to enable confident decision making. Advances in data science offer opportunities to gain new insights from space missions and their vast data collections. We are working to innovate new architectures, exploit emerging technologies, develop new data-driven methodologies, and transfer them across disciplines, while working across the dual dimensions of the data lifecycle and the data ecosystem.

  17. Cultivating Data Expertise and Roles at a National Research Center

    NASA Astrophysics Data System (ADS)

    Thompson, C. A.

    2015-12-01

    As research becomes more computation and data-intensive, it brings new demands for staff that can manage complex data, design user services, and facilitate open access. Responding to these new demands, universities and research institutions are developing data services to support their scientists and scholarly communities. As more organizations extend their operations to research data, a better understanding of the staff roles and expertise required to support data-intensive research services is needed. What is data expertise - knowledge, skills, and roles? This study addresses this question through a case study of an exemplar research center, the National Center for Atmospheric Research (NCAR) in Boulder, CO. The NCAR case study results were supplemented and validated with a set of interviews of managers at additional geoscience data centers. To date, 11 interviews with NCAR staff and 19 interviews with managers at supplementary data centers have been completed. Selected preliminary results from the qualitative analysis will be reported in the poster: Data professionals have cultivated expertise in areas such as managing scientific data and products, understanding use and users, harnessing technology for data solutions, and standardizing metadata and data sets. Staff roles and responsibilities have evolved over the years to create new roles for data scientists, data managers/curators, data engineers, and senior managers of data teams, embedding data expertise into each NCAR lab. Explicit career paths and ladders for data professionals are limited but starting to emerge. NCAR has supported organization-wide efforts for data management, leveraging knowledge and best practices across all the labs and their staff. Based on preliminary results, NCAR provides a model for how organizations can build expertise and roles into their data service models. Data collection for this study is ongoing. The author anticipates that the results will help answer questions on what are the knowledge and skills required for data professionals and how organizations can develop data expertise.

  18. Systematically linking tranSMART, Galaxy and EGA for reusing human translational research data

    PubMed Central

    Zhang, Chao; Bijlard, Jochem; Staiger, Christine; Scollen, Serena; van Enckevort, David; Hoogstrate, Youri; Senf, Alexander; Hiltemann, Saskia; Repo, Susanna; Pipping, Wibo; Bierkens, Mariska; Payralbe, Stefan; Stringer, Bas; Heringa, Jaap; Stubbs, Andrew; Bonino Da Silva Santos, Luiz Olavo; Belien, Jeroen; Weistra, Ward; Azevedo, Rita; van Bochove, Kees; Meijer, Gerrit; Boiten, Jan-Willem; Rambla, Jordi; Fijneman, Remond; Spalding, J. Dylan; Abeln, Sanne

    2017-01-01

    The availability of high-throughput molecular profiling techniques has provided more accurate and informative data for regular clinical studies. Nevertheless, complex computational workflows are required to interpret these data. Over the past years, the data volume has been growing explosively, requiring robust human data management to organise and integrate the data efficiently. For this reason, we set up an ELIXIR implementation study, together with the Translational research IT (TraIT) programme, to design a data ecosystem that is able to link raw and interpreted data. In this project, the data from the TraIT Cell Line Use Case (TraIT-CLUC) are used as a test case for this system. Within this ecosystem, we use the European Genome-phenome Archive (EGA) to store raw molecular profiling data; tranSMART to collect interpreted molecular profiling data and clinical data for corresponding samples; and Galaxy to store, run and manage the computational workflows. We can integrate these data by linking their repositories systematically. To showcase our design, we have structured the TraIT-CLUC data, which contain a variety of molecular profiling data types, for storage in both tranSMART and EGA. The metadata provided allows referencing between tranSMART and EGA, fulfilling the cycle of data submission and discovery; we have also designed a data flow from EGA to Galaxy, enabling reanalysis of the raw data in Galaxy. In this way, users can select patient cohorts in tranSMART, trace them back to the raw data and perform (re)analysis in Galaxy. Our conclusion is that the majority of metadata does not necessarily need to be stored (redundantly) in both databases, but that instead FAIR persistent identifiers should be available for well-defined data ontology levels: study, data access committee, physical sample, data sample and raw data file. This approach will pave the way for the stable linkage and reuse of data. PMID:29123641

  19. Systematically linking tranSMART, Galaxy and EGA for reusing human translational research data.

    PubMed

    Zhang, Chao; Bijlard, Jochem; Staiger, Christine; Scollen, Serena; van Enckevort, David; Hoogstrate, Youri; Senf, Alexander; Hiltemann, Saskia; Repo, Susanna; Pipping, Wibo; Bierkens, Mariska; Payralbe, Stefan; Stringer, Bas; Heringa, Jaap; Stubbs, Andrew; Bonino Da Silva Santos, Luiz Olavo; Belien, Jeroen; Weistra, Ward; Azevedo, Rita; van Bochove, Kees; Meijer, Gerrit; Boiten, Jan-Willem; Rambla, Jordi; Fijneman, Remond; Spalding, J Dylan; Abeln, Sanne

    2017-01-01

    The availability of high-throughput molecular profiling techniques has provided more accurate and informative data for regular clinical studies. Nevertheless, complex computational workflows are required to interpret these data. Over the past years, the data volume has been growing explosively, requiring robust human data management to organise and integrate the data efficiently. For this reason, we set up an ELIXIR implementation study, together with the Translational research IT (TraIT) programme, to design a data ecosystem that is able to link raw and interpreted data. In this project, the data from the TraIT Cell Line Use Case (TraIT-CLUC) are used as a test case for this system. Within this ecosystem, we use the European Genome-phenome Archive (EGA) to store raw molecular profiling data; tranSMART to collect interpreted molecular profiling data and clinical data for corresponding samples; and Galaxy to store, run and manage the computational workflows. We can integrate these data by linking their repositories systematically. To showcase our design, we have structured the TraIT-CLUC data, which contain a variety of molecular profiling data types, for storage in both tranSMART and EGA. The metadata provided allows referencing between tranSMART and EGA, fulfilling the cycle of data submission and discovery; we have also designed a data flow from EGA to Galaxy, enabling reanalysis of the raw data in Galaxy. In this way, users can select patient cohorts in tranSMART, trace them back to the raw data and perform (re)analysis in Galaxy. Our conclusion is that the majority of metadata does not necessarily need to be stored (redundantly) in both databases, but that instead FAIR persistent identifiers should be available for well-defined data ontology levels: study, data access committee, physical sample, data sample and raw data file. This approach will pave the way for the stable linkage and reuse of data.

  20. Quarterly environmental data summary for first quarter 1998

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    NONE

    1998-05-01

    In support of the Weldon Spring Site Remedial Action Project Federal Facilities Agreement, a copy of the Quarterly Environmental Data Summary (QEDS) for the first quarter of 1998 is enclosed. The data presented in this letter and attachment constitute the QEDS. The data were received from the contract laboratories, verified by the Weldon Spring Site verification group and, except for air monitoring data and site KPA generated data (uranium analyses), merged into the data base during the first quarter of 1998. Air monitoring data presented are the most recent complete sets of quarterly data. Air data are not stored inmore » the data base, and KPA data are not merged into the regular data base. Significant data, defined as data values that have exceeded defined {open_quotes}above normal{close_quotes} Level 2 values, are discussed in this letter for Environmental Monitoring Plan (EMP) generated data only. Above normal Level 2 values are based, in ES&H procedures, on historical high values, DOE Derived Concentration Guides (DCGs), NPDES limits and other guidelines. The procedures also establish actions to be taken in the event that {open_quotes}above normal{close_quotes} data occur. All data received and verified during the first quarter were within a permissible range of variability except for those detailed below. Above normal occurrences are cited for groundwater, air, and NPDES data. There were none for springs or surface water. The following discussion offers a brief summary of the data merged during the first quarter that exceeded the above normal criteria and updates on past reported above normal data. The attached tables present the most recent data for air and the data merged into the data base during the first quarter 1998 for groundwater, NPDES, surface water, and springs. Graphs showing concentrations of selected contaminants of concern at some of the critical locations have also been included in this QEDS. The graphs are discussed in the separate sections.« less

  1. GeoMapApp as a platform for visualizing marine data from Polar Regions

    NASA Astrophysics Data System (ADS)

    Nitsche, F. O.; Ryan, W. B.; Carbotte, S. M.; Ferrini, V.; Goodwillie, A. M.; O'hara, S. H.; Weissel, R.; McLain, K.; Chinhong, C.; Arko, R. A.; Chan, S.; Morton, J. J.; Pomeroy, D.

    2012-12-01

    To maximize the investment in expensive fieldwork the resulting data should be re-used as much as possible. In addition, unnecessary duplication of data collection effort should be avoided. This becomes even more important if access to field areas is as difficult and expensive as it is in Polar Regions. Making existing data discoverable in an easy to use platform is key to improve re-use and avoid duplication. A common obstacle is that use of existing data is often limited to specialists who know of the data existence and also have the right tools to view and analyze these data. GeoMapApp is a free, interactive, map based tool that allows users to discover, visualize, and analyze a large number of data sets. In addition to a global view, it provides polar map projections for displaying data in Arctic and Antarctic areas. Data that have currently been added to the system include Arctic swath bathymetry data collected from the USCG icebreaker Healy. These data are collected almost continuously including from cruises where bathymetry is not the main objective and for which existence of the acquired data may not be well known. In contrast, existence of seismic data from the Antarctic continental margin is well known in the seismic community. They are archived at and can be accessed through the Antarctic Seismic Data Library System (SDLS). Incorporating these data into GeoMapApp makes an even broader community aware of these data and the custom interface, which includes capabilities to visualize and explore these data, allows users without specific software or knowledge of the underlying data format to access the data. In addition to investigating these datasets, GeoMapApp provides links to the actual data sources to allow specialists the opportunity to re-use the original data. Important identification of data sources and data references are achieved on different levels. For access to the actual Antarctic seismic data GeoMapApp links to the SDLS site, where users have to register before downloading the data and where they are informed about data owners. For the swath bathymetry data GeoMapApp links to an IEDA/MGDS web page for each cruise containing detailed information about investigators and surveys.

  2. SeaDataNet - Pan-European infrastructure for marine and ocean data management: Unified access to distributed data sets

    NASA Astrophysics Data System (ADS)

    Schaap, D. M. A.; Maudire, G.

    2009-04-01

    SeaDataNet is an Integrated research Infrastructure Initiative (I3) in EU FP6 (2006 - 2011) to provide the data management system adapted both to the fragmented observation system and the users need for an integrated access to data, meta-data, products and services. Therefore SeaDataNet insures the long term archiving of the large number of multidisciplinary data (i.e. temperature, salinity current, sea level, chemical, physical and biological properties) collected by many different sensors installed on board of research vessels, satellite and the various platforms of the marine observing system. The SeaDataNet project started in 2006, but builds upon earlier data management infrastructure projects, undertaken over a period of 20 years by an expanding network of oceanographic data centres from the countries around all European seas. Its predecessor project Sea-Search had a strict focus on metadata. SeaDataNet maintains significant interest in the further development of the metadata infrastructure, but its primary objective is the provision of easy data access and generic data products. SeaDataNet is a distributed infrastructure that provides transnational access to marine data, meta-data, products and services through 40 interconnected Trans National Data Access Platforms (TAP) from 35 countries around the Black Sea, Mediterranean, North East Atlantic, North Sea, Baltic and Arctic regions. These include: National Oceanographic Data Centres (NODC's) Satellite Data Centres. Furthermore the SeaDataNet consortium comprises a number of expert modelling centres, SME's experts in IT, and 3 international bodies (ICES, IOC and JRC). Planning: The SeaDataNet project is delivering and operating the infrastructure in 3 versions: Version 0: maintenance and further development of the metadata systems developed by the Sea-Search project plus the development of a new metadata system for indexing and accessing to individual data objects managed by the SeaDataNet data centres. This is known as the Common Data Index (CDI) V0 system Version 1: harmonisation and upgrading of the metadatabases through adoption of the ISO 19115 metadata standard and provision of transparent data access and download services from all partner data centres through upgrading the Common Data Index and deployment of a data object delivery service. Version 2: adding data product services and OGC compliant viewing services and further virtualisation of data access. SeaDataNet Version 0: The SeaDataNet portal has been set up at http://www.seadatanet.org and it provides a platform for all SeaDataNet services and standards as well as background information about the project and its partners. It includes discovery services via the following catalogues: CSR - Cruise Summary Reports of research vessels; EDIOS - Locations and details of monitoring stations and networks / programmes; EDMED - High level inventory of Marine Environmental Data sets collected and managed by research institutes and organisations; EDMERP - Marine Environmental Research Projects ; EDMO - Marine Organisations. These catalogues are interrelated, where possible, to facilitate cross searching and context searching. These catalogues connect to the Common Data Index (CDI). Common Data Index (CDI) The CDI gives detailed insight in available datasets at partners databases and paves the way to direct online data access or direct online requests for data access / data delivery. The CDI V0 metadatabase contains more than 340.000 individual data entries from 36 CDI partners from 29 countries across Europe, covering a broad scope and range of data, held by these organisations. For purposes of standardisation and international exchange the ISO19115 metadata standard has been adopted. The CDI format is defined as a dedicated subset of this standard. A CDI XML format supports the exchange between CDI-partners and the central CDI manager, and ensures interoperability with other systems and networks. CDI XML entries are generated by participating data centres, directly from their databases. CDI-partners can make use of dedicated SeaDataNet Tools to generate CDI XML files automatically. Approach for SeaDataNet V1 and V2: The approach for SeaDataNet V1 and V2, which is in line with the INSPIRE Directive, comprises the following services: Discovery services = Metadata directories Security services = Authentication, Authorization & Accounting (AAA) Delivery services = Data access & downloading of datasets Viewing services = Visualisation of metadata, data and data products Product services = Generic and standard products Monitoring services = Statistics on usage and performance of the system Maintenance services = Updating of metadata by SeaDataNet partners The services will be operated over a distributed network of interconnected Data Centres accessed through a central Portal. In addition to service access the portal will provide information on data management standards, tools and protocols. The architecture has been designed to provide a coherent system based on V1 services, whilst leaving the pathway open for later extension with V2 services. For the implementation, a range of technical components have been defined. Some are already operational with the remainder in the final stages of development and testing. These make use of recent web technologies, and also comprise Java components, to provide multi-platform support and syntactic interoperability. To facilitate sharing of resources and interoperability, SeaDataNet has adopted SOAP Web Service technology. The SeaDataNet architecture and components have been designed to handle all kinds of oceanographic and marine environmental data including both in-situ measurements and remote sensing observations. The V1 technical development is ready and the V1 system is now being implemented and adopted by all participating data centres in SeaDataNet. Interoperability: Interoperability is the key to distributed data management system success and it is achieved in SeaDataNet V1 by: Using common quality control protocols and flag scale Using controlled vocabularies from a single source that have been developed using international content governance Adopting the ISO 19115 metadata standard for all metadata directories Providing XML Validation Services to quality control the metadata maintenance, including field content verification based on Schematron. Providing standard metadata entry tools Using harmonised Data Transport Formats (NetCDF, ODV ASCII and MedAtlas ASCII) for data sets delivery Adopting of OGC standards for mapping and viewing services Using SOAP Web Services in the SeaDataNet architecture SeaDataNet V1 Delivery Services: An important objective of the V1 system is to provide transparent access to the distributed data sets via a unique user interface at the SeaDataNet portal and download service. In the SeaDataNet V1 architecture the Common Data Index (CDI) V1 provides the link between discovery and delivery. The CDI user interface enables users to have a detailed insight of the availability and geographical distribution of marine data, archived at the connected data centres, and it provides the means for downloading data sets in common formats via a transaction mechanism. The SeaDataNet portal provides registered users access to these distributed data sets via the CDI V1 Directory and a shopping basket mechanism. This allows registered users to locate data of interest and submit their data requests. The requests are forwarded automatically from the portal to the relevant SeaDataNet data centres. This process is controlled via the Request Status Manager (RSM) Web Service at the portal and a Download Manager (DM) java software module, implemented at each of the data centres. The RSM also enables registered users to check regularly the status of their requests and download data sets, after access has been granted. Data centres can follow all transactions for their data sets online and can handle requests which require their consent. The actual delivery of data sets is done between the user and the selected data centre. The CDI V1 system is now being populated by all participating data centres in SeaDataNet, thereby phasing out CDI V0. 0.1 SeaDataNet Partners: IFREMER (France), MARIS (Netherlands), HCMR/HNODC (Greece), ULg (Belgium), OGS (Italy), NERC/BODC (UK), BSH/DOD (Germany), SMHI (Sweden), IEO (Spain), RIHMI/WDC (Russia), IOC (International), ENEA (Italy), INGV (Italy), METU (Turkey), CLS (France), AWI (Germany), IMR (Norway), NERI (Denmark), ICES (International), EC-DG JRC (International), MI (Ireland), IHPT (Portugal), RIKZ (Netherlands), RBINS/MUMM (Belgium), VLIZ (Belgium), MRI (Iceland), FIMR (Finland ), IMGW (Poland), MSI (Estonia), IAE/UL (Latvia), CMR (Lithuania), SIO/RAS (Russia), MHI/DMIST (Ukraine), IO/BAS (Bulgaria), NIMRD (Romania), TSU (Georgia), INRH (Morocco), IOF (Croatia), PUT (Albania), NIB (Slovenia), UoM (Malta), OC/UCY (Cyprus), IOLR (Israel), NCSR/NCMS (Lebanon), CNR-ISAC (Italy), ISMAL (Algeria), INSTM (Tunisia)

  3. Ocean Acidification Scientific Data Stewardship: An approach for end-to-end data management and integration

    NASA Astrophysics Data System (ADS)

    Arzayus, K. M.; Garcia, H. E.; Jiang, L.; Michael, P.

    2012-12-01

    As the designated Federal permanent oceanographic data center in the United States, NOAA's National Oceanographic Data Center (NODC) has been providing scientific stewardship for national and international marine environmental and ecosystem data for over 50 years. NODC is supporting NOAA's Ocean Acidification Program and the science community by providing end-to-end scientific data management of ocean acidification (OA) data, dedicated online data discovery, and user-friendly access to a diverse range of historical and modern OA and other chemical, physical, and biological oceanographic data. This effort is being catalyzed by the NOAA Ocean Acidification Program, but the intended reach is for the broader scientific ocean acidification community. The first three years of the project will be focused on infrastructure building. A complete ocean acidification data content standard is being developed to ensure that a full spectrum of ocean acidification data and metadata can be stored and utilized for optimal data discovery and access in usable data formats. We plan to develop a data access interface capable of allowing users to constrain their search based on real-time and delayed mode measured variables, scientific data quality, their observation types, the temporal coverage, methods, instruments, standards, collecting institutions, and the spatial coverage. In addition, NODC seeks to utilize the existing suite of international standards (including ISO 19115-2 and CF-compliant netCDF) to help our data producers use those standards for their data, and help our data consumers make use of the well-standardized metadata-rich data sets. These tools will be available through our NODC Ocean Acidification Scientific Data Stewardship (OADS) web page at http://www.nodc.noaa.gov/oceanacidification. NODC also has a goal to provide each archived dataset with a unique ID, to ensure a means of providing credit to the data provider. Working with partner institutions, such as the Carbon Dioxide Information Analysis Center (CDIAC), Biological and Chemical Oceanography Data management Office (BCO-DMO), and federal labs, NODC is exploring the challenges of coordinated data flow and quality control for diverse ocean acidification data sets. These data sets include data from coastal and ocean monitoring, laboratory and field experiments, model output, and remotely sensed data. NODC already has in place automated data extraction protocols for archiving oceanographic data from BCO-DMO and CDIAC. We present a vision for how these disparate data streams can be more fully utilized when brought together using data standards. Like the Multiple-Listing Service in the real estate market, the OADS project is dedicated to developing a repository of ocean acidification data from all sources, and to serving them to the ocean acidification community using a user-friendly interface in a timely manner. For further information please contact NODC.Ocean.Acidification@noaa.gov.

  4. Increasing the availability and usability of terrestrial ecology data through geospatial Web services and visualization tools (Invited)

    NASA Astrophysics Data System (ADS)

    Santhana Vannan, S.; Cook, R. B.; Wilson, B. E.; Wei, Y.

    2010-12-01

    Terrestrial ecology data sets are produced from diverse data sources such as model output, field data collection, laboratory analysis and remote sensing observation. These data sets can be created, distributed, and consumed in diverse ways as well. However, this diversity can hinder the usability of the data, and limit data users’ abilities to validate and reuse data for science and application purposes. Geospatial web services, such as those described in this paper, are an important means of reducing this burden. Terrestrial ecology researchers generally create the data sets in diverse file formats, with file and data structures tailored to the specific needs of their project, possibly as tabular data, geospatial images, or documentation in a report. Data centers may reformat the data to an archive-stable format and distribute the data sets through one or more protocols, such as FTP, email, and WWW. Because of the diverse data preparation, delivery, and usage patterns, users have to invest time and resources to bring the data into the format and structure most useful for their analysis. This time-consuming data preparation process shifts valuable resources from data analysis to data assembly. To address these issues, the ORNL DAAC, a NASA-sponsored terrestrial ecology data center, has utilized geospatial Web service technology, such as Open Geospatial Consortium (OGC) Web Map Service (WMS) and OGC Web Coverage Service (WCS) standards, to increase the usability and availability of terrestrial ecology data sets. Data sets are standardized into non-proprietary file formats and distributed through OGC Web Service standards. OGC Web services allow the ORNL DAAC to store data sets in a single format and distribute them in multiple ways and formats. Registering the OGC Web services through search catalogues and other spatial data tools allows for publicizing the data sets and makes them more available across the Internet. The ORNL DAAC has also created a Web-based graphical user interface called Spatial Data Access Tool (SDAT) that utilizes OGC Web services standards and allows data distribution and consumption for users not familiar with OGC standards. SDAT also allows for users to visualize the data set prior to download. Google Earth visualizations of the data set are also provided through SDAT. The use of OGC Web service standards at the ORNL DAAC has enabled an increase in data consumption. In one case, a data set had ~10 fold increase in download through OGC Web service in comparison to the conventional FTP and WWW method of access. The increase in download suggests that users are not only finding the data sets they need but also able to consume them readily in the format they need.

  5. A future data environment - reusability vs. citability and synchronisation vs. ingestion

    NASA Astrophysics Data System (ADS)

    Fleischer, D.

    2012-04-01

    During the last decades data managers dedicated their work to the pursuit for importable data. In the recent years this chase seams to come to an end while funding organisations assume that the approach of data publications with citable data sets will eliminate denial of scientists to commit their data. But is this true for all problems we are facing at the edge of a data avalanche and data intensive science? The concept of citable data is a logical consequence from the connection of points. Potential data providers in the past complained usually about the missing of a credit assignment for data providers and they still do. The selected way of DOI captured data sets is perfectly fitting into the credit system of publisher driven publications with countable citations. This system is well known by scientists for approximately 400 years now. Unfortunately, there is a double bind situation between citeability and reusability. While cooperation of publishers and data archives are coming into existence, it is necessary to get one question clear: "Is it really worth while in the twenty-first century to force data into the publication process of the seventeenth century?" Data publications enable easy citability, but do not support easy data reusability for future users. Additional problems occur in such an environment while taking into account the chances of collaborative data corrections in the institutional repository. The future with huge amounts of data connected with publications makes reconsideration towards a more integrated approach reasonable. In the past data archives were the only infrastructures taking care of long-term data retrievability and availability. Nevertheless, they were never a part of the scientific process from data creation, analysis, interpretation and publication. Data archives were regarded as isolated islands in the sea of scientific data. Accordingly scientists considered data publications like a stumbling stone in their daily routines and still do. The creation of data set as additional publications is an additional workload a lot of scientists are not yet convinced about. These times are coming to an end now because of the efforts of the funding organisations and the increased awareness of scientific institutions. Right now data archives have their expertise in retrievability and availability, but the new demand of data provenance is not yet included in their systems. So why not taking the chance of the scientific institutes sneaking in and split the workload of retrievability and provenance. Such an integrated data environment will be characterized by increased functionality, creditability and structured data from the creation and everything accompanied by data managers. The Kiel Data Management Infrastructure is creating such an institutional provenance system for the scientific site of Kiel. Having data sets up to date by synchronisation with institutional provenance system capturing all changes and improvements right where they happen. A sophisticated and scalable landscape needs to combine advantages of the existing data centers such as the usability and retrievability functionality with the advantages of decentralised data capturing and provenance. This data environment with synchronisation features and creditability of scientific data to future users would be capable of the future tasks.

  6. Mining and Integration of Environmental Data

    NASA Astrophysics Data System (ADS)

    Tran, V.; Hluchy, L.; Habala, O.; Ciglan, M.

    2009-04-01

    The project ADMIRE (Advanced Data Mining and Integration Research for Europe) is a 7th FP EU ICT project aims to deliver a consistent and easy-to-use technology for extracting information and knowledge. The project is motivated by the difficulty of extracting meaningful information by data mining combinations of data from multiple heterogeneous and distributed resources. It will also provide an abstract view of data mining and integration, which will give users and developers the power to cope with complexity and heterogeneity of services, data and processes. The data sets describing phenomena from domains like business, society, and environment often contain spatial and temporal dimensions. Integration of spatio-temporal data from different sources is a challenging task due to those dimensions. Different spatio-temporal data sets contain data at different resolutions (e.g. size of the spatial grid) and frequencies. This heterogeneity is the principal challenge of geo-spatial and temporal data sets integration - the integrated data set should hold homogeneous data of the same resolution and frequency. Thus, to integrate heterogeneous spatio-temporal data from distinct source, transformation of one or more data sets is necessary. Following transformation operation are required: • transformation to common spatial and temporal representation - (e.g. transformation to common coordinate system), • spatial and/or temporal aggregation - data from detailed data source are aggregated to match the resolution of other resources involved in the integration process, • spatial and/or temporal record decomposition - records from source with lower resolution data are decomposed to match the granularity of the other data source. This operation decreases data quality (e.g. transformation of data from 50km grid to 10 km grid) - data from lower resolution data set in the integrated schema are imprecise, but it allows us to preserve higher resolution data. We can decompose the spatio-temporal data integration to following phases: • pre-integration data processing - different data set can be physically stored in different formats (e.g. relational databases, text files); it might be necessary to pre-process the data sets to be integrated, • identification of transformation operations necessary to integrate data in spatio-temporal dimensions, • identification of transformation operations to be performed on non-spatio-temporal attributes and • output data schema and set generation - given prepared data and the set of transformation, operations, the final integrated schema is produces. Spatio-temporal dimension brings its specifics also to the problem of mining spatio-temporal data sets. Spatio-temporal relationships exist among records in (s-t) data sets and those relationships should be considered in mining operation. This means that when analyzing a record in spatio-temporal data set, the records in its spatial and/or temporal proximity should be taken into account. In addition, the relationships discovered in spatio-temporal data can be different when mining the same data on different scales (e.g. mining the same data sets on 50 km grid with daily data vs. 10 km grid with hourly data). To be able to do effective data mining, we first needed to gather a sufficient amount of environmental data covering similar area and time span. For this purpose we have engaged in cooperation with several organizations working in the environmental domain in Slovakia, some of which are also our partners from previous research efforts. The organizations which volunteered some of their data are the Slovak Hydro-meteorological Institute (SHMU), the Slovak Water Enterprise (SVP), the Soil Science and Conservation Institute (VUPOP), and the Institute of Hydrology of the Slovak Academy of Sciences (UHSAV). We have prepared scenarios from general meteorology, as well as specialized in hydrology and soil protection.

  7. 40 CFR 720.50 - Submission of test data and other data concerning the health and environmental effects of a...

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ... following types of test data: (i) Health effects data. (ii) Ecological effects data. (iii) Physical and... data; progress to date; types of data collected; significant preliminary results; and anticipated... scientific literature, a description of the type of data and summary of the results, if available, and the...

  8. Production and Consumption of University Linked Data

    ERIC Educational Resources Information Center

    Zablith, Fouad; Fernandez, Miriam; Rowe, Matthew

    2015-01-01

    Linked Data increases the value of an organisation's data over the web by introducing explicit and machine processable links at the data level. We have adopted this new stream of data representation to produce and expose existing data within The Open University (OU) as Linked Data. We present in this paper our approach for producing the data,…

  9. Quantification of Operational Risk Using A Data Mining

    NASA Technical Reports Server (NTRS)

    Perera, J. Sebastian

    1999-01-01

    What is Data Mining? - Data Mining is the process of finding actionable information hidden in raw data. - Data Mining helps find hidden patterns, trends, and important relationships often buried in a sea of data - Typically, automated software tools based on advanced statistical analysis and data modeling technology can be utilized to automate the data mining process

  10. A Comparison of Key Concepts in Data Analytics and Data Science

    ERIC Educational Resources Information Center

    McMaster, Kirby; Rague, Brian; Wolthuis, Stuart L.; Sambasivam, Samuel

    2018-01-01

    This research study provides an examination of the relatively new fields of Data Analytics and Data Science. We compare word rates in Data Analytics and Data Science documents to determine which concepts are mentioned most often. The most frequent concept in both fields is "data." The word rate for "data" is more than twice the…

  11. Data Confidentiality Challenges in Big Data Applications

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yin, Jian; Zhao, Dongfang

    In this paper, we address the problem of data confidentiality in big data analytics. In many fields, much useful patterns can be extracted by applying machine learning techniques to big data. However, data confidentiality must be protected. In many scenarios, data confidentiality could well be a prerequisite for data to be shared. We present a scheme to provide provable secure data confidentiality and discuss various techniques to optimize performance of such a system.

  12. Evaluation of a data dictionary system. [information dissemination and computer systems programs

    NASA Technical Reports Server (NTRS)

    Driggers, W. G.

    1975-01-01

    The usefulness was investigated of a data dictionary/directory system for achieving optimum benefits from existing and planned investments in computer data files in the Data Systems Development Branch and the Institutional Data Systems Division. Potential applications of the data catalogue system are discussed along with an evaluation of the system. Other topics discussed include data description, data structure, programming aids, programming languages, program networks, and test data.

  13. LANDSAT-D data format control book. Volume 6: (Products)

    NASA Technical Reports Server (NTRS)

    Kabat, F.

    1981-01-01

    Four basic product types are generated from the raw thematic mapper (TM) and multispectral scanner (MSS) payload data by the NASA GSFC LANDSAT 4 data management system: (1) unprocessed data (raw sensor data); (2) partially processed data, which consists of radiometrically corrected sensor data with geometric correction information appended; (3) fully processed data, which consists of radiometrically and geometrically corrected sensor data; and (4) inventory data which consists of summary information about product types 2 and 3. High density digital recorder formatting and the radiometric correction process are described. Geometric correction information is included.

  14. A new data collaboration service based on cloud computing security

    NASA Astrophysics Data System (ADS)

    Ying, Ren; Li, Hua-Wei; Wang, Li na

    2017-09-01

    With the rapid development of cloud computing, the storage and usage of data have undergone revolutionary changes. Data owners can store data in the cloud. While bringing convenience, it also brings many new challenges to cloud data security. A key issue is how to support a secure data collaboration service that supports access and updates to cloud data. This paper proposes a secure, efficient and extensible data collaboration service, which prevents data leaks in cloud storage, supports one to many encryption mechanisms, and also enables cloud data writing and fine-grained access control.

  15. High speed data compactor

    DOEpatents

    Baumbaugh, Alan E.; Knickerbocker, Kelly L.

    1988-06-04

    A method and apparatus for suppressing from transmission, non-informational data words from a source of data words such as a video camera. Data words having values greater than a predetermined threshold are transmitted whereas data words having values less than a predetermined threshold are not transmitted but their occurrences instead are counted. Before being transmitted, the count of occurrences of invalid data words and valid data words are appended with flag digits which a receiving system decodes. The original data stream is fully reconstructable from the stream of valid data words and count of invalid data words.

  16. A browse facility for Earth science remote sensing data: Center director's discretionary fund final report

    NASA Technical Reports Server (NTRS)

    Meyer, P. J.

    1993-01-01

    An image data visual browse facility is developed for a UNIX platform using the X Windows 11 system. It allows one to visually examine reduced resolution image data to determine which data are applicable for further research. Links with a relational data base manager then allow one to extract not only the full resolution image data, but any other ancillary data related to the case study. Various techniques are examined for compression of the image data in order to reduce data storage requirements and time necessary to transmit the data on the internet. Data used were from the WetNet project.

  17. Introduction to the mining of clinical data.

    PubMed

    Harrison, James H

    2008-03-01

    The increasing volume of medical data online, including laboratory data, represents a substantial resource that can provide a foundation for improved understanding of disease presentation, response to therapy, and health care delivery processes. Data mining supports these goals by providing a set of techniques designed to discover similarities and relationships between data elements in large data sets. Currently, medical data have several characteristics that increase the difficulty of applying these techniques, although there have been notable medical data mining successes. Future developments in integrated medical data repositories, standardized data representation, and guidelines for the appropriate research use of medical data will decrease the barriers to mining projects.

  18. Dynamic visualization of data streams

    DOEpatents

    Wong, Pak Chung [Richalnd, WA; Foote, Harlan P [Richland, WA; Adams, Daniel R [Kennewick, WA; Cowley, Wendy E [Richland, WA; Thomas, James J [Richland, WA

    2009-07-07

    One embodiment of the present invention includes a data communication subsystem to receive a data stream, and a data processing subsystem responsive to the data communication subsystem to generate a visualization output based on a group of data vectors corresponding to a first portion of the data stream. The processing subsystem is further responsive to a change in rate of receipt of the data to modify the visualization output with one or more other data vectors corresponding to a second portion of the data stream as a function of eigenspace defined with the group of data vectors. The system further includes a display device responsive to the visualization output to provide a corresponding visualization.

  19. International Data Base for the U.S. Renewable Energy Industry

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    none

    1986-05-01

    The International Data Base for the US Renewable Energy Industry was developed to provide the US renewable energy industry with background data for identifying and analyzing promising foreign market opportunities for their products and services. Specifically, the data base provides the following information for 161 developed and developing countries: (1) General Country Data--consisting of general energy indicators; (2) Energy Demand Data--covering commercial primary energy consumption; (3) Energy Resource Data--identifying annual average insolation, wind power, and river flow data; (4) Power System Data--indicating a wide range of electrical parameters; and (5) Business Data--including currency and credit worthiness data.

  20. Data Acquisition for Modular Biometric Monitoring System

    NASA Technical Reports Server (NTRS)

    Grodsinsky, Carlos M. (Inventor); Chmiel, Alan J. (Inventor); Humphreys, Bradley T. (Inventor)

    2014-01-01

    A modular system for acquiring biometric data includes a plurality of data acquisition modules configured to sample biometric data from at least one respective input channel at a data acquisition rate. A representation of the sampled biometric data is stored in memory of each of the plurality of data acquisition modules. A central control system is in communication with each of the plurality of data acquisition modules through a bus. The central control system is configured to collect data asynchronously, via the bus, from the memory of the plurality of data acquisition modules according to a relative fullness of the memory of the plurality of data acquisition modules.

  1. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Burnett, R.A.

    A major goal of the Analysis of Large Data Sets (ALDS) research project at Pacific Northwest Laboratory (PNL) is to provide efficient data organization, storage, and access capabilities for statistical applications involving large amounts of data. As part of the effort to achieve this goal, a self-describing binary (SDB) data file structure has been designed and implemented together with a set of basic data manipulation functions and supporting SDB data access routines. Logical and physical data descriptors are stored in SDB files preceding the data values. SDB files thus provide a common data representation for interfacing diverse software components. Thismore » paper describes the various types of data descriptors and data structures permitted by the file design. Data buffering, file segmentation, and a segment overflow handler are also discussed.« less

  2. A zero-error operational video data compression system

    NASA Technical Reports Server (NTRS)

    Kutz, R. L.

    1973-01-01

    A data compression system has been operating since February 1972, using ATS spin-scan cloud cover data. With the launch of ITOS 3 in October 1972, this data compression system has become the only source of near-realtime very high resolution radiometer image data at the data processing facility. The VHRR image data are compressed and transmitted over a 50 kilobit per second wideband ground link. The goal of the data compression experiment was to send data quantized to six bits at twice the rate possible when no compression is used, while maintaining zero error between the transmitted and reconstructed data. All objectives of the data compression experiment were met, and thus a capability of doubling the data throughput of the system has been achieved.

  3. Kepler Data Release 25 Notes (Q0-Q17)

    NASA Technical Reports Server (NTRS)

    Mullally, Susan E.; Caldwell, Douglas A.; Barclay, Thomas Stewart; Barentsen, Geert; Clarke, Bruce Donald; Bryson, Stephen T.; Burke, Christopher James; Campbell, Jennifer Roseanna; Catanzarite, Joseph H.; Christiansen, Jessie; hide

    2016-01-01

    These Data Release Notes provide information specific to the current reprocessing and re-export of the Q0-Q17 data. The data products included in this data release include target pixel files, light curve files, FFIs,CBVs, ARP, Background, and Collateral files. This release marks the final processing of the Kepler Mission Data. See Tables 1 and 2 for a list of the reprocessed Kepler cadence data. See Table 3 for a list of the available FFIs. The Long Cadence Data, Short Cadence Data, and FFI data are documented in these data release notes. The ancillary files (i.e., cotrending basis vectors, artifact removal pixels, background, and collateral data) are described in the Archive Manual (Thompson et al., 2016).

  4. -Omic and Electronic Health Record Big Data Analytics for Precision Medicine.

    PubMed

    Wu, Po-Yen; Cheng, Chih-Wen; Kaddi, Chanchala D; Venugopalan, Janani; Hoffman, Ryan; Wang, May D

    2017-02-01

    Rapid advances of high-throughput technologies and wide adoption of electronic health records (EHRs) have led to fast accumulation of -omic and EHR data. These voluminous complex data contain abundant information for precision medicine, and big data analytics can extract such knowledge to improve the quality of healthcare. In this paper, we present -omic and EHR data characteristics, associated challenges, and data analytics including data preprocessing, mining, and modeling. To demonstrate how big data analytics enables precision medicine, we provide two case studies, including identifying disease biomarkers from multi-omic data and incorporating -omic information into EHR. Big data analytics is able to address -omic and EHR data challenges for paradigm shift toward precision medicine. Big data analytics makes sense of -omic and EHR data to improve healthcare outcome. It has long lasting societal impact.

  5. Satellite and earth science data management activities at the U.S. geological survey's EROS data center

    USGS Publications Warehouse

    Carneggie, David M.; Metz, Gary G.; Draeger, William C.; Thompson, Ralph J.

    1991-01-01

    The U.S. Geological Survey's Earth Resources Observation Systems (EROS) Data Center, the national archive for Landsat data, has 20 years of experience in acquiring, archiving, processing, and distributing Landsat and earth science data. The Center is expanding its satellite and earth science data management activities to support the U.S. Global Change Research Program and the National Aeronautics and Space Administration (NASA) Earth Observing System Program. The Center's current and future data management activities focus on land data and include: satellite and earth science data set acquisition, development and archiving; data set preservation, maintenance and conversion to more durable and accessible archive medium; development of an advanced Land Data Information System; development of enhanced data packaging and distribution mechanisms; and data processing, reprocessing, and product generation systems.

  6. The Role of Community-Driven Data Curation for Enterprises

    NASA Astrophysics Data System (ADS)

    Curry, Edward; Freitas, Andre; O'Riáin, Sean

    With increased utilization of data within their operational and strategic processes, enterprises need to ensure data quality and accuracy. Data curation is a process that can ensure the quality of data and its fitness for use. Traditional approaches to curation are struggling with increased data volumes, and near real-time demands for curated data. In response, curation teams have turned to community crowd-sourcing and semi-automatedmetadata tools for assistance. This chapter provides an overview of data curation, discusses the business motivations for curating data and investigates the role of community-based data curation, focusing on internal communities and pre-competitive data collaborations. The chapter is supported by case studies from Wikipedia, The New York Times, Thomson Reuters, Protein Data Bank and ChemSpider upon which best practices for both social and technical aspects of community-driven data curation are described.

  7. A repository based on a dynamically extensible data model supporting multidisciplinary research in neuroscience.

    PubMed

    Corradi, Luca; Porro, Ivan; Schenone, Andrea; Momeni, Parastoo; Ferrari, Raffaele; Nobili, Flavio; Ferrara, Michela; Arnulfo, Gabriele; Fato, Marco M

    2012-10-08

    Robust, extensible and distributed databases integrating clinical, imaging and molecular data represent a substantial challenge for modern neuroscience. It is even more difficult to provide extensible software environments able to effectively target the rapidly changing data requirements and structures of research experiments. There is an increasing request from the neuroscience community for software tools addressing technical challenges about: (i) supporting researchers in the medical field to carry out data analysis using integrated bioinformatics services and tools; (ii) handling multimodal/multiscale data and metadata, enabling the injection of several different data types according to structured schemas; (iii) providing high extensibility, in order to address different requirements deriving from a large variety of applications simply through a user runtime configuration. A dynamically extensible data structure supporting collaborative multidisciplinary research projects in neuroscience has been defined and implemented. We have considered extensibility issues from two different points of view. First, the improvement of data flexibility has been taken into account. This has been done through the development of a methodology for the dynamic creation and use of data types and related metadata, based on the definition of "meta" data model. This way, users are not constrainted to a set of predefined data and the model can be easily extensible and applicable to different contexts. Second, users have been enabled to easily customize and extend the experimental procedures in order to track each step of acquisition or analysis. This has been achieved through a process-event data structure, a multipurpose taxonomic schema composed by two generic main objects: events and processes. Then, a repository has been built based on such data model and structure, and deployed on distributed resources thanks to a Grid-based approach. Finally, data integration aspects have been addressed by providing the repository application with an efficient dynamic interface designed to enable the user to both easily query the data depending on defined datatypes and view all the data of every patient in an integrated and simple way. The results of our work have been twofold. First, a dynamically extensible data model has been implemented and tested based on a "meta" data-model enabling users to define their own data types independently from the application context. This data model has allowed users to dynamically include additional data types without the need of rebuilding the underlying database. Then a complex process-event data structure has been built, based on this data model, describing patient-centered diagnostic processes and merging information from data and metadata. Second, a repository implementing such a data structure has been deployed on a distributed Data Grid in order to provide scalability both in terms of data input and data storage and to exploit distributed data and computational approaches in order to share resources more efficiently. Moreover, data managing has been made possible through a friendly web interface. The driving principle of not being forced to preconfigured data types has been satisfied. It is up to users to dynamically configure the data model for the given experiment or data acquisition program, thus making it potentially suitable for customized applications. Based on such repository, data managing has been made possible through a friendly web interface. The driving principle of not being forced to preconfigured data types has been satisfied. It is up to users to dynamically configure the data model for the given experiment or data acquisition program, thus making it potentially suitable for customized applications.

  8. A repository based on a dynamically extensible data model supporting multidisciplinary research in neuroscience

    PubMed Central

    2012-01-01

    Background Robust, extensible and distributed databases integrating clinical, imaging and molecular data represent a substantial challenge for modern neuroscience. It is even more difficult to provide extensible software environments able to effectively target the rapidly changing data requirements and structures of research experiments. There is an increasing request from the neuroscience community for software tools addressing technical challenges about: (i) supporting researchers in the medical field to carry out data analysis using integrated bioinformatics services and tools; (ii) handling multimodal/multiscale data and metadata, enabling the injection of several different data types according to structured schemas; (iii) providing high extensibility, in order to address different requirements deriving from a large variety of applications simply through a user runtime configuration. Methods A dynamically extensible data structure supporting collaborative multidisciplinary research projects in neuroscience has been defined and implemented. We have considered extensibility issues from two different points of view. First, the improvement of data flexibility has been taken into account. This has been done through the development of a methodology for the dynamic creation and use of data types and related metadata, based on the definition of “meta” data model. This way, users are not constrainted to a set of predefined data and the model can be easily extensible and applicable to different contexts. Second, users have been enabled to easily customize and extend the experimental procedures in order to track each step of acquisition or analysis. This has been achieved through a process-event data structure, a multipurpose taxonomic schema composed by two generic main objects: events and processes. Then, a repository has been built based on such data model and structure, and deployed on distributed resources thanks to a Grid-based approach. Finally, data integration aspects have been addressed by providing the repository application with an efficient dynamic interface designed to enable the user to both easily query the data depending on defined datatypes and view all the data of every patient in an integrated and simple way. Results The results of our work have been twofold. First, a dynamically extensible data model has been implemented and tested based on a “meta” data-model enabling users to define their own data types independently from the application context. This data model has allowed users to dynamically include additional data types without the need of rebuilding the underlying database. Then a complex process-event data structure has been built, based on this data model, describing patient-centered diagnostic processes and merging information from data and metadata. Second, a repository implementing such a data structure has been deployed on a distributed Data Grid in order to provide scalability both in terms of data input and data storage and to exploit distributed data and computational approaches in order to share resources more efficiently. Moreover, data managing has been made possible through a friendly web interface. The driving principle of not being forced to preconfigured data types has been satisfied. It is up to users to dynamically configure the data model for the given experiment or data acquisition program, thus making it potentially suitable for customized applications. Conclusions Based on such repository, data managing has been made possible through a friendly web interface. The driving principle of not being forced to preconfigured data types has been satisfied. It is up to users to dynamically configure the data model for the given experiment or data acquisition program, thus making it potentially suitable for customized applications. PMID:23043673

  9. The CUAHSI Water Data Center: Empowering scientists to discover, use, store, and share water data

    NASA Astrophysics Data System (ADS)

    Couch, A. L.; Hooper, R. P.; Arrigo, J. S.

    2012-12-01

    The proposed CUAHSI Water Data Center (WDC) will provide production-quality water data resources based upon the successful large-scale data services prototype developed by the CUAHSI Hydrologic Information System (HIS) project. The WDC, using the HIS technology, concentrates on providing time series data collected at fixed points or on moving platforms from sensors primarily (but not exclusively) in the medium of water. The WDC's missions include providing simple and effective data discovery tools useful to researchers in a variety of water-related disciplines, and providing simple and cost-effective data publication mechanisms for projects that do not desire to run their own data servers. The WDC's activities will include: 1. Rigorous curation of the water data catalog already assembled during the CUAHSI HIS project, to ensure accuracy of records and existence of declared sources. 2. Data backup and failover services for "at risk" data sources. 3. Creation and support for ubiquitously accessible data discovery and access, web-based search and smartphone applications. 4. Partnerships with researchers to extend the state of the art in water data use. 5. Partnerships with industry to create plug-and-play data publishing from sensors, and to create domain-specific tools. The WDC will serve as a knowledge resource for researchers of water-related issues, and will interface with other data centers to make their data more accessible to water researchers. The WDC will serve as a vehicle for addressing some of the grand challenges of accessing and using water data, including: a. Cross-domain data discovery: different scientific domains refer to the same kind of water data using different terminologies, making discovery of data difficult for researchers outside the data provider's domain. b. Cross-validation of data sources: much water data comes from sources lacking rigorous quality control procedures; such sources can be compared against others with rigorous quality control. The WDC enables this by making both kinds of sources available in the same search interface. c. Data provenance: the appropriateness of data for use in a specific model or analysis often depends upon the exact details of how data was gathered and processed. The WDC will aid this by curating standards for metadata that are as descriptive as practical of the collection procedures. "Plug and play" sensor interfaces will fill in metadata appropriate to each sensor without human intervention. d. Contextual search: discovering data based upon geological (e.g. aquifer) or geographic (e.g., location in a stream network) features external to metadata. e. Data-driven search: discovering data that exhibit quality factors that are not described by the metadata. The WDC will partner with researchers desiring contextual and data driven search, and make results available to all. Many major data providers (e.g. federal agencies) are not mandated to provide access to data other than those they collect. The HIS project assembled data from over 90 different sources, thus demonstrating the promise of this approach. Meeting the grand challenges listed above will greatly enhance scientists' ability to discover, interpret, access, and analyze water data from across domains and sources to test Earth system hypotheses.

  10. Embedding Data Stewardship in Geoscience Australia

    NASA Astrophysics Data System (ADS)

    Bastrakova, I.; Fyfe, S.

    2013-12-01

    Ten years of technological innovation now enable vast amounts of data to be collected, managed, processed and shared. At the same time, organisations have witnessed government legislative and policy requirements for open access to public sector data, and a demand for flexibility in access to data by both machine-to-machine and human consumption. Geoscience Australia (GA) has adopted Data Stewardship as an organisation-wide initiative to improve the way we manage and share our data. The benefits to GA including: - Consolidated understanding of GA's data assets and their value to the Agency; - Recognition of the significant role of data custodianship and data management; - Well-defined governance, policies, standards, practices and accountabilities that promote the accessibility, quality and interoperability of GA's data; - Integration of disparate data sets into cohesive information products available online in real time and equally accessible to researchers, government, industry and the public. Although the theory behind data stewardship is well-defined and accepted and the benefits are generally well-understood, practical implementation requires an organisation to prepare for a long-term commitment of resources, both financial and human. Fundamentally this involves: 1. Raising awareness in the organisation of the need for data stewardship and the challenges this entails; 2. Establishing a data stewardship framework including a data governance office to set policy and drive organisational change; and 3. Embedding the functions and a culture of data stewardship into business as usual operations. GA holds a vast amount of data ranging from petabytes of Big Data to significant quantities of relatively small ';long tail' geoscientific observations and measurements. Over the past four years, GA has undertaken strategic activities that prepare us for Data Stewardship: - Organisation-wide audits of GA's data holdings and identification of custodians for each dataset; - Developing guiding Principles on how the Agency undertakes Science, Data Management and Cataloguing; - Developing a Data Classification Schema that aligns scientific requirements and business workflows with data architecture; - Creating the Scientific Data Stewardship Steering Committee, comprising champions from across the Agency to guide development and support implementation of Data Stewardship in GA; - Forming Scientific Data Communities of Practice of leading scientific experts to identify tandards and practices across their domain, and integrate data stewardship practices into scientific workflows; - Establishing the Data Governance and Services Section to provide ongoing capacity for the development, communication, coordination and governance of data stewardship policies, strategies, standards and practices. GA is now moving towards Data Stewardship as an operational capability and culture within the Agency. The challenges we face into the future include: - Maintaining continuous and enthusiastic engagement from the Agency executive; - Implementing long term cultural change at all levels within the organisation; - Formal recognition that data stewardship is a continuous operational BAU activity; - Incorporation of data custodianship and management activities in work programs and budgets; and - Cultivation and support of the data stewardship champions.

  11. Why should we publish Linked Data?

    NASA Astrophysics Data System (ADS)

    Blower, Jon; Riechert, Maik; Koubarakis, Manolis; Pace, Nino

    2016-04-01

    We use the Web every day to access information from all kinds of different sources. But the complexity and diversity of scientific data mean that discovering accessing and interpreting data remains a large challenge to researchers, decision-makers and other users. Different sources of useful information on data, algorithms, instruments and publications are scattered around the Web. How can we link all these things together to help users to better understand and exploit earth science data? How can we combine scientific data with other relevant data sources, when standards for describing and sharing data vary so widely between communities? "Linked Data" is a term that describes a set of standards and "best practices" for sharing data on the Web (http://www.w3.org/standards/semanticweb/data). These principles can be summarised as follows: 1. Create unique and persistent identifiers for the important "things" in a community (e.g. datasets, publications, algorithms, instruments). 2. Allow users to "look up" these identifiers on the web to find out more information about them. 3. Make this information machine-readable in a community-neutral format (such as RDF, Resource Description Framework). 4. Within this information, embed links to other things and concepts and say how these are related. 5. Optionally, provide web service interfaces to allow the user to perform sophisticated queries over this information (using a language such as SPARQL). The promise of Linked Data is that, through these techniques, data will be more discoverable, more comprehensible and more usable by different communities, not just the community that produced the data. As a result, many data providers (particularly public-sector institutions) are now publishing data in this way. However, this area is still in its infancy in terms of real-world applications. Data users need guidance and tools to help them use Linked Data. Data providers need reassurance that the investments they are making in publishing Linked Data will result in tangible user benefits. This presentation will address a number of these issues, using real-world experience gathered from four recent European projects: MELODIES (http://melodiesproject.eu), LEO (http://linkedeodata.eu), CHARMe (http://linkedeodata.eu) and TELEIOS (http://www.earthobservatory.eu). These projects have all applied Linked Data techniques in practical, real-world situations involving the use of diverse data (including earth science data) by both industrial and academic users. Specifically, we will: • Identify a set of practical and valuable uses for Linked Data, focusing on areas where Linked Data fills gaps left by other technologies. These uses include: enabling the discovery of earth science data using mass-market search engines, helping users to understand data and its uses, combining data from multiple sources and enabling the annotation of data by users. • Enumerate some common challenges faced by developers of data-driven services who wish to use Linked Data in their applications. • Describe a new suite of tools for managing, processing and visualising Linked Data in earth science applications (including geospatial Linked Data).

  12. Semantic Data Access Services at NASA's Atmospheric Science Data Center

    NASA Astrophysics Data System (ADS)

    Huffer, E.; Hertz, J.; Kusterer, J.

    2012-12-01

    The corpus of Earth Science data products at the Atmospheric Science Data Center at NASA's Langley Research Center comprises a widely heterogeneous set of products, even among those whose subject matter is very similar. Two distinct data products may both contain data on the same parameter, for instance, solar irradiance; but the instruments used, and the circumstances under which the data were collected and processed, may differ significantly. Understanding the differences is critical to using the data effectively. Data distribution services must be able to provide prospective users with enough information to allow them to meaningfully compare and evaluate the data products offered. Semantic technologies - ontologies, triple stores, reasoners, linked data - offer functionality for addressing this issue. Ontologies can provide robust, high-fidelity domain models that serve as common schema for discovering, evaluating, comparing and integrating data from disparate products. Reasoning engines and triple stores can leverage ontologies to support intelligent search applications that allow users to discover, query, retrieve, and easily reformat data from a broad spectrum of sources. We argue that because of the extremely complex nature of scientific data, data distribution systems should wholeheartedly embrace semantic technologies in order to make their data accessible to a broad array of prospective end users, and to ensure that the data they provide will be clearly understood and used appropriately by consumers. Toward this end, we propose a distribution system in which formal ontological models that accurately and comprehensively represent the ASDC's data domain, and fully leverage the expressivity and inferential capabilities of first order logic, are used to generate graph-based representations of the relevant relationships among data sets, observational systems, metadata files, and geospatial, temporal and scientific parameters to help prospective data consumers navigate directly to relevant data sets and query, subset, retrieve and compare the measurement and calculation data they contain. A critical part of developing semantically-enabled data distribution capabilities is developing an ontology that adequately describes 1) the data products - their structure, their content, and any supporting documentation; 2) the data domain - the objects and processes that the products denote; and 3) the relationship between the data and the domain. The ontology, in addition, should be machine readable and capable of integrating with the larger data distribution system to provide an interactive user experience. We will demonstrate how a formal, high-fidelity, queriable ontology representing the atmospheric science domain objects and data products, together with a robust set of inference rules for generating interactive graphs, allows researchers to navigate quickly and painlessly through the large volume of data at the ASDC. Scientists will be able to discover data products that exactly meet their particular criteria, link to information about the instruments and processing methods that generated the data; and compare and contrast related products.

  13. 17 CFR 49.10 - Acceptance of data.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... technological protocols established by a swap data repository shall provide for the receipt of swap creation data, swap continuation data, real-time public reporting data, and all other data and information... swap data repository shall adopt policies and procedures, including technological protocols, which...

  14. Data banks for recreation supply and participation

    Treesearch

    E. M. Avedon; S. L. J. Smith

    1980-01-01

    Data archives and data banks have become increasingly important as more researchers begin to examine trends. Characteristics of data banks, sources of bias in secondary data sources and important trends in data banks are described. The paper concludes with advice about using data banks.

  15. The Environmental Data Initiative: A broad-use data repository for environmental and ecological data that strives to balance data quality and ease of submission

    NASA Astrophysics Data System (ADS)

    Servilla, M. S.; Brunt, J.; Costa, D.; Gries, C.; Grossman-Clarke, S.; Hanson, P. C.; O'Brien, M.; Smith, C.; Vanderbilt, K.; Waide, R.

    2017-12-01

    In the world of data repositories, there seems to be a never ending struggle between the generation of high-quality data documentation and the ease of archiving a data product in a repository - the higher the documentation standards, the greater effort required by the scientist, and the less likely the data will be archived. The Environmental Data Initiative (EDI) attempts to balance the rigor of data documentation to the amount of effort required by a scientist to upload and archive data. As an outgrowth of the LTER Network Information System, the EDI is funded by the US NSF Division of Environmental Biology, to support the LTER, LTREB, OBFS, and MSB programs, in addition to providing an open data archive for environmental scientists without a viable archive. EDI uses the PASTA repository software, developed originally by the LTER. PASTA is metadata driven and documents data with the Ecological Metadata Language (EML), a high-fidelity standard that can describe all types of data in great detail. PASTA incorporates a series of data quality tests to ensure that data are correctly documented with EML in a process that is termed "metadata and data congruence", and incongruent data packages are forbidden in the repository. EDI reduces the burden of data documentation on scientists in two ways: first, EDI provides hands-on assistance in data documentation best practices using R and being developed in Python, for generating EML. These tools obscure the details of EML generation and syntax by providing a more natural and contextual setting for describing data. Second, EDI works closely with community information managers in defining rules used in PASTA quality tests. Rules deemed too strict can be turned off completely or just issue a warning, while the community learns to best handle the situation and improve their documentation practices. Rules can also be added or refined over time to improve overall quality of archived data. The outcome of quality tests are stored as part of the data archive in PASTA and are accessible to all users of the EDI data repository. In summary, EDI's metadata support to scientists and the comprehensive set of data quality tests for metadata and data congruency provide an ideal archive for environmental and ecological data.

  16. The next generation of data capturing - digital ink for the data stewards of the future

    NASA Astrophysics Data System (ADS)

    Czerniak, A.; Fleischer, D.; Schirnick, C.

    2012-12-01

    Data stewardship of the future requires the continuation from an expert driven discipline into a general scientific routine. One solution how this expansion can be done is the use of data management infrastructures already in the student education. Unsurprisingly, well-known drawbacks in terms of data stewardship from the scientific use complicate this expansion into the educational programs. The advantage of educational programs usually based on the application of standard methods is depleted by the general data capturing process at the point of publication or end of project lifetime. Considering student courses as short projects there are no publications and the end of the course exams keep students just like scientists away from data stewardship tasks. The Kiel Data Management Infrastructure brings the data capturing right in the data creation process. With this approach student education courses can be just another use case of data capturing. Smoothing the data capturing process and making use of available technologies drove the Kiel Data Management Infrastructure into a prototype testing of the use of 'digital ink' and the later on possible handwriting recognition. Making the data digitalization as easy as possible without abandoning the standards of paper-based protocols is the use case 'Smart Pens'. This technology fills the gap between the very long-lasting paper protocols and the effort depending digitalization of field and sampling data but it's also robust enough to work with battery powered devices. The combination of the Kiel Data Management Infrastructure with the 'digital ink' technology enables the data capturing from student education to high-end scientific lab work. Valuing educational data equally to scientific lab data is a strong signal to the researchers of the future while their work is recognized all the way from their undergraduate stage to their post-doc position. Students memorize that their data work is not neglected at any time and so they realize that their is no excuse of keeping any data away from the data management infrastructure. The technology of 'digital ink' is a milestone for the data stewardship discipline and fits perfectly into the a lot of gaps between the data creation and the data infrastructure and as long as we do not establish the life long data capturing support for the scientific career we can not complain about reluctant data submissions.

  17. Large-Scale Data Collection Metadata Management at the National Computation Infrastructure

    NASA Astrophysics Data System (ADS)

    Wang, J.; Evans, B. J. K.; Bastrakova, I.; Ryder, G.; Martin, J.; Duursma, D.; Gohar, K.; Mackey, T.; Paget, M.; Siddeswara, G.

    2014-12-01

    Data Collection management has become an essential activity at the National Computation Infrastructure (NCI) in Australia. NCI's partners (CSIRO, Bureau of Meteorology, Australian National University, and Geoscience Australia), supported by the Australian Government and Research Data Storage Infrastructure (RDSI), have established a national data resource that is co-located with high-performance computing. This paper addresses the metadata management of these data assets over their lifetime. NCI manages 36 data collections (10+ PB) categorised as earth system sciences, climate and weather model data assets and products, earth and marine observations and products, geosciences, terrestrial ecosystem, water management and hydrology, astronomy, social science and biosciences. The data is largely sourced from NCI partners, the custodians of many of the national scientific records, and major research community organisations. The data is made available in a HPC and data-intensive environment - a ~56000 core supercomputer, virtual labs on a 3000 core cloud system, and data services. By assembling these large national assets, new opportunities have arisen to harmonise the data collections, making a powerful cross-disciplinary resource.To support the overall management, a Data Management Plan (DMP) has been developed to record the workflows, procedures, the key contacts and responsibilities. The DMP has fields that can be exported to the ISO19115 schema and to the collection level catalogue of GeoNetwork. The subset or file level metadata catalogues are linked with the collection level through parent-child relationship definition using UUID. A number of tools have been developed that support interactive metadata management, bulk loading of data, and support for computational workflows or data pipelines. NCI creates persistent identifiers for each of the assets. The data collection is tracked over its lifetime, and the recognition of the data providers, data owners, data generators and data aggregators are updated. A Digital Object Identifier is assigned using the Australian National Data Service (ANDS). Once the data has been quality assured, a DOI is minted and the metadata record updated. NCI's data citation policy establishes the relationship between research outcomes, data providers, and the data.

  18. Best Practices for Making Scientific Data Discoverable and Accessible through Integrated, Standards-Based Data Portals

    NASA Astrophysics Data System (ADS)

    Lucido, J. M.

    2013-12-01

    Scientists in the fields of hydrology, geophysics, and climatology are increasingly using the vast quantity of publicly-available data to address broadly-scoped scientific questions. For example, researchers studying contamination of nearshore waters could use a combination of radar indicated precipitation, modeled water currents, and various sources of in-situ monitoring data to predict water quality near a beach. In discovering, gathering, visualizing and analyzing potentially useful data sets, data portals have become invaluable tools. The most effective data portals often aggregate distributed data sets seamlessly and allow multiple avenues for accessing the underlying data, facilitated by the use of open standards. Additionally, adequate metadata are necessary for attribution, documentation of provenance and relating data sets to one another. Metadata also enable thematic, geospatial and temporal indexing of data sets and entities. Furthermore, effective portals make use of common vocabularies for scientific methods, units of measure, geologic features, chemical, and biological constituents as they allow investigators to correctly interpret and utilize data from external sources. One application that employs these principles is the National Ground Water Monitoring Network (NGWMN) Data Portal (http://cida.usgs.gov/ngwmn), which makes groundwater data from distributed data providers available through a single, publicly accessible web application by mediating and aggregating native data exposed via web services on-the-fly into Open Geospatial Consortium (OGC) compliant service output. That output may be accessed either through the map-based user interface or through the aforementioned OGC web services. Furthermore, the Geo Data Portal (http://cida.usgs.gov/climate/gdp/), which is a system that provides users with data access, subsetting and geospatial processing of large and complex climate and land use data, exemplifies the application of International Standards Organization (ISO) metadata records to enhance data discovery for both human and machine interpretation. Lastly, the Water Quality Portal (http://www.waterqualitydata.us/) achieves interoperable dissemination of water quality data by referencing a vocabulary service for mapping constituents and methods between the USGS and USEPA. The NGWMN Data Portal, Geo Data Portal and Water Quality Portal are three examples of best practices when implementing data portals that provide distributed scientific data in an integrated, standards-based approach.

  19. Quality of Big Data in health care.

    PubMed

    Sukumar, Sreenivas R; Natarajan, Ramachandran; Ferrell, Regina K

    2015-01-01

    The current trend in Big Data analytics and in particular health information technology is toward building sophisticated models, methods and tools for business, operational and clinical intelligence. However, the critical issue of data quality required for these models is not getting the attention it deserves. The purpose of this paper is to highlight the issues of data quality in the context of Big Data health care analytics. The insights presented in this paper are the results of analytics work that was done in different organizations on a variety of health data sets. The data sets include Medicare and Medicaid claims, provider enrollment data sets from both public and private sources, electronic health records from regional health centers accessed through partnerships with health care claims processing entities under health privacy protected guidelines. Assessment of data quality in health care has to consider: first, the entire lifecycle of health data; second, problems arising from errors and inaccuracies in the data itself; third, the source(s) and the pedigree of the data; and fourth, how the underlying purpose of data collection impact the analytic processing and knowledge expected to be derived. Automation in the form of data handling, storage, entry and processing technologies is to be viewed as a double-edged sword. At one level, automation can be a good solution, while at another level it can create a different set of data quality issues. Implementation of health care analytics with Big Data is enabled by a road map that addresses the organizational and technological aspects of data quality assurance. The value derived from the use of analytics should be the primary determinant of data quality. Based on this premise, health care enterprises embracing Big Data should have a road map for a systematic approach to data quality. Health care data quality problems can be so very specific that organizations might have to build their own custom software or data quality rule engines. Today, data quality issues are diagnosed and addressed in a piece-meal fashion. The authors recommend a data lifecycle approach and provide a road map, that is more appropriate with the dimensions of Big Data and fits different stages in the analytical workflow.

  20. The international Argo data infrastructure; past, present, and future.

    NASA Astrophysics Data System (ADS)

    Buck, J. J. H.; Pouliquen, S.; Thresher, A.; Schmechtig, C.; Ignaszewski, M.; Carval, T.; Scanderbeg, M.; Frost, M.

    2016-12-01

    The Argo array is composed of over 3,000 autonomous profiling floats that measure the temperature and salinity of the upper 2,000 m of the global deep ocean every ten days. Argo is a key component of the global ocean observing system and the data addresses crucial questions such as quantifying the heat content of the upper ocean and steric sea level change. Further to this data are routinely assimilated into operational ocean forecast models. Argo is underpinned by an international data system that was founded in the year 2,000 at the first meeting of the Argo data management team. The Argo data system is built on principles of open data and supplying data to both operational ocean models and research communities within 24 hours of collection. The data system served as a template for the established international OceanSITES community and the emerging Everyones Glider Observatories initiative. The Argo data system is composed of national Data Assembly Centers (DAC) that supply data to two mirrored Global Data Assembly Centres (GDAC). GDAC data exchanges are based on File Transfer Protocol (FTP). A significant recent data system development is the assignment of a single dynamic DOI to GDAC holdings enabling time dependent unambiguous data citation at a monthly granularity. The on-going evolution of Argo to address new global questions requires deeper data, shallower data, biogeochemical sampling and increased spatial coverage. These enhancements are increasing data complexity and volumes necessitating significant recent data format adaptation. The challenge and achievement was to preserve data formats and quality for existing established users while still allowing the integration of new data streams. The implementation of these adaptations is currently in progress within DACs. Argo data have been traditionally delivered via FTP protocol with developments are on-going to facilitate new users and emerging expectations on data delivery mechanisms. These experimental developments include access via Application Programming Interfaces such as ERDDAP, integration with other components of GOOS within the AtlantOS project, and a prototype 'Big Data' solution is being developed within the EU ENVRIplus project.

  1. Geoscience Informatics and the National Science Foundation

    NASA Astrophysics Data System (ADS)

    Leinen, M.

    2005-12-01

    The central importance of data to the geosciences, the diversity of data types, the volume of data, and the importance of integrating data collected by many individual researchers to study Earth and space systems, have often put geoscientists and space scientists at the forefront of developing ways to handle data. Recently, however, changes in the role of data in our science are changing how we deal with data: (a) Data volume is increasing - modern observing tools are providing larger volumes of data, more people are collecting data; (b) Scientists are tackling increasingly complex problems that often require different types from different sources; (c) There is growth in observatory science that is generating real-time data streams. These pose a number of challenges. Part of the scientific utility of observatories comes from being able to detect and respond to events in real-time. In many cases quality control must be done automatically. (d) Environmental data from unattended observing systems is likely to become used the way weather data is most often used by researchers - not directly but indirectly as a data product obtained from assimilation of raw data into a numerical model of the system. These changes have and will continue to require changes in how data are managed. For example, data discovery is an increasing challenge. Researchers must pay more attention to metadata - ensuring that it is created as data are collected, processed and archived, ensuring that it is standardized, ensuring that metadata records are machine readable and permit machine-reasoning about the relations between data. NSF is responding to these challenges in several ways. First, virtually all large projects must have sophisticated data management components. Second, we are funding many geoinformatics projects that propose to develop new tools for data management. Third, NSF computer sciences funds research efforts in data management - several in partnership with geoscientists. Finally, NSF is in the process of evaluating our policies for funding data management in order to be responsive to the challenges of ensuring access to data and to state of the art tools for using data.

  2. Striped Data Server for Scalable Parallel Data Analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chang, Jin; Gutsche, Oliver; Mandrichenko, Igor

    A columnar data representation is known to be an efficient way for data storage, specifically in cases when the analysis is often done based only on a small fragment of the available data structures. A data representation like Apache Parquet is a step forward from a columnar representation, which splits data horizontally to allow for easy parallelization of data analysis. Based on the general idea of columnar data storage, working on the [LDRD Project], we have developed a striped data representation, which, we believe, is better suited to the needs of High Energy Physics data analysis. A traditional columnar approachmore » allows for efficient data analysis of complex structures. While keeping all the benefits of columnar data representations, the striped mechanism goes further by enabling easy parallelization of computations without requiring special hardware. We will present an implementation and some performance characteristics of such a data representation mechanism using a distributed no-SQL database or a local file system, unified under the same API and data representation model. The representation is efficient and at the same time simple so that it allows for a common data model and APIs for wide range of underlying storage mechanisms such as distributed no-SQL databases and local file systems. Striped storage adopts Numpy arrays as its basic data representation format, which makes it easy and efficient to use in Python applications. The Striped Data Server is a web service, which allows to hide the server implementation details from the end user, easily exposes data to WAN users, and allows to utilize well known and developed data caching solutions to further increase data access efficiency. We are considering the Striped Data Server as the core of an enterprise scale data analysis platform for High Energy Physics and similar areas of data processing. We have been testing this architecture with a 2TB dataset from a CMS dark matter search and plan to expand it to multiple 100 TB or even PB scale. We will present the striped format, Striped Data Server architecture and performance test results.« less

  3. Policy-based Distributed Data Management

    NASA Astrophysics Data System (ADS)

    Moore, R. W.

    2009-12-01

    The analysis and understanding of climate variability and change builds upon access to massive collections of observational and simulation data. The analyses involve distributed computing, both at the storage systems (which support data subsetting) and at compute engines (for assimilation of observational data into simulations). The integrated Rule Oriented Data System (iRODS) organizes the distributed data into collections to facilitate enforcement of management policies, support remote data processing, and enable development of reference collections. Currently at RENCI, the iRODS data grid is being used to manage ortho-photos and lidar data for the State of North Carolina, provide a unifying storage environment for engagement centers across the state, support distributed access to visualizations of weather data, and is being explored to manage and disseminate collections of ensembles of meteorological and hydrological model results. In collaboration with the National Climatic Data Center, an iRODS data grid is being established to support data transmission from NCDC to ORNL, and to integrate NCDC archives with ORNL compute services. To manage the massive data transfers, parallel I/O streams are used between High Performance Storage System tape archives and the supercomputers at ORNL. Further, we are exploring the movement and management of large RADAR and in situ datasets to be used for data mining between RENCI and NCDC, and for the distributed creation of decision support and climate analysis tools. The iRODS data grid supports all phases of the scientific data life cycle, from management of data products for a project, to sharing of data between research institutions, to publication of data in a digital library, to preservation of data for use in future research projects. Each phase is characterized by a broader user community, with higher expectations for more detailed descriptions and analysis mechanisms for manipulating the data. The higher usage requirements are enforced by management policies that define the required metadata, the required data formats, and the required analysis tools. The iRODS policy based data management system automates the creation of the community chosen data products, validates integrity and authenticity assessment criteria, and enforces management policies across all accesses of the system.

  4. The Planetary Data System— Archiving Planetary Data for the use of the Planetary Science Community

    NASA Astrophysics Data System (ADS)

    Morgan, Thomas H.; McLaughlin, Stephanie A.; Grayzeck, Edwin J.; Vilas, Faith; Knopf, William P.; Crichton, Daniel J.

    2014-11-01

    NASA’s Planetary Data System (PDS) archives, curates, and distributes digital data from NASA’s planetary missions. PDS provides the planetary science community convenient online access to data from NASA’s missions so that they can continue to mine these rich data sets for new discoveries. The PDS is a federated system consisting of nodes for specific discipline areas ranging from planetary geology to space physics. Our federation includes an engineering node that provides systems engineering support to the entire PDS.In order to adequately capture complete mission data sets containing not only raw and reduced instrument data, but also calibration and documentation and geometry data required to interpret and use these data sets both singly and together (data from multiple instruments, or from multiple missions), PDS personnel work with NASA missions from the initial AO through the end of mission to define, organize, and document the data. This process includes peer-review of data sets by members of the science community to ensure that the data sets are scientifically useful, effectively organized, and well documented. PDS makes the data in PDS easily searchable so that members of the planetary community can both query the archive to find data relevant to specific scientific investigations and easily retrieve the data for analysis. To ensure long-term preservation of data and to make data sets more easily searchable with the new capabilities in Information Technology now available (and as existing technologies become obsolete), the PDS (together with the COSPAR sponsored IPDA) developed and deployed a new data archiving system known as PDS4, released in 2013. The LADEE, MAVEN, OSIRIS REx, InSight, and Mars2020 missions are using PDS4. ESA has adopted PDS4 for the upcoming BepiColumbo mission. The PDS is actively migrating existing data records into PDS4 and developing tools to aid data providers and users. The PDS is also incorporating challenge-based competitions to rapidly and economically develop new tools for both users and data providers.Please visit our User Support Area at the meeting (Booth #114) if you have questions accessing our data sets or providing data to the PDS.

  5. Evaluating lossy data compression on climate simulation data within a large ensemble

    DOE PAGES

    Baker, Allison H.; Hammerling, Dorit M.; Mickelson, Sheri A.; ...

    2016-12-07

    High-resolution Earth system model simulations generate enormous data volumes, and retaining the data from these simulations often strains institutional storage resources. Further, these exceedingly large storage requirements negatively impact science objectives, for example, by forcing reductions in data output frequency, simulation length, or ensemble size. To lessen data volumes from the Community Earth System Model (CESM), we advocate the use of lossy data compression techniques. While lossy data compression does not exactly preserve the original data (as lossless compression does), lossy techniques have an advantage in terms of smaller storage requirements. To preserve the integrity of the scientific simulation data,more » the effects of lossy data compression on the original data should, at a minimum, not be statistically distinguishable from the natural variability of the climate system, and previous preliminary work with data from CESM has shown this goal to be attainable. However, to ultimately convince climate scientists that it is acceptable to use lossy data compression, we provide climate scientists with access to publicly available climate data that have undergone lossy data compression. In particular, we report on the results of a lossy data compression experiment with output from the CESM Large Ensemble (CESM-LE) Community Project, in which we challenge climate scientists to examine features of the data relevant to their interests, and attempt to identify which of the ensemble members have been compressed and reconstructed. We find that while detecting distinguishing features is certainly possible, the compression effects noticeable in these features are often unimportant or disappear in post-processing analyses. In addition, we perform several analyses that directly compare the original data to the reconstructed data to investigate the preservation, or lack thereof, of specific features critical to climate science. Overall, we conclude that applying lossy data compression to climate simulation data is both advantageous in terms of data reduction and generally acceptable in terms of effects on scientific results.« less

  6. Sankofa pediatric HIV disclosure intervention cyber data management: building capacity in a resource-limited setting and ensuring data quality

    PubMed Central

    Catlin, Ann Christine; Fernando, Sumudinie; Gamage, Ruwan; Renner, Lorna; Antwi, Sampson; Tettey, Jonas Kusah; Amisah, Kofi Aikins; Kyriakides, Tassos; Cong, Xiangyu; Reynolds, Nancy R.; Paintsil, Elijah

    2015-01-01

    Prevalence of pediatric HIV disclosure is low in resource-limited settings. Innovative, culturally sensitive, and patient-centered disclosure approaches are needed. Conducting such studies in resource-limited settings is not trivial considering the challenges of capturing, cleaning, and storing clinical research data. To overcome some of these challenges, the Sankofa pediatric disclosure intervention adopted an interactive cyber infrastructure for data capture and analysis. The Sankofa Project database system is built on the HUBzero cyber infrastructure (https://hubzero.org), an open source software platform. The hub database components support: (1) data management – the “databases” component creates, configures, and manages database access, backup, repositories, applications, and access control; (2) data collection – the “forms” component is used to build customized web case report forms that incorporate common data elements and include tailored form submit processing to handle error checking, data validation, and data linkage as the data are stored to the database; and (3) data exploration – the “dataviewer” component provides powerful methods for users to view, search, sort, navigate, explore, map, graph, visualize, aggregate, drill-down, compute, and export data from the database. The Sankofa cyber data management tool supports a user-friendly, secure, and systematic collection of all data. We have screened more than 400 child–caregiver dyads and enrolled nearly 300 dyads, with tens of thousands of data elements. The dataviews have successfully supported all data exploration and analysis needs of the Sankofa Project. Moreover, the ability of the sites to query and view data summaries has proven to be an incentive for collecting complete and accurate data. The data system has all the desirable attributes of an electronic data capture tool. It also provides an added advantage of building data management capacity in resource-limited settings due to its innovative data query and summary views and availability of real-time support by the data management team. PMID:26616131

  7. Challenges for Data Archival Centers in Evolving Environmental Sciences

    NASA Astrophysics Data System (ADS)

    Wei, Y.; Cook, R. B.; Gu, L.; Santhana Vannan, S. K.; Beaty, T.

    2015-12-01

    Environmental science has entered into a big data era as enormous data about the Earth environment are continuously collected through field and airborne missions, remote sensing observations, model simulations, sensor networks, etc. An open-access and open-management data infrastructure for data-intensive science is a major grand challenge in global environmental research (BERAC, 2010). Such an infrastructure, as exemplified in EOSDIS, GEOSS, and NSF EarthCube, will provide a complete lifecycle of environmental data and ensures that data will smoothly flow among different phases of collection, preservation, integration, and analysis. Data archival centers, as the data integration units closest to data providers, serve as the source power to compile and integrate heterogeneous environmental data into this global infrastructure. This presentation discusses the interoperability challenges and practices of geosciences from the aspect of data archival centers, based on the operational experiences of the NASA-sponsored Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) and related environmental data management activities. Specifically, we will discuss the challenges to 1) encourage and help scientists to more actively share data with the broader scientific community, so that valuable environmental data, especially those dark data collected by individual scientists in small independent projects, can be shared and integrated into the infrastructure to tackle big science questions; 2) curate heterogeneous multi-disciplinary data, focusing on the key aspects of identification, format, metadata, data quality, and semantics to make them ready to be plugged into a global data infrastructure. We will highlight data curation practices at the ORNL DAAC for global campaigns such as BOREAS, LBA, SAFARI 2000; and 3) enhance the capabilities to more effectively and efficiently expose and deliver "big" environmental data to broad range of users and systems. Experiences and challenges with integrating large data sets via the ORNL DAAC's data discovery and delivery Web services will be discussed.

  8. Evaluating lossy data compression on climate simulation data within a large ensemble

    NASA Astrophysics Data System (ADS)

    Baker, Allison H.; Hammerling, Dorit M.; Mickelson, Sheri A.; Xu, Haiying; Stolpe, Martin B.; Naveau, Phillipe; Sanderson, Ben; Ebert-Uphoff, Imme; Samarasinghe, Savini; De Simone, Francesco; Carbone, Francesco; Gencarelli, Christian N.; Dennis, John M.; Kay, Jennifer E.; Lindstrom, Peter

    2016-12-01

    High-resolution Earth system model simulations generate enormous data volumes, and retaining the data from these simulations often strains institutional storage resources. Further, these exceedingly large storage requirements negatively impact science objectives, for example, by forcing reductions in data output frequency, simulation length, or ensemble size. To lessen data volumes from the Community Earth System Model (CESM), we advocate the use of lossy data compression techniques. While lossy data compression does not exactly preserve the original data (as lossless compression does), lossy techniques have an advantage in terms of smaller storage requirements. To preserve the integrity of the scientific simulation data, the effects of lossy data compression on the original data should, at a minimum, not be statistically distinguishable from the natural variability of the climate system, and previous preliminary work with data from CESM has shown this goal to be attainable. However, to ultimately convince climate scientists that it is acceptable to use lossy data compression, we provide climate scientists with access to publicly available climate data that have undergone lossy data compression. In particular, we report on the results of a lossy data compression experiment with output from the CESM Large Ensemble (CESM-LE) Community Project, in which we challenge climate scientists to examine features of the data relevant to their interests, and attempt to identify which of the ensemble members have been compressed and reconstructed. We find that while detecting distinguishing features is certainly possible, the compression effects noticeable in these features are often unimportant or disappear in post-processing analyses. In addition, we perform several analyses that directly compare the original data to the reconstructed data to investigate the preservation, or lack thereof, of specific features critical to climate science. Overall, we conclude that applying lossy data compression to climate simulation data is both advantageous in terms of data reduction and generally acceptable in terms of effects on scientific results.

  9. Evaluating lossy data compression on climate simulation data within a large ensemble

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Baker, Allison H.; Hammerling, Dorit M.; Mickelson, Sheri A.

    High-resolution Earth system model simulations generate enormous data volumes, and retaining the data from these simulations often strains institutional storage resources. Further, these exceedingly large storage requirements negatively impact science objectives, for example, by forcing reductions in data output frequency, simulation length, or ensemble size. To lessen data volumes from the Community Earth System Model (CESM), we advocate the use of lossy data compression techniques. While lossy data compression does not exactly preserve the original data (as lossless compression does), lossy techniques have an advantage in terms of smaller storage requirements. To preserve the integrity of the scientific simulation data,more » the effects of lossy data compression on the original data should, at a minimum, not be statistically distinguishable from the natural variability of the climate system, and previous preliminary work with data from CESM has shown this goal to be attainable. However, to ultimately convince climate scientists that it is acceptable to use lossy data compression, we provide climate scientists with access to publicly available climate data that have undergone lossy data compression. In particular, we report on the results of a lossy data compression experiment with output from the CESM Large Ensemble (CESM-LE) Community Project, in which we challenge climate scientists to examine features of the data relevant to their interests, and attempt to identify which of the ensemble members have been compressed and reconstructed. We find that while detecting distinguishing features is certainly possible, the compression effects noticeable in these features are often unimportant or disappear in post-processing analyses. In addition, we perform several analyses that directly compare the original data to the reconstructed data to investigate the preservation, or lack thereof, of specific features critical to climate science. Overall, we conclude that applying lossy data compression to climate simulation data is both advantageous in terms of data reduction and generally acceptable in terms of effects on scientific results.« less

  10. Characteristics, availability and uses of vital registration and other mortality data sources in post-democracy South Africa.

    PubMed

    Joubert, Jané; Rao, Chalapati; Bradshaw, Debbie; Dorrington, Rob E; Vos, Theo; Lopez, Alan D

    2012-12-27

    The value of good-quality mortality data for public health is widely acknowledged. While effective civil registration systems remains the 'gold standard' source for continuous mortality measurement, less than 25% of deaths are registered in most African countries. Alternative data collection systems can provide mortality data to complement those from civil registration, given an understanding of data source characteristics and data quality. We aim to document mortality data sources in post-democracy South Africa; to report on availability, limitations, strengths, and possible complementary uses of the data; and to make recommendations for improved data for mortality measurement. Civil registration and alternative mortality data collection systems, data availability, and complementary uses were assessed by reviewing blank questionnaires, death notification forms, death data capture sheets, and patient cards; legislation; electronic data archives and databases; and related information in scientific journals, research reports, statistical releases, government reports and books. Recent transformation has enhanced civil registration and official mortality data availability. Additionally, a range of mortality data items are available in three population censuses, three demographic surveillance systems, and a number of national surveys, mortality audits, and disease notification programmes. Child and adult mortality items were found in all national data sources, and maternal mortality items in most. Detailed cause-of-death data are available from civil registration and demographic surveillance. In a continent often reported as lacking the basic data to infer levels, patterns and trends of mortality, there is evidence of substantial improvement in South Africa in the availability of data for mortality assessment. Mortality data sources are many and varied, providing opportunity for comparing results and improved public health planning. However, more can and must be done to improve mortality measurement by improving data quality, triangulating data, and expanding analytic capacity. Cause data, in particular, must be improved.

  11. Managing Large Datasets for Atmospheric Research

    NASA Technical Reports Server (NTRS)

    Chen, Gao

    2015-01-01

    Since the mid-1980s, airborne and ground measurements have been widely used to provide comprehensive characterization of atmospheric composition and processes. Field campaigns have generated a wealth of insitu data and have grown considerably over the years in terms of both the number of measured parameters and the data volume. This can largely be attributed to the rapid advances in instrument development and computing power. The users of field data may face a number of challenges spanning data access, understanding, and proper use in scientific analysis. This tutorial is designed to provide an introduction to using data sets, with a focus on airborne measurements, for atmospheric research. The first part of the tutorial provides an overview of airborne measurements and data discovery. This will be followed by a discussion on the understanding of airborne data files. An actual data file will be used to illustrate how data are reported, including the use of data flags to indicate missing data and limits of detection. Retrieving information from the file header will be discussed, which is essential to properly interpreting the data. Field measurements are typically reported as a function of sampling time, but different instruments often have different sampling intervals. To create a combined data set, the data merge process (interpolation of all data to a common time base) will be discussed in terms of the algorithm, data merge products available from airborne studies, and their application in research. Statistical treatment of missing data and data flagged for limit of detection will also be covered in this section. These basic data processing techniques are applicable to both airborne and ground-based observational data sets. Finally, the recently developed Toolsets for Airborne Data (TAD) will be introduced. TAD (tad.larc.nasa.gov) is an airborne data portal offering tools to create user defined merged data products with the capability to provide descriptive statistics and the option to treat measurement uncertainty.

  12. Standardized Access and Processing of Multi-Source Earth Observation Time-Series Data within a Regional Data Middleware

    NASA Astrophysics Data System (ADS)

    Eberle, J.; Schmullius, C.

    2017-12-01

    Increasing archives of global satellite data present a new challenge to handle multi-source satellite data in a user-friendly way. Any user is confronted with different data formats and data access services. In addition the handling of time-series data is complex as an automated processing and execution of data processing steps is needed to supply the user with the desired product for a specific area of interest. In order to simplify the access to data archives of various satellite missions and to facilitate the subsequent processing, a regional data and processing middleware has been developed. The aim of this system is to provide standardized and web-based interfaces to multi-source time-series data for individual regions on Earth. For further use and analysis uniform data formats and data access services are provided. Interfaces to data archives of the sensor MODIS (NASA) as well as the satellites Landsat (USGS) and Sentinel (ESA) have been integrated in the middleware. Various scientific algorithms, such as the calculation of trends and breakpoints of time-series data, can be carried out on the preprocessed data on the basis of uniform data management. Jupyter Notebooks are linked to the data and further processing can be conducted directly on the server using Python and the statistical language R. In addition to accessing EO data, the middleware is also used as an intermediary between the user and external databases (e.g., Flickr, YouTube). Standardized web services as specified by OGC are provided for all tools of the middleware. Currently, the use of cloud services is being researched to bring algorithms to the data. As a thematic example, an operational monitoring of vegetation phenology is being implemented on the basis of various optical satellite data and validation data from the German Weather Service. Other examples demonstrate the monitoring of wetlands focusing on automated discovery and access of Landsat and Sentinel data for local areas.

  13. Characteristics, availability and uses of vital registration and other mortality data sources in post-democracy South Africa

    PubMed Central

    Joubert, Jané; Rao, Chalapati; Bradshaw, Debbie; Dorrington, Rob E.; Vos, Theo; Lopez, Alan D.

    2012-01-01

    The value of good-quality mortality data for public health is widely acknowledged. While effective civil registration systems remains the ‘gold standard’ source for continuous mortality measurement, less than 25% of deaths are registered in most African countries. Alternative data collection systems can provide mortality data to complement those from civil registration, given an understanding of data source characteristics and data quality. We aim to document mortality data sources in post-democracy South Africa; to report on availability, limitations, strengths, and possible complementary uses of the data; and to make recommendations for improved data for mortality measurement. Civil registration and alternative mortality data collection systems, data availability, and complementary uses were assessed by reviewing blank questionnaires, death notification forms, death data capture sheets, and patient cards; legislation; electronic data archives and databases; and related information in scientific journals, research reports, statistical releases, government reports and books. Recent transformation has enhanced civil registration and official mortality data availability. Additionally, a range of mortality data items are available in three population censuses, three demographic surveillance systems, and a number of national surveys, mortality audits, and disease notification programmes. Child and adult mortality items were found in all national data sources, and maternal mortality items in most. Detailed cause-of-death data are available from civil registration and demographic surveillance. In a continent often reported as lacking the basic data to infer levels, patterns and trends of mortality, there is evidence of substantial improvement in South Africa in the availability of data for mortality assessment. Mortality data sources are many and varied, providing opportunity for comparing results and improved public health planning. However, more can and must be done to improve mortality measurement by improving data quality, triangulating data, and expanding analytic capacity. Cause data, in particular, must be improved. PMID:23273252

  14. Sankofa pediatric HIV disclosure intervention cyber data management: building capacity in a resource-limited setting and ensuring data quality.

    PubMed

    Catlin, Ann Christine; Fernando, Sumudinie; Gamage, Ruwan; Renner, Lorna; Antwi, Sampson; Tettey, Jonas Kusah; Amisah, Kofi Aikins; Kyriakides, Tassos; Cong, Xiangyu; Reynolds, Nancy R; Paintsil, Elijah

    2015-01-01

    Prevalence of pediatric HIV disclosure is low in resource-limited settings. Innovative, culturally sensitive, and patient-centered disclosure approaches are needed. Conducting such studies in resource-limited settings is not trivial considering the challenges of capturing, cleaning, and storing clinical research data. To overcome some of these challenges, the Sankofa pediatric disclosure intervention adopted an interactive cyber infrastructure for data capture and analysis. The Sankofa Project database system is built on the HUBzero cyber infrastructure ( https://hubzero.org ), an open source software platform. The hub database components support: (1) data management - the "databases" component creates, configures, and manages database access, backup, repositories, applications, and access control; (2) data collection - the "forms" component is used to build customized web case report forms that incorporate common data elements and include tailored form submit processing to handle error checking, data validation, and data linkage as the data are stored to the database; and (3) data exploration - the "dataviewer" component provides powerful methods for users to view, search, sort, navigate, explore, map, graph, visualize, aggregate, drill-down, compute, and export data from the database. The Sankofa cyber data management tool supports a user-friendly, secure, and systematic collection of all data. We have screened more than 400 child-caregiver dyads and enrolled nearly 300 dyads, with tens of thousands of data elements. The dataviews have successfully supported all data exploration and analysis needs of the Sankofa Project. Moreover, the ability of the sites to query and view data summaries has proven to be an incentive for collecting complete and accurate data. The data system has all the desirable attributes of an electronic data capture tool. It also provides an added advantage of building data management capacity in resource-limited settings due to its innovative data query and summary views and availability of real-time support by the data management team.

  15. A New Look at Data Usage by Using Metadata Attributes as Indicators of Data Quality

    NASA Astrophysics Data System (ADS)

    Won, Y. I.; Wanchoo, L.; Behnke, J.

    2016-12-01

    NASA's Earth Observing System Data and Information System (EOSDIS) stores and distributes data from EOS satellites, as well as ancillary, airborne, in-situ, and socio-economic data. Twelve EOSDIS data centers support different scientific disciplines by providing products and services tailored to specific science communities. Although discipline oriented, these data centers provide common data management functions of ingest, archive and distribution, as well as documentation of their data and services on their web-sites. The Earth Science Data and Information System (ESDIS) Project collects these metrics from the EOSDIS data centers on a daily basis through a tool called the ESDIS Metrics System (EMS). These metrics are used in this study. The implementation of the Earthdata Login - formerly known as the User Registration System (URS) - across the various NASA data centers provides the EMS additional information about users obtaining data products from EOSDIS data centers. These additional user attributes collected by the Earthdata login, such as the user's primary area of study can augment the understanding of data usage, which in turn can help the EOSDIS program better understand the users' needs. This study will review the key metrics (users, distributed volume, and files) in multiple ways to gain an understanding of the significance of the metadata. Characterizing the usability of data by key metadata elements such as discipline and study area, will assist in understanding how the users have evolved over time. The data usage pattern based on version numbers may also provide some insight into the level of data quality. In addition, the data metrics by various services such as the Open-source Project for a Network Data Access Protocol (OPeNDAP), Web Map Service (WMS), Web Coverage Service (WCS), and subsets, will address how these services have extended the usage of data. Over-all, this study will present the usage of data and metadata by metrics analyses and will assist data centers in better supporting the needs of the users.

  16. Data Curation Education in Research Centers (DCERC)

    NASA Astrophysics Data System (ADS)

    Marlino, M. R.; Mayernik, M. S.; Kelly, K.; Allard, S.; Tenopir, C.; Palmer, C.; Varvel, V. E., Jr.

    2012-12-01

    Digital data both enable and constrain scientific research. Scientists are enabled by digital data to develop new research methods, utilize new data sources, and investigate new topics, but they also face new data collection, management, and preservation burdens. The current data workforce consists primarily of scientists who receive little formal training in data management and data managers who are typically educated through on-the-job training. The Data Curation Education in Research Centers (DCERC) program is investigating a new model for educating data professionals to contribute to scientific research. DCERC is a collaboration between the University of Illinois at Urbana-Champaign Graduate School of Library and Information Science, the University of Tennessee School of Information Sciences, and the National Center for Atmospheric Research. The program is organized around a foundations course in data curation and provides field experiences in research and data centers for both master's and doctoral students. This presentation will outline the aims and the structure of the DCERC program and discuss results and lessons learned from the first set of summer internships in 2012. Four masters students participated and worked with both data mentors and science mentors, gaining first hand experiences in the issues, methods, and challenges of scientific data curation. They engaged in a diverse set of topics, including climate model metadata, observational data management workflows, and data cleaning, documentation, and ingest processes within a data archive. The students learned current data management practices and challenges while developing expertise and conducting research. They also made important contributions to NCAR data and science teams by evaluating data management workflows and processes, preparing data sets to be archived, and developing recommendations for particular data management activities. The master's student interns will return in summer of 2013, and two Ph.D. students will conduct data curation-related dissertation fieldwork during the 2013-2014 academic year.

  17. Digging Back In Time: Integrating Historical Data Into an Operational Ocean Observing System

    NASA Astrophysics Data System (ADS)

    McCammon, M.

    2016-02-01

    Modern technologies allow reporting and display of data near real-time from in situ instrumentation live on the internet. This has given users fast access to critical information for scientific applications, marine safety, planning, and numerous other activities. Equally as valuable is having access to historical data sets. However, it is challenging to identify sources and access of historical data of interest as it exists in many different locations, depending on the funding source and provider. Also, time-varying formats can make it difficult to data-mine and display historical data. There is also the issue of data quality, and having a systematic means of assessing credibility of historical data sets. The Alaska Ocean Observing System (AOOS) data management system demonstrates the successful ingestion of historical data, both old and new (as recent as yesterday) and has integrated numerous historical data streams into user friendly data portals, available for data upload and display on the AOOS Website. An example is the inclusion of non-real-time (e.g. day old) AIS (Automatic Identification System) ship tracking data, important for scientists working in marine mammal migration regions. Other examples include historical sea ice data, and various data streams from previous research projects (e.g. moored time series, HF Radar surface currents, weather, shipboard CTD). Most program or project websites only offer access to data specific to their agency or project alone, but do not have the capacity to provide access to the plethora of other data that might be available for the region and be useful for integration, comparison and synthesis. AOOS offers end users access to a one stop-shop for data in the area they want to research, helping them identify other sources of information and access. Demonstrations of data portals using historical data illustrate these benefits.

  18. Validating hospital antibiotic purchasing data as a metric of inpatient antibiotic use.

    PubMed

    Tan, Charlie; Ritchie, Michael; Alldred, Jason; Daneman, Nick

    2016-02-01

    Antibiotic purchasing data are a widely used, but unsubstantiated, measure of antibiotic consumption. To validate this source, we compared purchasing data from hospitals and external medical databases with patient-level dispensing data. Antibiotic purchasing and dispensing data from internal hospital records and purchasing data from IMS Health were obtained for two hospitals between May 2013 and April 2015. Internal purchasing data were validated against dispensing data, and IMS data were compared with both internal metrics. Scatterplots of individual antimicrobial data points were generated; Pearson's correlation and linear regression coefficients were computed. A secondary analysis re-examined these correlations over shorter calendar periods. Internal purchasing data were strongly correlated with dispensing data, with correlation coefficients of 0.90 (95% CI = 0.83-0.95) and 0.98 (95% CI = 0.95-0.99) at hospitals A and B, respectively. Although dispensing data were consistently lower than purchasing data, this was attributed to a single antibiotic at both hospitals. IMS data were favourably correlated with, but underestimated, internal purchasing and dispensing data. This difference was accounted for by eight antibiotics for which direct sales from some manufacturers were not included in the IMS database. The correlation between purchasing and dispensing data was consistent across periods as short as 3 months, but not at monthly intervals. Both internal and external antibiotic purchasing data are strongly correlated with dispensing data. If outliers are accounted for appropriately, internal purchasing data could be used for cost-effective evaluation of antimicrobial stewardship programmes, and external data sets could be used for surveillance and research across geographical regions. © The Author 2015. Published by Oxford University Press on behalf of the British Society for Antimicrobial Chemotherapy. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  19. Validating hospital antibiotic purchasing data as a metric of inpatient antibiotic use

    PubMed Central

    Tan, Charlie; Ritchie, Michael; Alldred, Jason; Daneman, Nick

    2016-01-01

    Objectives Antibiotic purchasing data are a widely used, but unsubstantiated, measure of antibiotic consumption. To validate this source, we compared purchasing data from hospitals and external medical databases with patient-level dispensing data. Methods Antibiotic purchasing and dispensing data from internal hospital records and purchasing data from IMS Health were obtained for two hospitals between May 2013 and April 2015. Internal purchasing data were validated against dispensing data, and IMS data were compared with both internal metrics. Scatterplots of individual antimicrobial data points were generated; Pearson's correlation and linear regression coefficients were computed. A secondary analysis re-examined these correlations over shorter calendar periods. Results Internal purchasing data were strongly correlated with dispensing data, with correlation coefficients of 0.90 (95% CI = 0.83–0.95) and 0.98 (95% CI = 0.95–0.99) at hospitals A and B, respectively. Although dispensing data were consistently lower than purchasing data, this was attributed to a single antibiotic at both hospitals. IMS data were favourably correlated with, but underestimated, internal purchasing and dispensing data. This difference was accounted for by eight antibiotics for which direct sales from some manufacturers were not included in the IMS database. The correlation between purchasing and dispensing data was consistent across periods as short as 3 months, but not at monthly intervals. Conclusions Both internal and external antibiotic purchasing data are strongly correlated with dispensing data. If outliers are accounted for appropriately, internal purchasing data could be used for cost-effective evaluation of antimicrobial stewardship programmes, and external data sets could be used for surveillance and research across geographical regions. PMID:26546668

  20. Restoration of Apollo Data by the Lunar Data Project/PDS Lunar Data Node: An Update

    NASA Technical Reports Server (NTRS)

    Williams, David R.; Hills, H. Kent; Taylor, Patrick T.; Grayzeck, Edwin J.; Guinness, Edward A.

    2016-01-01

    The Apollo 11, 12, and 14 through 17 missions orbited and landed on the Moon, carrying scientific instruments that returned data from all phases of the missions, included long-lived Apollo Lunar Surface Experiments Packages (ALSEPs) deployed by the astronauts on the lunar surface. Much of these data were never archived, and some of the archived data were on media and in formats that are outmoded, or were deposited with little or no useful documentation to aid outside users. This is particularly true of the ALSEP data returned autonomously for many years after the Apollo missions ended. The purpose of the Lunar Data Project and the Planetary Data System (PDS) Lunar Data Node is to take data collections already archived at the NASA Space Science Data Coordinated Archive (NSSDCA) and prepare them for archiving through PDS, and to locate lunar data that were never archived, bring them into NSSDCA, and then archive them through PDS. Preparing these data for archiving involves reading the data from the original media, be it magnetic tape, microfilm, microfiche, or hard-copy document, converting the outmoded, often binary, formats when necessary, putting them into a standard digital form accepted by PDS, collecting the necessary ancillary data and documentation (metadata) to ensure that the data are usable and well-described, summarizing the metadata in documentation to be included in the data set, adding other information such as references, mission and instrument descriptions, contact information, and related documentation, and packaging the results in a PDS-compliant data set. The data set is then validated and reviewed by a group of external scientists as part of the PDS final archive process. We present a status report on some of the data sets that we are processing.

  1. Development of an Operational TS Dataset Production System for the Data Assimilation System

    NASA Astrophysics Data System (ADS)

    Kim, Sung Dae; Park, Hyuk Min; Kim, Young Ho; Park, Kwang Soon

    2017-04-01

    An operational TS (Temperature and Salinity) dataset production system was developed to provide near real-time data to the data assimilation system periodically. It collects the latest 15 days' TS data of the north western pacific area (20°N - 55°N, 110°E - 150°E), applies QC tests to the archived data and supplies them to numerical prediction models of KIOST (Korea Institute of Ocean Science and Technology). The latest real-time TS data are collected from Argo GDAC and GTSPP data server every week. Argo data are downloaded from /latest_data directory of Argo GDAC. Because many duplicated data exist when all profile data are extracted from all Argo netCDF files, DB system is used to avoid duplication. All metadata (float ID, location, observation date and time, etc) of all Argo floats is stored into Database system and a Matlab program was developed to manipulate DB data, to check the duplication and to exclude duplicated data. GTSPP data are downloaded from /realtime directory of GTSPP data service. The latest data except ARGO data are extracted from the original data. Another Matlab program was coded to inspect all collected data using 10 QC tests and produce final dataset which can be used by the assimilation system. Three regional range tests to inspect annual, seasonal and monthly variations are included in the QC procedures. The C program was developed to provide regional ranges to data managers. It can calculate upper limit and lower limit of temperature and salinity at depth from 0 to 1550m. The final TS dataset contains the latest 15 days' TS data in netCDF format. It is updated every week and transmitted to numerical modeler of KIOST for operational use.

  2. The Open Data Repositorys Data Publisher

    NASA Technical Reports Server (NTRS)

    Stone, N.; Lafuente, B.; Downs, R. T.; Blake, D.; Bristow, T.; Fonda, M.; Pires, A.

    2015-01-01

    Data management and data publication are becoming increasingly important components of researcher's workflows. The complexity of managing data, publishing data online, and archiving data has not decreased significantly even as computing access and power has greatly increased. The Open Data Repository's Data Publisher software strives to make data archiving, management, and publication a standard part of a researcher's workflow using simple, web-based tools and commodity server hardware. The publication engine allows for uploading, searching, and display of data with graphing capabilities and downloadable files. Access is controlled through a robust permissions system that can control publication at the field level and can be granted to the general public or protected so that only registered users at various permission levels receive access. Data Publisher also allows researchers to subscribe to meta-data standards through a plugin system, embargo data publication at their discretion, and collaborate with other researchers through various levels of data sharing. As the software matures, semantic data standards will be implemented to facilitate machine reading of data and each database will provide a REST application programming interface for programmatic access. Additionally, a citation system will allow snapshots of any data set to be archived and cited for publication while the data itself can remain living and continuously evolve beyond the snapshot date. The software runs on a traditional LAMP (Linux, Apache, MySQL, PHP) server and is available on GitHub (http://github.com/opendatarepository) under a GPLv2 open source license. The goal of the Open Data Repository is to lower the cost and training barrier to entry so that any researcher can easily publish their data and ensure it is archived for posterity.

  3. Integrating Engineering Data Systems for NASA Spaceflight Projects

    NASA Technical Reports Server (NTRS)

    Carvalho, Robert E.; Tollinger, Irene; Bell, David G.; Berrios, Daniel C.

    2012-01-01

    NASA has a large range of custom-built and commercial data systems to support spaceflight programs. Some of the systems are re-used by many programs and projects over time. Management and systems engineering processes require integration of data across many of these systems, a difficult problem given the widely diverse nature of system interfaces and data models. This paper describes an ongoing project to use a central data model with a web services architecture to support the integration and access of linked data across engineering functions for multiple NASA programs. The work involves the implementation of a web service-based middleware system called Data Aggregator to bring together data from a variety of systems to support space exploration. Data Aggregator includes a central data model registry for storing and managing links between the data in disparate systems. Initially developed for NASA's Constellation Program needs, Data Aggregator is currently being repurposed to support the International Space Station Program and new NASA projects with processes that involve significant aggregating and linking of data. This change in user needs led to development of a more streamlined data model registry for Data Aggregator in order to simplify adding new project application data as well as standardization of the Data Aggregator query syntax to facilitate cross-application querying by client applications. This paper documents the approach from a set of stand-alone engineering systems from which data are manually retrieved and integrated, to a web of engineering data systems from which the latest data are automatically retrieved and more quickly and accurately integrated. This paper includes the lessons learned through these efforts, including the design and development of a service-oriented architecture and the evolution of the data model registry approaches as the effort continues to evolve and adapt to support multiple NASA programs and priorities.

  4. DAS: A Data Management System for Instrument Tests and Operations

    NASA Astrophysics Data System (ADS)

    Frailis, M.; Sartor, S.; Zacchei, A.; Lodi, M.; Cirami, R.; Pasian, F.; Trifoglio, M.; Bulgarelli, A.; Gianotti, F.; Franceschi, E.; Nicastro, L.; Conforti, V.; Zoli, A.; Smart, R.; Morbidelli, R.; Dadina, M.

    2014-05-01

    The Data Access System (DAS) is a and data management software system, providing a reusable solution for the storage of data acquired both from telescopes and auxiliary data sources during the instrument development phases and operations. It is part of the Customizable Instrument WorkStation system (CIWS-FW), a framework for the storage, processing and quick-look at the data acquired from scientific instruments. The DAS provides a data access layer mainly targeted to software applications: quick-look displays, pre-processing pipelines and scientific workflows. It is logically organized in three main components: an intuitive and compact Data Definition Language (DAS DDL) in XML format, aimed for user-defined data types; an Application Programming Interface (DAS API), automatically adding classes and methods supporting the DDL data types, and providing an object-oriented query language; a data management component, which maps the metadata of the DDL data types in a relational Data Base Management System (DBMS), and stores the data in a shared (network) file system. With the DAS DDL, developers define the data model for a particular project, specifying for each data type the metadata attributes, the data format and layout (if applicable), and named references to related or aggregated data types. Together with the DDL user-defined data types, the DAS API acts as the only interface to store, query and retrieve the metadata and data in the DAS system, providing both an abstract interface and a data model specific one in C, C++ and Python. The mapping of metadata in the back-end database is automatic and supports several relational DBMSs, including MySQL, Oracle and PostgreSQL.

  5. Searching for unpublished data for Cochrane reviews: cross sectional study

    PubMed Central

    Bero, Lisa; Gøtzsche, Peter C

    2013-01-01

    Objective To describe the experiences of authors of Cochrane reviews in searching for, getting access to, and using unpublished data. Design Cross sectional study. Setting Cochrane reviews. Participants 2184 corresponding authors of Cochrane reviews as of May 2012. Main outcome measure Frequencies of responses to open ended and closed questions in an online survey. Results Of 5915 authors contacted by email, 2184 replied (36.9% response rate). Of those, 1656 (75.8%) had searched for unpublished data. In 913 cases (55.1% of 1656), new data were obtained and we received details about these data for 794 data sources. The most common data source was “trialists/investigators,” accounting for 73.9% (n=587) of the 794 data sources. Most of the data were used in the review (82.0%, 651/794) and in 53.4% (424/794) of cases data were provided in less than a month. Summary data were most common, provided by 50.8% (403/794) of the data sources, whereas 20.5% (163/794) provided individual patient data. In only 6.3% (50/794) of cases were data reported to have been obtained from the manufacturers, and this group waited longer and had to make more contacts to get the data. The data from manufacturers were less likely to be for individual patients and less likely to be used in the review. Data from regulatory agencies accounted for 3.0% (24/794) of the obtained data. Conclusions Most authors of Cochrane reviews who searched for unpublished data received useful information, primarily from trialists. Our response rate was low and the authors who did not respond were probably less likely to have searched for unpublished data. Manufacturers and regulatory agencies were uncommon sources of unpublished data. PMID:23613540

  6. Searching for unpublished data for Cochrane reviews: cross sectional study.

    PubMed

    Schroll, Jeppe Bennekou; Bero, Lisa; Gøtzsche, Peter C

    2013-04-23

    To describe the experiences of authors of Cochrane reviews in searching for, getting access to, and using unpublished data. Cross sectional study. Cochrane reviews. 2184 corresponding authors of Cochrane reviews as of May 2012. Frequencies of responses to open ended and closed questions in an online survey. Of 5915 authors contacted by email, 2184 replied (36.9% response rate). Of those, 1656 (75.8%) had searched for unpublished data. In 913 cases (55.1% of 1656), new data were obtained and we received details about these data for 794 data sources. The most common data source was "trialists/investigators," accounting for 73.9% (n=587) of the 794 data sources. Most of the data were used in the review (82.0%, 651/794) and in 53.4% (424/794) of cases data were provided in less than a month. Summary data were most common, provided by 50.8% (403/794) of the data sources, whereas 20.5% (163/794) provided individual patient data. In only 6.3% (50/794) of cases were data reported to have been obtained from the manufacturers, and this group waited longer and had to make more contacts to get the data. The data from manufacturers were less likely to be for individual patients and less likely to be used in the review. Data from regulatory agencies accounted for 3.0% (24/794) of the obtained data. Most authors of Cochrane reviews who searched for unpublished data received useful information, primarily from trialists. Our response rate was low and the authors who did not respond were probably less likely to have searched for unpublished data. Manufacturers and regulatory agencies were uncommon sources of unpublished data.

  7. Can Data be Organized for Science and Reuse?

    NASA Astrophysics Data System (ADS)

    Habermann, T.; Shasharina, S.; Jelenak, A.; Fillmore, D. W.

    2015-12-01

    The Data Life Cycle is an important general concept for thinking about data collection, management, and preservation practices across the geophysical scientific data community. The cycle generally spans the scientific process from ideation, through experimental design, observation collection, data analysis and visualization, publication, archive, distributions and eventual reuse. During the cycle, the data may change through new analyses, presentations, and responsible parties, but, historically, the format and organization of the data have generally remained the same. Data collected as a time series at a point remains as a time series and data collected/calculated as grids remains as grids. BIP is BIP and BSQ is BSQ. In fact, in many large data centers, the native format remains sacrosanct and, in the scientific community, reformatting is avoided because of fear of losing information or introducing data quality problems and irreproducible results. This traditional approach has worked well in areas where data are collected and used for a single purpose throughout the life cycle and domains where comparisons across different data sets are rare and problematic due to conflicting data organizational structures or incomplete documentation. This is not the world we live in today. Reuse for unexpected purposes and data (and model) comparisons are becoming increasingly common (e.g. climate model / observation comparisons). Data sets are preserved for future global investigators that may be unaware of the original project or purpose of the data. Also, it is becoming more common for data to be restructured and reformatted for particular problem or to support a flexible web service. Unfortunately, many of these efforts do not preserve the metadata that, hopefully, accompanies the data in the original format. In this presentation we will discuss alternative approaches to data management that will facilitate data reuse across teams and disciplines.

  8. Applying a Data Stewardship Maturity Matrix to the NOAA Observing System Portfolio Integrated Assessment Process

    NASA Astrophysics Data System (ADS)

    Peng, G.; Austin, M.

    2017-12-01

    Identification and prioritization of targeted user community needs are not always considered until after data has been created and archived. Gaps in data curation and documentation in the data production and delivery phases limit data's broad utility specifically for decision makers. Expert understanding and knowledge of a particular dataset is often required as a part of the data and metadata curation process to establish the credibility of the data and support informed decision-making. To enhance curation practices, content from NOAA's Observing System Integrated Assessment (NOSIA) Value Tree, NOAA's Data Catalog/Digital Object Identifier (DOI) projects (collection-level metadata) have been integrated with Data/Stewardship Maturity Matrices (data and stewardship quality information) focused on assessment of user community needs. This results in user focused evidence based decision making tools created by NOAA's National Environmental Satellite, Data, and Information Service (NESDIS) through identification and assessment of data content gaps related to scientific knowledge and application to key areas of societal benefit. Through enabling user need feedback from the beginning of data creation through archive allows users to determine the quality and value of data that is fit for purpose. Data gap assessment and prioritization are presented in a user-friendly way using the data stewardship maturity matrices as measurement of data management quality. These decision maker tools encourages data producers and data providers/stewards to consider users' needs prior to data creation and dissemination resulting in user driven data requirements increasing return on investment. A use case focused on need for NOAA observations linked societal benefit will be used to demonstrate the value of these tools.

  9. Data-resolution matrix and model-resolution matrix for Rayleigh-wave inversion using a damped least-squares method

    USGS Publications Warehouse

    Xia, J.; Miller, R.D.; Xu, Y.

    2008-01-01

    Inversion of multimode surface-wave data is of increasing interest in the near-surface geophysics community. For a given near-surface geophysical problem, it is essential to understand how well the data, calculated according to a layered-earth model, might match the observed data. A data-resolution matrix is a function of the data kernel (determined by a geophysical model and a priori information applied to the problem), not the data. A data-resolution matrix of high-frequency (>2 Hz) Rayleigh-wave phase velocities, therefore, offers a quantitative tool for designing field surveys and predicting the match between calculated and observed data. We employed a data-resolution matrix to select data that would be well predicted and we find that there are advantages of incorporating higher modes in inversion. The resulting discussion using the data-resolution matrix provides insight into the process of inverting Rayleigh-wave phase velocities with higher-mode data to estimate S-wave velocity structure. Discussion also suggested that each near-surface geophysical target can only be resolved using Rayleigh-wave phase velocities within specific frequency ranges, and higher-mode data are normally more accurately predicted than fundamental-mode data because of restrictions on the data kernel for the inversion system. We used synthetic and real-world examples to demonstrate that selected data with the data-resolution matrix can provide better inversion results and to explain with the data-resolution matrix why incorporating higher-mode data in inversion can provide better results. We also calculated model-resolution matrices in these examples to show the potential of increasing model resolution with selected surface-wave data. ?? Birkhaueser 2008.

  10. DBCG hypo trial validation of radiotherapy parameters from a national data bank versus manual reporting.

    PubMed

    Brink, Carsten; Lorenzen, Ebbe L; Krogh, Simon Long; Westberg, Jonas; Berg, Martin; Jensen, Ingelise; Thomsen, Mette Skovhus; Yates, Esben Svitzer; Offersen, Birgitte Vrou

    2018-01-01

    The current study evaluates the data quality achievable using a national data bank for reporting radiotherapy parameters relative to the classical manual reporting method of selected parameters. The data comparison is based on 1522 Danish patients of the DBCG hypo trial with data stored in the Danish national radiotherapy data bank. In line with standard DBCG trial practice selected parameters were also reported manually to the DBCG database. Categorical variables are compared using contingency tables, and comparison of continuous parameters is presented in scatter plots. For categorical variables 25 differences between the data bank and manual values were located. Of these 23 were related to mistakes in the manual reported value whilst the remaining two were a wrong classification in the data bank. The wrong classification in the data bank was related to lack of dose information, since the two patients had been treated with an electron boost based on a manual calculation, thus data was not exported to the data bank, and this was not detected prior to comparison with the manual data. For a few database fields in the manual data an ambiguity of the parameter definition of the specific field is seen in the data. This was not the case for the data bank, which extract all data consistently. In terms of data quality the data bank is superior to manually reported values. However, there is a need to allocate resources for checking the validity of the available data as well as ensuring that all relevant data is present. The data bank contains more detailed information, and thus facilitates research related to the actual dose distribution in the patients.

  11. Terra Populus and DataNet Collaboration

    NASA Astrophysics Data System (ADS)

    Kugler, T.; Ruggles, S.; Fitch, C. A.; Clark, P. D.; Sobek, M.; Van Riper, D.

    2012-12-01

    Terra Populus, part of NSF's new DataNet initiative, is developing organizational and technical infrastructure to integrate, preserve, and disseminate data describing changes in the human population and environment over time. Terra Populus will incorporate large microdata and aggregate census datasets from the United States and around the world, as well as land use, land cover, climate and other environmental datasets. These data are widely dispersed, exist in a variety of data structures, have incompatible or inadequate metadata, and have incompatible geographic identifiers. Terra Populus is developing methods of integrating data from different domains and translating across data structures based on spatio-temporal linkages among data contents. The new infrastructure will enable researchers to identify and merge data from heterogeneous sources to study the relationships between human behavior and the natural world. Terra Populus will partner with data archives, data producers, and data users to create a sustainable international organization that will guarantee preservation and access over multiple decades. Terra Populus is also collaborating with the other projects in the DataNet initiative - DataONE, the DataNet Federation Consortium (DFC) and Sustainable Environment-Actionable Data (SEAD). Taken together, the four projects address aspects of the entire data lifecycle, including planning, collection, documentation, discovery, integration, curation, preservation, and collaboration; and encompass a wide range of disciplines including earth sciences, ecology, social sciences, hydrology, oceanography, and engineering. The four projects are pursuing activities to share data, tools, and expertise between pairs of projects as well as collaborating across the DataNet program on issues of cyberinfrastructure and community engagement. Topics to be addressed through program-wide collaboration include technical, organizational, and financial sustainability; semantic integration; data management training and education; and cross-disciplinary awareness of data resources.

  12. Data Foundry: Data Warehousing and Integration for Scientific Data Management

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Musick, R.; Critchlow, T.; Ganesh, M.

    2000-02-29

    Data warehousing is an approach for managing data from multiple sources by representing them with a single, coherent point of view. Commercial data warehousing products have been produced by companies such as RebBrick, IBM, Brio, Andyne, Ardent, NCR, Information Advantage, Informatica, and others. Other companies have chosen to develop their own in-house data warehousing solution using relational databases, such as those sold by Oracle, IBM, Informix and Sybase. The typical approaches include federated systems, and mediated data warehouses, each of which, to some extent, makes use of a series of source-specific wrapper and mediator layers to integrate the data intomore » a consistent format which is then presented to users as a single virtual data store. These approaches are successful when applied to traditional business data because the data format used by the individual data sources tends to be rather static. Therefore, once a data source has been integrated into a data warehouse, there is relatively little work required to maintain that connection. However, that is not the case for all data sources. Data sources from scientific domains tend to regularly change their data model, format and interface. This is problematic because each change requires the warehouse administrator to update the wrapper, mediator, and warehouse interfaces to properly read, interpret, and represent the modified data source. Furthermore, the data that scientists require to carry out research is continuously changing as their understanding of a research question develops, or as their research objectives evolve. The difficulty and cost of these updates effectively limits the number of sources that can be integrated into a single data warehouse, or makes an approach based on warehousing too expensive to consider.« less

  13. Why is data sharing in collaborative natural resource efforts so hard and what can we do to improve it?

    PubMed

    Volk, Carol J; Lucero, Yasmin; Barnas, Katie

    2014-05-01

    Increasingly, research and management in natural resource science rely on very large datasets compiled from multiple sources. While it is generally good to have more data, utilizing large, complex datasets has introduced challenges in data sharing, especially for collaborating researchers in disparate locations ("distributed research teams"). We surveyed natural resource scientists about common data-sharing problems. The major issues identified by our survey respondents (n = 118) when providing data were lack of clarity in the data request (including format of data requested). When receiving data, survey respondents reported various insufficiencies in documentation describing the data (e.g., no data collection description/no protocol, data aggregated, or summarized without explanation). Since metadata, or "information about the data," is a central obstacle in efficient data handling, we suggest documenting metadata through data dictionaries, protocols, read-me files, explicit null value documentation, and process metadata as essential to any large-scale research program. We advocate for all researchers, but especially those involved in distributed teams to alleviate these problems with the use of several readily available communication strategies including the use of organizational charts to define roles, data flow diagrams to outline procedures and timelines, and data update cycles to guide data-handling expectations. In particular, we argue that distributed research teams magnify data-sharing challenges making data management training even more crucial for natural resource scientists. If natural resource scientists fail to overcome communication and metadata documentation issues, then negative data-sharing experiences will likely continue to undermine the success of many large-scale collaborative projects.

  14. Why is Data Sharing in Collaborative Natural Resource Efforts so Hard and What can We Do to Improve it?

    NASA Astrophysics Data System (ADS)

    Volk, Carol J.; Lucero, Yasmin; Barnas, Katie

    2014-05-01

    Increasingly, research and management in natural resource science rely on very large datasets compiled from multiple sources. While it is generally good to have more data, utilizing large, complex datasets has introduced challenges in data sharing, especially for collaborating researchers in disparate locations ("distributed research teams"). We surveyed natural resource scientists about common data-sharing problems. The major issues identified by our survey respondents ( n = 118) when providing data were lack of clarity in the data request (including format of data requested). When receiving data, survey respondents reported various insufficiencies in documentation describing the data (e.g., no data collection description/no protocol, data aggregated, or summarized without explanation). Since metadata, or "information about the data," is a central obstacle in efficient data handling, we suggest documenting metadata through data dictionaries, protocols, read-me files, explicit null value documentation, and process metadata as essential to any large-scale research program. We advocate for all researchers, but especially those involved in distributed teams to alleviate these problems with the use of several readily available communication strategies including the use of organizational charts to define roles, data flow diagrams to outline procedures and timelines, and data update cycles to guide data-handling expectations. In particular, we argue that distributed research teams magnify data-sharing challenges making data management training even more crucial for natural resource scientists. If natural resource scientists fail to overcome communication and metadata documentation issues, then negative data-sharing experiences will likely continue to undermine the success of many large-scale collaborative projects.

  15. Linked health data for pharmacovigilance in children: perceived legal and ethical issues for stakeholders and data guardians.

    PubMed

    Hopf, Yvonne Marina; Bond, Christine B; Francis, Jill J; Haughney, John; Helms, Peter J

    2014-02-12

    The inclusion of the Community Health Index in the recording of National Health Service (NHS) contacts in Scotland facilitates national linkage of data such as prescribing and healthcare utilisation. This linkage could be the basis for identification of adverse drug reactions. The aim of this article is to report the views of healthcare professionals on data sharing, ownership and the legal and other applicable frameworks relevant to linkage of routinely collected paediatric healthcare data. Qualitative study using semistructured face-to-face interviews addressing the study aims. Purposive sample of professional stakeholders (n=25) including experts on ethics, data protection, pharmacovigilance, data linkage, legal issues and prescribing. Interviews were audio-recorded, transcribed and thematically analysed using a framework approach. Participants identified existing data sharing systems in the UK. Access to healthcare data should be approved by the data owners. The definition of data ownership and associated legal responsibilities for linked healthcare data were seen as important factors to ensure accountability for the use of linked data. Yet data owners were seen as facilitators of the proposed data linkage. Twelve frameworks (legal, regulatory and governance) applicable to the linkage of healthcare data were identified. A large number of potentially relevant legal and regulatory frameworks were identified. Ownership of the linked data was seen as an extension of responsibility for, or guardianship of, the source datasets. The consensus emerging from the present study was that clarity is required on the definition of data sharing, data ownership and responsibilities of data owners.

  16. Opportunities for the Mashup of Heterogenous Data Server via Semantic Web Technology

    NASA Astrophysics Data System (ADS)

    Ritschel, Bernd; Seelus, Christoph; Neher, Günther; Iyemori, Toshihiko; Koyama, Yukinobu; Yatagai, Akiyo; Murayama, Yasuhiro; King, Todd; Hughes, John; Fung, Shing; Galkin, Ivan; Hapgood, Michael; Belehaki, Anna

    2015-04-01

    Opportunities for the Mashup of Heterogenous Data Server via Semantic Web Technology European Union ESPAS, Japanese IUGONET and GFZ ISDC data server are developed for the ingestion, archiving and distributing of geo and space science domain data. Main parts of the data -managed by the mentioned data server- are related to near earth-space and geomagnetic field data. A smart mashup of the data server would allow a seamless browse and access to data and related context information. However the achievement of a high level of interoperability is a challenge because the data server are based on different data models and software frameworks. This paper is focused on the latest experiments and results for the mashup of the data server using the semantic Web approach. Besides the mashup of domain and terminological ontologies, especially the options to connect data managed by relational databases using D2R server and SPARQL technology will be addressed. A successful realization of the data server mashup will not only have a positive impact to the data users of the specific scientific domain but also to related projects, such as e.g. the development of a new interoperable version of NASA's Planetary Data System (PDS) or ICUS's World Data System alliance. ESPAS data server: https://www.espas-fp7.eu/portal/ IUGONET data server: http://search.iugonet.org/iugonet/ GFZ ISDC data server (semantic Web based prototype): http://rz-vm30.gfz-potsdam.de/drupal-7.9/ NASA PDS: http://pds.nasa.gov ICSU-WDS: https://www.icsu-wds.org

  17. VORBrouter: A dynamic data routing system for Real-Time Seismic networks

    NASA Astrophysics Data System (ADS)

    Hansen, T.; Vernon, F.; Lindquist, K.; Orcutt, J.

    2004-12-01

    For anyone who has managed a moderately complex buffered real-time data transport system, the need for reliable adaptive data transport is clear. The ROADNet VORBrouter system, an extension to the ROADNet data catalog system [AGU-2003, Dynamic Dataflow Topology Monitoring for Real-time Seismic Networks], allows dynamic routing of real-time seismic data from sensor to end-user. Traditional networks consist of a series of data buffer computers with data transport interconnections configured by hand. This allows for arbitrarily complex data networks, which can often exceed full comprehension by network administrators, sometimes resulting in data loops or accidental data cutoff. In order to manage data transport systems in the event of a network failure, a network administrator must be called upon to change the data transport paths and to recover the missing data. Using VORBrouter, administrators can sleep at night while still providing 7/24 uninterupted data streams at realistic cost. This software package uses information from the ROADNet data catalog system to route packets around failed link outages and to new consumers in real-time. Dynamic data routing protocols operating on top of the Antelope Data buffering layer allow authorized users to request data sets from their local buffer and to have them delivered from anywhere within the network of buffers. The VORBrouter software also allows for dynamic routing around network outages, and the elimination of duplicate data paths within the network, while maintaining the nearly lossless data transport features exhibited by the underlying Antelope system. We present the design of the VORBrouter system, its features, limitations and some future research directions.

  18. The OCC NOAA Data Commons: First Year Experiences

    NASA Astrophysics Data System (ADS)

    Flamig, Z.; Patterson, M.; Wells, W.; Grossman, R.

    2016-12-01

    The Open Commons Consortium (OCC) is one of the five "Data Alliance" anchoring institutions in the NOAA Big Data Project (BDP) that was announced on April 21st, 2015. This study will present lessons learned from the first year of the BDP. The project so far has set up a pilot data commons with some initial datasets and established a digital ID service. Demonstrations on how to work with the digital ID service and the NEXRAD radar data will be shown. The proof of concept for the OCC NOAA Data Commons was established using the level 2 NEXRAD data made available to the BDP partners. Approximately 50 TiB of NEXRAD data representing the year 2015 was incorporated into the data commons. The digital ID service supports a common persistent data ID that can access data from across multiple data locations. Using this digital ID service allows users to access the NEXRAD data from their choice of the OCC NOAA Data Commons or from Amazon's NEXRAD data holdings in the same manner. To demonstrate the concept further, a sample Jupyter notebook was created to utilize the data. The notebook, which uses the Py-ART package, creates an animated loop of the NEXRAD data showing a Mayfly hatch in Wisconsin during June 2015. The notebook also demonstrates how to do a basic quality control procedure on the radar data, in this instance to remove meteorological echoes in favor of showcasing the biological scatters. For grantees on the Open Science Data Cloud there are additional premade resources available such as virtual machine images preloaded with the tools needed to access the NEXRAD data.

  19. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research

    PubMed Central

    Weng, Chunhua

    2013-01-01

    Objective To review the methods and dimensions of data quality assessment in the context of electronic health record (EHR) data reuse for research. Materials and methods A review of the clinical research literature discussing data quality assessment methodology for EHR data was performed. Using an iterative process, the aspects of data quality being measured were abstracted and categorized, as well as the methods of assessment used. Results Five dimensions of data quality were identified, which are completeness, correctness, concordance, plausibility, and currency, and seven broad categories of data quality assessment methods: comparison with gold standards, data element agreement, data source agreement, distribution comparison, validity checks, log review, and element presence. Discussion Examination of the methods by which clinical researchers have investigated the quality and suitability of EHR data for research shows that there are fundamental features of data quality, which may be difficult to measure, as well as proxy dimensions. Researchers interested in the reuse of EHR data for clinical research are recommended to consider the adoption of a consistent taxonomy of EHR data quality, to remain aware of the task-dependence of data quality, to integrate work on data quality assessment from other fields, and to adopt systematic, empirically driven, statistically based methods of data quality assessment. Conclusion There is currently little consistency or potential generalizability in the methods used to assess EHR data quality. If the reuse of EHR data for clinical research is to become accepted, researchers should adopt validated, systematic methods of EHR data quality assessment. PMID:22733976

  20. Quality Analysis of Open Street Map Data

    NASA Astrophysics Data System (ADS)

    Wang, M.; Li, Q.; Hu, Q.; Zhou, M.

    2013-05-01

    Crowd sourcing geographic data is an opensource geographic data which is contributed by lots of non-professionals and provided to the public. The typical crowd sourcing geographic data contains GPS track data like OpenStreetMap, collaborative map data like Wikimapia, social websites like Twitter and Facebook, POI signed by Jiepang user and so on. These data will provide canonical geographic information for pubic after treatment. As compared with conventional geographic data collection and update method, the crowd sourcing geographic data from the non-professional has characteristics or advantages of large data volume, high currency, abundance information and low cost and becomes a research hotspot of international geographic information science in the recent years. Large volume crowd sourcing geographic data with high currency provides a new solution for geospatial database updating while it need to solve the quality problem of crowd sourcing geographic data obtained from the non-professionals. In this paper, a quality analysis model for OpenStreetMap crowd sourcing geographic data is proposed. Firstly, a quality analysis framework is designed based on data characteristic analysis of OSM data. Secondly, a quality assessment model for OSM data by three different quality elements: completeness, thematic accuracy and positional accuracy is presented. Finally, take the OSM data of Wuhan for instance, the paper analyses and assesses the quality of OSM data with 2011 version of navigation map for reference. The result shows that the high-level roads and urban traffic network of OSM data has a high positional accuracy and completeness so that these OSM data can be used for updating of urban road network database.

  1. Usability Analysis within The DataONE Network of Collaborators

    NASA Astrophysics Data System (ADS)

    Budden, A. E.; Frame, M. T.; Tenopir, C.; Volentine, R.

    2014-12-01

    DataONE was conceived as a 10-year project to enable new science and knowledge creation through universal access to data about life on Earth and the environment that sustains it. In Phase I (2009-2014) more than 300 DataONE participants designed, developed and deployed a robust cyberinfrastructure (CI) with innovative services, and directly engaged and educated a broad stakeholder community. DataONE provides a resilient, scalable infrastructure using Member Nodes (data repositories), Coordinating Nodes, and an Investigator Toolkit to support the data access and data management needs of biological, Earth, and environmental science researchers in the U.S. and across the globe. DataONE collaborators, such as the U.S. Geological Survey, University of New Mexico, and the University of Tennessee, perform research to measure both the current data practices and opinions of DataONE stakeholders and the usability of DataONE for these stakeholders. Stakeholders include scientists, data managers, librarians, and educators among others. The DataONE Usability and Assessment Working Group, which includes members from multiple sectors, does research, development, and implementation projects on DataONE processes, systems, and methods. These projects are essential to insure that DataONE products and services meet network goals, include appropriate community involvement, and demonstrate progress and achievements of DataONE. This poster will provide an overview of DataONE's usability analysis and assessment methodologies, benefits to DataONE and its collaborators, and current tools/techniques being utilized by the participants.

  2. Methods for Specifying Scientific Data Standards and Modeling Relationships with Applications to Neuroscience

    PubMed Central

    Rübel, Oliver; Dougherty, Max; Prabhat; Denes, Peter; Conant, David; Chang, Edward F.; Bouchard, Kristofer

    2016-01-01

    Neuroscience continues to experience a tremendous growth in data; in terms of the volume and variety of data, the velocity at which data is acquired, and in turn the veracity of data. These challenges are a serious impediment to sharing of data, analyses, and tools within and across labs. Here, we introduce BRAINformat, a novel data standardization framework for the design and management of scientific data formats. The BRAINformat library defines application-independent design concepts and modules that together create a general framework for standardization of scientific data. We describe the formal specification of scientific data standards, which facilitates sharing and verification of data and formats. We introduce the concept of Managed Objects, enabling semantic components of data formats to be specified as self-contained units, supporting modular and reusable design of data format components and file storage. We also introduce the novel concept of Relationship Attributes for modeling and use of semantic relationships between data objects. Based on these concepts we demonstrate the application of our framework to design and implement a standard format for electrophysiology data and show how data standardization and relationship-modeling facilitate data analysis and sharing. The format uses HDF5, enabling portable, scalable, and self-describing data storage and integration with modern high-performance computing for data-driven discovery. The BRAINformat library is open source, easy-to-use, and provides detailed user and developer documentation and is freely available at: https://bitbucket.org/oruebel/brainformat. PMID:27867355

  3. Data Curation Education Grounded in Earth Sciences and the Science of Data

    NASA Astrophysics Data System (ADS)

    Palmer, C. L.

    2015-12-01

    This presentation looks back over ten years of experience advancing data curation education at two Information Schools, highlighting the vital role of earth science case studies, expertise, and collaborations in development of curriculum and internships. We also consider current data curation practices and workforce demand in data centers in the geosciences, drawing on studies conducted in the Data Curation Education in Research Centers (DCERC) initiative and the Site-Based Data Curation project. Outcomes from this decade of data curation research and education has reinforced the importance of key areas of information science in preparing data professionals to respond to the needs of user communities, provide services across disciplines, invest in standards and interoperability, and promote open data practices. However, a serious void remains in principles to guide education and practice that are distinct to the development of data systems and services that meet both local and global aims. We identify principles emerging from recent empirical studies on the reuse value of data in the earth sciences and propose an approach for advancing data curation education that depends on systematic coordination with data intensive research and propagation of current best practices from data centers into curriculum. This collaborative model can increase both domain-based and cross-disciplinary expertise among data professionals, ultimately improving data systems and services in our universities and data centers while building the new base of knowledge needed for a foundational science of data.

  4. Semantic Approaches Applied to Scientific Ocean Drilling Data

    NASA Astrophysics Data System (ADS)

    Fils, D.; Jenkins, C. J.; Arko, R. A.

    2012-12-01

    The application of Linked Open Data methods to 40 years of data from scientific ocean drilling is providing users with several new methods for rich-content data search and discovery. Data from the Deep Sea Drilling Project (DSDP), Ocean Drilling Program (ODP) and Integrated Ocean Drilling Program (IODP) have been translated and placed in RDF triple stores to provide access via SPARQL, linked open data patterns, and by embedded structured data through schema.org / RDFa. Existing search services have been re-encoded in this environment which allows the new and established architectures to be contrasted. Vocabularies including computed semantic relations between concepts, allow separate but related data sets to be connected on their concepts and resources even when they are expressed somewhat differently. Scientific ocean drilling produces a wide range of data types and data sets: borehole logging file-based data, images, measurements, visual observations and the physical sample data. The steps involved in connecting these data to concepts using vocabularies will be presented, including the connection of data sets through Vocabulary of Interlinked Datasets (VoID) and open entity collections such as Freebase and dbPedia. Demonstrated examples will include: (i) using RDF Schema for inferencing and in federated searches across NGDC and IODP data, (ii) using structured data in the data.oceandrilling.org web site, (iii) association through semantic methods of age models and depth recorded data to facilitate age based searches for data recorded by depth only.

  5. Validation of geometric accuracy of Global Land Survey (GLS) 2000 data

    USGS Publications Warehouse

    Rengarajan, Rajagopalan; Sampath, Aparajithan; Storey, James C.; Choate, Michael J.

    2015-01-01

    The Global Land Survey (GLS) 2000 data were generated from Geocover™ 2000 data with the aim of producing a global data set of accuracy better than 25 m Root Mean Square Error (RMSE). An assessment and validation of accuracy of GLS 2000 data set, and its co-registration with Geocover™ 2000 data set is presented here. Since the availability of global data sets that have higher nominal accuracy than the GLS 2000 is a concern, the data sets were assessed in three tiers. In the first tier, the data were compared with the Geocover™ 2000 data. This comparison provided a means of localizing regions of higher differences. In the second tier, the GLS 2000 data were compared with systematically corrected Landsat-7 scenes that were obtained in a time period when the spacecraft pointing information was extremely accurate. These comparisons localize regions where the data are consistently off, which may indicate regions of higher errors. The third tier consisted of comparing the GLS 2000 data against higher accuracy reference data. The reference data were the Digital Ortho Quads over the United States, orthorectified SPOT data over Australia, and high accuracy check points obtained using triangulation bundle adjustment of Landsat-7 images over selected sites around the world. The study reveals that the geometric errors in Geocover™ 2000 data have been rectified in GLS 2000 data, and that the accuracy of GLS 2000 data can be expected to be better than 25 m RMSE for most of its constituent scenes.

  6. Data policy for data sets from various sources: recent developments in the Integrated Carbon Observation System (ICOS)

    NASA Astrophysics Data System (ADS)

    Vermeulen, A. T.; Kutsch, W. L.; Lavric, J. V.; Juurola, E.

    2016-12-01

    Fluxnet is facing a transition from single PI and project engagement to a cooperation of infrastructures such as ICOS, Ameriflux, NEON, Chinaflux or TERN. Each of these infrastructures has developed its own data life cycle, data license and data policy which will have implications on future cooperation within Fluxnet and other global data integration efforts such as e.g. SOCAt in the ocean community. This presentation will introduce into the recent developments of the ICOS data policy and show perspectives for future cooperation in global networks. The challenge in developing the ICOS data policy has been to find the best compromise between optimized access for users and sufficient visibility and acknowledgement of data providers. ICOS data will be provided under the Creative Commons 4.0 BY license. ICOS data will be provided through the ICOS Carbon Portal. Data usage will be absolutely unrestricted. Data have only to be attributed as ICOS data. With the attribution ICOS will provide a persistent identifier (pid, sometimes also nominated as digial object identifier, doi) that will direct to a landing page where data provider and if necessary also funding organisations are identified. In cooperation with other environmental research infrastructures in the framework of the European cluster project ENVRIplus and the Research Data Alliance (RDA) the ICOS Carbon Portal is currently developing a data citation system. This includes developing recommendations for data citation of integrated data sets from different sources.

  7. Common Data Model for Neuroscience Data and Data Model Exchange

    PubMed Central

    Gardner, Daniel; Knuth, Kevin H.; Abato, Michael; Erde, Steven M.; White, Thomas; DeBellis, Robert; Gardner, Esther P.

    2001-01-01

    Objective: Generalizing the data models underlying two prototype neurophysiology databases, the authors describe and propose the Common Data Model (CDM) as a framework for federating a broad spectrum of disparate neuroscience information resources. Design: Each component of the CDM derives from one of five superclasses—data, site, method, model, and reference—or from relations defined between them. A hierarchic attribute-value scheme for metadata enables interoperability with variable tree depth to serve specific intra- or broad inter-domain queries. To mediate data exchange between disparate systems, the authors propose a set of XML-derived schema for describing not only data sets but data models. These include biophysical description markup language (BDML), which mediates interoperability between data resources by providing a meta-description for the CDM. Results: The set of superclasses potentially spans data needs of contemporary neuroscience. Data elements abstracted from neurophysiology time series and histogram data represent data sets that differ in dimension and concordance. Site elements transcend neurons to describe subcellular compartments, circuits, regions, or slices; non-neuroanatomic sites include sequences to patients. Methods and models are highly domain-dependent. Conclusions: True federation of data resources requires explicit public description, in a metalanguage, of the contents, query methods, data formats, and data models of each data resource. Any data model that can be derived from the defined superclasses is potentially conformant and interoperability can be enabled by recognition of BDML-described compatibilities. Such metadescriptions can buffer technologic changes. PMID:11141510

  8. Exploring and Analyzing Climate Variations Online by Using NASA MERRA-2 Data at GES DISC

    NASA Technical Reports Server (NTRS)

    Shen, Suhung; Ostrenga, Dana M.; Vollmer, Bruce E.; Kempler, Steven J.

    2016-01-01

    NASA Giovanni (Goddard Interactive Online Visualization ANd aNalysis Infrastructure) (http:giovanni.sci.gsfc.nasa.govgiovanni) is a web-based data visualization and analysis system developed by the Goddard Earth Sciences Data and Information Services Center (GES DISC). Current data analysis functions include Lat-Lon map, time series, scatter plot, correlation map, difference, cross-section, vertical profile, and animation etc. The system enables basic statistical analysis and comparisons of multiple variables. This web-based tool facilitates data discovery, exploration and analysis of large amount of global and regional remote sensing and model data sets from a number of NASA data centers. Long term global assimilated atmospheric, land, and ocean data have been integrated into the system that enables quick exploration and analysis of climate data without downloading, preprocessing, and learning data. Example data include climate reanalysis data from NASA Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2) which provides data beginning in 1980 to present; land data from NASA Global Land Data Assimilation System (GLDAS), which assimilates data from 1948 to 2012; as well as ocean biological data from NASA Ocean Biogeochemical Model (NOBM), which provides data from 1998 to 2012. This presentation, using surface air temperature, precipitation, ozone, and aerosol, etc. from MERRA-2, demonstrates climate variation analysis with Giovanni at selected regions.

  9. Applying Triple-Matrix Masking for Privacy Preserving Data Collection and Sharing in HIV Studies.

    PubMed

    Pei, Qinglin; Chen, Shigang; Xiao, Yao; Wu, Samuel S

    2016-01-01

    Many HIV research projects are plagued by the high missing rate of selfreported information during data collection. Also, due to the sensitive nature of the HIV research data, privacy protection is always a concern for data sharing in HIV studies. This paper applies a data masking approach, called triple-matrix masking [1], to the context of HIV research for ensuring privacy protection during the process of data collection and data sharing. Using a set of generated HIV patient data, we show step by step how the data are randomly transformed (masked) before leaving the patients' individual data collection device (which ensures that nobody sees the actual data) and how the masked data are further transformed by a masking service provider and a data collector. We demonstrate that the masked data retain statistical utility of the original data, yielding the exactly same inference results in the planned logistic regression on the effect of age on the adherence to antiretroviral therapy and in the Cox proportional hazard model for the age effect on time to viral load suppression. Privacy-preserving data collection method may help resolve the privacy protection issue in HIV research. The individual sensitive data can be completely hidden while the same inference results can still be obtained from the masked data, with the use of common statistical analysis methods.

  10. Challenges to the Standardization of Burn Data Collection: A Call for Common Data Elements for Burn Care.

    PubMed

    Schneider, Jeffrey C; Chen, Liang; Simko, Laura C; Warren, Katherine N; Nguyen, Brian Phu; Thorpe, Catherine R; Jeng, James C; Hickerson, William L; Kazis, Lewis E; Ryan, Colleen M

    2018-02-20

    The use of common data elements (CDEs) is growing in medical research; CDEs have demonstrated benefit in maximizing the impact of existing research infrastructure and funding. However, the field of burn care does not have a standard set of CDEs. The objective of this study is to examine the extent of common data collected in current burn databases.This study examines the data dictionaries of six U.S. burn databases to ascertain the extent of common data. This was assessed from a quantitative and qualitative perspective. Thirty-two demographic and clinical data elements were examined. The number of databases that collect each data element was calculated. The data values for each data element were compared across the six databases for common terminology. Finally, the data prompts of the data elements were examined for common language and structure.Five (16%) of the 32 data elements are collected by all six burn databases; additionally, five data elements (16%) are present in only one database. Furthermore, there are considerable variations in data values and prompts used among the burn databases. Only one of the 32 data elements (age) contains the same data values across all databases.The burn databases examined show minimal evidence of common data. There is a need to develop CDEs and standardized coding to enhance interoperability of burn databases.

  11. Great Basin geoscience data base

    USGS Publications Warehouse

    Raines, Gary L.; Sawatzky, Don L.; Connors, Katherine A.

    1996-01-01

    This CD-ROM serves as the archive for 73 digital GIS data set for the Great Basin. The data sets cover Nevada, eastern California, southeastern Oregon, southern Idaho, and western Utah. Some of the data sets are incomplete for the total area. On the CD-ROM, the data are provided in three formats, a prototype Federal Data Exchange standard format, the ESRI PC ARCVIEW1 format for viewing the data, and the ESRI ARC/INFO export format. Extensive documentation is provided to describe the data, the sources, and data enhancements. The following data are provided. One group of coverages comes primarily from 1:2,000,000-scale National Atlas data and can be assembled for use as base maps. These various forms of topographic information. In addition, public land system data sets are provided from the 1:2,500,000-scale Geologic Map of the United States and 1:500,000-scale geologic maps of Nevada, Oregon, and Utah. Geochemical data from the National Uranium Resource Evaluation (NURE) program are provided for most of the Great Basin. Geophysical data are provided for most of the Great Basin, typically gridded data with a spacing of 1 km. The geophysical data sets include aeromagnetics, gravity, radiometric data, and several derivative products. The thematic data sets include geochronology, calderas, pluvial lakes, tectonic extension domains, distribution of pre-Cenozoic terranes, limonite anomalies, Landsat linear features, mineral sites, and Bureau of Land Management exploration and mining permits.

  12. Born semantic: linking data from sensors to users and balancing hardware limitations with data standards

    NASA Astrophysics Data System (ADS)

    Buck, Justin; Leadbetter, Adam

    2015-04-01

    New users for the growing volume of ocean data for purposes such as 'big data' data products and operational data assimilation/ingestion require data to be readily ingestible. This can be achieved via the application of World Wide Web Consortium (W3C) Linked Data and Open Geospatial Consortium (OGC) Sensor Web Enablement (SWE) standards to data management. As part of several Horizons 2020 European projects (SenseOCEAN, ODIP, AtlantOS) the British Oceanographic Data Centre (BODC) are working on combining existing data centre architecture and SWE software such as Sensor Observation Services with a Linked Data front end. The standards to enable data delivery are proven and well documented1,2 There are practical difficulties when SWE standards are applied to real time data because of internal hardware bandwidth restrictions and a requirement to constrain data transmission costs. A pragmatic approach is proposed where sensor metadata and data output in OGC standards are implemented "shore-side" with sensors and instruments transmitting unique resolvable web linkages to persistent OGC SensorML records published at the BODC. References: 1. World Wide Web Consortium. (2013). Linked Data. Available: http://www.w3.org/standards/semanticweb/data. Last accessed 8th October 2014. 2. Open Geospatial Consortium. (2014). Sensor Web Enablement (SWE). Available: http://www.opengeospatial.org/ogc/markets-technologies/swe. Last accessed 8th October 2014.

  13. Shifting from Stewardship to Analytics of Massive Science Data

    NASA Astrophysics Data System (ADS)

    Crichton, D. J.; Doyle, R.; Law, E.; Hughes, S.; Huang, T.; Mahabal, A.

    2015-12-01

    Currently, the analysis of large data collections is executed through traditional computational and data analysis approaches, which require users to bring data to their desktops and perform local data analysis. Data collection, archiving and analysis from future remote sensing missions, be it from earth science satellites, planetary robotic missions, or massive radio observatories may not scale as more capable instruments stress existing architectural approaches and systems due to more continuous data streams, data from multiple observational platforms, and measurements and models from different agencies. A new paradigm is needed in order to increase the productivity and effectiveness of scientific data analysis. This paradigm must recognize that architectural choices, data processing, management, analysis, etc are interrelated, and must be carefully coordinated in any system that aims to allow efficient, interactive scientific exploration and discovery to exploit massive data collections. Future observational systems, including satellite and airborne experiments, and research in climate modeling will significantly increase the size of the data requiring new methodological approaches towards data analytics where users can more effectively interact with the data and apply automated mechanisms for data reduction, reduction and fusion across these massive data repositories. This presentation will discuss architecture, use cases, and approaches for developing a big data analytics strategy across multiple science disciplines.

  14. Enriching and improving the quality of linked data with GIS

    NASA Astrophysics Data System (ADS)

    Iwaniak, Adam; Kaczmarek, Iwona; Strzelecki, Marek; Lukowicz, Jaromar; Jankowski, Piotr

    2016-06-01

    Standardization of methods for data exchange in GIS has along history predating the creation of World Wide Web. The advent of World Wide Web brought the emergence of new solutions for data exchange and sharing including; more recently, standards proposed by the W3C for data exchange involving Semantic Web technologies and linked data. Despite the growing interest in integration, GIS and linked data are still two separate paradigms for describing and publishing spatial data on the Web. At the same time, both paradigms offer complementary ways of representing real world phenomena and means of analysis using different processing functions. The complementarity of linked data and GIS can be leveraged to synergize both paradigms resulting in richer data content and more powerful inferencing. The article presents an approach aimed at integrating linked data with GIS. The approach relies on the use of GIS tools for integration, verification and enrichment of linked data. The GIS tools are employed to enrich linked data by furnishing access to collection of data resources, defining relationship between data resources, and subsequently facilitating GIS data integration with linked data. The proposed approach is demonstrated with examples using data from DBpedia, OSM, and tools developed by the authors for standard GIS software.

  15. Validity of administrative data claim-based methods for identifying individuals with diabetes at a population level.

    PubMed

    Southern, Danielle A; Roberts, Barbara; Edwards, Alun; Dean, Stafford; Norton, Peter; Svenson, Lawrence W; Larsen, Erik; Sargious, Peter; Lau, David C W; Ghali, William A

    2010-01-01

    This study assessed the validity of a widely-accepted administrative data surveillance methodology for identifying individuals with diabetes relative to three laboratory data reference standard definitions for diabetes. We used a combination of linked regional data (hospital discharge abstracts and physician data) and laboratory data to test the validity of administrative data surveillance definitions for diabetes relative to a laboratory data reference standard. The administrative discharge data methodology includes two definitions for diabetes: a strict administrative data definition of one hospitalization code or two physician claims indicating diabetes; and a more liberal definition of one hospitalization code or a single physician claim. The laboratory data, meanwhile, produced three reference standard definitions based on glucose levels +/- HbA1c levels. Sensitivities ranged from 68.4% to 86.9% for the administrative data definitions tested relative to the three laboratory data reference standards. Sensitivities were higher for the more liberal administrative data definition. Positive predictive values (PPV), meanwhile, ranged from 53.0% to 88.3%, with the liberal administrative data definition producing lower PPVs. These findings demonstrate the trade-offs of sensitivity and PPV for selecting diabetes surveillance definitions. Centralized laboratory data may be of value to future surveillance initiatives that use combined data sources to optimize case detection.

  16. Considerations for blending data from various sensors

    USGS Publications Warehouse

    Bauer, Brian P.; Barringer, Anthony R.

    1980-01-01

    A project is being proposed at the EROS Data Center to blend the information from sensors aboard various satellites. The problems of, and considerations for, blending data from several satellite-borne sensors are discussed. System descriptions of the sensors aboard the HCMM, TIROS-N, GOES-D, Landsat 3, Landsat D, Seasat, SPOT, Stereosat, and NOSS satellites, and the quantity, quality, image dimensions, and availability of these data are summaries to define attributes of a multi-sensor satellite data base. Unique configurations of equipment, storage, media, and specialized hardware to meet the data system requirement are described as well as archival media and improved sensors that will be on-line within the next 5 years. Definitions and rigor required for blending various sensor data are given. Problems of merging data from the same sensor (intrasensor comparison) and from different sensors (intersensor comparison), the characteristics and advantages of cross-calibration of data, and integration of data into a product matrix field are addressed. Data processing considerations as affected by formation, resolution, and problems of merging large data sets, and organization of data bases for blending data are presented. Examples utilizing GOES and Landsat data are presented to demonstrate techniques of data blending, and recommendations for future implementation of a set of standard scenes and their characteristics necessary for optimal data blending are discussed.

  17. 48 CFR 15.403-5 - Instructions for submission of certified cost or pricing data and data other than certified cost...

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... submission of certified cost or pricing data and data other than certified cost or pricing data. 15.403-5... AND CONTRACT TYPES CONTRACTING BY NEGOTIATION Contract Pricing 15.403-5 Instructions for submission of certified cost or pricing data and data other than certified cost or pricing data. (a) Taking into...

  18. 48 CFR 15.403-5 - Instructions for submission of certified cost or pricing data and data other than certified cost...

    Code of Federal Regulations, 2012 CFR

    2012-10-01

    ... submission of certified cost or pricing data and data other than certified cost or pricing data. 15.403-5... AND CONTRACT TYPES CONTRACTING BY NEGOTIATION Contract Pricing 15.403-5 Instructions for submission of certified cost or pricing data and data other than certified cost or pricing data. (a) Taking into...

  19. 48 CFR 15.403-5 - Instructions for submission of certified cost or pricing data and data other than certified cost...

    Code of Federal Regulations, 2013 CFR

    2013-10-01

    ... submission of certified cost or pricing data and data other than certified cost or pricing data. 15.403-5... AND CONTRACT TYPES CONTRACTING BY NEGOTIATION Contract Pricing 15.403-5 Instructions for submission of certified cost or pricing data and data other than certified cost or pricing data. (a) Taking into...

  20. 48 CFR 15.403-5 - Instructions for submission of certified cost or pricing data and data other than certified cost...

    Code of Federal Regulations, 2011 CFR

    2011-10-01

    ... submission of certified cost or pricing data and data other than certified cost or pricing data. 15.403-5... AND CONTRACT TYPES CONTRACTING BY NEGOTIATION Contract Pricing 15.403-5 Instructions for submission of certified cost or pricing data and data other than certified cost or pricing data. (a) Taking into...

  1. 48 CFR 15.403-5 - Instructions for submission of certified cost or pricing data and data other than certified cost...

    Code of Federal Regulations, 2014 CFR

    2014-10-01

    ... submission of certified cost or pricing data and data other than certified cost or pricing data. 15.403-5... AND CONTRACT TYPES CONTRACTING BY NEGOTIATION Contract Pricing 15.403-5 Instructions for submission of certified cost or pricing data and data other than certified cost or pricing data. (a) Taking into...

  2. Keeping up with Big Data--Designing an Introductory Data Analytics Class

    ERIC Educational Resources Information Center

    Hijazi, Sam

    2016-01-01

    Universities need to keep up with the demand of the business world when it comes to Big Data. The exponential increase in data has put additional demands on academia to meet the big gap in education. Business demand for Big Data has surpassed 1.9 million positions in 2015. Big Data, Business Intelligence, Data Analytics, and Data Mining are the…

  3. Implementation and Optimization of the Advanced Encryption Standard Algorithm on all 8-Bit Field Programmable Gate Array Hardware Platform

    DTIC Science & Technology

    2007-03-01

    01101101 " => data...3401101010" => inv_data := Xං" when �" => inv_data := X്" when �" => inv_data := X"b8" when " 01101101 " => inv_data := X"b3...3400110011" when " 01101101 " => data := �" when �" => data := �" when �" => data := �"

  4. TREC Dynamic Domain: Polar Science

    DTIC Science & Technology

    2015-11-20

    Science Foundation Advanced Cooperative Arctic Data and Information System (ACADIS), the National Snow and Ice Data Center (NSIDC) Arctic Data Explorer...Master Directory (AMD, upper right) and the National Snow and Ice Data Center (NSIDC) Arctic Data Explorer (ADE, bottom). These data sets represent a...Information System (ACADIS), the National Snow and Ice Data Center (NSIDC) Arctic Data Explorer (ADE), and the National Aeronautics and Space

  5. Image display device in digital TV

    DOEpatents

    Choi, Seung Jong [Seoul, KR

    2006-07-18

    Disclosed is an image display device in a digital TV that is capable of carrying out the conversion into various kinds of resolution by using single bit map data in the digital TV. The image display device includes: a data processing part for executing bit map conversion, compression, restoration and format-conversion for text data; a memory for storing the bit map data obtained according to the bit map conversion and compression in the data processing part and image data inputted from an arbitrary receiving part, the receiving part receiving one of digital image data and analog image data; an image outputting part for reading the image data from the memory; and a display processing part for mixing the image data read from the image outputting part and the bit map data converted in format from the a data processing part. Therefore, the image display device according to the present invention can convert text data in such a manner as to correspond with various resolution, carry out the compression for bit map data, thereby reducing the memory space, and support text data of an HTML format, thereby providing the image with the text data of various shapes.

  6. Data management in large-scale collaborative toxicity studies: how to file experimental data for automated statistical analysis.

    PubMed

    Stanzel, Sven; Weimer, Marc; Kopp-Schneider, Annette

    2013-06-01

    High-throughput screening approaches are carried out for the toxicity assessment of a large number of chemical compounds. In such large-scale in vitro toxicity studies several hundred or thousand concentration-response experiments are conducted. The automated evaluation of concentration-response data using statistical analysis scripts saves time and yields more consistent results in comparison to data analysis performed by the use of menu-driven statistical software. Automated statistical analysis requires that concentration-response data are available in a standardised data format across all compounds. To obtain consistent data formats, a standardised data management workflow must be established, including guidelines for data storage, data handling and data extraction. In this paper two procedures for data management within large-scale toxicological projects are proposed. Both procedures are based on Microsoft Excel files as the researcher's primary data format and use a computer programme to automate the handling of data files. The first procedure assumes that data collection has not yet started whereas the second procedure can be used when data files already exist. Successful implementation of the two approaches into the European project ACuteTox is illustrated. Copyright © 2012 Elsevier Ltd. All rights reserved.

  7. Scaling an expert system data mart: more facilities in real-time.

    PubMed

    McNamee, L A; Launsby, B D; Frisse, M E; Lehmann, R; Ebker, K

    1998-01-01

    Clinical Data Repositories are being rapidly adopted by large healthcare organizations as a method of centralizing and unifying clinical data currently stored in diverse and isolated information systems. Once stored in a clinical data repository, healthcare organizations seek to use this centralized data to store, analyze, interpret, and influence clinical care, quality and outcomes. A recent trend in the repository field has been the adoption of data marts--specialized subsets of enterprise-wide data taken from a larger repository designed specifically to answer highly focused questions. A data mart exploits the data stored in the repository, but can use unique structures or summary statistics generated specifically for an area of study. Thus, data marts benefit from the existence of a repository, are less general than a repository, but provide more effective and efficient support for an enterprise-wide data analysis task. In previous work, we described the use of batch processing for populating data marts directly from legacy systems. In this paper, we describe an architecture that uses both primary data sources and an evolving enterprise-wide clinical data repository to create real-time data sources for a clinical data mart to support highly specialized clinical expert systems.

  8. Toward a Literature-Driven Definition of Big Data in Healthcare

    PubMed Central

    Baro, Emilie; Degoul, Samuel; Beuscart, Régis; Chazard, Emmanuel

    2015-01-01

    Objective. The aim of this study was to provide a definition of big data in healthcare. Methods. A systematic search of PubMed literature published until May 9, 2014, was conducted. We noted the number of statistical individuals (n) and the number of variables (p) for all papers describing a dataset. These papers were classified into fields of study. Characteristics attributed to big data by authors were also considered. Based on this analysis, a definition of big data was proposed. Results. A total of 196 papers were included. Big data can be defined as datasets with Log⁡(n∗p) ≥ 7. Properties of big data are its great variety and high velocity. Big data raises challenges on veracity, on all aspects of the workflow, on extracting meaningful information, and on sharing information. Big data requires new computational methods that optimize data management. Related concepts are data reuse, false knowledge discovery, and privacy issues. Conclusion. Big data is defined by volume. Big data should not be confused with data reuse: data can be big without being reused for another purpose, for example, in omics. Inversely, data can be reused without being necessarily big, for example, secondary use of Electronic Medical Records (EMR) data. PMID:26137488

  9. Toward a Literature-Driven Definition of Big Data in Healthcare.

    PubMed

    Baro, Emilie; Degoul, Samuel; Beuscart, Régis; Chazard, Emmanuel

    2015-01-01

    The aim of this study was to provide a definition of big data in healthcare. A systematic search of PubMed literature published until May 9, 2014, was conducted. We noted the number of statistical individuals (n) and the number of variables (p) for all papers describing a dataset. These papers were classified into fields of study. Characteristics attributed to big data by authors were also considered. Based on this analysis, a definition of big data was proposed. A total of 196 papers were included. Big data can be defined as datasets with Log(n∗p) ≥ 7. Properties of big data are its great variety and high velocity. Big data raises challenges on veracity, on all aspects of the workflow, on extracting meaningful information, and on sharing information. Big data requires new computational methods that optimize data management. Related concepts are data reuse, false knowledge discovery, and privacy issues. Big data is defined by volume. Big data should not be confused with data reuse: data can be big without being reused for another purpose, for example, in omics. Inversely, data can be reused without being necessarily big, for example, secondary use of Electronic Medical Records (EMR) data.

  10. Credit where credit is due: indexing and exposing data citations in international data repository networks

    NASA Astrophysics Data System (ADS)

    Jones, M. B.; Vieglais, D.; Cruse, P.; Chodacki, J.; Budden, A. E.; Fenner, M.; Lowenberg, D.; Abrams, S.

    2017-12-01

    Research data are fundamental to the success of the academic enterprise, and yet the practice of citing data in academic and applied works is not widespread among researchers. Researchers need credit for their contributions, and yet current citation infrastructure focuses primarily on citations to research literature. Some citation indiexing systems even systematically exclude citations to data from their corpus. The Making Data Count (MDC) project will enable measuring the impact of research data much as is currently being done with publications, the primary vehicle for scholarly credit and accountability. The MDC team (including the California Digital Library, COUNTER, DataCite, and DataONE) are working together to publish a new COUNTER recommendation on data usage statistics; launch a DataCite-hosted MDC service for aggregated DLM based on the open-source Lagotto platform; and build tools for data repository and discovery services to easily integrate with the new MDC service. In providing such data-level metrics (DLM), the MDC project augments existing measures of scholarly success and so offers an important incentive promoting open data principles and quality research data through adoption of research data management best practices.

  11. Protecting personal data in epidemiological research: DataSHIELD and UK law.

    PubMed

    Wallace, Susan E; Gaye, Amadou; Shoush, Osama; Burton, Paul R

    2014-01-01

    Data from individual collections, such as biobanks and cohort studies, are now being shared in order to create combined datasets which can be queried to ask complex scientific questions. But this sharing must be done with due regard for data protection principles. DataSHIELD is a new technology that queries nonaggregated, individual-level data in situ but returns query data in an anonymous format. This raises questions of the ability of DataSHIELD to adequately protect participant confidentiality. An ethico-legal analysis was conducted that examined each step of the DataSHIELD process from the perspective of UK case law, regulations, and guidance. DataSHIELD reaches agreed UK standards of protection for the sharing of biomedical data. All direct processing of personal data is conducted within the protected environment of the contributing study; participating studies have scientific, ethics, and data access approvals in place prior to the analysis; studies are clear that their consents conform with this use of data, and participants are informed that anonymisation for further disclosure will take place. DataSHIELD can provide a flexible means of interrogating data while protecting the participants' confidentiality in accordance with applicable legislation and guidance. © 2014 S. Karger AG, Basel.

  12. Knowledge sharing in the health scenario

    PubMed Central

    2014-01-01

    The understanding of certain data often requires the collection of similar data from different places to be analysed and interpreted. Interoperability standards and ontologies, are facilitating data interchange around the world. However, beyond the existing networks and advances for data transfer, data sharing protocols to support multilateral agreements are useful to exploit the knowledge of distributed Data Warehouses. The access to a certain data set in a federated Data Warehouse may be constrained by the requirement to deliver another specific data set. When bilateral agreements between two nodes of a network are not enough to solve the constraints for accessing to a certain data set, multilateral agreements for data exchange are needed. We present the implementation of a Multi-Agent System for multilateral exchange agreements of clinical data, and evaluate how those multilateral agreements increase the percentage of data collected by a single node from the total amount of data available in the network. Different strategies to reduce the number of messages needed to achieve an agreement are also considered. The results show that with this collaborative sharing scenario the percentage of data collected dramaticaly improve from bilateral agreements to multilateral ones, up to reach almost all data available in the network. PMID:25471452

  13. What’s So Different about Big Data?. A Primer for Clinicians Trained to Think Epidemiologically

    PubMed Central

    Liu, Vincent

    2014-01-01

    The Big Data movement in computer science has brought dramatic changes in what counts as data, how those data are analyzed, and what can be done with those data. Although increasingly pervasive in the business world, it has only recently begun to influence clinical research and practice. As Big Data draws from different intellectual traditions than clinical epidemiology, the ideas may be less familiar to practicing clinicians. There is an increasing role of Big Data in health care, and it has tremendous potential. This Demystifying Data Seminar identifies four main strands in Big Data relevant to health care. The first is the inclusion of many new kinds of data elements into clinical research and operations, in a volume not previously routinely used. Second, Big Data asks different kinds of questions of data and emphasizes the usefulness of analyses that are explicitly associational but not causal. Third, Big Data brings new analytic approaches to bear on these questions. And fourth, Big Data embodies a new set of aspirations for a breaking down of distinctions between research data and operational data and their merging into a continuously learning health system. PMID:25102315

  14. What's so different about big data?. A primer for clinicians trained to think epidemiologically.

    PubMed

    Iwashyna, Theodore J; Liu, Vincent

    2014-09-01

    The Big Data movement in computer science has brought dramatic changes in what counts as data, how those data are analyzed, and what can be done with those data. Although increasingly pervasive in the business world, it has only recently begun to influence clinical research and practice. As Big Data draws from different intellectual traditions than clinical epidemiology, the ideas may be less familiar to practicing clinicians. There is an increasing role of Big Data in health care, and it has tremendous potential. This Demystifying Data Seminar identifies four main strands in Big Data relevant to health care. The first is the inclusion of many new kinds of data elements into clinical research and operations, in a volume not previously routinely used. Second, Big Data asks different kinds of questions of data and emphasizes the usefulness of analyses that are explicitly associational but not causal. Third, Big Data brings new analytic approaches to bear on these questions. And fourth, Big Data embodies a new set of aspirations for a breaking down of distinctions between research data and operational data and their merging into a continuously learning health system.

  15. Digital data sets for map products produced as part of the Black Hills Hydrology Study, western South Dakota

    USGS Publications Warehouse

    Williamson, Joyce E.; Jarrell, Gregory J.; Clawges, Rick M.; Galloway, Joel M.; Carter, Janet M.

    2000-01-01

    This compact disk contains digital data produced as part of the 1:100,000-scale map products for the Black Hills Hydrology Study conducted in western South Dakota. The digital data include 28 individual Geographic Information System (GIS) data sets: data sets for the hydrogeologic unit map including all mapped hydrogeologic units within the study area (1 data set) and major geologic structure including anticlines and synclines (1 data set); data sets for potentiometric maps including the potentiometric contours for the Inyan Kara, Minnekahta, Minnelusa, Madison, and Deadwood aquifers (5 data sets), wells used as control points for each aquifer (5 data sets), and springs used as control points for the potentiometric contours (1 data set); and data sets for the structure-contour maps including the structure contours for the top of each formation that contains major aquifers (5 data sets), wells and tests holes used as control points for each formation (5 data sets), and surficial deposits (alluvium and terrace deposits) that directly overlie each of the major aquifer outcrops (5 data sets). These data sets were used to produce the maps published by the U.S. Geological Survey.

  16. SeaDataNet: Pan-European infrastructure for ocean and marine data management

    NASA Astrophysics Data System (ADS)

    Fichaut, M.; Schaap, D.; Maudire, G.; Manzella, G. M. R.

    2012-04-01

    The overall objective of the SeaDataNet project is the upgrade the present SeaDataNet infrastructure into an operationally robust and state-of-the-art Pan-European infrastructure for providing up-to-date and high quality access to ocean and marine metadata, data and data products originating from data acquisition activities by all engaged coastal states, by setting, adopting and promoting common data management standards and by realising technical and semantic interoperability with other relevant data management systems and initiatives on behalf of science, environmental management, policy making, and economy. SeaDataNet is undertaken by the National Oceanographic Data Centres (NODCs), and marine information services of major research institutes, from 31 coastal states bordering the European seas, and also includes Satellite Data Centres, expert modelling centres and the international organisations IOC, ICES and EU-JRC in its network. Its 40 data centres are highly skilled and have been actively engaged in data management for many years and have the essential capabilities and facilities for data quality control, long term stewardship, retrieval and distribution. SeaDataNet undertakes activities to achieve data access and data products services that meet requirements of end-users and intermediate user communities, such as GMES Marine Core Services (e.g. MyOcean), establishing SeaDataNet as the core data management component of the EMODNet infrastructure and contributing on behalf of Europe to global portal initiatives, such as the IOC/IODE - Ocean Data Portal (ODP), and GEOSS. Moreover it aims to achieve INSPIRE compliance and to contribute to the INSPIRE process for developing implementing rules for oceanography. • As part of the SeaDataNet upgrading and capacity building, training courses will be organised aiming at data managers and technicians at the data centres. For the data managers it is important, that they learn to work with the upgraded common SeaDataNet formats and procedures and software tools for preparing and updating metadata, processing and quality control of data, and presentation of data in viewing services, and for production of data products. • SeaDataNet maintains and operates several discovery services with overviews of marine organisations in Europe and their engagement in marine research projects, managing large datasets, and data acquisition by research vessels and monitoring programmes for the European seas and global oceans: o European Directory of Marine Environmental Data (EDMED) (at present > 4300 entries from more than 600 data holding centres in Europe) is a comprehensive reference to the marine data and sample collections held within Europe providing marine scientists, engineers and policy makers with a simple discovery mechanism. It covers all marine environmental disciplines. This needs regular maintenance. o European Directory of Marine Environmental Research Projects (EDMERP) (at present > 2200 entries from more than 300 organisations in Europe) gives an overview of research projects relating to the marine environment, that are relevant in the context of data sets and data acquisition activities ( cruises, in situ monitoring networks, ..) that are covered in SeaDataNet. This needs regular updating, following activities by dataholding institutes for preparing metadata references for EDMED, EDIOS, CSR and CDI. o Cruise Summary Reports (CSR) directory (at present > 43000 entries) provides a coarse-grained inventory for tracking oceanographic data collected by research vessels. o European Directory of Oceanographic Observing Systems (EDIOS) (at present > 10000 entries) is an initiative of EuroGOOS and gives an overview of the ocean measuring and monitoring systems operated by European countries. • European Directory of Marine Organisations (EDMO) (at present > 2000 entries) contains the contact information and activity profiles for the organisations whose data and activities are described by the discovery services. • Common Vocabularies (at present > 120000 terms in > 100 lists), covering a broad spectrum of ocean and marine disciplines. The common terms are used to mark up metadata, data and data products in a consistent and coherent way. Governance is regulated by an international board. • Common Data Index (CDI) data discovery and access service: SeaDataNet provides online unified access to distributed datasets via its portal website to the vast resources of marine and ocean datasets, managed by all the connected distributed data centres. The Common Data Index (CDI) service is the key Discovery and Delivery service. It enables users to have a detailed insight of the availability and geographical distribution of marine data, archived at the connected data centres, and it provides the means for downloading datasets in common formats via a transaction mechanism.

  17. Rethinking big data: A review on the data quality and usage issues

    NASA Astrophysics Data System (ADS)

    Liu, Jianzheng; Li, Jie; Li, Weifeng; Wu, Jiansheng

    2016-05-01

    The recent explosive publications of big data studies have well documented the rise of big data and its ongoing prevalence. Different types of ;big data; have emerged and have greatly enriched spatial information sciences and related fields in terms of breadth and granularity. Studies that were difficult to conduct in the past time due to data availability can now be carried out. However, big data brings lots of ;big errors; in data quality and data usage, which cannot be used as a substitute for sound research design and solid theories. We indicated and summarized the problems faced by current big data studies with regard to data collection, processing and analysis: inauthentic data collection, information incompleteness and noise of big data, unrepresentativeness, consistency and reliability, and ethical issues. Cases of empirical studies are provided as evidences for each problem. We propose that big data research should closely follow good scientific practice to provide reliable and scientific ;stories;, as well as explore and develop techniques and methods to mitigate or rectify those 'big-errors' brought by big data.

  18. Big Data and Data Science in Critical Care.

    PubMed

    Sanchez-Pinto, L Nelson; Luo, Yuan; Churpek, Matthew M

    2018-05-09

    The digitalization of the health-care system has resulted in a deluge of clinical Big Data and has prompted the rapid growth of data science in medicine. Data science, which is the field of study dedicated to the principled extraction of knowledge from complex data, is particularly relevant in the critical care setting. The availability of large amounts of data in the ICU, the need for better evidence-based care, and the complexity of critical illness makes the use of data science techniques and data-driven research particularly appealing to intensivists. Despite the increasing number of studies and publications in the field, thus far there have been few examples of data science projects that have resulted in successful implementations of data-driven systems in the ICU. However, given the expected growth in the field, intensivists should be familiar with the opportunities and challenges of Big Data and data science. The present article reviews the definitions, types of algorithms, applications, challenges, and future of Big Data and data science in critical care. Copyright © 2018 American College of Chest Physicians. Published by Elsevier Inc. All rights reserved.

  19. Integrated data lookup and replication scheme in mobile ad hoc networks

    NASA Astrophysics Data System (ADS)

    Chen, Kai; Nahrstedt, Klara

    2001-11-01

    Accessing remote data is a challenging task in mobile ad hoc networks. Two problems have to be solved: (1) how to learn about available data in the network; and (2) how to access desired data even when the original copy of the data is unreachable. In this paper, we develop an integrated data lookup and replication scheme to solve these problems. In our scheme, a group of mobile nodes collectively host a set of data to improve data accessibility for all members of the group. They exchange data availability information by broadcasting advertising (ad) messages to the group using an adaptive sending rate policy. The ad messages are used by other nodes to derive a local data lookup table, and to reduce data redundancy within a connected group. Our data replication scheme predicts group partitioning based on each node's current location and movement patterns, and replicates data to other partitions before partitioning occurs. Our simulations show that data availability information can quickly propagate throughout the network, and that the successful data access ratio of each node is significantly improved.

  20. Globally distributed software defined storage (proposal)

    NASA Astrophysics Data System (ADS)

    Shevel, A.; Khoruzhnikov, S.; Grudinin, V.; Sadov, O.; Kairkanov, A.

    2017-10-01

    The volume of the coming data in HEP is growing. The volume of the data to be held for a long time is growing as well. Large volume of data - big data - is distributed around the planet. The methods, approaches how to organize and manage the globally distributed data storage are required. The distributed storage has several examples for personal needs like own-cloud.org, pydio.com, seafile.com, sparkleshare.org. For enterprise-level there is a number of systems: SWIFT - distributed storage systems (part of Openstack), CEPH and the like which are mostly object storage. When several data center’s resources are integrated, the organization of data links becomes very important issue especially if several parallel data links between data centers are used. The situation in data centers and in data links may vary each hour. All that means each part of distributed data storage has to be able to rearrange usage of data links and storage servers in each data center. In addition, for each customer of distributed storage different requirements could appear. The above topics are planned to be discussed in data storage proposal.

  1. The NASA master directory: Quick reference guide

    NASA Technical Reports Server (NTRS)

    Satin, Karen (Editor); Kanga, Carol (Editor)

    1989-01-01

    This is a quick reference guide to the NASA Master Directory (MD), which is a free, online, multidisciplinary directory of space and Earth science data sets (NASA and non-NASA data) that are of potential interest to the NASA-sponsored research community. The MD contains high-level descriptions of data sets, other data systems and archives, and campaigns and projects. It provides mechanisms for searching for data sets by important criteria such as geophysical parameters, time, and spatial coverage, and provides information on ordering the data. It also provides automatic connections to a number of data systems such as the NASA Climate Data System, the Planetary Data System, the NASA Ocean Data System, the Pilot Land Data System, and others. The MD includes general information about many data systems, data centers, and coordinated data analysis projects, It represents the first major step in the Catalog Interoperability project, whose objective is to enable researchers to quickly and efficiently identify, obtain information about, and get access to space and Earth science data. The guide describes how to access, use, and exit the MD and lists its features.

  2. SEURAT: Visual analytics for the integrated analysis of microarray data

    PubMed Central

    2010-01-01

    Background In translational cancer research, gene expression data is collected together with clinical data and genomic data arising from other chip based high throughput technologies. Software tools for the joint analysis of such high dimensional data sets together with clinical data are required. Results We have developed an open source software tool which provides interactive visualization capability for the integrated analysis of high-dimensional gene expression data together with associated clinical data, array CGH data and SNP array data. The different data types are organized by a comprehensive data manager. Interactive tools are provided for all graphics: heatmaps, dendrograms, barcharts, histograms, eventcharts and a chromosome browser, which displays genetic variations along the genome. All graphics are dynamic and fully linked so that any object selected in a graphic will be highlighted in all other graphics. For exploratory data analysis the software provides unsupervised data analytics like clustering, seriation algorithms and biclustering algorithms. Conclusions The SEURAT software meets the growing needs of researchers to perform joint analysis of gene expression, genomical and clinical data. PMID:20525257

  3. A methodology for collecting valid software engineering data

    NASA Technical Reports Server (NTRS)

    Basili, Victor R.; Weiss, David M.

    1983-01-01

    An effective data collection method for evaluating software development methodologies and for studying the software development process is described. The method uses goal-directed data collection to evaluate methodologies with respect to the claims made for them. Such claims are used as a basis for defining the goals of the data collection, establishing a list of questions of interest to be answered by data analysis, defining a set of data categorization schemes, and designing a data collection form. The data to be collected are based on the changes made to the software during development, and are obtained when the changes are made. To insure accuracy of the data, validation is performed concurrently with software development and data collection. Validation is based on interviews with those people supplying the data. Results from using the methodology show that data validation is a necessary part of change data collection. Without it, as much as 50% of the data may be erroneous. Feasibility of the data collection methodology was demonstrated by applying it to five different projects in two different environments. The application showed that the methodology was both feasible and useful.

  4. AmeriFlux Data System Looking Forward

    NASA Astrophysics Data System (ADS)

    Agarwal, D.; Cheah, Y. W.; Chu, H.; Keenan, T. F.; Pastorello, G.; Torn, M. S.; Humphrey, M.; Beekwilder, N.

    2016-12-01

    The AmeriFlux data system is turning 20 along with the network. This anniversary provides an opportunity to reimagine the data system and services to support the next 20 years of AmeriFlux. The next generation goals of the data system will be continuous data quality and consistency improvements, and also increased services to both tower owners and data users. An example change is new data citation capabilities and support leveraging Digital Object Identifiers (DOI) to enable easier citation and improved tracking of data usage. A new processing pipeline will enable more regular production of gap-filled and partitioned datasets. Automation will be continuously increased to help improve the time from data submission to user data availability. New standard data submission and processing formats will enable automation of the data submission and improved QA/QC processes. This new data system is made possible by the extensive work that the AmeriFlux data team at CDIAC has done over the last 20 years in close collaboration with the tower teams. This presentation will provide both a historical perspective and a vision for the future of the AmeriFlux data system.

  5. Classification and data acquisition with incomplete data

    NASA Astrophysics Data System (ADS)

    Williams, David P.

    In remote-sensing applications, incomplete data can result when only a subset of sensors (e.g., radar, infrared, acoustic) are deployed at certain regions. The limitations of single sensor systems have spurred interest in employing multiple sensor modalities simultaneously. For example, in land mine detection tasks, different sensor modalities are better-suited to capture different aspects of the underlying physics of the mines. Synthetic aperture radar sensors may be better at detecting surface mines, while infrared sensors may be better at detecting buried mines. By employing multiple sensor modalities to address the detection task, the strengths of the disparate sensors can be exploited in a synergistic manner to improve performance beyond that which would be achievable with either single sensor alone. When multi-sensor approaches are employed, however, incomplete data can be manifested. If each sensor is located on a separate platform ( e.g., aircraft), each sensor may interrogate---and hence collect data over---only partially overlapping areas of land. As a result, some data points may be characterized by data (i.e., features) from only a subset of the possible sensors employed in the task. Equivalently, this scenario implies that some data points will be missing features. Increasing focus in the future on using---and fusing data from---multiple sensors will make such incomplete-data problems commonplace. In many applications involving incomplete data, it is possible to acquire the missing data at a cost. In multi-sensor remote-sensing applications, data is acquired by deploying sensors to data points. Acquiring data is usually an expensive, time-consuming task, a fact that necessitates an intelligent data acquisition process. Incomplete data is not limited to remote-sensing applications, but rather, can arise in virtually any data set. In this dissertation, we address the general problem of classification when faced with incomplete data. We also address the closely related problem of active data acquisition, which develops a strategy to acquire missing features and labels that will most benefit the classification task. We first address the general problem of classification with incomplete data, maintaining the view that all data (i.e., information) is valuable. We employ a logistic regression framework within which we formulate a supervised classification algorithm for incomplete data. This principled, yet flexible, framework permits several interesting extensions that allow all available data to be utilized. One extension incorporates labeling error, which permits the usage of potentially imperfectly labeled data in learning a classifier. A second major extension converts the proposed algorithm to a semi-supervised approach by utilizing unlabeled data via graph-based regularization. Finally, the classification algorithm is extended to the case in which (image) data---from which features are extracted---are available from multiple resolutions. Taken together, this family of incomplete-data classification algorithms exploits all available data in a principled manner by avoiding explicit imputation. Instead, missing data is integrated out analytically with the aid of an estimated conditional density function (conditioned on the observed features). This feat is accomplished by invoking only mild assumptions. We also address the problem of active data acquisition by determining which missing data should be acquired to most improve performance. Specifically, we examine this data acquisition task when the data to be acquired can be either labels or features. The proposed approach is based on a criterion that accounts for the expected benefit of the acquisition. This approach, which is applicable for any general missing data problem, exploits the incomplete-data classification framework introduced in the first part of this dissertation. This data acquisition approach allows for the acquisition of both labels and features. Moreover, several types of feature acquisition are permitted, including the acquisition of individual or multiple features for individual or multiple data points, which may be either labeled or unlabeled. Furthermore, if different types of data acquisition are feasible for a given application, the algorithm will automatically determine the most beneficial type of data to acquire. Experimental results on both benchmark machine learning data sets and real (i.e., measured) remote-sensing data demonstrate the advantages of the proposed incomplete-data classification and active data acquisition algorithms.

  6. The Modern Research Data Portal: a design pattern for networked, data-intensive science

    DOE PAGES

    Chard, Kyle; Dart, Eli; Foster, Ian; ...

    2018-01-15

    We describe best practices for providing convenient, high-speed, secure access to large data via research data portals. Here, we capture these best practices in a new design pattern, the Modern Research Data Portal, that disaggregates the traditional monolithic web-based data portal to achieve orders-of-magnitude increases in data transfer performance, support new deployment architectures that decouple control logic from data storage, and reduce development and operations costs. We introduce the design pattern; explain how it leverages high-performance data enclaves and cloud-based data management services; review representative examples at research laboratories and universities, including both experimental facilities and supercomputer sites; describe howmore » to leverage Python APIs for authentication, authorization, data transfer, and data sharing; and use coding examples to demonstrate how these APIs can be used to implement a range of research data portal capabilities. Sample code at a companion web site,https://docs.globus.org/mrdp, provides application skeletons that readers can adapt to realize their own research data portals.« less

  7. Managing troubled data: Coastal data partnerships smooth data integration

    USGS Publications Warehouse

    Hale, S.S.; Hale, Miglarese A.; Bradley, M.P.; Belton, T.J.; Cooper, L.D.; Frame, M.T.; Friel, C.A.; Harwell, L.M.; King, R.E.; Michener, W.K.; Nicolson, D.T.; Peterjohn, B.G.

    2003-01-01

    Understanding the ecology, condition, and changes of coastal areas requires data from many sources. Broad-scale and long-term ecological questions, such as global climate change, biodiversity, and cumulative impacts of human activities, must be addressed with databases that integrate data from several different research and monitoring programs. Various barriers, including widely differing data formats, codes, directories, systems, and metadata used by individual programs, make such integration troublesome. Coastal data partnerships, by helping overcome technical, social, and organizational barriers, can lead to a better understanding of environmental issues, and may enable better management decisions. Characteristics of successful data partnerships include a common need for shared data, strong collaborative leadership, committed partners willing to invest in the partnership, and clear agreements on data standards and data policy. Emerging data and metadata standards that become widely accepted are crucial. New information technology is making it easier to exchange and integrate data. Data partnerships allow us to create broader databases than would be possible for any one organization to create by itself.

  8. Data compression: The end-to-end information systems perspective for NASA space science missions

    NASA Technical Reports Server (NTRS)

    Tai, Wallace

    1991-01-01

    The unique characteristics of compressed data have important implications to the design of space science data systems, science applications, and data compression techniques. The sequential nature or data dependence between each of the sample values within a block of compressed data introduces an error multiplication or propagation factor which compounds the effects of communication errors. The data communication characteristics of the onboard data acquisition, storage, and telecommunication channels may influence the size of the compressed blocks and the frequency of included re-initialization points. The organization of the compressed data are continually changing depending on the entropy of the input data. This also results in a variable output rate from the instrument which may require buffering to interface with the spacecraft data system. On the ground, there exist key tradeoff issues associated with the distribution and management of the science data products when data compression techniques are applied in order to alleviate the constraints imposed by ground communication bandwidth and data storage capacity.

  9. Analysis of Multiple Data Hiding Combined Coloured Visual Cryptography and LSB

    NASA Astrophysics Data System (ADS)

    Maulana, Halim; Rahman Syahputra, Edy

    2017-12-01

    Currently the level of data security becoming a major factor in data transfer. As we know every process of sending data through any medium the risk of that data gets hacked will still be there. Some techniques for securing data such as steganography and cryptography also often used as a solution for securing data. But it does not last long because it has been found out the weaknesses of the algorithm so that the security be assured. So, in need of variety of new algorithms to be able to protect the data so that data security can be guaranteed. In this study tries to combine two visual algorithms that steganography and cryptography. Where in these experiments will try to secure two pieces of data type that is the type of image data and text data where both the data is regarded as a message so to obtain the correct information receiver should get that two types of data.

  10. An inventory of four-dimensional data sets for the earth sciences

    NASA Technical Reports Server (NTRS)

    Gregory, Terri

    1989-01-01

    The wide variety of data available to the diligent researcher and the myriad paths to obtaining it are emphasized. This inventory is an attempt to make accessible much of the four-dimensional data available in the world. In this quick-look catalog are listed separate data sets (Data Sets sections), sources of data including centers and large data bases (Sources section), and some data expected to be available in the future (Future Data section). In the Data Sets section, individual data sets are arranged alphabetically by institution, with those archived in the U.S.A. listed first, followed by those found elsewhere. The Sources section includes large data bases, centers, and directories. Sources are arranged alphabetically by country. This section is followed by a Future Data section which is a collection of data sets, experiments, and other future developments of which we are cognizant. A collection of further information and order blanks provided by some of the archiving institutions are presented in the appendix.

  11. The Modern Research Data Portal: a design pattern for networked, data-intensive science

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chard, Kyle; Dart, Eli; Foster, Ian

    We describe best practices for providing convenient, high-speed, secure access to large data via research data portals. Here, we capture these best practices in a new design pattern, the Modern Research Data Portal, that disaggregates the traditional monolithic web-based data portal to achieve orders-of-magnitude increases in data transfer performance, support new deployment architectures that decouple control logic from data storage, and reduce development and operations costs. We introduce the design pattern; explain how it leverages high-performance data enclaves and cloud-based data management services; review representative examples at research laboratories and universities, including both experimental facilities and supercomputer sites; describe howmore » to leverage Python APIs for authentication, authorization, data transfer, and data sharing; and use coding examples to demonstrate how these APIs can be used to implement a range of research data portal capabilities. Sample code at a companion web site,https://docs.globus.org/mrdp, provides application skeletons that readers can adapt to realize their own research data portals.« less

  12. Scalable Architecture for Federated Translational Inquiries Network (SAFTINet) Technology Infrastructure for a Distributed Data Network

    PubMed Central

    Schilling, Lisa M.; Kwan, Bethany M.; Drolshagen, Charles T.; Hosokawa, Patrick W.; Brandt, Elias; Pace, Wilson D.; Uhrich, Christopher; Kamerick, Michael; Bunting, Aidan; Payne, Philip R.O.; Stephens, William E.; George, Joseph M.; Vance, Mark; Giacomini, Kelli; Braddy, Jason; Green, Mika K.; Kahn, Michael G.

    2013-01-01

    Introduction: Distributed Data Networks (DDNs) offer infrastructure solutions for sharing electronic health data from across disparate data sources to support comparative effectiveness research. Data sharing mechanisms must address technical and governance concerns stemming from network security and data disclosure laws and best practices, such as HIPAA. Methods: The Scalable Architecture for Federated Translational Inquiries Network (SAFTINet) deploys TRIAD grid technology, a common data model, detailed technical documentation, and custom software for data harmonization to facilitate data sharing in collaboration with stakeholders in the care of safety net populations. Data sharing partners host TRIAD grid nodes containing harmonized clinical data within their internal or hosted network environments. Authorized users can use a central web-based query system to request analytic data sets. Discussion: SAFTINet DDN infrastructure achieved a number of data sharing objectives, including scalable and sustainable systems for ensuring harmonized data structures and terminologies and secure distributed queries. Initial implementation challenges were resolved through iterative discussions, development and implementation of technical documentation, governance, and technology solutions. PMID:25848567

  13. Scalable Architecture for Federated Translational Inquiries Network (SAFTINet) Technology Infrastructure for a Distributed Data Network.

    PubMed

    Schilling, Lisa M; Kwan, Bethany M; Drolshagen, Charles T; Hosokawa, Patrick W; Brandt, Elias; Pace, Wilson D; Uhrich, Christopher; Kamerick, Michael; Bunting, Aidan; Payne, Philip R O; Stephens, William E; George, Joseph M; Vance, Mark; Giacomini, Kelli; Braddy, Jason; Green, Mika K; Kahn, Michael G

    2013-01-01

    Distributed Data Networks (DDNs) offer infrastructure solutions for sharing electronic health data from across disparate data sources to support comparative effectiveness research. Data sharing mechanisms must address technical and governance concerns stemming from network security and data disclosure laws and best practices, such as HIPAA. The Scalable Architecture for Federated Translational Inquiries Network (SAFTINet) deploys TRIAD grid technology, a common data model, detailed technical documentation, and custom software for data harmonization to facilitate data sharing in collaboration with stakeholders in the care of safety net populations. Data sharing partners host TRIAD grid nodes containing harmonized clinical data within their internal or hosted network environments. Authorized users can use a central web-based query system to request analytic data sets. SAFTINet DDN infrastructure achieved a number of data sharing objectives, including scalable and sustainable systems for ensuring harmonized data structures and terminologies and secure distributed queries. Initial implementation challenges were resolved through iterative discussions, development and implementation of technical documentation, governance, and technology solutions.

  14. Making Research Data Repositories Visible: The re3data.org Registry

    PubMed Central

    Pampel, Heinz; Vierkant, Paul; Scholze, Frank; Bertelmann, Roland; Kindling, Maxi; Klump, Jens; Goebelbecker, Hans-Jürgen; Gundlach, Jens; Schirmbacher, Peter; Dierolf, Uwe

    2013-01-01

    Researchers require infrastructures that ensure a maximum of accessibility, stability and reliability to facilitate working with and sharing of research data. Such infrastructures are being increasingly summarized under the term Research Data Repositories (RDR). The project re3data.org–Registry of Research Data Repositories–has begun to index research data repositories in 2012 and offers researchers, funding organizations, libraries and publishers an overview of the heterogeneous research data repository landscape. In July 2013 re3data.org lists 400 research data repositories and counting. 288 of these are described in detail using the re3data.org vocabulary. Information icons help researchers to easily identify an adequate repository for the storage and reuse of their data. This article describes the heterogeneous RDR landscape and presents a typology of institutional, disciplinary, multidisciplinary and project-specific RDR. Further the article outlines the features of re3data.org, and shows how this registry helps to identify appropriate repositories for storage and search of research data. PMID:24223762

  15. Follow up: Compound data sets and software tools for chemoinformatics and medicinal chemistry applications: update and data transfer.

    PubMed

    Hu, Ye; Bajorath, Jürgen

    2014-01-01

    In 2012, we reported 30 compound data sets and/or programs developed in our laboratory in a data article and made them freely available to the scientific community to support chemoinformatics and computational medicinal chemistry applications. These data sets and computational tools were provided for download from our website. Since publication of this data article, we have generated 13 new data sets with which we further extend our collection of publicly available data and tools. Due to changes in web servers and website architectures, data accessibility has recently been limited at times. Therefore, we have also transferred our data sets and tools to a public repository to ensure full and stable accessibility. To aid in data selection, we have classified the data sets according to scientific subject areas. Herein, we describe new data sets, introduce the data organization scheme, summarize the database content and provide detailed access information in ZENODO (doi: 10.5281/zenodo.8451 and doi:10.5281/zenodo.8455).

  16. Telecom Big Data for Urban Transport Analysis - a Case Study of Split-Dalmatia County in Croatia

    NASA Astrophysics Data System (ADS)

    Baučić, M.; Jajac, N.; Bućan, M.

    2017-09-01

    Today, big data has become widely available and the new technologies are being developed for big data storage architecture and big data analytics. An ongoing challenge is how to incorporate big data into GIS applications supporting the various domains. International Transport Forum explains how the arrival of big data and real-time data, together with new data processing algorithms lead to new insights and operational improvements of transport. Based on the telecom customer data, the Study of Tourist Movement and Traffic in Split-Dalmatia County in Croatia is carried out as a part of the "IPA Adriatic CBC//N.0086/INTERMODAL" project. This paper briefly explains the big data used in the study and the results of the study. Furthermore, this paper investigates the main considerations when using telecom customer big data: data privacy and data quality. The paper concludes with GIS visualisation and proposes the further use of big data used in the study.

  17. Technical challenges for big data in biomedicine and health: data sources, infrastructure, and analytics.

    PubMed

    Peek, N; Holmes, J H; Sun, J

    2014-08-15

    To review technical and methodological challenges for big data research in biomedicine and health. We discuss sources of big datasets, survey infrastructures for big data storage and big data processing, and describe the main challenges that arise when analyzing big data. The life and biomedical sciences are massively contributing to the big data revolution through secondary use of data that were collected during routine care and through new data sources such as social media. Efficient processing of big datasets is typically achieved by distributing computation over a cluster of computers. Data analysts should be aware of pitfalls related to big data such as bias in routine care data and the risk of false-positive findings in high-dimensional datasets. The major challenge for the near future is to transform analytical methods that are used in the biomedical and health domain, to fit the distributed storage and processing model that is required to handle big data, while ensuring confidentiality of the data being analyzed.

  18. Making research data repositories visible: the re3data.org Registry.

    PubMed

    Pampel, Heinz; Vierkant, Paul; Scholze, Frank; Bertelmann, Roland; Kindling, Maxi; Klump, Jens; Goebelbecker, Hans-Jürgen; Gundlach, Jens; Schirmbacher, Peter; Dierolf, Uwe

    2013-01-01

    Researchers require infrastructures that ensure a maximum of accessibility, stability and reliability to facilitate working with and sharing of research data. Such infrastructures are being increasingly summarized under the term Research Data Repositories (RDR). The project re3data.org-Registry of Research Data Repositories-has begun to index research data repositories in 2012 and offers researchers, funding organizations, libraries and publishers an overview of the heterogeneous research data repository landscape. In July 2013 re3data.org lists 400 research data repositories and counting. 288 of these are described in detail using the re3data.org vocabulary. Information icons help researchers to easily identify an adequate repository for the storage and reuse of their data. This article describes the heterogeneous RDR landscape and presents a typology of institutional, disciplinary, multidisciplinary and project-specific RDR. Further the article outlines the features of re3data.org, and shows how this registry helps to identify appropriate repositories for storage and search of research data.

  19. Managing troubled data: coastal data partnerships smooth data integration.

    PubMed

    Hale, Stephen S; Miglarese, Anne Hale; Bradley, M Patricia; Belton, Thomas J; Cooper, Larry D; Frame, Michael T; Friel, Christopher A; Harwell, Linda M; King, Robert E; Michener, William K; Nicolson, David T; Peterjohn, Bruce G

    2003-01-01

    Understanding the ecology, condition, and changes of coastal areas requires data from many sources. Broad-scale and long-term ecological questions, such as global climate change, biodiversity, and cumulative impacts of human activities, must be addressed with databases that integrate data from several different research and monitoring programs. Various barriers, including widely differing data formats, codes, directories, systems, and metadata used by individual programs, make such integration troublesome. Coastal data partnerships, by helping overcome technical, social, and organizational barriers, can lead to a better understanding of environmental issues, and may enable better management decisions. Characteristics of successful data partnerships include a common need for shared data, strong collaborative leadership, committed partners willing to invest in the partnership, and clear agreements on data standards and data policy. Emerging data and metadata standards that become widely accepted are crucial. New information technology is making it easier to exchange and integrate data. Data partnerships allow us to create broader databases than would be possible for any one organization to create by itself.

  20. Pre-coding method and apparatus for multiple source or time-shifted single source data and corresponding inverse post-decoding method and apparatus

    NASA Technical Reports Server (NTRS)

    Yeh, Pen-Shu (Inventor)

    1997-01-01

    A pre-coding method and device for improving data compression performance by removing correlation between a first original data set and a second original data set, each having M members, respectively. The pre-coding method produces a compression-efficiency-enhancing double-difference data set. The method and device produce a double-difference data set, i.e., an adjacent-delta calculation performed on a cross-delta data set or a cross-delta calculation performed on two adjacent-delta data sets, from either one of (1) two adjacent spectral bands coming from two discrete sources, respectively, or (2) two time-shifted data sets coming from a single source. The resulting double-difference data set is then coded using either a distortionless data encoding scheme (entropy encoding) or a lossy data compression scheme. Also, a post-decoding method and device for recovering a second original data set having been represented by such a double-difference data set.

  1. Pre-coding method and apparatus for multiple source or time-shifted single source data and corresponding inverse post-decoding method and apparatus

    NASA Technical Reports Server (NTRS)

    Yeh, Pen-Shu (Inventor)

    1998-01-01

    A pre-coding method and device for improving data compression performance by removing correlation between a first original data set and a second original data set, each having M members, respectively. The pre-coding method produces a compression-efficiency-enhancing double-difference data set. The method and device produce a double-difference data set, i.e., an adjacent-delta calculation performed on a cross-delta data set or a cross-delta calculation performed on two adjacent-delta data sets, from either one of (1) two adjacent spectral bands coming from two discrete sources, respectively, or (2) two time-shifted data sets coming from a single source. The resulting double-difference data set is then coded using either a distortionless data encoding scheme (entropy encoding) or a lossy data compression scheme. Also, a post-decoding method and device for recovering a second original data set having been represented by such a double-difference data set.

  2. PLC backplane analyzer for field forensics and intrusion detection

    DOEpatents

    Mulder, John; Schwartz, Moses Daniel; Berg, Michael; Van Houten, Jonathan Roger; Urrea, Jorge Mario; King, Michael Aaron; Clements, Abraham Anthony; Trent, Jason; Depoy, Jennifer M; Jacob, Joshua

    2015-05-12

    The various technologies presented herein relate to the determination of unexpected and/or malicious activity occurring between components communicatively coupled across a backplane. Control data, etc., can be intercepted at a backplane where the backplane facilitates communication between a controller and at least one device in an automation process. During interception of the control data, etc., a copy of the control data can be made, e.g., the original control data can be replicated to generate a copy of the original control data. The original control data can continue on to its destination, while the control data copy can be forwarded to an analyzer system to determine whether the control data contains a data anomaly. The content of the copy of the control data can be compared with a previously captured baseline data content, where the baseline data can be captured for a same operational state as the subsequently captured control data.

  3. -Omic and Electronic Health Records Big Data Analytics for Precision Medicine

    PubMed Central

    Wu, Po-Yen; Cheng, Chih-Wen; Kaddi, Chanchala D.; Venugopalan, Janani; Hoffman, Ryan; Wang, May D.

    2017-01-01

    Objective Rapid advances of high-throughput technologies and wide adoption of electronic health records (EHRs) have led to fast accumulation of -omic and EHR data. These voluminous complex data contain abundant information for precision medicine, and big data analytics can extract such knowledge to improve the quality of health care. Methods In this article, we present -omic and EHR data characteristics, associated challenges, and data analytics including data pre-processing, mining, and modeling. Results To demonstrate how big data analytics enables precision medicine, we provide two case studies, including identifying disease biomarkers from multi-omic data and incorporating -omic information into EHR. Conclusion Big data analytics is able to address –omic and EHR data challenges for paradigm shift towards precision medicine. Significance Big data analytics makes sense of –omic and EHR data to improve healthcare outcome. It has long lasting societal impact. PMID:27740470

  4. User's guide to the Nimbus-4 backscatter ultraviolet experiment data sets

    NASA Technical Reports Server (NTRS)

    Lowrey, B. E.

    1978-01-01

    The first year's data from the Nimbus 4 backscatter ultraviolet (BUV) experiment have been archived in the National Space Science Data Center (NSSDC). Backscattered radiances in the ultraviolet measured by the satellite were used to compute the global total ozone for the period April 1970 - April 1971. The data sets now in the NSSDC are the results obtained by the Ozone Processing Team, which has processed the data with the purpose of determining the best quality of the data. There are four basic sets of data available in the NSSDC representing various stages in processing. The primary data base contains organized and cleaned data in telemetry units. The radiance data has had most of the engineering calibrations performed. The detailed total ozone data is the result of computations to obtain the total ozone; the Compressed Total Ozone data is a convenient condensation of the detailed total ozone. Product data sets are also included.

  5. [Prevention and handling of missing data in clinical trials].

    PubMed

    Jiang, Zhi-wei; Li, Chan-juan; Wang, Ling; Xia, Jie-lai

    2015-11-01

    Missing data is a common but unavoidable issue in clinical trials. It not only lowers the trial power, but brings the bias to the trial results. Therefore, on one hand, the missing data handling methods are employed in data analysis. On the other hand, it is vital to prevent the missing data in the trials. Prevention of missing data should take the first place. From the perspective of data, firstly, some measures should be taken at the stages of protocol design, data collection and data check to enhance the patients' compliance and reduce the unnecessary missing data. Secondly, the causes of confirmed missing data in the trials should be notified and recorded in detail, which are very important to determine the mechanism of missing data and choose the suitable missing data handling methods, e.g., last observation carried forward (LOCF); multiple imputation (MI); mixed-effect model repeated measure (MMRM), etc.

  6. Computer systems and methods for the query and visualization multidimensional databases

    DOEpatents

    Stolte, Chris; Tang, Diane L.; Hanrahan, Patrick

    2017-04-25

    A method of generating a data visualization is performed at a computer having a display, one or more processors, and memory. The memory stores one or more programs for execution by the one or more processors. The process receives user specification of a plurality of characteristics of a data visualization. The data visualization is based on data from a multidimensional database. The characteristics specify at least x-position and y-position of data marks corresponding to tuples of data retrieved from the database. The process generates a data visualization according to the specified plurality of characteristics. The data visualization has an x-axis defined based on data for one or more first fields from the database that specify x-position of the data marks and the data visualization has a y-axis defined based on data for one or more second fields from the database that specify y-position of the data marks.

  7. Advancing data management and analysis in different scientific disciplines

    NASA Astrophysics Data System (ADS)

    Fischer, M.; Gasthuber, M.; Giesler, A.; Hardt, M.; Meyer, J.; Prabhune, A.; Rigoll, F.; Schwarz, K.; Streit, A.

    2017-10-01

    Over the past several years, rapid growth of data has affected many fields of science. This has often resulted in the need for overhauling or exchanging the tools and approaches in the disciplines’ data life cycles. However, this allows the application of new data analysis methods and facilitates improved data sharing. The project Large-Scale Data Management and Analysis (LSDMA) of the German Helmholtz Association has been addressing both specific and generic requirements in its data life cycle successfully since 2012. Its data scientists work together with researchers from the fields such as climatology, energy and neuroscience to improve the community-specific data life cycles, in several cases even all stages of the data life cycle, i.e. from data acquisition to data archival. LSDMA scientists also study methods and tools that are of importance to many communities, e.g. data repositories and authentication and authorization infrastructure.

  8. Materials Data Science: Current Status and Future Outlook

    NASA Astrophysics Data System (ADS)

    Kalidindi, Surya R.; De Graef, Marc

    2015-07-01

    The field of materials science and engineering is on the cusp of a digital data revolution. After reviewing the nature of data science and Big Data, we discuss the features of materials data that distinguish them from data in other fields. We introduce the concept of process-structure-property (PSP) linkages and illustrate how the determination of PSPs is one of the main objectives of materials data science. Then we review a selection of materials databases, as well as important aspects of materials data management, such as storage hardware, archiving strategies, and data access strategies. We introduce the emerging field of materials data analytics, which focuses on data-driven approaches to extract and curate materials knowledge from available data sets. The critical need for materials e-collaboration platforms is highlighted, and we conclude the article with a number of suggestions regarding the near-term future of the materials data science field.

  9. RSRA sixth scale wind tunnel test. Tabulated balance data, volume 2

    NASA Technical Reports Server (NTRS)

    Ruddell, A.; Flemming, R.

    1974-01-01

    Summaries are presented of all the force and moment data acquired during the RSRA Sixth Scale Wind Tunnel Test. These data include and supplement the data presented in curve form in previous reports. Each summary includes the model configuration, wing and empennage incidences and deflections, and recorded balance data. The first group of data in each summary presents the force and moment data in full scale parametric form, the dynamic pressure and velocity in the test section, and the powered nacelle fan speed. The second and third groups of data are the balance data in nondimensional coefficient form. The wind axis coefficient data corresponds to the parametric data divided by the wing area for forces and divided by the product of the wing area and wing span or mean aerodynamic chord for moments. The stability axis data resolves the wind axis data with respect to the angle of yaw.

  10. Mobile In Vivo Infrared Data Collection and Diagnoses Comparison System

    NASA Technical Reports Server (NTRS)

    Mintz, Frederick W. (Inventor); Gunapala, Sarath D. (Inventor); Moynihan, Philip I. (Inventor)

    2013-01-01

    Described is a mobile in vivo infrared brain scan and analysis system. The system includes a data collection subsystem and a data analysis subsystem. The data collection subsystem is a helmet with a plurality of infrared (IR) thermometer probes. Each of the IR thermometer probes includes an IR photodetector capable of detecting IR radiation generated by evoked potentials within a user's skull. The helmet is formed to collect brain data that is reflective of firing neurons in a mobile subject and transmit the brain data to the data analysis subsystem. The data analysis subsystem is configured to generate and display a three-dimensional image that depicts a location of the firing neurons. The data analysis subsystem is also configured to compare the brain data against a library of brain data to detect an anomaly in the brain data, and notify a user of any detected anomaly in the brain data.

  11. Data warehousing leads to improved business performance

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Morris, R.

    1995-09-01

    Data warehousing is emerging as one of the most significant trends in information technology (IT) during the 1990s. According to William H. Inmon, sometimes referred to as the father of data warehousing, a data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data organized to support management needs. Data warehousing can: provide integrated, historical and operational data; integrate disparate application systems; and organize and store data for informational, analytical processing. Data warehousing offers opportunity to address today`s problems of realizing a return on massive investments being made in acquiring and managing E and P data. Effective implementations require anmore » understanding of business benefits being sought and an adaptive, flexible IT architecture for supporting processes and technologies involved. As national E and P data archives continue to emerge and complement existing data reserves within E and P companies, expect to see increased data warehousing use to merge these two environments.« less

  12. The performance of differential VLBI delay during interplanetary cruise

    NASA Technical Reports Server (NTRS)

    Moultrie, B.; Wolff, P. J.; Taylor, T. H.

    1984-01-01

    Project Voyager radio metric data are used to evaluate the orbit determination abilities of several data strategies during spacecraft interplanetary cruise. Benchmark performance is established with an operational data strategy of conventional coherent doppler, coherent range, and explicitly differenced range data from two intercontinental baselines to ameliorate the low declination singularity of the doppler data. Employing a Voyager operations trajectory as a reference, the performance of the operational data strategy is compared to the performances of data strategies using differential VLBI delay data (spacecraft delay minus quasar delay) in combinations with the aforementioned conventional data types. The comparison of strategy performances indicates that high accuracy cruise orbit determination can be achieved with a data strategy employing differential VLBI delay data, where the quantity of coherent radio metric data has been greatly reduced.

  13. New and Improved GLDAS Data Sets and Data Services at NASA GES DISC

    NASA Technical Reports Server (NTRS)

    Rui, Hualan; Beaudoing, Hiroko; Teng, William; Vollmer, Bruce; Rodell, Matthew; Lei, Guang-Dih

    2012-01-01

    The goal of a Land Data Assimilation System (LDAS) is to ingest satellite- and ground-based observational data products, using advanced land surface modeling and data assimilation techniques, in order to generate optimal fields of land surface states and fluxes data and, thereby, facilitate hydrology and climate modeling, research, and forecast. With the motivation of creating more climatologically consistent data sets, NASA GSFC's Hydrological Sciences Laboratory has generated more than 60 years (Jan. 1948-- Dec. 2008) of Global LDAS Version 2 (GLDAS-2) data, by using the Princeton Forcing Data Set and upgraded versions of Land Surface Models (LSMs). GLDAS data and data services are provided at NASA GES DISC Hydrology Data and Information Services Center (HDISC), in collaboration with HSL and LDAS.

  14. Data distribution method of workflow in the cloud environment

    NASA Astrophysics Data System (ADS)

    Wang, Yong; Wu, Junjuan; Wang, Ying

    2017-08-01

    Cloud computing for workflow applications provides the required high efficiency calculation and large storage capacity and it also brings challenges to the protection of trade secrets and other privacy data. Because of privacy data will cause the increase of the data transmission time, this paper presents a new data allocation algorithm based on data collaborative damage degree, to improve the existing data allocation strategy? Safety and public cloud computer algorithm depends on the private cloud; the static allocation method in the initial stage only to the non-confidential data division to improve the original data, in the operational phase will continue to generate data to dynamically adjust the data distribution scheme. The experimental results show that the improved method is effective in reducing the data transmission time.

  15. Compilation of fatigue, fatigue-crack propagation, and fracture data for 2024 and 7075 aluminum, Ti-6Al-4V titanium, and 300M steel. Volume 1: Description of data and data storage on magnetic tape. Volume 2: Data tape (7-track magnetic tape)

    NASA Technical Reports Server (NTRS)

    Rice, R. C.; Reynolds, J. L.

    1976-01-01

    Fatigue, fatigue-crack-propagation, and fracture data compiled and stored on magnetic tape are documented. Data for 202 and 7075 aluminum alloys, Ti-6Al-4V titanium alloy, and 300M steel are included in the compilation. Approximately 4,500 fatigue, 6,500 fatigue-crack-propagation, and 1,500 fracture data points are stored on magnetic tape. Descriptions of the data, an index to the data on the magnetic tape, information on data storage format on the tape, a listing of all data source references, and abstracts of other pertinent test information from each data source reference are included.

  16. Revisiting the Procedures for the Vector Data Quality Assurance in Practice

    NASA Astrophysics Data System (ADS)

    Erdoğan, M.; Torun, A.; Boyacı, D.

    2012-07-01

    Immense use of topographical data in spatial data visualization, business GIS (Geographic Information Systems) solutions and applications, mobile and location-based services forced the topo-data providers to create standard, up-to-date and complete data sets in a sustainable frame. Data quality has been studied and researched for more than two decades. There have been un-countable numbers of references on its semantics, its conceptual logical and representations and many applications on spatial databases and GIS. However, there is a gap between research and practice in the sense of spatial data quality which increases the costs and decreases the efficiency of data production. Spatial data quality is well-known by academia and industry but usually in different context. The research on spatial data quality stated several issues having practical use such as descriptive information, metadata, fulfillment of spatial relationships among data, integrity measures, geometric constraints etc. The industry and data producers realize them in three stages; pre-, co- and post data capturing. The pre-data capturing stage covers semantic modelling, data definition, cataloguing, modelling, data dictionary and schema creation processes. The co-data capturing stage covers general rules of spatial relationships, data and model specific rules such as topologic and model building relationships, geometric threshold, data extraction guidelines, object-object, object-belonging class, object-non-belonging class, class-class relationships to be taken into account during data capturing. And post-data capturing stage covers specified QC (quality check) benchmarks and checking compliance to general and specific rules. The vector data quality criteria are different from the views of producers and users. But these criteria are generally driven by the needs, expectations and feedbacks of the users. This paper presents a practical method which closes the gap between theory and practice. Development of spatial data quality concepts into developments and application requires existence of conceptual, logical and most importantly physical existence of data model, rules and knowledge of realization in a form of geo-spatial data. The applicable metrics and thresholds are determined on this concrete base. This study discusses application of geo-spatial data quality issues and QA (quality assurance) and QC procedures in the topographic data production. Firstly we introduce MGCP (Multinational Geospatial Co-production Program) data profile of NATO (North Atlantic Treaty Organization) DFDD (DGIWG Feature Data Dictionary), the requirements of data owner, the view of data producers for both data capturing and QC and finally QA to fulfil user needs. Then, our practical and new approach which divides the quality into three phases is introduced. Finally, implementation of our approach to accomplish metrics, measures and thresholds of quality definitions is discussed. In this paper, especially geometry and semantics quality and quality control procedures that can be performed by the producers are discussed. Some applicable best-practices that we experienced on techniques of quality control, defining regulations that define the objectives and data production procedures are given in the final remarks. These quality control procedures should include the visual checks over the source data, captured vector data and printouts, some automatic checks that can be performed by software and some semi-automatic checks by the interaction with quality control personnel. Finally, these quality control procedures should ensure the geometric, semantic, attribution and metadata quality of vector data.

  17. Data Reorganization for Optimal Time Series Data Access, Analysis, and Visualization

    NASA Astrophysics Data System (ADS)

    Rui, H.; Teng, W. L.; Strub, R.; Vollmer, B.

    2012-12-01

    The way data are archived is often not optimal for their access by many user communities (e.g., hydrological), particularly if the data volumes and/or number of data files are large. The number of data records of a non-static data set generally increases with time. Therefore, most data sets are commonly archived by time steps, one step per file, often containing multiple variables. However, many research and application efforts need time series data for a given geographical location or area, i.e., a data organization that is orthogonal to the way the data are archived. The retrieval of a time series of the entire temporal coverage of a data set for a single variable at a single data point, in an optimal way, is an important and longstanding challenge, especially for large science data sets (i.e., with volumes greater than 100 GB). Two examples of such large data sets are the North American Land Data Assimilation System (NLDAS) and Global Land Data Assimilation System (GLDAS), archived at the NASA Goddard Earth Sciences Data and Information Services Center (GES DISC; Hydrology Data Holdings Portal, http://disc.sci.gsfc.nasa.gov/hydrology/data-holdings). To date, the NLDAS data set, hourly 0.125x0.125° from Jan. 1, 1979 to present, has a total volume greater than 3 TB (compressed). The GLDAS data set, 3-hourly and monthly 0.25x0.25° and 1.0x1.0° Jan. 1948 to present, has a total volume greater than 1 TB (compressed). Both data sets are accessible, in the archived time step format, via several convenient methods, including Mirador search and download (http://mirador.gsfc.nasa.gov/), GrADS Data Server (GDS; http://hydro1.sci.gsfc.nasa.gov/dods/), direct FTP (ftp://hydro1.sci.gsfc.nasa.gov/data/s4pa/), and Giovanni Online Visualization and Analysis (http://disc.sci.gsfc.nasa.gov/giovanni). However, users who need long time series currently have no efficient way to retrieve them. Continuing a longstanding tradition of facilitating data access, analysis, and visualization that contribute to knowledge discovery from large science data sets, the GES DISC recently begun a NASA ACCESS-funded project to, in part, optimally reorganize selected large data sets for access and use by the hydrological user community. This presentation discusses the following aspects of the project: (1) explorations of approaches, such as database and file system; (2) findings for each approach, such as limitations and concerns, and pros and cons; (3) implementation of reorganizing data via the file system approach, including data processing (parameter and spatial subsetting), metadata and file structure of reorganized time series data (true "Data Rod," single variable, single grid point, and entire data range per file), and production and quality control. The reorganized time series data will be integrated into several broadly used data tools, such as NASA Giovanni and those provided by CUAHSI HIS (http://his.cuahsi.org/) and EPA BASINS (http://water.epa.gov/scitech/datait/models/basins/), as well as accessible via direct FTP, along with documentation and sample reading software. The data reorganization is initially, as part of the project, applied to selected popular hydrology-related parameters, with other parameters to be added, as resources permit.

  18. Evolution of data stewardship over two decades at a NASA data center

    NASA Astrophysics Data System (ADS)

    Armstrong, E. M.; Moroni, D. F.; Hausman, J.; Tsontos, V. M.

    2013-12-01

    Whether referred to as data science or data engineering, the technical nature and practice of data curation has seen a noticeable shift in the last two decades. The majority of this has been driven by factors of increasing data volumes and complexity, new data structures, and data virtualization through internet access that have themselves spawned new fields or advances in semantic ontologies, metadata, advanced distributed computing and new file formats. As a result of this shifting landscape, the role of the data scientist/engineer has also evolved.. We will discuss the key elements of this evolutionary shift from the perspective of data curation at the NASA Physical Oceanography Distributed Active Archive Center (PO.DAAC), which is one of 12 NASA Earth Science data centers responsible for archiving and distributing oceanographic satellite data since 1993. Earlier responsibilities of data curation in the history of the PO.DAAC focused strictly on data archiving, low-level data quality assessments, understanding and building read software for terse binary data or limited applications of self-describing file formats and metadata. Data discovery was often word of mouth or based on perusing simple web pages built for specific products. At that time the PO.DAAC served only a few tens of datasets. A single data engineer focused on a specific mission or suite of datasets from a specific physical parameter (e.g., ocean topography measurements). Since that time the number of datasets in the PO.DAAC has grown to approach one thousand, with increasing complexity of data and metadata structures in self-describing formats. Advances in ontologies, metadata, applications of MapReduce distributed computing and "big data", improvements in data discovery, data mining, and tools for visualization and analysis have all required new and evolving skill sets. The community began requiring more rigorous assessments of data quality and uncertainty. Although the expert knowledge of the physical domain was still critical, especially relevant to assessments of data quality, additional skills in computer science, statistics and system engineering also became necessary. Furthermore, the level of effort to implement data curation has not expanded linearly either. Management of ongoing data operations demands increased productivity on a continual basis and larger volumes of data, with constraints on funding, must be managed with proportionately less human resources. The role of data curation has also changed within the perspective of satellite missions. In many early missions, data management and curation was an afterthought (since there were no explicit data management plans written into the proposals), while current NASA mission proposals must have explicit data management plans to identify resources and funds for archiving, distribution and implementing overall data stewardship. In conclusion, the role of the data scientist/engineer at the PO.DAAC has shifted from supporting singular missions and primarily representing a point of contact for the science community to complete end-to-end stewardship through the implementation of a robust set of dataset lifecycle policies from ingest, to archiving, including data quality assessment for a broad swath of parameter based datasets that can number in the hundreds.

  19. SeaDataNet II - EMODNet - building a pan-European infrastructure for marine and ocean data management

    NASA Astrophysics Data System (ADS)

    Schaap, Dick M. A.; Fichaut, Michele

    2014-05-01

    The second phase of the project SeaDataNet is well underway since October 2011 and is making good progress. The main objective is to improve operations and to progress towards an efficient data management infrastructure able to handle the diversity and large volume of data collected via research cruises and monitoring activities in European marine waters and global oceans. The SeaDataNet infrastructure comprises a network of interconnected data centres and a central SeaDataNet portal. The portal provides users a unified and transparent overview of the metadata and controlled access to the large collections of data sets, managed by the interconnected data centres, and the various SeaDataNet standards and tools,. Recently the 1st Innovation Cycle has been completed, including upgrading of the CDI Data Discovery and Access service to ISO 19139 and making it fully INSPIRE compliant. The extensive SeaDataNet Vocabularies have been upgraded too and implemented for all SeaDataNet European metadata directories. SeaDataNet is setting and governing marine data standards, and exploring and establishing interoperability solutions to connect to other e-infrastructures on the basis of standards of ISO (19115, 19139), OGC (WMS, WFS, CS-W and SWE), and OpenSearch. The population of directories has also increased considerably in cooperation and involvement in associated EU projects and initiatives. SeaDataNet now gives overview and access to more than 1.4 million data sets for physical oceanography, chemistry, geology, geophysics, bathymetry and biology from more than 90 connected data centres from 30 countries riparian to European seas. Access to marine data is also a key issue for the implementation of the EU Marine Strategy Framework Directive (MSFD). The EU communication 'Marine Knowledge 2020' underpins the importance of data availability and harmonising access to marine data from different sources. SeaDataNet qualified itself for leading the data management component of the EMODNet (European Marine Observation and Data Network) that is promoted in the EU Communication. In the past 4 years EMODNet portals have been initiated for marine data themes: digital bathymetry, chemistry, physical oceanography, geology, biology, and seabed habitat mapping. These portals are now being expanded to all European seas in successor projects, which started mid 2013 from EU DG MARE. EMODNet encourages more data providers to come forward for data sharing and participating in the process of making complete overviews and homogeneous data products. The EMODNet Bathymetry project is very illustrative for the synergy with SeaDataNet and added value of generating public data products. The project develops and publishes Digital Terrain Models (DTM) for the European seas. These are produced from survey and aggregated data sets. The portal provides a versatile DTM viewing service with many relevant map layers and functions for retrieving. A further refinement is taking place in the new phase. The presentation will give information on present services of the SeaDataNet infrastructure and services, highlight key achievements in SeaDataNet II so far, and give further insights in the EMODNet Bathymetry progress.

  20. SEASAT synthetic-aperture radar data user's manual

    NASA Technical Reports Server (NTRS)

    Pravdo, S. H.; Huneycutt, B.; Holt, B. M.; Held, D. N.

    1983-01-01

    The SEASAT Synthetic-Aperture Radar (SAR) system, the data processors, the extent of the image data set, and the means by which a user obtains this data are described and the data quality is evaluated. The user is alerted to some potential problems with the existing volume of SEASAT SAR image data, and allows him to modify his use of that data accordingly. Secondly, the manual focuses on the ultimate focuses on the ultimate capabilities of the raw data set and evaluates the potential of this data for processing into accurately located, amplitude-calibrated imagery of high resolution. This allows the user to decide whether his needs require special-purpose data processing of the SAR raw data.

  1. Data dissemination using gossiping in wireless sensor networks

    NASA Astrophysics Data System (ADS)

    Medidi, Muralidhar; Ding, Jin; Medidi, Sirisha

    2005-06-01

    Disseminating data among sensors is a fundamental operation in energy-constrained wireless sensor networks. We present a gossip-based adaptive protocol for data dissemination to improve energy efficiency of this operation. To overcome the data implosion problems associated with dissemination operation, our protocol uses meta-data to name the data using high-level data descriptors and negotiation to eliminate redundant transmissions of duplicate data in the network. Further, we adapt the gossiping with data aggregation possibilities in sensor networks. We simulated our data dissemination protocol, and compared it to the SPIN protocol. We find that our protocol improves on the energy consumption by about 20% over others, while improving significantly over the data dissemination rate of gossiping.

  2. [Advance in interferogram data processing technique].

    PubMed

    Jing, Juan-Juan; Xiangli, Bin; Lü, Qun-Bo; Huang, Min; Zhou, Jin-Song

    2011-04-01

    Fourier transform spectrometry is a type of novel information obtaining technology, which integrated the functions of imaging and spectra, but the data that the instrument acquired is the interference data of the target, which is an intermediate data and couldn't be used directly, so data processing must be adopted for the successful application of the interferometric data In the present paper, data processing techniques are divided into two classes: general-purpose and special-type. First, the advance in universal interferometric data processing technique is introduced, then the special-type interferometric data extracting method and data processing technique is illustrated according to the classification of Fourier transform spectroscopy. Finally, the trends of interferogram data processing technique are discussed.

  3. Growing Data User Communities

    NASA Astrophysics Data System (ADS)

    Wiggin, B.

    2017-12-01

    Preserving data is not only a technical challenge. Perhaps the best way to protect data is to use it. Grassroots efforts to make research-quality copies of federal data continue to energize communities of data users who often did not previously recognize themselves as open earth data users. Beyond "data rescue" events, the Data Refuge project researches how federal climate and environmental data are used downstream in a variety of local communities and municipal governments to address everyday issues: public health, municipal safety, and even the preservation of cultural heritage assets. Documenting the diverse uses made of open earth data beyond the earth sciences research community grows the community who, in making use of data, also helps to preserve it.

  4. The development of health care data warehouses to support data mining.

    PubMed

    Lyman, Jason A; Scully, Kenneth; Harrison, James H

    2008-03-01

    Clinical data warehouses offer tremendous benefits as a foundation for data mining. By serving as a source for comprehensive clinical and demographic information on large patient populations, they streamline knowledge discovery efforts by providing standard and efficient mechanisms to replace time-consuming and expensive original data collection, organization, and processing. Building effective data warehouses requires knowledge of and attention to key issues in database design, data acquisition and processing, and data access and security. In this article, the authors provide an operational and technical definition of data warehouses, present examples of data mining projects enabled by existing data warehouses, and describe key issues and challenges related to warehouse development and implementation.

  5. Collection, quality control and delivery of ground-based magnetic data during ESA's Swarm mission

    NASA Astrophysics Data System (ADS)

    Macmillan, Susan; Humphries, Thomas; Flower, Simon; Swan, Anthony

    2016-04-01

    Ground-based magnetic data are used in a variety of ways when analysing satellite data. Selecting satellite data often involves the use of magnetic disturbance indices derived from ground-based stations and inverting satellite magnetic data for models of fields from various sources often requires ground-based data. Ground-based data can also be valuable independent data for validation purposes. We summarise data collection and quality control procedures in place at the British Geological Survey for global ground-based observatory and repeat station data. Whilst ongoing participation in the ICSU World Data System and INTERMAGNET facilitates this work, additional procedures have been specially developed for the Swarm mission. We describe these in detail.

  6. The ARM Data System and Archive

    DOE PAGES

    McCord, Raymond A.; Voyles, Jimmy W.

    2016-07-05

    Every observationally based research program needs a way to collect data from instruments, convert the data from its raw format into a more usable format, apply quality control, process it into higher-order data products, store the data, and make the data available to its scientific community. This data flow is illustrated pictorially in Fig. 11-1. These are the basic requirements of any scientific data system, and ARM’s data system would have to address these requirements and more. This research provides one view of the development of the ARM data system, which includes the ARM Data Archive, and some of themore » notable decisions that were made along the way.« less

  7. Economics of data center optics

    NASA Astrophysics Data System (ADS)

    Huff, Lisa

    2016-03-01

    Traffic to and from data centers is now reaching Zettabytes/year. Even the smallest of businesses now rely on data centers for revenue generation. And, the largest data centers today are orders of magnitude larger than the supercomputing centers of a few years ago. Until quite recently, for most data center managers, optical data centers were nice to dream about, but not really essential. Today, the all-optical data center - perhaps even an all-single mode fiber (SMF) data center is something that even managers of medium-sized data centers should be considering. Economical transceivers are the key to increased adoption of data center optics. An analysis of current and near future data center optics economics will be discussed in this paper.

  8. System and method for responding to ground and flight system malfunctions

    NASA Technical Reports Server (NTRS)

    Anderson, Julie J. (Inventor); Fussell, Ronald M. (Inventor)

    2010-01-01

    A system for on-board anomaly resolution for a vehicle has a data repository. The data repository stores data related to different systems, subsystems, and components of the vehicle. The data stored is encoded in a tree-based structure. A query engine is coupled to the data repository. The query engine provides a user and automated interface and provides contextual query to the data repository. An inference engine is coupled to the query engine. The inference engine compares current anomaly data to contextual data stored in the data repository using inference rules. The inference engine generates a potential solution to the current anomaly by referencing the data stored in the data repository.

  9. SAR data compression: Application, requirements, and designs

    NASA Technical Reports Server (NTRS)

    Curlander, John C.; Chang, C. Y.

    1991-01-01

    The feasibility of reducing data volume and data rate is evaluated for the Earth Observing System (EOS) Synthetic Aperture Radar (SAR). All elements of data stream from the sensor downlink data stream to electronic delivery of browse data products are explored. The factors influencing design of a data compression system are analyzed, including the signal data characteristics, the image quality requirements, and the throughput requirements. The conclusion is that little or no reduction can be achieved in the raw signal data using traditional data compression techniques (e.g., vector quantization, adaptive discrete cosine transform) due to the induced phase errors in the output image. However, after image formation, a number of techniques are effective for data compression.

  10. Validation of Nimbus-7 cloud and SMMR data

    NASA Technical Reports Server (NTRS)

    Hwang, P. H.; Yeh, H. Y. M.; Macmillan, D. S.; Long, C. S.

    1986-01-01

    The relationship between cloud amount, water content (WC), and liquid water content (LWC) is studied. Nimbus-7 cloud data and LWC and WC data derived from the SMMR for July 1979 are analyzed and compared. The SMMR sea surface temperature (SST) data are also compared to Air Force SST data. The comparisons reveal that Nimbus-7 cloud data and the SMMR WC and LWC data correlate well, and there is also good agreement between the SMMR SST and the Air Force data. The data demonstrate that there is a relation between the WC, LWC, and cloud amount data.

  11. Evaluation of Healthcare Interventions and Big Data: Review of Associated Data Issues.

    PubMed

    Asche, Carl V; Seal, Brian; Kahler, Kristijan H; Oehrlein, Elisabeth M; Baumgartner, Meredith Greer

    2017-08-01

    Although the analysis of 'big data' holds tremendous potential to improve patient care, there remain significant challenges before it can be realized. Accuracy and completeness of data, linkage of disparate data sources, and access to data are areas that require particular focus. This article discusses these areas and shares strategies to promote progress. Improvement in clinical coding, innovative matching methodologies, and investment in data standardization are potential solutions to data validation and linkage problems. Challenges to data access still require significant attention with data ownership, security needs, and costs representing significant barriers to access.

  12. Accessing numeric data via flags and tags: A final report on a real world experiment

    NASA Technical Reports Server (NTRS)

    Kottenstette, J. P.; Freeman, J. E.; Staskin, E. R.; Hargrave, C. W.

    1978-01-01

    An experiment is reported which: extended the concepts of data flagging and tagging to the aerospace scientific and technical literature; generated experience with the assignment of data summaries and data terms by documentation specialists; and obtained real world assessments of data summaries and data terms in information products and services. Inclusion of data summaries and data terms improved users' understanding of referenced documents from a subject perspective as well as from a data perspective; furthermore, a radical shift in document ordering behavior occurred during the experiment toward proportionately more requests for data-summarized items.

  13. Big data mining: In-database Oracle data mining over hadoop

    NASA Astrophysics Data System (ADS)

    Kovacheva, Zlatinka; Naydenova, Ina; Kaloyanova, Kalinka; Markov, Krasimir

    2017-07-01

    Big data challenges different aspects of storing, processing and managing data, as well as analyzing and using data for business purposes. Applying Data Mining methods over Big Data is another challenge because of huge data volumes, variety of information, and the dynamic of the sources. Different applications are made in this area, but their successful usage depends on understanding many specific parameters. In this paper we present several opportunities for using Data Mining techniques provided by the analytical engine of RDBMS Oracle over data stored in Hadoop Distributed File System (HDFS). Some experimental results are given and they are discussed.

  14. Evolving Metadata in NASA Earth Science Data Systems

    NASA Astrophysics Data System (ADS)

    Mitchell, A.; Cechini, M. F.; Walter, J.

    2011-12-01

    NASA's Earth Observing System (EOS) is a coordinated series of satellites for long term global observations. NASA's Earth Observing System Data and Information System (EOSDIS) is a petabyte-scale archive of environmental data that supports global climate change research by providing end-to-end services from EOS instrument data collection to science data processing to full access to EOS and other earth science data. On a daily basis, the EOSDIS ingests, processes, archives and distributes over 3 terabytes of data from NASA's Earth Science missions representing over 3500 data products ranging from various types of science disciplines. EOSDIS is currently comprised of 12 discipline specific data centers that are collocated with centers of science discipline expertise. Metadata is used in all aspects of NASA's Earth Science data lifecycle from the initial measurement gathering to the accessing of data products. Missions use metadata in their science data products when describing information such as the instrument/sensor, operational plan, and geographically region. Acting as the curator of the data products, data centers employ metadata for preservation, access and manipulation of data. EOSDIS provides a centralized metadata repository called the Earth Observing System (EOS) ClearingHouse (ECHO) for data discovery and access via a service-oriented-architecture (SOA) between data centers and science data users. ECHO receives inventory metadata from data centers who generate metadata files that complies with the ECHO Metadata Model. NASA's Earth Science Data and Information System (ESDIS) Project established a Tiger Team to study and make recommendations regarding the adoption of the international metadata standard ISO 19115 in EOSDIS. The result was a technical report recommending an evolution of NASA data systems towards a consistent application of ISO 19115 and related standards including the creation of a NASA-specific convention for core ISO 19115 elements. Part of NASA's effort to continually evolve its data systems led ECHO to enhancing the method in which it receives inventory metadata from the data centers to allow for multiple metadata formats including ISO 19115. ECHO's metadata model will also be mapped to the NASA-specific convention for ingesting science metadata into the ECHO system. As NASA's new Earth Science missions and data centers are migrating to the ISO 19115 standards, EOSDIS is developing metadata management resources to assist in the reading, writing and parsing ISO 19115 compliant metadata. To foster interoperability with other agencies and international partners, NASA is working to ensure that a common ISO 19115 convention is developed, enhancing data sharing capabilities and other data analysis initiatives. NASA is also investigating the use of ISO 19115 standards to encode data quality, lineage and provenance with stored values. A common metadata standard across NASA's Earth Science data systems promotes interoperability, enhances data utilization and removes levels of uncertainty found in data products.

  15. NREL: Renewable Resource Data Center - Geothermal Resource Data

    Science.gov Websites

    sites Data related to geothermal technology and energy Resource Data The following geothermal resource data collections can be found in the Renewable Resource Data Center (RReDC). Geothermal Resource Data The datasets on this page offer a qualitative

  16. Applying Data Mining Principles to Library Data Collection.

    ERIC Educational Resources Information Center

    Guenther, Kim

    2000-01-01

    Explains how libraries can use data mining techniques for more effective data collection. Highlights include three phases: data selection and acquisition; data preparation and processing, including a discussion of the use of XML (extensible markup language); and data interpretation and integration, including database management systems. (LRW)

  17. Enabling New and More Transparent Science via DataONE—a Virtual Data Observation Network for Earth (Invited)

    NASA Astrophysics Data System (ADS)

    Michener, W.

    2010-12-01

    Addressing grand environmental science challenges requires unprecedented access to easily understood data that cross the breadth of temporal, spatial, and thematic scales. From a scientist’s perspective, the big challenges lie in discovering the relevant data, dealing with extreme data heterogeneity, and converting data to information and knowledge. Addressing these challenges requires new approaches for managing, preserving, analyzing, and sharing data. DataONE is designed to be the foundation of new innovative environmental research that addresses questions of relevance to science and society. DataONE will ensure preservation and access to multi-scale, multi-discipline, and multi-national data. Operationally, DataONE encompasses a distributed global network of Member Nodes (i.e., data repositories) that provide open and persistent access to well-described and easily discovered Earth observational data. In addition, a smaller number of Coordinating Nodes (i.e., metadata repositories and service centers) support network-wide services such as data replication and access to an array of enabling tools. DataONE’s objectives are to: make biological data available from the genome to the ecosystem; make environmental data available from atmospheric, ecological, hydrological, and oceanographic sources; provide secure and long-term preservation and access; and engage scientists, land-managers, policy makers, students, educators, and the public through logical access and intuitive visualizations. The foundation for excellence of DataONE is the established collaboration among participating organizations that have multi-decade expertise in a wide range of fields that includes: existing archive initiatives, libraries, environmental observing systems and research networks, data and information management, science synthesis centers, and professional societies. DataONE is a means to serve a broad range of science domains directly and indirectly through interoperability with partnering networks. DataONE engages its community of partners through working groups focused on identifying, describing, and implementing the DataONE cyberinfrastructure, governance, and sustainability models. These working groups, which consist of a diverse group of graduate students, educators, government and industry representatives, and leading computer, information, and library scientists: (1) perform computer science, informatics, and social science research related to all stages of the data life cycle; (2) develop DataONE interfaces and prototypes; (3) adopt/adapt interoperability standards; (4) create value-added technologies (e.g., semantic mediation, scientific workflow, and visualization) that facilitate data integration, analysis, and understanding; (5) address socio-cultural barriers to sustainable data preservation and data sharing; and (6) promote the adoption of best practices for managing the full data life cycle.

  18. Quality assurance and quality control for autonomously collected geoscience data

    NASA Astrophysics Data System (ADS)

    Versteeg, R. J.; Richardson, A.; Labrecque, D.

    2006-12-01

    The growing interest in processes, coupled with the reduction in cost and complexity of sensors which allow for continuous data collection and transmission is giving rise to vast amounts of semi autonomously collected data. Such data is typically collected from a range of physical and chemical sensors and transmitted - either at the time of collection, or periodically as a collection of measurements - to a central server. Such setups can collect vast amounts of data. In cases where power is not an issue one datapoint can be collected every minute, resulting in tens of thousands of data points per month per sensor. Especially in cases in which multiple sensors are deployed it is infeasible to examine each individual datapoint for each individual sensor, and users typically will look at aggregates of such data on a periodic (once a week to once every few months) basis. Such aggregates (and the timelag between data collection and data evaluation) will impact the ability to rapidly identify and resolve data issues. Thus, there is a need to integrate data qa/qc rules and procedures in the data collection process. These should be implemented such that data is analyzed for compliance the moment it arrives at the server, and that any issues with this data result in notification of cognizant personnel. Typical issues (encountered in the field) include complete system failure (resulting in no data arriving at all), to complete sensor failure (data is collected, but is meaningless), to partial sensor failure (sensor gives erratic readings, or starts to exhibit a bias) to partial powerloss (system collects and transmits data only intermittently). We have implemented a suite of such rules and tests as part of the INL developed performance monitoring system. These rules are invoked as part of a data qa/qc workflow, and result in quality indicators for each datapoint as well as user alerts in case of issues. Tests which are applied to the data include tests on individual datapoints, tests on suites of datapoints, and tests applied over the whole dataset. Example of tests include: Did data arrive on time, is received data in a valid format, are all measurements present, is data within valid range, is data collected at appropriate time intervals, are the statistics of the data changing over time and is the data collected within an appropriate instrument calibration window? This approach, which is executed automatically on all data provides data end users with confidence and auditability regarding the quality and useability of autonomously collected data.

  19. Best Practices for Preparing Interoperable Geospatial Data

    NASA Astrophysics Data System (ADS)

    Wei, Y.; Santhana Vannan, S.; Cook, R. B.; Wilson, B. E.; Beaty, T. W.

    2010-12-01

    Geospatial data is critically important for a wide scope of research and applications: carbon cycle and ecosystem, climate change, land use and urban planning, environmental protecting, etc. Geospatial data is created by different organizations using different methods, from remote sensing observations, field surveys, model simulations, etc., and stored in various formats. So geospatial data is diverse and heterogeneous, which brings a huge barrier for the sharing and using of geospatial data, especially when targeting a broad user community. Many efforts have been taken to address different aspects of using geospatial data by improving its interoperability. For example, the specification for Open Geospatial Consortium (OGC) catalog services defines a standard way for geospatial information discovery; OGC Web Coverage Services (WCS) and OPeNDAP define interoperable protocols for geospatial data access, respectively. But the reality is that only having the standard mechanisms for data discovery and access is not enough. The geospatial data content itself has to be organized in standard, easily understandable, and readily usable formats. The Oak Ridge National Lab Distributed Archived Data Center (ORNL DAAC) archives data and information relevant to biogeochemical dynamics, ecological data, and environmental processes. The Modeling and Synthesis Thematic Data Center (MAST-DC) prepares and distributes both input data and output data of carbon cycle models and provides data support for synthesis and terrestrial model inter-comparison in multi-scales. Both of these NASA-funded data centers compile and distribute a large amount of diverse geospatial data and have broad user communities, including GIS users, Earth science researchers, and ecosystem modeling teams. The ORNL DAAC and MAST-DC address this geospatial data interoperability issue by standardizing the data content and feeding them into a well-designed Spatial Data Infrastructure (SDI) which provides interoperable mechanisms to advertise, visualize, and distribute the standardized geospatial data. In this presentation, we summarize the experiences learned and the best practices for geospatial data standardization. The presentation will describe how diverse and historical data archived in the ORNL DAAC were converted into standard and non-proprietary formats; what tools were used to make the conversion; how the spatial and temporal information are properly captured in a consistent manor; how to name a data file or a variable to make it both human-friendly and semantically interoperable; how NetCDF file format and CF convention can promote the data usage in ecosystem modeling user community; how those standardized geospatial data can be fed into OGC Web Services to support on-demand data visualization and access; and how the metadata should be collected and organized so that they can be discovered through standard catalog services.

  20. A research on the security of wisdom campus based on geospatial big data

    NASA Astrophysics Data System (ADS)

    Wang, Haiying

    2018-05-01

    There are some difficulties in wisdom campus, such as geospatial big data sharing, function expansion, data management, analysis and mining geospatial big data for a characteristic, especially the problem of data security can't guarantee cause prominent attention increasingly. In this article we put forward a data-oriented software architecture which is designed by the ideology of orienting data and data as kernel, solve the problem of traditional software architecture broaden the campus space data research, develop the application of wisdom campus.

  1. Exploring NASA GES DISC Data with Interoperable Services

    NASA Technical Reports Server (NTRS)

    Zhao, Peisheng; Yang, Wenli; Hegde, Mahabal; Wei, Jennifer C.; Kempler, Steven; Pham, Long; Teng, William; Savtchenko, Andrey

    2015-01-01

    Overview of NASA GES DISC (NASA Goddard Earth Science Data and Information Services Center) data with interoperable services: Open-standard and Interoperable Services Improve data discoverability, accessibility, and usability with metadata, catalogue and portal standards Achieve data, information and knowledge sharing across applications with standardized interfaces and protocols Open Geospatial Consortium (OGC) Data Services and Specifications Web Coverage Service (WCS) -- data Web Map Service (WMS) -- pictures of data Web Map Tile Service (WMTS) --- pictures of data tiles Styled Layer Descriptors (SLD) --- rendered styles.

  2. The AmeriFlux data activity and data system: an evolving collection of data management techniques, tools, products and services

    NASA Astrophysics Data System (ADS)

    Boden, T. A.; Krassovski, M.; Yang, B.

    2013-06-01

    The Carbon Dioxide Information Analysis Center (CDIAC) at Oak Ridge National Laboratory (ORNL), USA has provided scientific data management support for the US Department of Energy and international climate change science since 1982. Among the many data archived and available from CDIAC are collections from long-term measurement projects. One current example is the AmeriFlux measurement network. AmeriFlux provides continuous measurements from forests, grasslands, wetlands, and croplands in North, Central, and South America and offers important insight about carbon cycling in terrestrial ecosystems. To successfully manage AmeriFlux data and support climate change research, CDIAC has designed flexible data systems using proven technologies and standards blended with new, evolving technologies and standards. The AmeriFlux data system, comprised primarily of a relational database, a PHP-based data interface and a FTP server, offers a broad suite of AmeriFlux data. The data interface allows users to query the AmeriFlux collection in a variety of ways and then subset, visualize and download the data. From the perspective of data stewardship, on the other hand, this system is designed for CDIAC to easily control database content, automate data movement, track data provenance, manage metadata content, and handle frequent additions and corrections. CDIAC and researchers in the flux community developed data submission guidelines to enhance the AmeriFlux data collection, enable automated data processing, and promote standardization across regional networks. Both continuous flux and meteorological data and irregular biological data collected at AmeriFlux sites are carefully scrutinized by CDIAC using established quality-control algorithms before the data are ingested into the AmeriFlux data system. Other tasks at CDIAC include reformatting and standardizing the diverse and heterogeneous datasets received from individual sites into a uniform and consistent network database, generating high-level derived products to meet the current demands from a broad user group, and developing new products in anticipation of future needs. In this paper, we share our approaches to meet the challenges of standardizing, archiving and delivering quality, well-documented AmeriFlux data worldwide to benefit others with similar challenges of handling diverse climate change data, to further heighten awareness and use of an outstanding ecological data resource, and to highlight expanded software engineering applications being used for climate change measurement data.

  3. The AmeriFlux data activity and data system: an evolving collection of data management techniques, tools, products and services

    NASA Astrophysics Data System (ADS)

    Boden, T. A.; Krassovski, M.; Yang, B.

    2013-02-01

    The Carbon Dioxide Information Analysis Center (CDIAC) at Oak Ridge National Laboratory (ORNL), USA has provided scientific data management support for the US Department of Energy and international climate change science since 1982. Among the many data archived and available from CDIAC are collections from long-term measurement projects. One current example is the AmeriFlux measurement network. AmeriFlux provides continuous measurements from forests, grasslands, wetlands, and croplands in North, Central, and South America and offers important insight about carbon cycling in terrestrial ecosystems. To successfully manage AmeriFlux data and support climate change research, CDIAC has designed flexible data systems using proven technologies and standards blended with new, evolving technologies and standards. The AmeriFlux data system, comprised primarily of a relational database, a PHP based data-interface and a FTP server, offers a broad suite of AmeriFlux data. The data interface allows users to query the AmeriFlux collection in a variety of ways and then subset, visualize and download the data. From the perspective of data stewardship, on the other hand, this system is designed for CDIAC to easily control database content, automate data movement, track data provenance, manage metadata content, and handle frequent additions and corrections. CDIAC and researchers in the flux community developed data submission guidelines to enhance the AmeriFlux data collection, enable automated data processing, and promote standardization across regional networks. Both continuous flux and meteorological data and irregular biological data collected at AmeriFlux sites are carefully scrutinized by CDIAC using established quality-control algorithms before the data are ingested into the AmeriFlux data system. Other tasks at CDIAC include reformatting and standardizing the diverse and heterogeneous datasets received from individual sites into a uniform and consistent network database, generating high-level derived products to meet the current demands from a broad user group, and developing new products in anticipation of future needs. In this paper, we share our approaches to meet the challenges of standardizing, archiving and delivering quality, well-documented AmeriFlux data worldwide to benefit others with similar challenges of handling diverse climate change data, to further heighten awareness and use of an outstanding ecological data resource, and to highlight expanded software engineering applications being used for climate change measurement data.

  4. The Generic Data Capture Facility

    NASA Technical Reports Server (NTRS)

    Connell, Edward B.; Barnes, William P.; Stallings, William H.

    1987-01-01

    The Generic Data Capture Facility, which can provide data capture support for a variety of different types of spacecraft while enabling operations costs to be carefully controlled, is discussed. The data capture functions, data protection, isolation of users from data acquisition problems, data reconstruction, and quality and accounting are addressed. The TDM and packet data formats utilized by the system are described, and the development of generic facilities is considered.

  5. The Diverse Data, User Driven Services and the Power of Giovanni at NASA GES DISC

    NASA Technical Reports Server (NTRS)

    Shen, Suhung

    2017-01-01

    This presentation provides an overview of remote sensing and model data at GES (Goddard Earth Sciences) DISC (Data and Information Services Center); Overview of data services at GES DISC (Registration with NASA data system; Searching and downloading data); Giovanni (Geospatial Interactive Online VisualizationANd aNalysis Infrastructure): online data exploration tool; and NASA Earth Data and Information System.

  6. CyVerse Data Commons: lessons learned in cyberinfrastructure management and data hosting from the Life Sciences

    NASA Astrophysics Data System (ADS)

    Swetnam, T. L.; Walls, R.; Merchant, N.

    2017-12-01

    CyVerse, is a US National Science Foundation funded initiative "to design, deploy, and expand a national cyberinfrastructure for life sciences research, and to train scientists in its use," supporting and enabling cross disciplinary collaborations across institutions. CyVerse' free, open-source, cyberinfrastructure is being adopted into biogeoscience and space sciences research. CyVerse data-science agnostic platforms provide shared data storage, high performance computing, and cloud computing that allow analysis of very large data sets (including incomplete or work-in-progress data sets). Part of CyVerse success has been in addressing the handling of data through its entire lifecycle, from creation to final publication in a digital data repository to reuse in new analyses. CyVerse developers and user communities have learned many lessons that are germane to Earth and Environmental Science. We present an overview of the tools and services available through CyVerse including: interactive computing with the Discovery Environment (https://de.cyverse.org/), an interactive data science workbench featuring data storage and transfer via the Data Store; cloud computing with Atmosphere (https://atmo.cyverse.org); and access to HPC via Agave API (https://agaveapi.co/). Each CyVerse service emphasizes access to long term data storage, including our own Data Commons (http://datacommons.cyverse.org), as well as external repositories. The Data Commons service manages, organizes, preserves, publishes, allows for discovery and reuse of data. All data published to CyVerse's Curated Data receive a permanent identifier (PID) in the form of a DOI (Digital Object Identifier) or ARK (Archival Resource Key). Data that is more fluid can also be published in the Data commons through Community Collaborated data. The Data Commons provides landing pages, permanent DOIs or ARKs, and supports data reuse and citation through features such as open data licenses and downloadable citations. The ability to access and do computing on data within the CyVerse framework or with external compute resources when necessary, has proven highly beneficial to our user community, which has continuously grown since the inception of CyVerse nine years ago.

  7. The NSF Arctic Data Center: Leveraging the DataONE Federation to Build a Sustainable Archive for the NSF Arctic Research Community

    NASA Astrophysics Data System (ADS)

    Budden, A. E.; Arzayus, K. M.; Baker-Yeboah, S.; Casey, K. S.; Dozier, J.; Jones, C. S.; Jones, M. B.; Schildhauer, M.; Walker, L.

    2016-12-01

    The newly established NSF Arctic Data Center plays a critical support role in archiving and curating the data and software generated by Arctic researchers from diverse disciplines. The Arctic community, comprising Earth science, archaeology, geography, anthropology, and other social science researchers, are supported through data curation services and domain agnostic tools and infrastructure, ensuring data are accessible in the most transparent and usable way possible. This interoperability across diverse disciplines within the Arctic community facilitates collaborative research and is mirrored by interoperability between the Arctic Data Center infrastructure and other large scale cyberinfrastructure initiatives. The Arctic Data Center leverages the DataONE federation to standardize access to and replication of data and metadata to other repositories, specifically the NOAA's National Centers for Environmental Information (NCEI). This approach promotes long-term preservation of the data and metadata, as well as opening the door for other data repositories to leverage this replication infrastructure with NCEI and other DataONE member repositories. The Arctic Data Center uses rich, detailed metadata following widely recognized standards. Particularly, measurement-level and provenance metadata provide scientists the details necessary to integrate datasets across studies and across repositories while enabling a full understanding of the provenance of data used in the system. The Arctic Data Center gains this deep metadata and provenance support by simply adopting DataONE services, which results in significant efficiency gains by eliminating the need to develop systems de novo. Similarly, the advanced search tool developed by the Knowledge Network for Biocomplexity and extended for data submission by the Arctic Data Center, can be used by other DataONE-compliant repositories without further development. By standardizing interfaces and leveraging the DataONE federation, the Arctic Data Center has advanced rapidly and can itself contribute to raising the capabilities of all members of the federation.

  8. Data, data everywhere ...

    NASA Astrophysics Data System (ADS)

    Chandler, C. L.

    2016-12-01

    The scientific research endeavor requires data, and in some cases massive amounts of complex and highly diverse data. From experimental design, through data acquisition and analysis, hypothesis testing, and finally drawing conclusions, data collection and proper stewardship are critical to science. Even a single experiment conducted by a single researcher will produce data to test the working hypothesis. The types of complex science questions being tackled today often require large, diverse, multi-disciplinary teams of researchers who must be prepared to exchange their data.This 2016 AGU Leptoukh Lecture comprises a series of vignettes that illustrate a brief history of data stewardship: where we have come from, how and why we have arrived where we are today, and where we are headed with respect to data management. The specific focus will be on management of marine ecosystem research data and will include observations on the drivers, challenges, strategies, and solutions that have evolved over time. The lessons learned should be applicable to other disciplines and the hope is that many will recognize parallels in their chosen domain.From historical shipboard logbooks to the high-volume, digital, quality-controlled ocean science data sets created by today's researchers, there have been enormous changes in the way ocean data are collected and reported. Rapid change in data management practices is being driven by new data exchange requirements, by modern expectations for machine-interoperable exchange, and by the desire to achieve research transparency. Advances in technology and cultural shifts contribute to the changing conditions through which data managers and informatics specialists must navigate.The unique challenges associated with collecting and managing environmental data, complicated by the onset of the big data era, make this a fascinating time to be responsible for data. It seems there are data everywhere, being collected by everyone, for all sorts of reasons, and people have recognized the value of access to data. Properly managed and documented data, freely available to all, hold enormous potential for reuse beyond the original reason for collection.

  9. Analytics to Better Interpret and Use Large Amounts of Heterogeneous Data

    NASA Astrophysics Data System (ADS)

    Mathews, T. J.; Baskin, W. E.; Rinsland, P. L.

    2014-12-01

    Data scientists at NASA's Atmospheric Science Data Center (ASDC) are seasoned software application developers who have worked with the creation, archival, and distribution of large datasets (multiple terabytes and larger). In order for ASDC data scientists to effectively implement the most efficient processes for cataloging and organizing data access applications, they must be intimately familiar with data contained in the datasets with which they are working. Key technologies that are critical components to the background of ASDC data scientists include: large RBMSs (relational database management systems) and NoSQL databases; web services; service-oriented architectures; structured and unstructured data access; as well as processing algorithms. However, as prices of data storage and processing decrease, sources of data increase, and technologies advance - granting more people to access to data at real or near-real time - data scientists are being pressured to accelerate their ability to identify and analyze vast amounts of data. With existing tools this is becoming exceedingly more challenging to accomplish. For example, NASA Earth Science Data and Information System (ESDIS) alone grew from having just over 4PBs of data in 2009 to nearly 6PBs of data in 2011. This amount then increased to roughly10PBs of data in 2013. With data from at least ten new missions to be added to the ESDIS holdings by 2017, the current volume will continue to grow exponentially and drive the need to be able to analyze more data even faster. Though there are many highly efficient, off-the-shelf analytics tools available, these tools mainly cater towards business data, which is predominantly unstructured. Inadvertently, there are very few known analytics tools that interface well to archived Earth science data, which is predominantly heterogeneous and structured. This presentation will identify use cases for data analytics from an Earth science perspective in order to begin to identify specific tools that may be able to address those challenges.

  10. The Biological and Chemical Oceanography Data Management Office

    NASA Astrophysics Data System (ADS)

    Allison, M. D.; Chandler, C. L.; Groman, R. C.; Wiebe, P. H.; Glover, D. M.; Gegg, S. R.

    2011-12-01

    Oceanography and marine ecosystem research are inherently interdisciplinary fields of study that generate and require access to a wide variety of measurements. In late 2006 the Biological and Chemical Oceanography Sections of the National Science Foundation (NSF) Geosciences Directorate Division of Ocean Sciences (OCE) funded the Biological and Chemical Oceanography Data Management Office (BCO-DMO). In late 2010 additional funding was contributed to support management of research data from the NSF Office of Polar Programs Antarctic Organisms & Ecosystems Program. The BCO-DMO is recognized in the 2011 Division of Ocean Sciences Sample and Data Policy as one of several program specific data offices that support NSF OCE funded researchers. BCO-DMO staff members offer data management support throughout the project life cycle to investigators from large national programs and medium-sized collaborative research projects, as well as researchers from single investigator awards. The office manages and serves all types of oceanographic data and information generated during the research process and contributed by the originating investigators. BCO-DMO has built a data system that includes the legacy data from several large ocean research programs (e.g. United States Joint Global Ocean Flux Study and United States GLOBal Ocean ECosystems Dynamics), to which data have been contributed from recently granted NSF OCE and OPP awards. The BCO-DMO data system can accommodate many different types of data including: in situ and experimental biological, chemical, and physical measurements; modeling results and synthesis data products. The system enables reuse of oceanographic data for new research endeavors, supports synthesis and modeling activities, provides availability of "real data" for K-12 and college level use, and provides decision-support field data for policy-relevant investigations. We will present an overview of the data management system capabilities including: map-based and text-based data discovery and access systems; recent enhancements to data search tools; data export and download utilities; and strategic use of controlled vocabularies to facilitate data integration and to improve data system interoperability.

  11. Managing Astronomy Research Data: Case Studies of Big and Small Research Projects

    NASA Astrophysics Data System (ADS)

    Sands, Ashley E.

    2015-01-01

    Astronomy data management refers to all actions taken upon data over the course of the entire research process. It includes activities involving the collection, organization, analysis, release, storage, archiving, preservation, and curation of research data. Astronomers have cultivated data management tools, infrastructures, and local practices to ensure the use and future reuse of their data. However, new sky surveys will soon amass petabytes of data requiring new data management strategies.The goal of this dissertation, to be completed in 2015, is to identify and understand data management practices and the infrastructure and expertise required to support best practices. This will benefit the astronomy community in efforts toward an integrated scholarly communication framework.This dissertation employs qualitative, social science research methods (including interviews, observations, and document analysis) to conduct case studies of data management practices, covering the entire data lifecycle, amongst three populations: Sloan Digital Sky Survey (SDSS) collaboration team members; Individual and small-group users of SDSS data; and Large Synoptic Survey Telescope (LSST) collaboration team members. I have been observing the collection, release, and archiving of data by the SDSS collaboration, the data practices of individuals and small groups using SDSS data in journal articles, and the LSST collaboration's planning and building of infrastructure to produce data.Preliminary results demonstrate that current data management practices in astronomy are complex, situational, and heterogeneous. Astronomers often have different management repertoires for working on sky surveys and for their own data collections, varying their data practices as they move between projects. The multitude of practices complicates coordinated efforts to maintain data.While astronomy expertise proves critical to managing astronomy data in the short, medium, and long term, the larger astronomy data workforce encompasses a greater breadth of educational backgrounds. Results show that teams of individuals with distinct expertise are key to ensuring the long-term preservation and usability of astronomy datasets.

  12. NCI's national environmental research data collection: metadata management built on standards and preparing for the semantic web

    NASA Astrophysics Data System (ADS)

    Wang, Jingbo; Bastrakova, Irina; Evans, Ben; Gohar, Kashif; Santana, Fabiana; Wyborn, Lesley

    2015-04-01

    National Computational Infrastructure (NCI) manages national environmental research data collections (10+ PB) as part of its specialized high performance data node of the Research Data Storage Infrastructure (RDSI) program. We manage 40+ data collections using NCI's Data Management Plan (DMP), which is compatible with the ISO 19100 metadata standards. We utilize ISO standards to make sure our metadata is transferable and interoperable for sharing and harvesting. The DMP is used along with metadata from the data itself, to create a hierarchy of data collection, dataset and time series catalogues that is then exposed through GeoNetwork for standard discoverability. This hierarchy catalogues are linked using a parent-child relationship. The hierarchical infrastructure of our GeoNetwork catalogues system aims to address both discoverability and in-house administrative use-cases. At NCI, we are currently improving the metadata interoperability in our catalogue by linking with standardized community vocabulary services. These emerging vocabulary services are being established to help harmonise data from different national and international scientific communities. One such vocabulary service is currently being established by the Australian National Data Services (ANDS). Data citation is another important aspect of the NCI data infrastructure, which allows tracking of data usage and infrastructure investment, encourage data sharing, and increasing trust in research that is reliant on these data collections. We incorporate the standard vocabularies into the data citation metadata so that the data citation become machine readable and semantically friendly for web-search purpose as well. By standardizing our metadata structure across our entire data corpus, we are laying the foundation to enable the application of appropriate semantic mechanisms to enhance discovery and analysis of NCI's national environmental research data information. We expect that this will further increase the data discoverability and encourage the data sharing and reuse within the community, increasing the value of the data much further than its current use.

  13. Data sharing by scientists: Practices and perceptions

    USGS Publications Warehouse

    Tenopir, C.; Allard, S.; Douglass, K.; Aydinoglu, A.U.; Wu, L.; Read, E.; Manoff, M.; Frame, M.

    2011-01-01

    Background: Scientific research in the 21st century is more data intensive and collaborative than in the past. It is important to study the data practices of researchers - data accessibility, discovery, re-use, preservation and, particularly, data sharing. Data sharing is a valuable part of the scientific method allowing for verification of results and extending research from prior results. Methodology/Principal Findings: A total of 1329 scientists participated in this survey exploring current data sharing practices and perceptions of the barriers and enablers of data sharing. Scientists do not make their data electronically available to others for various reasons, including insufficient time and lack of funding. Most respondents are satisfied with their current processes for the initial and short-term parts of the data or research lifecycle (collecting their research data; searching for, describing or cataloging, analyzing, and short-term storage of their data) but are not satisfied with long-term data preservation. Many organizations do not provide support to their researchers for data management both in the short- and long-term. If certain conditions are met (such as formal citation and sharing reprints) respondents agree they are willing to share their data. There are also significant differences and approaches in data management practices based on primary funding agency, subject discipline, age, work focus, and world region. Conclusions/Significance: Barriers to effective data sharing and preservation are deeply rooted in the practices and culture of the research process as well as the researchers themselves. New mandates for data management plans from NSF and other federal agencies and world-wide attention to the need to share and preserve data could lead to changes. Large scale programs, such as the NSF-sponsored DataNET (including projects like DataONE) will both bring attention and resources to the issue and make it easier for scientists to apply sound data management principles. ?? 2011 Tenopir et al.

  14. Data Linkage Strategies to Advance Youth Suicide Prevention: A Systematic Review for a National Institutes of Health Pathways to Prevention Workshop.

    PubMed

    Wilcox, Holly C; Kharrazi, Hadi; Wilson, Renee F; Musci, Rashelle J; Susukida, Ryoko; Gharghabi, Fardad; Zhang, Allen; Wissow, Lawrence; Robinson, Karen A

    2016-12-06

    Linking national, state, and community data systems to data from prevention programs could allow for longer-term assessment of outcomes and evaluation of interventions to prevent suicide. To identify and describe data systems that can be linked to data from prevention studies to advance youth suicide prevention research. A systematic review, an environmental scan, and a targeted search were conducted to identify prevention studies and potentially linkable external data systems with suicide outcomes from January 1990 through December 2015. Studies and data systems had to be U.S.-based and include persons aged 25 years or younger. Data systems also had to include data on suicide, suicide attempt, or suicidal ideation. Information about participants, intervention type, suicide outcomes, primary analytic method used for linkage, statistical approach, analyses performed, and characteristics of data systems was abstracted by 2 reviewers. Of 47 studies (described in 59 articles) identified in the systematic review, only 6 were already linked to data systems. A total of 153 unique and potentially linkable data systems were identified, but only 66 were classified as "fairly accessible" and had data dictionaries available. Of the data systems identified, 19% were established primarily for research, 11% for clinical care or operations, 29% for administrative services (such as billing), and 52% for surveillance. About one third (37%) provided national data, 12% provided regional data, 63% provided state data, and 41% provided data below the state level (some provided coverage for >1 geographic unit). Only U.S.-based studies published in English were included. There is untapped potential to evaluate and enhance suicide prevention efforts by linking suicide prevention data with existing data systems. However, sparse availability of data dictionaries and lack of adherence to standard data elements limit this potential. Agency for Healthcare Research and Quality.

  15. Data Sharing by Scientists: Practices and Perceptions

    PubMed Central

    Tenopir, Carol; Allard, Suzie; Douglass, Kimberly; Aydinoglu, Arsev Umur; Wu, Lei; Read, Eleanor; Manoff, Maribeth; Frame, Mike

    2011-01-01

    Background Scientific research in the 21st century is more data intensive and collaborative than in the past. It is important to study the data practices of researchers – data accessibility, discovery, re-use, preservation and, particularly, data sharing. Data sharing is a valuable part of the scientific method allowing for verification of results and extending research from prior results. Methodology/Principal Findings A total of 1329 scientists participated in this survey exploring current data sharing practices and perceptions of the barriers and enablers of data sharing. Scientists do not make their data electronically available to others for various reasons, including insufficient time and lack of funding. Most respondents are satisfied with their current processes for the initial and short-term parts of the data or research lifecycle (collecting their research data; searching for, describing or cataloging, analyzing, and short-term storage of their data) but are not satisfied with long-term data preservation. Many organizations do not provide support to their researchers for data management both in the short- and long-term. If certain conditions are met (such as formal citation and sharing reprints) respondents agree they are willing to share their data. There are also significant differences and approaches in data management practices based on primary funding agency, subject discipline, age, work focus, and world region. Conclusions/Significance Barriers to effective data sharing and preservation are deeply rooted in the practices and culture of the research process as well as the researchers themselves. New mandates for data management plans from NSF and other federal agencies and world-wide attention to the need to share and preserve data could lead to changes. Large scale programs, such as the NSF-sponsored DataNET (including projects like DataONE) will both bring attention and resources to the issue and make it easier for scientists to apply sound data management principles. PMID:21738610

  16. Fundamental Data Standards for Science Data System Interoperability and Data Correlation

    NASA Astrophysics Data System (ADS)

    Hughes, J. Steven; Gopala Krishna, Barla; Rye, Elizabeth; Crichton, Daniel

    The advent of the Web and languages such as XML have brought an explosion of online science data repositories and the promises of correlated data and interoperable systems. However there have been relatively few successes in meeting the expectations of science users in the internet age. For example a Google-like search for images of Mars will return many highly-derived and appropriately tagged images but largely ignore the majority of images in most online image repositories. Once retrieved, users are further frustrated by poor data descriptions, arcane formats, and badly organized ancillary information. A wealth of research indicates that shared information models are needed to enable system interoperability and data correlation. However, at a more fundamental level, data correlation and system interoperability are dependant on a relatively few shared data standards. A com-mon data dictionary standard, for example, allows the controlled vocabulary used in a science repository to be shared with potential collaborators. Common data registry and product iden-tification standards enable systems to efficiently find, locate, and retrieve data products and their metadata from remote repositories. Information content standards define categories of descriptive data that help make the data products scientifically useful to users who were not part of the original team that produced the data. The Planetary Data System (PDS) has a plan to move the PDS to a fully online, federated system. This plan addresses new demands on the system including increasing data volume, numbers of missions, and complexity of missions. A key component of this plan is the upgrade of the PDS Data Standards. The adoption of the core PDS data standards by the International Planetary Data Alliance (IPDA) adds the element of international cooperation to the plan. This presentation will provide an overview of the fundamental data standards being adopted by the PDS that transcend science domains and that will help to meet the PDS's and IPDA's system interoperability and data correlation requirements.

  17. Integrating Experiential and Distributional Data to Learn Semantic Representations

    ERIC Educational Resources Information Center

    Andrews, Mark; Vigliocco, Gabriella; Vinson, David

    2009-01-01

    The authors identify 2 major types of statistical data from which semantic representations can be learned. These are denoted as "experiential data" and "distributional data". Experiential data are derived by way of experience with the physical world and comprise the sensory-motor data obtained through sense receptors. Distributional data, by…

  18. Alternative Fuels Data Center: New Hampshire Transportation Data for

    Science.gov Websites

    to someone by E-mail Share Alternative Fuels Data Center: New Hampshire Transportation Data for Alternative Fuels and Vehicles on Facebook Tweet about Alternative Fuels Data Center: New Hampshire Transportation Data for Alternative Fuels and Vehicles on Twitter Bookmark Alternative Fuels Data Center: New

  19. Alternative Fuels Data Center: New Mexico Transportation Data for

    Science.gov Websites

    someone by E-mail Share Alternative Fuels Data Center: New Mexico Transportation Data for Alternative Fuels and Vehicles on Facebook Tweet about Alternative Fuels Data Center: New Mexico Transportation Data for Alternative Fuels and Vehicles on Twitter Bookmark Alternative Fuels Data Center: New Mexico

  20. Alternative Fuels Data Center: New Jersey Transportation Data for

    Science.gov Websites

    someone by E-mail Share Alternative Fuels Data Center: New Jersey Transportation Data for Alternative Fuels and Vehicles on Facebook Tweet about Alternative Fuels Data Center: New Jersey Transportation Data for Alternative Fuels and Vehicles on Twitter Bookmark Alternative Fuels Data Center: New Jersey

  1. Records Management Handbook; Source Data Automation Equipment Guide.

    ERIC Educational Resources Information Center

    National Archives and Records Service (GSA), Washington, DC. Office of Records Management.

    A detailed guide to selecting appropriate source data automation equipment is presented. Source data automation equipment is used to prepare data for electronic data processing or computerized recordkeeping. The guide contains specifications, performance data cost, and pictures of the major types of machines used in source data automation.…

  2. Data Systems and Reports as Active Participants in Data Interpretation

    ERIC Educational Resources Information Center

    Rankin, Jenny Grant

    2016-01-01

    Most data-informed decision-making in education is undermined by flawed interpretations. Educator-driven interventions to improve data use are beneficial but not omnipotent, as data misunderstandings persist at schools and school districts commended for ideal data use support. Meanwhile, most data systems and reports display figures without…

  3. Alternative Fuels Data Center: North Dakota Transportation Data for

    Science.gov Websites

    someone by E-mail Share Alternative Fuels Data Center: North Dakota Transportation Data for Alternative Fuels and Vehicles on Facebook Tweet about Alternative Fuels Data Center: North Dakota Transportation Data for Alternative Fuels and Vehicles on Twitter Bookmark Alternative Fuels Data Center: North Dakota

  4. Data, Data Everywhere but Not a Byte to Read: Managing Monitoring Information.

    ERIC Educational Resources Information Center

    Stafford, Susan G.

    1993-01-01

    Describes the Forest Science Data Bank that contains 2,400 data sets from over 350 existing ecological studies. Database features described include involvement of the scientific community; database documentation; data quality assurance; security; data access and retrieval; and data import/export flexibility. Appendices present the Quantitative…

  5. 17 CFR 45.3 - Swap data reporting: creation data.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... 17 Commodity and Securities Exchanges 2 2014-04-01 2014-04-01 false Swap data reporting: creation... (CONTINUED) SWAP DATA RECORDKEEPING AND REPORTING REQUIREMENTS § 45.3 Swap data reporting: creation data. Registered entities and swap counterparties must report required swap creation data electronically to a swap...

  6. Advantages of Integrative Data Analysis for Developmental Research

    ERIC Educational Resources Information Center

    Bainter, Sierra A.; Curran, Patrick J.

    2015-01-01

    Amid recent progress in cognitive development research, high-quality data resources are accumulating, and data sharing and secondary data analysis are becoming increasingly valuable tools. Integrative data analysis (IDA) is an exciting analytical framework that can enhance secondary data analysis in powerful ways. IDA pools item-level data across…

  7. 14 CFR 60.13 - FSTD objective data requirements.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ..., the data made available to the NSPM (the validation data package) must include the aircraft...) The validation data package may contain flight test data from a source in addition to or independent..., as described in the applicable QPS. (c) The validation data package may also contain predicted data...

  8. 14 CFR 60.13 - FSTD objective data requirements.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ..., the data made available to the NSPM (the validation data package) must include the aircraft...) The validation data package may contain flight test data from a source in addition to or independent..., as described in the applicable QPS. (c) The validation data package may also contain predicted data...

  9. 14 CFR 60.13 - FSTD objective data requirements.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ..., the data made available to the NSPM (the validation data package) must include the aircraft...) The validation data package may contain flight test data from a source in addition to or independent..., as described in the applicable QPS. (c) The validation data package may also contain predicted data...

  10. 14 CFR 60.13 - FSTD objective data requirements.

    Code of Federal Regulations, 2012 CFR

    2012-01-01

    ..., the data made available to the NSPM (the validation data package) must include the aircraft...) The validation data package may contain flight test data from a source in addition to or independent..., as described in the applicable QPS. (c) The validation data package may also contain predicted data...

  11. 14 CFR 60.13 - FSTD objective data requirements.

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ..., the data made available to the NSPM (the validation data package) must include the aircraft...) The validation data package may contain flight test data from a source in addition to or independent..., as described in the applicable QPS. (c) The validation data package may also contain predicted data...

  12. Frequently Asked Questions | DOE Data Explorer

    Science.gov Websites

    . DDE includes data Project, data Collection, and individual Dataset records. The data themselves reside . What does DDE contain? DDE has three types of data records: data Projects, data Collections, and centers, user facilities, colleges and universities, or other websites. These three 'types' of records

  13. ECS DAAC Data Pools

    NASA Astrophysics Data System (ADS)

    Kiebuzinski, A. B.; Bories, C. M.; Kalluri, S.

    2002-12-01

    As part of its Earth Observing System (EOS), NASA supports operations for several satellites including Landsat 7, Terra, and Aqua. ECS (EOSDIS Core System) is a vast archival and distribution system and includes several Distributed Active Archive Centers (DAACs) located around the United States. EOSDIS reached a milestone in February when its data holdings exceeded one petabyte (1,000 terabytes) in size. It has been operational since 1999 and originally was intended to serve a large community of Earth Science researchers studying global climate change. The Synergy Program was initiated in 2000 with the purpose of exploring and expanding the use of remote sensing data beyond the traditional research community to the applications community including natural resource managers, disaster/emergency managers, urban planners and others. This included facilitating data access at the DAACs to enable non-researchers to exploit the data for their specific applications. The combined volume of data archived daily across the DAACs is of the order of three terabytes. These archived data are made available to the research community and to general users of ECS data. Currently, the average data volume distributed daily is two terabytes, which combined with an ever-increasing need for timely access to these data, taxes the ECS processing and archival resources for more real-time use than was previously intended for research purposes. As a result, the delivery of data sets to users was being delayed in many cases, to unacceptable limits. Raytheon, under the auspices of the Synergy Program, investigated methods at making data more accessible at a lower cost of resources (processing and archival) at the DAACs. Large on-line caches (as big as 70 Terabytes) of data were determined to be a solution that would allow users who require contemporary data to access them without having to pull it from the archive. These on-line caches are referred to as "Data Pools." In the Data Pool concept, data is inserted via subscriptions based on ECS events, for example, arrival of data matching a specific spatial context. Upon acquisition, these data are written to the Data Pools as well as to the permanent archive. The data is then accessed via a public Web interface, which provides a drilldown search, using data group, spatial, temporal and other flags. The result set is displayed as a list of ftp links to the data, which the user can click and directly download. Data Pool holdings are continuously renewed as the data is allowed to expire and is replaced by more current insertions. In addition, the Data Pool may also house data sets that though not contemporary, receive significant user attention, i.e. a Chernobyl-type of incident, a flood, or a forest fire. The benefits are that users who require contemporary data can access the data immediately (within 24 hours of acquisition) under a much improved access technique. Users not requiring contemporary data, benefit from the Data Pools by having greater archival and processing resources (and a shorter processing queue) made available to them. All users benefit now from the capability to have standing data orders for data matching a geographic context (spatial subscription), a capability also developed under the Synergy program. The Data Pools are currently being installed and checked at each of the DAACs. Additionally, several improvements to the search capabilities, data manipulation tools and overall storage capacity are being developed and will be installed in the First Quarter of 2003.

  14. Enabling innovative research by supporting the life cycle of high frequency streaming sensor data in the Global Lake Ecological Observatory Network (GLEON)

    NASA Astrophysics Data System (ADS)

    Gries, C.; Read, J. S.; Winslow, L. A.; Hanson, P. C.; Weathers, K. C.

    2014-12-01

    The Global Lake Ecological Observatory Network (GLEON) is an international community of scientists, educators and citizens with the mission to conduct innovative science by sharing and interpreting high-resolution sensor data to understand, predict and communicate the role and response of lakes in a changing global environment. During its almost ten years of existence and continual growth, GLEON has inspired innovative science, new modeling approaches, and accumulated extensive experience in the management of streaming, high resolution, and large volume data. However, a recent 'data task force' identified inhibiting data infrastructure issues, including providing access to data, discovering distributed data, and integrating data into useful data products for scientific research and management. Accordingly, in support of the complete data lifecycle, tools are being developed by the GLEON community and integrated with innovative technology from other groups to improve environmental observations data management in the broader community. Specifically we will discuss raw data handling with tools developed by the Consortium of Universities for the Advancement of Hydrologic Sciences (CUAHSI, Observation Data Model and DataLoader), quality control practices using a newly developed R package (sensorQC), data access with HydroDesktop, or webservices delivering WaterML, data analysis with the R package rLakeAnalyzer, and final storage of the quality controlled, harmonized and value added data product in a DataONE member node. Such data product is then discoverable, accessible for new analyses and citable in subsequent publications. Leveraging GLEON's organizational structure, community trust, extensive experience, and technological talent the goal is to develop a design and implementation plan for a data publishing and sharing system that will address not only GLEON's needs, but also those of other environmental research communities.

  15. Topographic data requirements for EOS global change research

    USGS Publications Warehouse

    Gesch, Dean B.

    1994-01-01

    This document is a result of Earth Observing System Data and Information System (EOSDIS) Version 0 activities of the Land Processes Distributed Active Archive Center at the U.S. Geological Survey's EROS Data Center. A relatively small part of the Version 0 funding provided by NASA is used to address topographic data issues related to EOS. These issues include identifying and improving access to existing sources of topographic data, data generation, facilitating the use of topographic data in global change research by demonstrating derivative products, and inventorying the specific topographic data requirements of EOS investigators. There is a clear need for global topographic data in EOSDIS. Only 10 percent of the global land surface is covered by high-resolution data that are available to the global change science community. Alternative sources for new data exist or have been proposed; however, none of them alone can fulfill the data requirements by the launch of the first EOS platform in 4 years. There is no operational provider of all the source data that are required. Even if appropriate global source data existed, a concerted production effort would be necessary to ensure the existence of the requisite topographic data before EOS launch. Additionally, no funding from NASA or any other agency has been appropriated for a new mapping mission or for other means of data acquisition. This effort to document requirements is an initial step toward understanding the severity of the data shortage. It is well beyond the scope of Version 0 funding and capabilities to provide the required data in the proper timeframe. The shortage of data and the lack of a plan for providing the necessary topographic data through EOSDIS in time for launch are issues that must be addressed by the EOS program.

  16. The data paper: a mechanism to incentivize data publishing in biodiversity science.

    PubMed

    Chavan, Vishwas; Penev, Lyubomir

    2011-01-01

    Free and open access to primary biodiversity data is essential for informed decision-making to achieve conservation of biodiversity and sustainable development. However, primary biodiversity data are neither easily accessible nor discoverable. Among several impediments, one is a lack of incentives to data publishers for publishing of their data resources. One such mechanism currently lacking is recognition through conventional scholarly publication of enriched metadata, which should ensure rapid discovery of 'fit-for-use' biodiversity data resources. We review the state of the art of data discovery options and the mechanisms in place for incentivizing data publishers efforts towards easy, efficient and enhanced publishing, dissemination, sharing and re-use of biodiversity data. We propose the establishment of the 'biodiversity data paper' as one possible mechanism to offer scholarly recognition for efforts and investment by data publishers in authoring rich metadata and publishing them as citable academic papers. While detailing the benefits to data publishers, we describe the objectives, work flow and outcomes of the pilot project commissioned by the Global Biodiversity Information Facility in collaboration with scholarly publishers and pioneered by Pensoft Publishers through its journals Zookeys, PhytoKeys, MycoKeys, BioRisk, NeoBiota, Nature Conservation and the forthcoming Biodiversity Data Journal. We then debate further enhancements of the data paper beyond the pilot project and attempt to forecast the future uptake of data papers as an incentivization mechanism by the stakeholder communities. We believe that in addition to recognition for those involved in the data publishing enterprise, data papers will also expedite publishing of fit-for-use biodiversity data resources. However, uptake and establishment of the data paper as a potential mechanism of scholarly recognition requires a high degree of commitment and investment by the cross-sectional stakeholder communities.

  17. Advancing Long Tail Data Capture and Access Through Trusted, Community-Driven Data Services at the IEDA Data Facility

    NASA Astrophysics Data System (ADS)

    Lehnert, K. A.; Carbotte, S. M.; Ferrini, V.; Hsu, L.; Arko, R. A.; Walker, J. D.; O'hara, S. H.

    2012-12-01

    Substantial volumes of data in the Earth Sciences are collected in small- to medium-size projects by individual investigators or small research teams, known as the 'Long Tail' of science. Traditionally, these data have largely stayed 'in the dark', i.e. they have not been properly archived, and have therefore been inaccessible and underutilized. The primary reason has been the lack of appropriate infrastructure, from adequate repositories to resources and support for investigators to properly manage their data, to community standards and best practices. Lack of credit for data management and for the data themselves has contributed to the reluctance of investigators to share their data. IEDA (Integrated Earth Data Applications), a NSF-funded data facility for solid earth geoscience data, has developed a comprehensive suite of data services that are designed to address the concerns and needs of investigators. IEDA's data publication service registers datasets with DOI and ensures their proper citation and attribution. IEDA is working with publishers on advanced linkages between datasets in the IEDA repository and scientific online articles to facilitate access to the data, enhance their visibility, and augment their use and citation. IEDA's investigator support ranges from individual support for data management to tools, tutorials, and virtual or face-to-face workshops that guide and assist investigators with data management planning, data submission, and data documentation. A critical aspect of IEDA's concept has been the disciplinary expertise within the team and its strong liaison with the science community, as well as a community-based governance. These have been fundamental to gain the trust and support of the community that have lead to significantly improved data preservation and access in the communities served by IEDA.

  18. A New Data Management System for Biological and Chemical Oceanography

    NASA Astrophysics Data System (ADS)

    Groman, R. C.; Chandler, C.; Allison, D.; Glover, D. M.; Wiebe, P. H.

    2007-12-01

    The Biological and Chemical Oceanography Data Management Office (BCO-DMO) was created to serve PIs principally funded by NSF to conduct marine chemical and ecological research. The new office is dedicated to providing open access to data and information developed in the course of scientific research on short and intermediate time-frames. The data management system developed in support of U.S. JGOFS and U.S. GLOBEC programs is being modified to support the larger scope of the BCO-DMO effort, which includes ultimately providing a way to exchange data with other data systems. The open access system is based on a philosophy of data stewardship, support for existing and evolving data standards, and use of public domain software. The DMO staff work closely with originating PIs to manage data gathered as part of their individual programs. In the new BCO-DMO data system, project and data set metadata records designed to support re-use of the data are stored in a relational database (MySQL) and the data are stored in or made accessible by the JGOFS/GLOBEC object- oriented, relational, data management system. Data access will be provided via any standard Web browser client user interface through a GIS application (Open Source, OGC-compliant MapServer), a directory listing from the data holdings catalog, or a custom search engine that facilitates data discovery. In an effort to maximize data system interoperability, data will also be available via Web Services; and data set descriptions will be generated to comply with a variety of metadata content standards. The office is located at the Woods Hole Oceanographic Institution and web access is via http://www.bco-dmo.org.

  19. A Qualitative Readiness-Requirements Assessment Model for Enterprise Big-Data Infrastructure Investment

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Olama, Mohammed M; McNair, Wade; Sukumar, Sreenivas R

    2014-01-01

    In the last three decades, there has been an exponential growth in the area of information technology providing the information processing needs of data-driven businesses in government, science, and private industry in the form of capturing, staging, integrating, conveying, analyzing, and transferring data that will help knowledge workers and decision makers make sound business decisions. Data integration across enterprise warehouses is one of the most challenging steps in the big data analytics strategy. Several levels of data integration have been identified across enterprise warehouses: data accessibility, common data platform, and consolidated data model. Each level of integration has its ownmore » set of complexities that requires a certain amount of time, budget, and resources to implement. Such levels of integration are designed to address the technical challenges inherent in consolidating the disparate data sources. In this paper, we present a methodology based on industry best practices to measure the readiness of an organization and its data sets against the different levels of data integration. We introduce a new Integration Level Model (ILM) tool, which is used for quantifying an organization and data system s readiness to share data at a certain level of data integration. It is based largely on the established and accepted framework provided in the Data Management Association (DAMA-DMBOK). It comprises several key data management functions and supporting activities, together with several environmental elements that describe and apply to each function. The proposed model scores the maturity of a system s data governance processes and provides a pragmatic methodology for evaluating integration risks. The higher the computed scores, the better managed the source data system and the greater the likelihood that the data system can be brought in at a higher level of integration.« less

  20. Benefits of a clinical data warehouse with data mining tools to collect data for a radiotherapy trial

    PubMed Central

    Roelofs, Erik; Persoon, Lucas; Nijsten, Sebastiaan; Wiessler, Wolfgang; Dekker, André; Lambin, Philippe

    2016-01-01

    Introduction Collecting trial data in a medical environment is at present mostly performed manually and therefore time-consuming, prone to errors and often incomplete with the complex data considered. Faster and more accurate methods are needed to improve the data quality and to shorten data collection times where information is often scattered over multiple data sources. The purpose of this study is to investigate the possible benefit of modern data warehouse technology in the radiation oncology field. Material and methods In this study, a Computer Aided Theragnostics (CAT) data warehouse combined with automated tools for feature extraction was benchmarked against the regular manual data-collection processes. Two sets of clinical parameters were compiled for non-small cell lung cancer (NSCLC) and rectal cancer, using 27 patients per disease. Data collection times and inconsistencies were compared between the manual and the automated extraction method. Results The average time per case to collect the NSCLC data manually was 10.4 ± 2.1 min and 4.3 ± 1.1 min when using the automated method (p < 0.001). For rectal cancer, these times were 13.5 ± 4.1 and 6.8 ± 2.4 min, respectively (p < 0.001). In 3.2% of the data collected for NSCLC and 5.3% for rectal cancer, there was a discrepancy between the manual and automated method. Conclusions Aggregating multiple data sources in a data warehouse combined with tools for extraction of relevant parameters is beneficial for data collection times and offers the ability to improve data quality. The initial investments in digitizing the data are expected to be compensated due to the flexibility of the data analysis. Furthermore, successive investigations can easily select trial candidates and extract new parameters from the existing databases. PMID:23394741

  1. Benefits of a clinical data warehouse with data mining tools to collect data for a radiotherapy trial.

    PubMed

    Roelofs, Erik; Persoon, Lucas; Nijsten, Sebastiaan; Wiessler, Wolfgang; Dekker, André; Lambin, Philippe

    2013-07-01

    Collecting trial data in a medical environment is at present mostly performed manually and therefore time-consuming, prone to errors and often incomplete with the complex data considered. Faster and more accurate methods are needed to improve the data quality and to shorten data collection times where information is often scattered over multiple data sources. The purpose of this study is to investigate the possible benefit of modern data warehouse technology in the radiation oncology field. In this study, a Computer Aided Theragnostics (CAT) data warehouse combined with automated tools for feature extraction was benchmarked against the regular manual data-collection processes. Two sets of clinical parameters were compiled for non-small cell lung cancer (NSCLC) and rectal cancer, using 27 patients per disease. Data collection times and inconsistencies were compared between the manual and the automated extraction method. The average time per case to collect the NSCLC data manually was 10.4 ± 2.1 min and 4.3 ± 1.1 min when using the automated method (p<0.001). For rectal cancer, these times were 13.5 ± 4.1 and 6.8 ± 2.4 min, respectively (p<0.001). In 3.2% of the data collected for NSCLC and 5.3% for rectal cancer, there was a discrepancy between the manual and automated method. Aggregating multiple data sources in a data warehouse combined with tools for extraction of relevant parameters is beneficial for data collection times and offers the ability to improve data quality. The initial investments in digitizing the data are expected to be compensated due to the flexibility of the data analysis. Furthermore, successive investigations can easily select trial candidates and extract new parameters from the existing databases. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.

  2. Restoration of Apollo Data by the NSSDC and the PDS Lunar Data Node

    NASA Technical Reports Server (NTRS)

    Williams, David R.; Hills, H. Kent; Lowman, Paul D.; Taylor, Patrick T.; Guinness, Edward A.

    2011-01-01

    The Lunar Data Node (LDN), under the auspices of the Geosciences Node of the Planetary Data System (PDS), is restoring Apollo data archived at the National Space Science Data Center. The Apollo data were arch ived on older media (7 -track tapes. microfilm, microfiche) and in ob solete digital formats, which limits use of the data. The LDN is maki ng these data accessible by restoring them to standard formats and archiving them through PDS. The restoration involves reading the older m edia, collecting supporting data (metadata), deciphering and understa nding the data, and organizing into a data set. The data undergo a pe er review before archive at PDS. We will give an update on last year' s work. We have scanned notebooks from Otto Berg, P.1. for the Lunar Ejecta and Meteorites Experiment. These notebooks contain information on the data and calibration coefficients which we hope to be able to use to restore the raw data into a usable archive. We have scanned Ap ollo 14 and 15 Dust Detector data from microfilm and are in the proce ss of archiving thc scans with PDS. We are also restoring raw dust de tector data from magnetic tape supplied by Yosio Nakamura (UT Austin) . Seiichi Nagihara (Texas Tech Univ.) and others in cooperation with NSSDC are recovering ARCSAV tapes (tapes containing raw data streams from all the ALSEP instruments). We will be preparing these data for archive with PDS. We are also in the process of recovering and archivi ng data not previously archived, from the Apollo 16 Gamma Ray Spectro meter and the Apollo 17 Infrared Spectrometer.

  3. Design and study of geosciences data share platform :platform framework, data interoperability, share approach

    NASA Astrophysics Data System (ADS)

    Lu, H.; Yi, D.

    2010-12-01

    The Deep Exploration is one of the important approaches to the Geoscience research. Since 1980s we had started it and achieved a lot of data. Researchers usually integrate both data of space exploration and deep exploration to study geological structures and represent the Earth’s subsurface, and analyze and explain on the base of integrated data. Due to the different exploration approach it results the heterogeneity of data, and therefore the data achievement is always of the import issue to make the researchers confused. The problem of data share and interaction has to be solved during the development of the SinoProbe research project. Through the research of domestic and overseas well-known exploration project and geosciences data platform, the subject explores the solution of data share and interaction. Based on SOA we present the deep exploration data share framework which comprises three level: data level is used for the solution of data store and the integration of the heterogeneous data; medial level provides the data service of geophysics, geochemistry, etc. by the means of Web service, and carry out kinds of application combination by the use of GIS middleware and Eclipse RCP; interaction level provides professional and non-professional customer the access to different accuracy data. The framework adopts GeoSciML data interaction approach. GeoSciML is a geosciences information markup language, as an application of the OpenGIS Consortium’s (OGC) Geography Markup Language (GML). It transfers heterogeneous data into one earth frame and implements inter-operation. We dissertate in this article the solution how to integrate the heterogeneous data and share the data in the project of SinoProbe.

  4. Investigating Access Performance of Long Time Series with Restructured Big Model Data

    NASA Astrophysics Data System (ADS)

    Shen, S.; Ostrenga, D.; Vollmer, B.; Meyer, D. J.

    2017-12-01

    Data sets generated by models are substantially increasing in volume, due to increases in spatial and temporal resolution, and the number of output variables. Many users wish to download subsetted data in preferred data formats and structures, as it is getting increasingly difficult to handle the original full-size data files. For example, application research users, such as those involved with wind or solar energy, or extreme weather events, are likely only interested in daily or hourly model data at a single point or for a small area for a long time period, and prefer to have the data downloaded in a single file. With native model file structures, such as hourly data from NASA Modern-Era Retrospective analysis for Research and Applications Version-2 (MERRA-2), it may take over 10 hours for the extraction of interested parameters at a single point for 30 years. The NASA Goddard Earth Sciences Data and Information Services Center (GES DISC) is exploring methods to address this particular user need. One approach is to create value-added data by reconstructing the data files. Taking MERRA-2 data as an example, we have tested converting hourly data from one-day-per-file into different data cubes, such as one-month, one-year, or whole-mission. Performance are compared for reading local data files and accessing data through interoperable service, such as OPeNDAP. Results show that, compared to the original file structure, the new data cubes offer much better performance for accessing long time series. We have noticed that performance is associated with the cube size and structure, the compression method, and how the data are accessed. An optimized data cube structure will not only improve data access, but also may enable better online analytic services.

  5. The data paper: a mechanism to incentivize data publishing in biodiversity science

    PubMed Central

    2011-01-01

    Background Free and open access to primary biodiversity data is essential for informed decision-making to achieve conservation of biodiversity and sustainable development. However, primary biodiversity data are neither easily accessible nor discoverable. Among several impediments, one is a lack of incentives to data publishers for publishing of their data resources. One such mechanism currently lacking is recognition through conventional scholarly publication of enriched metadata, which should ensure rapid discovery of 'fit-for-use' biodiversity data resources. Discussion We review the state of the art of data discovery options and the mechanisms in place for incentivizing data publishers efforts towards easy, efficient and enhanced publishing, dissemination, sharing and re-use of biodiversity data. We propose the establishment of the 'biodiversity data paper' as one possible mechanism to offer scholarly recognition for efforts and investment by data publishers in authoring rich metadata and publishing them as citable academic papers. While detailing the benefits to data publishers, we describe the objectives, work flow and outcomes of the pilot project commissioned by the Global Biodiversity Information Facility in collaboration with scholarly publishers and pioneered by Pensoft Publishers through its journals Zookeys, PhytoKeys, MycoKeys, BioRisk, NeoBiota, Nature Conservation and the forthcoming Biodiversity Data Journal. We then debate further enhancements of the data paper beyond the pilot project and attempt to forecast the future uptake of data papers as an incentivization mechanism by the stakeholder communities. Conclusions We believe that in addition to recognition for those involved in the data publishing enterprise, data papers will also expedite publishing of fit-for-use biodiversity data resources. However, uptake and establishment of the data paper as a potential mechanism of scholarly recognition requires a high degree of commitment and investment by the cross-sectional stakeholder communities. PMID:22373175

  6. Key Lessons in Building "Data Commons": The Open Science Data Cloud Ecosystem

    NASA Astrophysics Data System (ADS)

    Patterson, M.; Grossman, R.; Heath, A.; Murphy, M.; Wells, W.

    2015-12-01

    Cloud computing technology has created a shift around data and data analysis by allowing researchers to push computation to data as opposed to having to pull data to an individual researcher's computer. Subsequently, cloud-based resources can provide unique opportunities to capture computing environments used both to access raw data in its original form and also to create analysis products which may be the source of data for tables and figures presented in research publications. Since 2008, the Open Cloud Consortium (OCC) has operated the Open Science Data Cloud (OSDC), which provides scientific researchers with computational resources for storing, sharing, and analyzing large (terabyte and petabyte-scale) scientific datasets. OSDC has provided compute and storage services to over 750 researchers in a wide variety of data intensive disciplines. Recently, internal users have logged about 2 million core hours each month. The OSDC also serves the research community by colocating these resources with access to nearly a petabyte of public scientific datasets in a variety of fields also accessible for download externally by the public. In our experience operating these resources, researchers are well served by "data commons," meaning cyberinfrastructure that colocates data archives, computing, and storage infrastructure and supports essential tools and services for working with scientific data. In addition to the OSDC public data commons, the OCC operates a data commons in collaboration with NASA and is developing a data commons for NOAA datasets. As cloud-based infrastructures for distributing and computing over data become more pervasive, we ask, "What does it mean to publish data in a data commons?" Here we present the OSDC perspective and discuss several services that are key in architecting data commons, including digital identifier services.

  7. NASA Earth Sciences Data Support System and Services for the Northern Eurasia Earth Science Partnership Initiative

    NASA Technical Reports Server (NTRS)

    Leptoukh, Gregory

    2005-01-01

    The presentation describes data management of NASA remote sensing data for Northern Eurasia Earth Science Partnership Initiative (NEESPI). Many types of ground and integrative (e.g., satellite, GIs) data will be needed and many models must be applied, adapted or developed for properly understanding the functioning of Northern Eurasia cold and diverse regional system. Mechanisms for obtaining the requisite data sets and models and sharing them among the participating scientists are essential. The proposed project targets integration of remote sensing data from AVHRR, MODIS, and other NASA instruments on board US- satellites (with potential expansion to data from non-US satellites), customized data products from climatology data sets (e.g., ISCCP, ISLSCP) and model data (e.g., NCEPNCAR) into a single, well-architected data management system. It will utilize two existing components developed by the Goddard Earth Sciences Data & Information Services Center (GES DISC) at the NASA Goddard Space Flight Center: (1) online archiving and distribution system, that allows collection, processing and ingest of data from various sources into the online archive, and (2) user-friendly intelligent web-based online visualization and analysis system, also known as Giovanni. The former includes various kinds of data preparation for seamless interoperability between measurements by different instruments. The latter provides convenient access to various geophysical parameters measured in the Northern Eurasia region without any need to learn complicated remote sensing data formats, or retrieve and process large volumes of NASA data. Initial implementation of this data management system will concentrate on atmospheric data and surface data aggregated to coarse resolution to support collaborative environment and climate change studies and modeling, while at later stages, data from NASA and non-NASA satellites at higher resolution will be integrated into the system.

  8. Enabling Long-Term Earth Science Research: Changing Data Practices (Invited)

    NASA Astrophysics Data System (ADS)

    Baker, K. S.

    2013-12-01

    Data stewardship plans are shaped by our shared experiences. As a result, community engagement and collaborative activities are central to the stewardship of data. Since modes and mechanisms of engagement have changed, we benefit from asking anew: ';Who are the communities?' and ';What are the lessons learned?'. Data stewardship with its long-term care perspective, is enriched by reflection on community experience. This presentation draws on data management issues and strategies originating from within long-term research communities as well as on recent studies informed by library and information science. Ethnographic case studies that capture project activities and histories are presented as resources for comparative analysis. Agency requirements and funding opportunities are stimulating collaborative endeavors focused on data re-use and archiving. Research groups including earth scientists, information professionals, and data systems designers are recognizing the possibilities for new ways of thinking about data in the digital arena. Together, these groups are re-conceptualizing and reconfiguring for data management and data curation. A differentiation between managing data for local use and production of data for re-use remotely in locations and fields remote from the data origin is just one example of the concepts emerging to facilitate development of data management. While earth scientists as data generators have the responsibility to plan new workflows and documentation practices, data and information specialists have responsibility to promote best practices as well as to facilitate the development of community resources such as controlled vocabularies and data dictionaries. With data-centric activities and changing data practices, the potential for creating dynamic community information environments in conjunction with development of data facilities exists but remains elusive.

  9. Publication of sensor data in the long-term environmental monitoring infrastructure TERENO

    NASA Astrophysics Data System (ADS)

    Stender, V.; Schroeder, M.; Klump, J. F.

    2014-12-01

    Terrestrial Environmental Observatories (TERENO) is an interdisciplinary and long-term research project spanning an Earth observation network across Germany. It includes four test sites within Germany from the North German lowlands to the Bavarian Alps and is operated by six research centers of the Helmholtz Association. TERENO Northeast is one of the sub-observatories of TERENO and is operated by the German Research Centre for Geosciences GFZ in Potsdam. This observatory investigates geoecological processes in the northeastern lowland of Germany by collecting large amounts of environmentally relevant data. The success of long-term projects like TERENO depends on well-organized data management, data exchange between the partners involved and on the availability of the captured data. Data discovery and dissemination are facilitated not only through data portals of the regional TERENO observatories but also through a common spatial data infrastructure TEODOOR (TEreno Online Data repOsitORry). TEODOOR bundles the data, provided by the different web services of the single observatories, and provides tools for data discovery, visualization and data access. The TERENO Northeast data infrastructure integrates data from more than 200 instruments and makes data available through standard web services. TEODOOR accesses the OGC Sensor Web Enablement (SWE) interfaces offered by the regional observatories. In addition to the SWE interface, TERENO Northeast also publishes time series of environmental sensor data through the online research data publication platform DataCite. The metadata required by DataCite are created in an automated process by extracting information from the SWE SensorML to create ISO 19115 compliant metadata. The GFZ data management tool kit panMetaDocs is used to register Digital Object Identifiers (DOI) and preserve file based datasets. In addition to DOI, the International Geo Sample Numbers (IGSN) is used to uniquely identify research specimens.

  10. Investigating Access Performance of Long Time Series with Restructured Big Model Data

    NASA Technical Reports Server (NTRS)

    Shen, Suhung; Ostrenga, Dana M.; Vollmer, Bruce E.; Meyer, Dave

    2017-01-01

    Data sets generated by models are substantially increasing in volume, due to increases in spatial and temporal resolution, and the number of output variables. Many users wish to download subsetted data in preferred data formats and structures, as it is getting increasingly difficult to handle the original full-size data files. For example, application research users such as those involved with wind or solar energy, or extreme weather events are likely only interested in daily or hourly model data at a single point (or for a small area) for a long time period, and prefer to have the data downloaded in a single file. With native model file structures, such as hourly data from NASA Modern-Era Retrospective analysis for Research and Applications Version-2 (MERRA-2), it may take over 10 hours for the extraction of parameters-of-interest at a single point for 30 years. The NASA Goddard Earth Sciences Data and Information Services Center (GES DISC) is exploring methods to address this particular user need. One approach is to create value-added data by reconstructing the data files. Taking MERRA-2 data as an example, we have tested converting hourly data from one-day-per-file into different data cubes, such as one-month, or one-year. Performance is compared for reading local data files and accessing data through interoperable services, such as OPeNDAP. Results show that, compared to the original file structure, the new data cubes offer much better performance for accessing long time series. We have noticed that performance is associated with the cube size and structure, the compression method, and how the data are accessed. An optimized data cube structure will not only improve data access, but also may enable better online analysis services

  11. A qualitative readiness-requirements assessment model for enterprise big-data infrastructure investment

    NASA Astrophysics Data System (ADS)

    Olama, Mohammed M.; McNair, Allen W.; Sukumar, Sreenivas R.; Nutaro, James J.

    2014-05-01

    In the last three decades, there has been an exponential growth in the area of information technology providing the information processing needs of data-driven businesses in government, science, and private industry in the form of capturing, staging, integrating, conveying, analyzing, and transferring data that will help knowledge workers and decision makers make sound business decisions. Data integration across enterprise warehouses is one of the most challenging steps in the big data analytics strategy. Several levels of data integration have been identified across enterprise warehouses: data accessibility, common data platform, and consolidated data model. Each level of integration has its own set of complexities that requires a certain amount of time, budget, and resources to implement. Such levels of integration are designed to address the technical challenges inherent in consolidating the disparate data sources. In this paper, we present a methodology based on industry best practices to measure the readiness of an organization and its data sets against the different levels of data integration. We introduce a new Integration Level Model (ILM) tool, which is used for quantifying an organization and data system's readiness to share data at a certain level of data integration. It is based largely on the established and accepted framework provided in the Data Management Association (DAMADMBOK). It comprises several key data management functions and supporting activities, together with several environmental elements that describe and apply to each function. The proposed model scores the maturity of a system's data governance processes and provides a pragmatic methodology for evaluating integration risks. The higher the computed scores, the better managed the source data system and the greater the likelihood that the data system can be brought in at a higher level of integration.

  12. Gap Assessment (FY 13 Update)

    DOE Data Explorer

    Getman, Dan

    2013-09-30

    To help guide its future data collection efforts, The DOE GTO funded a data gap analysis in FY2012 to identify high potential hydrothermal areas where critical data are needed. This analysis was updated in FY2013 and the resulting datasets are represented by this metadata. The original process was published in FY 2012 and is available here: https://pangea.stanford.edu/ERE/db/GeoConf/papers/SGW/2013/Esposito.pdf Though there are many types of data that can be used for hydrothermal exploration, five types of exploration data were targeted for this analysis. These data types were selected for their regional reconnaissance potential, and include many of the primary exploration techniques currently used by the geothermal industry. The data types include: 1. well data 2. geologic maps 3. fault maps 4. geochemistry data 5. geophysical data To determine data coverage, metadata for exploration data (including data type, data status, and coverage information) were collected and catalogued from nodes on the National Geothermal Data System (NGDS). It is the intention of this analysis that the data be updated from this source in a semi-automated fashion as new datasets are added to the NGDS nodes. In addition to this upload, an online tool was developed to allow all geothermal data providers to access this assessment and to directly add metadata themselves and view the results of the analysis via maps of data coverage in Geothermal Prospector (http://maps.nrel.gov/gt_prospector). A grid of the contiguous U.S. was created with 88,000 10-km by 10-km grid cells, and each cell was populated with the status of data availability corresponding to the five data types. Using these five data coverage maps and the USGS Resource Potential Map, sites were identified for future data collection efforts. These sites signify both that the USGS has indicated high favorability of occurrence of geothermal resources and that data gaps exist. The uploaded data are contained in two data files for each data category. The first file contains the grid and is in the SHP file format (shape file.) Each populated grid cell represents a 10k area within which data is known to exist. The second file is a CSV (comma separated value) file that contains all of the individual layers that intersected with the grid. This CSV can be joined with the map to retrieve a list of datasets that are available at any given site. The attributes in the CSV include: 1. grid_id : The id of the grid cell that the data intersects with 2. title: This represents the name of the WFS service that intersected with this grid cell 3. abstract: This represents the description of the WFS service that intersected with this grid cell 4. gap_type: This represents the category of data availability that these data fall within. As the current processing is pulling data from NGDS, this category universally represents data that are available in the NGDS and are ready for acquisition for analytic purposes. 5. proprietary_type: Whether the data are considered proprietary 6. service_type: The type of service 7. base_url: The service URL

  13. Data Identifiers and Citations Enable Reproducible Science

    NASA Astrophysics Data System (ADS)

    Tilmes, C.

    2011-12-01

    Modern science often involves data processing with tremendous volumes of data. Keeping track of that data has been a growing challenge for data center. Researchers who access and use that data don't always reference and cite their data sources adequately for consumers of their research to follow their methodology or reproduce their analyses or experiments. Recent research has led to recommendations for good identifiers and citations that can help address this problem. This paper will describe some of the best practices in data identifiers, reference and citation. Using a simplified example scenario based on a long term remote sensing satellite mission, it will explore issues in identifying dynamic data sets and the importance of good data citations for reproducibility. It will describe the difference between granule and collection level identifiers, using UUIDs and DOIs to illustrate some recommendations for developing identifiers and assigning them during data processing. As data processors create data products, the provenance of the input products and precise steps that led to their creation are recorded and published for users of the data to see. As researchers access the data from an archive, they can use the provenance to help understand the genesis of the data, which could have effects on their usage of the data. By citing the data on publishing their research, others can retrieve the precise data used in their research and reproduce the analyses and experiments to confirm the results. Describing the experiment to a sufficient extent to reproduce the research enforces a formal approach that lends credibility to the results, and ultimately, to the policies of decision makers depending on that research.

  14. Data Access Services that Make Remote Sensing Data Easier to Use

    NASA Technical Reports Server (NTRS)

    Lynnes, Christopher

    2010-01-01

    This slide presentation reviews some of the processes that NASA uses to make the remote sensing data easy to use over the World Wide Web. This work involves much research into data formats, geolocation structures and quality indicators, often to be followed by coding a preprocessing program. Only then are the data usable within the analysis tool of choice. The Goddard Earth Sciences Data and Information Services Center is deploying a variety of data access services that are designed to dramatically shorten the time consumed in the data preparation step. On-the-fly conversion to the standard network Common Data Form (netCDF) format with Climate-Forecast (CF) conventions imposes a standard coordinate system framework that makes data instantly readable through several tools, such as the Integrated Data Viewer, Gridded Analysis and Display System, Panoply and Ferret. A similar benefit is achieved by serving data through the Open Source Project for a Network Data Access Protocol (OPeNDAP), which also provides subsetting. The Data Quality Screening Service goes a step further in filtering out data points based on quality control flags, based on science team recommendations or user-specified criteria. Further still is the Giovanni online analysis system which goes beyond handling formatting and quality to provide visualization and basic statistics of the data. This general approach of automating the preparation steps has the important added benefit of enabling use of the data by non-human users (i.e., computer programs), which often make sub-optimal use of the available data due to the need to hard-code data preparation on the client side.

  15. Visualizing the quality of partially accruing data for use in decision making

    PubMed Central

    Eaton, Julia; Painter, Ian; Olson, Don; Lober, William B

    2015-01-01

    Secondary use of clinical health data for near real-time public health surveillance presents challenges surrounding its utility due to data quality issues. Data used for real-time surveillance must be timely, accurate and complete if it is to be useful; if incomplete data are used for surveillance, understanding the structure of the incompleteness is necessary. Such data are commonly aggregated due to privacy concerns. The Distribute project was a near real-time influenza-like-illness (ILI) surveillance system that relied on aggregated secondary clinical health data. The goal of this work is to disseminate the data quality tools developed to gain insight into the data quality problems associated with these data. These tools apply in general to any system where aggregate data are accrued over time and were created through the end-user-as-developer paradigm. Each tool was developed during the exploratory analysis to gain insight into structural aspects of data quality. Our key finding is that data quality of partially accruing data must be studied in the context of accrual lag—the difference between the time an event occurs and the time data for that event are received, i.e. the time at which data become available to the surveillance system. Our visualization methods therefore revolve around visualizing dimensions of data quality affected by accrual lag, in particular the tradeoff between timeliness and completion, and the effects of accrual lag on accuracy. Accounting for accrual lag in partially accruing data is necessary to avoid misleading or biased conclusions about trends in indicator values and data quality. PMID:27252794

  16. The Research Data Management Platform (RDMP): A novel, process driven, open-source tool for the management of longitudinal cohorts of clinical data.

    PubMed

    Nind, Thomas; Galloway, James; McAllister, Gordon; Scobbie, Donald; Bonney, Wilfred; Hall, Christopher; Tramma, Leandro; Reel, Parminder; Groves, Martin; Appleby, Philip; Doney, Alex; Guthrie, Bruce; Jefferson, Emily

    2018-05-22

    The Health Informatics Centre (HIC) at the University of Dundee provides a service to securely host clinical datasets and extract relevant data for anonymised cohorts to researchers to enable them to answer key research questions. As is common in research using routine healthcare data, the service was historically delivered using ad-hoc processes resulting in the slow provision of data whose provenance was often hidden to the researchers using it. This paper describes the development and evaluation of the Research Data Management Platform (RDMP): an open source tool to load, manage, clean, and curate longitudinal healthcare data for research and provide reproducible and updateable datasets for defined cohorts to researchers. Between 2013 and 2017, RDMP tool implementation tripled the productivity of Data Analysts producing data releases for researchers from 7.1 to 25.3 per month; and reduced the error rate from 12.7% to 3.1%. The effort on data management reduced from a mean of 24.6 to 3.0 hours per data release. The waiting time for researchers to receive data after agreeing a specification reduced from approximately 6 months to less than one week. The software is scalable and currently manages 163 datasets. 1,321 data extracts for research have been produced with the largest extract linking data from 70 different datasets. The tools and processes that encompass the RDMP not only fulfil the research data management requirements of researchers but also support the seamless collaboration of data cleaning, data transformation, data summarisation and data quality assessment activities by different research groups.

  17. SeaView: bringing EarthCube to the Oceanographer

    NASA Astrophysics Data System (ADS)

    Stocks, K. I.; Diggs, S. C.; Arko, R. A.; Kinkade, D.; Shepherd, A.

    2016-12-01

    As new instrument types are developed, and new observational programs start, that support a growing community of "dry" oceanographers, the ability to find, access, and visualize existing data of interest becomes increasingly critical. Yet ocean data, when available, is are held in multiple data facilities, in different formats, and accessible through different pathways. This creates practical problems with integrating and working across different data sets. The SeaView project is building connections between the rich data resources in five major oceanographic data facilities - BCO-DMO, CCHDO, OBIS, OOI, and R2R* - creating a federated set of thematic data collections that are organized around common characteristics (geographic location, time, expedition, program, data type, etc.) and published online in Web Accessible Folders using standard file formats such as ODV and NetCDF. The work includes not simply reformatting data, but identifying and, where possible, addressing interoperability challenges: which common identifiers for core concepts can connect data across repositories, which terms a scientist may want to search that, if added to the data repositories, will increase discoverability; the presence of duplicate data across repositories, etc. We will present the data collections available to date, including data from the OOI Pioneer Array region, and seek scientists' input on the data types and formats they prefer, the tools they use to analyze and visualize data, and their specific recommendations for future data collections to support oceanographic science. * Biological and Chemical Oceanography Data Management Office (BCO-DMO), CLIVAR and Carbon Hydrographic Data Office (CCHDO), International Ocean Biogeographic Information System (iOBIS), Ocean Observatories Initiative (OOI), and Rolling Deck to Repository (R2R) Program.

  18. Common Data Servers as a Foundation for Specialized Services

    NASA Astrophysics Data System (ADS)

    Burger, E. F.; Schweitzer, R.; O'Brien, K.; Manke, A. B.; Smith, K. M.

    2017-12-01

    NOAA's Pacific Marine Environmental Laboratory (PMEL) hosts a broad range of research efforts that span many scientific and environmental research disciplines. Many of these research projects have their own data streams that are as diverse as the research. Data are collected using various platforms, including innovative new platforms such as Saildrones and autonomous profilers. With its requirements for public access to federally funded research results and data, the 2013 White House Office of Science and Technology memo on Public Access to Research Results (PARR) changed the data landscape for Federal agencies. In 2015, with support from the PMEL Director, the PMEL Science Data Integration Group (SDIG) initiated a multi-year effort to formulate and implement an integrated data-management strategy for PMEL research efforts. The PMEL integrated data management strategy will provide data access, visualization and some archive services to PMEL data and use existing and proven frameworks for this capability. In addition to these foundational data services, these data access and visualization frameworks are also leveraged to provide enhanced services to scientists. One enhanced service developed is a data management "dashboard". This application provides scientists with a snapshot of their data assets, access to these data, a map view of data locations, and information on the archival status. Ideally, information on the dashboard continually updates to accurately reflect the project's data asset status. This poster explains how frameworks such as ERDDAP and LAS were used as a foundation for the development of custom services, as well as an explanation of the PMEL data management dashboard functionality. We will also highlight accomplishments of the PMEL Integrated data management strategy implementation.

  19. Rules for processing genetic data for research purposes in view of the new EU General Data Protection Regulation.

    PubMed

    Shabani, Mahsa; Borry, Pascal

    2018-02-01

    Genetic data contain sensitive health and non-health-related information about the individuals and their family members. Therefore, adopting adequate privacy safeguards is paramount when processing genetic data for research or clinical purposes. One of the major legal instruments for personal data protection in the EU is the new General Data Protection Regulation (GDPR), which has entered into force in May 2016 and repealed the Directive 95/46/EC, with an ultimate goal of enhancing effectiveness and harmonization of personal data protection in the EU. This paper explores the major provisions of the new Regulation with regard to processing genetic data, and assesses the influence of such provisions on reinforcing the legal safeguards when sharing genetic data for research purposes. The new Regulation attempts to elucidate the scope of personal data, by recognizing pseudonymized data as personal (identifiable) data, and including genetic data in the catalog of special categories of data (sensitive data). Moreover, a set of new rules is laid out in the Regulation for processing personal data under the scientific research exemption. For instance, further use of genetic data for scientific research purposes, without obtaining additional consent will be allowed, if the specific conditions is met. The new Regulation has already fueled concerns among various stakeholders, owing to the challenges that may emerge when implementing the Regulation across the countries. Notably, the provided definition for pseudonymized data has been criticized because it leaves too much room for interpretations, and it might undermine the harmonization of the data protection across the countries.

  20. Visualizing the quality of partially accruing data for use in decision making.

    PubMed

    Eaton, Julia; Painter, Ian; Olson, Don; Lober, William B

    2015-01-01

    Secondary use of clinical health data for near real-time public health surveillance presents challenges surrounding its utility due to data quality issues. Data used for real-time surveillance must be timely, accurate and complete if it is to be useful; if incomplete data are used for surveillance, understanding the structure of the incompleteness is necessary. Such data are commonly aggregated due to privacy concerns. The Distribute project was a near real-time influenza-like-illness (ILI) surveillance system that relied on aggregated secondary clinical health data. The goal of this work is to disseminate the data quality tools developed to gain insight into the data quality problems associated with these data. These tools apply in general to any system where aggregate data are accrued over time and were created through the end-user-as-developer paradigm. Each tool was developed during the exploratory analysis to gain insight into structural aspects of data quality. Our key finding is that data quality of partially accruing data must be studied in the context of accrual lag-the difference between the time an event occurs and the time data for that event are received, i.e. the time at which data become available to the surveillance system. Our visualization methods therefore revolve around visualizing dimensions of data quality affected by accrual lag, in particular the tradeoff between timeliness and completion, and the effects of accrual lag on accuracy. Accounting for accrual lag in partially accruing data is necessary to avoid misleading or biased conclusions about trends in indicator values and data quality.

  1. Agile Data Curation: A conceptual framework and approach for practitioner data management

    NASA Astrophysics Data System (ADS)

    Young, J. W.; Benedict, K. K.; Lenhardt, W. C.

    2015-12-01

    Data management occurs across a range of science and related activities such as decision-support. Exemplars within the science community operate data management systems that are extensively planned before implementation, staffed with robust data management expertise, equipped with appropriate services and technologies, and often highly structured. However, this is not the only approach to data management and almost certainly not the typical experience. The other end of the spectrum is often an ad hoc practitioner team, with changing requirements, limited training in data management, and resource constrained for both equipment and human resources. Much of the existing data management literature serves the exemplar community and ignores the ad hoc practitioners. Somewhere in the middle are examples where data are repurposed for new uses thereby generating new data management challenges. This submission presents a conceptualization of an Agile Data Curation approach that provides foundational principles for data management efforts operating across the spectrum of data generation and use from large science systems to efforts with constrained resources, limited expertise, and evolving requirements. The underlying principles to Agile Data Curation are a reapplication of agile software development principles to data management. The historical reality for many data management efforts is operating in a practioner environment so Agile Data Curation utilizes historical and current case studies to validate the foundational principles and through comparison learn lessons for future application. This submission will provide an overview of the Agile Data Curation, cover the foundational principles to the approach, and introduce a framework for gathering, classifying, and applying lessons from case studies of practitioner data management.

  2. The combined EarthScope data set at the IRIS DMC

    NASA Astrophysics Data System (ADS)

    Trabant, C.; Sharer, G.; Benson, R.; Ahern, T.

    2007-12-01

    The IRIS Data Management Center (DMC) is the perpetual archive and access point for an ever-increasing variety of geophysical data in terms of volume, geographic distribution and scientific value. A particular highlight is the combined data set produced by the EarthScope project. The DMC archives data from each of the primary components: USArray, the Plate Boundary Observatory (PBO) & the San Andreas Fault Observatory at Depth (SAFOD). Growing at over 4.6 gigabytes per day, the USArray data set currently totals approximately 5 terabytes. Composed of four separate sub-components: the Permanent, Transportable, Flexible and Magnetotelluric Arrays, the USArray data set provides a multi-scale view of the western United States at present and the conterminous United States when it is completed. The primary data from USArray are in the form of broadband and short-period seismic recordings and magnetotelluric measurements. Complementing the data from USArray are the short- period, borehole seismic data and borehole and laser strain data from PBO. The DMC also archives the high- resolution seismic data from instruments in the SAFOD main and pilot drill holes. The SAFOD seismic data is available in two forms: lower-rate monitoring channels sampled at 250 hertz and full resolution channels varying between 1 and 4 kilohertz. Beyond data collection and archive management the DMC performs value-added functions. All data arriving at the DMC as real-time data streams are processed by QUACK, an automated Quality Control (QC) system. All the measurements made by this system are stored in a database and made available to data contributors and users via a web interface including customized report generation. In addition to the automated QC measurements, quality control is performed on USArray data at the DMC by a team of analysts. The primary functions of the analysts are to routinely report data quality assessment to the respective network operators and log serious, unfixable data issues for reference by data users. All of these data are managed in a unified SEED format archive and are seamlessly available to data users via the DMC's&pstandard data access methods along with all the other data managed by the DMC. The only exception is high resolution, special case SAFOD seismic data that is retained in its original SEG-2 format as an assembled data set. A data user can choose between a handful of data access methods ranging from simple email requests to technologically advanced CORBA-based access, streamlining the "information into application" philosophy. Currently totally over 8.5 terabytes and growing, the combined EarthScope data at the DMC provides an unparalleled, multi-measurement record of geophysical information ideal for determining Earth structure and processes in the United States and beyond. A website is maintained to provide current information regarding EarthScope data at the DMC: http://www.iris.edu/earthscope/.

  3. The 1993 Space and Earth Science Data Compression Workshop

    NASA Technical Reports Server (NTRS)

    Tilton, James C. (Editor)

    1993-01-01

    The Earth Observing System Data and Information System (EOSDIS) is described in terms of its data volume, data rate, and data distribution requirements. Opportunities for data compression in EOSDIS are discussed.

  4. Unlocking the Power of Big Data at the National Institutes of Health.

    PubMed

    Coakley, Meghan F; Leerkes, Maarten R; Barnett, Jason; Gabrielian, Andrei E; Noble, Karlynn; Weber, M Nick; Huyen, Yentram

    2013-09-01

    The era of "big data" presents immense opportunities for scientific discovery and technological progress, with the potential to have enormous impact on research and development in the public sector. In order to capitalize on these benefits, there are significant challenges to overcome in data analytics. The National Institute of Allergy and Infectious Diseases held a symposium entitled "Data Science: Unlocking the Power of Big Data" to create a forum for big data experts to present and share some of the creative and innovative methods to gleaning valuable knowledge from an overwhelming flood of biological data. A significant investment in infrastructure and tool development, along with more and better-trained data scientists, may facilitate methods for assimilation of data and machine learning, to overcome obstacles such as data security, data cleaning, and data integration.

  5. Data Cubes Integration in Spatial OLAP for Agricultural Commodities

    NASA Astrophysics Data System (ADS)

    Putri, A. I.; Sitanggang, I. S.

    2017-03-01

    Ministry of Agriculture Indonesia collects data of agricultural commodities in Indonesia in the annual period. Agricultural commodities data include food crops, horticulture, plantations, and livestock. The data are available in the spreadsheet format. This study developed data cubes for food crops, plantations, and livestock using the galaxy schema of data warehouse and integrated the data cubes into the SOLAP Horticulture using SpagoBI. SOLAP is useful for data analysis and data visualization. The application displays agricultural commodities data in form of crosstab and chart. This study also developed the location intelligence module that visualizes agricultural commodities data on the map. The system was tested using the black box approach. The result showed that main functions including roll up, drill down, slice, dice, and pivot work properly. This application is expected to enable users to easily obtain data summaries of agricultural commodities.

  6. Consumables data base workbook: Formulation of consumables management models

    NASA Technical Reports Server (NTRS)

    Zamora, M. A.

    1977-01-01

    Activity consumables data specifications and data applications are presented. The data are structured in a series of "Consumable Data Worksheets" for each activity that includes a profile of its operations and the rate of each consumable required to support the given activity. The data worksheets provide for the uniform specification of consumables data, allows for the ready identification of the consumables affected by a given activity, and facilitates the updating process. An activity is defined and the data that must be included in the data worksheets are specified. An example of its use and application is given, i.e. consumables data requirements for the performance of the EVA. The consumables data for the activities currently identified for the shuttle spacecraft are included. The consumables data sources are identified and information to facilitate the maintenance process is detailed.

  7. Towards Data Value-Level Metadata for Clinical Studies.

    PubMed

    Zozus, Meredith Nahm; Bonner, Joseph

    2017-01-01

    While several standards for metadata describing clinical studies exist, comprehensive metadata to support traceability of data from clinical studies has not been articulated. We examine uses of metadata in clinical studies. We examine and enumerate seven sources of data value-level metadata in clinical studies inclusive of research designs across the spectrum of the National Institutes of Health definition of clinical research. The sources of metadata inform categorization in terms of metadata describing the origin of a data value, the definition of a data value, and operations to which the data value was subjected. The latter is further categorized into information about changes to a data value, movement of a data value, retrieval of a data value, and data quality checks, constraints or assessments to which the data value was subjected. The implications of tracking and managing data value-level metadata are explored.

  8. Facilitating Cohort Discovery by Enhancing Ontology Exploration, Query Management and Query Sharing for Large Clinical Data Repositories.

    PubMed

    Tao, Shiqiang; Cui, Licong; Wu, Xi; Zhang, Guo-Qiang

    2017-01-01

    To help researchers better access clinical data, we developed a prototype query engine called DataSphere for exploring large-scale integrated clinical data repositories. DataSphere expedites data importing using a NoSQL data management system and dynamically renders its user interface for concept-based querying tasks. DataSphere provides an interactive query-building interface together with query translation and optimization strategies, which enable users to build and execute queries effectively and efficiently. We successfully loaded a dataset of one million patients for University of Kentucky (UK) Healthcare into DataSphere with more than 300 million clinical data records. We evaluated DataSphere by comparing it with an instance of i2b2 deployed at UK Healthcare, demonstrating that DataSphere provides enhanced user experience for both query building and execution.

  9. Facilitating Cohort Discovery by Enhancing Ontology Exploration, Query Management and Query Sharing for Large Clinical Data Repositories

    PubMed Central

    Tao, Shiqiang; Cui, Licong; Wu, Xi; Zhang, Guo-Qiang

    2017-01-01

    To help researchers better access clinical data, we developed a prototype query engine called DataSphere for exploring large-scale integrated clinical data repositories. DataSphere expedites data importing using a NoSQL data management system and dynamically renders its user interface for concept-based querying tasks. DataSphere provides an interactive query-building interface together with query translation and optimization strategies, which enable users to build and execute queries effectively and efficiently. We successfully loaded a dataset of one million patients for University of Kentucky (UK) Healthcare into DataSphere with more than 300 million clinical data records. We evaluated DataSphere by comparing it with an instance of i2b2 deployed at UK Healthcare, demonstrating that DataSphere provides enhanced user experience for both query building and execution. PMID:29854239

  10. Visual Data Analysis for Satellites

    NASA Technical Reports Server (NTRS)

    Lau, Yee; Bhate, Sachin; Fitzpatrick, Patrick

    2008-01-01

    The Visual Data Analysis Package is a collection of programs and scripts that facilitate visual analysis of data available from NASA and NOAA satellites, as well as dropsonde, buoy, and conventional in-situ observations. The package features utilities for data extraction, data quality control, statistical analysis, and data visualization. The Hierarchical Data Format (HDF) satellite data extraction routines from NASA's Jet Propulsion Laboratory were customized for specific spatial coverage and file input/output. Statistical analysis includes the calculation of the relative error, the absolute error, and the root mean square error. Other capabilities include curve fitting through the data points to fill in missing data points between satellite passes or where clouds obscure satellite data. For data visualization, the software provides customizable Generic Mapping Tool (GMT) scripts to generate difference maps, scatter plots, line plots, vector plots, histograms, timeseries, and color fill images.

  11. AOIPS data base management systems support for GARP data sets

    NASA Technical Reports Server (NTRS)

    Gary, J. P.

    1977-01-01

    A data base management system is identified, developed to provide flexible access to data sets produced by GARP during its data systems tests. The content and coverage of the data base are defined and a computer-aided, interactive information storage and retrieval system, implemented to facilitate access to user specified data subsets, is described. The computer programs developed to provide the capability were implemented on the highly interactive, minicomputer-based AOIPS and are referred to as the data retrieval system (DRS). Implemented as a user interactive but menu guided system, the DRS permits users to inventory the data tape library and create duplicate or subset data sets based on a user selected window defined by time and latitude/longitude boundaries. The DRS permits users to select, display, or produce formatted hard copy of individual data items contained within the data records.

  12. Aviation Data Integration System

    NASA Technical Reports Server (NTRS)

    Kulkarni, Deepak; Wang, Yao; Windrem, May; Patel, Hemil; Keller, Richard

    2003-01-01

    During the analysis of flight data and safety reports done in ASAP and FOQA programs, airline personnel are not able to access relevant aviation data for a variety of reasons. We have developed the Aviation Data Integration System (ADIS), a software system that provides integrated heterogeneous data to support safety analysis. Types of data available in ADIS include weather, D-ATIS, RVR, radar data, and Jeppesen charts, and flight data. We developed three versions of ADIS to support airlines. The first version has been developed to support ASAP teams. A second version supports FOQA teams, and it integrates aviation data with flight data while keeping identification information inaccessible. Finally, we developed a prototype that demonstrates the integration of aviation data into flight data analysis programs. The initial feedback from airlines is that ADIS is very useful in FOQA and ASAP analysis.

  13. A data management system for engineering and scientific computing

    NASA Technical Reports Server (NTRS)

    Elliot, L.; Kunii, H. S.; Browne, J. C.

    1978-01-01

    Data elements and relationship definition capabilities for this data management system are explicitly tailored to the needs of engineering and scientific computing. System design was based upon studies of data management problems currently being handled through explicit programming. The system-defined data element types include real scalar numbers, vectors, arrays and special classes of arrays such as sparse arrays and triangular arrays. The data model is hierarchical (tree structured). Multiple views of data are provided at two levels. Subschemas provide multiple structural views of the total data base and multiple mappings for individual record types are supported through the use of a REDEFINES capability. The data definition language and the data manipulation language are designed as extensions to FORTRAN. Examples of the coding of real problems taken from existing practice in the data definition language and the data manipulation language are given.

  14. Data Protection Compliance in the Age of Digital Health.

    PubMed

    Hordern, Victoria

    2016-06-01

    Advances in technology are transforming the way that health data is collected and used. This includes improvements in existing technology as well as innovations in mobile technology such as smartphone apps and wearables. Health data is strictly regulated under the EU Data Protection Directive 95/46/EC. Under current data protection rules, health data is broadly interpreted and will, in most circumstances not connected to the provision of healthcare, require organisations to obtain explicit consent from individuals for its collection and use. Further data protection compliance issues arise such as identifying who is a controller, ensuring transparency, using health data for research purposes and keeping health data secure. As the EU data protection landscape is due to change in the next few years and will affect the collection and use of health data, the forthcoming Data Protection Regulation also deserves attention.

  15. Combined Industry, Space and Earth Science Data Compression Workshop

    NASA Technical Reports Server (NTRS)

    Kiely, Aaron B. (Editor); Renner, Robert L. (Editor)

    1996-01-01

    The sixth annual Space and Earth Science Data Compression Workshop and the third annual Data Compression Industry Workshop were held as a single combined workshop. The workshop was held April 4, 1996 in Snowbird, Utah in conjunction with the 1996 IEEE Data Compression Conference, which was held at the same location March 31 - April 3, 1996. The Space and Earth Science Data Compression sessions seek to explore opportunities for data compression to enhance the collection, analysis, and retrieval of space and earth science data. Of particular interest is data compression research that is integrated into, or has the potential to be integrated into, a particular space or earth science data information system. Preference is given to data compression research that takes into account the scien- tist's data requirements, and the constraints imposed by the data collection, transmission, distribution and archival systems.

  16. Bi-level Multi-Source Learning for Heterogeneous Block-wise Missing Data

    PubMed Central

    Xiang, Shuo; Yuan, Lei; Fan, Wei; Wang, Yalin; Thompson, Paul M.; Ye, Jieping

    2013-01-01

    Bio-imaging technologies allow scientists to collect large amounts of high-dimensional data from multiple heterogeneous sources for many biomedical applications. In the study of Alzheimer's Disease (AD), neuroimaging data, gene/protein expression data, etc., are often analyzed together to improve predictive power. Joint learning from multiple complementary data sources is advantageous, but feature-pruning and data source selection are critical to learn interpretable models from high-dimensional data. Often, the data collected has block-wise missing entries. In the Alzheimer’s Disease Neuroimaging Initiative (ADNI), most subjects have MRI and genetic information, but only half have cerebrospinal fluid (CSF) measures, a different half has FDG-PET; only some have proteomic data. Here we propose how to effectively integrate information from multiple heterogeneous data sources when data is block-wise missing. We present a unified “bi-level” learning model for complete multi-source data, and extend it to incomplete data. Our major contributions are: (1) our proposed models unify feature-level and source-level analysis, including several existing feature learning approaches as special cases; (2) the model for incomplete data avoids imputing missing data and offers superior performance; it generalizes to other applications with block-wise missing data sources; (3) we present efficient optimization algorithms for modeling complete and incomplete data. We comprehensively evaluate the proposed models including all ADNI subjects with at least one of four data types at baseline: MRI, FDG-PET, CSF and proteomics. Our proposed models compare favorably with existing approaches. PMID:23988272

  17. Bi-level multi-source learning for heterogeneous block-wise missing data.

    PubMed

    Xiang, Shuo; Yuan, Lei; Fan, Wei; Wang, Yalin; Thompson, Paul M; Ye, Jieping

    2014-11-15

    Bio-imaging technologies allow scientists to collect large amounts of high-dimensional data from multiple heterogeneous sources for many biomedical applications. In the study of Alzheimer's Disease (AD), neuroimaging data, gene/protein expression data, etc., are often analyzed together to improve predictive power. Joint learning from multiple complementary data sources is advantageous, but feature-pruning and data source selection are critical to learn interpretable models from high-dimensional data. Often, the data collected has block-wise missing entries. In the Alzheimer's Disease Neuroimaging Initiative (ADNI), most subjects have MRI and genetic information, but only half have cerebrospinal fluid (CSF) measures, a different half has FDG-PET; only some have proteomic data. Here we propose how to effectively integrate information from multiple heterogeneous data sources when data is block-wise missing. We present a unified "bi-level" learning model for complete multi-source data, and extend it to incomplete data. Our major contributions are: (1) our proposed models unify feature-level and source-level analysis, including several existing feature learning approaches as special cases; (2) the model for incomplete data avoids imputing missing data and offers superior performance; it generalizes to other applications with block-wise missing data sources; (3) we present efficient optimization algorithms for modeling complete and incomplete data. We comprehensively evaluate the proposed models including all ADNI subjects with at least one of four data types at baseline: MRI, FDG-PET, CSF and proteomics. Our proposed models compare favorably with existing approaches. © 2013 Elsevier Inc. All rights reserved.

  18. USGS Science Data Catalog - Open Data Advances or Declines

    NASA Astrophysics Data System (ADS)

    Frame, M. T.; Hutchison, V.; Zolly, L.; Wheeler, B.; Latysh, N.; Devarakonda, R.; Palanisamy, G.; Shrestha, B.

    2014-12-01

    The recent Office of Science and Technology Policy (OSTP) White House Open Data Policies (2013) have required Federal agencies to establish formal catalogues of their science data holdings and make these data easily available on Web sites, portals, and applications. As an organization, the USGS has historically excelled at making its data holdings freely available on its various Web sites (i.e., National, Scientific Programs, or local Science Center). In response to these requirements, the USGS Core Science Analytics, Synthesis, and Libraries program, in collaboration with DOE's Oak Ridge National Laboratory (ORNL) Mercury Consortium (funded by NASA, USGS, and DOE), and a number of other USGS organizations, established the Science Data Catalog (http://data.usgs.gov) cyberinfrastructure, content management processes/tools, and supporting policies. The USGS Science Data Catalog led the charge at USGS to improve the robustness of existing/future metadata collections; streamline and develop sustainable publishing to external aggregators (i.e., data.gov); and provide leadership to the U.S. Department of Interior in emerging Open Data policies, techniques, and systems. The session will discuss the current successes, challenges, and movement toward meeting these Open Data policies for USGS scientific data holdings. A retrospective look at the last year of implementation of these efforts within USGS will occur to determine whether these Open Data Policies are improving data access or limiting data availability. To learn more about the USGS Science Data Catalog, visit us at http://data.usgs.gov/info/about.html

  19. Satellite data-relay activities in Arizona

    USGS Publications Warehouse

    Boner, F.C.; Blee, J.W.; Shope, W.G.

    1985-01-01

    The U.S. Geological Survey (USGS) Arizona District collects data from automated streamflow stations for a wide variety of uses. Data from these stations are provided to Federal, State, and local agencies that have a responsibility to issue flood warnings; to generate forecasts of water availability; to monitor flow to insure compliance with treaties and other legal mandates; and to manage reservoirs for hydropower, flood abatement, and municipal and irrigation water supply. In the mid-1970's, the escalation of data collection costs and a need for more timely data led the Arizona District to examine alternatives for remote data acquisition. On the basis of successful data communications experiments with NASA 's Landsat satellite, an operational system for satellite-data relay was developed in 1976 using the National Oceanic and Atmospheric Administrations 's (NOAA) Geostationary Operational Environmental Satellite (GOES). A total of 62 data collection platforms (DCP's) was operated in 1983. Satellite telemetry operations are controlled at the remote data-collection stations by small battery-operated data collection platforms. The DCP 's periodically collect data from the sensors, store the data in computer memory, and at preset times transmit the data to the GOES satellite. The satellite retransmits the data to Earth where a ground-receive station transmits or transfers the data by land communications to the USGS computer in Reston, Virginia, for processing. The satellite relay transfers the data from sensor to computer in minutes; therefore, the data are available to users on a near real-time basis. (Author 's abstract)

  20. Seismological Data Stewardship at the IRIS DMC: The Role of a Dedicated Data Management System for Seismology

    NASA Astrophysics Data System (ADS)

    Benson, R. B.; Ahern, T. K.; Trabant, C.; Casey, R.

    2011-12-01

    Since the founding of the Incorporated Research Institutions for Seismology (IRIS) in 1984, there has been a core program for data management, quite unique at the time, dedicated solely to ensuring that data recorded by IRIS and it's partners had a perpetual data management framework that ensures data will be searchable, well-documented, and preserved so that future generations can, at it's core, have an accurate history of ground motion recordings. This goal is manifest in the IRIS Data Management System, or DMS. The mission of this NSF-EAR facility is "To provide reliable and efficient access to high quality seismological and related geophysical data, generated by IRIS and its domestic and international partners, and to enable all parties interested in using these data to do so in a straightforward and efficient manner". This presentation will focus on the data management business rules that capture the data life-cycle of 3 different segments of seismological and related geophysical data managed by IRIS: - Images and parametric information of historical analog data, - Non-real time quality-controlled digital data, - Real time data that streams into the DMC through a number of different protocols. We will describe how data collection, curation, and distribution to users are cataloged to provide an accurate provenance log of contributed data, which are passed along to both the consumer and network data provider. In addition, we will discuss the need and business rules that apply to metadata and how it is managed.

  1. Characterization of Emergent Data Networks Among Long-Tail Data

    NASA Astrophysics Data System (ADS)

    Elag, Mostafa; Kumar, Praveen; Hedstrom, Margaret; Myers, James; Plale, Beth; Marini, Luigi; McDonald, Robert

    2014-05-01

    Data curation underpins data-driven scientific advancements. It manages the information flux across multiple users throughout data life cycle as well as increases data sustainability and reusability. The exponential growth in data production spanning across the Earth Science involving individual and small research groups, which is termed as log-tail data, increases the data-knowledge latency among related domains. It has become clear that an advanced framework-agnostic metadata and ontologies for long-tail data is required to increase their visibility to each other, and provide concise and meaningful descriptions that reveal their connectivity. Despite the advancement that has been achieved by various sophisticated data management models in different Earth Science disciplines, it is not always straightforward to derive relationships among long-tail data. Semantic data clustering algorithms and pre-defined logic rules that are oriented toward prediction of possible data relationships, is one method to address these challenges. Our work advances the connectivity of related long-tail data by introducing the design for an ontology-based knowledge management system. In this work, we present the system architecture, its components, and illustrate how it can be used to scrutinize the connectivity among datasets. To demonstrate the capabilities of this "data network" prototype, we implemented this approach within the Sustainable Environment Actionable Data (SEAD) environment, an open-source semantic content repository that provides a RDF database for long-tail data, and show how emergent relationships among datasets can be identified.

  2. The NOWESP research data base

    NASA Astrophysics Data System (ADS)

    Radach, G.; Gekeler, J.; Becker, G.; Bot, P.; Castaing, P.; Colijn, F.; Damm, P.; Danielssen, D.; Føyn, L.; Gamble, J.; Laane, R.; Mommaerts, J. P.; Nehring, D.; Pegler, K.; van Raaphorst, W.; Wilson, J.

    1996-09-01

    In the NOWESP project historical data from the Northwest European Shelf were compiled and evaluated to estimate the variability and trends in water movements, concentrations of dissolved and particulate constituents, and fluxes of the relevant substances across the shelf. As an integral part of the project, the NOWESP Research Data Base was created as a research tool to provide the data and data products needed for the analyses within the project. The tasks of the NOWESP Research Data Base group were the acquisition of the relevant data sets, with the intensive support of all partners, organization of the data sets in the NOWESP Research Data Base, merging of the specific data sets for the ten main state variables used in NOWESP, and the provision of data products for analysis within NOWESP. The data compiled during NOWESP represent a unique data set for the Northwest European Shelf. The data set is sufficiently comprehensive to allow the definition of long time series at about 14 sites in eight areas. It further enables the derivation of mean annual cycles of horizontal distributions of nine main state variables. NOWESP thus has provided valuable data sets for estimating budgets and fluxes across the shelf and, in addition, important data sets for the forcing and validation of ecological shelf sea models. An overview of the NOWESP data set is given. The organization of the data base is described in some detail, and examples of the products obtained for NOWESP are displayed.

  3. Data Analysis and Data Mining: Current Issues in Biomedical Informatics

    PubMed Central

    Bellazzi, Riccardo; Diomidous, Marianna; Sarkar, Indra Neil; Takabayashi, Katsuhiko; Ziegler, Andreas; McCray, Alexa T.

    2011-01-01

    Summary Background Medicine and biomedical sciences have become data-intensive fields, which, at the same time, enable the application of data-driven approaches and require sophisticated data analysis and data mining methods. Biomedical informatics provides a proper interdisciplinary context to integrate data and knowledge when processing available information, with the aim of giving effective decision-making support in clinics and translational research. Objectives To reflect on different perspectives related to the role of data analysis and data mining in biomedical informatics. Methods On the occasion of the 50th year of Methods of Information in Medicine a symposium was organized, that reflected on opportunities, challenges and priorities of organizing, representing and analysing data, information and knowledge in biomedicine and health care. The contributions of experts with a variety of backgrounds in the area of biomedical data analysis have been collected as one outcome of this symposium, in order to provide a broad, though coherent, overview of some of the most interesting aspects of the field. Results The paper presents sections on data accumulation and data-driven approaches in medical informatics, data and knowledge integration, statistical issues for the evaluation of data mining models, translational bioinformatics and bioinformatics aspects of genetic epidemiology. Conclusions Biomedical informatics represents a natural framework to properly and effectively apply data analysis and data mining methods in a decision-making context. In the future, it will be necessary to preserve the inclusive nature of the field and to foster an increasing sharing of data and methods between researchers. PMID:22146916

  4. Policies and Procedures for Accessing Archived NASA Lunar Data via the Web

    NASA Technical Reports Server (NTRS)

    James, Nathan L.; Williams, David R.

    2011-01-01

    The National Space Science Data Center (NSSDC) was established by NASA to provide for the preservation and dissemination of scientific data from NASA missions. This paper describes the policies specifically related to lunar science data. NSSDC presently archives 660 lunar data collections. Most of these data (423 units) are stored offline in analog format. The remainder of this collection consists of magnetic tapes and discs containing approximately 1.7 TB of digital lunar data. The active archive for NASA lunar data is the Planetary Data System (PDS). NSSDC has an agreement with the PDS Lunar Data Node to assist in the restoration and preparation of NSSDC-resident lunar data upon request for access and distribution via the PDS archival system. Though much of NSSDC's digital store also resides in PDS, NSSDC has many analog data collections and some digital lunar data sets that are not in PDS. NSSDC stands ready to make these archived lunar data accessible to both the research community and the general public upon request as resources allow. Newly requested offline lunar data are digitized and moved to near-line storage devices called digital linear tape jukeboxes. The data are then packaged and made network-accessible via FTP for the convenience of a growing segment of the user community. This publication will 1) discuss the NSSDC processes and policies that govern how NASA lunar data is preserved, restored, and made accessible via the web and 2) highlight examples of special lunar data requests.

  5. NASA Earth Sciences Data Support System and Services for the Northern Eurasia Earth Science Partnership Initiative

    NASA Technical Reports Server (NTRS)

    Leptoukh, Gregory

    2006-01-01

    The presentation describes the recently awarded ACCESS project to provide data management of NASA remote sensing data for the Northern Eurasia Earth Science Partnership Initiative (NEESPI). The project targets integration of remote sensing data from MODIS, and other NASA instruments on board US-satellites (with potential expansion to data from non-US satellites), customized data products from climatology data sets (e.g., ISCCP, ISLSCP) and model data (e.g., NCEP/NCAR) into a single, well-architected data management system. It will utilize two existing components developed by the Goddard Earth Sciences Data & Information Services Center (GES DISC) at the NASA Goddard Space Flight Center: (1) online archiving and distribution system, that allows collection, processing and ingest of data from various sources into the online archive, and (2) user-friendly intelligent web-based online visualization and analysis system, also known as Giovanni. The former includes various kinds of data preparation for seamless interoperability between measurements by different instruments. The latter provides convenient access to various geophysical parameters measured in the Northern Eurasia region without any need to learn complicated remote sensing data formats, or retrieve and process large volumes of NASA data. Initial implementation of this data management system will concentrate on atmospheric data and surface data aggregated to coarse resolution to support collaborative environment and climate change studies and modeling, while at later stages, data from NASA and non-NASA satellites at higher resolution will be integrated into the system.

  6. Update on Apollo Data Restoration by the NSSDC and the PDS Lunar Data Node

    NASA Technical Reports Server (NTRS)

    Williams, David R.; Hills, K. Kent; Taylor, Patrick T.; McBride, Marie J.; Guinness, Edward A.

    2012-01-01

    The Lunar Data Node (LDN) , under the auspices of the Geosciences Node of the Planetary Data System (PDS) and the National Space Science Data Center (NSSDC), is continuing its efforts to recover and restore Apollo science data. The data being restored are in large part archived with NSSDC on older media, but unarchived data are also being recovered from other sources. They are typically on 7- or 9-track magnetic tapes, often in obsolete formats, or held on microfilm, microfiche, or paper documents. The goal of the LDN is to restore these data from their current form, which is difficult for most researchers to access, into common digital formats with all necessary supporting data (metadata) and archive the data sets with PDS. Restoration involves reading the data from the original media, deciphering the data formats to produce readable digital data and converting the data into usable tabular formats. Each set of values in the table must then be understood in terms of the quantity measured and the units used. Information on instrument properties, operational history, and calibrations is gathered and added to the data set, along with pertinent references, contacts, and other ancillary documentation. The data set then undergoes a peer review and the final validated product is archived with PDS. Although much of this effort has concentrated on data archived at NSSDC in the 1970's, we have also recovered data and information that were never sent to NSSDC. These data, retrieved from various outside sources, include raw and reduced Gamma-Ray Spectrometer data from Apollos 15 and 16, information on the Apollo 17 Lunar Ejecta And Meteorites experiment, Dust Detector data from Apollos 11, 12, 14, and I5, raw telemetry tapes from the Apollo ALSEPs, and Weekly Status Reports for all the Apollo missions. These data are currently being read or organized, and supporting data is being gathered. We are still looking for the calibrated heat flow data from Apollos 15 and 17 for the period 1975-1977, any assistance or information on these data would be welcome. NSSDC has recently been tasked to release its hard-copy archive, comprising photography, microfilm, and microfiche. The details are still being discussed, but we are concentrating on recovering the valuable lunar data from these materials while they are still readily accessible. We have identified the most critical of these data and written a LASER proposal to fund their restoration. Included in this effort are data from the Apollo 15 and 16 Mass Spectrometers and the Apollo 17 Par-UV Spectrometer and ancillary information on the Apollo 17 Surface Electrical Properties Experiment.

  7. [Nonparametric method of estimating survival functions containing right-censored and interval-censored data].

    PubMed

    Xu, Yonghong; Gao, Xiaohuan; Wang, Zhengxi

    2014-04-01

    Missing data represent a general problem in many scientific fields, especially in medical survival analysis. Dealing with censored data, interpolation method is one of important methods. However, most of the interpolation methods replace the censored data with the exact data, which will distort the real distribution of the censored data and reduce the probability of the real data falling into the interpolation data. In order to solve this problem, we in this paper propose a nonparametric method of estimating the survival function of right-censored and interval-censored data and compare its performance to SC (self-consistent) algorithm. Comparing to the average interpolation and the nearest neighbor interpolation method, the proposed method in this paper replaces the right-censored data with the interval-censored data, and greatly improves the probability of the real data falling into imputation interval. Then it bases on the empirical distribution theory to estimate the survival function of right-censored and interval-censored data. The results of numerical examples and a real breast cancer data set demonstrated that the proposed method had higher accuracy and better robustness for the different proportion of the censored data. This paper provides a good method to compare the clinical treatments performance with estimation of the survival data of the patients. This pro vides some help to the medical survival data analysis.

  8. Data Streaming for Metabolomics: Accelerating Data Processing and Analysis from Days to Minutes

    PubMed Central

    2016-01-01

    The speed and throughput of analytical platforms has been a driving force in recent years in the “omics” technologies and while great strides have been accomplished in both chromatography and mass spectrometry, data analysis times have not benefited at the same pace. Even though personal computers have become more powerful, data transfer times still represent a bottleneck in data processing because of the increasingly complex data files and studies with a greater number of samples. To meet the demand of analyzing hundreds to thousands of samples within a given experiment, we have developed a data streaming platform, XCMS Stream, which capitalizes on the acquisition time to compress and stream recently acquired data files to data processing servers, mimicking just-in-time production strategies from the manufacturing industry. The utility of this XCMS Online-based technology is demonstrated here in the analysis of T cell metabolism and other large-scale metabolomic studies. A large scale example on a 1000 sample data set demonstrated a 10 000-fold time savings, reducing data analysis time from days to minutes. Further, XCMS Stream has the capability to increase the efficiency of downstream biochemical dependent data acquisition (BDDA) analysis by initiating data conversion and data processing on subsets of data acquired, expanding its application beyond data transfer to smart preliminary data decision-making prior to full acquisition. PMID:27983788

  9. Exploring and Analyzing Climate Variations Online by Using MERRA-2 data at GES DISC

    NASA Astrophysics Data System (ADS)

    Shen, S.; Ostrenga, D.; Vollmer, B.; Kempler, S.

    2016-12-01

    NASA Giovanni (Geospatial Interactive Online Visualization ANd aNalysis Infrastructure) (http://giovanni.sci.gsfc.nasa.gov/giovanni/) is a web-based data visualization and analysis system developed by the Goddard Earth Sciences Data and Information Services Center (GES DISC). Current data analysis functions include Lat-Lon map, time series, scatter plot, correlation map, difference, cross-section, vertical profile, and animation etc. The system enables basic statistical analysis and comparisons of multiple variables. This web-based tool facilitates data discovery, exploration and analysis of large amount of global and regional remote sensing and model data sets from a number of NASA data centers. Recently, long term global assimilated atmospheric, land, and ocean data have been integrated into the system that enables quick exploration and analysis of climate data without downloading, and preprocessing the data. Example data include climate reanalysis from NASA Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2) which provides data beginning 1980 to present; land data from NASA Global Land Data Assimilation System (GLDAS) which assimilates data from 1948 to 2012; as well as ocean biological data from NASA Ocean Biogeochemical Model (NOBM) which assimilates data from 1998 to 2012. This presentation, using surface air temperature, precipitation, ozone, and aerosol, etc. from MERRA-2, demonstrates climate variation analysis with Giovanni at selected regions.

  10. Data Specifications for the Landsat Data Continuity Mission

    USGS Publications Warehouse

    Irons, J.R.; Speciale, N.J.; Douglas, McCuistion J.; Masek, J.G.; Markham, B.L.; Storey, James C.; Lencioni, D.E.; Ryan, R.E.

    2003-01-01

    The National Aeronautics and Space Administration (NASA) plans to procure data from a privately-owned and commercially-operated remote sensing system for the next Landsat mission, the Landsat Data Continuity Mission (LDCM). Data requirements are documented in an LDCM Data Specification. The specifications require delivery of data covering 250 Landsat scenes on a daily basis. The data are to be acquired in a manner that affords seasonal coverage of the global land mass. Data are required for the heritage reflective Thematic Mapper (TM) spectral bands plus two new bands, a blue band for coastal zone observations and a short wave infrared band for cirrus cloud detection. The specifications do not require thermal data, representing a departure from the TM heritage. The specification also requires data providing a 30 m ground sample distance for each of the spectral bands with the exception of the new cirrus cloud band at 120 m. An absolute uncertainty of 5 percent or less is required for radiometrically corrected LDCM data and the commercial operator is required to deliver data that can be registered to a cartographic projection with an uncertainty of 65 m or less. Procuring data from a commercial operator represents a new approach for the 30-year-old Landsat Program. The LDCM Data Specification will ensure that the procured data provides continuity of the Landsat data stream and advances the mission.

  11. Data streaming for metabolomics: Accelerating data processing and analysis from days to minutes

    DOE PAGES

    Montenegro-Burke, J. Rafael; Aisporna, Aries E.; Benton, H. Paul; ...

    2016-12-16

    The speed and throughput of analytical platforms has been a driving force in recent years in the “omics” technologies and while great strides have been accomplished in both chromatography and mass spectrometry, data analysis times have not benefited at the same pace. Even though personal computers have become more powerful, data transfer times still represent a bottleneck in data processing because of the increasingly complex data files and studies with a greater number of samples. To meet the demand of analyzing hundreds to thousands of samples within a given experiment, we have developed a data streaming platform, XCMS Stream, whichmore » capitalizes on the acquisition time to compress and stream recently acquired data files to data processing servers, mimicking just-in-time production strategies from the manufacturing industry. The utility of this XCMS Online-based technology is demonstrated here in the analysis of T cell metabolism and other large-scale metabolomic studies. A large scale example on a 1000 sample data set demonstrated a 10 000-fold time savings, reducing data analysis time from days to minutes. Here, XCMS Stream has the capability to increase the efficiency of downstream biochemical dependent data acquisition (BDDA) analysis by initiating data conversion and data processing on subsets of data acquired, expanding its application beyond data transfer to smart preliminary data decision-making prior to full acquisition.« less

  12. Usefulness of syndromic data sources for investigating morbidity resulting from a severe weather event.

    PubMed

    Baer, Atar; Elbert, Yevgeniy; Burkom, Howard S; Holtry, Rekha; Lombardo, Joseph S; Duchin, Jeffrey S

    2011-03-01

    We evaluated emergency department (ED) data, emergency medical services (EMS) data, and public utilities data for describing an outbreak of carbon monoxide (CO) poisoning following a windstorm. Syndromic ED data were matched against previously collected chart abstraction data. We ran detection algorithms on selected time series derived from all 3 data sources to identify health events associated with the CO poisoning outbreak. We used spatial and spatiotemporal scan statistics to identify geographic areas that were most heavily affected by the CO poisoning event. Of the 241 CO cases confirmed by chart review, 190 (78.8%) were identified in the syndromic surveillance data as exact matches. Records from the ED and EMS data detected an increase in CO-consistent syndromes after the storm. The ED data identified significant clusters of CO-consistent syndromes, including zip codes that had widespread power outages. Weak temporal gastrointestinal (GI) signals, possibly resulting from ingestion of food spoiled by lack of refrigeration, were detected in the ED data but not in the EMS data. Spatial clustering of GI-based groupings in the ED data was not detected. Data from this evaluation support the value of ED data for surveillance after natural disasters. Enhanced EMS data may be useful for monitoring a CO poisoning event, if these data are available to the health department promptly. ©2011 American Medical Association. All rights reserved.

  13. JiTTree: A Just-in-Time Compiled Sparse GPU Volume Data Structure.

    PubMed

    Labschütz, Matthias; Bruckner, Stefan; Gröller, M Eduard; Hadwiger, Markus; Rautek, Peter

    2016-01-01

    Sparse volume data structures enable the efficient representation of large but sparse volumes in GPU memory for computation and visualization. However, the choice of a specific data structure for a given data set depends on several factors, such as the memory budget, the sparsity of the data, and data access patterns. In general, there is no single optimal sparse data structure, but a set of several candidates with individual strengths and drawbacks. One solution to this problem are hybrid data structures which locally adapt themselves to the sparsity. However, they typically suffer from increased traversal overhead which limits their utility in many applications. This paper presents JiTTree, a novel sparse hybrid volume data structure that uses just-in-time compilation to overcome these problems. By combining multiple sparse data structures and reducing traversal overhead we leverage their individual advantages. We demonstrate that hybrid data structures adapt well to a large range of data sets. They are especially superior to other sparse data structures for data sets that locally vary in sparsity. Possible optimization criteria are memory, performance and a combination thereof. Through just-in-time (JIT) compilation, JiTTree reduces the traversal overhead of the resulting optimal data structure. As a result, our hybrid volume data structure enables efficient computations on the GPU, while being superior in terms of memory usage when compared to non-hybrid data structures.

  14. Apparatus and Method for Effecting Data Transfer Between Data Systems

    NASA Technical Reports Server (NTRS)

    Kirkpatrick, Joey V. (Inventor); Grosz, Francis B., Jr. (Inventor); Lannes, Kenny (Inventor); Maniscalco, David G. (Inventor)

    2001-01-01

    An apparatus for effecting data transfer between data systems comprising a first transceiver and a second transceiver. The first transceiver has an input for receiving digital data from one of the data systems, an output for serially outputting digital data to one of the data systems, at least one transmitter for converting digital data received at the input into optical signals, and at least one receiver for receiving optical signals and serially converting the received optical signals to digital data for output to the data output. The second transceiver has an input for receiving digital data from another one of the data systems, an output for serially outputting digital data to the another one of the data systems, at least one transmitter for serially converting digital data received at the input of the second transceiver into optical signals, and at least one receiver for receiving optical signals and serially converting the received optical signals to digital data for output to the output of the second transceiver. The apparatus further comprises an optical link connecting the first and second transceivers. The optical link comprising a pair of optical fibers. One of the optical fibers optically links the transmitter of the first transceiver to the receiver of the second transceiver. The other optical fiber optically links the receiver of the first transceiver to the transmitter of the second transceiver.

  15. [Big data in medicine and healthcare].

    PubMed

    Rüping, Stefan

    2015-08-01

    Healthcare is one of the business fields with the highest Big Data potential. According to the prevailing definition, Big Data refers to the fact that data today is often too large and heterogeneous and changes too quickly to be stored, processed, and transformed into value by previous technologies. The technological trends drive Big Data: business processes are more and more executed electronically, consumers produce more and more data themselves - e.g. in social networks - and finally ever increasing digitalization. Currently, several new trends towards new data sources and innovative data analysis appear in medicine and healthcare. From the research perspective, omics-research is one clear Big Data topic. In practice, the electronic health records, free open data and the "quantified self" offer new perspectives for data analytics. Regarding analytics, significant advances have been made in the information extraction from text data, which unlocks a lot of data from clinical documentation for analytics purposes. At the same time, medicine and healthcare is lagging behind in the adoption of Big Data approaches. This can be traced to particular problems regarding data complexity and organizational, legal, and ethical challenges. The growing uptake of Big Data in general and first best-practice examples in medicine and healthcare in particular, indicate that innovative solutions will be coming. This paper gives an overview of the potentials of Big Data in medicine and healthcare.

  16. Data Streaming for Metabolomics: Accelerating Data Processing and Analysis from Days to Minutes.

    PubMed

    Montenegro-Burke, J Rafael; Aisporna, Aries E; Benton, H Paul; Rinehart, Duane; Fang, Mingliang; Huan, Tao; Warth, Benedikt; Forsberg, Erica; Abe, Brian T; Ivanisevic, Julijana; Wolan, Dennis W; Teyton, Luc; Lairson, Luke; Siuzdak, Gary

    2017-01-17

    The speed and throughput of analytical platforms has been a driving force in recent years in the "omics" technologies and while great strides have been accomplished in both chromatography and mass spectrometry, data analysis times have not benefited at the same pace. Even though personal computers have become more powerful, data transfer times still represent a bottleneck in data processing because of the increasingly complex data files and studies with a greater number of samples. To meet the demand of analyzing hundreds to thousands of samples within a given experiment, we have developed a data streaming platform, XCMS Stream, which capitalizes on the acquisition time to compress and stream recently acquired data files to data processing servers, mimicking just-in-time production strategies from the manufacturing industry. The utility of this XCMS Online-based technology is demonstrated here in the analysis of T cell metabolism and other large-scale metabolomic studies. A large scale example on a 1000 sample data set demonstrated a 10 000-fold time savings, reducing data analysis time from days to minutes. Further, XCMS Stream has the capability to increase the efficiency of downstream biochemical dependent data acquisition (BDDA) analysis by initiating data conversion and data processing on subsets of data acquired, expanding its application beyond data transfer to smart preliminary data decision-making prior to full acquisition.

  17. Improvements in Space Geodesy Data Discovery at the CDDIS

    NASA Technical Reports Server (NTRS)

    Noll, C.; Pollack, N.; Michael, P.

    2011-01-01

    The Crustal Dynamics Data Information System (CDDIS) supports data archiving and distribution activities for the space geodesy and geodynamics community. The main objectives of the system are to store space geodesy and geodynamics related data products in a central data bank. to maintain information about the archival of these data, and to disseminate these data and information in a timely manner to a global scientific research community. The archive consists of GNSS, laser ranging, VLBI, and DORIS data sets and products derived from these data. The CDDIS is one of NASA's Earth Observing System Data and Information System (EOSDIS) distributed data centers; EOSDIS data centers serve a diverse user community and arc tasked to provide facilities to search and access science data and products. Several activities are currently under development at the CDDIS to aid users in data discovery, both within the current community and beyond. The CDDIS is cooperating in the development of Geodetic Seamless Archive Centers (GSAC) with colleagues at UNAVCO and SIO. TIle activity will provide web services to facilitate data discovery within and across participating archives. In addition, the CDDIS is currently implementing modifications to the metadata extracted from incoming data and product files pushed to its archive. These enhancements will permit information about COOlS archive holdings to be made available through other data portals such as Earth Observing System (EOS) Clearinghouse (ECHO) and integration into the Global Geodetic Observing System (GGOS) portal.

  18. Bayesian anomaly detection in monitoring data applying relevance vector machine

    NASA Astrophysics Data System (ADS)

    Saito, Tomoo

    2011-04-01

    A method for automatically classifying the monitoring data into two categories, normal and anomaly, is developed in order to remove anomalous data included in the enormous amount of monitoring data, applying the relevance vector machine (RVM) to a probabilistic discriminative model with basis functions and their weight parameters whose posterior PDF (probabilistic density function) conditional on the learning data set is given by Bayes' theorem. The proposed framework is applied to actual monitoring data sets containing some anomalous data collected at two buildings in Tokyo, Japan, which shows that the trained models discriminate anomalous data from normal data very clearly, giving high probabilities of being normal to normal data and low probabilities of being normal to anomalous data.

  19. Aggregation of Minnesota water-use data and transfer of data to the National Water-Use Data System; Procedures and programs

    USGS Publications Warehouse

    Trotta, L.C.

    1988-01-01

    The Minnesota Water-Use Data System stores data on the quantity of withdrawals and discharge in Minnesota. To transfer these data into the U.S. Geological Survey 's National Water-Use Data System properly, certain procedures must be followed. Uniform data categorization and entry allows comparison of water use from State to State. The data in the National Water-Use Data System are aggregated by county and by watershed (hydrologic unit). This report documents the data aggregation and transfer process as developed by the Minnesota Department of Natural Resources, the Minnesota State Planning Agency/Planning Information Center, and the U.S. Geological Survey as part of the National Water-Use Information Program.

  20. Improvements to the National Transport Code Collaboration Data Server

    NASA Astrophysics Data System (ADS)

    Alexander, David A.

    2001-10-01

    The data server of the National Transport Code Colaboration Project provides a universal network interface to interpolated or raw transport data accessible by a universal set of names. Data can be acquired from a local copy of the Iternational Multi-Tokamak (ITER) profile database as well as from TRANSP trees of MDS Plus data systems on the net. Data is provided to the user's network client via a CORBA interface, thus providing stateful data server instances, which have the advantage of remembering the desired interpolation, data set, etc. This paper will review the status and discuss the recent improvements made to the data server, such as the modularization of the data server and the addition of hdf5 and MDS Plus data file writing capability.

Top