Sample records for data

  1. Data, data everywhere ...

    NASA Astrophysics Data System (ADS)

    Chandler, C. L.

    2016-12-01

    The scientific research endeavor requires data, and in some cases massive amounts of complex and highly diverse data. From experimental design, through data acquisition and analysis, hypothesis testing, and finally drawing conclusions, data collection and proper stewardship are critical to science. Even a single experiment conducted by a single researcher will produce data to test the working hypothesis. The types of complex science questions being tackled today often require large, diverse, multi-disciplinary teams of researchers who must be prepared to exchange their data.This 2016 AGU Leptoukh Lecture comprises a series of vignettes that illustrate a brief history of data stewardship: where we have come from, how and why we have arrived where we are today, and where we are headed with respect to data management. The specific focus will be on management of marine ecosystem research data and will include observations on the drivers, challenges, strategies, and solutions that have evolved over time. The lessons learned should be applicable to other disciplines and the hope is that many will recognize parallels in their chosen domain.From historical shipboard logbooks to the high-volume, digital, quality-controlled ocean science data sets created by today's researchers, there have been enormous changes in the way ocean data are collected and reported. Rapid change in data management practices is being driven by new data exchange requirements, by modern expectations for machine-interoperable exchange, and by the desire to achieve research transparency. Advances in technology and cultural shifts contribute to the changing conditions through which data managers and informatics specialists must navigate.The unique challenges associated with collecting and managing environmental data, complicated by the onset of the big data era, make this a fascinating time to be responsible for data. It seems there are data everywhere, being collected by everyone, for all sorts of

  2. DataHub: Knowledge-based data management for data discovery

    NASA Astrophysics Data System (ADS)

    Handley, Thomas H.; Li, Y. Philip

    1993-08-01

    Currently available database technology is largely designed for business data-processing applications, and seems inadequate for scientific applications. The research described in this paper, the DataHub, will address the issues associated with this shortfall in technology utilization and development. The DataHub development is addressing the key issues in scientific data management of scientific database models and resource sharing in a geographically distributed, multi-disciplinary, science research environment. Thus, the DataHub will be a server between the data suppliers and data consumers to facilitate data exchanges, to assist science data analysis, and to provide as systematic approach for science data management. More specifically, the DataHub's objectives are to provide support for (1) exploratory data analysis (i.e., data driven analysis); (2) data transformations; (3) data semantics capture and usage; analysis-related knowledge capture and usage; and (5) data discovery, ingestion, and extraction. Applying technologies that vary from deductive databases, semantic data models, data discovery, knowledge representation and inferencing, exploratory data analysis techniques and modern man-machine interfaces, DataHub will provide a prototype, integrated environement to support research scientists' needs in multiple disciplines (i.e. oceanography, geology, and atmospheric) while addressing the more general science data management issues. Additionally, the DataHub will provide data management services to exploratory data analysis applications such as LinkWinds and NCSA's XIMAGE.

  3. Managing troubled data: Coastal data partnerships smooth data integration

    USGS Publications Warehouse

    Hale, S.S.; Hale, Miglarese A.; Bradley, M.P.; Belton, T.J.; Cooper, L.D.; Frame, M.T.; Friel, C.A.; Harwell, L.M.; King, R.E.; Michener, W.K.; Nicolson, D.T.; Peterjohn, B.G.

    2003-01-01

    Understanding the ecology, condition, and changes of coastal areas requires data from many sources. Broad-scale and long-term ecological questions, such as global climate change, biodiversity, and cumulative impacts of human activities, must be addressed with databases that integrate data from several different research and monitoring programs. Various barriers, including widely differing data formats, codes, directories, systems, and metadata used by individual programs, make such integration troublesome. Coastal data partnerships, by helping overcome technical, social, and organizational barriers, can lead to a better understanding of environmental issues, and may enable better management decisions. Characteristics of successful data partnerships include a common need for shared data, strong collaborative leadership, committed partners willing to invest in the partnership, and clear agreements on data standards and data policy. Emerging data and metadata standards that become widely accepted are crucial. New information technology is making it easier to exchange and integrate data. Data partnerships allow us to create broader databases than would be possible for any one organization to create by itself.

  4. Managing troubled data: coastal data partnerships smooth data integration.

    PubMed

    Hale, Stephen S; Miglarese, Anne Hale; Bradley, M Patricia; Belton, Thomas J; Cooper, Larry D; Frame, Michael T; Friel, Christopher A; Harwell, Linda M; King, Robert E; Michener, William K; Nicolson, David T; Peterjohn, Bruce G

    2003-01-01

    Understanding the ecology, condition, and changes of coastal areas requires data from many sources. Broad-scale and long-term ecological questions, such as global climate change, biodiversity, and cumulative impacts of human activities, must be addressed with databases that integrate data from several different research and monitoring programs. Various barriers, including widely differing data formats, codes, directories, systems, and metadata used by individual programs, make such integration troublesome. Coastal data partnerships, by helping overcome technical, social, and organizational barriers, can lead to a better understanding of environmental issues, and may enable better management decisions. Characteristics of successful data partnerships include a common need for shared data, strong collaborative leadership, committed partners willing to invest in the partnership, and clear agreements on data standards and data policy. Emerging data and metadata standards that become widely accepted are crucial. New information technology is making it easier to exchange and integrate data. Data partnerships allow us to create broader databases than would be possible for any one organization to create by itself.

  5. Common Data Model for Neuroscience Data and Data Model Exchange

    PubMed Central

    Gardner, Daniel; Knuth, Kevin H.; Abato, Michael; Erde, Steven M.; White, Thomas; DeBellis, Robert; Gardner, Esther P.

    2001-01-01

    Objective: Generalizing the data models underlying two prototype neurophysiology databases, the authors describe and propose the Common Data Model (CDM) as a framework for federating a broad spectrum of disparate neuroscience information resources. Design: Each component of the CDM derives from one of five superclasses—data, site, method, model, and reference—or from relations defined between them. A hierarchic attribute-value scheme for metadata enables interoperability with variable tree depth to serve specific intra- or broad inter-domain queries. To mediate data exchange between disparate systems, the authors propose a set of XML-derived schema for describing not only data sets but data models. These include biophysical description markup language (BDML), which mediates interoperability between data resources by providing a meta-description for the CDM. Results: The set of superclasses potentially spans data needs of contemporary neuroscience. Data elements abstracted from neurophysiology time series and histogram data represent data sets that differ in dimension and concordance. Site elements transcend neurons to describe subcellular compartments, circuits, regions, or slices; non-neuroanatomic sites include sequences to patients. Methods and models are highly domain-dependent. Conclusions: True federation of data resources requires explicit public description, in a metalanguage, of the contents, query methods, data formats, and data models of each data resource. Any data model that can be derived from the defined superclasses is potentially conformant and interoperability can be enabled by recognition of BDML-described compatibilities. Such metadescriptions can buffer technologic changes. PMID:11141510

  6. Data Foundry: Data Warehousing and Integration for Scientific Data Management

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Musick, R.; Critchlow, T.; Ganesh, M.

    2000-02-29

    Data warehousing is an approach for managing data from multiple sources by representing them with a single, coherent point of view. Commercial data warehousing products have been produced by companies such as RebBrick, IBM, Brio, Andyne, Ardent, NCR, Information Advantage, Informatica, and others. Other companies have chosen to develop their own in-house data warehousing solution using relational databases, such as those sold by Oracle, IBM, Informix and Sybase. The typical approaches include federated systems, and mediated data warehouses, each of which, to some extent, makes use of a series of source-specific wrapper and mediator layers to integrate the data intomore » a consistent format which is then presented to users as a single virtual data store. These approaches are successful when applied to traditional business data because the data format used by the individual data sources tends to be rather static. Therefore, once a data source has been integrated into a data warehouse, there is relatively little work required to maintain that connection. However, that is not the case for all data sources. Data sources from scientific domains tend to regularly change their data model, format and interface. This is problematic because each change requires the warehouse administrator to update the wrapper, mediator, and warehouse interfaces to properly read, interpret, and represent the modified data source. Furthermore, the data that scientists require to carry out research is continuously changing as their understanding of a research question develops, or as their research objectives evolve. The difficulty and cost of these updates effectively limits the number of sources that can be integrated into a single data warehouse, or makes an approach based on warehousing too expensive to consider.« less

  7. Simulation Data as Data Streams

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Abdulla, G; Arrighi, W; Critchlow, T

    2003-11-18

    Computational or scientific simulations are increasingly being applied to solve a variety of scientific problems. Domains such as astrophysics, engineering, chemistry, biology, and environmental studies are benefiting from this important capability. Simulations, however, produce enormous amounts of data that need to be analyzed and understood. In this overview paper, we describe scientific simulation data, its characteristics, and the way scientists generate and use the data. We then compare and contrast simulation data to data streams. Finally, we describe our approach to analyzing simulation data, present the AQSim (Ad-hoc Queries for Simulation data) system, and discuss some of the challenges thatmore » result from handling this kind of data.« less

  8. Alternative Fuels Data Center: Data Downloads

    Science.gov Websites

    Data Downloads to someone by E-mail Share Alternative Fuels Data Center: Data Downloads on Facebook Tweet about Alternative Fuels Data Center: Data Downloads on Twitter Bookmark Alternative Fuels Data Center: Data Downloads on Google Bookmark Alternative Fuels Data Center: Data Downloads on Delicious Rank

  9. Health data and data governance.

    PubMed

    Hovenga, Evelyn J S; Grain, Heather

    2013-01-01

    Health is a knowledge industry, based on data collected to support care, service planning, financing and knowledge advancement. Increasingly there is a need to collect, retrieve and use health record information in an electronic format to provide greater flexibility, as this enables retrieval and display of data in multiple locations and formats irrespective of where the data were collected. Electronically maintained records require greater structure and consistency to achieve this. The use of data held in records generated in real time in clinical systems also has the potential to reduce the time it takes to gain knowledge, as there is less need to collect research specific information, this is only possible if data governance principles are applied. Connected devices and information systems are now generating huge amounts of data, as never before seen. An ability to analyse and mine very large amounts of data, "Big Data", provides policy and decision makers with new insights into varied aspects of work and information flow and operational business patterns and trends, and drives greater efficiencies, and safer and more effective health care. This enables decision makers to apply rules and guidance that have been developed based upon knowledge from many individual patient records through recognition of triggers based upon that knowledge. In clinical decision support systems information about the individual is compared to rules based upon knowledge gained from accumulated information of many to provide guidance at appropriate times in the clinical process. To achieve this the data in the individual system, and the knowledge rules must be represented in a compatible and consistent manner. This chapter describes data attributes; explains the difference between data and information; outlines the requirements for quality data; shows the relevance of health data standards; and describes how data governance impacts representation of content in systems and the use of that

  10. Data Model Performance in Data Warehousing

    NASA Astrophysics Data System (ADS)

    Rorimpandey, G. C.; Sangkop, F. I.; Rantung, V. P.; Zwart, J. P.; Liando, O. E. S.; Mewengkang, A.

    2018-02-01

    Data Warehouses have increasingly become important in organizations that have large amount of data. It is not a product but a part of a solution for the decision support system in those organizations. Data model is the starting point for designing and developing of data warehouses architectures. Thus, the data model needs stable interfaces and consistent for a longer period of time. The aim of this research is to know which data model in data warehousing has the best performance. The research method is descriptive analysis, which has 3 main tasks, such as data collection and organization, analysis of data and interpretation of data. The result of this research is discussed in a statistic analysis method, represents that there is no statistical difference among data models used in data warehousing. The organization can utilize four data model proposed when designing and developing data warehouse.

  11. Data Resources | Geospatial Data Science | NREL

    Science.gov Websites

    variety of renewable energy technologies. These datasets are designed to be used in GIS software applications. Biomass Data Geothermal Data Hydrogen Data Marine and Hydrokinetic Data Solar Data Wind Data

  12. Transforming Research Data into Resource Data

    NASA Astrophysics Data System (ADS)

    Chandler, C. L.; Shepherd, A.; Groman, R. C.; Kinkade, D.; Rauch, S.; Allison, M. D.; Copley, N. J.; Ake, H.; York, A.; Wiebe, P. H.; Glover, D. M.

    2016-12-01

    Many of the Grand Challenge science questions are of interest to the marine science research community funded by the United States National Science Foundation (NSF). The highly diverse range of environmental data from the oceans, coastal regions, and Great Lakes are collected using a variety of platforms, instrument systems and sensors and are complemented by experimental results including sequence data, and model results. The data are often collected with a particular research purpose in mind. Such data are costly to acquire and environmental data, temporally and geographically unique, cannot be acquired again. The NSF-funded research community comprising funded investigators and their research teams, operators of the US academic research fleet, data managers, marine librarians, and NSF program managers are working together to transform `research data' into `resource data'. The objective is to ensure that the original research data become available to a much wider community, and have potential to be used as `resource data' for new and different types of research well beyond the initial focus of the NSF research grant. The Biological and Chemical Oceanography Data Management Office (BCO-DMO) manages a community-driven data repository that serves some of these data: the data and results from research funded by NSF Ocean Sciences and Polar Programs. Individually such data sets are typically small in size, but when integrated these data become a valuable resource for the global research effort. The data are analyzed, quality controlled, finalized by the original investigators and their research teams, and then contributed to BCO-DMO. The BCO-DMO data managers reformat the data if they were submitted in proprietary formats, perform quality assessment review, augment the data sets with additional documentation, and create structured, machine-actionable metadata. The BCO-DMO data system allows researchers to make connections between related data sets within the BCO

  13. Northern California Earthquake Data Center: Data Sets and Data Services

    NASA Astrophysics Data System (ADS)

    Neuhauser, D. S.; Allen, R. M.; Zuzlewski, S.

    2015-12-01

    The Northern California Earthquake Data Center (NCEDC) provides a permanent archive and real-time data distribution services for a unique and comprehensive data set of seismological and geophysical data sets encompassing northern and central California. We provide access to over 85 terabytes of continuous and event-based time series data from broadband, short-period, strong motion, and strain sensors as well as continuous and campaign GPS data at both standard and high sample rates. The Northen California Seismic System (NCSS), operated by UC Berkeley and USGS Menlo Park, has recorded over 900,000 events from 1984 to the present, and the NCEDC serves catalog, parametric information, moment tensors and first motion mechanisms, and time series data for these events. We also serve event catalogs, parametric information, and event waveforms for DOE enhanced geothermal system monitoring in northern California and Nevada. The NCEDC provides a several ways for users to access these data. The most recent development are web services, which provide interactive, command-line, or program-based workflow access to data. Web services use well-established server and client protocols and RESTful software architecture that allow users to easily submit queries and receive the requested data in real-time rather than through batch or email-based requests. Data are returned to the user in the appropriate format such as XML, RESP, simple text, or MiniSEED depending on the service and selected output format. The NCEDC supports all FDSN-defined web services as well as a number of IRIS-defined and NCEDC-defined services. We also continue to support older email-based and browser-based access to data. NCEDC data and web services can be found at http://www.ncedc.org and http://service.ncedc.org.

  14. The Open Data Repositorys Data Publisher

    NASA Technical Reports Server (NTRS)

    Stone, N.; Lafuente, B.; Downs, R. T.; Blake, D.; Bristow, T.; Fonda, M.; Pires, A.

    2015-01-01

    Data management and data publication are becoming increasingly important components of researcher's workflows. The complexity of managing data, publishing data online, and archiving data has not decreased significantly even as computing access and power has greatly increased. The Open Data Repository's Data Publisher software strives to make data archiving, management, and publication a standard part of a researcher's workflow using simple, web-based tools and commodity server hardware. The publication engine allows for uploading, searching, and display of data with graphing capabilities and downloadable files. Access is controlled through a robust permissions system that can control publication at the field level and can be granted to the general public or protected so that only registered users at various permission levels receive access. Data Publisher also allows researchers to subscribe to meta-data standards through a plugin system, embargo data publication at their discretion, and collaborate with other researchers through various levels of data sharing. As the software matures, semantic data standards will be implemented to facilitate machine reading of data and each database will provide a REST application programming interface for programmatic access. Additionally, a citation system will allow snapshots of any data set to be archived and cited for publication while the data itself can remain living and continuously evolve beyond the snapshot date. The software runs on a traditional LAMP (Linux, Apache, MySQL, PHP) server and is available on GitHub (http://github.com/opendatarepository) under a GPLv2 open source license. The goal of the Open Data Repository is to lower the cost and training barrier to entry so that any researcher can easily publish their data and ensure it is archived for posterity.

  15. Fundamental Data Standards for Science Data System Interoperability and Data Correlation

    NASA Astrophysics Data System (ADS)

    Hughes, J. Steven; Gopala Krishna, Barla; Rye, Elizabeth; Crichton, Daniel

    The advent of the Web and languages such as XML have brought an explosion of online science data repositories and the promises of correlated data and interoperable systems. However there have been relatively few successes in meeting the expectations of science users in the internet age. For example a Google-like search for images of Mars will return many highly-derived and appropriately tagged images but largely ignore the majority of images in most online image repositories. Once retrieved, users are further frustrated by poor data descriptions, arcane formats, and badly organized ancillary information. A wealth of research indicates that shared information models are needed to enable system interoperability and data correlation. However, at a more fundamental level, data correlation and system interoperability are dependant on a relatively few shared data standards. A com-mon data dictionary standard, for example, allows the controlled vocabulary used in a science repository to be shared with potential collaborators. Common data registry and product iden-tification standards enable systems to efficiently find, locate, and retrieve data products and their metadata from remote repositories. Information content standards define categories of descriptive data that help make the data products scientifically useful to users who were not part of the original team that produced the data. The Planetary Data System (PDS) has a plan to move the PDS to a fully online, federated system. This plan addresses new demands on the system including increasing data volume, numbers of missions, and complexity of missions. A key component of this plan is the upgrade of the PDS Data Standards. The adoption of the core PDS data standards by the International Planetary Data Alliance (IPDA) adds the element of international cooperation to the plan. This presentation will provide an overview of the fundamental data standards being adopted by the PDS that transcend science domains and that

  16. Data, Data Everywhere!

    ERIC Educational Resources Information Center

    Busby, Joe R.; Ernst, Jeremy V.; Varnado, Terri E.

    2009-01-01

    Data acquisition and analysis are cornerstones for informed decision making about the environment. Facts and figures are vital to the practical application of systems and products. These data are collected through various means and organized into useful forms (i.e., visual imagery, datasets, and sounds). Whether collected by persons or obtained…

  17. The Open Data Repository's Data Publisher

    NASA Astrophysics Data System (ADS)

    Stone, N.; Lafuente, B.; Downs, R. T.; Bristow, T.; Blake, D. F.; Fonda, M.; Pires, A.

    2015-12-01

    Data management and data publication are becoming increasingly important components of research workflows. The complexity of managing data, publishing data online, and archiving data has not decreased significantly even as computing access and power has greatly increased. The Open Data Repository's Data Publisher software (http://www.opendatarepository.org) strives to make data archiving, management, and publication a standard part of a researcher's workflow using simple, web-based tools and commodity server hardware. The publication engine allows for uploading, searching, and display of data with graphing capabilities and downloadable files. Access is controlled through a robust permissions system that can control publication at the field level and can be granted to the general public or protected so that only registered users at various permission levels receive access. Data Publisher also allows researchers to subscribe to meta-data standards through a plugin system, embargo data publication at their discretion, and collaborate with other researchers through various levels of data sharing. As the software matures, semantic data standards will be implemented to facilitate machine reading of data and each database will provide a REST application programming interface for programmatic access. Additionally, a citation system will allow snapshots of any data set to be archived and cited for publication while the data itself can remain living and continuously evolve beyond the snapshot date. The software runs on a traditional LAMP (Linux, Apache, MySQL, PHP) server and is available on GitHub (http://github.com/opendatarepository) under a GPLv2 open source license. The goal of the Open Data Repository is to lower the cost and training barrier to entry so that any researcher can easily publish their data and ensure it is archived for posterity. We gratefully acknowledge the support for this study by the Science-Enabling Research Activity (SERA), and NASA NNX11AP82A

  18. Automated Data Submission for the Data Center

    NASA Astrophysics Data System (ADS)

    Wright, D.; Beaty, T.; Wei, Y.; Shanafield, H.; Santhana Vannan, S. K.

    2014-12-01

    Data centers struggle with difficulties related to data submission. Data are acquired through many avenues by many people. Many data submission activities involve intensive manual processes. During the submission process, data end up on varied storage devices. The situation can easily become chaotic. Collecting information on the status of pending data sets is arduous. For data providers, the submission process can be inconsistent and confusing. Scientists generally provide data from previous projects, and archival can be a low priority. Incomplete or poor documentation accompanies many data sets. However, complicated questionnaires deter busy data providers. At the ORNL DAAC, we have semi-automated the data set submission process to create a uniform data product and provide a consistent data provider experience. The formalized workflow makes archival faster for the data center and data set submission easier for data providers. Software modules create a flexible, reusable submission package. Formalized data set submission provides several benefits to the data center. A single data upload area provides one point of entry and ensures data are stored in a consistent location. A central dashboard records pending data set submissions in a single table and simplifies reporting. Flexible role management allows team members to readily coordinate and increases efficiency. Data products and metadata become uniform and easily maintained. As data and metadata standards change, modules can be modified or re-written without affecting workflow. While each data center has unique challenges, the data ingestion process is generally the same: get data from the provider, scientist, or project and capture metadata pertinent to that data. The ORNL DAAC data set submission workflow and software modules can be reused entirely or in part by other data centers looking for a data set submission solution. These data set submission modules will be available on NASA's Earthdata Code

  19. Towards Data Repository Interoperability: The Data Conservancy Data Packaging Specification

    NASA Astrophysics Data System (ADS)

    DiLauro, T.; Duerr, R.; Thessen, A. E.; Rippin, M.; Pralle, B.; Choudhury, G. S.

    2013-12-01

    A modern data archive must support a variety of functions and services for a broad set of stakeholders over a variety of content. Data producers need to deposit this content; data consumers need to find and access it; journal publishers need to link to and from it; funders need to ensure that it is protected and its value enhanced; research institutions need to track it; and the archive itself needs to manage and preserve it. But there is not an optimal information model that supports all of these tasks. The attributes needed to manage format transformations for long-term preservation are different from, for example, those needed to understand provenance relationships among the various entities modeled in the archive. Exposing all possible properties to every function burdens users and makes it difficult to maintain a separation of concerns among the functional components. The Data Conservancy Software (DCS) manages these overlapping information needs by defining strict interfaces between components and providing mappers between the layers of the architecture. Still, work remains to make deposit more intuitive. Currently, depositing content into a DCS instance requires either very simple objects (e.g., one file equals one data item), significant manual effort, or detailed knowledge of DCS-internal data model serializations. And if one were to deposit that content into another type of archive, it would be necessary to repeat this effort. To allow data producers and consumers to interact with data in a more natural manner, the Data Conservancy[1] is developing a packaging approach that eases this burden and allows a semantic overlay atop the directory/folder and file metaphor that is more familiar. The standards-based packaging scheme augments the payload and validation capabilities of Bagit[2] with the relationship and resource description capabilities of the Open Archives Initiative (OAI) Object Reuse and Exchange (ORE)[3] model. In the absence of the ORE resource

  20. Data Integration Tool: Permafrost Data Debugging

    NASA Astrophysics Data System (ADS)

    Wilcox, H.; Schaefer, K. M.; Jafarov, E. E.; Pulsifer, P. L.; Strawhacker, C.; Yarmey, L.; Basak, R.

    2017-12-01

    We developed a Data Integration Tool (DIT) to significantly speed up the time of manual processing needed to translate inconsistent, scattered historical permafrost data into files ready to ingest directly into the Global Terrestrial Network-Permafrost (GTN-P). The United States National Science Foundation funded this project through the National Snow and Ice Data Center (NSIDC) with the GTN-P to improve permafrost data access and discovery. We leverage this data to support science research and policy decisions. DIT is a workflow manager that divides data preparation and analysis into a series of steps or operations called widgets (https://github.com/PermaData/DIT). Each widget does a specific operation, such as read, multiply by a constant, sort, plot, and write data. DIT allows the user to select and order the widgets as desired to meet their specific needs, incrementally interact with and evolve the widget workflows, and save those workflows for reproducibility. Taking ideas from visual programming found in the art and design domain, debugging and iterative design principles from software engineering, and the scientific data processing and analysis power of Fortran and Python it was written for interactive, iterative data manipulation, quality control, processing, and analysis of inconsistent data in an easily installable application. DIT was used to completely translate one dataset (133 sites) that was successfully added to GTN-P, nearly translate three datasets (270 sites), and is scheduled to translate 10 more datasets ( 1000 sites) from the legacy inactive site data holdings of the Frozen Ground Data Center (FGDC). Iterative development has provided the permafrost and wider scientific community with an extendable tool designed specifically for the iterative process of translating unruly data.

  1. Data Governance and Stewardship: Designing Data Stewardship Entities and Advancing Data Access

    PubMed Central

    Rosenbaum, Sara

    2010-01-01

    U.S. health policy is engaged in a struggle over access to health information, in particular, the conditions under which information should be accessible for research when appropriate privacy protections and security safeguards are in place. The expanded use of health information—an inevitable step in an information age—is widely considered be essential to health system reform. Models exist for the creation of data-sharing arrangements that promote proper use of information in a safe and secure environment and with attention to ethical standards. Data stewardship is a concept with deep roots in the science and practice of data collection, sharing, and analysis. Reflecting the values of fair information practice, data stewardship denotes an approach to the management of data, particularly data that can identify individuals. The concept of a data steward is intended to convey a fiduciary (or trust) level of responsibility toward the data. Data governance is the process by which responsibilities of stewardship are conceptualized and carried out. As the concept of health information data stewardship advances in a technology-enabled environment, the question is whether legal barriers to data access and use will begin to give way. One possible answer may lie in defining the public interest in certain data uses, tying provider participation in federal health programs to the release of all-payer data to recognized data stewardship entities for aggregation and management, and enabling such entities to foster and enable the creation of knowledge through research. PMID:21054365

  2. Data hosting infrastructure for primary biodiversity data

    PubMed Central

    2011-01-01

    Background Today, an unprecedented volume of primary biodiversity data are being generated worldwide, yet significant amounts of these data have been and will continue to be lost after the conclusion of the projects tasked with collecting them. To get the most value out of these data it is imperative to seek a solution whereby these data are rescued, archived and made available to the biodiversity community. To this end, the biodiversity informatics community requires investment in processes and infrastructure to mitigate data loss and provide solutions for long-term hosting and sharing of biodiversity data. Discussion We review the current state of biodiversity data hosting and investigate the technological and sociological barriers to proper data management. We further explore the rescuing and re-hosting of legacy data, the state of existing toolsets and propose a future direction for the development of new discovery tools. We also explore the role of data standards and licensing in the context of data hosting and preservation. We provide five recommendations for the biodiversity community that will foster better data preservation and access: (1) encourage the community's use of data standards, (2) promote the public domain licensing of data, (3) establish a community of those involved in data hosting and archival, (4) establish hosting centers for biodiversity data, and (5) develop tools for data discovery. Conclusion The community's adoption of standards and development of tools to enable data discovery is essential to sustainable data preservation. Furthermore, the increased adoption of open content licensing, the establishment of data hosting infrastructure and the creation of a data hosting and archiving community are all necessary steps towards the community ensuring that data archival policies become standardized. PMID:22373257

  3. Circular Data Images for Directional Data

    NASA Technical Reports Server (NTRS)

    Morpet, William J.

    2004-01-01

    Directional data includes vectors, points on a unit sphere, axis orientation, angular direction, and circular or periodic data. The theoretical statistics for circular data (random points on a unit circle) or spherical data (random points on a unit sphere) are a recent development. An overview of existing graphical methods for the display of directional data is given. Cross-over occurs when periodic data are measured on a scale for the measurement of linear variables. For example, if angle is represented by a linear color gradient changing uniformly from dark blue at -180 degrees to bright red at +180 degrees, the color image will be discontinuous at +180 degrees and -180 degrees, which are the same location. The resultant color would depend on the direction of approach to the cross-over point. A new graphical method for imaging directional data is described, which affords high resolution without color discontinuity from "cross-over". It is called the circular data image. The circular data image uses a circular color scale in which colors repeat periodically. Some examples of the circular data image include direction of earth winds on a global scale, rocket motor internal flow, earth global magnetic field direction, and rocket motor nozzle vector direction vs. time.

  4. Data governance and stewardship: designing data stewardship entities and advancing data access.

    PubMed

    Rosenbaum, Sara

    2010-10-01

    U.S. health policy is engaged in a struggle over access to health information, in particular, the conditions under which information should be accessible for research when appropriate privacy protections and security safeguards are in place. The expanded use of health information-an inevitable step in an information age-is widely considered be essential to health system reform. Models exist for the creation of data-sharing arrangements that promote proper use of information in a safe and secure environment and with attention to ethical standards. Data stewardship is a concept with deep roots in the science and practice of data collection, sharing, and analysis. Reflecting the values of fair information practice, data stewardship denotes an approach to the management of data, particularly data that can identify individuals. The concept of a data steward is intended to convey a fiduciary (or trust) level of responsibility toward the data. Data governance is the process by which responsibilities of stewardship are conceptualized and carried out. As the concept of health information data stewardship advances in a technology-enabled environment, the question is whether legal barriers to data access and use will begin to give way. One possible answer may lie in defining the public interest in certain data uses, tying provider participation in federal health programs to the release of all-payer data to recognized data stewardship entities for aggregation and management, and enabling such entities to foster and enable the creation of knowledge through research. © Health Research and Educational Trust.

  5. Collaborative Data Publication Utilizing the Open Data Repository's Data Publisher

    NASA Technical Reports Server (NTRS)

    Stone, N.; Lafuente, B.; Bristow, T.; Keller, R. M.; Downs, R. T.; Blake, D.; Fonda, M.; Dateo, C.; Pires, A.

    2017-01-01

    For small communities in multidisciplinary fields such as astrobiology, publishing and sharing data can be challenging. While large, homogenous fields often have repositories and existing data standards, small groups of independent researchers have few options for publishing data that can be utilized within their community. In conjunction with teams at NASA Ames and the University of Arizona, a number of pilot studies are being conducted to assess the needs of these research groups and to guide the software development so that it allows them to publish and share their data collaboratively.

  6. Classification and data acquisition with incomplete data

    NASA Astrophysics Data System (ADS)

    Williams, David P.

    In remote-sensing applications, incomplete data can result when only a subset of sensors (e.g., radar, infrared, acoustic) are deployed at certain regions. The limitations of single sensor systems have spurred interest in employing multiple sensor modalities simultaneously. For example, in land mine detection tasks, different sensor modalities are better-suited to capture different aspects of the underlying physics of the mines. Synthetic aperture radar sensors may be better at detecting surface mines, while infrared sensors may be better at detecting buried mines. By employing multiple sensor modalities to address the detection task, the strengths of the disparate sensors can be exploited in a synergistic manner to improve performance beyond that which would be achievable with either single sensor alone. When multi-sensor approaches are employed, however, incomplete data can be manifested. If each sensor is located on a separate platform ( e.g., aircraft), each sensor may interrogate---and hence collect data over---only partially overlapping areas of land. As a result, some data points may be characterized by data (i.e., features) from only a subset of the possible sensors employed in the task. Equivalently, this scenario implies that some data points will be missing features. Increasing focus in the future on using---and fusing data from---multiple sensors will make such incomplete-data problems commonplace. In many applications involving incomplete data, it is possible to acquire the missing data at a cost. In multi-sensor remote-sensing applications, data is acquired by deploying sensors to data points. Acquiring data is usually an expensive, time-consuming task, a fact that necessitates an intelligent data acquisition process. Incomplete data is not limited to remote-sensing applications, but rather, can arise in virtually any data set. In this dissertation, we address the general problem of classification when faced with incomplete data. We also address the

  7. Authoring Data-Driven Videos with DataClips.

    PubMed

    Amini, Fereshteh; Riche, Nathalie Henry; Lee, Bongshin; Monroy-Hernandez, Andres; Irani, Pourang

    2017-01-01

    Data videos, or short data-driven motion graphics, are an increasingly popular medium for storytelling. However, creating data videos is difficult as it involves pulling together a unique combination of skills. We introduce DataClips, an authoring tool aimed at lowering the barriers to crafting data videos. DataClips allows non-experts to assemble data-driven "clips" together to form longer sequences. We constructed the library of data clips by analyzing the composition of over 70 data videos produced by reputable sources such as The New York Times and The Guardian. We demonstrate that DataClips can reproduce over 90% of our data videos corpus. We also report on a qualitative study comparing the authoring process and outcome achieved by (1) non-experts using DataClips, and (2) experts using Adobe Illustrator and After Effects to create data-driven clips. Results indicated that non-experts are able to learn and use DataClips with a short training period. In the span of one hour, they were able to produce more videos than experts using a professional editing tool, and their clips were rated similarly by an independent audience.

  8. Drowning in Data: Going Beyond Traditional Data Archival to Educate Data Users

    NASA Astrophysics Data System (ADS)

    Weigel, A. M.; Smith, T.; Smith, D. K.; Bugbee, K.; Sinclair, L.

    2017-12-01

    Increasing quantities of Earth science data and information prove overwhelming to new and unfamiliar users. Data discovery and use challenges faced by these users are compounded with atmospheric science field campaign data collected by a variety of instruments and stored, visualized, processed and analyzed in different ways. To address data and user needs assessed through annual surveys and user questions, the NASA Global Hydrology Resource Center Distributed Active Archive Center (GHRC DAAC), in collaboration with a graphic designer, has developed a series of resources to help users learn about GHRC science focus areas, field campaigns, instruments, data, and data processing techniques. In this talk, GHRC data recipes, micro articles, interactive data visualization techniques, and artistic science outreach and education efforts, such as ESRI story maps and research as art, will be overviewed. The objective of this talk is to stress the importance artistic information visualization has in communicating with and educating Earth science data users.

  9. Data Confidentiality Challenges in Big Data Applications

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yin, Jian; Zhao, Dongfang

    In this paper, we address the problem of data confidentiality in big data analytics. In many fields, much useful patterns can be extracted by applying machine learning techniques to big data. However, data confidentiality must be protected. In many scenarios, data confidentiality could well be a prerequisite for data to be shared. We present a scheme to provide provable secure data confidentiality and discuss various techniques to optimize performance of such a system.

  10. Reporting Data with "Over-the-Counter" Data Analysis Supports Improves Educators' Data Analyses

    ERIC Educational Resources Information Center

    Rankin, Jenny Grant

    2014-01-01

    The benefits of making data-informed decisions to improve learning rely on educators correctly interpreting given data. Many educators routinely misinterpret data, even at districts with proactive support for data use. The tool most educators use for data analyses, which is an information technology data system or its reports, typically reports…

  11. Data Standards for Omics Data: The Basis of Data Sharing and Reuse

    PubMed Central

    Chervitz, Stephen A.; Deutsch, Eric W.; Field, Dawn; Parkinson, Helen; Quackenbush, John; Rocca-Serra, Phillipe; Sansone, Susanna-Assunta; Stoeckert, Christian J.; Taylor, Chris F.; Taylor, Ronald; Ball, Catherine A.

    2014-01-01

    To facilitate sharing of Omics data, many groups of scientists have been working to establish the relevant data standards. The main components of data sharing standards are experiment description standards, data exchange standards, terminology standards, and experiment execution standards. Here we provide a survey of existing and emerging standards that are intended to assist the free and open exchange of large-format data. PMID:21370078

  12. Data compression for sequencing data

    PubMed Central

    2013-01-01

    Post-Sanger sequencing methods produce tons of data, and there is a general agreement that the challenge to store and process them must be addressed with data compression. In this review we first answer the question “why compression” in a quantitative manner. Then we also answer the questions “what” and “how”, by sketching the fundamental compression ideas, describing the main sequencing data types and formats, and comparing the specialized compression algorithms and tools. Finally, we go back to the question “why compression” and give other, perhaps surprising answers, demonstrating the pervasiveness of data compression techniques in computational biology. PMID:24252160

  13. DataCite - Making data sets citable

    NASA Astrophysics Data System (ADS)

    Brase, J.

    2013-12-01

    The scientific and information communities have largely mastered the presentation of, and linkages between, text-based electronic information by assigning persistent identifiers to give scientific literature unique identities and accessibility. Knowledge, as published through scientific literature, is however often the last step in a process originating from scientific research data. Today scientists are using simulation, observational, and experimentation techniques that yield massive quantities of research data. These data are analysed, synthesised, interpreted, and the outcome of this process is generally published as a scientific article. Access to the original data as the foundation of knowledge has become an important issue throughout the world and different projects have started to find solutions. Global collaboration and scientific advances could be accelerated through broader access to scientific research data. In other words, data access could be revolutionized through the same technologies used to make textual literature accessible. The most obvious opportunity to broaden visibility of and access to research data is to integrate its access into the medium where it is most often cited: electronic textual information. Besides this opportunity, it is important, irrespective of where they are cited, for research data to have an internet identity. Since 2005, the German National Library of Science and Technology (TIB) has offered a successful Digital Object Identifier (DOI) registration service for persistent identification of research data. Since 2010 these services are offered by the global consortium DataCite, carried by 17 member organisations from 12 different countries: The German National Library of Science and Technology (TIB), the German National Library of Medicine (ZB MED), the German National Library of Economics (ZBW) and the German GESIS - Leibniz Institute for the Social Sciences. Additional European members are: The Library of the ETH Z

  14. International Data | Geospatial Data Science | NREL

    Science.gov Websites

    International Data International Data These datasets detail solar and wind resources for select Annual.xml India 10-km Monthly Direct Normal and Global Horizontal Zip 4.68 MB 04/25/2013 Monthly.xml Wind Data 50-m Wind Data These 50-m hub-height datasets have been validated by NREL and wind energy

  15. Uniform-penalty inversion of multiexponential decay data. II. Data spacing, T(2) data, systemic data errors, and diagnostics.

    PubMed

    Borgia, G C; Brown, R J; Fantazzini, P

    2000-12-01

    The basic method of UPEN (uniform penalty inversion of multiexponential decay data) is given in an earlier publication (Borgia et al., J. Magn. Reson. 132, 65-77 (1998)), which also discusses the effects of noise, constraints, and smoothing on the resolution or apparent resolution of features of a computed distribution of relaxation times. UPEN applies negative feedback to a regularization penalty, allowing stronger smoothing for a broad feature than for a sharp line. This avoids unnecessarily broadening the sharp line and/or breaking the wide peak or tail into several peaks that the relaxation data do not demand to be separate. The experimental and artificial data presented earlier were T(1) data, and all had fixed data spacings, uniform in log-time. However, for T(2) data, usually spaced uniformly in linear time, or for data spaced in any manner, we have found that the data spacing does not enter explicitly into the computation. The present work shows the extension of UPEN to T(2) data, including the averaging of data in windows and the use of the corresponding weighting factors in the computation. Measures are implemented to control portions of computed distributions extending beyond the data range. The input smoothing parameters in UPEN are normally fixed, rather than data dependent. A major problem arises, especially at high signal-to-noise ratios, when UPEN is applied to data sets with systematic errors due to instrumental nonidealities or adjustment problems. For instance, a relaxation curve for a wide line can be narrowed by an artificial downward bending of the relaxation curve. Diagnostic parameters are generated to help identify data problems, and the diagnostics are applied in several examples, with particular attention to the meaningful resolution of two closely spaced peaks in a distribution of relaxation times. Where feasible, processing with UPEN in nearly real time should help identify data problems while further instrument adjustments can still be

  16. ACTRIS Data Centre: An atmospheric data portal

    NASA Astrophysics Data System (ADS)

    Myhre, C. Lund; Fahre Vik, A.; Logna, R.; Torseth, K.; Linné, H.; O'Connor, E.

    2012-04-01

    ACTRIS (Aerosols, Clouds, and Trace gases Research InfraStructure Network) is a European Project aiming at integrating European ground-based stations equipped with advanced instrumentation for studying aerosols, clouds, and short-lived gas-phase species. The ACTRIS activities result in improved atmospheric measurements data made at more than 60 European sites, from numerous instruments and includes variables measured by ground based in situ and remote sensing technologies. Core variables are in situ aerosol optical, physical and chemical properties, short-lived trace gases (volatile organic carbon and nitrogen oxides), aerosol scattering and extinction profiles, and cloud properties. The ACTRIS data centre (ACTRIS DC) is giving free and open access to all data resulting from the activities of the infrastructure network, complemented with data from other relevant networks and data bases. The overall goal is to facilitate scientists and other user groups access to atmospheric observational data, and to provide mature products for analysis and interpretation of atmospheric composition change. The ACTRIS DC aims at substantially increasing the number of high-quality data by providing long-term observational data relevant to climate and air quality research produced with standardized or comparable procedures throughout the network. The backbone of the ACTRIS DC is the three core data bases: - EARLINET Data Base hosting aerosol lidar data from more than 30 European sites - EBAS hosting ground based atmospheric in situ data from more than 1000 sites globally - Cloudnet hosting remote sensing cloud data and products from 5 European sites Furthermore, a joint portal is developed combining information from various data sources to gain new information not presently available from standalone databases or networks. The data centre will provide tools and services to facilitate the use of measurements for broad user communities. Higher level and integrated products will be

  17. National Space Science Data Center (NSSDC) Data Listing

    NASA Technical Reports Server (NTRS)

    1980-01-01

    Satellite and nonsatellite data available from the National Space Science Data Center are listed. The Satellite Data listing includes the spacecraft name, launch date, and an alphabetical list of experiments. The Non-Satellite Data listing contains ground based data, models, computer routines, and composite spacecraft data. The data set name, data form code, quantity of data, and the time space covered are included in the data sets of both listings where appropriate. Geodetic tracking data sets are also included.

  18. The Ocean Observatories Initiative: Data, Data and More Data

    NASA Astrophysics Data System (ADS)

    Crowley, M. F.; Vardaro, M.; Belabbassi, L.; Smith, M. J.; Garzio, L. M.; Knuth, F.; Glenn, S. M.; Schofield, O.; Lichtenwalner, C. S.; Kerfoot, J.

    2016-02-01

    The Ocean Observatories Initiative (OOI), a project funded by the National Science Foundation (NSF) and managed by the Consortium for Ocean Leadership, is a networked infrastructure of science-driven sensor systems that measure the physical, chemical, geological, and biological variables in the ocean and seafloor on coastal, regional, and global scales. OOI long term research arrays have been installed off the Washington coast (Cabled), Massachusetts and Oregon coasts (Coastal) and off Alaska, Greenland, Chile and Argentina (Global). Woods Hole Oceanographic Institution and Oregon State University are responsible for the coastal and global moorings and their autonomous vehicles. The University of Washington is responsible for cabled seafloor systems and moorings. Rutgers University operates the Cyberinfrastructure (CI) portion of the OOI, which acquires, processes and distributes data to the scientists, researchers, educators and the public. It also provides observatory mission command and control, data assessment and distribution, and long-term data management. This talk will present an overview of the OOI infrastructure and its three primary websites which include: 1) An OOI overview website offering technical information on the infrastructure ranging from instruments to science goals, news, deployment updates, and information on the proposal process, 2) The Education and Public Engagement website where students can view and analyze exactly the same data that scientists have access to at exactly the same time, but with simple visualization tools and compartmentalized lessons that lead them through complex science questions, and 3) The primary data access website and machine to machine interface where anyone can plot or download data from the over 700 instruments within the OOI Network.

  19. [Big Data Revolution or Data Hubris? : On the Data Positivism of Molecular Biology].

    PubMed

    Gramelsberger, Gabriele

    2017-12-01

    Genome data, the core of the 2008 proclaimed big data revolution in biology, are automatically generated and analyzed. The transition from the manual laboratory practice of electrophoresis sequencing to automated DNA-sequencing machines and software-based analysis programs was completed between 1982 and 1992. This transition facilitated the first data deluge, which was considerably increased by the second and third generation of DNA-sequencers during the 2000s. However, the strategies for evaluating sequence data were also transformed along with this transition. The paper explores both the computational strategies of automation, as well as the data evaluation culture connected with it, in order to provide a complete picture of the complexity of today's data generation and its intrinsic data positivism. This paper is thereby guided by the question, whether this data positivism is the basis of the big data revolution of molecular biology announced today, or it marks the beginning of its data hubris.

  20. "Small" data in a big data world: archiving terrestrial ecology data at ORNL DAAC

    NASA Astrophysics Data System (ADS)

    Santhana Vannan, S. K.; Beaty, T.; Boyer, A.; Deb, D.; Hook, L.; Shrestha, R.; Thornton, M.; Virdi, M.; Wei, Y.; Wright, D.

    2016-12-01

    The Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC http://daac.ornl.gov), a NASA-funded data center, archives a diverse collection of terrestrial biogeochemistry and ecological dynamics observations and models in support of NASA's Earth Science program. The ORNL DAAC has been addressing the increasing challenge of publishing diverse small data products into an online archive while dealing with the enhanced need for integration and availability of these data to address big science questions. This paper will show examples of "small" diverse data holdings - ranging from the Daymet model output data to site-based soil moisture observation data. We define "small" by the data volume of these data products compared to petabyte scale observations. We will highlight the use of tools and services for visualizing diverse data holdings and subsetting services such as the MODIS land products subsets tool (at ORNL DAAC) that provides big MODIS data in small chunks. Digital Object Identifiers (DOI) and data citations have enhanced the availability of data. The challenge faced by data publishers now is to deal with the increased number of publishable data products and most importantly the difficulties of publishing small diverse data products into an online archive. This paper will also present our experiences designing a data curation system for these types of data. The characteristics of these data will be examined and their scientific value will be demonstrated via data citation metrics. We will present case studies of leveraging specialized tools and services that have enabled small data sets to realize their "big" scientific potential. Overall, we will provide a holistic view of the challenges and potential of small diverse terrestrial ecology data sets from data curation to distribution.

  1. Collaborative Data Publication Utilizing the Open Data Repository's (ODR) Data Publisher

    NASA Technical Reports Server (NTRS)

    Stone, N.; Lafuente, B.; Bristow, T.; Keller, R. M.; Downs, R. T.; Blake, D.; Fonda, M.; Dateo, C.; Pires, A.

    2017-01-01

    Introduction: For small communities in diverse fields such as astrobiology, publishing and sharing data can be a difficult challenge. While large, homogenous fields often have repositories and existing data standards, small groups of independent researchers have few options for publishing standards and data that can be utilized within their community. In conjunction with teams at NASA Ames and the University of Arizona, the Open Data Repository's (ODR) Data Publisher has been conducting ongoing pilots to assess the needs of diverse research groups and to develop software to allow them to publish and share their data collaboratively. Objectives: The ODR's Data Publisher aims to provide an easy-to-use and implement software tool that will allow researchers to create and publish database templates and related data. The end product will facilitate both human-readable interfaces (web-based with embedded images, files, and charts) and machine-readable interfaces utilizing semantic standards. Characteristics: The Data Publisher software runs on the standard LAMP (Linux, Apache, MySQL, PHP) stack to provide the widest server base available. The software is based on Symfony (www.symfony.com) which provides a robust framework for creating extensible, object-oriented software in PHP. The software interface consists of a template designer where individual or master database templates can be created. A master database template can be shared by many researchers to provide a common metadata standard that will set a compatibility standard for all derivative databases. Individual researchers can then extend their instance of the template with custom fields, file storage, or visualizations that may be unique to their studies. This allows groups to create compatible databases for data discovery and sharing purposes while still providing the flexibility needed to meet the needs of scientists in rapidly evolving areas of research. Research: As part of this effort, a number of ongoing

  2. Data Sets and Data Services at the Northern California Earthquake Data Center

    NASA Astrophysics Data System (ADS)

    Neuhauser, D. S.; Zuzlewski, S.; Allen, R. M.

    2014-12-01

    The Northern California Earthquake Data Center (NCEDC) houses a unique and comprehensive data archive and provides real-time services for a variety of seismological and geophysical data sets that encompass northern and central California. We have over 80 terabytes of continuous and event-based time series data from broadband, short-period, strong motion, and strain sensors as well as continuous and campaign GPS data at both standard and high sample rates in both raw and RINEX format. The Northen California Seismic System (NCSS), operated by UC Berkeley and USGS Menlo Park, has recorded over 890,000 events from 1984 to the present, and the NCEDC provides catalog, parametric information, moment tensors and first motion mechanisms, and time series data for these events. We also host and provide event catalogs, parametric information, and event waveforms for DOE enhanced geothermal system monitoring in northern California and Nevada. The NCEDC provides a variety of ways for users to access these data. The most recent development are web services, which provide interactive, command-line, or program-based workflow access to data. Web services use well-established server and client protocols and RESTful software architecture that allow users to easily submit queries and receive the requested data in real-time rather than through batch or email-based requests. Data are returned to the user in the appropriate format such as XML, RESP, simple text, or MiniSEED depending on the service and selected output format. The NCEDC supports all FDSN-defined web services as well as a number of IRIS-defined and NCEDC-defined services. We also continue to support older email-based and browser-based access to data. NCEDC data and web services can be found at http://www.ncedc.org and http://service.ncedc.org.

  3. Data Science and its Relationship to Big Data and Data-Driven Decision Making.

    PubMed

    Provost, Foster; Fawcett, Tom

    2013-03-01

    Companies have realized they need to hire data scientists, academic institutions are scrambling to put together data-science programs, and publications are touting data science as a hot-even "sexy"-career choice. However, there is confusion about what exactly data science is, and this confusion could lead to disillusionment as the concept diffuses into meaningless buzz. In this article, we argue that there are good reasons why it has been hard to pin down exactly what is data science. One reason is that data science is intricately intertwined with other important concepts also of growing importance, such as big data and data-driven decision making. Another reason is the natural tendency to associate what a practitioner does with the definition of the practitioner's field; this can result in overlooking the fundamentals of the field. We believe that trying to define the boundaries of data science precisely is not of the utmost importance. We can debate the boundaries of the field in an academic setting, but in order for data science to serve business effectively, it is important (i) to understand its relationships to other important related concepts, and (ii) to begin to identify the fundamental principles underlying data science. Once we embrace (ii), we can much better understand and explain exactly what data science has to offer. Furthermore, only once we embrace (ii) should we be comfortable calling it data science. In this article, we present a perspective that addresses all these concepts. We close by offering, as examples, a partial list of fundamental principles underlying data science.

  4. Addressing Data Veracity in Big Data Applications

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Aman, Saima; Chelmis, Charalampos; Prasanna, Viktor

    Big data applications such as in smart electric grids, transportation, and remote environment monitoring involve geographically dispersed sensors that periodically send back information to central nodes. In many cases, data from sensors is not available at central nodes at a frequency that is required for real-time modeling and decision-making. This may be due to physical limitations of the transmission networks, or due to consumers limiting frequent transmission of data from sensors located at their premises for security and privacy concerns. Such scenarios lead to partial data problem and raise the issue of data veracity in big data applications. We describemore » a novel solution to the problem of making short term predictions (up to a few hours ahead) in absence of real-time data from sensors in Smart Grid. A key implication of our work is that by using real-time data from only a small subset of influential sensors, we are able to make predictions for all sensors. We thus reduce the communication complexity involved in transmitting sensory data in Smart Grids. We use real-world electricity consumption data from smart meters to empirically demonstrate the usefulness of our method. Our dataset consists of data collected at 15-min intervals from 170 smart meters in the USC Microgrid for 7 years, totaling 41,697,600 data points.« less

  5. Data Service: Distributed Data Capture and Replication

    NASA Astrophysics Data System (ADS)

    Warner, P. B.; Pietrowicz, S. R.

    2007-10-01

    Data Service is a critical component of the NOAO Data Management and Science Support (DMaSS) Solutions Platform, which is based on a service-oriented architecture, and is to replace the current NOAO Data Transport System. Its responsibilities include capturing data from NOAO and partner telescopes and instruments and replicating the data across multiple (currently six) storage sites. Java 5 was chosen as the implementation language, and Java EE as the underlying enterprise framework. Application metadata persistence is performed using EJB and Hibernate on the JBoss Application Server, with PostgreSQL as the persistence back-end. Although potentially any underlying mass storage system may be used as the Data Service file persistence technology, DTS deployments and Data Service test deployments currently use the Storage Resource Broker from SDSC. This paper presents an overview and high-level design of the Data Service, including aspects of deployment, e.g., for the LSST Data Challenge at the NCSA computing facilities.

  6. Tsunami Data and Scientific Data Diplomacy

    NASA Astrophysics Data System (ADS)

    Arcos, N. P.; Dunbar, P. K.; Gusiakov, V. K.; Kong, L. S. L.; Aliaga, B.; Yamamoto, M.; Stroker, K. J.

    2016-12-01

    Free and open access to data and information fosters scientific progress and can build bridges between nations even when political relationships are strained. Data and information held by one stakeholder may be vital for promoting research of another. As an emerging field of inquiry, data diplomacy explores how data-sharing helps create and support positive relationships between countries to enable the use of data for societal and humanitarian benefit. Tsunami has arguably been the only natural hazard that has been addressed so effectively at an international scale and illustrates the success of scientific data diplomacy. Tsunami mitigation requires international scientific cooperation in both tsunami science and technology development. This requires not only international agreements, but working-level relationships between scientists from countries that may have different political and economic policies. For example, following the Pacific wide tsunami of 1960 that killed two thousand people in Chile and then, up to a day later, hundreds in Hawaii, Japan, and the Philippines; delegates from twelve countries met to discuss and draft the requirements for an international tsunami warning system. The Pacific Tsunami Warning System led to the development of local, regional, and global tsunami databases and catalogs. For example, scientists at NOAA/NCEI and the Tsunami Laboratory/Russian Academy of Sciences have collaborated on their tsunami catalogs that are now routinely accessed by scientists and the public around the world. These data support decision-making during tsunami events, are used in developing inundation and evacuation maps, and hazard assessments. This presentation will include additional examples of agreements for data-sharing between countries, as well as challenges in standardization and consistency among the tsunami research community. Tsunami data and scientific data diplomacy have ultimately improved understanding of tsunami and associated impacts.

  7. Oceanids command and control (C2) data system - Marine autonomous systems data for vehicle piloting, scientific data users, operational data assimilation, and big data

    NASA Astrophysics Data System (ADS)

    Buck, J. J. H.; Phillips, A.; Lorenzo, A.; Kokkinaki, A.; Hearn, M.; Gardner, T.; Thorne, K.

    2017-12-01

    The National Oceanography Centre (NOC) operate a fleet of approximately 36 autonomous marine platforms including submarine gliders, autonomous underwater vehicles, and autonomous surface vehicles. Each platform effectivity has the capability to observe the ocean and collect data akin to a small research vessel. This is creating a growth in data volumes and complexity while the amount of resource available to manage data remains static. The OceanIds Command and Control (C2) project aims to solve these issues by fully automating the data archival, processing and dissemination. The data architecture being implemented jointly by NOC and the Scottish Association for Marine Science (SAMS) includes a single Application Programming Interface (API) gateway to handle authentication, forwarding and delivery of both metadata and data. Technicians and principle investigators will enter expedition data prior to deployment of vehicles enabling automated data processing when vehicles are deployed. The system will support automated metadata acquisition from platforms as this technology moves towards operational implementation. The metadata exposure to the web builds on a prototype developed by the European Commission supported SenseOCEAN project and is via open standards including World Wide Web Consortium (W3C) RDF/XML and the use of the Semantic Sensor Network ontology and Open Geospatial Consortium (OGC) SensorML standard. Data will be delivered in the marine domain Everyone's Glider Observatory (EGO) format and OGC Observations and Measurements. Additional formats will be served by implementation of endpoints such as the NOAA ERDDAP tool. This standardised data delivery via the API gateway enables timely near-real-time data to be served to Oceanids users, BODC users, operational users and big data systems. The use of open standards will also enable web interfaces to be rapidly built on the API gateway and delivery to European research infrastructures that include aligned

  8. Data warehouse implementation with clinical pharmacokinetic/pharmacodynamic data.

    PubMed

    Koprowski, S P; Barrett, J S

    2002-03-01

    We have created a data warehouse for human pharmacokinetic (PK) and pharmacodynamic (PD) data generated primarily within the Clinical PK Group of the Drug Metabolism and Pharmacokinetics (DM&PK) Department of DuPont Pharmaceuticals. Data which enters an Oracle-based LIMS directly from chromatography systems or through files from contract research organizations are accessed via SAS/PH.Kinetics, GLP-compliant data analysis software residing on individual users' workstations. Upon completion of the final PK or PD analysis, data are pushed to a predefined location. Data analyzed/created with other software (i.e., WinNonlin, NONMEM, Adapt, etc.) are added to this file repository as well. The warehouse creates views to these data and accumulates metadata on all data sources defined in the warehouse. The warehouse is managed via the SAS/Warehouse Administrator product that defines the environment, creates summarized data structures, and schedules data refresh. The clinical PK/PD warehouse encompasses laboratory, biometric, PK and PD data streams. Detailed logical tables for each compound are created/updated as the clinical PK/PD data warehouse is populated. The data model defined to the warehouse is based on a star schema. Summarized data structures such as multidimensional data bases (MDDB), infomarts, and datamarts are created from detail tables. Data mining and querying of highly summarized data as well as drill-down to detail data is possible via the creation of exploitation tools which front-end the warehouse data. Based on periodic refreshing of the warehouse data, these applications are able to access the most current data available and do not require a manual interface to update/populate the data store. Prototype applications have been web-enabled to facilitate their usage to varied data customers across platform and location. The warehouse also contains automated mechanisms for the construction of study data listings and SAS transport files for eventual

  9. Expediting Scientific Data Analysis with Reorganization of Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Byna, Surendra; Wu, Kesheng

    2013-08-19

    Data producers typically optimize the layout of data files to minimize the write time. In most cases, data analysis tasks read these files in access patterns different from the write patterns causing poor read performance. In this paper, we introduce Scientific Data Services (SDS), a framework for bridging the performance gap between writing and reading scientific data. SDS reorganizes data to match the read patterns of analysis tasks and enables transparent data reads from the reorganized data. We implemented a HDF5 Virtual Object Layer (VOL) plugin to redirect the HDF5 dataset read calls to the reorganized data. To demonstrate themore » effectiveness of SDS, we applied two parallel data organization techniques: a sort-based organization on a plasma physics data and a transpose-based organization on mass spectrometry imaging data. We also extended the HDF5 data access API to allow selection of data based on their values through a query interface, called SDS Query. We evaluated the execution time in accessing various subsets of data through existing HDF5 Read API and SDS Query. We showed that reading the reorganized data using SDS is up to 55X faster than reading the original data.« less

  10. Ocean data management in OMP Data Service

    NASA Astrophysics Data System (ADS)

    Fleury, Laurence; André, François; Belmahfoud, Nizar; Boichard, Jean-Luc; Brissebrat, Guillaume; Ferré, Hélène; Mière, Arnaud

    2014-05-01

    The Observatoire Midi-Pyrénées Data Service (SEDOO) is a development team, dedicated to environmental data management and dissemination application set up, in the framework of intensive field campaigns and long term observation networks. SEDOO developped some applications dealing with ocean data only, but also generic databases that enable to store and distribute multidisciplinary datasets. SEDOO is in charge of the in situ data management and the data portal for international and multidisciplinary programmes as large as African Monsoon Multidisciplinary Analyses (AMMA) and Mediterranean Integrated STudies at Regional And Local Scales (MISTRALS). The AMMA and MISTRALS databases are distributed and the data portals enable to access datasets managed by other data centres (IPSL, CORIOLIS...) through interoperability protocols (OPeNDAP, xml requests...). AMMA and MISTRALS metadata (data description) are standardized and comply with international standards (ISO 19115-19139; INSPIRE European Directive; Global Change Master Directory Thesaurus). Most of the AMMA and MISTRALS in situ ocean data sets are homogenized and inserted in a relational database, in order to enable accurate data selection and download of different data sets in a shared format. Data selection criteria are location, period, physical property name, physical property range... The data extraction procedure include format output selection among CSV, NetCDF, Nasa Ames... The AMMA database - http://database.amma-international.org/ - contains field campaign observations in the Guinea Gulf (EGEE 2005-2007) and Atlantic Tropical Ocean (AEROSE-II 2006...), as well as long term monitoring data (PIRATA, ARGO...). Operational analysis (MERCATOR) and satellite products (TMI, SSMI...) are managed by IPSL data centre and can be accessed too. They have been projected over regular latitude-longitude grids and converted into the NetCDF format. The MISTRALS data portal - http://mistrals.sedoo.fr/ - enables to access

  11. Data management support for selected climate data sets using the climate data access system

    NASA Technical Reports Server (NTRS)

    Reph, M. G.

    1983-01-01

    The functional capabilities of the Goddard Space Flight Center (GSFC) Climate Data Access System (CDAS), an interactive data storage and retrieval system, and the archival data sets which this system manages are discussed. The CDAS manages several climate-related data sets, such as the First Global Atmospheric Research Program (GARP) Global Experiment (FGGE) Level 2-b and Level 3-a data tapes. CDAS data management support consists of three basic functions: (1) an inventory capability which allows users to search or update a disk-resident inventory describing the contents of each tape in a data set, (2) a capability to depict graphically the spatial coverage of a tape in a data set, and (3) a data set selection capability which allows users to extract portions of a data set using criteria such as time, location, and data source/parameter and output the data to tape, user terminal, or system printer. This report includes figures that illustrate menu displays and output listings for each CDAS function.

  12. Biological data integration: wrapping data and tools.

    PubMed

    Lacroix, Zoé

    2002-06-01

    Nowadays scientific data is inevitably digital and stored in a wide variety of formats in heterogeneous systems. Scientists need to access an integrated view of remote or local heterogeneous data sources with advanced data accessing, analyzing, and visualization tools. Building a digital library for scientific data requires accessing and manipulating data extracted from flat files or databases, documents retrieved from the Web as well as data generated by software. We present an approach to wrapping web data sources, databases, flat files, or data generated by tools through a database view mechanism. Generally, a wrapper has two tasks: it first sends a query to the source to retrieve data and, second builds the expected output with respect to the virtual structure. Our wrappers are composed of a retrieval component based on an intermediate object view mechanism called search views mapping the source capabilities to attributes, and an eXtensible Markup Language (XML) engine, respectively, to perform these two tasks. The originality of the approach consists of: 1) a generic view mechanism to access seamlessly data sources with limited capabilities and 2) the ability to wrap data sources as well as the useful specific tools they may provide. Our approach has been developed and demonstrated as part of the multidatabase system supporting queries via uniform object protocol model (OPM) interfaces.

  13. Linked Data: Forming Partnerships at the Data Layer

    NASA Astrophysics Data System (ADS)

    Shepherd, A.; Chandler, C. L.; Arko, R. A.; Jones, M. B.; Hitzler, P.; Janowicz, K.; Krisnadhi, A.; Schildhauer, M.; Fils, D.; Narock, T.; Groman, R. C.; O'Brien, M.; Patton, E. W.; Kinkade, D.; Rauch, S.

    2015-12-01

    The challenges presented by big data are straining data management software architectures of the past. For smaller existing data facilities, the technical refactoring of software layers become costly to scale across the big data landscape. In response to these challenges, data facilities will need partnerships with external entities for improved solutions to perform tasks such as data cataloging, discovery and reuse, and data integration and processing with provenance. At its surface, the concept of linked open data suggests an uncalculated altruism. Yet, in his concept of five star open data, Tim Berners-Lee explains the strategic costs and benefits of deploying linked open data from the perspective of its consumer and producer - a data partnership. The Biological and Chemical Oceanography Data Management Office (BCO-DMO) addresses some of the emerging needs of its research community by partnering with groups doing complementary work and linking their respective data layers using linked open data principles. Examples will show how these links, explicit manifestations of partnerships, reduce technical debt and provide a swift flexibility for future considerations.

  14. Good Data Can Be Better Data - How Data Management Maturity Can Help Repositories Improve Operations, Data Quality, And Usability, Helping Researchers

    NASA Astrophysics Data System (ADS)

    Stall, S.

    2015-12-01

    Much earth and space science data and metadata are managed and supported by an infrastructure of repositories, ranging from large agency or instrument facilities, to institutions, to smaller repositories including labs. Scientists face many challenges in this ecosystem both on storing their data and in accessing data from others for new research. Critical for all uses is ensuring the credibility and integrity of the data and conveying that and provenance information now and in the future. Accurate information is essential for future researchers to find (or discover) the data, evaluate the data for use (content, temporal, geolocation, precision) and finally select (or discard) that data as meeting a "fit-for-purpose" criteria. We also need to optimize the effort it takes in describing the data for these determinations, which means making it efficient for the researchers who collect the data. At AGU we are developing a program aimed at helping repositories, and thereby researchers, improve data quality and data usability toward these goals. AGU has partnered with the CMMI Institute to develop their Data Management Maturity (DMM) framework within the Earth and space sciences. The CMMI DMM framework guides best practices in a range of data operations, and the application of the DMM, through an assessment, reveals how repositories and institutions can best optimize efforts to improve operations and functionality throughout the data lifecycle and elevate best practices across a variety of data management operations. Supporting processes like data operations, data governance, and data architecture are included. An assessment involves identifying accomplishment, and weaknesses compared to leading practices for data management. Broad application of the DMM can help improve quality in data and operations, and consistency across the community that will facilitate interoperability, discovery, preservation, and reuse. Good data can be better data. Consistency results in

  15. P19-S Managing Proteomics Data from Data Generation and Data Warehousing to Central Data Repository and Journal Reviewing Processes

    PubMed Central

    Thiele, H.; Glandorf, J.; Koerting, G.; Reidegeld, K.; Blüggel, M.; Meyer, H.; Stephan, C.

    2007-01-01

    In today’s proteomics research, various techniques and instrumentation bioinformatics tools are necessary to manage the large amount of heterogeneous data with an automatic quality control to produce reliable and comparable results. Therefore a data-processing pipeline is mandatory for data validation and comparison in a data-warehousing system. The proteome bioinformatics platform ProteinScape has been proven to cover these needs. The reprocessing of HUPO BPP participants’ MS data was done within ProteinScape. The reprocessed information was transferred into the global data repository PRIDE. ProteinScape as a data-warehousing system covers two main aspects: archiving relevant data of the proteomics workflow and information extraction functionality (protein identification, quantification and generation of biological knowledge). As a strategy for automatic data validation, different protein search engines are integrated. Result analysis is performed using a decoy database search strategy, which allows the measurement of the false-positive identification rate. Peptide identifications across different workflows, different MS techniques, and different search engines are merged to obtain a quality-controlled protein list. The proteomics identifications database (PRIDE), as a public data repository, is an archiving system where data are finally stored and no longer changed by further processing steps. Data submission to PRIDE is open to proteomics laboratories generating protein and peptide identifications. An export tool has been developed for transferring all relevant HUPO BPP data from ProteinScape into PRIDE using the PRIDE.xml format. The EU-funded ProDac project will coordinate the development of software tools covering international standards for the representation of proteomics data. The implementation of data submission pipelines and systematic data collection in public standards–compliant repositories will cover all aspects, from the generation of MS data

  16. Panel data analysis of cardiotocograph (CTG) data.

    PubMed

    Horio, Hiroyuki; Kikuchi, Hitomi; Ikeda, Tomoaki

    2013-01-01

    Panel data analysis is a statistical method, widely used in econometrics, which deals with two-dimensional panel data collected over time and over individuals. Cardiotocograph (CTG) which monitors fetal heart rate (FHR) using Doppler ultrasound and uterine contraction by strain gage is commonly used in intrapartum treatment of pregnant women. Although the relationship between FHR waveform pattern and the outcome such as umbilical blood gas data at delivery has long been analyzed, there exists no accumulated FHR patterns from large number of cases. As time-series economic fluctuations in econometrics such as consumption trend has been studied using panel data which consists of time-series and cross-sectional data, we tried to apply this method to CTG data. The panel data composed of a symbolized segment of FHR pattern can be easily handled, and a perinatologist can get the whole FHR pattern view from the microscopic level of time-series FHR data.

  17. Striped Data Server for Scalable Parallel Data Analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chang, Jin; Gutsche, Oliver; Mandrichenko, Igor

    A columnar data representation is known to be an efficient way for data storage, specifically in cases when the analysis is often done based only on a small fragment of the available data structures. A data representation like Apache Parquet is a step forward from a columnar representation, which splits data horizontally to allow for easy parallelization of data analysis. Based on the general idea of columnar data storage, working on the [LDRD Project], we have developed a striped data representation, which, we believe, is better suited to the needs of High Energy Physics data analysis. A traditional columnar approachmore » allows for efficient data analysis of complex structures. While keeping all the benefits of columnar data representations, the striped mechanism goes further by enabling easy parallelization of computations without requiring special hardware. We will present an implementation and some performance characteristics of such a data representation mechanism using a distributed no-SQL database or a local file system, unified under the same API and data representation model. The representation is efficient and at the same time simple so that it allows for a common data model and APIs for wide range of underlying storage mechanisms such as distributed no-SQL databases and local file systems. Striped storage adopts Numpy arrays as its basic data representation format, which makes it easy and efficient to use in Python applications. The Striped Data Server is a web service, which allows to hide the server implementation details from the end user, easily exposes data to WAN users, and allows to utilize well known and developed data caching solutions to further increase data access efficiency. We are considering the Striped Data Server as the core of an enterprise scale data analysis platform for High Energy Physics and similar areas of data processing. We have been testing this architecture with a 2TB dataset from a CMS dark matter search and plan to

  18. The SeaDataNet data products: regional temperature and salinity historical data collections

    NASA Astrophysics Data System (ADS)

    Simoncelli, Simona; Coatanoan, Christine; Bäck, Orjan; Sagen, Helge; Scoy, Serge; Myroshnychenko, Volodymyr; Schaap, Dick; Schlitzer, Reiner; Iona, Sissy; Fichaut, Michele

    2016-04-01

    Temperature and Salinity (TS) historical data collections covering the time period 1900-2013 were created for each European marginal sea (Arctic Sea, Baltic Sea, Black Sea, North Sea, North Atlantic Ocean and Mediterranean Sea) within the framework of SeaDataNet2 (SDN) EU-Project and they are now available as ODV collections through the SeaDataNet web catalog at http://sextant.ifremer.fr/en/web/seadatanet/. Two versions have been published and they represent a snapshot of the SDN database content at two different times: V1.1 (January 2014) and V2 (March 2015). A Quality Control Strategy (QCS) has been developped and continuously refined in order to improve the quality of the SDN database content and to create the best product deriving from SDN data. The QCS was originally implemented in collaboration with MyOcean2 and MyOcean Follow On projects in order to develop a true synergy at regional level to serve operational oceanography and climate change communities. The QCS involved the Regional Coordinators, responsible of the scientific assessment, the National Oceanographic Data Centers (NODC) and the data providers that, on the base of the data quality assessment outcome, checked and eventually corrected anomalies in the original data. The QCS consists of four main phases: 1) data harvesting from the central CDI; 2) file and parameter aggregation; 3) quality check analysis at regional level; 4) analysis and correction of data anomalies. The approach is iterative to facilitate the upgrade of SDN database content and it allows also the versioning of data products with the release of new regional data collections at the end of each QCS loop. SDN data collections and the QCS will be presented and the results summarized.

  19. Democratizing data science through data science training

    PubMed Central

    Van Horn, John Darrell; Fierro, Lily; Kamdar, Jeana; Gordon, Jonathan; Stewart, Crystal; Bhattrai, Avnish; Abe, Sumiko; Lei, Xiaoxiao; O’Driscoll, Caroline; Sinha, Aakanchha; Jain, Priyambada; Burns, Gully; Lerman, Kristina; Ambite, José Luis

    2017-01-01

    The biomedical sciences have experienced an explosion of data which promises to overwhelm many current practitioners. Without easy access to data science training resources, biomedical researchers may find themselves unable to wrangle their own datasets. In 2014, to address the challenges posed such a data onslaught, the National Institutes of Health (NIH) launched the Big Data to Knowledge (BD2K) initiative. To this end, the BD2K Training Coordinating Center (TCC; bigdatau.org) was funded to facilitate both in-person and online learning, and open up the concepts of data science to the widest possible audience. Here, we describe the activities of the BD2K TCC and its focus on the construction of the Educational Resource Discovery Index (ERuDIte), which identifies, collects, describes, and organizes online data science materials from BD2K awardees, open online courses, and videos from scientific lectures and tutorials. ERuDIte now indexes over 9,500 resources. Given the richness of online training materials and the constant evolution of biomedical data science, computational methods applying information retrieval, natural language processing, and machine learning techniques are required - in effect, using data science to inform training in data science. In so doing, the TCC seeks to democratize novel insights and discoveries brought forth via large-scale data science training. PMID:29218890

  20. Democratizing data science through data science training.

    PubMed

    Van Horn, John Darrell; Fierro, Lily; Kamdar, Jeana; Gordon, Jonathan; Stewart, Crystal; Bhattrai, Avnish; Abe, Sumiko; Lei, Xiaoxiao; O'Driscoll, Caroline; Sinha, Aakanchha; Jain, Priyambada; Burns, Gully; Lerman, Kristina; Ambite, José Luis

    2018-01-01

    The biomedical sciences have experienced an explosion of data which promises to overwhelm many current practitioners. Without easy access to data science training resources, biomedical researchers may find themselves unable to wrangle their own datasets. In 2014, to address the challenges posed such a data onslaught, the National Institutes of Health (NIH) launched the Big Data to Knowledge (BD2K) initiative. To this end, the BD2K Training Coordinating Center (TCC; bigdatau.org) was funded to facilitate both in-person and online learning, and open up the concepts of data science to the widest possible audience. Here, we describe the activities of the BD2K TCC and its focus on the construction of the Educational Resource Discovery Index (ERuDIte), which identifies, collects, describes, and organizes online data science materials from BD2K awardees, open online courses, and videos from scientific lectures and tutorials. ERuDIte now indexes over 9,500 resources. Given the richness of online training materials and the constant evolution of biomedical data science, computational methods applying information retrieval, natural language processing, and machine learning techniques are required - in effect, using data science to inform training in data science. In so doing, the TCC seeks to democratize novel insights and discoveries brought forth via large-scale data science training.

  1. Hydrogen Data | Geospatial Data Science | NREL

    Science.gov Websites

    Hydrogen Data Hydrogen Data This hydrogen data estimates the potential for producing hydrogen from Coordinate System Name: GCS_North_American_1983 Coverage File Last Updated Metadata KMZ File Hydrogen Zip of hydrogen could be produced annually from wind, solar, and biomass resources in the United States

  2. Connecting geoscience systems and data using Linked Open Data in the Web of Data

    NASA Astrophysics Data System (ADS)

    Ritschel, Bernd; Neher, Günther; Iyemori, Toshihiko; Koyama, Yukinobu; Yatagai, Akiyo; Murayama, Yasuhiro; Galkin, Ivan; King, Todd; Fung, Shing F.; Hughes, Steve; Habermann, Ted; Hapgood, Mike; Belehaki, Anna

    2014-05-01

    Linked Data or Linked Open Data (LOD) in the realm of free and publically accessible data is one of the most promising and most used semantic Web frameworks connecting various types of data and vocabularies including geoscience and related domains. The semantic Web extension to the commonly existing and used World Wide Web is based on the meaning of entities and relationships or in different words classes and properties used for data in a global data and information space, the Web of Data. LOD data is referenced and mash-uped by URIs and is retrievable using simple parameter controlled HTTP-requests leading to a result which is human-understandable or machine-readable. Furthermore the publishing and mash-up of data in the semantic Web realm is realized by specific Web standards, such as RDF, RDFS, OWL and SPARQL defined for the Web of Data. Semantic Web based mash-up is the Web method to aggregate and reuse various contents from different sources, such as e.g. using FOAF as a model and vocabulary for the description of persons and organizations -in our case- related to geoscience projects, instruments, observations, data and so on. On the example of three different geoscience data and information management systems, such as ESPAS, IUGONET and GFZ ISDC and the associated science data and related metadata or better called context data, the concept of the mash-up of systems and data using the semantic Web approach and the Linked Open Data framework is described in this publication. Because the three systems are based on different data models, data storage structures and technical implementations an extra semantic Web layer upon the existing interfaces is used for mash-up solutions. In order to satisfy the semantic Web standards, data transition processes, such as the transfer of content stored in relational databases or mapped in XML documents into SPARQL capable databases or endpoints using D2R or XSLT is necessary. In addition, the use of mapped and/or merged domain

  3. Data Specifications for the Landsat Data Continuity Mission

    USGS Publications Warehouse

    Irons, J.R.; Speciale, N.J.; Douglas, McCuistion J.; Masek, J.G.; Markham, B.L.; Storey, James C.; Lencioni, D.E.; Ryan, R.E.

    2003-01-01

    The National Aeronautics and Space Administration (NASA) plans to procure data from a privately-owned and commercially-operated remote sensing system for the next Landsat mission, the Landsat Data Continuity Mission (LDCM). Data requirements are documented in an LDCM Data Specification. The specifications require delivery of data covering 250 Landsat scenes on a daily basis. The data are to be acquired in a manner that affords seasonal coverage of the global land mass. Data are required for the heritage reflective Thematic Mapper (TM) spectral bands plus two new bands, a blue band for coastal zone observations and a short wave infrared band for cirrus cloud detection. The specifications do not require thermal data, representing a departure from the TM heritage. The specification also requires data providing a 30 m ground sample distance for each of the spectral bands with the exception of the new cirrus cloud band at 120 m. An absolute uncertainty of 5 percent or less is required for radiometrically corrected LDCM data and the commercial operator is required to deliver data that can be registered to a cartographic projection with an uncertainty of 65 m or less. Procuring data from a commercial operator represents a new approach for the 30-year-old Landsat Program. The LDCM Data Specification will ensure that the procured data provides continuity of the Landsat data stream and advances the mission.

  4. Automatic meta-data collection of STP observation data

    NASA Astrophysics Data System (ADS)

    Ishikura, S.; Kimura, E.; Murata, K.; Kubo, T.; Shinohara, I.

    2006-12-01

    For the geo-science and the STP (Solar-Terrestrial Physics) studies, various observations have been done by satellites and ground-based observatories up to now. These data are saved and managed at many organizations, but no common procedure and rule to provide and/or share these data files. Researchers have felt difficulty in searching and analyzing such different types of data distributed over the Internet. To support such cross-over analyses of observation data, we have developed the STARS (Solar-Terrestrial data Analysis and Reference System). The STARS consists of client application (STARS-app), the meta-database (STARS- DB), the portal Web service (STARS-WS) and the download agent Web service (STARS DLAgent-WS). The STARS-DB includes directory information, access permission, protocol information to retrieve data files, hierarchy information of mission/team/data and user information. Users of the STARS are able to download observation data files without knowing locations of the files by using the STARS-DB. We have implemented the Portal-WS to retrieve meta-data from the meta-database. One reason we use the Web service is to overcome a variety of firewall restrictions which is getting stricter in recent years. Now it is difficult for the STARS client application to access to the STARS-DB by sending SQL query to obtain meta- data from the STARS-DB. Using the Web service, we succeeded in placing the STARS-DB behind the Portal- WS and prevent from exposing it on the Internet. The STARS accesses to the Portal-WS by sending the SOAP (Simple Object Access Protocol) request over HTTP. Meta-data is received as a SOAP Response. The STARS DLAgent-WS provides clients with data files downloaded from data sites. The data files are provided with a variety of protocols (e.g., FTP, HTTP, FTPS and SFTP). These protocols are individually selected at each site. The clients send a SOAP request with download request messages and receive observation data files as a SOAP Response with

  5. Big Data and Data Science in Critical Care.

    PubMed

    Sanchez-Pinto, L Nelson; Luo, Yuan; Churpek, Matthew M

    2018-05-09

    The digitalization of the health-care system has resulted in a deluge of clinical Big Data and has prompted the rapid growth of data science in medicine. Data science, which is the field of study dedicated to the principled extraction of knowledge from complex data, is particularly relevant in the critical care setting. The availability of large amounts of data in the ICU, the need for better evidence-based care, and the complexity of critical illness makes the use of data science techniques and data-driven research particularly appealing to intensivists. Despite the increasing number of studies and publications in the field, thus far there have been few examples of data science projects that have resulted in successful implementations of data-driven systems in the ICU. However, given the expected growth in the field, intensivists should be familiar with the opportunities and challenges of Big Data and data science. The present article reviews the definitions, types of algorithms, applications, challenges, and future of Big Data and data science in critical care. Copyright © 2018 American College of Chest Physicians. Published by Elsevier Inc. All rights reserved.

  6. Big data from small data: data-sharing in the ‘long tail’ of neuroscience

    PubMed Central

    Ferguson, Adam R; Nielson, Jessica L; Cragin, Melissa H; Bandrowski, Anita E; Martone, Maryann E

    2016-01-01

    The launch of the US BRAIN and European Human Brain Projects coincides with growing international efforts toward transparency and increased access to publicly funded research in the neurosciences. The need for data-sharing standards and neuroinformatics infrastructure is more pressing than ever. However, ‘big science’ efforts are not the only drivers of data-sharing needs, as neuroscientists across the full spectrum of research grapple with the overwhelming volume of data being generated daily and a scientific environment that is increasingly focused on collaboration. In this commentary, we consider the issue of sharing of the richly diverse and heterogeneous small data sets produced by individual neuroscientists, so-called long-tail data. We consider the utility of these data, the diversity of repositories and options available for sharing such data, and emerging best practices. We provide use cases in which aggregating and mining diverse long-tail data convert numerous small data sources into big data for improved knowledge about neuroscience-related disorders. PMID:25349910

  7. NATIONAL GEOTHERMAL DATA SYSTEM (NGDS) GEOTHERMAL DATA DOMAIN: ASSESSMENT OF GEOTHERMAL COMMUNITY DATA NEEDS

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Anderson, Arlene; Blackwell, David; Chickering, Cathy

    2013-01-01

    To satisfy the critical need for geothermal data to ad- vance geothermal energy as a viable renewable ener- gy contender, the U.S. Department of Energy is in- vesting in the development of the National Geother- mal Data System (NGDS). This paper outlines efforts among geothermal data providers nationwide to sup- ply cutting edge geo-informatics. NGDS geothermal data acquisition, delivery, and methodology are dis- cussed. In particular, this paper addresses the various types of data required to effectively assess geother- mal energy potential and why simple links to existing data are insufficient. To create a platform for ready access by allmore » geothermal stakeholders, the NGDS in- cludes a work plan that addresses data assets and re- sources of interest to users, a survey of data provid- ers, data content models, and how data will be ex- changed and promoted, as well as lessons learned within the geothermal community.« less

  8. Advancements in Data Access at the IRIS Data Management Center to Broaden Data Use

    NASA Astrophysics Data System (ADS)

    Benson, R. B.; Trabant, C. M.; Ahern, T. K.

    2013-12-01

    The IRIS Data Management Center (DMC) has been serving digital seismic data for more than 20 years and has offered a variety of access mechanisms that have stood the test of time. However, beginning in 2010, and in response to multiple needs being requested from the IRIS DMC, we have developed web service interfaces to access our primary data repository. These new interfaces have rapidly grown in popularity. In 2013, the third full year of their operation, these services were responsible for half of all the data shipped from the DMC. In the same time period, the amount of data shipped from the other data access mechanisms has also increased. This non-linear growth of data shipments reflects the increased data usage by the research community. We believe that our new web service interfaces are well suited to fit future data access needs and signify a significant evolution in integrating different scientific data sets. Based on standardized web technologies, support for writing access software is ubiquitous. As fundamentally programmatic interfaces, the services are well suited for integration into data processing systems, in particular large-scale data processing systems. Their programmatic nature also makes then well suited for use with brokering systems where, for example, data from multiple disciplines can be integrated. In addition to providing access to raw data, the DMC created web services that apply simple, on-the-fly processing and format conversion. Processing the data (e.g. converting to Earth units) and formatting the result into something generally usable (e.g. ASCII) removes important barriers for users working in other disciplines. The end result is that we are shipping a much larger amount of data in a manner more directly usable by users. Many of these principles will be applied to the DMC's future work in the NSF's EarthCube Web Service Building Blocks project.

  9. Data Collection, Collaboration, Analysis, and Publication Using the Open Data Repository's (ODR) Data Publisher

    NASA Astrophysics Data System (ADS)

    Lafuente, B.; Stone, N.; Bristow, T.; Keller, R. M.; Blake, D. F.; Downs, R. T.; Pires, A.; Dateo, C. E.; Fonda, M.

    2017-12-01

    In development for nearly four years, the Open Data Repository's (ODR) Data Publisher software has become a useful tool for researchers' data needs. Data Publisher facilitates the creation of customized databases with flexible permission sets that allow researchers to share data collaboratively while improving data discovery and maintaining ownership rights. The open source software provides an end-to-end solution from collection to final repository publication. A web-based interface allows researchers to enter data, view data, and conduct analysis using any programming language supported by JupyterHub (http://www.jupyterhub.org). This toolset makes it possible for a researcher to store and manipulate their data in the cloud from any internet capable device. Data can be embargoed in the system until a date selected by the researcher. For instance, open publication can be set to a date that coincides with publication of data analysis in a third party journal. In conjunction with teams at NASA Ames and the University of Arizona, a number of pilot studies are being conducted to guide the software development so that it allows them to publish and share their data. These pilots include (1) the Astrobiology Habitable Environments Database (AHED), a central searchable repository designed to promote and facilitate the integration and sharing of all the data generated by the diverse disciplines in astrobiology; (2) a database containing the raw and derived data products from the CheMin instrument on the MSL rover Curiosity (http://odr.io/CheMin), featuring a versatile graphing system, instructions and analytical tools to process the data, and a capability to download data in different formats; and (3) the Mineral Evolution project, which by correlating the diversity of mineral species with their ages, localities, and other measurable properties aims to understand how the episodes of planetary accretion and differentiation, plate tectonics, and origin of life lead to a

  10. KNMI DataLab experiences in serving data-driven innovations

    NASA Astrophysics Data System (ADS)

    Noteboom, Jan Willem; Sluiter, Raymond

    2016-04-01

    Climate change research and innovations in weather forecasting rely more and more on (Big) data. Besides increasing data from traditional sources (such as observation networks, radars and satellites), the use of open data, crowd sourced data and the Internet of Things (IoT) is emerging. To deploy these sources of data optimally in our services and products, KNMI has established a DataLab to serve data-driven innovations in collaboration with public and private sector partners. Big data management, data integration, data analytics including machine learning and data visualization techniques are playing an important role in the DataLab. Cross-domain data-driven innovations that arise from public-private collaborative projects and research programmes can be explored, experimented and/or piloted by the KNMI DataLab. Furthermore, advice can be requested on (Big) data techniques and data sources. In support of collaborative (Big) data science activities, scalable environments are offered with facilities for data integration, data analysis and visualization. In addition, Data Science expertise is provided directly or from a pool of internal and external experts. At the EGU conference, gained experiences and best practices are presented in operating the KNMI DataLab to serve data-driven innovations for weather and climate applications optimally.

  11. Arithmetic Data Cube as a Data Intensive Benchmark

    NASA Technical Reports Server (NTRS)

    Frumkin, Michael A.; Shabano, Leonid

    2003-01-01

    Data movement across computational grids and across memory hierarchy of individual grid machines is known to be a limiting factor for application involving large data sets. In this paper we introduce the Data Cube Operator on an Arithmetic Data Set which we call Arithmetic Data Cube (ADC). We propose to use the ADC to benchmark grid capabilities to handle large distributed data sets. The ADC stresses all levels of grid memory by producing 2d views of an Arithmetic Data Set of d-tuples described by a small number of parameters. We control data intensity of the ADC by controlling the sizes of the views through choice of the tuple parameters.

  12. DataSHIELD: taking the analysis to the data, not the data to the analysis.

    PubMed

    Gaye, Amadou; Marcon, Yannick; Isaeva, Julia; LaFlamme, Philippe; Turner, Andrew; Jones, Elinor M; Minion, Joel; Boyd, Andrew W; Newby, Christopher J; Nuotio, Marja-Liisa; Wilson, Rebecca; Butters, Oliver; Murtagh, Barnaby; Demir, Ipek; Doiron, Dany; Giepmans, Lisette; Wallace, Susan E; Budin-Ljøsne, Isabelle; Oliver Schmidt, Carsten; Boffetta, Paolo; Boniol, Mathieu; Bota, Maria; Carter, Kim W; deKlerk, Nick; Dibben, Chris; Francis, Richard W; Hiekkalinna, Tero; Hveem, Kristian; Kvaløy, Kirsti; Millar, Sean; Perry, Ivan J; Peters, Annette; Phillips, Catherine M; Popham, Frank; Raab, Gillian; Reischl, Eva; Sheehan, Nuala; Waldenberger, Melanie; Perola, Markus; van den Heuvel, Edwin; Macleod, John; Knoppers, Bartha M; Stolk, Ronald P; Fortier, Isabel; Harris, Jennifer R; Woffenbuttel, Bruce H R; Murtagh, Madeleine J; Ferretti, Vincent; Burton, Paul R

    2014-12-01

    Research in modern biomedicine and social science requires sample sizes so large that they can often only be achieved through a pooled co-analysis of data from several studies. But the pooling of information from individuals in a central database that may be queried by researchers raises important ethico-legal questions and can be controversial. In the UK this has been highlighted by recent debate and controversy relating to the UK's proposed 'care.data' initiative, and these issues reflect important societal and professional concerns about privacy, confidentiality and intellectual property. DataSHIELD provides a novel technological solution that can circumvent some of the most basic challenges in facilitating the access of researchers and other healthcare professionals to individual-level data. Commands are sent from a central analysis computer (AC) to several data computers (DCs) storing the data to be co-analysed. The data sets are analysed simultaneously but in parallel. The separate parallelized analyses are linked by non-disclosive summary statistics and commands transmitted back and forth between the DCs and the AC. This paper describes the technical implementation of DataSHIELD using a modified R statistical environment linked to an Opal database deployed behind the computer firewall of each DC. Analysis is controlled through a standard R environment at the AC. Based on this Opal/R implementation, DataSHIELD is currently used by the Healthy Obese Project and the Environmental Core Project (BioSHaRE-EU) for the federated analysis of 10 data sets across eight European countries, and this illustrates the opportunities and challenges presented by the DataSHIELD approach. DataSHIELD facilitates important research in settings where: (i) a co-analysis of individual-level data from several studies is scientifically necessary but governance restrictions prohibit the release or sharing of some of the required data, and/or render data access unacceptably slow; (ii) a

  13. NREL: Renewable Resource Data Center - Geothermal Resource Data

    Science.gov Websites

    sites Data related to geothermal technology and energy Resource Data The following geothermal resource data collections can be found in the Renewable Resource Data Center (RReDC). Geothermal Resource Data The datasets on this page offer a qualitative

  14. Applying Data Mining Principles to Library Data Collection.

    ERIC Educational Resources Information Center

    Guenther, Kim

    2000-01-01

    Explains how libraries can use data mining techniques for more effective data collection. Highlights include three phases: data selection and acquisition; data preparation and processing, including a discussion of the use of XML (extensible markup language); and data interpretation and integration, including database management systems. (LRW)

  15. Oceans of Data : the Australian Ocean Data Network

    NASA Astrophysics Data System (ADS)

    Proctor, R.; Blain, P.; Mancini, S.

    2012-04-01

    The Australian Integrated Marine Observing System (IMOS, www.imos.org.au) is a research infrastructure project to establish an enduring marine observing system for Australian oceanic waters and shelf seas (in total, 4% of the world's oceans). Marine data and information are the main products and data management is therefore a central element to the project's success. A single integrative framework for data and information management has been developed which allows discovery and access of the data by scientists, managers and the public, based on standards and interoperability. All data is freely available. This information infrastructure has been further developed to form the Australian Ocean Data Network (AODN, www.aodn.org.au) which is rapidly becoming the 'one-stop-shop' for marine data in Australia. In response to requests from users, new features have recently been added to data discovery, visualization, and data access which move the AODN closer towards providing full integration of multi-disciplinary data.

  16. Open Core Data: Connecting scientific drilling data to scientists and community data resources

    NASA Astrophysics Data System (ADS)

    Fils, D.; Noren, A. J.; Lehnert, K.; Diver, P.

    2016-12-01

    Open Core Data (OCD) is an innovative, efficient, and scalable infrastructure for data generated by scientific drilling and coring to improve discoverability, accessibility, citability, and preservation of data from the oceans and continents. OCD is building on existing community data resources that manage, store, publish, and preserve scientific drilling data, filling a critical void that currently prevents linkages between these and other data systems and tools to realize the full potential of data generated through drilling and coring. We are developing this functionality through Linked Open Data (LOD) and semantic patterns that enable data access through the use of community ontologies such as GeoLink (geolink.org, an EarthCube Building Block), a collection of protocols, formats and vocabularies from a set of participating geoscience repositories. Common shared concepts of classes such as cruise, dataset, person and others allow easier resolution of common references through shared resource IDs. These graphs are then made available via SPARQL as well as incorporated into web pages following schema.org approaches. Additionally the W3C PROV vocabulary is under evaluation for use for documentation of provenance. Further, the application of persistent identifiers for samples (IGSNs); datasets, expeditions, and projects (DOIs); and people (ORCIDs), combined with LOD approaches, provides methods to resolve and incorporate metadata and datasets. Application Program Interfaces (APIs) complement these semantic approaches to the OCD data holdings. APIs are exposed following the Swagger guidelines (swagger.io) and will be evolved into the OpenAPI (openapis.org) approach. Currently APIs are in development for the NSF funded Flyover Country mobile geoscience app (fc.umn.edu), the Neotoma Paleoecology Database (neotomadb.org), Magnetics Information Consortium (MagIC; earthref.org/MagIC), and other community tools and data systems, as well as for internal OCD use.

  17. 14 CFR 217.5 - Data collected (data elements).

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... 14 Aeronautics and Space 4 2014-01-01 2014-01-01 false Data collected (data elements). 217.5..., CHARTER, AND NONSCHEDULED SERVICES § 217.5 Data collected (data elements). (a) Within each of the service classifications prescribed in § 217.4, data shall be reported in applicable traffic elements. (b) The statistical...

  18. 14 CFR 217.5 - Data collected (data elements).

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ... 14 Aeronautics and Space 4 2013-01-01 2013-01-01 false Data collected (data elements). 217.5..., CHARTER, AND NONSCHEDULED SERVICES § 217.5 Data collected (data elements). (a) Within each of the service classifications prescribed in § 217.4, data shall be reported in applicable traffic elements. (b) The statistical...

  19. 14 CFR 217.5 - Data collected (data elements).

    Code of Federal Regulations, 2012 CFR

    2012-01-01

    ... 14 Aeronautics and Space 4 2012-01-01 2012-01-01 false Data collected (data elements). 217.5..., CHARTER, AND NONSCHEDULED SERVICES § 217.5 Data collected (data elements). (a) Within each of the service classifications prescribed in § 217.4, data shall be reported in applicable traffic elements. (b) The statistical...

  20. NREL: Renewable Resource Data Center - Solar Resource Data

    Science.gov Websites

    Solar Resource Data The following solar resource data collections can be found in the Renewable Resource Data Center (RReDC). Cooperative Networks for Renewable Resource Measurements (CONFRRM) Solar Energy Resource Data Provides solar radiation and wind measurement data for select U.S. locations

  1. 14 CFR 217.5 - Data collected (data elements).

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 14 Aeronautics and Space 4 2010-01-01 2010-01-01 false Data collected (data elements). 217.5..., CHARTER, AND NONSCHEDULED SERVICES § 217.5 Data collected (data elements). (a) Within each of the service classifications prescribed in § 217.4, data shall be reported in applicable traffic elements. (b) The statistical...

  2. 14 CFR 217.5 - Data collected (data elements).

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 14 Aeronautics and Space 4 2011-01-01 2011-01-01 false Data collected (data elements). 217.5..., CHARTER, AND NONSCHEDULED SERVICES § 217.5 Data collected (data elements). (a) Within each of the service classifications prescribed in § 217.4, data shall be reported in applicable traffic elements. (b) The statistical...

  3. Xenbase: Core features, data acquisition, and data processing.

    PubMed

    James-Zorn, Christina; Ponferrada, Virgillio G; Burns, Kevin A; Fortriede, Joshua D; Lotay, Vaneet S; Liu, Yu; Brad Karpinka, J; Karimi, Kamran; Zorn, Aaron M; Vize, Peter D

    2015-08-01

    Xenbase, the Xenopus model organism database (www.xenbase.org), is a cloud-based, web-accessible resource that integrates the diverse genomic and biological data from Xenopus research. Xenopus frogs are one of the major vertebrate animal models used for biomedical research, and Xenbase is the central repository for the enormous amount of data generated using this model tetrapod. The goal of Xenbase is to accelerate discovery by enabling investigators to make novel connections between molecular pathways in Xenopus and human disease. Our relational database and user-friendly interface make these data easy to query and allows investigators to quickly interrogate and link different data types in ways that would otherwise be difficult, time consuming, or impossible. Xenbase also enhances the value of these data through high-quality gene expression curation and data integration, by providing bioinformatics tools optimized for Xenopus experiments, and by linking Xenopus data to other model organisms and to human data. Xenbase draws in data via pipelines that download data, parse the content, and save them into appropriate files and database tables. Furthermore, Xenbase makes these data accessible to the broader biomedical community by continually providing annotated data updates to organizations such as NCBI, UniProtKB, and Ensembl. Here, we describe our bioinformatics, genome-browsing tools, data acquisition and sharing, our community submitted and literature curation pipelines, text-mining support, gene page features, and the curation of gene nomenclature and gene models. © 2015 Wiley Periodicals, Inc.

  4. DataONE: A Distributed Environmental and Earth Science Data Network Supporting the Full Data Life Cycle

    NASA Astrophysics Data System (ADS)

    Cook, R.; Michener, W.; Vieglais, D.; Budden, A.; Koskela, R.

    2012-04-01

    Addressing grand environmental science challenges requires unprecedented access to easily understood data that cross the breadth of temporal, spatial, and thematic scales. Tools are needed to plan management of the data, discover the relevant data, integrate heterogeneous and diverse data, and convert the data to information and knowledge. Addressing these challenges requires new approaches for the full data life cycle of managing, preserving, sharing, and analyzing data. DataONE (Observation Network for Earth) represents a virtual organization that enables new science and knowledge creation through preservation and access to data about life on Earth and the environment that sustains it. The DataONE approach is to improve data collection and management techniques; facilitate easy, secure, and persistent storage of data; continue to increase access to data and tools that improve data interoperability; disseminate integrated and user-friendly tools for data discovery and novel analyses; work with researchers to build intuitive data exploration and visualization tools; and support communities of practice via education, outreach, and stakeholder engagement.

  5. Scenario driven data modelling: a method for integrating diverse sources of data and data streams

    PubMed Central

    2011-01-01

    Background Biology is rapidly becoming a data intensive, data-driven science. It is essential that data is represented and connected in ways that best represent its full conceptual content and allows both automated integration and data driven decision-making. Recent advancements in distributed multi-relational directed graphs, implemented in the form of the Semantic Web make it possible to deal with complicated heterogeneous data in new and interesting ways. Results This paper presents a new approach, scenario driven data modelling (SDDM), that integrates multi-relational directed graphs with data streams. SDDM can be applied to virtually any data integration challenge with widely divergent types of data and data streams. In this work, we explored integrating genetics data with reports from traditional media. SDDM was applied to the New Delhi metallo-beta-lactamase gene (NDM-1), an emerging global health threat. The SDDM process constructed a scenario, created a RDF multi-relational directed graph that linked diverse types of data to the Semantic Web, implemented RDF conversion tools (RDFizers) to bring content into the Sematic Web, identified data streams and analytical routines to analyse those streams, and identified user requirements and graph traversals to meet end-user requirements. Conclusions We provided an example where SDDM was applied to a complex data integration challenge. The process created a model of the emerging NDM-1 health threat, identified and filled gaps in that model, and constructed reliable software that monitored data streams based on the scenario derived multi-relational directed graph. The SDDM process significantly reduced the software requirements phase by letting the scenario and resulting multi-relational directed graph define what is possible and then set the scope of the user requirements. Approaches like SDDM will be critical to the future of data intensive, data-driven science because they automate the process of converting

  6. Restoration of Apollo Data by the Lunar Data Project/PDS Lunar Data Node: An Update

    NASA Technical Reports Server (NTRS)

    Williams, David R.; Hills, H. Kent; Taylor, Patrick T.; Grayzeck, Edwin J.; Guinness, Edward A.

    2016-01-01

    The Apollo 11, 12, and 14 through 17 missions orbited and landed on the Moon, carrying scientific instruments that returned data from all phases of the missions, included long-lived Apollo Lunar Surface Experiments Packages (ALSEPs) deployed by the astronauts on the lunar surface. Much of these data were never archived, and some of the archived data were on media and in formats that are outmoded, or were deposited with little or no useful documentation to aid outside users. This is particularly true of the ALSEP data returned autonomously for many years after the Apollo missions ended. The purpose of the Lunar Data Project and the Planetary Data System (PDS) Lunar Data Node is to take data collections already archived at the NASA Space Science Data Coordinated Archive (NSSDCA) and prepare them for archiving through PDS, and to locate lunar data that were never archived, bring them into NSSDCA, and then archive them through PDS. Preparing these data for archiving involves reading the data from the original media, be it magnetic tape, microfilm, microfiche, or hard-copy document, converting the outmoded, often binary, formats when necessary, putting them into a standard digital form accepted by PDS, collecting the necessary ancillary data and documentation (metadata) to ensure that the data are usable and well-described, summarizing the metadata in documentation to be included in the data set, adding other information such as references, mission and instrument descriptions, contact information, and related documentation, and packaging the results in a PDS-compliant data set. The data set is then validated and reviewed by a group of external scientists as part of the PDS final archive process. We present a status report on some of the data sets that we are processing.

  7. Data Elevator

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    BYNA, SUNRENDRA; DONG, BIN; WU, KESHENG

    Data Elevator: Efficient Asynchronous Data Movement in Hierarchical Storage Systems Multi-layer storage subsystems, including SSD-based burst buffers and disk-based parallel file systems (PFS), are becoming part of HPC systems. However, software for this storage hierarchy is still in its infancy. Applications may have to explicitly move data among the storage layers. We propose Data Elevator for transparently and efficiently moving data between a burst buffer and a PFS. Users specify the final destination for their data, typically on PFS, Data Elevator intercepts the I/O calls, stages data on burst buffer, and then asynchronously transfers the data to their final destinationmore » in the background. This system allows extensive optimizations, such as overlapping read and write operations, choosing I/O modes, and aligning buffer boundaries. In tests with large-scale scientific applications, Data Elevator is as much as 4.2X faster than Cray DataWarp, the start-of-art software for burst buffer, and 4X faster than directly writing to PFS. The Data Elevator library uses HDF5's Virtual Object Layer (VOL) for intercepting parallel I/O calls that write data to PFS. The intercepted calls are redirected to the Data Elevator, which provides a handle to write the file in a faster and intermediate burst buffer system. Once the application finishes writing the data to the burst buffer, the Data Elevator job uses HDF5 to move the data to final destination in an asynchronous manner. Hence, using the Data Elevator library is currently useful for applications that call HDF5 for writing data files. Also, the Data Elevator depends on the HDF5 VOL functionality.« less

  8. Data near processing support for climate data analysis

    NASA Astrophysics Data System (ADS)

    Kindermann, Stephan; Ehbrecht, Carsten; Hempelmann, Nils

    2016-04-01

    Climate data repositories grow in size exponentially. Scalable data near processing capabilities are required to meet future data analysis requirements and to replace current "data download and process at home" workflows and approaches. On one hand side, these processing capabilities should be accessible via standardized interfaces (e.g. OGC WPS), on the other side a large variety of processing tools, toolboxes and deployment alternatives have to be supported and maintained at the data/processing center. We present a community approach of a modular and flexible system supporting the development, deployment and maintenace of OGC-WPS based web processing services. This approach is organized in an open source github project (called "bird-house") supporting individual processing services ("birds", e.g. climate index calculations, model data ensemble calculations), which rely on basic common infrastructural components (e.g. installation and deployment recipes, analysis code dependencies management). To support easy deployment at data centers as well as home institutes (e.g. for testing and development) the system supports the management of the often very complex package dependency chain of climate data analysis packages as well as docker based packaging and installation. We present a concrete deployment scenario at the German Climate Computing Center (DKRZ). The DKRZ one hand side hosts a multi-petabyte climate archive which is integrated e.g. into the european ENES and worldwide ESGF data infrastructure, and on the other hand hosts an HPC center supporting (model) data production and data analysis. The deployment scenario also includes openstack based data cloud services to support data import and data distribution for bird-house based WPS web processing services. Current challenges for inter-institutionnal deployments of web processing services supporting the european and international climate modeling community as well as the climate impact community are highlighted

  9. Transparent Reporting of Data Quality in Distributed Data Networks

    PubMed Central

    Kahn, Michael G.; Brown, Jeffrey S.; Chun, Alein T.; Davidson, Bruce N.; Meeker, Daniella; Ryan, Patrick B.; Schilling, Lisa M.; Weiskopf, Nicole G.; Williams, Andrew E.; Zozus, Meredith Nahm

    2015-01-01

    Introduction: Poor data quality can be a serious threat to the validity and generalizability of clinical research findings. The growing availability of electronic administrative and clinical data is accompanied by a growing concern about the quality of these data for observational research and other analytic purposes. Currently, there are no widely accepted guidelines for reporting quality results that would enable investigators and consumers to independently determine if a data source is fit for use to support analytic inferences and reliable evidence generation. Model and Methods: We developed a conceptual model that captures the flow of data from data originator across successive data stewards and finally to the data consumer. This “data lifecycle” model illustrates how data quality issues can result in data being returned back to previous data custodians. We highlight the potential risks of poor data quality on clinical practice and research results. Because of the need to ensure transparent reporting of a data quality issues, we created a unifying data-quality reporting framework and a complementary set of 20 data-quality reporting recommendations for studies that use observational clinical and administrative data for secondary data analysis. We obtained stakeholder input on the perceived value of each recommendation by soliciting public comments via two face-to-face meetings of informatics and comparative-effectiveness investigators, through multiple public webinars targeted to the health services research community, and with an open access online wiki. Recommendations: Our recommendations propose reporting on both general and analysis-specific data quality features. The goals of these recommendations are to improve the reporting of data quality measures for studies that use observational clinical and administrative data, to ensure transparency and consistency in computing data quality measures, and to facilitate best practices and trust in the new

  10. Profiling Oman education data using data visualization technique

    NASA Astrophysics Data System (ADS)

    Alalawi, Sultan Juma Sultan; Shaharanee, Izwan Nizal Mohd; Jamil, Jastini Mohd

    2016-10-01

    This research works presents an innovative data visualization technique to understand and visualize the information of Oman's education data generated from the Ministry of Education Oman "Educational Portal". The Ministry of Education in Sultanate of Oman have huge databases contains massive information. The volume of data in the database increase yearly as many students, teachers and employees enter into the database. The task for discovering and analyzing these vast volumes of data becomes increasingly difficult. Information visualization and data mining offer a better ways in dealing with large volume of information. In this paper, an innovative information visualization technique is developed to visualize the complex multidimensional educational data. Microsoft Excel Dashboard, Visual Basic Application (VBA) and Pivot Table are utilized to visualize the data. Findings from the summarization of the data are presented, and it is argued that information visualization can help related stakeholders to become aware of hidden and interesting information from large amount of data drowning in their educational portal.

  11. Limiting Data Friction by Reducing Data Download Using Spatiotemporally Aligned Data Organization Through STARE

    NASA Technical Reports Server (NTRS)

    Kuo, Kwo-Sen; Rilee, Michael Lee

    2017-01-01

    Current data processing practice limits the volume and variety of relevant geoscience data that can practically be applied to important problems. File archives in centralized data centers are the principal means by which Earth Science data are accessed. This approach, however, requires laborious search, retrieval, and eventual customization/adaptation for the data to be used. Such fractionation makes it even more difficult to share outcomes, i.e. research artifacts and data products, hampering reusability and repeatability, since end users generally have their own research agenda and preferences as well as scarce resources. Thus, while finding and downloading data files from central data centers are already costly for end users working in their own field, using data products from other disciplines rapidly becomes prohibitive. This curtails scientific productivity, limits avenues of study, and endangers quality and reproducibility. The Spatio-Temporal Adaptive Resolution Encoding (STARE) is a unifying scheme that facilitates the indexing, access, and fusion of diverse Earth Science data. STARE implements an innovative encoding of geo-spatiotemporal information, originally developed for aligning datasets with diverse spatiotemporal characteristics in an array database. The spatial component of STARE recursively quadfurcates a root polyhedron, producing a hierarchical scheme for addressing geographic locations and regions. The temporal component of STARE uses conventional date-time units as an indexing hierarchy. The additional encoding of spatial and temporal resolution information in STARE enables comparisons and conditional selections across diverse datasets. Moreover, spatiotemporal set-operations, e.g. union and intersection, are mapped to efficient integer operations with STARE. Applied to existing data models (point, grid, spacecraft swath) and corresponding granules, STARE indexes provide a streamlined description usable as geo-spatiotemporal metadata. When

  12. Limiting Data Friction by Reducing Data Download Using Spatiotemporally Aligned Data Organization Through STARE

    NASA Astrophysics Data System (ADS)

    Kuo, K. S.; Rilee, M. L.

    2017-12-01

    Current data processing practice limits the volume and variety of relevant geoscience data that can practically be applied to important problems. File archives in centralized data centers are the principal means by which Earth Science data are accessed. This approach, however, requires laborious search, retrieval, and eventual customization/adaptation for the data to be used. Such fractionation makes it even more difficult to share outcomes, i.e. research artifacts and data products, hampering reusability and repeatability, since end users generally have their own research agenda and preferences as well as scarce resources. Thus, while finding and downloading data files from central data centers are already costly for end users working in their own field, using data products from other disciplines rapidly becomes prohibitive. This curtails scientific productivity, limits avenues of study, and endangers quality and reproducibility. The Spatio-Temporal Adaptive Resolution Encoding ( STARE ) is a unifying scheme that facilitates the indexing, access, and fusion of diverse Earth Science data. STARE implements an innovative encoding of geo-spatiotemporal information, originally developed for aligning datasets with diverse spatiotemporal characteristics in an array database. The spatial component of STARE recursively quadfurcates a root polyhedron, producing a hierarchical scheme for addressing geographic locations and regions. The temporal component of STARE uses conventional date-time units as an indexing hierarchy. The additional encoding of spatial and temporal resolution information in STARE enables comparisons and conditional selections across diverse datasets. Moreover, spatiotemporal set-operations, e.g. union and intersection, are mapped to efficient integer operations with STARE. Applied to existing data models (point, grid, spacecraft swath) and corresponding granules, STARE indexes provide a streamlined description usable as geo-spatiotemporal metadata. When

  13. Data Curation: Improving Environmental Health Data Quality.

    PubMed

    Yang, Lin; Li, Jiao; Hou, Li; Qian, Qing

    2015-01-01

    With the growing recognition of the influence of climate change on human health, scientists' attention to analyzing the relationship between meteorological factors and adverse health effects. However, the paucity of high quality integrated data is one of the great challenges, especially when scientific studies rely on data-intensive computing. This paper aims to design an appropriate curation process to address this problem. We present a data curation workflow that: (i) follows the guidance of DCC Curation Lifecycle Model; (ii) combines manual curation with automatic curation; (iii) and solves environmental health data curation problem. The workflow was applied to a medical knowledge service system and showed that it was capable of improving work efficiency and data quality.

  14. Metadata for data rescue and data at risk

    USGS Publications Warehouse

    Anderson, William L.; Faundeen, John L.; Greenberg, Jane; Taylor, Fraser

    2011-01-01

    Scientific data age, become stale, fall into disuse and run tremendous risks of being forgotten and lost. These problems can be addressed by archiving and managing scientific data over time, and establishing practices that facilitate data discovery and reuse. Metadata documentation is integral to this work and essential for measuring and assessing high priority data preservation cases. The International Council for Science: Committee on Data for Science and Technology (CODATA) has a newly appointed Data-at-Risk Task Group (DARTG), participating in the general arena of rescuing data. The DARTG primary objective is building an inventory of scientific data that are at risk of being lost forever. As part of this effort, the DARTG is testing an approach for documenting endangered datasets. The DARTG is developing a minimal and easy to use set of metadata properties for sufficiently describing endangered data, which will aid global data rescue missions. The DARTG metadata framework supports rapid capture, and easy documentation, across an array of scientific domains. This paper reports on the goals and principles supporting the DARTG metadata schema, and provides a description of the preliminary implementation.

  15. DataSync - sharing data via filesystem

    NASA Astrophysics Data System (ADS)

    Ulbricht, Damian; Klump, Jens

    2014-05-01

    Usually research work is a cycle of to hypothesize, to collect data, to corroborate the hypothesis, and finally to publish the results. In this sequence there are possibilities to base the own work on the work of others. Maybe there are candidates of physical samples listed in the IGSN-Registry and there is no need to go on excursion to acquire physical samples. Hopefully the DataCite catalogue lists already metadata of datasets that meet the constraints of the hypothesis and that are now open for reappraisal. After all, working with the measured data to corroborate the hypothesis involves new methods, and proven methods as well as different software tools. A cohort of intermediate data is created that can be shared with colleagues to discuss the research progress and receive a first evaluation. In consequence, the intermediate data should be versioned to easily get back to valid intermediate data, when you notice you get on the wrong track. Things are different for project managers. They want to know what is currently done, what has been done, and what is the last valid data, if somebody has to continue the work. To make life of members of small science projects easier we developed Datasync [1] as a software for sharing and versioning data. Datasync is designed to synchronize directory trees between different computers of a research team over the internet. The software is developed as JAVA application and watches a local directory tree for changes that are replicated as eSciDoc-objects into an eSciDoc-infrastructure [2] using the eSciDoc REST API. Modifications to the local filesystem automatically create a new version of an eSciDoc-object inside the eSciDoc-infrastructure. This way individual folders can be shared between team members while project managers can get a general idea of current status by synchronizing whole project inventories. Additionally XML metadata from separate files can be managed together with data files inside the eSciDoc-objects. While

  16. Distributed data discovery, access and visualization services to Improve Data Interoperability across different data holdings

    NASA Astrophysics Data System (ADS)

    Palanisamy, G.; Krassovski, M.; Devarakonda, R.; Santhana Vannan, S.

    2012-12-01

    The current climate debate is highlighting the importance of free, open, and authoritative sources of high quality climate data that are available for peer review and for collaborative purposes. It is increasingly important to allow various organizations around the world to share climate data in an open manner, and to enable them to perform dynamic processing of climate data. This advanced access to data can be enabled via Web-based services, using common "community agreed" standards without having to change their internal structure used to describe the data. The modern scientific community has become diverse and increasingly complex in nature. To meet the demands of such diverse user community, the modern data supplier has to provide data and other related information through searchable, data and process oriented tool. This can be accomplished by setting up on-line, Web-based system with a relational database as a back end. The following common features of the web data access/search systems will be outlined in the proposed presentation: - A flexible data discovery - Data in commonly used format (e.g., CSV, NetCDF) - Preparing metadata in standard formats (FGDC, ISO19115, EML, DIF etc.) - Data subseting capabilities and ability to narrow down to individual data elements - Standards based data access protocols and mechanisms (SOAP, REST, OpenDAP, OGC etc.) - Integration of services across different data systems (discovery to access, visualizations and subseting) This presentation will also include specific examples of integration of various data systems that are developed by Oak Ridge National Laboratory's - Climate Change Science Institute, their ability to communicate between each other to enable better data interoperability and data integration. References: [1] Devarakonda, Ranjeet, and Harold Shanafield. "Drupal: Collaborative framework for science research." Collaboration Technologies and Systems (CTS), 2011 International Conference on. IEEE, 2011. [2

  17. DataFed: A Federated Data System for Visualization and Analysis of Spatio-Temporal Air Quality Data

    NASA Astrophysics Data System (ADS)

    Husar, R. B.; Hoijarvi, K.

    2017-12-01

    DataFed is a distributed web-services-based computing environment for accessing, processing, and visualizing atmospheric data in support of air quality science and management. The flexible, adaptive environment facilitates the access and flow of atmospheric data from provider to users by enabling the creation of user-driven data processing/visualization applications. DataFed `wrapper' components, non-intrusively wrap heterogeneous, distributed datasets for access by standards-based GIS web services. The mediator components (also web services) map the heterogeneous data into a spatio-temporal data model. Chained web services provide homogeneous data views (e.g., geospatial, time views) using a global multi-dimensional data model. In addition to data access and rendering, the data processing component services can be programmed for filtering, aggregation, and fusion of multidimensional data. A complete application software is written in a custom made data flow language. Currently, the federated data pool consists of over 50 datasets originating from globally distributed data providers delivering surface-based air quality measurements, satellite observations, emissions data as well as regional and global-scale air quality models. The web browser-based user interface allows point and click navigation and browsing the XYZT multi-dimensional data space. The key applications of DataFed are for exploring spatial pattern of pollutants, seasonal, weekly, diurnal cycles and frequency distributions for exploratory air quality research. Since 2008, DataFed has been used to support EPA in the implementation of the Exceptional Event Rule. The data system is also used at universities in the US, Europe and Asia.

  18. Legacy data center integration into distributed data federations: The World Data Center for Climate (WDCC) experience

    NASA Astrophysics Data System (ADS)

    Kindermann, Stephan; Berger, Katharina; Toussaint, Frank

    2014-05-01

    The integration of well-established legacy data centers into newly developed data federation infrastructures is a key requirement to enhance climate data access based on widely agreed interfaces. We present the approach taken to integrate the ICSU World Data Center for Climate (WDCC) located in Hamburg, Germany into the European ENES climate data Federation which is part of the international ESGF data federation. The ENES / ESGF data federation hosts petabytes of climate model data and provides scalable data search and access services across the worldwide distributed data centers. Parts of the data provided by the ENES / ESGF data federation is also long term archived and curated at the WDCC data archive, allowing e.g. for DOI based data citation. An integration of the WDCC into the ENES / ESGF federation allows end users to search and access WDCC data using consistent interfaces worldwide. We will summarize the integration approach we have taken for WDCC legacy system and ESGF infrastructure integration. On the technical side we describe the provisioning of ESGF consistent metadata and data interfaces as well as the security infrastructure adoption. On the non-technical side we describe our experiences in integrating a long-term archival center with costly quality assurance procedures with an integrated distributed data federation putting emphasis on providing early and consistent data search and access services to scientists. The experiences were gained in the process of curating ESGF hosted CMIP5 data at the WDCC. Approximately one petabyte of CMIP5 data which was used for the IPCC climate report is being replicated and archived at the WDCC.

  19. Nursing Needs Big Data and Big Data Needs Nursing.

    PubMed

    Brennan, Patricia Flatley; Bakken, Suzanne

    2015-09-01

    Contemporary big data initiatives in health care will benefit from greater integration with nursing science and nursing practice; in turn, nursing science and nursing practice has much to gain from the data science initiatives. Big data arises secondary to scholarly inquiry (e.g., -omics) and everyday observations like cardiac flow sensors or Twitter feeds. Data science methods that are emerging ensure that these data be leveraged to improve patient care. Big data encompasses data that exceed human comprehension, that exist at a volume unmanageable by standard computer systems, that arrive at a velocity not under the control of the investigator and possess a level of imprecision not found in traditional inquiry. Data science methods are emerging to manage and gain insights from big data. The primary methods included investigation of emerging federal big data initiatives, and exploration of exemplars from nursing informatics research to benchmark where nursing is already poised to participate in the big data revolution. We provide observations and reflections on experiences in the emerging big data initiatives. Existing approaches to large data set analysis provide a necessary but not sufficient foundation for nursing to participate in the big data revolution. Nursing's Social Policy Statement guides a principled, ethical perspective on big data and data science. There are implications for basic and advanced practice clinical nurses in practice, for the nurse scientist who collaborates with data scientists, and for the nurse data scientist. Big data and data science has the potential to provide greater richness in understanding patient phenomena and in tailoring interventional strategies that are personalized to the patient. © 2015 Sigma Theta Tau International.

  20. DataONE: Gateway to Earth and Environmental Data Repositories

    NASA Astrophysics Data System (ADS)

    Koskela, R.; Michener, W. K.; Vieglais, D.; Budden, A. E.

    2017-12-01

    DataONE (Data Observation Network for Earth) is a National Science Foundation DataNet project that enables universal access to data and also facilitates researchers in fulfilling their need for data management and in providing secure and permanent access to their data. DataONE offers the scientific community a suite of tools and training materials that cover all aspects of the data life cycle from data collection, to management, analysis and publication. Data repositories affiliated with DataONE are referred to as Member Nodes and represent large regional, national and international research networks, agencies, and other institutions. As part of the DataONE Federation, the repositories gain access to a range of value-added services to support their users. These services include usage tracking and reporting, content replication, and the ability to register the services created by the repository. In addition, DataONE and the California Digital Library manage ONEShare, a repository that accepts content submitted through Dash, a platform allowing researchers to easily describe, deposit and share their research data.

  1. Lunar Data Node: Apollo Data Restoration and Archiving Update

    NASA Technical Reports Server (NTRS)

    Williams, David R.; Hills, Howard K.; Guiness, Edward A.; Taylor, Patrick T.; McBride, Marie Julia

    2013-01-01

    The Lunar Data Node (LDN) of the Planetary Data System (PDS) is responsible for the restoration and archiving of Apollo data. The LDN is located at the National Space Science Data Center (NSSDC), which holds much of the extant Apollo data on microfilm, microfiche, hard-copy documents, and magnetic tapes in older formats. The goal of the restoration effort is to convert the data into user-accessible PDS formats, create a full set of explanatory supporting data (metadata), archive the full data sets through PDS, and post the data online at the PDS Geosciences Node. This will both enable easy use of the data by current researchers and ensure that the data and metadata are securely preserved for future use. We are also attempting to locate and preserve Apollo data which were never archived at NSSDC. We will give a progress report on the data sets we have been restoring and future work.

  2. Data-adaptive test statistics for microarray data.

    PubMed

    Mukherjee, Sach; Roberts, Stephen J; van der Laan, Mark J

    2005-09-01

    An important task in microarray data analysis is the selection of genes that are differentially expressed between different tissue samples, such as healthy and diseased. However, microarray data contain an enormous number of dimensions (genes) and very few samples (arrays), a mismatch which poses fundamental statistical problems for the selection process that have defied easy resolution. In this paper, we present a novel approach to the selection of differentially expressed genes in which test statistics are learned from data using a simple notion of reproducibility in selection results as the learning criterion. Reproducibility, as we define it, can be computed without any knowledge of the 'ground-truth', but takes advantage of certain properties of microarray data to provide an asymptotically valid guide to expected loss under the true data-generating distribution. We are therefore able to indirectly minimize expected loss, and obtain results substantially more robust than conventional methods. We apply our method to simulated and oligonucleotide array data. By request to the corresponding author.

  3. Space data management at the NSSDC (National Space Sciences Data Center): Applications for data compression

    NASA Technical Reports Server (NTRS)

    Green, James L.

    1989-01-01

    The National Space Science Data Center (NSSDC), established in 1966, is the largest archive for processed data from NASA's space and Earth science missions. The NSSDC manages over 120,000 data tapes with over 4,000 data sets. The size of the digital archive is approximately 6,000 gigabytes with all of this data in its original uncompressed form. By 1995 the NSSDC digital archive is expected to more than quadruple in size reaching over 28,000 gigabytes. The NSSDC digital archive is expected to more than quadruple in size reaching over 28,000 gigabytes. The NSSDC is beginning several thrusts allowing it to better serve the scientific community and keep up with managing the ever increasing volumes of data. These thrusts involve managing larger and larger amounts of information and data online, employing mass storage techniques, and the use of low rate communications networks to move requested data to remote sites in the United States, Europe and Canada. The success of these thrusts, combined with the tremendous volume of data expected to be archived at the NSSDC, clearly indicates that innovative storage and data management solutions must be sought and implemented. Although not presently used, data compression techniques may be a very important tool for managing a large fraction or all of the NSSDC archive in the future. Some future applications would consist of compressing online data in order to have more data readily available, compress requested data that must be moved over low rate ground networks, and compress all the digital data in the NSSDC archive for a cost effective backup that would be used only in the event of a disaster.

  4. A Constrained and Versioned Data Model for TEAM Data

    NASA Astrophysics Data System (ADS)

    Andelman, S.; Baru, C.; Chandra, S.; Fegraus, E.; Lin, K.

    2009-04-01

    The objective of the Tropical Ecology Assessment and Monitoring Network (www.teamnetwork.org) is "To generate real time data for monitoring long-term trends in tropical biodiversity through a global network of TEAM sites (i.e. field stations in tropical forests), providing an early warning system on the status of biodiversity to effectively guide conservation action". To achieve this, the TEAM Network operates by collecting data via standardized protocols at TEAM Sites. The standardized TEAM protocols include the Climate, Vegetation and Terrestrial Vertebrate Protocols. Some sites also implement additional protocols. There are currently 7 TEAM Sites with plans to grow the network to 15 by June 30, 2009 and 50 TEAM Sites by the end of 2010. At each TEAM Site, data is gathered as defined by the protocols and according to a predefined sampling schedule. The TEAM data is organized and stored in a database based on the TEAM spatio-temporal data model. This data model is at the core of the TEAM Information System - it consumes and executes spatio-temporal queries, and analytical functions that are performed on TEAM data, and defines the object data types, relationships and operations that maintain database integrity. The TEAM data model contains object types including types for observation objects (e.g. bird, butterfly and trees), sampling unit, person, role, protocol, site and the relationship of these object types. Each observation data record is a set of attribute values of an observation object and is always associated with a sampling unit, an observation timestamp or time interval, a versioned protocol and data collectors. The operations on the TEAM data model can be classified as read operations, insert operations and update operations. Following are some typical operations: The operation get(site, protocol, [sampling unit block, sampling unit,] start time, end time) returns all data records using the specified protocol and collected at the specified site, block

  5. From data point timelines to a well curated data set, data mining of experimental data and chemical structure data from scientific articles, problems and possible solutions.

    PubMed

    Ruusmann, Villu; Maran, Uko

    2013-07-01

    The scientific literature is important source of experimental and chemical structure data. Very often this data has been harvested into smaller or bigger data collections leaving the data quality and curation issues on shoulders of users. The current research presents a systematic and reproducible workflow for collecting series of data points from scientific literature and assembling a database that is suitable for the purposes of high quality modelling and decision support. The quality assurance aspect of the workflow is concerned with the curation of both chemical structures and associated toxicity values at (1) single data point level and (2) collection of data points level. The assembly of a database employs a novel "timeline" approach. The workflow is implemented as a software solution and its applicability is demonstrated on the example of the Tetrahymena pyriformis acute aquatic toxicity endpoint. A literature collection of 86 primary publications for T. pyriformis was found to contain 2,072 chemical compounds and 2,498 unique toxicity values, which divide into 2,440 numerical and 58 textual values. Every chemical compound was assigned to a preferred toxicity value. Examples for most common chemical and toxicological data curation scenarios are discussed.

  6. Data Mining Web Services for Science Data Repositories

    NASA Astrophysics Data System (ADS)

    Graves, S.; Ramachandran, R.; Keiser, K.; Maskey, M.; Lynnes, C.; Pham, L.

    2006-12-01

    The maturation of web services standards and technologies sets the stage for a distributed "Service-Oriented Architecture" (SOA) for NASA's next generation science data processing. This architecture will allow members of the scientific community to create and combine persistent distributed data processing services and make them available to other users over the Internet. NASA has initiated a project to create a suite of specialized data mining web services designed specifically for science data. The project leverages the Algorithm Development and Mining (ADaM) toolkit as its basis. The ADaM toolkit is a robust, mature and freely available science data mining toolkit that is being used by several research organizations and educational institutions worldwide. These mining services will give the scientific community a powerful and versatile data mining capability that can be used to create higher order products such as thematic maps from current and future NASA satellite data records with methods that are not currently available. The package of mining and related services are being developed using Web Services standards so that community-based measurement processing systems can access and interoperate with them. These standards-based services allow users different options for utilizing them, from direct remote invocation by a client application to deployment of a Business Process Execution Language (BPEL) solutions package where a complex data mining workflow is exposed to others as a single service. The ability to deploy and operate these services at a data archive allows the data mining algorithms to be run where the data are stored, a more efficient scenario than moving large amounts of data over the network. This will be demonstrated in a scenario in which a user uses a remote Web-Service-enabled clustering algorithm to create cloud masks from satellite imagery at the Goddard Earth Sciences Data and Information Services Center (GES DISC).

  7. Who Owns the Data? Open Data for Healthcare.

    PubMed

    Kostkova, Patty; Brewer, Helen; de Lusignan, Simon; Fottrell, Edward; Goldacre, Ben; Hart, Graham; Koczan, Phil; Knight, Peter; Marsolier, Corinne; McKendry, Rachel A; Ross, Emma; Sasse, Angela; Sullivan, Ralph; Chaytor, Sarah; Stevenson, Olivia; Velho, Raquel; Tooke, John

    2016-01-01

    Research on large shared medical datasets and data-driven research are gaining fast momentum and provide major opportunities for improving health systems as well as individual care. Such open data can shed light on the causes of disease and effects of treatment, including adverse reactions side-effects of treatments, while also facilitating analyses tailored to an individual's characteristics, known as personalized or "stratified medicine." Developments, such as crowdsourcing, participatory surveillance, and individuals pledging to become "data donors" and the "quantified self" movement (where citizens share data through mobile device-connected technologies), have great potential to contribute to our knowledge of disease, improving diagnostics, and delivery of -healthcare and treatment. There is not only a great potential but also major concerns over privacy, confidentiality, and control of data about individuals once it is shared. Issues, such as user trust, data privacy, transparency over the control of data ownership, and the implications of data analytics for personal privacy with potentially intrusive inferences, are becoming increasingly scrutinized at national and international levels. This can be seen in the recent backlash over the proposed implementation of care.data, which enables individuals' NHS data to be linked, retained, and shared for other uses, such as research and, more controversially, with businesses for commercial exploitation. By way of contrast, through increasing popularity of social media, GPS-enabled mobile apps and tracking/wearable devices, the IT industry and MedTech giants are pursuing new projects without clear public and policy discussion about ownership and responsibility for user-generated data. In the absence of transparent regulation, this paper addresses the opportunities of Big Data in healthcare together with issues of responsibility and accountability. It also aims to pave the way for public policy to support a balanced

  8. Data Quality- and Master Data Management - A Hospital Case.

    PubMed

    Arthofer, Klaus; Girardi, Dominic

    2017-01-01

    Poor data quality prevents the analysis of data for decisions which are critical for business. It also has a negative impact on business processes. Nevertheless the maturity level of data quality- and master data management is still insufficient in many organizations nowadays. This article discusses the corresponding maturity of companies and a management cycle integrating data quality- and master data management in a case dealing with benchmarking in hospitals. In conclusion if data quality and master data are not properly managed, structured data should not be acquired in the first place due to the added expense and complexity.

  9. Automated Data Cleansing in Data Harvesting and Data Migration

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Martin, Mark; Vowell, Lance; King, Ian

    2011-03-16

    In the proposal for this project, we noted how the explosion of digitized information available through corporate databases, data stores and online search systems has resulted in the knowledge worker being bombarded by information. Knowledge workers typically spend more than 20-30% of their time seeking and sorting information, only finding the information 50-60% of the time . This information exists as unstructured, semi-structured and structured data. The problem of information overload is compounded by the production of duplicate or near-duplicate information. In addition, near-duplicate items frequently have different origins, creating a situation in which each item may have unique informationmore » of value, but their differences are not significant enough to justify maintaining them as separate entities. Effective tools can be provided to eliminate duplicate and near-duplicate information. The proposed approach was to extract unique information from data sets and consolidation that information into a single comprehensive file.« less

  10. Visual data mining for quantized spatial data

    NASA Technical Reports Server (NTRS)

    Braverman, Amy; Kahn, Brian

    2004-01-01

    In previous papers we've shown how a well known data compression algorithm called Entropy-constrained Vector Quantization ( can be modified to reduce the size and complexity of very large, satellite data sets. In this paper, we descuss how to visualize and understand the content of such reduced data sets.

  11. Data flow machine for data driven computing

    DOEpatents

    Davidson, George S.; Grafe, Victor G.

    1995-01-01

    A data flow computer which of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.

  12. Data Auditor: Analyzing Data Quality Using Pattern Tableaux

    NASA Astrophysics Data System (ADS)

    Srivastava, Divesh

    Monitoring databases maintain configuration and measurement tables about computer systems, such as networks and computing clusters, and serve important business functions, such as troubleshooting customer problems, analyzing equipment failures, planning system upgrades, etc. These databases are prone to many data quality issues: configuration tables may be incorrect due to data entry errors, while measurement tables may be affected by incorrect, missing, duplicate and delayed polls. We describe Data Auditor, a tool for analyzing data quality and exploring data semantics of monitoring databases. Given a user-supplied constraint, such as a boolean predicate expected to be satisfied by every tuple, a functional dependency, or an inclusion dependency, Data Auditor computes "pattern tableaux", which are concise summaries of subsets of the data that satisfy or fail the constraint. We discuss the architecture of Data Auditor, including the supported types of constraints and the tableau generation mechanism. We also show the utility of our approach on an operational network monitoring database.

  13. Chemical Data Reporting - Previously Collected Data

    EPA Pesticide Factsheets

    EPA now refers to the Inventory Update Reporting (IUR) rule as the Chemical Data Reporting (CDR) Rule. This change was effective with the publication of the Inventory Update Reporting Modifications; Chemical Data Reporting Final Rule in August 2011.

  14. Comparison of radar data versus rainfall data

    PubMed Central

    Espinosa, B.; Hromadka, T.V.; Perez, R.

    2015-01-01

    Doppler radar data are increasingly used in rainfall-runoff synthesis studies, perhaps due to radar data availability, among other factors. However, the veracity of the radar data are often a topic of concern. In this paper, three Doppler radar outcomes developed by the United States National Weather Service at three radar sites are examined and compared to actual rain gage data for two separate severe storm events in order to assess accuracy in the published radar estimates of rainfall. Because the subject storms were very intense rainfall events lasting approximately one hour in duration, direct comparisons between the three radar gages themselves can be made, as well as a comparison to rain gage data at a rain gage location subjected to the same storm cells. It is shown that topographic interference with the radar outcomes can be a significant factor leading to differences between radar and rain gage readings, and that care is needed in calibrating radar outcomes using available rain gage data in order to interpolate rainfall estimates between rain gages using the spatial variation observed in the radar readings. The paper establishes and describes•the need for “ground-truthing” of radar data, and•possible errors due to topographic interference. PMID:26649276

  15. Building Scientific Data's list of recommended data repositories

    NASA Astrophysics Data System (ADS)

    Hufton, A. L.; Khodiyar, V.; Hrynaszkiewicz, I.

    2016-12-01

    When Scientific Data launched in 2014 we provided our authors with a list of recommended data repositories to help them identify data hosting options that were likely to meet the journal's requirements. This list has grown in size and scope, and is now a central resource for authors across the Nature-titled journals. It has also been used in the development of data deposition policies and recommended repository lists across Springer Nature and at other publishers. Each new addition to the list is assessed according to a series of criteria that emphasize the stability of the resource, its commitment to principles of open science and its implementation of relevant community standards and reporting guidelines. A preference is expressed for repositories that issue digital object identifiers (DOIs) through the DataCite system and that share data under the Creative Commons CC0 waiver. Scientific Data currently lists fourteen repositories that focus on specific areas within the Earth and environmental sciences, as well as the broad scope repositories, Dryad and figshare. Readers can browse and filter datasets published at the journal by the host repository using ISA-explorer, a demo tool built by the ISA-tools team at Oxford University1. We believe that well-maintained lists like this one help publishers build a network of trust with community data repositories and provide an important complement to more comprehensive data repository indices and more formal certification efforts. In parallel, Scientific Data has also improved its policies to better support submissions from authors using institutional and project-specific repositories, without requiring each to apply for listing individually. Online resources Journal homepage: http://www.nature.com/scientificdata Data repository criteria: http://www.nature.com/sdata/policies/data-policies#repo-criteria Recommended data repositories: http://www.nature.com/sdata/policies/repositories Archived copies of the list: https

  16. Procedure of Partitioning Data Into Number of Data Sets or Data Group - A Review

    NASA Astrophysics Data System (ADS)

    Kim, Tai-Hoon

    The goal of clustering is to decompose a dataset into similar groups based on a objective function. Some already well established clustering algorithms are there for data clustering. Objective of these data clustering algorithms are to divide the data points of the feature space into a number of groups (or classes) so that a predefined set of criteria are satisfied. The article considers the comparative study about the effectiveness and efficiency of traditional data clustering algorithms. For evaluating the performance of the clustering algorithms, Minkowski score is used here for different data sets.

  17. Japan Data Exchange Network JDXnet and Cloud-type Data Relay Server for Earthquake Observation Data

    NASA Astrophysics Data System (ADS)

    Takano, K.; Urabe, T.; Tsuruoka, H.; Nakagawa, S.

    2015-12-01

    In Japan, high-sensitive seismic observation and broad-band seismic observation are carried out by several organization such as Japan Meteorological Agency (JMA) , National Research Institute for Earth Science and Disaster Prevention (NIED), nine National Universities, Japan Agency for Marine-Earth Science and Technology (JAMSTEC) , etc. The total number of the observation station is about 1400 points. The total volume of the seismic waveform data collected from all these observation station is about 1MByte for 1 second (about 8 to 10Mbps) by using the WIN system(Urabe 1991). JDXnet is the Japan Data eXchange network for earthquake observation data. JDXnet was started from 2007 by cooperation of the researchers of each organization. All the seismic waveform data are available at the all organizations in real-time. The core of JDXnet is the broadcast type real-time data exchange by using the nationwide L2-VPN service offered in JGN-X of NICT and SINET4 of NII. Before the Tohoku earthquake, the nine national universities had collected seismic data to each data center and then exchanged with other universities and institutions by JDXnet. However, in this case, if the center of the university was stopped, all data of the university could not use even though there are some alive observation stations. Because of this problem, we have prepared the data relay server in the data center of SINET4 ie the cloud center. This data relay server collects data directly from the observation stations of the universities and delivers data to all universities and institutions by JDXnet. By using the relay server on cloud center, even if some universities are affected by a large disaster, it is eliminated that the data of the living station is lost. If the researchers set up seismometers and send data to the relay server, then data are available to all researchers. This mechanism promotes the joint use of the seismometers and joint research activities in nationwide researchers.

  18. Data Redistribution through MY NASA DATA: Striving to bring authentic NASA data into education

    NASA Astrophysics Data System (ADS)

    Lewis, P. M.; Oostra, D.; Oots, P.; Chambers, L. H.; Moore, S.; Crecelius, S.; Taylor, J.

    2012-12-01

    The Mentoring and inquirY using NASA Data on Atmospheric and Earth science for Teachers and Amateurs (MY NASA DATA or MND) project was launched in 2004 to bring authentic data into K-12 education. The MND website features a Live Access Server (LAS), an open source tool which allows users to customize data sets to suit their individual needs, choosing from among 200 global Level 3 data sets. Approximately 120 lesson plans that utilize the available parameters are offered to help teachers and students get started with data exploration. Grade appropriate data documentation is also provided (with continual efforts to improve it to better meet the needs of this target audience). Through inquiry and lesson utilization, educators have several connection points to the data. As classrooms shift to problem-based and inquiry learning, the need for a data visualizer/server increases. Through numerous and growing connections to NASA satellite missions, and with access to data as a built-in feature, MND effectively fills this niche to provide a first level of data re-use that is friendly to the K-12 community. Offering a wide variety of data sets allows MND to support many science topics within the K-12 curriculum while extending the use of scientific data from NASA Earth science satellites. Lessons, created by educators across the country, allow MND to connect with the classroom teacher and to meet their data needs. As technology continues to evolve, a second level of data re-use becomes both interesting and possible. Thus, the MND team is now exploring new web and mobile platforms that can be built and distributed on an accelerated time cycle to keep up with information technology developments. With implementation of these new platforms come challenges in promoting new items to the education community, the public, and other potential users. Included in the list of challenges are: ever-evolving technology, prediction of the market, web/mobile platforms, and time-to-market for

  19. Data embedding

    DOEpatents

    Sandford, II, Maxwell T.; Handel, Theodore G.

    1997-01-01

    A method of embedding auxiliary information into a set of host data, such as a photograph, television signal, facsimile transmission, or identification card. All such host data contain intrinsic noise, allowing pixels in the host data which are nearly identical and which have values differing by less than the noise value to be manipulated and replaced with auxiliary data. As the embedding method does not change the elemental values of the host data, the auxiliary data do not noticeably affect the appearance or interpretation of the host data. By a substantially reverse process, the embedded auxiliary data can be retrieved easily by an authorized user.

  20. Data embedding

    DOEpatents

    Sandford, M.T. II; Handel, T.G.

    1997-08-19

    A method is disclosed for embedding auxiliary information into a set of host data, such as a photograph, television signal, facsimile transmission, or identification card. All such host data contain intrinsic noise, allowing pixels in the host data which are nearly identical and which have values differing by less than the noise value to be manipulated and replaced with auxiliary data. As the embedding method does not change the elemental values of the host data, the auxiliary data do not noticeably affect the appearance or interpretation of the host data. By a substantially reverse process, the embedded auxiliary data can be retrieved easily by an authorized user. 19 figs.

  1. 14 CFR 125.228 - Flight data recorders: filtered data.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 14 Aeronautics and Space 3 2011-01-01 2011-01-01 false Flight data recorders: filtered data. 125... Equipment Requirements § 125.228 Flight data recorders: filtered data. (a) A flight data signal is filtered... original sensor signal value can be reconstructed from the recorded data. This demonstration requires that...

  2. 14 CFR 121.346 - Flight data recorders: filtered data.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 14 Aeronautics and Space 3 2011-01-01 2011-01-01 false Flight data recorders: filtered data. 121... § 121.346 Flight data recorders: filtered data. (a) A flight data signal is filtered when an original... sensor signal value can be reconstructed from the recorded data. This demonstration requires that: (i...

  3. 14 CFR 135.156 - Flight data recorders: filtered data.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 14 Aeronautics and Space 3 2011-01-01 2011-01-01 false Flight data recorders: filtered data. 135... Aircraft and Equipment § 135.156 Flight data recorders: filtered data. (a) A flight data signal is filtered... original sensor signal value can be reconstructed from the recorded data. This demonstration requires that...

  4. USGS Science Data Catalog - Open Data Advances or Declines

    NASA Astrophysics Data System (ADS)

    Frame, M. T.; Hutchison, V.; Zolly, L.; Wheeler, B.; Latysh, N.; Devarakonda, R.; Palanisamy, G.; Shrestha, B.

    2014-12-01

    The recent Office of Science and Technology Policy (OSTP) White House Open Data Policies (2013) have required Federal agencies to establish formal catalogues of their science data holdings and make these data easily available on Web sites, portals, and applications. As an organization, the USGS has historically excelled at making its data holdings freely available on its various Web sites (i.e., National, Scientific Programs, or local Science Center). In response to these requirements, the USGS Core Science Analytics, Synthesis, and Libraries program, in collaboration with DOE's Oak Ridge National Laboratory (ORNL) Mercury Consortium (funded by NASA, USGS, and DOE), and a number of other USGS organizations, established the Science Data Catalog (http://data.usgs.gov) cyberinfrastructure, content management processes/tools, and supporting policies. The USGS Science Data Catalog led the charge at USGS to improve the robustness of existing/future metadata collections; streamline and develop sustainable publishing to external aggregators (i.e., data.gov); and provide leadership to the U.S. Department of Interior in emerging Open Data policies, techniques, and systems. The session will discuss the current successes, challenges, and movement toward meeting these Open Data policies for USGS scientific data holdings. A retrospective look at the last year of implementation of these efforts within USGS will occur to determine whether these Open Data Policies are improving data access or limiting data availability. To learn more about the USGS Science Data Catalog, visit us at http://data.usgs.gov/info/about.html

  5. Characterization of Emergent Data Networks Among Long-Tail Data

    NASA Astrophysics Data System (ADS)

    Elag, Mostafa; Kumar, Praveen; Hedstrom, Margaret; Myers, James; Plale, Beth; Marini, Luigi; McDonald, Robert

    2014-05-01

    Data curation underpins data-driven scientific advancements. It manages the information flux across multiple users throughout data life cycle as well as increases data sustainability and reusability. The exponential growth in data production spanning across the Earth Science involving individual and small research groups, which is termed as log-tail data, increases the data-knowledge latency among related domains. It has become clear that an advanced framework-agnostic metadata and ontologies for long-tail data is required to increase their visibility to each other, and provide concise and meaningful descriptions that reveal their connectivity. Despite the advancement that has been achieved by various sophisticated data management models in different Earth Science disciplines, it is not always straightforward to derive relationships among long-tail data. Semantic data clustering algorithms and pre-defined logic rules that are oriented toward prediction of possible data relationships, is one method to address these challenges. Our work advances the connectivity of related long-tail data by introducing the design for an ontology-based knowledge management system. In this work, we present the system architecture, its components, and illustrate how it can be used to scrutinize the connectivity among datasets. To demonstrate the capabilities of this "data network" prototype, we implemented this approach within the Sustainable Environment Actionable Data (SEAD) environment, an open-source semantic content repository that provides a RDF database for long-tail data, and show how emergent relationships among datasets can be identified.

  6. Environmental Data Store: A Web-Based System Providing Management and Exploitation for Multi-Data-Type Environmental Data

    NASA Astrophysics Data System (ADS)

    Ji, P.; Piasecki, M.

    2012-12-01

    With the rapid growth in data volumes, data diversity and data demands from multi-disciplinary research effort, data management and exploitation are increasingly facing significant challenges for environmental scientific community. We describe Environmental data store (EDS), a system we are developing that is a web-based system following an open source implementation to manage and exploit multi-data-type environmental data. EDS provides repository services for the six fundamental data types, which meet the demands of multi-disciplinary environmental research. These data types are: a) Time Series Data, b) GeoSpatial data, c) Digital Data, d) Ex-Situ Sampling data, e) Modeling Data, f) Raster Data. Through data portal, EDS allows for efficient consuming these six types of data placed in data pool, which is made up of different data nodes corresponding to different data types, including iRODS, ODM, THREADS, ESSDB, GeoServer, etc.. EDS data portal offers unified submission interface for the above different data types; provides fully integrated, scalable search across content from the above different data systems; also features mapping, analysis, exporting and visualization, through integration with other software. EDS uses a number of developed systems, follows widely used data standards, and highlights the thematic, semantic, and syntactic support on the submission and search, in order to advance multi-disciplinary environmental research. This system will be installed and develop at the CrossRoads initiative at the City College of New York.

  7. ["Big data" - large data, a lot of knowledge?].

    PubMed

    Hothorn, Torsten

    2015-01-28

    Since a couple of years, the term Big Data describes technologies to extract knowledge from data. Applications of Big Data and their consequences are also increasingly discussed in the mass media. Because medicine is an empirical science, we discuss the meaning of Big Data and its potential for future medical research.

  8. Data reuse and the open data citation advantage

    PubMed Central

    Vision, Todd J.

    2013-01-01

    Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation benefit”. Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2

  9. Data, Data Everywhere--Not a Report in Sight!

    ERIC Educational Resources Information Center

    Norman, Wendy

    2003-01-01

    Presents six steps of data warehouse development that result in valuable, long-term reporting solutions, discussing how to choose the right reporting vehicle. The six steps are: defining one's needs; mapping the source for each element; extracting the data; cleaning and verifying the data; moving the data into a relational database; and developing…

  10. Spatial data standards meet meteorological data - pushing the boundaries

    NASA Astrophysics Data System (ADS)

    Wagemann, Julia; Siemen, Stephan; Lamy-Thepaut, Sylvie

    2017-04-01

    The data archive of the European Centre for Medium-Range Weather Forecasts (ECMWF) holds around 120 PB of data and is world's largest archive of meteorological data. This information is of great value for many Earth Science disciplines, but the complexity of the data (up to five dimensions and different time axis domains) and its native data format GRIB, while being an efficient archive format, limits the overall data uptake especially from users outside the MetOcean domain. ECMWF's MARS WebAPI is a very efficient and flexible system for expert users to access and retrieve meteorological data, though challenging for users outside the MetOcean domain. With the help of web-based standards for data access and processing, ECMWF wants to make more than 1 PB of meteorological and climate data easier accessible to users across different Earth Science disciplines. As climate data provider for the H2020 project EarthServer-2, ECMWF explores the feasibility to give on-demand access to it's MARS archive via the OGC standard interface Web Coverage Service (WCS). Despite the potential a WCS for climate and meteorological data offers, the standards-based modelling of meteorological and climate data entails many challenges and reveals the boundaries of the current Web Coverage Service 2.0 standard. Challenges range from valid semantic data models for meteorological data to optimal and efficient data structures for a scalable web service. The presentation reviews the applicability of the current Web Coverage Service 2.0 standard to meteorological and climate data and discusses challenges that are necessary to overcome in order to achieve real interoperability and to ensure the conformant sharing and exchange of meteorological data.

  11. Data flow machine for data driven computing

    DOEpatents

    Davidson, G.S.; Grafe, V.G.

    1988-07-22

    A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information from an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ''fire'' signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.

  12. Biomass Data | Geospatial Data Science | NREL

    Science.gov Websites

    Biomass Data Biomass Data These datasets detail the biomass resources available in the United Coverage File Last Updated Metadata Biomethane Zip 72.2 MB 10/30/2014 Biomethane.xml Solid Biomass Zip 69.5

  13. Spectroscopic Data for an Astronomy Data Base

    NASA Technical Reports Server (NTRS)

    Parkinson, W. H.; Smith, Peter L.

    1997-01-01

    When we began this work, very few of the atomic and molecular data used by astronomers in the analysis of astronomical spectra were available in on-line searchable databases. Our principal goal was to: make the most useful of the atomic data files of R.L. Kurucuz (1995a,b) available on the WWW; and also to make the atomic data of R.L. Kelly for ultraviolet lines (i.e., essentially the same as the data in Kelly (1979) and Kelly (1987)) similarly available. In addition, we proposed to improve access to parameters for simple molecules of interest to astronomers.

  14. Ancillary Data Services of NASA's Planetary Data System

    NASA Technical Reports Server (NTRS)

    Acton, C.

    1994-01-01

    JPL's Navigation and Ancillary Information Facility (NAIF) has primary responsibility for design and implementation of the SPICE ancillary information system, supporting a wide range of space science mission design, observation planning and data analysis functions/activities. NAIF also serves as the geometry and ancillary data node of the Planetary Data System (PDS). As part of the PDS, NAIF archives SPICE and other ancillary data produced by flight projects. NAIF then distributes these data, and associated data access software and high-level tools, to researchers funded by NASA's Office of Space Science. Support for a broader user community is also offered to the extent resources permit. This paper describes the SPICE system and customer support offered by NAIF.

  15. Who Owns the Data? Open Data for Healthcare

    PubMed Central

    Kostkova, Patty; Brewer, Helen; de Lusignan, Simon; Fottrell, Edward; Goldacre, Ben; Hart, Graham; Koczan, Phil; Knight, Peter; Marsolier, Corinne; McKendry, Rachel A.; Ross, Emma; Sasse, Angela; Sullivan, Ralph; Chaytor, Sarah; Stevenson, Olivia; Velho, Raquel; Tooke, John

    2016-01-01

    Research on large shared medical datasets and data-driven research are gaining fast momentum and provide major opportunities for improving health systems as well as individual care. Such open data can shed light on the causes of disease and effects of treatment, including adverse reactions side-effects of treatments, while also facilitating analyses tailored to an individual’s characteristics, known as personalized or “stratified medicine.” Developments, such as crowdsourcing, participatory surveillance, and individuals pledging to become “data donors” and the “quantified self” movement (where citizens share data through mobile device-connected technologies), have great potential to contribute to our knowledge of disease, improving diagnostics, and delivery of ­healthcare and treatment. There is not only a great potential but also major concerns over privacy, confidentiality, and control of data about individuals once it is shared. Issues, such as user trust, data privacy, transparency over the control of data ownership, and the implications of data analytics for personal privacy with potentially intrusive inferences, are becoming increasingly scrutinized at national and international levels. This can be seen in the recent backlash over the proposed implementation of care.data, which enables individuals’ NHS data to be linked, retained, and shared for other uses, such as research and, more controversially, with businesses for commercial exploitation. By way of contrast, through increasing popularity of social media, GPS-enabled mobile apps and tracking/wearable devices, the IT industry and MedTech giants are pursuing new projects without clear public and policy discussion about ownership and responsibility for user-generated data. In the absence of transparent regulation, this paper addresses the opportunities of Big Data in healthcare together with issues of responsibility and accountability. It also aims to pave the way for public policy to

  16. DATA MANAGEMENT SYSTEM FOR MOBILE SATELLITE PROPAGATION DATA

    NASA Technical Reports Server (NTRS)

    Kantak, A. V.

    1994-01-01

    The "Data Management System for Mobile Satellite Propogation" package is a collection of FORTRAN programs and UNIX shell scripts designed to handle the huge amounts of data resulting from Mobile Satellite propogation experiments. These experiments are designed to assist in defining channels for mobile satellite systems. By understanding multipath fading characteristics of the channel, doppler effects, and blockage due to manmade objects as well as natural surroundings, characterization of the channel can be realized. Propogation experiments, then, are performed using a prototype of the system simulating the ultimate product environment. After the data from these experiments is generated, the researcher must access this data with a minimum of effort and to derive some standard results. The programs included in this package manipulate the data files generated by the NASA/JPL Mobile Satellite propogation experiment on an interactive basis. In the experiment, a transmitter operating at 869 MHz was carried to an altitude of 32Km by a stratospheric balloon. A vehicle within the line-of-sight of the transmitter was then driven around, splitting the incoming signal into I and Q channels, and sampling the resulting signal strength at 1000 samples per second. The data was collected at various antenna elavation angles and different times of day generating the ancillary data for the experiment. This package contains a program to convert the binary format of the data generated into standard ASCII format suitable for use with a wide variety of machine architectures. Also included is a UNIX shell-script designed to parse this ASCII file into those records of data that match the researcher's desired values for the ancillary data parameters. In addition, four FORTRAN programs are included to obtain standard quantities from the data. Quantities such as probability of signal level greater than or equal to a specified signal level, probability density of the signal levels, frequency

  17. Maximizing data holdings and data documentation with a hierarchical system for sample-based geochemical data

    NASA Astrophysics Data System (ADS)

    Hsu, L.; Lehnert, K. A.; Walker, J. D.; Chan, C.; Ash, J.; Johansson, A. K.; Rivera, T. A.

    2011-12-01

    Sample-based measurements in geochemistry are highly diverse, due to the large variety of sample types, measured properties, and idiosyncratic analytical procedures. In order to ensure the utility of sample-based data for re-use in research or education they must be associated with a high quality and quantity of descriptive, discipline-specific metadata. Without an adequate level of documentation, it is not possible to reproduce scientific results or have confidence in using the data for new research inquiries. The required detail in data documentation makes it challenging to aggregate large sets of data from different investigators and disciplines. One solution to this challenge is to build data systems with several tiers of intricacy, where the less detailed tiers are geared toward discovery and interoperability, and the more detailed tiers have higher value for data analysis. The Geoinformatics for Geochemistry (GfG) group, which is part of the Integrated Earth Data Applications facility (http://www.iedadata.org), has taken this approach to provide services for the discovery, access, and analysis of sample-based geochemical data for a diverse user community, ranging from the highly informed geochemist to non-domain scientists and undergraduate students. GfG builds and maintains three tiers in the sample based data systems, from a simple data catalog (Geochemical Resource Library), to a substantially richer data model for the EarthChem Portal (EarthChem XML), and finally to detailed discipline-specific data models for petrologic (PetDB), sedimentary (SedDB), hydrothermal spring (VentDB), and geochronological (GeoChron) samples. The data catalog, the lowest level in the hierarchy, contains the sample data values plus metadata only about the dataset itself (Dublin Core metadata such as dataset title and author), and therefore can accommodate the widest diversity of data holdings. The second level includes measured data values from the sample, basic information

  18. Freight data architecture business process, logical data model, and physical data model.

    DOT National Transportation Integrated Search

    2014-09-01

    This document summarizes the study teams efforts to establish data-sharing partnerships : and relay the lessons learned. In addition, it provides information on a prototype freight data : architecture and supporting description and specifications ...

  19. Examining Data Repository Guidelines for Qualitative Data Sharing.

    PubMed

    Antes, Alison L; Walsh, Heidi A; Strait, Michelle; Hudson-Vitale, Cynthia R; DuBois, James M

    2018-02-01

    Qualitative data provide rich information on research questions in diverse fields. Recent calls for increased transparency and openness in research emphasize data sharing. However, qualitative data sharing has yet to become the norm internationally and is particularly uncommon in the United States. Guidance for archiving and secondary use of qualitative data is required for progress in this regard. In this study, we review the benefits and concerns associated with qualitative data sharing and then describe the results of a content analysis of guidelines from international repositories that archive qualitative data. A minority of repositories provide qualitative data sharing guidelines. Of the guidelines available, there is substantial variation in whether specific topics are addressed. Some topics, such as removing direct identifiers, are consistently addressed, while others, such as providing an anonymization log, are not. We discuss the implications of our study for education, best practices, and future research.

  20. Balancing data sharing requirements for analyses with data sensitivity

    USGS Publications Warehouse

    Jarnevich, C.S.; Graham, J.J.; Newman, G.J.; Crall, A.W.; Stohlgren, T.J.

    2007-01-01

    Data sensitivity can pose a formidable barrier to data sharing. Knowledge of species current distributions from data sharing is critical for the creation of watch lists and an early warning/rapid response system and for model generation for the spread of invasive species. We have created an on-line system to synthesize disparate datasets of non-native species locations that includes a mechanism to account for data sensitivity. Data contributors are able to mark their data as sensitive. This data is then 'fuzzed' in mapping applications and downloaded files to quarter-quadrangle grid cells, but the actual locations are available for analyses. We propose that this system overcomes the hurdles to data sharing posed by sensitive data. ?? 2006 Springer Science+Business Media B.V.

  1. Driving Ms. Data: Creating Data-Driven Possibilities

    ERIC Educational Resources Information Center

    Hoffman, Richard

    2005-01-01

    This article describes how driven Web sites help schools and districts maximize their IT resources by making online content more "self-service" for users. It shows how to set up the capacity to create data-driven sites. By definition, a data-driven Web site is one in which the content comes from some back-end data source, such as a…

  2. Social Water Science Data: Dimensions, Data Management, and Visualization

    NASA Astrophysics Data System (ADS)

    Jones, A. S.; Horsburgh, J. S.; Flint, C.; Jackson-Smith, D.

    2016-12-01

    Water systems are increasingly conceptualized as coupled human-natural systems, with growing emphasis on representing the human element in hydrology. However, social science data and associated considerations may be unfamiliar and intimidating to many hydrologic researchers. Monitoring social aspects of water systems involves expanding the range of data types typically used in hydrology and appreciating nuances in datasets that are well known to social scientists, but less understood by hydrologists. We define social water science data as any information representing the human aspects of a water system. We present a scheme for classifying these data, highlight an array of data types, and illustrate data management considerations and challenges unique to social science data. This classification scheme was applied to datasets generated as part of iUTAH (innovative Urban Transitions and Arid region Hydro-sustainability), an interdisciplinary water research project based in Utah, USA that seeks to integrate and share social and biophysical water science data. As the project deployed cyberinfrastructure for baseline biophysical data, cyberinfrastructure for analogous social science data was necessary. As a particular case of social water science data, we focus in this presentation on social science survey data. These data are often interpreted through the lens of the original researcher and are typically presented to interested parties in static figures or reports. To provide more exploratory and dynamic communication of these data beyond the individual or team who collected the data, we developed a web-based, interactive viewer to visualize social science survey responses. This interface is applicable for examining survey results that show human motivations and actions related to environmental systems and as a useful tool for participatory decision-making. It also serves as an example of how new data sharing and visualization tools can be developed once the

  3. Curated eutherian third party data gene data sets.

    PubMed

    Premzl, Marko

    2016-03-01

    The free available eutherian genomic sequence data sets advanced scientific field of genomics. Of note, future revisions of gene data sets were expected, due to incompleteness of public eutherian genomic sequence assemblies and potential genomic sequence errors. The eutherian comparative genomic analysis protocol was proposed as guidance in protection against potential genomic sequence errors in public eutherian genomic sequences. The protocol was applicable in updates of 7 major eutherian gene data sets, including 812 complete coding sequences deposited in European Nucleotide Archive as curated third party data gene data sets.

  4. Data Stewardship throughout the Ocean Research Data Life Cycle

    NASA Astrophysics Data System (ADS)

    Chandler, Cynthia; Groman, Robert; Allison, Molly; Wiebe, Peter; Glover, David

    2013-04-01

    The Biological and Chemical Oceanography Data Management Office (BCO-DMO) works in partnership with ocean science investigators to publish data from research projects funded by the Biological and Chemical Oceanography Sections and the Office of Polar Programs Antarctic Organisms & Ecosystems Program (OPP ANT) at the U.S. National Science Foundation. Since 2006, researchers have been contributing data to the BCO-DMO data system, and it has developed into a rich repository of data from ocean, coastal and Great Lakes research programs. The end goals of the BCO-DMO are to ensure preservation of NSF funded project data and to provide open access to those data; achievement of those goals is attained through successful completion of a series of related phases. BCO-DMO has developed an end-to-end data stewardship process that includes all phases of the data life cycle: (1) providing data management advice to investigators during the proposal writing stage; (2) registering their funded project at BCO-DMO; (3) adding data and supporting documentation to the BCO-DMO data repository; (4) providing geospatial and text-based data access systems that support data discovery, access, display, assessment, integration, and export of data resources; (5) exploring mechanisms for exchange of data with complementary repositories; (6) publication of data sets to provide publishers of the peer-reviewed literature with citable references (Digital Object Identifiers) and to encourage proper citation and attribution of data sets in the future and (7) submission of final data sets for preservation in the appropriate long-term data archive. Strategic development of collaborative partnerships with complementary data management organizations is essential to sustainable coverage of the full data life cycle from research proposal through preservation of the final data products. Development and incorporation of controlled vocabularies, domain-specific ontologies and globally unique, persistent

  5. Linked data and provenance in biological data webs.

    PubMed

    Zhao, Jun; Miles, Alistair; Klyne, Graham; Shotton, David

    2009-03-01

    The Web is now being used as a platform for publishing and linking life science data. The Web's linking architecture can be exploited to join heterogeneous data from multiple sources. However, as data are frequently being updated in a decentralized environment, provenance information becomes critical to providing reliable and trustworthy services to scientists. This article presents design patterns for representing and querying provenance information relating to mapping links between heterogeneous data from sources in the domain of functional genomics. We illustrate the use of named resource description framework (RDF) graphs at different levels of granularity to make provenance assertions about linked data, and demonstrate that these assertions are sufficient to support requirements including data currency, integrity, evidential support and historical queries.

  6. Big Data, Small Data: Accessing and Manipulating Geoscience Data Ranging From Repositories to Student-Collected Data Sets Using GeoMapApp

    NASA Astrophysics Data System (ADS)

    Goodwillie, A. M.

    2015-12-01

    We often demand information and data to be accessible over the web at no cost, and no longer do we expect to spend time labouriously compiling data from myriad sources with frustratingly-different formats. Instead, we increasingly expect convenience and consolidation. Recent advances in web-enabled technologies and cyberinfrastructure are answering those calls by providing data tools and resources that can transform undergraduate education. By freeing up valuable classroom time, students can focus upon gaining deeper insights and understanding from real-world data. GeoMapApp (http://www.geomapapp.org) is a map-based data discovery and visualisation tool developed at Lamont-Doherty Earth Observatory. GeoMapApp promotes U-Learning by working across all major computer platforms and functioning anywhere with internet connectivity, by lowering socio-economic barriers (it is free), by seamlessly integrating thousands of built-in research-grade data sets under intuitive menus, and by being adaptable to a range of learning environments - from lab sessions, group projects, and homework assignments to in-class pop-ups. GeoMapApp caters to casual and specialist users alike. Contours, artificial illumination, 3-D displays, data point manipulations, cross-sectional profiles, and other display techniques help students better grasp the content and geospatial context of data. Layering capabilities allow easy data set comparisons. The core functionality also applies to imported data sets: Student-collected data can thus be imported and analysed using the same techniques. A new Save Session function allows educators to preserve a pre-loaded state of GeoMapApp. When shared with a class, the saved file allows every student to open GeoMapApp at exactly the same starting point from which to begin their data explorations. Examples of built-in data sets include seafloor crustal age, earthquake locations and focal mechanisms, analytical geochemistry, ocean water physical properties, US and

  7. PANGAEA® - Data Publisher for Earth & Environmental Science - Research data enters scholarly communication and big data analysis

    NASA Astrophysics Data System (ADS)

    Diepenbroek, Michael; Schindler, Uwe; Riedel, Morris; Huber, Robert

    2014-05-01

    The ISCU World Data Center PANGAEA is an information system for acquisition, processing, long term storage, and publication of geo-referenced data related to earth science fields. Storing more than 350.000 data sets from all fields of geosciences it belongs to the largest archives for observational earth science data. Standard conform interfaces (ISO, OGC, W3C, OAI) enable access from a variety of data and information portals, among them the search engine of PANGAEA itself ((www.pangaea.de) and e.g. GBIF. All data sets in PANGAEA are citable, fully documented, and can be referenced via persistent identifiers (Digital Object Identifier - DOI) - a premise for data publication. Together with other ICSU World Data Centers (www.icsu-wds.org) and the Technical Information Library in Germany (TIB) PANGAEA had a share in the implementation of a DOI based registry for scientific data, which by now is supported by a worldwide consortium of libraries (www.datacite.org). A further milestone was building up strong co-operations with science publishers as Elsevier, Springer, Wiley, AGU, Nature and others. A common web service allows to reference supplementary data in PANGAEA directly from an articles abstract page (e.g. Science Direct). The next step with science publishers is to further integrate the editorial process for the publication of supplementary data with the publication procedures on the journal side. Data centric research efforts such as environmental modelling or big data analysing approaches represent new challenges for PANGAEA. Integrated data warehouse technologies are used for highly efficient retrievals and compilations of time slices or surface data matrixes on any measurement parameters out of the whole data continuum. Further, new and emerging big data approaches are currently investigated within PANGAEA to e.g. evaluate its usability for quality control or data clustering. PANGAEA is operated as a joint long term facility by MARUM at the University Bremen

  8. 17 CFR 45.3 - Swap data reporting: creation data.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... 17 Commodity and Securities Exchanges 2 2014-04-01 2014-04-01 false Swap data reporting: creation... (CONTINUED) SWAP DATA RECORDKEEPING AND REPORTING REQUIREMENTS § 45.3 Swap data reporting: creation data. Registered entities and swap counterparties must report required swap creation data electronically to a swap...

  9. Smoothing Data Friction through building Service Oriented Data Platforms

    NASA Astrophysics Data System (ADS)

    Wyborn, L. A.; Richards, C. J.; Evans, B. J. K.; Wang, J.; Druken, K. A.

    2017-12-01

    Data Friction has been commonly defined as the costs in time, energy and attention required to simply collect, check, store, move, receive, and access data. On average, researchers spend a significant fraction of their time finding the data for their research project and then reformatting it so that it can be used by the software application of their choice. There is an increasing role for both data repositories and software to be modernised to help reduce data friction in ways that support the better use of the data. Many generic data repositories simply accept data in the format as supplied: the key check is that the data have sufficient metadata to enable discovery and download. Few generic repositories have both the expertise and infrastructure to support the multiple domain specific requirements that facilitate the increasing need for integration and reusability. In contrast, major science domain-focused repositories are increasingly able to implement and enforce community endorsed best practices and guidelines that ensure reusability and harmonization of data for use within the community by offering semi-automated QC workflows to improve quality of submitted data. The most advanced of these science repositories now operate as service-oriented data platforms that extend the use of data across domain silos and increasingly provide server-side programmatically-enabled access to data via network protocols and community standard APIs. To provide this, more rigorous QA/QC procedures are needed to validate data against standards and community software and tools. This ensures that the data can be accessed in expected ways and also demonstrates that the data works across different (non-domain specific) packages, tools and programming languages deployed by the various user communities. In Australia, the National Computational Infrastructure (NCI) has created such a service-oriented data platform which is demonstrating how this approach can reduce data friction

  10. Data System Architectures: Recent Experiences from Data Intensive Projects

    NASA Astrophysics Data System (ADS)

    Palanisamy, G.; Frame, M. T.; Boden, T.; Devarakonda, R.; Zolly, L.; Hutchison, V.; Latysh, N.; Krassovski, M.; Killeffer, T.; Hook, L.

    2014-12-01

    U.S. Federal agencies are frequently trying to address new data intensive projects that require next generation of data system architectures. This presentation will focus on two new such architectures: USGS's Science Data Catalog (SDC) and DOE's Next Generation Ecological Experiments - Arctic Data System. The U.S. Geological Survey (USGS) developed a Science Data Catalog (data.usgs.gov) to include records describing datasets, data collections, and observational or remotely-sensed data. The system was built using service oriented architecture and allows USGS scientists and data providers to create and register their data using either a standards-based metadata creation form or simply to register their already-created metadata records with the USGS SDC Dashboard. This dashboard then compiles the harvested metadata records and sends them to the post processing and indexing service using the JSON format. The post processing service, with the help of various ontologies and other geo-spatial validation services, auto-enhances these harvested metadata records and creates a Lucene index using the Solr enterprise search platform. Ultimately, metadata is made available via the SDC search interface. DOE's Next Generation Ecological Experiments (NGEE) Arctic project deployed a data system that allows scientists to prepare, publish, archive, and distribute data from field collections, lab experiments, sensors, and simulated modal outputs. This architecture includes a metadata registration form, data uploading and sharing tool, a Digital Object Identifier (DOI) tool, a Drupal based content management tool (http://ngee-arctic.ornl.gov), and a data search and access tool based on ORNL's Mercury software (http://mercury.ornl.gov). The team also developed Web-metric tools and a data ingest service to visualize geo-spatial and temporal observations.

  11. DataRocket: Interactive Visualisation of Data Structures

    NASA Astrophysics Data System (ADS)

    Parkes, Steve; Ramsay, Craig

    2010-08-01

    CodeRocket is a software engineering tool that provides cognitive support to the software engineer for reasoning about a method or procedure and for documenting the resulting code [1]. DataRocket is a software engineering tool designed to support visualisation and reasoning about program data structures. DataRocket is part of the CodeRocket family of software tools developed by Rapid Quality Systems [2] a spin-out company from the Space Technology Centre at the University of Dundee. CodeRocket and DataRocket integrate seamlessly with existing architectural design and coding tools and provide extensive documentation with little or no effort on behalf of the software engineer. Comprehensive, abstract, detailed design documentation is available early on in a project so that it can be used for design reviews with project managers and non expert stakeholders. Code and documentation remain fully synchronised even when changes are implemented in the code without reference to the existing documentation. At the end of a project the press of a button suffices to produce the detailed design document. Existing legacy code can be easily imported into CodeRocket and DataRocket to reverse engineer detailed design documentation making legacy code more manageable and adding substantially to its value. This paper introduces CodeRocket. It then explains the rationale for DataRocket and describes the key features of this new tool. Finally the major benefits of DataRocket for different stakeholders are considered.

  12. Warehousing Structured and Unstructured Data for Data Mining.

    ERIC Educational Resources Information Center

    Miller, L. L.; Honavar, Vasant; Barta, Tom

    1997-01-01

    Describes an extensible object-oriented view system that supports the integration of both structured and unstructured data sources in either the multidatabase or data warehouse environment. Discusses related work and data mining issues. (AEF)

  13. Connecting the Astrophysics Data System and Planetary Data System

    NASA Astrophysics Data System (ADS)

    Eichhorn, G.; Kurtz, M. J.; Accomazzi, A.; Grant, C. S.; Murray, S. S.; Hughes, J. S.; Mortellaro, J.; McMahon, S. K.

    1997-07-01

    The Astrophysics Data System (ADS) provides access to astronomical literature through a sophisticated search engine. Over 10,000 users retrieve almost 5 million references and read more than 25,000 full text articles per month. ADS cooperates closely with all the main astronomical journals and data centers to create and maintain a state-of-the-art digital library. The Planetary Data System (PDS) publishes high quality peer reviewed planetary science data products, defines planetary archiving standards to make products usable, and provides science expertise to users in data product preparation and use. Data products are available to users on CD media, with more than 600 CD-ROM titles in the inventory from past missions as well as the recent releases from active planetary missions and observations. The ADS and PDS serve overlapping communities and offer complementary functions. The ADS and PDS are both part of the NASA Space Science Data System, sponsored by the Office of Space Science, which curates science data products for researchers and the general public. We are in the process of connecting these two data systems. As a first step we have included entries for PDS data sets in the ADS abstract service. This allows ADS users to find PDS data sets by searching for their descriptions through the ADS search system. The information returned from the ADS links directly to the data set's entry in the PDS data set catalog. After linking to this catalog, the user will have access to more comprehensive data set information, related ancillary information, and on-line data products. The PDS on the other hand will use the ADS to provide access to bibliographic information. This includes links from PDS data set catalog bibliographic citations to ADS abstracts and on-line articles. The cross-linking between these data systems allows each system to concentrate on its main objectives and utilize the other system to provide more and improved services to the users of both systems.

  14. Alternative Fuels Data Center: New Hampshire Transportation Data for

    Science.gov Websites

    to someone by E-mail Share Alternative Fuels Data Center: New Hampshire Transportation Data for Alternative Fuels and Vehicles on Facebook Tweet about Alternative Fuels Data Center: New Hampshire Transportation Data for Alternative Fuels and Vehicles on Twitter Bookmark Alternative Fuels Data Center: New

  15. Alternative Fuels Data Center: New Mexico Transportation Data for

    Science.gov Websites

    someone by E-mail Share Alternative Fuels Data Center: New Mexico Transportation Data for Alternative Fuels and Vehicles on Facebook Tweet about Alternative Fuels Data Center: New Mexico Transportation Data for Alternative Fuels and Vehicles on Twitter Bookmark Alternative Fuels Data Center: New Mexico

  16. Alternative Fuels Data Center: New Jersey Transportation Data for

    Science.gov Websites

    someone by E-mail Share Alternative Fuels Data Center: New Jersey Transportation Data for Alternative Fuels and Vehicles on Facebook Tweet about Alternative Fuels Data Center: New Jersey Transportation Data for Alternative Fuels and Vehicles on Twitter Bookmark Alternative Fuels Data Center: New Jersey

  17. Alternative Fuels Data Center: North Dakota Transportation Data for

    Science.gov Websites

    someone by E-mail Share Alternative Fuels Data Center: North Dakota Transportation Data for Alternative Fuels and Vehicles on Facebook Tweet about Alternative Fuels Data Center: North Dakota Transportation Data for Alternative Fuels and Vehicles on Twitter Bookmark Alternative Fuels Data Center: North Dakota

  18. "Big Data" in Rheumatology: Intelligent Data Modeling Improves the Quality of Imaging Data.

    PubMed

    Landewé, Robert B M; van der Heijde, Désirée

    2018-05-01

    Analysis of imaging data in rheumatology is a challenge. Reliability of scores is an issue for several reasons. Signal-to-noise ratio of most imaging techniques is rather unfavorable (too little signal in relation to too much noise). Optimal use of all available data may help to increase credibility of imaging data, but knowledge of complicated statistical methodology and the help of skilled statisticians are required. Clinicians should appreciate the merits of sophisticated data modeling and liaise with statisticians to increase the quality of imaging results, as proper imaging studies in rheumatology imply more than a supersensitive imaging technique alone. Copyright © 2018 Elsevier Inc. All rights reserved.

  19. Alternative Fuels Data Center: Alaska Transportation Data for Alternative

    Science.gov Websites

    Fuels and Vehicles Alaska Transportation Data for Alternative Fuels and Vehicles to someone by E-mail Share Alternative Fuels Data Center: Alaska Transportation Data for Alternative Fuels and Vehicles on Facebook Tweet about Alternative Fuels Data Center: Alaska Transportation Data for Alternative

  20. Exploring the Integration of Data Mining and Data Visualization

    ERIC Educational Resources Information Center

    Zhang, Yi

    2011-01-01

    Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be…

  1. Data for Improvement, Data for Accountability

    ERIC Educational Resources Information Center

    Weiss, Janet A.

    2012-01-01

    This commentary on the special issue on data use highlights the distinctions between data systems intended to improve the performance of school staff and those intended to hold schools and districts accountable for outcomes. It advises researchers to be alert to the differences in the policy logics connected with each approach.

  2. Enhancing vector shoreline data using a data fusion approach

    NASA Astrophysics Data System (ADS)

    Carlotto, Mark; Nebrich, Mark; DeMichele, David

    2017-05-01

    Vector shoreline (VSL) data is potentially useful in ATR systems that distinguish between objects on land or water. Unfortunately available data such as the NOAA 1:250,000 World Vector Shoreline and NGA Prototype Global Shoreline data cannot be used by themselves to make a land/water determination because of the manner in which the data are compiled. We describe a data fusion approach for creating labeled VSL data using test points from Global 30 Arc-Second Elevation (GTOPO30) data to determine the direction of vector segments; i.e., whether they are in clockwise or counterclockwise order. We show consistently labeled VSL data be used to easily determine whether a point is on land or water using a vector cross product test.

  3. Quality Control and Peer Review of Data Sets: Mapping Data Archiving Processes to Data Publication Requirements

    NASA Astrophysics Data System (ADS)

    Mayernik, M. S.; Daniels, M.; Eaker, C.; Strand, G.; Williams, S. F.; Worley, S. J.

    2012-12-01

    Data sets exist within scientific research and knowledge networks as both technical and non-technical entities. Establishing the quality of data sets is a multi-faceted task that encompasses many automated and manual processes. Data sets have always been essential for science research, but now need to be more visible as first-class scholarly objects at national, international, and local levels. Many initiatives are establishing procedures to publish and curate data sets, as well as to promote professional rewards for researchers that collect, create, manage, and preserve data sets. Traditionally, research quality has been assessed by peer review of textual publications, e.g. journal articles, conference proceedings, and books. Citation indices then provide standard measures of productivity used to reward individuals for their peer-reviewed work. Whether a similar peer review process is appropriate for assessing and ensuring the quality of data sets remains as an open question. How does the traditional process of peer review apply to data sets? This presentation will describe current work being done at the National Center for Atmospheric Research (NCAR) in the context of the Peer REview for Publication & Accreditation of Research Data in the Earth sciences (PREPARDE) project. PREPARDE is assessing practices and processes for data peer review, with the goal of developing recommendations. NCAR data management teams perform various kinds of quality assessment and review of data sets prior to making them publicly available. The poster will investigate how notions of peer review relate to the types of data review already in place at NCAR. We highlight the data set characteristics and management/archiving processes that challenge the traditional peer review processes by using a number of questions as probes, including: Who is qualified to review data sets? What formal and informal documentation is necessary to allow someone outside of a research team to review a data set

  4. About the Transportation Secure Data Center | Transportation Secure Data

    Science.gov Websites

    Center | NREL About the Transportation Secure Data Center About the Transportation Secure Data Center The Transportation Secure Data Center (TSDC) makes vital transportation data broadly available large, colorful map of the United States. NREL data experts and engineers analyze large sets of complex

  5. Translating Big Data into Smart Data for Veterinary Epidemiology.

    PubMed

    VanderWaal, Kimberly; Morrison, Robert B; Neuhauser, Claudia; Vilalta, Carles; Perez, Andres M

    2017-01-01

    The increasing availability and complexity of data has led to new opportunities and challenges in veterinary epidemiology around how to translate abundant, diverse, and rapidly growing "big" data into meaningful insights for animal health. Big data analytics are used to understand health risks and minimize the impact of adverse animal health issues through identifying high-risk populations, combining data or processes acting at multiple scales through epidemiological modeling approaches, and harnessing high velocity data to monitor animal health trends and detect emerging health threats. The advent of big data requires the incorporation of new skills into veterinary epidemiology training, including, for example, machine learning and coding, to prepare a new generation of scientists and practitioners to engage with big data. Establishing pipelines to analyze big data in near real-time is the next step for progressing from simply having "big data" to create "smart data," with the objective of improving understanding of health risks, effectiveness of management and policy decisions, and ultimately preventing or at least minimizing the impact of adverse animal health issues.

  6. The Environmental Data Initiative: A broad-use data repository for environmental and ecological data that strives to balance data quality and ease of submission

    NASA Astrophysics Data System (ADS)

    Servilla, M. S.; Brunt, J.; Costa, D.; Gries, C.; Grossman-Clarke, S.; Hanson, P. C.; O'Brien, M.; Smith, C.; Vanderbilt, K.; Waide, R.

    2017-12-01

    In the world of data repositories, there seems to be a never ending struggle between the generation of high-quality data documentation and the ease of archiving a data product in a repository - the higher the documentation standards, the greater effort required by the scientist, and the less likely the data will be archived. The Environmental Data Initiative (EDI) attempts to balance the rigor of data documentation to the amount of effort required by a scientist to upload and archive data. As an outgrowth of the LTER Network Information System, the EDI is funded by the US NSF Division of Environmental Biology, to support the LTER, LTREB, OBFS, and MSB programs, in addition to providing an open data archive for environmental scientists without a viable archive. EDI uses the PASTA repository software, developed originally by the LTER. PASTA is metadata driven and documents data with the Ecological Metadata Language (EML), a high-fidelity standard that can describe all types of data in great detail. PASTA incorporates a series of data quality tests to ensure that data are correctly documented with EML in a process that is termed "metadata and data congruence", and incongruent data packages are forbidden in the repository. EDI reduces the burden of data documentation on scientists in two ways: first, EDI provides hands-on assistance in data documentation best practices using R and being developed in Python, for generating EML. These tools obscure the details of EML generation and syntax by providing a more natural and contextual setting for describing data. Second, EDI works closely with community information managers in defining rules used in PASTA quality tests. Rules deemed too strict can be turned off completely or just issue a warning, while the community learns to best handle the situation and improve their documentation practices. Rules can also be added or refined over time to improve overall quality of archived data. The outcome of quality tests are stored as

  7. Information Models, Data Requirements, and Agile Data Curation

    NASA Astrophysics Data System (ADS)

    Hughes, John S.; Crichton, Dan; Ritschel, Bernd; Hardman, Sean; Joyner, Ron

    2015-04-01

    The Planetary Data System's next generation system, PDS4, is an example of the successful use of an ontology-based Information Model (IM) to drive the development and operations of a data system. In traditional systems engineering, requirements or statements about what is necessary for the system are collected and analyzed for input into the design stage of systems development. With the advent of big data the requirements associated with data have begun to dominate and an ontology-based information model can be used to provide a formalized and rigorous set of data requirements. These requirements address not only the usual issues of data quantity, quality, and disposition but also data representation, integrity, provenance, context, and semantics. In addition the use of these data requirements during system's development has many characteristics of Agile Curation as proposed by Young et al. [Taking Another Look at the Data Management Life Cycle: Deconstruction, Agile, and Community, AGU 2014], namely adaptive planning, evolutionary development, early delivery, continuous improvement, and rapid and flexible response to change. For example customers can be satisfied through early and continuous delivery of system software and services that are configured directly from the information model. This presentation will describe the PDS4 architecture and its three principle parts: the ontology-based Information Model (IM), the federated registries and repositories, and the REST-based service layer for search, retrieval, and distribution. The development of the IM will be highlighted with special emphasis on knowledge acquisition, the impact of the IM on development and operations, and the use of shared ontologies at multiple governance levels to promote system interoperability and data correlation.

  8. Longitudinal data analysis with non-ignorable missing data.

    PubMed

    Tseng, Chi-hong; Elashoff, Robert; Li, Ning; Li, Gang

    2016-02-01

    A common problem in the longitudinal data analysis is the missing data problem. Two types of missing patterns are generally considered in statistical literature: monotone and non-monotone missing data. Nonmonotone missing data occur when study participants intermittently miss scheduled visits, while monotone missing data can be from discontinued participation, loss to follow-up, and mortality. Although many novel statistical approaches have been developed to handle missing data in recent years, few methods are available to provide inferences to handle both types of missing data simultaneously. In this article, a latent random effects model is proposed to analyze longitudinal outcomes with both monotone and non-monotone missingness in the context of missing not at random. Another significant contribution of this article is to propose a new computational algorithm for latent random effects models. To reduce the computational burden of high-dimensional integration problem in latent random effects models, we develop a new computational algorithm that uses a new adaptive quadrature approach in conjunction with the Taylor series approximation for the likelihood function to simplify the E-step computation in the expectation-maximization algorithm. Simulation study is performed and the data from the scleroderma lung study are used to demonstrate the effectiveness of this method. © The Author(s) 2012.

  9. Geographic Data as Personal Data in Four EU Member States

    NASA Astrophysics Data System (ADS)

    de Jong, A. J.; van Loenen, B.; Zevenbergen, J. A.

    2016-06-01

    The EU Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data aims at harmonising data protection legislation in the European Union. This should promote the free flow of products and services within the EU. This research found a wide variety of interpretations of the application of data protection legislation to geographic data. The variety was found among the different EU Member States, the different stakeholders and the different types of geographic data. In the Netherlands, the Data Protection Authority (DPA) states that panoramic images of streets are considered personal data. While Dutch case law judges that the data protection legislation does not apply if certain features are blurred and no link to an address is provided. The topographic datasets studied in the case studies do not contain personal data, according to the Dutch DPA, while the German DPA and the Belgian DPA judge that topographic maps of a large scale can contain personal data, and impose conditions on the processing of topographic maps. The UK DPA does consider this data outside of the scope of legal definition of personal data. The patchwork of differences in data protection legislation can be harmonised by using a traffic light model. This model focuses on the context in which the processing of the data takes place and has four categories of data: (1) sensitive personal data, (2) personal data, (3), data that can possibly lead to identification, and (4) non-personal data. For some geographic data, for example factual data that does not reveal sensitive information about a person, can be categorised in the third category giving room to opening up data under the INSPIRE Directive.

  10. Scidac-Data: Enabling Data Driven Modeling of Exascale Computing

    DOE PAGES

    Mubarak, Misbah; Ding, Pengfei; Aliaga, Leo; ...

    2017-11-23

    Here, the SciDAC-Data project is a DOE-funded initiative to analyze and exploit two decades of information and analytics that have been collected by the Fermilab data center on the organization, movement, and consumption of high energy physics (HEP) data. The project analyzes the analysis patterns and data organization that have been used by NOvA, MicroBooNE, MINERvA, CDF, D0, and other experiments to develop realistic models of HEP analysis workflows and data processing. The SciDAC-Data project aims to provide both realistic input vectors and corresponding output data that can be used to optimize and validate simulations of HEP analysis. These simulationsmore » are designed to address questions of data handling, cache optimization, and workflow structures that are the prerequisites for modern HEP analysis chains to be mapped and optimized to run on the next generation of leadership-class exascale computing facilities. We present the use of a subset of the SciDAC-Data distributions, acquired from analysis of approximately 71,000 HEP workflows run on the Fermilab data center and corresponding to over 9 million individual analysis jobs, as the input to detailed queuing simulations that model the expected data consumption and caching behaviors of the work running in high performance computing (HPC) and high throughput computing (HTC) environments. In particular we describe how the Sequential Access via Metadata (SAM) data-handling system in combination with the dCache/Enstore-based data archive facilities has been used to develop radically different models for analyzing the HEP data. We also show how the simulations may be used to assess the impact of design choices in archive facilities.« less

  11. Approaches to Linked Open Data at data.oceandrilling.org

    NASA Astrophysics Data System (ADS)

    Fils, D.

    2012-12-01

    The data.oceandrilling.org web application applies Linked Open Data (LOD) patterns to expose Deep Sea Drilling Project (DSDP), Ocean Drilling Program (ODP) and Integrated Ocean Drilling Program (IODP) data. Ocean drilling data is represented in a rich range of data formats: high resolution images, file based data sets and sample based data. This richness of data types has been well met by semantic approaches and will be demonstrated. Data has been extracted from CSV, HTML and RDBMS through custom software and existing packages for loading into a SPARQL 1.1 compliant triple store. Practices have been developed to streamline the maintenance of the RDF graphs and properly expose them using LOD approaches like VoID and HTML embedded structured data. Custom and existing vocabularies are used to allow semantic relations between resources. Use of the W3c draft RDF Data Cube Vocabulary and other approaches for encoding time scales, taxonomic fossil data and other graphs will be shown. A software layer written in Google Go mediates the RDF to web pipeline. The approach used is general and can be applied to other similar environments like node.js or Python Twisted. To facilitate communication user interface software libraries such as D3 and packages such as S2S and LodLive have been used. Additionally OpenSearch API's, structured data in HTML and SPARQL endpoints provide various access methods for applications. The data.oceandrilling.org is not viewed as a web site but as an application that communicate with a range of clients. This approach helps guide the development more along software practices than along web site authoring approaches.

  12. Scidac-Data: Enabling Data Driven Modeling of Exascale Computing

    NASA Astrophysics Data System (ADS)

    Mubarak, Misbah; Ding, Pengfei; Aliaga, Leo; Tsaris, Aristeidis; Norman, Andrew; Lyon, Adam; Ross, Robert

    2017-10-01

    The SciDAC-Data project is a DOE-funded initiative to analyze and exploit two decades of information and analytics that have been collected by the Fermilab data center on the organization, movement, and consumption of high energy physics (HEP) data. The project analyzes the analysis patterns and data organization that have been used by NOvA, MicroBooNE, MINERvA, CDF, D0, and other experiments to develop realistic models of HEP analysis workflows and data processing. The SciDAC-Data project aims to provide both realistic input vectors and corresponding output data that can be used to optimize and validate simulations of HEP analysis. These simulations are designed to address questions of data handling, cache optimization, and workflow structures that are the prerequisites for modern HEP analysis chains to be mapped and optimized to run on the next generation of leadership-class exascale computing facilities. We present the use of a subset of the SciDAC-Data distributions, acquired from analysis of approximately 71,000 HEP workflows run on the Fermilab data center and corresponding to over 9 million individual analysis jobs, as the input to detailed queuing simulations that model the expected data consumption and caching behaviors of the work running in high performance computing (HPC) and high throughput computing (HTC) environments. In particular we describe how the Sequential Access via Metadata (SAM) data-handling system in combination with the dCache/Enstore-based data archive facilities has been used to develop radically different models for analyzing the HEP data. We also show how the simulations may be used to assess the impact of design choices in archive facilities.

  13. Scidac-Data: Enabling Data Driven Modeling of Exascale Computing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mubarak, Misbah; Ding, Pengfei; Aliaga, Leo

    Here, the SciDAC-Data project is a DOE-funded initiative to analyze and exploit two decades of information and analytics that have been collected by the Fermilab data center on the organization, movement, and consumption of high energy physics (HEP) data. The project analyzes the analysis patterns and data organization that have been used by NOvA, MicroBooNE, MINERvA, CDF, D0, and other experiments to develop realistic models of HEP analysis workflows and data processing. The SciDAC-Data project aims to provide both realistic input vectors and corresponding output data that can be used to optimize and validate simulations of HEP analysis. These simulationsmore » are designed to address questions of data handling, cache optimization, and workflow structures that are the prerequisites for modern HEP analysis chains to be mapped and optimized to run on the next generation of leadership-class exascale computing facilities. We present the use of a subset of the SciDAC-Data distributions, acquired from analysis of approximately 71,000 HEP workflows run on the Fermilab data center and corresponding to over 9 million individual analysis jobs, as the input to detailed queuing simulations that model the expected data consumption and caching behaviors of the work running in high performance computing (HPC) and high throughput computing (HTC) environments. In particular we describe how the Sequential Access via Metadata (SAM) data-handling system in combination with the dCache/Enstore-based data archive facilities has been used to develop radically different models for analyzing the HEP data. We also show how the simulations may be used to assess the impact of design choices in archive facilities.« less

  14. Scenario driven data modelling: a method for integrating diverse sources of data and data streams

    DOEpatents

    Brettin, Thomas S.; Cottingham, Robert W.; Griffith, Shelton D.; Quest, Daniel J.

    2015-09-08

    A system and method of integrating diverse sources of data and data streams is presented. The method can include selecting a scenario based on a topic, creating a multi-relational directed graph based on the scenario, identifying and converting resources in accordance with the scenario and updating the multi-directed graph based on the resources, identifying data feeds in accordance with the scenario and updating the multi-directed graph based on the data feeds, identifying analytical routines in accordance with the scenario and updating the multi-directed graph using the analytical routines and identifying data outputs in accordance with the scenario and defining queries to produce the data outputs from the multi-directed graph.

  15. Beyond Open Data: the importance of data standards and interoperability - Experiences from ECMWF's Open Data Week

    NASA Astrophysics Data System (ADS)

    Wagemann, Julia; Siemen, Stephan

    2017-04-01

    The European Centre for Medium-Range Weather Forecasts (ECMWF) has been providing an increasing amount of data to the public. One of the most widely used datasets include the global climate reanalyses (e.g. ERA-interim) and atmospheric composition data, which are available to the public free of charge. The centre is further operating, on behalf of the European Commission, two Copernicus Services, the Copernicus Atmosphere Monitoring Service (CAMS) and Climate Change Service (C3S), which are making up-to-date environmental information freely available for scientists, policy makers and businesses. However, to fully benefit from open data, large environmental datasets also have to be easily accessible in a standardised, machine-readable format. Traditional data centres, such as ECMWF, currently face challenges in providing interoperable standardised access to increasingly large and complex datasets for scientists and industry. Therefore, ECMWF put open data in the spotlight during a week of events in March 2017 exploring the potential of freely available weather- and climate-related data and to review technological solutions serving these data. Key events included a Workshop on Meteorological Operational Systems (MOS) and a two-day hackathon. The MOS workshop aimed at reviewing technologies and practices to ensure efficient (open) data processing and provision. The hackathon focused on exploring creative uses of open environmental data and to see how open data is beneficial for various industries. The presentation aims to give a review of the outcomes and conclusions of the Open Data Week at ECMWF. A specific focus will be set on the importance of data standards and web services to make open environmental data a success. The presentation overall examines the opportunities and challenges of open environmental data from a data provider's perspective.

  16. Meta Data Mining in Earth Remote Sensing Data Archives

    NASA Astrophysics Data System (ADS)

    Davis, B.; Steinwand, D.

    2014-12-01

    Modern search and discovery tools for satellite based remote sensing data are often catalog based and rely on query systems which use scene- (or granule-) based meta data for those queries. While these traditional catalog systems are often robust, very little has been done in the way of meta data mining to aid in the search and discovery process. The recently coined term "Big Data" can be applied in the remote sensing world's efforts to derive information from the vast data holdings of satellite based land remote sensing data. Large catalog-based search and discovery systems such as the United States Geological Survey's Earth Explorer system and the NASA Earth Observing System Data and Information System's Reverb-ECHO system provide comprehensive access to these data holdings, but do little to expose the underlying scene-based meta data. These catalog-based systems are extremely flexible, but are manually intensive and often require a high level of user expertise. Exposing scene-based meta data to external, web-based services can enable machine-driven queries to aid in the search and discovery process. Furthermore, services which expose additional scene-based content data (such as product quality information) are now available and can provide a "deeper look" into remote sensing data archives too large for efficient manual search methods. This presentation shows examples of the mining of Landsat and Aster scene-based meta data, and an experimental service using OPeNDAP to extract information from quality band from multiple granules in the MODIS archive.

  17. The Frictionless Data Package: Data Containerization for Automated Scientific Workflows

    NASA Astrophysics Data System (ADS)

    Shepherd, A.; Fils, D.; Kinkade, D.; Saito, M. A.

    2017-12-01

    As cross-disciplinary geoscience research increasingly relies on machines to discover and access data, one of the critical questions facing data repositories is how data and supporting materials should be packaged for consumption. Traditionally, data repositories have relied on a human's involvement throughout discovery and access workflows. This human could assess fitness for purpose by reading loosely coupled, unstructured information from web pages and documentation. In attempts to shorten the time to science and access data resources across may disciplines, expectations for machines to mediate the process of discovery and access is challenging data repository infrastructure. This challenge is to find ways to deliver data and information in ways that enable machines to make better decisions by enabling them to understand the data and metadata of many data types. Additionally, once machines have recommended a data resource as relevant to an investigator's needs, the data resource should be easy to integrate into that investigator's toolkits for analysis and visualization. The Biological and Chemical Oceanography Data Management Office (BCO-DMO) supports NSF-funded OCE and PLR investigators with their project's data management needs. These needs involve a number of varying data types some of which require multiple files with differing formats. Presently, BCO-DMO has described these data types and the important relationships between the type's data files through human-readable documentation on web pages. For machines directly accessing data files from BCO-DMO, this documentation could be overlooked and lead to misinterpreting the data. Instead, BCO-DMO is exploring the idea of data containerization, or packaging data and related information for easier transport, interpretation, and use. In researching the landscape of data containerization, the Frictionlessdata Data Package (http://frictionlessdata.io/) provides a number of valuable advantages over similar

  18. Data and Data Products for Climate Research: Web Services at the Asia-Pacific Data-Research Center (APDRC)

    NASA Astrophysics Data System (ADS)

    DeCarlo, S.; Potemra, J. T.; Wang, K.

    2012-12-01

    The International Pacific Research Center (IPRC) at the University of Hawaii maintains a data center for climate studies called the Asia-Pacific Data-Research Center (APDRC). This data center was designed within a center of excellence in climate research with the intention of serving the needs of the research scientist. The APDRC provides easy access to a wide collection of climate data and data products for a wide variety of users. The data center maintains an archive of approximately 100 data sets including in-situ and remote data, as well as a range of model-based output. All data are available via on-line browsing tools such as a Live Access Server (LAS) and DChart, and direct binary access is available through OPeNDAP services. On-line tutorials on how to use these services are now available. Users can keep up-to-date with new data and product announcements via the APDRC facebook page. The main focus of the APDRC has been climate scientists, and the services are therefore streamlined to such users, both in the number and types of data served, but also in the way data are served. In addition, due to the integration of the APDRC within the IPRC, several value-added data products (see figure for an example using Argo floats) have been developed via a variety of research activities. The APDRC, therefore, has three main foci: 1. acquisition of climate-related data, 2. maintenance of integrated data servers, and 3. development and distribution of data products The APDRC can be found at http://apdrc.soest.hawaii.edu. The presentation will provide an overview along with specific examples of the data, data products and data services available at the APDRC.; APDRC product example: gridded field from Argo profiling floats

  19. Geothermal Data | Geospatial Data Science | NREL

    Science.gov Websites

    Identified Onshore Geopressured Geothermal Energy in Texas and Louisiana provides additional information on Geothermal Data Geothermal Data These datasets detail the geothermal resource available in the Metadata Geothermal Zip 5.4 MB 03/05/2009 geothermal.xml This dataset is a qualitative assessment of

  20. The Planetary Data System Web Catalog Interface--Another Use of the Planetary Data System Data Model

    NASA Technical Reports Server (NTRS)

    Hughes, S.; Bernath, A.

    1995-01-01

    The Planetary Data System Data Model consists of a set of standardized descriptions of entities within the Planetary Science Community. These can be real entities in the space exploration domain such as spacecraft, instruments, and targets; conceptual entities such as data sets, archive volumes, and data dictionaries; or the archive data products such as individual images, spectrum, series, and qubes.

  1. Modern data science for analytical chemical data - A comprehensive review.

    PubMed

    Szymańska, Ewa

    2018-10-22

    Efficient and reliable analysis of chemical analytical data is a great challenge due to the increase in data size, variety and velocity. New methodologies, approaches and methods are being proposed not only by chemometrics but also by other data scientific communities to extract relevant information from big datasets and provide their value to different applications. Besides common goal of big data analysis, different perspectives and terms on big data are being discussed in scientific literature and public media. The aim of this comprehensive review is to present common trends in the analysis of chemical analytical data across different data scientific fields together with their data type-specific and generic challenges. Firstly, common data science terms used in different data scientific fields are summarized and discussed. Secondly, systematic methodologies to plan and run big data analysis projects are presented together with their steps. Moreover, different analysis aspects like assessing data quality, selecting data pre-processing strategies, data visualization and model validation are considered in more detail. Finally, an overview of standard and new data analysis methods is provided and their suitability for big analytical chemical datasets shortly discussed. Copyright © 2018 Elsevier B.V. All rights reserved.

  2. DataForge: Modular platform for data storage and analysis

    NASA Astrophysics Data System (ADS)

    Nozik, Alexander

    2018-04-01

    DataForge is a framework for automated data acquisition, storage and analysis based on modern achievements of applied programming. The aim of the DataForge is to automate some standard tasks like parallel data processing, logging, output sorting and distributed computing. Also the framework extensively uses declarative programming principles via meta-data concept which allows a certain degree of meta-programming and improves results reproducibility.

  3. SPP: A data base processor data communications protocol

    NASA Technical Reports Server (NTRS)

    Fishwick, P. A.

    1983-01-01

    The design and implementation of a data communications protocol for the Intel Data Base Processor (DBP) is defined. The protocol is termed SPP (Service Port Protocol) since it enables data transfer between the host computer and the DBP service port. The protocol implementation is extensible in that it is explicitly layered and the protocol functionality is hierarchically organized. Extensive trace and performance capabilities have been supplied with the protocol software to permit optional efficient monitoring of the data transfer between the host and the Intel data base processor. Machine independence was considered to be an important attribute during the design and implementation of SPP. The protocol source is fully commented and is included in Appendix A of this report.

  4. Publish (Your Data) or (Let the Data) Perish! Why Not Publish Your Data Too?

    ERIC Educational Resources Information Center

    Wicherts, Jelte M.; Bakker, Marjan

    2012-01-01

    The authors argue that upon publication of a paper, the data should be made available through online archives or repositories. Reasons for not sharing data are discussed and contrasted with advantages of sharing, which include abiding by the scientific principle of openness, keeping the data for posterity, increasing one's impact, facilitation of…

  5. A data-management system for detailed areal interpretive data

    USGS Publications Warehouse

    Ferrigno, C.F.

    1986-01-01

    A data storage and retrieval system has been developed to organize and preserve areal interpretive data. This system can be used by any study where there is a need to store areal interpretive data that generally is presented in map form. This system provides the capability to grid areal interpretive data for input to groundwater flow models at any spacing and orientation. The data storage and retrieval system is designed to be used for studies that cover small areas such as counties. The system is built around a hierarchically structured data base consisting of related latitude-longitude blocks. The information in the data base can be stored at different levels of detail, with the finest detail being a block of 6 sec of latitude by 6 sec of longitude (approximately 0.01 sq mi). This system was implemented on a mainframe computer using a hierarchical data base management system. The computer programs are written in Fortran IV and PL/1. The design and capabilities of the data storage and retrieval system, and the computer programs that are used to implement the system are described. Supplemental sections contain the data dictionary, user documentation of the data-system software, changes that would need to be made to use this system for other studies, and information on the computer software tape. (Lantz-PTT)

  6. Translating Big Data into Smart Data for Veterinary Epidemiology

    PubMed Central

    VanderWaal, Kimberly; Morrison, Robert B.; Neuhauser, Claudia; Vilalta, Carles; Perez, Andres M.

    2017-01-01

    The increasing availability and complexity of data has led to new opportunities and challenges in veterinary epidemiology around how to translate abundant, diverse, and rapidly growing “big” data into meaningful insights for animal health. Big data analytics are used to understand health risks and minimize the impact of adverse animal health issues through identifying high-risk populations, combining data or processes acting at multiple scales through epidemiological modeling approaches, and harnessing high velocity data to monitor animal health trends and detect emerging health threats. The advent of big data requires the incorporation of new skills into veterinary epidemiology training, including, for example, machine learning and coding, to prepare a new generation of scientists and practitioners to engage with big data. Establishing pipelines to analyze big data in near real-time is the next step for progressing from simply having “big data” to create “smart data,” with the objective of improving understanding of health risks, effectiveness of management and policy decisions, and ultimately preventing or at least minimizing the impact of adverse animal health issues. PMID:28770216

  7. Space-Data Routers: Advanced data routing protocols for enhancing data exploitation for space weather applications

    NASA Astrophysics Data System (ADS)

    Anastasiadis, Anastasios; Daglis, Ioannis A.; Balasis, George; Papadimitriou, Constantinos; Tsaoussidis, Vassilios; Diamantopoulos, Sotirios

    2014-05-01

    Data sharing and access are major issues in space sciences, as they influence the degree of data exploitation. The availability of multi-spacecraft distributed observation methods and adaptive mission architectures require computationally intensive analysis methods. Moreover, accurate space weather forecasting and future space exploration far from Earth will be in need of real-time data distribution and assimilation technologies. The FP7-Space collaborative research project "Space-Data Routers" (SDR) relies on space internetworking and in particular on Delay Tolerant Networking (DTN), which marks the new era in space communications. SDR unifies space and earth communication infrastructures and delivers a set of tools and protocols for space-data exploitation. The main goal is to allow space agencies, academic institutes and research centers to share space-data generated by single or multiple missions, in an efficient, secure and automated manner. Here we are presenting the architecture and basic functionality of a DTN-based application specifically designed in the framework of the SDR project, for data query, retrieval and administration that will enable addressing outstanding science questions related to space weather, through the provision of simultaneous real-time data sampling at multiple points in space. The work leading to this paper has received funding from the European Union's Seventh Framework Programme (FP7-SPACE-2010-1) under grant agreement no. 263330 for the SDR (Space-Data Routers for Exploiting Space Data) collaborative research project. This paper reflects only the authors' views and the Union is not liable for any use that may be made of the information contained therein.

  8. NSSDC Data listing

    NASA Technical Reports Server (NTRS)

    1981-01-01

    A convenient reference to space science and supportive data available from the National Space Science Data Center (NSSDC) is provided. Satellite data are organized by NSSDC spacecraft common name. The launch date and NSSDC ID are given. Experiments are listed alphabetically by the principal investigator or team leader. The experiment name and NSSDC ID, data set ID, data set name, data form code, quantity of data, and the time span of the data as verified by NSSDC are shown. Ground-based data, models, computer routines, and composite spacecraft data that are available from NSSDC are listed alphabetically by discipline, source, data type, data content, and data set. The data set name, data form code, quantity of data, and the time span covered where appropriate are included.

  9. Enabling Data-Driven Methodologies Across the Data Lifecycle and Ecosystem

    NASA Astrophysics Data System (ADS)

    Doyle, R. J.; Crichton, D.

    2017-12-01

    NASA has unlocked unprecedented scientific knowledge through exploration of the Earth, our solar system, and the larger universe. NASA is generating enormous amounts of data that are challenging traditional approaches to capturing, managing, analyzing and ultimately gaining scientific understanding from science data. New architectures, capabilities and methodologies are needed to span the entire observing system, from spacecraft to archive, while integrating data-driven discovery and analytic capabilities. NASA data have a definable lifecycle, from remote collection point to validated accessibility in multiple archives. Data challenges must be addressed across this lifecycle, to capture opportunities and avoid decisions that may limit or compromise what is achievable once data arrives at the archive. Data triage may be necessary when the collection capacity of the sensor or instrument overwhelms data transport or storage capacity. By migrating computational and analytic capability to the point of data collection, informed decisions can be made about which data to keep; in some cases, to close observational decision loops onboard, to enable attending to unexpected or transient phenomena. Along a different dimension than the data lifecycle, scientists and other end-users must work across an increasingly complex data ecosystem, where the range of relevant data is rarely owned by a single institution. To operate effectively, scalable data architectures and community-owned information models become essential. NASA's Planetary Data System is having success with this approach. Finally, there is the difficult challenge of reproducibility and trust. While data provenance techniques will be part of the solution, future interactive analytics environments must support an ability to provide a basis for a result: relevant data source and algorithms, uncertainty tracking, etc., to assure scientific integrity and to enable confident decision making. Advances in data science offer

  10. Microprogrammable Integrated Data Acquisition System-Fatigue Life Data Application

    DTIC Science & Technology

    1976-03-01

    Lt. James W. Sturges, successfully applied the Midas general system [Sturges, 1975] to the fatigue life data monitoring problem and proved its...life data problem . The Midas FLD system computer program generates the required signals in the proper sequence for effectively sampling the 8-channel...Integrated Data Acquisition System- Fatigue Life Data Application" ( Midas FLD) is a microprocessor based data acquisition system. It incorporates a Pro-Log

  11. [Human genetic data from a data protection law perspective].

    PubMed

    Schulte In den Bäumen, Tobias

    2007-02-01

    The collection and use of genetic data have caused much concern in the German population. Data protection is widely seen as the tool to address these fears. The term genetic data is not self-explanatory, as it depends on the different types of genetic diseases. The protection of genetic data as defined with regard to the different sets of diseases needs to fit into the preexisting data protection legislation. Still, the particularities of genetic data such as the multipersonal impact need to be considered. A balance between the information needs of society and the right to privacy requires a medically driven criteria. The medical term of indication which corresponds with the data protection term of purpose should serve as a tool in order to balance the rights of the patients and their relatives or between clients and third persons involved. Some countries have set up new legislative acts to address the challenges of human genetics. The current state of German data protection law leaves citizen rather unprotected as long as the data are used for medical purposes in a wider sense. A special law on the collection of genetic data has been discussed for several years, but it should be questioned whether the scope of a sector-specific law would serve citizens better. It seems to be preferable to adjust the existing Data Protection Act rather than drafting a specific law which covers the field of human genetics. This adaptation should reflect upon the different technical ways in which genetic data are collected and used.

  12. Environmental Data Store (EDS): A multi-node Data Storage Facility for diverse sets of Geoscience Data

    NASA Astrophysics Data System (ADS)

    Piasecki, M.; Ji, P.

    2014-12-01

    Geoscience data comes in many flavors that are determined by type of data such as continous on a grid or mesh or discrete colelcted at point either as one time samples or a stream of data coming of sensors, but coudl also encompass digital files of any time type such text files, WORD or EXCEL documents, or audio and video files. We present a storage facility that is comprsed of 6 nodes each of speciaized to host a certain data type: grid based data (netCDF on a THREDDS server), GIS data (shapefiles using GeoServer), point time series data (CUAHSI ODM), sample data (EDBS), and any digital data (RAMADAA) plus a server fro Remote sensing data and its products. While there is overlap in data type storage capabilities (rasters can go into several of these nodes) we prefer to use dedicated storage facilities that are a) freeware, and b) have a good degree of maturity, and c) have shown their utility for stroing a cetain type. In addition it allows to place these commonly used software stacks and storage solutiosn side-by-side to develop interoprability strategies. We have used a DRUPAL based system to handle user regoistration and authentication, and also use the system for data submission and data search. In support for tis system we developed an extensive controlled vocabulary system that is an amalgamation of various CVs used in the geosciecne community in order to achieve as high a degree of recognition, such the CF conventions, CUAHSI Cvs, , NASA (GCMD), EPA and USGS taxonomies, GEMET, in addition to ontological representations such as SWEET.

  13. Semantic Data Access Services at NASA's Atmospheric Science Data Center

    NASA Astrophysics Data System (ADS)

    Huffer, E.; Hertz, J.; Kusterer, J.

    2012-12-01

    The corpus of Earth Science data products at the Atmospheric Science Data Center at NASA's Langley Research Center comprises a widely heterogeneous set of products, even among those whose subject matter is very similar. Two distinct data products may both contain data on the same parameter, for instance, solar irradiance; but the instruments used, and the circumstances under which the data were collected and processed, may differ significantly. Understanding the differences is critical to using the data effectively. Data distribution services must be able to provide prospective users with enough information to allow them to meaningfully compare and evaluate the data products offered. Semantic technologies - ontologies, triple stores, reasoners, linked data - offer functionality for addressing this issue. Ontologies can provide robust, high-fidelity domain models that serve as common schema for discovering, evaluating, comparing and integrating data from disparate products. Reasoning engines and triple stores can leverage ontologies to support intelligent search applications that allow users to discover, query, retrieve, and easily reformat data from a broad spectrum of sources. We argue that because of the extremely complex nature of scientific data, data distribution systems should wholeheartedly embrace semantic technologies in order to make their data accessible to a broad array of prospective end users, and to ensure that the data they provide will be clearly understood and used appropriately by consumers. Toward this end, we propose a distribution system in which formal ontological models that accurately and comprehensively represent the ASDC's data domain, and fully leverage the expressivity and inferential capabilities of first order logic, are used to generate graph-based representations of the relevant relationships among data sets, observational systems, metadata files, and geospatial, temporal and scientific parameters to help prospective data consumers

  14. The data life cycle applied to our own data.

    PubMed

    Goben, Abigail; Raszewski, Rebecca

    2015-01-01

    Increased demand for data-driven decision making is driving the need for librarians to be facile with the data life cycle. This case study follows the migration of reference desk statistics from handwritten to digital format. This shift presented two opportunities: first, the availability of a nonsensitive data set to improve the librarians' understanding of data-management and statistical analysis skills, and second, the use of analytics to directly inform staffing decisions and departmental strategic goals. By working through each step of the data life cycle, library faculty explored data gathering, storage, sharing, and analysis questions.

  15. Data Management Rubric for Video Data in Organismal Biology.

    PubMed

    Brainerd, Elizabeth L; Blob, Richard W; Hedrick, Tyson L; Creamer, Andrew T; Müller, Ulrike K

    2017-07-01

    Standards-based data management facilitates data preservation, discoverability, and access for effective data reuse within research groups and across communities of researchers. Data sharing requires community consensus on standards for data management, such as storage and formats for digital data preservation, metadata (i.e., contextual data about the data) that should be recorded and stored, and data access. Video imaging is a valuable tool for measuring time-varying phenotypes in organismal biology, with particular application for research in functional morphology, comparative biomechanics, and animal behavior. The raw data are the videos, but videos alone are not sufficient for scientific analysis. Nearly endless videos of animals can be found on YouTube and elsewhere on the web, but these videos have little value for scientific analysis because essential metadata such as true frame rate, spatial calibration, genus and species, weight, age, etc. of organisms, are generally unknown. We have embarked on a project to build community consensus on video data management and metadata standards for organismal biology research. We collected input from colleagues at early stages, organized an open workshop, "Establishing Standards for Video Data Management," at the Society for Integrative and Comparative Biology meeting in January 2017, and then collected two more rounds of input on revised versions of the standards. The result we present here is a rubric consisting of nine standards for video data management, with three levels within each standard: good, better, and best practices. The nine standards are: (1) data storage; (2) video file formats; (3) metadata linkage; (4) video data and metadata access; (5) contact information and acceptable use; (6) camera settings; (7) organism(s); (8) recording conditions; and (9) subject matter/topic. The first four standards address data preservation and interoperability for sharing, whereas standards 5-9 establish minimum metadata

  16. Data Management Rubric for Video Data in Organismal Biology

    PubMed Central

    Brainerd, Elizabeth L.; Blob, Richard W.; Hedrick, Tyson L.; Creamer, Andrew T.; Müller, Ulrike K.

    2017-01-01

    Synopsis Standards-based data management facilitates data preservation, discoverability, and access for effective data reuse within research groups and across communities of researchers. Data sharing requires community consensus on standards for data management, such as storage and formats for digital data preservation, metadata (i.e., contextual data about the data) that should be recorded and stored, and data access. Video imaging is a valuable tool for measuring time-varying phenotypes in organismal biology, with particular application for research in functional morphology, comparative biomechanics, and animal behavior. The raw data are the videos, but videos alone are not sufficient for scientific analysis. Nearly endless videos of animals can be found on YouTube and elsewhere on the web, but these videos have little value for scientific analysis because essential metadata such as true frame rate, spatial calibration, genus and species, weight, age, etc. of organisms, are generally unknown. We have embarked on a project to build community consensus on video data management and metadata standards for organismal biology research. We collected input from colleagues at early stages, organized an open workshop, “Establishing Standards for Video Data Management,” at the Society for Integrative and Comparative Biology meeting in January 2017, and then collected two more rounds of input on revised versions of the standards. The result we present here is a rubric consisting of nine standards for video data management, with three levels within each standard: good, better, and best practices. The nine standards are: (1) data storage; (2) video file formats; (3) metadata linkage; (4) video data and metadata access; (5) contact information and acceptable use; (6) camera settings; (7) organism(s); (8) recording conditions; and (9) subject matter/topic. The first four standards address data preservation and interoperability for sharing, whereas standards 5–9 establish

  17. [Health data].

    PubMed

    Polton, Dominique

    2018-05-01

    Healthcare is considered as one of the most promising areas where big data can be applied to foster innovation for the benefit of patients and of the whole system. Healthcare analytics have the potential to accelerate R&D, increase knowledge on diseases and risk factors, improve treatments, develop personalised medicine, help physicians with decision support systems… The access to data is also a driving force for patients' empowerment and for the democratic debate. However, there are also concerns about the societal, economic and ethic impacts of this wave of digitization and of the growing use of data, algorithms and artificial intelligence. Given the issues at stake, collecting and analysing data generated by health care systems is a strategic challenge in all countries; in that respect the French National System of Health Data (a national data warehouse linking data from several sources and giving a vision of the care pathways for the entire population, with a ten-year history) is an asset, but it has to be completed and enriched with data from electronic health records. © 2018 médecine/sciences – Inserm.

  18. NOAA's Data Catalog and the Federal Open Data Policy

    NASA Astrophysics Data System (ADS)

    Wengren, M. J.; de la Beaujardiere, J.

    2014-12-01

    The 2013 Open Data Policy Presidential Directive requires Federal agencies to create and maintain a 'public data listing' that includes all agency data that is currently or will be made publicly-available in the future. The directive requires the use of machine-readable and open formats that make use of 'common core' and extensible metadata formats according to the best practices published in an online repository called 'Project Open Data', to use open licenses where possible, and to adhere to existing metadata and other technology standards to promote interoperability. In order to meet the requirements of the Open Data Policy, the National Oceanic and Atmospheric Administration (NOAA) has implemented an online data catalog that combines metadata from all subsidiary NOAA metadata catalogs into a single master inventory. The NOAA Data Catalog is available to the public for search and discovery, providing access to the NOAA master data inventory through multiple means, including web-based text search, OGC CS-W endpoint, as well as a native Application Programming Interface (API) for programmatic query. It generates on a daily basis the Project Open Data JavaScript Object Notation (JSON) file required for compliance with the Presidential directive. The Data Catalog is based on the open source Comprehensive Knowledge Archive Network (CKAN) software and runs on the Amazon Federal GeoCloud. This presentation will cover topics including mappings of existing metadata in standard formats (FGDC-CSDGM and ISO 19115 XML ) to the Project Open Data JSON metadata schema, representation of metadata elements within the catalog, and compatible metadata sources used to feed the catalog to include Web Accessible Folder (WAF), Catalog Services for the Web (CS-W), and Esri ArcGIS.com. It will also discuss related open source technologies that can be used together to build a spatial data infrastructure compliant with the Open Data Policy.

  19. Data communication between data terminal equipment and the JPL administrative data base management system

    NASA Technical Reports Server (NTRS)

    Iverson, R. W.

    1984-01-01

    Approaches to enabling an installed base of mixed data terminal equipment to access a data base management system designed to work with a specific terminal are discussed. The approach taken by the Jet Propulsion Laboratory is described. Background information on the Jet Propulsion Laboratory (JPL), its organization and a description of the Administrative Data Base Management System is included.

  20. NSSDC data listing

    NASA Technical Reports Server (NTRS)

    Horowitz, Richard

    1991-01-01

    The purpose here is to identify, in a highly summarized way, data available from the National Space Science Data Center (NSSDC). Most data are maintained as offline data sets gathered from individual instruments carried on spacecraft; these comprise the Satellite Data Listing. Descriptive names, time spans, data form, and quality of these data sets are identified in the listing, which is sorted alphabetically, first by spacecraft name and then by the principal investigator's or team leader's last name. Several data sets not associated with individual spaceflight instruments are identified in separate listings following the Satellite Data Listing. These include composite spacecraft data sets, ground based data, models, and computer routines. NSSDC also offers data via special services and systems in a number of areas, including the Astronomical Data Center, Coordinated Data Analysis Workshops, NASA Climate Data System, Pilot Land Data System, and Crustal Dynamics Data Information System.

  1. Integration of land-use data and soil survey data

    NASA Technical Reports Server (NTRS)

    Cox, T. L.

    1977-01-01

    Approaches are discussed for increasing the utility of remotely sensed interpretations through the use of a computer-assisted process which provides capabilities for merging several types of data of varying formats. The resulting maps and summary data are used for planning and zoning in a rapidly developing area (34,000 ha) adjacent to the Black Hills in South Dakota. Attention is given to the data source, data digitization, and aspects of data handling and analysis.

  2. National Irrigation Water Quality Program data-synthesis data base

    USGS Publications Warehouse

    Seiler, Ralph L.; Skorupa, Joseph P.

    2001-01-01

    Under the National Irrigation Water Quality Program (NIWQP) of the U.S. Department of the Interior, researchers investigated contamination caused by irrigation drainage in 26 areas in the Western United States from 1986 to 1993. From 1992 to 1995, a comprehensive relational data base was built to organize data collected during the 26-area investigations. The data base provided the basis for analysis and synthesis of these data to identify common features of contaminated areas and hence dominant biologic, geologic, climatic, chemical, and physiographic factors that have resulted in contamination of water and biota in irrigated areas in the Western United States. Included in the data base are geologic, hydrologic, climatological, chemical, and cultural data that describe the 26 study areas in 14 Western States. The data base contains information on 1,264 sites from which water and bottom sediment were collected. It also contains chemical data from 6,903 analyses of surface water, 914 analyses of ground water, 707 analyses of inorganic constituents in bottom sediments, 223 analyses of organochlorine pesticides in bottom sediments, 8,217 analyses of inorganic constituents in biota, and 1,088 analyses for organic constituents in biota. The data base is available to the public and can be obtained at the NIWQP homepage http://www.usbr.gov/niwqp as dBase III tables for personal-computer systems or as American Standard Code for Information Exchange structured query language (SQL) command and data files for SQL data bases.

  3. Benefits of a clinical data warehouse with data mining tools to collect data for a radiotherapy trial

    PubMed Central

    Roelofs, Erik; Persoon, Lucas; Nijsten, Sebastiaan; Wiessler, Wolfgang; Dekker, André; Lambin, Philippe

    2016-01-01

    Introduction Collecting trial data in a medical environment is at present mostly performed manually and therefore time-consuming, prone to errors and often incomplete with the complex data considered. Faster and more accurate methods are needed to improve the data quality and to shorten data collection times where information is often scattered over multiple data sources. The purpose of this study is to investigate the possible benefit of modern data warehouse technology in the radiation oncology field. Material and methods In this study, a Computer Aided Theragnostics (CAT) data warehouse combined with automated tools for feature extraction was benchmarked against the regular manual data-collection processes. Two sets of clinical parameters were compiled for non-small cell lung cancer (NSCLC) and rectal cancer, using 27 patients per disease. Data collection times and inconsistencies were compared between the manual and the automated extraction method. Results The average time per case to collect the NSCLC data manually was 10.4 ± 2.1 min and 4.3 ± 1.1 min when using the automated method (p < 0.001). For rectal cancer, these times were 13.5 ± 4.1 and 6.8 ± 2.4 min, respectively (p < 0.001). In 3.2% of the data collected for NSCLC and 5.3% for rectal cancer, there was a discrepancy between the manual and automated method. Conclusions Aggregating multiple data sources in a data warehouse combined with tools for extraction of relevant parameters is beneficial for data collection times and offers the ability to improve data quality. The initial investments in digitizing the data are expected to be compensated due to the flexibility of the data analysis. Furthermore, successive investigations can easily select trial candidates and extract new parameters from the existing databases. PMID:23394741

  4. Benefits of a clinical data warehouse with data mining tools to collect data for a radiotherapy trial.

    PubMed

    Roelofs, Erik; Persoon, Lucas; Nijsten, Sebastiaan; Wiessler, Wolfgang; Dekker, André; Lambin, Philippe

    2013-07-01

    Collecting trial data in a medical environment is at present mostly performed manually and therefore time-consuming, prone to errors and often incomplete with the complex data considered. Faster and more accurate methods are needed to improve the data quality and to shorten data collection times where information is often scattered over multiple data sources. The purpose of this study is to investigate the possible benefit of modern data warehouse technology in the radiation oncology field. In this study, a Computer Aided Theragnostics (CAT) data warehouse combined with automated tools for feature extraction was benchmarked against the regular manual data-collection processes. Two sets of clinical parameters were compiled for non-small cell lung cancer (NSCLC) and rectal cancer, using 27 patients per disease. Data collection times and inconsistencies were compared between the manual and the automated extraction method. The average time per case to collect the NSCLC data manually was 10.4 ± 2.1 min and 4.3 ± 1.1 min when using the automated method (p<0.001). For rectal cancer, these times were 13.5 ± 4.1 and 6.8 ± 2.4 min, respectively (p<0.001). In 3.2% of the data collected for NSCLC and 5.3% for rectal cancer, there was a discrepancy between the manual and automated method. Aggregating multiple data sources in a data warehouse combined with tools for extraction of relevant parameters is beneficial for data collection times and offers the ability to improve data quality. The initial investments in digitizing the data are expected to be compensated due to the flexibility of the data analysis. Furthermore, successive investigations can easily select trial candidates and extract new parameters from the existing databases. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.

  5. Geospatial Data Science Modeling | Geospatial Data Science | NREL

    Science.gov Websites

    Geospatial Data Science Modeling Geospatial Data Science Modeling NREL uses geospatial data science modeling to develop innovative models and tools for energy professionals, project developers, and consumers . Photo of researchers inspecting maps on a large display. Geospatial modeling at NREL often produces the

  6. Ghana Open Data Initiative | Ghana Open Data Initiative

    Science.gov Websites

    Information Technology Agency (NITA) 2012-2016 government of Ghana. All right reserved. Agencies Search  Get Involved Finance Health Agriculture Energy Education Environment Local Government City Data Extractive Statistics Business Elections About Us Open Government FAQ Aid Data Sites Data

  7. Apparatus and Method for Effecting Data Transfer Between Data Systems

    NASA Technical Reports Server (NTRS)

    Kirkpatrick, Joey V. (Inventor); Grosz, Francis B., Jr. (Inventor); Lannes, Kenny (Inventor); Maniscalco, David G. (Inventor)

    2001-01-01

    An apparatus for effecting data transfer between data systems comprising a first transceiver and a second transceiver. The first transceiver has an input for receiving digital data from one of the data systems, an output for serially outputting digital data to one of the data systems, at least one transmitter for converting digital data received at the input into optical signals, and at least one receiver for receiving optical signals and serially converting the received optical signals to digital data for output to the data output. The second transceiver has an input for receiving digital data from another one of the data systems, an output for serially outputting digital data to the another one of the data systems, at least one transmitter for serially converting digital data received at the input of the second transceiver into optical signals, and at least one receiver for receiving optical signals and serially converting the received optical signals to digital data for output to the output of the second transceiver. The apparatus further comprises an optical link connecting the first and second transceivers. The optical link comprising a pair of optical fibers. One of the optical fibers optically links the transmitter of the first transceiver to the receiver of the second transceiver. The other optical fiber optically links the receiver of the first transceiver to the transmitter of the second transceiver.

  8. Improving sensor data analysis through diverse data source integration

    NASA Astrophysics Data System (ADS)

    Casper, Jennifer; Albuquerque, Ronald; Hyland, Jeremy; Leveille, Peter; Hu, Jing; Cheung, Eddy; Mauer, Dan; Couture, Ronald; Lai, Barry

    2009-05-01

    Daily sensor data volumes are increasing from gigabytes to multiple terabytes. The manpower and resources needed to analyze the increasing amount of data are not growing at the same rate. Current volumes of diverse data, both live streaming and historical, are not fully analyzed. Analysts are left mostly to analyzing the individual data sources manually. This is both time consuming and mentally exhausting. Expanding data collections only exacerbate this problem. Improved data management techniques and analysis methods are required to process the increasing volumes of historical and live streaming data sources simultaneously. Improved techniques are needed to reduce an analysts decision response time and to enable more intelligent and immediate situation awareness. This paper describes the Sensor Data and Analysis Framework (SDAF) system built to provide analysts with the ability to pose integrated queries on diverse live and historical data sources, and plug in needed algorithms for upstream processing and filtering. The SDAF system was inspired by input and feedback from field analysts and experts. This paper presents SDAF's capabilities, implementation, and reasoning behind implementation decisions. Finally, lessons learned from preliminary tests and deployments are captured for future work.

  9. Toolsets for Airborne Data (TAD): Customized Data Merging Function

    NASA Astrophysics Data System (ADS)

    Benson, A.; Peeters, M. C.; Perez, J.; Parker, L.; Chen, G.

    2013-12-01

    NASA has conducted airborne tropospheric chemistry studies for about three decades. These field campaigns have generated a great wealth of observations, including a wide range of the trace gases and aerosol properties. The ASDC Toolset for Airborne Data (TAD) is being designed to meet the user community needs for manipulating aircraft data for scientific research on climate change and air quality relevant issues. Prior to the actual toolset development, a comprehensive metadata database was created to compensate for the absence of standardization of the ICARTT data format in which the data is stored. This database tracks the Principal Investigator-provided metadata, and links the measurement variables to a common naming system that was developed as a part of this project. This database is used by the data merging module. Most aircraft data reported during a single flight is not on a consistent time base and is difficult to intercompare. This module provides the user with the ability to merge original data measurements from multiple data providers into a specified time interval or common time base. The database development, common naming scheme and data merge module development will be presented.

  10. Opportunistic data locality for end user data analysis

    NASA Astrophysics Data System (ADS)

    Fischer, M.; Heidecker, C.; Kuehn, E.; Quast, G.; Giffels, M.; Schnepf, M.; Heiss, A.; Petzold, A.

    2017-10-01

    With the increasing data volume of LHC Run2, user analyses are evolving towards increasing data throughput. This evolution translates to higher requirements for efficiency and scalability of the underlying analysis infrastructure. We approach this issue with a new middleware to optimise data access: a layer of coordinated caches transparently provides data locality for high-throughput analyses. We demonstrated the feasibility of this approach with a prototype used for analyses of the CMS working groups at KIT. In this paper, we present our experience both with the approach in general, and our prototype in specific.

  11. Oceanographic Data in Europe: Minimal Effort for Data Providers, Maximal Ease of Use and Access for Data Users

    NASA Astrophysics Data System (ADS)

    De Bruin, T.

    2017-12-01

    SeaDataCloud/SeaDataNet (SDC/SDN) is both a consortium and a data infrastructure as well as a (series of) European oceanographic data management project(s), allowing data providers to store data at a data centre of their choice (usually a type of National Oceanographic Data Center), while exposing and making the data available for download via a chain of interconnected data portals at local, regional, pan-European and global levels. SDC/SDN as an infrastructure connects over 100 data centers from 35 countries in and around Europe. The infrastructure has been operational since early 2009 and provides the user an overview of all available data as well as the possibility to download the data in an uniform format. This presentation will give a short introduction to the SDC/SDN infrastructure and describe how its development was based on sound data management principles. The emphasis will be on how the system is interconnected with other, non-discipline specific (metadata) portals such as the Group of Earth Observations System of Systems (GEOSS), allowing oceanographic data stored at a local level in a data centre to be exposed at a global level to a wide audience from various disciplines.

  12. Drive Cycle Data | Transportation Secure Data Center | NREL

    Science.gov Websites

    one file. Download Individual Survey and Study Drive Cycle Data Below you'll find drive cycle data download files for individual surveys and studies. Greater Fairbanks, Alaska, Transportation Survey Drive Cycle Data by Vehicle (24-hour period of operation) Download Learn more about the survey. California

  13. NREL: Renewable Resource Data Center - Biomass Resource Data

    Science.gov Websites

    Resource Data The following biomass resource data collections can be found in the Renewable Resource Data Center (RReDC). Current Biomass Resource Supply An estimate of biomass resources currently available by county. Projected Biomass Resource Supply An estimate of biomass resources potentially

  14. Data Science for Imbalanced Data: Methods and Applications

    ERIC Educational Resources Information Center

    Johnson, Reid A.

    2016-01-01

    Data science is a broad, interdisciplinary field concerned with the extraction of knowledge or insights from data, with the classification of data as a core, fundamental task. One of the most persistent challenges faced when performing classification is the class imbalance problem. Class imbalance refers to when the frequency with which each class…

  15. Alternative Fuels Data Center: Truck Stop Electrification Site Data

    Science.gov Websites

    Collection Methods Tools Printable Version Share this resource Send a link to Alternative Fuels Data Center: Truck Stop Electrification Site Data Collection Methods to someone by E-mail Share Alternative Fuels Data Center: Truck Stop Electrification Site Data Collection Methods on Facebook Tweet about

  16. Alternative Fuels Data Center: South Dakota Transportation Data for

    Science.gov Websites

    Alternative Fuels and Vehicles Dakota Transportation Data for Alternative Fuels and Vehicles to someone by E-mail Share Alternative Fuels Data Center: South Dakota Transportation Data for Alternative Fuels and Vehicles on Facebook Tweet about Alternative Fuels Data Center: South Dakota Transportation

  17. Data Publication and Interoperability for Long Tail Researchers via the Open Data Repository's (ODR) Data Publisher.

    NASA Astrophysics Data System (ADS)

    Stone, N.; Lafuente, B.; Bristow, T.; Keller, R.; Downs, R. T.; Blake, D. F.; Fonda, M.; Pires, A.

    2016-12-01

    Working primarily with astrobiology researchers at NASA Ames, the Open Data Repository (ODR) has been conducting a software pilot to meet the varying needs of this multidisciplinary community. Astrobiology researchers often have small communities or operate individually with unique data sets that don't easily fit into existing database structures. The ODR constructed its Data Publisher software to allow researchers to create databases with common metadata structures and subsequently extend them to meet their individual needs and data requirements. The software accomplishes these tasks through a web-based interface that allows collaborative creation and revision of common metadata templates and individual extensions to these templates for custom data sets. This allows researchers to search disparate datasets based on common metadata established through the metadata tools, but still facilitates distinct analyses and data that may be stored alongside the required common metadata. The software produces web pages that can be made publicly available at the researcher's discretion so that users may search and browse the data in an effort to make interoperability and data discovery a human-friendly task while also providing semantic data for machine-based discovery. Once relevant data has been identified, researchers can utilize the built-in application programming interface (API) that exposes the data for machine-based consumption and integration with existing data analysis tools (e.g. R, MATLAB, Project Jupyter - http://jupyter.org). The current evolution of the project has created the Astrobiology Habitable Environments Database (AHED)[1] which provides an interface to databases connected through a common metadata core. In the next project phase, the goal is for small research teams and groups to be self-sufficient in publishing their research data to meet funding mandates and academic requirements as well as fostering increased data discovery and interoperability

  18. ERDDAP: Reducing Data Friction with an Open Source Data Platform

    NASA Astrophysics Data System (ADS)

    O'Brien, K.

    2017-12-01

    Data friction is not just an issue facing interdisciplinary research. Often times, even within disciplines, significant data friction can exist. Issues of differing formats, limited metadata and non-existent machine-to-machine data access are all issues that exist within disciplines and make it that much harder for successful interdisciplinary cooperation. Therefore, reducing data friction within disciplines is crucial first step in providing better overall collaboration. ERDDAP, an open source data platform developed at NOAA's Southwest Fisheries Center, is well poised to improve data useability and understanding and reduce data friction, both in single and multi-disciplinary research. By virtue of its ability to integrate data of varying formats and provide RESTful-based user access to data and metadata, use of ERDDAP has grown substantially throughout the ocean data community. ERDDAP also supports standards such as the DAP data protocol, the Climate and Forecast (CF) metadata conventions and the Bagit document standard for data archival. In this presentation, we will discuss the advantages of using ERDDAP as a data platform. We will also show specific use cases where utilizing ERDDAP has reduced friction within a single discipline (physical oceanography) and improved interdisciplinary collaboration as well.

  19. MitoMiner: a data warehouse for mitochondrial proteomics data

    PubMed Central

    Smith, Anthony C.; Blackshaw, James A.; Robinson, Alan J.

    2012-01-01

    MitoMiner (http://mitominer.mrc-mbu.cam.ac.uk/) is a data warehouse for the storage and analysis of mitochondrial proteomics data gathered from publications of mass spectrometry and green fluorescent protein tagging studies. In MitoMiner, these data are integrated with data from UniProt, Gene Ontology, Online Mendelian Inheritance in Man, HomoloGene, Kyoto Encyclopaedia of Genes and Genomes and PubMed. The latest release of MitoMiner stores proteomics data sets from 46 studies covering 11 different species from eumetazoa, viridiplantae, fungi and protista. MitoMiner is implemented by using the open source InterMine data warehouse system, which provides a user interface allowing users to upload data for analysis, personal accounts to store queries and results and enables queries of any data in the data model. MitoMiner also provides lists of proteins for use in analyses, including the new MitoMiner mitochondrial proteome reference sets that specify proteins with substantial experimental evidence for mitochondrial localization. As further mitochondrial proteomics data sets from normal and diseased tissue are published, MitoMiner can be used to characterize the variability of the mitochondrial proteome between tissues and investigate how changes in the proteome may contribute to mitochondrial dysfunction and mitochondrial-associated diseases such as cancer, neurodegenerative diseases, obesity, diabetes, heart failure and the ageing process. PMID:22121219

  20. Wind Data | Geospatial Data Science | NREL

    Science.gov Websites

    Class 3 or greater are suitable for most utility-scale wind turbine applications, whereas class 2 areas ) with adequate wind resource for wind turbine applications may exist in some Class 1 areas. The degree Wind Data Wind Data These datasets detail the wind resource available in the United States. 50-m

  1. Data Analysis and Data Mining: Current Issues in Biomedical Informatics

    PubMed Central

    Bellazzi, Riccardo; Diomidous, Marianna; Sarkar, Indra Neil; Takabayashi, Katsuhiko; Ziegler, Andreas; McCray, Alexa T.

    2011-01-01

    Summary Background Medicine and biomedical sciences have become data-intensive fields, which, at the same time, enable the application of data-driven approaches and require sophisticated data analysis and data mining methods. Biomedical informatics provides a proper interdisciplinary context to integrate data and knowledge when processing available information, with the aim of giving effective decision-making support in clinics and translational research. Objectives To reflect on different perspectives related to the role of data analysis and data mining in biomedical informatics. Methods On the occasion of the 50th year of Methods of Information in Medicine a symposium was organized, that reflected on opportunities, challenges and priorities of organizing, representing and analysing data, information and knowledge in biomedicine and health care. The contributions of experts with a variety of backgrounds in the area of biomedical data analysis have been collected as one outcome of this symposium, in order to provide a broad, though coherent, overview of some of the most interesting aspects of the field. Results The paper presents sections on data accumulation and data-driven approaches in medical informatics, data and knowledge integration, statistical issues for the evaluation of data mining models, translational bioinformatics and bioinformatics aspects of genetic epidemiology. Conclusions Biomedical informatics represents a natural framework to properly and effectively apply data analysis and data mining methods in a decision-making context. In the future, it will be necessary to preserve the inclusive nature of the field and to foster an increasing sharing of data and methods between researchers. PMID:22146916

  2. NSSDC data listing

    NASA Technical Reports Server (NTRS)

    Horowitz, Richard; King, Joseph H.

    1990-01-01

    In a highly summarized way, data available from the National Space Science Data Center (NSSDC) is identified. Most data are offline data sets (on magnetic tape or as film/print products of various sizes) from individual instruments carried on spacecraft; these compose the Satellite Data Listing. Descriptive names, time spans, data form, and quantity of these data sets are identified in the listing, which is sorted alphabetically-first by spacecraft name and then by the principal investigator's or team leader's last name. Several data sets held at NSSDC, not associated with individual spaceflight instruments, are identified in separate listings following the Satellite Data Listing. These data sets make up the Supplementary Data Listings and include composite spacecraft data sets, ground-based data, models, and computer routines. The identifiers used in the Supplementary Data Listings were created by NSSDC and are explained in the pages preceding the listings. Data set form codes are listed. NSSDC offers primarily archival, retrieval, replication, and dissemination services associated with the data sets discussed in the two major listings identified above. NSSDC also provides documentation which enables the data recipient to use the data received. NSSDC is working toward expanding presently limited capabilities for data subsetting and for promotion of data files to online residence for user downloading. NSSDC data holdings span the range of scientific disciplines in which NASA is involved, and include astrophysics, lunar and planetary science, solar physics, space plasma physics, and Earth science. In addition to the functions mentioned above, NSSDC offers data via special services and systems in a number of areas, including Astronomical Data Center (ADC), Coordinated Data Analysis Workshops (CDAWs), NASA Climate Data System (NCDS), Pilot Land Data System (PLDS), and Crustal Dynamics Data Information System (CDDIS). Furthermore, NSSDC has a no-password account on its

  3. The World Data System - Your partner in data collaboration

    NASA Astrophysics Data System (ADS)

    Gärtner-Roer, Isabelle; Harrison, Sandy; Sorvari, Sanna

    2017-04-01

    The World Data System (ICSU-WDS) is an interdisciplinary body of the International Council of Science (ICSU) with a mission to promote international collaborations for long-term preservation and provision of quality-assessed research data and data services. WDS is a membership organization federating scientific data centers, data services and data networks across all disciplines in the natural and social sciences as well as humanities. The main goals of WDS are to promote documentation and access of data, as well as to strengthen data dissemination and its proper citation. Through its certification scheme, WDS promotes the development of trusted data repositories and the continual improvement of such facilities through maturity self- assessment and information exchange. Thus, WDS is responsible for creating a globally interoperable distributed data system that incorporates emerging technologies and multidisciplinary scientific data activities. Today, WDS has 76 Regular and Network Members and 24 Partners and Associate Members as of October 2016. The community is actively involved by a number of activities, such as working groups, webinars and the bi-annual Members Forum. Current effort is to promote activities in the African and Asia-Oceanian region in order to expand the WDS community by recruiting new members in these regions. In order to introduce the role of WDS, we will present the WDS certification scheme, introducing some selected partner services and detail their collaboration with WDS, including commitments, advantages and challenges. If YOU want to know more about or want to join WDS, have a look at www.icsu-wds.org!

  4. Data free inference with processed data products

    DOE PAGES

    Chowdhary, K.; Najm, H. N.

    2014-07-12

    Here, we consider the context of probabilistic inference of model parameters given error bars or confidence intervals on model output values, when the data is unavailable. We introduce a class of algorithms in a Bayesian framework, relying on maximum entropy arguments and approximate Bayesian computation methods, to generate consistent data with the given summary statistics. Once we obtain consistent data sets, we pool the respective posteriors, to arrive at a single, averaged density on the parameters. This approach allows us to perform accurate forward uncertainty propagation consistent with the reported statistics.

  5. ACTS data center

    NASA Technical Reports Server (NTRS)

    Syed, Ali; Vogel, Wolfhard J.

    1993-01-01

    Viewgraphs on ACTS Data Center status report are included. Topics covered include: ACTS Data Center Functions; data flow overview; PPD flow; RAW data flow; data compression; PPD distribution; RAW Data Archival; PPD Audit; and data analysis.

  6. Big data mining: In-database Oracle data mining over hadoop

    NASA Astrophysics Data System (ADS)

    Kovacheva, Zlatinka; Naydenova, Ina; Kaloyanova, Kalinka; Markov, Krasimir

    2017-07-01

    Big data challenges different aspects of storing, processing and managing data, as well as analyzing and using data for business purposes. Applying Data Mining methods over Big Data is another challenge because of huge data volumes, variety of information, and the dynamic of the sources. Different applications are made in this area, but their successful usage depends on understanding many specific parameters. In this paper we present several opportunities for using Data Mining techniques provided by the analytical engine of RDBMS Oracle over data stored in Hadoop Distributed File System (HDFS). Some experimental results are given and they are discussed.

  7. SeaDataCloud - further developing the pan-European SeaDataNet infrastructure for marine and ocean data management

    NASA Astrophysics Data System (ADS)

    Schaap, Dick M. A.; Fichaut, Michele

    2017-04-01

    SeaDataCloud marks the third phase of developing the pan-European SeaDataNet infrastructure for marine and ocean data management. The SeaDataCloud project is funded by EU and runs for 4 years from 1st November 2016. It succeeds the successful SeaDataNet II (2011 - 2015) and SeaDataNet (2006 - 2011) projects. SeaDataNet has set up and operates a pan-European infrastructure for managing marine and ocean data and is undertaken by National Oceanographic Data Centres (NODC's) and oceanographic data focal points from 34 coastal states in Europe. The infrastructure comprises a network of interconnected data centres and central SeaDataNet portal. The portal provides users a harmonised set of metadata directories and controlled access to the large collections of datasets, managed by the interconnected data centres. The population of directories has increased considerably in cooperation with and involvement in many associated EU projects and initiatives such as EMODnet. SeaDataNet at present gives overview and access to more than 1.9 million data sets for physical oceanography, chemistry, geology, geophysics, bathymetry and biology from more than 100 connected data centres from 34 countries riparian to European seas. SeaDataNet is also active in setting and governing marine data standards, and exploring and establishing interoperability solutions to connect to other e-infrastructures on the basis of standards of ISO (19115, 19139), and OGC (WMS, WFS, CS-W and SWE). Standards and associated SeaDataNet tools are made available at the SeaDataNet portal for wide uptake by data handling and managing organisations. SeaDataCloud aims at further developing standards, innovating services & products, adopting new technologies, and giving more attention to users. Moreover, it is about implementing a cooperation between the SeaDataNet consortium of marine data centres and the EUDAT consortium of e-infrastructure service providers. SeaDataCloud aims at considerably advancing services and

  8. AOIPS data base management systems support for GARP data sets

    NASA Technical Reports Server (NTRS)

    Gary, J. P.

    1977-01-01

    A data base management system is identified, developed to provide flexible access to data sets produced by GARP during its data systems tests. The content and coverage of the data base are defined and a computer-aided, interactive information storage and retrieval system, implemented to facilitate access to user specified data subsets, is described. The computer programs developed to provide the capability were implemented on the highly interactive, minicomputer-based AOIPS and are referred to as the data retrieval system (DRS). Implemented as a user interactive but menu guided system, the DRS permits users to inventory the data tape library and create duplicate or subset data sets based on a user selected window defined by time and latitude/longitude boundaries. The DRS permits users to select, display, or produce formatted hard copy of individual data items contained within the data records.

  9. The IRIS Federator: Accessing Seismological Data Across Data Centers

    NASA Astrophysics Data System (ADS)

    Trabant, C. M.; Van Fossen, M.; Ahern, T. K.; Weekly, R. T.

    2015-12-01

    In 2013 the International Federation of Digital Seismograph Networks (FDSN) approved a specification for web service interfaces for accessing seismological station metadata, time series and event parameters. Since then, a number of seismological data centers have implemented FDSN service interfaces, with more implementations in development. We have developed a new system called the IRIS Federator which leverages this standardization and provides the scientific community with a service for easy discovery and access of seismological data across FDSN data centers. These centers are located throughout the world and this work represents one model of a system for data collection across geographic and political boundaries.The main components of the IRIS Federator are a catalog of time series metadata holdings at each data center and a web service interface for searching the catalog. The service interface is designed to support client­-side federated data access, a model in which the client (software run by the user) queries the catalog and then collects the data from each identified center. By default the results are returned in a format suitable for direct submission to those web services, but could also be formatted in a simple text format for general data discovery purposes. The interface will remove any duplication of time series channels between data centers according to a set of business rules by default, however a user may request results with all duplicate time series entries included. We will demonstrate how client­-side federation is being incorporated into some of the DMC's data access tools. We anticipate further enhancement of the IRIS Federator to improve data discovery in various scenarios and to improve usefulness to communities beyond seismology.Data centers with FDSN web services: http://www.fdsn.org/webservices/The IRIS Federator query interface: http://service.iris.edu/irisws/fedcatalog/1/

  10. Enhancing data exploitation through DTN-based data transmission protocols

    NASA Astrophysics Data System (ADS)

    Daglis, Ioannis A.; Tsaoussidis, Vassilis; Rontogiannis, Athanasios; Balasis, Georgios; Keramitsoglou, Iphigenia; Paronis, Dimitrios; Sykioti, Olga; Tsinganos, Antonios

    2014-05-01

    Data distribution and data access are major issues in space sciences and geosciences as they strongly influence the degree of data exploitation. Processing and analysis of large volumes of Earth observation and space/planetary data face two major impediments: limited access capabilities due to narrow connectivity windows between spacecraft and ground receiving stations and lack of sufficient communication and dissemination mechanisms between space data receiving centres and the end-user community. Real-time data assimilation that would be critical in a number of forecasting capabilities is particularly affected by such limitations. The FP7-Space project "Space-Data Routers" (SDR) has the aim of allowing space agencies, academic institutes and research centres to disseminate/share space data generated by single or multiple missions, in an efficient, secure and automated manner. The approach of SDR relies on space internetworking - and in particular on Delay-Tolerant Networking (DTN), which marks the new era in space communications, unifies space and earth communication infrastructures and delivers a set of tools and protocols for space-data exploitation. The project includes the definition of limitations imposed by typical space mission scenarios in which the National Observatory of Athens is currently involved, including space and planetary exploration, as well as satellite-supported geoscience applications. In this paper, we present the mission scenarios, the SDR-application and the evaluation of the associated impact from the space-data router enhancements. The work leading to this paper has received funding from the European Union's Seventh Framework Programme (FP7-SPACE-2010-1) under grant agreement no. 263330 for the SDR (Space-Data Routers for Exploiting Space Data) collaborative research project. This paper reflects only the authors' views and the Union is not liable for any use that may be made of the information contained therein.

  11. SeaDataNet - Pan-European infrastructure for marine and ocean data management: Unified access to distributed data sets

    NASA Astrophysics Data System (ADS)

    Schaap, D. M. A.; Maudire, G.

    2009-04-01

    SeaDataNet is an Integrated research Infrastructure Initiative (I3) in EU FP6 (2006 - 2011) to provide the data management system adapted both to the fragmented observation system and the users need for an integrated access to data, meta-data, products and services. Therefore SeaDataNet insures the long term archiving of the large number of multidisciplinary data (i.e. temperature, salinity current, sea level, chemical, physical and biological properties) collected by many different sensors installed on board of research vessels, satellite and the various platforms of the marine observing system. The SeaDataNet project started in 2006, but builds upon earlier data management infrastructure projects, undertaken over a period of 20 years by an expanding network of oceanographic data centres from the countries around all European seas. Its predecessor project Sea-Search had a strict focus on metadata. SeaDataNet maintains significant interest in the further development of the metadata infrastructure, but its primary objective is the provision of easy data access and generic data products. SeaDataNet is a distributed infrastructure that provides transnational access to marine data, meta-data, products and services through 40 interconnected Trans National Data Access Platforms (TAP) from 35 countries around the Black Sea, Mediterranean, North East Atlantic, North Sea, Baltic and Arctic regions. These include: National Oceanographic Data Centres (NODC's) Satellite Data Centres. Furthermore the SeaDataNet consortium comprises a number of expert modelling centres, SME's experts in IT, and 3 international bodies (ICES, IOC and JRC). Planning: The SeaDataNet project is delivering and operating the infrastructure in 3 versions: Version 0: maintenance and further development of the metadata systems developed by the Sea-Search project plus the development of a new metadata system for indexing and accessing to individual data objects managed by the SeaDataNet data centres. This

  12. ClinData Express – A Metadata Driven Clinical Research Data Management System for Secondary Use of Clinical Data

    PubMed Central

    Li, Zuofeng; Wen, Jingran; Zhang, Xiaoyan; Wu, Chunxiao; Li, Zuogao; Liu, Lei

    2012-01-01

    Aim to ease the secondary use of clinical data in clinical research, we introduce a metadata driven web-based clinical data management system named ClinData Express. ClinData Express is made up of two parts: 1) m-designer, a standalone software for metadata definition; 2) a web based data warehouse system for data management. With ClinData Express, what the researchers need to do is to define the metadata and data model in the m-designer. The web interface for data collection and specific database for data storage will be automatically generated. The standards used in the system and the data export modular make sure of the data reuse. The system has been tested on seven disease-data collection in Chinese and one form from dbGap. The flexibility of system makes its great potential usage in clinical research. The system is available at http://code.google.com/p/clindataexpress. PMID:23304327

  13. AQUIRE: Aquatic Toxicity Information Retrieval data base. Data file

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Anderson, E.; Pilli, A.

    The purpose of Aquatic Toxicity Information Retrieval (AQUIRE) data base is to provide scientists and managers quick access to a comprehensive, systematic, computerized compilation of aquatic toxicity data. Scientific papers published both nationally and internationally on the toxicity of chemicals to aquatic organisms and plants are collected and reviewed for AQUIRE. Independently compiled data files that meet AQUIRE parameter and quality assurance criteria are also included. Selected toxicity-test results and related testing information for any individual chemical from laboratory and field aquatic toxicity tests are extracted and added to AQUIRE. Acute, sublethal, and bioconcentration effects are included for tests withmore » freshwater and marine organisms. The total number of data records in AQUIRE now equals 104,500. This includes data from 6000 references, for 5200 chemicals and 2400 test species. A major data file, Acute Toxicity of Organic Chemicals (ATOC), has been incorporated into AQUIRE. The ATOC file contains laboratory acute test data on 525 organic chemicals using juvenile fathead minnows. The complete data file can be accessed by requesting review code 5 as a search parameter.« less

  14. VISUAL DATA MINING IN ATMOSPHERIC SCIENCE DATA

    EPA Science Inventory

    This paper discusses the use of simple visual tools to explore multivariate spatially-referenced data. It describes interactive approaches such as linked brushing, and dynamic methods such as the grand tour. applied to studying the Comprehensive Ocean-Atmosphere Data Set (COADS)....

  15. Simulation Experiments: Better Data, Not Just Big Data

    DTIC Science & Technology

    2014-12-01

    Modeling and Computer Simulation 22 (4): 20:1–20:17. Hogan, Joe 2014, June 9. “So Far, Big Data is Small Potatoes ”. Scientific American Blog Network...Available via http://blogs.scientificamerican.com/cross-check/2014/06/09/so-far- big-data-is-small- potatoes /. IBM. 2014. “Big Data at the Speed of Business

  16. Data reduction and analysis of HELIOS plasma wave data

    NASA Technical Reports Server (NTRS)

    Anderson, Roger R.

    1988-01-01

    Reduction of data acquired from the HELIOS Solar Wind Plasma Wave Experiments on HELIOS 1 and 2 was continued. Production of 24 hour survey plots of the HELIOS 1 plasma wave data were continued and microfilm copies were submitted to the National Space Science Data Center. Much of the effort involved the shock memory from both HELIOS 1 and 2. This data had to be deconvoluted and time ordered before it could be displayed and plotted in an organized form. The UNIVAX 418-III computer was replaced by a DEC VAX 11/780 computer. In order to continue the reduction and analysis of the data set, all data reduction and analysis computer programs had to be rewritten.

  17. Closing the data gap: Creating an open data environment

    NASA Astrophysics Data System (ADS)

    Hester, J. R.

    2014-02-01

    Poor data management brought on by increasing volumes of complex data undermines both the integrity of the scientific process and the usefulness of datasets. Researchers should endeavour both to make their data citeable and to cite data whenever possible. The reusability of datasets is improved by community adoption of comprehensive metadata standards and public availability of reversibly reduced data. Where standards are not yet defined, as much information as possible about the experiment and samples should be preserved in datafiles written in a standard format.

  18. Improving data quality in the linked open data: a survey

    NASA Astrophysics Data System (ADS)

    Hadhiatma, A.

    2018-03-01

    The Linked Open Data (LOD) is “web of data”, a different paradigm from “web of document” commonly used today. However, the huge LOD still suffers from data quality problems such as completeness, consistency, and accuracy. Data quality problems relate to designing effective methods both to manage and to retrieve information at various data quality levels. Based on review from papers and journals, addressing data quality requires some standards functioning to (1) identification of data quality problems, (2) assessment of data quality for a given context, and (3) correction of data quality problems. However, mostly the methods and strategies dealing with the LOD data quality were not as an integrative approach. Hence, based on those standards and an integrative approach, there are opportunities to improve the LOD data quality in the term of incompleteness, inaccuracy and inconsistency, considering to its schema and ontology, namely ontology refinement. Moreover, the term of the ontology refinement means that it copes not only to improve data quality but also to enrich the LOD. Therefore, it needs (1) a standard for data quality assessment and evaluation which is more appropriate to the LOD; (2) a framework of methods based on statistical relational learning that can improve the correction of data quality problems as well as enrich the LOD.

  19. Data needs and data bases for climate studies

    NASA Technical Reports Server (NTRS)

    Matthews, Elaine

    1986-01-01

    Two complementary global digital data bases of vegetation and land use, compiled at 1 deg resolution from published sources for use in climate studies, are discussed. The data bases were implemented, in several individually tailored formulations, in a series of climate related applications including: land-surface prescriptions in three-dimensional general circulation models, global biogeochemical cycles (CO2, methane), critical-area mapping for satellite monitoring of land-cover change, and large-scale remote sensing of surface reflectance. The climate applications are discussed with reference to data needs, and data availability from traditional and remote sensing sources.

  20. [Traditional Chinese Medicine data management policy in big data environment].

    PubMed

    Liang, Yang; Ding, Chang-Song; Huang, Xin-di; Deng, Le

    2018-02-01

    As traditional data management model cannot effectively manage the massive data in traditional Chinese medicine(TCM) due to the uncertainty of data object attributes as well as the diversity and abstraction of data representation, a management strategy for TCM data based on big data technology is proposed. Based on true characteristics of TCM data, this strategy could solve the problems of the uncertainty of data object attributes in TCM information and the non-uniformity of the data representation by using modeless properties of stored objects in big data technology. Hybrid indexing mode was also used to solve the conflicts brought by different storage modes in indexing process, with powerful capabilities in query processing of massive data through efficient parallel MapReduce process. The theoretical analysis provided the management framework and its key technology, while its performance was tested on Hadoop by using several common traditional Chinese medicines and prescriptions from practical TCM data source. Result showed that this strategy can effectively solve the storage problem of TCM information, with good performance in query efficiency, completeness and robustness. Copyright© by the Chinese Pharmaceutical Association.

  1. ALCF Data Science Program: Productive Data-centric Supercomputing

    NASA Astrophysics Data System (ADS)

    Romero, Nichols; Vishwanath, Venkatram

    The ALCF Data Science Program (ADSP) is targeted at big data science problems that require leadership computing resources. The goal of the program is to explore and improve a variety of computational methods that will enable data-driven discoveries across all scientific disciplines. The projects will focus on data science techniques covering a wide area of discovery including but not limited to uncertainty quantification, statistics, machine learning, deep learning, databases, pattern recognition, image processing, graph analytics, data mining, real-time data analysis, and complex and interactive workflows. Project teams will be among the first to access Theta, ALCFs forthcoming 8.5 petaflops Intel/Cray system. The program will transition to the 200 petaflop/s Aurora supercomputing system when it becomes available. In 2016, four projects have been selected to kick off the ADSP. The selected projects span experimental and computational sciences and range from modeling the brain to discovering new materials for solar-powered windows to simulating collision events at the Large Hadron Collider (LHC). The program will have a regular call for proposals with the next call expected in Spring 2017.http://www.alcf.anl.gov/alcf-data-science-program This research used resources of the ALCF, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

  2. Efficient data management tools for the heterogeneous big data warehouse

    NASA Astrophysics Data System (ADS)

    Alekseev, A. A.; Osipova, V. V.; Ivanov, M. A.; Klimentov, A.; Grigorieva, N. V.; Nalamwar, H. S.

    2016-09-01

    The traditional RDBMS has been consistent for the normalized data structures. RDBMS served well for decades, but the technology is not optimal for data processing and analysis in data intensive fields like social networks, oil-gas industry, experiments at the Large Hadron Collider, etc. Several challenges have been raised recently on the scalability of data warehouse like workload against the transactional schema, in particular for the analysis of archived data or the aggregation of data for summary and accounting purposes. The paper evaluates new database technologies like HBase, Cassandra, and MongoDB commonly referred as NoSQL databases for handling messy, varied and large amount of data. The evaluation depends upon the performance, throughput and scalability of the above technologies for several scientific and industrial use-cases. This paper outlines the technologies and architectures needed for processing Big Data, as well as the description of the back-end application that implements data migration from RDBMS to NoSQL data warehouse, NoSQL database organization and how it could be useful for further data analytics.

  3. A multi-dataset data-collection strategy produces better diffraction data

    PubMed Central

    Liu, Zhi-Jie; Chen, Lirong; Wu, Dong; Ding, Wei; Zhang, Hua; Zhou, Weihong; Fu, Zheng-Qing; Wang, Bi-Cheng

    2011-01-01

    A multi-dataset (MDS) data-collection strategy is proposed and analyzed for macromolecular crystal diffraction data acquisition. The theoretical analysis indicated that the MDS strategy can reduce the standard deviation (background noise) of diffraction data compared with the commonly used single-dataset strategy for a fixed X-ray dose. In order to validate the hypothesis experimentally, a data-quality evaluation process, termed a readiness test of the X-ray data-collection system, was developed. The anomalous signals of sulfur atoms in zinc-free insulin crystals were used as the probe to differentiate the quality of data collected using different data-collection strategies. The data-collection results using home-laboratory-based rotating-anode X-ray and synchrotron X-ray systems indicate that the diffraction data collected with the MDS strategy contain more accurate anomalous signals from sulfur atoms than the data collected with a regular data-collection strategy. In addition, the MDS strategy offered more advantages with respect to radiation-damage-sensitive crystals and better usage of rotating-anode as well as synchrotron X-rays. PMID:22011470

  4. Individual Data Linkage of Survey Data with Claims Data in Germany—An Overview Based on a Cohort Study

    PubMed Central

    March, Stefanie

    2017-01-01

    Research based on health insurance data has a long tradition in Germany. By contrast, data linkage of survey data with such claims data is a relatively new field of research with high potential. Data linkage opens up new opportunities for analyses in the field of health services research and public health. Germany has comprehensive rules and regulations of data protection that have to be followed. Therefore, a written informed consent is needed for individual data linkage. Additionally, the health system is characterized by heterogeneity of health insurance. The lidA-living at work-study is a cohort study on work, age and health, which linked survey data with claims data of a large number of statutory health insurance data. All health insurance funds were contacted, of whom a written consent was given. This paper will give an overview of individual data linkage of survey data with German claims data on the example of the lidA-study results. The challenges and limitations of data linkage will be presented. Despite heterogeneity, such kind of studies is possible with a negligibly small influence of bias. The experience we gain in lidA will be shown and provide important insights for other studies focusing on data linkage. PMID:29232834

  5. Individual Data Linkage of Survey Data with Claims Data in Germany-An Overview Based on a Cohort Study.

    PubMed

    March, Stefanie

    2017-12-09

    Research based on health insurance data has a long tradition in Germany. By contrast, data linkage of survey data with such claims data is a relatively new field of research with high potential. Data linkage opens up new opportunities for analyses in the field of health services research and public health. Germany has comprehensive rules and regulations of data protection that have to be followed. Therefore, a written informed consent is needed for individual data linkage. Additionally, the health system is characterized by heterogeneity of health insurance. The lidA-living at work-study is a cohort study on work, age and health, which linked survey data with claims data of a large number of statutory health insurance data. All health insurance funds were contacted, of whom a written consent was given. This paper will give an overview of individual data linkage of survey data with German claims data on the example of the lidA-study results. The challenges and limitations of data linkage will be presented. Despite heterogeneity, such kind of studies is possible with a negligibly small influence of bias. The experience we gain in lidA will be shown and provide important insights for other studies focusing on data linkage.

  6. Early Citability of Data vs Peer-Review like Data Publishing Procedures

    NASA Astrophysics Data System (ADS)

    Stockhause, Martina; Höck, Heinke; Toussaint, Frank; Lautenschlager, Michael

    2014-05-01

    The World Data Center for Climate (WDCC) hosted at the German Climate Computing Center (DKRZ) was one of the first data centers, which established a peer-review like data publication procedure resulting in DataCite DOIs. Data in the long-term archive (LTA) is diligently reviewed by data managers and data authors to grant high quality and widely reusability of the published data. This traditional data publication procedure for LTA data bearing DOIs is very time consuming especially for WDCC's high data volumes of climate model data in the order of multiple TBytes. Data is shared with project members and selected scientists months before the data is long-term archived. The scientific community analyses and thus reviews the data leading to data quality improvements. Scientists wish to cite these unstable data in scientific publications before the long-term archiving and the thorough data review process are finalized. A concept for early preprint DOIs for shared but not yet long-term archived data is presented. Requirements on data documentation, persistence and quality and use cases for preprint DOIs within the data life-cycle are discussed as well as questions of how to document the differences of the two DOI types and how to relate them to each other with the recommendation to use LTA DOIs in citations. WDCC wants to offer an additional user service for early citations of data of basic quality without compromising the LTA DOIs, i.e. WDCC's standard DOIs, as trustworthy indicator for high quality data. Referencing Links: World Data Center for Climate (WDCC): http://www.wdc-climate.de German Climate Computing Center (DKRZ): http://www.dkrz.de DataCite: http://datacite.org

  7. A VGI data integration framework based on linked data model

    NASA Astrophysics Data System (ADS)

    Wan, Lin; Ren, Rongrong

    2015-12-01

    This paper aims at the geographic data integration and sharing method for multiple online VGI data sets. We propose a semantic-enabled framework for online VGI sources cooperative application environment to solve a target class of geospatial problems. Based on linked data technologies - which is one of core components of semantic web, we can construct the relationship link among geographic features distributed in diverse VGI platform by using linked data modeling methods, then deploy these semantic-enabled entities on the web, and eventually form an interconnected geographic data network to support geospatial information cooperative application across multiple VGI data sources. The mapping and transformation from VGI sources to RDF linked data model is presented to guarantee the unique data represent model among different online social geographic data sources. We propose a mixed strategy which combined spatial distance similarity and feature name attribute similarity as the measure standard to compare and match different geographic features in various VGI data sets. And our work focuses on how to apply Markov logic networks to achieve interlinks of the same linked data in different VGI-based linked data sets. In our method, the automatic generating method of co-reference object identification model according to geographic linked data is discussed in more detail. It finally built a huge geographic linked data network across loosely-coupled VGI web sites. The results of the experiment built on our framework and the evaluation of our method shows the framework is reasonable and practicable.

  8. Technical solution for data collection, data safety and data privacy legislation: experiences from the SWEET study.

    PubMed

    Forsander, Gun; Pellinat, Martin; Volk, Michael; Muller, Markus; Pinelli, Leonardo; Magnen, Agnes; Danne, Thomas; Aschemeier, Bärbel; de Beaufort, Carine

    2012-09-01

    One of the most important tasks of the SWEET study is benchmarking the data collected. Information on the occurrence of the disease of diabetes, the treatment, and their outcomes in children from the different member states of European Union (EU) is crucial. How the collection of data is realized is essential, concerning both the technical issues and the results. The creation of SWEET Centers of Reference (CoR), all over Europe will be facilitated by the access to safe data collection, where legal aspects and privacy are ascertained. To describe the rationale for- and the technical procedure in the data collection implementation, in the SWEET study. Selected data on all patients treated at SWEET CoR are collected. The SWEET project data collection and management system, consists of modular components for data collection, online data interchange, and a database for statistical analysis. The SWEET study and the organization of CoR aims for the goal of offering an updated, secure, and continuous evaluation of diabetes treatment regimens for all children with diabetes in Europe. To support this goal, an appropriate and secure data management system as described in this paper has been created. © 2012 John Wiley & Sons A/S.

  9. Ethical and Appropriate Data Use Requires Data Literacy

    ERIC Educational Resources Information Center

    Mandinach, Ellen B.; Parton, Brennan M.; Gummer, Edith S.; Anderson, Rachel

    2015-01-01

    Data use should be a continuous, integrated part of practice, a tool that is used all the time. Good teachers have been doing data-driven decision making all along, it just has not been recognized by that term. But there is more work to be done to ensure that educators know how to continuously, effectively, and ethically use data; that is, to help…

  10. Simulation of meteorological satellite (METSAT) data using LANDSAT data

    NASA Technical Reports Server (NTRS)

    Austin, W. W.; Ryland, W. E.

    1983-01-01

    The information content which can be expected from the advanced very high resolution radiometer system, AVHRR, on the NOAA-6 satellite was assessed, and systematic techniques of data interpretation for use with meteorological satellite data were defined. In-house data from LANDSAT 2 and 3 were used to simulate the spatial, spectral, and sampling methods of the NOAA-6 satellite data.

  11. The MMS Science Data Center. Operations, Capabilities, and Data Availability.

    NASA Astrophysics Data System (ADS)

    Larsen, Kristopher; Pankratz, Chris; Giles, Barbara; Kokkonen, Kim; Putnam, Brian; Schafer, Corey; Baker, Dan; Burch, Jim

    2016-04-01

    On September 1, 2015 the Magnetospheric MultiScale (MMS) constellation of satellites completed their six-month commissioning period and began collecting data under nominal conditions. Science operations for the mission are conducted at the Science Operations Center (SOC) at the Laboratory for Atmospheric and Space Physics, University of Colorado, Boulder, Colorado, USA. The Science Data Center (SDC) is a component of the SOC responsible for the data production, management, distribution, archiving, and visualization of the data from the Solving Magnetospheric Acceleration, Reconnection, and Turbulence (SMART) instrument package on board the spacecraft. The mission collects several gigabytes of particle and field data per day, but the constraints on download volumes require efficient tools to manage the selection, transmission, and analysis of data to determine the highest value science data to downlink. This is the Scientist-in-the-Loop (SITL) program and is a critical piece of the MMS science data operations. As of March 2016, MMS science data is available to the entire science community. This includes both the survey data as well as the ultra-high resolution burst data downlinked through the SITL process. This presentation will explain the data and demonstrate the tools available to the community via the SDC so as to encourage as many scientists as possible to look at the wealth of magnetospheric data being produced and made available from MMS.

  12. The SPASE Data Model for Heliophysics Data: Is it Working?

    NASA Technical Reports Server (NTRS)

    Thieman, James; King, Todd; Roberts, Aaron

    2011-01-01

    The Space Physics Archive Search and Extract (SPASE) Data Model was developed to provide a metadata standard for describing Heliophysics (Space and Solar Physics) data within that science discipline. The SPASE Data Model has matured over the many years of its creation and is presently represented by Version 2.2.1. Information about SPASE can be obtained from the website group.org. The Data Model defines terms and values as well as the relationships between them in order to describe the data resources in the Heliophysics data environment. This data environment is quite complex, consisting of Virtual Observatories, Resident Archives, Data Providers, Partnering Data Centers, Services, Final Archives, and a Deep Archive. SPASE is the metadata language standard intended to permeate the complexity and provide a common method of obtaining and understanding data. Is it working in this capacity? SPASE has been used to describe a wide range of data. Examples range from ground-based magnetometer data to interplanetary satellite measurements to space weather model results. Has it achieved the goal of making the data easier to find and use? To find data of interest it is necessary that all the data of importance be described using the SPASE Data Model. Within the part of the data community associated with NASA (supported through NASA funding) there are obligations to use SPASE and (0 describe the old and new data using the SPASE XML schema. Although this pan of the community is not near 100% compliance with the mandate, there is good progress being made and the goal should be reachable in the future. Outside of the NASA data community there is still work to be done to convince the international community that SPASE descriptions are w011h the cost of their generation. Some of these groups such as Cluster, HELlO, GAIA, NOAA/NGDe. CSSDP, VSTO, SuperMAG, and IUGONET have agreed to use SPASE. but there are still other groups of importance that need (0 be reached. It is also assumed

  13. MTI science, data products, and ground-data processing overview

    NASA Astrophysics Data System (ADS)

    Szymanski, John J.; Atkins, William H.; Balick, Lee K.; Borel, Christoph C.; Clodius, William B.; Christensen, R. Wynn; Davis, Anthony B.; Echohawk, J. C.; Galbraith, Amy E.; Hirsch, Karen L.; Krone, James B.; Little, Cynthia K.; McLachlan, Peter M.; Morrison, Aaron; Pollock, Kimberly A.; Pope, Paul A.; Novak, Curtis; Ramsey, Keri A.; Riddle, Emily E.; Rohde, Charles A.; Roussel-Dupre, Diane C.; Smith, Barham W.; Smith, Kathy; Starkovich, Kim; Theiler, James P.; Weber, Paul G.

    2001-08-01

    The mission of the Multispectral Thermal Imager (MTI) satellite is to demonstrate the efficacy of highly accurate multispectral imaging for passive characterization of urban and industrial areas, as well as sites of environmental interest. The satellite makes top-of-atmosphere radiance measurements that are subsequently processed into estimates of surface properties such as vegetation health, temperatures, material composition and others. The MTI satellite also provides simultaneous data for atmospheric characterization at high spatial resolution. To utilize these data the MTI science program has several coordinated components, including modeling, comprehensive ground-truth measurements, image acquisition planning, data processing and data interpretation and analysis. Algorithms have been developed to retrieve a multitude of physical quantities and these algorithms are integrated in a processing pipeline architecture that emphasizes automation, flexibility and programmability. In addition, the MTI science team has produced detailed site, system and atmospheric models to aid in system design and data analysis. This paper provides an overview of the MTI research objectives, data products and ground data processing.

  14. Data sharing platforms for de-identified data from human clinical trials.

    PubMed

    Huser, Vojtech; Shmueli-Blumberg, Dikla

    2018-04-01

    Data sharing of de-identified individual participant data is being adopted by an increasing number of sponsors of human clinical trials. In addition to standardizing data syntax for shared trial data, semantic integration of various data elements is the focus of several initiatives that define research common data elements. This perspective article, in the first part, compares several data sharing platforms for de-identified clinical research data in terms of their size, policies and supported features. In the second part, we use a case study approach to describe in greater detail one data sharing platform (Data Share from National Institute of Drug Abuse). We present data on the past use of the platform, data formats offered, data de-identification approaches and its use of research common data elements. We conclude with a summary of current and expected future trends that facilitate secondary research use of data from completed human clinical trials.

  15. System and method for transferring data on a data link

    NASA Technical Reports Server (NTRS)

    Cole, Robert M. (Inventor); Bishop, James E. (Inventor)

    2007-01-01

    A system and method are provided for transferring a packet across a data link. The packet may include a stream of data symbols which is delimited by one or more framing symbols. Corruptions of the framing symbol which result in valid data symbols may be mapped to invalid symbols. If it is desired to transfer one of the valid data symbols that has been mapped to an invalid symbol, the data symbol may be replaced with an unused symbol. At the receiving end, these unused symbols are replaced with the corresponding valid data symbols. The data stream of the packet may be encoded with forward error correction information to detect and correct errors in the data stream.

  16. Data Systems and Reports as Active Participants in Data Interpretation

    ERIC Educational Resources Information Center

    Rankin, Jenny Grant

    2016-01-01

    Most data-informed decision-making in education is undermined by flawed interpretations. Educator-driven interventions to improve data use are beneficial but not omnipotent, as data misunderstandings persist at schools and school districts commended for ideal data use support. Meanwhile, most data systems and reports display figures without…

  17. Data visualization methods, data visualization devices, data visualization apparatuses, and articles of manufacture

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Turner, Alan E.; Crow, Vernon L.; Payne, Deborah A.

    Data visualization methods, data visualization devices, data visualization apparatuses, and articles of manufacture are described according to some aspects. In one aspect, a data visualization method includes accessing a plurality of initial documents at a first moment in time, first processing the initial documents providing processed initial documents, first identifying a plurality of first associations of the initial documents using the processed initial documents, generating a first visualization depicting the first associations, accessing a plurality of additional documents at a second moment in time after the first moment in time, second processing the additional documents providing processed additional documents, secondmore » identifying a plurality of second associations of the additional documents and at least some of the initial documents, wherein the second identifying comprises identifying using the processed initial documents and the processed additional documents, and generating a second visualization depicting the second associations.« less

  18. Engineering graphics data entry for space station data base

    NASA Technical Reports Server (NTRS)

    Lacovara, R. C.

    1986-01-01

    The entry of graphical engineering data into the Space Station Data Base was examined. Discussed were: representation of graphics objects; representation of connectivity data; graphics capture hardware; graphics display hardware; site-wide distribution of graphics, and consolidation of tools and hardware. A fundamental assumption was that existing equipment such as IBM based graphics capture software and VAX networked facilities would be exploited. Defensible conclusions reached after study and simulations of use of these systems at the engineering level are: (1) existing IBM based graphics capture software is an adequate and economical means of entry of schematic and block diagram data for present and anticipated electronic systems for Space Station; (2) connectivity data from the aforementioned system may be incorporated into the envisioned Space Station Data Base with modest effort; (3) graphics and connectivity data captured on the IBM based system may be exported to the VAX network in a simple and direct fashion; (4) graphics data may be displayed site-wide on VT-125 terminals and lookalikes; (5) graphics hard-copy may be produced site-wide on various dot-matrix printers; and (6) the system may provide integrated engineering services at both the engineering and engineering management level.

  19. Exploring hyperspectral imaging data sets with topological data analysis.

    PubMed

    Duponchel, Ludovic

    2018-02-13

    Analytical chemistry is rapidly changing. Indeed we acquire always more data in order to go ever further in the exploration of complex samples. Hyperspectral imaging has not escaped this trend. It quickly became a tool of choice for molecular characterisation of complex samples in many scientific domains. The main reason is that it simultaneously provides spectral and spatial information. As a result, chemometrics has provided many exploration tools (PCA, clustering, MCR-ALS …) well-suited for such data structure at early stage. However we are today facing a new challenge considering the always increasing number of pixels in the data cubes we have to manage. The idea is therefore to introduce a new paradigm of Topological Data Analysis in order explore hyperspectral imaging data sets highlighting its nice properties and specific features. With this paper, we shall also point out the fact that conventional chemometric methods are often based on variance analysis or simply impose a data model which implicitly defines the geometry of the data set. Thus we will show that it is not always appropriate in the framework of hyperspectral imaging data sets exploration. Copyright © 2017 Elsevier B.V. All rights reserved.

  20. Data integration: Combined Imaging and Electrophysiology data in the cloud

    PubMed Central

    Kini, Lohith G.; Davis, Kathryn A.; Wagenaar, Joost B.

    2015-01-01

    There has been an increasing effort to correlate electrophysiology data with imaging in patients with refractory epilepsy over recent years. IEEG.org provides a free-access, rapidly growing archive of imaging data combined with electrophysiology data and patient metadata. It currently contains over 1200 human and animal datasets, with multiple data modalities associated with each dataset (neuroimaging, EEG, EKG, de-identified clinical and experimental data, etc.). The platform is developed around the concept that scientific data sharing requires a flexible platform that allows sharing of data from multiple file-formats. IEEG.org provides high and low-level access to the data in addition to providing an environment in which domain experts can find, visualize, and analyze data in an intuitive manner. Here, we present a summary of the current infrastructure of the platform, available datasets and goals for the near future. PMID:26044858

  1. Alternative Fuels Data Center: New York Transportation Data for Alternative

    Science.gov Websites

    Fuels and Vehicles New York Transportation Data for Alternative Fuels and Vehicles to someone by E-mail Share Alternative Fuels Data Center: New York Transportation Data for Alternative Fuels and Vehicles on Facebook Tweet about Alternative Fuels Data Center: New York Transportation Data for

  2. Replacing missing values using trustworthy data values from web data sources

    NASA Astrophysics Data System (ADS)

    Izham Jaya, M.; Sidi, Fatimah; Mat Yusof, Sharmila; Suriani Affendey, Lilly; Ishak, Iskandar; Jabar, Marzanah A.

    2017-09-01

    In practice, collected data usually are incomplete and contains missing value. Existing approaches in managing missing values overlook the importance of trustworthy data values in replacing missing values. In view that trusted completed data is very important in data analysis, we proposed a framework of missing value replacement using trustworthy data values from web data sources. The proposed framework adopted ontology to map data values from web data sources to the incomplete dataset. As data from web is conflicting with each other, we proposed a trust score measurement based on data accuracy and data reliability. Trust score is then used to select trustworthy data values from web data sources for missing values replacement. We successfully implemented the proposed framework using financial dataset and presented the findings in this paper. From our experiment, we manage to show that replacing missing values with trustworthy data values is important especially in a case of conflicting data to solve missing values problem.

  3. Alternative Fuels Data Center: Maps and Data

    Science.gov Websites

    vehicles with unidentified fuel types. Data consists of registered vehicles in operation in the United fuel types with small populations, such as methanol and hydrogen vehicles. See relative vehicle completed a conversion. "undefined" values are vehicles with unidentified fuel types. Data

  4. Data Publishing and Sharing Via the THREDDS Data Repository

    NASA Astrophysics Data System (ADS)

    Wilson, A.; Caron, J.; Davis, E.; Baltzer, T.

    2007-12-01

    The terms "Team Science" and "Networked Science" have been coined to describe a virtual organization of researchers tied via some intellectual challenge, but often located in different organizations and locations. A critical component to these endeavors is publishing and sharing of content, including scientific data. Imagine pointing your web browser to a web page that interactively lets you upload data and metadata to a repository residing on a remote server, which can then be accessed by others in a secure fasion via the web. While any content can be added to this repository, it is designed particularly for storing and sharing scientific data and metadata. Server support includes uploading of data files that can subsequently be subsetted, aggregrated, and served in NetCDF or other scientific data formats. Metadata can be associated with the data and interactively edited. The THREDDS Data Repository (TDR) is a server that provides client initiated, on demand, location transparent storage for data of any type that can then be served by the THREDDS Data Server (TDS). The TDR provides functionality to: * securely store and "own" data files and associated metadata * upload files via HTTP and gridftp * upload a collection of data as single file * modify and restructure repository contents * incorporate metadata provided by the user * generate additional metadata programmatically * edit individual metadata elements The TDR can exist separately from a TDS, serving content via HTTP. Also, it can work in conjunction with the TDS, which includes functionality to provide: * access to data in a variety of formats via -- OPeNDAP -- OGC Web Coverage Service (for gridded datasets) -- bulk HTTP file transfer * a NetCDF view of datasets in NetCDF, OPeNDAP, HDF-5, GRIB, and NEXRAD formats * serving of very large volume datasets, such as NEXRAD radar * aggregation into virtual datasets * subsetting via OPeNDAP and NetCDF Subsetting services This talk will discuss TDR

  5. Reducing data friction through site-based data curation

    NASA Astrophysics Data System (ADS)

    Thomer, A.; Palmer, C. L.

    2017-12-01

    Much of geoscience research takes place at "scientifically significant sites": localities which have attracted a critical mass of scientific interest, and thereby merit protection by government bodies, as well as the preservation of specimen and data collections and the development of site-specific permitting requirements for access to the site and its associated collections. However, many data standards and knowledge organization schemas do not adequately describe key characteristics of the sites, despite their centrality to research projects. Through work conducted as part of the IMLS-funded Site-Based Data Curation (SBDC) project, we developed a Minimum Information Framework (MIF) for site-based science, in which "information about a site's structure" is considered a core class of information. Here we present our empirically-derived information framework, as well as the methods used to create it. We believe these approaches will lead to the development of more effective data repositories and tools, and thereby will reduce "data friction" in interdisciplinary, yet site-based, geoscience workflows. The Minimum Information Framework for Site-based Research was developed through work at two scientifically significant sites: the hot springs at Yellowstone National Park, which are key to geobiology research; and the La Brea Tar Pits, an important paleontology locality in Southern California. We employed diverse methods of participatory engagement, in which key stakeholders at our sites (e.g. curators, collections managers, researchers, permit officers) were consulted through workshops, focus groups, interviews, action research methods, and collaborative information modeling and systems analysis. These participatory approaches were highly effective in fostering on-going partnership among a diverse team of domain scientists, information scientists, and software developers. The MIF developed in this work may be viewed as a "proto-standard" that can inform future

  6. Minimally buffered data transfers between nodes in a data communications network

    DOEpatents

    Miller, Douglas R.

    2015-06-23

    Methods, apparatus, and products for minimally buffered data transfers between nodes in a data communications network are disclosed that include: receiving, by a messaging module on an origin node, a storage identifier, a origin data type, and a target data type, the storage identifier specifying application storage containing data, the origin data type describing a data subset contained in the origin application storage, the target data type describing an arrangement of the data subset in application storage on a target node; creating, by the messaging module, origin metadata describing the origin data type; selecting, by the messaging module from the origin application storage in dependence upon the origin metadata and the storage identifier, the data subset; and transmitting, by the messaging module to the target node, the selected data subset for storing in the target application storage in dependence upon the target data type without temporarily buffering the data subset.

  7. Data integration: Combined imaging and electrophysiology data in the cloud.

    PubMed

    Kini, Lohith G; Davis, Kathryn A; Wagenaar, Joost B

    2016-01-01

    There has been an increasing effort to correlate electrophysiology data with imaging in patients with refractory epilepsy over recent years. IEEG.org provides a free-access, rapidly growing archive of imaging data combined with electrophysiology data and patient metadata. It currently contains over 1200 human and animal datasets, with multiple data modalities associated with each dataset (neuroimaging, EEG, EKG, de-identified clinical and experimental data, etc.). The platform is developed around the concept that scientific data sharing requires a flexible platform that allows sharing of data from multiple file formats. IEEG.org provides high- and low-level access to the data in addition to providing an environment in which domain experts can find, visualize, and analyze data in an intuitive manner. Here, we present a summary of the current infrastructure of the platform, available datasets and goals for the near future. Copyright © 2015 Elsevier Inc. All rights reserved.

  8. Encouraging data citation and discovery with the Data Citation Index.

    PubMed

    Force, Megan M; Robinson, Nigel J

    2014-10-01

    An overview of the Data Citation Index is provided. Thomson Reuters developed this resource in response to a stated desire among members of the research community for increased attribution of non-traditional scholarly output. Launched in October of 2012 on the Web of science research platform, its aims include linking published research articles to their underlying data sets and tracking the citation of the data, as well as encouraging bibliographic citation of data. Cross-disciplinary search capabilities in the Index enable new possibilities for data discovery and synthesis. Data repositories are evaluated with respect to various selection criteria, with particular attention to their relevance to scientific and scholarly research. Index content reflects current data deposition practices. As data citation standards and practices continue to move toward widespread formalization and adoption, the initiative seeks to address issues of data citation, reuse, and author credit in a developing climate.

  9. Encouraging data citation and discovery with the Data Citation Index

    NASA Astrophysics Data System (ADS)

    Force, Megan M.; Robinson, Nigel J.

    2014-10-01

    An overview of the Data Citation Index is provided. Thomson Reuters developed this resource in response to a stated desire among members of the research community for increased attribution of non-traditional scholarly output. Launched in October of 2012 on the Web of science research platform, its aims include linking published research articles to their underlying data sets and tracking the citation of the data, as well as encouraging bibliographic citation of data. Cross-disciplinary search capabilities in the Index enable new possibilities for data discovery and synthesis. Data repositories are evaluated with respect to various selection criteria, with particular attention to their relevance to scientific and scholarly research. Index content reflects current data deposition practices. As data citation standards and practices continue to move toward widespread formalization and adoption, the initiative seeks to address issues of data citation, reuse, and author credit in a developing climate.

  10. Towards data warehousing and mining of protein unfolding simulation data.

    PubMed

    Berrar, Daniel; Stahl, Frederic; Silva, Candida; Rodrigues, J Rui; Brito, Rui M M; Dubitzky, Werner

    2005-10-01

    The prediction of protein structure and the precise understanding of protein folding and unfolding processes remains one of the greatest challenges in structural biology and bioinformatics. Computer simulations based on molecular dynamics (MD) are at the forefront of the effort to gain a deeper understanding of these complex processes. Currently, these MD simulations are usually on the order of tens of nanoseconds, generate a large amount of conformational data and are computationally expensive. More and more groups run such simulations and generate a myriad of data, which raises new challenges in managing and analyzing these data. Because the vast range of proteins researchers want to study and simulate, the computational effort needed to generate data, the large data volumes involved, and the different types of analyses scientists need to perform, it is desirable to provide a public repository allowing researchers to pool and share protein unfolding data. To adequately organize, manage, and analyze the data generated by unfolding simulation studies, we designed a data warehouse system that is embedded in a grid environment to facilitate the seamless sharing of available computer resources and thus enable many groups to share complex molecular dynamics simulations on a more regular basis. To gain insight into the conformational fluctuations and stability of the monomeric forms of the amyloidogenic protein transthyretin (TTR), molecular dynamics unfolding simulations of the monomer of human TTR have been conducted. Trajectory data and meta-data of the wild-type (WT) protein and the highly amyloidogenic variant L55P-TTR represent the test case for the data warehouse. Web and grid services, especially pre-defined data mining services that can run on or 'near' the data repository of the data warehouse, are likely to play a pivotal role in the analysis of molecular dynamics unfolding data.

  11. Data Democratization - Promoting Real-Time Data Sharing and Use Worldwide

    NASA Astrophysics Data System (ADS)

    Yoksas, T. C.; Almeida, W. G.; Leon, V. C.

    2007-05-01

    The Unidata Program Center (Unidata) of the University Corporation of Atmospheric Research (UCAR) is actively involved in international collaborations whose goals are the free-and-open sharing of hydro-meteorological data; the distribution of analysis and visualization tools for those data; the establishment of server technologies that provide easy-to-use, programmatic remote-access to a wide variety of datasets, and in the building of a community where data, tools, and best practices in education and research are shared. The tools and services provided by Unidata are available to the research and education community free-of-charge. Data sharing capabilities are being provided by Unidata's Internet Data Distribution (IDD) system, a community-based effort that has been the primary source of real-time meteorological data in the US university community for over a decade. A collaboration among Unidata, Brazil's Centro de Previso de Tempo e Estudos Climaticos (CPTEC), the Universidad Federal do Rio de Janeiro (UFRJ), and the Universidade de Sao Paulo (USP) has resulted in the creation of a Brazilian peer of the North American IDD, the IDD-Brasil. Collaboration between Unidata and the Universidad de Costa Rica (UCR) seeks to extend IDD data sharing throughout Central America and the Caribbean in an IDD-Caribe. Efforts aimed at creating a data sharing network for researchers on the Antarctic continent have resulted in the establishment of the Antarctic-IDD. Most recently, explorations of data sharing between UCAR and select countries in Africa have begun. Data analysis and visualization capabilities are available through Unidata in a suite of freely-available applications: the National Centers for Environmental Prediction (NCEP) GEneral Meteorology PAcKage (GEMPAK); the Unidata Integrated Data Viewer (IDV); and University of Wisconsin, Space Science and Engineering Center (SSEC) Man-computer Interactive Data Access System (McIDAS). Remote data access capabilities are

  12. Data Mining of NASA Boeing 737 Flight Data: Frequency Analysis of In-Flight Recorded Data

    NASA Technical Reports Server (NTRS)

    Butterfield, Ansel J.

    2001-01-01

    Data recorded during flights of the NASA Trailblazer Boeing 737 have been analyzed to ascertain the presence of aircraft structural responses from various excitations such as the engine, aerodynamic effects, wind gusts, and control system operations. The NASA Trailblazer Boeing 737 was chosen as a focus of the study because of a large quantity of its flight data records. The goal of this study was to determine if any aircraft structural characteristics could be identified from flight data collected for measuring non-structural phenomena. A number of such data were examined for spatial and frequency correlation as a means of discovering hidden knowledge of the dynamic behavior of the aircraft. Data recorded from on-board dynamic sensors over a range of flight conditions showed consistently appearing frequencies. Those frequencies were attributed to aircraft structural vibrations.

  13. The environmental-data automated track annotation (Env-DATA) system: linking animal tracks with environmental data

    USGS Publications Warehouse

    Dodge, Somayeh; Bohrer, Gil; Weinzierl, Rolf P.; Davidson, Sarah C.; Kays, Roland; Douglas, David C.; Cruz, Sebastian; Han, J.; Brandes, David; Wikelski, Martin

    2013-01-01

    The movement of animals is strongly influenced by external factors in their surrounding environment such as weather, habitat types, and human land use. With advances in positioning and sensor technologies, it is now possible to capture animal locations at high spatial and temporal granularities. Likewise, scientists have an increasing access to large volumes of environmental data. Environmental data are heterogeneous in source and format, and are usually obtained at different spatiotemporal scales than movement data. Indeed, there remain scientific and technical challenges in developing linkages between the growing collections of animal movement data and the large repositories of heterogeneous remote sensing observations, as well as in the developments of new statistical and computational methods for the analysis of movement in its environmental context. These challenges include retrieval, indexing, efficient storage, data integration, and analytical techniques.

  14. Alternative Fuels Data Center: About the Alternative Fuels Data Center

    Science.gov Websites

    About Printable Version Share this resource Send a link to Alternative Fuels Data Center: About the Alternative Fuels Data Center to someone by E-mail Share Alternative Fuels Data Center: About the Alternative Fuels Data Center on Facebook Tweet about Alternative Fuels Data Center: About the Alternative Fuels

  15. DataUp: Helping manage and archive data within the researcher's workflow

    NASA Astrophysics Data System (ADS)

    Strasser, C.

    2012-12-01

    There are many barriers to data management and sharing among earth and environmental scientists; among the most significant are lacks of knowledge about best practices for data management, metadata standards, or appropriate data repositories for archiving and sharing data. We have developed an open-source add-in for Excel and an open source web application intended to help researchers overcome these barriers. DataUp helps scientists to (1) determine whether their file is CSV compatible, (2) generate metadata in a standard format, (3) retrieve an identifier to facilitate data citation, and (4) deposit their data into a repository. The researcher does not need a prior relationship with a data repository to use DataUp; the newly implemented ONEShare repository, a DataONE member node, is available for any researcher to archive and share their data. By meeting researchers where they already work, in spreadsheets, DataUp becomes part of the researcher's workflow and data management and sharing becomes easier. Future enhancement of DataUp will rely on members of the community adopting and adapting the DataUp tools to meet their unique needs, including connecting to analytical tools, adding new metadata schema, and expanding the list of connected data repositories. DataUp is a collaborative project between Microsoft Research Connections, the University of California's California Digital Library, the Gordon and Betty Moore Foundation, and DataONE.

  16. The Materials Data Facility: Data Services to Advance Materials Science Research

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Blaiszik, B.; Chard, K.; Pruyne, J.

    2016-07-06

    With increasingly strict data management requirements from funding agencies and institutions, expanding focus on the challenges of research replicability, and growing data sizes and heterogeneity, new data needs are emerging in the materials community. The materials data facility (MDF) operates two cloudhosted services, data publication and data discovery, with features to promote open data sharing, self-service data publication and curation, and encourage data reuse, layered with powerful data discovery tools. The data publication service simplifies the process of copying data to a secure storage location, assigning data a citable persistent identifier, and recording custom (e.g., material, technique, or instrument specific)andmore » automatically-extractedmetadata in a registrywhile the data discovery service will provide advanced search capabilities (e.g., faceting, free text range querying, and full text search) against the registered data and metadata. TheMDF services empower individual researchers, research projects, and institutions to (I) publish research datasets, regardless of size, from local storage, institutional data stores, or cloud storage, without involvement of thirdparty publishers; (II) build, share, and enforce extensible domain-specific custom metadata schemas; (III) interact with published data and metadata via representational state transfer (REST) application program interfaces (APIs) to facilitate automation, analysis, and feedback; and (IV) access a data discovery model that allows researchers to search, interrogate, and eventually build on existing published data. We describe MDF’s design, current status, and future plans.« less

  17. The Materials Data Facility: Data Services to Advance Materials Science Research

    NASA Astrophysics Data System (ADS)

    Blaiszik, B.; Chard, K.; Pruyne, J.; Ananthakrishnan, R.; Tuecke, S.; Foster, I.

    2016-08-01

    With increasingly strict data management requirements from funding agencies and institutions, expanding focus on the challenges of research replicability, and growing data sizes and heterogeneity, new data needs are emerging in the materials community. The materials data facility (MDF) operates two cloud-hosted services, data publication and data discovery, with features to promote open data sharing, self-service data publication and curation, and encourage data reuse, layered with powerful data discovery tools. The data publication service simplifies the process of copying data to a secure storage location, assigning data a citable persistent identifier, and recording custom (e.g., material, technique, or instrument specific) and automatically-extracted metadata in a registry while the data discovery service will provide advanced search capabilities (e.g., faceting, free text range querying, and full text search) against the registered data and metadata. The MDF services empower individual researchers, research projects, and institutions to (I) publish research datasets, regardless of size, from local storage, institutional data stores, or cloud storage, without involvement of third-party publishers; (II) build, share, and enforce extensible domain-specific custom metadata schemas; (III) interact with published data and metadata via representational state transfer (REST) application program interfaces (APIs) to facilitate automation, analysis, and feedback; and (IV) access a data discovery model that allows researchers to search, interrogate, and eventually build on existing published data. We describe MDF's design, current status, and future plans.

  18. Data-driven medicinal chemistry in the era of big data.

    PubMed

    Lusher, Scott J; McGuire, Ross; van Schaik, René C; Nicholson, C David; de Vlieg, Jacob

    2014-07-01

    Science, and the way we undertake research, is changing. The increasing rate of data generation across all scientific disciplines is providing incredible opportunities for data-driven research, with the potential to transform our current practices. The exploitation of so-called 'big data' will enable us to undertake research projects never previously possible but should also stimulate a re-evaluation of all our data practices. Data-driven medicinal chemistry approaches have the potential to improve decision making in drug discovery projects, providing that all researchers embrace the role of 'data scientist' and uncover the meaningful relationships and patterns in available data. Copyright © 2013 Elsevier Ltd. All rights reserved.

  19. Puget Sound sediment-trap data: 1980-1985. Data report

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Paulson, A.J.; Baker, E.T.; Feely, R.A.

    1991-12-01

    In 1979, scientists at the Pacific Marine Environmental Laboratory began investigating the sources, transformation, transport and fate of pollutants in Puget Sound and its watershed under Sec. 202 of the Marine Protection, Research and Sanctuaries Act of 1971 (P.L. 92-532) which called in part for '...a comprehensive and continuing program of research with respect to the possible long range effects of pollution, overfishing, and man-induced changes of ocean ecosystems...' The effort was called the Long-Range Effects Research Program (L-RERP) after language in the Act and was later called the PMEL Marine Environmental Quality Program. The Long-Range Effect Research Program consistedmore » of (1) sampling dissolved and particulate constituents in the water column by bottle sampling, (2) sampling settling particles by sediment trap and (3) sampling sediments by grab, box, gravity and Kasten corers. In the Data Report, a variety of data from particles collected in 104 traps deployed on 34 moorings in open waters between 1980 and 1985 are presented. The text of the data report begins with the sampling and analytical methods with the accompanying quality control/quality assurance data. The text of the data sections are a summary of the available data and published literature in which the data is interpreted along with a catalogue of the data available in the Appendix (on microfiche located in the back pocket of the data report).« less

  20. Remote Data Exploration with the Interactive Data Language (IDL)

    NASA Technical Reports Server (NTRS)

    Galloy, Michael

    2013-01-01

    A difficulty for many NASA researchers is that often the data to analyze is located remotely from the scientist and the data is too large to transfer for local analysis. Researchers have developed the Data Access Protocol (DAP) for accessing remote data. Presently one can use DAP from within IDL, but the IDL-DAP interface is both limited and cumbersome. A more powerful and user-friendly interface to DAP for IDL has been developed. Users are able to browse remote data sets graphically, select partial data to retrieve, import that data and make customized plots, and have an interactive IDL command line session simultaneous with the remote visualization. All of these IDL-DAP tools are usable easily and seamlessly for any IDL user. IDL and DAP are both widely used in science, but were not easily used together. The IDL DAP bindings were incomplete and had numerous bugs that prevented their serious use. For example, the existing bindings did not read DAP Grid data, which is the organization of nearly all NASA datasets currently served via DAP. This project uniquely provides a fully featured, user-friendly interface to DAP from IDL, both from the command line and a GUI application. The DAP Explorer GUI application makes browsing a dataset more user-friendly, while also providing the capability to run user-defined functions on specified data. Methods for running remote functions on the DAP server were investigated, and a technique for accomplishing this task was decided upon.

  1. The SOOS Data Portal, providing access to Southern Oceans data

    NASA Astrophysics Data System (ADS)

    Proctor, Roger; Finney, Kim; Blain, Peter; Taylor, Fiona; Newman, Louise; Meredith, Mike; Schofield, Oscar

    2013-04-01

    The Southern Ocean Observing System (SOOS) is an international initiative to enhance, coordinate and expand the strategic observations of the Southern Oceans that are required to address key scientific and societal challenges. A key component of SOOS will be the creation and maintenance of a Southern Ocean Data Portal to provide improved access to historical and ongoing data (Schofield et al., 2012, Eos, Vol. 93, No. 26, pp 241-243). The scale of this effort will require strong leveraging of existing data centres, new cyberinfrastructure development efforts, and defined data collection, quality control, and archiving procedures across the international community. The task of assembling the SOOS data portal is assigned to the SOOS Data Management Sub-Committee. The information infrastructure chosen for the SOOS data portal is based on the Australian Ocean Data Network (AODN, http://portal.aodn.org.au). The AODN infrastructure is built on open-source tools and the use of international standards ensures efficiency of data exchange and interoperability between contributing systems. OGC standard web services protocols are used for serving of data via the internet. These include Web Map Service (WMS) for visualisation, Web Feature Service (WFS) for data download, and Catalogue Service for Web (CSW) for catalogue exchange. The portal offers a number of tools to access and visualize data: - a Search link to the metadata catalogue enables search and discovery by simple text search, by geographic area, temporal extent, keyword, parameter, organisation, or by any combination of these, allowing users to gain access to further information and/or the data for download. Also, searches can be restricted to items which have either data to download, or attached map layers, or both - a Map interface for discovery and display of data, with the ability to change the style and opacity of layers, add additional data layers via OGC Web Map Services, view animated timeseries datastreams

  2. Methods for examining data quality in healthcare integrated data repositories.

    PubMed

    Huser, Vojtech; Kahn, Michael G; Brown, Jeffrey S; Gouripeddi, Ramkiran

    2018-01-01

    This paper summarizes content of the workshop focused on data quality. The first speaker (VH) described data quality infrastructure and data quality evaluation methods currently in place within the Observational Data Science and Informatics (OHDSI) consortium. The speaker described in detail a data quality tool called Achilles Heel and latest development for extending this tool. Interim results of an ongoing Data Quality study within the OHDSI consortium were also presented. The second speaker (MK) described lessons learned and new data quality checks developed by the PEDsNet pediatric research network. The last two speakers (JB, RG) described tools developed by the Sentinel Initiative and University of Utah's service oriented framework. The workshop discussed at the end and throughout how data quality assessment can be advanced by combining best features of each network.

  3. Advancing Long Tail Data Capture and Access Through Trusted, Community-Driven Data Services at the IEDA Data Facility

    NASA Astrophysics Data System (ADS)

    Lehnert, K. A.; Carbotte, S. M.; Ferrini, V.; Hsu, L.; Arko, R. A.; Walker, J. D.; O'hara, S. H.

    2012-12-01

    Substantial volumes of data in the Earth Sciences are collected in small- to medium-size projects by individual investigators or small research teams, known as the 'Long Tail' of science. Traditionally, these data have largely stayed 'in the dark', i.e. they have not been properly archived, and have therefore been inaccessible and underutilized. The primary reason has been the lack of appropriate infrastructure, from adequate repositories to resources and support for investigators to properly manage their data, to community standards and best practices. Lack of credit for data management and for the data themselves has contributed to the reluctance of investigators to share their data. IEDA (Integrated Earth Data Applications), a NSF-funded data facility for solid earth geoscience data, has developed a comprehensive suite of data services that are designed to address the concerns and needs of investigators. IEDA's data publication service registers datasets with DOI and ensures their proper citation and attribution. IEDA is working with publishers on advanced linkages between datasets in the IEDA repository and scientific online articles to facilitate access to the data, enhance their visibility, and augment their use and citation. IEDA's investigator support ranges from individual support for data management to tools, tutorials, and virtual or face-to-face workshops that guide and assist investigators with data management planning, data submission, and data documentation. A critical aspect of IEDA's concept has been the disciplinary expertise within the team and its strong liaison with the science community, as well as a community-based governance. These have been fundamental to gain the trust and support of the community that have lead to significantly improved data preservation and access in the communities served by IEDA.

  4. EARS : Repositioning data management near data acquisition.

    NASA Astrophysics Data System (ADS)

    Sinquin, Jean-Marc; Sorribas, Jordi; Diviacco, Paolo; Vandenberghe, Thomas; Munoz, Raquel; Garcia, Oscar

    2016-04-01

    The EU FP7 Projects Eurofleets and Eurofleets2 are an European wide alliance of marine research centers that aim to share their research vessels, to improve information sharing on planned, current and completed cruises, on details of ocean-going research vessels and specialized equipment, and to durably improve cost-effectiveness of cruises. Within this context logging of information on how, when and where anything happens on board of the vessel is crucial information for data users in a later stage. This forms a primordial step in the process of data quality control as it could assist in the understanding of anomalies and unexpected trends recorded in the acquired data sets. In this way completeness of the metadata is improved as it is recorded accurately at the origin of the measurement. The collection of this crucial information has been done in very different ways, using different procedures, formats and pieces of software in the context of the European Research Fleet. At the time that the Eurofleets project started, every institution and country had adopted different strategies and approaches, which complicated the task of users that need to log general purpose information and events on-board whenever they access a different platform loosing the opportunity to produce this valuable metadata on-board. Among the many goals the Eurofleets project has, a very important task is the development of an "event log software" called EARS (Eurofleets Automatic Reporting System) that enables scientists and operators to record what happens during a survey. EARS will allow users to fill, in a standardized way, the gap existing at the moment in metadata description that only very seldom links data with its history. Events generated automatically by acquisition instruments will also be handled, enhancing the granularity and precision of the event annotation. The adoption of a common procedure to log survey events and a common terminology to describe them is crucial to provide

  5. Vector data structure conversion at the EROS Data Center

    USGS Publications Warehouse

    van Roessel, Jan W.; Doescher, S.W.

    1986-01-01

    With the increasing prevalence of GIS systems and the processing of spatial data, conversion of data from one system to another has become a more serious problem. This report describes the approach taken to arrive at a solution at the EROS Data Center. The report consists of a main section and a number of appendices. The methodology is described in the main section, while the appendices have system specific descriptions. The overall approach is based on a central conversion hub consisting of a relational database manager and associated tools, with a standard data structure for the transfer of spatial data. This approach is the best compromise between the two goals of reducing the overall interfacing effort and producing efficient system interfaces, while the tools can be used to arrive at a progression of interface sophistication ranging from toolbench to smooth flow. The appendices provide detailed information on a number of spatial data handling systems and data structures and existing interfaces as well as interfaces developed with the described methodology.

  6. Big data, big knowledge: big data for personalized healthcare.

    PubMed

    Viceconti, Marco; Hunter, Peter; Hose, Rod

    2015-07-01

    The idea that the purely phenomenological knowledge that we can extract by analyzing large amounts of data can be useful in healthcare seems to contradict the desire of VPH researchers to build detailed mechanistic models for individual patients. But in practice no model is ever entirely phenomenological or entirely mechanistic. We propose in this position paper that big data analytics can be successfully combined with VPH technologies to produce robust and effective in silico medicine solutions. In order to do this, big data technologies must be further developed to cope with some specific requirements that emerge from this application. Such requirements are: working with sensitive data; analytics of complex and heterogeneous data spaces, including nontextual information; distributed data management under security and performance constraints; specialized analytics to integrate bioinformatics and systems biology information with clinical observations at tissue, organ and organisms scales; and specialized analytics to define the "physiological envelope" during the daily life of each patient. These domain-specific requirements suggest a need for targeted funding, in which big data technologies for in silico medicine becomes the research priority.

  7. Data management and data enrichment for systems biology projects.

    PubMed

    Wittig, Ulrike; Rey, Maja; Weidemann, Andreas; Müller, Wolfgang

    2017-11-10

    Collecting, curating, interlinking, and sharing high quality data are central to de.NBI-SysBio, the systems biology data management service center within the de.NBI network (German Network for Bioinformatics Infrastructure). The work of the center is guided by the FAIR principles for scientific data management and stewardship. FAIR stands for the four foundational principles Findability, Accessibility, Interoperability, and Reusability which were established to enhance the ability of machines to automatically find, access, exchange and use data. Within this overview paper we describe three tools (SABIO-RK, Excemplify, SEEK) that exemplify the contribution of de.NBI-SysBio services to FAIR data, models, and experimental methods storage and exchange. The interconnectivity of the tools and the data workflow within systems biology projects will be explained. For many years we are the German partner in the FAIRDOM initiative (http://fair-dom.org) to establish a European data and model management service facility for systems biology. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.

  8. The Challenges of Data Quality Evaluation in a Joint Data Warehouse

    PubMed Central

    Bae, Charles J.; Griffith, Sandra; Fan, Youran; Dunphy, Cheryl; Thompson, Nicolas; Urchek, John; Parchman, Alandra; Katzan, Irene L.

    2015-01-01

    Introduction: The use of clinically derived data from electronic health records (EHRs) and other electronic clinical systems can greatly facilitate clinical research as well as operational and quality initiatives. One approach for making these data available is to incorporate data from different sources into a joint data warehouse. When using such a data warehouse, it is important to understand the quality of the data. The primary objective of this study was to determine the completeness and concordance of common types of clinical data available in the Knowledge Program (KP) joint data warehouse, which contains feeds from several electronic systems including the EHR. Methods: A manual review was performed of specific data elements for 250 patients from an EHR, and these were compared with corresponding elements in the KP data warehouse. Completeness and concordance were calculated for five categories of data including demographics, vital signs, laboratory results, diagnoses, and medications. Results: In general, data elements for demographics, vital signs, diagnoses, and laboratory results were present in more cases in the source EHR compared to the KP. When data elements were available in both sources, there was a high concordance. In contrast, the KP data warehouse documented a higher prevalence of deaths and medications compared to the EHR. Discussion: Several factors contributed to the discrepancies between data in the KP and the EHR—including the start date and frequency of data feeds updates into the KP, inability to transfer data located in nonstructured formats (e.g., free text or scanned documents), as well as incomplete and missing data variables in the source EHR. Conclusion: When evaluating the quality of a data warehouse with multiple data sources, assessing completeness and concordance between data set and source data may be better than designating one to be a gold standard. This will allow the user to optimize the method and timing of data transfer

  9. Data management by using R: big data clinical research series.

    PubMed

    Zhang, Zhongheng

    2015-11-01

    Electronic medical record (EMR) system has been widely used in clinical practice. Instead of traditional record system by hand writing and recording, the EMR makes big data clinical research feasible. The most important feature of big data research is its real-world setting. Furthermore, big data research can provide all aspects of information related to healthcare. However, big data research requires some skills on data management, which however, is always lacking in the curriculum of medical education. This greatly hinders doctors from testing their clinical hypothesis by using EMR. To make ends meet, a series of articles introducing data management techniques are put forward to guide clinicians to big data clinical research. The present educational article firstly introduces some basic knowledge on R language, followed by some data management skills on creating new variables, recoding variables and renaming variables. These are very basic skills and may be used in every project of big data research.

  10. AirData

    EPA Pesticide Factsheets

    The AirData site provides access to yearly summaries of United States air pollution data, taken from EPA's air pollution databases. AirData has information about where air pollution comes from (emissions) and how much pollution is in the air outside our homes and work places (monitoring).

  11. Data disclosure and data sharing in scientific research.

    PubMed

    Allison, J R; Cooper, W W

    1992-01-01

    Data sharing is examined for its bearing on (i) quality assurance and (ii) extensions of results in scientific research as well as (iii) part of a tradition of openness in science. It is suggested that sharing can be accomplished in a simple manner that is also sufficiently flexible to fit varying individual situations by asking authors of data dependent articles and grant proposals to footnote (a) whether they are willing to make their data available to others and, if so, (b) how the data may be accessed. Appendices report results from a survey of current policies and practices in professional societies and in Federal government fund granting agencies. Emphasis is on the social and management sciences.

  12. High quality data: An evaluation of AIM data quality and data quality procedures

    USDA-ARS?s Scientific Manuscript database

    The goal of every monitoring program is to collect high-quality data which can then be used to provide information to decision makers. The Bureau of Land Management (BLM) Assessment, Inventory, and Monitoring (AIM) program is one such data set which provides rangeland status, condition, and trend in...

  13. Virtual data

    NASA Astrophysics Data System (ADS)

    Bjorklund, E.

    1994-12-01

    In the 1970s, when computers were memory limited, operating system designers created the concept of "virtual memory", which gave users the ability to address more memory than physically existed. In the 1990s, many large control systems have the potential of becoming data limited. We propose that many of the principles behind virtual memory systems (working sets, locality, caching and clustering) can also be applied to data-limited systems, creating, in effect, "virtual data systems". At the Los Alamos National Laboratory's Clinton P. Anderson Meson Physics Facility (LAMPF), we have applied these principles to a moderately sized (10 000 data points) data acquisition and control system. To test the principles, we measured the system's performance during tune-up, production, and maintenance periods. In this paper, we present a general discussion of the principles of a virtual data system along with some discussion of our own implementation and the results of our performance measurements.

  14. Consolidating drug data on a global scale using Linked Data.

    PubMed

    Jovanovik, Milos; Trajanov, Dimitar

    2017-01-21

    Drug product data is available on the Web in a distributed fashion. The reasons lie within the regulatory domains, which exist on a national level. As a consequence, the drug data available on the Web are independently curated by national institutions from each country, leaving the data in varying languages, with a varying structure, granularity level and format, on different locations on the Web. Therefore, one of the main challenges in the realm of drug data is the consolidation and integration of large amounts of heterogeneous data into a comprehensive dataspace, for the purpose of developing data-driven applications. In recent years, the adoption of the Linked Data principles has enabled data publishers to provide structured data on the Web and contextually interlink them with other public datasets, effectively de-siloing them. Defining methodological guidelines and specialized tools for generating Linked Data in the drug domain, applicable on a global scale, is a crucial step to achieving the necessary levels of data consolidation and alignment needed for the development of a global dataset of drug product data. This dataset would then enable a myriad of new usage scenarios, which can, for instance, provide insight into the global availability of different drug categories in different parts of the world. We developed a methodology and a set of tools which support the process of generating Linked Data in the drug domain. Using them, we generated the LinkedDrugs dataset by seamlessly transforming, consolidating and publishing high-quality, 5-star Linked Drug Data from twenty-three countries, containing over 248,000 drug products, over 99,000,000 RDF triples and over 278,000 links to generic drugs from the LOD Cloud. Using the linked nature of the dataset, we demonstrate its ability to support advanced usage scenarios in the drug domain. The process of generating the LinkedDrugs dataset demonstrates the applicability of the methodological guidelines and the

  15. Advancements in Large-Scale Data/Metadata Management for Scientific Data.

    NASA Astrophysics Data System (ADS)

    Guntupally, K.; Devarakonda, R.; Palanisamy, G.; Frame, M. T.

    2017-12-01

    Scientific data often comes with complex and diverse metadata which are critical for data discovery and users. The Online Metadata Editor (OME) tool, which was developed by an Oak Ridge National Laboratory team, effectively manages diverse scientific datasets across several federal data centers, such as DOE's Atmospheric Radiation Measurement (ARM) Data Center and USGS's Core Science Analytics, Synthesis, and Libraries (CSAS&L) project. This presentation will focus mainly on recent developments and future strategies for refining OME tool within these centers. The ARM OME is a standard based tool (https://www.archive.arm.gov/armome) that allows scientists to create and maintain metadata about their data products. The tool has been improved with new workflows that help metadata coordinators and submitting investigators to submit and review their data more efficiently. The ARM Data Center's newly upgraded Data Discovery Tool (http://www.archive.arm.gov/discovery) uses rich metadata generated by the OME to enable search and discovery of thousands of datasets, while also providing a citation generator and modern order-delivery techniques like Globus (using GridFTP), Dropbox and THREDDS. The Data Discovery Tool also supports incremental indexing, which allows users to find new data as and when they are added. The USGS CSAS&L search catalog employs a custom version of the OME (https://www1.usgs.gov/csas/ome), which has been upgraded with high-level Federal Geographic Data Committee (FGDC) validations and the ability to reserve and mint Digital Object Identifiers (DOIs). The USGS's Science Data Catalog (SDC) (https://data.usgs.gov/datacatalog) allows users to discover a myriad of science data holdings through a web portal. Recent major upgrades to the SDC and ARM Data Discovery Tool include improved harvesting performance and migration using new search software, such as Apache Solr 6.0 for serving up data/metadata to scientific communities. Our presentation will highlight

  16. Complexity Science Framework for Big Data: Data-enabled Science

    NASA Astrophysics Data System (ADS)

    Surjalal Sharma, A.

    2016-07-01

    The ubiquity of Big Data has stimulated the development of analytic tools to harness the potential for timely and improved modeling and prediction. While much of the data is available near-real time and can be compiled to specify the current state of the system, the capability to make predictions is lacking. The main reason is the basic nature of Big Data - the traditional techniques are challenged in their ability to cope with its velocity, volume and variability to make optimum use of the available information. Another aspect is the absence of an effective description of the time evolution or dynamics of the specific system, derived from the data. Once such dynamical models are developed predictions can be made readily. This approach of " letting the data speak for itself " is distinct from the first-principles models based on the understanding of the fundamentals of the system. The predictive capability comes from the data-derived dynamical model, with no modeling assumptions, and can address many issues such as causality and correlation. This approach provides a framework for addressing the challenges in Big Data, especially in the case of spatio-temporal time series data. The reconstruction of dynamics from time series data is based on recognition that in most systems the different variables or degrees of freedom are coupled nonlinearly and in the presence of dissipation the state space contracts, effectively reducing the number of variables, thus enabling a description of its dynamical evolution and consequently prediction of future states. The predictability is analysed from the intrinsic characteristics of the distribution functions, such as Hurst exponents and Hill estimators. In most systems the distributions have heavy tails, which imply higher likelihood for extreme events. The characterization of the probabilities of extreme events are critical in many cases e. g., natural hazards, for proper assessment of risk and mitigation strategies. Big Data with

  17. re3data.org - a global registry of research data repositories

    NASA Astrophysics Data System (ADS)

    Pampel, Heinz; Vierkant, Paul; Elger, Kirsten; Bertelmann, Roland; Witt, Michael; Schirmbacher, Peter; Rücknagel, Jessika; Kindling, Maxi; Scholze, Frank; Ulrich, Robert

    2016-04-01

    re3data.org - the registry of research data repositories lists over 1,400 research data repositories from all over the world making it the largest and most comprehensive online catalog of research data repositories on the web. The registry is a valuable tool for researchers, funding organizations, publishers and libraries. re3data.org provides detailed information about research data repositories, and its distinctive icons help researchers to easily identify relevant repositories for accessing and depositing data sets [1]. Funding agencies, like the European Commission [2] and research institutions like the University of Bielefeld [3] already recommend the use of re3data.org in their guidelines and policies. Several publishers and journals like Copernicus Publications, PeerJ, and Nature's Scientific Data recommend re3data.org in their editorial policies as a tool for the easy identification of appropriate data repositories to store research data. Project partners in re3data.org are the Library and Information Services department (LIS) of the GFZ German Research Centre for Geosciences, the Computer and Media Service at the Humboldt-Universität zu Berlin, the Purdue University Libraries and the KIT Library at the Karlsruhe Institute of Technology (KIT). After its fusion with the U.S. American DataBib in 2014, re3data.org continues as a service of DataCite from 2016 on. DataCite is the international organization for the registration of Digital Object Identifiers (DOI) for research data and aims to improve their citation. The poster describes the current status and the future plans of re3data.org. [1] Pampel H, et al. (2013) Making Research Data Repositories Visible: The re3data.org Registry. PLoS ONE 8(11): e78080. doi:10.1371/journal.pone.0078080. [2] European Commission (2015): Guidelines on Open Access to Scientific Publications and Research Data in Horizon 2020. Available: http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi

  18. Streamflow data

    USGS Publications Warehouse

    Holmes, Robert R.; Singh, Vijay P.

    2016-01-01

    The importance of streamflow data to the world’s economy, environmental health, and public safety continues to grow as the population increases. The collection of streamflow data is often an involved and complicated process. The quality of streamflow data hinges on such things as site selection, instrumentation selection, streamgage maintenance and quality assurance, proper discharge measurement techniques, and the development and continued verification of the streamflow rating. This chapter serves only as an overview of the streamflow data collection process as proper treatment of considerations, techniques, and quality assurance cannot be addressed adequately in the space limitations of this chapter. Readers with the need for the detailed information on the streamflow data collection process are referred to the many references noted in this chapter. 

  19. SeaDataNet Pan-European infrastructure for Ocean & Marine Data Management

    NASA Astrophysics Data System (ADS)

    Manzella, G. M.; Maillard, C.; Maudire, G.; Schaap, D.; Rickards, L.; Nast, F.; Balopoulos, E.; Mikhailov, N.; Vladymyrov, V.; Pissierssens, P.; Schlitzer, R.; Beckers, J. M.; Barale, V.

    2007-12-01

    SEADATANET is developing a Pan-European data management infrastructure to insure access to a large number of marine environmental data (i.e. temperature, salinity current, sea level, chemical, physical and biological properties), safeguard and long term archiving. Data are derived from many different sensors installed on board of research vessels, satellite and the various platforms of the marine observing system. SeaDataNet allows to have information on real time and archived marine environmental data collected at a pan-european level, through directories on marine environmental data and projects. SeaDataNet allows the access to the most comprehensive multidisciplinary sets of marine in-situ and remote sensing data, from about 40 laboratories, through user friendly tools. The data selection and access is operated through the Common Data Index (CDI), XML files compliant with ISO standards and unified dictionaries. Technical Developments carried out by SeaDataNet includes: A library of Standards - Meta-data standards, compliant with ISO 19115, for communication and interoperability between the data platforms. Software of interoperable on line system - Interconnection of distributed data centres by interfacing adapted communication technology tools. Off-Line Data Management software - software representing the minimum equipment of all the data centres is developed by AWI "Ocean Data View (ODV)". Training, Education and Capacity Building - Training 'on the job' is carried out by IOC-Unesco in Ostende. SeaDataNet Virtual Educational Centre internet portal provides basic tools for informal education

  20. Investigation into interactive graphics data base exchange via Gerber data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Parks, R.E.

    1980-03-01

    Data representing the same interactive grahic design vary greatly from one graphics system manufacturer to another. Therefore, translating the data into a common form to effect data base exchange is a difficult problem. This study examines the use of the Gerber language as a common data form through which design data could be exchanged between unlike systems. For this study Applicon Graphic System was used cyclically to check retention or degeneration of the data integrity when the original design was extracted/defined in the Gerber language and reentered into the AGS utilizing various Gerber Interface Programs. The various parts of thismore » study include the transferring of data not only in the 2D environment, but 2D to 3D and 3D to 2D. Even though plots of the files appear very similar, the individual data bases are very dissimilar. Programs, both present and future, that might supply needed information or design aids and characteristics would find it virtually impossible to do so from a data base lacking the sophistication and completeness of the original AGS data base. Man-machine hours required to bring the data base back to original quality would be extensive. The loss of data base integrity shown by this study was restricted to an AGS to AGS transfer. The loss could very easily be magnified if the transfer were between unlike systems. 8 figures. (RWR)« less

  1. Data publication with the structural biology data grid supports live analysis

    DOE PAGES

    Meyer, Peter A.; Socias, Stephanie; Key, Jason; ...

    2016-03-07

    Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of themore » original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. In conclusion, it is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis.« less

  2. Data publication with the structural biology data grid supports live analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Meyer, Peter A.; Socias, Stephanie; Key, Jason

    Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of themore » original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. In conclusion, it is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis.« less

  3. Data publication with the structural biology data grid supports live analysis.

    PubMed

    Meyer, Peter A; Socias, Stephanie; Key, Jason; Ransey, Elizabeth; Tjon, Emily C; Buschiazzo, Alejandro; Lei, Ming; Botka, Chris; Withrow, James; Neau, David; Rajashankar, Kanagalaghatta; Anderson, Karen S; Baxter, Richard H; Blacklow, Stephen C; Boggon, Titus J; Bonvin, Alexandre M J J; Borek, Dominika; Brett, Tom J; Caflisch, Amedeo; Chang, Chung-I; Chazin, Walter J; Corbett, Kevin D; Cosgrove, Michael S; Crosson, Sean; Dhe-Paganon, Sirano; Di Cera, Enrico; Drennan, Catherine L; Eck, Michael J; Eichman, Brandt F; Fan, Qing R; Ferré-D'Amaré, Adrian R; Fromme, J Christopher; Garcia, K Christopher; Gaudet, Rachelle; Gong, Peng; Harrison, Stephen C; Heldwein, Ekaterina E; Jia, Zongchao; Keenan, Robert J; Kruse, Andrew C; Kvansakul, Marc; McLellan, Jason S; Modis, Yorgo; Nam, Yunsun; Otwinowski, Zbyszek; Pai, Emil F; Pereira, Pedro José Barbosa; Petosa, Carlo; Raman, C S; Rapoport, Tom A; Roll-Mecak, Antonina; Rosen, Michael K; Rudenko, Gabby; Schlessinger, Joseph; Schwartz, Thomas U; Shamoo, Yousif; Sondermann, Holger; Tao, Yizhi J; Tolia, Niraj H; Tsodikov, Oleg V; Westover, Kenneth D; Wu, Hao; Foster, Ian; Fraser, James S; Maia, Filipe R N C; Gonen, Tamir; Kirchhausen, Tom; Diederichs, Kay; Crosas, Mercè; Sliz, Piotr

    2016-03-07

    Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of the original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. It is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis.

  4. Data publication with the structural biology data grid supports live analysis

    PubMed Central

    Meyer, Peter A.; Socias, Stephanie; Key, Jason; Ransey, Elizabeth; Tjon, Emily C.; Buschiazzo, Alejandro; Lei, Ming; Botka, Chris; Withrow, James; Neau, David; Rajashankar, Kanagalaghatta; Anderson, Karen S.; Baxter, Richard H.; Blacklow, Stephen C.; Boggon, Titus J.; Bonvin, Alexandre M. J. J.; Borek, Dominika; Brett, Tom J.; Caflisch, Amedeo; Chang, Chung-I; Chazin, Walter J.; Corbett, Kevin D.; Cosgrove, Michael S.; Crosson, Sean; Dhe-Paganon, Sirano; Di Cera, Enrico; Drennan, Catherine L.; Eck, Michael J.; Eichman, Brandt F.; Fan, Qing R.; Ferré-D'Amaré, Adrian R.; Christopher Fromme, J.; Garcia, K. Christopher; Gaudet, Rachelle; Gong, Peng; Harrison, Stephen C.; Heldwein, Ekaterina E.; Jia, Zongchao; Keenan, Robert J.; Kruse, Andrew C.; Kvansakul, Marc; McLellan, Jason S.; Modis, Yorgo; Nam, Yunsun; Otwinowski, Zbyszek; Pai, Emil F.; Pereira, Pedro José Barbosa; Petosa, Carlo; Raman, C. S.; Rapoport, Tom A.; Roll-Mecak, Antonina; Rosen, Michael K.; Rudenko, Gabby; Schlessinger, Joseph; Schwartz, Thomas U.; Shamoo, Yousif; Sondermann, Holger; Tao, Yizhi J.; Tolia, Niraj H.; Tsodikov, Oleg V.; Westover, Kenneth D.; Wu, Hao; Foster, Ian; Fraser, James S.; Maia, Filipe R. N C.; Gonen, Tamir; Kirchhausen, Tom; Diederichs, Kay; Crosas, Mercè; Sliz, Piotr

    2016-01-01

    Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of the original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. It is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis. PMID:26947396

  5. Data Relationships: Towards a Conceptual Model of Scientific Data Catalogs

    NASA Astrophysics Data System (ADS)

    Hourcle, J. A.

    2008-12-01

    As the amount of data, types of processing and storage formats increase, the total number of record permutations increase dramatically. The result is an overwhelming number of records that make identifying the best data object to answer a user's needs more difficult. The issue is further complicated as each archive's data catalog may be designed around different concepts - - anything from individual files to be served, series of similarly generated and processed data, or something entirely different. Catalogs may not only be flat tables, but may be structured as multiple tables with each table being a different data series, or a normalized structure of the individual data files. Merging federated search results from archives with different catalog designs can create situations where the data object of interest is difficult to find due to an overwhelming number of seemingly similar or entirely unwanted records. We present a reference model for discussing data catalogs and the complex relationships between similar data objects. We show how the model can be used to improve scientist's ability to quickly identify the best data object for their purposes and discuss technical issues required to use this model in a federated system.

  6. Data management in the mission data system

    NASA Technical Reports Server (NTRS)

    Wagner, David A.

    2005-01-01

    As spacecraft evolve from simple embedded devices to become more sophisticated computing platforms with complex behaviors it is increasingly necessary to model and manage the flow of data, and to provide uniform models for managing data that promote adaptability, yet pay heed to the physical limitations of the embedded and space environments.

  7. Technologies for Online Data Management of Oceanographic Data

    NASA Astrophysics Data System (ADS)

    Zodiatis, G.; Hayes, D.; Karaolia, A.; Stylianou, S.; Nikolaidis, A.; Constantinou, I.; Michael, S.; Galanis, G.; Georgiou, G.

    2012-04-01

    The need for efficient and effective on line data management is greatly recognized today by the marine research community. The Cyprus Oceanography Center at the University of Cyprus, realizing this need, is continuously working in this area and has developed a variety of data management and visualization tools which are currently utilized for both the Mediterranean and the Black Sea. Bythos, CYCOFOS and LAS server are three different systems employed by the Oceanography Center, each one dealing with different data sets and processes. Bythos is a rich internet application that combines the latest technologies and enables scientists to search, visualize and download climatological oceanographic data with capabilities of being applied worldwide. CYCOFOS is an operational coastal ocean forecasting and observing system which provides in near real time predictions for sea currents, hydrological characteristics, waves, swells and tides, remote sensing and in-situ data from various remote observing platforms in the Mediterranean Sea, the EEZ and the coastal areas of Cyprus. LAS (Live Access Server) is deployed to present distributed various types of data sets as a unified virtual data base through the use of OpenDap networking. It is first applied for providing an integrated, high resolution system for monitoring the energy potential from sea waves in the Exclusive Economic Zone of Cyprus and the Eastern Mediterranean Levantine Basin. This paper presents the aforementioned technologies as currently adopted by the Cyprus Oceanography Center and describes their utilization that supports both the research and operational activities in the Mediterranean.

  8. Data Analytics of Hydraulic Fracturing Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhang, Jovan Yang; Viswanathan, Hari; Hyman, Jeffery

    These are a set of slides on the data analytics of hydraulic fracturing data. The conclusions from this research are the following: they proposed a permeability evolution as a new mechanism to explain hydraulic fracturing trends; they created a model to include this mechanism and it showed promising results; the paper from this research is ready for submission; they devised a way to identify and sort refractures in order to study their effects, and this paper is currently being written.

  9. Embedded Data Representations.

    PubMed

    Willett, Wesley; Jansen, Yvonne; Dragicevic, Pierre

    2017-01-01

    We introduce embedded data representations, the use of visual and physical representations of data that are deeply integrated with the physical spaces, objects, and entities to which the data refers. Technologies like lightweight wireless displays, mixed reality hardware, and autonomous vehicles are making it increasingly easier to display data in-context. While researchers and artists have already begun to create embedded data representations, the benefits, trade-offs, and even the language necessary to describe and compare these approaches remain unexplored. In this paper, we formalize the notion of physical data referents - the real-world entities and spaces to which data corresponds - and examine the relationship between referents and the visual and physical representations of their data. We differentiate situated representations, which display data in proximity to data referents, and embedded representations, which display data so that it spatially coincides with data referents. Drawing on examples from visualization, ubiquitous computing, and art, we explore the role of spatial indirection, scale, and interaction for embedded representations. We also examine the tradeoffs between non-situated, situated, and embedded data displays, including both visualizations and physicalizations. Based on our observations, we identify a variety of design challenges for embedded data representation, and suggest opportunities for future research and applications.

  10. Classifying Floating Potential Measurement Unit Data Products as Science Data

    NASA Technical Reports Server (NTRS)

    Coffey, Victoria; Minow, Joseph

    2015-01-01

    We are Co-Investigators for the Floating Potential Measurement Unit (FPMU) on the International Space Station (ISS) and members of the FPMU operations and data analysis team. We are providing this memo for the purpose of classifying raw and processed FPMU data products and ancillary data as NASA science data with unrestricted, public availability in order to best support science uses of the data.

  11. Aquatic Toxicity Information Retrieval Data Base (ACQUIRE). Data file

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Not Available

    The purpose of Acquire is to provide scientists and managers quick access to a comprehensive, systematic, computerized compilation of aquatic toxicity data. Scientific papers published both nationally and internationally on the toxicity of chemicals to aquatic organisms and plants are collected and reviewed for ACQUIRE. Independently compiled data files that meet ACQUIRE parameter and quality assurance criteria are also included. Selected toxicity test results and related testing information for any individual chemical from laboratory and field aquatic toxicity effects are included for tests with freshwater and marine organisms. The total number of data records in ACQUIRE is now over 105,300.more » This includes data from 6000 references, for 5200 chemicals and 2400 test species. A major data file, Acute Toxicity of Organic Chemicals (ATOC), has been incorporated into ACQUIRE. The ATOC file contains laboratory acute test data on 525 organic chemicals using juvenile fathead minnows.« less

  12. The Environmental Data Initiative data repository: Trustworthy practices that foster preservation, fitness, and reuse for environmental and ecological data

    NASA Astrophysics Data System (ADS)

    Servilla, M. S.; Brunt, J.; Costa, D.; Gries, C.; Grossman-Clarke, S.; Hanson, P. C.; O'Brien, M.; Smith, C.; Vanderbilt, K.; Waide, R.

    2017-12-01

    The Environmental Data Initiative (EDI) is an outgrowth of more than 30 years of information management experience and technology from LTER Network data practitioners. EDI builds upon the PASTA data repository software used by the LTER Network Information System and manages more than 42,000 data packages, containing tabular data, imagery, and other formats. Development of the repository was a community process beginning in 2009 that included numerous working groups for generating use cases, system requirements, and testing of completed software, thereby creating a vested interested in its success and transparency in design. All software is available for review on GitHub, and refinements and new features are ongoing. Documentation is also available on Read-the-docs, including a comprehensive description of all web-service API methods. PASTA is metadata driven and uses the Ecological Metadata Language (EML) standard for describing environmental and ecological data; a simplified Dublin Core document is also available for each data package. Data are aggregated into packages consisting of metadata and other related content described by an OAI-ORE document. Once archived, each data package becomes immutable and permanent; updates are possible through the addition of new revisions. Components of each data package are accessible through a unique identifier, while the entire data package receives a DOI that is registered in DataCite. Preservation occurs through a combination of DataONE synchronization/replication and by a series of local and remote backup strategies, including daily uploads to AWS Glacier storage. Checksums are computed for all data at initial upload, with random verification occurring on a continuous basis, thus ensuring the integrity of data. PASTA incorporates a series of data quality tests to ensure that data are correctly documented with EML before data are archived; data packages that fail any test are forbidden in the repository. These tests are a

  13. Data embedding method

    NASA Astrophysics Data System (ADS)

    Sandford, Maxwell T., II; Bradley, Jonathan N.; Handel, Theodore G.

    1996-01-01

    Data embedding is a new steganographic method for combining digital information sets. This paper describes the data embedding method and gives examples of its application using software written in the C-programming language. Sandford and Handel produced a computer program (BMPEMBED, Ver. 1.51 written for IBM PC/AT or compatible, MS/DOS Ver. 3.3 or later) that implements data embedding in an application for digital imagery. Information is embedded into, and extracted from, Truecolor or color-pallet images in MicrosoftTM bitmap (BMP) format. Hiding data in the noise component of a host, by means of an algorithm that modifies or replaces the noise bits, is termed `steganography.' Data embedding differs markedly from conventional steganography, because it uses the noise component of the host to insert information with few or no modifications to the host data values or their statistical properties. Consequently, the entropy of the host data is affected little by using data embedding to add information. The data embedding method applies to host data compressed with transform, or `lossy' compression algorithms, as for example ones based on discrete cosine transform and wavelet functions. Analysis of the host noise generates a key required for embedding and extracting the auxiliary data from the combined data. The key is stored easily in the combined data. Images without the key cannot be processed to extract the embedded information. To provide security for the embedded data, one can remove the key from the combined data and manage it separately. The image key can be encrypted and stored in the combined data or transmitted separately as a ciphertext much smaller in size than the embedded data. The key size is typically ten to one-hundred bytes, and it is derived from the original host data by an analysis algorithm.

  14. Data Type Registry - Cross Road Between Catalogs, Data And Semantics

    NASA Astrophysics Data System (ADS)

    Richard, S. M.; Zaslavsky, I.; Bristol, S.

    2017-12-01

    As more data become accessible online, the opportunity is increasing to improve search for information within datasets and for automating some levels of data integration. A prerequisite for these advances is indexing the kinds of information that are present in datasets and providing machine actionable descriptions of data structures. We are exploring approaches to enabling these capabilities in the EarthCube DigitalCrust and Data Discovery Hub Building Block projects, building on the Data type registry (DTR) workgroup activity in the Research Data Alliance. We are prototyping a registry implementation using the CNRI Cordra platform and API to enable 'deep registration' of datasets for building hydrogeologic models of the Earth's Crust, and executing complex science scenarios for river chemistry and coral bleaching data. These use cases require the ability to respond to queries such as: What are properties of Entity X; What entities include property Y (or L, M, N…), and What DataTypes are about Entity X and include property Y. Development of the registry to enable these capabilities requires more in-depth metadata than is commonly available, so we are also exploring approaches to analyzing simple tabular data to automate recognition of entities and properties, and assist users with establishing semantic mappings to data integration vocabularies. This poster will review the current capabilities and implementation of a data type registry.

  15. Alternative Fuels Data Center: Maps and Data

    Science.gov Websites

    Data Generated_thumb20160404-15442-1rr9crl AFV and HEV Model Offerings, By Manufacturer traditional and bioenergy power, fuels, and resources. The tool also calculates the biofuels potential for a Graph Download Data Generated_thumb20160926-6025-heors6 Light-Duty AFV, HEV, and Diesel Model Offerings

  16. Data-Driven Decision Making: The "Other" Data

    ERIC Educational Resources Information Center

    Villano, Matt

    2007-01-01

    Data is a daily reality for school systems. Between standardized tests and tools from companies that offer data warehousing services, educators and district superintendents alike are up to their eyeballs in facts and figures about student performance that they can use as the basis for curricular decisions. Still, there is more to assessment than…

  17. CPC - Monitoring & Data: Pacific Island Climate Data

    Science.gov Websites

    Weather Service NWS logo - Click to go to the NWS home page Climate Prediction Center Home Site Map News Web resources and services. HOME > Monitoring and Data > Pacific Islands Climate Data & Maps island stations. NOAA/ National Weather Service NOAA Center for Weather and Climate Prediction Climate

  18. How New National Air Data System Affects ECHO Data ...

    EPA Pesticide Factsheets

    The ECHO website is displaying Clean Air Act stationary source data from the modernized national data management system, ICIS-Air. The old system, AFS was retired in October 2014. Answers to frequently asked questions about the data system transition are presented on this page.

  19. IGS Data Flow

    NASA Technical Reports Server (NTRS)

    Noll, Carey

    2006-01-01

    The IGS analysis centers and user community in general need to be assured that the data centers archive a consistent set of files. Changes to the archives can occur because of the re-publishing of data, the transmission of historic data, and the resulting re-distribution (or lack thereof) of these data from data center to data center. To ensure the quality of the archives, a defined data flow and method of archive population needs to be established. This poster will diagram and review the current IGS data flow, discuss problems that have occurred, and provide recommendations for improvement.

  20. An Analysis of the Climate Data Initiative's Data Collection

    NASA Astrophysics Data System (ADS)

    Ramachandran, R.; Bugbee, K.

    2015-12-01

    The Climate Data Initiative (CDI) is a broad multi-agency effort of the U.S. government that seeks to leverage the extensive existing federal climate-relevant data to stimulate innovation and private-sector entrepreneurship to support national climate-change preparedness. The CDI project is a systematic effort to manually curate and share openly available climate data from various federal agencies. To date, the CDI has curated seven themes, or topics, relevant to climate change resiliency. These themes include Coastal Flooding, Food Resilience, Water, Ecosystem Vulnerability, Human Health, Energy Infrastructure, and Transportation. Each theme was curated by subject matter experts who selected datasets relevant to the topic at hand. An analysis of the entire Climate Data Initiative data collection and the data curated for each theme offers insights into which datasets are considered most relevant in addressing climate resiliency. Other aspects of the data collection will be examined including which datasets were the most visited or popular and which datasets were the most sought after for curation by the theme teams. Results from the analysis of the CDI collection will be presented in this talk.

  1. OpenFlyData: an exemplar data web integrating gene expression data on the fruit fly Drosophila melanogaster.

    PubMed

    Miles, Alistair; Zhao, Jun; Klyne, Graham; White-Cooper, Helen; Shotton, David

    2010-10-01

    Integrating heterogeneous data across distributed sources is a major requirement for in silico bioinformatics supporting translational research. For example, genome-scale data on patterns of gene expression in the fruit fly Drosophila melanogaster are widely used in functional genomic studies in many organisms to inform candidate gene selection and validate experimental results. However, current data integration solutions tend to be heavy weight, and require significant initial and ongoing investment of effort. Development of a common Web-based data integration infrastructure (a.k.a. data web), using Semantic Web standards, promises to alleviate these difficulties, but little is known about the feasibility, costs, risks or practical means of migrating to such an infrastructure. We describe the development of OpenFlyData, a proof-of-concept system integrating gene expression data on D. melanogaster, combining Semantic Web standards with light-weight approaches to Web programming based on Web 2.0 design patterns. To support researchers designing and validating functional genomic studies, OpenFlyData includes user-facing search applications providing intuitive access to and comparison of gene expression data from FlyAtlas, the BDGP in situ database, and FlyTED, using data from FlyBase to expand and disambiguate gene names. OpenFlyData's services are also openly accessible, and are available for reuse by other bioinformaticians and application developers. Semi-automated methods and tools were developed to support labour- and knowledge-intensive tasks involved in deploying SPARQL services. These include methods for generating ontologies and relational-to-RDF mappings for relational databases, which we illustrate using the FlyBase Chado database schema; and methods for mapping gene identifiers between databases. The advantages of using Semantic Web standards for biomedical data integration are discussed, as are open issues. In particular, although the performance of open

  2. Aerothermodynamic data base. Data file contents report, phase C

    NASA Technical Reports Server (NTRS)

    Lutz, G. R.

    1983-01-01

    Space shuttle aerothermodynamic data, collected from a continuing series of wind tunnel tests, are permanently stored with the Data Management Services (DMS) system. Information pertaining to current baseline configuration definition is also stored. Documentation of DMS processed data arranged sequentially and by space shuttle configuration is listed to provide an up-to-date record of all applicable aerothermodynamic data collected, processed, or summarized during the space shuttle program. Tables provide survey information to the various space shuttle managerial and technical levels.

  3. Data Reorganization for Optimal Time Series Data Access, Analysis, and Visualization

    NASA Astrophysics Data System (ADS)

    Rui, H.; Teng, W. L.; Strub, R.; Vollmer, B.

    2012-12-01

    The way data are archived is often not optimal for their access by many user communities (e.g., hydrological), particularly if the data volumes and/or number of data files are large. The number of data records of a non-static data set generally increases with time. Therefore, most data sets are commonly archived by time steps, one step per file, often containing multiple variables. However, many research and application efforts need time series data for a given geographical location or area, i.e., a data organization that is orthogonal to the way the data are archived. The retrieval of a time series of the entire temporal coverage of a data set for a single variable at a single data point, in an optimal way, is an important and longstanding challenge, especially for large science data sets (i.e., with volumes greater than 100 GB). Two examples of such large data sets are the North American Land Data Assimilation System (NLDAS) and Global Land Data Assimilation System (GLDAS), archived at the NASA Goddard Earth Sciences Data and Information Services Center (GES DISC; Hydrology Data Holdings Portal, http://disc.sci.gsfc.nasa.gov/hydrology/data-holdings). To date, the NLDAS data set, hourly 0.125x0.125° from Jan. 1, 1979 to present, has a total volume greater than 3 TB (compressed). The GLDAS data set, 3-hourly and monthly 0.25x0.25° and 1.0x1.0° Jan. 1948 to present, has a total volume greater than 1 TB (compressed). Both data sets are accessible, in the archived time step format, via several convenient methods, including Mirador search and download (http://mirador.gsfc.nasa.gov/), GrADS Data Server (GDS; http://hydro1.sci.gsfc.nasa.gov/dods/), direct FTP (ftp://hydro1.sci.gsfc.nasa.gov/data/s4pa/), and Giovanni Online Visualization and Analysis (http://disc.sci.gsfc.nasa.gov/giovanni). However, users who need long time series currently have no efficient way to retrieve them. Continuing a longstanding tradition of facilitating data access, analysis, and

  4. Making Research Data Repositories Visible: The re3data.org Registry

    PubMed Central

    Pampel, Heinz; Vierkant, Paul; Scholze, Frank; Bertelmann, Roland; Kindling, Maxi; Klump, Jens; Goebelbecker, Hans-Jürgen; Gundlach, Jens; Schirmbacher, Peter; Dierolf, Uwe

    2013-01-01

    Researchers require infrastructures that ensure a maximum of accessibility, stability and reliability to facilitate working with and sharing of research data. Such infrastructures are being increasingly summarized under the term Research Data Repositories (RDR). The project re3data.org–Registry of Research Data Repositories–has begun to index research data repositories in 2012 and offers researchers, funding organizations, libraries and publishers an overview of the heterogeneous research data repository landscape. In July 2013 re3data.org lists 400 research data repositories and counting. 288 of these are described in detail using the re3data.org vocabulary. Information icons help researchers to easily identify an adequate repository for the storage and reuse of their data. This article describes the heterogeneous RDR landscape and presents a typology of institutional, disciplinary, multidisciplinary and project-specific RDR. Further the article outlines the features of re3data.org, and shows how this registry helps to identify appropriate repositories for storage and search of research data. PMID:24223762

  5. Making research data repositories visible: the re3data.org Registry.

    PubMed

    Pampel, Heinz; Vierkant, Paul; Scholze, Frank; Bertelmann, Roland; Kindling, Maxi; Klump, Jens; Goebelbecker, Hans-Jürgen; Gundlach, Jens; Schirmbacher, Peter; Dierolf, Uwe

    2013-01-01

    Researchers require infrastructures that ensure a maximum of accessibility, stability and reliability to facilitate working with and sharing of research data. Such infrastructures are being increasingly summarized under the term Research Data Repositories (RDR). The project re3data.org-Registry of Research Data Repositories-has begun to index research data repositories in 2012 and offers researchers, funding organizations, libraries and publishers an overview of the heterogeneous research data repository landscape. In July 2013 re3data.org lists 400 research data repositories and counting. 288 of these are described in detail using the re3data.org vocabulary. Information icons help researchers to easily identify an adequate repository for the storage and reuse of their data. This article describes the heterogeneous RDR landscape and presents a typology of institutional, disciplinary, multidisciplinary and project-specific RDR. Further the article outlines the features of re3data.org, and shows how this registry helps to identify appropriate repositories for storage and search of research data.

  6. Profiling Oman education data using data mining approach

    NASA Astrophysics Data System (ADS)

    Alawi, Sultan Juma Sultan; Shaharanee, Izwan Nizal Mohd; Jamil, Jastini Mohd

    2017-10-01

    Nowadays, with a large amount of data generated by many application services in different learning fields has led to the new challenges in education field. Education portal is an important system that leads to a better development of education field. This research paper presents an innovative data mining techniques to understand and summarizes the information of Oman's education data generated from the Ministry of Education Oman "Educational Portal". This research embarks into performing student profiling of the Oman student database. This study utilized the k-means clustering technique to determine the students' profiles. An amount of 42484-student records from Sultanate of Oman has been extracted for this study. The findings of this study show the practicality of clustering technique to investigating student's profiles. Allowing for a better understanding of student's behavior and their academic performance. Oman Education Portal contain a large amounts of user activity and interaction data. Analyses of this large data can be meaningful for educator to improve the student performance level and recognize students who needed additional attention.

  7. Deploying Object Oriented Data Technology to the Planetary Data System

    NASA Technical Reports Server (NTRS)

    Kelly, S.; Crichton, D.; Hughes, J. S.

    2003-01-01

    How do you provide more than 350 scientists and researchers access to data from every instrument in Odyssey when the data is curated across half a dozen institutions and in different formats and is too big to mail on a CD-ROM anymore? The Planetary Data System (PDS) faced this exact question. The solution was to use a metadata-based middleware framework developed by the Object Oriented Data Technology task at NASA s Jet Propulsion Laboratory. Using OODT, PDS provided - for the first time ever - data from all mission instruments through a single system immediately upon data delivery.

  8. International seismological data center: Preparation of an experimental data base

    NASA Astrophysics Data System (ADS)

    Israelson, H.; Jeppsson, I.; Barkeby, G.

    1980-11-01

    An experimental data base compiled for a temporary international seismological data center is presented. Data include recording and measurements at 60 globally distributed seismological stations for a one week period. Data for definition, location and magnitude estimation of seismic events are examined. Original digital records from 11 seismological research observatories around the world are also analyzed to provide additional identification data. It is shown that the routine measurement and reporting of data at seismological stations as proposed by the Seismic Experts Group at the UN Committee of Disarmament, is an onerous task that goes far beyond current seismological practices.

  9. Alternative Fuels Data Center: Maps and Data

    Science.gov Websites

    fleet type from 1992-2014 Last update August 2016 View Graph Graph Download Data Generated_thumb20160830 Trend of S&FP AFV acquisitions by fuel type from 1992-2015 Last update August 2016 View Graph Graph transactions from 1997-2014 Last update August 2016 View Graph Graph Download Data Biofuelsatlas BioFuels Atlas

  10. Alternative Fuels Data Center: Maps and Data

    Science.gov Websites

    -1paywcu Last update August 2014 View Graph Graph Download Data State & Alt Fuel Providers -kgi9ks Trend of S&FP AFV acquisitions by fleet type from 1992-2014 Last update August 2016 View Graph -2015 Last update August 2016 View Graph Graph Download Data Generated_thumb20160907-12999-119sgvk

  11. Alternative Fuels Data Center: Maps and Data

    Science.gov Websites

    Fuel Standard Volumes by Year Generated_thumb20150904-8240-13hgnxh Last update August 2012 View Graph product or destination Last update August 2015 View Graph Graph Download Data Custom_thumb U.S. Ethanol , from 1866-2014 Last update August 2015 View Graph Graph Download Data Generated_thumb20160920-21993

  12. N(h) profile data at World Data Centers

    NASA Technical Reports Server (NTRS)

    Allen, J. H.; Conkright, R.; Bilitza, D.; Feldstein, A. IA.; Willis, D. M.

    1990-01-01

    The data base for ionospheric electron density profiles at the World Data Centers is assessed. Past and future N(h) profiling from ionograms and profiles from incoherent scatter radar measurements are considered.

  13. MODIS land data at the EROS data center DAAC

    USGS Publications Warehouse

    Jenkerson, Calli B.; Reed, B.C.

    2001-01-01

    The US Geological Survey's (USGS) Earth Resources Observation Systems (EROS) Data Center (EDC) in Sioux Falls, SD, USA, is the primary national archive for land processes data and one of the National Aeronautics and Space Administration's (NASA) Distributed Active Archive Centers (DAAC) for the Earth Observing System (EOS). One of EDC's functions as a DAAC is the archival and distribution of Moderate Resolution Spectroradiometer (MODIS) Land Data collected from the Earth Observing System (EOS) satellite Terra. More than 500,000 publicly available MODIS land data granules totaling 25 Terabytes (Tb) are currently stored in the EDC archive. This collection is managed, archived, and distributed by EOS Data and Information System (EOSDIS) Core System (ECS) at EDC. EDC User Services support the use of MODIS Land data, which include land surface reflectance/albedo, temperature/emissivity, vegetation characteristics, and land cover, by responding to user inquiries, constructing user information sites on the EDC web page, and presenting MODIS materials worldwide.

  14. The Data Conservancy

    NASA Astrophysics Data System (ADS)

    Choudhury, S.; Duerr, R. E.

    2009-12-01

    NSF's Sustainable Digital Data Preservation and Access Network Partners program is an ambitious attempt to integrate a wide variety of expertise and infrastructure into a network for providing "reliable digital preservation, access, integration, and analysis capabilities for science." One of the first two DataNet award recipients, the Data Conservancy, is itself a network of widely diverse partners led by the libraries at the Johns Hopkins University. The Data Conservancy is built on existing exemplar scientific projects, communities, and virtual organizations that have deep engagement with their user communities, and extensive experience with large-scale distributed system development. Data Conservancy members embrace a shared vision that data curation is not an end, but rather a means to collect, organize, validate, and preserve data needed to address the grand research challenges that face society. Data Conservancy members holdings encompass the entire range of earth, life, and space science data. New to the Data Conservancy is the concept that University libraries will be part of the distributed network of data centers and that data science will become a path in the library and information science curricula. As noted by Winston Tabb (JHU Dean of Libraries) "Data Centers are the new library stacks."

  15. The Topological Field Theory of Data: a program towards a novel strategy for data mining through data language

    NASA Astrophysics Data System (ADS)

    Rasetti, M.; Merelli, E.

    2015-07-01

    This paper aims to challenge the current thinking in IT for the 'Big Data' question, proposing - almost verbatim, with no formulas - a program aiming to construct an innovative methodology to perform data analytics in a way that returns an automaton as a recognizer of the data language: a Field Theory of Data. We suggest to build, directly out of probing data space, a theoretical framework enabling us to extract the manifold hidden relations (patterns) that exist among data, as correlations depending on the semantics generated by the mining context. The program, that is grounded in the recent innovative ways of integrating data into a topological setting, proposes the realization of a Topological Field Theory of Data, transferring and generalizing to the space of data notions inspired by physical (topological) field theories and harnesses the theory of formal languages to define the potential semantics necessary to understand the emerging patterns.

  16. MANAGING TROUBLED DATA: COASTAL DATA PARTNERSHIPS SMOOTH DATA INTEGRATION

    EPA Science Inventory

    Understanding the ecology, condition, and changes of coastal areas requires data from many sources. Broad-scale and long-term ecological questions, such as global climate change, biodiversity, and cumulative impacts of human activities, must be addressed with databases that integ...

  17. The development of health care data warehouses to support data mining.

    PubMed

    Lyman, Jason A; Scully, Kenneth; Harrison, James H

    2008-03-01

    Clinical data warehouses offer tremendous benefits as a foundation for data mining. By serving as a source for comprehensive clinical and demographic information on large patient populations, they streamline knowledge discovery efforts by providing standard and efficient mechanisms to replace time-consuming and expensive original data collection, organization, and processing. Building effective data warehouses requires knowledge of and attention to key issues in database design, data acquisition and processing, and data access and security. In this article, the authors provide an operational and technical definition of data warehouses, present examples of data mining projects enabled by existing data warehouses, and describe key issues and challenges related to warehouse development and implementation.

  18. Landsat 8 Data Modeled as DGGS Data Cubes

    NASA Astrophysics Data System (ADS)

    Sherlock, M. J.; Tripathi, G.; Samavati, F.

    2016-12-01

    In the context of tracking recent global changes in the Earth's landscape, Landsat 8 provides high-resolution multi-wavelength data with a temporal resolution of sixteen days. Such a live dataset can benefit novel applications in environmental monitoring. However, a temporal analysis of this dataset in its native format is a challenging task mostly due to the huge volume of geospatial images and imperfect overlay of different day Landsat 8 images. We propose the creation of data cubes derived from Landsat 8 data, through the use of a Discrete Global Grid System (DGGS). DGGS referencing of Landsat 8 data provides a cell-based representation of the pixel values for a fixed area on earth, indexed by keys. Having the calibrated cell-based Landsat 8 images can speed up temporal analysis and facilitate parallel processing using distributed systems. In our method, the Landsat 8 dataset hosted on Amazon Web Services (AWS) is downloaded using a web crawler and stored on a filesystem. We apply the cell-based DGGS referencing (using Pyxis SDK) to Landsat 8 images which provide a rhombus based tessellation of equal area cells for our use-case. After this step, the cell-images which overlay perfectly on different days, are stacked in the temporal dimension and stored into data cube units. The depth of the cube represents the number of temporal images of the same cell and can be updated when new images are received each day. Harnessing the regular spatio-temporal structure of data cubes, we want to compress, query, transmit and visualize big Landsat 8 data in an efficient way for temporal analysis.

  19. The AmeriFlux data activity and data system: an evolving collection of data management techniques, tools, products and services

    NASA Astrophysics Data System (ADS)

    Boden, T. A.; Krassovski, M.; Yang, B.

    2013-06-01

    The Carbon Dioxide Information Analysis Center (CDIAC) at Oak Ridge National Laboratory (ORNL), USA has provided scientific data management support for the US Department of Energy and international climate change science since 1982. Among the many data archived and available from CDIAC are collections from long-term measurement projects. One current example is the AmeriFlux measurement network. AmeriFlux provides continuous measurements from forests, grasslands, wetlands, and croplands in North, Central, and South America and offers important insight about carbon cycling in terrestrial ecosystems. To successfully manage AmeriFlux data and support climate change research, CDIAC has designed flexible data systems using proven technologies and standards blended with new, evolving technologies and standards. The AmeriFlux data system, comprised primarily of a relational database, a PHP-based data interface and a FTP server, offers a broad suite of AmeriFlux data. The data interface allows users to query the AmeriFlux collection in a variety of ways and then subset, visualize and download the data. From the perspective of data stewardship, on the other hand, this system is designed for CDIAC to easily control database content, automate data movement, track data provenance, manage metadata content, and handle frequent additions and corrections. CDIAC and researchers in the flux community developed data submission guidelines to enhance the AmeriFlux data collection, enable automated data processing, and promote standardization across regional networks. Both continuous flux and meteorological data and irregular biological data collected at AmeriFlux sites are carefully scrutinized by CDIAC using established quality-control algorithms before the data are ingested into the AmeriFlux data system. Other tasks at CDIAC include reformatting and standardizing the diverse and heterogeneous datasets received from individual sites into a uniform and consistent network database

  20. The AmeriFlux data activity and data system: an evolving collection of data management techniques, tools, products and services

    NASA Astrophysics Data System (ADS)

    Boden, T. A.; Krassovski, M.; Yang, B.

    2013-02-01

    The Carbon Dioxide Information Analysis Center (CDIAC) at Oak Ridge National Laboratory (ORNL), USA has provided scientific data management support for the US Department of Energy and international climate change science since 1982. Among the many data archived and available from CDIAC are collections from long-term measurement projects. One current example is the AmeriFlux measurement network. AmeriFlux provides continuous measurements from forests, grasslands, wetlands, and croplands in North, Central, and South America and offers important insight about carbon cycling in terrestrial ecosystems. To successfully manage AmeriFlux data and support climate change research, CDIAC has designed flexible data systems using proven technologies and standards blended with new, evolving technologies and standards. The AmeriFlux data system, comprised primarily of a relational database, a PHP based data-interface and a FTP server, offers a broad suite of AmeriFlux data. The data interface allows users to query the AmeriFlux collection in a variety of ways and then subset, visualize and download the data. From the perspective of data stewardship, on the other hand, this system is designed for CDIAC to easily control database content, automate data movement, track data provenance, manage metadata content, and handle frequent additions and corrections. CDIAC and researchers in the flux community developed data submission guidelines to enhance the AmeriFlux data collection, enable automated data processing, and promote standardization across regional networks. Both continuous flux and meteorological data and irregular biological data collected at AmeriFlux sites are carefully scrutinized by CDIAC using established quality-control algorithms before the data are ingested into the AmeriFlux data system. Other tasks at CDIAC include reformatting and standardizing the diverse and heterogeneous datasets received from individual sites into a uniform and consistent network database

  1. A Spatial Data Infrastructure for Environmental Noise Data in Europe.

    PubMed

    Abramic, Andrej; Kotsev, Alexander; Cetl, Vlado; Kephalopoulos, Stylianos; Paviotti, Marco

    2017-07-06

    Access to high quality data is essential in order to better understand the environmental and health impact of noise in an increasingly urbanised world. This paper analyses how recent developments of spatial data infrastructures in Europe can significantly improve the utilization of data and streamline reporting on a pan-European scale. The Infrastructure for Spatial Information in the European Community (INSPIRE), and Environmental Noise Directive (END) described in this manuscript provide principles for data management that, once applied, would lead to a better understanding of the state of environmental noise. Furthermore, shared, harmonised and easily discoverable environmental spatial data, required by the INSPIRE, would also support the data collection needed for the assessment and development of strategic noise maps. Action plans designed by the EU Member States to reduce noise and mitigate related effects can be shared to the public through already established nodes of the European spatial data infrastructure. Finally, data flows regarding reporting on the state of environment and END implementation to the European level can benefit by applying a decentralised e-reporting service oriented infrastructure. This would allow reported data to be maintained, frequently updated and enable pooling of information from/to other relevant and interrelated domains such as air quality, transportation, human health, population, marine environment or biodiversity. We describe those processes and provide a use case in which noise data from two neighbouring European countries are mapped to common data specifications, defined by INSPIRE, thus ensuring interoperability and harmonisation.

  2. Wavelet data compression for archiving high-resolution icosahedral model data

    NASA Astrophysics Data System (ADS)

    Wang, N.; Bao, J.; Lee, J.

    2011-12-01

    With the increase of the resolution of global circulation models, it becomes ever more important to develop highly effective solutions to archive the huge datasets produced by those models. While lossless data compression guarantees the accuracy of the restored data, it can only achieve limited reduction of data size. Wavelet transform based data compression offers significant potentials in data size reduction, and it has been shown very effective in transmitting data for remote visualizations. However, for data archive purposes, a detailed study has to be conducted to evaluate its impact to the datasets that will be used in further numerical computations. In this study, we carried out two sets of experiments for both summer and winter seasons. An icosahedral grid weather model and a highly efficient wavelet data compression software were used for this study. Initial conditions were compressed and input to the model to run to 10 days. The forecast results were then compared to those forecast results from the model run with the original uncompressed initial conditions. Several visual comparisons, as well as the statistics of numerical comparisons are presented. These results indicate that with specified minimum accuracy losses, wavelet data compression achieves significant data size reduction, and at the same time, it maintains minimum numerical impacts to the datasets. In addition, some issues are discussed to increase the archive efficiency while retaining a complete set of meta data for each archived file.

  3. Data Curation

    ERIC Educational Resources Information Center

    Mallon, Melissa, Ed.

    2012-01-01

    In their Top Trends of 2012, the Association of College and Research Libraries (ACRL) named data curation as one of the issues to watch in academic libraries in the near future (ACRL, 2012, p. 312). Data curation can be summarized as "the active and ongoing management of data through its life cycle of interest and usefulness to scholarship,…

  4. Cafe Data

    ERIC Educational Resources Information Center

    DePaolo, Concetta A.; Robinson, David F.

    2011-01-01

    In this paper we present time series data collected from a cafe run by business students at a Midwestern public university. The data were collected over a ten-week period during the spring semester of 2010. These data can be used in introductory courses to illustrate basic concepts of time series and forecasting, including trend, seasonality, and…

  5. Crystal Data

    National Institute of Standards and Technology Data Gateway

    SRD 3 NIST Crystal Data (PC database for purchase)   NIST Crystal Data contains chemical, physical, and crystallographic information useful to characterize more than 237,671 inorganic and organic crystalline materials. The data include the standard cell parameters, cell volume, space group number and symbol, calculated density, chemical formula, chemical name, and classification by chemical type.

  6. Analysis of crystallization data in the Protein Data Bank

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kirkwood, Jobie; Hargreaves, David; O’Keefe, Simon

    In a large-scale study using data from the Protein Data Bank, some of the many reported findings regarding the crystallization of proteins were investigated. The Protein Data Bank (PDB) is the largest available repository of solved protein structures and contains a wealth of information on successful crystallization. Many centres have used their own experimental data to draw conclusions about proteins and the conditions in which they crystallize. Here, data from the PDB were used to reanalyse some of these results. The most successful crystallization reagents were identified, the link between solution pH and the isoelectric point of the protein wasmore » investigated and the possibility of predicting whether a protein will crystallize was explored.« less

  7. NSSDC data listing

    NASA Technical Reports Server (NTRS)

    1982-01-01

    The first part of this listing, Satellite Data, is in an abbreviated form compared to the data catalogs published by NSSDC. It is organized by NSSDC spacecraft common name. The launch date and NSSDC ID are printed for each spacecraft. The experiments are listed alphabetically by the principal investigator's or team leader's last name following the spacecraft name. The experiment name and NSSDC ID are printed for each experiment. The data sets are listed by NSSDC ID following the experiment name. The data set name, data form code, quantity of data, and the time span of the data as verified by NSSDC are printed for each data set.

  8. Data Product Citation

    Atmospheric Science Data Center

    2017-07-31

    Citing Data from the NASA Langley Research Center's Atmospheric Science Data Center (ASDC) Distributed Active ... data products and services are provided by the NASA Langley Research Center's (LaRC) ASDC DAAC and are managed by the NASA Earth Science ... for all data sets that are recommended for reference in research. The DOI for each data set can be found on the data set landing page ...

  9. Protecting personal data in epidemiological research: DataSHIELD and UK law.

    PubMed

    Wallace, Susan E; Gaye, Amadou; Shoush, Osama; Burton, Paul R

    2014-01-01

    Data from individual collections, such as biobanks and cohort studies, are now being shared in order to create combined datasets which can be queried to ask complex scientific questions. But this sharing must be done with due regard for data protection principles. DataSHIELD is a new technology that queries nonaggregated, individual-level data in situ but returns query data in an anonymous format. This raises questions of the ability of DataSHIELD to adequately protect participant confidentiality. An ethico-legal analysis was conducted that examined each step of the DataSHIELD process from the perspective of UK case law, regulations, and guidance. DataSHIELD reaches agreed UK standards of protection for the sharing of biomedical data. All direct processing of personal data is conducted within the protected environment of the contributing study; participating studies have scientific, ethics, and data access approvals in place prior to the analysis; studies are clear that their consents conform with this use of data, and participants are informed that anonymisation for further disclosure will take place. DataSHIELD can provide a flexible means of interrogating data while protecting the participants' confidentiality in accordance with applicable legislation and guidance. © 2014 S. Karger AG, Basel.

  10. Integration of upper air data in the MeteoSwiss Data Warehouse

    NASA Astrophysics Data System (ADS)

    Musa, M.; Haeberli, Ch.; Ruffieux, D.

    2010-09-01

    Over the last 10 years MeteoSwiss established a Data Warehouse in order to get one single, integrated data platform for all kinds of meteorological and climatological data. In the MeteoSwiss Data Warehouse data and metadata are hold in a metadata driven relational database. To reach this goal, we started with the integration of the actual and historical data from our surface stations in a first step, including routines for aggregation and calculation and the implementation of enhanced Quality Control tools. In 2008 we started with the integration of actual and historical upper air data like soundings (PTU, Wind and Ozone), any kind of profilers like wind profiler or radiometer, profiles calculated from numerical weather models and AMDAR data in the Data Warehouse. The dataset includes also high resolution sounding data from the station Payerne and TEMP data from 20 European stations since 1942. A critical point was to work out a concept for the general architecture which could deal with all different types of data. While integrating the data itself all metadata of the aerological station Payerne was transferred and imported in the central metadata repository of the Data Warehouse. The implementation of the real time and daily QC tools as well as the routines for aggregation and calculation were realized in an analog way as for the surface data. The Quality Control tools include plausibility tests like limit tests, consistency tests in the same level and vertical consistency tests. From the beginning it was the aim to support the MeteoSwiss integration strategy which deals with all aspects of integration like various observing technologies and platforms, observing systems outside MeteoSwiss and the data and metadata itself. This kind of integration comprises all aspects of "Enterprise Data Integration". After the integration, the historical as well as the actual upper air data are now available for the climatologists and meteorologists with standardized access for

  11. Big Data and Deep data in scanning and electron microscopies: functionality from multidimensional data sets

    DOE PAGES

    Belianinov, Alex; Vasudevan, Rama K; Strelcov, Evgheni; ...

    2015-05-13

    The development of electron, and scanning probe microscopies in the second half of the twentieth century have produced spectacular images of internal structure and composition of matter with, at nanometer, molecular, and atomic resolution. Largely, this progress was enabled by computer-assisted methods of microscope operation, data acquisition and analysis. The progress in imaging technologies in the beginning of the twenty first century has opened the proverbial floodgates of high-veracity information on structure and functionality. High resolution imaging now allows information on atomic positions with picometer precision, allowing for quantitative measurements of individual bond length and angles. Functional imaging often leadsmore » to multidimensional data sets containing partial or full information on properties of interest, acquired as a function of multiple parameters (time, temperature, or other external stimuli). Here, we review several recent applications of the big and deep data analysis methods to visualize, compress, and translate this data into physically and chemically relevant information from imaging data.« less

  12. Big Data and Deep data in scanning and electron microscopies: functionality from multidimensional data sets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Belianinov, Alex; Vasudevan, Rama K; Strelcov, Evgheni

    The development of electron, and scanning probe microscopies in the second half of the twentieth century have produced spectacular images of internal structure and composition of matter with, at nanometer, molecular, and atomic resolution. Largely, this progress was enabled by computer-assisted methods of microscope operation, data acquisition and analysis. The progress in imaging technologies in the beginning of the twenty first century has opened the proverbial floodgates of high-veracity information on structure and functionality. High resolution imaging now allows information on atomic positions with picometer precision, allowing for quantitative measurements of individual bond length and angles. Functional imaging often leadsmore » to multidimensional data sets containing partial or full information on properties of interest, acquired as a function of multiple parameters (time, temperature, or other external stimuli). Here, we review several recent applications of the big and deep data analysis methods to visualize, compress, and translate this data into physically and chemically relevant information from imaging data.« less

  13. 17 CFR 45.4 - Swap data reporting: continuation data.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... swap data repository as set forth in this section. This obligation commences on the applicable... swap data set forth in part 43 of this chapter; and, where applicable, swap dealers, major swap... traders set forth in parts 17 and 18 of this chapter. (a) Continuation data reporting method. For each...

  14. 17 CFR 45.4 - Swap data reporting: continuation data.

    Code of Federal Regulations, 2013 CFR

    2013-04-01

    ... swap data repository as set forth in this section. This obligation commences on the applicable... swap data set forth in part 43 of this chapter; and, where applicable, swap dealers, major swap... traders set forth in parts 17 and 18 of this chapter. (a) Continuation data reporting method. For each...

  15. 17 CFR 45.4 - Swap data reporting: continuation data.

    Code of Federal Regulations, 2012 CFR

    2012-04-01

    ... swap data repository as set forth in this section. This obligation commences on the applicable... swap data set forth in part 43 of this chapter; and, where applicable, swap dealers, major swap... traders set forth in parts 17 and 18 of this chapter. (a) Continuation data reporting method. For each...

  16. YummyData: providing high-quality open life science data

    PubMed Central

    Yamaguchi, Atsuko; Splendiani, Andrea

    2018-01-01

    Abstract Many life science datasets are now available via Linked Data technologies, meaning that they are represented in a common format (the Resource Description Framework), and are accessible via standard APIs (SPARQL endpoints). While this is an important step toward developing an interoperable bioinformatics data landscape, it also creates a new set of obstacles, as it is often difficult for researchers to find the datasets they need. Different providers frequently offer the same datasets, with different levels of support: as well as having more or less up-to-date data, some providers add metadata to describe the content, structures, and ontologies of the stored datasets while others do not. We currently lack a place where researchers can go to easily assess datasets from different providers in terms of metrics such as service stability or metadata richness. We also lack a space for collecting feedback and improving data providers’ awareness of user needs. To address this issue, we have developed YummyData, which consists of two components. One periodically polls a curated list of SPARQL endpoints, monitoring the states of their Linked Data implementations and content. The other presents the information measured for the endpoints and provides a forum for discussion and feedback. YummyData is designed to improve the findability and reusability of life science datasets provided as Linked Data and to foster its adoption. It is freely accessible at http://yummydata.org/. Database URL: http://yummydata.org/ PMID:29688370

  17. Direct match data flow memory for data driven computing

    DOEpatents

    Davidson, George S.; Grafe, Victor Gerald

    1997-01-01

    A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.

  18. A Comparison of Key Concepts in Data Analytics and Data Science

    ERIC Educational Resources Information Center

    McMaster, Kirby; Rague, Brian; Wolthuis, Stuart L.; Sambasivam, Samuel

    2018-01-01

    This research study provides an examination of the relatively new fields of Data Analytics and Data Science. We compare word rates in Data Analytics and Data Science documents to determine which concepts are mentioned most often. The most frequent concept in both fields is "data." The word rate for "data" is more than twice the…

  19. NSSDC Data Listing

    NASA Technical Reports Server (NTRS)

    1979-01-01

    Data available from the National Space Science Data Center (NSSDC) are listed. The spacecraft, principal investigator, the experiment, and time span of the data are given. A listing is also included of ground-based data, models, computer routines and composite spacecraft data that are available from NSSDC.

  20. SSE Global Data

    Atmospheric Science Data Center

    2018-04-12

    SSE Global Data Text files of monthly averaged data for the entire ... Version:  V6 Location:  Global Spatial Coverage:  (90N, 90S)(180W,180E) ... File Format:  ASCII Order Data:  SSE Global Data: Order Data SCAR-B Block:  ...

  1. Data Access Services that Make Remote Sensing Data Easier to Use

    NASA Technical Reports Server (NTRS)

    Lynnes, Christopher

    2010-01-01

    This slide presentation reviews some of the processes that NASA uses to make the remote sensing data easy to use over the World Wide Web. This work involves much research into data formats, geolocation structures and quality indicators, often to be followed by coding a preprocessing program. Only then are the data usable within the analysis tool of choice. The Goddard Earth Sciences Data and Information Services Center is deploying a variety of data access services that are designed to dramatically shorten the time consumed in the data preparation step. On-the-fly conversion to the standard network Common Data Form (netCDF) format with Climate-Forecast (CF) conventions imposes a standard coordinate system framework that makes data instantly readable through several tools, such as the Integrated Data Viewer, Gridded Analysis and Display System, Panoply and Ferret. A similar benefit is achieved by serving data through the Open Source Project for a Network Data Access Protocol (OPeNDAP), which also provides subsetting. The Data Quality Screening Service goes a step further in filtering out data points based on quality control flags, based on science team recommendations or user-specified criteria. Further still is the Giovanni online analysis system which goes beyond handling formatting and quality to provide visualization and basic statistics of the data. This general approach of automating the preparation steps has the important added benefit of enabling use of the data by non-human users (i.e., computer programs), which often make sub-optimal use of the available data due to the need to hard-code data preparation on the client side.

  2. Alternative Fuels Data Center: Maps and Data

    Science.gov Websites

    acquisitions by fleet type from 1992-2014 Last update August 2016 View Graph Graph Download Data -m8i0e0 Trend of S&FP AFV acquisitions by fuel type from 1992-2015 Last update August 2016 View Graph transactions from 1997-2014 Last update August 2016 View Graph Graph Download Data Generated_thumb20160907

  3. Digital data storage systems, computers, and data verification methods

    DOEpatents

    Groeneveld, Bennett J.; Austad, Wayne E.; Walsh, Stuart C.; Herring, Catherine A.

    2005-12-27

    Digital data storage systems, computers, and data verification methods are provided. According to a first aspect of the invention, a computer includes an interface adapted to couple with a dynamic database; and processing circuitry configured to provide a first hash from digital data stored within a portion of the dynamic database at an initial moment in time, to provide a second hash from digital data stored within the portion of the dynamic database at a subsequent moment in time, and to compare the first hash and the second hash.

  4. Web Based Data Access to the World Data Center for Climate

    NASA Astrophysics Data System (ADS)

    Toussaint, F.; Lautenschlager, M.

    2006-12-01

    The World Data Center for Climate (WDC-Climate, www.wdc-climate.de) is hosted by the Model &Data Group (M&D) of the Max Planck Institute for Meteorology. The M&D department is financed by the German government and uses the computers and mass storage facilities of the German Climate Computing Centre (Deutsches Klimarechenzentrum, DKRZ). The WDC-Climate provides web access to 200 Terabytes of climate data; the total mass storage archive contains nearly 4 Petabytes. Although the majority of the datasets concern model output data, some satellite and observational data are accessible as well. The underlying relational database is distributed on five servers. The CERA relational data model is used to integrate catalogue data and mass data. The flexibility of the model allows to store and access very different types of data and metadata. The CERA metadata catalogue provides easy access to the content of the CERA database as well as to other data in the web. Visit ceramodel.wdc-climate.de for additional information on the CERA data model. The majority of the users access data via the CERA metadata catalogue, which is open without registration. However, prior to retrieving data user are required to check in and apply for a userid and password. The CERA metadata catalogue is servlet based. So it is accessible worldwide through any web browser at cera.wdc-climate.de. In addition to data and metadata access by the web catalogue, WDC-Climate offers a number of other forms of web based data access. All metadata are available via http request as xml files in various metadata formats (ISO, DC, etc., see wini.wdc-climate.de) which allows for easy data interchange with other catalogues. Model data can be retrieved in GRIB, ASCII, NetCDF, and binary (IEEE) format. WDC-Climate serves as data centre for various projects. Since xml files are accessible by http, the integration of data into applications of different projects is very easy. Projects supported by WDC-Climate are e.g. CEOP

  5. An Automated Data Analysis Tool for Livestock Market Data

    ERIC Educational Resources Information Center

    Williams, Galen S.; Raper, Kellie Curry

    2011-01-01

    This article describes an automated data analysis tool that allows Oklahoma Cooperative Extension Service educators to disseminate results in a timely manner. Primary data collected at Oklahoma Quality Beef Network (OQBN) certified calf auctions across the state results in a large amount of data per sale site. Sale summaries for an individual sale…

  6. 17 CFR 45.3 - Swap data reporting: creation data.

    Code of Federal Regulations, 2013 CFR

    2013-04-01

    ... to the reporting obligations with respect to real time reporting of swap data set forth in part 43 of... economic terms data for the swap, within the applicable reporting deadline set forth in paragraph (b)(1)(i..., must report all primary economic terms data for the swap, within the applicable reporting deadline set...

  7. 17 CFR 45.3 - Swap data reporting: creation data.

    Code of Federal Regulations, 2012 CFR

    2012-04-01

    ... to the reporting obligations with respect to real time reporting of swap data set forth in part 43 of... economic terms data for the swap, within the applicable reporting deadline set forth in paragraph (b)(1)(i..., must report all primary economic terms data for the swap, within the applicable reporting deadline set...

  8. Data Services

    Science.gov Websites

    Moon Data for One Day Rise/Set/Twilight Table for an Entire Year What the Moon Looks Like Now Dates of Contact more... Sitemap Rise/Set/Transit/Twilight Data Complete Sun and Moon Data for One Day Table of Solar System Objects and Bright Stars Duration of Daylight/Darkness Table for One Year Phases of the

  9. Big Data Tools as Applied to ATLAS Event Data

    NASA Astrophysics Data System (ADS)

    Vukotic, I.; Gardner, R. W.; Bryant, L. A.

    2017-10-01

    Big Data technologies have proven to be very useful for storage, processing and visualization of derived metrics associated with ATLAS distributed computing (ADC) services. Logfiles, database records, and metadata from a diversity of systems have been aggregated and indexed to create an analytics platform for ATLAS ADC operations analysis. Dashboards, wide area data access cost metrics, user analysis patterns, and resource utilization efficiency charts are produced flexibly through queries against a powerful analytics cluster. Here we explore whether these techniques and associated analytics ecosystem can be applied to add new modes of open, quick, and pervasive access to ATLAS event data. Such modes would simplify access and broaden the reach of ATLAS public data to new communities of users. An ability to efficiently store, filter, search and deliver ATLAS data at the event and/or sub-event level in a widely supported format would enable or significantly simplify usage of machine learning environments and tools like Spark, Jupyter, R, SciPy, Caffe, TensorFlow, etc. Machine learning challenges such as the Higgs Boson Machine Learning Challenge, the Tracking challenge, Event viewers (VP1, ATLANTIS, ATLASrift), and still to be developed educational and outreach tools would be able to access the data through a simple REST API. In this preliminary investigation we focus on derived xAOD data sets. These are much smaller than the primary xAODs having containers, variables, and events of interest to a particular analysis. Being encouraged with the performance of Elasticsearch for the ADC analytics platform, we developed an algorithm for indexing derived xAOD event data. We have made an appropriate document mapping and have imported a full set of standard model W/Z datasets. We compare the disk space efficiency of this approach to that of standard ROOT files, the performance in simple cut flow type of data analysis, and will present preliminary results on its scaling

  10. Causes of deaths data, linkages and big data perspectives.

    PubMed

    Rey, Grégoire; Bounebache, Karim; Rondet, Claire

    2018-07-01

    The study of cause-specific mortality data is one of the main sources of information for public health monitoring. In most industrialized countries, when a death occurs, it is a legal requirement that a medical certificate based on the international form recommended by World Health Organization's (WHO) is filled in by a physician. The physician reports the causes of death that directly led or contributed to the death on the death certificate. The death certificate is then forwarded to a coding office, where each cause is coded, and one underlying cause is defined, using the rules of the International Classification of Diseases and Related Health Problems, now in its 10th Revision (ICD-10). Recently, a growing number of countries have adopted, or have decided to adopt, the coding software Iris, developed and maintained by an international consortium 1 . This whole standardized production process results in a high and constantly increasing international comparability of cause-specific mortality data. While these data could be used for international comparisons and benchmarking of global burden of diseases, quality of care and prevention policies, there are also many other ways and methods to explore their richness, especially when they are linked with other data sources. Some of these methods are potentially referring to the so-called "big data" field. These methods could be applied both to the production of the data, to the statistical processing of the data, and even more to process these data linked to other databases. In the present note, we depict the main domains in which this new field of methods could be applied. We focus specifically on the context of France, a 65 million inhabitants country with a centralized health data system. Finally we will insist on the importance of data quality, and the specific problematics related to death certification in the forensic medicine domain. Copyright © 2016 Elsevier Ltd and Faculty of Forensic and Legal Medicine. All

  11. Speaker verification system using acoustic data and non-acoustic data

    DOEpatents

    Gable, Todd J [Walnut Creek, CA; Ng, Lawrence C [Danville, CA; Holzrichter, John F [Berkeley, CA; Burnett, Greg C [Livermore, CA

    2006-03-21

    A method and system for speech characterization. One embodiment includes a method for speaker verification which includes collecting data from a speaker, wherein the data comprises acoustic data and non-acoustic data. The data is used to generate a template that includes a first set of "template" parameters. The method further includes receiving a real-time identity claim from a claimant, and using acoustic data and non-acoustic data from the identity claim to generate a second set of parameters. The method further includes comparing the first set of parameters to the set of parameters to determine whether the claimant is the speaker. The first set of parameters and the second set of parameters include at least one purely non-acoustic parameter, including a non-acoustic glottal shape parameter derived from averaging multiple glottal cycle waveforms.

  12. XAFS Data Interchange: A single spectrum XAFS data file format.

    PubMed

    Ravel, B; Newville, M

    We propose a standard data format for the interchange of XAFS data. The XAFS Data Interchange (XDI) standard is meant to encapsulate a single spectrum of XAFS along with relevant metadata. XDI is a text-based format with a simple syntax which clearly delineates metadata from the data table in a way that is easily interpreted both by a computer and by a human. The metadata header is inspired by the format of an electronic mail header, representing metadata names and values as an associative array. The data table is represented as columns of numbers. This format can be imported as is into most existing XAFS data analysis, spreadsheet, or data visualization programs. Along with a specification and a dictionary of metadata types, we provide an application-programming interface written in C and bindings for programming dynamic languages.

  13. XAFS Data Interchange: A single spectrum XAFS data file format

    NASA Astrophysics Data System (ADS)

    Ravel, B.; Newville, M.

    2016-05-01

    We propose a standard data format for the interchange of XAFS data. The XAFS Data Interchange (XDI) standard is meant to encapsulate a single spectrum of XAFS along with relevant metadata. XDI is a text-based format with a simple syntax which clearly delineates metadata from the data table in a way that is easily interpreted both by a computer and by a human. The metadata header is inspired by the format of an electronic mail header, representing metadata names and values as an associative array. The data table is represented as columns of numbers. This format can be imported as is into most existing XAFS data analysis, spreadsheet, or data visualization programs. Along with a specification and a dictionary of metadata types, we provide an application-programming interface written in C and bindings for programming dynamic languages.

  14. PDB data curation.

    PubMed

    Wang, Yanchao; Sunderraman, Rajshekhar

    2006-01-01

    In this paper, we propose two architectures for curating PDB data to improve its quality. The first one, PDB Data Curation System, is developed by adding two parts, Checking Filter and Curation Engine, between User Interface and Database. This architecture supports the basic PDB data curation. The other one, PDB Data Curation System with XCML, is designed for further curation which adds four more parts, PDB-XML, PDB, OODB, Protin-OODB, into the previous one. This architecture uses XCML language to automatically check errors of PDB data that enables PDB data more consistent and accurate. These two tools can be used for cleaning existing PDB files and creating new PDB files. We also show some ideas how to add constraints and assertions with XCML to get better data. In addition, we discuss the data provenance that may affect data accuracy and consistency.

  15. Smart Data Infrastructure: The Sixth Generation of Mediation for Data Science

    NASA Astrophysics Data System (ADS)

    Fox, P. A.

    2014-12-01

    In the emergent "fourth paradigm" (data-driven) science, the scientific method is enhanced by the integration of significant data sources into the practice of scientific research. To address Big Science, there are challenges in understanding the role of data in enabling researchers to attack not just disciplinary issues, but also the system-level, large-scale, and transdisciplinary global scientific challenges facing society.Recognizing that the volume of data is only one of many dimensions to be considered, there is a clear need for improved data infrastructures to mediate data and information exchange, which we contend will need to be powered by semantic technologies. One clear need is to provide computational approaches for researchers to discover appropriate data resources, rapidly integrate data collections from heterogeneously resources or multiple data sets, and inter-compare results to allow generation and validation of hypotheses. Another trend is toward automated tools that allow researchers to better find and reuse data that they currently don't know they need, let alone know how to find. Again semantic technologies will be required. Finally, to turn data analytics from "art to science", technical solutions are needed for cross-dataset validation, reproducibility studies on data-driven results, and the concomitant citation of data products allowing recognition for those who curate and share important data resources.

  16. Data Stream Mining

    NASA Astrophysics Data System (ADS)

    Gaber, Mohamed Medhat; Zaslavsky, Arkady; Krishnaswamy, Shonali

    Data mining is concerned with the process of computationally extracting hidden knowledge structures represented in models and patterns from large data repositories. It is an interdisciplinary field of study that has its roots in databases, statistics, machine learning, and data visualization. Data mining has emerged as a direct outcome of the data explosion that resulted from the success in database and data warehousing technologies over the past two decades (Fayyad, 1997,Fayyad, 1998,Kantardzic, 2003).

  17. Keeping up with Big Data--Designing an Introductory Data Analytics Class

    ERIC Educational Resources Information Center

    Hijazi, Sam

    2016-01-01

    Universities need to keep up with the demand of the business world when it comes to Big Data. The exponential increase in data has put additional demands on academia to meet the big gap in education. Business demand for Big Data has surpassed 1.9 million positions in 2015. Big Data, Business Intelligence, Data Analytics, and Data Mining are the…

  18. Chapter 35: Describing Data and Data Collections in the VO

    NASA Astrophysics Data System (ADS)

    Kent, B. R.; Hanisch, R. J.; Williams, R. D.

    The list of numbers: 19.22, 17.23, 18.11, 16.98, and 15.11, is of little intrinsic interest without information about the context in which they appear. For instance, are these daily closing stock prices for your favorite investment, or are they hourly photometric measurements of an increasingly bright quasar? The information needed to define this context is called metadata. Metadata are data about data. Astronomers are familiar with metadata through the headers of FITS files and the names and units associated with columns in a table or database. In the VO, metadata describe the contents of tables, images, and spectra, as well as aggregate collections of data (archives, surveys) and computational services. Moreover, VO metadata are constructed according to rules that avoid ambiguity and make it clear whether, in the example above, the stock prices are in dollars or euros, or the photometry is Johnson V or Sloan g. Organization of data is important in any scientific discipline. Equally crucial are the descriptions of that data: the organization publishing the data, its creator or the person making it available, what instruments were used, units assigned to measurement, calibration status, and data quality assessment. The Virtual Observatory metadata scheme not only applies to datasets, but to resources as well, including data archive facilities, searchable web forms, and online analysis and display tools. Since the scientific output flowing from large datasets depends greatly on how well the data are described, it is important for users to understand the basics of the metadata scheme in order to locate the data that they want and use it correctly. Metadata are the key to data discovery and data and service interoperability in the Virtual Observatory.

  19. A New Paradigm to Analyze Data Completeness of Patient Data.

    PubMed

    Nasir, Ayan; Gurupur, Varadraj; Liu, Xinliang

    2016-08-03

    There is a need to develop a tool that will measure data completeness of patient records using sophisticated statistical metrics. Patient data integrity is important in providing timely and appropriate care. Completeness is an important step, with an emphasis on understanding the complex relationships between data fields and their relative importance in delivering care. This tool will not only help understand where data problems are but also help uncover the underlying issues behind them. Develop a tool that can be used alongside a variety of health care database software packages to determine the completeness of individual patient records as well as aggregate patient records across health care centers and subpopulations. The methodology of this project is encapsulated within the Data Completeness Analysis Package (DCAP) tool, with the major components including concept mapping, CSV parsing, and statistical analysis. The results from testing DCAP with Healthcare Cost and Utilization Project (HCUP) State Inpatient Database (SID) data show that this tool is successful in identifying relative data completeness at the patient, subpopulation, and database levels. These results also solidify a need for further analysis and call for hypothesis driven research to find underlying causes for data incompleteness. DCAP examines patient records and generates statistics that can be used to determine the completeness of individual patient data as well as the general thoroughness of record keeping in a medical database. DCAP uses a component that is customized to the settings of the software package used for storing patient data as well as a Comma Separated Values (CSV) file parser to determine the appropriate measurements. DCAP itself is assessed through a proof of concept exercise using hypothetical data as well as available HCUP SID patient data.

  20. A New Paradigm to Analyze Data Completeness of Patient Data

    PubMed Central

    Nasir, Ayan; Liu, Xinliang

    2016-01-01

    Summary Background There is a need to develop a tool that will measure data completeness of patient records using sophisticated statistical metrics. Patient data integrity is important in providing timely and appropriate care. Completeness is an important step, with an emphasis on understanding the complex relationships between data fields and their relative importance in delivering care. This tool will not only help understand where data problems are but also help uncover the underlying issues behind them. Objectives Develop a tool that can be used alongside a variety of health care database software packages to determine the completeness of individual patient records as well as aggregate patient records across health care centers and subpopulations. Methods The methodology of this project is encapsulated within the Data Completeness Analysis Package (DCAP) tool, with the major components including concept mapping, CSV parsing, and statistical analysis. Results The results from testing DCAP with Healthcare Cost and Utilization Project (HCUP) State Inpatient Database (SID) data show that this tool is successful in identifying relative data completeness at the patient, subpopulation, and database levels. These results also solidify a need for further analysis and call for hypothesis driven research to find underlying causes for data incompleteness. Conclusion DCAP examines patient records and generates statistics that can be used to determine the completeness of individual patient data as well as the general thoroughness of record keeping in a medical database. DCAP uses a component that is customized to the settings of the software package used for storing patient data as well as a Comma Separated Values (CSV) file parser to determine the appropriate measurements. DCAP itself is assessed through a proof of concept exercise using hypothetical data as well as available HCUP SID patient data. PMID:27484918

  1. Hydratools, a MATLAB® based data processing package for Sontek Hydra data

    USGS Publications Warehouse

    Martini, M.; Lightsom, F.L.; Sherwood, C.R.; Xu, Jie; Lacy, J.R.; Ramsey, A.; Horwitz, R.

    2005-01-01

    The U.S. Geological Survey (USGS) has developed a set of MATLAB tools to process and convert data collected by Sontek Hydra instruments to netCDF, which is a format used by the USGS to process and archive oceanographic time-series data. The USGS makes high-resolution current measurements within 1.5 meters of the bottom. These data are used in combination with other instrument data from sediment transport studies to develop sediment transport models. Instrument manufacturers provide software which outputs unique binary data formats. Multiple data formats are cumbersome. The USGS solution is to translate data streams into a common data format: netCDF. The Hydratools toolbox is written to create netCDF format files following EPIC conventions, complete with embedded metadata. Data are accepted from both the ADV and the PCADP. The toolbox will detect and remove bad data, substitute other sources of heading and tilt measurements if necessary, apply ambiguity corrections, calculate statistics, return information about data quality, and organize metadata. Standardized processing and archiving makes these data more easily and routinely accessible locally and over the Internet. In addition, documentation of the techniques used in the toolbox provides a baseline reference for others utilizing the data.

  2. Axiope tools for data management and data sharing.

    PubMed

    Goddard, Nigel H; Cannon, Robert C; Howell, Fred W

    2003-01-01

    Many areas of biological research generate large volumes of very diverse data. Managing this data can be a difficult and time-consuming process, particularly in an academic environment where there are very limited resources for IT support staff such as database administrators. The most economical and efficient solutions are those that enable scientists with minimal IT expertise to control and operate their own desktop systems. Axiope provides one such solution, Catalyzer, which acts as flexible cataloging system for creating structured records describing digital resources. The user is able specify both the content and structure of the information included in the catalog. Information and resources can be shared by a variety of means, including automatically generated sets of web pages. Federation and integration of this information, where needed, is handled by Axiope's Mercat server. Where there is a need for standardization or compatibility of the structures usedby different researchers this canbe achieved later by applying user-defined mappings in Mercat. In this way, large-scale data sharing can be achieved without imposing unnecessary constraints or interfering with the way in which individual scientists choose to record and catalog their work. We summarize the key technical issues involved in scientific data management and data sharing, describe the main features and functionality of Axiope Catalyzer and Axiope Mercat, and discuss future directions and requirements for an information infrastructure to support large-scale data sharing and scientific collaboration.

  3. NSSDC data listing

    NASA Technical Reports Server (NTRS)

    Horowitz, Richard; King, Joseph H.

    1994-01-01

    This document identifies, in a highly summarized way, all the data held at the National Space Science Data Center (NSSDC). These data cover astrophysics and astronomy, solar and space physics, planetary and lunar, and Earth science disciplines. They are primarily but not exclusively from past and on-going NASA spaceflight missions. We first identify all the data electronically available through NSSDC's principal on-line (magnetic disk-based) and near-line (robotics jukebox-based) systems and then those data available on CD-ROM's. Finally, we identify all NSSDC-held data, the majority of which are still off line on magnetic tape, film, etc., but include the electronically accessible and CD-ROM-resident data of earlier sections. These comprehensive identifications are in the form of two listings, one for the majority of NSSDC-held data sets resulting from individual instruments flown on individual spacecraft and the other for the remainder of NSSDC-held data sets that do not adhere to this spacecraft/experiment/data set hierarchy. The latter listing is presented in two parts, one for the numerous source catalogs of the NSSDC-operated Astronomical Data Center and the other for the remainder.

  4. A Data Services Upgrade for Advanced Composition Explorer (ACE) Data

    NASA Astrophysics Data System (ADS)

    Davis, A. J.; Hamell, G.

    2008-12-01

    Since early in 1998, NASA's Advanced Composition Explorer (ACE) spacecraft has provided continuous measurements of solar wind, interplanetary magnetic field, and energetic particle activity from L1, located approximately 0.01 AU sunward of Earth. The spacecraft has enough fuel to stay in orbit about L1 until ~2024. The ACE Science Center (ASC) provides access to ACE data, and performs level 1 and browse data processing for the science instruments. Thanks to a NASA Data Services Upgrade grant, we have recently retooled our legacy web interface to ACE data, enhancing data subsetting capabilities and improving online plotting options. We have also integrated a new application programming interface (API) and we are working to ensure that it will be compatible with emerging Virtual Observatory (VO) data services standards. The new API makes extensive use of metadata created using the Space Physics Archive Search and Extract (SPASE) data model. We describe these recent improvements to the ACE Science Center data services, and our plans for integrating these services into the VO system.

  5. Linked Ocean Data

    NASA Astrophysics Data System (ADS)

    Leadbetter, Adam; Arko, Robert; Chandler, Cynthia; Shepherd, Adam

    2014-05-01

    "Linked Data" is a term used in Computer Science to encapsulate a methodology for publishing data and metadata in a structured format so that links may be created and exploited between objects. Berners-Lee (2006) outlines the following four design principles of a Linked Data system: Use Uniform Resource Identifiers (URIs) as names for things. Use HyperText Transfer Protocol (HTTP) URIs so that people can look up those names. When someone looks up a URI, provide useful information, using the standards (Resource Description Framework [RDF] and the RDF query language [SPARQL]). Include links to other URIs so that they can discover more things. In 2010, Berners-Lee revisited his original design plan for Linked Data to encourage data owners along a path to "good Linked Data". This revision involved the creation of a five star rating system for Linked Data outlined below. One star: Available on the web (in any format). Two stars: Available as machine-readable structured data (e.g. An Excel spreadsheet instead of an image scan of a table). Three stars: As two stars plus the use of a non-proprietary format (e.g. Comma Separated Values instead of Excel). Four stars: As three stars plus the use of open standards from the World Wide Web Commission (W3C) (i.e. RDF and SPARQL) to identify things, so that people can point to your data and metadata. Five stars: All the above plus link your data to other people's data to provide context Here we present work building on the SeaDataNet common vocabularies served by the NERC Vocabulary Server, connecting projects such as the Rolling Deck to Repository (R2R) and the Biological and Chemical Oceanography Data Management Office (BCO-DMO) and other vocabularies such as the Marine Metadata Interoperability Ontology Register and Repository and the NASA Global Change Master Directory to create a Linked Ocean Data cloud. Publishing the vocabularies and metadata in standard RDF XML and exposing SPARQL endpoints renders them five-star Linked

  6. TCGA Expedition: A Data Acquisition and Management System for TCGA Data

    PubMed Central

    Chandran, Uma R.; Medvedeva, Olga P.; Barmada, M. Michael; Blood, Philip D.; Chakka, Anish; Luthra, Soumya; Ferreira, Antonio; Wong, Kim F.; Lee, Adrian V.; Zhang, Zhihui; Budden, Robert; Scott, J. Ray; Berndt, Annerose; Berg, Jeremy M.; Jacobson, Rebecca S.

    2016-01-01

    Background The Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile at least 500 cases of 20 different tumor types using genomic platforms and to make these data, both raw and processed, available to all researchers. TCGA data are currently over 1.2 Petabyte in size and include whole genome sequence (WGS), whole exome sequence, methylation, RNA expression, proteomic, and clinical datasets. Publicly accessible TCGA data are released through public portals, but many challenges exist in navigating and using data obtained from these sites. We developed TCGA Expedition to support the research community focused on computational methods for cancer research. Data obtained, versioned, and archived using TCGA Expedition supports command line access at high-performance computing facilities as well as some functionality with third party tools. For a subset of TCGA data collected at University of Pittsburgh, we also re-associate TCGA data with de-identified data from the electronic health records. Here we describe the software as well as the architecture of our repository, methods for loading of TCGA data to multiple platforms, and security and regulatory controls that conform to federal best practices. Results TCGA Expedition software consists of a set of scripts written in Bash, Python and Java that download, extract, harmonize, version and store all TCGA data and metadata. The software generates a versioned, participant- and sample-centered, local TCGA data directory with metadata structures that directly reference the local data files as well as the original data files. The software supports flexible searches of the data via a web portal, user-centric data tracking tools, and data provenance tools. Using this software, we created a collaborative repository, the Pittsburgh Genome Resource Repository (PGRR) that enabled investigators at our institution to work with all TCGA data formats, and to interrogate these data with analysis pipelines, and

  7. Medical data mining: knowledge discovery in a clinical data warehouse.

    PubMed Central

    Prather, J. C.; Lobach, D. F.; Goodwin, L. K.; Hales, J. W.; Hage, M. L.; Hammond, W. E.

    1997-01-01

    Clinical databases have accumulated large quantities of information about patients and their medical conditions. Relationships and patterns within this data could provide new medical knowledge. Unfortunately, few methodologies have been developed and applied to discover this hidden knowledge. In this study, the techniques of data mining (also known as Knowledge Discovery in Databases) were used to search for relationships in a large clinical database. Specifically, data accumulated on 3,902 obstetrical patients were evaluated for factors potentially contributing to preterm birth using exploratory factor analysis. Three factors were identified by the investigators for further exploration. This paper describes the processes involved in mining a clinical database including data warehousing, data query and cleaning, and data analysis. PMID:9357597

  8. Use of data description languages in the interchange of data

    NASA Technical Reports Server (NTRS)

    Pignede, M.; Real-Planells, B.; Smith, S. R.

    1994-01-01

    The Consultative Committee for Space Data Systems (CCSDS) is developing Standards for the interchange of information between systems, including those operating under different environments. The objective is to perform the interchange automatically, i.e. in a computer interpretable manner. One aspect of the concept developed by CCSDS is the use of a separate data description to specify the data being transferred. Using the description, data can then be automatically parsed by the receiving computer. With a suitably expressive Data Description Language (DDL), data formats of arbitrary complexity can be handled. The advantages of this approach are: (1) that the description need only be written and distributed once to all users, and (2) new software does not need to be written for each new format, provided generic tools are available to support writing and interpretation of descriptions and the associated data instances. Consequently, the effort of 'hard coding' each new format is avoided and problems of integrating multiple implementations of a given format by different users are avoided. The approach is applicable in any context where computer parsable description of data could enhance efficiency (e.g. within a spacecraft control system, a data delivery system or an archive). The CCSDS have identified several candidate DDL's: EAST (Extended Ada Subset), TSDN (Transfer Syntax Data Notation) and MADEL (Modified ASN.1 as a Data Description Language -- a DDL based on the Abstract Syntax Notation One - ASN.1 - specified in the ISO/IEC 8824). This paper concentrates on ESA's development of MADEL. ESA have also developed a 'proof of concept' prototype of the required support tools, implemented on a PC under MS-DOS, which has successfully demonstrated the feasibility of the approach, including the capability within an application of retrieving and displaying particular data elements, given its MADEL description (i.e. a data description written in MADEL). This paper outlines

  9. Prototype Local Data Integration System and Central Florida Data Deficiency

    NASA Technical Reports Server (NTRS)

    Manobianco, John; Case, Jonathan

    1998-01-01

    This report describes the Applied Meteorology Unit's (AMU) task on the Local Data Integration System (LDIS) and central Florida data deficiency. The objectives of the task are to identify all existing meteorological data sources within 250 km of the Kennedy Space Center (KSC) and the Eastern Range at Cape Canaveral Air Station (CCAS), identify and configure an appropriate LDIS to integrate these data, and implement a working prototype to be used for limited case studies and data non-incorporation (DNI) experiments. The ultimate goal for running LDIS is to generate products that may enhance weather nowcasts and short-range (less than 6 h) forecasts issued in support of the 45th Weather Squadron (45 WS), Spaceflight Meteorology Group (SMG), and the Melbourne National Weather Service (NWS MLB) operational requirements. The LDIS has the potential to provide added value for nowcasts and short term forecasts for two reasons. First, it incorporates all data operationally available in east central Florida. Second, it is run at finer spatial and temporal resolutions than current national-scale operational models. In combination with a suitable visualization tool, LDIS may provide users with a more complete and comprehensive understanding of evolving fine-scale weather features than could be developed by individually examining the disparate data sets over the same area and time. The utility of LDIS depends largely on the reliability and availability of observational data. Therefore, it is important to document all existing meteorological data sources around central Florida that can be incorporated by it. Several factors contribute to the data density and coverage over east central Florida including the level in the atmosphere, distance from KSC/CCAS, time, and prevailing weather. The central Florida mesonet consists of existing surface meteorological and hydrological data available from the Tampa NWS and data servers at Miami and Jacksonville. However the utility of these

  10. Implementing a Data Quality Strategy to Simplify Access to Data

    NASA Astrophysics Data System (ADS)

    Druken, K. A.; Trenham, C. E.; Evans, B. J. K.; Richards, C. J.; Wang, J.; Wyborn, L. A.

    2016-12-01

    To ensure seamless programmatic access for data analysis (including machine learning), standardization of both data and services is vital. At the Australian National Computational Infrastructure (NCI) we have developed a Data Quality Strategy (DQS) that currently provides processes for: (1) the consistency of data structures in the underlying High Performance Data (HPD) platform; (2) quality control through compliance with recognized community standards; and (3) data quality assurance through demonstrated functionality across common platforms, tools and services. NCI hosts one of Australia's largest repositories (10+ PBytes) of research data collections spanning datasets from climate, coasts, oceans and geophysics through to astronomy, bioinformatics and the social sciences. A key challenge is the application of community-agreed data standards to the broad set of Earth systems and environmental data that are being used. Within these disciplines, data span a wide range of gridded, ungridded (i.e., line surveys, point clouds), and raster image types, as well as diverse coordinate reference projections and resolutions. By implementing our DQS we have seen progressive improvement in the quality of the datasets across the different subject domains, and through this, the ease by which the users can programmatically access the data, either in situ or via web services. As part of its quality control procedures, NCI has developed a compliance checker based upon existing domain standards. The DQS also includes extensive Functionality Testing which include readability by commonly used libraries (e.g., netCDF, HDF, GDAL, etc.); accessibility by data servers (e.g., THREDDS, Hyrax, GeoServer), validation against scientific analysis and programming platforms (e.g., Python, Matlab, QGIS); and visualization tools (e.g., ParaView, NASA Web World Wind). These tests ensure smooth interoperability between products and services as well as exposing unforeseen requirements and

  11. Data deluxe

    NASA Astrophysics Data System (ADS)

    Carlowicz, Michael

    Several new web pages from the National Oceanic and Atmospheric Administration (NOAA) will allow scientists and nonscientists alike to view graphic displays of weather and space weather data from around the world. Users can select a region of the Earth and a time period to see plots and data sets of everything from severe storms to aurorae.The National Climate Data Center has made available weather data from 8000 stations around the world, 160 satellite images of hurricanes from the GOES satellites, and technical reports about weather events such as the East Coast blizzard of 1996. The web address is http://www.ncdc.noaa.gov.

  12. Electricity Data Browser

    EIA Publications

    The Electricity Data Browser shows generation, consumption, fossil fuel receipts, stockpiles, retail sales, and electricity prices. The data appear on an interactive web page and are updated each month. The Electricity Data Browser includes all the datasets collected and published in EIA's Electric Power Monthly and allows users to perform dynamic charting of data sets as well as map the data by state. The data browser includes a series of reports that appear in the Electric Power Monthly and allows readers to drill down to plant level statistics, where available. All images and datasets are available for download. Users can also link to the data series in EIA's Application Programming Interface (API). An API makes our data machine-readable and more accessible to users.

  13. Data selection techniques in the interpretation of MAGSAT data over Australia

    NASA Technical Reports Server (NTRS)

    Johnson, B. D.; Dampney, C. N. G.

    1983-01-01

    The MAGSAT data require critical selection in order to produce a self-consistent data set suitable for map construction and subsequent interpretation. Interactive data selection techniques are described which involve the use of a special-purpose profile-oriented data base and a colour graphics display. The careful application of these data selection techniques permits validation every data value and ensures that the best possible self-consistent data set is being used to construct the maps of the magnetic field measured at satellite altitudes over Australia.

  14. Trends in Planetary Data Analysis. Executive summary of the Planetary Data Workshop

    NASA Technical Reports Server (NTRS)

    Evans, N.

    1984-01-01

    Planetary data include non-imaging remote sensing data, which includes spectrometric, radiometric, and polarimetric remote sensing observations. Also included are in-situ, radio/radar data, and Earth based observation. Also discussed is development of a planetary data system. A catalog to identify observations will be the initial entry point for all levels of users into the data system. There are seven distinct data support services: encyclopedia, data index, data inventory, browse, search, sample, and acquire. Data systems for planetary science users must provide access to data, process, store, and display data. Two standards will be incorporated into the planetary data system: Standard communications protocol and Standard format data unit. The data system configuration must combine a distributed system with those of a centralized system. Fiscal constraints have made prioritization important. Activities include saving previous mission data, planning/cost analysis, and publishing of proceedings.

  15. Challenges in sharing of geospatial data by data custodians in South Africa

    NASA Astrophysics Data System (ADS)

    Kay, Sissiel E.

    2018-05-01

    As most development planning and rendering of public services happens at a place or in a space, geospatial data is required. This geospatial data is best managed through a spatial data infrastructure, which has as a key objective to share geospatial data. The collection and maintenance of geospatial data is expensive and time consuming and so the principle of "collect once - use many times" should apply. It is best to obtain the geospatial data from the authoritative source - the appointed data custodian. In South Africa the South African Spatial Data Infrastructure (SASDI) is the means to achieve the requirement for geospatial data sharing. This requires geospatial data sharing to take place between the data custodian and the user. All data custodians are expected to comply with the Spatial Data Infrastructure Act (SDI Act) in terms of geo-spatial data sharing. Currently data custodians are experiencing challenges with regard to the sharing of geospatial data. This research is based on the current ten data themes selected by the Committee for Spatial Information and the organisations identified as the data custodians for these ten data themes. The objectives are to determine whether the identified data custodians comply with the SDI Act with respect to geospatial data sharing, and if not what are the reasons for this. Through an international comparative assessment it then determines if the compliance with the SDI Act is not too onerous on the data custodians. The research concludes that there are challenges with geospatial data sharing in South Africa and that the data custodians only partially comply with the SDI Act in terms of geospatial data sharing. However, it is shown that the South African legislation is not too onerous on the data custodians.

  16. Direct match data flow memory for data driven computing

    DOEpatents

    Davidson, G.S.; Grafe, V.G.

    1997-10-07

    A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ``fire`` signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.

  17. SkData: data sets and algorithm evaluation protocols in Python

    NASA Astrophysics Data System (ADS)

    Bergstra, James; Pinto, Nicolas; Cox, David D.

    2015-01-01

    Machine learning benchmark data sets come in all shapes and sizes, whereas classification algorithms assume sanitized input, such as (x, y) pairs with vector-valued input x and integer class label y. Researchers and practitioners know all too well how tedious it can be to get from the URL of a new data set to a NumPy ndarray suitable for e.g. pandas or sklearn. The SkData library handles that work for a growing number of benchmark data sets (small and large) so that one-off in-house scripts for downloading and parsing data sets can be replaced with library code that is reliable, community-tested, and documented. The SkData library also introduces an open-ended formalization of training and testing protocols that facilitates direct comparison with published research. This paper describes the usage and architecture of the SkData library.

  18. Data quality assessment for comparative effectiveness research in distributed data networks

    PubMed Central

    Brown, Jeffrey; Kahn, Michael; Toh, Sengwee

    2015-01-01

    Background Electronic health information routinely collected during healthcare delivery and reimbursement can help address the need for evidence about the real-world effectiveness, safety, and quality of medical care. Often, distributed networks that combine information from multiple sources are needed to generate this real-world evidence. Objective We provide a set of field-tested best practices and a set of recommendations for data quality checking for comparative effectiveness research (CER) in distributed data networks. Methods Explore the requirements for data quality checking and describe data quality approaches undertaken by several existing multi-site networks. Results There are no established standards regarding how to evaluate the quality of electronic health data for CER within distributed networks. Data checks of increasing complexity are often employed, ranging from consistency with syntactic rules to evaluation of semantics and consistency within and across sites. Temporal trends within and across sites are widely used, as are checks of each data refresh or update. Rates of specific events and exposures by age group, sex, and month are also common. Discussion Secondary use of electronic health data for CER holds promise but is complex, especially in distributed data networks that incorporate periodic data refreshes. The viability of a learning health system is dependent on a robust understanding of the quality, validity, and optimal secondary uses of routinely collected electronic health data within distributed health data networks. Robust data quality checking can strengthen confidence in findings based on distributed data network. PMID:23793049

  19. Data You May Like: A Recommender System for Research Data Discovery

    NASA Astrophysics Data System (ADS)

    Devaraju, A.; Davy, R.; Hogan, D.

    2016-12-01

    Various data portals been developed to facilitate access to research datasets from different sources. For example, the Data Publisher for Earth & Environmental Science (PANGAEA), the Registry of Research Data Repositories (re3data.org), and the National Geoscience Data Centre (NGDC). Due to data quantity and heterogeneity, finding relevant datasets on these portals may be difficult and tedious. Keyword searches based on specific metadata elements or multi-key indexes may return irrelevant results. Faceted searches may be unsatisfactory and time consuming, especially when facet values are exhaustive. We need a much more intelligent way to complement existing searching mechanisms in order to enhance user experiences of the data portals. We developed a recommender system that helps users to find the most relevant research datasets on the CSIRO's Data Access Portal (DAP). The system is based on content-based filtering. We computed the similarity of datasets based on data attributes (e.g., descriptions, fields of research, location, contributors, and provenance) and inference from transaction logs (e.g., the relations among datasets and between queries and datasets). We improved the recommendation quality by assigning weights to data similarities. The weight values are drawn from a survey involving data users. The recommender results for a given dataset are accessible programmatically via a web service. Taking both data attributes and user actions into account, the recommender system will make it easier for researchers to find and reuse data offered through the data portal.

  20. An online analytical processing multi-dimensional data warehouse for malaria data

    PubMed Central

    Madey, Gregory R; Vyushkov, Alexander; Raybaud, Benoit; Burkot, Thomas R; Collins, Frank H

    2017-01-01

    Abstract Malaria is a vector-borne disease that contributes substantially to the global burden of morbidity and mortality. The management of malaria-related data from heterogeneous, autonomous, and distributed data sources poses unique challenges and requirements. Although online data storage systems exist that address specific malaria-related issues, a globally integrated online resource to address different aspects of the disease does not exist. In this article, we describe the design, implementation, and applications of a multi-dimensional, online analytical processing data warehouse, named the VecNet Data Warehouse (VecNet-DW). It is the first online, globally-integrated platform that provides efficient search, retrieval and visualization of historical, predictive, and static malaria-related data, organized in data marts. Historical and static data are modelled using star schemas, while predictive data are modelled using a snowflake schema. The major goals, characteristics, and components of the DW are described along with its data taxonomy and ontology, the external data storage systems and the logical modelling and physical design phases. Results are presented as screenshots of a Dimensional Data browser, a Lookup Tables browser, and a Results Viewer interface. The power of the DW emerges from integrated querying of the different data marts and structuring those queries to the desired dimensions, enabling users to search, view, analyse, and store large volumes of aggregated data, and responding better to the increasing demands of users. Database URL https://dw.vecnet.org/datawarehouse/ PMID:29220463

  1. Big Data: More than Just Big and More than Just Data.

    PubMed

    Spencer, Gregory A

    2017-01-01

    According to an report, 90 percent of the data in the world today were created in the past two years. This statistic is not surprising given the explosion of mobile phones and other devices that generate data, the Internet of Things (e.g., smart refrigerators), and metadata (data about data). While it might be a stretch to figure out how a healthcare organization can use data generated from an ice maker, data from a plethora of rich and useful sources, when combined with an organization's own data, can produce improved results. How can healthcare organizations leverage these rich and diverse data sources to improve patients' health and make their businesses more competitive? The authors of the two feature articles in this issue of Frontiers provide tangible examples of how their organizations are using big data to meaningfully improve healthcare. Sentara Healthcare and Carolinas HealthCare System both use big data in creative ways that differ because of different business situations, yet are also similar in certain respects.

  2. The Kiel data management infrastructure - arising from a generic data model

    NASA Astrophysics Data System (ADS)

    Fleischer, D.; Mehrtens, H.; Schirnick, C.; Springer, P.

    2010-12-01

    The Kiel Data Management Infrastructure (KDMI) started from a cooperation of three large-scale projects (SFB574, SFB754 and Cluster of Excellence The Future Ocean) and the Leibniz Institute of Marine Sciences (IFM-GEOMAR). The common strategy for project data management is a single person collecting and transforming data according to the requirements of the targeted data center(s). The intention of the KDMI cooperation is to avoid redundant and potentially incompatible data management efforts for scientists and data managers and to create a single sustainable infrastructure. An increased level of complexity in the conceptual planing arose from the diversity of marine disciplines and approximately 1000 scientists involved. KDMI key features focus on the data provenance which we consider to comprise the entire workflow from field sampling thru labwork to data calculation and evaluation. Managing the data of each individual project participant in this way yields the data management for the entire project and warrants the reusability of (meta)data. Accordingly scientists provide a workflow definition of their data creation procedures resulting in their target variables. The central idea in the development of the KDMI presented here is based on the object oriented programming concept which allows to have one object definition (workflow) and infinite numbers of object instances (data). Each definition is created by a graphical user interface and produces XML output stored in a database using a generic data model. On creation of a data instance the KDMI translates the definition into web forms for the scientist, the generic data model then accepts all information input following the given data provenance definition. An important aspect of the implementation phase is the possibility of a successive transition from daily measurement routines resulting in single spreadsheet files with well known points of failure and limited reuseability to a central infrastructure as a

  3. Tools and strategies for instrument monitoring, data mining and data access

    NASA Astrophysics Data System (ADS)

    van Hees, R. M., ,, Dr

    2009-04-01

    The ever growing size of data sets produced by various satellite instruments creates a challenge in data management. Three main tasks were identified: instrument performance monitoring, data mining by users and data deployment. In this presentation, I will discuss the three tasks and our solution. As a practical example to illustrate the problem and make the discussion less abstract, I will use Sciamachy on-board the ESA satellite Envisat. Since the launch of Envisat, in March 2002, Sciamachy has performed nearly a billion science measurements and performed daily calibrations measurements. The total size of the data set (not including reprocessed data) is over 30 TB, distributed over 150,000 files. [Instrument Monitoring] Most instruments produce house-keeping data, which may include time, geo-location, temperature of different parts of the instrument and instrument settings and configuration. In addition, many instruments perform calibration measurements. Instrument performance monitoring requires automated analyzes of critical parameters for events, and the option to off-line inspect the behavior of various parameters in time. We choose to extract the necessary information from the SCIAMACHY data products, and store everything in one file, where we separated house-keeping data from calibration measurements. Due to the large volume and the need to have quick random-access, the Hierarchical Data Format (HDF5) was our obvious choice. The HDF5 format is self describing and designed to organize different types of data in one file. For example, one data set may contain the meta data of the calibration measurements: time, geo-location, instrument settings, quality parameters (temperature of the instrument), while a second large data set contains the actual measurements. The HDF5 high-level packet table API is ideal for tables that only grow (by appending rows), while the HDF5 table API is better suited for tables where rows need to be updated, inserted or replaced. In

  4. LANDSAT data preprocessing

    NASA Technical Reports Server (NTRS)

    Austin, W. W.

    1983-01-01

    The effect on LANDSAT data of a Sun angle correction, an intersatellite LANDSAT-2 and LANDSAT-3 data range adjustment, and the atmospheric correction algorithm was evaluated. Fourteen 1978 crop year LACIE sites were used as the site data set. The preprocessing techniques were applied to multispectral scanner channel data and transformed data were plotted and used to analyze the effectiveness of the preprocessing techniques. Ratio transformations effectively reduce the need for preprocessing techniques to be applied directly to the data. Subtractive transformations are more sensitive to Sun angle and atmospheric corrections than ratios. Preprocessing techniques, other than those applied at the Goddard Space Flight Center, should only be applied as an option of the user. While performed on LANDSAT data the study results are also applicable to meteorological satellite data.

  5. LADS: Optimizing Data Transfers using Layout-Aware Data Scheduling

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kim, Youngjae; Atchley, Scott; Vallee, Geoffroy R

    While future terabit networks hold the promise of signifi- cantly improving big-data motion among geographically distributed data centers, significant challenges must be overcome even on today s 100 gigabit networks to real- ize end-to-end performance. Multiple bottlenecks exist along the end-to-end path from source to sink. Data stor- age infrastructure at both the source and sink and its in- terplay with the wide-area network are increasingly the bottleneck to achieving high performance. In this paper, we identify the issues that lead to congestion on the path of an end-to-end data transfer in the terabit network en- vironment, and we presentmore » a new bulk data movement framework called LADS for terabit networks. LADS ex- ploits the underlying storage layout at each endpoint to maximize throughput without negatively impacting the performance of shared storage resources for other users. LADS also uses the Common Communication Interface (CCI) in lieu of the sockets interface to use zero-copy, OS-bypass hardware when available. It can further im- prove data transfer performance under congestion on the end systems using buffering at the source using flash storage. With our evaluations, we show that LADS can avoid congested storage elements within the shared stor- age resource, improving I/O bandwidth, and data transfer rates across the high speed networks.« less

  6. Approach to data exchange: the spatial data transfer standard

    USGS Publications Warehouse

    Rossmeissl, Hedy J.; Rugg, Robert D.

    1992-01-01

    Significant developments have taken place in the disciplines of cartography and geography in recent years with the advent of computer hardware and software that manipulate and process digital cartographic and geographic data more efficiently. The availability of inexpensive and powerful hardware and software systems offers the capability of displaying and analyzing spatial data to a growing number of users. As a result, developing and using existing digital cartographic databases are becoming very popular. However, the absence of uniform standards for the transfer of digital spatial data is hindering the exchange of data and increasing costs. Several agencies of the U.S. government and the academic community have been working hard over the last few years to develop a spatial data transfer standard that includes definitions of standard terminology, a spatial data transfer specification, recommendations on reporting digital cartographic data quality, and standard topographic and hydrographic entity terms and definitions. This proposed standard was published in the January 1988 issue of The American Cartographer. Efforts to test and promote this standard were coordinated by the U.S. Geological Survey. A Technical Review Board was appointed with representatives from the U.S. government, the private sector, and the academic community to complete the standard for submittal to the National Institute of Standards and Technology for approval as a Federal Information Processing Standard. The proposed standard was submitted in February 1992 for final approval.

  7. Big(ger) Data as Better Data in Open Distance Learning

    ERIC Educational Resources Information Center

    Prinsloo, Paul; Archer, Elizabeth; Barnes, Glen; Chetty, Yuraisha; van Zyl, Dion

    2015-01-01

    In the context of the hype, promise and perils of Big Data and the currently dominant paradigm of data-driven decision-making, it is important to critically engage with the potential of Big Data for higher education. We do not question the potential of Big Data, but we do raise a number of issues, and present a number of theses to be seriously…

  8. SWADE Data Guide

    NASA Technical Reports Server (NTRS)

    Oberholtzer, David; Donelan, Mark

    1996-01-01

    This is a guide to the data that was collected during the Surface Wave Dynamics Experiment (SWADE) carried out from October 1990 through March 1991. The area studied was the Mid-Atlantic bight. Buoys collected data continuously during this time. Included are three 3-meter discus buoys the National Data Buoy Center (NDBC) set in the area for this project, they then later added a fourth 3-meter discus and an experimental coastal buoy. Data from all of the NDBC buoys and stations from New England to Florida have been included in the data set; in addition, several other buoys monitored meteorological and ocean conditions for parts of the experiment. Several times during the project, there were intensive data collection periods. During the third of these, there were 7 aircraft and a ship collecting data. Maps of each mission from the platforms that have shared their data are included. The anonymous FTP site that has been established to make this data public is described.

  9. Geospatial-enabled Data Exploration and Computation through Data Infrastructure Building Blocks

    NASA Astrophysics Data System (ADS)

    Song, C. X.; Biehl, L. L.; Merwade, V.; Villoria, N.

    2015-12-01

    Geospatial data are present everywhere today with the proliferation of location-aware computing devices and sensors. This is especially true in the scientific community where large amounts of data are driving research and education activities in many domains. Collaboration over geospatial data, for example, in modeling, data analysis and visualization, must still overcome the barriers of specialized software and expertise among other challenges. The GABBs project aims at enabling broader access to geospatial data exploration and computation by developing spatial data infrastructure building blocks that leverage capabilities of end-to-end application service and virtualized computing framework in HUBzero. Funded by NSF Data Infrastructure Building Blocks (DIBBS) initiative, GABBs provides a geospatial data architecture that integrates spatial data management, mapping and visualization and will make it available as open source. The outcome of the project will enable users to rapidly create tools and share geospatial data and tools on the web for interactive exploration of data without requiring significant software development skills, GIS expertise or IT administrative privileges. This presentation will describe the development of geospatial data infrastructure building blocks and the scientific use cases that help drive the software development, as well as seek feedback from the user communities.

  10. Fiber optic data link for data acquisition and analysis

    NASA Astrophysics Data System (ADS)

    Saulsberry, Garen

    A data link has been designed and developed for use with fiber optics as a transmission medium, though coaxial and twisted pair cable might also be used. Multiple data types may be transferred at various rates up to 100 Mbits per second and data word width may be programmed to obtain the highest level of efficiency from the bit rate.

  11. A scope classification of data quality requirements for food composition data.

    PubMed

    Presser, Karl; Hinterberger, Hans; Weber, David; Norrie, Moira

    2016-02-15

    Data quality is an important issue when managing food composition data since the usage of the data can have a significant influence on policy making and further research. Although several frameworks for data quality have been proposed, general tools and measures are still lacking. As a first step in this direction, we investigated data quality requirements for an information system to manage food composition data, called FoodCASE. The objective of our investigation was to find out if different requirements have different impacts on the intrinsic data quality that must be regarded during data quality assessment and how these impacts can be described. We refer to the resulting classification with its categories as the scope classification of data quality requirements. As proof of feasibility, the scope classification has been implemented in the FoodCASE system. Copyright © 2015 Elsevier Ltd. All rights reserved.

  12. Data Point Averaging for Computational Fluid Dynamics Data

    NASA Technical Reports Server (NTRS)

    Norman, Jr., David (Inventor)

    2016-01-01

    A system and method for generating fluid flow parameter data for use in aerodynamic heating analysis. Computational fluid dynamics data is generated for a number of points in an area on a surface to be analyzed. Sub-areas corresponding to areas of the surface for which an aerodynamic heating analysis is to be performed are identified. A computer system automatically determines a sub-set of the number of points corresponding to each of the number of sub-areas and determines a value for each of the number of sub-areas using the data for the sub-set of points corresponding to each of the number of sub-areas. The value is determined as an average of the data for the sub-set of points corresponding to each of the number of sub-areas. The resulting parameter values then may be used to perform an aerodynamic heating analysis.

  13. Data Point Averaging for Computational Fluid Dynamics Data

    NASA Technical Reports Server (NTRS)

    Norman, David, Jr. (Inventor)

    2014-01-01

    A system and method for generating fluid flow parameter data for use in aerodynamic heating analysis. Computational fluid dynamics data is generated for a number of points in an area on a surface to be analyzed. Sub-areas corresponding to areas of the surface for which an aerodynamic heating analysis is to be performed are identified. A computer system automatically determines a sub-set of the number of points corresponding to each of the number of sub-areas and determines a value for each of the number of sub-areas using the data for the sub-set of points corresponding to each of the number of sub-areas. The value is determined as an average of the data for the sub-set of points corresponding to each of the number of sub-areas. The resulting parameter values then may be used to perform an aerodynamic heating analysis.

  14. Data Entry: Towards the Critical Study of Digital Data and Education

    ERIC Educational Resources Information Center

    Selwyn, Neil

    2015-01-01

    The generation and processing of data through digital technologies is an integral element of contemporary society, as reflected in recent debates over online data privacy, "Big Data" and the rise of data mining and analytics in business, science and government. This paper outlines the significance of digital data within education,…

  15. Data-centric Science: New challenges for long-term archives and data publishers

    NASA Astrophysics Data System (ADS)

    Stockhause, Martina; Lautenschlager, Michael

    2016-04-01

    In the recent years the publication of data has become more and more common. Data and metadata for a single project are often disseminated by multiple data centers in federated data infrastructures. In the same time data is shared earlier to enable collaboration within research projects. The research data environment has become more heterogeneous and the data more dynamic. Only few data or metadata repositories are long-term archives (LTAs) with WDS/DSA certificates, complying to Force 11's 'Joint Declaration of Data Citation Principles'. Therefore for long-term usage of these data and information, a small number of LTAs have the task to preserve these pieces of information. They replicate, connect, quality assure, harmonize, archive, and curate these different types of data from multiple data centers with different operation procedures and data standards. Consortia or federations of certified LTAs are needed to meet the challenges of big data storage and citations. Data publishers play a central role in storing, preserving, and disseminating scientific information. Portals of these federations of LTAs or data registration agencies like DataCite might even become the portals of the future for scientific knowledge discovery. The example CMIP6 is used to illustrate this future perspective of the role of LTAs/data publishers.

  16. Data Recipes: Toward Creating How-To Knowledge Base for Earth Science Data

    NASA Technical Reports Server (NTRS)

    Shen, Suhung; Lynnes, Chris; Acker, James G.; Beaty, Tammy

    2015-01-01

    Both the diversity and volume of Earth science data from satellites and numerical models are growing dramatically, due to an increasing population of measured physical parameters, and also an increasing variety of spatial and temporal resolutions for many data products. To further complicate matters, Earth science data delivered to data archive centers are commonly found in different formats and structures. NASA data centers, managed by the Earth Observing System Data and Information System (EOSDIS), have developed a rich and diverse set of data services and tools with features intended to simplify finding, downloading, and working with these data. Although most data services and tools have user guides, many users still experience difficulties with accessing or reading data due to varying levels of familiarity with data services, tools, and or formats. The data recipe project at Goddard Earth Science Data and Information Services Center (GES DISC) was initiated in late 2012 for enhancing user support. A data recipe is a How-To online explanatory document, with step-by-step instructions and examples of accessing and working with real data (http:disc.sci.gsfc.nasa.govrecipes). The current suite of recipes has been found to be very helpful, especially to first-time-users of particular data services, tools, or data products. Online traffic to the data recipe pages is significant, even though the data recipe topics are still limited. An Earth Science Data System Working Group (ESDSWG) for data recipes was established in the spring of 2014, aimed to initiate an EOSDIS-wide campaign for leveraging the distributed knowledge within EOSDIS and its user communities regarding their respective services and tools. The ESDSWG data recipe group is working on an inventory and analysis of existing data recipes and tutorials, and will provide guidelines and recommendation for writing and grouping data recipes, and for cross linking recipes to data products. This presentation gives an

  17. TRMM Data Improvement as Part of the GPM Data Processing

    NASA Technical Reports Server (NTRS)

    Stocker, Erich F.; Ji, Y.; Kwiatkowski, J.; Kelley, O.; Stout, J.; Woltz, L.

    2016-01-01

    NASA has a long standing commitment to the improvement of its mission datasets. Indeed, data reprocessing is always built into the plans, schedule and budget for the mission data processing system. However, in addition to these ongoing mission reprocessing, NASA also supports a final reprocessing of all the data for a mission upon its completion (known as Phase F). TRMM Phase F started with the end of the TRMM mission in June of 2015. This last reprocessing has two overall goals: improvement of the TRMM mission data products; incorporation of the 17+ years of TRMM data into the ongoing NASA/JAXA GPM data processing. The first goal guarantees that the latest algorithms used for precipitation retrievals will also be used in reprocessing the TRMM data. The second goal ensures that as GPM algorithms are improved, the entire TRMM data will always be reprocessed with each GPM reprocessing. In essence TRMM becomes another of the GPM constellation satellites. This paper will concentrate on presenting the improvements to TMI level 1 data including calibration, geolocation, and emissive antenna corrections. It will describe the format changes that will occur how the TMI level 1C product will be intercalibrated using GMI as the reference calibration. It will also provide an overview of changes in the precipitation radar products as well as the combined TMIPR product.

  18. Data Stewardship and Long-Term Archive of ICESat Data at the National Snow and Ice Data Center (NSIDC)

    NASA Astrophysics Data System (ADS)

    Fowler, D. K.; Moses, J. F.; Duerr, R. E.; Webster, D.; Korn, D.

    2010-12-01

    Data Stewardship is becoming a principal part of a data manager’s work at NSIDC. It is vitally important that our organization makes a commitment to both current and long-term goals of data management and the preservation of our scientific data. Data must be available to researchers not only during active missions, but long after missions end. This includes maintaining accurate documentation, data tools, and a knowledgeable user support staff. NSIDC is preparing for long-term support of the ICESat mission data. Though ICESat has seen its last operational day, the data is still being improved and NSIDC is scheduled to archive the final release, Release 33, starting late in 2010. This release will include the final adjustments to the processing algorithms and will produce the best possible products to date. Along with the higher-level data sets, all supporting documentation will be archived at NSIDC. For the long-term archive, it is imperative that there is sufficient information about how products were prepared in order to convince future researchers that the scientific results are reproducible. The processing algorithms along with the Level 0 and ancillary products used to create the higher-level products will be archived and made available to users. This can enable users to examine production history, to derive revised products and to create their own products. Also contained in the long-term archive will be pre-launch, calibration/validation, and test data. These data are an important part of the provenance which must be preserved. For longevity, we’ll need to archive the data and documentation in formats that will be supported in the years to come.

  19. NSSDC data listing

    NASA Technical Reports Server (NTRS)

    Horowitz, Richard; King, Joseph H.

    1993-01-01

    This document identifies, in a highly summarized way, all the data held at the NSSDC. These data cover astrophysics and astronomy, solar and space physics, planetary and lunar, and Earth science disciplines. They are primarily, but not exclusively, from past and ongoing NASA spaceflight missions. We first identify all the data electronically available through NSSDC's principal online (magnetic disk-based) and nearline (robotics jukebox-based) systems, and then those data available on CDROM's. Finally, we identify all NSSDC-held data, the majority of which are still offline on magnetic tape, film, etc., but including the electronically accessible and CD-ROM resident data of earlier sections. These comprehensive identifications are in the form of two listings, one for the majority of NSSDC-held data sets resulting from individual instruments flown on individual spacecraft, and the other listing for the remainder of NSSDC-held data sets which do not adhere to this spacecraft/experiment/dataset hierarchy. The latter listing is presented in two parts, one for the numerous source catalogs of the NSSDC-operated Astronomical Data Center, and the other for the remainder. Access paths to all these data, and to further information about each, are also given in the related sections of this Data Listing. Note that this document is a companion to the electronically accessible information files (in particular, the NASA Master Directory) at NSSDC which also identify NSSDC-resident (and other) data.

  20. A highlight of data products from IRIS Data Services

    NASA Astrophysics Data System (ADS)

    Hutko, A. R.; Bahavar, M.; Trabant, C. M.; Van Fossen, M.; Weekly, R. T.

    2014-12-01

    Since 2009 the IRIS Data Management Center has served the seismology community in a variety of ways by offering higher order data products generated internally or by the research community in addition to raw times series data traditionally managed at the DMC. These products are intended to facilitate research as baseline standards, tools for data visualization or characterization, and teaching & outreach material. We currently serve 25 data products of which 7 are event-based that provide quick looks at many aspects of interest to researchers, often within a few hours of real-time. Among our new offerings is an expansion of the visualization capabilities of the Earth Model Collaboration, a repository of author contributed tomography and other Earth models. Currently EMC tools allow users to make 2D plots slicing through models. New 3D visualization tools being developed will bridge the gap between 2D slices and advanced and sometimes complicated 3D visualization packages with common 3D capabilities that can be set up and learned within minutes. The newly released Global Stacks is a project that stacks up to a million seismograms to illuminate the global seismic wavefield. Seismograms are processed and stacked for three component data across many frequency bands. The resulting stacks lead to high-fidelity wavefield images that clearly highlight characteristics such as dispersion in surface waves and many phases not commonly observed such as P'P'P'P'. Another recent addition is the Automated Surface Wave Phase Velocity Measuring System, which is an automated do-it-yourself surface wave tomography package requiring minimal user input and produces research quality tomography results. To further enhance our effort to support the research community, we invite proposals for collaborative data product development. This is an excellent opportunity for researchers to put forward unique and useful data product ideas and collaborate with the DMC in the development of the

  1. TERRA/MODIS Data Products and Data Management at the GES-DAAC

    NASA Astrophysics Data System (ADS)

    Sharma, A. K.; Ahmad, S.; Eaton, P.; Koziana, J.; Leptoukh, G.; Ouzounov, D.; Savtchenko, A.; Serafino, G.; Sikder, M.; Zhou, B.

    2001-05-01

    Since February 2000, the Earth Sciences Distributed Active Archive Center (GES-DAAC) at the NASA/Goddard Space Flight Center has been successfully ingesting, processing, archiving, and distributing the Moderate Resolution Imaging Spectroradiometer (MODIS) data. MODIS is the key instrument aboard the Terra satellite, viewing the entire Earth's surface every 1 to 2 days, acquiring data in 36 channels in the visible and infrared spectral bands (0.4 to 14.4 microns). Higher resolution (250m, 500m, and 1km pixel) data are improving our understanding of global dynamics and processes occurring on the land, in the oceans, and in the lower atmosphere and will play a vital role in the future development of validated, global, interactive Earth-system models. MODIS calibrated and uncalibrated radiances, and geolocation products were released to the public in April 2000, and a suite of oceans products and an entire suite of atmospheric products were released by early January 2001. The suite of ocean products is grouped into three categories Ocean Color, SST and Primary Productivity. The suite of atmospheric products includes Aerosol, Total Precipitable Water, Cloud Optical and Physical properties, Atmospheric Profiles and Cloud Mask. The MODIS Data Support Team (MDST) at the GES-DAAC has been providing support for enabling basic scientific research and assistance in accessing the scientific data and information to the Earth Science User Community. Support is also provided for data formats (HDF-EOS), information on visualization tools, documentation for data products, information on the scientific content of products and metadata. Visit the MDST website at http://daac.gsfc.nasa.gov/CAMPAIGN_DOCS/MODIS/index.html The task to process archive and distribute enormous volumes of MODIS data to users (more than 0.5 TB a day) has led to the development of an unique world wide web based GES DAAC Search and Order system http://acdisx.gsfc.nasa.gov/data/, data handling software and tools

  2. Taming the data wilderness with the VHO: Integrating heliospheric data sets

    NASA Astrophysics Data System (ADS)

    Schroeder, P.; Szabo, A.; Narock, T.

    Currently space physicists are faced with a bewildering array of heliospheric missions experiments and data sets available at archives distributed around the world Daunting even for those most familiar with the field physicists in other concentrations solar physics magnetospheric physics etc find locating the heliospheric data that they need extremely challenging if not impossible The Virtual Heliospheric Observatory VHO will help to solve this problem by creating an Application Programming Interface API and web portal that integrates these data sets to find the highest quality data for a given task The VHO will locate the best available data often found only at PI institutions rather than at national archives like the NSSDC The VHO will therefore facilitate a dynamic data environment where improved data products are made available immediately In order to accomplish this the VHO will enforce a metadata standard on participating data providers with sufficient depth to allow for meaningful scientific evaluation of similar data products The VHO will provide an automated way for secondary sites to keep mirrors of data archives up to date and encouraging the generation of secondary or added-value data products The VHO will interact seamlessly with the Virtual Solar Observatory VSO and other Virtual Observatories VxO s to allow for inter-disciplinary data searching Software tools for these data sets will also be available through the VHO Finally the VHO will provide linkages to the modeling community and will develop metadata standards for the

  3. From data to function: functional modeling of poultry genomics data.

    PubMed

    McCarthy, F M; Lyons, E

    2013-09-01

    One of the challenges of functional genomics is to create a better understanding of the biological system being studied so that the data produced are leveraged to provide gains for agriculture, human health, and the environment. Functional modeling enables researchers to make sense of these data as it reframes a long list of genes or gene products (mRNA, ncRNA, and proteins) by grouping based upon function, be it individual molecular functions or interactions between these molecules or broader biological processes, including metabolic and signaling pathways. However, poultry researchers have been hampered by a lack of functional annotation data, tools, and training to use these data and tools. Moreover, this lack is becoming more critical as new sequencing technologies enable us to generate data not only for an increasingly diverse range of species but also individual genomes and populations of individuals. We discuss the impact of these new sequencing technologies on poultry research, with a specific focus on what functional modeling resources are available for poultry researchers. We also describe key strategies for researchers who wish to functionally model their own data, providing background information about functional modeling approaches, the data and tools to support these approaches, and the strengths and limitations of each. Specifically, we describe methods for functional analysis using Gene Ontology (GO) functional summaries, functional enrichment analysis, and pathways and network modeling. As annotation efforts begin to provide the fundamental data that underpin poultry functional modeling (such as improved gene identification, standardized gene nomenclature, temporal and spatial expression data and gene product function), tool developers are incorporating these data into new and existing tools that are used for functional modeling, and cyberinfrastructure is being developed to provide the necessary extendibility and scalability for storing and

  4. Data Publication: A Partnership between Scientists, Data Managers and Librarians

    NASA Astrophysics Data System (ADS)

    Raymond, L.; Chandler, C.; Lowry, R.; Urban, E.; Moncoiffe, G.; Pissierssens, P.; Norton, C.; Miller, H.

    2012-04-01

    Current literature on the topic of data publication suggests that success is best achieved when there is a partnership between scientists, data managers, and librarians. The Marine Biological Laboratory/Woods Hole Oceanographic Institution (MBLWHOI) Library and the Biological and Chemical Oceanography Data Management Office (BCO-DMO) have developed tools and processes to automate the ingestion of metadata from BCO-DMO for deposit with datasets into the Institutional Repository (IR) Woods Hole Open Access Server (WHOAS). The system also incorporates functionality for BCO-DMO to request a Digital Object Identifier (DOI) from the Library. This partnership allows the Library to work with a trusted data repository to ensure high quality data while the data repository utilizes library services and is assured of a permanent archive of the copy of the data extracted from the repository database. The assignment of persistent identifiers enables accurate data citation. The Library can assign a DOI to appropriate datasets deposited in WHOAS. A primary activity is working with authors to deposit datasets associated with published articles. The DOI would ideally be assigned before submission and be included in the published paper so readers can link directly to the dataset, but DOIs are also being assigned to datasets related to articles after publication. WHOAS metadata records link the article to the datasets and the datasets to the article. The assignment of DOIs has enabled another important collaboration with Elsevier, publisher of educational and professional science journals. Elsevier can now link from articles in the Science Direct database to the datasets available from WHOAS that are related to that article. The data associated with the article are freely available from WHOAS and accompanied by a Dublin Core metadata record. In addition, the Library has worked with researchers to deposit datasets in WHOAS that are not appropriate for national, international, or domain

  5. Enabling Geotechnical Data for Broader Use by the Spatial Data Infrastructures

    ERIC Educational Resources Information Center

    Zand, Amir Ghasem

    2011-01-01

    Geotechnical data are one of the most prevalent data types in civil engineering projects. The majority of the civil engineering projects that are in use today are designed using site-specific geotechnical data. The usage of geotechnical data is not limited to construction projects. This data is used in a wide range of applications, including…

  6. Analysis of selected data from the triservice missile data base

    NASA Technical Reports Server (NTRS)

    Allen, Jerry M.; Shaw, David S.; Sawyer, Wallace C.

    1989-01-01

    An extremely large, systematic, axisymmetric-body/tail-fin data base has been gathered through tests of an innovative missile model design which is described herein. These data were originally obtained for incorporation into a missile aerodynamics code based on engineering methods (Program MISSILE3), but these data are also valuable as diagnostic test cases for developing computational methods because of the individual-fin data included in the data base. Detailed analyses of four sample cases from these data are presented to illustrate interesting individual-fin force and moment trends. These samples quantitatively show how bow shock, fin orientation, fin deflection, and body vortices can produce strong, unusual, and computationally challenging effects on individual fin loads. Flow-visualization photographs are examined to provide physical insight into the cause of these effects.

  7. Developing Healthcare Data Analytics APPs with Open Data Science Tools.

    PubMed

    Hao, Bibo; Sun, Wen; Yu, Yiqin; Xie, Guotong

    2017-01-01

    Recent advances in big data analytics provide more flexible, efficient, and open tools for researchers to gain insight from healthcare data. Whilst many tools require researchers to develop programs with programming languages like Python, R and so on, which is not a skill set grasped by many researchers in the healthcare data analytics area. To make data science more approachable, we explored existing tools and developed a practice that can help data scientists convert existing analytics pipelines to user-friendly analytics APPs with rich interactions and features of real-time analysis. With this practice, data scientists can develop customized analytics pipelines as APPs in Jupyter Notebook and disseminate them to other researchers easily, and researchers can benefit from the shared notebook to perform analysis tasks or reproduce research results much more easily.

  8. Improved Data Access From the Northern California Earthquake Data Center

    NASA Astrophysics Data System (ADS)

    Neuhauser, D.; Oppenheimer, D.; Zuzlewski, S.; Klein, F.; Jensen, E.; Gee, L.; Murray, M.; Romanowicz, B.

    2002-12-01

    The NCEDC is a joint project of the UC Berkeley Seismological Laboratory and the USGS Menlo Park to provide a long-term archive and distribution center for geophysical data for northern California. Most data are available via the Web at http://quake.geo.berkeley.edu and research accounts are available for access to specialized datasets. Current efforts continue to expand the available datasets, enhance distribution methods, and to provide rapid access to all datasets. The NCEDC archives continuous and event-based seismic and geophysical time-series data from the BDSN, the USGS NCSN, the UNR Seismic Network, the Parkfield HRSN, and the Calpine/Unocal Geysers network. In collaboration with the USGS, the NCEDC has archived a total of 887 channels from 139 sites of the "USGS low-frequency" geophysical network (UL), including data from strainmeters, creep meters, magnetometers, water well levels, and tiltmeters. There are 336 active continuous data channels that are updated at the NCEDC on a daily basis. Geodetic data from the BARD network of over 40 continuously recording GPS sites are archived at the NCEDC in both raw and RINEX format. The NCEDC is the primary archive for survey-mode GPS and other geodetic data collected in northern California by the USGS, universities, and other agencies. All of the BARD data and GPS data archived from USGS Menlo Park surveys are now available through the GPS Seamless Archive Centers (GSAC), and by FTP directly from the NCEDC. Virtually all time-series data at the NCEDC are now available in SEED with complete instrument responses. Assembling, verifying, and maintaining the response information for these networks is a huge task, and is accomplished through the collaborative efforts of the NCEDC and the contributing agencies. Until recently, the NCSN waveform data were available only through research accounts and special request methods due to incomplete instrument responses. In the last year, the USGS compiled the necessary

  9. Supporting diverse data providers in the open water data initiative: Communicating water data quality and fitness of use

    USGS Publications Warehouse

    Larsen, Sara; Hamilton, Stuart; Lucido, Jessica M.; Garner, Bradley D.; Young, Dwane

    2016-01-01

    Shared, trusted, timely data are essential elements for the cooperation needed to optimize economic, ecologic, and public safety concerns related to water. The Open Water Data Initiative (OWDI) will provide a fully scalable platform that can support a wide variety of data from many diverse providers. Many of these will be larger, well-established, and trusted agencies with a history of providing well-documented, standardized, and archive-ready products. However, some potential partners may be smaller, distributed, and relatively unknown or untested as data providers. The data these partners will provide are valuable and can be used to fill in many data gaps, but can also be variable in quality or supplied in nonstandardized formats. They may also reflect the smaller partners' variable budgets and missions, be intermittent, or of unknown provenance. A challenge for the OWDI will be to convey the quality and the contextual “fitness” of data from providers other than the most trusted brands. This article reviews past and current methods for documenting data quality. Three case studies are provided that describe processes and pathways for effective data-sharing and publication initiatives. They also illustrate how partners may work together to find a metadata reporting threshold that encourages participation while maintaining high data integrity. And lastly, potential governance is proposed that may assist smaller partners with short- and long-term participation in the OWDI.

  10. An efficient reversible privacy-preserving data mining technology over data streams.

    PubMed

    Lin, Chen-Yi; Kao, Yuan-Hung; Lee, Wei-Bin; Chen, Rong-Chang

    2016-01-01

    With the popularity of smart handheld devices and the emergence of cloud computing, users and companies can save various data, which may contain private data, to the cloud. Topics relating to data security have therefore received much attention. This study focuses on data stream environments and uses the concept of a sliding window to design a reversible privacy-preserving technology to process continuous data in real time, known as a continuous reversible privacy-preserving (CRP) algorithm. Data with CRP algorithm protection can be accurately recovered through a data recovery process. In addition, by using an embedded watermark, the integrity of the data can be verified. The results from the experiments show that, compared to existing algorithms, CRP is better at preserving knowledge and is more effective in terms of reducing information loss and privacy disclosure risk. In addition, it takes far less time for CRP to process continuous data than existing algorithms. As a result, CRP is confirmed as suitable for data stream environments and fulfills the requirements of being lightweight and energy-efficient for smart handheld devices.

  11. Open Research Challenges with Big Data - A Data-Scientist s Perspective

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sukumar, Sreenivas R

    In this paper, we discuss data-driven discovery challenges of the Big Data era. We observe that recent innovations in being able to collect, access, organize, integrate, and query massive amounts of data from a wide variety of data sources have brought statistical data mining and machine learning under more scrutiny and evaluation for gleaning insights from the data than ever before. In that context, we pose and debate the question - Are data mining algorithms scaling with the ability to store and compute? If yes, how? If not, why not? We survey recent developments in the state-of-the-art to discuss emergingmore » and outstanding challenges in the design and implementation of machine learning algorithms at scale. We leverage experience from real-world Big Data knowledge discovery projects across domains of national security, healthcare and manufacturing to suggest our efforts be focused along the following axes: (i) the data science challenge - designing scalable and flexible computational architectures for machine learning (beyond just data-retrieval); (ii) the science of data challenge the ability to understand characteristics of data before applying machine learning algorithms and tools; and (iii) the scalable predictive functions challenge the ability to construct, learn and infer with increasing sample size, dimensionality, and categories of labels. We conclude with a discussion of opportunities and directions for future research.« less

  12. New Data Services for Polar Investigators from Integrated Earth Data Applications (IEDA)

    NASA Astrophysics Data System (ADS)

    Nitsche, F. O.; Ferrini, V.; Morton, J. J.; Arko, R. A.; McLain, K.; O'hara, S. H.; Carbotte, S. M.; Lehnert, K. A.; IEDA Team, I.

    2013-12-01

    Accessibility and preservation of data is needed to support multi-disciplinary research in the key environmentally sensitive Polar Regions. IEDA (Integrated Earth Data Applications) is a community-based data facility funded by the US National Science Foundation (NSF) to support, sustain, and advance the geosciences by providing data services for observational solid earth data from the Ocean, Earth, and Polar Sciences. IEDA tools and services relevant to the Polar Research Community include the Antarctic and Southern Ocean Data System (ASODS), the U.S. Antarctic Program Data Coordination Center (USAP-DCC), GeoMapApp, as well as a number of services for sample-based data (SESAR and EarthChem). In addition to existing tools, which assist Polar investigators in archiving their data, and creating DIF records for global searches in AMD, IEDA recently added several new tools and services that will provide further support for investigators with the data life cycle process. These include a data management plan (http://www.iedadata.org/compliance/plan) and data compliance reporting tool (http://www.iedadata.org/compliance/report) that will help investigators comply with the requirements of funding agencies such as the National Science Foundation (NSF). Data, especially from challenging Polar Regions, are likely to be used by other scientists for future studies. Therefore, data acknowledgment is an important concern of many investigators. To encourage data acknowledgments by data users, we link references of publications (when known) to datasets and cruises registered within the ASODS system as part of our data curation services (http://www.marine-geo.org/portals/antarctic/references.php). In addition, IEDA offers a data publication service to register scientific data with DOI's, making data sets citable as publications with attribution to investigators as authors. IEDA is a publication agent of the DataCite consortium. Offering such services provides additional incentives

  13. Applying open source data visualization tools to standard based medical data.

    PubMed

    Kopanitsa, Georgy; Taranik, Maxim

    2014-01-01

    Presentation of medical data in personal health records (PHRs) requires flexible platform independent tools to ensure easy access to the information. Different backgrounds of the patients, especially elder people require simple graphical presentation of the data. Data in PHRs can be collected from heterogeneous sources. Application of standard based medical data allows development of generic visualization methods. Focusing on the deployment of Open Source Tools, in this paper we applied Java Script libraries to create data presentations for standard based medical data.

  14. Building Bridges Between Geoscience and Data Science through Benchmark Data Sets

    NASA Astrophysics Data System (ADS)

    Thompson, D. R.; Ebert-Uphoff, I.; Demir, I.; Gel, Y.; Hill, M. C.; Karpatne, A.; Güereque, M.; Kumar, V.; Cabral, E.; Smyth, P.

    2017-12-01

    The changing nature of observational field data demands richer and more meaningful collaboration between data scientists and geoscientists. Thus, among other efforts, the Working Group on Case Studies of the NSF-funded RCN on Intelligent Systems Research To Support Geosciences (IS-GEO) is developing a framework to strengthen such collaborations through the creation of benchmark datasets. Benchmark datasets provide an interface between disciplines without requiring extensive background knowledge. The goals are to create (1) a means for two-way communication between geoscience and data science researchers; (2) new collaborations, which may lead to new approaches for data analysis in the geosciences; and (3) a public, permanent repository of complex data sets, representative of geoscience problems, useful to coordinate efforts in research and education. The group identified 10 key elements and characteristics for ideal benchmarks. High impact: A problem with high potential impact. Active research area: A group of geoscientists should be eager to continue working on the topic. Challenge: The problem should be challenging for data scientists. Data science generality and versatility: It should stimulate development of new general and versatile data science methods. Rich information content: Ideally the data set provides stimulus for analysis at many different levels. Hierarchical problem statement: A hierarchy of suggested analysis tasks, from relatively straightforward to open-ended tasks. Means for evaluating success: Data scientists and geoscientists need means to evaluate whether the algorithms are successful and achieve intended purpose. Quick start guide: Introduction for data scientists on how to easily read the data to enable rapid initial data exploration. Geoscience context: Summary for data scientists of the specific data collection process, instruments used, any pre-processing and the science questions to be answered. Citability: A suitable identifier to

  15. Data processing 1: Advancements in machine analysis of multispectral data

    NASA Technical Reports Server (NTRS)

    Swain, P. H.

    1972-01-01

    Multispectral data processing procedures are outlined beginning with the data display process used to accomplish data editing and proceeding through clustering, feature selection criterion for error probability estimation, and sample clustering and sample classification. The effective utilization of large quantities of remote sensing data by formulating a three stage sampling model for evaluation of crop acreage estimates represents an improvement in determining the cost benefit relationship associated with remote sensing technology.

  16. Writing through Big Data: New Challenges and Possibilities for Data-Driven Arguments

    ERIC Educational Resources Information Center

    Beveridge, Aaron

    2017-01-01

    As multimodal writing continues to shift and expand in the era of Big Data, writing studies must confront the new challenges and possibilities emerging from data mining, data visualization, and data-driven arguments. Often collected under the broad banner of "data literacy," students' experiences of data visualization and data-driven…

  17. Structured data quality reports to improve EHR data quality.

    PubMed

    Taggart, Jane; Liaw, Siaw-Teng; Yu, Hairong

    2015-12-01

    To examine whether a structured data quality report (SDQR) and feedback sessions with practice principals and managers improve the quality of routinely collected data in EHRs. The intervention was conducted in four general practices participating in the Fairfield neighborhood electronic Practice Based Research Network (ePBRN). Data were extracted from their clinical information systems and summarised as a SDQR to guide feedback to practice principals and managers at 0, 4, 8 and 12 months. Data quality (DQ) metrics included completeness, correctness, consistency and duplication of patient records. Information on data recording practices, data quality improvement, and utility of SDQRs was collected at the feedback sessions at the practices. The main outcome measure was change in the recording of clinical information and level of meeting Royal Australian College of General Practice (RACGP) targets. Birth date was 100% and gender 99% complete at baseline and maintained. DQ of all variables measured improved significantly (p<0.01) over 12 months, but was not sufficient to comply with RACGP standards. Improvement was greatest with allergies. There was no significant change in duplicate records. SDQRs and feedback sessions support general practitioners and practice managers to focus on improving the recording of patient information. However, improved practice DQ, was not sufficient to meet RACGP targets. Randomised controlled studies are required to evaluate strategies to improve data quality and any associated improved safety and quality of care. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  18. Preprocessing Raw Data in Clinical Medicine for a Data Mining Purpose

    NASA Astrophysics Data System (ADS)

    Peterková, Andrea; Michaľčonok, German

    2016-12-01

    Dealing with data from the field of medicine is nowadays very current and difficult. On a global scale, a large amount of medical data is produced on an everyday basis. For the purpose of our research, we understand medical data as data about patients like results from laboratory analysis, results from screening examinations (CT, ECHO) and clinical parameters. This data is usually in a raw format, difficult to understand, non-standard and not suitable for further processing or analysis. This paper aims to describe the possible method of data preparation and preprocessing of such raw medical data into a form, where further analysis algorithms can be applied.

  19. Enhancing the Impact of Science Data: Toward Data Discovery and Reuse

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chappell, Alan R.; Weaver, Jesse R.; Purohit, Sumit

    The amount of data produced in support of scientific research continues to grow rapidly. Despite the accumulation and demand for scientific data, relatively little data are actually made available for the broader scientific community. We surmise that one root of this problem is the perceived difficulty of electronically publishing scientific data and associated metadata in a way that makes it discoverable. We propose exploiting Semantic Web technologies and best practices to make metadata both discoverable and easy to publish. We share experiences in curating metadata to illustrate the cumbersome nature of data reuse in the current research environment. We alsomore » make recommendations with a real-world example of how data publishers can provide their metadata by adding limited additional markup to HTML pages on the Web. With little additional effort from data publishers, the difficulty of data discovery, access, and sharing can be greatly reduced and the impact of research data greatly enhanced.« less

  20. The AmeriFlux Data Activity and Data System: An Evolving Collection of Data Management Techniques, Tools, Products and Services

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Boden, Thomas A; Krassovski, Misha B; Yang, Bai

    2013-01-01

    The Carbon Dioxide Information Analysis Center (CDIAC) at Oak Ridge National Laboratory (ORNL), USA has provided scientific data management support for the U.S. Department of Energy and international climate change science since 1982. Over this period, climate change science has expanded from research focusing on basic understanding of geochemical cycles, particularly the carbon cycle, to integrated research addressing climate change impacts, vulnerability, adaptation, and mitigation. Interests in climate change data and information worldwide have grown remarkably and, as a result, so have demands and expectations for CDIAC s data systems. To meet the growing demands, CDIAC s strategy has beenmore » to design flexible data systems using proven technologies blended with new, evolving technologies and standards. CDIAC development teams are multidisciplinary and include computer science and information technology expertise, but also scientific expertise necessary to address data quality and documentation issues and to identify data products and system capabilities needed by climate change scientists. CDIAC has learned there is rarely a single commercial tool or product readily available to satisfy long-term scientific data system requirements (i.e., one size does not fit all and the breadth and diversity of environmental data are often too complex for easy use with commercial products) and typically deploys a variety of tools and data products in an effort to provide credible data freely to users worldwide. Like many scientific data management applications, CDIAC s data systems are highly customized to satisfy specific scientific usage requirements (e.g., developing data products specific for model use) but are also designed to be flexible and interoperable to take advantage of new software engineering techniques, standards (e.g., metadata standards) and tools and to support future Earth system data efforts (e.g., ocean acidification). CDIAC has provided data

  1. A study of dynamic data placement for ATLAS distributed data management

    NASA Astrophysics Data System (ADS)

    Beermann, T.; Stewart, G. A.; Maettig, P.

    2015-12-01

    This contribution presents a study on the applicability and usefulness of dynamic data placement methods for data-intensive systems, such as ATLAS distributed data management (DDM). In this system the jobs are sent to the data, therefore having a good distribution of data is significant. Ways of forecasting workload patterns are examined which then are used to redistribute data to achieve a better overall utilisation of computing resources and to reduce waiting time for jobs before they can run on the grid. This method is based on a tracer infrastructure that is able to monitor and store historical data accesses and which is used to create popularity reports. These reports provide detailed summaries about data accesses in the past, including information about the accessed files, the involved users and the sites. From this past data it is possible to then make near-term forecasts for data popularity in the future. This study evaluates simple prediction methods as well as more complex methods like neural networks. Based on the outcome of the predictions a redistribution algorithm deletes unused replicas and adds new replicas for potentially popular datasets. Finally, a grid simulator is used to examine the effects of the redistribution. The simulator replays workload on different data distributions while measuring the job waiting time and site usage. The study examines how the average waiting time is affected by the amount of data that is moved, how it differs for the various forecasting methods and how that compares to the optimal data distribution.

  2. WIFIRE Data Model and Catalog for Wildfire Data and Tools

    NASA Astrophysics Data System (ADS)

    Altintas, I.; Crawl, D.; Cowart, C.; Gupta, A.; Block, J.; de Callafon, R.

    2014-12-01

    The WIFIRE project (wifire.ucsd.edu) is building an end-to-end cyberinfrastructure for real-time and data-driven simulation, prediction and visualization of wildfire behavior. WIFIRE may be used by wildfire management authorities in the future to predict wildfire rate of spread and direction, and assess the effectiveness of high-density sensor networks in improving fire and weather predictions. WIFIRE has created a data model for wildfire resources including sensed and archived data, sensors, satellites, cameras, modeling tools, workflows and social information including Twitter feeds. This data model and associated wildfire resource catalog includes a detailed description of the HPWREN sensor network, SDG&E's Mesonet, and NASA MODIS. In addition, the WIFIRE data-model describes how to integrate the data from multiple heterogeneous sources to provide detailed fire-related information. The data catalog describes 'Observables' captured by each instrument using multiple ontologies including OGC SensorML and NASA SWEET. Observables include measurements such as wind speed, air temperature, and relative humidity, as well as their accuracy and resolution. We have implemented a REST service for publishing to and querying from the catalog using Web Application Description Language (WADL). We are creating web-based user interfaces and mobile device Apps that use the REST interface for dissemination to wildfire modeling community and project partners covering academic, private, and government laboratories while generating value to emergency officials and the general public. Additionally, the Kepler scientific workflow system is instrumented to interact with this data catalog to access real-time streaming and archived wildfire data and stream it into dynamic data-driven wildfire models at scale.

  3. Gulf of Mexico Data Atlas: Digital Data Discovery and Access

    NASA Astrophysics Data System (ADS)

    Rose, K.

    2014-12-01

    The Gulf of Mexico Data Atlas is an online data discovery and access tool that allows users to browse a growing collection of ecosystem-related datasets visualized as map plates. Thematically, the Atlas includes updated long-term assessments of the physical, biological, environmental, economic and living marine resource characteristics that indicate baseline conditions of the Gulf of Mexico ecosystems. These data are crucial components of integrated ecosystem assessments and modeling and support restoration and monitoring efforts in the Gulf. A multi-agency executive steering committee including members from international, federal, state, and non-governmental organizations was established to guide Atlas development and to contribute data and expertise. The Atlas currently contains over 235 maps in 70 subject areas. Each map plate is accompanied by a descriptive summary authored by a subject matter expert and each data set is fully documented by metadata in Federal Geographic Data Committee (FGDC)-compliant standards. Source data are available in native formats and as web mapping services (WMS). Datasets are also searchable through an accompanying Map Catalog and RSS feed. The Gulf of Mexico Data Atlas is an operational example of the philosophy of leveraging resources among agencies and activities involved in geospatial data as outlined in the US Department of Interior and FGDC "Geospatial Platform Modernization Roadmap v4 - March 2011". We continue to update and add datasets through existing and new partnerships to ensure that the Atlas becomes a truly ecosystem-wide resource.

  4. Comparison of electronic data capture (EDC) with the standard data capture method for clinical trial data.

    PubMed

    Walther, Brigitte; Hossin, Safayet; Townend, John; Abernethy, Neil; Parker, David; Jeffries, David

    2011-01-01

    Traditionally, clinical research studies rely on collecting data with case report forms, which are subsequently entered into a database to create electronic records. Although well established, this method is time-consuming and error-prone. This study compares four electronic data capture (EDC) methods with the conventional approach with respect to duration of data capture and accuracy. It was performed in a West African setting, where clinical trials involve data collection from urban, rural and often remote locations. Three types of commonly available EDC tools were assessed in face-to-face interviews; netbook, PDA, and tablet PC. EDC performance during telephone interviews via mobile phone was evaluated as a fourth method. The Graeco Latin square study design allowed comparison of all four methods to standard paper-based recording followed by data double entry while controlling simultaneously for possible confounding factors such as interview order, interviewer and interviewee. Over a study period of three weeks the error rates decreased considerably for all EDC methods. In the last week of the study the data accuracy for the netbook (5.1%, CI95%: 3.5-7.2%) and the tablet PC (5.2%, CI95%: 3.7-7.4%) was not significantly different from the accuracy of the conventional paper-based method (3.6%, CI95%: 2.2-5.5%), but error rates for the PDA (7.9%, CI95%: 6.0-10.5%) and telephone (6.3%, CI95% 4.6-8.6%) remained significantly higher. While EDC-interviews take slightly longer, data become readily available after download, making EDC more time effective. Free text and date fields were associated with higher error rates than numerical, single select and skip fields. EDC solutions have the potential to produce similar data accuracy compared to paper-based methods. Given the considerable reduction in the time from data collection to database lock, EDC holds the promise to reduce research-associated costs. However, the successful implementation of EDC requires adjustment of

  5. Comparison of Electronic Data Capture (EDC) with the Standard Data Capture Method for Clinical Trial Data

    PubMed Central

    Walther, Brigitte; Hossin, Safayet; Townend, John; Abernethy, Neil; Parker, David; Jeffries, David

    2011-01-01

    Background Traditionally, clinical research studies rely on collecting data with case report forms, which are subsequently entered into a database to create electronic records. Although well established, this method is time-consuming and error-prone. This study compares four electronic data capture (EDC) methods with the conventional approach with respect to duration of data capture and accuracy. It was performed in a West African setting, where clinical trials involve data collection from urban, rural and often remote locations. Methodology/Principal Findings Three types of commonly available EDC tools were assessed in face-to-face interviews; netbook, PDA, and tablet PC. EDC performance during telephone interviews via mobile phone was evaluated as a fourth method. The Graeco Latin square study design allowed comparison of all four methods to standard paper-based recording followed by data double entry while controlling simultaneously for possible confounding factors such as interview order, interviewer and interviewee. Over a study period of three weeks the error rates decreased considerably for all EDC methods. In the last week of the study the data accuracy for the netbook (5.1%, CI95%: 3.5–7.2%) and the tablet PC (5.2%, CI95%: 3.7–7.4%) was not significantly different from the accuracy of the conventional paper-based method (3.6%, CI95%: 2.2–5.5%), but error rates for the PDA (7.9%, CI95%: 6.0–10.5%) and telephone (6.3%, CI95% 4.6–8.6%) remained significantly higher. While EDC-interviews take slightly longer, data become readily available after download, making EDC more time effective. Free text and date fields were associated with higher error rates than numerical, single select and skip fields. Conclusions EDC solutions have the potential to produce similar data accuracy compared to paper-based methods. Given the considerable reduction in the time from data collection to database lock, EDC holds the promise to reduce research-associated costs

  6. Common Data Format (CDF) and Coordinated Data Analysis Web (CDAWeb)

    NASA Technical Reports Server (NTRS)

    Candey, Robert M.

    2010-01-01

    The Coordinated Data Analysis Web (CDAWeb) data browsing system provides plotting, listing and open access v ia FTP, HTTP, and web services (REST, SOAP, OPeNDAP) for data from mo st NASA Heliophysics missions and is heavily used by the community. C ombining data from many instruments and missions enables broad resear ch analysis and correlation and coordination with other experiments a nd missions. Crucial to its effectiveness is the use of a standard se lf-describing data format, in this case, the Common Data Format (CDF) , also developed at the Space Physics Data facility , and the use of metadata standa rds (easily edited with SKTeditor ). CDAweb is based on a set of IDL routines, CDAWlib . . The CDF project also maintains soft ware and services for translating between many standard formats (CDF. netCDF, HDF, FITS, XML) .

  7. Remote sensing data from CLARET: A prototype CART data set

    NASA Technical Reports Server (NTRS)

    Eberhard, Wynn L.; Uttal, Taneil; Clark, Kurt A.; Cupp, Richard E.; Dutton, Ellsworth G.; Fedor, Leonard, S.; Intrieri, Janet M.; Matrosov, Sergey Y.; Snider, Jack B.; Willis, Ron J.

    1992-01-01

    The data set containing radiation, meteorological , and cloud sensor observations is documented. It was prepared for use by the Department of Energy's Atmospheric Radiation Measurement (ARM) Program and other interested scientists. These data are a precursor of the types of data that ARM Cloud And Radiation Testbed (CART) sites will provide. The data are from the Cloud Lidar And Radar Exploratory Test (CLARET) conducted by the Wave Propagation Laboratory during autumn 1989 in the Denver-Boulder area of Colorado primarily for the purpose of developing new cloud-sensing techniques on cirrus. After becoming aware of the experiment, ARM scientists requested archival of subsets of the data to assist in the developing ARM program. Five CLARET cases were selected: two with cirrus, one with stratus, one with mixed-phase clouds, and one with clear skies. Satellite data from the stratus case and one cirrus case were analyzed for statistics on cloud cover and top height. The main body of the selected data are available on diskette from the Wave Propagation Laboratory or Los Alamos National Laboratory.

  8. Format conversion between CAD data and GIS data based on ArcGIS

    NASA Astrophysics Data System (ADS)

    Xie, Qingqing; Wei, Bo; Zhang, Kailin; Wang, Zhichao

    2015-12-01

    To make full use of the data resources and realize a sharing for the different types of data in different industries, a method of format conversion between CAD data and GIS data based on ArcGIS was proposed. To keep the integrity of the converted data, some key steps to process CAD data before conversion were made in AutoCAD. For examples, deleting unnecessary elements such as title, border and legend avoided the appearance of unnecessary elements after conversion, as layering data again by a national standard avoided the different types of elements to appear in a same layer after conversion. In ArcGIS, converting CAD data to GIS data was executed by the correspondence of graphic element classification between AutoCAD and ArcGIS. In addition, an empty geographic database and feature set was required to create in ArcGIS for storing the text data of CAD data. The experimental results show that the proposed method avoids a large amount of editing work in data conversion and maintains the integrity of spatial data and attribute data between before and after conversion.

  9. Cambridge Crystallographic Data Centre. II. Structural Data File

    ERIC Educational Resources Information Center

    Allen, F. H.; And Others

    1973-01-01

    The Cambridge Crystallographic Data Centre is concerned with the retrieval, evaluation, synthesis, and dissemination of structural data obtained by diffraction methods. This article (Part I is EJ053033) describes the work of the center and deals with the organization and maintenance of a computerized file of numeric crystallographic structural…

  10. Integration of remote sensing and GIS: Data and data access

    USGS Publications Warehouse

    Ehlers, M.; Greenlee, D.D.; Smith, T.; Star, J.

    1991-01-01

    CT: Theintegration of remote sensing tools and technology with the spatial analysis orientation of geographic information systems is a complex task. In this paper, we focus on the issues of making data available and useful to the user. In part, this involves a set of problems which reflect on the physical and logical structures used to encode the data. At the same time, however, the mechanisms and protocols which provide information about the data, and which maintain the data through time, have become increasingly important. We discuss these latter issues from the viewpoint of the functions which must be provided by archives of spatial data.

  11. A Distributed Data Architecture for 2001 Mars Odyssey Data Distribution

    NASA Technical Reports Server (NTRS)

    Crichton, Daniel J.; Hughes, J. Steven; Kelly, Sean

    2003-01-01

    Newer instruments and communications techniques have given scientists unprecedented amounts of data, more than can be feasibly distributed through traditional methods such as mailed CD-ROM's. Leveraging the web makes sense since it enables scientists to request specific data and retrieve products as soon as they're available. Yet defining the middleware system to support such an application has remained just out of reach, until Odyssey. For the first time ever, data from all Odyssey mission instruments were made available through a single system immediately upon delivery to the Planetary Data System (PDS). The Object Oriented Data Technology (OODT) software made such an application possible.

  12. Comparative analysis of data mining techniques for business data

    NASA Astrophysics Data System (ADS)

    Jamil, Jastini Mohd; Shaharanee, Izwan Nizal Mohd

    2014-12-01

    Data mining is the process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data contained within a database. Companies are using this tool to further understand their customers, to design targeted sales and marketing campaigns, to predict what product customers will buy and the frequency of purchase, and to spot trends in customer preferences that can lead to new product development. In this paper, we conduct a systematic approach to explore several of data mining techniques in business application. The experimental result reveals that all data mining techniques accomplish their goals perfectly, but each of the technique has its own characteristics and specification that demonstrate their accuracy, proficiency and preference.

  13. Improving data sharing in research with context-free encoded missing data.

    PubMed

    Hoevenaar-Blom, Marieke P; Guillemont, Juliette; Ngandu, Tiia; Beishuizen, Cathrien R L; Coley, Nicola; Moll van Charante, Eric P; Andrieu, Sandrine; Kivipelto, Miia; Soininen, Hilkka; Brayne, Carol; Meiller, Yannick; Richard, Edo

    2017-01-01

    Lack of attention to missing data in research may result in biased results, loss of power and reduced generalizability. Registering reasons for missing values at the time of data collection, or-in the case of sharing existing data-before making data available to other teams, can save time and efforts, improve scientific value and help to prevent erroneous assumptions and biased results. To ensure that encoding of missing data is sufficient to understand the reason why data are missing, it should ideally be context-free. Therefore, 11 context-free codes of missing data were carefully designed based on three completed randomized controlled clinical trials and tested in a new randomized controlled clinical trial by an international team consisting of clinical researchers and epidemiologists with extended experience in designing and conducting trials and an Information System expert. These codes can be divided into missing due to participant and/or participation characteristics (n = 6), missing by design (n = 4), and due to a procedural error (n = 1). Broad implementation of context-free missing data encoding may enhance the possibilities of data sharing and pooling, thus allowing more powerful analyses using existing data.

  14. Data registration for automated non-destructive inspection with multiple data sets

    NASA Astrophysics Data System (ADS)

    Tippetts, T.; Brierley, N.; Cawley, P.

    2013-01-01

    In many NDE applications, multiple sources of data are available covering the same region of a part under inspection. These overlapping data can come from intersecting scan patterns, sensors in an array configuration, or repeated inspections at different times. In many cases these data sets are analysed independently, with separate assessments for each channel or data file. It should be possible to improve the overall reliability of the inspection by combining multiple sources of information, simultaneously increasing the Probability of Detection (POD) and decreasing the Probability of False Alarm (PFA). Data registration, i.e. mapping the data to matching coordinates in space, is both an essential prerequisite and a challenging obstacle to this type of data fusion. This paper describes optimization techniques for matching and aligning features in NDE data. Examples from automated ultrasound inspection of aircraft engine discs illustrate the approach.

  15. The Emirates Space Data Center, a PDS4-Compliant Data Archive

    NASA Astrophysics Data System (ADS)

    DeWolfe, A. W.; Al Hammadi, O.; Amiri, S.

    2017-12-01

    As part of the UAE's Emirates Mars Mission (EMM), we are constructing a data archive to preserve and distribute science data from this and future missions. The archive will be publicly accessible and will provide access to Level 2 and 3 science data products from EMM, as well as ancillary data such as SPICE kernels and mission event timelines. As a member of the International Planetary Data Alliance (IPDA), the UAE has committed to making its archive PDS4-compatible, and maintaining the archive beyond the end of the mission. EMM is scheduled to begin collecting science data in spring 2021, and the archive is expected to begin releasing data in September 2021.

  16. Comparison of current map data with data from STS-99 SRTM

    NASA Image and Video Library

    2000-02-04

    JSC2000E01552 (January 2000) --- This chart compares currently available global map data with the data which will be provided by SRTM during STS-99. The area depicted is the California coast. The SRTM mission will have approximately 1,000 scheduled data takes (every time Endeavour is over land). Data acquisition will be conducted in excess of 80 hours. The recording rate for data will be 180 Mbits/sec for C-band, 90 Mbits/sec for X-band. Total raw radar data will be approximately 9.8 terabytes (15,000 CDs). The mission will utilize some 300 high-density tapes (each tape records 30 min. of C-band, or 60 min. of X-band data).

  17. Enabling the Use of Authentic Scientific Data in the Classroom--Lessons Learned from the AccessData and Data Services Workshops

    NASA Astrophysics Data System (ADS)

    Lynds, S. E.; Buhr, S. M.; Ledley, T. S.

    2007-12-01

    Enabling the Use of Authentic Scientific Data in the Classroom--Lessons Learned from the AccessData and Data Services Workshops Since 2004, the annual AccessData and DLESE Data Services workshops have gathered scientists, data managers, technology specialists, teachers, and curriculum developers to work together creating classroom- ready scientific data modules. Teams of five (one participant from each of the five professions) develop topic- specific online educational units of the Earth Exploration Toolbook (serc.carleton.edu/eet/). Educators from middle schools through undergraduate colleges have been represented, as have scientific data professionals from many organizations across the United States. Extensive evaluation has been included in the design of each workshop. The evaluation results have been used each year to improve subsequent workshops. In addition to refining the format and process of the workshop itself, evaluation data collected reveal attendees' experiences using scientific data for educational purposes. Workshop attendees greatly value the opportunity to network with those of other professional roles in developing a real-world education project using scientific data. Educators appreciate the opportunity to work directly with scientists and technology specialists, while researchers and those in technical fields value the classroom expertise of the educators. Attendees' data use experiences are explored every year. Although bandwidth and connectivity were problems for data use in 2004, that has become much less common over time. The most common barriers to data use cited now are discoverability, data format problems, incomplete data sets, and poor documentation. Most attendees agree that the most useful types of online documentation and user support for scientific data are step-by-step instructions, examples, tutorials, and reference manuals. Satellite imagery and weather data were the most commonly used types of data, and these were often

  18. Proteomics data repositories

    PubMed Central

    Riffle, Michael; Eng, Jimmy K.

    2010-01-01

    The field of proteomics, particularly the application of mass spectrometry analysis to protein samples, is well-established and growing rapidly. Proteomics studies generate large volumes of raw experimental data and inferred biological results. To facilitate the dissemination of these data, centralized data repositories have been developed that make the data and results accessible to proteomics researchers and biologists alike. This review of proteomics data repositories focuses exclusively on freely-available, centralized data resources that disseminate or store experimental mass spectrometry data and results. The resources chosen reflect a current “snapshot” of the state of resources available with an emphasis placed on resources that may be of particular interest to yeast researchers. Resources are described in terms of their intended purpose and the features and functionality provided to users. PMID:19795424

  19. Geoscience Data Puzzles: Developing Students' Ability to Make Meaning from Data

    NASA Astrophysics Data System (ADS)

    Kastens, K. A.; Turrin, M.

    2010-12-01

    One of the most fundamental aspects of geoscience expertise is the ability to extract insights from observational earth data. Where an expert might see trends, patterns, processes, and candidate causal relationships, a novice could look at the same data representation and see dots, wiggles and blotches of color. The problem is compounded when the student was not personally involved in collecting the data or samples and thus has no experiential knowledge of the Earth setting that the data represent. In other words, the problem is especially severe when students tap into the vast archives of professionally-collected data that the geoscience community has worked so hard to make available for instructional use over the internet. Moreover, most high school and middle school teachers did not themselves learn Earth Science through analyzing data, and they may lack skills and/or confidence needed to scaffold students through the process of learning to interpret realistically-complex data sets. We have developed “Geoscience Data Puzzles” with the paired goals of (a) helping students learn about the earth from data, and (b) helping teachers learn to teach with data. Geoscience Data Puzzles are data-using activities that purposefully present a low barrier-to-entry for teachers and a high ratio of insight-to-effort for students. Each Puzzle uses authentic geoscience data, but the data are carefully pre-selected in order to illuminate a fundamental Earth process within tractable snippets of data. Every Puzzle offers "Aha" moments, when the connection between data and process comes clear in a rewarding burst of insight. Every Puzzle is accompanied by a Pedagogical Content Knowledge (PCK) guide, which explicates the chain of reasoning by which the puzzle-solver can use the evidence provided by the data to construct scientific claims. Four types of reasoning are stressed: spatial reasoning, in which students make inferences from observations about location, orientation, shape

  20. Stepping Into Science Data: Data Visualization in Virtual Reality

    NASA Astrophysics Data System (ADS)

    Skolnik, S.

    2017-12-01

    Have you ever seen people get really excited about science data? Navteca, along with the Earth Science Technology Office (ESTO), within the Earth Science Division of NASA's Science Mission Directorate have been exploring virtual reality (VR) technology for the next generation of Earth science technology information systems. One of their first joint experiments was visualizing climate data from the Goddard Earth Observing System Model (GEOS) in VR, and the resulting visualizations greatly excited the scientific community. This presentation will share the value of VR for science, such as the capability of permitting the observer to interact with data rendered in real-time, make selections, and view volumetric data in an innovative way. Using interactive VR hardware (headset and controllers), the viewer steps into the data visualizations, physically moving through three-dimensional structures that are traditionally displayed as layers or slices, such as cloud and storm systems from NASA's Global Precipitation Measurement (GPM). Results from displaying this precipitation and cloud data show that there is interesting potential for scientific visualization, 3D/4D visualizations, and inter-disciplinary studies using VR. Additionally, VR visualizations can be leveraged as 360 content for scientific communication and outreach and VR can be used as a tool to engage policy and decision makers, as well as the public.

  1. A Custom Data Logger for Real-Time Remote Field Data Collections

    DTIC Science & Technology

    2017-03-01

    ERDC/CHL CHETN-VI-46 March 2017 Approved for public release; distribution is unlimited. A Custom Data Logger for Real- Time Remote Field Data...Field Research Facility (FRF), for remote real- time data collections. This custom data logger is compact and energy efficient but has the same...INTRODUCTION: Real- time data collections offer many advantages: 1. Instrument failures can be rapidly detected and repaired, thereby minimizing

  2. National Geothermal Data System: an Exemplar of Open Access to Data

    NASA Astrophysics Data System (ADS)

    Allison, M. L.; Richard, S. M.; Blackman, H.; Anderson, A.

    2013-12-01

    The National Geothermal Data System's (NGDS - www.geothermaldata.org) formal launch in 2014 will provide open access to millions of datasets, sharing technical geothermal-relevant data across the geosciences to propel geothermal development and production. With information from all of the Department of Energy's sponsored development and research projects and geologic data from all 50 states, this free, interactive tool is opening new exploration opportunities and shortening project development by making data easily discoverable and accessible. We continue to populate our prototype functional data system with multiple data nodes and nationwide data online and available to the public. Data from state geological surveys and partners includes more than 5 million records online, including 1.48 million well headers (oil and gas, water, geothermal), 732,000 well logs, and 314,000 borehole temperatures and is growing rapidly. There are over 250 Web services and another 138 WMS (Web Map Services) registered in the system as of August, 2013. Companion projects run by Boise State University, Southern Methodist University, and USGS are adding millions of additional data records. The National Renewable Energy Laboratory is managing the Geothermal Data Repository which will serve as a system node and clearinghouse for data from hundreds of DOE-funded geothermal projects. NGDS is built on the US Geoscience Information Network data integration framework, which is a joint undertaking of the USGS and the Association of American State Geologists (AASG). NGDS is fully compliant with the White House Executive Order of May 2013, requiring all federal agencies to make their data holdings publicly accessible online in open source, interoperable formats with common core and extensible metadata. The National Geothermal Data System is being designed, built, deployed, and populated primarily with grants from the US Department of Energy, Geothermal Technologies Office. To keep this operational

  3. Increasing value and reducing waste in data extraction for systematic reviews: tracking data in data extraction forms.

    PubMed

    Shokraneh, Farhad; Adams, Clive E

    2017-08-04

    Data extraction is one of the most time-consuming tasks in performing a systematic review. Extraction is often onto some sort of form. Sharing completed forms can be used to check quality and accuracy of extraction or for re-cycling data to other researchers for updating. However, validating each piece of extracted data is time-consuming and linking to source problematic.In this methodology paper, we summarize three methods for reporting the location of data in original full-text reports, comparing their advantages and disadvantages.

  4. SeaDataNet: Pan-European infrastructure for ocean and marine data management

    NASA Astrophysics Data System (ADS)

    Fichaut, M.; Schaap, D.; Maudire, G.; Manzella, G. M. R.

    2012-04-01

    The overall objective of the SeaDataNet project is the upgrade the present SeaDataNet infrastructure into an operationally robust and state-of-the-art Pan-European infrastructure for providing up-to-date and high quality access to ocean and marine metadata, data and data products originating from data acquisition activities by all engaged coastal states, by setting, adopting and promoting common data management standards and by realising technical and semantic interoperability with other relevant data management systems and initiatives on behalf of science, environmental management, policy making, and economy. SeaDataNet is undertaken by the National Oceanographic Data Centres (NODCs), and marine information services of major research institutes, from 31 coastal states bordering the European seas, and also includes Satellite Data Centres, expert modelling centres and the international organisations IOC, ICES and EU-JRC in its network. Its 40 data centres are highly skilled and have been actively engaged in data management for many years and have the essential capabilities and facilities for data quality control, long term stewardship, retrieval and distribution. SeaDataNet undertakes activities to achieve data access and data products services that meet requirements of end-users and intermediate user communities, such as GMES Marine Core Services (e.g. MyOcean), establishing SeaDataNet as the core data management component of the EMODNet infrastructure and contributing on behalf of Europe to global portal initiatives, such as the IOC/IODE - Ocean Data Portal (ODP), and GEOSS. Moreover it aims to achieve INSPIRE compliance and to contribute to the INSPIRE process for developing implementing rules for oceanography. • As part of the SeaDataNet upgrading and capacity building, training courses will be organised aiming at data managers and technicians at the data centres. For the data managers it is important, that they learn to work with the upgraded common SeaData

  5. Data Curation Education Grounded in Earth Sciences and the Science of Data

    NASA Astrophysics Data System (ADS)

    Palmer, C. L.

    2015-12-01

    This presentation looks back over ten years of experience advancing data curation education at two Information Schools, highlighting the vital role of earth science case studies, expertise, and collaborations in development of curriculum and internships. We also consider current data curation practices and workforce demand in data centers in the geosciences, drawing on studies conducted in the Data Curation Education in Research Centers (DCERC) initiative and the Site-Based Data Curation project. Outcomes from this decade of data curation research and education has reinforced the importance of key areas of information science in preparing data professionals to respond to the needs of user communities, provide services across disciplines, invest in standards and interoperability, and promote open data practices. However, a serious void remains in principles to guide education and practice that are distinct to the development of data systems and services that meet both local and global aims. We identify principles emerging from recent empirical studies on the reuse value of data in the earth sciences and propose an approach for advancing data curation education that depends on systematic coordination with data intensive research and propagation of current best practices from data centers into curriculum. This collaborative model can increase both domain-based and cross-disciplinary expertise among data professionals, ultimately improving data systems and services in our universities and data centers while building the new base of knowledge needed for a foundational science of data.

  6. Integrating sea floor observatory data: the EMSO data infrastructure

    NASA Astrophysics Data System (ADS)

    Huber, Robert; Azzarone, Adriano; Carval, Thierry; Doumaz, Fawzi; Giovanetti, Gabriele; Marinaro, Giuditta; Rolin, Jean-Francois; Beranzoli, Laura; Waldmann, Christoph

    2013-04-01

    The European research infrastructure EMSO is a European network of fixed-point, deep-seafloor and water column observatories deployed in key sites of the European Continental margin and Arctic. It aims to provide the technological and scientific framework for the investigation of the environmental processes related to the interaction between the geosphere, biosphere, and hydrosphere and for a sustainable management by long-term monitoring also with real-time data transmission. Since 2006, EMSO is on the ESFRI (European Strategy Forum on Research Infrastructures) roadmap and has entered its construction phase in 2012. Within this framework, EMSO is contributing to large infrastructure integration projects such as ENVRI and COOPEUS. The EMSO infrastructure is geographically distributed in key sites of European waters, spanning from the Arctic, through the Atlantic and Mediterranean Sea to the Black Sea. It is presently consisting of thirteen sites which have been identified by the scientific community according to their importance respect to Marine Ecosystems, Climate Changes and Marine GeoHazards. The data infrastructure for EMSO is being designed as a distributed system. Presently, EMSO data collected during experiments at each EMSO site are locally stored and organized in catalogues or relational databases run by the responsible regional EMSO nodes. Three major institutions and their data centers are currently offering access to EMSO data: PANGAEA, INGV and IFREMER. In continuation of the IT activities which have been performed during EMSOs twin project ESONET, EMSO is now implementing the ESONET data architecture within an operational EMSO data infrastructure. EMSO aims to be compliant with relevant marine initiatives such as MyOceans, EUROSITES, EuroARGO, SEADATANET and EMODNET as well as to meet the requirements of international and interdisciplinary projects such as COOPEUS and ENVRI, EUDAT and iCORDI. A major focus is therefore set on standardization and

  7. Data Qualification Report For: Thermodynamic Data File, DATA0.YMP.R0 For Geochemical Code, EQ3/6 

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    P.L. Cloke

    The objective of this work is to evaluate the adequacy of chemical thermodynamic data provided by Lawrence Livermore National Laboratory (LLNL) as DataO.ymp.ROA in response to an input request submitted under AP-3.14Q. This request specified that chemical thermodynamic data available in the file, Data0.com.R2, be updated, improved, and augmented for use in geochemical modeling used in Process Model Reports (PMRs) for Engineered Barrier Systems, Waste Form, Waste Package, Unsaturated Zone, and Near Field Environment, as well as for Performance Assessment. The data are qualified in the temperature range 0 to 100 C. Several Data Tracking Numbers (DTNs) associated with Analysis/Modelmore » Reports (AMR) addressing various aspects of the post-closure chemical behavior of the waste package and the Engineered Barrier System that rely on EQ316 outputs to which these data are used as input, are Principal Factor affecting. This qualification activity was accomplished in accordance with the AP-SIII.2Q using the Technical Assessment method. A development plan, TDP-EBS-MD-000044, was prepared in accordance with AP-2.13Q and approved by the Responsible Manager. In addition, a Process Control Evaluation was performed in accordance with AP-SV.1Q. The qualification method, selected in accordance with AP-SIII.2Q, was Technical Assessment. The rationale for this approach is that the data in File Data0.com.R2 are considered Handbook data and therefore do not themselves require qualification. Only changes to Data0.com.R2 required qualification. A new file has been produced which contains the database Data0.ymp.R0, which is recommended for qualification as a result of this action. Data0.ymp.R0 will supersede Data0.com.R2 for all Yucca Mountain Project (YMP) activities.« less

  8. Enabling Data Fusion via a Common Data Model and Programming Interface

    NASA Astrophysics Data System (ADS)

    Lindholm, D. M.; Wilson, A.

    2011-12-01

    Much progress has been made in scientific data interoperability, especially in the areas of metadata and discovery. However, while a data user may have improved techniques for finding data, there is often a large chasm to span when it comes to acquiring the desired subsets of various datasets and integrating them into a data processing environment. Some tools such as OPeNDAP servers and the Unidata Common Data Model (CDM) have introduced improved abstractions for accessing data via a common interface, but they alone do not go far enough to enable fusion of data from multidisciplinary sources. Although data from various scientific disciplines may represent semantically similar concepts (e.g. time series), the user may face widely varying structural representations of the data (e.g. row versus column oriented), not to mention radically different storage formats. It is not enough to convert data to a common format. The key to fusing scientific data is to represent each dataset with consistent sampling. This can best be done by using a data model that expresses the functional relationship that each dataset represents. The domain of those functions determines how the data can be combined. The Visualization for Algorithm Development (VisAD) Java API has provided a sophisticated data model for representing the functional nature of scientific datasets for well over a decade. Because VisAD is largely designed for its visualization capabilities, the data model can be cumbersome to use for numerical computation, especially for those not comfortable with Java. Although both VisAD and the implementation of the CDM are written in Java, neither defines a pure Java interface that others could implement and program to, further limiting potential for interoperability. In this talk, we will present a solution for data integration based on a simple discipline-agnostic scientific data model and programming interface that enables a dataset to be defined in terms of three variable types

  9. Landsat Data

    USGS Publications Warehouse

    ,

    1997-01-01

    In the mid-1960's, the National Aeronautics and Space Administration (NASA) embarked on an initiative to develop and launch the first Earth monitoring satellite to meet the needs of resource managers and earth scientists. The U.S. Geological Survey (USGS) entered into a partnership with NASA in the early 1970?s to assume responsibility for archiving data and distributing data products. On July 23, 1972, NASA launched the first in a series of satellites designed to provide repetitive global coverage of the Earth?s land masses. Designated initially as the "Earth Resources Technology Satellite-A" ("ERTS-A"), it used a Nimbus-type platform that was modified to carry sensor systems and data relay equipment. When operational orbit was achieved, it was designated "ERTS-1." The satellite continued to function beyond its designed life expectancy of 1 year and finally ceased to operate on January 6, 1978, more than 5 years after its launch date. The second in this series of Earth resources satellites (designated ?ERTS-B?) was launched January 22, 1975. It was renamed "Landsat 2" by NASA, which also renamed "ERTS-1" as "Landsat 1." Three additional Landsats were launched in 1978, 1982, and 1984 (Landsats 3, 4, and 5 ). (See table 1). NASA was responsible for operating the program through the early 1980?s. In January 1983, operation of the Landsat system was transferred to the National Oceanic and Atmospheric Administration (NOAA). In October 1985, the Landsat system was commercialized and the Earth Observation Satellite Company, now Space Imaging EOSAT, assumed responsibility for its operation under contract to NOAA. Throughout these changes, the USGS EROS Data Center (EDC) retained primary responsibility as the Government archive of Landsat data. The Land Remote Sensing Policy Act of 1992 (Public Law 102-5555) officially authorized the National Satellite Land Remote Sensing Data Archive and assigned responsibility to the Department of the Interior. In addition to its Landsat

  10. Nitrogen Source and Loading Data for EPA Estuary Data Mapper

    EPA Science Inventory

    Nitrogen source and loading data have been compiled and aggregated at the scale of estuaries and associated watersheds of the conterminous United States, using the spatial framework in EPA's Estuary Data Mapper (EDM) to provide system boundaries. Original sources of data include...

  11. Data Visualization in Information Retrieval and Data Mining (SIG VIS).

    ERIC Educational Resources Information Center

    Efthimiadis, Efthimis

    2000-01-01

    Presents abstracts that discuss using data visualization for information retrieval and data mining, including immersive information space and spatial metaphors; spatial data using multi-dimensional matrices with maps; TREC (Text Retrieval Conference) experiments; users' information needs in cartographic information retrieval; and users' relevance…

  12. Data Integration Framework Data Management Plan Remote Sensing Dataset

    DTIC Science & Technology

    2016-07-01

    performed by the Coastal Observations and Analysis Branch (CEERD-HFA) of the Flood and Storm Protection Division (CEERD-HF), U.S. Army Engineer Research... Protection Division, Coastal Observations and Analysis Branch CESAM U.S. Army Corps of Engineers, Mobile District CESAM-OP-J U.S. Army Corps of Engineers...ER D C/ CH L SR -1 6- 2 Coastal Ocean Data Systems Program Data Integration Framework Data Management Plan Remote Sensing Dataset Co

  13. The project data sphere initiative: accelerating cancer research by sharing data.

    PubMed

    Green, Angela K; Reeder-Hayes, Katherine E; Corty, Robert W; Basch, Ethan; Milowsky, Mathew I; Dusetzina, Stacie B; Bennett, Antonia V; Wood, William A

    2015-05-01

    In this paper, we provide background and context regarding the potential for a new data-sharing platform, the Project Data Sphere (PDS) initiative, funded by financial and in-kind contributions from the CEO Roundtable on Cancer, to transform cancer research and improve patient outcomes. Given the relatively modest decline in cancer death rates over the past several years, a new research paradigm is needed to accelerate therapeutic approaches for oncologic diseases. Phase III clinical trials generate large volumes of potentially usable information, often on hundreds of patients, including patients treated with standard of care therapies (i.e., controls). Both nationally and internationally, a variety of stakeholders have pursued data-sharing efforts to make individual patient-level clinical trial data available to the scientific research community. For researchers, shared data have the potential to foster a more collaborative environment, to answer research questions in a shorter time frame than traditional randomized control trials, to reduce duplication of effort, and to improve efficiency. For industry participants, use of trial data to answer additional clinical questions could increase research and development efficiency and guide future projects through validation of surrogate end points, development of prognostic or predictive models, selection of patients for phase II trials, stratification in phase III studies, and identification of patient subgroups for development of novel therapies. Data transparency also helps promote a public image of collaboration and altruism among industry participants. For patient participants, data sharing maximizes their contribution to public health and increases access to information that may be used to develop better treatments. Concerns about data-sharing efforts include protection of patient privacy and confidentiality. To alleviate these concerns, data sets are deidentified to maintain anonymity. To address industry

  14. Relating Satellite Gravimetry Data To Global Snow Water Equivalent Data

    NASA Astrophysics Data System (ADS)

    Baumann, Sabine

    2017-04-01

    In 04/2002, the gravimetric satellites GRACE were launched. They measure Earth's gravity via a precise microwave system. These satellites assess changes of Earth's mass. Main contributions of these changes originate from hydrological compartments as e.g. surface water, groundwater, soil moisture, or snow water equivalent (SWE). The benefit of GRACE data is to receive a direct measured signal. The data are not calibrated with other data (as e.g. done in models) or unusable due to particular Earth's surface conditions (e.g. AMSR-e for thick and wet snow surfaces). GRACE data show changes in total water storage (TWS) but cannot distinguish between different sources. Therefore, other data, models, and methods are necessary to extract the different compartments. Due to the spatial resolution of 200,000 km2 and an accuracy of 2.5 cm w.e., mostly other global products are compared with GRACE. In this study, the hydrological model WGHM (TWS and SWE), the land surface model GLDAS (TWS and SWE), and the passive microwave sensor AMSR-E (SWE) are compared with the GRACE data. All data have to be pre-processed in the same way as the GRACE data to be comparable. A correlation analysis was performed between the different products assuming that changes in TWS can be linked to changes in SWE if either SWE is the dominant compartment of TWS or if SWE changes proportionally with TWS. To focus on the SWE product a second correlation was performed only for the winter season. Spatial extent was focused on the large permafrost areas in North America and Russia. By this method, those areas were detected in which GRACE data can be integrated for SWE data assessment to, for example, improve the models.

  15. Refinement of ground reference data with segmented image data

    NASA Technical Reports Server (NTRS)

    Robinson, Jon W.; Tilton, James C.

    1991-01-01

    One of the ways to determine ground reference data (GRD) for satellite remote sensing data is to photo-interpret low altitude aerial photographs and then digitize the cover types on a digitized tablet and register them to 7.5 minute U.S.G.S. maps (that were themselves digitized). The resulting GRD can be registered to the satellite image or, vice versa. Unfortunately, there are many opportunities for error when using digitizing tablet and the resolution of the edges for the GRD depends on the spacing of the points selected on the digitizing tablet. One of the consequences of this is that when overlaid on the image, errors and missed detail in the GRD become evident. An approach is discussed for correcting these errors and adding detail to the GRD through the use of a highly interactive, visually oriented process. This process involves the use of overlaid visual displays of the satellite image data, the GRD, and a segmentation of the satellite image data. Several prototype programs were implemented which provide means of taking a segmented image and using the edges from the reference data to mask out these segment edges that are beyond a certain distance from the reference data edges. Then using the reference data edges as a guide, those segment edges that remain and that are judged not to be image versions of the reference edges are manually marked and removed. The prototype programs that were developed and the algorithmic refinements that facilitate execution of this task are described.

  16. Study of data collection platform concepts: Data collection system user requirements

    NASA Technical Reports Server (NTRS)

    1973-01-01

    The overall purpose of the survey was to provide real world data on user requirements. The intent was to assess data collection system user requirements by questioning actual potential users rather than speculating on requirements. The end results of the survey are baseline requirements models for both a data collection platform and a data collection system. These models were derived from the survey results. The real value of these models lies in the fact that they are based on actual user requirements as delineated in the survey questionnaires. Some users desire data collection platforms of small size and light weight. These sizes and weights are beyond the present state of the art. Also, the survey provided a wealth of information on the nature and constituency of the data collection user community as well as information on user applications for data collection systems. Finally, the data sheds light on the generalized platform concept. That is, the diversity of user requirements shown in the data indicates the difficulty that can be anticipated in attempting to implement such a concept.

  17. Rethinking big data: A review on the data quality and usage issues

    NASA Astrophysics Data System (ADS)

    Liu, Jianzheng; Li, Jie; Li, Weifeng; Wu, Jiansheng

    2016-05-01

    The recent explosive publications of big data studies have well documented the rise of big data and its ongoing prevalence. Different types of ;big data; have emerged and have greatly enriched spatial information sciences and related fields in terms of breadth and granularity. Studies that were difficult to conduct in the past time due to data availability can now be carried out. However, big data brings lots of ;big errors; in data quality and data usage, which cannot be used as a substitute for sound research design and solid theories. We indicated and summarized the problems faced by current big data studies with regard to data collection, processing and analysis: inauthentic data collection, information incompleteness and noise of big data, unrepresentativeness, consistency and reliability, and ethical issues. Cases of empirical studies are provided as evidences for each problem. We propose that big data research should closely follow good scientific practice to provide reliable and scientific ;stories;, as well as explore and develop techniques and methods to mitigate or rectify those 'big-errors' brought by big data.

  18. Data Basin: Expanding Access to Conservation Data, Tools, and People

    NASA Astrophysics Data System (ADS)

    Comendant, T.; Strittholt, J.; Frost, P.; Ward, B. C.; Bachelet, D. M.; Osborne-Gowey, J.

    2009-12-01

    Mapping and spatial analysis are a fundamental part of problem solving in conservation science, yet spatial data are widely scattered, difficult to locate, and often unavailable. Valuable time and resources are wasted locating and gaining access to important biological, cultural, and economic datasets, scientific analysis, and experts. As conservation problems become more serious and the demand to solve them grows more urgent, a new way to connect science and practice is needed. To meet this need, an open-access, web tool called Data Basin (www.databasin.org) has been created by the Conservation Biology Institute in partnership with ESRI and the Wilburforce Foundation. Users of Data Basin can gain quick access to datasets, experts, groups, and tools to help solve real-world problems. Individuals and organizations can perform essential tasks such as exploring and downloading from a vast library of conservation datasets, uploading existing datasets, connecting to other external data sources, create groups, and produce customized maps that can be easily shared. Data Basin encourages sharing and publishing, but also provides privacy and security for sensitive information when needed. Users can publish projects within Data Basin to tell more complete and rich stories of discovery and solutions. Projects are an ideal way to publish collections of datasets, maps and other information on the internet to reach wider audiences. Data Basin also houses individual centers that provide direct access to data, maps, and experts focused on specific geographic areas or conservation topics. Current centers being developed include the Boreal Information Centre, the Data Basin Climate Center, and proposed Aquatic and Forest Conservation Centers.

  19. Data System Implications Derived from User Application Requirements for Satellite Data

    NASA Technical Reports Server (NTRS)

    Neiers, J.

    1979-01-01

    An investigation of the data system needs as driven by users of space acquired Earth observation data is documented. Two major categories of users, operational and research, are identified. Limiting data acquisition alleviates some of the delays in processing thus improving timeliness of the delivered product. Trade offs occur between timeliness and data distribution costs, and between data storage and reprocessing. The complexity of the data system requirements to apply space data to users' needs is such that no single analysis suffices to design and implement the optimum system. A series of iterations is required with analyses of the salient problems in a general way, followed by a limited implementation of benefit to some users with a continual upgrade in system capacity, functions, and applications served. The resulting most important requirement for the data system is flexibility to accommodate changing requirements as the system is implemented.

  20. Evolution of data stewardship over two decades at a NASA data center

    NASA Astrophysics Data System (ADS)

    Armstrong, E. M.; Moroni, D. F.; Hausman, J.; Tsontos, V. M.

    2013-12-01

    Whether referred to as data science or data engineering, the technical nature and practice of data curation has seen a noticeable shift in the last two decades. The majority of this has been driven by factors of increasing data volumes and complexity, new data structures, and data virtualization through internet access that have themselves spawned new fields or advances in semantic ontologies, metadata, advanced distributed computing and new file formats. As a result of this shifting landscape, the role of the data scientist/engineer has also evolved.. We will discuss the key elements of this evolutionary shift from the perspective of data curation at the NASA Physical Oceanography Distributed Active Archive Center (PO.DAAC), which is one of 12 NASA Earth Science data centers responsible for archiving and distributing oceanographic satellite data since 1993. Earlier responsibilities of data curation in the history of the PO.DAAC focused strictly on data archiving, low-level data quality assessments, understanding and building read software for terse binary data or limited applications of self-describing file formats and metadata. Data discovery was often word of mouth or based on perusing simple web pages built for specific products. At that time the PO.DAAC served only a few tens of datasets. A single data engineer focused on a specific mission or suite of datasets from a specific physical parameter (e.g., ocean topography measurements). Since that time the number of datasets in the PO.DAAC has grown to approach one thousand, with increasing complexity of data and metadata structures in self-describing formats. Advances in ontologies, metadata, applications of MapReduce distributed computing and "big data", improvements in data discovery, data mining, and tools for visualization and analysis have all required new and evolving skill sets. The community began requiring more rigorous assessments of data quality and uncertainty. Although the expert knowledge of the

  1. Big Data in the Earth Observing System Data and Information System

    NASA Technical Reports Server (NTRS)

    Lynnes, Chris; Baynes, Katie; McInerney, Mark

    2016-01-01

    Approaches that are being pursued for the Earth Observing System Data and Information System (EOSDIS) data system to address the challenges of Big Data were presented to the NASA Big Data Task Force. Cloud prototypes are underway to tackle the volume challenge of Big Data. However, advances in computer hardware or cloud won't help (much) with variety. Rather, interoperability standards, conventions, and community engagement are the key to addressing variety.

  2. The Project Data Sphere Initiative: Accelerating Cancer Research by Sharing Data

    PubMed Central

    Reeder-Hayes, Katherine E.; Corty, Robert W.; Basch, Ethan; Milowsky, Mathew I.; Dusetzina, Stacie B.; Bennett, Antonia V.; Wood, William A.

    2015-01-01

    Background. In this paper, we provide background and context regarding the potential for a new data-sharing platform, the Project Data Sphere (PDS) initiative, funded by financial and in-kind contributions from the CEO Roundtable on Cancer, to transform cancer research and improve patient outcomes. Given the relatively modest decline in cancer death rates over the past several years, a new research paradigm is needed to accelerate therapeutic approaches for oncologic diseases. Phase III clinical trials generate large volumes of potentially usable information, often on hundreds of patients, including patients treated with standard of care therapies (i.e., controls). Both nationally and internationally, a variety of stakeholders have pursued data-sharing efforts to make individual patient-level clinical trial data available to the scientific research community. Potential Benefits and Risks of Data Sharing. For researchers, shared data have the potential to foster a more collaborative environment, to answer research questions in a shorter time frame than traditional randomized control trials, to reduce duplication of effort, and to improve efficiency. For industry participants, use of trial data to answer additional clinical questions could increase research and development efficiency and guide future projects through validation of surrogate end points, development of prognostic or predictive models, selection of patients for phase II trials, stratification in phase III studies, and identification of patient subgroups for development of novel therapies. Data transparency also helps promote a public image of collaboration and altruism among industry participants. For patient participants, data sharing maximizes their contribution to public health and increases access to information that may be used to develop better treatments. Concerns about data-sharing efforts include protection of patient privacy and confidentiality. To alleviate these concerns, data sets are

  3. [Data sources, the data used, and the modality for collection].

    PubMed

    Mercier, G; Costa, N; Dutot, C; Riche, V-P

    2018-03-01

    The hospital costing process implies access to various sources of data. Whether a micro-costing or a gross-costing approach is used, the choice of the methodology is based on a compromise between the cost of data collection, data accuracy, and data transferability. This work describes the data sources available in France and the access modalities that are used, as well as the main advantages and shortcomings of: (1) the local unit costs, (2) the hospital analytical accounting, (3) the Angers database, (4) the National Health Cost Studies, (5) the INTER CHR/U databases, (6) the Program for Medicalizing Information Systems, and (7) the public health insurance databases. Copyright © 2018 Elsevier Masson SAS. All rights reserved.

  4. Data management, archiving, visualization and analysis of space physics data

    NASA Technical Reports Server (NTRS)

    Russell, C. T.

    1995-01-01

    A series of programs for the visualization and analysis of space physics data has been developed at UCLA. In the course of those developments, a number of lessons have been learned regarding data management and data archiving, as well as data analysis. The issues now facing those wishing to develop such software, as well as the lessons learned, are reviewed. Modern media have eased many of the earlier problems of the physical volume required to store data, the speed of access, and the permanence of the records. However, the ultimate longevity of these media is still a question of debate. Finally, while software development has become easier, cost is still a limiting factor in developing visualization and analysis software.

  5. Breaking BAD: A Data Serving Vision for Big Active Data

    PubMed Central

    Carey, Michael J.; Jacobs, Steven; Tsotras, Vassilis J.

    2017-01-01

    Virtually all of today’s Big Data systems are passive in nature. Here we describe a project to shift Big Data platforms from passive to active. We detail a vision for a scalable system that can continuously and reliably capture Big Data to enable timely and automatic delivery of new information to a large pool of interested users as well as supporting analyses of historical information. We are currently building a Big Active Data (BAD) system by extending an existing scalable open-source BDMS (AsterixDB) in this active direction. This first paper zooms in on the Data Serving piece of the BAD puzzle, including its key concepts and user model. PMID:29034377

  6. Evaluating lossy data compression on climate simulation data within a large ensemble

    DOE PAGES

    Baker, Allison H.; Hammerling, Dorit M.; Mickelson, Sheri A.; ...

    2016-12-07

    High-resolution Earth system model simulations generate enormous data volumes, and retaining the data from these simulations often strains institutional storage resources. Further, these exceedingly large storage requirements negatively impact science objectives, for example, by forcing reductions in data output frequency, simulation length, or ensemble size. To lessen data volumes from the Community Earth System Model (CESM), we advocate the use of lossy data compression techniques. While lossy data compression does not exactly preserve the original data (as lossless compression does), lossy techniques have an advantage in terms of smaller storage requirements. To preserve the integrity of the scientific simulation data,more » the effects of lossy data compression on the original data should, at a minimum, not be statistically distinguishable from the natural variability of the climate system, and previous preliminary work with data from CESM has shown this goal to be attainable. However, to ultimately convince climate scientists that it is acceptable to use lossy data compression, we provide climate scientists with access to publicly available climate data that have undergone lossy data compression. In particular, we report on the results of a lossy data compression experiment with output from the CESM Large Ensemble (CESM-LE) Community Project, in which we challenge climate scientists to examine features of the data relevant to their interests, and attempt to identify which of the ensemble members have been compressed and reconstructed. We find that while detecting distinguishing features is certainly possible, the compression effects noticeable in these features are often unimportant or disappear in post-processing analyses. In addition, we perform several analyses that directly compare the original data to the reconstructed data to investigate the preservation, or lack thereof, of specific features critical to climate science. Overall, we conclude that

  7. Evaluating lossy data compression on climate simulation data within a large ensemble

    NASA Astrophysics Data System (ADS)

    Baker, Allison H.; Hammerling, Dorit M.; Mickelson, Sheri A.; Xu, Haiying; Stolpe, Martin B.; Naveau, Phillipe; Sanderson, Ben; Ebert-Uphoff, Imme; Samarasinghe, Savini; De Simone, Francesco; Carbone, Francesco; Gencarelli, Christian N.; Dennis, John M.; Kay, Jennifer E.; Lindstrom, Peter

    2016-12-01

    High-resolution Earth system model simulations generate enormous data volumes, and retaining the data from these simulations often strains institutional storage resources. Further, these exceedingly large storage requirements negatively impact science objectives, for example, by forcing reductions in data output frequency, simulation length, or ensemble size. To lessen data volumes from the Community Earth System Model (CESM), we advocate the use of lossy data compression techniques. While lossy data compression does not exactly preserve the original data (as lossless compression does), lossy techniques have an advantage in terms of smaller storage requirements. To preserve the integrity of the scientific simulation data, the effects of lossy data compression on the original data should, at a minimum, not be statistically distinguishable from the natural variability of the climate system, and previous preliminary work with data from CESM has shown this goal to be attainable. However, to ultimately convince climate scientists that it is acceptable to use lossy data compression, we provide climate scientists with access to publicly available climate data that have undergone lossy data compression. In particular, we report on the results of a lossy data compression experiment with output from the CESM Large Ensemble (CESM-LE) Community Project, in which we challenge climate scientists to examine features of the data relevant to their interests, and attempt to identify which of the ensemble members have been compressed and reconstructed. We find that while detecting distinguishing features is certainly possible, the compression effects noticeable in these features are often unimportant or disappear in post-processing analyses. In addition, we perform several analyses that directly compare the original data to the reconstructed data to investigate the preservation, or lack thereof, of specific features critical to climate science. Overall, we conclude that applying

  8. Evaluating lossy data compression on climate simulation data within a large ensemble

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Baker, Allison H.; Hammerling, Dorit M.; Mickelson, Sheri A.

    High-resolution Earth system model simulations generate enormous data volumes, and retaining the data from these simulations often strains institutional storage resources. Further, these exceedingly large storage requirements negatively impact science objectives, for example, by forcing reductions in data output frequency, simulation length, or ensemble size. To lessen data volumes from the Community Earth System Model (CESM), we advocate the use of lossy data compression techniques. While lossy data compression does not exactly preserve the original data (as lossless compression does), lossy techniques have an advantage in terms of smaller storage requirements. To preserve the integrity of the scientific simulation data,more » the effects of lossy data compression on the original data should, at a minimum, not be statistically distinguishable from the natural variability of the climate system, and previous preliminary work with data from CESM has shown this goal to be attainable. However, to ultimately convince climate scientists that it is acceptable to use lossy data compression, we provide climate scientists with access to publicly available climate data that have undergone lossy data compression. In particular, we report on the results of a lossy data compression experiment with output from the CESM Large Ensemble (CESM-LE) Community Project, in which we challenge climate scientists to examine features of the data relevant to their interests, and attempt to identify which of the ensemble members have been compressed and reconstructed. We find that while detecting distinguishing features is certainly possible, the compression effects noticeable in these features are often unimportant or disappear in post-processing analyses. In addition, we perform several analyses that directly compare the original data to the reconstructed data to investigate the preservation, or lack thereof, of specific features critical to climate science. Overall, we conclude that

  9. Antarctic Glaciological Data at NSIDC: field data, temperature, and ice velocity

    NASA Astrophysics Data System (ADS)

    Bauer, R.; Bohlander, J.; Scambos, T.; Berthier, E.; Raup, B.; Scharfen, G.

    2003-12-01

    An extensive collection of many Antarctic glaciological parameters is available for the polar science community upon request. The National Science Foundation's Office of Polar Programs funds the Antarctic Glaciological Data Center (AGDC) at the National Snow and Ice Data Center (NSIDC) to archive and distribute Antarctic glaciological and cryospheric system data collected by the U.S. Antarctic Program. AGDC facilitates data exchange among Principal Investigators, preserves recently collected data useful to future research, gathers data sets from past research, and compiles continent-wide information useful for modeling and field work planning. Data sets are available via our web site, http://nsidc.org/agdc/. From here, users can access extensive documentation, citation information, locator maps, derived images and references, and the numerical data. More than 50 Antarctic scientists have contributed data to the archive. Among the compiled products distributed by AGDC are VELMAP and THERMAP. THERMAP is a compilation of over 600 shallow firn temperature measurements ('10-meter temperatures') collected since 1950. These data provide a record of mean annual temperature, and potentially hold a record of climate change on the continent. The data are represented with maps showing the traverse route, and include data sources, measurement technique, and additional measurements made at each site, i.e., snow density and accumulation. VELMAP is an archive of surface ice velocity measurements for the Antarctic Ice Sheet. The primary objective of VELMAP is to assemble a historic record of outlet glaciers and ice shelf ice motion over the Antarctic. The collection includes both PI-contributed measurements and data generated at NSIDC using Landsat and SPOT satellite imagery. Tabular data contain position, speed, bearing, and data quality information, and related references. Two new VELMAP data sets are highlighted: the Mertz Glacier and the Institute Ice Stream. Mertz Glacier ice

  10. Analyzing Likert Data

    ERIC Educational Resources Information Center

    Boone, Harry N., Jr.; Boone, Deborah A.

    2012-01-01

    This article provides information for Extension professionals on the correct analysis of Likert data. The analyses of Likert-type and Likert scale data require unique data analysis procedures, and as a result, misuses and/or mistakes often occur. This article discusses the differences between Likert-type and Likert scale data and provides…

  11. A Disciplined Architectural Approach to Scaling Data Analysis for Massive, Scientific Data

    NASA Astrophysics Data System (ADS)

    Crichton, D. J.; Braverman, A. J.; Cinquini, L.; Turmon, M.; Lee, H.; Law, E.

    2014-12-01

    Data collections across remote sensing and ground-based instruments in astronomy, Earth science, and planetary science are outpacing scientists' ability to analyze them. Furthermore, the distribution, structure, and heterogeneity of the measurements themselves pose challenges that limit the scalability of data analysis using traditional approaches. Methods for developing science data processing pipelines, distribution of scientific datasets, and performing analysis will require innovative approaches that integrate cyber-infrastructure, algorithms, and data into more systematic approaches that can more efficiently compute and reduce data, particularly distributed data. This requires the integration of computer science, machine learning, statistics and domain expertise to identify scalable architectures for data analysis. The size of data returned from Earth Science observing satellites and the magnitude of data from climate model output, is predicted to grow into the tens of petabytes challenging current data analysis paradigms. This same kind of growth is present in astronomy and planetary science data. One of the major challenges in data science and related disciplines defining new approaches to scaling systems and analysis in order to increase scientific productivity and yield. Specific needs include: 1) identification of optimized system architectures for analyzing massive, distributed data sets; 2) algorithms for systematic analysis of massive data sets in distributed environments; and 3) the development of software infrastructures that are capable of performing massive, distributed data analysis across a comprehensive data science framework. NASA/JPL has begun an initiative in data science to address these challenges. Our goal is to evaluate how scientific productivity can be improved through optimized architectural topologies that identify how to deploy and manage the access, distribution, computation, and reduction of massive, distributed data, while

  12. US GeoData

    USGS Publications Warehouse

    ,

    1992-01-01

    US GeoData tapes are computer tapes which contain cartographic data in digital form. The 1:2,000,000-scale data are available in two forms. The graphic form can be used to generate computer-plotted maps. The content and scale of the maps can be varied to meet your needs. The topologically-structured form of US GeoData is suitable for input to geographic information systems for use in spatial analysis and geographic studies. Both forms must be used in conjunction with appropriate software. US GeoData tapes offer convenience, accuracy, flexibility, and cost effectiveness to many map users. Business, industry, and government users who are involved in network planning and analysis, transportation, demography, land use, or any activity where data can be related to, or plotted on a map will find US GeoData a valuable resource.

  13. Organizing research data.

    PubMed

    Sestoft, Peter

    2011-01-01

    Research relies on ever larger amounts of data from experiments, automated production equipment, questionnaries, times series such as weather records, and so on. A major task in science is to combine, process and analyse such data to obtain evidence of patterns and correlations.Most research data are on digital form, which in principle ensures easy processing and analysis, easy long-term preservation, and easy reuse in future research, perhaps in entirely unanticipated ways. However, in practice, obstacles such as incompatible or undocumented data formats, poor data quality and lack of familiarity with current technology prevent researchers from making full use of available data.This paper argues that relational databases are excellent tools for veterinary research and animal production; provides a small example to introduce basic database concepts; and points out some concerns that must be addressed when organizing data for research purposes.

  14. Agile Data Curation: A conceptual framework and approach for practitioner data management

    NASA Astrophysics Data System (ADS)

    Young, J. W.; Benedict, K. K.; Lenhardt, W. C.

    2015-12-01

    Data management occurs across a range of science and related activities such as decision-support. Exemplars within the science community operate data management systems that are extensively planned before implementation, staffed with robust data management expertise, equipped with appropriate services and technologies, and often highly structured. However, this is not the only approach to data management and almost certainly not the typical experience. The other end of the spectrum is often an ad hoc practitioner team, with changing requirements, limited training in data management, and resource constrained for both equipment and human resources. Much of the existing data management literature serves the exemplar community and ignores the ad hoc practitioners. Somewhere in the middle are examples where data are repurposed for new uses thereby generating new data management challenges. This submission presents a conceptualization of an Agile Data Curation approach that provides foundational principles for data management efforts operating across the spectrum of data generation and use from large science systems to efforts with constrained resources, limited expertise, and evolving requirements. The underlying principles to Agile Data Curation are a reapplication of agile software development principles to data management. The historical reality for many data management efforts is operating in a practioner environment so Agile Data Curation utilizes historical and current case studies to validate the foundational principles and through comparison learn lessons for future application. This submission will provide an overview of the Agile Data Curation, cover the foundational principles to the approach, and introduce a framework for gathering, classifying, and applying lessons from case studies of practitioner data management.

  15. Big Bicycle Data Processing: from Personal Data to Urban Applications

    NASA Astrophysics Data System (ADS)

    Pettit, C. J.; Lieske, S. N.; Leao, S. Z.

    2016-06-01

    Understanding the flows of people moving through the built environment is a vital source of information for the planners and policy makers who shape our cities. Smart phone applications enable people to trace themselves through the city and these data can potentially be then aggregated and visualised to show hot spots and trajectories of macro urban movement. In this paper our aim is to develop procedures for cleaning, aggregating and visualising human movement data and translating this into policy relevant information. In conducting this research we explore using bicycle data collected from a smart phone application known as RiderLog. We focus on the RiderLog application initially in the context of Sydney, Australia and discuss the procedures and challenges in processing and cleaning this data before any analysis can be made. We then present some preliminary map results using the CartoDB online mapping platform where data are aggregated and visualised to show hot spots and trajectories of macro urban movement. We conclude the paper by highlighting some of the key challenges in working with such data and outline some next steps in processing the data and conducting higher volume and more extensive analysis.

  16. Earth Observing Data System Data and Information System (EOSDIS) Overview

    NASA Technical Reports Server (NTRS)

    Klene, Stephan

    2016-01-01

    The National Aeronautics and Space Administration (NASA) acquires and distributes an abundance of Earth science data on a daily basis to a diverse user community worldwide. The NASA Big Earth Data Initiative (BEDI) is an effort to make the acquired science data more discoverable, accessible, and usable. This presentation will provide a brief introduction to the Earth Observing System Data and Information System (EOSDIS) project and the nature of advances that have been made by BEDI to other Federal Users.

  17. Data dictionary and formatting standard for dissemination of geotechnical data

    USGS Publications Warehouse

    Benoit, J.; Bobbitt, J.I.; Ponti, D.J.; Shimel, S.A.; ,

    2004-01-01

    A pilot system for archiving and web dissemination of geotechnical data collected and stored by various agencies is currently under development. Part of the scope of this project, sponsored by the Consortium of Organizations for Strong-Motion Observation Systems (COSMOS) and by the Pacific Earthquake Engineering Research Center (PEER) Lifelines Program, is the development of a data dictionary and formatting standard. This paper presents the data model along with the basic structure of the data dictionary tables for this pilot system.

  18. Big data and data repurposing - using existing data to answer new questions in vascular dementia research.

    PubMed

    Doubal, Fergus N; Ali, Myzoon; Batty, G David; Charidimou, Andreas; Eriksdotter, Maria; Hofmann-Apitius, Martin; Kim, Yun-Hee; Levine, Deborah A; Mead, Gillian; Mucke, Hermann A M; Ritchie, Craig W; Roberts, Charlotte J; Russ, Tom C; Stewart, Robert; Whiteley, William; Quinn, Terence J

    2017-04-17

    Traditional approaches to clinical research have, as yet, failed to provide effective treatments for vascular dementia (VaD). Novel approaches to collation and synthesis of data may allow for time and cost efficient hypothesis generating and testing. These approaches may have particular utility in helping us understand and treat a complex condition such as VaD. We present an overview of new uses for existing data to progress VaD research. The overview is the result of consultation with various stakeholders, focused literature review and learning from the group's experience of successful approaches to data repurposing. In particular, we benefitted from the expert discussion and input of delegates at the 9 th International Congress on Vascular Dementia (Ljubljana, 16-18 th October 2015). We agreed on key areas that could be of relevance to VaD research: systematic review of existing studies; individual patient level analyses of existing trials and cohorts and linking electronic health record data to other datasets. We illustrated each theme with a case-study of an existing project that has utilised this approach. There are many opportunities for the VaD research community to make better use of existing data. The volume of potentially available data is increasing and the opportunities for using these resources to progress the VaD research agenda are exciting. Of course, these approaches come with inherent limitations and biases, as bigger datasets are not necessarily better datasets and maintaining rigour and critical analysis will be key to optimising data use.

  19. Delving into Data

    ERIC Educational Resources Information Center

    Cullen, Kevin

    2005-01-01

    Corporations employ data mining to analyze operations, find trends in recorded information, and look for new opportunities. Libraries are no different. Librarians manage large stores of data--about collections and usage, for example--and they also want to analyze this data to serve their users better. Analysts use data mining to query a data…

  20. Data Governance and Data Quality: Is It on Your Agenda?

    ERIC Educational Resources Information Center

    Young, Anne; McConkey, Kevin

    2012-01-01

    Data governance is a relatively new and evolving discipline. It encompasses the people who are responsible for data quality (the stewards); the policies and processes associated with collecting, managing, storing and reporting data; and the information technology systems and support that provide efficient infrastructure. Higher education…