Toxics Release Inventory Chemical Hazard Information Profiles (TRI-CHIP) Dataset
The Toxics Release Inventory (TRI) Chemical Hazard Information Profiles (TRI-CHIP) dataset contains hazard information about the chemicals reported in TRI. Users can use this XML-format dataset to create their own databases and hazard analyses of TRI chemicals. The hazard information is compiled from a series of authoritative sources including the Integrated Risk Information System (IRIS). The dataset is provided as a downloadable .zip file that when extracted provides XML files and schemas for the hazard information tables.
NASA Astrophysics Data System (ADS)
Gross, M. B.; Mayernik, M. S.; Rowan, L. R.; Khan, H.; Boler, F. M.; Maull, K. E.; Stott, D.; Williams, S.; Corson-Rikert, J.; Johns, E. M.; Daniels, M. D.; Krafft, D. B.
2015-12-01
UNAVCO, UCAR, and Cornell University are working together to leverage semantic web technologies to enable discovery of people, datasets, publications and other research products, as well as the connections between them. The EarthCollab project, an EarthCube Building Block, is enhancing an existing open-source semantic web application, VIVO, to address connectivity gaps across distributed networks of researchers and resources related to the following two geoscience-based communities: (1) the Bering Sea Project, an interdisciplinary field program whose data archive is hosted by NCAR's Earth Observing Laboratory (EOL), and (2) UNAVCO, a geodetic facility and consortium that supports diverse research projects informed by geodesy. People, publications, datasets and grant information have been mapped to an extended version of the VIVO-ISF ontology and ingested into VIVO's database. Data is ingested using a custom set of scripts that include the ability to perform basic automated and curated disambiguation. VIVO can display a page for every object ingested, including connections to other objects in the VIVO database. A dataset page, for example, includes the dataset type, time interval, DOI, related publications, and authors. The dataset type field provides a connection to all other datasets of the same type. The author's page will show, among other information, related datasets and co-authors. Information previously spread across several unconnected databases is now stored in a single location. In addition to VIVO's default display, the new database can also be queried using SPARQL, a query language for semantic data. EarthCollab will also extend the VIVO web application. One such extension is the ability to cross-link separate VIVO instances across institutions, allowing local display of externally curated information. For example, Cornell's VIVO faculty pages will display UNAVCO's dataset information and UNAVCO's VIVO will display Cornell faculty member contact and position information. Additional extensions, including enhanced geospatial capabilities, will be developed following task-centered usability testing.
Improving data discovery and usability through commentary and user feedback: the CHARMe project
NASA Astrophysics Data System (ADS)
Alegre, R.; Blower, J. D.
2014-12-01
Earth science datasets are highly diverse. Users of these datasets are similarly varied, ranging from research scientists through industrial users to government decision- and policy-makers. It is very important for these users to understand the applicability of any dataset to their particular problem so that they can select the most appropriate data sources for their needs. Although data providers often provide rich supporting information in the form of metadata, typically this information does not include community usage information that can help other users judge fitness-for-purpose.The CHARMe project (http://www.charme.org.uk) is filling this gap by developing a system for sharing "commentary metadata". These are annotations that are generated and shared by the user community and include: Links between publications and datasets. The CHARMe system can record information about why a particular dataset was used (e.g. the paper may describe the dataset, it may use the dataset as a source, or it may be publishing results of a dataset assessment). These publications may appear in the peer-reviewed literature, or may be technical reports, websites or blog posts. Free-text comments supplied by the user. Provenance information, including links between datasets and descriptions of processing algorithms and sensors. External events that may affect data quality (e.g. large volcanic eruptions or El Niño events); we call these "significant events". Data quality information, e.g. system maturity indices. Commentary information can be linked to anything that can be uniquely identified (e.g. a dataset with a DOI or a persistent web address). It is also possible to associate commentary with particular subsets of datasets, for example to highlight an issue that is confined to a particular geographic region. We will demonstrate tools that show these capabilities in action, showing how users can apply commentary information during data discovery, visualization and analysis. The CHARMe project has implemented a set of open-source tools to create, store and explore commentary information, using open Web standards. In this presentation we will describe the application of the CHARMe system to the particular case of the climate data community; however the techniques and technologies are generic and can be applied in many fields.
Challenges in Extracting Information From Large Hydrogeophysical-monitoring Datasets
NASA Astrophysics Data System (ADS)
Day-Lewis, F. D.; Slater, L. D.; Johnson, T.
2012-12-01
Over the last decade, new automated geophysical data-acquisition systems have enabled collection of increasingly large and information-rich geophysical datasets. Concurrent advances in field instrumentation, web services, and high-performance computing have made real-time processing, inversion, and visualization of large three-dimensional tomographic datasets practical. Geophysical-monitoring datasets have provided high-resolution insights into diverse hydrologic processes including groundwater/surface-water exchange, infiltration, solute transport, and bioremediation. Despite the high information content of such datasets, extraction of quantitative or diagnostic hydrologic information is challenging. Visual inspection and interpretation for specific hydrologic processes is difficult for datasets that are large, complex, and (or) affected by forcings (e.g., seasonal variations) unrelated to the target hydrologic process. New strategies are needed to identify salient features in spatially distributed time-series data and to relate temporal changes in geophysical properties to hydrologic processes of interest while effectively filtering unrelated changes. Here, we review recent work using time-series and digital-signal-processing approaches in hydrogeophysics. Examples include applications of cross-correlation, spectral, and time-frequency (e.g., wavelet and Stockwell transforms) approaches to (1) identify salient features in large geophysical time series; (2) examine correlation or coherence between geophysical and hydrologic signals, even in the presence of non-stationarity; and (3) condense large datasets while preserving information of interest. Examples demonstrate analysis of large time-lapse electrical tomography and fiber-optic temperature datasets to extract information about groundwater/surface-water exchange and contaminant transport.
Interactive visualization and analysis of multimodal datasets for surgical applications.
Kirmizibayrak, Can; Yim, Yeny; Wakid, Mike; Hahn, James
2012-12-01
Surgeons use information from multiple sources when making surgical decisions. These include volumetric datasets (such as CT, PET, MRI, and their variants), 2D datasets (such as endoscopic videos), and vector-valued datasets (such as computer simulations). Presenting all the information to the user in an effective manner is a challenging problem. In this paper, we present a visualization approach that displays the information from various sources in a single coherent view. The system allows the user to explore and manipulate volumetric datasets, display analysis of dataset values in local regions, combine 2D and 3D imaging modalities and display results of vector-based computer simulations. Several interaction methods are discussed: in addition to traditional interfaces including mouse and trackers, gesture-based natural interaction methods are shown to control these visualizations with real-time performance. An example of a medical application (medialization laryngoplasty) is presented to demonstrate how the combination of different modalities can be used in a surgical setting with our approach.
A global distributed basin morphometric dataset
NASA Astrophysics Data System (ADS)
Shen, Xinyi; Anagnostou, Emmanouil N.; Mei, Yiwen; Hong, Yang
2017-01-01
Basin morphometry is vital information for relating storms to hydrologic hazards, such as landslides and floods. In this paper we present the first comprehensive global dataset of distributed basin morphometry at 30 arc seconds resolution. The dataset includes nine prime morphometric variables; in addition we present formulas for generating twenty-one additional morphometric variables based on combination of the prime variables. The dataset can aid different applications including studies of land-atmosphere interaction, and modelling of floods and droughts for sustainable water management. The validity of the dataset has been consolidated by successfully repeating the Hack's law.
Enhancing Geoscience Research Discovery Through the Semantic Web
NASA Astrophysics Data System (ADS)
Rowan, Linda R.; Gross, M. Benjamin; Mayernik, Matthew; Khan, Huda; Boler, Frances; Maull, Keith; Stott, Don; Williams, Steve; Corson-Rikert, Jon; Johns, Erica M.; Daniels, Michael; Krafft, Dean B.; Meertens, Charles
2016-04-01
UNAVCO, UCAR, and Cornell University are working together to leverage semantic web technologies to enable discovery of people, datasets, publications and other research products, as well as the connections between them. The EarthCollab project, a U.S. National Science Foundation EarthCube Building Block, is enhancing an existing open-source semantic web application, VIVO, to enhance connectivity across distributed networks of researchers and resources related to the following two geoscience-based communities: (1) the Bering Sea Project, an interdisciplinary field program whose data archive is hosted by NCAR's Earth Observing Laboratory (EOL), and (2) UNAVCO, a geodetic facility and consortium that supports diverse research projects informed by geodesy. People, publications, datasets and grant information have been mapped to an extended version of the VIVO-ISF ontology and ingested into VIVO's database. Much of the VIVO ontology was built for the life sciences, so we have added some components of existing geoscience-based ontologies and a few terms from a local ontology that we created. The UNAVCO VIVO instance, connect.unavco.org, utilizes persistent identifiers whenever possible; for example using ORCIDs for people, publication DOIs, data DOIs and unique NSF grant numbers. Data is ingested using a custom set of scripts that include the ability to perform basic automated and curated disambiguation. VIVO can display a page for every object ingested, including connections to other objects in the VIVO database. A dataset page, for example, includes the dataset type, time interval, DOI, related publications, and authors. The dataset type field provides a connection to all other datasets of the same type. The author's page shows, among other information, related datasets and co-authors. Information previously spread across several unconnected databases is now stored in a single location. In addition to VIVO's default display, the new database can be queried using SPARQL, a query language for semantic data. EarthCollab is extending the VIVO web application. One such extension is the ability to cross-link separate VIVO instances across institutions, allowing local display of externally curated information. For example, Cornell's VIVO faculty pages will display UNAVCO's dataset information and UNAVCO's VIVO will display Cornell faculty member contact and position information. About half of UNAVCO's membership is international and we hope to connect our data to institutions in other countries with a similar approach. Additional extensions, including enhanced geospatial capabilities, will be developed based on task-centered usability testing.
Wind Integration National Dataset Toolkit | Grid Modernization | NREL
information, share tips The WIND Toolkit includes meteorological conditions and turbine power for more than Integration National Dataset Toolkit Wind Integration National Dataset Toolkit The Wind Integration National Dataset (WIND) Toolkit is an update and expansion of the Eastern Wind Integration Data Set and
A web Accessible Framework for Discovery, Visualization and Dissemination of Polar Data
NASA Astrophysics Data System (ADS)
Kirsch, P. J.; Breen, P.; Barnes, T. D.
2007-12-01
A web accessible information framework, currently under development within the Physical Sciences Division of the British Antarctic Survey is described. The datasets accessed are generally heterogeneous in nature from fields including space physics, meteorology, atmospheric chemistry, ice physics, and oceanography. Many of these are returned in near real time over a 24/7 limited bandwidth link from remote Antarctic Stations and ships. The requirement is to provide various user groups - each with disparate interests and demands - a system incorporating a browsable and searchable catalogue; bespoke data summary visualization, metadata access facilities and download utilities. The system allows timely access to raw and processed datasets through an easily navigable discovery interface. Once discovered, a summary of the dataset can be visualized in a manner prescribed by the particular projects and user communities or the dataset may be downloaded, subject to accessibility restrictions that may exist. In addition, access to related ancillary information including software, documentation, related URL's and information concerning non-electronic media (of particular relevance to some legacy datasets) is made directly available having automatically been associated with a dataset during the discovery phase. Major components of the framework include the relational database containing the catalogue, the organizational structure of the systems holding the data - enabling automatic updates of the system catalogue and real-time access to data -, the user interface design, and administrative and data management scripts allowing straightforward incorporation of utilities, datasets and system maintenance.
Schedl, Markus
2017-01-01
Recently, the LFM-1b dataset has been proposed to foster research and evaluation in music retrieval and music recommender systems, Schedl (Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR). New York, 2016). It contains more than one billion music listening events created by more than 120,000 users of Last.fm. Each listening event is characterized by artist, album, and track name, and further includes a timestamp. Basic demographic information and a selection of more elaborate listener-specific descriptors are included as well, for anonymized users. In this article, we reveal information about LFM-1b's acquisition and content and we compare it to existing datasets. We furthermore provide an extensive statistical analysis of the dataset, including basic properties of the item sets, demographic coverage, distribution of listening events (e.g., over artists and users), and aspects related to music preference and consumption behavior (e.g., temporal features and mainstreaminess of listeners). Exploiting country information of users and genre tags of artists, we also create taste profiles for populations and determine similar and dissimilar countries in terms of their populations' music preferences. Finally, we illustrate the dataset's usage in a simple artist recommendation task, whose results are intended to serve as baseline against which more elaborate techniques can be assessed.
Rail Trails and Property Values: Is There an Association?
ERIC Educational Resources Information Center
Hartenian, Ella; Horton, Nicholas J.
2015-01-01
The Rail Trail and Property Values dataset includes information on a set of n = 104 homes which sold in Northampton, Massachusetts in 2007. The dataset provides house information (square footage, acreage, number of bedrooms, etc.), price estimates (from Zillow.com) at four time points, location, distance from a rail trail in the community, biking…
Hedefalk, Finn; Svensson, Patrick; Harrie, Lars
2017-01-01
This paper presents datasets that enable historical longitudinal studies of micro-level geographic factors in a rural setting. These types of datasets are new, as historical demography studies have generally failed to properly include the micro-level geographic factors. Our datasets describe the geography over five Swedish rural parishes, and by linking them to a longitudinal demographic database, we obtain a geocoded population (at the property unit level) for this area for the period 1813–1914. The population is a subset of the Scanian Economic Demographic Database (SEDD). The geographic information includes the following feature types: property units, wetlands, buildings, roads and railroads. The property units and wetlands are stored in object-lifeline time representations (information about creation, changes and ends of objects are recorded in time), whereas the other feature types are stored as snapshots in time. Thus, the datasets present one of the first opportunities to study historical spatio-temporal patterns at the micro-level. PMID:28398288
EEG datasets for motor imagery brain-computer interface.
Cho, Hohyun; Ahn, Minkyu; Ahn, Sangtae; Kwon, Moonyoung; Jun, Sung Chan
2017-07-01
Most investigators of brain-computer interface (BCI) research believe that BCI can be achieved through induced neuronal activity from the cortex, but not by evoked neuronal activity. Motor imagery (MI)-based BCI is one of the standard concepts of BCI, in that the user can generate induced activity by imagining motor movements. However, variations in performance over sessions and subjects are too severe to overcome easily; therefore, a basic understanding and investigation of BCI performance variation is necessary to find critical evidence of performance variation. Here we present not only EEG datasets for MI BCI from 52 subjects, but also the results of a psychological and physiological questionnaire, EMG datasets, the locations of 3D EEG electrodes, and EEGs for non-task-related states. We validated our EEG datasets by using the percentage of bad trials, event-related desynchronization/synchronization (ERD/ERS) analysis, and classification analysis. After conventional rejection of bad trials, we showed contralateral ERD and ipsilateral ERS in the somatosensory area, which are well-known patterns of MI. Finally, we showed that 73.08% of datasets (38 subjects) included reasonably discriminative information. Our EEG datasets included the information necessary to determine statistical significance; they consisted of well-discriminated datasets (38 subjects) and less-discriminative datasets. These may provide researchers with opportunities to investigate human factors related to MI BCI performance variation, and may also achieve subject-to-subject transfer by using metadata, including a questionnaire, EEG coordinates, and EEGs for non-task-related states. © The Authors 2017. Published by Oxford University Press.
Research on Zheng Classification Fusing Pulse Parameters in Coronary Heart Disease
Guo, Rui; Wang, Yi-Qin; Xu, Jin; Yan, Hai-Xia; Yan, Jian-Jun; Li, Fu-Feng; Xu, Zhao-Xia; Xu, Wen-Jie
2013-01-01
This study was conducted to illustrate that nonlinear dynamic variables of Traditional Chinese Medicine (TCM) pulse can improve the performances of TCM Zheng classification models. Pulse recordings of 334 coronary heart disease (CHD) patients and 117 normal subjects were collected in this study. Recurrence quantification analysis (RQA) was employed to acquire nonlinear dynamic variables of pulse. TCM Zheng models in CHD were constructed, and predictions using a novel multilabel learning algorithm based on different datasets were carried out. Datasets were designed as follows: dataset1, TCM inquiry information including inspection information; dataset2, time-domain variables of pulse and dataset1; dataset3, RQA variables of pulse and dataset1; and dataset4, major principal components of RQA variables and dataset1. The performances of the different models for Zheng differentiation were compared. The model for Zheng differentiation based on RQA variables integrated with inquiry information had the best performance, whereas that based only on inquiry had the worst performance. Meanwhile, the model based on time-domain variables of pulse integrated with inquiry fell between the above two. This result showed that RQA variables of pulse can be used to construct models of TCM Zheng and improve the performance of Zheng differentiation models. PMID:23737839
Bayesian correlated clustering to integrate multiple datasets
Kirk, Paul; Griffin, Jim E.; Savage, Richard S.; Ghahramani, Zoubin; Wild, David L.
2012-01-01
Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact: D.L.Wild@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23047558
KCMP Minnesota Tall Tower Nitrous Oxide Inverse Modeling Dataset 2010-2015
Griffis, Timothy J. [University of Minnesota; Baker, John; Millet, Dylan; Chen, Zichong; Wood, Jeff; Erickson, Matt; Lee, Xuhui
2017-01-01
This dataset contains nitrous oxide mixing ratios and supporting information measured at a tall tower (KCMP, 244 m) site near St. Paul, Minnesot, USA. The data include nitrous oxide and carbon dioxide mixing ratios measured at the 100 m level. Turbulence and wind data were measured using a sonic anemometer at the 185 m level. Also included in this dataset are estimates of the "background" nitrous oxide mixing ratios and monthly concentration source footprints derived from WRF-STILT modeling.
NASA Astrophysics Data System (ADS)
Agapiou, Athos; Lysandrou, Vasiliki; Themistocleous, Kyriakos; Nisantzi, Argyro; Lasaponara, Rosa; Masini, Nicola; Krauss, Thomas; Cerra, Daniele; Gessner, Ursula; Schreier, Gunter; Hadjimitsis, Diofantos
2016-08-01
The landscape of Cyprus is characterized by transformations that occurred during the 20th century, with many of such changes being still active today. Landscapes' changes are due to a variety of reasons including war conflicts, environmental conditions and modern development that have often caused the alteration or even the total loss of important information that could have assisted the archaeologists to comprehend the archaeo-landscape. The present work aims to provide detailed information regarding the different existing datasets that can be used to support archaeologists in understanding the transformations that the landscape in Cyprus undergone, from a remote sensing perspective. Such datasets may help archaeologists to visualize a lost landscape and try to retrieve valuable information, while they support researchers for future investigations. As such they can further highlight in a predictive manner and consequently assess the impacts of landscape transformation -being of natural or anthropogenic cause- to cultural heritage. Three main datasets are presented here: aerial images, satellite datasets including spy satellite datasets acquired during the Cold War, and cadastral maps. The variety of data is provided in a chronological order (e.g. year of acquisitions), while other important parameters such as the cost and the accuracy are also determined. Individual examples of archaeological sites in Cyprus are also provided for each dataset in order to underline both their importance and performance. Also some pre- and post-processing remote sensing methodologies are briefly described in order to enhance the final results. The paper within the framework of ATHENA project, dedicated to remote sensing archaeology/CH, aims to fill a significant gap in the recent literature of remote sensing archaeology of the island and to assist current and future archaeologists in their quest for remote sensing information to support their research.
Toward a complete dataset of drug-drug interaction information from publicly available sources.
Ayvaz, Serkan; Horn, John; Hassanzadeh, Oktie; Zhu, Qian; Stan, Johann; Tatonetti, Nicholas P; Vilar, Santiago; Brochhausen, Mathias; Samwald, Matthias; Rastegar-Mojarad, Majid; Dumontier, Michel; Boyce, Richard D
2015-06-01
Although potential drug-drug interactions (PDDIs) are a significant source of preventable drug-related harm, there is currently no single complete source of PDDI information. In the current study, all publically available sources of PDDI information that could be identified using a comprehensive and broad search were combined into a single dataset. The combined dataset merged fourteen different sources including 5 clinically-oriented information sources, 4 Natural Language Processing (NLP) Corpora, and 5 Bioinformatics/Pharmacovigilance information sources. As a comprehensive PDDI source, the merged dataset might benefit the pharmacovigilance text mining community by making it possible to compare the representativeness of NLP corpora for PDDI text extraction tasks, and specifying elements that can be useful for future PDDI extraction purposes. An analysis of the overlap between and across the data sources showed that there was little overlap. Even comprehensive PDDI lists such as DrugBank, KEGG, and the NDF-RT had less than 50% overlap with each other. Moreover, all of the comprehensive lists had incomplete coverage of two data sources that focus on PDDIs of interest in most clinical settings. Based on this information, we think that systems that provide access to the comprehensive lists, such as APIs into RxNorm, should be careful to inform users that the lists may be incomplete with respect to PDDIs that drug experts suggest clinicians be aware of. In spite of the low degree of overlap, several dozen cases were identified where PDDI information provided in drug product labeling might be augmented by the merged dataset. Moreover, the combined dataset was also shown to improve the performance of an existing PDDI NLP pipeline and a recently published PDDI pharmacovigilance protocol. Future work will focus on improvement of the methods for mapping between PDDI information sources, identifying methods to improve the use of the merged dataset in PDDI NLP algorithms, integrating high-quality PDDI information from the merged dataset into Wikidata, and making the combined dataset accessible as Semantic Web Linked Data. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
Spooner, Amy J; Aitken, Leanne M; Chaboyer, Wendy
2017-11-15
There is widespread use of clinical information systems in intensive care units however, the evidence to support electronic handover is limited. The study aim was to assess the barriers and facilitators to use of an electronic minimum dataset for nursing team leader shift-to-shift handover in the intensive care unit prior to its implementation. The study was conducted in a 21-bed medical/surgical intensive care unit, specialising in cardiothoracic surgery at a tertiary referral hospital, in Queensland, Australia. An established tool was modified to the intensive care nursing handover context and a survey of all 63 nursing team leaders was undertaken. Survey statements were rated using a 6-point Likert scale with selections from 'strongly disagree' to 'strongly agree', and open-ended questions. Descriptive statistics were used to summarise results. A total of 39 team leaders responded to the survey (62%). Team leaders used general intensive care work unit guidelines to inform practice however they were less familiar with the intensive care handover work unit guideline. Barriers to minimum dataset uptake included: a tool that was not user friendly, time consuming and contained too much information. Facilitators to minimum dataset adoption included: a tool that was user friendly, saved time and contained relevant information. Identifying the complexities of a healthcare setting prior to the implementation of an intervention assists researchers and clinicians to integrate new knowledge into healthcare settings. Barriers and facilitators to knowledge use focused on usability, content and efficiency of the electronic minimum dataset and can be used to inform tailored strategies to optimise team leaders' adoption of a minimum dataset for handover. Copyright © 2017 Australian College of Critical Care Nurses Ltd. Published by Elsevier Ltd. All rights reserved.
Dataset of all Indian Reservations in US EPA Region 9 (California, Arizona and Nevada) with some reservation border areas of adjacent states included (adjacent areas of Colorado, New Mexico and Utah). Reservation boundaries are compiled from multiple sources and are derived from several different source scales. Information such as reservation type, primary tribe name are included with the feature dataset. Public Domain Allotments are not included in this data set.
McCann, Liza J; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Appelbe, Duncan; Kirkham, Jamie J; Williamson, Paula R; Aggarwal, Amita; Christopher-Stine, Lisa; Constantin, Tamas; Feldman, Brian M; Lundberg, Ingrid; Maillard, Sue; Mathiesen, Pernille; Murphy, Ruth; Pachman, Lauren M; Reed, Ann M; Rider, Lisa G; van Royen-Kerkof, Annet; Russo, Ricardo; Spinty, Stefan; Wedderburn, Lucy R
2018-01-01
Objectives This study aimed to develop consensus on an internationally agreed dataset for juvenile dermatomyositis (JDM), designed for clinical use, to enhance collaborative research and allow integration of data between centres. Methods A prototype dataset was developed through a formal process that included analysing items within existing databases of patients with idiopathic inflammatory myopathies. This template was used to aid a structured multistage consensus process. Exploiting Delphi methodology, two web-based questionnaires were distributed to healthcare professionals caring for patients with JDM identified through email distribution lists of international paediatric rheumatology and myositis research groups. A separate questionnaire was sent to parents of children with JDM and patients with JDM, identified through established research networks and patient support groups. The results of these parallel processes informed a face-to-face nominal group consensus meeting of international myositis experts, tasked with defining the content of the dataset. This developed dataset was tested in routine clinical practice before review and finalisation. Results A dataset containing 123 items was formulated with an accompanying glossary. Demographic and diagnostic data are contained within form A collected at baseline visit only, disease activity measures are included within form B collected at every visit and disease damage items within form C collected at baseline and annual visits thereafter. Conclusions Through a robust international process, a consensus dataset for JDM has been formulated that can capture disease activity and damage over time. This dataset can be incorporated into national and international collaborative efforts, including existing clinical research databases. PMID:29084729
Pantheon 1.0, a manually verified dataset of globally famous biographies.
Yu, Amy Zhao; Ronen, Shahar; Hu, Kevin; Lu, Tiffany; Hidalgo, César A
2016-01-05
We present the Pantheon 1.0 dataset: a manually verified dataset of individuals that have transcended linguistic, temporal, and geographic boundaries. The Pantheon 1.0 dataset includes the 11,341 biographies present in more than 25 languages in Wikipedia and is enriched with: (i) manually verified demographic information (place and date of birth, gender) (ii) a taxonomy of occupations classifying each biography at three levels of aggregation and (iii) two measures of global popularity including the number of languages in which a biography is present in Wikipedia (L), and the Historical Popularity Index (HPI) a metric that combines information on L, time since birth, and page-views (2008-2013). We compare the Pantheon 1.0 dataset to data from the 2003 book, Human Accomplishments, and also to external measures of accomplishment in individual games and sports: Tennis, Swimming, Car Racing, and Chess. In all of these cases we find that measures of popularity (L and HPI) correlate highly with individual accomplishment, suggesting that measures of global popularity proxy the historical impact of individuals.
Pantheon 1.0, a manually verified dataset of globally famous biographies
Yu, Amy Zhao; Ronen, Shahar; Hu, Kevin; Lu, Tiffany; Hidalgo, César A.
2016-01-01
We present the Pantheon 1.0 dataset: a manually verified dataset of individuals that have transcended linguistic, temporal, and geographic boundaries. The Pantheon 1.0 dataset includes the 11,341 biographies present in more than 25 languages in Wikipedia and is enriched with: (i) manually verified demographic information (place and date of birth, gender) (ii) a taxonomy of occupations classifying each biography at three levels of aggregation and (iii) two measures of global popularity including the number of languages in which a biography is present in Wikipedia (L), and the Historical Popularity Index (HPI) a metric that combines information on L, time since birth, and page-views (2008–2013). We compare the Pantheon 1.0 dataset to data from the 2003 book, Human Accomplishments, and also to external measures of accomplishment in individual games and sports: Tennis, Swimming, Car Racing, and Chess. In all of these cases we find that measures of popularity (L and HPI) correlate highly with individual accomplishment, suggesting that measures of global popularity proxy the historical impact of individuals. PMID:26731133
Hofman, Abe D.; Visser, Ingmar; Jansen, Brenda R. J.; van der Maas, Han L. J.
2015-01-01
We propose and test three statistical models for the analysis of children’s responses to the balance scale task, a seminal task to study proportional reasoning. We use a latent class modelling approach to formulate a rule-based latent class model (RB LCM) following from a rule-based perspective on proportional reasoning and a new statistical model, the Weighted Sum Model, following from an information-integration approach. Moreover, a hybrid LCM using item covariates is proposed, combining aspects of both a rule-based and information-integration perspective. These models are applied to two different datasets, a standard paper-and-pencil test dataset (N = 779), and a dataset collected within an online learning environment that included direct feedback, time-pressure, and a reward system (N = 808). For the paper-and-pencil dataset the RB LCM resulted in the best fit, whereas for the online dataset the hybrid LCM provided the best fit. The standard paper-and-pencil dataset yielded more evidence for distinct solution rules than the online data set in which quantitative item characteristics are more prominent in determining responses. These results shed new light on the discussion on sequential rule-based and information-integration perspectives of cognitive development. PMID:26505905
NASA Astrophysics Data System (ADS)
Dogon-yaro, M. A.; Kumar, P.; Rahman, A. Abdul; Buyuksalih, G.
2016-10-01
Timely and accurate acquisition of information on the condition and structural changes of urban trees serves as a tool for decision makers to better appreciate urban ecosystems and their numerous values which are critical to building up strategies for sustainable development. The conventional techniques used for extracting tree features include; ground surveying and interpretation of the aerial photography. However, these techniques are associated with some constraint, such as labour intensive field work, a lot of financial requirement, influences by weather condition and topographical covers which can be overcome by means of integrated airborne based LiDAR and very high resolution digital image datasets. This study presented a semi-automated approach for extracting urban trees from integrated airborne based LIDAR and multispectral digital image datasets over Istanbul city of Turkey. The above scheme includes detection and extraction of shadow free vegetation features based on spectral properties of digital images using shadow index and NDVI techniques and automated extraction of 3D information about vegetation features from the integrated processing of shadow free vegetation image and LiDAR point cloud datasets. The ability of the developed algorithms shows a promising result as an automated and cost effective approach to estimating and delineated 3D information of urban trees. The research also proved that integrated datasets is a suitable technology and a viable source of information for city managers to be used in urban trees management.
Applications of the LBA-ECO Metadata Warehouse
NASA Astrophysics Data System (ADS)
Wilcox, L.; Morrell, A.; Griffith, P. C.
2006-05-01
The LBA-ECO Project Office has developed a system to harvest and warehouse metadata resulting from the Large-Scale Biosphere Atmosphere Experiment in Amazonia. The harvested metadata is used to create dynamically generated reports, available at www.lbaeco.org, which facilitate access to LBA-ECO datasets. The reports are generated for specific controlled vocabulary terms (such as an investigation team or a geospatial region), and are cross-linked with one another via these terms. This approach creates a rich contextual framework enabling researchers to find datasets relevant to their research. It maximizes data discovery by association and provides a greater understanding of the scientific and social context of each dataset. For example, our website provides a profile (e.g. participants, abstract(s), study sites, and publications) for each LBA-ECO investigation. Linked from each profile is a list of associated registered dataset titles, each of which link to a dataset profile that describes the metadata in a user-friendly way. The dataset profiles are generated from the harvested metadata, and are cross-linked with associated reports via controlled vocabulary terms such as geospatial region. The region name appears on the dataset profile as a hyperlinked term. When researchers click on this link, they find a list of reports relevant to that region, including a list of dataset titles associated with that region. Each dataset title in this list is hyperlinked to its corresponding dataset profile. Moreover, each dataset profile contains hyperlinks to each associated data file at its home data repository and to publications that have used the dataset. We also use the harvested metadata in administrative applications to assist quality assurance efforts. These include processes to check for broken hyperlinks to data files, automated emails that inform our administrators when critical metadata fields are updated, dynamically generated reports of metadata records that link to datasets with questionable file formats, and dynamically generated region/site coordinate quality assurance reports. These applications are as important as those that facilitate access to information because they help ensure a high standard of quality for the information. This presentation will discuss reports currently in use, provide a technical overview of the system, and discuss plans to extend this system to harvest metadata resulting from the North American Carbon Program by drawing on datasets in many different formats, residing in many thematic data centers and also distributed among hundreds of investigators.
McCann, Liza J; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Appelbe, Duncan; Kirkham, Jamie J; Williamson, Paula R; Aggarwal, Amita; Christopher-Stine, Lisa; Constantin, Tamas; Feldman, Brian M; Lundberg, Ingrid; Maillard, Sue; Mathiesen, Pernille; Murphy, Ruth; Pachman, Lauren M; Reed, Ann M; Rider, Lisa G; van Royen-Kerkof, Annet; Russo, Ricardo; Spinty, Stefan; Wedderburn, Lucy R; Beresford, Michael W
2018-02-01
This study aimed to develop consensus on an internationally agreed dataset for juvenile dermatomyositis (JDM), designed for clinical use, to enhance collaborative research and allow integration of data between centres. A prototype dataset was developed through a formal process that included analysing items within existing databases of patients with idiopathic inflammatory myopathies. This template was used to aid a structured multistage consensus process. Exploiting Delphi methodology, two web-based questionnaires were distributed to healthcare professionals caring for patients with JDM identified through email distribution lists of international paediatric rheumatology and myositis research groups. A separate questionnaire was sent to parents of children with JDM and patients with JDM, identified through established research networks and patient support groups. The results of these parallel processes informed a face-to-face nominal group consensus meeting of international myositis experts, tasked with defining the content of the dataset. This developed dataset was tested in routine clinical practice before review and finalisation. A dataset containing 123 items was formulated with an accompanying glossary. Demographic and diagnostic data are contained within form A collected at baseline visit only, disease activity measures are included within form B collected at every visit and disease damage items within form C collected at baseline and annual visits thereafter. Through a robust international process, a consensus dataset for JDM has been formulated that can capture disease activity and damage over time. This dataset can be incorporated into national and international collaborative efforts, including existing clinical research databases. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Plant databases and data analysis tools
USDA-ARS?s Scientific Manuscript database
It is anticipated that the coming years will see the generation of large datasets including diagnostic markers in several plant species with emphasis on crop plants. To use these datasets effectively in any plant breeding program, it is essential to have the information available via public database...
DE Knegt, L V; Pires, S M; Hald, T
2015-04-01
Microbial subtyping approaches are commonly used for source attribution of human salmonellosis. Such methods require data on Salmonella in animals and humans, outbreaks, infection abroad and amounts of food available for consumption. A source attribution model was applied to 24 European countries, requiring special data management to produce a standardized dataset. Salmonellosis data on animals and humans were obtained from datasets provided by the European Food Safety Authority. The amount of food available for consumption was calculated based on production and trade data. Limitations included different types of underreporting, non-participation in prevalence studies, and non-availability of trade data. Cases without travel information were assumed to be domestic; non-subtyped human or animal records were re-identified according to proportions observed in reference datasets; missing trade information was estimated based on previous years. The resulting dataset included data on 24 serovars in humans, broilers, laying hens, pigs and turkeys in 24 countries.
Pérez-Luque, Antonio Jesús; Zamora, Regino; Bonet, Francisco Javier; Pérez-Pérez, Ramón
2015-01-01
Abstract In this data paper, we describe the dataset of the Global Change, Altitudinal Range Shift and Colonization of Degraded Habitats in Mediterranean Mountains (MIGRAME) project, which aims to assess the capacity of altitudinal migration and colonization of marginal habitats by Quercus pyrenaica Willd. forests in Sierra Nevada (southern Spain) considering two global-change drivers: temperature increase and land-use changes. The dataset includes information of the forest structure (diameter size, tree height, and abundance) of the Quercus pyrenaica ecosystem in Sierra Nevada obtained from 199 transects sampled at the treeline ecotone, mature forest, and marginal habitats (abandoned cropland and pine plantations). A total of 3839 occurrence records were collected and 5751 measurements recorded. The dataset is included in the Sierra Nevada Global-Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this mountain range. PMID:26491387
Gururaj, Anupama E.; Chen, Xiaoling; Pournejati, Saeid; Alter, George; Hersh, William R.; Demner-Fushman, Dina; Ohno-Machado, Lucila
2017-01-01
Abstract The rapid proliferation of publicly available biomedical datasets has provided abundant resources that are potentially of value as a means to reproduce prior experiments, and to generate and explore novel hypotheses. However, there are a number of barriers to the re-use of such datasets, which are distributed across a broad array of dataset repositories, focusing on different data types and indexed using different terminologies. New methods are needed to enable biomedical researchers to locate datasets of interest within this rapidly expanding information ecosystem, and new resources are needed for the formal evaluation of these methods as they emerge. In this paper, we describe the design and generation of a benchmark for information retrieval of biomedical datasets, which was developed and used for the 2016 bioCADDIE Dataset Retrieval Challenge. In the tradition of the seminal Cranfield experiments, and as exemplified by the Text Retrieval Conference (TREC), this benchmark includes a corpus (biomedical datasets), a set of queries, and relevance judgments relating these queries to elements of the corpus. This paper describes the process through which each of these elements was derived, with a focus on those aspects that distinguish this benchmark from typical information retrieval reference sets. Specifically, we discuss the origin of our queries in the context of a larger collaborative effort, the biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium, and the distinguishing features of biomedical dataset retrieval as a task. The resulting benchmark set has been made publicly available to advance research in the area of biomedical dataset retrieval. Database URL: https://biocaddie.org/benchmark-data PMID:29220453
McCann, Liza J; Arnold, Katie; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Beard, Laura; Beresford, Michael W; Wedderburn, Lucy R
2014-01-01
Juvenile dermatomyositis (JDM) is a rare but severe autoimmune inflammatory myositis of childhood. International collaboration is essential in order to undertake clinical trials, understand the disease and improve long-term outcome. The aim of this study was to propose from existing collaborative initiatives a preliminary minimal dataset for JDM. This will form the basis of the future development of an international consensus-approved minimum core dataset to be used both in clinical care and inform research, allowing integration of data between centres. A working group of internationally-representative JDM experts was formed to develop a provisional minimal dataset. Clinical and laboratory variables contained within current national and international collaborative databases of patients with idiopathic inflammatory myopathies were scrutinised. Judgements were informed by published literature and a more detailed analysis of the Juvenile Dermatomyositis Cohort Biomarker Study and Repository, UK and Ireland. A provisional minimal JDM dataset has been produced, with an associated glossary of definitions. The provisional minimal dataset will request information at time of patient diagnosis and during on-going prospective follow up. At time of patient diagnosis, information will be requested on patient demographics, diagnostic criteria and treatments given prior to diagnosis. During on-going prospective follow-up, variables will include the presence of active muscle or skin disease, major organ involvement or constitutional symptoms, investigations, treatment, physician global assessments and patient reported outcome measures. An internationally agreed minimal dataset has the potential to significantly enhance collaboration, allow effective communication between groups, provide a minimal standard of care and enable analysis of the largest possible number of JDM patients to provide a greater understanding of this disease. This preliminary dataset can now be developed into a consensus-approved minimum core dataset and tested in a wider setting with the aim of achieving international agreement.
2014-01-01
Background Juvenile dermatomyositis (JDM) is a rare but severe autoimmune inflammatory myositis of childhood. International collaboration is essential in order to undertake clinical trials, understand the disease and improve long-term outcome. The aim of this study was to propose from existing collaborative initiatives a preliminary minimal dataset for JDM. This will form the basis of the future development of an international consensus-approved minimum core dataset to be used both in clinical care and inform research, allowing integration of data between centres. Methods A working group of internationally-representative JDM experts was formed to develop a provisional minimal dataset. Clinical and laboratory variables contained within current national and international collaborative databases of patients with idiopathic inflammatory myopathies were scrutinised. Judgements were informed by published literature and a more detailed analysis of the Juvenile Dermatomyositis Cohort Biomarker Study and Repository, UK and Ireland. Results A provisional minimal JDM dataset has been produced, with an associated glossary of definitions. The provisional minimal dataset will request information at time of patient diagnosis and during on-going prospective follow up. At time of patient diagnosis, information will be requested on patient demographics, diagnostic criteria and treatments given prior to diagnosis. During on-going prospective follow-up, variables will include the presence of active muscle or skin disease, major organ involvement or constitutional symptoms, investigations, treatment, physician global assessments and patient reported outcome measures. Conclusions An internationally agreed minimal dataset has the potential to significantly enhance collaboration, allow effective communication between groups, provide a minimal standard of care and enable analysis of the largest possible number of JDM patients to provide a greater understanding of this disease. This preliminary dataset can now be developed into a consensus-approved minimum core dataset and tested in a wider setting with the aim of achieving international agreement. PMID:25075205
Enrichment of Data Publications in Earth Sciences - Data Reports as a Missing Link
NASA Astrophysics Data System (ADS)
Elger, Kirsten; Bertelmann, Roland; Haberland, Christian; Evans, Peter L.
2015-04-01
During the past decade, the relevance of research data stewardship has been rising significantly. Preservation and publication of scientific data for long-term use, including the storage in adequate repositories has been identified as a key issue by the scientific community as well as by bodies like research agencies. Essential for any kind of re-use is a proper description of the datasets. As a result of the increasing interest, data repositories have been developed and the included research data is accompanied with at least a minimum set of metadata. This metadata is useful for data discovery and a first insight to the content of a dataset. But often data re-use needs more and extended information. Many datasets are accompanied by a small 'readme file' with basic information on the data structure, or other accompanying documents. A source of additional information could be an article published in one of the newly emerging data journals (e.g. Copernicus's ESSD Earth System Science Data or Nature's Scientific Data). Obviously there is an information gap between a 'readme file', that is only accessible after data download (which often leads to less usage of published datasets than if the information was available beforehand) and the much larger effort to prepare an article for a peer-reviewed data journal. For many years, GFZ German Research Centre for Geosciences publishes 'Scientific Technical Reports (STR)' as a report series which is electronically persistently available and citable with assigned DOIs. This series was opened for the description of parallel published datasets as 'STR Data'. These are internally reviewed and offer a flexible publication format describing published data in depth, suitable for different datasets ranging from long-term monitoring time series of observatories to field data, to (meta-)databases, and software publications. STR Data offer a full and consistent overview and description to all relevant parameters of a linked published dataset. These reports are readable and citable on their own, but are, of course, closely connected to the respective datasets. Therefore, they give full insight into the framework of the data before data download. This is especially relevant for large and often heterogeneous datasets, like e.g. controlled-source seismic data gathered with instruments of the 'Geophysical Instrument Pool Potsdam GIPP'. Here, details of the instrumentation, data organization, data format, accuracy, geographical coordinates, timing and data completeness, etc. need to be documented. STR Data are also attractive for the publication of historic datasets, e.g. 30-40 years old seismic experiments. It is also possible for one STR Data to describe several datasets, e.g. from multiple diverse instruments types, or distinct regions of interest. The publication of DOI-assigned data reports is a helpful tool to fill the gap between basic metadata and restricted 'readme' information on the one hand and preparing extended journal articles on the other hand. They open the way for informed re-use and, with their comprehensive data description, may act as 'appetizer' for the re-use of published datasets.
A resource for assessing information processing in the developing brain using EEG and eye tracking
Langer, Nicolas; Ho, Erica J.; Alexander, Lindsay M.; Xu, Helen Y.; Jozanovic, Renee K.; Henin, Simon; Petroni, Agustin; Cohen, Samantha; Marcelle, Enitan T.; Parra, Lucas C.; Milham, Michael P.; Kelly, Simon P.
2017-01-01
We present a dataset combining electrophysiology and eye tracking intended as a resource for the investigation of information processing in the developing brain. The dataset includes high-density task-based and task-free EEG, eye tracking, and cognitive and behavioral data collected from 126 individuals (ages: 6–44). The task battery spans both the simple/complex and passive/active dimensions to cover a range of approaches prevalent in modern cognitive neuroscience. The active task paradigms facilitate principled deconstruction of core components of task performance in the developing brain, whereas the passive paradigms permit the examination of intrinsic functional network activity during varying amounts of external stimulation. Alongside these neurophysiological data, we include an abbreviated cognitive test battery and questionnaire-based measures of psychiatric functioning. We hope that this dataset will lead to the development of novel assays of neural processes fundamental to information processing, which can be used to index healthy brain development as well as detect pathologic processes. PMID:28398357
A resource for assessing information processing in the developing brain using EEG and eye tracking.
Langer, Nicolas; Ho, Erica J; Alexander, Lindsay M; Xu, Helen Y; Jozanovic, Renee K; Henin, Simon; Petroni, Agustin; Cohen, Samantha; Marcelle, Enitan T; Parra, Lucas C; Milham, Michael P; Kelly, Simon P
2017-04-11
We present a dataset combining electrophysiology and eye tracking intended as a resource for the investigation of information processing in the developing brain. The dataset includes high-density task-based and task-free EEG, eye tracking, and cognitive and behavioral data collected from 126 individuals (ages: 6-44). The task battery spans both the simple/complex and passive/active dimensions to cover a range of approaches prevalent in modern cognitive neuroscience. The active task paradigms facilitate principled deconstruction of core components of task performance in the developing brain, whereas the passive paradigms permit the examination of intrinsic functional network activity during varying amounts of external stimulation. Alongside these neurophysiological data, we include an abbreviated cognitive test battery and questionnaire-based measures of psychiatric functioning. We hope that this dataset will lead to the development of novel assays of neural processes fundamental to information processing, which can be used to index healthy brain development as well as detect pathologic processes.
Chirico, P.G.; Moran, T.W.
2011-01-01
This dataset contains a collection of 24 folders, each representing a specific U.S. Geological Survey area of interest (AOI; fig. 1), as well as datasets for AOI subsets. Each folder includes the extent, contours, Digital Elevation Model (DEM), and hydrography of the corresponding AOI, which are organized into feature vector and raster datasets. The dataset comprises a geographic information system (GIS), which is available upon request from the USGS Afghanistan programs Web site (http://afghanistan.cr.usgs.gov/minerals.php), and the maps of the 24 areas of interest of the USGS AOIs.
NASA Astrophysics Data System (ADS)
Karlsson, K.
2010-12-01
The EUMETSAT CMSAF project (www.cmsaf.eu) compiles climatological datasets from various satellite sources with emphasis on the use of EUMETSAT-operated satellites. However, since climate monitoring primarily has a global scope, also datasets merging data from various satellites and satellite operators are prepared. One such dataset is the CMSAF historic GAC (Global Area Coverage) dataset which is based on AVHRR data from the full historic series of NOAA-satellites and the European METOP satellite in mid-morning orbit launched in October 2006. The CMSAF GAC dataset consists of three groups of products: Macroscopical cloud products (cloud amount, cloud type and cloud top), cloud physical products (cloud phase, cloud optical thickness and cloud liquid water path) and surface radiation products (including surface albedo). Results will be presented and discussed for all product groups, including some preliminary inter-comparisons with other datasets (e.g., PATMOS-X, MODIS and CloudSat/CALIPSO datasets). A background will also be given describing the basic methodology behind the derivation of all products. This will include a short historical review of AVHRR cloud processing and resulting AVHRR applications at SMHI. Historic GAC processing is one of five pilot projects selected by the SCOPE-CM (Sustained Co-Ordinated Processing of Environmental Satellite data for Climate Monitoring) project organised by the WMO Space programme. The pilot project is carried out jointly between CMSAF and NOAA with the purpose of finding an optimal GAC processing approach. The initial activity is to inter-compare results of the CMSAF GAC dataset and the NOAA PATMOS-X dataset for the case when both datasets have been derived using the same inter-calibrated AVHRR radiance dataset. The aim is to get further knowledge of e.g. most useful multispectral methods and the impact of ancillary datasets (for example from meteorological reanalysis datasets from NCEP and ECMWF). The CMSAF project is currently defining plans for another five years (2012-2017) of operations and development. New GAC reprocessing efforts are planned and new methodologies will be tested. Central questions here will be how to increase the quantitative use of the products through improving error and uncertainty estimates and how to compile the information in a way to allow meaningful and efficient ways of using the data for e.g. validation of climate model information.
A polymer dataset for accelerated property prediction and design.
Huan, Tran Doan; Mannodi-Kanakkithodi, Arun; Kim, Chiho; Sharma, Vinit; Pilania, Ghanshyam; Ramprasad, Rampi
2016-03-01
Emerging computation- and data-driven approaches are particularly useful for rationally designing materials with targeted properties. Generally, these approaches rely on identifying structure-property relationships by learning from a dataset of sufficiently large number of relevant materials. The learned information can then be used to predict the properties of materials not already in the dataset, thus accelerating the materials design. Herein, we develop a dataset of 1,073 polymers and related materials and make it available at http://khazana.uconn.edu/. This dataset is uniformly prepared using first-principles calculations with structures obtained either from other sources or by using structure search methods. Because the immediate target of this work is to assist the design of high dielectric constant polymers, it is initially designed to include the optimized structures, atomization energies, band gaps, and dielectric constants. It will be progressively expanded by accumulating new materials and including additional properties calculated for the optimized structures provided.
Dataset of Passerine bird communities in a Mediterranean high mountain (Sierra Nevada, Spain).
Pérez-Luque, Antonio Jesús; Barea-Azcón, José Miguel; Álvarez-Ruiz, Lola; Bonet-García, Francisco Javier; Zamora, Regino
2016-01-01
In this data paper, a dataset of passerine bird communities is described in Sierra Nevada, a Mediterranean high mountain located in southern Spain. The dataset includes occurrence data from bird surveys conducted in four representative ecosystem types of Sierra Nevada from 2008 to 2015. For each visit, bird species numbers as well as distance to the transect line were recorded. A total of 27847 occurrence records were compiled with accompanying measurements on distance to the transect and animal counts. All records are of species in the order Passeriformes. Records of 16 different families and 44 genera were collected. Some of the taxa in the dataset are included in the European Red List. This dataset belongs to the Sierra Nevada Global-Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area.
Dataset of Passerine bird communities in a Mediterranean high mountain (Sierra Nevada, Spain)
Pérez-Luque, Antonio Jesús; Barea-Azcón, José Miguel; Álvarez-Ruiz, Lola; Bonet-García, Francisco Javier; Zamora, Regino
2016-01-01
Abstract In this data paper, a dataset of passerine bird communities is described in Sierra Nevada, a Mediterranean high mountain located in southern Spain. The dataset includes occurrence data from bird surveys conducted in four representative ecosystem types of Sierra Nevada from 2008 to 2015. For each visit, bird species numbers as well as distance to the transect line were recorded. A total of 27847 occurrence records were compiled with accompanying measurements on distance to the transect and animal counts. All records are of species in the order Passeriformes. Records of 16 different families and 44 genera were collected. Some of the taxa in the dataset are included in the European Red List. This dataset belongs to the Sierra Nevada Global-Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area. PMID:26865820
Just, Anaïs; Gourvil, Johan; Millet, Jérôme; Boullet, Vincent; Milon, Thomas; Mandon, Isabelle; Dutrève, Bruno
2015-01-01
More than 20 years ago, the French Muséum National d'Histoire Naturelle (MNHN, Secretariat of the Fauna and Flora) published the first part of an atlas of the flora of France at a 20km spatial resolution, accounting for 645 taxa (Dupont 1990). Since then, at the national level, there has not been any work on this scale relating to flora distribution, despite the obvious need for a better understanding. In 2011, in response to this need, the Federation des Conservatoires Botaniques Nationaux (FCBN, http://www.fcbn.fr) launched an ambitious collaborative project involving eleven national botanical conservatories of France. The project aims to establish a formal procedure and standardized system for data hosting, aggregation and publication for four areas: flora, fungi, vegetation and habitats. In 2014, the first phase of the project led to the development of the national flora dataset: SIFlore. As it includes about 21 million records of flora occurrences, this is currently the most comprehensive dataset on the distribution of vascular plants (Tracheophyta) in the French territory. SIFlore contains information for about 15'454 plant taxa occurrences (indigenous and alien taxa) in metropolitan France and Reunion Island, from 1545 until 2014. The data records were originally collated from inventories, checklists, literature and herbarium records. SIFlore was developed by assembling flora datasets from the regional to the national level. At the regional level, source records are managed by the national botanical conservatories that are responsible for flora data collection and validation. In order to present our results, a geoportal was developed by the Fédération des conservatoires botaniques nationaux that allows the SIFlore dataset to be publically viewed. This portal is available at: http://siflore.fcbn.fr. As the FCBN belongs to the Information System for Nature and Landscapes' (SINP), a governmental program, the dataset is also accessible through the websites of the National Inventory of Natural Heritage (http://www.inpn.fr) and the Global Biodiversity Information Facility (http://www.gbif.fr). SIFlore is regularly updated with additional data records. It is also planned to expand the scope of the dataset to include information about taxon biology, phenology, ecology, chorology, frequency, conservation status and seed banks. A map showing an estimation of the dataset completeness (based on Jackknife 1 estimator) is presented and included as a numerical appendix. SIFlore aims to make the data of the flora of France available at the national level for conservation, policy management and scientific research. Such a dataset will provide enough information to allow for macro-ecological reviews of species distribution patterns and, coupled with climatic or topographic datasets, the identification of determinants of these patterns. This dataset can be considered as the primary indicator of the current state of knowledge of flora distribution across France. At a policy level, and in the context of global warming, this should promote the adoption of new measures aiming to improve and intensify flora conservation and surveys.
Just, Anaïs; Gourvil, Johan; Millet, Jérôme; Boullet, Vincent; Milon, Thomas; Mandon, Isabelle; Dutrève, Bruno
2015-01-01
Abstract More than 20 years ago, the French Muséum National d’Histoire Naturelle1 (MNHN, Secretariat of the Fauna and Flora) published the first part of an atlas of the flora of France at a 20km spatial resolution, accounting for 645 taxa (Dupont 1990). Since then, at the national level, there has not been any work on this scale relating to flora distribution, despite the obvious need for a better understanding. In 2011, in response to this need, the Federation des Conservatoires Botaniques Nationaux2 (FCBN, http://www.fcbn.fr) launched an ambitious collaborative project involving eleven national botanical conservatories of France. The project aims to establish a formal procedure and standardized system for data hosting, aggregation and publication for four areas: flora, fungi, vegetation and habitats. In 2014, the first phase of the project led to the development of the national flora dataset: SIFlore. As it includes about 21 million records of flora occurrences, this is currently the most comprehensive dataset on the distribution of vascular plants (Tracheophyta) in the French territory. SIFlore contains information for about 15'454 plant taxa occurrences (indigenous and alien taxa) in metropolitan France and Reunion Island, from 1545 until 2014. The data records were originally collated from inventories, checklists, literature and herbarium records. SIFlore was developed by assembling flora datasets from the regional to the national level. At the regional level, source records are managed by the national botanical conservatories that are responsible for flora data collection and validation. In order to present our results, a geoportal was developed by the Fédération des conservatoires botaniques nationaux that allows the SIFlore dataset to be publically viewed. This portal is available at: http://siflore.fcbn.fr. As the FCBN belongs to the Information System for Nature and Landscapes’ (SINP), a governmental program, the dataset is also accessible through the websites of the National Inventory of Natural Heritage (http://www.inpn.fr) and the Global Biodiversity Information Facility (http://www.gbif.fr). SIFlore is regularly updated with additional data records. It is also planned to expand the scope of the dataset to include information about taxon biology, phenology, ecology, chorology, frequency, conservation status and seed banks. A map showing an estimation of the dataset completeness (based on Jackknife 1 estimator) is presented and included as a numerical appendix. Purpose: SIFlore aims to make the data of the flora of France available at the national level for conservation, policy management and scientific research. Such a dataset will provide enough information to allow for macro-ecological reviews of species distribution patterns and, coupled with climatic or topographic datasets, the identification of determinants of these patterns. This dataset can be considered as the primary indicator of the current state of knowledge of flora distribution across France. At a policy level, and in the context of global warming, this should promote the adoption of new measures aiming to improve and intensify flora conservation and surveys. PMID:26491386
Datasets on hub-height wind speed comparisons for wind farms in California.
Wang, Meina; Ullrich, Paul; Millstein, Dev
2018-08-01
This article includes the description of data information related to the research article entitled "The future of wind energy in California: Future projections with the Variable-Resolution CESM"[1], with reference number RENE_RENE-D-17-03392. Datasets from the Variable-Resolution CESM, Det Norske Veritas Germanischer Lloyd Virtual Met, MERRA-2, CFSR, NARR, ISD surface observations, and upper air sounding observations were used for calculating and comparing hub-height wind speed at multiple major wind farms across California. Information on hub-height wind speed interpolation and power curves at each wind farm sites are also presented. All datasets, except Det Norske Veritas Germanischer Lloyd Virtual Met, are publicly available for future analysis.
Nicholson, Andrew G; Detterbeck, Frank; Marx, Alexander; Roden, Anja C; Marchevsky, Alberto M; Mukai, Kiyoshi; Chen, Gang; Marino, Mirella; den Bakker, Michael A; Yang, Woo-Ick; Judge, Meagan; Hirschowitz, Lynn
2017-03-01
The International Collaboration on Cancer Reporting (ICCR) is a not-for-profit organization formed by the Royal Colleges of Pathologists of Australasia and the United Kingdom, the College of American Pathologists, the Canadian Association of Pathologists-Association Canadienne des Pathologists in association with the Canadian Partnership Against Cancer, and the European Society of Pathology. Its goal is to produce standardized, internationally agreed, evidence-based datasets for use throughout the world. This article describes the development of a cancer dataset by the multidisciplinary ICCR expert panel for the reporting of thymic epithelial tumours. The dataset includes 'required' (mandatory) and 'recommended' (non-mandatory) elements, which are validated by a review of current evidence and supported by explanatory text. Seven required elements and 12 recommended elements were agreed by the international dataset authoring committee to represent the essential information for the reporting of thymic epithelial tumours. The use of an internationally agreed, structured pathology dataset for reporting thymic tumours provides all of the necessary information for optimal patient management, facilitates consistent and accurate data collection, and provides valuable data for research and international benchmarking. The dataset also provides a valuable resource for those countries and institutions that are not in a position to develop their own datasets. © 2016 John Wiley & Sons Ltd.
NASA Technical Reports Server (NTRS)
Stefanov, William L.
2017-01-01
The NASA Earth observations dataset obtained by humans in orbit using handheld film and digital cameras is freely accessible to the global community through the online searchable database at https://eol.jsc.nasa.gov, and offers a useful compliment to traditional ground-commanded sensor data. The dataset includes imagery from the NASA Mercury (1961) through present-day International Space Station (ISS) programs, and currently totals over 2.6 million individual frames. Geographic coverage of the dataset includes land and oceans areas between approximately 52 degrees North and South latitudes, but is spatially and temporally discontinuous. The photographic dataset includes some significant impediments for immediate research, applied, and educational use: commercial RGB films and camera systems with overlapping bandpasses; use of different focal length lenses, unconstrained look angles, and variable spacecraft altitudes; and no native geolocation information. Such factors led to this dataset being underutilized by the community but recent advances in automated and semi-automated image geolocation, image feature classification, and web-based services are adding new value to the astronaut-acquired imagery. A coupled ground software and on-orbit hardware system for the ISS is in development for planned deployment in mid-2017; this system will capture camera pose information for each astronaut photograph to allow automated, full georegistration of the data. The ground system component of the system is currently in use to fully georeference imagery collected in response to International Disaster Charter activations, and the auto-registration procedures are being applied to the extensive historical database of imagery to add value for research and educational purposes. In parallel, machine learning techniques are being applied to automate feature identification and classification throughout the dataset, in order to build descriptive metadata that will improve search capabilities. It is expected that these value additions will increase interest and use of the dataset by the global community.
CrossCheck: an open-source web tool for high-throughput screen data analysis.
Najafov, Jamil; Najafov, Ayaz
2017-07-19
Modern high-throughput screening methods allow researchers to generate large datasets that potentially contain important biological information. However, oftentimes, picking relevant hits from such screens and generating testable hypotheses requires training in bioinformatics and the skills to efficiently perform database mining. There are currently no tools available to general public that allow users to cross-reference their screen datasets with published screen datasets. To this end, we developed CrossCheck, an online platform for high-throughput screen data analysis. CrossCheck is a centralized database that allows effortless comparison of the user-entered list of gene symbols with 16,231 published datasets. These datasets include published data from genome-wide RNAi and CRISPR screens, interactome proteomics and phosphoproteomics screens, cancer mutation databases, low-throughput studies of major cell signaling mediators, such as kinases, E3 ubiquitin ligases and phosphatases, and gene ontological information. Moreover, CrossCheck includes a novel database of predicted protein kinase substrates, which was developed using proteome-wide consensus motif searches. CrossCheck dramatically simplifies high-throughput screen data analysis and enables researchers to dig deep into the published literature and streamline data-driven hypothesis generation. CrossCheck is freely accessible as a web-based application at http://proteinguru.com/crosscheck.
A Semantically Enabled Metadata Repository for Solar Irradiance Data Products
NASA Astrophysics Data System (ADS)
Wilson, A.; Cox, M.; Lindholm, D. M.; Nadiadi, I.; Traver, T.
2014-12-01
The Laboratory for Atmospheric and Space Physics, LASP, has been conducting research in Atmospheric and Space science for over 60 years, and providing the associated data products to the public. LASP has a long history, in particular, of making space-based measurements of the solar irradiance, which serves as crucial input to several areas of scientific research, including solar-terrestrial interactions, atmospheric, and climate. LISIRD, the LASP Interactive Solar Irradiance Data Center, serves these datasets to the public, including solar spectral irradiance (SSI) and total solar irradiance (TSI) data. The LASP extended metadata repository, LEMR, is a database of information about the datasets served by LASP, such as parameters, uncertainties, temporal and spectral ranges, current version, alerts, etc. It serves as the definitive, single source of truth for that information. The database is populated with information garnered via web forms and automated processes. Dataset owners keep the information current and verified for datasets under their purview. This information can be pulled dynamically for many purposes. Web sites such as LISIRD can include this information in web page content as it is rendered, ensuring users get current, accurate information. It can also be pulled to create metadata records in various metadata formats, such as SPASE (for heliophysics) and ISO 19115. Once these records are be made available to the appropriate registries, our data will be discoverable by users coming in via those organizations. The database is implemented as a RDF triplestore, a collection of instances of subject-object-predicate data entities identifiable with a URI. This capability coupled with SPARQL over HTTP read access enables semantic queries over the repository contents. To create the repository we leveraged VIVO, an open source semantic web application, to manage and create new ontologies and populate repository content. A variety of ontologies were used in creating the triplestore, including ontologies that came with VIVO such as FOAF. Also, the W3C DCAT ontology was integrated and extended to describe properties of our data products that we needed to capture, such as spectral range. The presentation will describe the architecture, ontology issues, and tools used to create LEMR and plans for its evolution.
De-identification of health records using Anonym: effectiveness and robustness across datasets.
Zuccon, Guido; Kotzur, Daniel; Nguyen, Anthony; Bergheim, Anton
2014-07-01
Evaluate the effectiveness and robustness of Anonym, a tool for de-identifying free-text health records based on conditional random fields classifiers informed by linguistic and lexical features, as well as features extracted by pattern matching techniques. De-identification of personal health information in electronic health records is essential for the sharing and secondary usage of clinical data. De-identification tools that adapt to different sources of clinical data are attractive as they would require minimal intervention to guarantee high effectiveness. The effectiveness and robustness of Anonym are evaluated across multiple datasets, including the widely adopted Integrating Biology and the Bedside (i2b2) dataset, used for evaluation in a de-identification challenge. The datasets used here vary in type of health records, source of data, and their quality, with one of the datasets containing optical character recognition errors. Anonym identifies and removes up to 96.6% of personal health identifiers (recall) with a precision of up to 98.2% on the i2b2 dataset, outperforming the best system proposed in the i2b2 challenge. The effectiveness of Anonym across datasets is found to depend on the amount of information available for training. Findings show that Anonym compares to the best approach from the 2006 i2b2 shared task. It is easy to retrain Anonym with new datasets; if retrained, the system is robust to variations of training size, data type and quality in presence of sufficient training data. Crown Copyright © 2014. Published by Elsevier B.V. All rights reserved.
FLUXNET2015 Dataset: Batteries included
NASA Astrophysics Data System (ADS)
Pastorello, G.; Papale, D.; Agarwal, D.; Trotta, C.; Chu, H.; Canfora, E.; Torn, M. S.; Baldocchi, D. D.
2016-12-01
The synthesis datasets have become one of the signature products of the FLUXNET global network. They are composed from contributions of individual site teams to regional networks, being then compiled into uniform data products - now used in a wide variety of research efforts: from plant-scale microbiology to global-scale climate change. The FLUXNET Marconi Dataset in 2000 was the first in the series, followed by the FLUXNET LaThuile Dataset in 2007, with significant additions of data products and coverage, solidifying the adoption of the datasets as a research tool. The FLUXNET2015 Dataset counts with another round of substantial improvements, including extended quality control processes and checks, use of downscaled reanalysis data for filling long gaps in micrometeorological variables, multiple methods for USTAR threshold estimation and flux partitioning, and uncertainty estimates - all of which accompanied by auxiliary flags. This "batteries included" approach provides a lot of information for someone who wants to explore the data (and the processing methods) in detail. This inevitably leads to a large number of data variables. Although dealing with all these variables might seem overwhelming at first, especially to someone looking at eddy covariance data for the first time, there is method to our madness. In this work we describe the data products and variables that are part of the FLUXNET2015 Dataset, and the rationale behind the organization of the dataset, covering the simplified version (labeled SUBSET), the complete version (labeled FULLSET), and the auxiliary products in the dataset.
Ye, Huimin; Chen, Elizabeth S.
2011-01-01
In order to support the increasing need to share electronic health data for research purposes, various methods have been proposed for privacy preservation including k-anonymity. Many k-anonymity models provide the same level of anoymization regardless of practical need, which may decrease the utility of the dataset for a particular research study. In this study, we explore extensions to the k-anonymity algorithm that aim to satisfy the heterogeneous needs of different researchers while preserving privacy as well as utility of the dataset. The proposed algorithm, Attribute Utility Motivated k-anonymization (AUM), involves analyzing the characteristics of attributes and utilizing them to minimize information loss during the anonymization process. Through comparison with two existing algorithms, Mondrian and Incognito, preliminary results indicate that AUM may preserve more information from original datasets thus providing higher quality results with lower distortion. PMID:22195223
A global experimental dataset for assessing grain legume production
Cernay, Charles; Pelzer, Elise; Makowski, David
2016-01-01
Grain legume crops are a significant component of the human diet and animal feed and have an important role in the environment, but the global diversity of agricultural legume species is currently underexploited. Experimental assessments of grain legume performances are required, to identify potential species with high yields. Here, we introduce a dataset including results of field experiments published in 173 articles. The selected experiments were carried out over five continents on 39 grain legume species. The dataset includes measurements of grain yield, aerial biomass, crop nitrogen content, residual soil nitrogen content and water use. When available, yields for cereals and oilseeds grown after grain legumes in the crop sequence are also included. The dataset is arranged into a relational database with nine structured tables and 198 standardized attributes. Tillage, fertilization, pest and irrigation management are systematically recorded for each of the 8,581 crop*field site*growing season*treatment combinations. The dataset is freely reusable and easy to update. We anticipate that it will provide valuable information for assessing grain legume production worldwide. PMID:27676125
Enhancing studies of the connectome in autism using the autism brain imaging data exchange II
Di Martino, Adriana; O’Connor, David; Chen, Bosi; Alaerts, Kaat; Anderson, Jeffrey S.; Assaf, Michal; Balsters, Joshua H.; Baxter, Leslie; Beggiato, Anita; Bernaerts, Sylvie; Blanken, Laura M. E.; Bookheimer, Susan Y.; Braden, B. Blair; Byrge, Lisa; Castellanos, F. Xavier; Dapretto, Mirella; Delorme, Richard; Fair, Damien A.; Fishman, Inna; Fitzgerald, Jacqueline; Gallagher, Louise; Keehn, R. Joanne Jao; Kennedy, Daniel P.; Lainhart, Janet E.; Luna, Beatriz; Mostofsky, Stewart H.; Müller, Ralph-Axel; Nebel, Mary Beth; Nigg, Joel T.; O’Hearn, Kirsten; Solomon, Marjorie; Toro, Roberto; Vaidya, Chandan J.; Wenderoth, Nicole; White, Tonya; Craddock, R. Cameron; Lord, Catherine; Leventhal, Bennett; Milham, Michael P.
2017-01-01
The second iteration of the Autism Brain Imaging Data Exchange (ABIDE II) aims to enhance the scope of brain connectomics research in Autism Spectrum Disorder (ASD). Consistent with the initial ABIDE effort (ABIDE I), that released 1112 datasets in 2012, this new multisite open-data resource is an aggregate of resting state functional magnetic resonance imaging (MRI) and corresponding structural MRI and phenotypic datasets. ABIDE II includes datasets from an additional 487 individuals with ASD and 557 controls previously collected across 16 international institutions. The combination of ABIDE I and ABIDE II provides investigators with 2156 unique cross-sectional datasets allowing selection of samples for discovery and/or replication. This sample size can also facilitate the identification of neurobiological subgroups, as well as preliminary examinations of sex differences in ASD. Additionally, ABIDE II includes a range of psychiatric variables to inform our understanding of the neural correlates of co-occurring psychopathology; 284 diffusion imaging datasets are also included. It is anticipated that these enhancements will contribute to unraveling key sources of ASD heterogeneity. PMID:28291247
Unclassified Data Export from the Containment Database
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gaylord, Jessie M.; Myers, K. B. L.
2016-08-03
The attached dataset is an unclassified subset of data copied from the CDB for release to a wider audience. It includes only the clearly unclassified columns from each of the tables in the CDB, and only the rows related to announced tests. Also included is a glossary that lists each of the tables and columns included in the export with its definition, historical labels, unit of measure, and reference information. Our objective is to make this unclassified dataset available for querying on an unclassified server through a repeatable process.
Soltaninejad, Mohammadreza; Yang, Guang; Lambrou, Tryphon; Allinson, Nigel; Jones, Timothy L; Barrick, Thomas R; Howe, Franklyn A; Ye, Xujiong
2018-04-01
Accurate segmentation of brain tumour in magnetic resonance images (MRI) is a difficult task due to various tumour types. Using information and features from multimodal MRI including structural MRI and isotropic (p) and anisotropic (q) components derived from the diffusion tensor imaging (DTI) may result in a more accurate analysis of brain images. We propose a novel 3D supervoxel based learning method for segmentation of tumour in multimodal MRI brain images (conventional MRI and DTI). Supervoxels are generated using the information across the multimodal MRI dataset. For each supervoxel, a variety of features including histograms of texton descriptor, calculated using a set of Gabor filters with different sizes and orientations, and first order intensity statistical features are extracted. Those features are fed into a random forests (RF) classifier to classify each supervoxel into tumour core, oedema or healthy brain tissue. The method is evaluated on two datasets: 1) Our clinical dataset: 11 multimodal images of patients and 2) BRATS 2013 clinical dataset: 30 multimodal images. For our clinical dataset, the average detection sensitivity of tumour (including tumour core and oedema) using multimodal MRI is 86% with balanced error rate (BER) 7%; while the Dice score for automatic tumour segmentation against ground truth is 0.84. The corresponding results of the BRATS 2013 dataset are 96%, 2% and 0.89, respectively. The method demonstrates promising results in the segmentation of brain tumour. Adding features from multimodal MRI images can largely increase the segmentation accuracy. The method provides a close match to expert delineation across all tumour grades, leading to a faster and more reproducible method of brain tumour detection and delineation to aid patient management. Copyright © 2018 Elsevier B.V. All rights reserved.
The LANDFIRE Refresh strategy: updating the national dataset
Nelson, Kurtis J.; Connot, Joel A.; Peterson, Birgit E.; Martin, Charley
2013-01-01
The LANDFIRE Program provides comprehensive vegetation and fuel datasets for the entire United States. As with many large-scale ecological datasets, vegetation and landscape conditions must be updated periodically to account for disturbances, growth, and natural succession. The LANDFIRE Refresh effort was the first attempt to consistently update these products nationwide. It incorporated a combination of specific systematic improvements to the original LANDFIRE National data, remote sensing based disturbance detection methods, field collected disturbance information, vegetation growth and succession modeling, and vegetation transition processes. This resulted in the creation of two complete datasets for all 50 states: LANDFIRE Refresh 2001, which includes the systematic improvements, and LANDFIRE Refresh 2008, which includes the disturbance and succession updates to the vegetation and fuel data. The new datasets are comparable for studying landscape changes in vegetation type and structure over a decadal period, and provide the most recent characterization of fuel conditions across the country. The applicability of the new layers is discussed and the effects of using the new fuel datasets are demonstrated through a fire behavior modeling exercise using the 2011 Wallow Fire in eastern Arizona as an example.
Process mining in oncology using the MIMIC-III dataset
NASA Astrophysics Data System (ADS)
Prima Kurniati, Angelina; Hall, Geoff; Hogg, David; Johnson, Owen
2018-03-01
Process mining is a data analytics approach to discover and analyse process models based on the real activities captured in information systems. There is a growing body of literature on process mining in healthcare, including oncology, the study of cancer. In earlier work we found 37 peer-reviewed papers describing process mining research in oncology with a regular complaint being the limited availability and accessibility of datasets with suitable information for process mining. Publicly available datasets are one option and this paper describes the potential to use MIMIC-III, for process mining in oncology. MIMIC-III is a large open access dataset of de-identified patient records. There are 134 publications listed as using the MIMIC dataset, but none of them have used process mining. The MIMIC-III dataset has 16 event tables which are potentially useful for process mining and this paper demonstrates the opportunities to use MIMIC-III for process mining in oncology. Our research applied the L* lifecycle method to provide a worked example showing how process mining can be used to analyse cancer pathways. The results and data quality limitations are discussed along with opportunities for further work and reflection on the value of MIMIC-III for reproducible process mining research.
A longitudinal dataset of five years of public activity in the Scratch online community.
Hill, Benjamin Mako; Monroy-Hernández, Andrés
2017-01-31
Scratch is a programming environment and an online community where young people can create, share, learn, and communicate. In collaboration with the Scratch Team at MIT, we created a longitudinal dataset of public activity in the Scratch online community during its first five years (2007-2012). The dataset comprises 32 tables with information on more than 1 million Scratch users, nearly 2 million Scratch projects, more than 10 million comments, more than 30 million visits to Scratch projects, and more. To help researchers understand this dataset, and to establish the validity of the data, we also include the source code of every version of the software that operated the website, as well as the software used to generate this dataset. We believe this is the largest and most comprehensive downloadable dataset of youth programming artifacts and communication.
A polymer dataset for accelerated property prediction and design
Huan, Tran Doan; Mannodi-Kanakkithodi, Arun; Kim, Chiho; ...
2016-03-01
Emerging computation- and data-driven approaches are particularly useful for rationally designing materials with targeted properties. Generally, these approaches rely on identifying structure-property relationships by learning from a dataset of sufficiently large number of relevant materials. The learned information can then be used to predict the properties of materials not already in the dataset, thus accelerating the materials design. Herein, we develop a dataset of 1,073 polymers and related materials and make it available at http://khazana.uconn.edu/. This dataset is uniformly prepared using first-principles calculations with structures obtained either from other sources or by using structure search methods. Because the immediate targetmore » of this work is to assist the design of high dielectric constant polymers, it is initially designed to include the optimized structures, atomization energies, band gaps, and dielectric constants. As a result, it will be progressively expanded by accumulating new materials and including additional properties calculated for the optimized structures provided.« less
Carlson, Colin J.; Bond, Alexander L.
2018-01-01
Abstract Background Despite much present-day attention on recently extinct North American birds species, little contemporary research has focused on the Carolina parakeet (Conuropsis carolinesis). While the last captive Carolina parakeet died 100 years ago this year, the Carolina parakeet was officially declared extinct in 1920, but they likely persisted in small, isolated populations until at least the 1930s, and perhaps longer. How this once wide-ranging and plentiful species went extinct remains a mystery. Here, we present a georeferenced dataset of Carolina parakeet sightings spanning nearly 400 years by combining both written observations and specimen data. New information Because we include both observations and specimen data, the Carolina parakeet occurrence dataset presented here is the most comprehensive and rigorous datsetset on this species available. The dataset includes 861 sightings from 1564 to 1944. Each datapoint includes geographic coordinates, a measurement of uncertainty, detailed information about each sighting, and an assessment of the sighting's validity. Given that this species is so poorly understood, we make these data freely available to facilitate more research on this colorful and charismatic species.
Creation of the Naturalistic Engagement in Secondary Tasks (NEST) distracted driving dataset.
Owens, Justin M; Angell, Linda; Hankey, Jonathan M; Foley, James; Ebe, Kazutoshi
2015-09-01
Distracted driving has become a topic of critical importance to driving safety research over the past several decades. Naturalistic driving data offer a unique opportunity to study how drivers engage with secondary tasks in real-world driving; however, the complexities involved with identifying and coding relevant epochs of naturalistic data have limited its accessibility to the general research community. This project was developed to help address this problem by creating an accessible dataset of driver behavior and situational factors observed during distraction-related safety-critical events and baseline driving epochs, using the Strategic Highway Research Program 2 (SHRP2) naturalistic dataset. The new NEST (Naturalistic Engagement in Secondary Tasks) dataset was created using crashes and near-crashes from the SHRP2 dataset that were identified as including secondary task engagement as a potential contributing factor. Data coding included frame-by-frame video analysis of secondary task and hands-on-wheel activity, as well as summary event information. In addition, information about each secondary task engagement within the trip prior to the crash/near-crash was coded at a higher level. Data were also coded for four baseline epochs and trips per safety-critical event. 1,180 events and baseline epochs were coded, and a dataset was constructed. The project team is currently working to determine the most useful way to allow broad public access to the dataset. We anticipate that the NEST dataset will be extraordinarily useful in allowing qualified researchers access to timely, real-world data concerning how drivers interact with secondary tasks during safety-critical events and baseline driving. The coded dataset developed for this project will allow future researchers to have access to detailed data on driver secondary task engagement in the real world. It will be useful for standalone research, as well as for integration with additional SHRP2 data to enable the conduct of more complex research. Copyright © 2015 Elsevier Ltd and National Safety Council. All rights reserved.
Wootten, Adrienne; Smith, Kara; Boyles, Ryan; Terando, Adam; Stefanova, Lydia; Misra, Vasru; Smith, Tom; Blodgett, David L.; Semazzi, Fredrick
2014-01-01
Climate change is likely to have many effects on natural ecosystems in the Southeast U.S. The National Climate Assessment Southeast Technical Report (SETR) indicates that natural ecosystems in the Southeast are likely to be affected by warming temperatures, ocean acidification, sea-level rise, and changes in rainfall and evapotranspiration. To better assess these how climate changes could affect multiple sectors, including ecosystems, climatologists have created several downscaled climate projections (or downscaled datasets) that contain information from the global climate models (GCMs) translated to regional or local scales. The process of creating these downscaled datasets, known as downscaling, can be carried out using a broad range of statistical or numerical modeling techniques. The rapid proliferation of techniques that can be used for downscaling and the number of downscaled datasets produced in recent years present many challenges for scientists and decisionmakers in assessing the impact or vulnerability of a given species or ecosystem to climate change. Given the number of available downscaled datasets, how do these model outputs compare to each other? Which variables are available, and are certain downscaled datasets more appropriate for assessing vulnerability of a particular species? Given the desire to use these datasets for impact and vulnerability assessments and the lack of comparison between these datasets, the goal of this report is to synthesize the information available in these downscaled datasets and provide guidance to scientists and natural resource managers with specific interests in ecological modeling and conservation planning related to climate change in the Southeast U.S. This report enables the Southeast Climate Science Center (SECSC) to address an important strategic goal of providing scientific information and guidance that will enable resource managers and other participants in Landscape Conservation Cooperatives to make science-based climate change adaptation decisions.
Dataset of Phenology of Mediterranean high-mountain meadows flora (Sierra Nevada, Spain).
Pérez-Luque, Antonio Jesús; Sánchez-Rojas, Cristina Patricia; Zamora, Regino; Pérez-Pérez, Ramón; Bonet, Francisco Javier
2015-01-01
Sierra Nevada mountain range (southern Spain) hosts a high number of endemic plant species, being one of the most important biodiversity hotspots in the Mediterranean basin. The high-mountain meadow ecosystems (borreguiles) harbour a large number of endemic and threatened plant species. In this data paper, we describe a dataset of the flora inhabiting this threatened ecosystem in this Mediterranean mountain. The dataset includes occurrence data for flora collected in those ecosystems in two periods: 1988-1990 and 2009-2013. A total of 11002 records of occurrences belonging to 19 orders, 28 families 52 genera were collected. 73 taxa were recorded with 29 threatened taxa. We also included data of cover-abundance and phenology attributes for the records. The dataset is included in the Sierra Nevada Global-Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area.
Dataset of Phenology of Mediterranean high-mountain meadows flora (Sierra Nevada, Spain)
Pérez-Luque, Antonio Jesús; Sánchez-Rojas, Cristina Patricia; Zamora, Regino; Pérez-Pérez, Ramón; Bonet, Francisco Javier
2015-01-01
Abstract Sierra Nevada mountain range (southern Spain) hosts a high number of endemic plant species, being one of the most important biodiversity hotspots in the Mediterranean basin. The high-mountain meadow ecosystems (borreguiles) harbour a large number of endemic and threatened plant species. In this data paper, we describe a dataset of the flora inhabiting this threatened ecosystem in this Mediterranean mountain. The dataset includes occurrence data for flora collected in those ecosystems in two periods: 1988–1990 and 2009–2013. A total of 11002 records of occurrences belonging to 19 orders, 28 families 52 genera were collected. 73 taxa were recorded with 29 threatened taxa. We also included data of cover-abundance and phenology attributes for the records. The dataset is included in the Sierra Nevada Global-Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area. PMID:25878552
Murphy, David J; Rubinson, Lewis; Blum, James; Isakov, Alexander; Bhagwanjee, Statish; Cairns, Charles B; Cobb, J Perren; Sevransky, Jonathan E
2015-11-01
In developed countries, public health systems have become adept at rapidly identifying the etiology and impact of public health emergencies. However, within the time course of clinical responses, shortfalls in readily analyzable patient-level data limit capabilities to understand clinical course, predict outcomes, ensure resource availability, and evaluate the effectiveness of diagnostic and therapeutic strategies for seriously ill and injured patients. To be useful in the timeline of a public health emergency, multi-institutional clinical investigation systems must be in place to rapidly collect, analyze, and disseminate detailed clinical information regarding patients across prehospital, emergency department, and acute care hospital settings, including ICUs. As an initial step to near real-time clinical learning during public health emergencies, we sought to develop an "all-hazards" core dataset to characterize serious illness and injuries and the resource requirements for acute medical response across the care continuum. A multidisciplinary panel of clinicians, public health professionals, and researchers with expertise in public health emergencies. Group consensus process. The consensus process included regularly scheduled conference calls, electronic communications, and an in-person meeting to generate candidate variables. Candidate variables were then reviewed by the group to meet the competing criteria of utility and feasibility resulting in the core dataset. The 40-member panel generated 215 candidate variables for potential dataset inclusion. The final dataset includes 140 patient-level variables in the domains of demographics and anthropometrics (7), prehospital (11), emergency department (13), diagnosis (8), severity of illness (54), medications and interventions (38), and outcomes (9). The resulting all-hazard core dataset for seriously ill and injured persons provides a foundation to facilitate rapid collection, analyses, and dissemination of information necessary for clinicians, public health officials, and policymakers to optimize public health emergency response. Further work is needed to validate the effectiveness of the dataset in a variety of emergency settings.
Excel Spreadsheet Tools for Analyzing Groundwater Level Records and Displaying Information in ArcMap
Tillman, Fred D.
2009-01-01
When beginning hydrologic investigations, a first action is often to gather existing sources of well information, compile this information into a single dataset, and visualize this information in a geographic information system (GIS) environment. This report presents tools (macros) developed using Visual Basic for Applications (VBA) for Microsoft Excel 2007 to assist in these tasks. One tool combines multiple datasets into a single worksheet and formats the resulting data for use by the other tools. A second tool produces summary information about the dataset, such as a list of unique site identification numbers, the number of water-level observations for each, and a table of the number of sites with a listed number of water-level observations. A third tool creates subsets of the original dataset based on user-specified options and produces a worksheet with water-level information for each well in the subset, including the average and standard deviation of water-level observations and maximum decline and rise in water levels between any two observations, among other information. This water-level information worksheet can be imported directly into ESRI ArcMap as an 'XY Data' file, and each of the fields of summary well information can be used for custom display. A separate set of VBA tools distributed in an additional Excel workbook creates hydrograph charts of each of the wells in the data subset produced by the aforementioned tools and produces portable document format (PDF) versions of the hydrograph charts. These PDF hydrographs can be hyperlinked to well locations in ArcMap or other GIS applications.
Struwe, Weston B; Agravat, Sanjay; Aoki-Kinoshita, Kiyoko F; Campbell, Matthew P; Costello, Catherine E; Dell, Anne; Ten Feizi; Haslam, Stuart M; Karlsson, Niclas G; Khoo, Kay-Hooi; Kolarich, Daniel; Liu, Yan; McBride, Ryan; Novotny, Milos V; Packer, Nicolle H; Paulson, James C; Rapp, Erdmann; Ranzinger, Rene; Rudd, Pauline M; Smith, David F; Tiemeyer, Michael; Wells, Lance; York, William S; Zaia, Joseph; Kettner, Carsten
2016-09-01
The minimum information required for a glycomics experiment (MIRAGE) project was established in 2011 to provide guidelines to aid in data reporting from all types of experiments in glycomics research including mass spectrometry (MS), liquid chromatography, glycan arrays, data handling and sample preparation. MIRAGE is a concerted effort of the wider glycomics community that considers the adaptation of reporting guidelines as an important step towards critical evaluation and dissemination of datasets as well as broadening of experimental techniques worldwide. The MIRAGE Commission published reporting guidelines for MS data and here we outline guidelines for sample preparation. The sample preparation guidelines include all aspects of sample generation, purification and modification from biological and/or synthetic carbohydrate material. The application of MIRAGE sample preparation guidelines will lead to improved recording of experimental protocols and reporting of understandable and reproducible glycomics datasets. © The Author 2016. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Columbia River Coordinated Information System (CIS); Data Catalog, 1992 Technical Report.
DOE Office of Scientific and Technical Information (OSTI.GOV)
O'Connor, Dick; Allen, Stan; Reece, Doug
1993-05-01
The Columbia River Coordinated Information system (CIS) Project started in 1989 to address regional data sharing. Coordinated exchange and dissemination of any data must begin with dissemination of information about those data, such as: what is available; where the data are stored; what form they exist in; who to contact for further information or access to these data. In Phase II of this Project (1991), a Data Catalog describing the contents of regional datasets and less formal data collections useful for system monitoring and evaluation projects was built to improve awareness of their existence. Formal datasets are described in amore » `Dataset Directory,` while collections of data are Used to those that collect such information in the `Data Item Directory.` The Data Catalog will serve regional workers as a useful reference which centralizes the institutional knowledge of many data contacts into a single source. Recommendations for improvement of the Catalog during Phase III of this Project include addressing gaps in coverage, establishing an annual maintenance schedule, and loading the contents into a PC-based electronic database for easier searching and cross-referencing.« less
Information Retrieval in Biomedical Research: From Articles to Datasets
ERIC Educational Resources Information Center
Wei, Wei
2017-01-01
Information retrieval techniques have been applied to biomedical research for a variety of purposes, such as textual document retrieval and molecular data retrieval. As biomedical research evolves over time, information retrieval is also constantly facing new challenges, including the growing number of available data, the emerging new data types,…
A multimodal dataset for authoring and editing multimedia content: The MAMEM project.
Nikolopoulos, Spiros; Petrantonakis, Panagiotis C; Georgiadis, Kostas; Kalaganis, Fotis; Liaros, Georgios; Lazarou, Ioulietta; Adam, Katerina; Papazoglou-Chalikias, Anastasios; Chatzilari, Elisavet; Oikonomou, Vangelis P; Kumar, Chandan; Menges, Raphael; Staab, Steffen; Müller, Daniel; Sengupta, Korok; Bostantjopoulou, Sevasti; Katsarou, Zoe; Zeilig, Gabi; Plotnik, Meir; Gotlieb, Amihai; Kizoni, Racheli; Fountoukidou, Sofia; Ham, Jaap; Athanasiou, Dimitrios; Mariakaki, Agnes; Comanducci, Dario; Sabatini, Edoardo; Nistico, Walter; Plank, Markus; Kompatsiaris, Ioannis
2017-12-01
We present a dataset that combines multimodal biosignals and eye tracking information gathered under a human-computer interaction framework. The dataset was developed in the vein of the MAMEM project that aims to endow people with motor disabilities with the ability to edit and author multimedia content through mental commands and gaze activity. The dataset includes EEG, eye-tracking, and physiological (GSR and Heart rate) signals collected from 34 individuals (18 able-bodied and 16 motor-impaired). Data were collected during the interaction with specifically designed interface for web browsing and multimedia content manipulation and during imaginary movement tasks. The presented dataset will contribute towards the development and evaluation of modern human-computer interaction systems that would foster the integration of people with severe motor impairments back into society.
NASA Astrophysics Data System (ADS)
Versteeg, R. J.; Johnson, T.; Henrie, A.; Johnson, D.
2013-12-01
The Hanford 300 Area, located adjacent to the Columbia River in south-central Washington, USA, is the site of former research and uranium fuel rod fabrication facilities. Waste disposal practices at the site included discharging between 33 and 59 metric tons of uranium over a 40 year period into shallow infiltration galleries, resulting in persistent uranium contamination within the vadose and saturated zones. Uranium transport from the vadose zone to the saturated zone is intimately linked with water table fluctuations and river water driven by upstream dam operations. Different remedial efforts have occurred at the site to address uranium contamination. Numerous investigations are occurring at the site, both to investigate remedial performance and to increase the understanding of uranium dynamics. Several of these studies include acquisition of large hydrological and time lapse electrical geophysical data sets. Such datasets contain large amounts of information on hydrological processes. There are substantial challenges in how to effectively deal with the data volumes of such datasets, how to process such datasets and how to provide users with the ability to effectively access and synergize the hydrological information contained in raw and processed data. These challenges motivated the development of a cloud based cyberinfrastructure for dealing with large electrical hydrogeophysical datasets. This cyberinfrastructure is modular and extensible and includes datamanagement, data processing, visualization and result mining capabilities. Specifically, it provides for data transmission to a central server, data parsing in a relational database and processing of the data using a PNNL developed parallel inversion code on either dedicated or commodity compute clusters. Access to results is done through a browser with interactive tools allowing for generation of on demand visualization of the inversion results as well as interactive data mining and statistical calculation. This infrastructure was used for the acquisition and processing of an electrical geophysical timelapse survey which was collected over a highly instrumented field site in the Hanford 300 Area. Over a 13 month period between November 2011 and December 2012 1823 timelapse datasets were collected (roughly 5 datasets a day for a total of 23 million individual measurements) on three parallel resistivity lines of 30 m each with 0.5 meter electrode spacing. In addition, hydrological and environmental data were collected from dedicated and general purpose sensors. This dataset contains rich information on near surface processes on a range of different spatial and temporal scales (ranging from hourly to seasonal). We will show how this cyberinfrastructure was used to manage and process this dataset and how the cyberinfrastructure can be used to access, mine and visualize the resulting data and information.
Gary, Robin H.; Wilson, Zachary D.; Archuleta, Christy-Ann M.; Thompson, Florence E.; Vrabel, Joseph
2009-01-01
During 2006-09, the U.S. Geological Survey, in cooperation with the National Atlas of the United States, produced a 1:1,000,000-scale (1:1M) hydrography dataset comprising streams and waterbodies for the entire United States, including Puerto Rico and the U.S. Virgin Islands, for inclusion in the recompiled National Atlas. This report documents the methods used to select, simplify, and refine features in the 1:100,000-scale (1:100K) (1:63,360-scale in Alaska) National Hydrography Dataset to create the national 1:1M hydrography dataset. Custom tools and semi-automated processes were created to facilitate generalization of the 1:100K National Hydrography Dataset (1:63,360-scale in Alaska) to 1:1M on the basis of existing small-scale hydrography datasets. The first step in creating the new 1:1M dataset was to address feature selection and optimal data density in the streams network. Several existing methods were evaluated. The production method that was established for selecting features for inclusion in the 1:1M dataset uses a combination of the existing attributes and network in the National Hydrography Dataset and several of the concepts from the methods evaluated. The process for creating the 1:1M waterbodies dataset required a similar approach to that used for the streams dataset. Geometric simplification of features was the next step. Stream reaches and waterbodies indicated in the feature selection process were exported as new feature classes and then simplified using a geographic information system tool. The final step was refinement of the 1:1M streams and waterbodies. Refinement was done through the use of additional geographic information system tools.
USGS Mineral Resources Program; national maps and datasets for research and land planning
Nicholson, S.W.; Stoeser, D.B.; Ludington, S.D.; Wilson, Frederic H.
2001-01-01
The U.S. Geological Survey, the Nation’s leader in producing and maintaining earth science data, serves as an advisor to Congress, the Department of the Interior, and many other Federal and State agencies. Nationwide datasets that are easily available and of high quality are critical for addressing a wide range of land-planning, resource, and environmental issues. Four types of digital databases (geological, geophysical, geochemical, and mineral occurrence) are being compiled and upgraded by the Mineral Resources Program on regional and national scales to meet these needs. Where existing data are incomplete, new data are being collected to ensure national coverage. Maps and analyses produced from these databases provide basic information essential for mineral resource assessments and environmental studies, as well as fundamental information for regional and national land-use studies. Maps and analyses produced from the databases are instrumental to ongoing basic research, such as the identification of mineral deposit origins, determination of regional background values of chemical elements with known environmental impact, and study of the relationships between toxic elements or mining practices to human health. As datasets are completed or revised, the information is made available through a variety of media, including the Internet. Much of the available information is the result of cooperative activities with State and other Federal agencies. The upgraded Mineral Resources Program datasets make geologic, geophysical, geochemical, and mineral occurrence information at the state, regional, and national scales available to members of Congress, State and Federal government agencies, researchers in academia, and the general public. The status of the Mineral Resources Program datasets is outlined below.
Grantz, Erin; Haggard, Brian; Scott, J Thad
2018-06-12
We calculated four median datasets (chlorophyll a, Chl a; total phosphorus, TP; and transparency) using multiple approaches to handling censored observations, including substituting fractions of the quantification limit (QL; dataset 1 = 1QL, dataset 2 = 0.5QL) and statistical methods for censored datasets (datasets 3-4) for approximately 100 Texas, USA reservoirs. Trend analyses of differences between dataset 1 and 3 medians indicated percent difference increased linearly above thresholds in percent censored data (%Cen). This relationship was extrapolated to estimate medians for site-parameter combinations with %Cen > 80%, which were combined with dataset 3 as dataset 4. Changepoint analysis of Chl a- and transparency-TP relationships indicated threshold differences up to 50% between datasets. Recursive analysis identified secondary thresholds in dataset 4. Threshold differences show that information introduced via substitution or missing due to limitations of statistical methods biased values, underestimated error, and inflated the strength of TP thresholds identified in datasets 1-3. Analysis of covariance identified differences in linear regression models relating transparency-TP between datasets 1, 2, and the more statistically robust datasets 3-4. Study findings identify high-risk scenarios for biased analytical outcomes when using substitution. These include high probability of median overestimation when %Cen > 50-60% for a single QL, or when %Cen is as low 16% for multiple QL's. Changepoint analysis was uniquely vulnerable to substitution effects when using medians from sites with %Cen > 50%. Linear regression analysis was less sensitive to substitution and missing data effects, but differences in model parameters for transparency cannot be discounted and could be magnified by log-transformation of the variables.
Systems and methods for displaying data in split dimension levels
Stolte, Chris; Hanrahan, Patrick
2015-07-28
Systems and methods for displaying data in split dimension levels are disclosed. In some implementations, a method includes: at a computer, obtaining a dimensional hierarchy associated with a dataset, wherein the dimensional hierarchy includes at least one dimension and a sub-dimension of the at least one dimension; and populating information representing data included in the dataset into a visual table having a first axis and a second axis, wherein the first axis corresponds to the at least one dimension and the second axis corresponds to the sub-dimension of the at least one dimension.
Modal Analysis Using the Singular Value Decomposition and Rational Fraction Polynomials
2017-04-06
information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and...results. The programs are designed for experimental datasets with multiple drive and response points and have proven effective even for systems with... designed for experimental datasets with multiple drive and response points and have proven effective even for systems with numerous closely-spaced
Data Citation Concept for CMIP6
NASA Astrophysics Data System (ADS)
Stockhause, M.; Toussaint, F.; Lautenschlager, M.; Lawrence, B.
2015-12-01
There is a broad consensus among data centers and scientific publishers on Force 11's 'Joint Declaration of Data Citation Principles'. To put these principles into operation is not always as straight forward. The focus for CMIP6 data citations lies on the citation of data created by others and used in an analysis underlying the article. And for this source data usually no article of the data creators is available ('stand-alone data publication'). The planned data citation granularities are model data (data collections containing all datasets provided for the project by a single model) and experiment data (data collections containing all datasets for a scientific experiment run by a single model). In case of large international projects or activities like CMIP, the data is commonly stored and disseminated by multiple repositories in a federated data infrastructure such as the Earth System Grid Federation (ESGF). The individual repositories are subject to different institutional and national policies. A Data Management Plan (DMP) will define a certain standard for the repositories including data handling procedures. Another aspect of CMIP data, relevant for data citations, is its dynamic nature. For such large data collections, datasets are added, revised and retracted for years, before the data collection becomes stable for a data citation entity including all model or simulation data. Thus, a critical issue for ESGF is data consistency, requiring thorough dataset versioning to enable the identification of the data collection in the cited version. Currently, the ESGF is designed for accessing the latest dataset versions. Data citation introduces the necessity to support older and retracted dataset versions by storing metadata even beyond data availability (data unpublished in ESGF). Apart from ESGF, other infrastructure components exist for CMIP, which provide information that has to be connected to the CMIP6 data, e.g. ES-DOC providing information on models and simulations and the IPCC Data Distribution Centre (DDC) storing a subset of data together with available metadata (ES-DOC) for the long-term reuse of the interdisciplinary community. Other connections exist to standard project vocabularies, to personal identifiers (e.g. ORCID), or to data products (including provenance information).
Mansouri, K; Grulke, C M; Richard, A M; Judson, R S; Williams, A J
2016-11-01
The increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publicly available PHYSPROP physicochemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest-quality subset of the original dataset was compared with the larger curated and corrected dataset. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publicly available for further usage and integration by the scientific community.
NASA Astrophysics Data System (ADS)
Blower, Jon; Lawrence, Bryan; Kershaw, Philip; Nagni, Maurizio
2014-05-01
The research process can be thought of as an iterative activity, initiated based on prior domain knowledge, as well on a number of external inputs, and producing a range of outputs including datasets, studies and peer reviewed publications. These outputs may describe the problem under study, the methodology used, the results obtained, etc. In any new publication, the author may cite or comment other papers or datasets in order to support their research hypothesis. However, as their work progresses, the researcher may draw from many other latent channels of information. These could include for example, a private conversation following a lecture or during a social dinner; an opinion expressed concerning some significant event such as an earthquake or for example a satellite failure. In addition, other sources of information of grey literature are important public such as informal papers such as the arxiv deposit, reports and studies. The climate science community is not an exception to this pattern; the CHARMe project, funded under the European FP7 framework, is developing an online system for collecting and sharing user feedback on climate datasets. This is to help users judge how suitable such climate data are for an intended application. The user feedback could be comments about assessments, citations, or provenance of the dataset, or other information such as descriptions of uncertainty or data quality. We define this as a distinct category of metadata called Commentary or C-metadata. We link C-metadata with target climate datasets using a Linked Data approach via the Open Annotation data model. In the context of Linked Data, C-metadata plays the role of a resource which, depending on its nature, may be accessed as simple text or as more structured content. The project is implementing a range of software tools to create, search or visualize C-metadata including a JavaScript plugin enabling this functionality to be integrated in situ with data provider portals. Since commentary metadata may originate from a range of sources, moderation of this information will become a crucial issue. If the project is successful, expert human moderation (analogous to peer-review) will become impracticable as annotation numbers increase, and some combination of algorithmic and crowd-sourced evaluation of commentary metadata will be necessary. To that end, future work will need to extend work under development to enable access control and checking of inputs, to deal with scale.
Simões, Nuno; Pech, Daniel
2018-01-01
Abstract Background Alacranes Reef was declared as a National Marine Park in 1994. Since then, many efforts have been made to inventory its biodiversity. However, groups such as amphipods have been underestimated or not considered when benthic invertebrates were inventoried. Here we present a dataset that contributes to the knowledge of benthic amphipods (Crustacea, Peracarida) from the inner lagoon habitats from the Alacranes Reef National Park, the largest coral reef ecosystem in the Gulf of Mexico. The dataset contains information on records collected from 2009 to 2011. Data are available through Global Biodiversity Information Facility (GBIF). New information A total of 110 amphipod species distributed in 93 nominal species and 17 generic species, belonging to 71 genera, 33 families and three suborders are presented here. This information represents the first online dataset of amphipods from the Alacranes Reef National Park. The biological material is currently deposited in the crustacean collection from the regional unit of the National Autonomous University of Mexico located at Sisal, Yucatan, Mexico (UAS-Sisal). The biological material includes 588 data records with a total abundance of 6,551 organisms. The species inventory represents, until now, the richest fauna of benthic amphipods registered from any discrete coral reef ecosystem in Mexico. PMID:29416428
Accounting for one-channel depletion improves missing value imputation in 2-dye microarray data.
Ritz, Cecilia; Edén, Patrik
2008-01-19
For 2-dye microarray platforms, some missing values may arise from an un-measurably low RNA expression in one channel only. Information of such "one-channel depletion" is so far not included in algorithms for imputation of missing values. Calculating the mean deviation between imputed values and duplicate controls in five datasets, we show that KNN-based imputation gives a systematic bias of the imputed expression values of one-channel depleted spots. Evaluating the correction of this bias by cross-validation showed that the mean square deviation between imputed values and duplicates were reduced up to 51%, depending on dataset. By including more information in the imputation step, we more accurately estimate missing expression values.
Drilling informatics: data-driven challenges of scientific drilling
NASA Astrophysics Data System (ADS)
Yamada, Yasuhiro; Kyaw, Moe; Saito, Sanny
2017-04-01
The primary aim of scientific drilling is to precisely understand the dynamic nature of the Earth. This is the reason why we investigate the subsurface materials (rock and fluid including microbial community) existing under particular environmental conditions. This requires sample collection and analytical data production from the samples, and in-situ data measurement at boreholes. Current available data comes from cores, cuttings, mud logging, geophysical logging, and exploration geophysics, but these datasets are difficult to be integrated because of their different kinds and scales. Now we are producing more useful datasets to fill the gap between the exiting data and extracting more information from such datasets and finally integrating the information. In particular, drilling parameters are very useful datasets as geomechanical properties. We believe such approach, 'drilling informatics', would be the most appropriate to obtain the comprehensive and dynamic picture of our scientific target, such as the seismogenic fault zone and the Moho discontinuity surface. This presentation introduces our initiative and current achievements of drilling informatics.
NASA Technical Reports Server (NTRS)
Liu, Z.; Acker, J.; Kempler, S.
2016-01-01
The NASA Goddard Earth Sciences (GES) Data and Information Services Center(DISC) is one of twelve NASA Science Mission Directorate (SMD) Data Centers that provide Earth science data, information, and services to users around the world including research and application scientists, students, citizen scientists, etc. The GESDISC is the home (archive) of remote sensing datasets for NASA Precipitation and Hydrology, Atmospheric Composition and Dynamics, etc. To facilitate Earth science data access, the GES DISC has been developing user-friendly data services for users at different levels in different countries. Among them, the Geospatial Interactive Online Visualization ANd aNalysis Infrastructure (Giovanni, http:giovanni.gsfc.nasa.gov) allows users to explore satellite-based datasets using sophisticated analyses and visualization without downloading data and software, which is particularly suitable for novices (such as students) to use NASA datasets in STEM (science, technology, engineering and mathematics) activities. In this presentation, we will briefly introduce Giovanni along with examples for STEM activities.
A Semantically-enabled Community Health Portal for Cancer Prevention and Control
2011-06-01
original team identified relevant datasets including data from the National Health Interview Survey ( NHIS 1) and the Health Information National Trends...Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302
GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare.
Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung
2015-07-02
A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a "data modeler" tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets.
GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare
Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung
2015-01-01
A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a “data modeler” tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets. PMID:26147731
HyRA: A Hybrid Recommendation Algorithm Focused on Smart POI. Ceutí as a Study Scenario.
Alvarado-Uribe, Joanna; Gómez-Oliva, Andrea; Barrera-Animas, Ari Yair; Molina, Germán; Gonzalez-Mendoza, Miguel; Parra-Meroño, María Concepción; Jara, Antonio J
2018-03-17
Nowadays, Physical Web together with the increase in the use of mobile devices, Global Positioning System (GPS), and Social Networking Sites (SNS) have caused users to share enriched information on the Web such as their tourist experiences. Therefore, an area that has been significantly improved by using the contextual information provided by these technologies is tourism. In this way, the main goals of this work are to propose and develop an algorithm focused on the recommendation of Smart Point of Interaction (Smart POI) for a specific user according to his/her preferences and the Smart POIs' context. Hence, a novel Hybrid Recommendation Algorithm (HyRA) is presented by incorporating an aggregation operator into the user-based Collaborative Filtering (CF) algorithm as well as including the Smart POIs' categories and geographical information. For the experimental phase, two real-world datasets have been collected and preprocessed. In addition, one Smart POIs' categories dataset was built. As a result, a dataset composed of 16 Smart POIs, another constituted by the explicit preferences of 200 respondents, and the last dataset integrated by 13 Smart POIs' categories are provided. The experimental results show that the recommendations suggested by HyRA are promising.
HyRA: A Hybrid Recommendation Algorithm Focused on Smart POI. Ceutí as a Study Scenario
Gómez-Oliva, Andrea; Molina, Germán
2018-01-01
Nowadays, Physical Web together with the increase in the use of mobile devices, Global Positioning System (GPS), and Social Networking Sites (SNS) have caused users to share enriched information on the Web such as their tourist experiences. Therefore, an area that has been significantly improved by using the contextual information provided by these technologies is tourism. In this way, the main goals of this work are to propose and develop an algorithm focused on the recommendation of Smart Point of Interaction (Smart POI) for a specific user according to his/her preferences and the Smart POIs’ context. Hence, a novel Hybrid Recommendation Algorithm (HyRA) is presented by incorporating an aggregation operator into the user-based Collaborative Filtering (CF) algorithm as well as including the Smart POIs’ categories and geographical information. For the experimental phase, two real-world datasets have been collected and preprocessed. In addition, one Smart POIs’ categories dataset was built. As a result, a dataset composed of 16 Smart POIs, another constituted by the explicit preferences of 200 respondents, and the last dataset integrated by 13 Smart POIs’ categories are provided. The experimental results show that the recommendations suggested by HyRA are promising. PMID:29562590
Organic Carbon Transformation and Mercury Methylation in Tundra Soils from Barrow Alaska
Liang, L.; Wullschleger, Stan; Graham, David; Gu, B.; Yang, Ziming
2016-04-20
This dataset includes information on soil labile organic carbon transformation and mercury methylation for tundra soils from Barrow, Alaska. The soil cores were collected from high-centered polygon (trough) at BEO and were incubated under anaerobic laboratory conditions at both freezing and warming temperatures for up to 8 months. Soil organic carbon including reducing sugars, alcohols, and organic acids were analyzed, and CH4 and CO2 emissions were quantified. Net production of methylmercury and Fe(II)/Fe(total) ratio were also measured and provided in this dataset.
Towards automatic lithological classification from remote sensing data using support vector machines
NASA Astrophysics Data System (ADS)
Yu, Le; Porwal, Alok; Holden, Eun-Jung; Dentith, Michael
2010-05-01
Remote sensing data can be effectively used as a mean to build geological knowledge for poorly mapped terrains. Spectral remote sensing data from space- and air-borne sensors have been widely used to geological mapping, especially in areas of high outcrop density in arid regions. However, spectral remote sensing information by itself cannot be efficiently used for a comprehensive lithological classification of an area due to (1) diagnostic spectral response of a rock within an image pixel is conditioned by several factors including the atmospheric effects, spectral and spatial resolution of the image, sub-pixel level heterogeneity in chemical and mineralogical composition of the rock, presence of soil and vegetation cover; (2) only surface information and is therefore highly sensitive to the noise due to weathering, soil cover, and vegetation. Consequently, for efficient lithological classification, spectral remote sensing data needs to be supplemented with other remote sensing datasets that provide geomorphological and subsurface geological information, such as digital topographic model (DEM) and aeromagnetic data. Each of the datasets contain significant information about geology that, in conjunction, can potentially be used for automated lithological classification using supervised machine learning algorithms. In this study, support vector machine (SVM), which is a kernel-based supervised learning method, was applied to automated lithological classification of a study area in northwestern India using remote sensing data, namely, ASTER, DEM and aeromagnetic data. Several digital image processing techniques were used to produce derivative datasets that contained enhanced information relevant to lithological discrimination. A series of SVMs (trained using k-folder cross-validation with grid search) were tested using various combinations of input datasets selected from among 50 datasets including the original 14 ASTER bands and 36 derivative datasets (including 14 principal component bands, 14 independent component bands, 3 band ratios, 3 DEM derivatives: slope/curvatureroughness and 2 aeromagnetic derivatives: mean and variance of susceptibility) extracted from the ASTER, DEM and aeromagnetic data, in order to determine the optimal inputs that provide the highest classification accuracy. It was found that a combination of ASTER-derived independent components, principal components and band ratios, DEM-derived slope, curvature and roughness, and aeromagnetic-derived mean and variance of magnetic susceptibility provide the highest classification accuracy of 93.4% on independent test samples. A comparison of the classification results of the SVM with those of maximum likelihood (84.9%) and minimum distance (38.4%) classifiers clearly show that the SVM algorithm returns much higher classification accuracy. Therefore, the SVM method can be used to produce quick and reliable geological maps from scarce geological information, which is still the case with many under-developed frontier regions of the world.
Mendonça, André F; Percequillo, Alexandre R; de Camargo, Nicholas F; Ribeiro, Juliana F; Palma, Alexandre R T; Oliveira, Leonardo C; Câmara, Edeltrudes M V C; Vieira, Emerson M
2018-04-27
Patterns in distribution and local abundance of species within a biome are central concerns in ecology and allow the understanding of the effects of habitat loss on rates of species extinction; provide support for the creation and management of reserves; and contribute to the identification and quantification of the processes that allow niche partitioning by species. However, despite the importance in the conservation and management of the ecosystems, most systematized information on the abundance and distribution of small mammals is restricted to the northern hemisphere or forest ecosystems. For tropical biomes, an important part of this information remains dispersed and difficult to access in the form of theses, technical reports or unpublished datasets. Here we present a comprehensive dataset of abundance and richness of small mammals in the Cerrado, the largest Neotropical savanna. This dataset includes 2,599 records of 446 sites from 96 studies. Despite of more than 50% of references in this dataset are peer-reviewed journal articles, 45.78% of communities were compiled from theses. The dataset comprises 24,283 individuals of 55 genera and at least 118 species of small mammals including 29 marsupials, two lagomorphs (one exotic) and 87 rodents (three exotic). Local species richness ranged from one to 26 species (5.82 ±3.55, average species richness ±SD). We observed hyper-dominance of a few species; the 10 most abundant species in this dataset represented 60.19% of all recorded individuals. The hairy-tailed bolo mouse (Necromys lasiurus) represented over than 20% of all individuals and occurred at more than 50% of sites. Furthermore, we identified 18 environments, 16 native vegetation types, and two anthropic environments. Typical savanna and gallery forest were the most frequently sampled vegetation types (comprising 46.94% of all sampled sites) and the most speciose ones (57 species for typical savanna and 53 species for gallery forest). The information contained on this dataset can be used to analyze ecological questions as relationship between local abundance and regional distribution, relevance of local and regional factors on community structuring, and the role of phylogenetic mechanisms on community assembling. It can also be useful in conservation efforts in this biodiversity hotspot. No copyright, proprietary, or cost restrictions apply. Please cite this paper when the data are used in publications. We also request that researchers and teachers inform us of how they are using the data. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Multimedia Information Networks in Social Media
NASA Astrophysics Data System (ADS)
Cao, Liangliang; Qi, Guojun; Tsai, Shen-Fu; Tsai, Min-Hsuan; Pozo, Andrey Del; Huang, Thomas S.; Zhang, Xuemei; Lim, Suk Hwan
The popularity of personal digital cameras and online photo/video sharing community has lead to an explosion of multimedia information. Unlike traditional multimedia data, many new multimedia datasets are organized in a structural way, incorporating rich information such as semantic ontology, social interaction, community media, geographical maps, in addition to the multimedia contents by themselves. Studies of such structured multimedia data have resulted in a new research area, which is referred to as Multimedia Information Networks. Multimedia information networks are closely related to social networks, but especially focus on understanding the topics and semantics of the multimedia files in the context of network structure. This chapter reviews different categories of recent systems related to multimedia information networks, summarizes the popular inference methods used in recent works, and discusses the applications related to multimedia information networks. We also discuss a wide range of topics including public datasets, related industrial systems, and potential future research directions in this field.
Griswold, Terry; Gonzalez, Victor H; Ikerd, Harold
2014-01-01
This paper describes AnthWest, a large dataset that represents one of the outcomes of a comprehensive, broadly comparative study on the diversity, biology, biogeography, and evolution of Anthidium Fabricius in the Western Hemisphere. In this dataset a total of 22,648 adult occurrence records comprising 9657 unique events are documented for 92 species of Anthidium, including the invasive range of two introduced species from Eurasia, A. oblongatum (Illiger) and A. manicatum (Linnaeus). The geospatial coverage of the dataset extends from northern Canada and Alaska to southern Argentina, and from below sea level in Death Valley, California, USA, to 4700 m a.s.l. in Tucumán, Argentina. The majority of records in the dataset correspond to information recorded from individual specimens examined by the authors during this project and deposited in 60 biodiversity collections located in Africa, Europe, North and South America. A fraction (4.8%) of the occurrence records were taken from the literature, largely California records from a taxonomic treatment with some additional records for the two introduced species. The temporal scale of the dataset represents collection events recorded between 1886 and 2012. The dataset was developed employing SQL server 2008 r2. For each specimen, the following information is generally provided: scientific name including identification qualifier when species status is uncertain (e.g. "Questionable Determination" for 0.4% of the specimens), sex, temporal and geospatial details, coordinates, data collector, host plants, associated organisms, name of identifier, historic identification, historic identifier, taxonomic value (i.e., type specimen, voucher, etc.), and repository. For a small portion of the database records, bees associated with threatened or endangered plants (~ 0.08% of total records) as well as specimens collected as part of unpublished biological inventories (~17%), georeferencing is presented only to nearest degree and the information on floral host, locality, elevation, month, and day has been withheld. This database can potentially be used in species distribution and niche modeling studies, as well as in assessments of pollinator status and pollination services. For native pollinators, this large dataset of occurrence records is the first to be simultaneously developed during a species-level systematic study.
EnviroAtlas - Portland, ME - Land Cover by Block Group
This EnviroAtlas dataset describes the percentage of each block group that is classified as impervious, forest, green space, wetland, and agriculture. Impervious is a combination of dark and light impervious. Forest is combination of trees and forest and woody wetlands. Green space is a combination of trees and forest, grass and herbaceous, agriculture, woody wetlands, and emergent wetlands. Wetlands includes both Woody and Emergent Wetlands. This dataset also includes the area per capita for each block group for impervious, forest, and green space land cover. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Spooner, Amy J; Aitken, Leanne M; Corley, Amanda; Chaboyer, Wendy
2018-01-01
Despite increasing demand for structured processes to guide clinical handover, nursing handover tools are limited in the intensive care unit. The study aim was to identify key items to include in a minimum dataset for intensive care nursing team leader shift-to-shift handover. This focus group study was conducted in a 21-bed medical/surgical intensive care unit in Australia. Senior registered nurses involved in team leader handovers were recruited. Focus groups were conducted using a nominal group technique to generate and prioritise minimum dataset items. Nurses were presented with content from previous team leader handovers and asked to select which content items to include in a minimum dataset. Participant responses were summarised as frequencies and percentages. Seventeen senior nurses participated in three focus groups. Participants agreed that ISBAR (Identify-Situation-Background-Assessment-Recommendations) was a useful tool to guide clinical handover. Items recommended to be included in the minimum dataset (≥65% agreement) included Identify (name, age, days in intensive care), Situation (diagnosis, surgical procedure), Background (significant event(s), management of significant event(s)) and Recommendations (patient plan for next shift, tasks to follow up for next shift). Overall, 30 of the 67 (45%) items in the Assessment category were considered important to include in the minimum dataset and focused on relevant observations and treatment within each body system. Other non-ISBAR items considered important to include related to the ICU (admissions to ICU, staffing/skill mix, theatre cases) and patients (infectious status, site of infection, end of life plan). Items were further categorised into those to include in all handovers and those to discuss only when relevant to the patient. The findings suggest a minimum dataset for intensive care nursing team leader shift-to-shift handover should contain items within ISBAR along with unit and patient specific information to maintain continuity of care and patient safety across shift changes. Copyright © 2017 Australian College of Critical Care Nurses Ltd. All rights reserved.
Rachel Riemann; Ty Wilson; Andrew Lister
2012-01-01
We recently developed an assessment protocol that provides information on the magnitude, location, frequency and type of error in geospatial datasets of continuous variables (Riemann et al. 2010). The protocol consists of a suite of assessment metrics which include an examination of data distributions and areas estimates, at several scales, examining each in the form...
Damage and protection cost curves for coastal floods within the 600 largest European cities
NASA Astrophysics Data System (ADS)
Prahl, Boris F.; Boettle, Markus; Costa, Luís; Kropp, Jürgen P.; Rybski, Diego
2018-03-01
The economic assessment of the impacts of storm surges and sea-level rise in coastal cities requires high-level information on the damage and protection costs associated with varying flood heights. We provide a systematically and consistently calculated dataset of macroscale damage and protection cost curves for the 600 largest European coastal cities opening the perspective for a wide range of applications. Offering the first comprehensive dataset to include the costs of dike protection, we provide the underpinning information to run comparative assessments of costs and benefits of coastal adaptation. Aggregate cost curves for coastal flooding at the city-level are commonly regarded as by-products of impact assessments and are generally not published as a standalone dataset. Hence, our work also aims at initiating a more critical discussion on the availability and derivation of cost curves.
Damage and protection cost curves for coastal floods within the 600 largest European cities.
Prahl, Boris F; Boettle, Markus; Costa, Luís; Kropp, Jürgen P; Rybski, Diego
2018-03-20
The economic assessment of the impacts of storm surges and sea-level rise in coastal cities requires high-level information on the damage and protection costs associated with varying flood heights. We provide a systematically and consistently calculated dataset of macroscale damage and protection cost curves for the 600 largest European coastal cities opening the perspective for a wide range of applications. Offering the first comprehensive dataset to include the costs of dike protection, we provide the underpinning information to run comparative assessments of costs and benefits of coastal adaptation. Aggregate cost curves for coastal flooding at the city-level are commonly regarded as by-products of impact assessments and are generally not published as a standalone dataset. Hence, our work also aims at initiating a more critical discussion on the availability and derivation of cost curves.
DOT National Transportation Integrated Search
2013-11-30
Travel time reliability information includes static data about traffic speeds or trip times that capture historic variations from day to day, and it can help individuals understand the level of variation in traffic. Unlike real-time travel time infor...
Comparing Goldstone Solar System Radar Earth-based Observations of Mars with Orbital Datasets
NASA Technical Reports Server (NTRS)
Haldemann, A. F. C.; Larsen, K. W.; Jurgens, R. F.; Slade, M. A.
2005-01-01
The Goldstone Solar System Radar (GSSR) has collected a self-consistent set of delay-Doppler near-nadir radar echo data from Mars since 1988. Prior to the Mars Global Surveyor (MGS) Mars Orbiter Laser Altimeter (MOLA) global topography for Mars, these radar data provided local elevation information, along with radar scattering information with global coverage. Two kinds of GSSR Mars delay-Doppler data exist: low 5 km x 150 km resolution and, more recently, high (5 to 10 km) spatial resolution. Radar data, and non-imaging delay-Doppler data in particular, requires significant data processing to extract elevation, reflectivity and roughness of the reflecting surface. Interpretation of these parameters, while limited by the complexities of electromagnetic scattering, provide information directly relevant to geophysical and geomorphic analyses of Mars. In this presentation we want to demonstrate how to compare GSSR delay-Doppler data to other Mars datasets, including some idiosyncracies of the radar data. Additional information is included in the original extended abstract.
Integrative Analysis of “-Omics” Data Using Penalty Functions
Zhao, Qing; Shi, Xingjie; Huang, Jian; Liu, Jin; Li, Yang; Ma, Shuangge
2014-01-01
In the analysis of omics data, integrative analysis provides an effective way of pooling information across multiple datasets or multiple correlated responses, and can be more effective than single-dataset (response) analysis. Multiple families of integrative analysis methods have been proposed in the literature. The current review focuses on the penalization methods. Special attention is paid to sparse meta-analysis methods that pool summary statistics across datasets, and integrative analysis methods that pool raw data across datasets. We discuss their formulation and rationale. Beyond “standard” penalized selection, we also review contrasted penalization and Laplacian penalization which accommodate finer data structures. The computational aspects, including computational algorithms and tuning parameter selection, are examined. This review concludes with possible limitations and extensions. PMID:25691921
Synthetic ALSPAC longitudinal datasets for the Big Data VR project.
Avraam, Demetris; Wilson, Rebecca C; Burton, Paul
2017-01-01
Three synthetic datasets - of observation size 15,000, 155,000 and 1,555,000 participants, respectively - were created by simulating eleven cardiac and anthropometric variables from nine collection ages of the ALSAPC birth cohort study. The synthetic datasets retain similar data properties to the ALSPAC study data they are simulated from (co-variance matrices, as well as the mean and variance values of the variables) without including the original data itself or disclosing participant information. In this instance, the three synthetic datasets have been utilised in an academia-industry collaboration to build a prototype virtual reality data analysis software, but they could have a broader use in method and software development projects where sensitive data cannot be freely shared.
National Transportation Atlas Databases : 2002
DOT National Transportation Integrated Search
2002-01-01
The National Transportation Atlas Databases 2002 (NTAD2002) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2010
DOT National Transportation Integrated Search
2010-01-01
The National Transportation Atlas Databases 2010 (NTAD2010) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2006
DOT National Transportation Integrated Search
2006-01-01
The National Transportation Atlas Databases 2006 (NTAD2006) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2005
DOT National Transportation Integrated Search
2005-01-01
The National Transportation Atlas Databases 2005 (NTAD2005) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2008
DOT National Transportation Integrated Search
2008-01-01
The National Transportation Atlas Databases 2008 (NTAD2008) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2003
DOT National Transportation Integrated Search
2003-01-01
The National Transportation Atlas Databases 2003 (NTAD2003) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2004
DOT National Transportation Integrated Search
2004-01-01
The National Transportation Atlas Databases 2004 (NTAD2004) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2009
DOT National Transportation Integrated Search
2009-01-01
The National Transportation Atlas Databases 2009 (NTAD2009) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2007
DOT National Transportation Integrated Search
2007-01-01
The National Transportation Atlas Databases 2007 (NTAD2007) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2012
DOT National Transportation Integrated Search
2012-01-01
The National Transportation Atlas Databases 2012 (NTAD2012) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2011
DOT National Transportation Integrated Search
2011-01-01
The National Transportation Atlas Databases 2011 (NTAD2011) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
SHARE: system design and case studies for statistical health information release
Gardner, James; Xiong, Li; Xiao, Yonghui; Gao, Jingjing; Post, Andrew R; Jiang, Xiaoqian; Ohno-Machado, Lucila
2013-01-01
Objectives We present SHARE, a new system for statistical health information release with differential privacy. We present two case studies that evaluate the software on real medical datasets and demonstrate the feasibility and utility of applying the differential privacy framework on biomedical data. Materials and Methods SHARE releases statistical information in electronic health records with differential privacy, a strong privacy framework for statistical data release. It includes a number of state-of-the-art methods for releasing multidimensional histograms and longitudinal patterns. We performed a variety of experiments on two real datasets, the surveillance, epidemiology and end results (SEER) breast cancer dataset and the Emory electronic medical record (EeMR) dataset, to demonstrate the feasibility and utility of SHARE. Results Experimental results indicate that SHARE can deal with heterogeneous data present in medical data, and that the released statistics are useful. The Kullback–Leibler divergence between the released multidimensional histograms and the original data distribution is below 0.5 and 0.01 for seven-dimensional and three-dimensional data cubes generated from the SEER dataset, respectively. The relative error for longitudinal pattern queries on the EeMR dataset varies between 0 and 0.3. While the results are promising, they also suggest that challenges remain in applying statistical data release using the differential privacy framework for higher dimensional data. Conclusions SHARE is one of the first systems to provide a mechanism for custodians to release differentially private aggregate statistics for a variety of use cases in the medical domain. This proof-of-concept system is intended to be applied to large-scale medical data warehouses. PMID:23059729
High-resolution grids of hourly meteorological variables for Germany
NASA Astrophysics Data System (ADS)
Krähenmann, S.; Walter, A.; Brienen, S.; Imbery, F.; Matzarakis, A.
2018-02-01
We present a 1-km2 gridded German dataset of hourly surface climate variables covering the period 1995 to 2012. The dataset comprises 12 variables including temperature, dew point, cloud cover, wind speed and direction, global and direct shortwave radiation, down- and up-welling longwave radiation, sea level pressure, relative humidity and vapour pressure. This dataset was constructed statistically from station data, satellite observations and model data. It is outstanding in terms of spatial and temporal resolution and in the number of climate variables. For each variable, we employed the most suitable gridding method and combined the best of several information sources, including station records, satellite-derived data and data from a regional climate model. A module to estimate urban heat island intensity was integrated for air and dew point temperature. Owing to the low density of available synop stations, the gridded dataset does not capture all variations that may occur at a resolution of 1 km2. This applies to areas of complex terrain (all the variables), and in particular to wind speed and the radiation parameters. To achieve maximum precision, we used all observational information when it was available. This, however, leads to inhomogeneities in station network density and affects the long-term consistency of the dataset. A first climate analysis for Germany was conducted. The Rhine River Valley, for example, exhibited more than 100 summer days in 2003, whereas in 1996, the number was low everywhere in Germany. The dataset is useful for applications in various climate-related studies, hazard management and for solar or wind energy applications and it is available via doi: 10.5676/DWD_CDC/TRY_Basis_v001.
NASA Astrophysics Data System (ADS)
Chow, L.; Fai, S.
2017-08-01
The digitization and abstraction of existing buildings into building information models requires the translation of heterogeneous datasets that may include CAD, technical reports, historic texts, archival drawings, terrestrial laser scanning, and photogrammetry into model elements. In this paper, we discuss a project undertaken by the Carleton Immersive Media Studio (CIMS) that explored the synthesis of heterogeneous datasets for the development of a building information model (BIM) for one of Canada's most significant heritage assets - the Centre Block of the Parliament Hill National Historic Site. The scope of the project included the development of an as-found model of the century-old, six-story building in anticipation of specific model uses for an extensive rehabilitation program. The as-found Centre Block model was developed in Revit using primarily point cloud data from terrestrial laser scanning. The data was captured by CIMS in partnership with Heritage Conservation Services (HCS), Public Services and Procurement Canada (PSPC), using a Leica C10 and P40 (exterior and large interior spaces) and a Faro Focus (small to mid-sized interior spaces). Secondary sources such as archival drawings, photographs, and technical reports were referenced in cases where point cloud data was not available. As a result of working with heterogeneous data sets, a verification system was introduced in order to communicate to model users/viewers the source of information for each building element within the model.
Development of a Watershed Boundary Dataset for Mississippi
Van Wilson, K.; Clair, Michael G.; Turnipseed, D. Phil; Rebich, Richard A.
2009-01-01
The U.S. Geological Survey, in cooperation with the Mississippi Department of Environmental Quality, U.S. Department of Agriculture-Natural Resources Conservation Service, Mississippi Department of Transportation, U.S. Department of Agriculture-Forest Service, and the Mississippi Automated Resource Information System, developed a 1:24,000-scale Watershed Boundary Dataset for Mississippi including watershed and subwatershed boundaries, codes, names, and drainage areas. The Watershed Boundary Dataset for Mississippi provides a standard geographical framework for water-resources and selected land-resources planning. The original 8-digit subbasins (hydrologic unit codes) were further subdivided into 10-digit watersheds and 12-digit subwatersheds - the exceptions are the Lower Mississippi River Alluvial Plain (known locally as the Delta) and the Mississippi River inside levees, which were only subdivided into 10-digit watersheds. Also, large water bodies in the Mississippi Sound along the coast were not delineated as small as a typical 12-digit subwatershed. All of the data - including watershed and subwatershed boundaries, hydrologic unit codes and names, and drainage-area data - are stored in a Geographic Information System database.
EnviroAtlas - NHDPlus V2 WBD Snapshot, EnviroAtlas version - Conterminous United States
This EnviroAtlas dataset is a digital hydrologic unit boundary layer to the Subwatershed (12-digit) 6th level for the conterminous United States, based on the January 6, 2015 NHDPlus V2 WBD (Watershed Boundary Dataset) Snapshot (NHDPlusV21_NationalData_WBDSnapshot_FileGDB_05). The feature class has been edited for use in for EPA ORD's EnviroAtlas. Features in Canada and Mexico have been removed, the boundaries of three 12-digit HUCs have been edited to eliminate gaps and overlaps, the dataset has been dissolved on HUC_12 to create multipart polygons, and information on the percent land area has been added. Hawaii, Puerto Rico, and the U.S. Virgin Islands have been removed, and can be downloaded separately. Other than these modifications, the dataset is the same as the WBD Snapshot included in NHDPlus V2.This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Medical and Transmission Vector Vocabulary Alignment with Schema.org
DOE Office of Scientific and Technical Information (OSTI.GOV)
Smith, William P.; Chappell, Alan R.; Corley, Courtney D.
Available biomedical ontologies and knowledge bases currently lack formal and standards-based interconnections between disease, disease vector, and drug treatment vocabularies. The PNNL Medical Linked Dataset (PNNL-MLD) addresses this gap. This paper describes the PNNL-MLD, which provides a unified vocabulary and dataset of drug, disease, side effect, and vector transmission background information. Currently, the PNNL-MLD combines and curates data from the following research projects: DrugBank, DailyMed, Diseasome, DisGeNet, Wikipedia Infobox, Sider, and PharmGKB. The main outcomes of this effort are a dataset aligned to Schema.org, including a parsing framework, and extensible hooks ready for integration with selected medical ontologies. The PNNL-MLDmore » enables researchers more quickly and easily to query distinct datasets. Future extensions to the PNNL-MLD will include Traditional Chinese Medicine, broader interlinks across genetic structures, a larger thesaurus of synonyms and hypernyms, explicit coding of diseases and drugs across research systems, and incorporating vector-borne transmission vocabularies.« less
Tools for proactive collection and use of quality metadata in GEOSS
NASA Astrophysics Data System (ADS)
Bastin, L.; Thum, S.; Maso, J.; Yang, K. X.; Nüst, D.; Van den Broek, M.; Lush, V.; Papeschi, F.; Riverola, A.
2012-12-01
The GEOSS Common Infrastructure allows interactive evaluation and selection of Earth Observation datasets by the scientific community and decision makers, but the data quality information needed to assess fitness for use is often patchy and hard to visualise when comparing candidate datasets. In a number of studies over the past decade, users repeatedly identified the same types of gaps in quality metadata, specifying the need for enhancements such as peer and expert review, better traceability and provenance information, information on citations and usage of a dataset, warning about problems identified with a dataset and potential workarounds, and 'soft knowledge' from data producers (e.g. recommendations for use which are not easily encoded using the existing standards). Despite clear identification of these issues in a number of recommendations, the gaps persist in practice and are highlighted once more in our own, more recent, surveys. This continuing deficit may well be the result of a historic paucity of tools to support the easy documentation and continual review of dataset quality. However, more recent developments in tools and standards, as well as more general technological advances, present the opportunity for a community of scientific users to adopt a more proactive attitude by commenting on their uses of data, and for that feedback to be federated with more traditional and static forms of metadata, allowing a user to more accurately assess the suitability of a dataset for their own specific context and reliability thresholds. The EU FP7 GeoViQua project aims to develop this opportunity by adding data quality representations to the existing search and visualisation functionalities of the Geo Portal. Subsequently we will help to close the gap by providing tools to easily create quality information, and to permit user-friendly exploration of that information as the ultimate incentive for improved data quality documentation. Quality information is derived from producer metadata, from the data themselves, from validation of in-situ sensor data, from provenance information and from user feedback, and will be aggregated to produce clear and useful summaries of quality, including a GEO Label. GeoViQua's conceptual quality information models for users and producers are specifically described and illustrated in this presentation. These models (which have been encoded as XML schemas and can be accessed at http://schemas.geoviqua.org/) are designed to satisfy the identified user needs while remaining consistent with current standards such as ISO 19115 and advanced drafts such as ISO 19157. The resulting components being developed for the GEO Portal are designed to lower the entry barrier to users who wish to help to generate and explore rich and useful metadata. This metadata will include reviews, comments and ratings, reports of usage in specific domains and specification of datasets used for benchmarking, as well as rich quantitative information encoded in more traditional data quality elements such as thematic correctness and positional accuracy. The value of the enriched metadata will also be enhanced by graphical tools for visualizing spatially distributed uncertainties. We demonstrate practical example applications in selected environmental application domains.
EnviroAtlas - Austin, TX - Land Cover by Block Group
This EnviroAtlas dataset describes the percentage of each block group that is classified as impervious, forest, green space, and agriculture. Forest is defined as Trees & Forest. Green space is defined as Trees & Forest, Grass & Herbaceous, and Agriculture. This dataset also includes the area per capita for each block group for some land cover types. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Common Ground: An Interactive Visual Exploration and Discovery for Complex Health Data
2015-04-01
working with Intermountain Healthcare on a new rich dataset extracted directly from medical notes using natural language processing ( NLP ) algorithms...probabilities based on a state- of-the-art NLP classifiers. At that stage the data did not include geographic information or temporal information but we
Damage and protection cost curves for coastal floods within the 600 largest European cities
Prahl, Boris F.; Boettle, Markus; Costa, Luís; Kropp, Jürgen P.; Rybski, Diego
2018-01-01
The economic assessment of the impacts of storm surges and sea-level rise in coastal cities requires high-level information on the damage and protection costs associated with varying flood heights. We provide a systematically and consistently calculated dataset of macroscale damage and protection cost curves for the 600 largest European coastal cities opening the perspective for a wide range of applications. Offering the first comprehensive dataset to include the costs of dike protection, we provide the underpinning information to run comparative assessments of costs and benefits of coastal adaptation. Aggregate cost curves for coastal flooding at the city-level are commonly regarded as by-products of impact assessments and are generally not published as a standalone dataset. Hence, our work also aims at initiating a more critical discussion on the availability and derivation of cost curves. PMID:29557944
EnviroAtlas - Austin, TX - Atlas Area Boundary
This EnviroAtlas dataset shows the boundary of the Austin, TX Atlas Area. It represents the outside edge of all the block groups included in the EnviroAtlas Area.This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Becnel, Lauren B; Darlington, Yolanda F; Ochsner, Scott A; Easton-Marks, Jeremy R; Watkins, Christopher M; McOwiti, Apollo; Kankanamge, Wasula H; Wise, Michael W; DeHart, Michael; Margolis, Ronald N; McKenna, Neil J
2015-01-01
Signaling pathways involving nuclear receptors (NRs), their ligands and coregulators, regulate tissue-specific transcriptomes in diverse processes, including development, metabolism, reproduction, the immune response and neuronal function, as well as in their associated pathologies. The Nuclear Receptor Signaling Atlas (NURSA) is a Consortium focused around a Hub website (www.nursa.org) that annotates and integrates diverse 'omics datasets originating from the published literature and NURSA-funded Data Source Projects (NDSPs). These datasets are then exposed to the scientific community on an Open Access basis through user-friendly data browsing and search interfaces. Here, we describe the redesign of the Hub, version 3.0, to deploy "Web 2.0" technologies and add richer, more diverse content. The Molecule Pages, which aggregate information relevant to NR signaling pathways from myriad external databases, have been enhanced to include resources for basic scientists, such as post-translational modification sites and targeting miRNAs, and for clinicians, such as clinical trials. A portal to NURSA's Open Access, PubMed-indexed journal Nuclear Receptor Signaling has been added to facilitate manuscript submissions. Datasets and information on reagents generated by NDSPs are available, as is information concerning periodic new NDSP funding solicitations. Finally, the new website integrates the Transcriptomine analysis tool, which allows for mining of millions of richly annotated public transcriptomic data points in the field, providing an environment for dataset re-use and citation, bench data validation and hypothesis generation. We anticipate that this new release of the NURSA database will have tangible, long term benefits for both basic and clinical research in this field.
Longitudinal data for interdisciplinary ageing research. Design of the Linnaeus Database.
Malmberg, Gunnar; Nilsson, Lars-Göran; Weinehall, Lars
2010-11-01
To allow for interdisciplinary research on the relations between socioeconomic conditions and health in the ageing population, a new anonymized longitudinal database - the Linnaeus Database - has been developed at the Centre for Population Studies at Umeå University. This paper presents the database and its research potential. Using the Swedish personal numbers the researchers have, in collaboration with Statistics Sweden and the National Board for Health and Welfare, linked individual records from Swedish register data on death causes, hospitalization and various socioeconomic conditions with two databases - Betula and VIP (Västerbottens Intervention Programme) - previously developed by the researchers at Umeå University. Whereas Betula includes rich information about e.g. cognitive functions, VIP contains information about e.g. lifestyle and health indicators. The Linnaeus Database includes annually updated socioeconomic information from Statistics Sweden registers for all registered residents of Sweden for the period 1990 to 2006, in total 12,066,478. The information from the Betula includes 4,500 participants from the city of Umeå and VIP includes data for almost 90,000 participants. Both datasets include cross-sectional as well as longitudinal information. Due to the coverage and rich information, the Linnaeus Database allows for a variety of longitudinal studies on the relations between, for instance, socioeconomic conditions, health, lifestyle, cognition, family networks, migration and working conditions in ageing cohorts. By joining various datasets developed in different disciplinary traditions new possibilities for interdisciplinary research on ageing emerge.
On the visualization of water-related big data: extracting insights from drought proxies' datasets
NASA Astrophysics Data System (ADS)
Diaz, Vitali; Corzo, Gerald; van Lanen, Henny A. J.; Solomatine, Dimitri
2017-04-01
Big data is a growing area of science where hydroinformatics can benefit largely. There have been a number of important developments in the area of data science aimed at analysis of large datasets. Such datasets related to water include measurements, simulations, reanalysis, scenario analyses and proxies. By convention, information contained in these databases is referred to a specific time and a space (i.e., longitude/latitude). This work is motivated by the need to extract insights from large water-related datasets, i.e., transforming large amounts of data into useful information that helps to better understand of water-related phenomena, particularly about drought. In this context, data visualization, part of data science, involves techniques to create and to communicate data by encoding it as visual graphical objects. They may help to better understand data and detect trends. Base on existing methods of data analysis and visualization, this work aims to develop tools for visualizing water-related large datasets. These tools were developed taking advantage of existing libraries for data visualization into a group of graphs which include both polar area diagrams (PADs) and radar charts (RDs). In both graphs, time steps are represented by the polar angles and the percentages of area in drought by the radios. For illustration, three large datasets of drought proxies are chosen to identify trends, prone areas and spatio-temporal variability of drought in a set of case studies. The datasets are (1) SPI-TS2p1 (1901-2002, 11.7 GB), (2) SPI-PRECL0p5 (1948-2016, 7.91 GB) and (3) SPEI-baseV2.3 (1901-2013, 15.3 GB). All of them are on a monthly basis and with a spatial resolution of 0.5 degrees. First two were retrieved from the repository of the International Research Institute for Climate and Society (IRI). They are included into the Analyses Standardized Precipitation Index (SPI) project (iridl.ldeo.columbia.edu/SOURCES/.IRI/.Analyses/.SPI/). The third dataset was recovered from the Standardized Precipitation Evaporation Index (SPEI) Monitor (digital.csic.es/handle/10261/128892). PADs were found suitable to identify the spatio-temporal variability and prone areas of drought. Drought trends were visually detected by using both PADs and RDs. A similar approach can be followed to include other types of graphs to deal with the analysis of water-related big data. Key words: Big data, data visualization, drought, SPI, SPEI
Paz-Ríos, Carlos E; Simões, Nuno; Pech, Daniel
2018-01-01
Alacranes Reef was declared as a National Marine Park in 1994. Since then, many efforts have been made to inventory its biodiversity. However, groups such as amphipods have been underestimated or not considered when benthic invertebrates were inventoried. Here we present a dataset that contributes to the knowledge of benthic amphipods (Crustacea, Peracarida) from the inner lagoon habitats from the Alacranes Reef National Park, the largest coral reef ecosystem in the Gulf of Mexico. The dataset contains information on records collected from 2009 to 2011. Data are available through Global Biodiversity Information Facility (GBIF). A total of 110 amphipod species distributed in 93 nominal species and 17 generic species, belonging to 71 genera, 33 families and three suborders are presented here. This information represents the first online dataset of amphipods from the Alacranes Reef National Park. The biological material is currently deposited in the crustacean collection from the regional unit of the National Autonomous University of Mexico located at Sisal, Yucatan, Mexico (UAS-Sisal). The biological material includes 588 data records with a total abundance of 6,551 organisms. The species inventory represents, until now, the richest fauna of benthic amphipods registered from any discrete coral reef ecosystem in Mexico.
Mutual-information-based registration for ultrasound and CT datasets
NASA Astrophysics Data System (ADS)
Firle, Evelyn A.; Wesarg, Stefan; Dold, Christian
2004-05-01
In many applications for minimal invasive surgery the acquisition of intra-operative medical images is helpful if not absolutely necessary. Especially for Brachytherapy imaging is critically important to the safe delivery of the therapy. Modern computed tomography (CT) and magnetic resonance (MR) scanners allow minimal invasive procedures to be performed under direct imaging guidance. However, conventional scanners do not have real-time imaging capability and are expensive technologies requiring a special facility. Ultrasound (U/S) is a much cheaper and one of the most flexible imaging modalities. It can be moved to the application room as required and the physician sees what is happening as it occurs. Nevertheless it may be easier to interpret these 3D intra-operative U/S images if they are used in combination with less noisier preoperative data such as CT. The purpose of our current investigation is to develop a registration tool for automatically combining pre-operative CT volumes with intra-operatively acquired 3D U/S datasets. The applied alignment procedure is based on the information theoretic approach of maximizing the mutual information of two arbitrary datasets from different modalities. Since the CT datasets include a much bigger field of view we introduced a bounding box to narrow down the region of interest within the CT dataset. We conducted a phantom experiment using a CIRS Model 53 U/S Prostate Training Phantom to evaluate the feasibility and accuracy of the proposed method.
Potential for using regional and global datasets for national scale ecosystem service modelling
NASA Astrophysics Data System (ADS)
Maxwell, Deborah; Jackson, Bethanna
2016-04-01
Ecosystem service models are increasingly being used by planners and policy makers to inform policy development and decisions about national-level resource management. Such models allow ecosystem services to be mapped and quantified, and subsequent changes to these services to be identified and monitored. In some cases, the impact of small scale changes can be modelled at a national scale, providing more detailed information to decision makers about where to best focus investment and management interventions that could address these issues, while moving toward national goals and/or targets. National scale modelling often uses national (or local) data (for example, soils, landcover and topographical information) as input. However, there are some places where fine resolution and/or high quality national datasets cannot be easily obtained, or do not even exist. In the absence of such detailed information, regional or global datasets could be used as input to such models. There are questions, however, about the usefulness of these coarser resolution datasets and the extent to which inaccuracies in this data may degrade predictions of existing and potential ecosystem service provision and subsequent decision making. Using LUCI (the Land Utilisation and Capability Indicator) as an example predictive model, we examine how the reliability of predictions change when national datasets of soil, landcover and topography are substituted with coarser scale regional and global datasets. We specifically look at how LUCI's predictions of where water services, such as flood risk, flood mitigation, erosion and water quality, change when national data inputs are replaced by regional and global datasets. Using the Conwy catchment, Wales, as a case study, the land cover products compared are the UK's Land Cover Map (2007), the European CORINE land cover map and the ESA global land cover map. Soils products include the National Soil Map of England and Wales (NatMap) and the European Soils Database. NEXTmap elevation data, which covers the UK and parts of continental Europe, are compared to global AsterDEM and SRTM30 topographical products. While the regional and global datasets can be used to fill gaps in data requirements, the coarser resolution of these datasets means that there is greater aggregation of information over larger areas. This loss of detail impacts on the reliability of model output, particularly where significant discrepancies between datasets exist. The implications of this loss of detail in terms of spatial planning and decision making is discussed. Finally, in the context of broader development the need for better nationally and globally available data to allow LUCI and other ecosystem models to become more globally applicable is highlighted.
Advanced Multidimensional Separations in Mass Spectrometry: Navigating the Big Data Deluge
May, Jody C.; McLean, John A.
2017-01-01
Hybrid analytical instrumentation constructed around mass spectrometry (MS) are becoming preferred techniques for addressing many grand challenges in science and medicine. From the omics sciences to drug discovery and synthetic biology, multidimensional separations based on MS provide the high peak capacity and high measurement throughput necessary to obtain large-scale measurements which are used to infer systems-level information. In this review, we describe multidimensional MS configurations as technologies which are big data drivers and discuss some new and emerging strategies for mining information from large-scale datasets. A discussion is included on the information content which can be obtained from individual dimensions, as well as the unique information which can be derived by comparing different levels of data. Finally, we discuss some emerging data visualization strategies which seek to make highly dimensional datasets both accessible and comprehensible. PMID:27306312
Ferreira, Antonio; Daraktchieva, Zornitza; Beamish, David; Kirkwood, Charles; Lister, T Robert; Cave, Mark; Wragg, Joanna; Lee, Kathryn
2018-01-01
Predictive mapping of indoor radon potential often requires the use of additional datasets. A range of geological, geochemical and geophysical data may be considered, either individually or in combination. The present work is an evaluation of how much of the indoor radon variation in south west England can be explained by four different datasets: a) the geology (G), b) the airborne gamma-ray spectroscopy (AGR), c) the geochemistry of topsoil (TSG) and d) the geochemistry of stream sediments (SSG). The study area was chosen since it provides a large (197,464) indoor radon dataset in association with the above information. Geology provides information on the distribution of the materials that may contribute to radon release while the latter three items provide more direct observations on the distributions of the radionuclide elements uranium (U), thorium (Th) and potassium (K). In addition, (c) and (d) provide multi-element assessments of geochemistry which are also included in this study. The effectiveness of datasets for predicting the existing indoor radon data is assessed through the level (the higher the better) of explained variation (% of variance or ANOVA) obtained from the tested models. A multiple linear regression using a compositional data (CODA) approach is carried out to obtain the required measure of determination for each analysis. Results show that, amongst the four tested datasets, the soil geochemistry (TSG, i.e. including all the available 41 elements, 10 major - Al, Ca, Fe, K, Mg, Mn, Na, P, Si, Ti - plus 31 trace) provides the highest explained variation of indoor radon (about 40%); more than double the value provided by U alone (ca. 15%), or the sub composition U, Th, K (ca. 16%) from the same TSG data. The remaining three datasets provide values ranging from about 27% to 32.5%. The enhanced prediction of the AGR model relative to the U, Th, K in soils suggests that the AGR signal captures more than just the U, Th and K content in the soil. The best result is obtained by including the soil geochemistry with geology and AGR (TSG + G + AGR, ca. 47%). However, adding G and AGR to the TSG model only slightly improves the prediction (ca. +7%), suggesting that the geochemistry of soils already contain most of the information given by geology and airborne datasets together, at least with regard to the explanation of indoor radon. From the present analysis performed in the SW of England, it may be concluded that each one of the four datasets is likely to be useful for radon mapping purposes, whether alone or in combination with others. The present work also suggest that the complete soil geochemistry dataset (TSG) is more effective for indoor radon modelling than using just the U (+Th, K) concentration in soil. Copyright © 2016 Natural Environment Research Council. Published by Elsevier Ltd.. All rights reserved.
NASA Astrophysics Data System (ADS)
Forkert, Nils Daniel; Siemonsen, Susanne; Dalski, Michael; Verleger, Tobias; Kemmling, Andre; Fiehler, Jens
2014-03-01
The acute ischemic stroke is a leading cause for death and disability in the industry nations. In case of a present acute ischemic stroke, the prediction of the future tissue outcome is of high interest for the clinicians as it can be used to support therapy decision making. Within this context, it has already been shown that the voxel-wise multi-parametric tissue outcome prediction leads to more promising results compared to single channel perfusion map thresholding. Most previously published multi-parametric predictions employ information from perfusion maps derived from perfusion-weighted MRI together with other image sequences such as diffusion-weighted MRI. However, it remains unclear if the typically calculated perfusion maps used for this purpose really include all valuable information from the PWI dataset for an optimal tissue outcome prediction. To investigate this problem in more detail, two different methods to predict tissue outcome using a k-nearest-neighbor approach were developed in this work and evaluated based on 18 datasets of acute stroke patients with known tissue outcome. The first method integrates apparent diffusion coefficient and perfusion parameter (Tmax, MTT, CBV, CBF) information for the voxel-wise prediction, while the second method employs also apparent diffusion coefficient information but the complete perfusion information in terms of the voxel-wise residue functions instead of the perfusion parameter maps for the voxel-wise prediction. Overall, the comparison of the results of the two prediction methods for the 18 patients using a leave-one-out cross validation revealed no considerable differences. Quantitatively, the parameter-based prediction of tissue outcome led to a mean Dice coefficient of 0.474, while the prediction using the residue functions led to a mean Dice coefficient of 0.461. Thus, it may be concluded from the results of this study that the perfusion parameter maps typically derived from PWI datasets include all valuable perfusion information required for a voxel-based tissue outcome prediction, while the complete analysis of the residue functions does not add further benefits for the voxel-wise tissue outcome prediction and is also computationally more expensive.
NASA Astrophysics Data System (ADS)
Pedretti, Daniele; Beckie, Roger Daniel
2014-05-01
Missing data in hydrological time-series databases are ubiquitous in practical applications, yet it is of fundamental importance to make educated decisions in problems involving exhaustive time-series knowledge. This includes precipitation datasets, since recording or human failures can produce gaps in these time series. For some applications, directly involving the ratio between precipitation and some other quantity, lack of complete information can result in poor understanding of basic physical and chemical dynamics involving precipitated water. For instance, the ratio between precipitation (recharge) and outflow rates at a discharge point of an aquifer (e.g. rivers, pumping wells, lysimeters) can be used to obtain aquifer parameters and thus to constrain model-based predictions. We tested a suite of methodologies to reconstruct missing information in rainfall datasets. The goal was to obtain a suitable and versatile method to reduce the errors given by the lack of data in specific time windows. Our analyses included both a classical chronologically-pairing approach between rainfall stations and a probability-based approached, which accounted for the probability of exceedence of rain depths measured at two or multiple stations. Our analyses proved that it is not clear a priori which method delivers the best methodology. Rather, this selection should be based considering the specific statistical properties of the rainfall dataset. In this presentation, our emphasis is to discuss the effects of a few typical parametric distributions used to model the behavior of rainfall. Specifically, we analyzed the role of distributional "tails", which have an important control on the occurrence of extreme rainfall events. The latter strongly affect several hydrological applications, including recharge-discharge relationships. The heavy-tailed distributions we considered were parametric Log-Normal, Generalized Pareto, Generalized Extreme and Gamma distributions. The methods were first tested on synthetic examples, to have a complete control of the impact of several variables such as minimum amount of data required to obtain reliable statistical distributions from the selected parametric functions. Then, we applied the methodology to precipitation datasets collected in the Vancouver area and on a mining site in Peru.
Griswold, Terry; Gonzalez, Victor H.; Ikerd, Harold
2014-01-01
Abstract This paper describes AnthWest, a large dataset that represents one of the outcomes of a comprehensive, broadly comparative study on the diversity, biology, biogeography, and evolution of Anthidium Fabricius in the Western Hemisphere. In this dataset a total of 22,648 adult occurrence records comprising 9657 unique events are documented for 92 species of Anthidium, including the invasive range of two introduced species from Eurasia, A. oblongatum (Illiger) and A. manicatum (Linnaeus). The geospatial coverage of the dataset extends from northern Canada and Alaska to southern Argentina, and from below sea level in Death Valley, California, USA, to 4700 m a.s.l. in Tucumán, Argentina. The majority of records in the dataset correspond to information recorded from individual specimens examined by the authors during this project and deposited in 60 biodiversity collections located in Africa, Europe, North and South America. A fraction (4.8%) of the occurrence records were taken from the literature, largely California records from a taxonomic treatment with some additional records for the two introduced species. The temporal scale of the dataset represents collection events recorded between 1886 and 2012. The dataset was developed employing SQL server 2008 r2. For each specimen, the following information is generally provided: scientific name including identification qualifier when species status is uncertain (e.g. “Questionable Determination” for 0.4% of the specimens), sex, temporal and geospatial details, coordinates, data collector, host plants, associated organisms, name of identifier, historic identification, historic identifier, taxonomic value (i.e., type specimen, voucher, etc.), and repository. For a small portion of the database records, bees associated with threatened or endangered plants (~ 0.08% of total records) as well as specimens collected as part of unpublished biological inventories (~17%), georeferencing is presented only to nearest degree and the information on floral host, locality, elevation, month, and day has been withheld. This database can potentially be used in species distribution and niche modeling studies, as well as in assessments of pollinator status and pollination services. For native pollinators, this large dataset of occurrence records is the first to be simultaneously developed during a species-level systematic study. PMID:24899835
EnviroAtlas - Industrial Water Demand by 12-Digit HUC for the Conterminous United States
This EnviroAtlas dataset includes industrial water demand attributes which provide insight into the amount of water currently used for manufacturing and production of commodities in the contiguous United States. The values are based on 2005 water demand and Dun and Bradstreet's 2009/2010 source data, and have been summarized by watershed or 12-digit hydrologic unit code (HUC). For the purposes of this metric, industrial water use includes chemical, food, paper, wood, and metal production. The industrial water is for self-supplied only such as by private wells or reservoirs. Sources include either surface water or groundwater. This dataset was produced by the US EPA to support research and online mapping activities related to the EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Canessa, Andrea; Gibaldi, Agostino; Chessa, Manuela; Fato, Marco; Solari, Fabio; Sabatini, Silvio P.
2017-01-01
Binocular stereopsis is the ability of a visual system, belonging to a live being or a machine, to interpret the different visual information deriving from two eyes/cameras for depth perception. From this perspective, the ground-truth information about three-dimensional visual space, which is hardly available, is an ideal tool both for evaluating human performance and for benchmarking machine vision algorithms. In the present work, we implemented a rendering methodology in which the camera pose mimics realistic eye pose for a fixating observer, thus including convergent eye geometry and cyclotorsion. The virtual environment we developed relies on highly accurate 3D virtual models, and its full controllability allows us to obtain the stereoscopic pairs together with the ground-truth depth and camera pose information. We thus created a stereoscopic dataset: GENUA PESTO—GENoa hUman Active fixation database: PEripersonal space STereoscopic images and grOund truth disparity. The dataset aims to provide a unified framework useful for a number of problems relevant to human and computer vision, from scene exploration and eye movement studies to 3D scene reconstruction. PMID:28350382
U.S.-Mexico Border Geographic Information System
Parcher, Jean W.
2008-01-01
Geographic Information Systems (GIS) and the development of extensive geodatabases have become invaluable tools for addressing a variety of contemporary societal issues and for making predictions about the future. The United States-Mexico Geographic Information System (USMX-GIS) is based on fundamental datasets that are produced and/or approved by the national geography agencies of each country, the U.S. Geological Survey (USGS) and the Instituto Nacional de Estadistica Y Geografia (INEGI) of Mexico, and the International Boundary and Water Commission (IBWC). The data are available at various scales to allow both regional and local analysis. The USGS and the INEGI have an extensive history of collaboration for transboundary mapping including exchanging digital technology and developing methods for harmonizing seamless national level geospatial datasets for binational environmental monitoring, urban growth analysis, and other scientific applications.
Automatic training and reliability estimation for 3D ASM applied to cardiac MRI segmentation
NASA Astrophysics Data System (ADS)
Tobon-Gomez, Catalina; Sukno, Federico M.; Butakoff, Constantine; Huguet, Marina; Frangi, Alejandro F.
2012-07-01
Training active shape models requires collecting manual ground-truth meshes in a large image database. While shape information can be reused across multiple imaging modalities, intensity information needs to be imaging modality and protocol specific. In this context, this study has two main purposes: (1) to test the potential of using intensity models learned from MRI simulated datasets and (2) to test the potential of including a measure of reliability during the matching process to increase robustness. We used a population of 400 virtual subjects (XCAT phantom), and two clinical populations of 40 and 45 subjects. Virtual subjects were used to generate simulated datasets (MRISIM simulator). Intensity models were trained both on simulated and real datasets. The trained models were used to segment the left ventricle (LV) and right ventricle (RV) from real datasets. Segmentations were also obtained with and without reliability information. Performance was evaluated with point-to-surface and volume errors. Simulated intensity models obtained average accuracy comparable to inter-observer variability for LV segmentation. The inclusion of reliability information reduced volume errors in hypertrophic patients (EF errors from 17 ± 57% to 10 ± 18% LV MASS errors from -27 ± 22 g to -14 ± 25 g), and in heart failure patients (EF errors from -8 ± 42% to -5 ± 14%). The RV model of the simulated images needs further improvement to better resemble image intensities around the myocardial edges. Both for real and simulated models, reliability information increased segmentation robustness without penalizing accuracy.
Automatic training and reliability estimation for 3D ASM applied to cardiac MRI segmentation.
Tobon-Gomez, Catalina; Sukno, Federico M; Butakoff, Constantine; Huguet, Marina; Frangi, Alejandro F
2012-07-07
Training active shape models requires collecting manual ground-truth meshes in a large image database. While shape information can be reused across multiple imaging modalities, intensity information needs to be imaging modality and protocol specific. In this context, this study has two main purposes: (1) to test the potential of using intensity models learned from MRI simulated datasets and (2) to test the potential of including a measure of reliability during the matching process to increase robustness. We used a population of 400 virtual subjects (XCAT phantom), and two clinical populations of 40 and 45 subjects. Virtual subjects were used to generate simulated datasets (MRISIM simulator). Intensity models were trained both on simulated and real datasets. The trained models were used to segment the left ventricle (LV) and right ventricle (RV) from real datasets. Segmentations were also obtained with and without reliability information. Performance was evaluated with point-to-surface and volume errors. Simulated intensity models obtained average accuracy comparable to inter-observer variability for LV segmentation. The inclusion of reliability information reduced volume errors in hypertrophic patients (EF errors from 17 ± 57% to 10 ± 18%; LV MASS errors from -27 ± 22 g to -14 ± 25 g), and in heart failure patients (EF errors from -8 ± 42% to -5 ± 14%). The RV model of the simulated images needs further improvement to better resemble image intensities around the myocardial edges. Both for real and simulated models, reliability information increased segmentation robustness without penalizing accuracy.
Developing a new global network of river reaches from merged satellite-derived datasets
NASA Astrophysics Data System (ADS)
Lion, C.; Allen, G. H.; Beighley, E.; Pavelsky, T.
2015-12-01
In 2020, the Surface Water and Ocean Topography satellite (SWOT), a joint mission of NASA/CNES/CSA/UK will be launched. One of its major products will be the measurements of continental water extent, including the width, height, and slope of rivers and the surface area and elevations of lakes. The mission will improve the monitoring of continental water and also our understanding of the interactions between different hydrologic reservoirs. For rivers, SWOT measurements of slope must be carried out over predefined river reaches. As such, an a priori dataset for rivers is needed in order to facilitate analysis of the raw SWOT data. The information required to produce this dataset includes measurements of river width, elevation, slope, planform, river network topology, and flow accumulation. To produce this product, we have linked two existing global datasets: the Global River Widths from Landsat (GRWL) database, which contains river centerline locations, widths, and a braiding index derived from Landsat imagery, and a modified version of the HydroSHEDS hydrologically corrected digital elevation product, which contains heights and flow accumulation measurements for streams at 3 arcsecond spatial resolution. Merging these two datasets requires considerable care. The difficulties, among others, lie in the difference of resolution: 30m versus 3 arseconds, and the age of the datasets: 2000 versus ~2010 (some rivers have moved, the braided sections are different). As such, we have developed custom software to merge the two datasets, taking into account the spatial proximity of river channels in the two datasets and ensuring that flow accumulation in the final dataset always increases downstream. Here, we present our preliminary results for a portion of South America and demonstrate the strengths and weaknesses of the method.
EnviroAtlas - Austin, TX - Block Groups
This EnviroAtlas dataset is the base layer for the Austin, TX EnviroAtlas area. The block groups are from the US Census Bureau and are included/excluded based on EnviroAtlas criteria described in the procedure log. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hodge, Bri-Mathias
2016-04-08
The primary objective of this work was to create a state-of-the-art national wind resource data set and to provide detailed wind plant output data for specific sites based on that data set. Corresponding retrospective wind forecasts were also included at all selected locations. The combined information from these activities was used to create the Wind Integration National Dataset (WIND), and an extraction tool was developed to allow web-based data access.
Data for Sochi 2014 Olympics discussion on social media.
Kirilenko, Andrei P
2017-08-01
Presented data is related to the research article "Sochi 2014 Olympics on Twitter: Perspectives of Hosts and Guests" [2]. The data were collected through regular API Twitter search for five months windowing 2014 Sochi Olympic Games and further used for cluster analysis and analysis of the sentiment on the Games. The main dataset contains 616 thousand tweets, rigorously cleaned and filtered to remove irrelevant content. To comply with the Twitter API user agreement, the dataset presented in this article includes only generalized daily data with all information contained in individual tweets removed. The proposed use of the dataset is academic research of changing discussion on the topics related to Mega-events in conjunction with political events.
Discovery and Analysis of Intersecting Datasets: JMARS as a Comparative Science Platform
NASA Astrophysics Data System (ADS)
Carter, S.; Christensen, P. R.; Dickenshied, S.; Anwar, S.; Noss, D.
2014-12-01
A great deal can be discovered from comparing and studying a chosen region or area on a planetary body. In this age, science has an enormous amount of instruments and data to study from; often the first obstacle can be finding the right information. Developed at Arizona State University, Java Mission-planning and Analysis for Remote Sensing (JMARS), enables users to easily find and study related datasets. JMARS supports a long list of planetary bodies in our solar system, including Earth, the Moon, Mars, and other planets, satellites, and asteroids. Within JMARS a user can start with a particular area and search for all datasets that have images/information intersecting that region of interest. Once a user has found data they are interested in comparing, they can view the image at once and see the numeric information at that location. This information can be analyzed in a few powerful ways. If the dataset of interest varies with time but the location stays constant, then the user may want to compare specific locations through time. This can be done the Investigate Tool in JMARS. Users can create a Data Spike and the information at that point will be plotted through time. If the region does not have a temporal dataset, then a different method would be suitable and involves a profile line. Also using the Investigate Tool, a user can create a Data Profile (a line which can contain as many vertices as necessary) and all numeric data underneath the line will be plotted on one graph for easy comparison. This can be used to compare differences between similar datasets - perhaps the same measurement but from different instruments - or to find correlations from one dataset to another. A third form of analysis is planned for future development. This method involves entire areas (polygons). Sampling of the different data sources beneath an area can reveal statistics like maximum, minimum, and average values, and standard deviation. These values can be compared to other data sources under the given area. JMARS has the ability to geographically locate and display a vast array of remote sensing data for a user. In addition to its powerful searching ability, it also enables users to compare datasets using the Data Spike and Data Profile techniques. Plots and tables from this data can be exported and used in presentations, papers, or external software for further study.
Gandy, Lisa M; Gumm, Jordan; Fertig, Benjamin; Thessen, Anne; Kennish, Michael J; Chavan, Sameer; Marchionni, Luigi; Xia, Xiaoxin; Shankrit, Shambhavi; Fertig, Elana J
2017-01-01
Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85-100%). We further implement Synthesize in an open source web application, Synthesizer (https://github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases.
Gumm, Jordan; Fertig, Benjamin; Thessen, Anne; Kennish, Michael J.; Chavan, Sameer; Marchionni, Luigi; Xia, Xiaoxin; Shankrit, Shambhavi; Fertig, Elana J.
2017-01-01
Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85–100%). We further implement Synthesize in an open source web application, Synthesizer (https://github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases. PMID:28437440
Yeh, Ching-Hua; Hartmann, Monika; Hirsch, Stefan
2018-06-01
The presentation of credence attributes such as the product's origin or the production method has a significant influence on consumers' food purchase decisions. The dataset includes survey responses from a discrete choice experiment with 1309 food shoppers in Taiwan using the example of sweet pepper. The survey was carried out in 2014 in the three largest Taiwanese cities. It evaluates the impact of providing information on the equality of organic standards on consumers' preferences at the example of sweet pepper. Equality of organic standards implies that regardless of products' country-of-origin (COO) organic certifications are based on the same production regulation and managerial processes. Respondents were randomly allocated to the information treatment and the control group. The dataset contains the product choices of participants in both groups, as well as their sociodemographic information.
Nawyn, John P.; Sargent, B. Pierre; Hoopes, Barbara; Augenstein, Todd; Rowland, Kathleen M.; Barber, Nancy L.
2017-10-06
The Aggregate Water-Use Data System (AWUDS) is the database management system used to enter, store, and analyze state aggregate water-use data. It is part of the U.S. Geological Survey National Water Information System. AWUDS has a graphical user interface that facilitates data entry, revision, review, and approval. This document provides information on the basic functions of AWUDS and the steps for carrying out common tasks that are a part of compiling an aggregated dataset. Also included are explanations of terminology and descriptions of user-interface structure, procedures for using the AWUDS operations, and dataset-naming conventions. Information on water-use category definitions, data-collection methods, and data sources are found in the report “Guidelines for preparation of State water-use estimates,” available at https://pubs.er.usgs.gov/publication/ofr20171029.
Wind and wave dataset for Matara, Sri Lanka
NASA Astrophysics Data System (ADS)
Luo, Yao; Wang, Dongxiao; Priyadarshana Gamage, Tilak; Zhou, Fenghua; Madusanka Widanage, Charith; Liu, Taiwei
2018-01-01
We present a continuous in situ hydro-meteorology observational dataset from a set of instruments first deployed in December 2012 in the south of Sri Lanka, facing toward the north Indian Ocean. In these waters, simultaneous records of wind and wave data are sparse due to difficulties in deploying measurement instruments, although the area hosts one of the busiest shipping lanes in the world. This study describes the survey, deployment, and measurements of wind and waves, with the aim of offering future users of the dataset the most comprehensive and as much information as possible. This dataset advances our understanding of the nearshore hydrodynamic processes and wave climate, including sea waves and swells, in the north Indian Ocean. Moreover, it is a valuable resource for ocean model parameterization and validation. The archived dataset (Table 1) is examined in detail, including wave data at two locations with water depths of 20 and 10 m comprising synchronous time series of wind, ocean astronomical tide, air pressure, etc. In addition, we use these wave observations to evaluate the ERA-Interim reanalysis product. Based on Buoy 2 data, the swells are the main component of waves year-round, although monsoons can markedly alter the proportion between swell and wind sea. The dataset (Luo et al., 2017) is publicly available from Science Data Bank (https://doi.org/10.11922/sciencedb.447).
Gesch, Dean B.; Oimoen, Michael J.; Evans, Gayla A.
2014-01-01
The National Elevation Dataset (NED) is the primary elevation data product produced and distributed by the U.S. Geological Survey. The NED provides seamless raster elevation data of the conterminous United States, Alaska, Hawaii, U.S. island territories, Mexico, and Canada. The NED is derived from diverse source datasets that are processed to a specification with consistent resolutions, coordinate system, elevation units, and horizontal and vertical datums. The NED serves as the elevation layer of The National Map, and it provides basic elevation information for earth science studies and mapping applications in the United States and most of North America. An important part of supporting scientific and operational use of the NED is provision of thorough dataset documentation including data quality and accuracy metrics. The focus of this report is on the vertical accuracy of the NED and on comparison of the NED with other similar large-area elevation datasets, namely data from the Shuttle Radar Topography Mission (SRTM) and the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER).
Morrison, James J; Hostetter, Jason; Wang, Kenneth; Siegel, Eliot L
2015-02-01
Real-time mining of large research trial datasets enables development of case-based clinical decision support tools. Several applicable research datasets exist including the National Lung Screening Trial (NLST), a dataset unparalleled in size and scope for studying population-based lung cancer screening. Using these data, a clinical decision support tool was developed which matches patient demographics and lung nodule characteristics to a cohort of similar patients. The NLST dataset was converted into Structured Query Language (SQL) tables hosted on a web server, and a web-based JavaScript application was developed which performs real-time queries. JavaScript is used for both the server-side and client-side language, allowing for rapid development of a robust client interface and server-side data layer. Real-time data mining of user-specified patient cohorts achieved a rapid return of cohort cancer statistics and lung nodule distribution information. This system demonstrates the potential of individualized real-time data mining using large high-quality clinical trial datasets to drive evidence-based clinical decision-making.
EnviroAtlas - Cleveland, OH - Estimated Percent Green Space Along Walkable Roads
This EnviroAtlas dataset estimates green space along walkable roads. Green space within 25 meters of the road centerline is included and the percentage is based on the total area between street intersections. In this community, green space is defined as Trees & Forest, Grass & Herbaceous, Woody Wetlands, and Emergent Wetlands. In this metric, water is also included in green space. Green space provides valuable benefits to neighborhood residents and walkers by providing shade, improved aesthetics, and outdoor gathering spaces. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Bradbury, Kyle; Saboo, Raghav; L. Johnson, Timothy; Malof, Jordan M.; Devarajan, Arjun; Zhang, Wuming; M. Collins, Leslie; G. Newell, Richard
2016-01-01
Earth-observing remote sensing data, including aerial photography and satellite imagery, offer a snapshot of the world from which we can learn about the state of natural resources and the built environment. The components of energy systems that are visible from above can be automatically assessed with these remote sensing data when processed with machine learning methods. Here, we focus on the information gap in distributed solar photovoltaic (PV) arrays, of which there is limited public data on solar PV deployments at small geographic scales. We created a dataset of solar PV arrays to initiate and develop the process of automatically identifying solar PV locations using remote sensing imagery. This dataset contains the geospatial coordinates and border vertices for over 19,000 solar panels across 601 high-resolution images from four cities in California. Dataset applications include training object detection and other machine learning algorithms that use remote sensing imagery, developing specific algorithms for predictive detection of distributed PV systems, estimating installed PV capacity, and analysis of the socioeconomic correlates of PV deployment. PMID:27922592
NASA Astrophysics Data System (ADS)
Bradbury, Kyle; Saboo, Raghav; L. Johnson, Timothy; Malof, Jordan M.; Devarajan, Arjun; Zhang, Wuming; M. Collins, Leslie; G. Newell, Richard
2016-12-01
Earth-observing remote sensing data, including aerial photography and satellite imagery, offer a snapshot of the world from which we can learn about the state of natural resources and the built environment. The components of energy systems that are visible from above can be automatically assessed with these remote sensing data when processed with machine learning methods. Here, we focus on the information gap in distributed solar photovoltaic (PV) arrays, of which there is limited public data on solar PV deployments at small geographic scales. We created a dataset of solar PV arrays to initiate and develop the process of automatically identifying solar PV locations using remote sensing imagery. This dataset contains the geospatial coordinates and border vertices for over 19,000 solar panels across 601 high-resolution images from four cities in California. Dataset applications include training object detection and other machine learning algorithms that use remote sensing imagery, developing specific algorithms for predictive detection of distributed PV systems, estimating installed PV capacity, and analysis of the socioeconomic correlates of PV deployment.
Bradbury, Kyle; Saboo, Raghav; L Johnson, Timothy; Malof, Jordan M; Devarajan, Arjun; Zhang, Wuming; M Collins, Leslie; G Newell, Richard
2016-12-06
Earth-observing remote sensing data, including aerial photography and satellite imagery, offer a snapshot of the world from which we can learn about the state of natural resources and the built environment. The components of energy systems that are visible from above can be automatically assessed with these remote sensing data when processed with machine learning methods. Here, we focus on the information gap in distributed solar photovoltaic (PV) arrays, of which there is limited public data on solar PV deployments at small geographic scales. We created a dataset of solar PV arrays to initiate and develop the process of automatically identifying solar PV locations using remote sensing imagery. This dataset contains the geospatial coordinates and border vertices for over 19,000 solar panels across 601 high-resolution images from four cities in California. Dataset applications include training object detection and other machine learning algorithms that use remote sensing imagery, developing specific algorithms for predictive detection of distributed PV systems, estimating installed PV capacity, and analysis of the socioeconomic correlates of PV deployment.
EnviroAtlas - Minneapolis/St. Paul, MN - Estimated Percent Green Space Along Walkable Roads
This EnviroAtlas dataset estimates green space along walkable roads. Green space within 25 meters of the road centerline is included and the percentage is based on the total area between street intersections. In this community, green space is defined as Trees and Forest, Grass and Herbaceous, Agriculture, Woody Wetlands, and Emergent Wetlands. In this metric, water is also included in green space. Green space provides valuable benefits to neighborhood residents and walkers by providing shade, improved aesthetics, and outdoor gathering spaces. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas/EnviroAtlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
2014-01-01
BACKGROUND Average real variability (ARV) is a recently proposed index for short-term blood pressure (BP) variability. We aimed to determine the minimum number of BP readings required to compute ARV without loss of prognostic information. METHODS ARV was calculated from a discovery dataset that included 24-hour ambulatory BP measurements for 1,254 residents (mean age = 56.6 years; 43.5% women) of Copenhagen, Denmark. Concordance between ARV from full (≥80 BP readings) and randomly reduced 24-hour BP recordings was examined, as was prognostic accuracy. A test dataset that included 5,353 subjects (mean age = 54.0 years; 45.6% women) with at least 48 BP measurements from 11 randomly recruited population cohorts was used to validate the results. RESULTS In the discovery dataset, a minimum of 48 BP readings allowed an accurate assessment of the association between cardiovascular risk and ARV. In the test dataset, over 10.2 years (median), 806 participants died (335 cardiovascular deaths, 206 cardiac deaths) and 696 experienced a major fatal or nonfatal cardiovascular event. Standardized multivariable-adjusted hazard ratios (HRs) were computed for associations between outcome and BP variability. Higher diastolic ARV in 24-hour ambulatory BP recordings predicted (P < 0.01) total (HR = 1.12), cardiovascular (HR = 1.19), and cardiac (HR = 1.19) mortality and fatal combined with nonfatal cerebrovascular events (HR = 1.16). Higher systolic ARV in 24-hour ambulatory BP recordings predicted (P < 0.01) total (HR = 1.12), cardiovascular (HR = 1.17), and cardiac (HR = 1.24) mortality. CONCLUSIONS Forty-eight BP readings over 24 hours were observed to be adequate to compute ARV without meaningful loss of prognostic information. PMID:23955605
Becnel, Lauren B.; Darlington, Yolanda F.; Ochsner, Scott A.; Easton-Marks, Jeremy R.; Watkins, Christopher M.; McOwiti, Apollo; Kankanamge, Wasula H.; Wise, Michael W.; DeHart, Michael; Margolis, Ronald N.; McKenna, Neil J.
2015-01-01
Signaling pathways involving nuclear receptors (NRs), their ligands and coregulators, regulate tissue-specific transcriptomes in diverse processes, including development, metabolism, reproduction, the immune response and neuronal function, as well as in their associated pathologies. The Nuclear Receptor Signaling Atlas (NURSA) is a Consortium focused around a Hub website (www.nursa.org) that annotates and integrates diverse ‘omics datasets originating from the published literature and NURSA-funded Data Source Projects (NDSPs). These datasets are then exposed to the scientific community on an Open Access basis through user-friendly data browsing and search interfaces. Here, we describe the redesign of the Hub, version 3.0, to deploy “Web 2.0” technologies and add richer, more diverse content. The Molecule Pages, which aggregate information relevant to NR signaling pathways from myriad external databases, have been enhanced to include resources for basic scientists, such as post-translational modification sites and targeting miRNAs, and for clinicians, such as clinical trials. A portal to NURSA’s Open Access, PubMed-indexed journal Nuclear Receptor Signaling has been added to facilitate manuscript submissions. Datasets and information on reagents generated by NDSPs are available, as is information concerning periodic new NDSP funding solicitations. Finally, the new website integrates the Transcriptomine analysis tool, which allows for mining of millions of richly annotated public transcriptomic data points in the field, providing an environment for dataset re-use and citation, bench data validation and hypothesis generation. We anticipate that this new release of the NURSA database will have tangible, long term benefits for both basic and clinical research in this field. PMID:26325041
A curated database of cyanobacterial strains relevant for modern taxonomy and phylogenetic studies.
Ramos, Vitor; Morais, João; Vasconcelos, Vitor M
2017-04-25
The dataset herein described lays the groundwork for an online database of relevant cyanobacterial strains, named CyanoType (http://lege.ciimar.up.pt/cyanotype). It is a database that includes categorized cyanobacterial strains useful for taxonomic, phylogenetic or genomic purposes, with associated information obtained by means of a literature-based curation. The dataset lists 371 strains and represents the first version of the database (CyanoType v.1). Information for each strain includes strain synonymy and/or co-identity, strain categorization, habitat, accession numbers for molecular data, taxonomy and nomenclature notes according to three different classification schemes, hierarchical automatic classification, phylogenetic placement according to a selection of relevant studies (including this), and important bibliographic references. The database will be updated periodically, namely by adding new strains meeting the criteria for inclusion and by revising and adding up-to-date metadata for strains already listed. A global 16S rDNA-based phylogeny is provided in order to assist users when choosing the appropriate strains for their studies.
A curated database of cyanobacterial strains relevant for modern taxonomy and phylogenetic studies
Ramos, Vitor; Morais, João; Vasconcelos, Vitor M.
2017-01-01
The dataset herein described lays the groundwork for an online database of relevant cyanobacterial strains, named CyanoType (http://lege.ciimar.up.pt/cyanotype). It is a database that includes categorized cyanobacterial strains useful for taxonomic, phylogenetic or genomic purposes, with associated information obtained by means of a literature-based curation. The dataset lists 371 strains and represents the first version of the database (CyanoType v.1). Information for each strain includes strain synonymy and/or co-identity, strain categorization, habitat, accession numbers for molecular data, taxonomy and nomenclature notes according to three different classification schemes, hierarchical automatic classification, phylogenetic placement according to a selection of relevant studies (including this), and important bibliographic references. The database will be updated periodically, namely by adding new strains meeting the criteria for inclusion and by revising and adding up-to-date metadata for strains already listed. A global 16S rDNA-based phylogeny is provided in order to assist users when choosing the appropriate strains for their studies. PMID:28440791
Data Basin: Expanding Access to Conservation Data, Tools, and People
NASA Astrophysics Data System (ADS)
Comendant, T.; Strittholt, J.; Frost, P.; Ward, B. C.; Bachelet, D. M.; Osborne-Gowey, J.
2009-12-01
Mapping and spatial analysis are a fundamental part of problem solving in conservation science, yet spatial data are widely scattered, difficult to locate, and often unavailable. Valuable time and resources are wasted locating and gaining access to important biological, cultural, and economic datasets, scientific analysis, and experts. As conservation problems become more serious and the demand to solve them grows more urgent, a new way to connect science and practice is needed. To meet this need, an open-access, web tool called Data Basin (www.databasin.org) has been created by the Conservation Biology Institute in partnership with ESRI and the Wilburforce Foundation. Users of Data Basin can gain quick access to datasets, experts, groups, and tools to help solve real-world problems. Individuals and organizations can perform essential tasks such as exploring and downloading from a vast library of conservation datasets, uploading existing datasets, connecting to other external data sources, create groups, and produce customized maps that can be easily shared. Data Basin encourages sharing and publishing, but also provides privacy and security for sensitive information when needed. Users can publish projects within Data Basin to tell more complete and rich stories of discovery and solutions. Projects are an ideal way to publish collections of datasets, maps and other information on the internet to reach wider audiences. Data Basin also houses individual centers that provide direct access to data, maps, and experts focused on specific geographic areas or conservation topics. Current centers being developed include the Boreal Information Centre, the Data Basin Climate Center, and proposed Aquatic and Forest Conservation Centers.
One tree to link them all: a phylogenetic dataset for the European tetrapoda.
Roquet, Cristina; Lavergne, Sébastien; Thuiller, Wilfried
2014-08-08
Since the ever-increasing availability of phylogenetic informative data, the last decade has seen an upsurge of ecological studies incorporating information on evolutionary relationships among species. However, detailed species-level phylogenies are still lacking for many large groups and regions, which are necessary for comprehensive large-scale eco-phylogenetic analyses. Here, we provide a dataset of 100 dated phylogenetic trees for all European tetrapods based on a mixture of supermatrix and supertree approaches. Phylogenetic inference was performed separately for each of the main Tetrapoda groups of Europe except mammals (i.e. amphibians, birds, squamates and turtles) by means of maximum likelihood (ML) analyses of supermatrix applying a tree constraint at the family (amphibians and squamates) or order (birds and turtles) levels based on consensus knowledge. For each group, we inferred 100 ML trees to be able to provide a phylogenetic dataset that accounts for phylogenetic uncertainty, and assessed node support with bootstrap analyses. Each tree was dated using penalized-likelihood and fossil calibration. The trees obtained were well-supported by existing knowledge and previous phylogenetic studies. For mammals, we modified the most complete supertree dataset available on the literature to include a recent update of the Carnivora clade. As a final step, we merged the phylogenetic trees of all groups to obtain a set of 100 phylogenetic trees for all European Tetrapoda species for which data was available (91%). We provide this phylogenetic dataset (100 chronograms) for the purpose of comparative analyses, macro-ecological or community ecology studies aiming to incorporate phylogenetic information while accounting for phylogenetic uncertainty.
EnviroAtlas - Durham, NC - Land Cover Summaries by Block Group
This EnviroAtlas dataset describes the percentage of each block group that is classified as impervious, forest, green space, wetland, and agriculture. Impervious is a combination of dark and light impervious. Green space is a combination of trees and forest and grass and herbaceous. This dataset also includes the area per capita for each block group for impervious, forest, and green space land cover. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas ) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets ).
MOLA-Based Landing Site Characterization
NASA Technical Reports Server (NTRS)
Duxbury, T. C.; Ivanov, A. B.
2001-01-01
The Mars Global Surveyor (MGS) Mars Orbiter Laser Altimeter (MOLA) data provide the basis for site characterization and selection never before possible. The basic MOLA information includes absolute radii, elevation and 1 micrometer albedo with derived datasets including digital image models (DIM's illuminated elevation data), slopes maps and slope statistics and small scale surface roughness maps and statistics. These quantities are useful in downsizing potential sites from descent engineering constraints and landing/roving hazard and mobility assessments. Slope baselines at the few hundred meter level and surface roughness at the 10 meter level are possible. Additionally, the MOLA-derived Mars surface offers the possibility to precisely register and map project other instrument datasets (images, ultraviolet, infrared, radar, etc.) taken at different resolution, viewing and lighting geometry, building multiple layers of an information cube for site characterization and selection. Examples of direct MOLA data, data derived from MOLA and other instruments data registered to MOLA arc given for the Hematite area.
The road to NHDPlus — Advancements in digital stream networks and associated catchments
Moore, Richard B.; Dewald, Thomas A.
2016-01-01
A progression of advancements in Geographic Information Systems techniques for hydrologic network and associated catchment delineation has led to the production of the National Hydrography Dataset Plus (NHDPlus). NHDPlus is a digital stream network for hydrologic modeling with catchments and a suite of related geospatial data. Digital stream networks with associated catchments provide a geospatial framework for linking and integrating water-related data. Advancements in the development of NHDPlus are expected to continue to improve the capabilities of this national geospatial hydrologic framework. NHDPlus is built upon the medium-resolution NHD and, like NHD, was developed by the U.S. Environmental Protection Agency and U.S. Geological Survey to support the estimation of streamflow and stream velocity used in fate-and-transport modeling. Catchments included with NHDPlus were created by integrating vector information from the NHD and from the Watershed Boundary Dataset with the gridded land surface elevation as represented by the National Elevation Dataset. NHDPlus is an actively used and continually improved dataset. Users recognize the importance of a reliable stream network and associated catchments. The NHDPlus spatial features and associated data tables will continue to be improved to support regional water quality and streamflow models and other user-defined applications.
NASA Astrophysics Data System (ADS)
Conrads, P. A.; Tufford, D. L.; Darby, L. S.
2015-12-01
The phenomenon of coastal drought has a different dynamic from upland droughts that are typically characterized by agricultural, hydrologic, meteorological, and(or) socio-economic impacts. Because of the uniqueness of drought impacts on coastal ecosystems, a coastal drought index (CDI) that uses existing salinity datasets for sites in South Carolina, Georgia, and Florida was developed using an approach similar to the Standardized Precipitation Index (SPI). CDIs characterizing the 1- to 24-month salinity conditions were developed and the evaluation of the CDI indicates that the index can be used for different estuary types (for example, brackish, olioghaline, or mesohaline), for regional comparison between estuaries, and as an index for wet conditions (high freshwater inflow) in addition to drought conditions. Unlike the SPI where long-term precipitation datasets of 50 to 100 years are available for computing the index, there are a limited number of salinity data sets of greater than 10 or 15 years for computing the CDI. To evaluate the length of salinity record necessary to compute the CDI, a 29-year dataset was resampled into 5-, 10-, 15-, and 20-year interval datasets. Comparison of the CDI for the different periods of record show that the range of salinity conditions in the 10-, 15-, and 20-year datasets were similar and results were a close approximation to the CDI computed by using the full period of record. The CDI computed with the 5-year dataset had the largest differences with the CDI computed with the 29-year dataset but did provide useful information on coastal drought and freshwater conditions. An ongoing National Integrated Drought Information System (NIDIS) drought early warning project in the Carolinas is developing ecological linkages to the CDI and evaluating the effectiveness of the CDI as a prediction tool for adaptation planning for future droughts. However, identifying potential coastal drought response datasets is a challenge. Coastal drought is a relatively new concept and existing datasets may not have been collected or understood as "drought response" datasets. We have considered drought response datasets including tree growth and liter fall, harmful algal blooms occurrence, Vibrio infection occurrence, shellfish harvesting data, and shark attacks.
Map_plot and bgg_plot: software for integration of geoscience datasets
NASA Astrophysics Data System (ADS)
Gaillot, Philippe; Punongbayan, Jane T.; Rea, Brice
2004-02-01
Since 1985, the Ocean Drilling Program (ODP) has been supporting multidisciplinary research in exploring the structure and history of Earth beneath the oceans. After more than 200 Legs, complementary datasets covering different geological environments, periods and space scales have been obtained and distributed world-wide using the ODP-Janus and Lamont Doherty Earth Observatory-Borehole Research Group (LDEO-BRG) database servers. In Earth Sciences, more than in any other science, the ensemble of these data is characterized by heterogeneous formats and graphical representation modes. In order to fully and quickly assess this information, a set of Unix/Linux and Generic Mapping Tool-based C programs has been designed to convert and integrate datasets acquired during the present ODP and the future Integrated ODP (IODP) Legs. Using ODP Leg 199 datasets, we show examples of the capabilities of the proposed programs. The program map_plot is used to easily display datasets onto 2-D maps. The program bgg_plot (borehole geology and geophysics plot) displays data with respect to depth and/or time. The latter program includes depth shifting, filtering and plotting of core summary information, continuous and discrete-sample core measurements (e.g. physical properties, geochemistry, etc.), in situ continuous logs, magneto- and bio-stratigraphies, specific sedimentological analyses (lithology, grain size, texture, porosity, etc.), as well as core and borehole wall images. Outputs from both programs are initially produced in PostScript format that can be easily converted to Portable Document Format (PDF) or standard image formats (GIF, JPEG, etc.) using widely distributed conversion programs. Based on command line operations and customization of parameter files, these programs can be included in other shell- or database-scripts, automating plotting procedures of data requests. As an open source software, these programs can be customized and interfaced to fulfill any specific plotting need of geoscientists using ODP-like datasets.
Wang, Yaping; Nie, Jingxin; Yap, Pew-Thian; Li, Gang; Shi, Feng; Geng, Xiujuan; Guo, Lei; Shen, Dinggang
2014-01-01
Accurate and robust brain extraction is a critical step in most neuroimaging analysis pipelines. In particular, for the large-scale multi-site neuroimaging studies involving a significant number of subjects with diverse age and diagnostic groups, accurate and robust extraction of the brain automatically and consistently is highly desirable. In this paper, we introduce population-specific probability maps to guide the brain extraction of diverse subject groups, including both healthy and diseased adult human populations, both developing and aging human populations, as well as non-human primates. Specifically, the proposed method combines an atlas-based approach, for coarse skull-stripping, with a deformable-surface-based approach that is guided by local intensity information and population-specific prior information learned from a set of real brain images for more localized refinement. Comprehensive quantitative evaluations were performed on the diverse large-scale populations of ADNI dataset with over 800 subjects (55∼90 years of age, multi-site, various diagnosis groups), OASIS dataset with over 400 subjects (18∼96 years of age, wide age range, various diagnosis groups), and NIH pediatrics dataset with 150 subjects (5∼18 years of age, multi-site, wide age range as a complementary age group to the adult dataset). The results demonstrate that our method consistently yields the best overall results across almost the entire human life span, with only a single set of parameters. To demonstrate its capability to work on non-human primates, the proposed method is further evaluated using a rhesus macaque dataset with 20 subjects. Quantitative comparisons with popularly used state-of-the-art methods, including BET, Two-pass BET, BET-B, BSE, HWA, ROBEX and AFNI, demonstrate that the proposed method performs favorably with superior performance on all testing datasets, indicating its robustness and effectiveness. PMID:24489639
[Current problems in the data acquisition of digitized virtual human and the countermeasures].
Zhong, Shi-zhen; Yuan, Lin
2003-06-01
As a relatively new field of medical science research that has attracted the attention from worldwide researchers, study of digitized virtual human still awaits long-term dedicated effort for its full development. In the full array of research projects of the integrated Virtual Chinese Human project, virtual visible human, virtual physical human, virtual physiome, and intellectualized virtual human must be included as the four essential constitutional opponents. The primary importance should be given to solving the problems concerning the data acquisition for the dataset of this immense project. Currently 9 virtual human datasets have been established worldwide, which are subjected to critical analyses in the paper with special attention given to the problems in the data storage and the techniques employed, for instance, in these datasets. On the basis of current research status of Virtual Chinese Human project, the authors propose some countermeasures for solving the problems in the data acquisition for the dataset, which include (1) giving the priority to the quality control instead of merely racing for quantity and speed, and (2) improving the setting up of the markers specific for the tissues and organs to meet the requirement from information technology, (3) with also attention to the development potential of the dataset which should have explicit pertinence to specific actual applications.
Technical note: Space-time analysis of rainfall extremes in Italy: clues from a reconciled dataset
NASA Astrophysics Data System (ADS)
Libertino, Andrea; Ganora, Daniele; Claps, Pierluigi
2018-05-01
Like other Mediterranean areas, Italy is prone to the development of events with significant rainfall intensity, lasting for several hours. The main triggering mechanisms of these events are quite well known, but the aim of developing rainstorm hazard maps compatible with their actual probability of occurrence is still far from being reached. A systematic frequency analysis of these occasional highly intense events would require a complete countrywide dataset of sub-daily rainfall records, but this kind of information was still lacking for the Italian territory. In this work several sources of data are gathered, for assembling the first comprehensive and updated dataset of extreme rainfall of short duration in Italy. The resulting dataset, referred to as the Italian Rainfall Extreme Dataset (I-RED), includes the annual maximum rainfalls recorded in 1 to 24 consecutive hours from more than 4500 stations across the country, spanning the period between 1916 and 2014. A detailed description of the spatial and temporal coverage of the I-RED is presented, together with an exploratory statistical analysis aimed at providing preliminary information on the climatology of extreme rainfall at the national scale. Due to some legal restrictions, the database can be provided only under certain conditions. Taking into account the potentialities emerging from the analysis, a description of the ongoing and planned future work activities on the database is provided.
2014-01-01
entry and review procedures; (2) explain the various database components; (3) outline included datafields and datasets; and (4) document the...collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources...gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or
76 FR 4904 - Agency Information Collection Request; 30-Day Public Comment Request
Federal Register 2010, 2011, 2012, 2013, 2014
2011-01-27
... datasets that are not specific to individual's personal health information to improve decision making by... making health indicator datasets (data that is not associated with any individuals) and tools available.../health . These datasets and tools are anticipated to benefit development of applications, web-based tools...
EnviroAtlas - Austin, TX - Estimated Percent Green Space Along Walkable Roads
This EnviroAtlas dataset estimates green space along walkable roads. Green space within 25 meters of the road centerline is included and the percentage is based on the total area between street intersections. Green space provides valuable benefits to neighborhood residents and walkers by providing shade, improved aesthetics, and outdoor gathering spaces. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Austin, TX - Park Access by Block Group
This EnviroAtlas dataset shows the block group population that is within and beyond an easy walking distance (500m) of a park entrance. Park entrances were included in this analysis if they were within 5km of the EnviroAtlas community boundary. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Developing a Global Network of River Reaches in Preparation of SWOT
NASA Astrophysics Data System (ADS)
Lion, C.; Pavelsky, T.; Allen, G. H.; Beighley, E.; Schumann, G.; Durand, M. T.
2016-12-01
In 2020, the Surface Water and Ocean Topography satellite (SWOT), a joint mission of NASA/CNES/CSA/UK will be launched. One of its major products will be the measurements of continental water surfaces, including the width, height, and slope of rivers and the surface area and elevations of lakes. The mission will improve the monitoring of continental water and also our understanding of the interactions between different hydrologic reservoirs. For rivers, SWOT measurements of slope will be carried out over predefined river reaches. As such, an a priori dataset for rivers is needed in order to facilitate analysis of the raw SWOT data. The information required to produce this dataset includes measurements of river width, elevation, slope, planform, river network topology, and flow accumulation. To produce this product, we have linked two existing global datasets: the Global River Widths from Landsat (GRWL) database, which contains river centerline locations, widths, and a braiding index derived from Landsat imagery, and a modified version of the HydroSHEDS hydrologically corrected digital elevation product, which contains heights and flow accumulation measurements for streams at 3 arcseconds spatial resolution. Merging these two datasets requires considerable care. The difficulties, among others, lie in the difference of resolution: 30m versus 3 arseconds, and the age of the datasets: 2000 versus 2010 (some rivers have moved, the braided sections are different). As such, we have developed custom software to merge the two datasets, taking into account the spatial proximity of river channels in the two datasets and ensuring that flow accumulation in the final dataset always increases downstream. Here, we present our results for the globe.
DSCOVR EPIC L2 Atmospheric Correction (MAIAC) Data Release Announcement
Atmospheric Science Data Center
2018-06-22
... several atmospheric quantities including cloud mask and aerosol optical depth (AOD) required for atmospheric correction. The parameters ... is a useful complementary dataset to MODIS and VIIRS global aerosol products. Information about the DSCOVR EPIC Atmospheric ...
DOE Office of Scientific and Technical Information (OSTI.GOV)
Magome, T; Haga, A; Igaki, H
Purpose: Although many outcome prediction models based on dose-volume information have been proposed, it is well known that the prognosis may be affected also by multiple clinical factors. The purpose of this study is to predict the survival time after radiotherapy for high-grade glioma patients based on features including clinical and dose-volume histogram (DVH) information. Methods: A total of 35 patients with high-grade glioma (oligodendroglioma: 2, anaplastic astrocytoma: 3, glioblastoma: 30) were selected in this study. All patients were treated with prescribed dose of 30–80 Gy after surgical resection or biopsy from 2006 to 2013 at The University of Tokyomore » Hospital. All cases were randomly separated into training dataset (30 cases) and test dataset (5 cases). The survival time after radiotherapy was predicted based on a multiple linear regression analysis and artificial neural network (ANN) by using 204 candidate features. The candidate features included the 12 clinical features (tumor location, extent of surgical resection, treatment duration of radiotherapy, etc.), and the 192 DVH features (maximum dose, minimum dose, D95, V60, etc.). The effective features for the prediction were selected according to a step-wise method by using 30 training cases. The prediction accuracy was evaluated by a coefficient of determination (R{sup 2}) between the predicted and actual survival time for the training and test dataset. Results: In the multiple regression analysis, the value of R{sup 2} between the predicted and actual survival time was 0.460 for the training dataset and 0.375 for the test dataset. On the other hand, in the ANN analysis, the value of R{sup 2} was 0.806 for the training dataset and 0.811 for the test dataset. Conclusion: Although a large number of patients would be needed for more accurate and robust prediction, our preliminary Result showed the potential to predict the outcome in the patients with high-grade glioma. This work was partly supported by the JSPS Core-to-Core Program(No. 23003) and Grant-in-aid from the JSPS Fellows.« less
Boiret, Mathieu; de Juan, Anna; Gorretta, Nathalie; Ginot, Yves-Michel; Roger, Jean-Michel
2015-01-25
In this work, Raman hyperspectral images and multivariate curve resolution-alternating least squares (MCR-ALS) are used to study the distribution of actives and excipients within a pharmaceutical drug product. This article is mainly focused on the distribution of a low dose constituent. Different approaches are compared, using initially filtered or non-filtered data, or using a column-wise augmented dataset before starting the MCR-ALS iterative process including appended information on the low dose component. In the studied formulation, magnesium stearate is used as a lubricant to improve powder flowability. With a theoretical concentration of 0.5% (w/w) in the drug product, the spectral variance contained in the data is weak. By using a principal component analysis (PCA) filtered dataset as a first step of the MCR-ALS approach, the lubricant information is lost in the non-explained variance and its associated distribution in the tablet cannot be highlighted. A sufficient number of components to generate the PCA noise-filtered matrix has to be used in order to keep the lubricant variability within the data set analyzed or, otherwise, work with the raw non-filtered data. Different models are built using an increasing number of components to perform the PCA reduction. It is shown that the magnesium stearate information can be extracted from a PCA model using a minimum of 20 components. In the last part, a column-wise augmented matrix, including a reference spectrum of the lubricant, is used before starting MCR-ALS process. PCA reduction is performed on the augmented matrix, so the magnesium stearate contribution is included within the MCR-ALS calculations. By using an appropriate PCA reduction, with a sufficient number of components, or by using an augmented dataset including appended information on the low dose component, the distribution of the two actives, the two main excipients and the low dose lubricant are correctly recovered. Copyright © 2014 Elsevier B.V. All rights reserved.
Advancing Collaboration through Hydrologic Data and Model Sharing
NASA Astrophysics Data System (ADS)
Tarboton, D. G.; Idaszak, R.; Horsburgh, J. S.; Ames, D. P.; Goodall, J. L.; Band, L. E.; Merwade, V.; Couch, A.; Hooper, R. P.; Maidment, D. R.; Dash, P. K.; Stealey, M.; Yi, H.; Gan, T.; Castronova, A. M.; Miles, B.; Li, Z.; Morsy, M. M.
2015-12-01
HydroShare is an online, collaborative system for open sharing of hydrologic data, analytical tools, and models. It supports the sharing of and collaboration around "resources" which are defined primarily by standardized metadata, content data models for each resource type, and an overarching resource data model based on the Open Archives Initiative's Object Reuse and Exchange (OAI-ORE) standard and a hierarchical file packaging system called "BagIt". HydroShare expands the data sharing capability of the CUAHSI Hydrologic Information System by broadening the classes of data accommodated to include geospatial and multidimensional space-time datasets commonly used in hydrology. HydroShare also includes new capability for sharing models, model components, and analytical tools and will take advantage of emerging social media functionality to enhance information about and collaboration around hydrologic data and models. It also supports web services and server/cloud based computation operating on resources for the execution of hydrologic models and analysis and visualization of hydrologic data. HydroShare uses iRODS as a network file system for underlying storage of datasets and models. Collaboration is enabled by casting datasets and models as "social objects". Social functions include both private and public sharing, formation of collaborative groups of users, and value-added annotation of shared datasets and models. The HydroShare web interface and social media functions were developed using the Django web application framework coupled to iRODS. Data visualization and analysis is supported through the Tethys Platform web GIS software stack. Links to external systems are supported by RESTful web service interfaces to HydroShare's content. This presentation will introduce the HydroShare functionality developed to date and describe ongoing development of functionality to support collaboration and integration of data and models.
Husain, Syed S; Kalinin, Alexandr; Truong, Anh; Dinov, Ivo D
Intuitive formulation of informative and computationally-efficient queries on big and complex datasets present a number of challenges. As data collection is increasingly streamlined and ubiquitous, data exploration, discovery and analytics get considerably harder. Exploratory querying of heterogeneous and multi-source information is both difficult and necessary to advance our knowledge about the world around us. We developed a mechanism to integrate dispersed multi-source data and service the mashed information via human and machine interfaces in a secure, scalable manner. This process facilitates the exploration of subtle associations between variables, population strata, or clusters of data elements, which may be opaque to standard independent inspection of the individual sources. This a new platform includes a device agnostic tool (Dashboard webapp, http://socr.umich.edu/HTML5/Dashboard/) for graphical querying, navigating and exploring the multivariate associations in complex heterogeneous datasets. The paper illustrates this core functionality and serviceoriented infrastructure using healthcare data (e.g., US data from the 2010 Census, Demographic and Economic surveys, Bureau of Labor Statistics, and Center for Medicare Services) as well as Parkinson's Disease neuroimaging data. Both the back-end data archive and the front-end dashboard interfaces are continuously expanded to include additional data elements and new ways to customize the human and machine interactions. A client-side data import utility allows for easy and intuitive integration of user-supplied datasets. This completely open-science framework may be used for exploratory analytics, confirmatory analyses, meta-analyses, and education and training purposes in a wide variety of fields.
Global Precipitation Measurement: Methods, Datasets and Applications
NASA Technical Reports Server (NTRS)
Tapiador, Francisco; Turk, Francis J.; Petersen, Walt; Hou, Arthur Y.; Garcia-Ortega, Eduardo; Machado, Luiz, A. T.; Angelis, Carlos F.; Salio, Paola; Kidd, Chris; Huffman, George J.;
2011-01-01
This paper reviews the many aspects of precipitation measurement that are relevant to providing an accurate global assessment of this important environmental parameter. Methods discussed include ground data, satellite estimates and numerical models. First, the methods for measuring, estimating, and modeling precipitation are discussed. Then, the most relevant datasets gathering precipitation information from those three sources are presented. The third part of the paper illustrates a number of the many applications of those measurements and databases. The aim of the paper is to organize the many links and feedbacks between precipitation measurement, estimation and modeling, indicating the uncertainties and limitations of each technique in order to identify areas requiring further attention, and to show the limits within which datasets can be used.
Connecting the Public to Scientific Research Data - Science On a Sphere°
NASA Astrophysics Data System (ADS)
Henderson, M. A.; Russell, E. L.; Science on a Sphere Datasets
2011-12-01
Connecting the Public to Scientific Research Data - Science On a Sphere° Maurice Henderson, NASA Goddard Space Flight Center Elizabeth Russell, NOAA Earth System Research Laboratory, University of Colorado Cooperative Institute for Research in Environmental Sciences Science On a Sphere° is a six foot animated globe developed by the National Ocean and Atmospheric Administration, NOAA, as a means to display global scientific research data in an intuitive, engaging format in public forums. With over 70 permanent installations of SOS around the world in science museums, visitor's centers and universities, the audience that enjoys SOS yearly is substantial, wide-ranging, and diverse. Through partnerships with the National Aeronautics and Space Administration, NASA, the SOS Data Catalog (http://sos.noaa.gov/datasets/) has grown to a collection of over 350 datasets from NOAA, NASA, and many others. Using an external projection system, these datasets are displayed onto the sphere creating a seamless global image. In a cross-site evaluation of Science On a Sphere°, 82% of participants said yes, seeing information displayed on a sphere changed their understanding of the information. This unique technology captivates viewers and exposes them to scientific research data in a way that is accessible, presentable, and understandable. The datasets that comprise the SOS Data Catalog are scientific research data that have been formatted for display on SOS. By formatting research data into visualizations that can be used on SOS, NOAA and NASA are able to turn research data into educational materials that are easily accessible for users. In many cases, visualizations do not need to be modified because SOS uses a common map projection. The SOS Data Catalog has become a "one-stop shop" for a broad range of global datasets from across NOAA and NASA, and as a result, the traffic on the site is more than just SOS users. While the target audience for this site is SOS users, many inquiries come from teachers, book editors, film producers and students interested in using the available datasets. The SOS Data Catalog online includes a written description of each dataset, rendered images of the data, animated movies of the data, links to more information, details on the data source and creator, and a link to a FTP server where each dataset can be downloaded. Many of the datasets are also displayed on the SOS YouTube Channel and Facebook page. In addition, NASA has developed NASA Earth Observations, NEO, which is a collection of global satellite datasets. The NEO website allows users to layer multiple datasets and perform basic analysis. Through a new iPad application, the NASA Earth Observations datasets can be exported to SOS and analyzed on the sphere. This new capability greatly expands the number of datasets that can be shown on SOS and adds a new element of interactivity with the datasets.
Vispubdata.org: A Metadata Collection About IEEE Visualization (VIS) Publications.
Isenberg, Petra; Heimerl, Florian; Koch, Steffen; Isenberg, Tobias; Xu, Panpan; Stolper, Charles D; Sedlmair, Michael; Chen, Jian; Moller, Torsten; Stasko, John
2017-09-01
We have created and made available to all a dataset with information about every paper that has appeared at the IEEE Visualization (VIS) set of conferences: InfoVis, SciVis, VAST, and Vis. The information about each paper includes its title, abstract, authors, and citations to other papers in the conference series, among many other attributes. This article describes the motivation for creating the dataset, as well as our process of coalescing and cleaning the data, and a set of three visualizations we created to facilitate exploration of the data. This data is meant to be useful to the broad data visualization community to help understand the evolution of the field and as an example document collection for text data visualization research.
Ratz, Joan M.; Conk, Shannon J.
2014-01-01
The Gap Analysis Program (GAP) of the U.S. Geological Survey (USGS) produces geospatial datasets providing information on land cover, predicted species distributions, stewardship (ownership and conservation status), and an analysis dataset which synthesizes the other three datasets. The intent in providing these datasets is to support the conservation of biodiversity. The datasets are made available at no cost. The initial datasets were created at the state level. More recent datasets have been assembled at regional and national levels. GAP entered an agreement with the Policy Analysis and Science Assistance branch of the USGS to conduct an evaluation to describe the effect that using GAP data has on those who utilize the datasets (GAP users). The evaluation project included multiple components: a discussion regarding use of GAP data conducted with participants at a GAP conference, a literature review of publications that cited use of GAP data, and a survey of GAP users. The findings of the published literature search were used to identify topics to include on the survey. This report summarizes the literature search, the characteristics of the resulting set of publications, the emergent themes from statements made regarding GAP data, and a bibliometric analysis of the publications. We cannot claim that this list includes all publications that have used GAP data. Given the time lapse that is common in the publishing process, more recent datasets may be cited less frequently in this list of publications. Reports or products that used GAP data may be produced but never published in print or released online. In that case, our search strategies would not have located those reports. Authors may have used GAP data but failed to cite it in such a way that the search strategies we used would have located those publications. These are common issues when using a literature search as part of an evaluation project. Although the final list of publications we identified is not comprehensive, this set of publications can be considered a sufficient sample of those citing GAP data and suitable for the descriptive analyses we conducted.
Heterogeneous data fusion for brain tumor classification.
Metsis, Vangelis; Huang, Heng; Andronesi, Ovidiu C; Makedon, Fillia; Tzika, Aria
2012-10-01
Current research in biomedical informatics involves analysis of multiple heterogeneous data sets. This includes patient demographics, clinical and pathology data, treatment history, patient outcomes as well as gene expression, DNA sequences and other information sources such as gene ontology. Analysis of these data sets could lead to better disease diagnosis, prognosis, treatment and drug discovery. In this report, we present a novel machine learning framework for brain tumor classification based on heterogeneous data fusion of metabolic and molecular datasets, including state-of-the-art high-resolution magic angle spinning (HRMAS) proton (1H) magnetic resonance spectroscopy and gene transcriptome profiling, obtained from intact brain tumor biopsies. Our experimental results show that our novel framework outperforms any analysis using individual dataset.
NASA Astrophysics Data System (ADS)
Hoyos, Isabel; Baquero-Bernal, Astrid; Hagemann, Stefan
2013-09-01
In Colombia, the access to climate related observational data is restricted and their quantity is limited. But information about the current climate is fundamental for studies on present and future climate changes and their impacts. In this respect, this information is especially important over the Colombian Caribbean Catchment Basin (CCCB) that comprises over 80 % of the population of Colombia and produces about 85 % of its GDP. Consequently, an ensemble of several datasets has been evaluated and compared with respect to their capability to represent the climate over the CCCB. The comparison includes observations, reconstructed data (CPC, Delaware), reanalyses (ERA-40, NCEP/NCAR), and simulated data produced with the regional climate model REMO. The capabilities to represent the average annual state, the seasonal cycle, and the interannual variability are investigated. The analyses focus on surface air temperature and precipitation as well as on surface water and energy balances. On one hand the CCCB characteristics poses some difficulties to the datasets as the CCCB includes a mountainous region with three mountain ranges, where the dynamical core of models and model parameterizations can fail. On the other hand, it has the most dense network of stations, with the longest records, in the country. The results can be summarised as follows: all of the datasets demonstrate a cold bias in the average temperature of CCCB. However, the variability of the average temperature of CCCB is most poorly represented by the NCEP/NCAR dataset. The average precipitation in CCCB is overestimated by all datasets. For the ERA-40, NCEP/NCAR, and REMO datasets, the amplitude of the annual cycle is extremely high. The variability of the average precipitation in CCCB is better represented by the reconstructed data of CPC and Delaware, as well as by NCEP/NCAR. Regarding the capability to represent the spatial behaviour of CCCB, temperature is better represented by Delaware and REMO, while precipitation is better represented by Delaware. Among the three datasets that permit an analysis of surface water and energy balances (REMO, ERA-40, and NCEP/NCAR), REMO best demonstrates the closure property of the surface water balance within the basin, while NCEP/NCAR does not demonstrate this property well. The three datasets represent the energy balance fairly well, although some inconsistencies were found in the individual balance components for NCEP/NCAR.
Automatic Residential/Commercial Classification of Parcels with Solar Panel Detections
DOE Office of Scientific and Technical Information (OSTI.GOV)
Morton, April M; Omitaomu, Olufemi A; Kotikot, Susan
A computational method to automatically detect solar panels on rooftops to aid policy and financial assessment of solar distributed generation. The code automatically classifies parcels containing solar panels in the U.S. as residential or commercial. The code allows the user to specify an input dataset containing parcels and detected solar panels, and then uses information about the parcels and solar panels to automatically classify the rooftops as residential or commercial using machine learning techniques. The zip file containing the code includes sample input and output datasets for the Boston and DC areas.
Lukeš, Tomáš; Pospíšil, Jakub; Fliegel, Karel; Lasser, Theo; Hagen, Guy M
2018-03-01
Super-resolution single molecule localization microscopy (SMLM) is a method for achieving resolution beyond the classical limit in optical microscopes (approx. 200 nm laterally). Yellow fluorescent protein (YFP) has been used for super-resolution single molecule localization microscopy, but less frequently than other fluorescent probes. Working with YFP in SMLM is a challenge because a lower number of photons are emitted per molecule compared with organic dyes, which are more commonly used. Publically available experimental data can facilitate development of new data analysis algorithms. Four complete, freely available single molecule super-resolution microscopy datasets on YFP-tagged growth factor receptors expressed in a human cell line are presented, including both raw and analyzed data. We report methods for sample preparation, for data acquisition, and for data analysis, as well as examples of the acquired images. We also analyzed the SMLM datasets using a different method: super-resolution optical fluctuation imaging (SOFI). The 2 modes of analysis offer complementary information about the sample. A fifth single molecule super-resolution microscopy dataset acquired with the dye Alexa 532 is included for comparison purposes. This dataset has potential for extensive reuse. Complete raw data from SMLM experiments have typically not been published. The YFP data exhibit low signal-to-noise ratios, making data analysis a challenge. These datasets will be useful to investigators developing their own algorithms for SMLM, SOFI, and related methods. The data will also be useful for researchers investigating growth factor receptors such as ErbB3.
Antarctic Starfish (Echinodermata, Asteroidea) from the ANDEEP3 expedition.
Danis, Bruno; Jangoux, Michel; Wilmes, Jennifer
2012-01-01
This dataset includes information on sea stars collected during the ANDEEP3 expedition, which took place in 2005. The expedition focused on deep-sea stations in the Powell Basin and Weddell Sea.Sea stars were collected using an Agassiz trawl (3m, mesh-size 500µm), deployed in 16 stations during the ANTXXII/3 (ANDEEP3, PS72) expedition of the RV Polarstern. Sampling depth ranged from 1047 to 4931m. Trawling distance ranged from 731 to 3841m. The sampling area ranges from -41°S to -71°S (latitude) and from 0 to -65°W (longitude). A complete list of stations is available from the PANGAEA data system (http://www.pangaea.de/PHP/CruiseReports.php?b=Polarstern), including a cruise report (http://epic-reports.awi.de/3694/1/PE_72.pdf).The dataset includes 50 records, with individual counts ranging from 1-10, reaching a total of 132 specimens.The andeep3-Asteroidea is a unique dataset as it covers an under-explored region of the Southern Ocean, and that very little information was available regarding Antarctic deep-sea starfish. Before this study, most of the information available focused on starfish from shallower depths than 1000m. This dataset allowed to make unique observations, such as the fact that some species were only present at very high depths (Hymenaster crucifer, Hymenaster pellucidus, Hymenaster praecoquis, Psilaster charcoti, Freyella attenuata, Freyastera tuberculata, Styrachaster chuni and Vemaster sudatlanticus were all found below -3770m), while others displayed remarkable eurybathy, with very high depths amplitudes (Bathybiaster loripes (4842m), Lysasterias adeliae (4832m), Lophaster stellans (4752m), Cheiraster planeta (4708m), Eremicaster crassus (4626m), Lophaster gaini (4560m) and Ctenodiscus australis (4489m)).Even if the number of records is relatively small, the data bring many new insights on the taxonomic, bathymetric and geographic distributions of Southern starfish, covering a very large sampling zone. The dataset also brings to light six species, newly reported in the Southern Ocean.The quality of the data was controlled very thoroughly, by means of on-board Polarstern GPS systems, checking of identification by a renowned specialist (Prof. Michel Jangoux, Université Libre de Bruxelles), and matching to the Register of Antarctic Marine Species (RAMS) and World Register of Marine Species (WoRMS). The data is therefore fit for completing checklists, for inclusion in biodiversity patterns analysis, or niche modeling. It also nicely fills an information gap regarding deep-sea starfish from the Southern Ocean, for which data is very scarce at this time. The authors may be contacted if any additional information is needed before carrying out detailed biodiversity or biogeographic studies.
Earth History databases and visualization - the TimeScale Creator system
NASA Astrophysics Data System (ADS)
Ogg, James; Lugowski, Adam; Gradstein, Felix
2010-05-01
The "TimeScale Creator" team (www.tscreator.org) and the Subcommission on Stratigraphic Information (stratigraphy.science.purdue.edu) of the International Commission on Stratigraphy (www.stratigraphy.org) has worked with numerous geoscientists and geological surveys to prepare reference datasets for global and regional stratigraphy. All events are currently calibrated to Geologic Time Scale 2004 (Gradstein et al., 2004, Cambridge Univ. Press) and Concise Geologic Time Scale (Ogg et al., 2008, Cambridge Univ. Press); but the array of intercalibrations enable dynamic adjustment to future numerical age scales and interpolation methods. The main "global" database contains over 25,000 events/zones from paleontology, geomagnetics, sea-level and sequence stratigraphy, igneous provinces, bolide impacts, plus several stable isotope curves and image sets. Several regional datasets are provided in conjunction with geological surveys, with numerical ages interpolated using a similar flexible inter-calibration procedure. For example, a joint program with Geoscience Australia has compiled an extensive Australian regional biostratigraphy and a full array of basin lithologic columns with each formation linked to public lexicons of all Proterozoic through Phanerozoic basins - nearly 500 columns of over 9,000 data lines plus hot-curser links to oil-gas reference wells. Other datapacks include New Zealand biostratigraphy and basin transects (ca. 200 columns), Russian biostratigraphy, British Isles regional stratigraphy, Gulf of Mexico biostratigraphy and lithostratigraphy, high-resolution Neogene stable isotope curves and ice-core data, human cultural episodes, and Circum-Arctic stratigraphy sets. The growing library of datasets is designed for viewing and chart-making in the free "TimeScale Creator" JAVA package. This visualization system produces a screen display of the user-selected time-span and the selected columns of geologic time scale information. The user can change the vertical-scale, column widths, fonts, colors, titles, ordering, range chart options and many other features. Mouse-activated pop-ups provide additional information on columns and events; including links to external Internet sites. The graphics can be saved as SVG (scalable vector graphics) or PDF files for direct import into Adobe Illustrator or other common drafting software. Users can load additional regional datapacks, and create and upload their own datasets. The "Pro" version has additional dataset-creation tools, output options and the ability to edit and re-save merged datasets. The databases and visualization package are envisioned as a convenient reference tool, chart-production assistant, and a window into the geologic history of our planet.
Aerosol Climate Time Series in ESA Aerosol_cci
NASA Astrophysics Data System (ADS)
Popp, Thomas; de Leeuw, Gerrit; Pinnock, Simon
2016-04-01
Within the ESA Climate Change Initiative (CCI) Aerosol_cci (2010 - 2017) conducts intensive work to improve algorithms for the retrieval of aerosol information from European sensors. Meanwhile, full mission time series of 2 GCOS-required aerosol parameters are completely validated and released: Aerosol Optical Depth (AOD) from dual view ATSR-2 / AATSR radiometers (3 algorithms, 1995 - 2012), and stratospheric extinction profiles from star occultation GOMOS spectrometer (2002 - 2012). Additionally, a 35-year multi-sensor time series of the qualitative Absorbing Aerosol Index (AAI) together with sensitivity information and an AAI model simulator is available. Complementary aerosol properties requested by GCOS are in a "round robin" phase, where various algorithms are inter-compared: fine mode AOD, mineral dust AOD (from the thermal IASI spectrometer, but also from ATSR instruments and the POLDER sensor), absorption information and aerosol layer height. As a quasi-reference for validation in few selected regions with sparse ground-based observations the multi-pixel GRASP algorithm for the POLDER instrument is used. Validation of first dataset versions (vs. AERONET, MAN) and inter-comparison to other satellite datasets (MODIS, MISR, SeaWIFS) proved the high quality of the available datasets comparable to other satellite retrievals and revealed needs for algorithm improvement (for example for higher AOD values) which were taken into account for a reprocessing. The datasets contain pixel level uncertainty estimates which were also validated and improved in the reprocessing. For the three ATSR algorithms the use of an ensemble method was tested. The paper will summarize and discuss the status of dataset reprocessing and validation. The focus will be on the ATSR, GOMOS and IASI datasets. Pixel level uncertainties validation will be summarized and discussed including unknown components and their potential usefulness and limitations. Opportunities for time series extension with successor instruments of the Sentinel family will be described and the complementarity of the different satellite aerosol products (e.g. dust vs. total AOD, ensembles from different algorithms for the same sensor) will be discussed.
Matott, L Shawn; Jiang, Zhengzheng; Rabideau, Alan J; Allen-King, Richelle M
2015-01-01
Numerous isotherm expressions have been developed for describing sorption of hydrophobic organic compounds (HOCs), including "dual-mode" approaches that combine nonlinear behavior with a linear partitioning component. Choosing among these alternative expressions for describing a given dataset is an important task that can significantly influence subsequent transport modeling and/or mechanistic interpretation. In this study, a series of numerical experiments were undertaken to identify "best-in-class" isotherms by refitting 10 alternative models to a suite of 13 previously published literature datasets. The corrected Akaike Information Criterion (AICc) was used for ranking these alternative fits and distinguishing between plausible and implausible isotherms for each dataset. The occurrence of multiple plausible isotherms was inversely correlated with dataset "richness", such that datasets with fewer observations and/or a narrow range of aqueous concentrations resulted in a greater number of plausible isotherms. Overall, only the Polanyi-partition dual-mode isotherm was classified as "plausible" across all 13 of the considered datasets, indicating substantial statistical support consistent with current advances in sorption theory. However, these findings are predicated on the use of the AICc measure as an unbiased ranking metric and the adoption of a subjective, but defensible, threshold for separating plausible and implausible isotherms. Copyright © 2015 Elsevier B.V. All rights reserved.
NHDPlusHR: A national geospatial framework for surface-water information
Viger, Roland; Rea, Alan H.; Simley, Jeffrey D.; Hanson, Karen M.
2016-01-01
The U.S. Geological Survey is developing a new geospatial hydrographic framework for the United States, called the National Hydrography Dataset Plus High Resolution (NHDPlusHR), that integrates a diversity of the best-available information, robustly supports ongoing dataset improvements, enables hydrographic generalization to derive alternate representations of the network while maintaining feature identity, and supports modern scientific computing and Internet accessibility needs. This framework is based on the High Resolution National Hydrography Dataset, the Watershed Boundaries Dataset, and elevation from the 3-D Elevation Program, and will provide an authoritative, high precision, and attribute-rich geospatial framework for surface-water information for the United States. Using this common geospatial framework will provide a consistent basis for indexing water information in the United States, eliminate redundancy, and harmonize access to, and exchange of water information.
Hayat, Maqsood; Khan, Asifullah
2013-05-01
Membrane protein is the prime constituent of a cell, which performs a role of mediator between intra and extracellular processes. The prediction of transmembrane (TM) helix and its topology provides essential information regarding the function and structure of membrane proteins. However, prediction of TM helix and its topology is a challenging issue in bioinformatics and computational biology due to experimental complexities and lack of its established structures. Therefore, the location and orientation of TM helix segments are predicted from topogenic sequences. In this regard, we propose WRF-TMH model for effectively predicting TM helix segments. In this model, information is extracted from membrane protein sequences using compositional index and physicochemical properties. The redundant and irrelevant features are eliminated through singular value decomposition. The selected features provided by these feature extraction strategies are then fused to develop a hybrid model. Weighted random forest is adopted as a classification approach. We have used two benchmark datasets including low and high-resolution datasets. tenfold cross validation is employed to assess the performance of WRF-TMH model at different levels including per protein, per segment, and per residue. The success rates of WRF-TMH model are quite promising and are the best reported so far on the same datasets. It is observed that WRF-TMH model might play a substantial role, and will provide essential information for further structural and functional studies on membrane proteins. The accompanied web predictor is accessible at http://111.68.99.218/WRF-TMH/ .
Condensing Massive Satellite Datasets For Rapid Interactive Analysis
NASA Astrophysics Data System (ADS)
Grant, G.; Gallaher, D. W.; Lv, Q.; Campbell, G. G.; Fowler, C.; LIU, Q.; Chen, C.; Klucik, R.; McAllister, R. A.
2015-12-01
Our goal is to enable users to interactively analyze massive satellite datasets, identifying anomalous data or values that fall outside of thresholds. To achieve this, the project seeks to create a derived database containing only the most relevant information, accelerating the analysis process. The database is designed to be an ancillary tool for the researcher, not an archival database to replace the original data. This approach is aimed at improving performance by reducing the overall size by way of condensing the data. The primary challenges of the project include: - The nature of the research question(s) may not be known ahead of time. - The thresholds for determining anomalies may be uncertain. - Problems associated with processing cloudy, missing, or noisy satellite imagery. - The contents and method of creation of the condensed dataset must be easily explainable to users. The architecture of the database will reorganize spatially-oriented satellite imagery into temporally-oriented columns of data (a.k.a., "data rods") to facilitate time-series analysis. The database itself is an open-source parallel database, designed to make full use of clustered server technologies. A demonstration of the system capabilities will be shown. Applications for this technology include quick-look views of the data, as well as the potential for on-board satellite processing of essential information, with the goal of reducing data latency.
Data Discovery, Exploration, Integration and Delivery - a practical experience
NASA Astrophysics Data System (ADS)
Kirsch, Peter; Barnes, Tim; Breen, Paul
2010-05-01
To fully address the questions and issues arising within Earth Systems Science; the discovery, exploration, integration, delivery and sharing of data, metadata and services across potentially many disciplines and areas of expertise is fundamental. British Antarctic Survey (BAS) collects, manages and curates data across many fields of the geophysical and biological sciences (including upper atmospheric physics, atmospheric chemistry, meteorology, glaciology, oceanography, Polar ecology and biology). BAS, through its Polar Data Centre has an interest to construct and deliver a user-friendly, informative, and administratively low overhead interface onto these data holdings. Designing effective interfaces and frameworks onto the heterogeneous datasets described above is non-trivial. We will discuss some of our approaches and implementations; particularly those addressing the following issues: How to aid and guide the user to accurate discovery of data? Many portals do not inform users clearly enough about the datasets they actually hold. As a result the search interface by which a user is meant to discover information is often inadequate and assumes prior knowledge (for example, that the dataset you are looking for actually exists; that a particular event, campaign, research cruise took place; and that you have a specialist knowledge of the terminology in a particular field), assumptions that cannot be made in multi-disciplinary topic areas. How easily is provenance, quality, and metadata information displayed and accessed? Once informed through the portal that data is available it is often extremely difficult to assess its provenance and quality information and broader documentation (including field reports, notebooks and software repositories). We shall demonstrate some simple methodologies. Can the user access summary data or visualizations of the dataset? It may be that the user is interested in some event, feature or threshold within the dataset; mechanisms need to be provided to allow a user to browse the data (or at least a summary of the data in the most appropriate form be it a plot, table, video etc) prior to making the decision to download or request data. A framework should be flexible enough to allow several methods of visualization. Can datasets be compared and or integrated? By allowing the inclusion of open, 3rd party, standards compliant utilities (e.g. Open Geo-Spatial Consortium WxS clients) into the framework, the utility of a data access system can be made more valuable. Is accessing the actual data straightforward? The process of accessing the data should follow naturally from the data discovery and browsing stages. The user should be made aware of terms and conditions of access. Access restrictions (if applicable) and security should be made as unobtrusive as possible. How is user feedback and comment monitored and acted upon? In general these systems exist to serve science communities, appropriate notice and acknowledgement of their needs and requirements must be taken into account when designing and developing these systems if they are to be of continued use in the future.
NASA Astrophysics Data System (ADS)
Lehmann, Jan Rudolf Karl; Zvara, Ondrej; Prinz, Torsten
2015-04-01
The biological invasion of Australian Acacia species in natural ecosystems outside Australia has often a negative impact on native and endemic plant species and the related biodiversity. In Brazil, the Atlantic rainforest of Bahia and Espirito Santo forms an associated type of ecosystem, the Mussununga. In our days this biologically diverse ecosystem is negatively affected by the invasion of Acacia mangium and Acacia auriculiformis, both introduced to Brazil by the agroforestry to increase the production of pulp and high grade woods. In order to detect the distribution of Acacia species and to monitor the expansion of this invasion the use of high-resolution imagery data acquired with an autonomous Unmanned Aerial System (UAS) proved to be a very promising approach. In this study, two types of datasets - CIR and RGB - were collected since both types provide different information. In case of CIR imagery attention was paid on spectral signatures related to plants, whereas in case of RGB imagery the focus was on surface characteristics. Orthophoto-mosaics and DSM/DTM for both dataset were extracted. RGB/IHS transformations of the imagery's colour space were utilized, as well as NDVIblue index in case of CIR imagery to discriminate plant associations. Next, two test areas were defined in order validate OBIA rule sets using eCognition software. In case of RGB dataset, a rule set based on elevation distinction between high vegetation (including Acacia) and low vegetation (including soils) was developed. High vegetation was classified using Nearest Neighbour algorithm while working with the CIR dataset. The IHS information was used to mask shadows, soils and low vegetation. Further Nearest Neighbour classification was used for distinction between Acacia and other high vegetation types. Finally an accuracy assessment was performed using a confusion matrix. One can state that the IHS information appeared to be helpful in Acacia detection while the surface elevation information in case of RGB dataset was helpful to distinguish between low and high vegetation types. The successful use of a fixed-wing UAS proved to be a reliable and flexible technique to acquire ecologically sensitive data over wide areas and by extended UAS flight missions.
Myneni, Sahiti; Patel, Vimla L.
2010-01-01
Biomedical researchers often work with massive, detailed and heterogeneous datasets. These datasets raise new challenges of information organization and management for scientific interpretation, as they demand much of the researchers’ time and attention. The current study investigated the nature of the problems that researchers face when dealing with such data. Four major problems identified with existing biomedical scientific information management methods were related to data organization, data sharing, collaboration, and publications. Therefore, there is a compelling need to develop an efficient and user-friendly information management system to handle the biomedical research data. This study evaluated the implementation of an information management system, which was introduced as part of the collaborative research to increase scientific productivity in a research laboratory. Laboratory members seemed to exhibit frustration during the implementation process. However, empirical findings revealed that they gained new knowledge and completed specified tasks while working together with the new system. Hence, researchers are urged to persist and persevere when dealing with any new technology, including an information management system in a research laboratory environment. PMID:20543892
Myneni, Sahiti; Patel, Vimla L
2010-06-01
Biomedical researchers often work with massive, detailed and heterogeneous datasets. These datasets raise new challenges of information organization and management for scientific interpretation, as they demand much of the researchers' time and attention. The current study investigated the nature of the problems that researchers face when dealing with such data. Four major problems identified with existing biomedical scientific information management methods were related to data organization, data sharing, collaboration, and publications. Therefore, there is a compelling need to develop an efficient and user-friendly information management system to handle the biomedical research data. This study evaluated the implementation of an information management system, which was introduced as part of the collaborative research to increase scientific productivity in a research laboratory. Laboratory members seemed to exhibit frustration during the implementation process. However, empirical findings revealed that they gained new knowledge and completed specified tasks while working together with the new system. Hence, researchers are urged to persist and persevere when dealing with any new technology, including an information management system in a research laboratory environment.
An ISA-TAB-Nano based data collection framework to support data-driven modelling of nanotoxicology.
Marchese Robinson, Richard L; Cronin, Mark T D; Richarz, Andrea-Nicole; Rallo, Robert
2015-01-01
Analysis of trends in nanotoxicology data and the development of data driven models for nanotoxicity is facilitated by the reporting of data using a standardised electronic format. ISA-TAB-Nano has been proposed as such a format. However, in order to build useful datasets according to this format, a variety of issues has to be addressed. These issues include questions regarding exactly which (meta)data to report and how to report them. The current article discusses some of the challenges associated with the use of ISA-TAB-Nano and presents a set of resources designed to facilitate the manual creation of ISA-TAB-Nano datasets from the nanotoxicology literature. These resources were developed within the context of the NanoPUZZLES EU project and include data collection templates, corresponding business rules that extend the generic ISA-TAB-Nano specification as well as Python code to facilitate parsing and integration of these datasets within other nanoinformatics resources. The use of these resources is illustrated by a "Toy Dataset" presented in the Supporting Information. The strengths and weaknesses of the resources are discussed along with possible future developments.
EnviroAtlas - Big Game Hunting Recreation Demand by 12-Digit HUC in the Conterminous United States
This EnviroAtlas dataset includes the total number of recreational days per year demanded by people ages 18 and over for big game hunting by location in the contiguous United States. Big game includes deer, elk, bear, and wild turkey. These values are based on 2010 population distribution, 2011 U.S. Fish and Wildlife Service (FWS) Fish, Hunting, and Wildlife-Associated Recreation (FHWAR) survey data, and 2011 U.S. Department of Agriculture (USDA) Forest Service National Visitor Use Monitoring program data, and have been summarized by 12-digit hydrologic unit code (HUC). This dataset was produced by the US EPA to support research and online mapping activities related to the EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Soil Bulk Density by Soil Type, Land Use and Data Source: Putting the Error in SOC Estimates
NASA Astrophysics Data System (ADS)
Wills, S. A.; Rossi, A.; Loecke, T.; Ramcharan, A. M.; Roecker, S.; Mishra, U.; Waltman, S.; Nave, L. E.; Williams, C. O.; Beaudette, D.; Libohova, Z.; Vasilas, L.
2017-12-01
An important part of SOC stock and pool assessment is the assessment, estimation, and application of bulk density estimates. The concept of bulk density is relatively simple (the mass of soil in a given volume), the specifics Bulk density can be difficult to measure in soils due to logistical and methodological constraints. While many estimates of SOC pools use legacy data in their estimates, few concerted efforts have been made to assess the process used to convert laboratory carbon concentration measurements and bulk density collection into volumetrically based SOC estimates. The methodologies used are particularly sensitive in wetlands and organic soils with high amounts of carbon and very low bulk densities. We will present an analysis across four database measurements: NCSS - the National Cooperative Soil Survey Characterization dataset, RaCA - the Rapid Carbon Assessment sample dataset, NWCA - the National Wetland Condition Assessment, and ISCN - the International soil Carbon Network. The relationship between bulk density and soil organic carbon will be evaluated by dataset and land use/land cover information. Prediction methods (both regression and machine learning) will be compared and contrasted across datasets and available input information. The assessment and application of bulk density, including modeling, aggregation and error propagation will be evaluated. Finally, recommendations will be made about both the use of new data in soil survey products (such as SSURGO) and the use of that information as legacy data in SOC pool estimates.
Establishing a threshold for the number of missing days using 7 d pedometer data.
Kang, Minsoo; Hart, Peter D; Kim, Youngdeok
2012-11-01
The purpose of this study was to examine the threshold of the number of missing days of recovery using the individual information (II)-centered approach. Data for this study came from 86 participants, aged from 17 to 79 years old, who had 7 consecutive days of complete pedometer (Yamax SW 200) wear. Missing datasets (1 d through 5 d missing) were created by a SAS random process 10,000 times each. All missing values were replaced using the II-centered approach. A 7 d average was calculated for each dataset, including the complete dataset. Repeated measure ANOVA was used to determine the differences between 1 d through 5 d missing datasets and the complete dataset. Mean absolute percentage error (MAPE) was also computed. Mean (SD) daily step count for the complete 7 d dataset was 7979 (3084). Mean (SD) values for the 1 d through 5 d missing datasets were 8072 (3218), 8066 (3109), 7968 (3273), 7741 (3050) and 8314 (3529), respectively (p > 0.05). The lower MAPEs were estimated for 1 d missing (5.2%, 95% confidence interval (CI) 4.4-6.0) and 2 d missing (8.4%, 95% CI 7.0-9.8), while all others were greater than 10%. The results of this study show that the 1 d through 5 d missing datasets, with replaced values, were not significantly different from the complete dataset. Based on the MAPE results, it is not recommended to replace more than two days of missing step counts.
VIS – A database on the distribution of fishes in inland and estuarine waters in Flanders, Belgium
Brosens, Dimitri; Breine, Jan; Van Thuyne, Gerlinde; Belpaire, Claude; Desmet, Peter; Verreycken, Hugo
2015-01-01
Abstract The Research Institute for Nature and Forest (INBO) has been performing standardized fish stock assessments in Flanders, Belgium. This Flemish Fish Monitoring Network aims to assess fish populations in public waters at regular time intervals in both inland waters and estuaries. This monitoring was set up in support of the Water Framework Directive, the Habitat Directive, the Eel Regulation, the Red List of fishes, fish stock management, biodiversity research, and to assess the colonization and spreading of non-native fish species. The collected data are consolidated in the Fish Information System or VIS. From VIS, the occurrence data are now published at the INBO IPT as two datasets: ‘VIS - Fishes in inland waters in Flanders, Belgium’ and ‘VIS - Fishes in estuarine waters in Flanders, Belgium’. Together these datasets represent a complete overview of the distribution and abundance of fish species pertaining in Flanders from late 1992 to the end of 2012. This data paper discusses both datasets together, as both have a similar methodology and structure. The inland waters dataset contains over 350,000 fish observations, sampled between 1992 and 2012 from over 2,000 locations in inland rivers, streams, canals, and enclosed waters in Flanders. The dataset includes 64 fish species, as well as a number of non-target species (mainly crustaceans). The estuarine waters dataset contains over 44,000 fish observations, sampled between 1995 and 2012 from almost 50 locations in the estuaries of the rivers Yser and Scheldt (“Zeeschelde”), including two sampling sites in the Netherlands. The dataset includes 69 fish species and a number of non-target crustacean species. To foster broad and collaborative use, the data are dedicated to the public domain under a Creative Commons Zero waiver and reference the INBO norms for data use. PMID:25685001
Demons registration for in vivo and deformable laser scanning confocal endomicroscopy.
Chiew, Wei-Ming; Lin, Feng; Seah, Hock Soon
2017-09-01
A critical effect found in noninvasive in vivo endomicroscopic imaging modalities is image distortions due to sporadic movement exhibited by living organisms. In three-dimensional confocal imaging, this effect results in a dataset that is tilted across deeper slices. Apart from that, the sequential flow of the imaging-processing pipeline restricts real-time adjustments due to the unavailability of information obtainable only from subsequent stages. To solve these problems, we propose an approach to render Demons-registered datasets as they are being captured, focusing on the coupling between registration and visualization. To improve the acquisition process, we also propose a real-time visual analytics tool, which complements the imaging pipeline and the Demons registration pipeline with useful visual indicators to provide real-time feedback for immediate adjustments. We highlight the problem of deformation within the visualization pipeline for object-ordered and image-ordered rendering. Visualizations of critical information including registration forces and partial renderings of the captured data are also presented in the analytics system. We demonstrate the advantages of the algorithmic design through experimental results with both synthetically deformed datasets and actual in vivo, time-lapse tissue datasets expressing natural deformations. Remarkably, this algorithm design is for embedded implementation in intelligent biomedical imaging instrumentation with customizable circuitry. (2017) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE).
PERSIANN-CDR Daily Precipitation Dataset for Hydrologic Applications and Climate Studies.
NASA Astrophysics Data System (ADS)
Sorooshian, S.; Hsu, K. L.; Ashouri, H.; Braithwaite, D.; Nguyen, P.; Thorstensen, A. R.
2015-12-01
Precipitation Estimation from Remotely Sensed Information using Artificial Neural Network - Climate Data Record (PERSIANN-CDR) is a newly developed and released dataset which covers more than 3 decades (01/01/1983 - 03/31/2015 to date) of daily precipitation estimations at 0.25° resolution for 60°S-60°N latitude band. PERSIANN-CDR is processed using the archive of the Gridded Satellite IRWIN CDR (GridSat-B1) from the International Satellite Cloud Climatology Project (ISCCP), and the Global Precipitation Climatology Project (GPCP) 2.5° monthly product for bias correction. The dataset has been released and made available for public access through NOAA's National Centers for Environmental Information (NCEI) (http://www1.ncdc.noaa.gov/pub/data/sds/cdr/CDRs/PERSIANN/Overview.pdf). PERSIANN-CDR has already shown its usefulness for a wide range of applications, including climate variability and change monitoring, hydrologic applications, and water resources system planning and management. This precipitation CDR data has also been used in studying the behavior of historical extreme precipitation events. Demonstration of PERSIANN-CDR data in detecting trends and variability of precipitation over the past 30 years, the potential usefulness of the dataset for evaluating climate model performance relevant to precipitation in retrospective mode, will be presented.
Demons registration for in vivo and deformable laser scanning confocal endomicroscopy
NASA Astrophysics Data System (ADS)
Chiew, Wei Ming; Lin, Feng; Seah, Hock Soon
2017-09-01
A critical effect found in noninvasive in vivo endomicroscopic imaging modalities is image distortions due to sporadic movement exhibited by living organisms. In three-dimensional confocal imaging, this effect results in a dataset that is tilted across deeper slices. Apart from that, the sequential flow of the imaging-processing pipeline restricts real-time adjustments due to the unavailability of information obtainable only from subsequent stages. To solve these problems, we propose an approach to render Demons-registered datasets as they are being captured, focusing on the coupling between registration and visualization. To improve the acquisition process, we also propose a real-time visual analytics tool, which complements the imaging pipeline and the Demons registration pipeline with useful visual indicators to provide real-time feedback for immediate adjustments. We highlight the problem of deformation within the visualization pipeline for object-ordered and image-ordered rendering. Visualizations of critical information including registration forces and partial renderings of the captured data are also presented in the analytics system. We demonstrate the advantages of the algorithmic design through experimental results with both synthetically deformed datasets and actual in vivo, time-lapse tissue datasets expressing natural deformations. Remarkably, this algorithm design is for embedded implementation in intelligent biomedical imaging instrumentation with customizable circuitry.
Basin Characteristics for Selected Streamflow-Gaging Stations In and Near West Virginia
Paybins, Katherine S.
2008-01-01
Basin characteristics have long been used to develop equations describing streamflow. In the past, flow equations used in West Virginia were based on a few hand-calculated basin characteristics. More recently, the use of a Geographic Information System (GIS) to generate basin characteristics from existing datasets has refined the process for developing equations to describe flow values in the Mountain State. These basin characteristics are described in this document for streamflow-gaging stations in and near West Virginia. The GIS program developed in ArcGIS Workstation by Environmental Systems Research Institute (ESRI?) used data that included National Elevation Dataset (NED) at 1:24,000 scale, climate data from the National Oceanic and Atmospheric Agency (NOAA), streamlines from the National Hydrologic Dataset (NHD), and LandSat-based land-cover data (NLCD) for the period 1999-2003. Full automation of data generation was not achieved due to some inaccuracies in the elevation dataset, as well as inaccuracies in the streamflow-gage locations retrieved from the National Water Information System (NWIS). A Pearson?s correlation examination of the data indicates that several of the basin characteristics are correlated with drainage area. However, the GIS-generated data provide a consistent and documented set of basin characteristics for resource managers and researchers to use.
Blood vessel-based liver segmentation through the portal phase of a CT dataset
NASA Astrophysics Data System (ADS)
Maklad, Ahmed S.; Matsuhiro, Mikio; Suzuki, Hidenobu; Kawata, Yoshiki; Niki, Noboru; Moriyama, Noriyuki; Utsunomiya, Toru; Shimada, Mitsuo
2013-02-01
Blood vessels are dispersed throughout the human body organs and carry unique information for each person. This information can be used to delineate organ boundaries. The proposed method relies on abdominal blood vessels (ABV) to segment the liver considering the potential presence of tumors through the portal phase of a CT dataset. ABV are extracted and classified into hepatic (HBV) and nonhepatic (non-HBV) with a small number of interactions. HBV and non-HBV are used to guide an automatic segmentation of the liver. HBV are used to individually segment the core region of the liver. This region and non-HBV are used to construct a boundary surface between the liver and other organs to separate them. The core region is classified based on extracted posterior distributions of its histogram into low intensity tumor (LIT) and non-LIT core regions. Non-LIT case includes normal part of liver, HBV, and high intensity tumors if exist. Each core region is extended based on its corresponding posterior distribution. Extension is completed when it reaches either a variation in intensity or the constructed boundary surface. The method was applied to 80 datasets (30 Medical Image Computing and Computer Assisted Intervention (MICCAI) and 50 non-MICCAI data) including 60 datasets with tumors. Our results for the MICCAI-test data were evaluated by sliver07 [1] with an overall score of 79.7, which ranks seventh best on the site (December 2013). This approach seems a promising method for extraction of liver volumetry of various shapes and sizes and low intensity hepatic tumors.
DOE Office of Scientific and Technical Information (OSTI.GOV)
de Boer, Gijs; Lawrence, Dale; Palo, Scott
2017-03-29
This final technical report details activities undertaken as part of the referenced project. Included is information on the preparation of aircraft for deployment to Alaska, summaries of the three deployments covered under this project, and a brief description of the dataset and science directions pursued. Additionally, we provide information on lessons learned, publications, and presentations resulting from this work.
Janjua, Naveed Zafar; Islam, Nazrul; Kuo, Margot; Yu, Amanda; Wong, Stanley; Butt, Zahid A; Gilbert, Mark; Buxton, Jane; Chapinal, Nuria; Samji, Hasina; Chong, Mei; Alvarez, Maria; Wong, Jason; Tyndall, Mark W; Krajden, Mel
2018-05-01
Large linked healthcare administrative datasets could be used to monitor programs providing prevention and treatment services to people who inject drugs (PWID). However, diagnostic codes in administrative datasets do not differentiate non-injection from injection drug use (IDU). We validated algorithms based on diagnostic codes and prescription records representing IDU in administrative datasets against interview-based IDU data. The British Columbia Hepatitis Testers Cohort (BC-HTC) includes ∼1.7 million individuals tested for HCV/HIV or reported HBV/HCV/HIV/tuberculosis cases in BC from 1990 to 2015, linked to administrative datasets including physician visit, hospitalization and prescription drug records. IDU, assessed through interviews as part of enhanced surveillance at the time of HIV or HCV/HBV diagnosis from a subset of cases included in the BC-HTC (n = 6559), was used as the gold standard. ICD-9/ICD-10 codes for IDU and injecting-related infections (IRI) were grouped with records of opioid substitution therapy (OST) into multiple IDU algorithms in administrative datasets. We assessed the performance of IDU algorithms through calculation of sensitivity, specificity, positive predictive, and negative predictive values. Sensitivity was highest (90-94%), and specificity was lowest (42-73%) for algorithms based either on IDU or IRI and drug misuse codes. Algorithms requiring both drug misuse and IRI had lower sensitivity (57-60%) and higher specificity (90-92%). An optimal sensitivity and specificity combination was found with two medical visits or a single hospitalization for injectable drugs with (83%/82%) and without OST (78%/83%), respectively. Based on algorithms that included two medical visits, a single hospitalization or OST records, there were 41,358 (1.2% of 11-65 years individuals in BC) recent PWID in BC based on health encounters during 3- year period (2013-2015). Algorithms for identifying PWID using diagnostic codes in linked administrative data could be used for tracking the progress of programing aimed at PWID. With population-based datasets, this tool can be used to inform much needed estimates of PWID population size. Copyright © 2018 Elsevier B.V. All rights reserved.
Sinfonevada: Dataset of Floristic diversity in Sierra Nevada forests (SE Spain)
Pérez-Luque, Antonio Jesús; Bonet, Francisco Javier; Pérez-Pérez, Ramón; Rut Aspizua; Lorite, Juan; Zamora, Regino
2014-01-01
Abstract The Sinfonevada database is a forest inventory that contains information on the forest ecosystem in the Sierra Nevada mountains (SE Spain). The Sinfonevada dataset contains more than 7,500 occurrence records belonging to 270 taxa (24 of these threatened) from floristic inventories of the Sinfonevada Forest inventory. Expert field workers collected the information. The whole dataset underwent a quality control by botanists with broad expertise in Sierra Nevada flora. This floristic inventory was created to gather useful information for the proper management of Pinus plantations in Sierra Nevada. This is the only dataset that shows a comprehensive view of the forest flora in Sierra Nevada. This is the reason why it is being used to assess the biodiversity in the very dense pine plantations on this massif. With this dataset, managers have improved their ability to decide where to apply forest treatments in order to avoid biodiversity loss. The dataset forms part of the Sierra Nevada Global Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area. PMID:24843285
Assessment of Homomorphic Analysis for Human Activity Recognition from Acceleration Signals.
Vanrell, Sebastian Rodrigo; Milone, Diego Humberto; Rufiner, Hugo Leonardo
2017-07-03
Unobtrusive activity monitoring can provide valuable information for medical and sports applications. In recent years, human activity recognition has moved to wearable sensors to deal with unconstrained scenarios. Accelerometers are the preferred sensors due to their simplicity and availability. Previous studies have examined several \\azul{classic} techniques for extracting features from acceleration signals, including time-domain, time-frequency, frequency-domain, and other heuristic features. Spectral and temporal features are the preferred ones and they are generally computed from acceleration components, leaving the acceleration magnitude potential unexplored. In this study, based on homomorphic analysis, a new type of feature extraction stage is proposed in order to exploit discriminative activity information present in acceleration signals. Homomorphic analysis can isolate the information about whole body dynamics and translate it into a compact representation, called cepstral coefficients. Experiments have explored several configurations of the proposed features, including size of representation, signals to be used, and fusion with other features. Cepstral features computed from acceleration magnitude obtained one of the highest recognition rates. In addition, a beneficial contribution was found when time-domain and moving pace information was included in the feature vector. Overall, the proposed system achieved a recognition rate of 91.21% on the publicly available SCUT-NAA dataset. To the best of our knowledge, this is the highest recognition rate on this dataset.
Preprocessed Consortium for Neuropsychiatric Phenomics dataset.
Gorgolewski, Krzysztof J; Durnez, Joke; Poldrack, Russell A
2017-01-01
Here we present preprocessed MRI data of 265 participants from the Consortium for Neuropsychiatric Phenomics (CNP) dataset. The preprocessed dataset includes minimally preprocessed data in the native, MNI and surface spaces accompanied with potential confound regressors, tissue probability masks, brain masks and transformations. In addition the preprocessed dataset includes unthresholded group level and single subject statistical maps from all tasks included in the original dataset. We hope that availability of this dataset will greatly accelerate research.
Mazzarelli, Joan M; Brestelli, John; Gorski, Regina K; Liu, Junmin; Manduchi, Elisabetta; Pinney, Deborah F; Schug, Jonathan; White, Peter; Kaestner, Klaus H; Stoeckert, Christian J
2007-01-01
EPConDB (http://www.cbil.upenn.edu/EPConDB) is a public web site that supports research in diabetes, pancreatic development and beta-cell function by providing information about genes expressed in cells of the pancreas. EPConDB displays expression profiles for individual genes and information about transcripts, promoter elements and transcription factor binding sites. Gene expression results are obtained from studies examining tissue expression, pancreatic development and growth, differentiation of insulin-producing cells, islet or beta-cell injury, and genetic models of impaired beta-cell function. The expression datasets are derived using different microarray platforms, including the BCBC PancChips and Affymetrix gene expression arrays. Other datasets include semi-quantitative RT-PCR and MPSS expression studies. For selected microarray studies, lists of differentially expressed genes, derived from PaGE analysis, are displayed on the site. EPConDB provides database queries and tools to examine the relationship between a gene, its transcriptional regulation, protein function and expression in pancreatic tissues.
NASA Astrophysics Data System (ADS)
Dogon-Yaro, M. A.; Kumar, P.; Rahman, A. Abdul; Buyuksalih, G.
2016-09-01
Mapping of trees plays an important role in modern urban spatial data management, as many benefits and applications inherit from this detailed up-to-date data sources. Timely and accurate acquisition of information on the condition of urban trees serves as a tool for decision makers to better appreciate urban ecosystems and their numerous values which are critical to building up strategies for sustainable development. The conventional techniques used for extracting trees include ground surveying and interpretation of the aerial photography. However, these techniques are associated with some constraints, such as labour intensive field work and a lot of financial requirement which can be overcome by means of integrated LiDAR and digital image datasets. Compared to predominant studies on trees extraction mainly in purely forested areas, this study concentrates on urban areas, which have a high structural complexity with a multitude of different objects. This paper presented a workflow about semi-automated approach for extracting urban trees from integrated processing of airborne based LiDAR point cloud and multispectral digital image datasets over Istanbul city of Turkey. The paper reveals that the integrated datasets is a suitable technology and viable source of information for urban trees management. As a conclusion, therefore, the extracted information provides a snapshot about location, composition and extent of trees in the study area useful to city planners and other decision makers in order to understand how much canopy cover exists, identify new planting, removal, or reforestation opportunities and what locations have the greatest need or potential to maximize benefits of return on investment. It can also help track trends or changes to the urban trees over time and inform future management decisions.
Flocks, James
2006-01-01
Scientific knowledge from the past century is commonly represented by two-dimensional figures and graphs, as presented in manuscripts and maps. Using today's computer technology, this information can be extracted and projected into three- and four-dimensional perspectives. Computer models can be applied to datasets to provide additional insight into complex spatial and temporal systems. This process can be demonstrated by applying digitizing and modeling techniques to valuable information within widely used publications. The seminal paper by D. Frazier, published in 1967, identified 16 separate delta lobes formed by the Mississippi River during the past 6,000 yrs. The paper includes stratigraphic descriptions through geologic cross-sections, and provides distribution and chronologies of the delta lobes. The data from Frazier's publication are extensively referenced in the literature. Additional information can be extracted from the data through computer modeling. Digitizing and geo-rectifying Frazier's geologic cross-sections produce a three-dimensional perspective of the delta lobes. Adding the chronological data included in the report provides the fourth-dimension of the delta cycles, which can be visualized through computer-generated animation. Supplemental information can be added to the model, such as post-abandonment subsidence of the delta-lobe surface. Analyzing the regional, net surface-elevation balance between delta progradations and land subsidence is computationally intensive. By visualizing this process during the past 4,500 yrs through multi-dimensional animation, the importance of sediment compaction in influencing both the shape and direction of subsequent delta progradations becomes apparent. Visualization enhances a classic dataset, and can be further refined using additional data, as well as provide a guide for identifying future areas of study.
NASA's Earth Observing Data and Information System - Near-Term Challenges
NASA Technical Reports Server (NTRS)
Behnke, Jeanne; Mitchell, Andrew; Ramapriyan, Hampapuram
2018-01-01
NASA's Earth Observing System Data and Information System (EOSDIS) has been a central component of the NASA Earth observation program since the 1990's. EOSDIS manages data covering a wide range of Earth science disciplines including cryosphere, land cover change, polar processes, field campaigns, ocean surface, digital elevation, atmosphere dynamics and composition, and inter-disciplinary research, and many others. One of the key components of EOSDIS is a set of twelve discipline-based Distributed Active Archive Centers (DAACs) distributed across the United States. Managed by NASA's Earth Science Data and Information System (ESDIS) Project at Goddard Space Flight Center, these DAACs serve over 3 million users globally. The ESDIS Project provides the infrastructure support for EOSDIS, which includes other components such as the Science Investigator-led Processing systems (SIPS), common metadata and metrics management systems, specialized network systems, standards management, and centralized support for use of commercial cloud capabilities. Given the long-term requirements, and the rapid pace of information technology and changing expectations of the user community, EOSDIS has evolved continually over the past three decades. However, many challenges remain. Challenges addressed in this paper include: growing volume and variety, achieving consistency across a diverse set of data producers, managing information about a large number of datasets, migration to a cloud computing environment, optimizing data discovery and access, incorporating user feedback from a diverse community, keeping metadata updated as data collections grow and age, and ensuring that all the content needed for understanding datasets by future users is identified and preserved.
Semi-supervised tracking of extreme weather events in global spatio-temporal climate datasets
NASA Astrophysics Data System (ADS)
Kim, S. K.; Prabhat, M.; Williams, D. N.
2017-12-01
Deep neural networks have been successfully applied to solve problem to detect extreme weather events in large scale climate datasets and attend superior performance that overshadows all previous hand-crafted methods. Recent work has shown that multichannel spatiotemporal encoder-decoder CNN architecture is able to localize events in semi-supervised bounding box. Motivated by this work, we propose new learning metric based on Variational Auto-Encoders (VAE) and Long-Short-Term-Memory (LSTM) to track extreme weather events in spatio-temporal dataset. We consider spatio-temporal object tracking problems as learning probabilistic distribution of continuous latent features of auto-encoder using stochastic variational inference. For this, we assume that our datasets are i.i.d and latent features is able to be modeled by Gaussian distribution. In proposed metric, we first train VAE to generate approximate posterior given multichannel climate input with an extreme climate event at fixed time. Then, we predict bounding box, location and class of extreme climate events using convolutional layers given input concatenating three features including embedding, sampled mean and standard deviation. Lastly, we train LSTM with concatenated input to learn timely information of dataset by recurrently feeding output back to next time-step's input of VAE. Our contribution is two-fold. First, we show the first semi-supervised end-to-end architecture based on VAE to track extreme weather events which can apply to massive scaled unlabeled climate datasets. Second, the information of timely movement of events is considered for bounding box prediction using LSTM which can improve accuracy of localization. To our knowledge, this technique has not been explored neither in climate community or in Machine Learning community.
This EnviroAtlas dataset contains polygons depicting the geographic areas of market-based programs, referred to herein as markets, and projects addressing ecosystem services protection in the United States. Depending upon the type of market or project and data availability, polygons reflect market coverage areas, project footprints, or project primary impact areas in which ecosystem service markets and projects operate. The data were collected via surveys and desk research conducted by Forest Trends' Ecosystem Marketplace from 2008 to 2016 on biodiversity (i.e., imperiled species/habitats; wetlands and streams), carbon, and water markets. Additional biodiversity data were obtained from the Regulatory In-lieu Fee and Bank Information Tracking System (RIBITS) database in 2015. Attribute data include information regarding the methodology, design, and development of biodiversity, carbon, and water markets and projects. This dataset was produced by Forest Trends' Ecosystem Marketplace for EnviroAtlas in order to support public access to and use of information related to environmental markets. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about thi
This EnviroAtlas web service contains layers depicting market-based programs and projects addressing ecosystem services protection in the United States. Layers include data collected via surveys and desk research conducted by Forest Trends' Ecosystem Marketplace from 2008 to 2016 on biodiversity (i.e., imperiled species/habitats; wetlands and streams), carbon, and water markets and enabling conditions that facilitate, directly or indirectly, market-based approaches to protecting and investing in those ecosystem services. This dataset was produced by Forest Trends' Ecosystem Marketplace for EnviroAtlas in order to support public access to and use of information related to environmental markets. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Scaled Tank Test Design and Results for the Aquantis 2.5 MW Ocean Current Generation Device
Swales, Henry; Kils, Ole; Coakley, David B.; Sites, Eric; Mayer, Tyler
2015-06-03
Aquantis 2.5 MW Ocean Current Generation Device, Tow Tank Dynamic Rig Structural Analysis Results. This is the detailed documentation for scaled device testing in a tow tank, including models, drawings, presentations, cost of energy analysis, and structural analysis. This dataset also includes specific information on drivetrain, roller bearing, blade fabrication, mooring, and rotor characteristics.
EnviroAtlas - Domestic Water Demand by 12-Digit HUC for the Conterminous United States
This EnviroAtlas dataset includes domestic water demand attributes which provide insight into the amount of water currently used for indoor and outdoor residential purposes in the contiguous United States. The values are based on 2010 water demand and 2010 population distribution, and have been summarized by subwatershed, or 12-digit hydrologic unit code (HUC12). For the purposes of this metric, domestic water use includes residential uses, such as for drinking, bathing, cleaning, landscaping, and pools. Depending on the location, domestic water can be self-supplied, such as by private wells, or publicly-supplied, such as by municipalities. Sources include surface water and groundwater. Estimates are for primary residences only (i.e., excluding second homes and tourism rentals). This dataset was produced by the US EPA to support research and online mapping activities related to the EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
An, Ji‐Yong; Meng, Fan‐Rong; Chen, Xing; Yan, Gui‐Ying; Hu, Ji‐Pu
2016-01-01
Abstract Predicting protein–protein interactions (PPIs) is a challenging task and essential to construct the protein interaction networks, which is important for facilitating our understanding of the mechanisms of biological systems. Although a number of high‐throughput technologies have been proposed to predict PPIs, there are unavoidable shortcomings, including high cost, time intensity, and inherently high false positive rates. For these reasons, many computational methods have been proposed for predicting PPIs. However, the problem is still far from being solved. In this article, we propose a novel computational method called RVM‐BiGP that combines the relevance vector machine (RVM) model and Bi‐gram Probabilities (BiGP) for PPIs detection from protein sequences. The major improvement includes (1) Protein sequences are represented using the Bi‐gram probabilities (BiGP) feature representation on a Position Specific Scoring Matrix (PSSM), in which the protein evolutionary information is contained; (2) For reducing the influence of noise, the Principal Component Analysis (PCA) method is used to reduce the dimension of BiGP vector; (3) The powerful and robust Relevance Vector Machine (RVM) algorithm is used for classification. Five‐fold cross‐validation experiments executed on yeast and Helicobacter pylori datasets, which achieved very high accuracies of 94.57 and 90.57%, respectively. Experimental results are significantly better than previous methods. To further evaluate the proposed method, we compare it with the state‐of‐the‐art support vector machine (SVM) classifier on the yeast dataset. The experimental results demonstrate that our RVM‐BiGP method is significantly better than the SVM‐based method. In addition, we achieved 97.15% accuracy on imbalance yeast dataset, which is higher than that of balance yeast dataset. The promising experimental results show the efficiency and robust of the proposed method, which can be an automatic decision support tool for future proteomics research. For facilitating extensive studies for future proteomics research, we developed a freely available web server called RVM‐BiGP‐PPIs in Hypertext Preprocessor (PHP) for predicting PPIs. The web server including source code and the datasets are available at http://219.219.62.123:8888/BiGP/. PMID:27452983
NASA Astrophysics Data System (ADS)
Martinez, Santa; Besse, Sebastien; Heather, Dave; Barbarisi, Isa; Arviset, Christophe; De Marchi, Guido; Barthelemy, Maud; Docasal, Ruben; Fraga, Diego; Grotheer, Emmanuel; Lim, Tanya; Macfarlane, Alan; Rios, Carlos; Vallejo, Fran; Saiz, Jaime; ESDC (European Space Data Centre) Team
2016-10-01
The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces at http://archives.esac.esa.int/psa. All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. The PSA is currently implementing a number of significant improvements, mostly driven by the evolution of the PDS standard, and the growing need for better interfaces and advanced applications to support science exploitation. The newly designed PSA will enhance the user experience and will significantly reduce the complexity for users to find their data promoting one-click access to the scientific datasets with more specialised views when needed. This includes a better integration with Planetary GIS analysis tools and Planetary interoperability services (search and retrieve data, supporting e.g. PDAP, EPN-TAP). It will be also up-to-date with versions 3 and 4 of the PDS standards, as PDS4 will be used for ESA's ExoMars and upcoming BepiColombo missions. Users will have direct access to documentation, information and tools that are relevant to the scientific use of the dataset, including ancillary datasets, Software Interface Specification (SIS) documents, and any tools/help that the PSA team can provide. A login mechanism will provide additional functionalities to the users to aid / ease their searches (e.g. saving queries, managing default views). This contribution will introduce the new PSA, its key features and access interfaces.
US Geoscience Information Network, Web Services for Geoscience Information Discovery and Access
NASA Astrophysics Data System (ADS)
Richard, S.; Allison, L.; Clark, R.; Coleman, C.; Chen, G.
2012-04-01
The US Geoscience information network has developed metadata profiles for interoperable catalog services based on ISO19139 and the OGC CSW 2.0.2. Currently data services are being deployed for the US Dept. of Energy-funded National Geothermal Data System. These services utilize OGC Web Map Services, Web Feature Services, and THREDDS-served NetCDF for gridded datasets. Services and underlying datasets (along with a wide variety of other information and non information resources are registered in the catalog system. Metadata for registration is produced by various workflows, including harvest from OGC capabilities documents, Drupal-based web applications, transformation from tabular compilations. Catalog search is implemented using the ESRI Geoportal open-source server. We are pursuing various client applications to demonstrated discovery and utilization of the data services. Currently operational applications allow catalog search and data acquisition from map services in an ESRI ArcMap extension, a catalog browse and search application built on openlayers and Django. We are developing use cases and requirements for other applications to utilize geothermal data services for resource exploration and evaluation.
RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system
Jensen, Tue V.; Pinson, Pierre
2017-01-01
Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation. PMID:29182600
RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system.
Jensen, Tue V; Pinson, Pierre
2017-11-28
Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.
EnviroAtlas - Tampa, FL - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Austin, TX - Proximity to Parks
This EnviroAtlas dataset shows the approximate walking distance from a park entrance at any given location within the EnviroAtlas community boundary. The zones are estimated in 1/4 km intervals up to 1km then in 1km intervals up to 5km. Park entrances were included in this analysis if they were within 5km of the community boundary. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system
NASA Astrophysics Data System (ADS)
Jensen, Tue V.; Pinson, Pierre
2017-11-01
Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.
EnviroAtlas - Portland, OR - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area.This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (http:/www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Woodbine, Iowa - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area.This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Milwaukee, WI - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Fresno, CA - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Pittsburgh, PA - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Portland, OR - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (http:/www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Tampa, FL - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - New Bedford, MA - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Green Bay, WI - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets ).
EnviroAtlas - Durham, NC - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas ) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets ).
EnviroAtlas - Phoenix, AZ - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Green Bay, WI - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area.This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets ).
EnviroAtlas - New Bedford, MA - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Woodbine, IA - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Fresno, CA - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area.This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Phoenix, AZ - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Portland, ME - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Portland, Maine - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Pittsburgh, PA - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Durham, NC - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas ) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets ).
EnviroAtlas - Milwaukee, WI - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Tampa, FL - Land Cover by Block Group
This EnviroAtlas dataset describes the percentage of each block group that is classified as impervious, forest, green space, wetland, and agriculture. Impervious is a combination of dark and light impervious. Forest is a combination of trees and forest and woody wetlands. Green space is a combination of trees and forest, grass and herbaceous, agriculture, woody wetlands, and emergent wetlands. Wetlands includes both Woody and Emergent Wetlands.This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Predicting Virtual World User Population Fluctuations with Deep Learning
Park, Nuri; Zhang, Qimeng; Kim, Jun Gi; Kang, Shin Jin; Kim, Chang Hun
2016-01-01
This paper proposes a system for predicting increases in virtual world user actions. The virtual world user population is a very important aspect of these worlds; however, methods for predicting fluctuations in these populations have not been well documented. Therefore, we attempt to predict changes in virtual world user populations with deep learning, using easily accessible online data, including formal datasets from Google Trends, Wikipedia, and online communities, as well as informal datasets collected from online forums. We use the proposed system to analyze the user population of EVE Online, one of the largest virtual worlds. PMID:27936009
Predicting Virtual World User Population Fluctuations with Deep Learning.
Kim, Young Bin; Park, Nuri; Zhang, Qimeng; Kim, Jun Gi; Kang, Shin Jin; Kim, Chang Hun
2016-01-01
This paper proposes a system for predicting increases in virtual world user actions. The virtual world user population is a very important aspect of these worlds; however, methods for predicting fluctuations in these populations have not been well documented. Therefore, we attempt to predict changes in virtual world user populations with deep learning, using easily accessible online data, including formal datasets from Google Trends, Wikipedia, and online communities, as well as informal datasets collected from online forums. We use the proposed system to analyze the user population of EVE Online, one of the largest virtual worlds.
Interoperable Solar Data and Metadata via LISIRD 3
NASA Astrophysics Data System (ADS)
Wilson, A.; Lindholm, D. M.; Pankratz, C. K.; Snow, M. A.; Woods, T. N.
2015-12-01
LISIRD 3 is a major upgrade of the LASP Interactive Solar Irradiance Data Center (LISIRD), which serves several dozen space based solar irradiance and related data products to the public. Through interactive plots, LISIRD 3 provides data browsing supported by data subsetting and aggregation. Incorporating a semantically enabled metadata repository, LISIRD 3 users see current, vetted, consistent information about the datasets offered. Users can now also search for datasets based on metadata fields such as dataset type and/or spectral or temporal range. This semantic database enables metadata browsing, so users can discover the relationships between datasets, instruments, spacecraft, mission and PI. The database also enables creation and publication of metadata records in a variety of formats, such as SPASE or ISO, making these datasets more discoverable. The database also enables the possibility of a public SPARQL endpoint, making the metadata browsable in an automated fashion. LISIRD 3's data access middleware, LaTiS, provides dynamic, on demand reformatting of data and timestamps, subsetting and aggregation, and other server side functionality via a RESTful OPeNDAP compliant API, enabling interoperability between LASP datasets and many common tools. LISIRD 3's templated front end design, coupled with the uniform data interface offered by LaTiS, allows easy integration of new datasets. Consequently the number and variety of datasets offered by LISIRD has grown to encompass several dozen, with many more to come. This poster will discuss design and implementation of LISIRD 3, including tools used, capabilities enabled, and issues encountered.
Scaling up: What coupled land-atmosphere models can tell us about critical zone processes
NASA Astrophysics Data System (ADS)
FitzGerald, K. A.; Masarik, M. T.; Rudisill, W. J.; Gelb, L.; Flores, A. N.
2017-12-01
A significant limitation to extending our knowledge of critical zone (CZ) evolution and function is a lack of hydrometeorological information at sufficiently fine spatial and temporal resolutions to resolve topo-climatic gradients and adequate spatial and temporal extent to capture a range of climatic conditions across ecoregions. Research at critical zone observatories (CZOs) suggests hydrometeorological stores and fluxes exert key controls on processes such as hydrologic partitioning and runoff generation, landscape evolution, soil formation, biogeochemical cycling, and vegetation dynamics. However, advancing fundamental understanding of CZ processes necessitates understanding how hydrometeorological drivers vary across space and time. As a result of recent advances in computational capabilities it has become possible, although still computationally expensive, to simulate hydrometeorological conditions via high resolution coupled land-atmosphere models. Using the Weather Research and Forecasting (WRF) model, we developed a high spatiotemporal resolution dataset extending from water year 1987 to present for the Snake River Basin in the northwestern USA including the Reynolds Creek and Dry Creek Experimental Watersheds, both part of the Reynolds Creek CZO, as well as a range of other ecosystems including shrubland desert, montane forests, and alpine tundra. Drawing from hypotheses generated by work at these sites and across the CZO network, we use the resulting dataset in combination with CZO observations and publically available datasets to provide insights regarding hydrologic partitioning, vegetation distribution, and erosional processes. This dataset provides key context in interpreting and reconciling what observations obtained at particular sites reveal about underlying CZ structure and function. While this dataset does not extend to future climates, the same modeling framework can be used to dynamically downscale coarse global climate model output to scales relevant to CZ processes. This presents an opportunity to better characterize the impact of climate change on the CZ. We also argue that opportunities exist beyond the one way flow of information and that what we learn at CZOs has the potential to contribute significantly to improved Earth system models.
Heuristics for Relevancy Ranking of Earth Dataset Search Results
NASA Astrophysics Data System (ADS)
Lynnes, C.; Quinn, P.; Norton, J.
2016-12-01
As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.
Heuristics for Relevancy Ranking of Earth Dataset Search Results
NASA Technical Reports Server (NTRS)
Lynnes, Christopher; Quinn, Patrick; Norton, James
2016-01-01
As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.
Relevancy Ranking of Satellite Dataset Search Results
NASA Technical Reports Server (NTRS)
Lynnes, Christopher; Quinn, Patrick; Norton, James
2017-01-01
As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.
Microwave Radiometer - UND Radiometrics MWR, Rufus - Reviewed Data
Leo, Laura
2018-01-09
Reviewed dataset that also includes post-reprocessed level1 and level2 data files from November 2015 to May 2016 (refer to "Additional Information"). Monitor real-time profiles of temperature (K), water vapor (gm-3), relative humidity (%), and liquid water (gm-3) up to 10 km.
The NAS Computational Aerosciences Archive
NASA Technical Reports Server (NTRS)
Miceli, Kristina D.; Globus, Al; Lasinski, T. A. (Technical Monitor)
1995-01-01
In order to further the state-of-the-art in computational aerosciences (CAS) technology, researchers must be able to gather and understand existing work in the field. One aspect of this information gathering is studying published work available in scientific journals and conference proceedings. However, current scientific publications are very limited in the type and amount of information that they can disseminate. Information is typically restricted to text, a few images, and a bibliography list. Additional information that might be useful to the researcher, such as additional visual results, referenced papers, and datasets, are not available. New forms of electronic publication, such as the World Wide Web (WWW), limit publication size only by available disk space and data transmission bandwidth, both of which are improving rapidly. The Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center is in the process of creating an archive of CAS information on the WWW. This archive will be based on the large amount of information produced by researchers associated with the NAS facility. The archive will contain technical summaries and reports of research performed on NAS supercomputers, visual results (images, animations, visualization system scripts), datasets, and any other supporting meta-information. This information will be available via the WWW through the NAS homepage, located at http://www.nas.nasa.gov/, fully indexed for searching. The main components of the archive are technical summaries and reports, visual results, and datasets. Technical summaries are gathered every year by researchers who have been allotted resources on NAS supercomputers. These summaries, together with supporting visual results and references, are browsable by interested researchers. Referenced papers made available by researchers can be accessed through hypertext links. Technical reports are in-depth accounts of tools and applications research projects performed by NAS staff members and collaborators. Visual results, which may be available in the form of images, animations, and/or visualization scripts, are generated by researchers with respect to a certain research project, depicting dataset features that were determined important by the investigating researcher. For example, script files for visualization systems (e.g. FAST, PLOT3D, AVS) are provided to create visualizations on the user's local workstation to elucidate the key points of the numerical study. Users can then interact with the data starting where the investigator left off. Datasets are intended to give researchers an opportunity to understand previous work, 'mine' solutions for new information (for example, have you ever read a paper thinking "I wonder what the helicity density looks like?"), compare new techniques with older results, collaborate with remote colleagues, and perform validation. Supporting meta-information associated with the research projects is also important to provide additional context for research projects. This may include information such as the software used in the simulation (e.g. grid generators, flow solvers, visualization). In addition to serving the CAS research community, the information archive will also be helpful to students, visualization system developers and researchers, and management. Students (of any age) can use the data to study fluid dynamics, compare results from different flow solvers, learn about meshing techniques, etc., leading to better informed individuals. For these users it is particularly important that visualization be integrated into dataset archives. Visualization researchers can use dataset archives to test algorithms and techniques, leading to better visualization systems, Management can use the data to figure what is really going on behind the viewgraphs. All users will benefit from fast, easy, and convenient access to CFD datasets. The CAS information archive hopes to serve as a useful resource to those interested in computational sciences. At present, only information that may be distributed internationally is made available via the archive. Studies are underway to determine security requirements and solutions to make additional information available. By providing access to the archive via the WWW, the process of information gathering can be more productive and fruitful due to ease of access and ability to manage many different types of information. As the archive grows, additional resources from outside NAS will be added, providing a dynamic source of research results.
NASA Astrophysics Data System (ADS)
Lloyd, S. A.; Acker, J. G.; Prados, A. I.; Leptoukh, G. G.
2008-12-01
One of the biggest obstacles for the average Earth science student today is locating and obtaining satellite- based remote sensing datasets in a format that is accessible and optimal for their data analysis needs. At the Goddard Earth Sciences Data and Information Services Center (GES-DISC) alone, on the order of hundreds of Terabytes of data are available for distribution to scientists, students and the general public. The single biggest and time-consuming hurdle for most students when they begin their study of the various datasets is how to slog through this mountain of data to arrive at a properly sub-setted and manageable dataset to answer their science question(s). The GES DISC provides a number of tools for data access and visualization, including the Google-like Mirador search engine and the powerful GES-DISC Interactive Online Visualization ANd aNalysis Infrastructure (Giovanni) web interface. Giovanni provides a simple way to visualize, analyze and access vast amounts of satellite-based Earth science data. Giovanni's features and practical examples of its use will be demonstrated, with an emphasis on how satellite remote sensing can help students understand recent events in the atmosphere and biosphere. Giovanni is actually a series of sixteen similar web-based data interfaces, each of which covers a single satellite dataset (such as TRMM, TOMS, OMI, AIRS, MLS, HALOE, etc.) or a group of related datasets (such as MODIS and MISR for aerosols, SeaWIFS and MODIS for ocean color, and the suite of A-Train observations co-located along the CloudSat orbital path). Recently, ground-based datasets have been included in Giovanni, including the Northern Eurasian Earth Science Partnership Initiative (NEESPI), and EPA fine particulate matter (PM2.5) for air quality. Model data such as the Goddard GOCART model and MERRA meteorological reanalyses (in process) are being increasingly incorporated into Giovanni to facilitate model- data intercomparison. A full suite of data analysis and visualization tools is also available within Giovanni. The GES DISC is currently developing a systematic series of training modules for Earth science satellite data, associated with our development of additional datasets and data visualization tools for Giovanni. Training sessions will include an overview of the Earth science datasets archived at Goddard, an overview of terms and techniques associated with satellite remote sensing, dataset-specific issues, an overview of Giovanni functionality, and a series of examples of how data can be readily accessed and visualized.
NASA SPoRT Initialization Datasets for Local Model Runs in the Environmental Modeling System
NASA Technical Reports Server (NTRS)
Case, Jonathan L.; LaFontaine, Frank J.; Molthan, Andrew L.; Carcione, Brian; Wood, Lance; Maloney, Joseph; Estupinan, Jeral; Medlin, Jeffrey M.; Blottman, Peter; Rozumalski, Robert A.
2011-01-01
The NASA Short-term Prediction Research and Transition (SPoRT) Center has developed several products for its National Weather Service (NWS) partners that can be used to initialize local model runs within the Weather Research and Forecasting (WRF) Environmental Modeling System (EMS). These real-time datasets consist of surface-based information updated at least once per day, and produced in a composite or gridded product that is easily incorporated into the WRF EMS. The primary goal for making these NASA datasets available to the WRF EMS community is to provide timely and high-quality information at a spatial resolution comparable to that used in the local model configurations (i.e., convection-allowing scales). The current suite of SPoRT products supported in the WRF EMS include a Sea Surface Temperature (SST) composite, a Great Lakes sea-ice extent, a Greenness Vegetation Fraction (GVF) composite, and Land Information System (LIS) gridded output. The SPoRT SST composite is a blend of primarily the Moderate Resolution Imaging Spectroradiometer (MODIS) infrared and Advanced Microwave Scanning Radiometer for Earth Observing System data for non-precipitation coverage over the oceans at 2-km resolution. The composite includes a special lake surface temperature analysis over the Great Lakes using contributions from the Remote Sensing Systems temperature data. The Great Lakes Environmental Research Laboratory Ice Percentage product is used to create a sea-ice mask in the SPoRT SST composite. The sea-ice mask is produced daily (in-season) at 1.8-km resolution and identifies ice percentage from 0 100% in 10% increments, with values above 90% flagged as ice.
NASA Astrophysics Data System (ADS)
Poyatos, Rafael; Sus, Oliver; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi
2018-05-01
The ubiquity of missing data in plant trait databases may hinder trait-based analyses of ecological patterns and processes. Spatially explicit datasets with information on intraspecific trait variability are rare but offer great promise in improving our understanding of functional biogeography. At the same time, they offer specific challenges in terms of data imputation. Here we compare statistical imputation approaches, using varying levels of environmental information, for five plant traits (leaf biomass to sapwood area ratio, leaf nitrogen content, maximum tree height, leaf mass per area and wood density) in a spatially explicit plant trait dataset of temperate and Mediterranean tree species (Ecological and Forest Inventory of Catalonia, IEFC, dataset for Catalonia, north-east Iberian Peninsula, 31 900 km2). We simulated gaps at different missingness levels (10-80 %) in a complete trait matrix, and we used overall trait means, species means, k nearest neighbours (kNN), ordinary and regression kriging, and multivariate imputation using chained equations (MICE) to impute missing trait values. We assessed these methods in terms of their accuracy and of their ability to preserve trait distributions, multi-trait correlation structure and bivariate trait relationships. The relatively good performance of mean and species mean imputations in terms of accuracy masked a poor representation of trait distributions and multivariate trait structure. Species identity improved MICE imputations for all traits, whereas forest structure and topography improved imputations for some traits. No method performed best consistently for the five studied traits, but, considering all traits and performance metrics, MICE informed by relevant ecological variables gave the best results. However, at higher missingness (> 30 %), species mean imputations and regression kriging tended to outperform MICE for some traits. MICE informed by relevant ecological variables allowed us to fill the gaps in the IEFC incomplete dataset (5495 plots) and quantify imputation uncertainty. Resulting spatial patterns of the studied traits in Catalan forests were broadly similar when using species means, regression kriging or the best-performing MICE application, but some important discrepancies were observed at the local level. Our results highlight the need to assess imputation quality beyond just imputation accuracy and show that including environmental information in statistical imputation approaches yields more plausible imputations in spatially explicit plant trait datasets.
NASA Astrophysics Data System (ADS)
Ward, Dennis W.; Bennett, Kelly W.
2017-05-01
The Sensor Information Testbed COllaberative Research Environment (SITCORE) and the Automated Online Data Repository (AODR) are significant enablers of the U.S. Army Research Laboratory (ARL)'s Open Campus Initiative and together create a highly-collaborative research laboratory and testbed environment focused on sensor data and information fusion. SITCORE creates a virtual research development environment allowing collaboration from other locations, including DoD, industry, academia, and collation facilities. SITCORE combined with AODR provides end-toend algorithm development, experimentation, demonstration, and validation. The AODR enterprise allows the U.S. Army Research Laboratory (ARL), as well as other government organizations, industry, and academia to store and disseminate multiple intelligence (Multi-INT) datasets collected at field exercises and demonstrations, and to facilitate research and development (R and D), and advancement of analytical tools and algorithms supporting the Intelligence, Surveillance, and Reconnaissance (ISR) community. The AODR provides a potential central repository for standards compliant datasets to serve as the "go-to" location for lessons-learned and reference products. Many of the AODR datasets have associated ground truth and other metadata which provides a rich and robust data suite for researchers to develop, test, and refine their algorithms. Researchers download the test data to their own environments using a sophisticated web interface. The AODR allows researchers to request copies of stored datasets and for the government to process the requests and approvals in an automated fashion. Access to the AODR requires two-factor authentication in the form of a Common Access Card (CAC) or External Certificate Authority (ECA)
Open University Learning Analytics dataset.
Kuzilek, Jakub; Hlosta, Martin; Zdrahal, Zdenek
2017-11-28
Learning Analytics focuses on the collection and analysis of learners' data to improve their learning experience by providing informed guidance and to optimise learning materials. To support the research in this area we have developed a dataset, containing data from courses presented at the Open University (OU). What makes the dataset unique is the fact that it contains demographic data together with aggregated clickstream data of students' interactions in the Virtual Learning Environment (VLE). This enables the analysis of student behaviour, represented by their actions. The dataset contains the information about 22 courses, 32,593 students, their assessment results, and logs of their interactions with the VLE represented by daily summaries of student clicks (10,655,280 entries). The dataset is freely available at https://analyse.kmi.open.ac.uk/open_dataset under a CC-BY 4.0 license.
Open University Learning Analytics dataset
Kuzilek, Jakub; Hlosta, Martin; Zdrahal, Zdenek
2017-01-01
Learning Analytics focuses on the collection and analysis of learners’ data to improve their learning experience by providing informed guidance and to optimise learning materials. To support the research in this area we have developed a dataset, containing data from courses presented at the Open University (OU). What makes the dataset unique is the fact that it contains demographic data together with aggregated clickstream data of students’ interactions in the Virtual Learning Environment (VLE). This enables the analysis of student behaviour, represented by their actions. The dataset contains the information about 22 courses, 32,593 students, their assessment results, and logs of their interactions with the VLE represented by daily summaries of student clicks (10,655,280 entries). The dataset is freely available at https://analyse.kmi.open.ac.uk/open_dataset under a CC-BY 4.0 license. PMID:29182599
Data Type Registry - Cross Road Between Catalogs, Data And Semantics
NASA Astrophysics Data System (ADS)
Richard, S. M.; Zaslavsky, I.; Bristol, S.
2017-12-01
As more data become accessible online, the opportunity is increasing to improve search for information within datasets and for automating some levels of data integration. A prerequisite for these advances is indexing the kinds of information that are present in datasets and providing machine actionable descriptions of data structures. We are exploring approaches to enabling these capabilities in the EarthCube DigitalCrust and Data Discovery Hub Building Block projects, building on the Data type registry (DTR) workgroup activity in the Research Data Alliance. We are prototyping a registry implementation using the CNRI Cordra platform and API to enable 'deep registration' of datasets for building hydrogeologic models of the Earth's Crust, and executing complex science scenarios for river chemistry and coral bleaching data. These use cases require the ability to respond to queries such as: What are properties of Entity X; What entities include property Y (or L, M, N…), and What DataTypes are about Entity X and include property Y. Development of the registry to enable these capabilities requires more in-depth metadata than is commonly available, so we are also exploring approaches to analyzing simple tabular data to automate recognition of entities and properties, and assist users with establishing semantic mappings to data integration vocabularies. This poster will review the current capabilities and implementation of a data type registry.
A comparison of NLCD 2011 and LANDFIRE EVT 2010: Regional and national summaries.
McKerrow, Alexa; Dewitz, Jon; Long, Donald G.; Nelson, Kurtis; Connot, Joel A.; Smith, Jim
2016-01-01
In order to provide the land cover user community a summary of the similarity and differences between the 2011 National Land Cover Dataset (NLCD) and the Landscape Fire and Resource Management Planning Tools Program Existing Vegetation 2010 Data (LANDFIRE EVT), the two datasets were compared at a national (conterminous U.S.) and regional (Eastern, Midwestern, and Western) extents (Figure 1). The comparisons were done by generalizing the LANDFIRE data to be consistent with mapped land cover classes in the NLCD (i.e., crosswalked). Summaries of the comparisons were based on areal extent including 1) the total extent of each land cover class, and 2) land cover classes in corresponding 900-m2 areas. The results from the comparisons provide the user community information regarding the utility of both datasets relative to their intended uses.
EnviroAtlas National Layers Master Web Service
This EnviroAtlas web service supports research and online mapping activities related to EnviroAtlas (https://www.epa.gov/enviroatlas). This web service includes layers depicting EnviroAtlas national metrics mapped at the 12-digit HUC within the conterminous United States. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Dataset on spatial distribution and location of universities in Nigeria.
Adeyemi, G A; Edeki, S O
2018-06-01
Access to quality educational system, and the location of educational institutions are of great importance for future prospect of youth in any nation. These in return, have great effects on the economy growth and development of any country. Thus, the dataset contained in this article examines and explains the spatial distribution of universities in the Nigeria system of education. Data from the university commission, Nigeria, as at December 2017 are used. These include all the 40 federal universities, 44 states universities, and 69 private universities making a total of 153 universities in the Nigerian system of education. The data analysis is via the Geographic Information System (GIS) software. The dataset contained in this article will be of immense assistance to the national educational policy makers, parents, and potential students as regards smart and reliable decision making academically.
NASA Astrophysics Data System (ADS)
Schepaschenko, D.; McCallum, I.; Shvidenko, A.; Kraxner, F.; Fritz, S.
2009-04-01
There is a critical need for accurate land cover information for resource assessment, biophysical modeling, greenhouse gas studies, and for estimating possible terrestrial responses and feedbacks to climate change. However, practically all existing land cover datasets have quite a high level of uncertainty and suffer from a lack of important details that does not allow for relevant parameterization, e.g., data derived from different forest inventories. The objective of this study is to develop a methodology in order to create a hybrid land cover dataset at the level which would satisfy requirements of the verified terrestrial biota full greenhouse gas account (Shvidenko et al., 2008) for large regions i.e. Russia. Such requirements necessitate a detailed quantification of land classes (e.g., for forests - dominant species, age, growing stock, net primary production, etc.) with additional information on uncertainties of the major biometric and ecological parameters in the range of 10-20% and a confidence interval of around 0.9. The approach taken here allows the integration of different datasets to explore synergies and in particular the merging and harmonization of land and forest inventories, ecological monitoring, remote sensing data and in-situ information. The following datasets have been integrated: Remote sensing: Global Land Cover 2000 (Fritz et al., 2003), Vegetation Continuous Fields (Hansen et al., 2002), Vegetation Fire (Sukhinin, 2007), Regional land cover (Schmullius et al., 2005); GIS: Soil 1:2.5 Mio (Dokuchaev Soil Science Institute, 1996), Administrative Regions 1:2.5 Mio, Vegetation 1:4 Mio, Bioclimatic Zones 1:4 Mio (Stolbovoi & McCallum, 2002), Forest Enterprises 1:2.5 Mio, Rivers/Lakes and Roads/Railways 1:1 Mio (IIASA's data base); Inventories and statistics: State Land Account (FARSC RF, 2006), State Forest Account - SFA (FFS RF, 2003), Disturbances in forests (FFS RF, 2006). The resulting hybrid land cover dataset at 1-km resolution comprises the following classes: Forest (each grid links to the SFA database, which contains 86,613 records); Agriculture (5 classes, parameterized by 89 administrative units); Wetlands (8 classes, parameterized by 83 zone/region units); Open Woodland, Burnt area; Shrub/grassland (50 classes, parameterized by 300 zone/region units); Water; Unproductive area. This study has demonstrated the ability to produce a highly detailed (both spatially and thematically) land cover dataset over Russia. Future efforts include further validation of the hybrid land cover dataset for Russia, and its use for assessment of the terrestrial biota full greenhouse gas budget across Russia. The methodology proposed in this study could be applied at the global level. Results of such an undertaking would however be highly dependent upon the quality of the available ground data. The implementation of the hybrid land cover dataset was undertaken in a way that it can be regularly updated based on new ground data and remote sensing products (ie. MODIS).
The international primary ciliary dyskinesia cohort (iPCD Cohort): methods and first results.
Goutaki, Myrofora; Maurer, Elisabeth; Halbeisen, Florian S; Amirav, Israel; Barbato, Angelo; Behan, Laura; Boon, Mieke; Casaulta, Carmen; Clement, Annick; Crowley, Suzanne; Haarman, Eric; Hogg, Claire; Karadag, Bulent; Koerner-Rettberg, Cordula; Leigh, Margaret W; Loebinger, Michael R; Mazurek, Henryk; Morgan, Lucy; Nielsen, Kim G; Omran, Heymut; Schwerk, Nicolaus; Scigliano, Sergio; Werner, Claudius; Yiallouros, Panayiotis; Zivkovic, Zorica; Lucas, Jane S; Kuehni, Claudia E
2017-01-01
Data on primary ciliary dyskinesia (PCD) epidemiology is scarce and published studies are characterised by low numbers. In the framework of the European Union project BESTCILIA we aimed to combine all available datasets in a retrospective international PCD cohort (iPCD Cohort).We identified eligible datasets by performing a systematic review of published studies containing clinical information on PCD, and by contacting members of past and current European Respiratory Society Task Forces on PCD. We compared the contents of the datasets, clarified definitions and pooled them in a standardised format.As of April 2016 the iPCD Cohort includes data on 3013 patients from 18 countries. It includes data on diagnostic evaluations, symptoms, lung function, growth and treatments. Longitudinal data are currently available for 542 patients. The extent of clinical details per patient varies between centres. More than 50% of patients have a definite PCD diagnosis based on recent guidelines. Children aged 10-19 years are the largest age group, followed by younger children (≤9 years) and young adults (20-29 years).This is the largest observational PCD dataset available to date. It will allow us to answer pertinent questions on clinical phenotype, disease severity, prognosis and effect of treatments, and to investigate genotype-phenotype correlations. Copyright ©ERS 2017.
The international primary ciliary dyskinesia cohort (iPCD Cohort): methods and first results
Goutaki, Myrofora; Maurer, Elisabeth; Halbeisen, Florian S.; Amirav, Israel; Barbato, Angelo; Behan, Laura; Boon, Mieke; Casaulta, Carmen; Clement, Annick; Crowley, Suzanne; Haarman, Eric; Hogg, Claire; Karadag, Bulent; Koerner-Rettberg, Cordula; Leigh, Margaret W.; Loebinger, Michael R.; Mazurek, Henryk; Morgan, Lucy; Nielsen, Kim G.; Omran, Heymut; Schwerk, Nicolaus; Scigliano, Sergio; Werner, Claudius; Yiallouros, Panayiotis; Zivkovic, Zorica; Lucas, Jane S.
2017-01-01
Data on primary ciliary dyskinesia (PCD) epidemiology is scarce and published studies are characterised by low numbers. In the framework of the European Union project BESTCILIA we aimed to combine all available datasets in a retrospective international PCD cohort (iPCD Cohort). We identified eligible datasets by performing a systematic review of published studies containing clinical information on PCD, and by contacting members of past and current European Respiratory Society Task Forces on PCD. We compared the contents of the datasets, clarified definitions and pooled them in a standardised format. As of April 2016 the iPCD Cohort includes data on 3013 patients from 18 countries. It includes data on diagnostic evaluations, symptoms, lung function, growth and treatments. Longitudinal data are currently available for 542 patients. The extent of clinical details per patient varies between centres. More than 50% of patients have a definite PCD diagnosis based on recent guidelines. Children aged 10–19 years are the largest age group, followed by younger children (≤9 years) and young adults (20–29 years). This is the largest observational PCD dataset available to date. It will allow us to answer pertinent questions on clinical phenotype, disease severity, prognosis and effect of treatments, and to investigate genotype–phenotype correlations. PMID:28052956
A benchmark for vehicle detection on wide area motion imagery
NASA Astrophysics Data System (ADS)
Catrambone, Joseph; Amzovski, Ismail; Liang, Pengpeng; Blasch, Erik; Sheaff, Carolyn; Wang, Zhonghai; Chen, Genshe; Ling, Haibin
2015-05-01
Wide area motion imagery (WAMI) has been attracting an increased amount of research attention due to its large spatial and temporal coverage. An important application includes moving target analysis, where vehicle detection is often one of the first steps before advanced activity analysis. While there exist many vehicle detection algorithms, a thorough evaluation of them on WAMI data still remains a challenge mainly due to the lack of an appropriate benchmark data set. In this paper, we address a research need by presenting a new benchmark for wide area motion imagery vehicle detection data. The WAMI benchmark is based on the recently available Wright-Patterson Air Force Base (WPAFB09) dataset and the Temple Resolved Uncertainty Target History (TRUTH) associated target annotation. Trajectory annotations were provided in the original release of the WPAFB09 dataset, but detailed vehicle annotations were not available with the dataset. In addition, annotations of static vehicles, e.g., in parking lots, are also not identified in the original release. Addressing these issues, we re-annotated the whole dataset with detailed information for each vehicle, including not only a target's location, but also its pose and size. The annotated WAMI data set should be useful to community for a common benchmark to compare WAMI detection, tracking, and identification methods.
NASA Earth Observations (NEO): Data Imagery for Education and Visualization
NASA Astrophysics Data System (ADS)
Ward, K.
2008-12-01
NASA Earth Observations (NEO) has dramatically simplified public access to georeferenced imagery of NASA remote sensing data. NEO targets the non-traditional data users who are currently underserved by functionality and formats available from the existing data ordering systems. These users include formal and informal educators, museum and science center personnel, professional communicators, and citizen scientists. NEO currently serves imagery from 45 different datasets with daily, weekly, and/or monthly temporal resolutions, with more datasets currently under development. The imagery from these datasets is produced in coordination with several data partners who are affiliated either with the instrument science teams or with the respective data processing center. NEO is a system of three components -- website, WMS (Web Mapping Service), and ftp archive -- which together are able to meet the wide-ranging needs of our users. Some of these needs include the ability to: view and manipulate imagery using the NEO website -- e.g., applying color palettes, resizing, exporting to a variety of formats including PNG, JPEG, KMZ (Google Earth), GeoTIFF; access the NEO collection via a standards-based API (WMS); and create customized exports for select users (ftp archive) such as Science on a Sphere, NASA's Earth Observatory, and others.
Arnold, L. Rick
2010-01-01
These datasets were compiled in support of U.S. Geological Survey Scientific-Investigations Report 2010-5082-Hydrogeology and Steady-State Numerical Simulation of Groundwater Flow in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. The datasets were developed by the U.S. Geological Survey in cooperation with the Lost Creek Ground Water Management District and the Colorado Geological Survey. The four datasets are described as follows and methods used to develop the datasets are further described in Scientific-Investigations Report 2010-5082: (1) ds507_regolith_data: This point dataset contains geologic information concerning regolith (unconsolidated sediment) thickness and top-of-bedrock altitude at selected well and test-hole locations in and near the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Data were compiled from published reports, consultant reports, and from lithologic logs of wells and test holes on file with the U.S. Geological Survey Colorado Water Science Center and the Colorado Division of Water Resources. (2) ds507_regthick_contours: This dataset consists of contours showing generalized lines of equal regolith thickness overlying bedrock in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Regolith thickness was contoured manually on the basis of information provided in the dataset ds507_regolith_data. (3) ds507_regthick_grid: This dataset consists of raster-based generalized thickness of regolith overlying bedrock in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Regolith thickness in this dataset was derived from contours presented in the dataset ds507_regthick_contours. (4) ds507_welltest_data: This point dataset contains estimates of aquifer transmissivity and hydraulic conductivity at selected well locations in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. This dataset also contains hydrologic information used to estimate transmissivity from specific capacity at selected well locations. Data were compiled from published reports, consultant reports, and from well-test records on file with the U.S. Geological Survey Colorado Water Science Center and the Colorado Division of Water Resources.
Improving stability of prediction models based on correlated omics data by using network approaches.
Tissier, Renaud; Houwing-Duistermaat, Jeanine; Rodríguez-Girondo, Mar
2018-01-01
Building prediction models based on complex omics datasets such as transcriptomics, proteomics, metabolomics remains a challenge in bioinformatics and biostatistics. Regularized regression techniques are typically used to deal with the high dimensionality of these datasets. However, due to the presence of correlation in the datasets, it is difficult to select the best model and application of these methods yields unstable results. We propose a novel strategy for model selection where the obtained models also perform well in terms of overall predictability. Several three step approaches are considered, where the steps are 1) network construction, 2) clustering to empirically derive modules or pathways, and 3) building a prediction model incorporating the information on the modules. For the first step, we use weighted correlation networks and Gaussian graphical modelling. Identification of groups of features is performed by hierarchical clustering. The grouping information is included in the prediction model by using group-based variable selection or group-specific penalization. We compare the performance of our new approaches with standard regularized regression via simulations. Based on these results we provide recommendations for selecting a strategy for building a prediction model given the specific goal of the analysis and the sizes of the datasets. Finally we illustrate the advantages of our approach by application of the methodology to two problems, namely prediction of body mass index in the DIetary, Lifestyle, and Genetic determinants of Obesity and Metabolic syndrome study (DILGOM) and prediction of response of each breast cancer cell line to treatment with specific drugs using a breast cancer cell lines pharmacogenomics dataset.
NASA Astrophysics Data System (ADS)
Dafflon, B.; Hubbard, S. S.; Ulrich, C.; Peterson, J. E.; Wu, Y.; Wainwright, H. M.; Gangodagamage, C.; Kholodov, A. L.; Kneafsey, T. J.
2013-12-01
Improvement in parameterizing Arctic process-rich terrestrial models to simulate feedbacks to a changing climate requires advances in estimating the spatiotemporal variations in active layer and permafrost properties - in sufficiently high resolution yet over modeling-relevant scales. As part of the DOE Next-Generation Ecosystem Experiments (NGEE-Arctic), we are developing advanced strategies for imaging the subsurface and for investigating land and subsurface co-variability and dynamics. Our studies include acquisition and integration of various measurements, including point-based, surface-based geophysical, and remote sensing datasets These data have been collected during a series of campaigns at the NGEE Barrow, AK site along transects that traverse a range of hydrological and geomorphological conditions, including low- to high- centered polygons and drained thaw lake basins. In this study, we describe the use of galvanic-coupled electrical resistance tomography (ERT), capacitively-coupled resistivity (CCR) , permafrost cores, above-ground orthophotography, and digital elevation model (DEM) to (1) explore complementary nature and trade-offs between characterization resolution, spatial extent and accuracy of different datasets; (2) develop inversion approaches to quantify permafrost characteristics (such as ice content, ice wedge frequency, and presence of unfrozen deep layer) and (3) identify correspondences between permafrost and land surface properties (such as water inundation, topography, and vegetation). In terms of methods, we developed a 1D-based direct search approach to estimate electrical conductivity distribution while allowing exploration of multiple solutions and prior information in a flexible way. Application of the method to the Barrow datasets reveals the relative information content of each dataset for characterizing permafrost properties, which shows features variability from below one meter length scales to large trends over more than a kilometer. Further, we used Pole- and Kite-based low-altitude aerial photography with inferred DEM, as well as DEM from LiDAR dataset, to quantify land-surface properties and their co-variability with the subsurface properties. Comparison of the above- and below-ground characterization information indicate that while some permafrost characteristics correspond with changes in hydrogeomorphological expressions, others features show more complex linkages with landscape properties. Overall, our results indicate that remote sensing data, point-scale measurements and surface geophysical measurements enable the identification of regional zones having similar relations between subsurface and land surface properties. Identification of such zonation and associated permafrost-land surface properties can be used to guide investigations of carbon cycling processes and for model parameterization.
The French Muséum national d'histoire naturelle vascular plant herbarium collection dataset
NASA Astrophysics Data System (ADS)
Le Bras, Gwenaël; Pignal, Marc; Jeanson, Marc L.; Muller, Serge; Aupic, Cécile; Carré, Benoît; Flament, Grégoire; Gaudeul, Myriam; Gonçalves, Claudia; Invernón, Vanessa R.; Jabbour, Florian; Lerat, Elodie; Lowry, Porter P.; Offroy, Bérangère; Pimparé, Eva Pérez; Poncy, Odile; Rouhan, Germinal; Haevermans, Thomas
2017-02-01
We provide a quantitative description of the French national herbarium vascular plants collection dataset. Held at the Muséum national d'histoire naturelle, Paris, it currently comprises records for 5,400,000 specimens, representing 90% of the estimated total of specimens. Ninety nine percent of the specimen entries are linked to one or more images and 16% have field-collecting information available. This major botanical collection represents the results of over three centuries of exploration and study. The sources of the collection are global, with a strong representation for France, including overseas territories, and former French colonies. The compilation of this dataset was made possible through numerous national and international projects, the most important of which was linked to the renovation of the herbarium building. The vascular plant collection is actively expanding today, hence the continuous growth exhibited by the dataset, which can be fully accessed through the GBIF portal or the MNHN database portal (available at: https://science.mnhn.fr/institution/mnhn/collection/p/item/search/form). This dataset is a major source of data for systematics, global plants macroecological studies or conservation assessments.
The French Muséum national d'histoire naturelle vascular plant herbarium collection dataset.
Le Bras, Gwenaël; Pignal, Marc; Jeanson, Marc L; Muller, Serge; Aupic, Cécile; Carré, Benoît; Flament, Grégoire; Gaudeul, Myriam; Gonçalves, Claudia; Invernón, Vanessa R; Jabbour, Florian; Lerat, Elodie; Lowry, Porter P; Offroy, Bérangère; Pimparé, Eva Pérez; Poncy, Odile; Rouhan, Germinal; Haevermans, Thomas
2017-02-14
We provide a quantitative description of the French national herbarium vascular plants collection dataset. Held at the Muséum national d'histoire naturelle, Paris, it currently comprises records for 5,400,000 specimens, representing 90% of the estimated total of specimens. Ninety nine percent of the specimen entries are linked to one or more images and 16% have field-collecting information available. This major botanical collection represents the results of over three centuries of exploration and study. The sources of the collection are global, with a strong representation for France, including overseas territories, and former French colonies. The compilation of this dataset was made possible through numerous national and international projects, the most important of which was linked to the renovation of the herbarium building. The vascular plant collection is actively expanding today, hence the continuous growth exhibited by the dataset, which can be fully accessed through the GBIF portal or the MNHN database portal (available at: https://science.mnhn.fr/institution/mnhn/collection/p/item/search/form). This dataset is a major source of data for systematics, global plants macroecological studies or conservation assessments.
Collaboration-Centred Cities through Urban Apps Based on Open and User-Generated Data
Aguilera, Unai; López-de-Ipiña, Diego; Pérez, Jorge
2016-01-01
This paper describes the IES Cities platform conceived to streamline the development of urban apps that combine heterogeneous datasets provided by diverse entities, namely, government, citizens, sensor infrastructure and other information data sources. This work pursues the challenge of achieving effective citizen collaboration by empowering them to prosume urban data across time. Particularly, this paper focuses on the query mapper; a key component of the IES Cities platform devised to democratize the development of open data-based mobile urban apps. This component allows developers not only to use available data, but also to contribute to existing datasets with the execution of SQL sentences. In addition, the component allows developers to create ad hoc storages for their applications, publishable as new datasets accessible by other consumers. As multiple users could be contributing and using a dataset, our solution also provides a data level permission mechanism to control how the platform manages the access to its datasets. We have evaluated the advantages brought forward by IES Cities from the developers’ perspective by describing an exemplary urban app created on top of it. In addition, we include an evaluation of the main functionalities of the query mapper. PMID:27376300
Collaboration-Centred Cities through Urban Apps Based on Open and User-Generated Data.
Aguilera, Unai; López-de-Ipiña, Diego; Pérez, Jorge
2016-07-01
This paper describes the IES Cities platform conceived to streamline the development of urban apps that combine heterogeneous datasets provided by diverse entities, namely, government, citizens, sensor infrastructure and other information data sources. This work pursues the challenge of achieving effective citizen collaboration by empowering them to prosume urban data across time. Particularly, this paper focuses on the query mapper; a key component of the IES Cities platform devised to democratize the development of open data-based mobile urban apps. This component allows developers not only to use available data, but also to contribute to existing datasets with the execution of SQL sentences. In addition, the component allows developers to create ad hoc storages for their applications, publishable as new datasets accessible by other consumers. As multiple users could be contributing and using a dataset, our solution also provides a data level permission mechanism to control how the platform manages the access to its datasets. We have evaluated the advantages brought forward by IES Cities from the developers' perspective by describing an exemplary urban app created on top of it. In addition, we include an evaluation of the main functionalities of the query mapper.
The French Muséum national d’histoire naturelle vascular plant herbarium collection dataset
Le Bras, Gwenaël; Pignal, Marc; Jeanson, Marc L.; Muller, Serge; Aupic, Cécile; Carré, Benoît; Flament, Grégoire; Gaudeul, Myriam; Gonçalves, Claudia; Invernón, Vanessa R.; Jabbour, Florian; Lerat, Elodie; Lowry, Porter P.; Offroy, Bérangère; Pimparé, Eva Pérez; Poncy, Odile; Rouhan, Germinal; Haevermans, Thomas
2017-01-01
We provide a quantitative description of the French national herbarium vascular plants collection dataset. Held at the Muséum national d’histoire naturelle, Paris, it currently comprises records for 5,400,000 specimens, representing 90% of the estimated total of specimens. Ninety nine percent of the specimen entries are linked to one or more images and 16% have field-collecting information available. This major botanical collection represents the results of over three centuries of exploration and study. The sources of the collection are global, with a strong representation for France, including overseas territories, and former French colonies. The compilation of this dataset was made possible through numerous national and international projects, the most important of which was linked to the renovation of the herbarium building. The vascular plant collection is actively expanding today, hence the continuous growth exhibited by the dataset, which can be fully accessed through the GBIF portal or the MNHN database portal (available at: https://science.mnhn.fr/institution/mnhn/collection/p/item/search/form). This dataset is a major source of data for systematics, global plants macroecological studies or conservation assessments. PMID:28195585
NASA Astrophysics Data System (ADS)
Minnett, R.; Koppers, A.; Jarboe, N.; Tauxe, L.; Constable, C.; Jonestrask, L.
2017-12-01
Challenges are faced by both new and experienced users interested in contributing their data to community repositories, in data discovery, or engaged in potentially transformative science. The Magnetics Information Consortium (https://earthref.org/MagIC) has recently simplified its data model and developed a new containerized web application to reduce the friction in contributing, exploring, and combining valuable and complex datasets for the paleo-, geo-, and rock magnetic scientific community. The new data model more closely reflects the hierarchical workflow in paleomagnetic experiments to enable adequate annotation of scientific results and ensure reproducibility. The new open-source (https://github.com/earthref/MagIC) application includes an upload tool that is integrated with the data model to provide early data validation feedback and ease the friction of contributing and updating datasets. The search interface provides a powerful full text search of contributions indexed by ElasticSearch and a wide array of filters, including specific geographic and geological timescale filtering, to support both novice users exploring the database and experts interested in compiling new datasets with specific criteria across thousands of studies and millions of measurements. The datasets are not large, but they are complex, with many results from evolving experimental and analytical approaches. These data are also extremely valuable due to the cost in collecting or creating physical samples and the, often, destructive nature of the experiments. MagIC is heavily invested in encouraging young scientists as well as established labs to cultivate workflows that facilitate contributing their data in a consistent format. This eLightning presentation includes a live demonstration of the MagIC web application, developed as a configurable container hosting an isomorphic Meteor JavaScript application, MongoDB database, and ElasticSearch search engine. Visitors can explore the MagIC Database through maps and image or plot galleries or search and filter the raw measurements and their derived hierarchy of analytical interpretations.
Aguirre-Gamboa, Raul; Gomez-Rueda, Hugo; Martínez-Ledesma, Emmanuel; Martínez-Torteya, Antonio; Chacolla-Huaringa, Rafael; Rodriguez-Barrientos, Alberto; Tamez-Peña, José G; Treviño, Victor
2013-01-01
Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R.
Aguirre-Gamboa, Raul; Gomez-Rueda, Hugo; Martínez-Ledesma, Emmanuel; Martínez-Torteya, Antonio; Chacolla-Huaringa, Rafael; Rodriguez-Barrientos, Alberto; Tamez-Peña, José G.; Treviño, Victor
2013-01-01
Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R. PMID:24066126
Hrynaszkiewicz, Iain; Khodiyar, Varsha; Hufton, Andrew L; Sansone, Susanna-Assunta
2016-01-01
Sharing of experimental clinical research data usually happens between individuals or research groups rather than via public repositories, in part due to the need to protect research participant privacy. This approach to data sharing makes it difficult to connect journal articles with their underlying datasets and is often insufficient for ensuring access to data in the long term. Voluntary data sharing services such as the Yale Open Data Access (YODA) and Clinical Study Data Request (CSDR) projects have increased accessibility to clinical datasets for secondary uses while protecting patient privacy and the legitimacy of secondary analyses but these resources are generally disconnected from journal articles-where researchers typically search for reliable information to inform future research. New scholarly journal and article types dedicated to increasing accessibility of research data have emerged in recent years and, in general, journals are developing stronger links with data repositories. There is a need for increased collaboration between journals, data repositories, researchers, funders, and voluntary data sharing services to increase the visibility and reliability of clinical research. Using the journal Scientific Data as a case study, we propose and show examples of changes to the format and peer-review process for journal articles to more robustly link them to data that are only available on request. We also propose additional features for data repositories to better accommodate non-public clinical datasets, including Data Use Agreements (DUAs).
75 FR 42680 - Proposed Information Collection; Topographic and Bathymetric Data Survey
Federal Register 2010, 2011, 2012, 2013, 2014
2010-07-22
.... Twenty-one pieces of information about each dataset will be collected to give an accurate picture of data quality and give users of the Topographic and Bathymetric Data Inventory access to each dataset. The end...
Trippi, Michael H.; Kinney, Scott A.; Gunther, Gregory; Ryder, Robert T.; Ruppert, Leslie F.; Ruppert, Leslie F.; Ryder, Robert T.
2014-01-01
Metadata for these datasets are available in HTML and XML formats. Metadata files contain information about the sources of data used to create the dataset, the creation process steps, the data quality, the geographic coordinate system and horizontal datum used for the dataset, the values of attributes used in the dataset table, information about the publication and the publishing organization, and other information that may be useful to the reader. All links in the metadata were valid at the time of compilation. Some of these links may no longer be valid. No attempt has been made to determine the new online location (if one exists) for the data.
SpectralNET – an application for spectral graph analysis and visualization
Forman, Joshua J; Clemons, Paul A; Schreiber, Stuart L; Haggarty, Stephen J
2005-01-01
Background Graph theory provides a computational framework for modeling a variety of datasets including those emerging from genomics, proteomics, and chemical genetics. Networks of genes, proteins, small molecules, or other objects of study can be represented as graphs of nodes (vertices) and interactions (edges) that can carry different weights. SpectralNET is a flexible application for analyzing and visualizing these biological and chemical networks. Results Available both as a standalone .NET executable and as an ASP.NET web application, SpectralNET was designed specifically with the analysis of graph-theoretic metrics in mind, a computational task not easily accessible using currently available applications. Users can choose either to upload a network for analysis using a variety of input formats, or to have SpectralNET generate an idealized random network for comparison to a real-world dataset. Whichever graph-generation method is used, SpectralNET displays detailed information about each connected component of the graph, including graphs of degree distribution, clustering coefficient by degree, and average distance by degree. In addition, extensive information about the selected vertex is shown, including degree, clustering coefficient, various distance metrics, and the corresponding components of the adjacency, Laplacian, and normalized Laplacian eigenvectors. SpectralNET also displays several graph visualizations, including a linear dimensionality reduction for uploaded datasets (Principal Components Analysis) and a non-linear dimensionality reduction that provides an elegant view of global graph structure (Laplacian eigenvectors). Conclusion SpectralNET provides an easily accessible means of analyzing graph-theoretic metrics for data modeling and dimensionality reduction. SpectralNET is publicly available as both a .NET application and an ASP.NET web application from . Source code is available upon request. PMID:16236170
SpectralNET--an application for spectral graph analysis and visualization.
Forman, Joshua J; Clemons, Paul A; Schreiber, Stuart L; Haggarty, Stephen J
2005-10-19
Graph theory provides a computational framework for modeling a variety of datasets including those emerging from genomics, proteomics, and chemical genetics. Networks of genes, proteins, small molecules, or other objects of study can be represented as graphs of nodes (vertices) and interactions (edges) that can carry different weights. SpectralNET is a flexible application for analyzing and visualizing these biological and chemical networks. Available both as a standalone .NET executable and as an ASP.NET web application, SpectralNET was designed specifically with the analysis of graph-theoretic metrics in mind, a computational task not easily accessible using currently available applications. Users can choose either to upload a network for analysis using a variety of input formats, or to have SpectralNET generate an idealized random network for comparison to a real-world dataset. Whichever graph-generation method is used, SpectralNET displays detailed information about each connected component of the graph, including graphs of degree distribution, clustering coefficient by degree, and average distance by degree. In addition, extensive information about the selected vertex is shown, including degree, clustering coefficient, various distance metrics, and the corresponding components of the adjacency, Laplacian, and normalized Laplacian eigenvectors. SpectralNET also displays several graph visualizations, including a linear dimensionality reduction for uploaded datasets (Principal Components Analysis) and a non-linear dimensionality reduction that provides an elegant view of global graph structure (Laplacian eigenvectors). SpectralNET provides an easily accessible means of analyzing graph-theoretic metrics for data modeling and dimensionality reduction. SpectralNET is publicly available as both a .NET application and an ASP.NET web application from http://chembank.broad.harvard.edu/resources/. Source code is available upon request.
Systematic Applications of Metabolomics in Metabolic Engineering
Dromms, Robert A.; Styczynski, Mark P.
2012-01-01
The goals of metabolic engineering are well-served by the biological information provided by metabolomics: information on how the cell is currently using its biochemical resources is perhaps one of the best ways to inform strategies to engineer a cell to produce a target compound. Using the analysis of extracellular or intracellular levels of the target compound (or a few closely related molecules) to drive metabolic engineering is quite common. However, there is surprisingly little systematic use of metabolomics datasets, which simultaneously measure hundreds of metabolites rather than just a few, for that same purpose. Here, we review the most common systematic approaches to integrating metabolite data with metabolic engineering, with emphasis on existing efforts to use whole-metabolome datasets. We then review some of the most common approaches for computational modeling of cell-wide metabolism, including constraint-based models, and discuss current computational approaches that explicitly use metabolomics data. We conclude with discussion of the broader potential of computational approaches that systematically use metabolomics data to drive metabolic engineering. PMID:24957776
Digital geologic map database of the Nevada Test Site area, Nevada
Wahl, R.R.; Sawyer, D.A.; Minor, S.A.; Carr, M.D.; Cole, J.C.; Swadley, W.C.; Laczniak, R.J.; Warren, R.G.; Green, K.S.; Engle, C.M.
1997-01-01
Forty years of geologic investigations at the Nevada Test Site (NTS) have been digitized. These data include all geologic information that: (1) has been collected, and (2) can be represented on a map within the map borders at the map scale is included in the map digital coverages. The following coverages are included with this dataset: Coverage Type Description geolpoly Polygon Geologic outcrops geolflts line Fault traces geolatts Point Bedding attitudes, etc. geolcald line Caldera boundaries geollins line Interpreted lineaments geolmeta line Metamorphic gradients The above coverages are attributed with numeric values and interpreted information. The entity files documented below show the data associated with each coverage.
NASA Technical Reports Server (NTRS)
Liu, Zhong; Ostrenga, D.; Teng, W. L.; Trivedi, Bhagirath; Kempler, S.
2012-01-01
The NASA Goddard Earth Sciences Data and Information Services Center (GES DISC) is home of global precipitation product archives, in particular, the Tropical Rainfall Measuring Mission (TRMM) products. TRMM is a joint U.S.-Japan satellite mission to monitor tropical and subtropical (40 S - 40 N) precipitation and to estimate its associated latent heating. The TRMM satellite provides the first detailed and comprehensive dataset on the four dimensional distribution of rainfall and latent heating over vastly undersampled tropical and subtropical oceans and continents. The TRMM satellite was launched on November 27, 1997. TRMM data products are archived at and distributed by GES DISC. The newly released TRMM Version 7 consists of several changes including new parameters, new products, meta data, data structures, etc. For example, hydrometeor profiles in 2A12 now have 28 layers (14 in V6). New parameters have been added to several popular Level-3 products, such as, 3B42, 3B43. Version 2.2 of the Global Precipitation Climatology Project (GPCP) dataset has been added to the TRMM Online Visualization and Analysis System (TOVAS; URL: http://disc2.nascom.nasa.gov/Giovanni/tovas/), allowing online analysis and visualization without downloading data and software. The GPCP dataset extends back to 1979. Version 3 of the Global Precipitation Climatology Centre (GPCC) monitoring product has been updated in TOVAS as well. The product provides global gauge-based monthly rainfall along with number of gauges per grid. The dataset begins in January 1986. To facilitate data and information access and support precipitation research and applications, we have developed a Precipitation Data and Information Services Center (PDISC; URL: http://disc.gsfc.nasa.gov/precipitation). In addition to TRMM, PDISC provides current and past observational precipitation data. Users can access precipitation data archives consisting of both remote sensing and in-situ observations. Users can use these data products to conduct a wide variety of activities, including case studies, model evaluation, uncertainty investigation, etc. To support Earth science applications, PDISC provides users near-real-time precipitation products over the Internet. At PDISC, users can access tools and software. Documentation, FAQ and assistance are also available. Other capabilities include: 1) Mirador (http://mirador.gsfc.nasa.gov/), a simplified interface for searching, browsing, and ordering Earth science data at NASA Goddard Earth Sciences Data and Information Services Center (GES DISC). Mirador is designed to be fast and easy to learn; 2)TOVAS; 3) NetCDF data download for the GIS community; 4) Data via OPeNDAP (http://disc.sci.gsfc.nasa.gov/services/opendap/). The OPeNDAP provides remote access to individual variables within datasets in a form usable by many tools, such as IDV, McIDAS-V, Panoply, Ferret and GrADS; 5) The Open Geospatial Consortium (OGC) Web Map Service (WMS) (http://disc.sci.gsfc.nasa.gov/services/wxs_ogc.shtml). The WMS is an interface that allows the use of data and enables clients to build customized maps with data coming from a different network.
Improving Risk Adjustment for Mortality After Pediatric Cardiac Surgery: The UK PRAiS2 Model.
Rogers, Libby; Brown, Katherine L; Franklin, Rodney C; Ambler, Gareth; Anderson, David; Barron, David J; Crowe, Sonya; English, Kate; Stickley, John; Tibby, Shane; Tsang, Victor; Utley, Martin; Witter, Thomas; Pagel, Christina
2017-07-01
Partial Risk Adjustment in Surgery (PRAiS), a risk model for 30-day mortality after children's heart surgery, has been used by the UK National Congenital Heart Disease Audit to report expected risk-adjusted survival since 2013. This study aimed to improve the model by incorporating additional comorbidity and diagnostic information. The model development dataset was all procedures performed between 2009 and 2014 in all UK and Ireland congenital cardiac centers. The outcome measure was death within each 30-day surgical episode. Model development followed an iterative process of clinical discussion and development and assessment of models using logistic regression under 25 × 5 cross-validation. Performance was measured using Akaike information criterion, the area under the receiver-operating characteristic curve (AUC), and calibration. The final model was assessed in an external 2014 to 2015 validation dataset. The development dataset comprised 21,838 30-day surgical episodes, with 539 deaths (mortality, 2.5%). The validation dataset comprised 4,207 episodes, with 97 deaths (mortality, 2.3%). The updated risk model included 15 procedural, 11 diagnostic, and 4 comorbidity groupings, and nonlinear functions of age and weight. Performance under cross-validation was: median AUC of 0.83 (range, 0.82 to 0.83), median calibration slope and intercept of 0.92 (range, 0.64 to 1.25) and -0.23 (range, -1.08 to 0.85) respectively. In the validation dataset, the AUC was 0.86 (95% confidence interval [CI], 0.82 to 0.89), and the calibration slope and intercept were 1.01 (95% CI, 0.83 to 1.18) and 0.11 (95% CI, -0.45 to 0.67), respectively, showing excellent performance. A more sophisticated PRAiS2 risk model for UK use was developed with additional comorbidity and diagnostic information, alongside age and weight as nonlinear variables. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Aerosol Climate Time Series Evaluation In ESA Aerosol_cci
NASA Astrophysics Data System (ADS)
Popp, T.; de Leeuw, G.; Pinnock, S.
2015-12-01
Within the ESA Climate Change Initiative (CCI) Aerosol_cci (2010 - 2017) conducts intensive work to improve algorithms for the retrieval of aerosol information from European sensors. By the end of 2015 full mission time series of 2 GCOS-required aerosol parameters are completely validated and released: Aerosol Optical Depth (AOD) from dual view ATSR-2 / AATSR radiometers (3 algorithms, 1995 - 2012), and stratospheric extinction profiles from star occultation GOMOS spectrometer (2002 - 2012). Additionally, a 35-year multi-sensor time series of the qualitative Absorbing Aerosol Index (AAI) together with sensitivity information and an AAI model simulator is available. Complementary aerosol properties requested by GCOS are in a "round robin" phase, where various algorithms are inter-compared: fine mode AOD, mineral dust AOD (from the thermal IASI spectrometer), absorption information and aerosol layer height. As a quasi-reference for validation in few selected regions with sparse ground-based observations the multi-pixel GRASP algorithm for the POLDER instrument is used. Validation of first dataset versions (vs. AERONET, MAN) and inter-comparison to other satellite datasets (MODIS, MISR, SeaWIFS) proved the high quality of the available datasets comparable to other satellite retrievals and revealed needs for algorithm improvement (for example for higher AOD values) which were taken into account for a reprocessing. The datasets contain pixel level uncertainty estimates which are also validated. The paper will summarize and discuss the results of major reprocessing and validation conducted in 2015. The focus will be on the ATSR, GOMOS and IASI datasets. Pixel level uncertainties validation will be summarized and discussed including unknown components and their potential usefulness and limitations. Opportunities for time series extension with successor instruments of the Sentinel family will be described and the complementarity of the different satellite aerosol products (e.g. dust vs. total AOD, ensembles from different algorithms for the same sensor) will be discussed.
GeoPAT: A toolbox for pattern-based information retrieval from large geospatial databases
NASA Astrophysics Data System (ADS)
Jasiewicz, Jarosław; Netzel, Paweł; Stepinski, Tomasz
2015-07-01
Geospatial Pattern Analysis Toolbox (GeoPAT) is a collection of GRASS GIS modules for carrying out pattern-based geospatial analysis of images and other spatial datasets. The need for pattern-based analysis arises when images/rasters contain rich spatial information either because of their very high resolution or their very large spatial extent. Elementary units of pattern-based analysis are scenes - patches of surface consisting of a complex arrangement of individual pixels (patterns). GeoPAT modules implement popular GIS algorithms, such as query, overlay, and segmentation, to operate on the grid of scenes. To achieve these capabilities GeoPAT includes a library of scene signatures - compact numerical descriptors of patterns, and a library of distance functions - providing numerical means of assessing dissimilarity between scenes. Ancillary GeoPAT modules use these functions to construct a grid of scenes or to assign signatures to individual scenes having regular or irregular geometries. Thus GeoPAT combines knowledge retrieval from patterns with mapping tasks within a single integrated GIS environment. GeoPAT is designed to identify and analyze complex, highly generalized classes in spatial datasets. Examples include distinguishing between different styles of urban settlements using VHR images, delineating different landscape types in land cover maps, and mapping physiographic units from DEM. The concept of pattern-based spatial analysis is explained and the roles of all modules and functions are described. A case study example pertaining to delineation of landscape types in a subregion of NLCD is given. Performance evaluation is included to highlight GeoPAT's applicability to very large datasets. The GeoPAT toolbox is available for download from
Identifying key genes in glaucoma based on a benchmarked dataset and the gene regulatory network.
Chen, Xi; Wang, Qiao-Ling; Zhang, Meng-Hui
2017-10-01
The current study aimed to identify key genes in glaucoma based on a benchmarked dataset and gene regulatory network (GRN). Local and global noise was added to the gene expression dataset to produce a benchmarked dataset. Differentially-expressed genes (DEGs) between patients with glaucoma and normal controls were identified utilizing the Linear Models for Microarray Data (Limma) package based on benchmarked dataset. A total of 5 GRN inference methods, including Zscore, GeneNet, context likelihood of relatedness (CLR) algorithm, Partial Correlation coefficient with Information Theory (PCIT) and GEne Network Inference with Ensemble of Trees (Genie3) were evaluated using receiver operating characteristic (ROC) and precision and recall (PR) curves. The interference method with the best performance was selected to construct the GRN. Subsequently, topological centrality (degree, closeness and betweenness) was conducted to identify key genes in the GRN of glaucoma. Finally, the key genes were validated by performing reverse transcription-quantitative polymerase chain reaction (RT-qPCR). A total of 176 DEGs were detected from the benchmarked dataset. The ROC and PR curves of the 5 methods were analyzed and it was determined that Genie3 had a clear advantage over the other methods; thus, Genie3 was used to construct the GRN. Following topological centrality analysis, 14 key genes for glaucoma were identified, including IL6 , EPHA2 and GSTT1 and 5 of these 14 key genes were validated by RT-qPCR. Therefore, the current study identified 14 key genes in glaucoma, which may be potential biomarkers to use in the diagnosis of glaucoma and aid in identifying the molecular mechanism of this disease.
Discovering the influential users oriented to viral marketing based on online social networks
NASA Astrophysics Data System (ADS)
Zhu, Zhiguo
2013-08-01
The target of viral marketing on the platform of popular online social networks is to rapidly propagate marketing information at lower cost and increase sales, in which a key problem is how to precisely discover the most influential users in the process of information diffusion. A novel method is proposed in this paper for helping companies to identify such users as seeds to maximize information diffusion in the viral marketing. Firstly, the user trust network oriented to viral marketing and users’ combined interest degree in the network including isolated users are extensively defined. Next, we construct a model considering the time factor to simulate the process of information diffusion in viral marketing and propose a dynamic algorithm description. Finally, experiments are conducted with a real dataset extracted from the famous SNS website Epinions. The experimental results indicate that the proposed algorithm has better scalability and is less time-consuming. Compared with the classical model, the proposed algorithm achieved a better performance than does the classical method on the two aspects of network coverage rate and time-consumption in our four sub-datasets.
Benchmarking Deep Learning Models on Large Healthcare Datasets.
Purushotham, Sanjay; Meng, Chuizheng; Che, Zhengping; Liu, Yan
2018-06-04
Deep learning models (aka Deep Neural Networks) have revolutionized many fields including computer vision, natural language processing, speech recognition, and is being increasingly used in clinical healthcare applications. However, few works exist which have benchmarked the performance of the deep learning models with respect to the state-of-the-art machine learning models and prognostic scoring systems on publicly available healthcare datasets. In this paper, we present the benchmarking results for several clinical prediction tasks such as mortality prediction, length of stay prediction, and ICD-9 code group prediction using Deep Learning models, ensemble of machine learning models (Super Learner algorithm), SAPS II and SOFA scores. We used the Medical Information Mart for Intensive Care III (MIMIC-III) (v1.4) publicly available dataset, which includes all patients admitted to an ICU at the Beth Israel Deaconess Medical Center from 2001 to 2012, for the benchmarking tasks. Our results show that deep learning models consistently outperform all the other approaches especially when the 'raw' clinical time series data is used as input features to the models. Copyright © 2018 Elsevier Inc. All rights reserved.
Holmes, Avram J; Hollinshead, Marisa O; O'Keefe, Timothy M; Petrov, Victor I; Fariello, Gabriele R; Wald, Lawrence L; Fischl, Bruce; Rosen, Bruce R; Mair, Ross W; Roffman, Joshua L; Smoller, Jordan W; Buckner, Randy L
2015-01-01
The goal of the Brain Genomics Superstruct Project (GSP) is to enable large-scale exploration of the links between brain function, behavior, and ultimately genetic variation. To provide the broader scientific community data to probe these associations, a repository of structural and functional magnetic resonance imaging (MRI) scans linked to genetic information was constructed from a sample of healthy individuals. The initial release, detailed in the present manuscript, encompasses quality screened cross-sectional data from 1,570 participants ages 18 to 35 years who were scanned with MRI and completed demographic and health questionnaires. Personality and cognitive measures were obtained on a subset of participants. Each dataset contains a T1-weighted structural MRI scan and either one (n=1,570) or two (n=1,139) resting state functional MRI scans. Test-retest reliability datasets are included from 69 participants scanned within six months of their initial visit. For the majority of participants self-report behavioral and cognitive measures are included (n=926 and n=892 respectively). Analyses of data quality, structure, function, personality, and cognition are presented to demonstrate the dataset's utility.
Fish and fishery historical data since the 19th century in the Adriatic Sea, Mediterranean
NASA Astrophysics Data System (ADS)
Fortibuoni, Tomaso; Libralato, Simone; Arneri, Enrico; Giovanardi, Otello; Solidoro, Cosimo; Raicevich, Saša
2017-09-01
Historic data on biodiversity provide the context for present observations and allow studying long-term changes in marine populations. Here we present multiple datasets on fish and fisheries of the Adriatic Sea covering the last two centuries encompassing from qualitative observations to standardised scientific monitoring. The datasets consist of three groups: (1) early naturalists' descriptions of fish fauna, including information (e.g., presence, perceived abundance, size) on 255 fish species for the period 1818-1936 (2) historical landings from major Northern Adriatic fish markets (Venice, Trieste, Rijeka) for the period 1902-1968, Italian official landings for the Northern and Central Adriatic (1953-2012) and landings from the Lagoon of Venice (1945-2001) (3) trawl-survey data from seven surveys spanning the period 1948-1991 and including Catch per Unit of Effort data (kgh-1 and/or nh-1) for 956 hauls performed at 301 stations. The integration of these datasets has already demonstrated to be useful to analyse historical marine community changes over time, and its availability through open-source data portal will facilitate analyses in the framework of marine historical ecology.
Fish and fishery historical data since the 19th century in the Adriatic Sea, Mediterranean.
Fortibuoni, Tomaso; Libralato, Simone; Arneri, Enrico; Giovanardi, Otello; Solidoro, Cosimo; Raicevich, Saša
2017-09-12
Historic data on biodiversity provide the context for present observations and allow studying long-term changes in marine populations. Here we present multiple datasets on fish and fisheries of the Adriatic Sea covering the last two centuries encompassing from qualitative observations to standardised scientific monitoring. The datasets consist of three groups: (1) early naturalists' descriptions of fish fauna, including information (e.g., presence, perceived abundance, size) on 255 fish species for the period 1818-1936; (2) historical landings from major Northern Adriatic fish markets (Venice, Trieste, Rijeka) for the period 1902-1968, Italian official landings for the Northern and Central Adriatic (1953-2012) and landings from the Lagoon of Venice (1945-2001); (3) trawl-survey data from seven surveys spanning the period 1948-1991 and including Catch per Unit of Effort data (kgh -1 and/or nh -1 ) for 956 hauls performed at 301 stations. The integration of these datasets has already demonstrated to be useful to analyse historical marine community changes over time, and its availability through open-source data portal will facilitate analyses in the framework of marine historical ecology.
A synthetic dataset for evaluating soft and hard fusion algorithms
NASA Astrophysics Data System (ADS)
Graham, Jacob L.; Hall, David L.; Rimland, Jeffrey
2011-06-01
There is an emerging demand for the development of data fusion techniques and algorithms that are capable of combining conventional "hard" sensor inputs such as video, radar, and multispectral sensor data with "soft" data including textual situation reports, open-source web information, and "hard/soft" data such as image or video data that includes human-generated annotations. New techniques that assist in sense-making over a wide range of vastly heterogeneous sources are critical to improving tactical situational awareness in counterinsurgency (COIN) and other asymmetric warfare situations. A major challenge in this area is the lack of realistic datasets available for test and evaluation of such algorithms. While "soft" message sets exist, they tend to be of limited use for data fusion applications due to the lack of critical message pedigree and other metadata. They also lack corresponding hard sensor data that presents reasonable "fusion opportunities" to evaluate the ability to make connections and inferences that span the soft and hard data sets. This paper outlines the design methodologies, content, and some potential use cases of a COIN-based synthetic soft and hard dataset created under a United States Multi-disciplinary University Research Initiative (MURI) program funded by the U.S. Army Research Office (ARO). The dataset includes realistic synthetic reports from a variety of sources, corresponding synthetic hard data, and an extensive supporting database that maintains "ground truth" through logical grouping of related data into "vignettes." The supporting database also maintains the pedigree of messages and other critical metadata.
Bennett, Derek S.; Lyons, John B.; Wittkop, Chad A.; Dicken, Connie L.
2006-01-01
The New Hampshire Geological Survey collects data and performs research on the land, mineral, and water resources of the State, and disseminates the findings of such research to the public through maps, reports, and other publications. The Bedrock Geologic Map of New Hampshire, by John B. Lyons, Wallace A. Bothner, Robert H. Moench, and James B. Thompson, was published in paper format by the U.S. Geological Survey (USGS) in 1997. The online version of this CD contains digital datasets of the State map that are intended to assist the professional geologist, land-use planners, water resource professionals, and engineers and to inform the interested layperson. In addition to the bedrock geology, the datasets include geopolitical and hydrologic information, such as political boundaries, quadrangle boundaries, hydrologic units, and water-well data. A more thorough explanation for each of these datasets may be found in the accompanying metadata files. The data are spatially referenced and may be used in a geographic information system (GIS). ArcExplorer, the Environmental Systems Research Institute's (ESRI) free GIS data viewer, is available at http://www.esri.com/software/arcexplorer. ArcExplorer provides basic functions that are needed to harness the power and versatility of the spatial datasets. Additional information on the viewer and other ESRI products may be found on the ArcExplorer website. Although extensive review and revisions of the data have been performed by the USGS and the New Hampshire Geological Survey, these data represent interpretations made by professional geologists using the best available data, and are intended to provide general geologic information. Use of these data at scales larger than 1:250,000 will not provide greater accuracy. The data are not intended to replace site-specific or specific-use investigations. The U.S. Geological Survey, New Hampshire Geological Survey, and State of New Hampshire make no representation or warranty, expressed or implied, regarding the use, accuracy, or completeness of the data presented herein, or from a map printed from these data; nor shall the act of distribution constitute any such warranty. The New Hampshire Geological Survey disclaims any legal responsibility or liability for interpretations made from the map, or decisions based thereon. For more information on New Hampshire Geological Survey programs please visit the State's website at http://des.nh.gov/Geology/. New Hampshire Geographically Referenced Analysis and Information Transfer System (NH GRANIT) provides access to statewide GIS (http://www.granit.unh.edu/). Questions about this CD or about other datasets should be directed to the New Hampshire Department of Environmental Services.
LEAP: biomarker inference through learning and evaluating association patterns.
Jiang, Xia; Neapolitan, Richard E
2015-03-01
Single nucleotide polymorphism (SNP) high-dimensional datasets are available from Genome Wide Association Studies (GWAS). Such data provide researchers opportunities to investigate the complex genetic basis of diseases. Much of genetic risk might be due to undiscovered epistatic interactions, which are interactions in which combination of several genes affect disease. Research aimed at discovering interacting SNPs from GWAS datasets proceeded in two directions. First, tools were developed to evaluate candidate interactions. Second, algorithms were developed to search over the space of candidate interactions. Another problem when learning interacting SNPs, which has not received much attention, is evaluating how likely it is that the learned SNPs are associated with the disease. A complete system should provide this information as well. We develop such a system. Our system, called LEAP, includes a new heuristic search algorithm for learning interacting SNPs, and a Bayesian network based algorithm for computing the probability of their association. We evaluated the performance of LEAP using 100 1,000-SNP simulated datasets, each of which contains 15 SNPs involved in interactions. When learning interacting SNPs from these datasets, LEAP outperformed seven others methods. Furthermore, only SNPs involved in interactions were found to be probable. We also used LEAP to analyze real Alzheimer's disease and breast cancer GWAS datasets. We obtained interesting and new results from the Alzheimer's dataset, but limited results from the breast cancer dataset. We conclude that our results support that LEAP is a useful tool for extracting candidate interacting SNPs from high-dimensional datasets and determining their probability. © 2015 The Authors. *Genetic Epidemiology published by Wiley Periodicals, Inc.
SUMER-IRIS Observations of AR11875
NASA Astrophysics Data System (ADS)
Schmit, Donald; Innes, Davina
2014-05-01
We present results of the first joint observing campaign of IRIS and SOHO/SUMER. While the IRIS datasets provide information on the chromosphere and transition region, SUMER provides complementary diagnostics on the corona. On 2013-10-24, we observed an active region, AR11875, and the surrounding plage for approximately 4 hours using rapid-cadence observing programs. These datasets include spectra from a small C -class flare which occurs in conjunction with an Ellerman-bomb type event. Our analysis focusses on how the high spatial resolution and slit jaw imaging capabilities of IRIS shed light on the unresolved structure of transient events in the SUMER catalog.
Hypergraph Based Feature Selection Technique for Medical Diagnosis.
Somu, Nivethitha; Raman, M R Gauthama; Kirthivasan, Kannan; Sriram, V S Shankar
2016-11-01
The impact of internet and information systems across various domains have resulted in substantial generation of multidimensional datasets. The use of data mining and knowledge discovery techniques to extract the original information contained in the multidimensional datasets play a significant role in the exploitation of complete benefit provided by them. The presence of large number of features in the high dimensional datasets incurs high computational cost in terms of computing power and time. Hence, feature selection technique has been commonly used to build robust machine learning models to select a subset of relevant features which projects the maximal information content of the original dataset. In this paper, a novel Rough Set based K - Helly feature selection technique (RSKHT) which hybridize Rough Set Theory (RST) and K - Helly property of hypergraph representation had been designed to identify the optimal feature subset or reduct for medical diagnostic applications. Experiments carried out using the medical datasets from the UCI repository proves the dominance of the RSKHT over other feature selection techniques with respect to the reduct size, classification accuracy and time complexity. The performance of the RSKHT had been validated using WEKA tool, which shows that RSKHT had been computationally attractive and flexible over massive datasets.
DOIs for Data: Progress in Data Citation and Publication in the Geosciences
NASA Astrophysics Data System (ADS)
Callaghan, S.; Murphy, F.; Tedds, J.; Allan, R.
2012-12-01
Identifiers for data are the bedrock on which data citation and publication rests. These, in their turn, are widely proposed as methods for encouraging researchers to share their datasets, and at the same time receive academic credit for their efforts in producing them. However, neither data citation nor publication can be properly achieved without a method of identifying clearly what is, and what isn't, part of the dataset. Once a dataset becomes part of the scientific record (either through formal data publication or through being cited) then issues such as dataset stability and permanence become vital to address. In the geosciences, several projects in the UK are concentrating on issues of dataset identification, citation and publication. The UK's Natural Environment Research Council's (NERC) Science Information Strategy data citation and publication project is addressing the issue of identifiers for data, stability, transparency, and credit for data producers through data citation. At a data publication level, 2012 has seen the launch of the new Wiley title Geoscience Data Journal and the PREPARDE (Peer Review for Publication & Accreditation of Research Data in the Earth sciences) project, both aiming to encourage data publication by addressing issues such as data paper submission workflows and the scientific peer-review of data. All of these initiatives work with a range of partners including academic institutions, learned societies, data centers and commercial publishers, both nationally and internationally, with a cross-project aim of developing the mechanisms so data can be identified, cited and published with confidence. This involves investigating barriers and drivers to data publishing and sharing, peer review, and re-use of geoscientific datasets, and specifically such topics as dataset requirements for citation, workflows for dataset ingestion into data centers and publishers, procedures and policies for editors, reviewers and authors of data publication, and assessing the trustworthiness of data archives. A key goal is to ensure that these projects reach out to, and are informed by, other related initiatives on a global basis, in particular anyone interested in developing long-term sustainable policies, processes, incentives and business models for managing and publishing research data. This presentation will give an overview of progress in the projects mentioned above, specifically focussing on the use of DOIs for datasets hosted in the NERC environmental data centers, and how DOIs are enabling formal data citation and publication in the geosciences.
Spatial datasets of radionuclide contamination in the Ukrainian Chernobyl Exclusion Zone
NASA Astrophysics Data System (ADS)
Kashparov, Valery; Levchuk, Sviatoslav; Zhurba, Marina; Protsak, Valentyn; Khomutinin, Yuri; Beresford, Nicholas A.; Chaplow, Jacqueline S.
2018-02-01
The dataset Spatial datasets of radionuclide contamination in the Ukrainian Chernobyl Exclusion Zone
was developed to enable data collected between May 1986 (immediately after Chernobyl) and 2014 by the Ukrainian Institute of Agricultural Radiology (UIAR) after the Chernobyl accident to be made publicly available. The dataset includes results from comprehensive soil sampling across the Chernobyl Exclusion Zone (CEZ). Analyses include radiocaesium (134Cs and 134Cs) 90Sr, 154Eu and soil property data; plutonium isotope activity concentrations in soil (including distribution in the soil profile); analyses of hot
(or fuel) particles from the CEZ (data from Poland and across Europe are also included); and results of monitoring in the Ivankov district, a region adjacent to the exclusion zone. The purpose of this paper is to describe the available data and methodology used to obtain them. The data will be valuable to those conducting studies within the CEZ in a number of ways, for instance (i) for helping to perform robust exposure estimates to wildlife, (ii) for predicting comparative activity concentrations of different key radionuclides, (iii) for providing a baseline against which future surveys in the CEZ can be compared, (iv) as a source of information on the behaviour of fuel particles (FPs), (v) for performing retrospective dose assessments and (vi) for assessing natural background dose rates in the CEZ. The CEZ has been proposed as a radioecological observatory
(i.e. a radioactively contaminated site that will provide a focus for long-term, radioecological collaborative international research). Key to the future success of this concept is open access to data for the CEZ. The data presented here are a first step in this process. The data and supporting documentation are freely available from the Environmental Information Data Centre (EIDC) under the terms and conditions of the Open Government Licence: https://doi.org/10.5285/782ec845-2135-4698-8881-b38823e533bf.
Concept for Future Data Services at the Long-Term Archive of WDCC combining DOIs with common PIDs
NASA Astrophysics Data System (ADS)
Stockhause, Martina; Weigel, Tobias; Toussaint, Frank; Höck, Heinke; Thiemann, Hannes; Lautenschlager, Michael
2013-04-01
The World Data Center for Climate (WDCC) hosted at the German Climate Computing Center (DKRZ) maintains a long-term archive (LTA) of climate model data as well as observational data. WDCC distinguishes between two types of LTA data: Structured data: Data output of an instrument or of a climate model run consists of numerous, highly structured individual datasets in a uniform format. Part of these data is also published on an ESGF (Earth System Grid Federation) data node. Detailed metadata is available allowing for fine-grained user-defined data access. Unstructured data: LTA data of finished scientific projects are in general unstructured and consist of datasets of different formats, different sizes, and different contents. For these data compact metadata is available as content information. The structured data is suitable for WDCC's DataCite DOI process, the project data only in exceptional cases. The DOI process includes a thorough quality control process of technical as well as scientific aspects by the publication agent and the data creator. DOIs are assigned to data collections appropriate to be cited in scientific publications, like a simulation run. The data collection is defined in agreement with the data creator. At the moment there is no possibility to identify and cite individual datasets within this DOI data collection analogous to the citation of chapters in a book. Also missing is a compact citation regulation for a user-specified collection of data. WDCC therefore complements its existing LTA/DOI concept by Persistent Identifier (PID) assignment to datasets using Handles. In addition to data identification for internal and external use, the concept of PIDs allows to define relations among PIDs. Such structural information is stored as key-value pair directly in the handles. Thus, relations provide basic provenance or lineage information, even if part of the data like intermediate results are lost. WDCC intends to use additional PIDs on metadata entities with a relation to the data PID(s). These add background information on the data creation process (e.g. descriptions of experiment, model, model set-up, and platform for the model run etc.) to the data. These pieces of additional information increase the re-usability of the archived model data, significantly. Other valuable additional information for scientific collaboration could be added by the same mechanism, like quality information and annotations. Apart from relations among data and metadata entities, PIDs on collections are advantageous for model data: Collections allow for persistent references to single datasets or subsets of data assigned a DOI, Data objects and additional information objects can be consistently connected via relations (provenance, creation, quality information for data),
Computational approaches for predicting biomedical research collaborations.
Zhang, Qing; Yu, Hong
2014-01-01
Biomedical research is increasingly collaborative, and successful collaborations often produce high impact work. Computational approaches can be developed for automatically predicting biomedical research collaborations. Previous works of collaboration prediction mainly explored the topological structures of research collaboration networks, leaving out rich semantic information from the publications themselves. In this paper, we propose supervised machine learning approaches to predict research collaborations in the biomedical field. We explored both the semantic features extracted from author research interest profile and the author network topological features. We found that the most informative semantic features for author collaborations are related to research interest, including similarity of out-citing citations, similarity of abstracts. Of the four supervised machine learning models (naïve Bayes, naïve Bayes multinomial, SVMs, and logistic regression), the best performing model is logistic regression with an ROC ranging from 0.766 to 0.980 on different datasets. To our knowledge we are the first to study in depth how research interest and productivities can be used for collaboration prediction. Our approach is computationally efficient, scalable and yet simple to implement. The datasets of this study are available at https://github.com/qingzhanggithub/medline-collaboration-datasets.
Enrichment of OpenStreetMap Data Completeness with Sidewalk Geometries Using Data Mining Techniques.
Mobasheri, Amin; Huang, Haosheng; Degrossi, Lívia Castro; Zipf, Alexander
2018-02-08
Tailored routing and navigation services utilized by wheelchair users require certain information about sidewalk geometries and their attributes to execute efficiently. Except some minor regions/cities, such detailed information is not present in current versions of crowdsourced mapping databases including OpenStreetMap. CAP4Access European project aimed to use (and enrich) OpenStreetMap for making it fit to the purpose of wheelchair routing. In this respect, this study presents a modified methodology based on data mining techniques for constructing sidewalk geometries using multiple GPS traces collected by wheelchair users during an urban travel experiment. The derived sidewalk geometries can be used to enrich OpenStreetMap to support wheelchair routing. The proposed method was applied to a case study in Heidelberg, Germany. The constructed sidewalk geometries were compared to an official reference dataset ("ground truth dataset"). The case study shows that the constructed sidewalk network overlays with 96% of the official reference dataset. Furthermore, in terms of positional accuracy, a low Root Mean Square Error (RMSE) value (0.93 m) is achieved. The article presents our discussion on the results as well as the conclusion and future research directions.
Analysis of Specular Reflections Off Geostationary Satellites
NASA Astrophysics Data System (ADS)
Jolley, A.
2016-09-01
Many photometric studies of artificial satellites have attempted to define procedures that minimise the size of datasets required to infer information about satellites. However, it is unclear whether deliberately limiting the size of datasets significantly reduces the potential for information to be derived from them. In 2013 an experiment was conducted using a 14 inch Celestron CG-14 telescope to gain multiple night-long, high temporal resolution datasets of six geostationary satellites [1]. This experiment produced evidence of complex variations in the spectral energy distribution (SED) of reflections off satellite surface materials, particularly during specular reflections. Importantly, specific features relating to the SED variations could only be detected with high temporal resolution data. An update is provided regarding the nature of SED and colour variations during specular reflections, including how some of the variables involved contribute to these variations. Results show that care must be taken when comparing observed spectra to a spectral library for the purpose of material identification; a spectral library that uses wavelength as the only variable will be unable to capture changes that occur to a material's reflected spectra with changing illumination and observation geometry. Conversely, colour variations with changing illumination and observation geometry might provide an alternative means of determining material types.
TriageTools: tools for partitioning and prioritizing analysis of high-throughput sequencing data.
Fimereli, Danai; Detours, Vincent; Konopka, Tomasz
2013-04-01
High-throughput sequencing is becoming a popular research tool but carries with it considerable costs in terms of computation time, data storage and bandwidth. Meanwhile, some research applications focusing on individual genes or pathways do not necessitate processing of a full sequencing dataset. Thus, it is desirable to partition a large dataset into smaller, manageable, but relevant pieces. We present a toolkit for partitioning raw sequencing data that includes a method for extracting reads that are likely to map onto pre-defined regions of interest. We show the method can be used to extract information about genes of interest from DNA or RNA sequencing samples in a fraction of the time and disk space required to process and store a full dataset. We report speedup factors between 2.6 and 96, depending on settings and samples used. The software is available at http://www.sourceforge.net/projects/triagetools/.
Correlation of Gear Surface Fatigue Lives to Lambda Ratio (Specific Film Thickness)
NASA Technical Reports Server (NTRS)
Krantz, Timothy Lewis
2013-01-01
The effect of the lubrication regime on gear performance has been recognized, qualitatively, for decades. Often the lubrication regime is characterized by the specific film thickness being the ratio of lubricant film thickness to the composite surface roughness. Three studies done at NASA to investigate gearing pitting life are revisited in this work. All tests were done at a common load. In one study, ground gears were tested using a variety of lubricants that included a range of viscosities, and therefore the gears operated with differing film thicknesses. In a second and third study, the performance of gears with ground teeth and superfinished teeth were assessed. Thicker oil films provided longer lives as did improved surface finish. These datasets were combined into a common dataset using the concept of specific film thickness. This unique dataset of more 258 tests provides gear designers with some qualitative information to make gear design decisions.
Service Delivery Experiences and Intervention Needs of Military Families with Children with ASD
ERIC Educational Resources Information Center
Davis, Jennifer M.; Finke, Erinn; Hickerson, Benjamin
2016-01-01
The purpose of this study was to describe the experiences of military families with children with autism spectrum disorder (ASD) specifically as it relates to relocation. Online survey methodology was used to gather information from military spouses with children with ASD. The finalized dataset included 189 cases. Descriptive statistics and…
Van Gosen, Bradley S.
2008-01-01
This map and its accompanying dataset provide information for 113 natural asbestos occurrences in the Southwestern United States (U.S.), using descriptions found in the geologic literature. Data on location, mineralogy, geology, and relevant literature for each asbestos site are provided. Using the map and digital data in this report, the user can examine the distribution of previously reported asbestos occurrences and their geological characteristics in the Southwestern U.S., which includes sites in Arizona, Nevada, and Utah. This report is part of an ongoing study by the U.S. Geological Survey to identify and map reported natural asbestos occurrences in the U.S., which thus far includes similar maps and datasets of natural asbestos occurrences within the Eastern U.S. (http://pubs.usgs.gov/of/2005/1189/), the Central U.S. (http://pubs.usgs.gov/of/2006/1211/), and the Rocky Mountain States (http://pubs.usgs.gov/of/2007/1182/. These reports are intended to provide State and local government agencies and other stakeholders with geologic information on natural occurrences of asbestos in the U.S.
McCann, Liza J; Kirkham, Jamie J; Wedderburn, Lucy R; Pilkington, Clarissa; Huber, Adam M; Ravelli, Angelo; Appelbe, Duncan; Williamson, Paula R; Beresford, Michael W
2015-06-12
Juvenile dermatomyositis (JDM) is a rare autoimmune inflammatory disorder associated with significant morbidity and mortality. International collaboration is necessary to better understand the pathogenesis of the disease, response to treatment and long-term outcome. To aid international collaboration, it is essential to have a core set of data that all researchers and clinicians collect in a standardised way for clinical purposes and for research. This should include demographic details, diagnostic data and measures of disease activity, investigations and treatment. Variables in existing clinical registries have been compared to produce a provisional data set for JDM. We now aim to develop this into a consensus-approved minimum core dataset, tested in a wider setting, with the objective of achieving international agreement. A two-stage bespoke Delphi-process will engage the opinion of a large number of key stakeholders through Email distribution via established international paediatric rheumatology and myositis organisations. This, together with a formalised patient/parent participation process will help inform a consensus meeting of international experts that will utilise a nominal group technique (NGT). The resulting proposed minimal dataset will be tested for feasibility within existing database infrastructures. The developed minimal dataset will be sent to all internationally representative collaborators for final comment. The participants of the expert consensus group will be asked to draw together these comments, ratify and 'sign off' the final minimal dataset. An internationally agreed minimal dataset has the potential to significantly enhance collaboration, allow effective communication between groups, provide a minimal standard of care and enable analysis of the largest possible number of JDM patients to provide a greater understanding of this disease. The final approved minimum core dataset could be rapidly incorporated into national and international collaborative efforts, including existing prospective databases, and be available for use in randomised controlled trials and for treatment/protocol comparisons in cohort studies.
A benchmark for comparison of cell tracking algorithms
Maška, Martin; Ulman, Vladimír; Svoboda, David; Matula, Pavel; Matula, Petr; Ederra, Cristina; Urbiola, Ainhoa; España, Tomás; Venkatesan, Subramanian; Balak, Deepak M.W.; Karas, Pavel; Bolcková, Tereza; Štreitová, Markéta; Carthel, Craig; Coraluppi, Stefano; Harder, Nathalie; Rohr, Karl; Magnusson, Klas E. G.; Jaldén, Joakim; Blau, Helen M.; Dzyubachyk, Oleh; Křížek, Pavel; Hagen, Guy M.; Pastor-Escuredo, David; Jimenez-Carretero, Daniel; Ledesma-Carbayo, Maria J.; Muñoz-Barrutia, Arrate; Meijering, Erik; Kozubek, Michal; Ortiz-de-Solorzano, Carlos
2014-01-01
Motivation: Automatic tracking of cells in multidimensional time-lapse fluorescence microscopy is an important task in many biomedical applications. A novel framework for objective evaluation of cell tracking algorithms has been established under the auspices of the IEEE International Symposium on Biomedical Imaging 2013 Cell Tracking Challenge. In this article, we present the logistics, datasets, methods and results of the challenge and lay down the principles for future uses of this benchmark. Results: The main contributions of the challenge include the creation of a comprehensive video dataset repository and the definition of objective measures for comparison and ranking of the algorithms. With this benchmark, six algorithms covering a variety of segmentation and tracking paradigms have been compared and ranked based on their performance on both synthetic and real datasets. Given the diversity of the datasets, we do not declare a single winner of the challenge. Instead, we present and discuss the results for each individual dataset separately. Availability and implementation: The challenge Web site (http://www.codesolorzano.com/celltrackingchallenge) provides access to the training and competition datasets, along with the ground truth of the training videos. It also provides access to Windows and Linux executable files of the evaluation software and most of the algorithms that competed in the challenge. Contact: codesolorzano@unav.es Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24526711
EnviroAtlas - Austin, TX - 15m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 15-m riparian buffer that is vegetated. Vegetated cover is defined as Trees & Forest and Grass & Herbaceous. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the EnviroAtlas community area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Austin, TX - 15m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 15-m riparian buffer that is forested. Forest is defined as Trees & Forest. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the EnviroAtlas community area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Data discovery with DATS: exemplar adoptions and lessons learned.
Gonzalez-Beltran, Alejandra N; Campbell, John; Dunn, Patrick; Guijarro, Diana; Ionescu, Sanda; Kim, Hyeoneui; Lyle, Jared; Wiser, Jeffrey; Sansone, Susanna-Assunta; Rocca-Serra, Philippe
2018-01-01
The DAta Tag Suite (DATS) is a model supporting dataset description, indexing, and discovery. It is available as an annotated serialization with schema.org, a vocabulary used by major search engines, thus making the datasets discoverable on the web. DATS underlies DataMed, the National Institutes of Health Big Data to Knowledge Data Discovery Index prototype, which aims to provide a "PubMed for datasets." The experience gained while indexing a heterogeneous range of >60 repositories in DataMed helped in evaluating DATS's entities, attributes, and scope. In this work, 3 additional exemplary and diverse data sources were mapped to DATS by their representatives or experts, offering a deep scan of DATS fitness against a new set of existing data. The procedure, including feedback from users and implementers, resulted in DATS implementation guidelines and best practices, and identification of a path for evolving and optimizing the model. Finally, the work exposed additional needs when defining datasets for indexing, especially in the context of clinical and observational information. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Experimental formation enthalpies for intermetallic phases and other inorganic compounds
Kim, George; Meschel, S. V.; Nash, Philip; Chen, Wei
2017-01-01
The standard enthalpy of formation of a compound is the energy associated with the reaction to form the compound from its component elements. The standard enthalpy of formation is a fundamental thermodynamic property that determines its phase stability, which can be coupled with other thermodynamic data to calculate phase diagrams. Calorimetry provides the only direct method by which the standard enthalpy of formation is experimentally measured. However, the measurement is often a time and energy intensive process. We present a dataset of enthalpies of formation measured by high-temperature calorimetry. The phases measured in this dataset include intermetallic compounds with transition metal and rare-earth elements, metal borides, metal carbides, and metallic silicides. These measurements were collected from over 50 years of calorimetric experiments. The dataset contains 1,276 entries on experimental enthalpy of formation values and structural information. Most of the entries are for binary compounds but ternary and quaternary compounds are being added as they become available. The dataset also contains predictions of enthalpy of formation from first-principles calculations for comparison. PMID:29064466
EnviroAtlas - Des Moines, IA - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. Vegetated cover is defined as Trees & Forest and Grass & Herbaceous. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the EnviroAtlas community area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://enviroatlas.epa.gov/EnviroAtlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets)
EnviroAtlas - New York, NY - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. In this community, vegetated cover is defined as Trees & Forest and Grass & Herbaceous. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the EnviroAtlas community area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets)
EnviroAtlas - Austin, TX - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. Vegetated cover is defined as Trees & Forest and Grass & Herbaceous. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the EnviroAtlas community area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Paterson, NJ - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. EnviroAtlas defines tree buffer for this community as only trees and forest. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Minneapolis/St. Paul, MN - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. In this community, forest is defined as Trees and Forest and Woody Wetlands. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the EnviroAtlas community area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Cleveland, OH - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. In this community, forest is defined as Trees & Forest and Woody Wetlands. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the EnviroAtlas community area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas ) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets)
EnviroAtlas - New York, NY - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. In this community, forest is defined as Trees & Forest. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the EnviroAtlas community area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets)
EnviroAtlas - Memphis, TN - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. Vegetated cover is defined as Trees & Forest, Grass & Herbaceous, Woody Wetlands, and Emergent Wetlands. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the EnviroAtlas community area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Cleveland, OH - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. In this community, vegetated cover is defined as Trees & Forest, Grass & Herbaceous, Woody Wetlands, and Emergent Wetlands. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the EnviroAtlas community area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas ) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets)
EnviroAtlas - Austin, TX - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. Forest is defined as Trees & Forest. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the EnviroAtlas community area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Memphis, TN - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. Forest is defined as Trees & Forest and Woody Wetlands. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the EnviroAtlas community area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Des Moines, IA - 51m Riparian Buffer Forest Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is forested. Forest is defined as Trees & Forest. There is a potential for decreased water quality in areas where the riparian buffer is less forested. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the EnviroAtlas community area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://enviroatlas.epa.gov/EnviroAtlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets)
EnviroAtlas - Paterson, NJ - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the Atlas Area. EnviroAtlas defines vegetated buffer for this community as trees and forest and grass and herbaceous. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
a Comparative Analysis of Five Cropland Datasets in Africa
NASA Astrophysics Data System (ADS)
Wei, Y.; Lu, M.; Wu, W.
2018-04-01
The food security, particularly in Africa, is a challenge to be resolved. The cropland area and spatial distribution obtained from remote sensing imagery are vital information. In this paper, according to cropland area and spatial location, we compare five global cropland datasets including CCI Land Cover, GlobCover, MODIS Collection 5, GlobeLand30 and Unified Cropland in circa 2010 of Africa in terms of cropland area and spatial location. The accuracy of cropland area calculated from five datasets was analyzed compared with statistic data. Based on validation samples, the accuracies of spatial location for the five cropland products were assessed by error matrix. The results show that GlobeLand30 has the best fitness with the statistics, followed by MODIS Collection 5 and Unified Cropland, GlobCover and CCI Land Cover have the lower accuracies. For the accuracy of spatial location of cropland, GlobeLand30 reaches the highest accuracy, followed by Unified Cropland, MODIS Collection 5 and GlobCover, CCI Land Cover has the lowest accuracy. The spatial location accuracy of five datasets in the Csa with suitable farming condition is generally higher than in the Bsk.
Data Publication: A Partnership between Scientists, Data Managers and Librarians
NASA Astrophysics Data System (ADS)
Raymond, L.; Chandler, C.; Lowry, R.; Urban, E.; Moncoiffe, G.; Pissierssens, P.; Norton, C.; Miller, H.
2012-04-01
Current literature on the topic of data publication suggests that success is best achieved when there is a partnership between scientists, data managers, and librarians. The Marine Biological Laboratory/Woods Hole Oceanographic Institution (MBLWHOI) Library and the Biological and Chemical Oceanography Data Management Office (BCO-DMO) have developed tools and processes to automate the ingestion of metadata from BCO-DMO for deposit with datasets into the Institutional Repository (IR) Woods Hole Open Access Server (WHOAS). The system also incorporates functionality for BCO-DMO to request a Digital Object Identifier (DOI) from the Library. This partnership allows the Library to work with a trusted data repository to ensure high quality data while the data repository utilizes library services and is assured of a permanent archive of the copy of the data extracted from the repository database. The assignment of persistent identifiers enables accurate data citation. The Library can assign a DOI to appropriate datasets deposited in WHOAS. A primary activity is working with authors to deposit datasets associated with published articles. The DOI would ideally be assigned before submission and be included in the published paper so readers can link directly to the dataset, but DOIs are also being assigned to datasets related to articles after publication. WHOAS metadata records link the article to the datasets and the datasets to the article. The assignment of DOIs has enabled another important collaboration with Elsevier, publisher of educational and professional science journals. Elsevier can now link from articles in the Science Direct database to the datasets available from WHOAS that are related to that article. The data associated with the article are freely available from WHOAS and accompanied by a Dublin Core metadata record. In addition, the Library has worked with researchers to deposit datasets in WHOAS that are not appropriate for national, international, or domain specific data repositories. These datasets currently include audio, text and image files. This research is being conducted by a team of librarians, data managers and scientists that are collaborating with representatives from the Scientific Committee on Oceanic Research (SCOR) and the International Oceanographic Data and Information Exchange (IODE) of the Intergovernmental Oceanographic Commission (IOC). The goal is to identify best practices for tracking data provenance and clearly attributing credit to data collectors/providers.
This EnviroAtlas dataset contains points depicting the location of market-based programs, referred to herein as markets, and projects addressing ecosystem services protection in the United States. The data were collected via surveys and desk research conducted by Forest Trends' Ecosystem Marketplace from 2008 to 2016 on biodiversity (i.e., imperiled species/habitats; wetlands and streams), carbon, and water markets. Additional biodiversity data were obtained from the Regulatory In-lieu Fee and Bank Information Tracking System (RIBITS) database in 2015. Points represent the centroids (i.e., center points) of market coverage areas, project footprints, or project primary impact areas in which ecosystem service markets or projects operate. National-level markets are an exception to this norm with points representing administrative headquarters locations. Attribute data include information regarding the methodology, design, and development of biodiversity, carbon, and water markets and projects. This dataset was produced by Forest Trends' Ecosystem Marketplace for EnviroAtlas in order to support public access to and use of information related to environmental markets. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) o
NASA Astrophysics Data System (ADS)
Strawhacker, C.
2017-12-01
As a result of the `open data' movement, an increased focus on how data should be attributed and cited has become increasingly important. As data becomes reused in analyses not performed by the initial data creator, efforts have turned to crediting the data creator, such as data citation and metrics of reuse to ensure appropriate attribution to the original data author. The increased focused on metrics and citation, however, need to be carefully considered when it comes to social science data, local observations, and Indigenous Knowledge held by Indigenous communities. These diverse and sometimes sensitive data/information/knowledge sets often require deep nuance, thought, and compromise within the `open data' framework, in order to consider issues of the confidentiality of research subject and the ownership of data and information, often in a colonial context. Furthermore, these datasets are often highly valuable to one or two villages, saving lives and retaining culture within. In these cases quantitative metrics of "data reuse" and citation do not adequately measure a dataset's `value.' On this panel, I will provide examples of datasets that are highly valuable to small communities from my research in the Arctic and US Southwest. These datasets are not highly cited or have impressive quantitative metrics (e.g., number of downloads) but have been incredibly valuable to the community where the data/information/Knowledge are held. These cases include atlases of placenames held by elders in small Arctic communities, as well as databases of local observations of wildlife and sea ice in Alaska that are essential for sharing knowledge across multiple villages. These examples suggest that a more nuanced approach to understanding how data should be accredited would be useful when working with social science data and Indigenous Knowledge.
2012-01-01
Background Patients’ experiences have become central to assessing the performance of healthcare systems worldwide and are increasingly being used to inform quality improvement processes. This paper explores the relative value of surveys and detailed patient narratives in identifying priorities for improving breast cancer services as part of a quality improvement process. Methods One dataset was collected using a narrative interview approach, (n = 13) and the other using a postal survey (n = 82). Datasets were analyzed separately and then compared to determine whether similar priorities for improving patient experiences were identified. Results There were both similarities and differences in the improvement priorities arising from each approach. Day surgery was specifically identified as a priority in the narrative dataset but included in the survey recommendations only as part of a broader priority around improving inpatient experience. Both datasets identified appointment systems, patients spending enough time with staff, information about treatment and side effects and more information at the end of treatment as priorities. The specific priorities identified by the narrative interviews commonly related to ‘relational’ aspects of patient experience. Those identified by the survey typically related to more ‘functional’ aspects and were not always sufficiently detailed to identify specific improvement actions. Conclusions Our analysis suggests that whilst local survey data may act as a screening tool to identify potential problems within the breast cancer service, they do not always provide sufficient detail of what to do to improve that service. These findings may have wider applicability in other services. We recommend using an initial preliminary survey, with better use of survey open comments, followed by an in-depth qualitative analysis to help deliver improvements to relational and functional aspects of patient experience. PMID:22913525
Watershed Boundary Dataset for Mississippi
Wilson, K. Van; Clair, Michael G.; Turnipseed, D. Phil; Rebich, Richard A.
2009-01-01
The U.S. Geological Survey, in cooperation with the Mississippi Department of Environmental Quality, U.S. Department of Agriculture-Natural Resources Conservation Service, Mississippi Department of Transportation, U.S. Department of Agriculture-Forest Service, and the Mississippi Automated Resource Information System developed a 1:24,000-scale Watershed Boundary Dataset for Mississippi including watershed and subwatershed boundaries, codes, names, and areas. The Watershed Boundary Dataset for Mississippi provides a standard geographical framework for water-resources and selected land-resources planning. The original 8-digit subbasins (Hydrologic Unit Codes) were further subdivided into 10-digit watersheds (62.5 to 391 square miles (mi2)) and 12-digit subwatersheds (15.6 to 62.5 mi2) - the exceptions being the Delta part of Mississippi and the Mississippi River inside levees, which were subdivided into 10-digit watersheds only. Also, large water bodies in the Mississippi Sound along the coast were not delineated as small as a typical 12-digit subwatershed. All of the data - including watershed and subwatershed boundaries, subdivision codes and names, and drainage-area data - are stored in a Geographic Information System database, which are available at: http://ms.water.usgs.gov/. This map shows information on drainage and hydrography in the form of U.S. Geological Survey hydrologic unit boundaries for water-resource 2-digit regions, 4-digit subregions, 6-digit basins (formerly called accounting units), 8-digit subbasins (formerly called cataloging units), 10-digit watershed, and 12-digit subwatersheds in Mississippi. A description of the project study area, methods used in the development of watershed and subwatershed boundaries for Mississippi, and results are presented in Wilson and others (2008). The data presented in this map and by Wilson and others (2008) supersede the data presented for Mississippi by Seaber and others (1987) and U.S. Geological Survey (1977).
Web mapping system for complex processing and visualization of environmental geospatial datasets
NASA Astrophysics Data System (ADS)
Titov, Alexander; Gordov, Evgeny; Okladnikov, Igor
2016-04-01
Environmental geospatial datasets (meteorological observations, modeling and reanalysis results, etc.) are used in numerous research applications. Due to a number of objective reasons such as inherent heterogeneity of environmental datasets, big dataset volume, complexity of data models used, syntactic and semantic differences that complicate creation and use of unified terminology, the development of environmental geodata access, processing and visualization services as well as client applications turns out to be quite a sophisticated task. According to general INSPIRE requirements to data visualization geoportal web applications have to provide such standard functionality as data overview, image navigation, scrolling, scaling and graphical overlay, displaying map legends and corresponding metadata information. It should be noted that modern web mapping systems as integrated geoportal applications are developed based on the SOA and might be considered as complexes of interconnected software tools for working with geospatial data. In the report a complex web mapping system including GIS web client and corresponding OGC services for working with geospatial (NetCDF, PostGIS) dataset archive is presented. There are three basic tiers of the GIS web client in it: 1. Tier of geospatial metadata retrieved from central MySQL repository and represented in JSON format 2. Tier of JavaScript objects implementing methods handling: --- NetCDF metadata --- Task XML object for configuring user calculations, input and output formats --- OGC WMS/WFS cartographical services 3. Graphical user interface (GUI) tier representing JavaScript objects realizing web application business logic Metadata tier consists of a number of JSON objects containing technical information describing geospatial datasets (such as spatio-temporal resolution, meteorological parameters, valid processing methods, etc). The middleware tier of JavaScript objects implementing methods for handling geospatial metadata, task XML object, and WMS/WFS cartographical services interconnects metadata and GUI tiers. The methods include such procedures as JSON metadata downloading and update, launching and tracking of the calculation task running on the remote servers as well as working with WMS/WFS cartographical services including: obtaining the list of available layers, visualizing layers on the map, exporting layers in graphical (PNG, JPG, GeoTIFF), vector (KML, GML, Shape) and digital (NetCDF) formats. Graphical user interface tier is based on the bundle of JavaScript libraries (OpenLayers, GeoExt and ExtJS) and represents a set of software components implementing web mapping application business logic (complex menus, toolbars, wizards, event handlers, etc.). GUI provides two basic capabilities for the end user: configuring the task XML object functionality and cartographical information visualizing. The web interface developed is similar to the interface of such popular desktop GIS applications, as uDIG, QuantumGIS etc. Web mapping system developed has shown its effectiveness in the process of solving real climate change research problems and disseminating investigation results in cartographical form. The work is supported by SB RAS Basic Program Projects VIII.80.2.1 and IV.38.1.7.
Revisiting Hansen Solubility Parameters by Including Thermodynamics.
Louwerse, Manuel J; Maldonado, Ana; Rousseau, Simon; Moreau-Masselon, Chloe; Roux, Bernard; Rothenberg, Gadi
2017-11-03
The Hansen solubility parameter approach is revisited by implementing the thermodynamics of dissolution and mixing. Hansen's pragmatic approach has earned its spurs in predicting solvents for polymer solutions, but for molecular solutes improvements are needed. By going into the details of entropy and enthalpy, several corrections are suggested that make the methodology thermodynamically sound without losing its ease of use. The most important corrections include accounting for the solvent molecules' size, the destruction of the solid's crystal structure, and the specificity of hydrogen-bonding interactions, as well as opportunities to predict the solubility at extrapolated temperatures. Testing the original and the improved methods on a large industrial dataset including solvent blends, fit qualities improved from 0.89 to 0.97 and the percentage of correct predictions rose from 54 % to 78 %. Full Matlab scripts are included in the Supporting Information, allowing readers to implement these improvements on their own datasets. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
User Guidelines for the Brassica Database: BRAD.
Wang, Xiaobo; Cheng, Feng; Wang, Xiaowu
2016-01-01
The genome sequence of Brassica rapa was first released in 2011. Since then, further Brassica genomes have been sequenced or are undergoing sequencing. It is therefore necessary to develop tools that help users to mine information from genomic data efficiently. This will greatly aid scientific exploration and breeding application, especially for those with low levels of bioinformatic training. Therefore, the Brassica database (BRAD) was built to collect, integrate, illustrate, and visualize Brassica genomic datasets. BRAD provides useful searching and data mining tools, and facilitates the search of gene annotation datasets, syntenic or non-syntenic orthologs, and flanking regions of functional genomic elements. It also includes genome-analysis tools such as BLAST and GBrowse. One of the important aims of BRAD is to build a bridge between Brassica crop genomes with the genome of the model species Arabidopsis thaliana, thus transferring the bulk of A. thaliana gene study information for use with newly sequenced Brassica crops.
Kernel-aligned multi-view canonical correlation analysis for image recognition
NASA Astrophysics Data System (ADS)
Su, Shuzhi; Ge, Hongwei; Yuan, Yun-Hao
2016-09-01
Existing kernel-based correlation analysis methods mainly adopt a single kernel in each view. However, only a single kernel is usually insufficient to characterize nonlinear distribution information of a view. To solve the problem, we transform each original feature vector into a 2-dimensional feature matrix by means of kernel alignment, and then propose a novel kernel-aligned multi-view canonical correlation analysis (KAMCCA) method on the basis of the feature matrices. Our proposed method can simultaneously employ multiple kernels to better capture the nonlinear distribution information of each view, so that correlation features learned by KAMCCA can have well discriminating power in real-world image recognition. Extensive experiments are designed on five real-world image datasets, including NIR face images, thermal face images, visible face images, handwritten digit images, and object images. Promising experimental results on the datasets have manifested the effectiveness of our proposed method.
Persistent identifiers for CMIP6 data in the Earth System Grid Federation
NASA Astrophysics Data System (ADS)
Buurman, Merret; Weigel, Tobias; Juckes, Martin; Lautenschlager, Michael; Kindermann, Stephan
2016-04-01
The Earth System Grid Federation (ESGF) is a distributed data infrastructure that will provide access to the CMIP6 experiment data. The data consist of thousands of datasets composed of millions of files. Over the course of the CMIP6 operational phase, datasets may be retracted and replaced by newer versions that consist of completely or partly new files. Each dataset is hosted at a single data centre, but can have one or several backups (replicas) at other data centres. To keep track of the different data entities and relationships between them, to ensure their consistency and improve exchange of information about them, Persistent Identifiers (PIDs) are used. These are unique identifiers that are registered at a globally accessible server, along with some metadata (the PID record). While usually providing access to the data object they refer to, as long as it exists, the metadata record will remain available even beyond the object's lifetime. Besides providing access to data and metadata, PIDs will allow scientists to communicate effectively and on a fine granularity about CMIP6 data. The initiative to introduce PIDs in the ESGF infrastructure has been described and agreed upon through a series of white papers governed by the WGCM Infrastructure Panel (WIP). In CMIP6, each dataset and each file is assigned a PID that keeps track of the data object's physical copies throughout the object lifetime. In addition to this, its relationship with other data objects is stored in the PID recordA human-readable version of this information is available on an information page also linked in the PID record. A possible application that exploits the information available from the PID records is a smart information tool, which a scientific user can call to find out if his/her version was replaced by a new one, to view and browse the related datasets and files, and to get access to the various copies or to additional metadata on a dedicated website. The PID registration process is embedded in the ESGF data publication process. During their first publication, the PID records are populated with metadata including the parent dataset(s), other existing versions and physical location. Every subsequent publication, un-publication or replica publication of a dataset or file then updates the PID records to keep track of changing physical locations of the data (or lack thereof) and of reported errors in the data. Assembling the metadata records and registering the PIDs on a central server is a potential performance bottleneck as millions of data objects may be published in a short timeframe when the CMIP6 experiment phase begins. For this reason, the PID registration and metadata update tasks are pushed to a message queueing system facilitating high availability and scalability and then processed asynchronously. This will lead to a slight delay in PID registration but will avoid blocking resources at the data centres and slowing down the publication of the data so eagerly awaited by the scientists.
Santa Margarita Estuary Water Quality Monitoring Data
2018-02-01
ADMINISTRATIVE INFORMATION The work described in this report was performed for the Water Quality Section of the Environmental Security Marine Corps Base...water quality model calibration given interest and the necessary resources. The dataset should also inform the stakeholders and Regional Board on...period. Several additional ancillary datasets were collected during the monitoring timeframe that provide key information though they were not collected
Liang, Li-Jung; Weiss, Robert E; Redelings, Benjamin; Suchard, Marc A
2009-10-01
Statistical analyses of phylogenetic data culminate in uncertain estimates of underlying model parameters. Lack of additional data hinders the ability to reduce this uncertainty, as the original phylogenetic dataset is often complete, containing the entire gene or genome information available for the given set of taxa. Informative priors in a Bayesian analysis can reduce posterior uncertainty; however, publicly available phylogenetic software specifies vague priors for model parameters by default. We build objective and informative priors using hierarchical random effect models that combine additional datasets whose parameters are not of direct interest but are similar to the analysis of interest. We propose principled statistical methods that permit more precise parameter estimates in phylogenetic analyses by creating informative priors for parameters of interest. Using additional sequence datasets from our lab or public databases, we construct a fully Bayesian semiparametric hierarchical model to combine datasets. A dynamic iteratively reweighted Markov chain Monte Carlo algorithm conveniently recycles posterior samples from the individual analyses. We demonstrate the value of our approach by examining the insertion-deletion (indel) process in the enolase gene across the Tree of Life using the phylogenetic software BALI-PHY; we incorporate prior information about indels from 82 curated alignments downloaded from the BAliBASE database.
GIS Representation of Coal-Bearing Areas in Africa
Merrill, Matthew D.; Tewalt, Susan J.
2008-01-01
The African continent contains approximately 5 percent of the world's proven recoverable reserves of coal (World Energy Council, 2007). Energy consumption in Africa is projected to grow at an annual rate of 2.3 percent from 2004 through 2030, while average consumption in first-world nations is expected to rise at 1.4 percent annually (Energy Information Administration, 2007). Coal reserves will undoubtedly continue to be part of Africa's energy portfolio as it grows in the future. A review of academic and industrial literature indicates that 27 nations in Africa contain coal-bearing rock. South Africa accounts for 96 percent of Africa's total proven recoverable coal reserves, ranking it sixth in the world. This report is a digital compilation of information on Africa's coal-bearing geology found in the literature and is intended to be used in small scale spatial investigations in a Geographic Information System (GIS) and as a visual aid for the discussion of Africa's coal resources. Many maps of African coal resources often include points for mine locations or regional scale polygons with generalized borders depicting basin edges. Point locations are detailed but provide no information regarding extent, and generalized polygons do not have sufficient detail. In this dataset, the polygons are representative of the actual coal-bearing lithology both in location and regional extent. Existing U.S. Geological Survey (USGS) digital geology datasets provide the majority of the base geologic polygons. Polygons for the coal-bearing localities were clipped from the base geology that represented the age and extent of the coal deposit as indicated in the literature. Where the 1:5,000,000-scale geology base layer's ages conflicted with those in the publications, polygons were generated directly from the regional African coal maps (1:500,000 scale, approximately) in the published material. In these cases, coal-bearing polygons were clipped to the literature's indicated coal extent, without regard to the underlying geology base or topographic constraints. Indication of the presence of African coal is based on multiple sources. However, the quality of the sources varies and there is often disagreement in the literature. This dataset includes the rank, age, and location of coal in Africa as well as the detailed source information responsible for each coal-bearing polygon. The dataset is not appropriate for use in resource assessments of any kind. Attributes necessary for tasks, such as number of coal seams, thickness of seams, and depth to coal are rarely provided in the literature and accordingly not represented in this data set. Small-scale investigations, representations and display uses are most appropriate for this product. This product is the first to show coal distribution as bounded by actual geologic contacts for the entire African continent. In addition to the spatial component of this dataset, complete references to source material are provided for each polygon, making this product a useful first step resource in African coal research. Greater detail regarding the creation of this dataset as well as the sources used is provided in the metadata file for the Africa_coal.shp file.
2012-01-01
Background ChIP-seq provides new opportunities to study allele-specific protein-DNA binding (ASB). However, detecting allelic imbalance from a single ChIP-seq dataset often has low statistical power since only sequence reads mapped to heterozygote SNPs are informative for discriminating two alleles. Results We develop a new method iASeq to address this issue by jointly analyzing multiple ChIP-seq datasets. iASeq uses a Bayesian hierarchical mixture model to learn correlation patterns of allele-specificity among multiple proteins. Using the discovered correlation patterns, the model allows one to borrow information across datasets to improve detection of allelic imbalance. Application of iASeq to 77 ChIP-seq samples from 40 ENCODE datasets and 1 genomic DNA sample in GM12878 cells reveals that allele-specificity of multiple proteins are highly correlated, and demonstrates the ability of iASeq to improve allelic inference compared to analyzing each individual dataset separately. Conclusions iASeq illustrates the value of integrating multiple datasets in the allele-specificity inference and offers a new tool to better analyze ASB. PMID:23194258
2016-01-01
Abstract Background Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. New information In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand. Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset. Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach. PMID:27932919
The observed clustering of damaging extra-tropical cyclones in Europe
NASA Astrophysics Data System (ADS)
Cusack, S.
2015-12-01
The clustering of severe European windstorms on annual timescales has substantial impacts on the re/insurance industry. Management of the risk is impaired by large uncertainties in estimates of clustering from historical storm datasets typically covering the past few decades. The uncertainties are unusually large because clustering depends on the variance of storm counts. Eight storm datasets are gathered for analysis in this study in order to reduce these uncertainties. Six of the datasets contain more than 100~years of severe storm information to reduce sampling errors, and the diversity of information sources and analysis methods between datasets sample observational errors. All storm severity measures used in this study reflect damage, to suit re/insurance applications. It is found that the shortest storm dataset of 42 years in length provides estimates of clustering with very large sampling and observational errors. The dataset does provide some useful information: indications of stronger clustering for more severe storms, particularly for southern countries off the main storm track. However, substantially different results are produced by removal of one stormy season, 1989/1990, which illustrates the large uncertainties from a 42-year dataset. The extended storm records place 1989/1990 into a much longer historical context to produce more robust estimates of clustering. All the extended storm datasets show a greater degree of clustering with increasing storm severity and suggest clustering of severe storms is much more material than weaker storms. Further, they contain signs of stronger clustering in areas off the main storm track, and weaker clustering for smaller-sized areas, though these signals are smaller than uncertainties in actual values. Both the improvement of existing storm records and development of new historical storm datasets would help to improve management of this risk.
An, Ji-Yong; Meng, Fan-Rong; You, Zhu-Hong; Chen, Xing; Yan, Gui-Ying; Hu, Ji-Pu
2016-10-01
Predicting protein-protein interactions (PPIs) is a challenging task and essential to construct the protein interaction networks, which is important for facilitating our understanding of the mechanisms of biological systems. Although a number of high-throughput technologies have been proposed to predict PPIs, there are unavoidable shortcomings, including high cost, time intensity, and inherently high false positive rates. For these reasons, many computational methods have been proposed for predicting PPIs. However, the problem is still far from being solved. In this article, we propose a novel computational method called RVM-BiGP that combines the relevance vector machine (RVM) model and Bi-gram Probabilities (BiGP) for PPIs detection from protein sequences. The major improvement includes (1) Protein sequences are represented using the Bi-gram probabilities (BiGP) feature representation on a Position Specific Scoring Matrix (PSSM), in which the protein evolutionary information is contained; (2) For reducing the influence of noise, the Principal Component Analysis (PCA) method is used to reduce the dimension of BiGP vector; (3) The powerful and robust Relevance Vector Machine (RVM) algorithm is used for classification. Five-fold cross-validation experiments executed on yeast and Helicobacter pylori datasets, which achieved very high accuracies of 94.57 and 90.57%, respectively. Experimental results are significantly better than previous methods. To further evaluate the proposed method, we compare it with the state-of-the-art support vector machine (SVM) classifier on the yeast dataset. The experimental results demonstrate that our RVM-BiGP method is significantly better than the SVM-based method. In addition, we achieved 97.15% accuracy on imbalance yeast dataset, which is higher than that of balance yeast dataset. The promising experimental results show the efficiency and robust of the proposed method, which can be an automatic decision support tool for future proteomics research. For facilitating extensive studies for future proteomics research, we developed a freely available web server called RVM-BiGP-PPIs in Hypertext Preprocessor (PHP) for predicting PPIs. The web server including source code and the datasets are available at http://219.219.62.123:8888/BiGP/. © 2016 The Authors Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.
Agriculture-driven deforestation in the tropics from 1990-2015: emissions, trends and uncertainties
NASA Astrophysics Data System (ADS)
Carter, Sarah; Herold, Martin; Avitabile, Valerio; de Bruin, Sytze; De Sy, Veronique; Kooistra, Lammert; Rufino, Mariana C.
2018-01-01
Limited data exists on emissions from agriculture-driven deforestation, and available data are typically uncertain. In this paper, we provide comparable estimates of emissions from both all deforestation and agriculture-driven deforestation, with uncertainties for 91 countries across the tropics between 1990 and 2015. Uncertainties associated with input datasets (activity data and emissions factors) were used to combine the datasets, where most certain datasets contribute the most. This method utilizes all the input data, while minimizing the uncertainty of the emissions estimate. The uncertainty of input datasets was influenced by the quality of the data, the sample size (for sample-based datasets), and the extent to which the timeframe of the data matches the period of interest. Area of deforestation, and the agriculture-driver factor (extent to which agriculture drives deforestation), were the most uncertain components of the emissions estimates, thus improvement in the uncertainties related to these estimates will provide the greatest reductions in uncertainties of emissions estimates. Over the period of the study, Latin America had the highest proportion of deforestation driven by agriculture (78%), and Africa had the lowest (62%). Latin America had the highest emissions from agriculture-driven deforestation, and these peaked at 974 ± 148 Mt CO2 yr-1 in 2000-2005. Africa saw a continuous increase in emissions between 1990 and 2015 (from 154 ± 21-412 ± 75 Mt CO2 yr-1), so mitigation initiatives could be prioritized there. Uncertainties for emissions from agriculture-driven deforestation are ± 62.4% (average over 1990-2015), and uncertainties were highest in Asia and lowest in Latin America. Uncertainty information is crucial for transparency when reporting, and gives credibility to related mitigation initiatives. We demonstrate that uncertainty data can also be useful when combining multiple open datasets, so we recommend new data providers to include this information.
NASA Technical Reports Server (NTRS)
Case, Jonathan L.; Kumar, Sujay V.; Kuligowski, Robert J.; Langston, Carrie
2013-01-01
The NASA Short ]term Prediction Research and Transition (SPoRT) Center in Huntsville, AL is running a real ]time configuration of the NASA Land Information System (LIS) with the Noah land surface model (LSM). Output from the SPoRT ]LIS run is used to initialize land surface variables for local modeling applications at select National Weather Service (NWS) partner offices, and can be displayed in decision support systems for situational awareness and drought monitoring. The SPoRT ]LIS is run over a domain covering the southern and eastern United States, fully nested within the National Centers for Environmental Prediction Stage IV precipitation analysis grid, which provides precipitation forcing to the offline LIS ]Noah runs. The SPoRT Center seeks to expand the real ]time LIS domain to the entire Continental U.S. (CONUS); however, geographical limitations with the Stage IV analysis product have inhibited this expansion. Therefore, a goal of this study is to test alternative precipitation forcing datasets that can enable the LIS expansion by improving upon the current geographical limitations of the Stage IV product. The four precipitation forcing datasets that are inter ]compared on a 4 ]km resolution CONUS domain include the Stage IV, an experimental GOES quantitative precipitation estimate (QPE) from NESDIS/STAR, the National Mosaic and QPE (NMQ) product from the National Severe Storms Laboratory, and the North American Land Data Assimilation System phase 2 (NLDAS ]2) analyses. The NLDAS ]2 dataset is used as the control run, with each of the other three datasets considered experimental runs compared against the control. The regional strengths, weaknesses, and biases of each precipitation analysis are identified relative to the NLDAS ]2 control in terms of accumulated precipitation pattern and amount, and the impacts on the subsequent LSM spin ]up simulations. The ultimate goal is to identify an alternative precipitation forcing dataset that can best support an expansion of the real ]time SPoRT ]LIS to a domain covering the entire CONUS.
Weaver, J. Curtis
2006-01-01
A study of annual maximum precipitation frequency in Mecklenburg County, North Carolina, was conducted to characterize the frequency of precipitation at sites having at least 10 years of precipitation record. Precipitation-frequency studies provide information about the occurrence of precipitation amounts for given durations (for example, 1 hour or 24 hours) that can be expected to occur within a specified recurrence interval (expressed in years). In this study, annual maximum precipitation totals were determined for durations of 15 and 30 minutes; 1, 2, 3, 6, 12, and 24 hours; and for recurrence intervals of 2, 5, 10, 25, 50, 100, and 500 years. Precipitation data collected by the U.S. Geological Survey network of raingages in the city of Charlotte and Mecklenburg County were analyzed for this study. In September 2004, more than 70 precipitation sites were in operation; 27 of these sites had at least 10 years of record, which is the minimum record typically required in frequency studies. Missing record at one site, however, resulted in its removal from the dataset. Two datasets--the Charlotte Raingage Network (CRN) initial and CRN modified datasets--were developed from the U.S. Geological Survey data, which represented relatively short periods of record (10 and 11 years). The CRN initial dataset included very high precipitation totals from two storms that caused severe flooding in areas of the city and county in August 1995 and July 1997, which could significantly influence the statistical results. The CRN modified dataset excluded the highest precipitation totals from these two storms but included the second highest totals. More...
PANTHER. Pattern ANalytics To support High-performance Exploitation and Reasoning.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Czuchlewski, Kristina Rodriguez; Hart, William E.
Sandia has approached the analysis of big datasets with an integrated methodology that uses computer science, image processing, and human factors to exploit critical patterns and relationships in large datasets despite the variety and rapidity of information. The work is part of a three-year LDRD Grand Challenge called PANTHER (Pattern ANalytics To support High-performance Exploitation and Reasoning). To maximize data analysis capability, Sandia pursued scientific advances across three key technical domains: (1) geospatial-temporal feature extraction via image segmentation and classification; (2) geospatial-temporal analysis capabilities tailored to identify and process new signatures more efficiently; and (3) domain- relevant models of humanmore » perception and cognition informing the design of analytic systems. Our integrated results include advances in geographical information systems (GIS) in which we discover activity patterns in noisy, spatial-temporal datasets using geospatial-temporal semantic graphs. We employed computational geometry and machine learning to allow us to extract and predict spatial-temporal patterns and outliers from large aircraft and maritime trajectory datasets. We automatically extracted static and ephemeral features from real, noisy synthetic aperture radar imagery for ingestion into a geospatial-temporal semantic graph. We worked with analysts and investigated analytic workflows to (1) determine how experiential knowledge evolves and is deployed in high-demand, high-throughput visual search workflows, and (2) better understand visual search performance and attention. Through PANTHER, Sandia's fundamental rethinking of key aspects of geospatial data analysis permits the extraction of much richer information from large amounts of data. The project results enable analysts to examine mountains of historical and current data that would otherwise go untouched, while also gaining meaningful, measurable, and defensible insights into overlooked relationships and patterns. The capability is directly relevant to the nation's nonproliferation remote-sensing activities and has broad national security applications for military and intelligence- gathering organizations.« less
SPAR: small RNA-seq portal for analysis of sequencing experiments.
Kuksa, Pavel P; Amlie-Wolf, Alexandre; Katanic, Živadin; Valladares, Otto; Wang, Li-San; Leung, Yuk Yee
2018-05-04
The introduction of new high-throughput small RNA sequencing protocols that generate large-scale genomics datasets along with increasing evidence of the significant regulatory roles of small non-coding RNAs (sncRNAs) have highlighted the urgent need for tools to analyze and interpret large amounts of small RNA sequencing data. However, it remains challenging to systematically and comprehensively discover and characterize sncRNA genes and specifically-processed sncRNA products from these datasets. To fill this gap, we present Small RNA-seq Portal for Analysis of sequencing expeRiments (SPAR), a user-friendly web server for interactive processing, analysis, annotation and visualization of small RNA sequencing data. SPAR supports sequencing data generated from various experimental protocols, including smRNA-seq, short total RNA sequencing, microRNA-seq, and single-cell small RNA-seq. Additionally, SPAR includes publicly available reference sncRNA datasets from our DASHR database and from ENCODE across 185 human tissues and cell types to produce highly informative small RNA annotations across all major small RNA types and other features such as co-localization with various genomic features, precursor transcript cleavage patterns, and conservation. SPAR allows the user to compare the input experiment against reference ENCODE/DASHR datasets. SPAR currently supports analyses of human (hg19, hg38) and mouse (mm10) sequencing data. SPAR is freely available at https://www.lisanwanglab.org/SPAR.
Using statistical text classification to identify health information technology incidents
Chai, Kevin E K; Anthony, Stephen; Coiera, Enrico; Magrabi, Farah
2013-01-01
Objective To examine the feasibility of using statistical text classification to automatically identify health information technology (HIT) incidents in the USA Food and Drug Administration (FDA) Manufacturer and User Facility Device Experience (MAUDE) database. Design We used a subset of 570 272 incidents including 1534 HIT incidents reported to MAUDE between 1 January 2008 and 1 July 2010. Text classifiers using regularized logistic regression were evaluated with both ‘balanced’ (50% HIT) and ‘stratified’ (0.297% HIT) datasets for training, validation, and testing. Dataset preparation, feature extraction, feature selection, cross-validation, classification, performance evaluation, and error analysis were performed iteratively to further improve the classifiers. Feature-selection techniques such as removing short words and stop words, stemming, lemmatization, and principal component analysis were examined. Measurements κ statistic, F1 score, precision and recall. Results Classification performance was similar on both the stratified (0.954 F1 score) and balanced (0.995 F1 score) datasets. Stemming was the most effective technique, reducing the feature set size to 79% while maintaining comparable performance. Training with balanced datasets improved recall (0.989) but reduced precision (0.165). Conclusions Statistical text classification appears to be a feasible method for identifying HIT reports within large databases of incidents. Automated identification should enable more HIT problems to be detected, analyzed, and addressed in a timely manner. Semi-supervised learning may be necessary when applying machine learning to big data analysis of patient safety incidents and requires further investigation. PMID:23666777
ARM Research in the Equatorial Western Pacific: A Decade and Counting
NASA Technical Reports Server (NTRS)
Long, C. N.; McFarlane, S. A.; DelGenio, A.; Minnis, P.; Ackerman, T. S.; Mather, J.; Comstock, J.; Mace, G. G.; Jensen, M.; Jakob, C.
2013-01-01
The tropical western Pacific (TWP) is an important climatic region. Strong solar heating, warm sea surface temperatures, and the annual progression of the intertropical convergence zone (ITCZ) across this region generate abundant convective systems, which through their effects on the heat and water budgets have a profound impact on global climate and precipitation. In order to accurately evaluate tropical cloud systems in models, measurements of tropical clouds, the environment in which they reside, and their impact on the radiation and water budgets are needed. Because of the remote location, ground-based datasets of cloud, atmosphere, and radiation properties from the TWP region have come primarily from short-term field experiments. While providing extremely useful information on physical processes, these short-term datasets are limited in statistical and climatological information. To provide longterm measurements of the surface radiation budget in the tropics and the atmospheric properties that affect it, the Atmospheric Radiation Measurement program established a measurement site on Manus Island, Papua New Guinea, in 1996 and on the island republic of Nauru in late 1998. These sites provide unique datasets now available for more than 10 years on Manus and Nauru. This article presents examples of the scientific use of these datasets including characterization of cloud properties, analysis of cloud radiative forcing, model studies of tropical clouds and processes, and validation of satellite algorithms. New instrumentation recently installed at the Manus site will provide expanded opportunities for tropical atmospheric science.
Rapid, semi-automatic fracture and contact mapping for point clouds, images and geophysical data
NASA Astrophysics Data System (ADS)
Thiele, Samuel T.; Grose, Lachlan; Samsu, Anindita; Micklethwaite, Steven; Vollgger, Stefan A.; Cruden, Alexander R.
2017-12-01
The advent of large digital datasets from unmanned aerial vehicle (UAV) and satellite platforms now challenges our ability to extract information across multiple scales in a timely manner, often meaning that the full value of the data is not realised. Here we adapt a least-cost-path solver and specially tailored cost functions to rapidly interpolate structural features between manually defined control points in point cloud and raster datasets. We implement the method in the geographic information system QGIS and the point cloud and mesh processing software CloudCompare. Using these implementations, the method can be applied to a variety of three-dimensional (3-D) and two-dimensional (2-D) datasets, including high-resolution aerial imagery, digital outcrop models, digital elevation models (DEMs) and geophysical grids. We demonstrate the algorithm with four diverse applications in which we extract (1) joint and contact patterns in high-resolution orthophotographs, (2) fracture patterns in a dense 3-D point cloud, (3) earthquake surface ruptures of the Greendale Fault associated with the Mw7.1 Darfield earthquake (New Zealand) from high-resolution light detection and ranging (lidar) data, and (4) oceanic fracture zones from bathymetric data of the North Atlantic. The approach improves the consistency of the interpretation process while retaining expert guidance and achieves significant improvements (35-65 %) in digitisation time compared to traditional methods. Furthermore, it opens up new possibilities for data synthesis and can quantify the agreement between datasets and an interpretation.
Rainfall simulation experiments in the southwestern USA using the Walnut Gulch Rainfall Simulator
NASA Astrophysics Data System (ADS)
Polyakov, Viktor; Stone, Jeffry; Holifield Collins, Chandra; Nearing, Mark A.; Paige, Ginger; Buono, Jared; Gomez-Pond, Rae-Landa
2018-01-01
This dataset contains hydrological, erosion, vegetation, ground cover, and other supplementary information from 272 rainfall simulation experiments conducted on 23 semiarid rangeland locations in Arizona and Nevada between 2002 and 2013. On 30 % of the plots, simulations were conducted up to five times during the decade of study. The rainfall was generated using the Walnut Gulch Rainfall Simulator on 2 m by 6 m plots. Simulation sites included brush and grassland areas with various degrees of disturbance by grazing, wildfire, or brush removal. This dataset advances our understanding of basic hydrological and biological processes that drive soil erosion on arid rangelands. It can be used to estimate runoff, infiltration, and erosion rates at a variety of ecological sites in the Southwestern USA. The inclusion of wildfire and brush treatment locations combined with long-term observations makes it important for studying vegetation recovery, ecological transitions, and the effect of management. It is also a valuable resource for erosion model parameterization and validation. The dataset is available from the National Agricultural Library at https://data.nal.usda.gov/search/type/dataset (DOI: https://doi.org/10.15482/USDA.ADC/1358583).
Wilson, Frederic H.; Hults, Chad P.; Mull, Charles G.; Karl, Susan M.
2015-12-31
This Alaska compilation is unique in that it is integrated with a rich database of information provided in the spatial datasets and standalone attribute databases. Within the spatial files every line and polygon is attributed to its original source; the references to these sources are contained in related tables, as well as in stand-alone tables. Additional attributes include typical lithology, geologic setting, and age range for the map units. Also included are tables of radiometric ages.
Semi supervised Learning of Feature Hierarchies for Object Detection in a Video (Open Access)
2013-10-03
dataset: PETS2009 Dataset, Oxford Town Center dataset [3], PNNL Parking Lot datasets [25] and CAVIAR cols1 dataset [1] for human detection. Be- sides, we...level features from TownCen- ter, ParkingLot, PETS09 and CAVIAR . As we can see that, the four set of features are visually very different from each other...information is more distinguished for detecting a person in the TownCen- ter than CAVIAR . Comparing figure 5(a) with 6(a), interest- ingly, the color
Schure, Mark R; Davis, Joe M
2017-11-10
Orthogonality metrics (OMs) for three and higher dimensional separations are proposed as extensions of previously developed OMs, which were used to evaluate the zone utilization of two-dimensional (2D) separations. These OMs include correlation coefficients, dimensionality, information theory metrics and convex-hull metrics. In a number of these cases, lower dimensional subspace metrics exist and can be readily calculated. The metrics are used to interpret previously generated experimental data. The experimental datasets are derived from Gilar's peptide data, now modified to be three dimensional (3D), and a comprehensive 3D chromatogram from Moore and Jorgenson. The Moore and Jorgenson chromatogram, which has 25 identifiable 3D volume elements or peaks, displayed good orthogonality values over all dimensions. However, OMs based on discretization of the 3D space changed substantially with changes in binning parameters. This example highlights the importance in higher dimensions of having an abundant number of retention times as data points, especially for methods that use discretization. The Gilar data, which in a previous study produced 21 2D datasets by the pairing of 7 one-dimensional separations, was reinterpreted to produce 35 3D datasets. These datasets show a number of interesting properties, one of which is that geometric and harmonic means of lower dimensional subspace (i.e., 2D) OMs correlate well with the higher dimensional (i.e., 3D) OMs. The space utilization of the Gilar 3D datasets was ranked using OMs, with the retention times of the datasets having the largest and smallest OMs presented as graphs. A discussion concerning the orthogonality of higher dimensional techniques is given with emphasis on molecular diversity in chromatographic separations. In the information theory work, an inconsistency is found in previous studies of orthogonality using the 2D metric often identified as %O. A new choice of metric is proposed, extended to higher dimensions, characterized by mixes of ordered and random retention times, and applied to the experimental datasets. In 2D, the new metric always equals or exceeds the original one. However, results from both the original and new methods are given. Copyright © 2017 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Arozarena, A.; Villa, G.; Valcárcel, N.; Pérez, B.
2016-06-01
Remote sensing satellites, together with aerial and terrestrial platforms (mobile and fixed), produce nowadays huge amounts of data coming from a wide variety of sensors. These datasets serve as main data sources for the extraction of Geospatial Reference Information (GRI), constituting the "skeleton" of any Spatial Data Infrastructure (SDI). Since very different situations can be found around the world in terms of geographic information production and management, the generation of global GRI datasets seems extremely challenging. Remotely sensed data, due to its wide availability nowadays, is able to provide fundamental sources for any production or management system present in different countries. After several automatic and semiautomatic processes including ancillary data, the extracted geospatial information is ready to become part of the GRI databases. In order to optimize these data flows for the production of high quality geospatial information and to promote its use to address global challenges several initiatives at national, continental and global levels have been put in place, such as European INSPIRE initiative and Copernicus Programme, and global initiatives such as the Group on Earth Observation/Global Earth Observation System of Systems (GEO/GEOSS) and United Nations Global Geospatial Information Management (UN-GGIM). These workflows are established mainly by public organizations, with the adequate institutional arrangements at national, regional or global levels. Other initiatives, such as Volunteered Geographic Information (VGI), on the other hand may contribute to maintain the GRI databases updated. Remotely sensed data hence becomes one of the main pillars underpinning the establishment of a global SDI, as those datasets will be used by public agencies or institutions as well as by volunteers to extract the required spatial information that in turn will feed the GRI databases. This paper intends to provide an example of how institutional arrangements and cooperative production systems can be set up at any territorial level in order to exploit remotely sensed data in the most intensive manner, taking advantage of all its potential.
Lutomski, Jennifer E.; Baars, Maria A. E.; Schalk, Bianca W. M.; Boter, Han; Buurman, Bianca M.; den Elzen, Wendy P. J.; Jansen, Aaltje P. D.; Kempen, Gertrudis I. J. M.; Steunenberg, Bas; Steyerberg, Ewout W.; Olde Rikkert, Marcel G. M.; Melis, René J. F.
2013-01-01
Introduction In 2008, the Ministry of Health, Welfare and Sport commissioned the National Care for the Elderly Programme. While numerous research projects in older persons’ health care were to be conducted under this national agenda, the Programme further advocated the development of The Older Persons and Informal Caregivers Survey Minimum DataSet (TOPICS-MDS) which would be integrated into all funded research protocols. In this context, we describe TOPICS data sharing initiative (www.topics-mds.eu). Materials and Methods A working group drafted TOPICS-MDS prototype, which was subsequently approved by a multidisciplinary panel. Using instruments validated for older populations, information was collected on demographics, morbidity, quality of life, functional limitations, mental health, social functioning and health service utilisation. For informal caregivers, information was collected on demographics, hours of informal care and quality of life (including subjective care-related burden). Results Between 2010 and 2013, a total of 41 research projects contributed data to TOPICS-MDS, resulting in preliminary data available for 32,310 older persons and 3,940 informal caregivers. The majority of studies sampled were from primary care settings and inclusion criteria differed across studies. Discussion TOPICS-MDS is a public data repository which contains essential data to better understand health challenges experienced by older persons and informal caregivers. Such findings are relevant for countries where increasing health-related expenditure has necessitated the evaluation of contemporary health care delivery. Although open sharing of data can be difficult to achieve in practice, proactively addressing issues of data protection, conflicting data analysis requests and funding limitations during TOPICS-MDS developmental phase has fostered a data sharing culture. To date, TOPICS-MDS has been successfully incorporated into 41 research projects, thus supporting the feasibility of constructing a large (>30,000 observations), standardised dataset pooled from various study protocols with different sampling frameworks. This unique implementation strategy improves efficiency and facilitates individual-level data meta-analysis. PMID:24324716
Rea, Alan; Skinner, Kenneth D.
2012-01-01
The U.S. Geological Survey Hawaii StreamStats application uses an integrated suite of raster and vector geospatial datasets to delineate and characterize watersheds. The geospatial datasets used to delineate and characterize watersheds on the StreamStats website, and the methods used to develop the datasets are described in this report. The datasets for Hawaii were derived primarily from 10 meter resolution National Elevation Dataset (NED) elevation models, and the National Hydrography Dataset (NHD), using a set of procedures designed to enforce the drainage pattern from the NHD into the NED, resulting in an integrated suite of elevation-derived datasets. Additional sources of data used for computing basin characteristics include precipitation, land cover, soil permeability, and elevation-derivative datasets. The report also includes links for metadata and downloads of the geospatial datasets.
Towards systematic evaluation of crop model outputs for global land-use models
NASA Astrophysics Data System (ADS)
Leclere, David; Azevedo, Ligia B.; Skalský, Rastislav; Balkovič, Juraj; Havlík, Petr
2016-04-01
Land provides vital socioeconomic resources to the society, however at the cost of large environmental degradations. Global integrated models combining high resolution global gridded crop models (GGCMs) and global economic models (GEMs) are increasingly being used to inform sustainable solution for agricultural land-use. However, little effort has yet been done to evaluate and compare the accuracy of GGCM outputs. In addition, GGCM datasets require a large amount of parameters whose values and their variability across space are weakly constrained: increasing the accuracy of such dataset has a very high computing cost. Innovative evaluation methods are required both to ground credibility to the global integrated models, and to allow efficient parameter specification of GGCMs. We propose an evaluation strategy for GGCM datasets in the perspective of use in GEMs, illustrated with preliminary results from a novel dataset (the Hypercube) generated by the EPIC GGCM and used in the GLOBIOM land use GEM to inform on present-day crop yield, water and nutrient input needs for 16 crops x 15 management intensities, at a spatial resolution of 5 arc-minutes. We adopt the following principle: evaluation should provide a transparent diagnosis of model adequacy for its intended use. We briefly describe how the Hypercube data is generated and how it articulates with GLOBIOM in order to transparently identify the performances to be evaluated, as well as the main assumptions and data processing involved. Expected performances include adequately representing the sub-national heterogeneity in crop yield and input needs: i) in space, ii) across crop species, and iii) across management intensities. We will present and discuss measures of these expected performances and weight the relative contribution of crop model, input data and data processing steps in performances. We will also compare obtained yield gaps and main yield-limiting factors against the M3 dataset. Next steps include iterative improvement of parameter assumptions and evaluation of implications of GGCM performances for intended use in the IIASA EPIC-GLOBIOM model cluster. Our approach helps targeting future efforts at improving GGCM accuracy and would achieve highest efficiency if combined with traditional field-scale evaluation and sensitivity analysis.
Fuzzy Naive Bayesian model for medical diagnostic decision support.
Wagholikar, Kavishwar B; Vijayraghavan, Sundararajan; Deshpande, Ashok W
2009-01-01
This work relates to the development of computational algorithms to provide decision support to physicians. The authors propose a Fuzzy Naive Bayesian (FNB) model for medical diagnosis, which extends the Fuzzy Bayesian approach proposed by Okuda. A physician's interview based method is described to define a orthogonal fuzzy symptom information system, required to apply the model. For the purpose of elaboration and elicitation of characteristics, the algorithm is applied to a simple simulated dataset, and compared with conventional Naive Bayes (NB) approach. As a preliminary evaluation of FNB in real world scenario, the comparison is repeated on a real fuzzy dataset of 81 patients diagnosed with infectious diseases. The case study on simulated dataset elucidates that FNB can be optimal over NB for diagnosing patients with imprecise-fuzzy information, on account of the following characteristics - 1) it can model the information that, values of some attributes are semantically closer than values of other attributes, and 2) it offers a mechanism to temper exaggerations in patient information. Although the algorithm requires precise training data, its utility for fuzzy training data is argued for. This is supported by the case study on infectious disease dataset, which indicates optimality of FNB over NB for the infectious disease domain. Further case studies on large datasets are required to establish utility of FNB.
Sari C. Saunders; Jiquan Chen; Thomas D. Drummer; Eric J. Gustafson; Kimberley D. Brosofske
2005-01-01
Identifying scales of pattern in ecological systems and coupling patterns to processes that create them are ongoing challenges. We examined the utility of three techniques (lacunarity, spectral, and wavelet analysis) for detecting scales of pattern of ecological data. We compared the information obtained using these methods for four datasets, including: surface...
Li, Yong-Xin; Zhong, Zheng; Hou, Peng; Zhang, Wei-Peng; Qian, Pei-Yuan
2018-03-07
In the version of this article originally published, the links and files for the Supplementary Information, including Supplementary Tables 1-5, Supplementary Figures 1-25, Supplementary Note, Supplementary Datasets 1-4 and the Life Sciences Reporting Summary, were missing in the HTML. The error has been corrected in the HTML version of this article.
The Ethics of Big Data and Nursing Science.
Milton, Constance L
2017-10-01
Big data is a scientific, social, and technological trend referring to the process and size of datasets available for analysis. Ethical implications arise as healthcare disciplines, including nursing, struggle over questions of informed consent, privacy, ownership of data, and its possible use in epistemology. The author offers straight-thinking possibilities for the use of big data in nursing science.
DNApod: DNA polymorphism annotation database from next-generation sequence read archives.
Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu
2017-01-01
With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information.
DNApod: DNA polymorphism annotation database from next-generation sequence read archives
Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu
2017-01-01
With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information. PMID:28234924
On standardization of basic datasets of electronic medical records in traditional Chinese medicine.
Zhang, Hong; Ni, Wandong; Li, Jing; Jiang, Youlin; Liu, Kunjing; Ma, Zhaohui
2017-12-24
Standardization of electronic medical record, so as to enable resource-sharing and information exchange among medical institutions has become inevitable in view of the ever increasing medical information. The current research is an effort towards the standardization of basic dataset of electronic medical records in traditional Chinese medicine. In this work, an outpatient clinical information model and an inpatient clinical information model are created to adequately depict the diagnosis processes and treatment procedures of traditional Chinese medicine. To be backward compatible with the existing dataset standard created for western medicine, the new standard shall be a superset of the existing standard. Thus, the two models are checked against the existing standard in conjunction with 170,000 medical record cases. If a case cannot be covered by the existing standard due to the particularity of Chinese medicine, then either an existing data element is expanded with some Chinese medicine contents or a new data element is created. Some dataset subsets are also created to group and record Chinese medicine special diagnoses and treatments such as acupuncture. The outcome of this research is a proposal of standardized traditional Chinese medicine medical records datasets. The proposal has been verified successfully in three medical institutions with hundreds of thousands of medical records. A new dataset standard for traditional Chinese medicine is proposed in this paper. The proposed standard, covering traditional Chinese medicine as well as western medicine, is expected to be soon approved by the authority. A widespread adoption of this proposal will enable traditional Chinese medicine hospitals and institutions to easily exchange information and share resources. Copyright © 2017. Published by Elsevier B.V.
Smith, Tanya; Page-Nicholson, Samantha; Gibbons, Bradley; Jones, M. Genevieve W.; van Niekerk, Mark; Botha, Bronwyn; Oliver, Kirsten; McCann, Kevin
2016-01-01
Abstract Background The International Crane Foundation (ICF) / Endangered Wildlife Trust’s (EWT) African Crane Conservation Programme has recorded 26 403 crane sightings in its database from 1978 to 2014. This sightings collection is currently ongoing and records are continuously added to the database by the EWT field staff, ICF/EWT Partnership staff, various partner organizations and private individuals. The dataset has two peak collection periods: 1994-1996 and 2008-2012. The dataset collection spans five African countries: Kenya, Rwanda, South Africa, Uganda and Zambia; 98% of the data were collected in South Africa. Georeferencing of the dataset was verified before publication of the data. The dataset contains data on three African crane species: Blue Crane Anthropoides paradiseus, Grey Crowned Crane Balearica regulorum and Wattled Crane Bugeranus carunculatus. The Blue and Wattled Cranes are classified by the IUCN Red List of Threatened Species as Vulnerable and the Grey Crowned Crane as Endangered. New information This is the single most comprehensive dataset published on African Crane species that adds new information about the distribution of these three threatened species. We hope this will further aid conservation authorities to monitor and protect these species. The dataset continues to grow and especially to expand in geographic coverage into new countries in Africa and new sites within countries. The dataset can be freely accessed through the Global Biodiversity Information Facility data portal. PMID:27956850
Providing Geographic Datasets as Linked Data in Sdi
NASA Astrophysics Data System (ADS)
Hietanen, E.; Lehto, L.; Latvala, P.
2016-06-01
In this study, a prototype service to provide data from Web Feature Service (WFS) as linked data is implemented. At first, persistent and unique Uniform Resource Identifiers (URI) are created to all spatial objects in the dataset. The objects are available from those URIs in Resource Description Framework (RDF) data format. Next, a Web Ontology Language (OWL) ontology is created to describe the dataset information content using the Open Geospatial Consortium's (OGC) GeoSPARQL vocabulary. The existing data model is modified in order to take into account the linked data principles. The implemented service produces an HTTP response dynamically. The data for the response is first fetched from existing WFS. Then the Geographic Markup Language (GML) format output of the WFS is transformed on-the-fly to the RDF format. Content Negotiation is used to serve the data in different RDF serialization formats. This solution facilitates the use of a dataset in different applications without replicating the whole dataset. In addition, individual spatial objects in the dataset can be referred with URIs. Furthermore, the needed information content of the objects can be easily extracted from the RDF serializations available from those URIs. A solution for linking data objects to the dataset URI is also introduced by using the Vocabulary of Interlinked Datasets (VoID). The dataset is divided to the subsets and each subset is given its persistent and unique URI. This enables the whole dataset to be explored with a web browser and all individual objects to be indexed by search engines.
Dupree, Jean A.; Crowfoot, Richard M.
2012-01-01
This geodatabase and its component datasets are part of U.S. Geological Survey Digital Data Series 650 and were generated to store basin boundaries for U.S. Geological Survey streamgages and other sites in Colorado. The geodatabase and its components were created by the U.S. Geological Survey, Colorado Water Science Center, and are used to derive the numeric drainage areas for Colorado that are input into the U.S. Geological Survey's National Water Information System (NWIS) database and also published in the Annual Water Data Report and on NWISWeb. The foundational dataset used to create the basin boundaries in this geodatabase was the National Watershed Boundary Dataset. This geodatabase accompanies a U.S. Geological Survey Techniques and Methods report (Book 11, Section C, Chapter 6) entitled "Digital Database Architecture and Delineation Methodology for Deriving Drainage Basins, and Comparison of Digitally and Non-Digitally Derived Numeric Drainage Areas." The Techniques and Methods report details the geodatabase architecture, describes the delineation methodology and workflows used to develop these basin boundaries, and compares digitally derived numeric drainage areas in this geodatabase to non-digitally derived areas. 1. COBasins.gdb: This geodatabase contains site locations and basin boundaries for Colorado. It includes a single feature dataset, called BasinsFD, which groups the component feature classes and topology rules. 2. BasinsFD: This feature dataset in the "COBasins.gdb" geodatabase is a digital container that holds the feature classes used to archive site locations and basin boundaries as well as the topology rules that govern spatial relations within and among component feature classes. This feature dataset includes three feature classes: the sites for which basins have been delineated (the "Sites" feature class), basin bounding lines (the "BasinLines" feature class), and polygonal basin areas (the "BasinPolys" feature class). The feature dataset also stores the topology rules (the "BasinsFD_Topology") that constrain the relations within and among component feature classes. The feature dataset also forces any feature classes inside it to have a consistent projection system, which is, in this case, an Albers-Equal-Area projection system. 3. BasinsFD_Topology: This topology contains four persistent topology rules that constrain the spatial relations within the "BasinLines" feature class and between the "BasinLines" feature class and the "BasinPolys" feature classes. 4. Sites: This point feature class contains the digital representations of the site locations for which Colorado Water Science Center basin boundaries have been delineated. This feature class includes point locations for Colorado Water Science Center active (as of September 30, 2009) gages and for other sites. 5. BasinLines: This line feature class contains the perimeters of basins delineated for features in the "Sites" feature class, and it also contains information regarding the sources of lines used for the basin boundaries. 6. BasinPolys: This polygon feature class contains the polygonal basin areas delineated for features in the "Sites" feature class, and it is used to derive the numeric drainage areas published by the Colorado Water Science Center.
Fish and fishery historical data since the 19th century in the Adriatic Sea, Mediterranean
Fortibuoni, Tomaso; Libralato, Simone; Arneri, Enrico; Giovanardi, Otello; Solidoro, Cosimo; Raicevich, Saša
2017-01-01
Historic data on biodiversity provide the context for present observations and allow studying long-term changes in marine populations. Here we present multiple datasets on fish and fisheries of the Adriatic Sea covering the last two centuries encompassing from qualitative observations to standardised scientific monitoring. The datasets consist of three groups: (1) early naturalists’ descriptions of fish fauna, including information (e.g., presence, perceived abundance, size) on 255 fish species for the period 1818–1936; (2) historical landings from major Northern Adriatic fish markets (Venice, Trieste, Rijeka) for the period 1902–1968, Italian official landings for the Northern and Central Adriatic (1953–2012) and landings from the Lagoon of Venice (1945–2001); (3) trawl-survey data from seven surveys spanning the period 1948–1991 and including Catch per Unit of Effort data (kgh−1 and/or nh−1) for 956 hauls performed at 301 stations. The integration of these datasets has already demonstrated to be useful to analyse historical marine community changes over time, and its availability through open-source data portal will facilitate analyses in the framework of marine historical ecology. PMID:28895949
An ISA-TAB-Nano based data collection framework to support data-driven modelling of nanotoxicology
Marchese Robinson, Richard L; Richarz, Andrea-Nicole; Rallo, Robert
2015-01-01
Summary Analysis of trends in nanotoxicology data and the development of data driven models for nanotoxicity is facilitated by the reporting of data using a standardised electronic format. ISA-TAB-Nano has been proposed as such a format. However, in order to build useful datasets according to this format, a variety of issues has to be addressed. These issues include questions regarding exactly which (meta)data to report and how to report them. The current article discusses some of the challenges associated with the use of ISA-TAB-Nano and presents a set of resources designed to facilitate the manual creation of ISA-TAB-Nano datasets from the nanotoxicology literature. These resources were developed within the context of the NanoPUZZLES EU project and include data collection templates, corresponding business rules that extend the generic ISA-TAB-Nano specification as well as Python code to facilitate parsing and integration of these datasets within other nanoinformatics resources. The use of these resources is illustrated by a “Toy Dataset” presented in the Supporting Information. The strengths and weaknesses of the resources are discussed along with possible future developments. PMID:26665069
NASA Astrophysics Data System (ADS)
Changyong, Dou; Huadong, Guo; Chunming, Han; Ming, Liu
2014-03-01
With more and more Earth observation data available to the community, how to manage and sharing these valuable remote sensing datasets is becoming an urgent issue to be solved. The web based Geographical Information Systems (GIS) technology provides a convenient way for the users in different locations to share and make use of the same dataset. In order to efficiently use the airborne Synthetic Aperture Radar (SAR) remote sensing data acquired in the Airborne Remote Sensing Center of the Institute of Remote Sensing and Digital Earth (RADI), Chinese Academy of Sciences (CAS), a Web-GIS based platform for airborne SAR data management, distribution and sharing was designed and developed. The major features of the system include map based navigation search interface, full resolution imagery shown overlaid the map, and all the software adopted in the platform are Open Source Software (OSS). The functions of the platform include browsing the imagery on the map navigation based interface, ordering and downloading data online, image dataset and user management, etc. At present, the system is under testing in RADI and will come to regular operation soon.
Establishing a process for conducting cross-jurisdictional record linkage in Australia.
Moore, Hannah C; Guiver, Tenniel; Woollacott, Anthony; de Klerk, Nicholas; Gidding, Heather F
2016-04-01
To describe the realities of conducting a cross-jurisdictional data linkage project involving state and Australian Government-based data collections to inform future national data linkage programs of work. We outline the processes involved in conducting a Proof of Concept data linkage project including the implementation of national data integration principles, data custodian and ethical approval requirements, and establishment of data flows. The approval process involved nine approval and regulatory bodies and took more than two years. Data will be linked across 12 datasets involving three data linkage centres. A framework was established to allow data to flow between these centres while maintaining the separation principle that serves to protect the privacy of the individual. This will be the first project to link child immunisation records from an Australian Government dataset to other administrative health datasets for a population cohort covering 2 million births in two Australian states. Although the project experienced some delays, positive outcomes were realised, primarily the development of strong collaborations across key stakeholder groups including community engagement. We have identified several recommendations and enhancements to this now established framework to further streamline the process for data linkage studies involving Australian Government data. © 2015 Public Health Association of Australia.
A new, long-term daily satellite-based rainfall dataset for operational monitoring in Africa
NASA Astrophysics Data System (ADS)
Maidment, Ross I.; Grimes, David; Black, Emily; Tarnavsky, Elena; Young, Matthew; Greatrex, Helen; Allan, Richard P.; Stein, Thorwald; Nkonde, Edson; Senkunda, Samuel; Alcántara, Edgar Misael Uribe
2017-05-01
Rainfall information is essential for many applications in developing countries, and yet, continually updated information at fine temporal and spatial scales is lacking. In Africa, rainfall monitoring is particularly important given the close relationship between climate and livelihoods. To address this information gap, this paper describes two versions (v2.0 and v3.0) of the TAMSAT daily rainfall dataset based on high-resolution thermal-infrared observations, available from 1983 to the present. The datasets are based on the disaggregation of 10-day (v2.0) and 5-day (v3.0) total TAMSAT rainfall estimates to a daily time-step using daily cold cloud duration. This approach provides temporally consistent historic and near-real time daily rainfall information for all of Africa. The estimates have been evaluated using ground-based observations from five countries with contrasting rainfall climates (Mozambique, Niger, Nigeria, Uganda, and Zambia) and compared to other satellite-based rainfall estimates. The results indicate that both versions of the TAMSAT daily estimates reliably detects rainy days, but have less skill in capturing rainfall amount—results that are comparable to the other datasets.
RepExplore: addressing technical replicate variance in proteomics and metabolomics data analysis.
Glaab, Enrico; Schneider, Reinhard
2015-07-01
High-throughput omics datasets often contain technical replicates included to account for technical sources of noise in the measurement process. Although summarizing these replicate measurements by using robust averages may help to reduce the influence of noise on downstream data analysis, the information on the variance across the replicate measurements is lost in the averaging process and therefore typically disregarded in subsequent statistical analyses.We introduce RepExplore, a web-service dedicated to exploit the information captured in the technical replicate variance to provide more reliable and informative differential expression and abundance statistics for omics datasets. The software builds on previously published statistical methods, which have been applied successfully to biomedical omics data but are difficult to use without prior experience in programming or scripting. RepExplore facilitates the analysis by providing a fully automated data processing and interactive ranking tables, whisker plot, heat map and principal component analysis visualizations to interpret omics data and derived statistics. Freely available at http://www.repexplore.tk enrico.glaab@uni.lu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
NASA Astrophysics Data System (ADS)
Poli, D.; Remondino, F.; Angiuli, E.; Agugiaro, G.
2015-02-01
Today the use of spaceborne Very High Resolution (VHR) optical sensors for automatic 3D information extraction is increasing in the scientific and civil communities. The 3D Optical Metrology (3DOM) unit of the Bruno Kessler Foundation (FBK) in Trento (Italy) has collected VHR satellite imagery, as well as aerial and terrestrial data over Trento for creating a complete testfield for investigations on image radiometry, geometric accuracy, automatic digital surface model (DSM) generation, 2D/3D feature extraction, city modelling and data fusion. This paper addresses the radiometric and the geometric aspects of the VHR spaceborne imagery included in the Trento testfield and their potential for 3D information extraction. The dataset consist of two stereo-pairs acquired by WorldView-2 and by GeoEye-1 in panchromatic and multispectral mode, and a triplet from Pléiades-1A. For reference and validation, a DSM from airborne LiDAR acquisition is used. The paper gives details on the project, dataset characteristics and achieved results.
Yin, Zheng; Zhou, Xiaobo; Bakal, Chris; Li, Fuhai; Sun, Youxian; Perrimon, Norbert; Wong, Stephen TC
2008-01-01
Background The recent emergence of high-throughput automated image acquisition technologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular phenotypes in different experimental conditions has been dependent upon the expert opinions of well-trained biologists. Such qualitative analysis is particularly effective in detecting subtle, but important, deviations in phenotypes. However, while the rapid and continuing development of automated microscope-based technologies now facilitates the acquisition of trillions of cells in thousands of diverse experimental conditions, such as in the context of RNA interference (RNAi) or small-molecule screens, the massive size of these datasets precludes human analysis. Thus, the development of automated methods which aim to identify novel and biological relevant phenotypes online is one of the major challenges in high-throughput image-based screening. Ideally, phenotype discovery methods should be designed to utilize prior/existing information and tackle three challenging tasks, i.e. restoring pre-defined biological meaningful phenotypes, differentiating novel phenotypes from known ones and clarifying novel phenotypes from each other. Arbitrarily extracted information causes biased analysis, while combining the complete existing datasets with each new image is intractable in high-throughput screens. Results Here we present the design and implementation of a novel and robust online phenotype discovery method with broad applicability that can be used in diverse experimental contexts, especially high-throughput RNAi screens. This method features phenotype modelling and iterative cluster merging using improved gap statistics. A Gaussian Mixture Model (GMM) is employed to estimate the distribution of each existing phenotype, and then used as reference distribution in gap statistics. This method is broadly applicable to a number of different types of image-based datasets derived from a wide spectrum of experimental conditions and is suitable to adaptively process new images which are continuously added to existing datasets. Validations were carried out on different dataset, including published RNAi screening using Drosophila embryos [Additional files 1, 2], dataset for cell cycle phase identification using HeLa cells [Additional files 1, 3, 4] and synthetic dataset using polygons, our methods tackled three aforementioned tasks effectively with an accuracy range of 85%–90%. When our method is implemented in the context of a Drosophila genome-scale RNAi image-based screening of cultured cells aimed to identifying the contribution of individual genes towards the regulation of cell-shape, it efficiently discovers meaningful new phenotypes and provides novel biological insight. We also propose a two-step procedure to modify the novelty detection method based on one-class SVM, so that it can be used to online phenotype discovery. In different conditions, we compared the SVM based method with our method using various datasets and our methods consistently outperformed SVM based method in at least two of three tasks by 2% to 5%. These results demonstrate that our methods can be used to better identify novel phenotypes in image-based datasets from a wide range of conditions and organisms. Conclusion We demonstrate that our method can detect various novel phenotypes effectively in complex datasets. Experiment results also validate that our method performs consistently under different order of image input, variation of starting conditions including the number and composition of existing phenotypes, and dataset from different screens. In our findings, the proposed method is suitable for online phenotype discovery in diverse high-throughput image-based genetic and chemical screens. PMID:18534020
Minimum information required for a DMET experiment reporting.
Kumuthini, Judit; Mbiyavanga, Mamana; Chimusa, Emile R; Pathak, Jyotishman; Somervuo, Panu; Van Schaik, Ron Hn; Dolzan, Vita; Mizzi, Clint; Kalideen, Kusha; Ramesar, Raj S; Macek, Milan; Patrinos, George P; Squassina, Alessio
2016-09-01
To provide pharmacogenomics reporting guidelines, the information and tools required for reporting to public omic databases. For effective DMET data interpretation, sharing, interoperability, reproducibility and reporting, we propose the Minimum Information required for a DMET Experiment (MIDE) reporting. MIDE provides reporting guidelines and describes the information required for reporting, data storage and data sharing in the form of XML. The MIDE guidelines will benefit the scientific community with pharmacogenomics experiments, including reporting pharmacogenomics data from other technology platforms, with the tools that will ease and automate the generation of such reports using the standardized MIDE XML schema, facilitating the sharing, dissemination, reanalysis of datasets through accessible and transparent pharmacogenomics data reporting.
Federal standards and procedures for the National Watershed Boundary Dataset (WBD)
,; ,; ,
2009-03-11
Terminology, definitions, and procedural information are provided to ensure uniformity in hydrologic unit boundaries, names, and numerical codes. Detailed standards and specifications for data are included. The document also includes discussion of objectives, communications required for revising the data resolution in the United States and the Caribbean, as well as final review and data-quality criteria. Instances of unusual landforms or artificial features that affect the hydrologic units are described with metadata standards. Up-to-date information and availability of the hydrologic units are listed athttp://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/technical/nra/dma/?&cid=nrcs143_021630/.
Processed foods available in the Pacific Islands
2013-01-01
Background There is an increasing reliance on processed foods globally, yet food composition tables include minimal information on their nutrient content. The Pacific Islands share common trade links and are heavily reliant on imported foods. The objective was to develop a dataset for the Pacific Islands on nutrient composition of processed foods sold and their sources. Methods Information on the food labels, including country of origin, nutrient content and promotional claims were recorded into a standardised dataset. Data were cleaned, converted to per 100 g data as needed and then checked for anomalies and recording errors. Setting: Five representative countries were selected for data collection, based on their trading patterns: Fiji, Guam, Nauru, New Caledonia, and Samoa. Data were collected in the capitals, in larger stores which import their own foods. Subjects: Processed foods in stores. Results The data from 6041 foods and drinks were recorded. Fifty four countries of origin were identified, with the main provider of food for each Pacific Island country being that with which it was most strongly linked politically. Nutrient data were not provided for 6% of the foods, imported from various countries. Inaccurate labels were found on 132 products. Over one-quarter of the foods included some nutrient or health-related claims. Conclusions The globalisation of the food supply is having considerable impacts on diets in the Pacific Islands. While nutrient labels can be informative for consumers looking for healthier options, difficulties still exist with poor labelling and interpretation can be challenging. PMID:24160249
An interactive web application for the dissemination of human systems immunology data.
Speake, Cate; Presnell, Scott; Domico, Kelly; Zeitner, Brad; Bjork, Anna; Anderson, David; Mason, Michael J; Whalen, Elizabeth; Vargas, Olivia; Popov, Dimitry; Rinchai, Darawan; Jourde-Chiche, Noemie; Chiche, Laurent; Quinn, Charlie; Chaussabel, Damien
2015-06-19
Systems immunology approaches have proven invaluable in translational research settings. The current rate at which large-scale datasets are generated presents unique challenges and opportunities. Mining aggregates of these datasets could accelerate the pace of discovery, but new solutions are needed to integrate the heterogeneous data types with the contextual information that is necessary for interpretation. In addition, enabling tools and technologies facilitating investigators' interaction with large-scale datasets must be developed in order to promote insight and foster knowledge discovery. State of the art application programming was employed to develop an interactive web application for browsing and visualizing large and complex datasets. A collection of human immune transcriptome datasets were loaded alongside contextual information about the samples. We provide a resource enabling interactive query and navigation of transcriptome datasets relevant to human immunology research. Detailed information about studies and samples are displayed dynamically; if desired the associated data can be downloaded. Custom interactive visualizations of the data can be shared via email or social media. This application can be used to browse context-rich systems-scale data within and across systems immunology studies. This resource is publicly available online at [Gene Expression Browser Landing Page ( https://gxb.benaroyaresearch.org/dm3/landing.gsp )]. The source code is also available openly [Gene Expression Browser Source Code ( https://github.com/BenaroyaResearch/gxbrowser )]. We have developed a data browsing and visualization application capable of navigating increasingly large and complex datasets generated in the context of immunological studies. This intuitive tool ensures that, whether taken individually or as a whole, such datasets generated at great effort and expense remain interpretable and a ready source of insight for years to come.
AmeriFlux Network Data Activities: updates, progress and plans
NASA Astrophysics Data System (ADS)
Yang, B.; Boden, T.; Krassovski, M.; Song, X.
2013-12-01
The Carbon Dioxide Information Analysis Center (CDIAC) at the Oak Ridge National Laboratory serves as the long-term data repository for the AmeriFlux network. Datasets currently available include hourly or half-hourly meteorological and flux observations, biological measurement records, and synthesis data products. In this presentation, we provide an update of this network database including a comprehensive review and evaluation of the biological data from about 70 sites, development of a new product for flux uncertainty estimates, and re-formatting of Level-2 standard files. In 2013, we also provided data support to two synthesis studies --- 2012 drought synthesis and FACE synthesis. Issues related to data quality and solutions in compiling datasets for these synthesis studies will be discussed. We will also present our work plans in developing and producing other high-level products, such as derivation of phenology from the available measurements at flux sites.
Visualization of conserved structures by fusing highly variable datasets.
Silverstein, Jonathan C; Chhadia, Ankur; Dech, Fred
2002-01-01
Skill, effort, and time are required to identify and visualize anatomic structures in three-dimensions from radiological data. Fundamentally, automating these processes requires a technique that uses symbolic information not in the dynamic range of the voxel data. We were developing such a technique based on mutual information for automatic multi-modality image fusion (MIAMI Fuse, University of Michigan). This system previously demonstrated facility at fusing one voxel dataset with integrated symbolic structure information to a CT dataset (different scale and resolution) from the same person. The next step of development of our technique was aimed at accommodating the variability of anatomy from patient to patient by using warping to fuse our standard dataset to arbitrary patient CT datasets. A standard symbolic information dataset was created from the full color Visible Human Female by segmenting the liver parenchyma, portal veins, and hepatic veins and overwriting each set of voxels with a fixed color. Two arbitrarily selected patient CT scans of the abdomen were used for reference datasets. We used the warping functions in MIAMI Fuse to align the standard structure data to each patient scan. The key to successful fusion was the focused use of multiple warping control points that place themselves around the structure of interest automatically. The user assigns only a few initial control points to align the scans. Fusion 1 and 2 transformed the atlas with 27 points around the liver to CT1 and CT2 respectively. Fusion 3 transformed the atlas with 45 control points around the liver to CT1 and Fusion 4 transformed the atlas with 5 control points around the portal vein. The CT dataset is augmented with the transformed standard structure dataset, such that the warped structure masks are visualized in combination with the original patient dataset. This combined volume visualization is then rendered interactively in stereo on the ImmersaDesk in an immersive Virtual Reality (VR) environment. The accuracy of the fusions was determined qualitatively by comparing the transformed atlas overlaid on the appropriate CT. It was examined for where the transformed structure atlas was incorrectly overlaid (false positive) and where it was incorrectly not overlaid (false negative). According to this method, fusions 1 and 2 were correct roughly 50-75% of the time, while fusions 3 and 4 were correct roughly 75-100%. The CT dataset augmented with transformed dataset was viewed arbitrarily in user-centered perspective stereo taking advantage of features such as scaling, windowing and volumetric region of interest selection. This process of auto-coloring conserved structures in variable datasets is a step toward the goal of a broader, standardized automatic structure visualization method for radiological data. If successful it would permit identification, visualization or deletion of structures in radiological data by semi-automatically applying canonical structure information to the radiological data (not just processing and visualization of the data's intrinsic dynamic range). More sophisticated selection of control points and patterns of warping may allow for more accurate transforms, and thus advances in visualization, simulation, education, diagnostics, and treatment planning.
Soul, Jamie; Hardingham, Timothy E; Boot-Handford, Raymond P; Schwartz, Jean-Marc
2015-01-29
We describe a new method, PhenomeExpress, for the analysis of transcriptomic datasets to identify pathogenic disease mechanisms. Our analysis method includes input from both protein-protein interaction and phenotype similarity networks. This introduces valuable information from disease relevant phenotypes, which aids the identification of sub-networks that are significantly enriched in differentially expressed genes and are related to the disease relevant phenotypes. This contrasts with many active sub-network detection methods, which rely solely on protein-protein interaction networks derived from compounded data of many unrelated biological conditions and which are therefore not specific to the context of the experiment. PhenomeExpress thus exploits readily available animal model and human disease phenotype information. It combines this prior evidence of disease phenotypes with the experimentally derived disease data sets to provide a more targeted analysis. Two case studies, in subchondral bone in osteoarthritis and in Pax5 in acute lymphoblastic leukaemia, demonstrate that PhenomeExpress identifies core disease pathways in both mouse and human disease expression datasets derived from different technologies. We also validate the approach by comparison to state-of-the-art active sub-network detection methods, which reveals how it may enhance the detection of molecular phenotypes and provide a more detailed context to those previously identified as possible candidates.
Zhang, Jian; Gao, Bo; Chai, Haiting; Ma, Zhiqiang; Yang, Guifu
2016-08-26
DNA-binding proteins (DBPs) play fundamental roles in many biological processes. Therefore, the developing of effective computational tools for identifying DBPs is becoming highly desirable. In this study, we proposed an accurate method for the prediction of DBPs. Firstly, we focused on the challenge of improving DBP prediction accuracy with information solely from the sequence. Secondly, we used multiple informative features to encode the protein. These features included evolutionary conservation profile, secondary structure motifs, and physicochemical properties. Thirdly, we introduced a novel improved Binary Firefly Algorithm (BFA) to remove redundant or noisy features as well as select optimal parameters for the classifier. The experimental results of our predictor on two benchmark datasets outperformed many state-of-the-art predictors, which revealed the effectiveness of our method. The promising prediction performance on a new-compiled independent testing dataset from PDB and a large-scale dataset from UniProt proved the good generalization ability of our method. In addition, the BFA forged in this research would be of great potential in practical applications in optimization fields, especially in feature selection problems. A highly accurate method was proposed for the identification of DBPs. A user-friendly web-server named iDbP (identification of DNA-binding Proteins) was constructed and provided for academic use.
NASA Astrophysics Data System (ADS)
Brekke, L. D.; Pruitt, T.; Maurer, E. P.; Duffy, P. B.
2007-12-01
Incorporating climate change information into long-term evaluations of water and energy resources requires analysts to have access to climate projection data that have been spatially downscaled to "basin-relevant" resolution. This is necessary in order to develop system-specific hydrology and demand scenarios consistent with projected climate scenarios. Analysts currently have access to "climate model" resolution data (e.g., at LLNL PCMDI), but not spatially downscaled translations of these datasets. Motivated by a common interest in supporting regional and local assessments, the U.S. Bureau of Reclamation and LLNL (through support from the DOE National Energy Technology Laboratory) have teamed to develop an archive of downscaled climate projections (temperature and precipitation) with geographic coverage consistent with the North American Land Data Assimilation System domain, encompassing the contiguous United States. A web-based information service, hosted at LLNL Green Data Oasis, has been developed to provide Reclamation, LLNL, and other interested analysts free access to archive content. A contemporary statistical method was used to bias-correct and spatially disaggregate projection datasets, and was applied to 112 projections included in the WCRP CMIP3 multi-model dataset hosted by LLNL PCMDI (i.e. 16 GCMs and their multiple simulations of SRES A2, A1b, and B1 emissions pathways).
ISO, FGDC, DIF and Dublin Core - Making Sense of Metadata Standards for Earth Science Data
NASA Astrophysics Data System (ADS)
Jones, P. R.; Ritchey, N. A.; Peng, G.; Toner, V. A.; Brown, H.
2014-12-01
Metadata standards provide common definitions of metadata fields for information exchange across user communities. Despite the broad adoption of metadata standards for Earth science data, there are still heterogeneous and incompatible representations of information due to differences between the many standards in use and how each standard is applied. Federal agencies are required to manage and publish metadata in different metadata standards and formats for various data catalogs. In 2014, the NOAA National Climatic data Center (NCDC) managed metadata for its scientific datasets in ISO 19115-2 in XML, GCMD Directory Interchange Format (DIF) in XML, DataCite Schema in XML, Dublin Core in XML, and Data Catalog Vocabulary (DCAT) in JSON, with more standards and profiles of standards planned. Of these standards, the ISO 19115-series metadata is the most complete and feature-rich, and for this reason it is used by NCDC as the source for the other metadata standards. We will discuss the capabilities of metadata standards and how these standards are being implemented to document datasets. Successful implementations include developing translations and displays using XSLTs, creating links to related data and resources, documenting dataset lineage, and establishing best practices. Benefits, gaps, and challenges will be highlighted with suggestions for improved approaches to metadata storage and maintenance.
This EnviroAtlas dataset includes analysis by NatureServe of species that are Imperiled (G1/G2) or Listed under the U.S. Endangered Species Act (ESA) by 12-digit Hydrologic Units (HUCs). The analysis results are for use and publication by both the LandScope America website and by the EnviroAtlas. Results are provided for the total number of Aquatic Associated G1-G2/ESA species, the total number of Wetland Associated G1-G2/ESA species, the total number of Terrestrial Associated G1-G2/ESA species, and the total number of Unknown Habitat Association G1-G2/ESA species in each HUC12. NatureServe is a non-profit organization dedicated to developing and providing information about the world's plants, animals, and ecological communities. NatureServe works in partnership with 82 independent Natural Heritage programs and Conservation Data Centers that gather scientific information on rare species and ecosystems in the United States, Latin America, and Canada (the Natural Heritage Network). NatureServe is a leading source for biodiversity information that is essential for effective conservation action. This dataset was produced by NatureServe to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data
A current landscape of provincial perinatal data collection in Canada.
Massey, Kiran A; Magee, Laura A; Dale, Sheryll; Claydon, Jennifer; Morris, Tara J; von Dadelszen, Peter; Liston, Robert M; Ansermino, J Mark
2009-03-01
The Canadian Perinatal Network (CPN) was launched in 2005 as a national perinatal database project designed to identify best practices in maternity care. The inaugural project of CPN is focused on interventions that optimize maternal and perinatal outcomes in women with threatened preterm birth at 22+0 to 28+6 weeks' gestation. To examine existing data collection by perinatal health programs (PHPs) to inform decisions about shared data collection and CPN database construction. We reviewed the database manuals and websites of all Canadian PHPs and compiled a list of data fields and their definitions. We compared these fields and definitions with those of CPN and the Canadian Minimal Dataset, proposed as a common dataset by the Canadian Perinatal Programs Coalition of Canadian PHPs. PHPs collect information on 2/3 of deliveries in Canada. PHPs consistently collect information on maternal demographics (including both maternal and neonatal personal identifiers), past obstetrical history, maternal lifestyle, aspects of labour and delivery, and basic neonatal outcomes. However, most PHPs collect insufficient data to enable identification of obstetric (and neonatal) practices associated with improved maternal and perinatal outcomes. In addition, there is between-PHP variability in defining many data fields. Construction of a separate CPN database was needed although harmonization of data field definitions with those of the proposed Canadian Minimal Dataset was done to plan for future shared data collection. This convergence should be the goal of researchers and clinicians alike as we construct a common language for electronic health records.
NASA Astrophysics Data System (ADS)
Baru, C.; Arrowsmith, R.; Crosby, C.; Nandigam, V.; Phan, M.; Cowart, C.
2012-04-01
OpenTopography is a cyberinfrastructure-based facility for online access to high-resolution topography and tools. The project is an outcome of the Geosciences Network (GEON) project, which was a research project funded several years ago in the US to investigate the use of cyberinfrastructure to support research and education in the geosciences. OpenTopography provides online access to large LiDAR point cloud datasets along with services for processing these data. Users are able to generate custom DEMs by invoking DEM services provided by OpenTopography with custom parameter values. Users can track the progress of their jobs, and a private myOpenTopo area retains job information and job outputs. Data available at OpenTopography are provided by a variety of data acquisition groups under joint agreements and memoranda of understanding (MoU). These include national facilities such as the National Center for Airborne Lidar Mapping, as well as local, state, and federal agencies. OpenTopography is also being designed as a hub for high-resolution topography resources. Datasets and services available at other locations can also be registered here, providing a "one-stop shop" for such information. We will describe the OpenTopography system architecture and its current set of features, including the service-oriented architecture, a job-tracking database, and social networking features. We will also describe several design and development activities underway to archive and publish datasets using digital object identifiers (DOIs); create a more flexible and scalable high-performance environment for processing of large datasets; extend support for satellite-based and terrestrial lidar as well as synthetic aperture radar (SAR) data; and create a "pluggable" infrastructure for third-party services. OpenTopography has successfully created a facility for sharing lidar data. In the next phase, we are developing a facility that will also enable equally easy and successful sharing of services related to these data.
Biswas, Mithun; Islam, Rafiqul; Shom, Gautam Kumar; Shopon, Md; Mohammed, Nabeel; Momen, Sifat; Abedin, Anowarul
2017-06-01
BanglaLekha-Isolated, a Bangla handwritten isolated character dataset is presented in this article. This dataset contains 84 different characters comprising of 50 Bangla basic characters, 10 Bangla numerals and 24 selected compound characters. 2000 handwriting samples for each of the 84 characters were collected, digitized and pre-processed. After discarding mistakes and scribbles, 1,66,105 handwritten character images were included in the final dataset. The dataset also includes labels indicating the age and the gender of the subjects from whom the samples were collected. This dataset could be used not only for optical handwriting recognition research but also to explore the influence of gender and age on handwriting. The dataset is publicly available at https://data.mendeley.com/datasets/hf6sf8zrkc/2.
Software Framework for Development of Web-GIS Systems for Analysis of Georeferenced Geophysical Data
NASA Astrophysics Data System (ADS)
Okladnikov, I.; Gordov, E. P.; Titov, A. G.
2011-12-01
Georeferenced datasets (meteorological databases, modeling and reanalysis results, remote sensing products, etc.) are currently actively used in numerous applications including modeling, interpretation and forecast of climatic and ecosystem changes for various spatial and temporal scales. Due to inherent heterogeneity of environmental datasets as well as their size which might constitute up to tens terabytes for a single dataset at present studies in the area of climate and environmental change require a special software support. A dedicated software framework for rapid development of providing such support information-computational systems based on Web-GIS technologies has been created. The software framework consists of 3 basic parts: computational kernel developed using ITTVIS Interactive Data Language (IDL), a set of PHP-controllers run within specialized web portal, and JavaScript class library for development of typical components of web mapping application graphical user interface (GUI) based on AJAX technology. Computational kernel comprise of number of modules for datasets access, mathematical and statistical data analysis and visualization of results. Specialized web-portal consists of web-server Apache, complying OGC standards Geoserver software which is used as a base for presenting cartographical information over the Web, and a set of PHP-controllers implementing web-mapping application logic and governing computational kernel. JavaScript library aiming at graphical user interface development is based on GeoExt library combining ExtJS Framework and OpenLayers software. Based on the software framework an information-computational system for complex analysis of large georeferenced data archives was developed. Structured environmental datasets available for processing now include two editions of NCEP/NCAR Reanalysis, JMA/CRIEPI JRA-25 Reanalysis, ECMWF ERA-40 Reanalysis, ECMWF ERA Interim Reanalysis, MRI/JMA APHRODITE's Water Resources Project Reanalysis, meteorological observational data for the territory of the former USSR for the 20th century, and others. Current version of the system is already involved into a scientific research process. Particularly, recently the system was successfully used for analysis of Siberia climate changes and its impact in the region. The software framework presented allows rapid development of Web-GIS systems for geophysical data analysis thus providing specialists involved into multidisciplinary research projects with reliable and practical instruments for complex analysis of climate and ecosystems changes on global and regional scales. This work is partially supported by RFBR grants #10-07-00547, #11-05-01190, and SB RAS projects 4.31.1.5, 4.31.2.7, 4, 8, 9, 50 and 66.
So many genes, so little time: A practical approach to divergence-time estimation in the genomic era
2018-01-01
Phylogenomic datasets have been successfully used to address questions involving evolutionary relationships, patterns of genome structure, signatures of selection, and gene and genome duplications. However, despite the recent explosion in genomic and transcriptomic data, the utility of these data sources for efficient divergence-time inference remains unexamined. Phylogenomic datasets pose two distinct problems for divergence-time estimation: (i) the volume of data makes inference of the entire dataset intractable, and (ii) the extent of underlying topological and rate heterogeneity across genes makes model mis-specification a real concern. “Gene shopping”, wherein a phylogenomic dataset is winnowed to a set of genes with desirable properties, represents an alternative approach that holds promise in alleviating these issues. We implemented an approach for phylogenomic datasets (available in SortaDate) that filters genes by three criteria: (i) clock-likeness, (ii) reasonable tree length (i.e., discernible information content), and (iii) least topological conflict with a focal species tree (presumed to have already been inferred). Such a winnowing procedure ensures that errors associated with model (both clock and topology) mis-specification are minimized, therefore reducing error in divergence-time estimation. We demonstrated the efficacy of this approach through simulation and applied it to published animal (Aves, Diplopoda, and Hymenoptera) and plant (carnivorous Caryophyllales, broad Caryophyllales, and Vitales) phylogenomic datasets. By quantifying rate heterogeneity across both genes and lineages we found that every empirical dataset examined included genes with clock-like, or nearly clock-like, behavior. Moreover, many datasets had genes that were clock-like, exhibited reasonable evolutionary rates, and were mostly compatible with the species tree. We identified overlap in age estimates when analyzing these filtered genes under strict clock and uncorrelated lognormal (UCLN) models. However, this overlap was often due to imprecise estimates from the UCLN model. We find that “gene shopping” can be an efficient approach to divergence-time inference for phylogenomic datasets that may otherwise be characterized by extensive gene tree heterogeneity. PMID:29772020
Unsupervised multiple kernel learning for heterogeneous data integration.
Mariette, Jérôme; Villa-Vialaneix, Nathalie
2018-03-15
Recent high-throughput sequencing advances have expanded the breadth of available omics datasets and the integrated analysis of multiple datasets obtained on the same samples has allowed to gain important insights in a wide range of applications. However, the integration of various sources of information remains a challenge for systems biology since produced datasets are often of heterogeneous types, with the need of developing generic methods to take their different specificities into account. We propose a multiple kernel framework that allows to integrate multiple datasets of various types into a single exploratory analysis. Several solutions are provided to learn either a consensus meta-kernel or a meta-kernel that preserves the original topology of the datasets. We applied our framework to analyse two public multi-omics datasets. First, the multiple metagenomic datasets, collected during the TARA Oceans expedition, was explored to demonstrate that our method is able to retrieve previous findings in a single kernel PCA as well as to provide a new image of the sample structures when a larger number of datasets are included in the analysis. To perform this analysis, a generic procedure is also proposed to improve the interpretability of the kernel PCA in regards with the original data. Second, the multi-omics breast cancer datasets, provided by The Cancer Genome Atlas, is analysed using a kernel Self-Organizing Maps with both single and multi-omics strategies. The comparison of these two approaches demonstrates the benefit of our integration method to improve the representation of the studied biological system. Proposed methods are available in the R package mixKernel, released on CRAN. It is fully compatible with the mixOmics package and a tutorial describing the approach can be found on mixOmics web site http://mixomics.org/mixkernel/. jerome.mariette@inra.fr or nathalie.villa-vialaneix@inra.fr. Supplementary data are available at Bioinformatics online.
Smith, Stephen A; Brown, Joseph W; Walker, Joseph F
2018-01-01
Phylogenomic datasets have been successfully used to address questions involving evolutionary relationships, patterns of genome structure, signatures of selection, and gene and genome duplications. However, despite the recent explosion in genomic and transcriptomic data, the utility of these data sources for efficient divergence-time inference remains unexamined. Phylogenomic datasets pose two distinct problems for divergence-time estimation: (i) the volume of data makes inference of the entire dataset intractable, and (ii) the extent of underlying topological and rate heterogeneity across genes makes model mis-specification a real concern. "Gene shopping", wherein a phylogenomic dataset is winnowed to a set of genes with desirable properties, represents an alternative approach that holds promise in alleviating these issues. We implemented an approach for phylogenomic datasets (available in SortaDate) that filters genes by three criteria: (i) clock-likeness, (ii) reasonable tree length (i.e., discernible information content), and (iii) least topological conflict with a focal species tree (presumed to have already been inferred). Such a winnowing procedure ensures that errors associated with model (both clock and topology) mis-specification are minimized, therefore reducing error in divergence-time estimation. We demonstrated the efficacy of this approach through simulation and applied it to published animal (Aves, Diplopoda, and Hymenoptera) and plant (carnivorous Caryophyllales, broad Caryophyllales, and Vitales) phylogenomic datasets. By quantifying rate heterogeneity across both genes and lineages we found that every empirical dataset examined included genes with clock-like, or nearly clock-like, behavior. Moreover, many datasets had genes that were clock-like, exhibited reasonable evolutionary rates, and were mostly compatible with the species tree. We identified overlap in age estimates when analyzing these filtered genes under strict clock and uncorrelated lognormal (UCLN) models. However, this overlap was often due to imprecise estimates from the UCLN model. We find that "gene shopping" can be an efficient approach to divergence-time inference for phylogenomic datasets that may otherwise be characterized by extensive gene tree heterogeneity.
Gon, Giorgia; Restrepo-Méndez, María Clara; Campbell, Oona M R; Barros, Aluísio J D; Woodd, Susannah; Benova, Lenka; Graham, Wendy J
2016-01-01
Hygiene during childbirth is essential to the health of mothers and newborns, irrespective of where birth takes place. This paper investigates the status of water and sanitation in both the home and facility childbirth environments, and for whom and where this is a more significant problem. We used three datasets: a global dataset, with information on the home environment from 58 countries, and two datasets for each of four countries in Eastern Africa: a healthcare facility dataset, and a dataset that incorporated information on facilities and the home environment to create a comprehensive description of birth environments in those countries. We constructed indices of improved water, and improved water and sanitation combined (WATSAN), for the home and healthcare facilities. The Joint Monitoring Program was used to construct indices for household; we tailored them to the facility context-household and facility indices include different components. We described what proportion of women delivered in an environment with improved WATSAN. For those women who delivered at home, we calculated what proportion had improved WATSAN by socio-economic status, education and rural-urban status. Among women delivering at home (58 countries), coverage of improved WATSAN by region varied from 9% to 53%. Fewer than 15% of women who delivered at home in Sub-Saharan Africa, had access to water and sanitation infrastructure (range 0.1% to 37%). This was worse among the poorest, the less educated and those living in rural areas. In Eastern Africa, where we looked at both the home and facility childbirth environment, a third of women delivered in an environment with improved water in Uganda and Rwanda; whereas, 18% of women in Kenya and 7% in Tanzania delivered with improved water and sanitation. Across the four countries, less than half of the facility deliveries had improved water, or improved water and sanitation in the childbirth environment. Access to water and sanitation during childbirth is poor across low and middle-income countries. Even when women travel to health facilities for childbirth, they are not guaranteed access to basic WATSAN infrastructure. These indicators should be measured routinely in order to inform improvements.
Internationally coordinated glacier monitoring: strategy and datasets
NASA Astrophysics Data System (ADS)
Hoelzle, Martin; Armstrong, Richard; Fetterer, Florence; Gärtner-Roer, Isabelle; Haeberli, Wilfried; Kääb, Andreas; Kargel, Jeff; Nussbaumer, Samuel; Paul, Frank; Raup, Bruce; Zemp, Michael
2014-05-01
Internationally coordinated monitoring of long-term glacier changes provide key indicator data about global climate change and began in the year 1894 as an internationally coordinated effort to establish standardized observations. Today, world-wide monitoring of glaciers and ice caps is embedded within the Global Climate Observing System (GCOS) in support of the United Nations Framework Convention on Climate Change (UNFCCC) as an important Essential Climate Variable (ECV). The Global Terrestrial Network for Glaciers (GTN-G) was established in 1999 with the task of coordinating measurements and to ensure the continuous development and adaptation of the international strategies to the long-term needs of users in science and policy. The basic monitoring principles must be relevant, feasible, comprehensive and understandable to a wider scientific community as well as to policy makers and the general public. Data access has to be free and unrestricted, the quality of the standardized and calibrated data must be high and a combination of detailed process studies at selected field sites with global coverage by satellite remote sensing is envisaged. Recently a GTN-G Steering Committee was established to guide and advise the operational bodies responsible for the international glacier monitoring, which are the World Glacier Monitoring Service (WGMS), the US National Snow and Ice Data Center (NSIDC), and the Global Land Ice Measurements from Space (GLIMS) initiative. Several online databases containing a wealth of diverse data types having different levels of detail and global coverage provide fast access to continuously updated information on glacier fluctuation and inventory data. For world-wide inventories, data are now available through (a) the World Glacier Inventory containing tabular information of about 130,000 glaciers covering an area of around 240,000 km2, (b) the GLIMS-database containing digital outlines of around 118,000 glaciers with different time stamps and (c) the Randolph Glacier Inventory (RGI), a new and globally complete digital dataset of outlines from about 180,000 glaciers with some meta-information, which has been used for many applications relating to the IPCC AR5 report. Concerning glacier changes, a database (Fluctuations of Glaciers) exists containing information about mass balance, front variations including past reconstructed time series, geodetic changes and special events. Annual mass balance reporting contains information for about 125 glaciers with a subset of 37 glaciers with continuous observational series since 1980 or earlier. Front variation observations of around 1800 glaciers are available from most of the mountain ranges world-wide. This database was recently updated with 26 glaciers having an unprecedented dataset of length changes from from reconstructions of well-dated historical evidence going back as far as the 16th century. Geodetic observations of about 430 glaciers are available. The database is completed by a dataset containing information on special events including glacier surges, glacier lake outbursts, ice avalanches, eruptions of ice-clad volcanoes, etc. related to about 200 glaciers. A special database of glacier photographs contains 13,000 pictures from around 500 glaciers, some of them dating back to the 19th century. A key challenge is to combine and extend the traditional observations with fast evolving datasets from new technologies.
Scene text detection by leveraging multi-channel information and local context
NASA Astrophysics Data System (ADS)
Wang, Runmin; Qian, Shengyou; Yang, Jianfeng; Gao, Changxin
2018-03-01
As an important information carrier, texts play significant roles in many applications. However, text detection in unconstrained scenes is a challenging problem due to cluttered backgrounds, various appearances, uneven illumination, etc.. In this paper, an approach based on multi-channel information and local context is proposed to detect texts in natural scenes. According to character candidate detection plays a vital role in text detection system, Maximally Stable Extremal Regions(MSERs) and Graph-cut based method are integrated to obtain the character candidates by leveraging the multi-channel image information. A cascaded false positive elimination mechanism are constructed from the perspective of the character and the text line respectively. Since the local context information is very valuable for us, these information is utilized to retrieve the missing characters for boosting the text detection performance. Experimental results on two benchmark datasets, i.e., the ICDAR 2011 dataset and the ICDAR 2013 dataset, demonstrate that the proposed method have achieved the state-of-the-art performance.
Unsupervised learning on scientific ocean drilling datasets from the South China Sea
NASA Astrophysics Data System (ADS)
Tse, Kevin C.; Chiu, Hon-Chim; Tsang, Man-Yin; Li, Yiliang; Lam, Edmund Y.
2018-06-01
Unsupervised learning methods were applied to explore data patterns in multivariate geophysical datasets collected from ocean floor sediment core samples coming from scientific ocean drilling in the South China Sea. Compared to studies on similar datasets, but using supervised learning methods which are designed to make predictions based on sample training data, unsupervised learning methods require no a priori information and focus only on the input data. In this study, popular unsupervised learning methods including K-means, self-organizing maps, hierarchical clustering and random forest were coupled with different distance metrics to form exploratory data clusters. The resulting data clusters were externally validated with lithologic units and geologic time scales assigned to the datasets by conventional methods. Compact and connected data clusters displayed varying degrees of correspondence with existing classification by lithologic units and geologic time scales. K-means and self-organizing maps were observed to perform better with lithologic units while random forest corresponded best with geologic time scales. This study sets a pioneering example of how unsupervised machine learning methods can be used as an automatic processing tool for the increasingly high volume of scientific ocean drilling data.
Ingwersen, Peter; Chavan, Vishwas
2011-01-01
A professional recognition mechanism is required to encourage expedited publishing of an adequate volume of 'fit-for-use' biodiversity data. As a component of such a recognition mechanism, we propose the development of the Data Usage Index (DUI) to demonstrate to data publishers that their efforts of creating biodiversity datasets have impact by being accessed and used by a wide spectrum of user communities. We propose and give examples of a range of 14 absolute and normalized biodiversity dataset usage indicators for the development of a DUI based on search events and dataset download instances. The DUI is proposed to include relative as well as species profile weighted comparative indicators. We believe that in addition to the recognition to the data publisher and all players involved in the data life cycle, a DUI will also provide much needed yet novel insight into how users use primary biodiversity data. A DUI consisting of a range of usage indicators obtained from the GBIF network and other relevant access points is within reach. The usage of biodiversity datasets leads to the development of a family of indicators in line with well known citation-based measurements of recognition.
Innovations in user-defined analysis: dynamic grouping and customized user datasets in VistaPHw.
Solet, David; Glusker, Ann; Laurent, Amy; Yu, Tianji
2006-01-01
Flexible, ready access to community health assessment data is a feature of innovative Web-based data query systems. An example is VistaPHw, which provides access to Washington state data and statistics used in community health assessment. Because of its flexible analysis options, VistaPHw customizes local, population-based results to be relevant to public health decision-making. The advantages of two innovations, dynamic grouping and the Custom Data Module, are described. Dynamic grouping permits the creation of user-defined aggregations of geographic areas, age groups, race categories, and years. Standard VistaPHw measures such as rates, confidence intervals, and other statistics may then be calculated for the new groups. Dynamic grouping has provided data for major, successful grant proposals, building partnerships with local governments and organizations, and informing program planning for community organizations. The Custom Data Module allows users to prepare virtually any dataset so it may be analyzed in VistaPHw. Uses for this module may include datasets too sensitive to be placed on a Web server or datasets that are not standardized across the state. Limitations and other system needs are also discussed.
EnviroAtlas - Minneapolis/St. Paul, MN - 51m Riparian Buffer Vegetated Cover
This EnviroAtlas dataset describes the percentage of a 51-m riparian buffer that is vegetated. In this community, vegetated cover is defined as Trees and Forest, Grass and Herbaceous, Woody Wetlands, and Emergent Wetlands. There is a potential for decreased water quality in areas where the riparian buffer is less vegetated. The displayed line represents the center of the analyzed riparian buffer. The water bodies analyzed include hydrologically connected streams, rivers, connectors, reservoirs, lakes/ponds, ice masses, washes, locks, and rapids within the EnviroAtlas community area. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Kulla, M; Friess, M; Schellinger, P D; Harth, A; Busse, O; Walcher, F; Helm, M
2015-12-01
The dataset "Emergency Department" of the German Interdisciplinary Association of Critical Care and Emergency Medicine (DIVI) has been developed during several expert meetings. Its goal is an all-encompassing documentation of the early clinical treatment of patients in emergency departments. Using the example of the index disease acute ischemic stroke (stroke), the aim was to analyze how far this approach has been fulfilled. In this study German, European and US American guidelines were used to analyze the extent of coverage of the datasets on current emergency department guidelines and recommendations from professional societies. In addition, it was examined whether the dataset includes recommended quality indicators (QI) for quality management (QM) and in a third step it was examined to what extent national provisions for billing are included. In each case a differentiation was made whether the respective rationale was primary, i.e. directly apparent or whether it was merely secondarily depicted by expertise. In the evaluation an additional differentiation was made between the level of recommendations and further quality relevant criteria. The modular design of the emergency department dataset comprising 676 data fields is briefly described. A total of 401 individual fields, divided into basic documentation, monitoring and specific neurological documentation of the treatment of stroke patients were considered. For 247 data fields a rationale was found. Partially overlapping, 78.9 % of 214 medical recommendations in 3 guidelines and 85.8 % of the 106 identified quality indicators were primarily covered. Of the 67 requirements for billing of performance of services, 55.5 % are primarily part of the emergency department dataset. Through appropriate expertise and documentation by a board certified neurologist, the results can be improved to almost 100 %. The index disease stroke illustrates that the emergency department dataset of the DIVI covers medical guidelines, especially 100 % of the German guidelines with a grade of recommendation. All necessary information to document the specialized stroke treatment procedure in the German diagnosis-related groups (DRG) system is also covered. The dataset is also suitable as a documentation tool of quality management, for example, to participate in the registry of the German Stroke Society (ADSR). Best results are obtained if the dataset is applied by a physician specialized in the treatment of patients with stroke (e.g. board certified neurologist). Finally the results show that changes in medical guidelines and recommendations for quality management as well as billing-relevant content should be implemented in the development of datasets for documentation to avoid duplicate documentation.
Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval
Karisani, Payam; Qin, Zhaohui S; Agichtein, Eugene
2018-01-01
Abstract The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie PMID:29688379
Ziatdinov, Maxim; Dyck, Ondrej; Maksov, Artem; ...
2017-12-07
Recent advances in scanning transmission electron and scanning probe microscopies have opened unprecedented opportunities in probing the materials structural parameters and various functional properties in real space with an angstrom-level precision. This progress has been accompanied by exponential increase in the size and quality of datasets produced by microscopic and spectroscopic experimental techniques. These developments necessitate adequate methods for extracting relevant physical and chemical information from the large datasets, for which a priori information on the structures of various atomic configurations and lattice defects is limited or absent. Here we demonstrate an application of deep neural networks to extracting informationmore » from atomically resolved images including location of the atomic species and type of defects. We develop a “weakly-supervised” approach that uses information on the coordinates of all atomic species in the image, extracted via a deep neural network, to identify a rich variety of defects that are not part of an initial training set. We further apply our approach to interpret complex atomic and defect transformation, including switching between different coordination of silicon dopants in graphene as a function of time, formation of peculiar silicon dimer with mixed 3-fold and 4-fold coordination, and the motion of molecular “rotor”. In conclusion, this deep learning based approach resembles logic of a human operator, but can be scaled leading to significant shift in the way of extracting and analyzing information from raw experimental data.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ziatdinov, Maxim; Dyck, Ondrej; Maksov, Artem
Recent advances in scanning transmission electron and scanning probe microscopies have opened unprecedented opportunities in probing the materials structural parameters and various functional properties in real space with an angstrom-level precision. This progress has been accompanied by exponential increase in the size and quality of datasets produced by microscopic and spectroscopic experimental techniques. These developments necessitate adequate methods for extracting relevant physical and chemical information from the large datasets, for which a priori information on the structures of various atomic configurations and lattice defects is limited or absent. Here we demonstrate an application of deep neural networks to extracting informationmore » from atomically resolved images including location of the atomic species and type of defects. We develop a “weakly-supervised” approach that uses information on the coordinates of all atomic species in the image, extracted via a deep neural network, to identify a rich variety of defects that are not part of an initial training set. We further apply our approach to interpret complex atomic and defect transformation, including switching between different coordination of silicon dopants in graphene as a function of time, formation of peculiar silicon dimer with mixed 3-fold and 4-fold coordination, and the motion of molecular “rotor”. In conclusion, this deep learning based approach resembles logic of a human operator, but can be scaled leading to significant shift in the way of extracting and analyzing information from raw experimental data.« less
NASA Astrophysics Data System (ADS)
Escarzaga, S. M.; Cody, R. P.; Gaylord, A. G.; Kassin, A.; Barba, M.; Aiken, Q.; Nelson, L.; Mazza Ramsay, F. D.; Tweedie, C. E.
2016-12-01
The Barrow area of northern Alaska is one of the most intensely researched locations in the Arctic and the Barrow Area Information Database (BAID, www.barrowmapped.org) tracks and facilitates a gamut of research, management, and educational activities in the area. BAID is a cyberinfrastructure (CI) that details much of the historic and extant research undertaken within in the Barrow region in a suite of interactive web-based mapping and information portals (geobrowsers). The BAID user community and target audience for BAID is diverse and includes research scientists, science logisticians, land managers, educators, students, and the general public. BAID contains information on more than 16,000 Barrow area research sites that extend back to the 1940's and more than 640 remote sensing images and geospatial datasets. In a web-based setting, users can zoom, pan, query, measure distance, save or print maps and query results, and filter or view information by space, time, and/or other tags. Recent advances include provision of differential global positioning (dGPS) system and high resolution aerial imagery support to visiting scientists, analysis and multitemporal mapping of over 120 km of coastline for erosion monitoring; maintenance of a wireless micrometeorological sensor network; links to Barrow area datasets housed at national data archives; and substantial upgrades to the BAID website. Web mapping applications that have launched to the public include: an Imagery Time Viewer that allows users to compare imagery of the Barrow area between 1949 and the present; a Coastal Erosion Viewer that allows users to view long-term (1955-2015) and recent (2013-2015) rates of erosion for the Barrow area; and a Community Planning Tool that allows users to view and print dynamic reports based on an array of basemaps including a new 0.5m resolution wetlands map designed to enhance decision making for development and land management.
Wyoming Landscape Conservation Initiative data management and integration
Latysh, Natalie; Bristol, R. Sky
2011-01-01
Six Federal agencies, two State agencies, and two local entities formally support the Wyoming Landscape Conservation Initiative (WLCI) and work together on a landscape scale to manage fragile habitats and wildlife resources amidst growing energy development in southwest Wyoming. The U.S. Geological Survey (USGS) was tasked with implementing targeted research and providing scientific information about southwest Wyoming to inform the development of WLCI habitat enhancement and restoration projects conducted by land management agencies. Many WLCI researchers and decisionmakers representing the Bureau of Land Management, U.S. Fish and Wildlife Service, the State of Wyoming, and others have overwhelmingly expressed the need for a stable, robust infrastructure to promote sharing of data resources produced by multiple entities, including metadata adequately describing the datasets. Descriptive metadata facilitates use of the datasets by users unfamiliar with the data. Agency representatives advocate development of common data handling and distribution practices among WLCI partners to enhance availability of comprehensive and diverse data resources for use in scientific analyses and resource management. The USGS Core Science Informatics (CSI) team is developing and promoting data integration tools and techniques across USGS and partner entity endeavors, including a data management infrastructure to aid WLCI researchers and decisionmakers.
Hu, Weiming; Hu, Ruiguang; Xie, Nianhua; Ling, Haibin; Maybank, Stephen
2014-04-01
In this paper, we propose saliency driven image multiscale nonlinear diffusion filtering. The resulting scale space in general preserves or even enhances semantically important structures such as edges, lines, or flow-like structures in the foreground, and inhibits and smoothes clutter in the background. The image is classified using multiscale information fusion based on the original image, the image at the final scale at which the diffusion process converges, and the image at a midscale. Our algorithm emphasizes the foreground features, which are important for image classification. The background image regions, whether considered as contexts of the foreground or noise to the foreground, can be globally handled by fusing information from different scales. Experimental tests of the effectiveness of the multiscale space for the image classification are conducted on the following publicly available datasets: 1) the PASCAL 2005 dataset; 2) the Oxford 102 flowers dataset; and 3) the Oxford 17 flowers dataset, with high classification rates.
NASA Astrophysics Data System (ADS)
Liu, Z.; Acker, J. G.; Kempler, S. J.
2016-12-01
The NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC) is one of twelve NASA Science Mission Directorate (SMD) Data Centers that provide Earth science data, information, and services to research scientists, applications scientists, applications users, and students around the world. The GES DISC is the home (archive) of NASA Precipitation and Hydrology, as well as Atmospheric Composition and Dynamics remote sensing data and information. To facilitate Earth science data access, the GES DISC has been developing user-friendly data services for users at different levels. Among them, the Geospatial Interactive Online Visualization ANd aNalysis Infrastructure (GIOVANNI, http://giovanni.gsfc.nasa.gov/) allows users to explore satellite-based data using sophisticated analyses and visualizations without downloading data and software, which is particularly suitable for novices to use NASA datasets in STEM activities. In this presentation, we will briefly introduce GIOVANNI and recommend datasets for STEM. Examples of using these datasets in STEM activities will be presented as well.
A conceptual prototype for the next-generation national elevation dataset
Stoker, Jason M.; Heidemann, Hans Karl; Evans, Gayla A.; Greenlee, Susan K.
2013-01-01
In 2012 the U.S. Geological Survey's (USGS) National Geospatial Program (NGP) funded a study to develop a conceptual prototype for a new National Elevation Dataset (NED) design with expanded capabilities to generate and deliver a suite of bare earth and above ground feature information over the United States. This report details the research on identifying operational requirements based on prior research, evaluation of what is needed for the USGS to meet these requirements, and development of a possible conceptual framework that could potentially deliver the kinds of information that are needed to support NGP's partners and constituents. This report provides an initial proof-of-concept demonstration using an existing dataset, and recommendations for the future, to inform NGP's ongoing and future elevation program planning and management decisions. The demonstration shows that this type of functional process can robustly create derivatives from lidar point cloud data; however, more research needs to be done to see how well it extends to multiple datasets.
NASA Astrophysics Data System (ADS)
Willis, D. M.; Coffey, H. E.; Henwood, R.; Erwin, E. H.; Hoyt, D. V.; Wild, M. N.; Denig, W. F.
2013-11-01
The measurements of sunspot positions and areas that were published initially by the Royal Observatory, Greenwich, and subsequently by the Royal Greenwich Observatory (RGO), as the Greenwich Photo-heliographic Results ( GPR), 1874 - 1976, exist in both printed and digital forms. These printed and digital sunspot datasets have been archived in various libraries and data centres. Unfortunately, however, typographic, systematic and isolated errors can be found in the various datasets. The purpose of the present paper is to begin the task of identifying and correcting these errors. In particular, the intention is to provide in one foundational paper all the necessary background information on the original solar observations, their various applications in scientific research, the format of the different digital datasets, the necessary definitions of the quantities measured, and the initial identification of errors in both the printed publications and the digital datasets. Two companion papers address the question of specific identifiable errors; namely, typographic errors in the printed publications, and both isolated and systematic errors in the digital datasets. The existence of two independently prepared digital datasets, which both contain information on sunspot positions and areas, makes it possible to outline a preliminary strategy for the development of an even more accurate digital dataset. Further work is in progress to generate an extremely reliable sunspot digital dataset, based on the programme of solar observations supported for more than a century by the Royal Observatory, Greenwich, and the Royal Greenwich Observatory. This improved dataset should be of value in many future scientific investigations.
Wang, Lei; Alpert, Kathryn I.; Calhoun, Vince D.; Cobia, Derin J.; Keator, David B.; King, Margaret D.; Kogan, Alexandr; Landis, Drew; Tallis, Marcelo; Turner, Matthew D.; Potkin, Steven G.; Turner, Jessica A.; Ambite, Jose Luis
2015-01-01
SchizConnect (www.schizconnect.org) is built to address the issues of multiple data repositories in schizophrenia neuroimaging studies. It includes a level of mediation—translating across data sources—so that the user can place one query, e.g. for diffusion images from male individuals with schizophrenia, and find out from across participating data sources how many datasets there are, as well as downloading the imaging and related data. The current version handles the Data Usage Agreements across different studies, as well as interpreting database-specific terminologies into a common framework. New data repositories can also be mediated to bring immediate access to existing datasets. Compared with centralized, upload data sharing models, SchizConnect is a unique, virtual database with a focus on schizophrenia and related disorders that can mediate live data as information are being updated at each data source. It is our hope that SchizConnect can facilitate testing new hypotheses through aggregated datasets, promoting discovery related to the mechanisms underlying schizophrenic dysfunction. PMID:26142271
Basak, Subhash C; Majumdar, Subhabrata
2015-01-01
Variation in high-dimensional data is often caused by a few latent factors, and hence dimension reduction or variable selection techniques are often useful in gathering useful information from the data. In this paper we consider two such recent methods: Interrelated two-way clustering and envelope models. We couple these methods with traditional statistical procedures like ridge regression and linear discriminant analysis, and apply them on two data sets which have more predictors than samples (i.e. n < p scenario) and several types of molecular descriptors. One of these datasets consists of a congeneric group of Amines while the other has a much diverse collection compounds. The difference of prediction results between these two datasets for both the methods supports the hypothesis that for a congeneric set of compounds, descriptors of a certain type are enough to provide good QSAR models, but as the data set grows diverse including a variety of descriptors can improve model quality considerably.
Pavloudi, Christina; Christodoulou, Magdalini; Mavidis, Michalis
2016-01-01
This paper describes a dataset of macrofaunal organisms associated with the sponge Sarcotragus foetidus Schmidt, 1862, collected by scuba diving from two sampling sites: one in Greece (North Aegean Sea) and one in Cyprus (Levantine Sea). This dataset includes macrofaunal taxa inhabiting the demosponge Sarcotragus foetidus and contributes to the ongoing efforts of the Ocean Biogeographic Information System (OBIS) which aims at filling the gaps in our current knowledge of the world's oceans. This is the first paper, to our knowledge, where the macrofauna associated with S. foetidus from the Levantine Basin is being recorded. In total, 90 taxa were recorded, from which 83 were identified to the species level. Eight of these species are new records for the Levantine Basin. The dataset contains 213 occurrence records, fully annotated with all required metadata. It is accessible at http://lifewww-00.her.hcmr.gr:8080/medobis/resource.do?r=organismic_assemblages_sarcotragus_foetidus_cyprus_greece.
The Lunar Source Disk: Old Lunar Datasets on a New CD-ROM
NASA Astrophysics Data System (ADS)
Hiesinger, H.
1998-01-01
A compilation of previously published datasets on CD-ROM is presented. This Lunar Source Disk is intended to be a first step in the improvement/expansion of the Lunar Consortium Disk, in order to create an "image-cube"-like data pool that can be easily accessed and might be useful for a variety of future lunar investigations. All datasets were transformed to a standard map projection that allows direct comparison of different types of information on a pixel-by pixel basis. Lunar observations have a long history and have been important to mankind for centuries, notably since the work of Plutarch and Galileo. As a consequence of centuries of lunar investigations, knowledge of the characteristics and properties of the Moon has accumulated over time. However, a side effect of this accumulation is that it has become more and more complicated for scientists to review all the datasets obtained through different techniques, to interpret them properly, to recognize their weaknesses and strengths in detail, and to combine them synoptically in geologic interpretations. Such synoptic geologic interpretations are crucial for the study of planetary bodies through remote-sensing data in order to avoid misinterpretation. In addition, many of the modem datasets, derived from Earth-based telescopes as well as from spacecraft missions, are acquired at different geometric and radiometric conditions. These differences make it challenging to compare or combine datasets directly or to extract information from different datasets on a pixel-by-pixel basis. Also, as there is no convention for the presentation of lunar datasets, different authors choose different map projections, depending on the location of the investigated areas and their personal interests. Insufficient or incomplete information on the map parameters used by different authors further complicates the reprojection of these datasets to a standard geometry. The goal of our efforts was to transfer previously published lunar datasets to a selected standard geometry in order to create an "image-cube"-like data pool for further interpretation. The starting point was a number of datasets on a CD-ROM published by the Lunar Consortium. The task of creating an uniform data pool was further complicated by some missing or wrong references and keys on the Lunar Consortium CD as well as erroneous reproduction of some datasets in the literature.
ARM Research in the Equatorial Western Pacific: A Decade and Counting
DOE Office of Scientific and Technical Information (OSTI.GOV)
Long, Charles N.; McFarlane, Sally A.; Del Genio, Anthony D.
2013-05-22
The tropical western Pacific (TWP) is an important climatic region. Strong solar heating, warm sea surface temperatures and the annual progression of the Intertropical Convergence Zone (ITCZ) across this region generate abundant convective systems, which through their effects on the heat and water budgets have a profound impact on global climate and precipitation. To accurately represent tropical cloud systems in models, measurements of tropical clouds, the environment in which they reside, and their impact on the radiation and water budgets are needed. Because of the remote location, ground-based datasets of cloud, atmosphere, and radiation properties from the TWP region havemore » traditionally come primarily from short-term field experiments. While providing extremely useful information on physical processes, these datasets are limited in statistical and climatological information because of their short duration. To provide long-term measurements of the surface radiation budget in the tropics, and the atmospheric properties that affect it, the Atmospheric Radiation Measurement program established a measurement site on Manus Island, Papua New Guinea in 1996 and on the island republic of Nauru in late 1998. These sites provide unique datasets available from more than 10 years of operation in the equatorial western Pacific on Manus and Nauru. We present examples of the scientific use of these datasets including characterization of cloud properties, analysis of cloud radiative forcing, model studies of tropical clouds and processes, and validation of satellite algorithms. We also note new instrumentation recently installed at the Manus site that will expand opportunities for tropical atmospheric science.« less
Scalable Visual Analytics of Massive Textual Datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Krishnan, Manoj Kumar; Bohn, Shawn J.; Cowley, Wendy E.
2007-04-01
This paper describes the first scalable implementation of text processing engine used in Visual Analytics tools. These tools aid information analysts in interacting with and understanding large textual information content through visual interfaces. By developing parallel implementation of the text processing engine, we enabled visual analytics tools to exploit cluster architectures and handle massive dataset. The paper describes key elements of our parallelization approach and demonstrates virtually linear scaling when processing multi-gigabyte data sets such as Pubmed. This approach enables interactive analysis of large datasets beyond capabilities of existing state-of-the art visual analytics tools.
Antarctic and Sub-Antarctic Asteroidea database.
Moreau, Camille; Mah, Christopher; Agüera, Antonio; Améziane, Nadia; David Barnes; Crokaert, Guillaume; Eléaume, Marc; Griffiths, Huw; Charlène Guillaumot; Hemery, Lenaïg G; Jażdżewska, Anna; Quentin Jossart; Vladimir Laptikhovsky; Linse, Katrin; Neill, Kate; Sands, Chester; Thomas Saucède; Schiaparelli, Stefano; Siciński, Jacek; Vasset, Noémie; Bruno Danis
2018-01-01
The present dataset is a compilation of georeferenced occurrences of asteroids (Echinodermata: Asteroidea) in the Southern Ocean. Occurrence data south of 45°S latitude were mined from various sources together with information regarding the taxonomy, the sampling source and sampling sites when available. Records from 1872 to 2016 were thoroughly checked to ensure the quality of a dataset that reaches a total of 13,840 occurrences from 4,580 unique sampling events. Information regarding the reproductive strategy (brooders vs. broadcasters) of 63 species is also made available. This dataset represents the most exhaustive occurrence database on Antarctic and Sub-Antarctic asteroids.
PhosphoSitePlus, 2014: mutations, PTMs and recalibrations
Hornbeck, Peter V.; Zhang, Bin; Murray, Beth; Kornhauser, Jon M.; Latham, Vaughan; Skrzypek, Elzbieta
2015-01-01
PhosphoSitePlus® (PSP, http://www.phosphosite.org/), a knowledgebase dedicated to mammalian post-translational modifications (PTMs), contains over 330 000 non-redundant PTMs, including phospho, acetyl, ubiquityl and methyl groups. Over 95% of the sites are from mass spectrometry (MS) experiments. In order to improve data reliability, early MS data have been reanalyzed, applying a common standard of analysis across over 1 000 000 spectra. Site assignments with P > 0.05 were filtered out. Two new downloads are available from PSP. The ‘Regulatory sites’ dataset includes curated information about modification sites that regulate downstream cellular processes, molecular functions and protein-protein interactions. The ‘PTMVar’ dataset, an intersect of missense mutations and PTMs from PSP, identifies over 25 000 PTMVars (PTMs Impacted by Variants) that can rewire signaling pathways. The PTMVar data include missense mutations from UniPROTKB, TCGA and other sources that cause over 2000 diseases or syndromes (MIM) and polymorphisms, or are associated with hundreds of cancers. PTMVars include 18 548 phosphorlyation sites, 3412 ubiquitylation sites, 2316 acetylation sites, 685 methylation sites and 245 succinylation sites. PMID:25514926
Compilation of climate data from heterogeneous networks across the Hawaiian Islands
Longman, Ryan J.; Giambelluca, Thomas W.; Nullet, Michael A.; Frazier, Abby G.; Kodama, Kevin; Crausbay, Shelley D.; Krushelnycky, Paul D.; Cordell, Susan; Clark, Martyn P.; Newman, Andy J.; Arnold, Jeffrey R.
2018-01-01
Long-term, accurate observations of atmospheric phenomena are essential for a myriad of applications, including historic and future climate assessments, resource management, and infrastructure planning. In Hawai‘i, climate data are available from individual researchers, local, State, and Federal agencies, and from large electronic repositories such as the National Centers for Environmental Information (NCEI). Researchers attempting to make use of available data are faced with a series of challenges that include: (1) identifying potential data sources; (2) acquiring data; (3) establishing data quality assurance and quality control (QA/QC) protocols; and (4) implementing robust gap filling techniques. This paper addresses these challenges by providing: (1) a summary of the available climate data in Hawai‘i including a detailed description of the various meteorological observation networks and data accessibility, and (2) a quality controlled meteorological dataset across the Hawaiian Islands for the 25-year period 1990-2014. The dataset draws on observations from 471 climate stations and includes rainfall, maximum and minimum surface air temperature, relative humidity, wind speed, downward shortwave and longwave radiation data. PMID:29437162
Compilation of climate data from heterogeneous networks across the Hawaiian Islands
NASA Astrophysics Data System (ADS)
Longman, Ryan J.; Giambelluca, Thomas W.; Nullet, Michael A.; Frazier, Abby G.; Kodama, Kevin; Crausbay, Shelley D.; Krushelnycky, Paul D.; Cordell, Susan; Clark, Martyn P.; Newman, Andy J.; Arnold, Jeffrey R.
2018-02-01
Long-term, accurate observations of atmospheric phenomena are essential for a myriad of applications, including historic and future climate assessments, resource management, and infrastructure planning. In Hawai'i, climate data are available from individual researchers, local, State, and Federal agencies, and from large electronic repositories such as the National Centers for Environmental Information (NCEI). Researchers attempting to make use of available data are faced with a series of challenges that include: (1) identifying potential data sources; (2) acquiring data; (3) establishing data quality assurance and quality control (QA/QC) protocols; and (4) implementing robust gap filling techniques. This paper addresses these challenges by providing: (1) a summary of the available climate data in Hawai'i including a detailed description of the various meteorological observation networks and data accessibility, and (2) a quality controlled meteorological dataset across the Hawaiian Islands for the 25-year period 1990-2014. The dataset draws on observations from 471 climate stations and includes rainfall, maximum and minimum surface air temperature, relative humidity, wind speed, downward shortwave and longwave radiation data.
Lima, Fernando; Beca, Gabrielle; Muylaert, Renata L; Jenkins, Clinton N; Perilli, Miriam L L; Paschoal, Ana Maria O; Massara, Rodrigo L; Paglia, Adriano P; Chiarello, Adriano G; Graipel, Maurício E; Cherem, Jorge J; Regolin, André L; Oliveira Santos, Luiz Gustavo R; Brocardo, Carlos R; Paviolo, Agustín; Di Bitetti, Mario S; Scoss, Leandro M; Rocha, Fabiana L; Fusco-Costa, Roberto; Rosa, Clarissa A; Da Silva, Marina X; Hufnagell, Ludmila; Santos, Paloma M; Duarte, Gabriela T; Guimarães, Luiza N; Bailey, Larissa L; Rodrigues, Flávio Henrique G; Cunha, Heitor M; Fantacini, Felipe M; Batista, Graziele O; Bogoni, Juliano A; Tortato, Marco A; Luiz, Micheli R; Peroni, Nivaldo; De Castilho, Pedro V; Maccarini, Thiago B; Filho, Vilmar Picinatto; Angelo, Carlos De; Cruz, Paula; Quiroga, Verónica; Iezzi, María E; Varela, Diego; Cavalcanti, Sandra M C; Martensen, Alexandre C; Maggiorini, Erica V; Keesen, Fabíola F; Nunes, André V; Lessa, Gisele M; Cordeiro-Estrela, Pedro; Beltrão, Mayara G; De Albuquerque, Anna Carolina F; Ingberman, Bianca; Cassano, Camila R; Junior, Laury Cullen; Ribeiro, Milton C; Galetti, Mauro
2017-11-01
Our understanding of mammal ecology has always been hindered by the difficulties of observing species in closed tropical forests. Camera trapping has become a major advance for monitoring terrestrial mammals in biodiversity rich ecosystems. Here we compiled one of the largest datasets of inventories of terrestrial mammal communities for the Neotropical region based on camera trapping studies. The dataset comprises 170 surveys of medium to large terrestrial mammals using camera traps conducted in 144 areas by 74 studies, covering six vegetation types of tropical and subtropical Atlantic Forest of South America (Brazil and Argentina), and present data on species composition and richness. The complete dataset comprises 53,438 independent records of 83 species of mammals, includes 10 species of marsupials, 15 rodents, 20 carnivores, eight ungulates and six armadillos. Species richness averaged 13 species (±6.07 SD) per site. Only six species occurred in more than 50% of the sites: the domestic dog Canis familiaris, crab-eating fox Cerdocyon thous, tayra Eira barbara, south American coati Nasua nasua, crab-eating raccoon Procyon cancrivorus and the nine-banded armadillo Dasypus novemcinctus. The information contained in this dataset can be used to understand macroecological patterns of biodiversity, community, and population structure, but also to evaluate the ecological consequences of fragmentation, defaunation, and trophic interactions. © 2017 by the Ecological Society of America.
This EnviroAtlas dataset contains biodiversity metrics reflecting ecosystem services or other aspects of biodiversity for reptile species, based on the number of reptile species as measured by predicted habitat present within a pixel. These metrics were created from grouping national level single species habitat models created by the USGS Gap Analysis Program into smaller ecologically based, phylogeny based, or stakeholder suggested composites. The dataset includes reptile species richness metrics for all reptile species, lizards, snakes, turtles, poisonous reptiles, Natureserve-listed G1,G2, and G3 reptile species, and reptile species listed by IUCN (International Union for Conservation of Nature), PARC (Partners in Amphibian and Reptile Conservation) and SWPARC (Southwest Partners in Amphibian and Reptile Conservation). This dataset was produced by a joint effort of New Mexico State University, US EPA, and USGS to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa
Global patterns of current and future road infrastructure
NASA Astrophysics Data System (ADS)
Meijer, Johan R.; Huijbregts, Mark A. J.; Schotten, Kees C. G. J.; Schipper, Aafke M.
2018-06-01
Georeferenced information on road infrastructure is essential for spatial planning, socio-economic assessments and environmental impact analyses. Yet current global road maps are typically outdated or characterized by spatial bias in coverage. In the Global Roads Inventory Project we gathered, harmonized and integrated nearly 60 geospatial datasets on road infrastructure into a global roads dataset. The resulting dataset covers 222 countries and includes over 21 million km of roads, which is two to three times the total length in the currently best available country-based global roads datasets. We then related total road length per country to country area, population density, GDP and OECD membership, resulting in a regression model with adjusted R 2 of 0.90, and found that that the highest road densities are associated with densely populated and wealthier countries. Applying our regression model to future population densities and GDP estimates from the Shared Socioeconomic Pathway (SSP) scenarios, we obtained a tentative estimate of 3.0–4.7 million km additional road length for the year 2050. Large increases in road length were projected for developing nations in some of the world’s last remaining wilderness areas, such as the Amazon, the Congo basin and New Guinea. This highlights the need for accurate spatial road datasets to underpin strategic spatial planning in order to reduce the impacts of roads in remaining pristine ecosystems.
Risk behaviours among internet-facilitated sex workers: evidence from two new datasets.
Cunningham, Scott; Kendall, Todd D
2010-12-01
Sex workers have historically played a central role in STI outbreaks by forming a core group for transmission and due to their higher rates of concurrency and inconsistent condom usage. Over the past 15 years, North American commercial sex markets have been radically reorganised by internet technologies that channelled a sizeable share of the marketplace online. These changes may have had a meaningful impact on the role that sex workers play in STI epidemics. In this study, two new datasets documenting the characteristics and practices of internet-facilitated sex workers are presented and analysed. The first dataset comes from a ratings website where clients share detailed information on over 94,000 sex workers in over 40 cities between 1999 and 2008. The second dataset reflects a year-long field survey of 685 sex workers who advertise online. Evidence from these datasets suggests that internet-facilitated sex workers are dissimilar from the street-based workers who largely populated the marketplace in earlier eras. Differences in characteristics and practices were found which suggest a lower potential for the spread of STIs among internet-facilitated sex workers. The internet-facilitated population appears to include a high proportion of sex workers who are well-educated, hold health insurance and operate only part time. They also engage in relatively low levels of risky sexual practices.
Analyzing How We Do Analysis and Consume Data, Results from the SciDAC-Data Project
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ding, P.; Aliaga, L.; Mubarak, M.
One of the main goals of the Dept. of Energy funded SciDAC-Data project is to analyze the more than 410,000 high energy physics datasets that have been collected, generated and defined over the past two decades by experiments using the Fermilab storage facilities. These datasets have been used as the input to over 5.6 million recorded analysis projects, for which detailed analytics have been gathered. The analytics and meta information for these datasets and analysis projects are being combined with knowledge of their part of the HEP analysis chains for major experiments to understand how modern computing and data deliverymore » is being used. We present the first results of this project, which examine in detail how the CDF, D0, NOvA, MINERvA and MicroBooNE experiments have organized, classified and consumed petascale datasets to produce their physics results. The results include analysis of the correlations in dataset/file overlap, data usage patterns, data popularity, dataset dependency and temporary dataset consumption. The results provide critical insight into how workflows and data delivery schemes can be combined with different caching strategies to more efficiently perform the work required to mine these large HEP data volumes and to understand the physics analysis requirements for the next generation of HEP computing facilities. In particular we present a detailed analysis of the NOvA data organization and consumption model corresponding to their first and second oscillation results (2014-2016) and the first look at the analysis of the Tevatron Run II experiments. We present statistical distributions for the characterization of these data and data driven models describing their consumption« less
Analyzing how we do Analysis and Consume Data, Results from the SciDAC-Data Project
NASA Astrophysics Data System (ADS)
Ding, P.; Aliaga, L.; Mubarak, M.; Tsaris, A.; Norman, A.; Lyon, A.; Ross, R.
2017-10-01
One of the main goals of the Dept. of Energy funded SciDAC-Data project is to analyze the more than 410,000 high energy physics datasets that have been collected, generated and defined over the past two decades by experiments using the Fermilab storage facilities. These datasets have been used as the input to over 5.6 million recorded analysis projects, for which detailed analytics have been gathered. The analytics and meta information for these datasets and analysis projects are being combined with knowledge of their part of the HEP analysis chains for major experiments to understand how modern computing and data delivery is being used. We present the first results of this project, which examine in detail how the CDF, D0, NOvA, MINERvA and MicroBooNE experiments have organized, classified and consumed petascale datasets to produce their physics results. The results include analysis of the correlations in dataset/file overlap, data usage patterns, data popularity, dataset dependency and temporary dataset consumption. The results provide critical insight into how workflows and data delivery schemes can be combined with different caching strategies to more efficiently perform the work required to mine these large HEP data volumes and to understand the physics analysis requirements for the next generation of HEP computing facilities. In particular we present a detailed analysis of the NOvA data organization and consumption model corresponding to their first and second oscillation results (2014-2016) and the first look at the analysis of the Tevatron Run II experiments. We present statistical distributions for the characterization of these data and data driven models describing their consumption.
Conducting high-value secondary dataset analysis: an introductory guide and resources.
Smith, Alexander K; Ayanian, John Z; Covinsky, Kenneth E; Landon, Bruce E; McCarthy, Ellen P; Wee, Christina C; Steinman, Michael A
2011-08-01
Secondary analyses of large datasets provide a mechanism for researchers to address high impact questions that would otherwise be prohibitively expensive and time-consuming to study. This paper presents a guide to assist investigators interested in conducting secondary data analysis, including advice on the process of successful secondary data analysis as well as a brief summary of high-value datasets and online resources for researchers, including the SGIM dataset compendium ( www.sgim.org/go/datasets ). The same basic research principles that apply to primary data analysis apply to secondary data analysis, including the development of a clear and clinically relevant research question, study sample, appropriate measures, and a thoughtful analytic approach. A real-world case description illustrates key steps: (1) define your research topic and question; (2) select a dataset; (3) get to know your dataset; and (4) structure your analysis and presentation of findings in a way that is clinically meaningful. Secondary dataset analysis is a well-established methodology. Secondary analysis is particularly valuable for junior investigators, who have limited time and resources to demonstrate expertise and productivity.
Benchmark Dataset for Whole Genome Sequence Compression.
C L, Biji; S Nair, Achuthsankar
2017-01-01
The research in DNA data compression lacks a standard dataset to test out compression tools specific to DNA. This paper argues that the current state of achievement in DNA compression is unable to be benchmarked in the absence of such scientifically compiled whole genome sequence dataset and proposes a benchmark dataset using multistage sampling procedure. Considering the genome sequence of organisms available in the National Centre for Biotechnology and Information (NCBI) as the universe, the proposed dataset selects 1,105 prokaryotes, 200 plasmids, 164 viruses, and 65 eukaryotes. This paper reports the results of using three established tools on the newly compiled dataset and show that their strength and weakness are evident only with a comparison based on the scientifically compiled benchmark dataset. The sample dataset and the respective links are available @ https://sourceforge.net/projects/benchmarkdnacompressiondataset/.
MANOVA for distinguishing experts' perceptions about entrepreneurship using NES data from GEM
NASA Astrophysics Data System (ADS)
Correia, Aldina; Costa e Silva, Eliana; Lopes, Isabel C.; Braga, Alexandra
2016-12-01
Global Entrepreneurship Monitor is a large scale database for internationally comparative entrepreneurship that includes information about many aspects of entrepreneurship activities, perceptions, conditions, national and regional policy, among others, of a large number of countries. This project has two main sources of primary data: the Adult Population Survey and the National Expert Survey. In this work the 2011 and 2012 National Expert Survey datasets are studied. Our goal is to analyze the effects of the different type of entrepreneurship expert specialization on the perceptions about the Entrepreneurial Framework Conditions. For this purpose the multivariate analysis of variance is used. Some similarities between the results obtained for the 2011 and 2012 datasets were found, however the differences between experts still exist.
Disaster Debris Recovery Database - Recovery
The US EPA Region 5 Disaster Debris Recovery Database includes public datasets of over 6,000 composting facilities, demolition contractors, transfer stations, landfills and recycling facilities for construction and demolition materials, electronics, household hazardous waste, metals, tires, and vehicles in the states of Illinois, Indiana, Iowa, Kentucky, Michigan, Minnesota, Missouri, North Dakota, Ohio, Pennsylvania, South Dakota, West Virginia and Wisconsin.In this update, facilities in the 7 states that border the EPA Region 5 states were added to assist interstate disaster debris management. Also, the datasets for composters, construction and demolition recyclers, demolition contractors, and metals recyclers were verified and source information added for each record using these sources: AGC, Biocycle, BMRA, CDRA, ISRI, NDA, USCC, FEMA Debris Removal Contractor Registry, EPA Facility Registry System, and State and local listings.
Disaster Debris Recovery Database - Landfills
The US EPA Region 5 Disaster Debris Recovery Database includes public datasets of over 6,000 composting facilities, demolition contractors, transfer stations, landfills and recycling facilities for construction and demolition materials, electronics, household hazardous waste, metals, tires, and vehicles in the states of Illinois, Indiana, Iowa, Kentucky, Michigan, Minnesota, Missouri, North Dakota, Ohio, Pennsylvania, South Dakota, West Virginia and Wisconsin.In this update, facilities in the 7 states that border the EPA Region 5 states were added to assist interstate disaster debris management. Also, the datasets for composters, construction and demolition recyclers, demolition contractors, and metals recyclers were verified and source information added for each record using these sources: AGC, Biocycle, BMRA, CDRA, ISRI, NDA, USCC, FEMA Debris Removal Contractor Registry, EPA Facility Registry System, and State and local listings.
Big Data Provenance: Challenges, State of the Art and Opportunities.
Wang, Jianwu; Crawl, Daniel; Purawat, Shweta; Nguyen, Mai; Altintas, Ilkay
2015-01-01
Ability to track provenance is a key feature of scientific workflows to support data lineage and reproducibility. The challenges that are introduced by the volume, variety and velocity of Big Data, also pose related challenges for provenance and quality of Big Data, defined as veracity. The increasing size and variety of distributed Big Data provenance information bring new technical challenges and opportunities throughout the provenance lifecycle including recording, querying, sharing and utilization. This paper discusses the challenges and opportunities of Big Data provenance related to the veracity of the datasets themselves and the provenance of the analytical processes that analyze these datasets. It also explains our current efforts towards tracking and utilizing Big Data provenance using workflows as a programming model to analyze Big Data.
Ecological data in support of an analysis of Guinea-Bissau׳s medicinal flora.
Catarino, Luís; Havik, Philip J; Indjai, Bucar; Romeiras, Maria M
2016-06-01
This dataset presents an annotated list of medicinal plants used by local communities in Guinea-Bissau (West Africa), in a total of 218 species. Data was gathered by means of herbarium and bibliographic research, as well as fieldwork. Biological and ecological information is provided for each species, including in-country distribution, geographical range, growth form and main vegetation types. The dataset was used to prepare a paper on the medicinal plants of Guinea-Bissau "Medicinal plants of Guinea-Bissau: therapeutic applications, ethnic diversity and knowledge transfer" (Catarino et al., 2016) [1]. The table and figures provide a unique database for Guinea-Bissau in support of ethno-medical and ethno-pharmacological research, and their ecological dimensions.
This entry contains two files. The first file, Hance_WFSR Flasher locations.xlxs, contains information describing the location of installed landmark 'flashers' consisting of 2 square aluminum metal tags. Each tag was inscribed with a number to aid field personnel in the identification of landmark location within the West Fork Smith River watershed in southern coastal Oregon. These landmarks were used to calculate stream distances between points in the watershed, including distances between tagging locations and detection events for tagged fish. A second file, named Hance_fish_detection_data1.xlxs contains information on the detection of tagged fish within the West Fork Smith River stream network. The file includes both the location where the fish were tagged and where they were subsequently detected. Together with the information in the WFSR flasher location dataset, these data allow estimation of the minimum distances and directions moved by juvenile coho salmon during the fall transition period.A map locator is provided in Figure 1 in the accompanying manuscript: Dalton J. Hance, Lisa M. Ganio, Kelly M. Burnett & Joseph L. Ebersole (2016) Basin-Scale Variation in the Spatial Pattern of Fall Movement of Juvenile Coho Salmon in the West Fork Smith River, Oregon, Transactions of the American Fisheries Society, 145:5, 1018-1034, DOI: 10.1080/00028487.2016.1194892This dataset is associated with the following publication:Hance, D.J., L.M. Ganio, K.M. Burnett, an
NASA Astrophysics Data System (ADS)
Ostrenga, D.; Liu, Z.; Teng, W. L.; Trivedi, B.; Kempler, S.
2011-12-01
The NASA Goddard Earth Sciences Data and Information Services Center (GES DISC) is home of global precipitation product archives, in particular, the Tropical Rainfall Measuring Mission (TRMM) products. TRMM is a joint U.S.-Japan satellite mission to monitor tropical and subtropical (40deg S - 40deg N) precipitation and to estimate its associated latent heating. The TRMM satellite provides the first detailed and comprehensive dataset on the four dimensional distribution of rainfall and latent heating over vastly undersampled tropical and subtropical oceans and continents. The TRMM satellite was launched on November 27, 1997. TRMM data products are archived at and distributed by GES DISC. The newly released TRMM Version 7 consists of several changes including new parameters, new products, meta data, data structures, etc. For example, hydrometeor profiles in 2A12 now have 28 layers (14 in V6). New parameters have been added to several popular Level-3 products, such as, 3B42, 3B43. Version 2.2 of the Global Precipitation Climatology Project (GPCP) dataset has been added to the TRMM Online Visualization and Analysis System (TOVAS; URL: http://disc2.nascom.nasa.gov/Giovanni/tovas/), allowing online analysis and visualization without downloading data and software. The GPCP dataset extends back to 1979. Results of basic intercomparison between the new and the previous versions of both TRMM and GPCP will be presented to help understand changes in data product characteristics. To facilitate data and information access and support precipitation research and applications, we have developed a Precipitation Data and Information Services Center (PDISC; URL: http://disc.gsfc.nasa.gov/precipitation). In addition to TRMM, PDISC provides current and past observational precipitation data. Users can access precipitation data archives consisting of both remote sensing and in-situ observations. Users can use these data products to conduct a wide variety of activities, including case studies, model evaluation, uncertainty investigation, etc. To support Earth science applications, PDISC provides users near-real-time precipitation products over the Internet. At PDISC, users can access tools and software. Documentation, FAQ and assistance are also available. Other capabilities include: 1) Mirador (http://mirador.gsfc.nasa.gov/), a simplified interface for searching, browsing, and ordering Earth science data at NASA Goddard Earth Sciences Data and Information Services Center (GES DISC). Mirador is designed to be fast and easy to learn; 2)TOVAS; 3) NetCDF data download for the GIS community; 4) Data via OPeNDAP (http://disc.sci.gsfc.nasa.gov/services/opendap/). The OPeNDAP provides remote access to individual variables within datasets in a form usable by many tools, such as IDV, McIDAS-V, Panoply, Ferret and GrADS; 5) The Open Geospatial Consortium (OGC) Web Map Service (WMS) (http://disc.sci.gsfc.nasa.gov/services/wxs_ogc.shtml). The WMS is an interface that allows the use of data and enables clients to build customized maps with data coming from a different network. More details along with examples will be presented.
EarthServer: Visualisation and use of uncertainty as a data exploration tool
NASA Astrophysics Data System (ADS)
Walker, Peter; Clements, Oliver; Grant, Mike
2013-04-01
The Ocean Science/Earth Observation community generates huge datasets from satellite observation. Until recently it has been difficult to obtain matching uncertainty information for these datasets and to apply this to their processing. In order to make use of uncertainty information when analysing "Big Data" we need both the uncertainty itself (attached to the underlying data) and a means of working with the combined product without requiring the entire dataset to be downloaded. The European Commission FP7 project EarthServer (http://earthserver.eu) is addressing the problem of accessing and ad-hoc analysis of extreme-size Earth Science data using cutting-edge Array Database technology. The core software (Rasdaman) and web services wrapper (Petascope) allow huge datasets to be accessed using Open Geospatial Consortium (OGC) standard interfaces including the well established standards, Web Coverage Service (WCS) and Web Map Service (WMS) as well as the emerging standard, Web Coverage Processing Service (WCPS). The WCPS standard allows the running of ad-hoc queries on any of the data stored within Rasdaman, creating an infrastructure where users are not restricted by bandwidth when manipulating or querying huge datasets. The ESA Ocean Colour - Climate Change Initiative (OC-CCI) project (http://www.esa-oceancolour-cci.org/), is producing high-resolution, global ocean colour datasets over the full time period (1998-2012) where high quality observations were available. This climate data record includes per-pixel uncertainty data for each variable, based on an analytic method that classifies how much and which types of water are present in a pixel, and assigns uncertainty based on robust comparisons to global in-situ validation datasets. These uncertainty values take two forms, Root Mean Square (RMS) and Bias uncertainty, respectively representing the expected variability and expected offset error. By combining the data produced through the OC-CCI project with the software from the EarthServer project we can produce a novel data offering that allows the use of traditional exploration and access mechanisms such as WMS and WCS. However the real benefits can be seen when utilising WCPS to explore the data . We will show two major benefits to this infrastructure. Firstly we will show that the visualisation of the combined chlorophyll and uncertainty datasets through a web based GIS portal gives users the ability to instantaneously assess the quality of the data they are exploring using traditional web based plotting techniques as well as through novel web based 3 dimensional visualisation. Secondly we will showcase the benefits available when combining these data with the WCPS standard. The uncertainty data can be utilised in queries using the standard WCPS query language. This allows selection of data either for download or use within the query, based on the respective uncertainty values as well as the possibility of incorporating both the chlorophyll data and uncertainty data into complex queries to produce additional novel data products. By filtering with uncertainty at the data source rather than the client we can minimise traffic over the network allowing huge datasets to be worked on with a minimal time penalty.
Engineering Lessons Learned and Systems Engineering Applications
NASA Technical Reports Server (NTRS)
Gill, Paul S.; Garcia, Danny; Vaughan, William W.
2005-01-01
Systems Engineering is fundamental to good engineering, which in turn depends on the integration and application of engineering lessons learned. Thus, good Systems Engineering also depends on systems engineering lessons learned from within the aerospace industry being documented and applied. About ten percent of the engineering lessons learned documented in the NASA Lessons Learned Information System are directly related to Systems Engineering. A key issue associated with lessons learned datasets is the communication and incorporation of this information into engineering processes. As part of the NASA Technical Standards Program activities, engineering lessons learned datasets have been identified from a number of sources. These are being searched and screened for those having a relation to Technical Standards. This paper will address some of these Systems Engineering Lessons Learned and how they are being related to Technical Standards within the NASA Technical Standards Program, including linking to the Agency's Interactive Engineering Discipline Training Courses and the life cycle for a flight vehicle development program.
A trust-based recommendation method using network diffusion processes
NASA Astrophysics Data System (ADS)
Chen, Ling-Jiao; Gao, Jian
2018-09-01
A variety of rating-based recommendation methods have been extensively studied including the well-known collaborative filtering approaches and some network diffusion-based methods, however, social trust relations are not sufficiently considered when making recommendations. In this paper, we contribute to the literature by proposing a trust-based recommendation method, named CosRA+T, after integrating the information of trust relations into the resource-redistribution process. Specifically, a tunable parameter is used to scale the resources received by trusted users before the redistribution back to the objects. Interestingly, we find an optimal scaling parameter for the proposed CosRA+T method to achieve its best recommendation accuracy, and the optimal value seems to be universal under several evaluation metrics across different datasets. Moreover, results of extensive experiments on the two real-world rating datasets with trust relations, Epinions and FriendFeed, suggest that CosRA+T has a remarkable improvement in overall accuracy, diversity and novelty. Our work takes a step towards designing better recommendation algorithms by employing multiple resources of social network information.
Progress and Challenges in Assessing NOAA Data Management
NASA Astrophysics Data System (ADS)
de la Beaujardiere, J.
2016-12-01
The US National Oceanic and Atmospheric Administration (NOAA) produces large volumes of environmental data from a great variety of observing systems including satellites, radars, aircraft, ships, buoys, and other platforms. These data are irreplaceable assets that must be properly managed to ensure they are discoverable, accessible, usable, and preserved. A policy framework has been established which informs data producers of their responsibilities and which supports White House-level mandates such as the Executive Order on Open Data and the OSTP Memorandum on Increasing Access to the Results of Federally Funded Scientific Research. However, assessing the current state and progress toward completion for the many NOAA datasets is a challenge. This presentation will discuss work toward establishing assessment methodologies and dashboard-style displays. Ideally, metrics would be gathered though software and be automatically updated whenever an individual improvement was made. In practice, however, some level of manual information collection is required. Differing approaches to dataset granularity in different branches of NOAA yield additional complexity.
CORUM: the comprehensive resource of mammalian protein complexes
Ruepp, Andreas; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Stransky, Michael; Waegele, Brigitte; Schmidt, Thorsten; Doudieu, Octave Noubibou; Stümpflen, Volker; Mewes, H. Werner
2008-01-01
Protein complexes are key molecular entities that integrate multiple gene products to perform cellular functions. The CORUM (http://mips.gsf.de/genre/proj/corum/index.html) database is a collection of experimentally verified mammalian protein complexes. Information is manually derived by critical reading of the scientific literature from expert annotators. Information about protein complexes includes protein complex names, subunits, literature references as well as the function of the complexes. For functional annotation, we use the FunCat catalogue that enables to organize the protein complex space into biologically meaningful subsets. The database contains more than 1750 protein complexes that are built from 2400 different genes, thus representing 12% of the protein-coding genes in human. A web-based system is available to query, view and download the data. CORUM provides a comprehensive dataset of protein complexes for discoveries in systems biology, analyses of protein networks and protein complex-associated diseases. Comparable to the MIPS reference dataset of protein complexes from yeast, CORUM intends to serve as a reference for mammalian protein complexes. PMID:17965090
Passing messages between biological networks to refine predicted interactions.
Glass, Kimberly; Huttenhower, Curtis; Quackenbush, John; Yuan, Guo-Cheng
2013-01-01
Regulatory network reconstruction is a fundamental problem in computational biology. There are significant limitations to such reconstruction using individual datasets, and increasingly people attempt to construct networks using multiple, independent datasets obtained from complementary sources, but methods for this integration are lacking. We developed PANDA (Passing Attributes between Networks for Data Assimilation), a message-passing model using multiple sources of information to predict regulatory relationships, and used it to integrate protein-protein interaction, gene expression, and sequence motif data to reconstruct genome-wide, condition-specific regulatory networks in yeast as a model. The resulting networks were not only more accurate than those produced using individual data sets and other existing methods, but they also captured information regarding specific biological mechanisms and pathways that were missed using other methodologies. PANDA is scalable to higher eukaryotes, applicable to specific tissue or cell type data and conceptually generalizable to include a variety of regulatory, interaction, expression, and other genome-scale data. An implementation of the PANDA algorithm is available at www.sourceforge.net/projects/panda-net.
Updated Value of Service Reliability Estimates for Electric Utility Customers in the United States
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sullivan, Michael; Schellenberg, Josh; Blundell, Marshall
2015-01-01
This report updates the 2009 meta-analysis that provides estimates of the value of service reliability for electricity customers in the United States (U.S.). The meta-dataset now includes 34 different datasets from surveys fielded by 10 different utility companies between 1989 and 2012. Because these studies used nearly identical interruption cost estimation or willingness-to-pay/accept methods, it was possible to integrate their results into a single meta-dataset describing the value of electric service reliability observed in all of them. Once the datasets from the various studies were combined, a two-part regression model was used to estimate customer damage functions that can bemore » generally applied to calculate customer interruption costs per event by season, time of day, day of week, and geographical regions within the U.S. for industrial, commercial, and residential customers. This report focuses on the backwards stepwise selection process that was used to develop the final revised model for all customer classes. Across customer classes, the revised customer interruption cost model has improved significantly because it incorporates more data and does not include the many extraneous variables that were in the original specification from the 2009 meta-analysis. The backwards stepwise selection process led to a more parsimonious model that only included key variables, while still achieving comparable out-of-sample predictive performance. In turn, users of interruption cost estimation tools such as the Interruption Cost Estimate (ICE) Calculator will have less customer characteristics information to provide and the associated inputs page will be far less cumbersome. The upcoming new version of the ICE Calculator is anticipated to be released in 2015.« less
Comparison of alternate scoring of variables on the performance of the frailty index
2014-01-01
Background The frailty index (FI) is used to measure the health status of ageing individuals. An FI is constructed as the proportion of deficits present in an individual out of the total number of age-related health variables considered. The purpose of this study was to systematically assess whether dichotomizing deficits included in an FI affects the information value of the whole index. Methods Secondary analysis of three population-based longitudinal studies of community dwelling individuals: Nova Scotia Health Survey (NSHS, n = 3227 aged 18+), Survey of Health, Ageing and Retirement in Europe (SHARE, n = 37546 aged 50+), and Yale Precipitating Events Project (Yale-PEP, n = 754 aged 70+). For each dataset, we constructed two FIs from baseline data using the deficit accumulation approach. In each dataset, both FIs included the same variables (23 in NSHS, 70 in SHARE, 33 in Yale-PEP). One FI was constructed with only dichotomous values (marking presence or absence of a deficit); in the other FI, as many variables as possible were coded as ordinal (graded severity of a deficit). Participants in each study were followed for different durations (NSHS: 10 years, SHARE: 5 years, Yale PEP: 12 years). Results Within each dataset, the difference in mean scores between the ordinal and dichotomous-only FIs ranged from 0 to 1.5 deficits. Their ability to predict mortality was identical; their absolute difference in area under the ROC curve ranged from 0.00 to 0.02, and their absolute difference between Cox Hazard Ratios ranged from 0.001 to 0.009. Conclusions Analyses from three diverse datasets suggest that variables included in an FI can be coded either as dichotomous or ordinal, with negligible impact on the performance of the index in predicting mortality. PMID:24559204
Yang, Jie; McArdle, Conor; Daniels, Stephen
2014-01-01
A new data dimension-reduction method, called Internal Information Redundancy Reduction (IIRR), is proposed for application to Optical Emission Spectroscopy (OES) datasets obtained from industrial plasma processes. For example in a semiconductor manufacturing environment, real-time spectral emission data is potentially very useful for inferring information about critical process parameters such as wafer etch rates, however, the relationship between the spectral sensor data gathered over the duration of an etching process step and the target process output parameters is complex. OES sensor data has high dimensionality (fine wavelength resolution is required in spectral emission measurements in order to capture data on all chemical species involved in plasma reactions) and full spectrum samples are taken at frequent time points, so that dynamic process changes can be captured. To maximise the utility of the gathered dataset, it is essential that information redundancy is minimised, but with the important requirement that the resulting reduced dataset remains in a form that is amenable to direct interpretation of the physical process. To meet this requirement and to achieve a high reduction in dimension with little information loss, the IIRR method proposed in this paper operates directly in the original variable space, identifying peak wavelength emissions and the correlative relationships between them. A new statistic, Mean Determination Ratio (MDR), is proposed to quantify the information loss after dimension reduction and the effectiveness of IIRR is demonstrated using an actual semiconductor manufacturing dataset. As an example of the application of IIRR in process monitoring/control, we also show how etch rates can be accurately predicted from IIRR dimension-reduced spectral data. PMID:24451453
Topaz, Maxim; Lai, Kenneth; Dowding, Dawn; Lei, Victor J; Zisberg, Anna; Bowles, Kathryn H; Zhou, Li
2016-12-01
Electronic health records are being increasingly used by nurses with up to 80% of the health data recorded as free text. However, only a few studies have developed nursing-relevant tools that help busy clinicians to identify information they need at the point of care. This study developed and validated one of the first automated natural language processing applications to extract wound information (wound type, pressure ulcer stage, wound size, anatomic location, and wound treatment) from free text clinical notes. First, two human annotators manually reviewed a purposeful training sample (n=360) and random test sample (n=1100) of clinical notes (including 50% discharge summaries and 50% outpatient notes), identified wound cases, and created a gold standard dataset. We then trained and tested our natural language processing system (known as MTERMS) to process the wound information. Finally, we assessed our automated approach by comparing system-generated findings against the gold standard. We also compared the prevalence of wound cases identified from free-text data with coded diagnoses in the structured data. The testing dataset included 101 notes (9.2%) with wound information. The overall system performance was good (F-measure is a compiled measure of system's accuracy=92.7%), with best results for wound treatment (F-measure=95.7%) and poorest results for wound size (F-measure=81.9%). Only 46.5% of wound notes had a structured code for a wound diagnosis. The natural language processing system achieved good performance on a subset of randomly selected discharge summaries and outpatient notes. In more than half of the wound notes, there were no coded wound diagnoses, which highlight the significance of using natural language processing to enrich clinical decision making. Our future steps will include expansion of the application's information coverage to other relevant wound factors and validation of the model with external data. Copyright © 2016 Elsevier Ltd. All rights reserved.
Exploiting Amino Acid Composition for Predicting Protein-Protein Interactions
Roy, Sushmita; Martinez, Diego; Platero, Harriett; Lane, Terran; Werner-Washburne, Margaret
2009-01-01
Background Computational prediction of protein interactions typically use protein domains as classifier features because they capture conserved information of interaction surfaces. However, approaches relying on domains as features cannot be applied to proteins without any domain information. In this paper, we explore the contribution of pure amino acid composition (AAC) for protein interaction prediction. This simple feature, which is based on normalized counts of single or pairs of amino acids, is applicable to proteins from any sequenced organism and can be used to compensate for the lack of domain information. Results AAC performed at par with protein interaction prediction based on domains on three yeast protein interaction datasets. Similar behavior was obtained using different classifiers, indicating that our results are a function of features and not of classifiers. In addition to yeast datasets, AAC performed comparably on worm and fly datasets. Prediction of interactions for the entire yeast proteome identified a large number of novel interactions, the majority of which co-localized or participated in the same processes. Our high confidence interaction network included both well-studied and uncharacterized proteins. Proteins with known function were involved in actin assembly and cell budding. Uncharacterized proteins interacted with proteins involved in reproduction and cell budding, thus providing putative biological roles for the uncharacterized proteins. Conclusion AAC is a simple, yet powerful feature for predicting protein interactions, and can be used alone or in conjunction with protein domains to predict new and validate existing interactions. More importantly, AAC alone performs at par with existing, but more complex, features indicating the presence of sequence-level information that is predictive of interaction, but which is not necessarily restricted to domains. PMID:19936254
Pathway-based personalized analysis of cancer
Drier, Yotam; Sheffer, Michal; Domany, Eytan
2013-01-01
We introduce Pathifier, an algorithm that infers pathway deregulation scores for each tumor sample on the basis of expression data. This score is determined, in a context-specific manner, for every particular dataset and type of cancer that is being investigated. The algorithm transforms gene-level information into pathway-level information, generating a compact and biologically relevant representation of each sample. We demonstrate the algorithm’s performance on three colorectal cancer datasets and two glioblastoma multiforme datasets and show that our multipathway-based representation is reproducible, preserves much of the original information, and allows inference of complex biologically significant information. We discovered several pathways that were significantly associated with survival of glioblastoma patients and two whose scores are predictive of survival in colorectal cancer: CXCR3-mediated signaling and oxidative phosphorylation. We also identified a subclass of proneural and neural glioblastoma with significantly better survival, and an EGF receptor-deregulated subclass of colon cancers. PMID:23547110
A new, long-term daily satellite-based rainfall dataset for operational monitoring in Africa
Maidment, Ross I.; Grimes, David; Black, Emily; Tarnavsky, Elena; Young, Matthew; Greatrex, Helen; Allan, Richard P.; Stein, Thorwald; Nkonde, Edson; Senkunda, Samuel; Alcántara, Edgar Misael Uribe
2017-01-01
Rainfall information is essential for many applications in developing countries, and yet, continually updated information at fine temporal and spatial scales is lacking. In Africa, rainfall monitoring is particularly important given the close relationship between climate and livelihoods. To address this information gap, this paper describes two versions (v2.0 and v3.0) of the TAMSAT daily rainfall dataset based on high-resolution thermal-infrared observations, available from 1983 to the present. The datasets are based on the disaggregation of 10-day (v2.0) and 5-day (v3.0) total TAMSAT rainfall estimates to a daily time-step using daily cold cloud duration. This approach provides temporally consistent historic and near-real time daily rainfall information for all of Africa. The estimates have been evaluated using ground-based observations from five countries with contrasting rainfall climates (Mozambique, Niger, Nigeria, Uganda, and Zambia) and compared to other satellite-based rainfall estimates. The results indicate that both versions of the TAMSAT daily estimates reliably detects rainy days, but have less skill in capturing rainfall amount—results that are comparable to the other datasets. PMID:28534868
Publishing NASA Metadata as Linked Open Data for Semantic Mashups
NASA Astrophysics Data System (ADS)
Wilson, Brian; Manipon, Gerald; Hua, Hook
2014-05-01
Data providers are now publishing more metadata in more interoperable forms, e.g. Atom or RSS 'casts', as Linked Open Data (LOD), or as ISO Metadata records. A major effort on the part of the NASA's Earth Science Data and Information System (ESDIS) project is the aggregation of metadata that enables greater data interoperability among scientific data sets regardless of source or application. Both the Earth Observing System (EOS) ClearingHOuse (ECHO) and the Global Change Master Directory (GCMD) repositories contain metadata records for NASA (and other) datasets and provided services. These records contain typical fields for each dataset (or software service) such as the source, creation date, cognizant institution, related access URL's, and domain and variable keywords to enable discovery. Under a NASA ACCESS grant, we demonstrated how to publish the ECHO and GCMD dataset and services metadata as LOD in the RDF format. Both sets of metadata are now queryable at SPARQL endpoints and available for integration into "semantic mashups" in the browser. It is straightforward to reformat sets of XML metadata, including ISO, into simple RDF and then later refine and improve the RDF predicates by reusing known namespaces such as Dublin core, georss, etc. All scientific metadata should be part of the LOD world. In addition, we developed an "instant" drill-down and browse interface that provides faceted navigation so that the user can discover and explore the 25,000 datasets and 3000 services. The available facets and the free-text search box appear in the left panel, and the instantly updated results for the dataset search appear in the right panel. The user can constrain the value of a metadata facet simply by clicking on a word (or phrase) in the "word cloud" of values for each facet. The display section for each dataset includes the important metadata fields, a full description of the dataset, potentially some related URL's, and a "search" button that points to an OpenSearch GUI that is pre-configured to search for granules within the dataset. We will present our experiences with converting NASA metadata into LOD, discuss the challenges, illustrate some of the enabled mashups, and demonstrate the latest version of the "instant browse" interface for navigating multiple metadata collections.
Pairwise gene GO-based measures for biclustering of high-dimensional expression data.
Nepomuceno, Juan A; Troncoso, Alicia; Nepomuceno-Chamorro, Isabel A; Aguilar-Ruiz, Jesús S
2018-01-01
Biclustering algorithms search for groups of genes that share the same behavior under a subset of samples in gene expression data. Nowadays, the biological knowledge available in public repositories can be used to drive these algorithms to find biclusters composed of groups of genes functionally coherent. On the other hand, a distance among genes can be defined according to their information stored in Gene Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each pair of genes which establishes their functional similarity. A scatter search-based algorithm that optimizes a merit function that integrates GO information is studied in this paper. This merit function uses a term that addresses the information through a GO measure. The effect of two possible different gene pairwise GO measures on the performance of the algorithm is analyzed. Firstly, three well known yeast datasets with approximately one thousand of genes are studied. Secondly, a group of human datasets related to clinical data of cancer is also explored by the algorithm. Most of these data are high-dimensional datasets composed of a huge number of genes. The resultant biclusters reveal groups of genes linked by a same functionality when the search procedure is driven by one of the proposed GO measures. Furthermore, a qualitative biological study of a group of biclusters show their relevance from a cancer disease perspective. It can be concluded that the integration of biological information improves the performance of the biclustering process. The two different GO measures studied show an improvement in the results obtained for the yeast dataset. However, if datasets are composed of a huge number of genes, only one of them really improves the algorithm performance. This second case constitutes a clear option to explore interesting datasets from a clinical point of view.
A data-driven model for constraint of present-day glacial isostatic adjustment in North America
NASA Astrophysics Data System (ADS)
Simon, K. M.; Riva, R. E. M.; Kleinherenbrink, M.; Tangdamrongsub, N.
2017-09-01
Geodetic measurements of vertical land motion and gravity change are incorporated into an a priori model of present-day glacial isostatic adjustment (GIA) in North America via least-squares adjustment. The result is an updated GIA model wherein the final predicted signal is informed by both observational data, and prior knowledge (or intuition) of GIA inferred from models. The data-driven method allows calculation of the uncertainties of predicted GIA fields, and thus offers a significant advantage over predictions from purely forward GIA models. In order to assess the influence each dataset has on the final GIA prediction, the vertical land motion and GRACE-measured gravity data are incorporated into the model first independently (i.e., one dataset only), then simultaneously. The relative weighting of the datasets and the prior input is iteratively determined by variance component estimation in order to achieve the most statistically appropriate fit to the data. The best-fit model is obtained when both datasets are inverted and gives respective RMS misfits to the GPS and GRACE data of 1.3 mm/yr and 0.8 mm/yr equivalent water layer change. Non-GIA signals (e.g., hydrology) are removed from the datasets prior to inversion. The post-fit residuals between the model predictions and the vertical motion and gravity datasets, however, suggest particular regions where significant non-GIA signals may still be present in the data, including unmodeled hydrological changes in the central Prairies west of Lake Winnipeg. Outside of these regions of misfit, the posterior uncertainty of the predicted model provides a measure of the formal uncertainty associated with the GIA process; results indicate that this quantity is sensitive to the uncertainty and spatial distribution of the input data as well as that of the prior model information. In the study area, the predicted uncertainty of the present-day GIA signal ranges from ∼0.2-1.2 mm/yr for rates of vertical land motion, and from ∼3-4 mm/yr of equivalent water layer change for gravity variations.
A reference human genome dataset of the BGISEQ-500 sequencer.
Huang, Jie; Liang, Xinming; Xuan, Yuankai; Geng, Chunyu; Li, Yuxiang; Lu, Haorong; Qu, Shoufang; Mei, Xianglin; Chen, Hongbo; Yu, Ting; Sun, Nan; Rao, Junhua; Wang, Jiahao; Zhang, Wenwei; Chen, Ying; Liao, Sha; Jiang, Hui; Liu, Xin; Yang, Zhaopeng; Mu, Feng; Gao, Shangxian
2017-05-01
BGISEQ-500 is a new desktop sequencer developed by BGI. Using DNA nanoball and combinational probe anchor synthesis developed from Complete Genomics™ sequencing technologies, it generates short reads at a large scale. Here, we present the first human whole-genome sequencing dataset of BGISEQ-500. The dataset was generated by sequencing the widely used cell line HG001 (NA12878) in two sequencing runs of paired-end 50 bp (PE50) and two sequencing runs of paired-end 100 bp (PE100). We also include examples of the raw images from the sequencer for reference. Finally, we identified variations using this dataset, estimated the accuracy of the variations, and compared to that of the variations identified from similar amounts of publicly available HiSeq2500 data. We found similar single nucleotide polymorphism (SNP) detection accuracy for the BGISEQ-500 PE100 data (false positive rate [FPR] = 0.00020%, sensitivity = 96.20%) compared to the PE150 HiSeq2500 data (FPR = 0.00017%, sensitivity = 96.60%) better SNP detection accuracy than the PE50 data (FPR = 0.0006%, sensitivity = 94.15%). But for insertions and deletions (indels), we found lower accuracy for BGISEQ-500 data (FPR = 0.00069% and 0.00067% for PE100 and PE50 respectively, sensitivity = 88.52% and 70.93%) than the HiSeq2500 data (FPR = 0.00032%, sensitivity = 96.28%). Our dataset can serve as the reference dataset, providing basic information not just for future development, but also for all research and applications based on the new sequencing platform. © The Authors 2017. Published by Oxford University Press.
ConTour: Data-Driven Exploration of Multi-Relational Datasets for Drug Discovery.
Partl, Christian; Lex, Alexander; Streit, Marc; Strobelt, Hendrik; Wassermann, Anne-Mai; Pfister, Hanspeter; Schmalstieg, Dieter
2014-12-01
Large scale data analysis is nowadays a crucial part of drug discovery. Biologists and chemists need to quickly explore and evaluate potentially effective yet safe compounds based on many datasets that are in relationship with each other. However, there is a lack of tools that support them in these processes. To remedy this, we developed ConTour, an interactive visual analytics technique that enables the exploration of these complex, multi-relational datasets. At its core ConTour lists all items of each dataset in a column. Relationships between the columns are revealed through interaction: selecting one or multiple items in one column highlights and re-sorts the items in other columns. Filters based on relationships enable drilling down into the large data space. To identify interesting items in the first place, ConTour employs advanced sorting strategies, including strategies based on connectivity strength and uniqueness, as well as sorting based on item attributes. ConTour also introduces interactive nesting of columns, a powerful method to show the related items of a child column for each item in the parent column. Within the columns, ConTour shows rich attribute data about the items as well as information about the connection strengths to other datasets. Finally, ConTour provides a number of detail views, which can show items from multiple datasets and their associated data at the same time. We demonstrate the utility of our system in case studies conducted with a team of chemical biologists, who investigate the effects of chemical compounds on cells and need to understand the underlying mechanisms.
GRIIDC: A Data Repository for Gulf of Mexico Science
NASA Astrophysics Data System (ADS)
Ellis, S.; Gibeaut, J. C.
2017-12-01
The Gulf of Mexico Research Initiative Information & Data Cooperative (GRIIDC) system is a data management solution appropriate for any researcher sharing Gulf of Mexico and oil spill science data. Our mission is to ensure a data and information legacy that promotes continual scientific discovery and public awareness of the Gulf of Mexico ecosystem. GRIIDC developed an open-source software solution to manage data from the Gulf of Mexico Research Initiative (GoMRI). The GoMRI program has over 2500 researchers from diverse fields of study with a variety of attitudes, experiences, and capacities for data sharing. The success of this solution is apparent through new partnerships to share data generated by RESTORE Act Centers of Excellence Programs, the National Academies of Science, and others. The GRIIDC data management system integrates dataset management planning, metadata creation, persistent identification, and data discoverability into an easy-to-use web application. No specialized software or program installations are required to support dataset submission or discovery. Furthermore, no data transformations are needed to submit data to GRIIDC; common file formats such as Excel, csv, and text are all acceptable for submissions. To ensure data are properly documented using the GRIIDC implementation of the ISO 19115-2 metadata standard, researchers submit detailed descriptive information through a series of interactive forms and no knowledge of metadata or xml formats are required. Once a dataset is documented and submitted the GRIIDC team performs a review of the dataset package. This review ensures that files can be opened and contain data, and that data are completely and accurately described. This review does not include performing quality assurance or control of data points, as GRIIDC expects scientists to perform these steps during the course of their work. Once approved, data are made public and searchable through the GRIIDC data discovery portal and the DataONE network.
Development of National Map ontologies for organization and orchestration of hydrologic observations
NASA Astrophysics Data System (ADS)
Lieberman, J. E.
2014-12-01
Feature layers in the National Map program (TNM) are a fundamental context for much of the data collection and analysis conducted by the USGS and other governmental and nongovernmental organizations. Their computational usefulness, though, has been constrained by the lack of formal relationships besides superposition between TNM layers, as well as limited means of representing how TNM datasets relate to additional attributes, datasets, and activities. In the field of Geospatial Information Science, there has been a growing recognition of the value of semantic representation and technology for addressing these limitations, particularly in the face of burgeoning information volume and heterogeneity. Fundamental to this approach is the development of formal ontologies for concepts related to that information that can be processed computationally to enhance creation and discovery of new geospatial knowledge. They offer a means of making much of the presently innate knowledge about relationships in and between TNM features accessible for machine processing and distributed computation.A full and comprehensive ontology of all knowledge represented by TNM features is still impractical. The work reported here involves elaboration and integration of a number of small ontology design patterns (ODP's) that represent limited, discrete, but commonly accepted and broadly applicable physical theories for the behavior of TNM features representing surface water bodies and landscape surfaces and the connections between them. These ontology components are validated through use in applications for discovery and aggregation of water science observational data associated with National Hydrography Data features, features from the National Elevation Dataset (NED) and Water Boundary Dataset (WBD) that constrain water occurrence in the continental US. These applications emphasize workflows which are difficult or impossible to automate using existing data structures. Evaluation of the usefulness of the developed ontology components includes both solicitation of feedback on prototype applications, and provision of a query / mediation service for feature-linked data to facilitate development of additional third-party applications.
A Bayesian trans-dimensional approach for the fusion of multiple geophysical datasets
NASA Astrophysics Data System (ADS)
JafarGandomi, Arash; Binley, Andrew
2013-09-01
We propose a Bayesian fusion approach to integrate multiple geophysical datasets with different coverage and sensitivity. The fusion strategy is based on the capability of various geophysical methods to provide enough resolution to identify either subsurface material parameters or subsurface structure, or both. We focus on electrical resistivity as the target material parameter and electrical resistivity tomography (ERT), electromagnetic induction (EMI), and ground penetrating radar (GPR) as the set of geophysical methods. However, extending the approach to different sets of geophysical parameters and methods is straightforward. Different geophysical datasets are entered into a trans-dimensional Markov chain Monte Carlo (McMC) search-based joint inversion algorithm. The trans-dimensional property of the McMC algorithm allows dynamic parameterisation of the model space, which in turn helps to avoid bias of the post-inversion results towards a particular model. Given that we are attempting to develop an approach that has practical potential, we discretize the subsurface into an array of one-dimensional earth-models. Accordingly, the ERT data that are collected by using two-dimensional acquisition geometry are re-casted to a set of equivalent vertical electric soundings. Different data are inverted either individually or jointly to estimate one-dimensional subsurface models at discrete locations. We use Shannon's information measure to quantify the information obtained from the inversion of different combinations of geophysical datasets. Information from multiple methods is brought together via introducing joint likelihood function and/or constraining the prior information. A Bayesian maximum entropy approach is used for spatial fusion of spatially dispersed estimated one-dimensional models and mapping of the target parameter. We illustrate the approach with a synthetic dataset and then apply it to a field dataset. We show that the proposed fusion strategy is successful not only in enhancing the subsurface information but also as a survey design tool to identify the appropriate combination of the geophysical tools and show whether application of an individual method for further investigation of a specific site is beneficial.
The National Center for Atmospheric Research (NCAR) Research Data Archive: a Data Education Center
NASA Astrophysics Data System (ADS)
Peng, G. S.; Schuster, D.
2015-12-01
The National Center for Atmospheric Research (NCAR) Research Data Archive (RDA), rda.ucar.edu, is not just another data center or data archive. It is a data education center. We not only serve data, we TEACH data. Weather and climate data is the original "Big Data" dataset and lessons learned while playing with weather data are applicable to a wide range of data investigations. Erroneous data assumptions are the Achilles heel of Big Data. It doesn't matter how much data you crunch if the data is not what you think it is. Each dataset archived at the RDA is assigned to a data specialist (DS) who curates the data. If a user has a question not answered in the dataset information web pages, they can call or email a skilled DS for further clarification. The RDA's diverse staff—with academic training in meteorology, oceanography, engineering (electrical, civil, ocean and database), mathematics, physics, chemistry and information science—means we likely have someone who "speaks your language." Data discovery is another difficult Big Data problem; one can only solve problems with data if one can find the right data. Metadata, both machine and human-generated, underpin the RDA data search tools. Users can quickly find datasets by name or dataset ID number. They can also perform a faceted search that successively narrows the options by user requirements or simply kick off an indexed search with a few words. Weather data formats can be difficult to read for non-expert users; it's usually packed in binary formats requiring specialized software and parameter names use specialized vocabularies. DSs create detailed information pages for each dataset and maintain lists of helpful software, documentation and links of information around the web. We further grow the level of sophistication of the users with tips, tutorials and data stories on the RDA Blog, http://ncarrda.blogspot.com/. How-to video tutorials are also posted on the NCAR Computational and Information Systems Laboratory (CISL) YouTube channel.
Phylogeny of Kinorhyncha Based on Morphology and Two Molecular Loci
Sørensen, Martin V.; Dal Zotto, Matteo; Rho, Hyun Soo; Herranz, Maria; Sánchez, Nuria; Pardos, Fernando; Yamasaki, Hiroshi
2015-01-01
The phylogeny of Kinorhyncha was analyzed using morphology and the molecular loci 18S rRNA and 28S rRNA. The different datasets were analyzed separately and in combination, using maximum likelihood and Bayesian Inference. Bayesian inference of molecular sequence data in combination with morphology supported the division of Kinorhyncha into two major clades: Cyclorhagida comb. nov. and Allomalorhagida nom. nov. The latter clade represents a new kinorhynch class, and accommodates Dracoderes, Franciscideres, a yet undescribed genus which is closely related with Franciscideres, and the traditional homalorhagid genera. Homalorhagid monophyly was not supported by any analyses with molecular sequence data included. Analysis of the combined molecular and morphological data furthermore supported a cyclorhagid clade which included all traditional cyclorhagid taxa, except Dracoderes that no longer should be considered a cyclorhagid genus. Accordingly, Cyclorhagida is divided into three main lineages: Echinoderidae, Campyloderidae, and a large clade, ‘Kentrorhagata’, which except for species of Campyloderes, includes all species with a midterminal spine present in adult individuals. Maximum likelihood analysis of the combined datasets produced a rather unresolved tree that was not regarded in the following discussion. Results of the analyses with only molecular sequence data included were incongruent at different points. However, common for all analyses was the support of several major clades, i.e., Campyloderidae, Kentrorhagata, Echinoderidae, Dracoderidae, Pycnophyidae, and a clade with Paracentrophyes + New Genus and Franciscideres (in those analyses where the latter was included). All molecular analyses including 18S rRNA sequence data furthermore supported monophyly of Allomalorhagida. Cyclorhagid monophyly was only supported in analyses of combined 18S rRNA and 28S rRNA (both ML and BI), and only in a restricted dataset where taxa with incomplete information from 28S rRNA had been omitted. Analysis of the morphological data produced results that were similar with those from the combined molecular and morphological analysis. E.g., the morphological data also supported exclusion of Dracoderes from Cyclorhagida. The main differences between the morphological analysis and analyses based on the combined datasets include: 1) Homalorhagida appears as monophyletic in the morphological tree only, 2) the morphological analyses position Franciscideres and the new genus within Cyclorhagida near Zelinkaderidae and Cateriidae, whereas analyses including molecular data place the two genera inside Allomalorhagida, and 3) species of Campyloderes appear in a basal trichotomy within Kentrorhagata in the morphological tree, whereas analysis of the combined datasets places species of Campyloderes as a sister clade to Echinoderidae and Kentrorhagata. PMID:26200115
Utility and Limitations of Using Gene Expression Data to Identify Functional Associations
Peng, Cheng; Shiu, Shin-Han
2016-01-01
Gene co-expression has been widely used to hypothesize gene function through guilt-by association. However, it is not clear to what degree co-expression is informative, whether it can be applied to genes involved in different biological processes, and how the type of dataset impacts inferences about gene functions. Here our goal is to assess the utility and limitations of using co-expression as a criterion to recover functional associations between genes. By determining the percentage of gene pairs in a metabolic pathway with significant expression correlation, we found that many genes in the same pathway do not have similar transcript profiles and the choice of dataset, annotation quality, gene function, expression similarity measure, and clustering approach significantly impacts the ability to recover functional associations between genes using Arabidopsis thaliana as an example. Some datasets are more informative in capturing coordinated expression profiles and larger data sets are not always better. In addition, to recover the maximum number of known pathways and identify candidate genes with similar functions, it is important to explore rather exhaustively multiple dataset combinations, similarity measures, clustering algorithms and parameters. Finally, we validated the biological relevance of co-expression cluster memberships with an independent phenomics dataset and found that genes that consistently cluster with leucine degradation genes tend to have similar leucine levels in mutants. This study provides a framework for obtaining gene functional associations by maximizing the information that can be obtained from gene expression datasets. PMID:27935950
Rendering an archive in three dimensions
NASA Astrophysics Data System (ADS)
Leiman, David A.; Twose, Claire; Lee, Teresa Y. H.; Fletcher, Alex; Yoo, Terry S.
2003-05-01
We examine the requirements for a publicly accessible, online collection of three-dimensional biomedical image data, including those yielded by radiological processes such as MRI, ultrasound and others. Intended as a repository and distribution mechanism for such medical data, we created the National Online Volumetric Archive (NOVA) as a case study aimed at identifying the multiple issues involved in realizing a large-scale digital archive. In the paper we discuss such factors as the current legal and health information privacy policy affecting the collection of human medical images, retrieval and management of information and technical implementation. This project culminated in the launching of a website that includes downloadable datasets and a prototype data submission system.
NASA Astrophysics Data System (ADS)
Heather, David; Besse, Sebastien; Vallat, Claire; Barbarisi, Isa; Arviset, Christophe; De Marchi, Guido; Barthelemy, Maud; Coia, Daniela; Costa, Marc; Docasal, Ruben; Fraga, Diego; Grotheer, Emmanuel; Lim, Tanya; MacFarlane, Alan; Martinez, Santa; Rios, Carlos; Vallejo, Fran; Saiz, Jaime
2017-04-01
The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces at http://psa.esa.int. All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. The PSA is currently implementing a number of significant improvements, mostly driven by the evolution of the PDS standard, and the growing need for better interfaces and advanced applications to support science exploitation. As of the end of 2016, the PSA is hosting data from all of ESA's planetary missions. This includes ESA's first planetary mission Giotto that encountered comet 1P/Halley in 1986 with a flyby at 800km. Science data from Venus Express, Mars Express, Huygens and the SMART-1 mission are also all available at the PSA. The PSA also contains all science data from Rosetta, which explored comet 67P/Churyumov-Gerasimenko and asteroids Steins and Lutetia. The year 2016 has seen the arrival of the ExoMars 2016 data in the archive. In the upcoming years, at least three new projects are foreseen to be fully archived at the PSA. The BepiColombo mission is scheduled for launch in 2018. Following that, the ExoMars Rover Surface Platform (RSP) in 2020, and then the JUpiter ICy moon Explorer (JUICE). All of these will archive their data in the PSA. In addition, a few ground-based support programmes are also available, especially for the Venus Express and Rosetta missions. The newly designed PSA will enhance the user experience and will significantly reduce the complexity for users to find their data promoting one-click access to the scientific datasets with more customized views when needed. This includes a better integration with Planetary GIS analysis tools and Planetary interoperability services (search and retrieve data, supporting e.g. PDAP, EPN-TAP). It will also be up-to-date with versions 3 and 4 of the PDS standards, as PDS4 will be used for ESA's ExoMars and upcoming BepiColombo missions. Users will have direct access to documentation, information and tools that are relevant to the scientific use of the dataset, including ancillary datasets, Software Interface Specification (SIS) documents, and any tools/help that the PSA team can provide. The new PSA interface was released in January 2017. The home page provides a direct and simple access to the scientific data, aiming to help scientists to discover and explore its content. The archive can be explored through a set of parameters that allow the selection of products through space and time. Quick views provide information needed for the selection of appropriate scientific products. During 2017, the PSA team will focus their efforts on developing a map search interface using GIS technologies to display ESA planetary datasets, an image gallery providing navigation through images to explore the datasets, and interoperability with international partners. This will be done in parallel with additional metadata searchable through the interface (i.e., geometry), and with a dedication to improve the content of 20 years of space exploration.
Rachel Riemann; Barry Tyler Wilson; Andrew Lister; Sarah Parks
2010-01-01
Geospatial datasets of forest characteristics are modeled representations of real populations on the ground. The continuous spatial character of such datasets provides an incredible source of information at the landscape level for ecosystem research, policy analysis, and planning applications, all of which are critical for addressing current challenges related to...
Exploring Sedimentary Basins with High Frequency Receiver Function: the Dublin Basin Case Study
NASA Astrophysics Data System (ADS)
Licciardi, A.; Piana Agostinetti, N.
2015-12-01
The Receiver Function (RF) method is a widely applied seismological tool for the imaging of crustal and lithospheric structures beneath a single seismic station with one to tens kilometers of vertical resolution. However, detailed information about the upper crust (0-10 km depth) can also be retrieved by increasing the frequency content of the analyzed RF data-set (with a vertical resolution lower than 0.5km). This information includes depth of velocity contrasts, S-wave velocities within layers, as well as presence and location of seismic anisotropy or dipping interfaces (e.g., induced by faulting) at depth. These observables provides valuable constraints on the structural settings and properties of sedimentary basins both for scientific and industrial applications. To test the RF capabilities for this high resolution application, six broadband seismic stations have been deployed across the southwestern margin of the Dublin Basin (DB), Ireland, whose geothermal potential has been investigated in the last few years. With an inter-station distance of about 1km, this closely spaced array has been designed to provide a clear picture of the structural transition between the margin and the inner portion of the basin. In this study, a Bayesian approach is used to retrieve the posterior probability distributions of S-wave velocity at depth beneath each seismic station. A multi-frequency RF data-set is analyzed and RF and curves of apparent velocity are jointly inverted to better constrain absolute velocity variations. A pseudo 2D section is built to observe the lateral changes in elastic properties across the margin of the basin with a focus in the shallow portion of the crust. Moreover, by means of the harmonic decomposition technique, the azimuthal variations in the RF data-set are isolated and interpreted in terms of anisotropy and dipping interfaces associated with the major fault system in the area. These results are compared with the available information from previous seismic active surveys in the area, including boreholes data.
VMSbase: an R-package for VMS and logbook data management and analysis in fisheries ecology.
Russo, Tommaso; D'Andrea, Lorenzo; Parisi, Antonio; Cataudella, Stefano
2014-01-01
VMSbase is an R package devised to manage, process and visualize information about fishing vessels activity (provided by the vessel monitoring system--VMS) and catches/landings (as reported in the logbooks). VMSbase is primarily conceived to be user-friendly; to this end, a suite of state-of-the-art analyses is accessible via a graphical interface. In addition, the package uses a database platform allowing large datasets to be stored, managed and processed vey efficiently. Methodologies include data cleaning, that is removal of redundant or evidently erroneous records, and data enhancing, that is interpolation and merging with external data sources. In particular, VMSbase is able to estimate sea bottom depth for single VMS pings using an on-line connection to the National Oceanic and Atmospheric Administration (NOAA) database. It also allows VMS pings to be assigned to whatever geographic partitioning has been selected by users. Standard analyses comprise: 1) métier identification (using a modified CLARA clustering approach on Logbook data or Artificial Neural Networks on VMS data); 2) linkage between VMS and Logbook records, with the former organized into fishing trips; 3) discrimination between steaming and fishing points; 4) computation of spatial effort with respect to user-selected grids; 5) calculation of standard fishing effort indicators within Data Collection Framework; 6) a variety of mapping tools, including an interface for Google viewer; 7) estimation of trawled area. Here we report a sample workflow for the accessory sample datasets (available with the package) in order to explore the potentialities of VMSbase. In addition, the results of some performance tests on two large datasets (1×10(5) and 1×10(6) VMS signals, respectively) are reported to inform about the time required for the analyses. The results, although merely illustrative, indicate that VMSbase can represent a step forward in extracting and enhancing information from VMS/logbook data for fisheries studies.
Mashburn, Shana L.; Winton, Kimberly T.
2010-01-01
This CD-ROM contains spatial datasets that describe natural and anthropogenic features and county-level estimates of agricultural pesticide use and pesticide data for surface-water, groundwater, and biological specimens in the state of Oklahoma. County-level estimates of pesticide use were compiled from the Pesticide National Synthesis Project of the U.S. Geological Survey, National Water-Quality Assessment Program. Pesticide data for surface water, groundwater, and biological specimens were compiled from U.S. Geological Survey National Water Information System database. These spatial datasets that describe natural and manmade features were compiled from several agencies and contain information collected by the U.S. Geological Survey. The U.S. Geological Survey datasets were not collected specifically for this compilation, but were previously collected for projects with various objectives. The spatial datasets were created by different agencies from sources with varied quality. As a result, features common to multiple layers may not overlay exactly. Users should check the metadata to determine proper use of these spatial datasets. These data were not checked for accuracy or completeness. If a question of accuracy or completeness arise, the user should contact the originator cited in the metadata.
NASA Technical Reports Server (NTRS)
Goseva-Popstojanova, Katerina; Tyo, Jacob P.; Sizemore, Brian
2017-01-01
NASA develops, runs, and maintains software systems for which security is of vital importance. Therefore, it is becoming an imperative to develop secure systems and extend the current software assurance capabilities to cover information assurance and cybersecurity concerns of NASA missions. The results presented in this report are based on the information provided in the issue tracking systems of one ground mission and one flight mission. The extracted data were used to create three datasets: Ground mission IVV issues, Flight mission IVV issues, and Flight mission Developers issues. In each dataset, we identified the software bugs that are security related and classified them in specific security classes. This information was then used to create the security vulnerability profiles (i.e., to determine how, why, where, and when the security vulnerabilities were introduced) and explore the existence of common trends. The main findings of our work include:- Code related security issues dominated both the Ground and Flight mission IVV security issues, with 95 and 92, respectively. Therefore, enforcing secure coding practices and verification and validation focused on coding errors would be cost effective ways to improve mission's security. (Flight mission Developers issues dataset did not contain data in the Issue Category.)- In both the Ground and Flight mission IVV issues datasets, the majority of security issues (i.e., 91 and 85, respectively) were introduced in the Implementation phase. In most cases, the phase in which the issues were found was the same as the phase in which they were introduced. The most security related issues of the Flight mission Developers issues dataset were found during Code Implementation, Build Integration, and Build Verification; the data on the phase in which these issues were introduced were not available for this dataset.- The location of security related issues, as the location of software issues in general, followed the Pareto principle. Specifically, for all three datasets, from 86 to 88 the security related issues were located in two to four subsystems.- The severity levels of most security issues were moderate, in all three datasets.- Out of 21 primary security classes, five dominated: Exception Management, Memory Access, Other, Risky Values, and Unused Entities. Together, these classes contributed from around 80 to 90 of all security issues in each dataset. This again proves the Pareto principle of uneven distribution of security issues, in this case across CWE classes, and supports the fact that addressing these dominant security classes provides the most cost efficient way to improve missions' security. The findings presented in this report uncovered the security vulnerability profiles and identified the common trends and dominant classes of security issues, which in turn can be used to select the most efficient secure design and coding best practices compiled by the part of the SARP project team associated with the NASA's Johnson Space Center. In addition, these findings provide valuable input to the NASA IVV initiative aimed at identification of the two 25 CWEs of ground and flight missions.
Bovendorp, Ricardo S; Villar, Nacho; de Abreu-Junior, Edson F; Bello, Carolina; Regolin, André L; Percequillo, Alexandre R; Galetti, Mauro
2017-08-01
The contribution of small mammal ecology to the understanding of macroecological patterns of biodiversity, population dynamics, and community assembly has been hindered by the absence of large datasets of small mammal communities from tropical regions. Here we compile the largest dataset of inventories of small mammal communities for the Neotropical region. The dataset reviews small mammal communities from the Atlantic forest of South America, one of the regions with the highest diversity of small mammals and a global biodiversity hotspot, though currently covering less than 12% of its original area due to anthropogenic pressures. The dataset comprises 136 references from 300 locations covering seven vegetation types of tropical and subtropical Atlantic forests of South America, and presents data on species composition, richness, and relative abundance (captures/trap-nights). One paper was published more than 70 yr ago, but 80% of them were published after 2000. The dataset comprises 53,518 individuals of 124 species of small mammals, including 30 species of marsupials and 94 species of rodents. Species richness averaged 8.2 species (1-21) per site. Only two species occurred in more than 50% of the sites (the common opossum, Didelphis aurita and black-footed pigmy rice rat Oligoryzomys nigripes). Mean species abundance varied 430-fold, from 4.3 to 0.01 individuals/trap-night. The dataset also revealed a hyper-dominance of 22 species that comprised 78.29% of all individuals captured, with only seven species representing 44% of all captures. The information contained on this dataset can be applied in the study of macroecological patterns of biodiversity, communities, and populations, but also to evaluate the ecological consequences of fragmentation and defaunation, and predict disease outbreaks, trophic interactions and community dynamics in this biodiversity hotspot. © 2017 by the Ecological Society of America.
Phase 1 Free Air CO2 Enrichment Model-Data Synthesis (FACE-MDS): Model Output Data (2015)
Walker, A. P.; De Kauwe, M. G.; Medlyn, B. E.; Zaehle, S.; Asao, S.; Dietze, M.; El-Masri, B.; Hanson, P. J.; Hickler, T.; Jain, A.; Luo, Y.; Parton, W. J.; Prentice, I. C.; Ricciuto, D. M.; Thornton, P. E.; Wang, S.; Wang, Y -P; Warlind, D.; Weng, E.; Oren, R.; Norby, R. J.
2015-01-01
These datasets comprise the model output from phase 1 of the FACE-MDS. These include simulations of the Duke and Oak Ridge experiments and also idealised long-term (300 year) simulations at both sites (please see the modelling protocol for details). Included as part of this dataset are modelling and output protocols. The model datasets are formatted according to the output protocols. Phase 1 datasets are reproduced here for posterity and reproducibility although the model output for the experimental period have been somewhat superseded by the Phase 2 datasets.
NASA Astrophysics Data System (ADS)
Car, Nicholas; Cox, Simon; Fitch, Peter
2015-04-01
With earth-science datasets increasingly being published to enable re-use in projects disassociated from the original data acquisition or generation, there is an urgent need for associated metadata to be connected, in order to guide their application. In particular, provenance traces should support the evaluation of data quality and reliability. However, while standards for describing provenance are emerging (e.g. PROV-O), these do not include the necessary statistical descriptors and confidence assessments. UncertML has a mature conceptual model that may be used to record uncertainty metadata. However, by itself UncertML does not support the representation of uncertainty of multi-part datasets, and provides no direct way of associating the uncertainty information - metadata in relation to a dataset - with dataset objects.We present a method to address both these issues by combining UncertML with PROV-O, and delivering resulting uncertainty-enriched provenance traces through the Linked Data API. UncertProv extends the PROV-O provenance ontology with an RDF formulation of the UncertML conceptual model elements, adds further elements to support uncertainty representation without a conceptual model and the integration of UncertML through links to documents. The Linked ID API provides a systematic way of navigating from dataset objects to their UncertProv metadata and back again. The Linked Data API's 'views' capability enables access to UncertML and non-UncertML uncertainty metadata representations for a dataset. With this approach, it is possible to access and navigate the uncertainty metadata associated with a published dataset using standard semantic web tools, such as SPARQL queries. Where the uncertainty data follows the UncertML model it can be automatically interpreted and may also support automatic uncertainty propagation . Repositories wishing to enable uncertainty propagation for all datasets must ensure that all elements that are associated with uncertainty (PROV-O Entity and Activity classes) have UncertML elements recorded. This methodology is intentionally flexible to allow uncertainty metadata in many forms, not limited to UncertML. While the more formal representation of uncertainty metadata is desirable (using UncertProv elements to implement the UncertML conceptual model ), this will not always be possible, and any uncertainty data stored will be better than none. Since the UncertProv ontology contains a superset of UncertML elements to facilitate the representation of non-UncertML uncertainty data, it could easily be extended to include other formal uncertainty conceptual models thus allowing non-UncertML propagation calculations.
Frömer, Romy; Maier, Martin; Abdel Rahman, Rasha
2018-01-01
Here we present an application of an EEG processing pipeline customizing EEGLAB and FieldTrip functions, specifically optimized to flexibly analyze EEG data based on single trial information. The key component of our approach is to create a comprehensive 3-D EEG data structure including all trials and all participants maintaining the original order of recording. This allows straightforward access to subsets of the data based on any information available in a behavioral data structure matched with the EEG data (experimental conditions, but also performance indicators, such accuracy or RTs of single trials). In the present study we exploit this structure to compute linear mixed models (LMMs, using lmer in R) including random intercepts and slopes for items. This information can easily be read out from the matched behavioral data, whereas it might not be accessible in traditional ERP approaches without substantial effort. We further provide easily adaptable scripts for performing cluster-based permutation tests (as implemented in FieldTrip), as a more robust alternative to traditional omnibus ANOVAs. Our approach is particularly advantageous for data with parametric within-subject covariates (e.g., performance) and/or multiple complex stimuli (such as words, faces or objects) that vary in features affecting cognitive processes and ERPs (such as word frequency, salience or familiarity), which are sometimes hard to control experimentally or might themselves constitute variables of interest. The present dataset was recorded from 40 participants who performed a visual search task on previously unfamiliar objects, presented either visually intact or blurred. MATLAB as well as R scripts are provided that can be adapted to different datasets.
NASA Technical Reports Server (NTRS)
Liu, Zhong; Ostrenga, Dana; Leptoukh, Gregory
2011-01-01
In order to facilitate Earth science data access, the NASA Goddard Earth Sciences Data Information Services Center (GES DISC) has developed a web prototype, the Hurricane Data Analysis Tool (HDAT; URL: http://disc.gsfc.nasa.gov/HDAT), to allow users to conduct online visualization and analysis of several remote sensing and model datasets for educational activities and studies of tropical cyclones and other weather phenomena. With a web browser and few mouse clicks, users can have a full access to terabytes of data and generate 2-D or time-series plots and animation without downloading any software and data. HDAT includes data from the NASA Tropical Rainfall Measuring Mission (TRMM), the NASA Quick Scatterometer(QuikSCAT) and NECP Reanalysis, and the NCEP/CPC half-hourly, 4-km Global (60 N - 60 S) IR Dataset. The GES DISC archives TRMM data. The daily global rainfall product derived from the 3-hourly multi-satellite precipitation product (3B42 V6) is available in HDAT. The TRMM Microwave Imager (TMI) sea surface temperature from the Remote Sensing Systems is in HDAT as well. The NASA QuikSCAT ocean surface wind and the NCEP Reanalysis provide ocean surface and atmospheric conditions, respectively. The global merged IR product, also known as, the NCEP/CPC half-hourly, 4-km Global (60 N -60 S) IR Dataset, is one of TRMM ancillary datasets. They are globally-merged pixel-resolution IR brightness temperature data (equivalent blackbody temperatures), merged from all available geostationary satellites (GOES-8/10, METEOSAT-7/5 & GMS). The GES DISC has collected over 10 years of the data beginning from February of 2000. This high temporal resolution (every 30 minutes) dataset not only provides additional background information to TRMM and other satellite missions, but also allows observing a wide range of meteorological phenomena from space, such as, hurricanes, typhoons, tropical cyclones, mesoscale convection system, etc. Basic functions include selection of area of interest and time, single imagery, overlay of two different products, animation,a time skip capability and different image size outputs. Users can save an animation as a file (animated gif) and import it in other presentation software, such as, Microsoft PowerPoint. Since the tool can directly access the real data, more features and functionality can be added in the future.
Belger, Mark; Haro, Josep Maria; Reed, Catherine; Happich, Michael; Kahle-Wrobleski, Kristin; Argimon, Josep Maria; Bruno, Giuseppe; Dodel, Richard; Jones, Roy W; Vellas, Bruno; Wimo, Anders
2016-07-18
Missing data are a common problem in prospective studies with a long follow-up, and the volume, pattern and reasons for missing data may be relevant when estimating the cost of illness. We aimed to evaluate the effects of different methods for dealing with missing longitudinal cost data and for costing caregiver time on total societal costs in Alzheimer's disease (AD). GERAS is an 18-month observational study of costs associated with AD. Total societal costs included patient health and social care costs, and caregiver health and informal care costs. Missing data were classified as missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR). Simulation datasets were generated from baseline data with 10-40 % missing total cost data for each missing data mechanism. Datasets were also simulated to reflect the missing cost data pattern at 18 months using MAR and MNAR assumptions. Naïve and multiple imputation (MI) methods were applied to each dataset and results compared with complete GERAS 18-month cost data. Opportunity and replacement cost approaches were used for caregiver time, which was costed with and without supervision included and with time for working caregivers only being costed. Total costs were available for 99.4 % of 1497 patients at baseline. For MCAR datasets, naïve methods performed as well as MI methods. For MAR, MI methods performed better than naïve methods. All imputation approaches were poor for MNAR data. For all approaches, percentage bias increased with missing data volume. For datasets reflecting 18-month patterns, a combination of imputation methods provided more accurate cost estimates (e.g. bias: -1 % vs -6 % for single MI method), although different approaches to costing caregiver time had a greater impact on estimated costs (29-43 % increase over base case estimate). Methods used to impute missing cost data in AD will impact on accuracy of cost estimates although varying approaches to costing informal caregiver time has the greatest impact on total costs. Tailoring imputation methods to the reason for missing data will further our understanding of the best analytical approach for studies involving cost outcomes.
Tyrer, Jonathan; Fasching, Peter A.; Beckmann, Matthias W.; Ekici, Arif B.; Schulz-Wendtland, Rüdiger; Bojesen, Stig E.; Nordestgaard, Børge G.; Flyger, Henrik; Milne, Roger L.; Arias, José Ignacio; Menéndez, Primitiva; Benítez, Javier; Chang-Claude, Jenny; Hein, Rebecca; Wang-Gohrke, Shan; Nevanlinna, Heli; Heikkinen, Tuomas; Aittomäki, Kristiina; Blomqvist, Carl; Margolin, Sara; Mannermaa, Arto; Kosma, Veli-Matti; Kataja, Vesa; Beesley, Jonathan; Chen, Xiaoqing; Chenevix-Trench, Georgia; Couch, Fergus J.; Olson, Janet E.; Fredericksen, Zachary S.; Wang, Xianshu; Giles, Graham G.; Severi, Gianluca; Baglietto, Laura; Southey, Melissa C.; Devilee, Peter; Tollenaar, Rob A. E. M.; Seynaeve, Caroline; García-Closas, Montserrat; Lissowska, Jolanta; Sherman, Mark E.; Bolton, Kelly L.; Hall, Per; Czene, Kamila; Cox, Angela; Brock, Ian W.; Elliott, Graeme C.; Reed, Malcolm W. R.; Greenberg, David; Anton-Culver, Hoda; Ziogas, Argyrios; Humphreys, Manjeet; Easton, Douglas F.; Caporaso, Neil E.; Pharoah, Paul D. P.
2010-01-01
Background Traditional prognostic factors for survival and treatment response of patients with breast cancer do not fully account for observed survival variation. We used available genotype data from a previously conducted two-stage, breast cancer susceptibility genome-wide association study (ie, Studies of Epidemiology and Risk factors in Cancer Heredity [SEARCH]) to investigate associations between variation in germline DNA and overall survival. Methods We evaluated possible associations between overall survival after a breast cancer diagnosis and 10 621 germline single-nucleotide polymorphisms (SNPs) from up to 3761 patients with invasive breast cancer (including 647 deaths and 26 978 person-years at risk) that were genotyped previously in the SEARCH study with high-density oligonucleotide microarrays (ie, hypothesis-generating set). Associations with all-cause mortality were assessed for each SNP by use of Cox regression analysis, generating a per rare allele hazard ratio (HR). To validate putative associations, we used patient genotype information that had been obtained with 5′ nuclease assay or mass spectrometry and overall survival information for up to 14 096 patients with invasive breast cancer (including 2303 deaths and 70 019 person-years at risk) from 15 international case–control studies (ie, validation set). Fixed-effects meta-analysis was used to generate an overall effect estimate in the validation dataset and in combined SEARCH and validation datasets. All statistical tests were two-sided. Results In the hypothesis-generating dataset, SNP rs4778137 (C>G) of the OCA2 gene at 15q13.1 was statistically significantly associated with overall survival among patients with estrogen receptor–negative tumors, with the rare G allele being associated with increased overall survival (HR of death per rare allele carried = 0.56, 95% confidence interval [CI] = 0.41 to 0.75, P = 9.2 × 10−5). This association was also observed in the validation dataset (HR of death per rare allele carried = 0.88, 95% CI = 0.78 to 0.99, P = .03) and in the combined dataset (HR of death per rare allele carried = 0.82, 95% CI = 0.73 to 0.92, P = 5 × 10−4). Conclusion The rare G allele of the OCA2 polymorphism, rs4778137, may be associated with improved overall survival among patients with estrogen receptor–negative breast cancer. PMID:20308648
Kim, Seokyeon; Jeong, Seongmin; Woo, Insoo; Jang, Yun; Maciejewski, Ross; Ebert, David S
2018-03-01
Geographic visualization research has focused on a variety of techniques to represent and explore spatiotemporal data. The goal of those techniques is to enable users to explore events and interactions over space and time in order to facilitate the discovery of patterns, anomalies and relationships within the data. However, it is difficult to extract and visualize data flow patterns over time for non-directional statistical data without trajectory information. In this work, we develop a novel flow analysis technique to extract, represent, and analyze flow maps of non-directional spatiotemporal data unaccompanied by trajectory information. We estimate a continuous distribution of these events over space and time, and extract flow fields for spatial and temporal changes utilizing a gravity model. Then, we visualize the spatiotemporal patterns in the data by employing flow visualization techniques. The user is presented with temporal trends of geo-referenced discrete events on a map. As such, overall spatiotemporal data flow patterns help users analyze geo-referenced temporal events, such as disease outbreaks, crime patterns, etc. To validate our model, we discard the trajectory information in an origin-destination dataset and apply our technique to the data and compare the derived trajectories and the original. Finally, we present spatiotemporal trend analysis for statistical datasets including twitter data, maritime search and rescue events, and syndromic surveillance.
Wide-Field Megahertz OCT Imaging of Patients with Diabetic Retinopathy
Reznicek, Lukas; Kolb, Jan P.; Klein, Thomas; Mohler, Kathrin J.; Huber, Robert; Kernt, Marcus; Märtz, Josef; Neubauer, Aljoscha S.
2015-01-01
Purpose. To evaluate the feasibility of wide-field Megahertz (MHz) OCT imaging in patients with diabetic retinopathy. Methods. A consecutive series of 15 eyes of 15 patients with diagnosed diabetic retinopathy were included. All patients underwent Megahertz OCT imaging, a close clinical examination, slit lamp biomicroscopy, and funduscopic evaluation. To acquire densely sampled, wide-field volumetric datasets, an ophthalmic 1050 nm OCT prototype system based on a Fourier-domain mode-locked (FDML) laser source with 1.68 MHz A-scan rate was employed. Results. We were able to obtain OCT volume scans from all included 15 patients. Acquisition time was 1.8 seconds. Obtained volume datasets consisted of 2088 × 1044 A-scans of 60° of view. Thus, reconstructed en face images had a resolution of 34.8 pixels per degree in x-axis and 17.4 pixels per degree. Due to the densely sampled OCT volume dataset, postprocessed customized cross-sectional B-frames through pathologic changes such as an individual microaneurysm or a retinal neovascularization could be imaged. Conclusions. Wide-field Megahertz OCT is feasible to successfully image patients with diabetic retinopathy at high scanning rates and a wide angle of view, providing information in all three axes. The Megahertz OCT is a useful tool to screen diabetic patients for diabetic retinopathy. PMID:26273665
A Bayesian method for detecting pairwise associations in compositional data
Ventz, Steffen; Huttenhower, Curtis
2017-01-01
Compositional data consist of vectors of proportions normalized to a constant sum from a basis of unobserved counts. The sum constraint makes inference on correlations between unconstrained features challenging due to the information loss from normalization. However, such correlations are of long-standing interest in fields including ecology. We propose a novel Bayesian framework (BAnOCC: Bayesian Analysis of Compositional Covariance) to estimate a sparse precision matrix through a LASSO prior. The resulting posterior, generated by MCMC sampling, allows uncertainty quantification of any function of the precision matrix, including the correlation matrix. We also use a first-order Taylor expansion to approximate the transformation from the unobserved counts to the composition in order to investigate what characteristics of the unobserved counts can make the correlations more or less difficult to infer. On simulated datasets, we show that BAnOCC infers the true network as well as previous methods while offering the advantage of posterior inference. Larger and more realistic simulated datasets further showed that BAnOCC performs well as measured by type I and type II error rates. Finally, we apply BAnOCC to a microbial ecology dataset from the Human Microbiome Project, which in addition to reproducing established ecological results revealed unique, competition-based roles for Proteobacteria in multiple distinct habitats. PMID:29140991
Wide-Field Megahertz OCT Imaging of Patients with Diabetic Retinopathy.
Reznicek, Lukas; Kolb, Jan P; Klein, Thomas; Mohler, Kathrin J; Wieser, Wolfgang; Huber, Robert; Kernt, Marcus; Märtz, Josef; Neubauer, Aljoscha S
2015-01-01
To evaluate the feasibility of wide-field Megahertz (MHz) OCT imaging in patients with diabetic retinopathy. A consecutive series of 15 eyes of 15 patients with diagnosed diabetic retinopathy were included. All patients underwent Megahertz OCT imaging, a close clinical examination, slit lamp biomicroscopy, and funduscopic evaluation. To acquire densely sampled, wide-field volumetric datasets, an ophthalmic 1050 nm OCT prototype system based on a Fourier-domain mode-locked (FDML) laser source with 1.68 MHz A-scan rate was employed. RESULTS. We were able to obtain OCT volume scans from all included 15 patients. Acquisition time was 1.8 seconds. Obtained volume datasets consisted of 2088 × 1044 A-scans of 60° of view. Thus, reconstructed en face images had a resolution of 34.8 pixels per degree in x-axis and 17.4 pixels per degree. Due to the densely sampled OCT volume dataset, postprocessed customized cross-sectional B-frames through pathologic changes such as an individual microaneurysm or a retinal neovascularization could be imaged. Wide-field Megahertz OCT is feasible to successfully image patients with diabetic retinopathy at high scanning rates and a wide angle of view, providing information in all three axes. The Megahertz OCT is a useful tool to screen diabetic patients for diabetic retinopathy.
NASA Astrophysics Data System (ADS)
Alvarez-Garreton, C. D.; Mendoza, P. A.; Zambrano-Bigiarini, M.; Galleguillos, M. H.; Boisier, J. P.; Lara, A.; Cortés, G.; Garreaud, R.; McPhee, J. P.; Addor, N.; Puelma, C.
2017-12-01
We provide the first catchment-based hydrometeorological, vegetation and physical data set over 531 catchments in Chile (17.8 S - 55.0 S). We compiled publicly available streamflow records at daily time steps for the period 1980-2015, and generated basin-averaged time series of the following hydrometeorological variables: 1) daily precipitation coming from three different gridded sources (re-analysis and satellite-based); 2) daily maximum and minimum temperature; 3) 8-days potential evapotranspiration (PET) based on MODIS imagery and daily PET based on Hargreaves formula; and 4) daily snow water equivalent. Additionally, catchments are characterized by their main physical (area, mean elevation, mean slope) and land cover characteristics. We synthetized these datasets with several indices characterizing the spatial distribution of climatic, hydrological, topographic and vegetation attributes. The new catchment-based dataset is unprecedented in the region and provides information that can be used in a myriad of applications, including catchment classification and regionalization studies, impacts of different land cover types on catchment response, characterization of drought history and projections, climate change impacts on hydrological processes, etc. Derived practical applications include water management and allocation strategies, decision making and adaptation planning to climate change. This data set will be publicly available and we encourage the community to use it.
Web-GIS platform for monitoring and forecasting of regional climate and ecological changes
NASA Astrophysics Data System (ADS)
Gordov, E. P.; Krupchatnikov, V. N.; Lykosov, V. N.; Okladnikov, I.; Titov, A. G.; Shulgina, T. M.
2012-12-01
Growing volume of environmental data from sensors and model outputs makes development of based on modern information-telecommunication technologies software infrastructure for information support of integrated scientific researches in the field of Earth sciences urgent and important task (Gordov et al, 2012, van der Wel, 2005). It should be considered that original heterogeneity of datasets obtained from different sources and institutions not only hampers interchange of data and analysis results but also complicates their intercomparison leading to a decrease in reliability of analysis results. However, modern geophysical data processing techniques allow combining of different technological solutions for organizing such information resources. Nowadays it becomes a generally accepted opinion that information-computational infrastructure should rely on a potential of combined usage of web- and GIS-technologies for creating applied information-computational web-systems (Titov et al, 2009, Gordov et al. 2010, Gordov, Okladnikov and Titov, 2011). Using these approaches for development of internet-accessible thematic information-computational systems, and arranging of data and knowledge interchange between them is a very promising way of creation of distributed information-computation environment for supporting of multidiscipline regional and global research in the field of Earth sciences including analysis of climate changes and their impact on spatial-temporal vegetation distribution and state. Experimental software and hardware platform providing operation of a web-oriented production and research center for regional climate change investigations which combines modern web 2.0 approach, GIS-functionality and capabilities of running climate and meteorological models, large geophysical datasets processing, visualization, joint software development by distributed research groups, scientific analysis and organization of students and post-graduate students education is presented. Platform software developed (Shulgina et al, 2012, Okladnikov et al, 2012) includes dedicated modules for numerical processing of regional and global modeling results for consequent analysis and visualization. Also data preprocessing, run and visualization of modeling results of models WRF and «Planet Simulator» integrated into the platform is provided. All functions of the center are accessible by a user through a web-portal using common graphical web-browser in the form of an interactive graphical user interface which provides, particularly, capabilities of visualization of processing results, selection of geographical region of interest (pan and zoom) and data layers manipulation (order, enable/disable, features extraction). Platform developed provides users with capabilities of heterogeneous geophysical data analysis, including high-resolution data, and discovering of tendencies in climatic and ecosystem changes in the framework of different multidisciplinary researches (Shulgina et al, 2011). Using it even unskilled user without specific knowledge can perform computational processing and visualization of large meteorological, climatological and satellite monitoring datasets through unified graphical web-interface.
Development of a global historic monthly mean precipitation dataset
NASA Astrophysics Data System (ADS)
Yang, Su; Xu, Wenhui; Xu, Yan; Li, Qingxiang
2016-04-01
Global historic precipitation dataset is the base for climate and water cycle research. There have been several global historic land surface precipitation datasets developed by international data centers such as the US National Climatic Data Center (NCDC), European Climate Assessment & Dataset project team, Met Office, etc., but so far there are no such datasets developed by any research institute in China. In addition, each dataset has its own focus of study region, and the existing global precipitation datasets only contain sparse observational stations over China, which may result in uncertainties in East Asian precipitation studies. In order to take into account comprehensive historic information, users might need to employ two or more datasets. However, the non-uniform data formats, data units, station IDs, and so on add extra difficulties for users to exploit these datasets. For this reason, a complete historic precipitation dataset that takes advantages of various datasets has been developed and produced in the National Meteorological Information Center of China. Precipitation observations from 12 sources are aggregated, and the data formats, data units, and station IDs are unified. Duplicated stations with the same ID are identified, with duplicated observations removed. Consistency test, correlation coefficient test, significance t-test at the 95% confidence level, and significance F-test at the 95% confidence level are conducted first to ensure the data reliability. Only those datasets that satisfy all the above four criteria are integrated to produce the China Meteorological Administration global precipitation (CGP) historic precipitation dataset version 1.0. It contains observations at 31 thousand stations with 1.87 × 107 data records, among which 4152 time series of precipitation are longer than 100 yr. This dataset plays a critical role in climate research due to its advantages in large data volume and high density of station network, compared to other datasets. Using the Penalized Maximal t-test method, significant inhomogeneity has been detected in historic precipitation datasets at 340 stations. The ratio method is then employed to effectively remove these remarkable change points. Global precipitation analysis based on CGP v1.0 shows that rainfall has been increasing during 1901-2013 with an increasing rate of 3.52 ± 0.5 mm (10 yr)-1, slightly higher than that in the NCDC data. Analysis also reveals distinguished long-term changing trends at different latitude zones.
Explain the CERES file naming convention
Atmospheric Science Data Center
2014-12-08
... using the dataset name, configuration code and date information which make each file name unique. A Dataset name consists ...
A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes
2011-01-01
Background Knowing the phase of marker genotype data can be useful in genome-wide association studies, because it makes it possible to use analysis frameworks that account for identity by descent or parent of origin of alleles and it can lead to a large increase in data quantities via genotype or sequence imputation. Long-range phasing and haplotype library imputation constitute a fast and accurate method to impute phase for SNP data. Methods A long-range phasing and haplotype library imputation algorithm was developed. It combines information from surrogate parents and long haplotypes to resolve phase in a manner that is not dependent on the family structure of a dataset or on the presence of pedigree information. Results The algorithm performed well in both simulated and real livestock and human datasets in terms of both phasing accuracy and computation efficiency. The percentage of alleles that could be phased in both simulated and real datasets of varying size generally exceeded 98% while the percentage of alleles incorrectly phased in simulated data was generally less than 0.5%. The accuracy of phasing was affected by dataset size, with lower accuracy for dataset sizes less than 1000, but was not affected by effective population size, family data structure, presence or absence of pedigree information, and SNP density. The method was computationally fast. In comparison to a commonly used statistical method (fastPHASE), the current method made about 8% less phasing mistakes and ran about 26 times faster for a small dataset. For larger datasets, the differences in computational time are expected to be even greater. A computer program implementing these methods has been made available. Conclusions The algorithm and software developed in this study make feasible the routine phasing of high-density SNP chips in large datasets. PMID:21388557
Securely Measuring the Overlap between Private Datasets with Cryptosets
Swamidass, S. Joshua; Matlock, Matthew; Rozenblit, Leon
2015-01-01
Many scientific questions are best approached by sharing data—collected by different groups or across large collaborative networks—into a combined analysis. Unfortunately, some of the most interesting and powerful datasets—like health records, genetic data, and drug discovery data—cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset’s contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach “information-theoretic” security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure. PMID:25714898