Sample records for include datasets models

  1. Phase 1 Free Air CO2 Enrichment Model-Data Synthesis (FACE-MDS): Model Output Data (2015)

    DOE Data Explorer

    Walker, A. P.; De Kauwe, M. G.; Medlyn, B. E.; Zaehle, S.; Asao, S.; Dietze, M.; El-Masri, B.; Hanson, P. J.; Hickler, T.; Jain, A.; Luo, Y.; Parton, W. J.; Prentice, I. C.; Ricciuto, D. M.; Thornton, P. E.; Wang, S.; Wang, Y -P; Warlind, D.; Weng, E.; Oren, R.; Norby, R. J.

    2015-01-01

    These datasets comprise the model output from phase 1 of the FACE-MDS. These include simulations of the Duke and Oak Ridge experiments and also idealised long-term (300 year) simulations at both sites (please see the modelling protocol for details). Included as part of this dataset are modelling and output protocols. The model datasets are formatted according to the output protocols. Phase 1 datasets are reproduced here for posterity and reproducibility although the model output for the experimental period have been somewhat superseded by the Phase 2 datasets.

  2. Tularosa Basin Play Fairway Analysis Data and Models

    DOE Data Explorer

    Nash, Greg

    2017-07-11

    This submission includes raster datasets for each layer of evidence used for weights of evidence analysis as well as the deterministic play fairway analysis (PFA). Data representative of heat, permeability and groundwater comprises some of the raster datasets. Additionally, the final deterministic PFA model is provided along with a certainty model. All of these datasets are best used with an ArcGIS software package, specifically Spatial Data Modeler.

  3. Convective - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  4. LANL - Convective - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  5. LANL - Neutral - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  6. KCMP Minnesota Tall Tower Nitrous Oxide Inverse Modeling Dataset 2010-2015

    DOE Data Explorer

    Griffis, Timothy J. [University of Minnesota; Baker, John; Millet, Dylan; Chen, Zichong; Wood, Jeff; Erickson, Matt; Lee, Xuhui

    2017-01-01

    This dataset contains nitrous oxide mixing ratios and supporting information measured at a tall tower (KCMP, 244 m) site near St. Paul, Minnesot, USA. The data include nitrous oxide and carbon dioxide mixing ratios measured at the 100 m level. Turbulence and wind data were measured using a sonic anemometer at the 185 m level. Also included in this dataset are estimates of the "background" nitrous oxide mixing ratios and monthly concentration source footprints derived from WRF-STILT modeling.

  7. NREL - SOWFA - Neutral - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  8. PNNL - WRF-LES - Convective - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  9. ANL - WRF-LES - Convective - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  10. LLNL - WRF-LES - Neutral - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  11. ANL - WRF-LES - Neutral - TTU

    DOE Data Explorer

    Kosovic, Branko

    2018-06-20

    This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  12. LANL - WRF-LES - Neutral - TTU

    DOE Data Explorer

    Kosovic, Branko

    2018-06-20

    This dataset includes large-eddy simulation (LES) output from a neutrally stratified atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on Aug. 17, 2012. The dataset was used to assess LES models for simulation of canonical neutral ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  13. LANL - WRF-LES - Convective - TTU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kosovic, Branko

    This dataset includes large-eddy simulation (LES) output from a convective atmospheric boundary layer (ABL) simulation of observations at the SWIFT tower near Lubbock, Texas on July 4, 2012. The dataset was used to assess the LES models for simulation of canonical convective ABL. The dataset can be used for comparison with other LES and computational fluid dynamics model outputs.

  14. Oregon Cascades Play Fairway Analysis: Raster Datasets and Models

    DOE Data Explorer

    Adam Brandt

    2015-11-15

    This submission includes maps of the spatial distribution of basaltic, and felsic rocks in the Oregon Cascades. It also includes a final Play Fairway Analysis (PFA) model, with the heat and permeability composite risk segments (CRS) supplied separately. Metadata for each raster dataset can be found within the zip files, in the TIF images

  15. A global distributed basin morphometric dataset

    NASA Astrophysics Data System (ADS)

    Shen, Xinyi; Anagnostou, Emmanouil N.; Mei, Yiwen; Hong, Yang

    2017-01-01

    Basin morphometry is vital information for relating storms to hydrologic hazards, such as landslides and floods. In this paper we present the first comprehensive global dataset of distributed basin morphometry at 30 arc seconds resolution. The dataset includes nine prime morphometric variables; in addition we present formulas for generating twenty-one additional morphometric variables based on combination of the prime variables. The dataset can aid different applications including studies of land-atmosphere interaction, and modelling of floods and droughts for sustainable water management. The validity of the dataset has been consolidated by successfully repeating the Hack's law.

  16. Research on Zheng Classification Fusing Pulse Parameters in Coronary Heart Disease

    PubMed Central

    Guo, Rui; Wang, Yi-Qin; Xu, Jin; Yan, Hai-Xia; Yan, Jian-Jun; Li, Fu-Feng; Xu, Zhao-Xia; Xu, Wen-Jie

    2013-01-01

    This study was conducted to illustrate that nonlinear dynamic variables of Traditional Chinese Medicine (TCM) pulse can improve the performances of TCM Zheng classification models. Pulse recordings of 334 coronary heart disease (CHD) patients and 117 normal subjects were collected in this study. Recurrence quantification analysis (RQA) was employed to acquire nonlinear dynamic variables of pulse. TCM Zheng models in CHD were constructed, and predictions using a novel multilabel learning algorithm based on different datasets were carried out. Datasets were designed as follows: dataset1, TCM inquiry information including inspection information; dataset2, time-domain variables of pulse and dataset1; dataset3, RQA variables of pulse and dataset1; and dataset4, major principal components of RQA variables and dataset1. The performances of the different models for Zheng differentiation were compared. The model for Zheng differentiation based on RQA variables integrated with inquiry information had the best performance, whereas that based only on inquiry had the worst performance. Meanwhile, the model based on time-domain variables of pulse integrated with inquiry fell between the above two. This result showed that RQA variables of pulse can be used to construct models of TCM Zheng and improve the performance of Zheng differentiation models. PMID:23737839

  17. Geospatial datasets for watershed delineation and characterization used in the Hawaii StreamStats web application

    USGS Publications Warehouse

    Rea, Alan; Skinner, Kenneth D.

    2012-01-01

    The U.S. Geological Survey Hawaii StreamStats application uses an integrated suite of raster and vector geospatial datasets to delineate and characterize watersheds. The geospatial datasets used to delineate and characterize watersheds on the StreamStats website, and the methods used to develop the datasets are described in this report. The datasets for Hawaii were derived primarily from 10 meter resolution National Elevation Dataset (NED) elevation models, and the National Hydrography Dataset (NHD), using a set of procedures designed to enforce the drainage pattern from the NHD into the NED, resulting in an integrated suite of elevation-derived datasets. Additional sources of data used for computing basin characteristics include precipitation, land cover, soil permeability, and elevation-derivative datasets. The report also includes links for metadata and downloads of the geospatial datasets.

  18. Dataset for Testing Contamination Source Identification Methods for Water Distribution Networks

    EPA Pesticide Factsheets

    This dataset includes the results of a simulation study using the source inversion techniques available in the Water Security Toolkit. The data was created to test the different techniques for accuracy, specificity, false positive rate, and false negative rate. The tests examined different parameters including measurement error, modeling error, injection characteristics, time horizon, network size, and sensor placement. The water distribution system network models that were used in the study are also included in the dataset. This dataset is associated with the following publication:Seth, A., K. Klise, J. Siirola, T. Haxton , and C. Laird. Testing Contamination Source Identification Methods for Water Distribution Networks. Journal of Environmental Division, Proceedings of American Society of Civil Engineers. American Society of Civil Engineers (ASCE), Reston, VA, USA, ., (2016).

  19. VEMAP Phase 2 bioclimatic database. I. Gridded historical (20th century) climate for modeling ecosystem dynamics across the conterminous USA

    USGS Publications Warehouse

    Kittel, T.G.F.; Rosenbloom, N.A.; Royle, J. Andrew; Daly, Christopher; Gibson, W.P.; Fisher, H.H.; Thornton, P.; Yates, D.N.; Aulenbach, S.; Kaufman, C.; McKeown, R.; Bachelet, D.; Schimel, D.S.; Neilson, R.; Lenihan, J.; Drapek, R.; Ojima, D.S.; Parton, W.J.; Melillo, J.M.; Kicklighter, D.W.; Tian, H.; McGuire, A.D.; Sykes, M.T.; Smith, B.; Cowling, S.; Hickler, T.; Prentice, I.C.; Running, S.; Hibbard, K.A.; Post, W.M.; King, A.W.; Smith, T.; Rizzo, B.; Woodward, F.I.

    2004-01-01

    Analysis and simulation of biospheric responses to historical forcing require surface climate data that capture those aspects of climate that control ecological processes, including key spatial gradients and modes of temporal variability. We developed a multivariate, gridded historical climate dataset for the conterminous USA as a common input database for the Vegetation/Ecosystem Modeling and Analysis Project (VEMAP), a biogeochemical and dynamic vegetation model intercomparison. The dataset covers the period 1895-1993 on a 0.5?? latitude/longitude grid. Climate is represented at both monthly and daily timesteps. Variables are: precipitation, mininimum and maximum temperature, total incident solar radiation, daylight-period irradiance, vapor pressure, and daylight-period relative humidity. The dataset was derived from US Historical Climate Network (HCN), cooperative network, and snowpack telemetry (SNOTEL) monthly precipitation and mean minimum and maximum temperature station data. We employed techniques that rely on geostatistical and physical relationships to create the temporally and spatially complete dataset. We developed a local kriging prediction model to infill discontinuous and limited-length station records based on spatial autocorrelation structure of climate anomalies. A spatial interpolation model (PRISM) that accounts for physiographic controls was used to grid the infilled monthly station data. We implemented a stochastic weather generator (modified WGEN) to disaggregate the gridded monthly series to dailies. Radiation and humidity variables were estimated from the dailies using a physically-based empirical surface climate model (MTCLIM3). Derived datasets include a 100 yr model spin-up climate and a historical Palmer Drought Severity Index (PDSI) dataset. The VEMAP dataset exhibits statistically significant trends in temperature, precipitation, solar radiation, vapor pressure, and PDSI for US National Assessment regions. The historical climate and companion datasets are available online at data archive centers. ?? Inter-Research 2004.

  20. A unified high-resolution wind and solar dataset from a rapidly updating numerical weather prediction model

    DOE PAGES

    James, Eric P.; Benjamin, Stanley G.; Marquis, Melinda

    2016-10-28

    A new gridded dataset for wind and solar resource estimation over the contiguous United States has been derived from hourly updated 1-h forecasts from the National Oceanic and Atmospheric Administration High-Resolution Rapid Refresh (HRRR) 3-km model composited over a three-year period (approximately 22 000 forecast model runs). The unique dataset features hourly data assimilation, and provides physically consistent wind and solar estimates for the renewable energy industry. The wind resource dataset shows strong similarity to that previously provided by a Department of Energy-funded study, and it includes estimates in southern Canada and northern Mexico. The solar resource dataset represents anmore » initial step towards application-specific fields such as global horizontal and direct normal irradiance. This combined dataset will continue to be augmented with new forecast data from the advanced HRRR atmospheric/land-surface model.« less

  1. Development of an Uncertainty Quantification Predictive Chemical Reaction Model for Syngas Combustion

    DOE PAGES

    Slavinskaya, N. A.; Abbasi, M.; Starcke, J. H.; ...

    2017-01-24

    An automated data-centric infrastructure, Process Informatics Model (PrIMe), was applied to validation and optimization of a syngas combustion model. The Bound-to-Bound Data Collaboration (B2BDC) module of PrIMe was employed to discover the limits of parameter modifications based on uncertainty quantification (UQ) and consistency analysis of the model–data system and experimental data, including shock-tube ignition delay times and laminar flame speeds. Existing syngas reaction models are reviewed, and the selected kinetic data are described in detail. Empirical rules were developed and applied to evaluate the uncertainty bounds of the literature experimental data. Here, the initial H 2/CO reaction model, assembled frommore » 73 reactions and 17 species, was subjected to a B2BDC analysis. For this purpose, a dataset was constructed that included a total of 167 experimental targets and 55 active model parameters. Consistency analysis of the composed dataset revealed disagreement between models and data. Further analysis suggested that removing 45 experimental targets, 8 of which were self-inconsistent, would lead to a consistent dataset. This dataset was subjected to a correlation analysis, which highlights possible directions for parameter modification and model improvement. Additionally, several methods of parameter optimization were applied, some of them unique to the B2BDC framework. The optimized models demonstrated improved agreement with experiments compared to the initially assembled model, and their predictions for experiments not included in the initial dataset (i.e., a blind prediction) were investigated. The results demonstrate benefits of applying the B2BDC methodology for developing predictive kinetic models.« less

  2. Development of an Uncertainty Quantification Predictive Chemical Reaction Model for Syngas Combustion

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Slavinskaya, N. A.; Abbasi, M.; Starcke, J. H.

    An automated data-centric infrastructure, Process Informatics Model (PrIMe), was applied to validation and optimization of a syngas combustion model. The Bound-to-Bound Data Collaboration (B2BDC) module of PrIMe was employed to discover the limits of parameter modifications based on uncertainty quantification (UQ) and consistency analysis of the model–data system and experimental data, including shock-tube ignition delay times and laminar flame speeds. Existing syngas reaction models are reviewed, and the selected kinetic data are described in detail. Empirical rules were developed and applied to evaluate the uncertainty bounds of the literature experimental data. Here, the initial H 2/CO reaction model, assembled frommore » 73 reactions and 17 species, was subjected to a B2BDC analysis. For this purpose, a dataset was constructed that included a total of 167 experimental targets and 55 active model parameters. Consistency analysis of the composed dataset revealed disagreement between models and data. Further analysis suggested that removing 45 experimental targets, 8 of which were self-inconsistent, would lead to a consistent dataset. This dataset was subjected to a correlation analysis, which highlights possible directions for parameter modification and model improvement. Additionally, several methods of parameter optimization were applied, some of them unique to the B2BDC framework. The optimized models demonstrated improved agreement with experiments compared to the initially assembled model, and their predictions for experiments not included in the initial dataset (i.e., a blind prediction) were investigated. The results demonstrate benefits of applying the B2BDC methodology for developing predictive kinetic models.« less

  3. Parameter-expanded data augmentation for Bayesian analysis of capture-recapture models

    USGS Publications Warehouse

    Royle, J. Andrew; Dorazio, Robert M.

    2012-01-01

    Data augmentation (DA) is a flexible tool for analyzing closed and open population models of capture-recapture data, especially models which include sources of hetereogeneity among individuals. The essential concept underlying DA, as we use the term, is based on adding "observations" to create a dataset composed of a known number of individuals. This new (augmented) dataset, which includes the unknown number of individuals N in the population, is then analyzed using a new model that includes a reformulation of the parameter N in the conventional model of the observed (unaugmented) data. In the context of capture-recapture models, we add a set of "all zero" encounter histories which are not, in practice, observable. The model of the augmented dataset is a zero-inflated version of either a binomial or a multinomial base model. Thus, our use of DA provides a general approach for analyzing both closed and open population models of all types. In doing so, this approach provides a unified framework for the analysis of a huge range of models that are treated as unrelated "black boxes" and named procedures in the classical literature. As a practical matter, analysis of the augmented dataset by MCMC is greatly simplified compared to other methods that require specialized algorithms. For example, complex capture-recapture models of an augmented dataset can be fitted with popular MCMC software packages (WinBUGS or JAGS) by providing a concise statement of the model's assumptions that usually involves only a few lines of pseudocode. In this paper, we review the basic technical concepts of data augmentation, and we provide examples of analyses of closed-population models (M 0, M h , distance sampling, and spatial capture-recapture models) and open-population models (Jolly-Seber) with individual effects.

  4. Bayesian Network Webserver: a comprehensive tool for biological network modeling.

    PubMed

    Ziebarth, Jesse D; Bhattacharya, Anindya; Cui, Yan

    2013-11-01

    The Bayesian Network Webserver (BNW) is a platform for comprehensive network modeling of systems genetics and other biological datasets. It allows users to quickly and seamlessly upload a dataset, learn the structure of the network model that best explains the data and use the model to understand relationships between network variables. Many datasets, including those used to create genetic network models, contain both discrete (e.g. genotype) and continuous (e.g. gene expression traits) variables, and BNW allows for modeling hybrid datasets. Users of BNW can incorporate prior knowledge during structure learning through an easy-to-use structural constraint interface. After structure learning, users are immediately presented with an interactive network model, which can be used to make testable hypotheses about network relationships. BNW, including a downloadable structure learning package, is available at http://compbio.uthsc.edu/BNW. (The BNW interface for adding structural constraints uses HTML5 features that are not supported by current version of Internet Explorer. We suggest using other browsers (e.g. Google Chrome or Mozilla Firefox) when accessing BNW). ycui2@uthsc.edu. Supplementary data are available at Bioinformatics online.

  5. Evaluation of precipitation extremes over the Asian domain: observation and modelling studies

    NASA Astrophysics Data System (ADS)

    Kim, In-Won; Oh, Jaiho; Woo, Sumin; Kripalani, R. H.

    2018-04-01

    In this study, a comparison in the precipitation extremes as exhibited by the seven reference datasets is made to ascertain whether the inferences based on these datasets agree or they differ. These seven datasets, roughly grouped in three categories i.e. rain-gauge based (APHRODITE, CPC-UNI), satellite-based (TRMM, GPCP1DD) and reanalysis based (ERA-Interim, MERRA, and JRA55), having a common data period 1998-2007 are considered. Focus is to examine precipitation extremes in the summer monsoon rainfall over South Asia, East Asia and Southeast Asia. Measures of extreme precipitation include the percentile thresholds, frequency of extreme precipitation events and other quantities. Results reveal that the differences in displaying extremes among the datasets are small over South Asia and East Asia but large differences among the datasets are displayed over the Southeast Asian region including the maritime continent. Furthermore, precipitation data appear to be more consistent over East Asia among the seven datasets. Decadal trends in extreme precipitation are consistent with known results over South and East Asia. No trends in extreme precipitation events are exhibited over Southeast Asia. Outputs of the Coupled Model Intercomparison Project Phase 5 (CMIP5) simulation data are categorized as high, medium and low-resolution models. The regions displaying maximum intensity of extreme precipitation appear to be dependent on model resolution. High-resolution models simulate maximum intensity of extreme precipitation over the Indian sub-continent, medium-resolution models over northeast India and South China and the low-resolution models over Bangladesh, Myanmar and Thailand. In summary, there are differences in displaying extreme precipitation statistics among the seven datasets considered here and among the 29 CMIP5 model data outputs.

  6. An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling.

    PubMed

    Mansouri, K; Grulke, C M; Richard, A M; Judson, R S; Williams, A J

    2016-11-01

    The increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publicly available PHYSPROP physicochemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest-quality subset of the original dataset was compared with the larger curated and corrected dataset. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publicly available for further usage and integration by the scientific community.

  7. Automated Detection of Obstructive Sleep Apnea Events from a Single-Lead Electrocardiogram Using a Convolutional Neural Network.

    PubMed

    Urtnasan, Erdenebayar; Park, Jong-Uk; Joo, Eun-Yeon; Lee, Kyoung-Joung

    2018-04-23

    In this study, we propose a method for the automated detection of obstructive sleep apnea (OSA) from a single-lead electrocardiogram (ECG) using a convolutional neural network (CNN). A CNN model was designed with six optimized convolution layers including activation, pooling, and dropout layers. One-dimensional (1D) convolution, rectified linear units (ReLU), and max pooling were applied to the convolution, activation, and pooling layers, respectively. For training and evaluation of the CNN model, a single-lead ECG dataset was collected from 82 subjects with OSA and was divided into training (including data from 63 patients with 34,281 events) and testing (including data from 19 patients with 8571 events) datasets. Using this CNN model, a precision of 0.99%, a recall of 0.99%, and an F 1 -score of 0.99% were attained with the training dataset; these values were all 0.96% when the CNN was applied to the testing dataset. These results show that the proposed CNN model can be used to detect OSA accurately on the basis of a single-lead ECG. Ultimately, this CNN model may be used as a screening tool for those suspected to suffer from OSA.

  8. GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare.

    PubMed

    Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung

    2015-07-02

    A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a "data modeler" tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets.

  9. Toward Computational Cumulative Biology by Combining Models of Biological Datasets

    PubMed Central

    Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

    2014-01-01

    A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations—for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database. PMID:25427176

  10. Toward computational cumulative biology by combining models of biological datasets.

    PubMed

    Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

    2014-01-01

    A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations-for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.

  11. Deep learning-based fine-grained car make/model classification for visual surveillance

    NASA Astrophysics Data System (ADS)

    Gundogdu, Erhan; Parıldı, Enes Sinan; Solmaz, Berkan; Yücesoy, Veysel; Koç, Aykut

    2017-10-01

    Fine-grained object recognition is a potential computer vision problem that has been recently addressed by utilizing deep Convolutional Neural Networks (CNNs). Nevertheless, the main disadvantage of classification methods relying on deep CNN models is the need for considerably large amount of data. In addition, there exists relatively less amount of annotated data for a real world application, such as the recognition of car models in a traffic surveillance system. To this end, we mainly concentrate on the classification of fine-grained car make and/or models for visual scenarios by the help of two different domains. First, a large-scale dataset including approximately 900K images is constructed from a website which includes fine-grained car models. According to their labels, a state-of-the-art CNN model is trained on the constructed dataset. The second domain that is dealt with is the set of images collected from a camera integrated to a traffic surveillance system. These images, which are over 260K, are gathered by a special license plate detection method on top of a motion detection algorithm. An appropriately selected size of the image is cropped from the region of interest provided by the detected license plate location. These sets of images and their provided labels for more than 30 classes are employed to fine-tune the CNN model which is already trained on the large scale dataset described above. To fine-tune the network, the last two fully-connected layers are randomly initialized and the remaining layers are fine-tuned in the second dataset. In this work, the transfer of a learned model on a large dataset to a smaller one has been successfully performed by utilizing both the limited annotated data of the traffic field and a large scale dataset with available annotations. Our experimental results both in the validation dataset and the real field show that the proposed methodology performs favorably against the training of the CNN model from scratch.

  12. Bayesian correlated clustering to integrate multiple datasets

    PubMed Central

    Kirk, Paul; Griffin, Jim E.; Savage, Richard S.; Ghahramani, Zoubin; Wild, David L.

    2012-01-01

    Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact: D.L.Wild@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23047558

  13. The LANDFIRE Refresh strategy: updating the national dataset

    USGS Publications Warehouse

    Nelson, Kurtis J.; Connot, Joel A.; Peterson, Birgit E.; Martin, Charley

    2013-01-01

    The LANDFIRE Program provides comprehensive vegetation and fuel datasets for the entire United States. As with many large-scale ecological datasets, vegetation and landscape conditions must be updated periodically to account for disturbances, growth, and natural succession. The LANDFIRE Refresh effort was the first attempt to consistently update these products nationwide. It incorporated a combination of specific systematic improvements to the original LANDFIRE National data, remote sensing based disturbance detection methods, field collected disturbance information, vegetation growth and succession modeling, and vegetation transition processes. This resulted in the creation of two complete datasets for all 50 states: LANDFIRE Refresh 2001, which includes the systematic improvements, and LANDFIRE Refresh 2008, which includes the disturbance and succession updates to the vegetation and fuel data. The new datasets are comparable for studying landscape changes in vegetation type and structure over a decadal period, and provide the most recent characterization of fuel conditions across the country. The applicability of the new layers is discussed and the effects of using the new fuel datasets are demonstrated through a fire behavior modeling exercise using the 2011 Wallow Fire in eastern Arizona as an example.

  14. Use of Maple Seeding Canopy Reflectance Dataset for Validation of SART/LEAFMOD Radiative Transfer Model

    NASA Technical Reports Server (NTRS)

    Bond, Barbara J.; Peterson, David L.

    1999-01-01

    This project was a collaborative effort by researchers at ARC, OSU and the University of Arizona. The goal was to use a dataset obtained from a previous study to "empirically validate a new canopy radiative-transfer model (SART) which incorporates a recently-developed leaf-level model (LEAFMOD)". The document includes a short research summary.

  15. TLEM 2.0 - a comprehensive musculoskeletal geometry dataset for subject-specific modeling of lower extremity.

    PubMed

    Carbone, V; Fluit, R; Pellikaan, P; van der Krogt, M M; Janssen, D; Damsgaard, M; Vigneron, L; Feilkas, T; Koopman, H F J M; Verdonschot, N

    2015-03-18

    When analyzing complex biomechanical problems such as predicting the effects of orthopedic surgery, subject-specific musculoskeletal models are essential to achieve reliable predictions. The aim of this paper is to present the Twente Lower Extremity Model 2.0, a new comprehensive dataset of the musculoskeletal geometry of the lower extremity, which is based on medical imaging data and dissection performed on the right lower extremity of a fresh male cadaver. Bone, muscle and subcutaneous fat (including skin) volumes were segmented from computed tomography and magnetic resonance images scans. Inertial parameters were estimated from the image-based segmented volumes. A complete cadaver dissection was performed, in which bony landmarks, attachments sites and lines-of-action of 55 muscle actuators and 12 ligaments, bony wrapping surfaces, and joint geometry were measured. The obtained musculoskeletal geometry dataset was finally implemented in the AnyBody Modeling System (AnyBody Technology A/S, Aalborg, Denmark), resulting in a model consisting of 12 segments, 11 joints and 21 degrees of freedom, and including 166 muscle-tendon elements for each leg. The new TLEM 2.0 dataset was purposely built to be easily combined with novel image-based scaling techniques, such as bone surface morphing, muscle volume registration and muscle-tendon path identification, in order to obtain subject-specific musculoskeletal models in a quick and accurate way. The complete dataset, including CT and MRI scans and segmented volume and surfaces, is made available at http://www.utwente.nl/ctw/bw/research/projects/TLEMsafe for the biomechanical community, in order to accelerate the development and adoption of subject-specific models on large scale. TLEM 2.0 is freely shared for non-commercial use only, under acceptance of the TLEMsafe Research License Agreement. Copyright © 2014 Elsevier Ltd. All rights reserved.

  16. Substituting values for censored data from Texas, USA, reservoirs inflated and obscured trends in analyses commonly used for water quality target development.

    PubMed

    Grantz, Erin; Haggard, Brian; Scott, J Thad

    2018-06-12

    We calculated four median datasets (chlorophyll a, Chl a; total phosphorus, TP; and transparency) using multiple approaches to handling censored observations, including substituting fractions of the quantification limit (QL; dataset 1 = 1QL, dataset 2 = 0.5QL) and statistical methods for censored datasets (datasets 3-4) for approximately 100 Texas, USA reservoirs. Trend analyses of differences between dataset 1 and 3 medians indicated percent difference increased linearly above thresholds in percent censored data (%Cen). This relationship was extrapolated to estimate medians for site-parameter combinations with %Cen > 80%, which were combined with dataset 3 as dataset 4. Changepoint analysis of Chl a- and transparency-TP relationships indicated threshold differences up to 50% between datasets. Recursive analysis identified secondary thresholds in dataset 4. Threshold differences show that information introduced via substitution or missing due to limitations of statistical methods biased values, underestimated error, and inflated the strength of TP thresholds identified in datasets 1-3. Analysis of covariance identified differences in linear regression models relating transparency-TP between datasets 1, 2, and the more statistically robust datasets 3-4. Study findings identify high-risk scenarios for biased analytical outcomes when using substitution. These include high probability of median overestimation when %Cen > 50-60% for a single QL, or when %Cen is as low 16% for multiple QL's. Changepoint analysis was uniquely vulnerable to substitution effects when using medians from sites with %Cen > 50%. Linear regression analysis was less sensitive to substitution and missing data effects, but differences in model parameters for transparency cannot be discounted and could be magnified by log-transformation of the variables.

  17. Parameter estimation of the Farquhar-von Caemmerer-Berry biochemical model from photosynthetic carbon dioxide response curves

    USDA-ARS?s Scientific Manuscript database

    The methods of Sharkey and Gu for estimating the eight parameters of the Farquhar-von Caemmerer-Berry (FvBC) model were examined using generated photosynthesis versus intercellular carbon dioxide concentration (A/Ci) datasets. The generated datasets included data with (A) high accuracy, (B) normal ...

  18. The Balance-Scale Task Revisited: A Comparison of Statistical Models for Rule-Based and Information-Integration Theories of Proportional Reasoning

    PubMed Central

    Hofman, Abe D.; Visser, Ingmar; Jansen, Brenda R. J.; van der Maas, Han L. J.

    2015-01-01

    We propose and test three statistical models for the analysis of children’s responses to the balance scale task, a seminal task to study proportional reasoning. We use a latent class modelling approach to formulate a rule-based latent class model (RB LCM) following from a rule-based perspective on proportional reasoning and a new statistical model, the Weighted Sum Model, following from an information-integration approach. Moreover, a hybrid LCM using item covariates is proposed, combining aspects of both a rule-based and information-integration perspective. These models are applied to two different datasets, a standard paper-and-pencil test dataset (N = 779), and a dataset collected within an online learning environment that included direct feedback, time-pressure, and a reward system (N = 808). For the paper-and-pencil dataset the RB LCM resulted in the best fit, whereas for the online dataset the hybrid LCM provided the best fit. The standard paper-and-pencil dataset yielded more evidence for distinct solution rules than the online data set in which quantitative item characteristics are more prominent in determining responses. These results shed new light on the discussion on sequential rule-based and information-integration perspectives of cognitive development. PMID:26505905

  19. Low-Income Students and School Meal Programs in California. Technical Appendices

    ERIC Educational Resources Information Center

    Danielson, Caroline

    2015-01-01

    These technical appendices are intended to accompany the study, "Low-Income Students and School Meal Programs in California." Two appendices are included. Appendix A provides tables detailing: (1) the variables included in the main models and the datasets(s) used to construct each; (2) observations in each dataset and categorizes them…

  20. Convergence and Divergence in a Multi-Model Ensemble of Terrestrial Ecosystem Models in North America

    NASA Astrophysics Data System (ADS)

    Dungan, J. L.; Wang, W.; Hashimoto, H.; Michaelis, A.; Milesi, C.; Ichii, K.; Nemani, R. R.

    2009-12-01

    In support of NACP, we are conducting an ensemble modeling exercise using the Terrestrial Observation and Prediction System (TOPS) to evaluate uncertainties among ecosystem models, satellite datasets, and in-situ measurements. The models used in the experiment include public-domain versions of Biome-BGC, LPJ, TOPS-BGC, and CASA, driven by a consistent set of climate fields for North America at 8km resolution and daily/monthly time steps over the period of 1982-2006. The reference datasets include MODIS Gross Primary Production (GPP) and Net Primary Production (NPP) products, Fluxnet measurements, and other observational data. The simulation results and the reference datasets are consistently processed and systematically compared in the climate (temperature-precipitation) space; in particular, an alternative to the Taylor diagram is developed to facilitate model-data intercomparisons in multi-dimensional space. The key findings of this study indicate that: the simulated GPP/NPP fluxes are in general agreement with observations over forests, but are biased low (underestimated) over non-forest types; large uncertainties of biomass and soil carbon stocks are found among the models (and reference datasets), often induced by seemingly “small” differences in model parameters and implementation details; the simulated Net Ecosystem Production (NEP) mainly responds to non-respiratory disturbances (e.g. fire) in the models and therefore is difficult to compare with flux data; and the seasonality and interannual variability of NEP varies significantly among models and reference datasets. These findings highlight the problem inherent in relying on only one modeling approach to map surface carbon fluxes and emphasize the pressing necessity of expanded and enhanced monitoring systems to narrow critical structural and parametrical uncertainties among ecosystem models.

  1. Development of Gridded Ensemble Precipitation and Temperature Datasets for the Contiguous United States Plus Hawai'i and Alaska

    NASA Astrophysics Data System (ADS)

    Newman, A. J.; Clark, M. P.; Nijssen, B.; Wood, A.; Gutmann, E. D.; Mizukami, N.; Longman, R. J.; Giambelluca, T. W.; Cherry, J.; Nowak, K.; Arnold, J.; Prein, A. F.

    2016-12-01

    Gridded precipitation and temperature products are inherently uncertain due to myriad factors. These include interpolation from a sparse observation network, measurement representativeness, and measurement errors. Despite this inherent uncertainty, uncertainty is typically not included, or is a specific addition to each dataset without much general applicability across different datasets. A lack of quantitative uncertainty estimates for hydrometeorological forcing fields limits their utility to support land surface and hydrologic modeling techniques such as data assimilation, probabilistic forecasting and verification. To address this gap, we have developed a first of its kind gridded, observation-based ensemble of precipitation and temperature at a daily increment for the period 1980-2012 over the United States (including Alaska and Hawaii). A longer, higher resolution version (1970-present, 1/16th degree) has also been implemented to support real-time hydrologic- monitoring and prediction in several regional US domains. We will present the development and evaluation of the dataset, along with initial applications of the dataset for ensemble data assimilation and probabilistic evaluation of high resolution regional climate model simulations. We will also present results on the new high resolution products for Alaska and Hawaii (2 km and 250 m respectively), to complete the first ensemble observation based product suite for the entire 50 states. Finally, we will present plans to improve the ensemble dataset, focusing on efforts to improve the methods used for station interpolation and ensemble generation, as well as methods to fuse station data with numerical weather prediction model output.

  2. GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare

    PubMed Central

    Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung

    2015-01-01

    A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a “data modeler” tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets. PMID:26147731

  3. Evaluation of experimental design and computational parameter choices affecting analyses of ChIP-seq and RNA-seq data in undomesticated poplar trees.

    Treesearch

    Lijun Liu; V. Missirian; Matthew S. Zinkgraf; Andrew Groover; V. Filkov

    2014-01-01

    Background: One of the great advantages of next generation sequencing is the ability to generate large genomic datasets for virtually all species, including non-model organisms. It should be possible, in turn, to apply advanced computational approaches to these datasets to develop models of biological processes. In a practical sense, working with non-model organisms...

  4. Advancing Collaboration through Hydrologic Data and Model Sharing

    NASA Astrophysics Data System (ADS)

    Tarboton, D. G.; Idaszak, R.; Horsburgh, J. S.; Ames, D. P.; Goodall, J. L.; Band, L. E.; Merwade, V.; Couch, A.; Hooper, R. P.; Maidment, D. R.; Dash, P. K.; Stealey, M.; Yi, H.; Gan, T.; Castronova, A. M.; Miles, B.; Li, Z.; Morsy, M. M.

    2015-12-01

    HydroShare is an online, collaborative system for open sharing of hydrologic data, analytical tools, and models. It supports the sharing of and collaboration around "resources" which are defined primarily by standardized metadata, content data models for each resource type, and an overarching resource data model based on the Open Archives Initiative's Object Reuse and Exchange (OAI-ORE) standard and a hierarchical file packaging system called "BagIt". HydroShare expands the data sharing capability of the CUAHSI Hydrologic Information System by broadening the classes of data accommodated to include geospatial and multidimensional space-time datasets commonly used in hydrology. HydroShare also includes new capability for sharing models, model components, and analytical tools and will take advantage of emerging social media functionality to enhance information about and collaboration around hydrologic data and models. It also supports web services and server/cloud based computation operating on resources for the execution of hydrologic models and analysis and visualization of hydrologic data. HydroShare uses iRODS as a network file system for underlying storage of datasets and models. Collaboration is enabled by casting datasets and models as "social objects". Social functions include both private and public sharing, formation of collaborative groups of users, and value-added annotation of shared datasets and models. The HydroShare web interface and social media functions were developed using the Django web application framework coupled to iRODS. Data visualization and analysis is supported through the Tethys Platform web GIS software stack. Links to external systems are supported by RESTful web service interfaces to HydroShare's content. This presentation will introduce the HydroShare functionality developed to date and describe ongoing development of functionality to support collaboration and integration of data and models.

  5. A single factor underlies the metabolic syndrome: a confirmatory factor analysis.

    PubMed

    Pladevall, Manel; Singal, Bonita; Williams, L Keoki; Brotons, Carlos; Guyer, Heidi; Sadurni, Josep; Falces, Carles; Serrano-Rios, Manuel; Gabriel, Rafael; Shaw, Jonathan E; Zimmet, Paul Z; Haffner, Steven

    2006-01-01

    Confirmatory factor analysis (CFA) was used to test the hypothesis that the components of the metabolic syndrome are manifestations of a single common factor. Three different datasets were used to test and validate the model. The Spanish and Mauritian studies included 207 men and 203 women and 1,411 men and 1,650 women, respectively. A third analytical dataset including 847 men was obtained from a previously published CFA of a U.S. population. The one-factor model included the metabolic syndrome core components (central obesity, insulin resistance, blood pressure, and lipid measurements). We also tested an expanded one-factor model that included uric acid and leptin levels. Finally, we used CFA to compare the goodness of fit of one-factor models with the fit of two previously published four-factor models. The simplest one-factor model showed the best goodness-of-fit indexes (comparative fit index 1, root mean-square error of approximation 0.00). Comparisons of one-factor with four-factor models in the three datasets favored the one-factor model structure. The selection of variables to represent the different metabolic syndrome components and model specification explained why previous exploratory and confirmatory factor analysis, respectively, failed to identify a single factor for the metabolic syndrome. These analyses support the current clinical definition of the metabolic syndrome, as well as the existence of a single factor that links all of the core components.

  6. Benchmarking Deep Learning Models on Large Healthcare Datasets.

    PubMed

    Purushotham, Sanjay; Meng, Chuizheng; Che, Zhengping; Liu, Yan

    2018-06-04

    Deep learning models (aka Deep Neural Networks) have revolutionized many fields including computer vision, natural language processing, speech recognition, and is being increasingly used in clinical healthcare applications. However, few works exist which have benchmarked the performance of the deep learning models with respect to the state-of-the-art machine learning models and prognostic scoring systems on publicly available healthcare datasets. In this paper, we present the benchmarking results for several clinical prediction tasks such as mortality prediction, length of stay prediction, and ICD-9 code group prediction using Deep Learning models, ensemble of machine learning models (Super Learner algorithm), SAPS II and SOFA scores. We used the Medical Information Mart for Intensive Care III (MIMIC-III) (v1.4) publicly available dataset, which includes all patients admitted to an ICU at the Beth Israel Deaconess Medical Center from 2001 to 2012, for the benchmarking tasks. Our results show that deep learning models consistently outperform all the other approaches especially when the 'raw' clinical time series data is used as input features to the models. Copyright © 2018 Elsevier Inc. All rights reserved.

  7. Being an honest broker of hydrology: Uncovering, communicating and addressing model error in a climate change streamflow dataset

    NASA Astrophysics Data System (ADS)

    Chegwidden, O.; Nijssen, B.; Pytlak, E.

    2017-12-01

    Any model simulation has errors, including errors in meteorological data, process understanding, model structure, and model parameters. These errors may express themselves as bias, timing lags, and differences in sensitivity between the model and the physical world. The evaluation and handling of these errors can greatly affect the legitimacy, validity and usefulness of the resulting scientific product. In this presentation we will discuss a case study of handling and communicating model errors during the development of a hydrologic climate change dataset for the Pacific Northwestern United States. The dataset was the result of a four-year collaboration between the University of Washington, Oregon State University, the Bonneville Power Administration, the United States Army Corps of Engineers and the Bureau of Reclamation. Along the way, the partnership facilitated the discovery of multiple systematic errors in the streamflow dataset. Through an iterative review process, some of those errors could be resolved. For the errors that remained, honest communication of the shortcomings promoted the dataset's legitimacy. Thoroughly explaining errors also improved ways in which the dataset would be used in follow-on impact studies. Finally, we will discuss the development of the "streamflow bias-correction" step often applied to climate change datasets that will be used in impact modeling contexts. We will describe the development of a series of bias-correction techniques through close collaboration among universities and stakeholders. Through that process, both universities and stakeholders learned about the others' expectations and workflows. This mutual learning process allowed for the development of methods that accommodated the stakeholders' specific engineering requirements. The iterative revision process also produced a functional and actionable dataset while preserving its scientific merit. We will describe how encountering earlier techniques' pitfalls allowed us to develop improved methods for scientists and practitioners alike.

  8. A decision tree model for predicting mediastinal lymph node metastasis in non-small cell lung cancer with F-18 FDG PET/CT.

    PubMed

    Pak, Kyoungjune; Kim, Keunyoung; Kim, Mi-Hyun; Eom, Jung Seop; Lee, Min Ki; Cho, Jeong Su; Kim, Yun Seong; Kim, Bum Soo; Kim, Seong Jang; Kim, In Joo

    2018-01-01

    We aimed to develop a decision tree model to improve diagnostic performance of positron emission tomography/computed tomography (PET/CT) to detect metastatic lymph nodes (LN) in non-small cell lung cancer (NSCLC). 115 patients with NSCLC were included in this study. The training dataset included 66 patients. A decision tree model was developed with 9 variables, and validated with 49 patients: short and long diameters of LNs, ratio of short and long diameters, maximum standardized uptake value (SUVmax) of LN, mean hounsfield unit, ratio of LN SUVmax and ascending aorta SUVmax (LN/AA), and ratio of LN SUVmax and superior vena cava SUVmax. A total of 301 LNs of 115 patients were evaluated in this study. Nodular calcification was applied as the initial imaging parameter, and LN SUVmax (≥3.95) was assessed as the second. LN/AA (≥2.92) was required to high LN SUVmax. Sensitivity was 50% for training dataset, and 40% for validation dataset. However, specificity was 99.28% for training dataset, and 96.23% for validation dataset. In conclusion, we have developed a new decision tree model for interpreting mediastinal LNs. All LNs with nodular calcification were benign, and LNs with high LN SUVmax and high LN/AA were metastatic Further studies are needed to incorporate subjective parameters and pathologic evaluations into a decision tree model to improve the test performance of PET/CT.

  9. Long-term records of global radiation, carbon and water fluxes derived from multi-satellite data and a process-based model

    NASA Astrophysics Data System (ADS)

    Ryu, Youngryel; Jiang, Chongya

    2016-04-01

    To gain insights about the underlying impacts of global climate change on terrestrial ecosystem fluxes, we present a long-term (1982-2015) global radiation, carbon and water fluxes products by integrating multi-satellite data with a process-based model, the Breathing Earth System Simulator (BESS). BESS is a coupled processed model that integrates radiative transfer in the atmosphere and canopy, photosynthesis (GPP), and evapotranspiration (ET). BESS was designed most sensitive to the variables that can be quantified reliably, fully taking advantages of remote sensing atmospheric and land products. Originally, BESS entirely relied on MODIS as input variables to produce global GPP and ET during the MODIS era. This study extends the work to provide a series of long-term products from 1982 to 2015 by incorporating AVHRR data. In addition to GPP and ET, more land surface processes related datasets are mapped to facilitate the discovery of the ecological variations and changes. The CLARA-A1 cloud property datasets, the TOMS aerosol datasets, along with the GLASS land surface albedo datasets, were input to a look-up table derived from an atmospheric radiative transfer model to produce direct and diffuse components of visible and near infrared radiation datasets. Theses radiation components together with the LAI3g datasets and the GLASS land surface albedo datasets, were used to calculate absorbed radiation through a clumping corrected two-stream canopy radiative transfer model. ECMWF ERA interim air temperature data were downscaled by using ALP-II land surface temperature dataset and a region-dependent regression model. The spatial and seasonal variations of CO2 concentration were accounted by OCO-2 datasets, whereas NOAA's global CO2 growth rates data were used to describe interannual variations. All these remote sensing based datasets are used to run the BESS. Daily fluxes in 1/12 degree were computed and then aggregated to half-month interval to match with the spatial-temporal resolution of LAI3g dataset. The BESS GPP and ET products were compared to other independent datasets including MPI-BGC and CLM. Overall, the BESS products show good agreement with the other two datasets, indicating a compelling potential for bridging remote sensing and land surface models.

  10. School Attendance Problems and Youth Psychopathology: Structural Cross-Lagged Regression Models in Three Longitudinal Datasets

    PubMed Central

    Wood, Jeffrey J.; Lynne, Sarah D.; Langer, David A.; Wood, Patricia A.; Clark, Shaunna L.; Eddy, J. Mark; Ialongo, Nicholas

    2011-01-01

    This study tests a model of reciprocal influences between absenteeism and youth psychopathology using three longitudinal datasets (Ns= 20745, 2311, and 671). Participants in 1st through 12th grades were interviewed annually or bi-annually. Measures of psychopathology include self-, parent-, and teacher-report questionnaires. Structural cross-lagged regression models were tested. In a nationally representative dataset (Add Health), middle school students with relatively greater absenteeism at study year 1 tended towards increased depression and conduct problems in study year 2, over and above the effects of autoregressive associations and demographic covariates. The opposite direction of effects was found for both middle and high school students. Analyses with two regionally representative datasets were also partially supportive. Longitudinal links were more evident in adolescence than in childhood. PMID:22188462

  11. Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models

    PubMed Central

    Stephens, Zachary D.; Hudson, Matthew E.; Mainzer, Liudmila S.; Taschuk, Morgan; Weber, Matthew R.; Iyer, Ravishankar K.

    2016-01-01

    An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the “ground truth” about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads. PMID:27893777

  12. The Development of Novel Chemical Fragment-Based Descriptors Using Frequent Common Subgraph Mining Approach and Their Application in QSAR Modeling.

    PubMed

    Khashan, Raed; Zheng, Weifan; Tropsha, Alexander

    2014-03-01

    We present a novel approach to generating fragment-based molecular descriptors. The molecules are represented by labeled undirected chemical graph. Fast Frequent Subgraph Mining (FFSM) is used to find chemical-fragments (subgraphs) that occur in at least a subset of all molecules in a dataset. The collection of frequent subgraphs (FSG) forms a dataset-specific descriptors whose values for each molecule are defined by the number of times each frequent fragment occurs in this molecule. We have employed the FSG descriptors to develop variable selection k Nearest Neighbor (kNN) QSAR models of several datasets with binary target property including Maximum Recommended Therapeutic Dose (MRTD), Salmonella Mutagenicity (Ames Genotoxicity), and P-Glycoprotein (PGP) data. Each dataset was divided into training, test, and validation sets to establish the statistical figures of merit reflecting the model validated predictive power. The classification accuracies of models for both training and test sets for all datasets exceeded 75 %, and the accuracy for the external validation sets exceeded 72 %. The model accuracies were comparable or better than those reported earlier in the literature for the same datasets. Furthermore, the use of fragment-based descriptors affords mechanistic interpretation of validated QSAR models in terms of essential chemical fragments responsible for the compounds' target property. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  13. Subsampling for dataset optimisation

    NASA Astrophysics Data System (ADS)

    Ließ, Mareike

    2017-04-01

    Soil-landscapes have formed by the interaction of soil-forming factors and pedogenic processes. In modelling these landscapes in their pedodiversity and the underlying processes, a representative unbiased dataset is required. This concerns model input as well as output data. However, very often big datasets are available which are highly heterogeneous and were gathered for various purposes, but not to model a particular process or data space. As a first step, the overall data space and/or landscape section to be modelled needs to be identified including considerations regarding scale and resolution. Then the available dataset needs to be optimised via subsampling to well represent this n-dimensional data space. A couple of well-known sampling designs may be adapted to suit this purpose. The overall approach follows three main strategies: (1) the data space may be condensed and de-correlated by a factor analysis to facilitate the subsampling process. (2) Different methods of pattern recognition serve to structure the n-dimensional data space to be modelled into units which then form the basis for the optimisation of an existing dataset through a sensible selection of samples. Along the way, data units for which there is currently insufficient soil data available may be identified. And (3) random samples from the n-dimensional data space may be replaced by similar samples from the available dataset. While being a presupposition to develop data-driven statistical models, this approach may also help to develop universal process models and identify limitations in existing models.

  14. Obs4MIPS: Satellite Observations for Model Evaluation

    NASA Astrophysics Data System (ADS)

    Ferraro, R.; Waliser, D. E.; Gleckler, P. J.

    2017-12-01

    This poster will review the current status of the obs4MIPs project, whose purpose is to provide a limited collection of well-established and documented datasets for comparison with Earth system models (https://www.earthsystemcog.org/projects/obs4mips/). These datasets have been reformatted to correspond with the CMIP5 model output requirements, and include technical documentation specifically targeted for their use in model output evaluation. The project holdings now exceed 120 datasets with observations that directly correspond to CMIP5 model output variables, with new additions in response to the CMIP6 experiments. With the growth in climate model output data volume, it is increasing more difficult to bring the model output and the observations together to do evaluations. The positioning of the obs4MIPs datasets within the Earth System Grid Federation (ESGF) allows for the use of currently available and planned online tools within the ESGF to perform analysis using model output and observational datasets without necessarily downloading everything to a local workstation. This past year, obs4MIPs has updated its submission guidelines to closely align with changes in the CMIP6 experiments, and is implementing additional indicators and ancillary data to allow users to more easily determine the efficacy of an obs4MIPs dataset for specific evaluation purposes. This poster will present the new guidelines and indicators, and update the list of current obs4MIPs holdings and their connection to the ESGF evaluation and analysis tools currently available, and being developed for the CMIP6 experiments.

  15. RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system

    PubMed Central

    Jensen, Tue V.; Pinson, Pierre

    2017-01-01

    Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation. PMID:29182600

  16. RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system.

    PubMed

    Jensen, Tue V; Pinson, Pierre

    2017-11-28

    Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.

  17. RE-Europe, a large-scale dataset for modeling a highly renewable European electricity system

    NASA Astrophysics Data System (ADS)

    Jensen, Tue V.; Pinson, Pierre

    2017-11-01

    Future highly renewable energy systems will couple to complex weather and climate dynamics. This coupling is generally not captured in detail by the open models developed in the power and energy system communities, where such open models exist. To enable modeling such a future energy system, we describe a dedicated large-scale dataset for a renewable electric power system. The dataset combines a transmission network model, as well as information for generation and demand. Generation includes conventional generators with their technical and economic characteristics, as well as weather-driven forecasts and corresponding realizations for renewable energy generation for a period of 3 years. These may be scaled according to the envisioned degrees of renewable penetration in a future European energy system. The spatial coverage, completeness and resolution of this dataset, open the door to the evaluation, scaling analysis and replicability check of a wealth of proposals in, e.g., market design, network actor coordination and forecasting of renewable power generation.

  18. The Mpi-M Aerosol Climatology (MAC)

    NASA Astrophysics Data System (ADS)

    Kinne, S.

    2014-12-01

    Monthly gridded global data-sets for aerosol optical properties (AOD, SSA and g) and for aerosol microphysical properties (CCN and IN) offer a (less complex) alternate path to include aerosol radiative effects and aerosol impacts on cloud-microphysics in global simulations. Based on merging AERONET sun-/sky-photometer data onto background maps provided by AeroCom phase 1 modeling output and AERONET sun-/the MPI-M Aerosol Climatology (MAC) version 1 was developed and applied in IPCC simulations with ECHAM and as ancillary data-set in satellite-based global data-sets. An updated version 2 of this climatology will be presented now applying central values from the more recent AeroCom phase 2 modeling and utilizing the better global coverage of trusted sun-photometer data - including statistics from the Marine Aerosol network (MAN). Applications include spatial distributions of estimates for aerosol direct and aerosol indirect radiative effects.

  19. Analysis models for the estimation of oceanic fields

    NASA Technical Reports Server (NTRS)

    Carter, E. F.; Robinson, A. R.

    1987-01-01

    A general model for statistically optimal estimates is presented for dealing with scalar, vector and multivariate datasets. The method deals with anisotropic fields and treats space and time dependence equivalently. Problems addressed include the analysis, or the production of synoptic time series of regularly gridded fields from irregular and gappy datasets, and the estimate of fields by compositing observations from several different instruments and sampling schemes. Technical issues are discussed, including the convergence of statistical estimates, the choice of representation of the correlations, the influential domain of an observation, and the efficiency of numerical computations.

  20. Using surveillance and monitoring data of different origins in a Salmonella source attribution model: a European Union example with challenges and proposed solutions.

    PubMed

    DE Knegt, L V; Pires, S M; Hald, T

    2015-04-01

    Microbial subtyping approaches are commonly used for source attribution of human salmonellosis. Such methods require data on Salmonella in animals and humans, outbreaks, infection abroad and amounts of food available for consumption. A source attribution model was applied to 24 European countries, requiring special data management to produce a standardized dataset. Salmonellosis data on animals and humans were obtained from datasets provided by the European Food Safety Authority. The amount of food available for consumption was calculated based on production and trade data. Limitations included different types of underreporting, non-participation in prevalence studies, and non-availability of trade data. Cases without travel information were assumed to be domestic; non-subtyped human or animal records were re-identified according to proportions observed in reference datasets; missing trade information was estimated based on previous years. The resulting dataset included data on 24 serovars in humans, broilers, laying hens, pigs and turkeys in 24 countries.

  1. Data assimilation and model evaluation experiment datasets

    NASA Technical Reports Server (NTRS)

    Lai, Chung-Cheng A.; Qian, Wen; Glenn, Scott M.

    1994-01-01

    The Institute for Naval Oceanography, in cooperation with Naval Research Laboratories and universities, executed the Data Assimilation and Model Evaluation Experiment (DAMEE) for the Gulf Stream region during fiscal years 1991-1993. Enormous effort has gone into the preparation of several high-quality and consistent datasets for model initialization and verification. This paper describes the preparation process, the temporal and spatial scopes, the contents, the structure, etc., of these datasets. The goal of DAMEE and the need of data for the four phases of experiment are briefly stated. The preparation of DAMEE datasets consisted of a series of processes: (1) collection of observational data; (2) analysis and interpretation; (3) interpolation using the Optimum Thermal Interpolation System package; (4) quality control and re-analysis; and (5) data archiving and software documentation. The data products from these processes included a time series of 3D fields of temperature and salinity, 2D fields of surface dynamic height and mixed-layer depth, analysis of the Gulf Stream and rings system, and bathythermograph profiles. To date, these are the most detailed and high-quality data for mesoscale ocean modeling, data assimilation, and forecasting research. Feedback from ocean modeling groups who tested this data was incorporated into its refinement. Suggestions for DAMEE data usages include (1) ocean modeling and data assimilation studies, (2) diagnosis and theoretical studies, and (3) comparisons with locally detailed observations.

  2. Fort Bliss Geothermal Area Data: Temperature profile, logs, schematic model and cross section

    DOE Data Explorer

    Adam Brandt

    2015-11-15

    This dataset contains a variety of data about the Fort Bliss geothermal area, part of the southern portion of the Tularosa Basin, New Mexico. The dataset contains schematic models for the McGregor Geothermal System, a shallow temperature survey of the Fort Bliss geothermal area. The dataset also contains Century OH logs, a full temperature profile, and complete logs from well RMI 56-5, including resistivity and porosity data, drill logs with drill rate, depth, lithology, mineralogy, fractures, temperature, pit total, gases, and descriptions among other measurements as well as CDL, CNL, DIL, GR Caliper and Temperature files. A shallow (2 meter depth) temperature survey of the Fort Bliss geothermal area with 63 data points is also included. Two cross sections through the Fort Bliss area, also included, show well position and depth. The surface map included shows faults and well spatial distribution. Inferred and observed fault distributions from gravity surveys around the Fort Bliss geothermal area.

  3. Topographic and hydrographic GIS dataset for the Afghanistan Geological Survey and U.S. Geological Survey 2010 Minerals Project

    USGS Publications Warehouse

    Chirico, P.G.; Moran, T.W.

    2011-01-01

    This dataset contains a collection of 24 folders, each representing a specific U.S. Geological Survey area of interest (AOI; fig. 1), as well as datasets for AOI subsets. Each folder includes the extent, contours, Digital Elevation Model (DEM), and hydrography of the corresponding AOI, which are organized into feature vector and raster datasets. The dataset comprises a geographic information system (GIS), which is available upon request from the USGS Afghanistan programs Web site (http://afghanistan.cr.usgs.gov/minerals.php), and the maps of the 24 areas of interest of the USGS AOIs.

  4. MODBASE, a database of annotated comparative protein structure models

    PubMed Central

    Pieper, Ursula; Eswar, Narayanan; Stuart, Ashley C.; Ilyin, Valentin A.; Sali, Andrej

    2002-01-01

    MODBASE (http://guitar.rockefeller.edu/modbase) is a relational database of annotated comparative protein structure models for all available protein sequences matched to at least one known protein structure. The models are calculated by MODPIPE, an automated modeling pipeline that relies on PSI-BLAST, IMPALA and MODELLER. MODBASE uses the MySQL relational database management system for flexible and efficient querying, and the MODVIEW Netscape plugin for viewing and manipulating multiple sequences and structures. It is updated regularly to reflect the growth of the protein sequence and structure databases, as well as improvements in the software for calculating the models. For ease of access, MODBASE is organized into different datasets. The largest dataset contains models for domains in 304 517 out of 539 171 unique protein sequences in the complete TrEMBL database (23 March 2001); only models based on significant alignments (PSI-BLAST E-value < 10–4) and models assessed to have the correct fold are included. Other datasets include models for target selection and structure-based annotation by the New York Structural Genomics Research Consortium, models for prediction of genes in the Drosophila melanogaster genome, models for structure determination of several ribosomal particles and models calculated by the MODWEB comparative modeling web server. PMID:11752309

  5. QSAR-Based Models for Designing Quinazoline/Imidazothiazoles/Pyrazolopyrimidines Based Inhibitors against Wild and Mutant EGFR

    PubMed Central

    Chauhan, Jagat Singh; Dhanda, Sandeep Kumar; Singla, Deepak; Agarwal, Subhash M.; Raghava, Gajendra P. S.

    2014-01-01

    Overexpression of EGFR is responsible for causing a number of cancers, including lung cancer as it activates various downstream signaling pathways. Thus, it is important to control EGFR function in order to treat the cancer patients. It is well established that inhibiting ATP binding within the EGFR kinase domain regulates its function. The existing quinazoline derivative based drugs used for treating lung cancer that inhibits the wild type of EGFR. In this study, we have made a systematic attempt to develop QSAR models for designing quinazoline derivatives that could inhibit wild EGFR and imidazothiazoles/pyrazolopyrimidines derivatives against mutant EGFR. In this study, three types of prediction methods have been developed to design inhibitors against EGFR (wild, mutant and both). First, we developed models for predicting inhibitors against wild type EGFR by training and testing on dataset containing 128 quinazoline based inhibitors. This dataset was divided into two subsets called wild_train and wild_valid containing 103 and 25 inhibitors respectively. The models were trained and tested on wild_train dataset while performance was evaluated on the wild_valid called validation dataset. We achieved a maximum correlation between predicted and experimentally determined inhibition (IC50) of 0.90 on validation dataset. Secondly, we developed models for predicting inhibitors against mutant EGFR (L858R) on mutant_train, and mutant_valid dataset and achieved a maximum correlation between 0.834 to 0.850 on these datasets. Finally, an integrated hybrid model has been developed on a dataset containing wild and mutant inhibitors and got maximum correlation between 0.761 to 0.850 on different datasets. In order to promote open source drug discovery, we developed a webserver for designing inhibitors against wild and mutant EGFR along with providing standalone (http://osddlinux.osdd.net/) and Galaxy (http://osddlinux.osdd.net:8001) version of software. We hope our webserver (http://crdd.osdd.net/oscadd/ntegfr/) will play a vital role in designing new anticancer drugs. PMID:24992720

  6. Predictive modeling of treatment resistant depression using data from STAR*D and an independent clinical study.

    PubMed

    Nie, Zhi; Vairavan, Srinivasan; Narayan, Vaibhav A; Ye, Jieping; Li, Qingqin S

    2018-01-01

    Identification of risk factors of treatment resistance may be useful to guide treatment selection, avoid inefficient trial-and-error, and improve major depressive disorder (MDD) care. We extended the work in predictive modeling of treatment resistant depression (TRD) via partition of the data from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) cohort into a training and a testing dataset. We also included data from a small yet completely independent cohort RIS-INT-93 as an external test dataset. We used features from enrollment and level 1 treatment (up to week 2 response only) of STAR*D to explore the feature space comprehensively and applied machine learning methods to model TRD outcome at level 2. For TRD defined using QIDS-C16 remission criteria, multiple machine learning models were internally cross-validated in the STAR*D training dataset and externally validated in both the STAR*D testing dataset and RIS-INT-93 independent dataset with an area under the receiver operating characteristic curve (AUC) of 0.70-0.78 and 0.72-0.77, respectively. The upper bound for the AUC achievable with the full set of features could be as high as 0.78 in the STAR*D testing dataset. Model developed using top 30 features identified using feature selection technique (k-means clustering followed by χ2 test) achieved an AUC of 0.77 in the STAR*D testing dataset. In addition, the model developed using overlapping features between STAR*D and RIS-INT-93, achieved an AUC of > 0.70 in both the STAR*D testing and RIS-INT-93 datasets. Among all the features explored in STAR*D and RIS-INT-93 datasets, the most important feature was early or initial treatment response or symptom severity at week 2. These results indicate that prediction of TRD prior to undergoing a second round of antidepressant treatment could be feasible even in the absence of biomarker data.

  7. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures.

    PubMed

    Urbanowicz, Ryan J; Kiralis, Jeff; Sinnott-Armstrong, Nicholas A; Heberling, Tamra; Fisher, Jonathan M; Moore, Jason H

    2012-10-01

    Geneticists who look beyond single locus disease associations require additional strategies for the detection of complex multi-locus effects. Epistasis, a multi-locus masking effect, presents a particular challenge, and has been the target of bioinformatic development. Thorough evaluation of new algorithms calls for simulation studies in which known disease models are sought. To date, the best methods for generating simulated multi-locus epistatic models rely on genetic algorithms. However, such methods are computationally expensive, difficult to adapt to multiple objectives, and unlikely to yield models with a precise form of epistasis which we refer to as pure and strict. Purely and strictly epistatic models constitute the worst-case in terms of detecting disease associations, since such associations may only be observed if all n-loci are included in the disease model. This makes them an attractive gold standard for simulation studies considering complex multi-locus effects. We introduce GAMETES, a user-friendly software package and algorithm which generates complex biallelic single nucleotide polymorphism (SNP) disease models for simulation studies. GAMETES rapidly and precisely generates random, pure, strict n-locus models with specified genetic constraints. These constraints include heritability, minor allele frequencies of the SNPs, and population prevalence. GAMETES also includes a simple dataset simulation strategy which may be utilized to rapidly generate an archive of simulated datasets for given genetic models. We highlight the utility and limitations of GAMETES with an example simulation study using MDR, an algorithm designed to detect epistasis. GAMETES is a fast, flexible, and precise tool for generating complex n-locus models with random architectures. While GAMETES has a limited ability to generate models with higher heritabilities, it is proficient at generating the lower heritability models typically used in simulation studies evaluating new algorithms. In addition, the GAMETES modeling strategy may be flexibly combined with any dataset simulation strategy. Beyond dataset simulation, GAMETES could be employed to pursue theoretical characterization of genetic models and epistasis.

  8. Developing a regional retrospective ensemble precipitation dataset for watershed hydrology modeling, Idaho, USA

    NASA Astrophysics Data System (ADS)

    Flores, A. N.; Smith, K.; LaPorte, P.

    2011-12-01

    Applications like flood forecasting, military trafficability assessment, and slope stability analysis necessitate the use of models capable of resolving hydrologic states and fluxes at spatial scales of hillslopes (e.g., 10s to 100s m). These models typically require precipitation forcings at spatial scales of kilometers or better and time intervals of hours. Yet in especially rugged terrain that typifies much of the Western US and throughout much of the developing world, precipitation data at these spatiotemporal resolutions is difficult to come by. Ground-based weather radars have significant problems in high-relief settings and are sparsely located, leaving significant gaps in coverage and high uncertainties. Precipitation gages provide accurate data at points but are very sparsely located and their placement is often not representative, yielding significant coverage gaps in a spatial and physiographic sense. Numerical weather prediction efforts have made precipitation data, including critically important information on precipitation phase, available globally and in near real-time. However, these datasets present watershed modelers with two problems: (1) spatial scales of many of these datasets are tens of kilometers or coarser, (2) numerical weather models used to generate these datasets include a land surface parameterization that in some circumstances can significantly affect precipitation predictions. We report on the development of a regional precipitation dataset for Idaho that leverages: (1) a dataset derived from a numerical weather prediction model, (2) gages within Idaho that report hourly precipitation data, and (3) a long-term precipitation climatology dataset. Hourly precipitation estimates from the Modern Era Retrospective-analysis for Research and Applications (MERRA) are stochastically downscaled using a hybrid orographic and statistical model from their native resolution (1/2 x 2/3 degrees) to a resolution of approximately 1 km. Downscaled precipitation realizations are conditioned on hourly observations from reporting gages and then conditioned again on the Parameter-elevation Regressions on Independent Slopes Model (PRISM) at the monthly timescale to reflect orographic precipitation trends common to watersheds of the Western US. While this methodology potentially introduces cross-pollination of errors due to the re-use of precipitation gage data, it nevertheless achieves an ensemble-based precipitation estimate and appropriate measures of uncertainty at a spatiotemporal resolution appropriate for watershed modeling.

  9. A photogrammetric technique for generation of an accurate multispectral optical flow dataset

    NASA Astrophysics Data System (ADS)

    Kniaz, V. V.

    2017-06-01

    A presence of an accurate dataset is the key requirement for a successful development of an optical flow estimation algorithm. A large number of freely available optical flow datasets were developed in recent years and gave rise for many powerful algorithms. However most of the datasets include only images captured in the visible spectrum. This paper is focused on the creation of a multispectral optical flow dataset with an accurate ground truth. The generation of an accurate ground truth optical flow is a rather complex problem, as no device for error-free optical flow measurement was developed to date. Existing methods for ground truth optical flow estimation are based on hidden textures, 3D modelling or laser scanning. Such techniques are either work only with a synthetic optical flow or provide a sparse ground truth optical flow. In this paper a new photogrammetric method for generation of an accurate ground truth optical flow is proposed. The method combines the benefits of the accuracy and density of a synthetic optical flow datasets with the flexibility of laser scanning based techniques. A multispectral dataset including various image sequences was generated using the developed method. The dataset is freely available on the accompanying web site.

  10. Predictive Models of Cognitive Outcomes of Developmental Insults

    NASA Astrophysics Data System (ADS)

    Chan, Yupo; Bouaynaya, Nidhal; Chowdhury, Parimal; Leszczynska, Danuta; Patterson, Tucker A.; Tarasenko, Olga

    2010-04-01

    Representatives of Arkansas medical, research and educational institutions have gathered over the past four years to discuss the relationship between functional developmental perturbations and their neurological consequences. We wish to track the effect on the nervous system by developmental perturbations over time and across species. Except for perturbations, the sequence of events that occur during neural development was found to be remarkably conserved across mammalian species. The tracking includes consequences on anatomical regions and behavioral changes. The ultimate goal is to develop a predictive model of long-term genotypic and phenotypic outcomes that includes developmental insults. Such a model can subsequently be fostered into an educated intervention for therapeutic purposes. Several datasets were identified to test plausible hypotheses, ranging from evoked potential datasets to sleep-disorder datasets. An initial model may be mathematical and conceptual. However, we expect to see rapid progress as large-scale gene expression studies in the mammalian brain permit genome-wide searches to discover genes that are uniquely expressed in brain circuits and regions. These genes ultimately control behavior. By using a validated model we endeavor to make useful predictions.

  11. Multi-scale mantle structure underneath the Americas from a new tomographic model of seismic shear velocity

    NASA Astrophysics Data System (ADS)

    Porritt, R. W.; Becker, T. W.; Auer, L.; Boschi, L.

    2017-12-01

    We present a whole-mantle, variable resolution, shear-wave tomography model based on newly available and existing seismological datasets including regional body-wave delay times and multi-mode Rayleigh and Love wave phase delays. Our body wave dataset includes 160,000 S wave delays used in the DNA13 regional tomographic model focused on the western and central US, 86,000 S and SKS delays measured on stations in western South America (Porritt et al., in prep), and 3,900,000 S+ phases measured by correlation between data observed at stations in the IRIS global networks (IU, II) and stations in the continuous US, against synthetic data generated with IRIS Syngine. The surface wave dataset includes fundamental mode and overtone Rayleigh wave data from Schaeffer and Levedev (2014), ambient noise derived Rayleigh wave and Love wave measurements from Ekstrom (2013), newly computed fundamental mode ambient noise Rayleigh wave phase delays for the continuous US up to July 2017, and other, previously published, measurements. These datasets, along with a data-adaptive parameterization utilized for the SAVANI model (Auer et al., 2014), should allow significantly finer-scale imaging than previous global models, rivaling that of regional-scale approaches, under the USArray footprint in the continuous US, while seamlessly integrating into a global model. We parameterize the model for both vertically (vSV) and horizontally (vSH) polarized shear velocities by accounting for the different sensitivities of the various phases and wave types. The resulting, radially anisotropic model should allow for a range of new geodynamic analysis, including estimates of mantle flow induced topography or seismic anisotropy, without generating artifacts due to edge effects, or requiring assumptions about the structure of the region outside the well resolved model space. Our model shows a number of features, including indications of the effects of edge-driven convection in the Cordillera and along the eastern margin and larger-scale convection due to the subduction of the Farallon slab and along the edge of the Laurentia cratonic margin.

  12. Population Pharmacokinetics of Intravenous Paracetamol (Acetaminophen) in Preterm and Term Neonates: Model Development and External Evaluation.

    PubMed

    Cook, Sarah F; Roberts, Jessica K; Samiee-Zafarghandy, Samira; Stockmann, Chris; King, Amber D; Deutsch, Nina; Williams, Elaine F; Allegaert, Karel; Wilkins, Diana G; Sherwin, Catherine M T; van den Anker, John N

    2016-01-01

    The aims of this study were to develop a population pharmacokinetic model for intravenous paracetamol in preterm and term neonates and to assess the generalizability of the model by testing its predictive performance in an external dataset. Nonlinear mixed-effects models were constructed from paracetamol concentration-time data in NONMEM 7.2. Potential covariates included body weight, gestational age, postnatal age, postmenstrual age, sex, race, total bilirubin, and estimated glomerular filtration rate. An external dataset was used to test the predictive performance of the model through calculation of bias, precision, and normalized prediction distribution errors. The model-building dataset included 260 observations from 35 neonates with a mean gestational age of 33.6 weeks [standard deviation (SD) 6.6]. Data were well-described by a one-compartment model with first-order elimination. Weight predicted paracetamol clearance and volume of distribution, which were estimated as 0.348 L/h (5.5 % relative standard error; 30.8 % coefficient of variation) and 2.46 L (3.5 % relative standard error; 14.3 % coefficient of variation), respectively, at the mean subject weight of 2.30 kg. An external evaluation was performed on an independent dataset that included 436 observations from 60 neonates with a mean gestational age of 35.6 weeks (SD 4.3). The median prediction error was 10.1 % [95 % confidence interval (CI) 6.1-14.3] and the median absolute prediction error was 25.3 % (95 % CI 23.1-28.1). Weight predicted intravenous paracetamol pharmacokinetics in neonates ranging from extreme preterm to full-term gestational status. External evaluation suggested that these findings should be generalizable to other similar patient populations.

  13. Population Pharmacokinetics of Intravenous Paracetamol (Acetaminophen) in Preterm and Term Neonates: Model Development and External Evaluation

    PubMed Central

    Cook, Sarah F.; Roberts, Jessica K.; Samiee-Zafarghandy, Samira; Stockmann, Chris; King, Amber D.; Deutsch, Nina; Williams, Elaine F.; Allegaert, Karel; Sherwin, Catherine M. T.; van den Anker, John N.

    2017-01-01

    Objectives The aims of this study were to develop a population pharmacokinetic model for intravenous paracetamol in preterm and term neonates and to assess the generalizability of the model by testing its predictive performance in an external dataset. Methods Nonlinear mixed-effects models were constructed from paracetamol concentration–time data in NONMEM 7.2. Potential covariates included body weight, gestational age, postnatal age, postmenstrual age, sex, race, total bilirubin, and estimated glomerular filtration rate. An external dataset was used to test the predictive performance of the model through calculation of bias, precision, and normalized prediction distribution errors. Results The model-building dataset included 260 observations from 35 neonates with a mean gestational age of 33.6 weeks [standard deviation (SD) 6.6]. Data were well-described by a one-compartment model with first-order elimination. Weight predicted paracetamol clearance and volume of distribution, which were estimated as 0.348 L/h (5.5 % relative standard error; 30.8 % coefficient of variation) and 2.46 L (3.5 % relative standard error; 14.3 % coefficient of variation), respectively, at the mean subject weight of 2.30 kg. An external evaluation was performed on an independent dataset that included 436 observations from 60 neonates with a mean gestational age of 35.6 weeks (SD 4.3). The median prediction error was 10.1 % [95 % confidence interval (CI) 6.1–14.3] and the median absolute prediction error was 25.3 % (95 % CI 23.1–28.1). Conclusions Weight predicted intravenous paracetamol pharmacokinetics in neonates ranging from extreme preterm to full-term gestational status. External evaluation suggested that these findings should be generalizable to other similar patient populations. PMID:26201306

  14. National Hydrography Dataset Plus (NHDPlus)

    EPA Pesticide Factsheets

    The NHDPlus Version 1.0 is an integrated suite of application-ready geospatial data sets that incorporate many of the best features of the National Hydrography Dataset (NHD) and the National Elevation Dataset (NED). The NHDPlus includes a stream network (based on the 1:100,000-scale NHD), improved networking, naming, and value-added attributes (VAA's). NHDPlus also includes elevation-derived catchments (drainage areas) produced using a drainageenforcement technique first broadly applied in New England, and thus dubbed The New-England Method. This technique involves burning-in the 1:100,000-scale NHD and when available building walls using the national WatershedBoundary Dataset (WBD). The resulting modified digital elevation model(HydroDEM) is used to produce hydrologic derivatives that agree with the NHDand WBD. An interdisciplinary team from the U. S. Geological Survey (USGS), U.S. Environmental Protection Agency (USEPA), and contractors, over the lasttwo years has found this method to produce the best quality NHD catchments using an automated process.The VAAs include greatly enhanced capabilities for upstream and downstream navigation, analysis and modeling. Examples include: retrieve all flowlines (predominantly confluence-to-confluence stream segments) and catchments upstream of a given flowline using queries rather than by slower flowline-by flowline navigation; retrieve flowlines by stream order; subset a stream level path sorted in hydrologic order for st

  15. Data Citation Concept for CMIP6

    NASA Astrophysics Data System (ADS)

    Stockhause, M.; Toussaint, F.; Lautenschlager, M.; Lawrence, B.

    2015-12-01

    There is a broad consensus among data centers and scientific publishers on Force 11's 'Joint Declaration of Data Citation Principles'. To put these principles into operation is not always as straight forward. The focus for CMIP6 data citations lies on the citation of data created by others and used in an analysis underlying the article. And for this source data usually no article of the data creators is available ('stand-alone data publication'). The planned data citation granularities are model data (data collections containing all datasets provided for the project by a single model) and experiment data (data collections containing all datasets for a scientific experiment run by a single model). In case of large international projects or activities like CMIP, the data is commonly stored and disseminated by multiple repositories in a federated data infrastructure such as the Earth System Grid Federation (ESGF). The individual repositories are subject to different institutional and national policies. A Data Management Plan (DMP) will define a certain standard for the repositories including data handling procedures. Another aspect of CMIP data, relevant for data citations, is its dynamic nature. For such large data collections, datasets are added, revised and retracted for years, before the data collection becomes stable for a data citation entity including all model or simulation data. Thus, a critical issue for ESGF is data consistency, requiring thorough dataset versioning to enable the identification of the data collection in the cited version. Currently, the ESGF is designed for accessing the latest dataset versions. Data citation introduces the necessity to support older and retracted dataset versions by storing metadata even beyond data availability (data unpublished in ESGF). Apart from ESGF, other infrastructure components exist for CMIP, which provide information that has to be connected to the CMIP6 data, e.g. ES-DOC providing information on models and simulations and the IPCC Data Distribution Centre (DDC) storing a subset of data together with available metadata (ES-DOC) for the long-term reuse of the interdisciplinary community. Other connections exist to standard project vocabularies, to personal identifiers (e.g. ORCID), or to data products (including provenance information).

  16. Using NASA's Giovanni Web Portal to Access and Visualize Satellite-Based Earth Science Data in the Classroom

    NASA Astrophysics Data System (ADS)

    Lloyd, S. A.; Acker, J. G.; Prados, A. I.; Leptoukh, G. G.

    2008-12-01

    One of the biggest obstacles for the average Earth science student today is locating and obtaining satellite- based remote sensing datasets in a format that is accessible and optimal for their data analysis needs. At the Goddard Earth Sciences Data and Information Services Center (GES-DISC) alone, on the order of hundreds of Terabytes of data are available for distribution to scientists, students and the general public. The single biggest and time-consuming hurdle for most students when they begin their study of the various datasets is how to slog through this mountain of data to arrive at a properly sub-setted and manageable dataset to answer their science question(s). The GES DISC provides a number of tools for data access and visualization, including the Google-like Mirador search engine and the powerful GES-DISC Interactive Online Visualization ANd aNalysis Infrastructure (Giovanni) web interface. Giovanni provides a simple way to visualize, analyze and access vast amounts of satellite-based Earth science data. Giovanni's features and practical examples of its use will be demonstrated, with an emphasis on how satellite remote sensing can help students understand recent events in the atmosphere and biosphere. Giovanni is actually a series of sixteen similar web-based data interfaces, each of which covers a single satellite dataset (such as TRMM, TOMS, OMI, AIRS, MLS, HALOE, etc.) or a group of related datasets (such as MODIS and MISR for aerosols, SeaWIFS and MODIS for ocean color, and the suite of A-Train observations co-located along the CloudSat orbital path). Recently, ground-based datasets have been included in Giovanni, including the Northern Eurasian Earth Science Partnership Initiative (NEESPI), and EPA fine particulate matter (PM2.5) for air quality. Model data such as the Goddard GOCART model and MERRA meteorological reanalyses (in process) are being increasingly incorporated into Giovanni to facilitate model- data intercomparison. A full suite of data analysis and visualization tools is also available within Giovanni. The GES DISC is currently developing a systematic series of training modules for Earth science satellite data, associated with our development of additional datasets and data visualization tools for Giovanni. Training sessions will include an overview of the Earth science datasets archived at Goddard, an overview of terms and techniques associated with satellite remote sensing, dataset-specific issues, an overview of Giovanni functionality, and a series of examples of how data can be readily accessed and visualized.

  17. Global Precipitation Measurement: Methods, Datasets and Applications

    NASA Technical Reports Server (NTRS)

    Tapiador, Francisco; Turk, Francis J.; Petersen, Walt; Hou, Arthur Y.; Garcia-Ortega, Eduardo; Machado, Luiz, A. T.; Angelis, Carlos F.; Salio, Paola; Kidd, Chris; Huffman, George J.; hide

    2011-01-01

    This paper reviews the many aspects of precipitation measurement that are relevant to providing an accurate global assessment of this important environmental parameter. Methods discussed include ground data, satellite estimates and numerical models. First, the methods for measuring, estimating, and modeling precipitation are discussed. Then, the most relevant datasets gathering precipitation information from those three sources are presented. The third part of the paper illustrates a number of the many applications of those measurements and databases. The aim of the paper is to organize the many links and feedbacks between precipitation measurement, estimation and modeling, indicating the uncertainties and limitations of each technique in order to identify areas requiring further attention, and to show the limits within which datasets can be used.

  18. Associating uncertainty with datasets using Linked Data and allowing propagation via provenance chains

    NASA Astrophysics Data System (ADS)

    Car, Nicholas; Cox, Simon; Fitch, Peter

    2015-04-01

    With earth-science datasets increasingly being published to enable re-use in projects disassociated from the original data acquisition or generation, there is an urgent need for associated metadata to be connected, in order to guide their application. In particular, provenance traces should support the evaluation of data quality and reliability. However, while standards for describing provenance are emerging (e.g. PROV-O), these do not include the necessary statistical descriptors and confidence assessments. UncertML has a mature conceptual model that may be used to record uncertainty metadata. However, by itself UncertML does not support the representation of uncertainty of multi-part datasets, and provides no direct way of associating the uncertainty information - metadata in relation to a dataset - with dataset objects.We present a method to address both these issues by combining UncertML with PROV-O, and delivering resulting uncertainty-enriched provenance traces through the Linked Data API. UncertProv extends the PROV-O provenance ontology with an RDF formulation of the UncertML conceptual model elements, adds further elements to support uncertainty representation without a conceptual model and the integration of UncertML through links to documents. The Linked ID API provides a systematic way of navigating from dataset objects to their UncertProv metadata and back again. The Linked Data API's 'views' capability enables access to UncertML and non-UncertML uncertainty metadata representations for a dataset. With this approach, it is possible to access and navigate the uncertainty metadata associated with a published dataset using standard semantic web tools, such as SPARQL queries. Where the uncertainty data follows the UncertML model it can be automatically interpreted and may also support automatic uncertainty propagation . Repositories wishing to enable uncertainty propagation for all datasets must ensure that all elements that are associated with uncertainty (PROV-O Entity and Activity classes) have UncertML elements recorded. This methodology is intentionally flexible to allow uncertainty metadata in many forms, not limited to UncertML. While the more formal representation of uncertainty metadata is desirable (using UncertProv elements to implement the UncertML conceptual model ), this will not always be possible, and any uncertainty data stored will be better than none. Since the UncertProv ontology contains a superset of UncertML elements to facilitate the representation of non-UncertML uncertainty data, it could easily be extended to include other formal uncertainty conceptual models thus allowing non-UncertML propagation calculations.

  19. New Inversion and Interpretation of Public-Domain Electromagnetic Survey Data from Selected Areas in Alaska

    NASA Astrophysics Data System (ADS)

    Smith, B. D.; Kass, A.; Saltus, R. W.; Minsley, B. J.; Deszcz-Pan, M.; Bloss, B. R.; Burns, L. E.

    2013-12-01

    Public-domain airborne geophysical surveys (combined electromagnetics and magnetics), mostly collected for and released by the State of Alaska, Division of Geological and Geophysical Surveys (DGGS), are a unique and valuable resource for both geologic interpretation and geophysical methods development. A new joint effort by the US Geological Survey (USGS) and the DGGS aims to add value to these data through the application of novel advanced inversion methods and through innovative and intuitive display of data: maps, profiles, voxel-based models, and displays of estimated inversion quality and confidence. Our goal is to make these data even more valuable for interpretation of geologic frameworks, geotechnical studies, and cryosphere studies, by producing robust estimates of subsurface resistivity that can be used by non-geophysicists. The available datasets, which are available in the public domain, include 39 frequency-domain electromagnetic datasets collected since 1993, and continue to grow with 5 more data releases pending in 2013. The majority of these datasets were flown for mineral resource purposes, with one survey designed for infrastructure analysis. In addition, several USGS datasets are included in this study. The USGS has recently developed new inversion methodologies for airborne EM data and have begun to apply these and other new techniques to the available datasets. These include a trans-dimensional Markov Chain Monte Carlo technique, laterally-constrained regularized inversions, and deterministic inversions which include calibration factors as a free parameter. Incorporation of the magnetic data as an additional constraining dataset has also improved the inversion results. Processing has been completed in several areas, including Fortymile and the Alaska Highway surveys, and continues in others such as the Styx River and Nome surveys. Utilizing these new techniques, we provide models beyond the apparent resistivity maps supplied by the original contractors, allowing us to produce a variety of products, such as maps of resistivity as a function of depth or elevation, cross section maps, and 3D voxel models, which have been treated consistently both in terms of processing and error analysis throughout the state. These products facilitate a more fruitful exchange between geologists and geophysicists and a better understanding of uncertainty, and the process results in iterative development and improvement of geologic models, both on small and large scales.

  20. Solar forcing for CMIP6 (v3.2)

    NASA Astrophysics Data System (ADS)

    Matthes, Katja; Funke, Bernd; Andersson, Monika E.; Barnard, Luke; Beer, Jürg; Charbonneau, Paul; Clilverd, Mark A.; Dudok de Wit, Thierry; Haberreiter, Margit; Hendry, Aaron; Jackman, Charles H.; Kretzschmar, Matthieu; Kruschke, Tim; Kunze, Markus; Langematz, Ulrike; Marsh, Daniel R.; Maycock, Amanda C.; Misios, Stergios; Rodger, Craig J.; Scaife, Adam A.; Seppälä, Annika; Shangguan, Ming; Sinnhuber, Miriam; Tourpali, Kleareti; Usoskin, Ilya; van de Kamp, Max; Verronen, Pekka T.; Versick, Stefan

    2017-06-01

    This paper describes the recommended solar forcing dataset for CMIP6 and highlights changes with respect to CMIP5. The solar forcing is provided for radiative properties, namely total solar irradiance (TSI), solar spectral irradiance (SSI), and the F10.7 index as well as particle forcing, including geomagnetic indices Ap and Kp, and ionization rates to account for effects of solar protons, electrons, and galactic cosmic rays. This is the first time that a recommendation for solar-driven particle forcing has been provided for a CMIP exercise. The solar forcing datasets are provided at daily and monthly resolution separately for the CMIP6 preindustrial control, historical (1850-2014), and future (2015-2300) simulations. For the preindustrial control simulation, both constant and time-varying solar forcing components are provided, with the latter including variability on 11-year and shorter timescales but no long-term changes. For the future, we provide a realistic scenario of what solar behavior could be, as well as an additional extreme Maunder-minimum-like sensitivity scenario. This paper describes the forcing datasets and also provides detailed recommendations as to their implementation in current climate models.For the historical simulations, the TSI and SSI time series are defined as the average of two solar irradiance models that are adapted to CMIP6 needs: an empirical one (NRLTSI2-NRLSSI2) and a semi-empirical one (SATIRE). A new and lower TSI value is recommended: the contemporary solar-cycle average is now 1361.0 W m-2. The slight negative trend in TSI over the three most recent solar cycles in the CMIP6 dataset leads to only a small global radiative forcing of -0.04 W m-2. In the 200-400 nm wavelength range, which is important for ozone photochemistry, the CMIP6 solar forcing dataset shows a larger solar-cycle variability contribution to TSI than in CMIP5 (50 % compared to 35 %).We compare the climatic effects of the CMIP6 solar forcing dataset to its CMIP5 predecessor by using time-slice experiments of two chemistry-climate models and a reference radiative transfer model. The differences in the long-term mean SSI in the CMIP6 dataset, compared to CMIP5, impact on climatological stratospheric conditions (lower shortwave heating rates of -0.35 K day-1 at the stratopause), cooler stratospheric temperatures (-1.5 K in the upper stratosphere), lower ozone abundances in the lower stratosphere (-3 %), and higher ozone abundances (+1.5 % in the upper stratosphere and lower mesosphere). Between the maximum and minimum phases of the 11-year solar cycle, there is an increase in shortwave heating rates (+0.2 K day-1 at the stratopause), temperatures ( ˜ 1 K at the stratopause), and ozone (+2.5 % in the upper stratosphere) in the tropical upper stratosphere using the CMIP6 forcing dataset. This solar-cycle response is slightly larger, but not statistically significantly different from that for the CMIP5 forcing dataset.CMIP6 models with a well-resolved shortwave radiation scheme are encouraged to prescribe SSI changes and include solar-induced stratospheric ozone variations, in order to better represent solar climate variability compared to models that only prescribe TSI and/or exclude the solar-ozone response. We show that monthly-mean solar-induced ozone variations are implicitly included in the SPARC/CCMI CMIP6 Ozone Database for historical simulations, which is derived from transient chemistry-climate model simulations and has been developed for climate models that do not calculate ozone interactively. CMIP6 models without chemistry that perform a preindustrial control simulation with time-varying solar forcing will need to use a modified version of the SPARC/CCMI Ozone Database that includes solar variability. CMIP6 models with interactive chemistry are also encouraged to use the particle forcing datasets, which will allow the potential long-term effects of particles to be addressed for the first time. The consideration of particle forcing has been shown to significantly improve the representation of reactive nitrogen and ozone variability in the polar middle atmosphere, eventually resulting in further improvements in the representation of solar climate variability in global models.

  1. Plant traits, productivity, biomass and soil properties from forest sites in the Pacific Northwest, 1999-2014.

    PubMed

    Berner, Logan T; Law, Beverly E

    2016-01-19

    Plant trait measurements are needed for evaluating ecological responses to environmental conditions and for ecosystem process model development, parameterization, and testing. We present a standardized dataset integrating measurements from projects conducted by the Terrestrial Ecosystem Research and Regional Analysis- Pacific Northwest (TERRA-PNW) research group between 1999 and 2014 across Oregon and Northern California, where measurements were collected for scaling and modeling regional terrestrial carbon processes with models such as Biome-BGC and the Community Land Model. The dataset contains measurements of specific leaf area, leaf longevity, leaf carbon and nitrogen for 35 tree and shrub species derived from more than 1,200 branch samples collected from over 200 forest plots, including several AmeriFlux sites. The dataset also contains plot-level measurements of forest composition, structure (e.g., tree biomass), and productivity, as well as measurements of soil structure (e.g., bulk density) and chemistry (e.g., carbon). Publically-archiving regional datasets of standardized, co-located, and geo-referenced plant trait measurements will advance the ability of earth system models to capture species-level climate sensitivity at regional to global scales.

  2. Plant traits, productivity, biomass and soil properties from forest sites in the Pacific Northwest, 1999-2014

    NASA Astrophysics Data System (ADS)

    Berner, Logan T.; Law, Beverly E.

    2016-01-01

    Plant trait measurements are needed for evaluating ecological responses to environmental conditions and for ecosystem process model development, parameterization, and testing. We present a standardized dataset integrating measurements from projects conducted by the Terrestrial Ecosystem Research and Regional Analysis- Pacific Northwest (TERRA-PNW) research group between 1999 and 2014 across Oregon and Northern California, where measurements were collected for scaling and modeling regional terrestrial carbon processes with models such as Biome-BGC and the Community Land Model. The dataset contains measurements of specific leaf area, leaf longevity, leaf carbon and nitrogen for 35 tree and shrub species derived from more than 1,200 branch samples collected from over 200 forest plots, including several AmeriFlux sites. The dataset also contains plot-level measurements of forest composition, structure (e.g., tree biomass), and productivity, as well as measurements of soil structure (e.g., bulk density) and chemistry (e.g., carbon). Publically-archiving regional datasets of standardized, co-located, and geo-referenced plant trait measurements will advance the ability of earth system models to capture species-level climate sensitivity at regional to global scales.

  3. Evaluation, Calibration and Comparison of the Precipitation-Runoff Modeling System (PRMS) National Hydrologic Model (NHM) Using Moderate Resolution Imaging Spectroradiometer (MODIS) and Snow Data Assimilation System (SNODAS) Gridded Datasets

    NASA Astrophysics Data System (ADS)

    Norton, P. A., II; Haj, A. E., Jr.

    2014-12-01

    The United States Geological Survey is currently developing a National Hydrologic Model (NHM) to support and facilitate coordinated and consistent hydrologic modeling efforts at the scale of the continental United States. As part of this effort, the Geospatial Fabric (GF) for the NHM was created. The GF is a database that contains parameters derived from datasets that characterize the physical features of watersheds. The GF was used to aggregate catchments and flowlines defined in the National Hydrography Dataset Plus dataset for more than 100,000 hydrologic response units (HRUs), and to establish initial parameter values for input to the Precipitation-Runoff Modeling System (PRMS). Many parameter values are adjusted in PRMS using an automated calibration process. Using these adjusted parameter values, the PRMS model estimated variables such as evapotranspiration (ET), potential evapotranspiration (PET), snow-covered area (SCA), and snow water equivalent (SWE). In order to evaluate the effectiveness of parameter calibration, and model performance in general, several satellite-based Moderate Resolution Imaging Spectroradiometer (MODIS) and Snow Data Assimilation System (SNODAS) gridded datasets including ET, PET, SCA, and SWE were compared to PRMS-simulated values. The MODIS and SNODAS data were spatially averaged for each HRU, and compared to PRMS-simulated ET, PET, SCA, and SWE values for each HRU in the Upper Missouri River watershed. Default initial GF parameter values and PRMS calibration ranges were evaluated. Evaluation results, and the use of MODIS and SNODAS datasets to update GF parameter values and PRMS calibration ranges, are presented and discussed.

  4. Global relationships in river hydromorphology

    NASA Astrophysics Data System (ADS)

    Pavelsky, T.; Lion, C.; Allen, G. H.; Durand, M. T.; Schumann, G.; Beighley, E.; Yang, X.

    2017-12-01

    Since the widespread adoption of digital elevation models (DEMs) in the 1980s, most global and continental-scale analysis of river flow characteristics has been focused on measurements derived from DEMs such as drainage area, elevation, and slope. These variables (especially drainage area) have been related to other quantities of interest such as river width, depth, and velocity via empirical relationships that often take the form of power laws. More recently, a number of groups have developed more direct measurements of river location and some aspects of planform geometry from optical satellite imagery on regional, continental, and global scales. However, these satellite-derived datasets often lack many of the qualities that make DEM=derived datasets attractive, including robust network topology. Here, we present analysis of a dataset that combines the Global River Widths from Landsat (GRWL) database of river location, width, and braiding index with a river database extracted from the Shuttle Radar Topography Mission DEM and the HydroSHEDS dataset. Using these combined tools, we present a dataset that includes measurements of river width, slope, braiding index, upstream drainage area, and other variables. The dataset is available everywhere that both datasets are available, which includes all continental areas south of 60N with rivers sufficiently large to be observed with Landsat imagery. We use the dataset to examine patterns and frequencies of river form across continental and global scales as well as global relationships among variables including width, slope, and drainage area. The results demonstrate the complex relationships among different dimensions of river hydromorphology at the global scale.

  5. Construction and evaluation of FiND, a fall risk prediction model of inpatients from nursing data.

    PubMed

    Yokota, Shinichiroh; Ohe, Kazuhiko

    2016-04-01

    To construct and evaluate an easy-to-use fall risk prediction model based on the daily condition of inpatients from secondary use electronic medical record system data. The present authors scrutinized electronic medical record system data and created a dataset for analysis by including inpatient fall report data and Intensity of Nursing Care Needs data. The authors divided the analysis dataset into training data and testing data, then constructed the fall risk prediction model FiND from the training data, and tested the model using the testing data. The dataset for analysis contained 1,230,604 records from 46,241 patients. The sensitivity of the model constructed from the training data was 71.3% and the specificity was 66.0%. The verification result from the testing dataset was almost equivalent to the theoretical value. Although the model's accuracy did not surpass that of models developed in previous research, the authors believe FiND will be useful in medical institutions all over Japan because it is composed of few variables (only age, sex, and the Intensity of Nursing Care Needs items), and the accuracy for unknown data was clear. © 2016 Japan Academy of Nursing Science.

  6. A tool for the estimation of the distribution of landslide area in R

    NASA Astrophysics Data System (ADS)

    Rossi, M.; Cardinali, M.; Fiorucci, F.; Marchesini, I.; Mondini, A. C.; Santangelo, M.; Ghosh, S.; Riguer, D. E. L.; Lahousse, T.; Chang, K. T.; Guzzetti, F.

    2012-04-01

    We have developed a tool in R (the free software environment for statistical computing, http://www.r-project.org/) to estimate the probability density and the frequency density of landslide area. The tool implements parametric and non-parametric approaches to the estimation of the probability density and the frequency density of landslide area, including: (i) Histogram Density Estimation (HDE), (ii) Kernel Density Estimation (KDE), and (iii) Maximum Likelihood Estimation (MLE). The tool is available as a standard Open Geospatial Consortium (OGC) Web Processing Service (WPS), and is accessible through the web using different GIS software clients. We tested the tool to compare Double Pareto and Inverse Gamma models for the probability density of landslide area in different geological, morphological and climatological settings, and to compare landslides shown in inventory maps prepared using different mapping techniques, including (i) field mapping, (ii) visual interpretation of monoscopic and stereoscopic aerial photographs, (iii) visual interpretation of monoscopic and stereoscopic VHR satellite images and (iv) semi-automatic detection and mapping from VHR satellite images. Results show that both models are applicable in different geomorphological settings. In most cases the two models provided very similar results. Non-parametric estimation methods (i.e., HDE and KDE) provided reasonable results for all the tested landslide datasets. For some of the datasets, MLE failed to provide a result, for convergence problems. The two tested models (Double Pareto and Inverse Gamma) resulted in very similar results for large and very large datasets (> 150 samples). Differences in the modeling results were observed for small datasets affected by systematic biases. A distinct rollover was observed in all analyzed landslide datasets, except for a few datasets obtained from landslide inventories prepared through field mapping or by semi-automatic mapping from VHR satellite imagery. The tool can also be used to evaluate the probability density and the frequency density of landslide volume.

  7. A nonparametric statistical technique for combining global precipitation datasets: development and hydrological evaluation over the Iberian Peninsula

    NASA Astrophysics Data System (ADS)

    Abul Ehsan Bhuiyan, Md; Nikolopoulos, Efthymios I.; Anagnostou, Emmanouil N.; Quintana-Seguí, Pere; Barella-Ortiz, Anaïs

    2018-02-01

    This study investigates the use of a nonparametric, tree-based model, quantile regression forests (QRF), for combining multiple global precipitation datasets and characterizing the uncertainty of the combined product. We used the Iberian Peninsula as the study area, with a study period spanning 11 years (2000-2010). Inputs to the QRF model included three satellite precipitation products, CMORPH, PERSIANN, and 3B42 (V7); an atmospheric reanalysis precipitation and air temperature dataset; satellite-derived near-surface daily soil moisture data; and a terrain elevation dataset. We calibrated the QRF model for two seasons and two terrain elevation categories and used it to generate ensemble for these conditions. Evaluation of the combined product was based on a high-resolution, ground-reference precipitation dataset (SAFRAN) available at 5 km 1 h-1 resolution. Furthermore, to evaluate relative improvements and the overall impact of the combined product in hydrological response, we used the generated ensemble to force a distributed hydrological model (the SURFEX land surface model and the RAPID river routing scheme) and compared its streamflow simulation results with the corresponding simulations from the individual global precipitation and reference datasets. We concluded that the proposed technique could generate realizations that successfully encapsulate the reference precipitation and provide significant improvement in streamflow simulations, with reduction in systematic and random error on the order of 20-99 and 44-88 %, respectively, when considering the ensemble mean.

  8. A Benchmark Dataset and Saliency-guided Stacked Autoencoders for Video-based Salient Object Detection.

    PubMed

    Li, Jia; Xia, Changqun; Chen, Xiaowu

    2017-10-12

    Image-based salient object detection (SOD) has been extensively studied in past decades. However, video-based SOD is much less explored due to the lack of large-scale video datasets within which salient objects are unambiguously defined and annotated. Toward this end, this paper proposes a video-based SOD dataset that consists of 200 videos. In constructing the dataset, we manually annotate all objects and regions over 7,650 uniformly sampled keyframes and collect the eye-tracking data of 23 subjects who free-view all videos. From the user data, we find that salient objects in a video can be defined as objects that consistently pop-out throughout the video, and objects with such attributes can be unambiguously annotated by combining manually annotated object/region masks with eye-tracking data of multiple subjects. To the best of our knowledge, it is currently the largest dataset for videobased salient object detection. Based on this dataset, this paper proposes an unsupervised baseline approach for video-based SOD by using saliencyguided stacked autoencoders. In the proposed approach, multiple spatiotemporal saliency cues are first extracted at the pixel, superpixel and object levels. With these saliency cues, stacked autoencoders are constructed in an unsupervised manner that automatically infers a saliency score for each pixel by progressively encoding the high-dimensional saliency cues gathered from the pixel and its spatiotemporal neighbors. In experiments, the proposed unsupervised approach is compared with 31 state-of-the-art models on the proposed dataset and outperforms 30 of them, including 19 imagebased classic (unsupervised or non-deep learning) models, six image-based deep learning models, and five video-based unsupervised models. Moreover, benchmarking results show that the proposed dataset is very challenging and has the potential to boost the development of video-based SOD.

  9. Downscaled climate projections for the Southeast United States: evaluation and use for ecological applications

    USGS Publications Warehouse

    Wootten, Adrienne; Smith, Kara; Boyles, Ryan; Terando, Adam; Stefanova, Lydia; Misra, Vasru; Smith, Tom; Blodgett, David L.; Semazzi, Fredrick

    2014-01-01

    Climate change is likely to have many effects on natural ecosystems in the Southeast U.S. The National Climate Assessment Southeast Technical Report (SETR) indicates that natural ecosystems in the Southeast are likely to be affected by warming temperatures, ocean acidification, sea-level rise, and changes in rainfall and evapotranspiration. To better assess these how climate changes could affect multiple sectors, including ecosystems, climatologists have created several downscaled climate projections (or downscaled datasets) that contain information from the global climate models (GCMs) translated to regional or local scales. The process of creating these downscaled datasets, known as downscaling, can be carried out using a broad range of statistical or numerical modeling techniques. The rapid proliferation of techniques that can be used for downscaling and the number of downscaled datasets produced in recent years present many challenges for scientists and decisionmakers in assessing the impact or vulnerability of a given species or ecosystem to climate change. Given the number of available downscaled datasets, how do these model outputs compare to each other? Which variables are available, and are certain downscaled datasets more appropriate for assessing vulnerability of a particular species? Given the desire to use these datasets for impact and vulnerability assessments and the lack of comparison between these datasets, the goal of this report is to synthesize the information available in these downscaled datasets and provide guidance to scientists and natural resource managers with specific interests in ecological modeling and conservation planning related to climate change in the Southeast U.S. This report enables the Southeast Climate Science Center (SECSC) to address an important strategic goal of providing scientific information and guidance that will enable resource managers and other participants in Landscape Conservation Cooperatives to make science-based climate change adaptation decisions.

  10. A Pilot Study of Biomedical Text Comprehension using an Attention-Based Deep Neural Reader: Design and Experimental Analysis

    PubMed Central

    Lee, Kyubum; Kim, Byounggun; Jeon, Minji; Kim, Jihye; Tan, Aik Choon

    2018-01-01

    Background With the development of artificial intelligence (AI) technology centered on deep-learning, the computer has evolved to a point where it can read a given text and answer a question based on the context of the text. Such a specific task is known as the task of machine comprehension. Existing machine comprehension tasks mostly use datasets of general texts, such as news articles or elementary school-level storybooks. However, no attempt has been made to determine whether an up-to-date deep learning-based machine comprehension model can also process scientific literature containing expert-level knowledge, especially in the biomedical domain. Objective This study aims to investigate whether a machine comprehension model can process biomedical articles as well as general texts. Since there is no dataset for the biomedical literature comprehension task, our work includes generating a large-scale question answering dataset using PubMed and manually evaluating the generated dataset. Methods We present an attention-based deep neural model tailored to the biomedical domain. To further enhance the performance of our model, we used a pretrained word vector and biomedical entity type embedding. We also developed an ensemble method of combining the results of several independent models to reduce the variance of the answers from the models. Results The experimental results showed that our proposed deep neural network model outperformed the baseline model by more than 7% on the new dataset. We also evaluated human performance on the new dataset. The human evaluation result showed that our deep neural model outperformed humans in comprehension by 22% on average. Conclusions In this work, we introduced a new task of machine comprehension in the biomedical domain using a deep neural model. Since there was no large-scale dataset for training deep neural models in the biomedical domain, we created the new cloze-style datasets Biomedical Knowledge Comprehension Title (BMKC_T) and Biomedical Knowledge Comprehension Last Sentence (BMKC_LS) (together referred to as BioMedical Knowledge Comprehension) using the PubMed corpus. The experimental results showed that the performance of our model is much higher than that of humans. We observed that our model performed consistently better regardless of the degree of difficulty of a text, whereas humans have difficulty when performing biomedical literature comprehension tasks that require expert level knowledge. PMID:29305341

  11. FURTHER ANALYSIS OF SUBTYPES OF AUTOMATICALLY REINFORCED SIB: A REPLICATION AND QUANTITATIVE ANALYSIS OF PUBLISHED DATASETS

    PubMed Central

    Hagopian, Louis P.; Rooker, Griffin W.; Zarcone, Jennifer R.; Bonner, Andrew C.; Arevalo, Alexander R.

    2017-01-01

    Hagopian, Rooker, and Zarcone (2015) evaluated a model for subtyping automatically reinforced self-injurious behavior (SIB) based on its sensitivity to changes in functional analysis conditions and the presence of self-restraint. The current study tested the generality of the model by applying it to all datasets of automatically reinforced SIB published from 1982 to 2015. We identified 49 datasets that included sufficient data to permit subtyping. Similar to the original study, Subtype-1 SIB was generally amenable to treatment using reinforcement alone, whereas Subtype-2 SIB was not. Conclusions could not be drawn about Subtype-3 SIB due to the small number of datasets. Nevertheless, the findings support the generality of the model and suggest that sensitivity of SIB to disruption by alternative reinforcement is an important dimension of automatically reinforced SIB. Findings also suggest that automatically reinforced SIB should no longer be considered a single category and that additional research is needed to better understand and treat Subtype-2 SIB. PMID:28032344

  12. Housing land transaction data and structural econometric estimation of preference parameters for urban economic simulation models

    PubMed Central

    Caruso, Geoffrey; Cavailhès, Jean; Peeters, Dominique; Thomas, Isabelle; Frankhauser, Pierre; Vuidel, Gilles

    2015-01-01

    This paper describes a dataset of 6284 land transactions prices and plot surfaces in 3 medium-sized cities in France (Besançon, Dijon and Brest). The dataset includes road accessibility as obtained from a minimization algorithm, and the amount of green space available to households in the neighborhood of the transactions, as evaluated from a land cover dataset. Further to the data presentation, the paper describes how these variables can be used to estimate the non-observable parameters of a residential choice function explicitly derived from a microeconomic model. The estimates are used by Caruso et al. (2015) to run a calibrated microeconomic urban growth simulation model where households are assumed to trade-off accessibility and local green space amenities. PMID:26958606

  13. Connectome-based predictive modeling of attention: Comparing different functional connectivity features and prediction methods across datasets.

    PubMed

    Yoo, Kwangsun; Rosenberg, Monica D; Hsu, Wei-Ting; Zhang, Sheng; Li, Chiang-Shan R; Scheinost, Dustin; Constable, R Todd; Chun, Marvin M

    2018-02-15

    Connectome-based predictive modeling (CPM; Finn et al., 2015; Shen et al., 2017) was recently developed to predict individual differences in traits and behaviors, including fluid intelligence (Finn et al., 2015) and sustained attention (Rosenberg et al., 2016a), from functional brain connectivity (FC) measured with fMRI. Here, using the CPM framework, we compared the predictive power of three different measures of FC (Pearson's correlation, accordance, and discordance) and two different prediction algorithms (linear and partial least square [PLS] regression) for attention function. Accordance and discordance are recently proposed FC measures that respectively track in-phase synchronization and out-of-phase anti-correlation (Meskaldji et al., 2015). We defined connectome-based models using task-based or resting-state FC data, and tested the effects of (1) functional connectivity measure and (2) feature-selection/prediction algorithm on individualized attention predictions. Models were internally validated in a training dataset using leave-one-subject-out cross-validation, and externally validated with three independent datasets. The training dataset included fMRI data collected while participants performed a sustained attention task and rested (N = 25; Rosenberg et al., 2016a). The validation datasets included: 1) data collected during performance of a stop-signal task and at rest (N = 83, including 19 participants who were administered methylphenidate prior to scanning; Farr et al., 2014a; Rosenberg et al., 2016b), 2) data collected during Attention Network Task performance and rest (N = 41, Rosenberg et al., in press), and 3) resting-state data and ADHD symptom severity from the ADHD-200 Consortium (N = 113; Rosenberg et al., 2016a). Models defined using all combinations of functional connectivity measure (Pearson's correlation, accordance, and discordance) and prediction algorithm (linear and PLS regression) predicted attentional abilities, with correlations between predicted and observed measures of attention as high as 0.9 for internal validation, and 0.6 for external validation (all p's < 0.05). Models trained on task data outperformed models trained on rest data. Pearson's correlation and accordance features generally showed a small numerical advantage over discordance features, while PLS regression models were usually better than linear regression models. Overall, in addition to correlation features combined with linear models (Rosenberg et al., 2016a), it is useful to consider accordance features and PLS regression for CPM. Copyright © 2017 Elsevier Inc. All rights reserved.

  14. A reanalysis dataset of the South China Sea.

    PubMed

    Zeng, Xuezhi; Peng, Shiqiu; Li, Zhijin; Qi, Yiquan; Chen, Rongyu

    2014-01-01

    Ocean reanalysis provides a temporally continuous and spatially gridded four-dimensional estimate of the ocean state for a better understanding of the ocean dynamics and its spatial/temporal variability. Here we present a 19-year (1992-2010) high-resolution ocean reanalysis dataset of the upper ocean in the South China Sea (SCS) produced from an ocean data assimilation system. A wide variety of observations, including in-situ temperature/salinity profiles, ship-measured and satellite-derived sea surface temperatures, and sea surface height anomalies from satellite altimetry, are assimilated into the outputs of an ocean general circulation model using a multi-scale incremental three-dimensional variational data assimilation scheme, yielding a daily high-resolution reanalysis dataset of the SCS. Comparisons between the reanalysis and independent observations support the reliability of the dataset. The presented dataset provides the research community of the SCS an important data source for studying the thermodynamic processes of the ocean circulation and meso-scale features in the SCS, including their spatial and temporal variability.

  15. A reanalysis dataset of the South China Sea

    PubMed Central

    Zeng, Xuezhi; Peng, Shiqiu; Li, Zhijin; Qi, Yiquan; Chen, Rongyu

    2014-01-01

    Ocean reanalysis provides a temporally continuous and spatially gridded four-dimensional estimate of the ocean state for a better understanding of the ocean dynamics and its spatial/temporal variability. Here we present a 19-year (1992–2010) high-resolution ocean reanalysis dataset of the upper ocean in the South China Sea (SCS) produced from an ocean data assimilation system. A wide variety of observations, including in-situ temperature/salinity profiles, ship-measured and satellite-derived sea surface temperatures, and sea surface height anomalies from satellite altimetry, are assimilated into the outputs of an ocean general circulation model using a multi-scale incremental three-dimensional variational data assimilation scheme, yielding a daily high-resolution reanalysis dataset of the SCS. Comparisons between the reanalysis and independent observations support the reliability of the dataset. The presented dataset provides the research community of the SCS an important data source for studying the thermodynamic processes of the ocean circulation and meso-scale features in the SCS, including their spatial and temporal variability. PMID:25977803

  16. A Large-Scale, High-Resolution Hydrological Model Parameter Data Set for Climate Change Impact Assessment for the Conterminous US

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Oubeidillah, Abdoul A; Kao, Shih-Chieh; Ashfaq, Moetasim

    2014-01-01

    To extend geographical coverage, refine spatial resolution, and improve modeling efficiency, a computation- and data-intensive effort was conducted to organize a comprehensive hydrologic dataset with post-calibrated model parameters for hydro-climate impact assessment. Several key inputs for hydrologic simulation including meteorologic forcings, soil, land class, vegetation, and elevation were collected from multiple best-available data sources and organized for 2107 hydrologic subbasins (8-digit hydrologic units, HUC8s) in the conterminous United States at refined 1/24 (~4 km) spatial resolution. Using high-performance computing for intensive model calibration, a high-resolution parameter dataset was prepared for the macro-scale Variable Infiltration Capacity (VIC) hydrologic model. The VICmore » simulation was driven by DAYMET daily meteorological forcing and was calibrated against USGS WaterWatch monthly runoff observations for each HUC8. The results showed that this new parameter dataset may help reasonably simulate runoff at most US HUC8 subbasins. Based on this exhaustive calibration effort, it is now possible to accurately estimate the resources required for further model improvement across the entire conterminous United States. We anticipate that through this hydrologic parameter dataset, the repeated effort of fundamental data processing can be lessened, so that research efforts can emphasize the more challenging task of assessing climate change impacts. The pre-organized model parameter dataset will be provided to interested parties to support further hydro-climate impact assessment.« less

  17. Tree-based approach for exploring marine spatial patterns with raster datasets.

    PubMed

    Liao, Xiaohan; Xue, Cunjin; Su, Fenzhen

    2017-01-01

    From multiple raster datasets to spatial association patterns, the data-mining technique is divided into three subtasks, i.e., raster dataset pretreatment, mining algorithm design, and spatial pattern exploration from the mining results. Comparison with the former two subtasks reveals that the latter remains unresolved. Confronted with the interrelated marine environmental parameters, we propose a Tree-based Approach for eXploring Marine Spatial Patterns with multiple raster datasets called TAXMarSP, which includes two models. One is the Tree-based Cascading Organization Model (TCOM), and the other is the Spatial Neighborhood-based CAlculation Model (SNCAM). TCOM designs the "Spatial node→Pattern node" from top to bottom layers to store the table-formatted frequent patterns. Together with TCOM, SNCAM considers the spatial neighborhood contributions to calculate the pattern-matching degree between the specified marine parameters and the table-formatted frequent patterns and then explores the marine spatial patterns. Using the prevalent quantification Apriori algorithm and a real remote sensing dataset from January 1998 to December 2014, a successful application of TAXMarSP to marine spatial patterns in the Pacific Ocean is described, and the obtained marine spatial patterns present not only the well-known but also new patterns to Earth scientists.

  18. Improving stability of prediction models based on correlated omics data by using network approaches.

    PubMed

    Tissier, Renaud; Houwing-Duistermaat, Jeanine; Rodríguez-Girondo, Mar

    2018-01-01

    Building prediction models based on complex omics datasets such as transcriptomics, proteomics, metabolomics remains a challenge in bioinformatics and biostatistics. Regularized regression techniques are typically used to deal with the high dimensionality of these datasets. However, due to the presence of correlation in the datasets, it is difficult to select the best model and application of these methods yields unstable results. We propose a novel strategy for model selection where the obtained models also perform well in terms of overall predictability. Several three step approaches are considered, where the steps are 1) network construction, 2) clustering to empirically derive modules or pathways, and 3) building a prediction model incorporating the information on the modules. For the first step, we use weighted correlation networks and Gaussian graphical modelling. Identification of groups of features is performed by hierarchical clustering. The grouping information is included in the prediction model by using group-based variable selection or group-specific penalization. We compare the performance of our new approaches with standard regularized regression via simulations. Based on these results we provide recommendations for selecting a strategy for building a prediction model given the specific goal of the analysis and the sizes of the datasets. Finally we illustrate the advantages of our approach by application of the methodology to two problems, namely prediction of body mass index in the DIetary, Lifestyle, and Genetic determinants of Obesity and Metabolic syndrome study (DILGOM) and prediction of response of each breast cancer cell line to treatment with specific drugs using a breast cancer cell lines pharmacogenomics dataset.

  19. LOD 1 VS. LOD 2 - Preliminary Investigations Into Differences in Mobile Rendering Performance

    NASA Astrophysics Data System (ADS)

    Ellul, C.; Altenbuchner, J.

    2013-09-01

    The increasing availability, size and detail of 3D City Model datasets has led to a challenge when rendering such data on mobile devices. Understanding the limitations to the usability of such models on these devices is particularly important given the broadening range of applications - such as pollution or noise modelling, tourism, planning, solar potential - for which these datasets and resulting visualisations can be utilized. Much 3D City Model data is created by extrusion of 2D topographic datasets, resulting in what is known as Level of Detail (LoD) 1 buildings - with flat roofs. However, in the UK the National Mapping Agency (the Ordnance Survey, OS) is now releasing test datasets to Level of Detail (LoD) 2 - i.e. including roof structures. These datasets are designed to integrate with the LoD 1 datasets provided by the OS, and provide additional detail in particular on larger buildings and in town centres. The availability of such integrated datasets at two different Levels of Detail permits investigation into the impact of the additional roof structures (and hence the display of a more realistic 3D City Model) on rendering performance on a mobile device. This paper describes preliminary work carried out to investigate this issue, for the test area of the city of Sheffield (in the UK Midlands). The data is stored in a 3D spatial database as triangles and then extracted and served as a web-based data stream which is queried by an App developed on the mobile device (using the Android environment, Java and OpenGL for graphics). Initial tests have been carried out on two dataset sizes, for the city centre and a larger area, rendering the data onto a tablet to compare results. Results of 52 seconds for rendering LoD 1 data, and 72 seconds for LoD 1 mixed with LoD 2 data, show that the impact of LoD 2 is significant.

  20. So many genes, so little time: A practical approach to divergence-time estimation in the genomic era

    PubMed Central

    2018-01-01

    Phylogenomic datasets have been successfully used to address questions involving evolutionary relationships, patterns of genome structure, signatures of selection, and gene and genome duplications. However, despite the recent explosion in genomic and transcriptomic data, the utility of these data sources for efficient divergence-time inference remains unexamined. Phylogenomic datasets pose two distinct problems for divergence-time estimation: (i) the volume of data makes inference of the entire dataset intractable, and (ii) the extent of underlying topological and rate heterogeneity across genes makes model mis-specification a real concern. “Gene shopping”, wherein a phylogenomic dataset is winnowed to a set of genes with desirable properties, represents an alternative approach that holds promise in alleviating these issues. We implemented an approach for phylogenomic datasets (available in SortaDate) that filters genes by three criteria: (i) clock-likeness, (ii) reasonable tree length (i.e., discernible information content), and (iii) least topological conflict with a focal species tree (presumed to have already been inferred). Such a winnowing procedure ensures that errors associated with model (both clock and topology) mis-specification are minimized, therefore reducing error in divergence-time estimation. We demonstrated the efficacy of this approach through simulation and applied it to published animal (Aves, Diplopoda, and Hymenoptera) and plant (carnivorous Caryophyllales, broad Caryophyllales, and Vitales) phylogenomic datasets. By quantifying rate heterogeneity across both genes and lineages we found that every empirical dataset examined included genes with clock-like, or nearly clock-like, behavior. Moreover, many datasets had genes that were clock-like, exhibited reasonable evolutionary rates, and were mostly compatible with the species tree. We identified overlap in age estimates when analyzing these filtered genes under strict clock and uncorrelated lognormal (UCLN) models. However, this overlap was often due to imprecise estimates from the UCLN model. We find that “gene shopping” can be an efficient approach to divergence-time inference for phylogenomic datasets that may otherwise be characterized by extensive gene tree heterogeneity. PMID:29772020

  1. So many genes, so little time: A practical approach to divergence-time estimation in the genomic era.

    PubMed

    Smith, Stephen A; Brown, Joseph W; Walker, Joseph F

    2018-01-01

    Phylogenomic datasets have been successfully used to address questions involving evolutionary relationships, patterns of genome structure, signatures of selection, and gene and genome duplications. However, despite the recent explosion in genomic and transcriptomic data, the utility of these data sources for efficient divergence-time inference remains unexamined. Phylogenomic datasets pose two distinct problems for divergence-time estimation: (i) the volume of data makes inference of the entire dataset intractable, and (ii) the extent of underlying topological and rate heterogeneity across genes makes model mis-specification a real concern. "Gene shopping", wherein a phylogenomic dataset is winnowed to a set of genes with desirable properties, represents an alternative approach that holds promise in alleviating these issues. We implemented an approach for phylogenomic datasets (available in SortaDate) that filters genes by three criteria: (i) clock-likeness, (ii) reasonable tree length (i.e., discernible information content), and (iii) least topological conflict with a focal species tree (presumed to have already been inferred). Such a winnowing procedure ensures that errors associated with model (both clock and topology) mis-specification are minimized, therefore reducing error in divergence-time estimation. We demonstrated the efficacy of this approach through simulation and applied it to published animal (Aves, Diplopoda, and Hymenoptera) and plant (carnivorous Caryophyllales, broad Caryophyllales, and Vitales) phylogenomic datasets. By quantifying rate heterogeneity across both genes and lineages we found that every empirical dataset examined included genes with clock-like, or nearly clock-like, behavior. Moreover, many datasets had genes that were clock-like, exhibited reasonable evolutionary rates, and were mostly compatible with the species tree. We identified overlap in age estimates when analyzing these filtered genes under strict clock and uncorrelated lognormal (UCLN) models. However, this overlap was often due to imprecise estimates from the UCLN model. We find that "gene shopping" can be an efficient approach to divergence-time inference for phylogenomic datasets that may otherwise be characterized by extensive gene tree heterogeneity.

  2. Plant traits, productivity, biomass and soil properties from forest sites in the Pacific Northwest, 1999–2014

    PubMed Central

    Berner, Logan T.; Law, Beverly E.

    2016-01-01

    Plant trait measurements are needed for evaluating ecological responses to environmental conditions and for ecosystem process model development, parameterization, and testing. We present a standardized dataset integrating measurements from projects conducted by the Terrestrial Ecosystem Research and Regional Analysis- Pacific Northwest (TERRA-PNW) research group between 1999 and 2014 across Oregon and Northern California, where measurements were collected for scaling and modeling regional terrestrial carbon processes with models such as Biome-BGC and the Community Land Model. The dataset contains measurements of specific leaf area, leaf longevity, leaf carbon and nitrogen for 35 tree and shrub species derived from more than 1,200 branch samples collected from over 200 forest plots, including several AmeriFlux sites. The dataset also contains plot-level measurements of forest composition, structure (e.g., tree biomass), and productivity, as well as measurements of soil structure (e.g., bulk density) and chemistry (e.g., carbon). Publically-archiving regional datasets of standardized, co-located, and geo-referenced plant trait measurements will advance the ability of earth system models to capture species-level climate sensitivity at regional to global scales. PMID:26784559

  3. Plant traits, productivity, biomass and soil properties from forest sites in the Pacific Northwest, 1999–2014

    DOE PAGES

    Berner, Logan T.; Law, Beverly E.

    2016-01-19

    Plant trait measurements are needed for evaluating ecological responses to environmental conditions and for ecosystem process model development, parameterization, and testing. Here, we present a standardized dataset integrating measurements from projects conducted by the Terrestrial Ecosystem Research and Regional Analysis- Pacific Northwest (TERRA-PNW) research group between 1999 and 2014 across Oregon and Northern California, where measurements were collected for scaling and modeling regional terrestrial carbon processes with models such as Biome-BGC and the Community Land Model. The dataset contains measurements of specific leaf area, leaf longevity, leaf carbon and nitrogen for 35 tree and shrub species derived from more thanmore » 1,200 branch samples collected from over 200 forest plots, including several AmeriFlux sites. The dataset also contains plot-level measurements of forest composition, structure (e.g., tree biomass), and productivity, as well as measurements of soil structure (e.g., bulk density) and chemistry (e.g., carbon). Publically-archiving regional datasets of standardized, co-located, and geo-referenced plant trait measurements will advance the ability of earth system models to capture species-level climate sensitivity at regional to global scales.« less

  4. Decision tree methods: applications for classification and prediction.

    PubMed

    Song, Yan-Yan; Lu, Ying

    2015-04-25

    Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable. This method classifies a population into branch-like segments that construct an inverted tree with a root node, internal nodes, and leaf nodes. The algorithm is non-parametric and can efficiently deal with large, complicated datasets without imposing a complicated parametric structure. When the sample size is large enough, study data can be divided into training and validation datasets. Using the training dataset to build a decision tree model and a validation dataset to decide on the appropriate tree size needed to achieve the optimal final model. This paper introduces frequently used algorithms used to develop decision trees (including CART, C4.5, CHAID, and QUEST) and describes the SPSS and SAS programs that can be used to visualize tree structure.

  5. X-ray computed tomography library of shark anatomy and lower jaw surface models.

    PubMed

    Kamminga, Pepijn; De Bruin, Paul W; Geleijns, Jacob; Brazeau, Martin D

    2017-04-11

    The cranial diversity of sharks reflects disparate biomechanical adaptations to feeding. In order to be able to investigate and better understand the ecomorphology of extant shark feeding systems, we created a x-ray computed tomography (CT) library of shark cranial anatomy with three-dimensional (3D) lower jaw reconstructions. This is used to examine and quantify lower jaw disparity in extant shark species in a separate study. The library is divided in a dataset comprised of medical CT scans of 122 sharks (Selachimorpha, Chondrichthyes) representing 73 extant species, including digitized morphology of entire shark specimens. This CT dataset and additional data provided by other researchers was used to reconstruct a second dataset containing 3D models of the left lower jaw for 153 individuals representing 94 extant shark species. These datasets form an extensive anatomical record of shark skeletal anatomy, necessary for comparative morphological, biomechanical, ecological and phylogenetic studies.

  6. Development of Web tools to predict axillary lymph node metastasis and pathological response to neoadjuvant chemotherapy in breast cancer patients.

    PubMed

    Sugimoto, Masahiro; Takada, Masahiro; Toi, Masakazu

    2014-12-09

    Nomograms are a standard computational tool to predict the likelihood of an outcome using multiple available patient features. We have developed a more powerful data mining methodology, to predict axillary lymph node (AxLN) metastasis and response to neoadjuvant chemotherapy (NAC) in primary breast cancer patients. We developed websites to use these tools. The tools calculate the probability of AxLN metastasis (AxLN model) and pathological complete response to NAC (NAC model). As a calculation algorithm, we employed a decision tree-based prediction model known as the alternative decision tree (ADTree), which is an analog development of if-then type decision trees. An ensemble technique was used to combine multiple ADTree predictions, resulting in higher generalization abilities and robustness against missing values. The AxLN model was developed with training datasets (n=148) and test datasets (n=143), and validated using an independent cohort (n=174), yielding an area under the receiver operating characteristic curve (AUC) of 0.768. The NAC model was developed and validated with n=150 and n=173 datasets from a randomized controlled trial, yielding an AUC of 0.787. AxLN and NAC models require users to input up to 17 and 16 variables, respectively. These include pathological features, including human epidermal growth factor receptor 2 (HER2) status and imaging findings. Each input variable has an option of "unknown," to facilitate prediction for cases with missing values. The websites developed facilitate the use of these tools, and serve as a database for accumulating new datasets.

  7. Isotherm ranking and selection using thirteen literature datasets involving hydrophobic organic compounds.

    PubMed

    Matott, L Shawn; Jiang, Zhengzheng; Rabideau, Alan J; Allen-King, Richelle M

    2015-01-01

    Numerous isotherm expressions have been developed for describing sorption of hydrophobic organic compounds (HOCs), including "dual-mode" approaches that combine nonlinear behavior with a linear partitioning component. Choosing among these alternative expressions for describing a given dataset is an important task that can significantly influence subsequent transport modeling and/or mechanistic interpretation. In this study, a series of numerical experiments were undertaken to identify "best-in-class" isotherms by refitting 10 alternative models to a suite of 13 previously published literature datasets. The corrected Akaike Information Criterion (AICc) was used for ranking these alternative fits and distinguishing between plausible and implausible isotherms for each dataset. The occurrence of multiple plausible isotherms was inversely correlated with dataset "richness", such that datasets with fewer observations and/or a narrow range of aqueous concentrations resulted in a greater number of plausible isotherms. Overall, only the Polanyi-partition dual-mode isotherm was classified as "plausible" across all 13 of the considered datasets, indicating substantial statistical support consistent with current advances in sorption theory. However, these findings are predicated on the use of the AICc measure as an unbiased ranking metric and the adoption of a subjective, but defensible, threshold for separating plausible and implausible isotherms. Copyright © 2015 Elsevier B.V. All rights reserved.

  8. A comparison of different ways of including baseline counts in negative binomial models for data from falls prevention trials.

    PubMed

    Zheng, Han; Kimber, Alan; Goodwin, Victoria A; Pickering, Ruth M

    2018-01-01

    A common design for a falls prevention trial is to assess falling at baseline, randomize participants into an intervention or control group, and ask them to record the number of falls they experience during a follow-up period of time. This paper addresses how best to include the baseline count in the analysis of the follow-up count of falls in negative binomial (NB) regression. We examine the performance of various approaches in simulated datasets where both counts are generated from a mixed Poisson distribution with shared random subject effect. Including the baseline count after log-transformation as a regressor in NB regression (NB-logged) or as an offset (NB-offset) resulted in greater power than including the untransformed baseline count (NB-unlogged). Cook and Wei's conditional negative binomial (CNB) model replicates the underlying process generating the data. In our motivating dataset, a statistically significant intervention effect resulted from the NB-logged, NB-offset, and CNB models, but not from NB-unlogged, and large, outlying baseline counts were overly influential in NB-unlogged but not in NB-logged. We conclude that there is little to lose by including the log-transformed baseline count in standard NB regression compared to CNB for moderate to larger sized datasets. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  9. Incorporating High-Frequency Physiologic Data Using Computational Dictionary Learning Improves Prediction of Delayed Cerebral Ischemia Compared to Existing Methods.

    PubMed

    Megjhani, Murad; Terilli, Kalijah; Frey, Hans-Peter; Velazquez, Angela G; Doyle, Kevin William; Connolly, Edward Sander; Roh, David Jinou; Agarwal, Sachin; Claassen, Jan; Elhadad, Noemie; Park, Soojin

    2018-01-01

    Accurate prediction of delayed cerebral ischemia (DCI) after subarachnoid hemorrhage (SAH) can be critical for planning interventions to prevent poor neurological outcome. This paper presents a model using convolution dictionary learning to extract features from physiological data available from bedside monitors. We develop and validate a prediction model for DCI after SAH, demonstrating improved precision over standard methods alone. 488 consecutive SAH admissions from 2006 to 2014 to a tertiary care hospital were included. Models were trained on 80%, while 20% were set aside for validation testing. Modified Fisher Scale was considered the standard grading scale in clinical use; baseline features also analyzed included age, sex, Hunt-Hess, and Glasgow Coma Scales. An unsupervised approach using convolution dictionary learning was used to extract features from physiological time series (systolic blood pressure and diastolic blood pressure, heart rate, respiratory rate, and oxygen saturation). Classifiers (partial least squares and linear and kernel support vector machines) were trained on feature subsets of the derivation dataset. Models were applied to the validation dataset. The performances of the best classifiers on the validation dataset are reported by feature subset. Standard grading scale (mFS): AUC 0.54. Combined demographics and grading scales (baseline features): AUC 0.63. Kernel derived physiologic features: AUC 0.66. Combined baseline and physiologic features with redundant feature reduction: AUC 0.71 on derivation dataset and 0.78 on validation dataset. Current DCI prediction tools rely on admission imaging and are advantageously simple to employ. However, using an agnostic and computationally inexpensive learning approach for high-frequency physiologic time series data, we demonstrated that we could incorporate individual physiologic data to achieve higher classification accuracy.

  10. Developing Verification Systems for Building Information Models of Heritage Buildings with Heterogeneous Datasets

    NASA Astrophysics Data System (ADS)

    Chow, L.; Fai, S.

    2017-08-01

    The digitization and abstraction of existing buildings into building information models requires the translation of heterogeneous datasets that may include CAD, technical reports, historic texts, archival drawings, terrestrial laser scanning, and photogrammetry into model elements. In this paper, we discuss a project undertaken by the Carleton Immersive Media Studio (CIMS) that explored the synthesis of heterogeneous datasets for the development of a building information model (BIM) for one of Canada's most significant heritage assets - the Centre Block of the Parliament Hill National Historic Site. The scope of the project included the development of an as-found model of the century-old, six-story building in anticipation of specific model uses for an extensive rehabilitation program. The as-found Centre Block model was developed in Revit using primarily point cloud data from terrestrial laser scanning. The data was captured by CIMS in partnership with Heritage Conservation Services (HCS), Public Services and Procurement Canada (PSPC), using a Leica C10 and P40 (exterior and large interior spaces) and a Faro Focus (small to mid-sized interior spaces). Secondary sources such as archival drawings, photographs, and technical reports were referenced in cases where point cloud data was not available. As a result of working with heterogeneous data sets, a verification system was introduced in order to communicate to model users/viewers the source of information for each building element within the model.

  11. Applicability of AgMERRA Forcing Dataset to Fill Gaps in Historical in-situ Meteorological Data

    NASA Astrophysics Data System (ADS)

    Bannayan, M.; Lashkari, A.; Zare, H.; Asadi, S.; Salehnia, N.

    2015-12-01

    Integrated assessment studies of food production systems use crop models to simulate the effects of climate and socio-economic changes on food security. Climate forcing data is one of those key inputs of crop models. This study evaluated the performance of AgMERRA climate forcing dataset to fill gaps in historical in-situ meteorological data for different climatic regions of Iran. AgMERRA dataset intercompared with in- situ observational dataset for daily maximum and minimum temperature and precipitation during 1980-2010 periods via Root Mean Square error (RMSE), Mean Absolute Error (MAE) and Mean Bias Error (MBE) for 17 stations in four climatic regions included humid and moderate, cold, dry and arid, hot and humid. Moreover, probability distribution function and cumulative distribution function compared between model and observed data. The results of measures of agreement between AgMERRA data and observed data demonstrated that there are small errors in model data for all stations. Except for stations which are located in cold regions, model data in the other stations illustrated under-prediction for daily maximum temperature and precipitation. However, it was not significant. In addition, probability distribution function and cumulative distribution function showed the same trend for all stations between model and observed data. Therefore, the reliability of AgMERRA dataset is high to fill gaps in historical observations in different climatic regions of Iran as well as it could be applied as a basis for future climate scenarios.

  12. Scaling up: What coupled land-atmosphere models can tell us about critical zone processes

    NASA Astrophysics Data System (ADS)

    FitzGerald, K. A.; Masarik, M. T.; Rudisill, W. J.; Gelb, L.; Flores, A. N.

    2017-12-01

    A significant limitation to extending our knowledge of critical zone (CZ) evolution and function is a lack of hydrometeorological information at sufficiently fine spatial and temporal resolutions to resolve topo-climatic gradients and adequate spatial and temporal extent to capture a range of climatic conditions across ecoregions. Research at critical zone observatories (CZOs) suggests hydrometeorological stores and fluxes exert key controls on processes such as hydrologic partitioning and runoff generation, landscape evolution, soil formation, biogeochemical cycling, and vegetation dynamics. However, advancing fundamental understanding of CZ processes necessitates understanding how hydrometeorological drivers vary across space and time. As a result of recent advances in computational capabilities it has become possible, although still computationally expensive, to simulate hydrometeorological conditions via high resolution coupled land-atmosphere models. Using the Weather Research and Forecasting (WRF) model, we developed a high spatiotemporal resolution dataset extending from water year 1987 to present for the Snake River Basin in the northwestern USA including the Reynolds Creek and Dry Creek Experimental Watersheds, both part of the Reynolds Creek CZO, as well as a range of other ecosystems including shrubland desert, montane forests, and alpine tundra. Drawing from hypotheses generated by work at these sites and across the CZO network, we use the resulting dataset in combination with CZO observations and publically available datasets to provide insights regarding hydrologic partitioning, vegetation distribution, and erosional processes. This dataset provides key context in interpreting and reconciling what observations obtained at particular sites reveal about underlying CZ structure and function. While this dataset does not extend to future climates, the same modeling framework can be used to dynamically downscale coarse global climate model output to scales relevant to CZ processes. This presents an opportunity to better characterize the impact of climate change on the CZ. We also argue that opportunities exist beyond the one way flow of information and that what we learn at CZOs has the potential to contribute significantly to improved Earth system models.

  13. A semiparametric graphical modelling approach for large-scale equity selection.

    PubMed

    Liu, Han; Mulvey, John; Zhao, Tianqi

    2016-01-01

    We propose a new stock selection strategy that exploits rebalancing returns and improves portfolio performance. To effectively harvest rebalancing gains, we apply ideas from elliptical-copula graphical modelling and stability inference to select stocks that are as independent as possible. The proposed elliptical-copula graphical model has a latent Gaussian representation; its structure can be effectively inferred using the regularized rank-based estimators. The resulting algorithm is computationally efficient and scales to large data-sets. To show the efficacy of the proposed method, we apply it to conduct equity selection based on a 16-year health care stock data-set and a large 34-year stock data-set. Empirical tests show that the proposed method is superior to alternative strategies including a principal component analysis-based approach and the classical Markowitz strategy based on the traditional buy-and-hold assumption.

  14. The HyMeX database

    NASA Astrophysics Data System (ADS)

    Brissebrat, Guillaume; Mastrorillo, Laurence; Ramage, Karim; Boichard, Jean-Luc; Cloché, Sophie; Fleury, Laurence; Klenov, Ludmila; Labatut, Laurent; Mière, Arnaud

    2013-04-01

    The international HyMeX (HYdrological cycle in the Mediterranean EXperiment) project aims at a better understanding and quantification of the hydrological cycle and related processes in the Mediterranean, with emphasis on high-impact weather events, inter-annual to decadal variability of the Mediterranean coupled system, and associated trends in the context of global change. The project includes long term monitoring of environmental parameters, intensive field campaigns, use of satellite data, modelling studies, as well as post event field surveys and value-added products processing. Therefore HyMeX database incorporates various dataset types from different disciplines, either operational or research. The database relies on a strong collaboration between OMP and IPSL data centres. Field data, which are 1D time series, maps or pictures, are managed by OMP team while gridded data (satellite products, model outputs, radar data...) are managed by IPSL team. At present, the HyMeX database contains about 150 datasets, including 80 hydrological, meteorological, ocean and soil in situ datasets, 30 radar datasets, 15 satellite products, 15 atmosphere, ocean and land surface model outputs from operational (re-)analysis or forecasts and from research simulations, and 5 post event survey datasets. The data catalogue complies with international standards (ISO 19115; INSPIRE; Directory Interchange Format; Global Change Master Directory Thesaurus). It includes all the datasets stored in the HyMeX database, as well as external datasets relevant for the project. All the data, whatever the type is, are accessible through a single gateway. The database website http://mistrals.sedoo.fr/HyMeX offers different tools: - A registration procedure which enables any scientist to accept the data policy and apply for a user database account. - A search tool to browse the catalogue using thematic, geographic and/or temporal criteria. - Sorted lists of the datasets by thematic keywords, by measured parameters, by instruments or by platform type. - Forms to document observations or products that will be provided to the database. - A shopping-cart web interface to order in situ data files. - Ftp facilities to access gridded data. The website will soon propose new facilities. Many in situ datasets have been homogenized and inserted in a relational database yet, in order to enable more accurate data selection and download of different datasets in a shared format. Interoperability between the two data centres will be enhanced by the OpenDAP communication protocol associated with the Thredds catalogue software, which may also be implemented in other data centres that manage data of interest for the HyMeX project. In order to meet the operational needs for the HyMeX 2012 campaigns, a day-to-day quick look and report display website has been developed too: http://sop.hymex.org. It offers a convenient way to browse meteorological conditions and data during the campaign periods.

  15. Computational Environment for Modeling and Analysing Network Traffic Behaviour Using the Divide and Recombine Framework

    ERIC Educational Resources Information Center

    Barthur, Ashrith

    2016-01-01

    There are two essential goals of this research. The first goal is to design and construct a computational environment that is used for studying large and complex datasets in the cybersecurity domain. The second goal is to analyse the Spamhaus blacklist query dataset which includes uncovering the properties of blacklisted hosts and understanding…

  16. An ISA-TAB-Nano based data collection framework to support data-driven modelling of nanotoxicology.

    PubMed

    Marchese Robinson, Richard L; Cronin, Mark T D; Richarz, Andrea-Nicole; Rallo, Robert

    2015-01-01

    Analysis of trends in nanotoxicology data and the development of data driven models for nanotoxicity is facilitated by the reporting of data using a standardised electronic format. ISA-TAB-Nano has been proposed as such a format. However, in order to build useful datasets according to this format, a variety of issues has to be addressed. These issues include questions regarding exactly which (meta)data to report and how to report them. The current article discusses some of the challenges associated with the use of ISA-TAB-Nano and presents a set of resources designed to facilitate the manual creation of ISA-TAB-Nano datasets from the nanotoxicology literature. These resources were developed within the context of the NanoPUZZLES EU project and include data collection templates, corresponding business rules that extend the generic ISA-TAB-Nano specification as well as Python code to facilitate parsing and integration of these datasets within other nanoinformatics resources. The use of these resources is illustrated by a "Toy Dataset" presented in the Supporting Information. The strengths and weaknesses of the resources are discussed along with possible future developments.

  17. Towards systematic evaluation of crop model outputs for global land-use models

    NASA Astrophysics Data System (ADS)

    Leclere, David; Azevedo, Ligia B.; Skalský, Rastislav; Balkovič, Juraj; Havlík, Petr

    2016-04-01

    Land provides vital socioeconomic resources to the society, however at the cost of large environmental degradations. Global integrated models combining high resolution global gridded crop models (GGCMs) and global economic models (GEMs) are increasingly being used to inform sustainable solution for agricultural land-use. However, little effort has yet been done to evaluate and compare the accuracy of GGCM outputs. In addition, GGCM datasets require a large amount of parameters whose values and their variability across space are weakly constrained: increasing the accuracy of such dataset has a very high computing cost. Innovative evaluation methods are required both to ground credibility to the global integrated models, and to allow efficient parameter specification of GGCMs. We propose an evaluation strategy for GGCM datasets in the perspective of use in GEMs, illustrated with preliminary results from a novel dataset (the Hypercube) generated by the EPIC GGCM and used in the GLOBIOM land use GEM to inform on present-day crop yield, water and nutrient input needs for 16 crops x 15 management intensities, at a spatial resolution of 5 arc-minutes. We adopt the following principle: evaluation should provide a transparent diagnosis of model adequacy for its intended use. We briefly describe how the Hypercube data is generated and how it articulates with GLOBIOM in order to transparently identify the performances to be evaluated, as well as the main assumptions and data processing involved. Expected performances include adequately representing the sub-national heterogeneity in crop yield and input needs: i) in space, ii) across crop species, and iii) across management intensities. We will present and discuss measures of these expected performances and weight the relative contribution of crop model, input data and data processing steps in performances. We will also compare obtained yield gaps and main yield-limiting factors against the M3 dataset. Next steps include iterative improvement of parameter assumptions and evaluation of implications of GGCM performances for intended use in the IIASA EPIC-GLOBIOM model cluster. Our approach helps targeting future efforts at improving GGCM accuracy and would achieve highest efficiency if combined with traditional field-scale evaluation and sensitivity analysis.

  18. Utah FORGE Site Location, Datasets, and Models

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Joe Moore

    This submission includes the geographic extent shapefile of the Milford FORGE site located in Utah, along with a shapefile of seismometer positions throughout the area, and models of basin depth and potentiometric contours.

  19. Study on homogenization of synthetic GNSS-retrieved IWV time series and its impact on trend estimates with autoregressive noise

    NASA Astrophysics Data System (ADS)

    Klos, Anna; Pottiaux, Eric; Van Malderen, Roeland; Bock, Olivier; Bogusz, Janusz

    2017-04-01

    A synthetic benchmark dataset of Integrated Water Vapour (IWV) was created within the activity of "Data homogenisation" of sub-working group WG3 of COST ES1206 Action. The benchmark dataset was created basing on the analysis of IWV differences retrieved by Global Positioning System (GPS) International GNSS Service (IGS) stations using European Centre for Medium-Range Weather Forecats (ECMWF) reanalysis data (ERA-Interim). Having analysed a set of 120 series of IWV differences (ERAI-GPS) derived for IGS stations, we delivered parameters of a number of gaps and breaks for every certain station. Moreover, we estimated values of trends, significant seasonalities and character of residuals when deterministic model was removed. We tested five different noise models and found that a combination of white and autoregressive processes of first order describes the stochastic part with a good accuracy. Basing on this analysis, we performed Monte Carlo simulations of 25 years long data with two different types of noise: white as well as combination of white and autoregressive processes. We also added few strictly defined offsets, creating three variants of synthetic dataset: easy, less-complicated and fully-complicated. The 'Easy' dataset included seasonal signals (annual, semi-annual, 3 and 4 months if present for a particular station), offsets and white noise. The 'Less-complicated' dataset included above-mentioned, as well as the combination of white and first order autoregressive processes (AR(1)+WH). The 'Fully-complicated' dataset included, beyond above, a trend and gaps. In this research, we show the impact of manual homogenisation on the estimates of trend and its error. We also cross-compare the results for three above-mentioned datasets, as the synthetized noise type might have a significant influence on manual homogenisation. Therefore, it might mostly affect the values of trend and their uncertainties when inappropriately handled. In a future, the synthetic dataset we present is going to be used as a benchmark to test various statistical tools in terms of homogenisation task.

  20. Fast randomization of large genomic datasets while preserving alteration counts.

    PubMed

    Gobbi, Andrea; Iorio, Francesco; Dawson, Kevin J; Wedge, David C; Tamborero, David; Alexandrov, Ludmil B; Lopez-Bigas, Nuria; Garnett, Mathew J; Jurman, Giuseppe; Saez-Rodriguez, Julio

    2014-09-01

    Studying combinatorial patterns in cancer genomic datasets has recently emerged as a tool for identifying novel cancer driver networks. Approaches have been devised to quantify, for example, the tendency of a set of genes to be mutated in a 'mutually exclusive' manner. The significance of the proposed metrics is usually evaluated by computing P-values under appropriate null models. To this end, a Monte Carlo method (the switching-algorithm) is used to sample simulated datasets under a null model that preserves patient- and gene-wise mutation rates. In this method, a genomic dataset is represented as a bipartite network, to which Markov chain updates (switching-steps) are applied. These steps modify the network topology, and a minimal number of them must be executed to draw simulated datasets independently under the null model. This number has previously been deducted empirically to be a linear function of the total number of variants, making this process computationally expensive. We present a novel approximate lower bound for the number of switching-steps, derived analytically. Additionally, we have developed the R package BiRewire, including new efficient implementations of the switching-algorithm. We illustrate the performances of BiRewire by applying it to large real cancer genomics datasets. We report vast reductions in time requirement, with respect to existing implementations/bounds and equivalent P-value computations. Thus, we propose BiRewire to study statistical properties in genomic datasets, and other data that can be modeled as bipartite networks. BiRewire is available on BioConductor at http://www.bioconductor.org/packages/2.13/bioc/html/BiRewire.html. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.

  1. A Pilot Study of Biomedical Text Comprehension using an Attention-Based Deep Neural Reader: Design and Experimental Analysis.

    PubMed

    Kim, Seongsoon; Park, Donghyeon; Choi, Yonghwa; Lee, Kyubum; Kim, Byounggun; Jeon, Minji; Kim, Jihye; Tan, Aik Choon; Kang, Jaewoo

    2018-01-05

    With the development of artificial intelligence (AI) technology centered on deep-learning, the computer has evolved to a point where it can read a given text and answer a question based on the context of the text. Such a specific task is known as the task of machine comprehension. Existing machine comprehension tasks mostly use datasets of general texts, such as news articles or elementary school-level storybooks. However, no attempt has been made to determine whether an up-to-date deep learning-based machine comprehension model can also process scientific literature containing expert-level knowledge, especially in the biomedical domain. This study aims to investigate whether a machine comprehension model can process biomedical articles as well as general texts. Since there is no dataset for the biomedical literature comprehension task, our work includes generating a large-scale question answering dataset using PubMed and manually evaluating the generated dataset. We present an attention-based deep neural model tailored to the biomedical domain. To further enhance the performance of our model, we used a pretrained word vector and biomedical entity type embedding. We also developed an ensemble method of combining the results of several independent models to reduce the variance of the answers from the models. The experimental results showed that our proposed deep neural network model outperformed the baseline model by more than 7% on the new dataset. We also evaluated human performance on the new dataset. The human evaluation result showed that our deep neural model outperformed humans in comprehension by 22% on average. In this work, we introduced a new task of machine comprehension in the biomedical domain using a deep neural model. Since there was no large-scale dataset for training deep neural models in the biomedical domain, we created the new cloze-style datasets Biomedical Knowledge Comprehension Title (BMKC_T) and Biomedical Knowledge Comprehension Last Sentence (BMKC_LS) (together referred to as BioMedical Knowledge Comprehension) using the PubMed corpus. The experimental results showed that the performance of our model is much higher than that of humans. We observed that our model performed consistently better regardless of the degree of difficulty of a text, whereas humans have difficulty when performing biomedical literature comprehension tasks that require expert level knowledge. ©Seongsoon Kim, Donghyeon Park, Yonghwa Choi, Kyubum Lee, Byounggun Kim, Minji Jeon, Jihye Kim, Aik Choon Tan, Jaewoo Kang. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 05.01.2018.

  2. Downscaling global precipitation for local applications - a case for the Rhine basin

    NASA Astrophysics Data System (ADS)

    Sperna Weiland, Frederiek; van Verseveld, Willem; Schellekens, Jaap

    2017-04-01

    Within the EU FP7 project eartH2Observe a global Water Resources Re-analysis (WRR) is being developed. This re-analysis consists of meteorological and hydrological water balance variables with global coverage, spanning the period 1979-2014 at 0.25 degrees resolution (Schellekens et al., 2016). The dataset can be of special interest in regions with limited in-situ data availability, yet for local scale analysis particularly in mountainous regions, a resolution of 0.25 degrees may be too coarse and downscaling the data to a higher resolution may be required. A downscaling toolbox has been made that includes spatial downscaling of precipitation based on the global WorldClim dataset that is available at 1 km resolution as a monthly climatology (Hijmans et al., 2005). The input of the down-scaling tool are either the global eartH2Observe WRR1 and WRR2 datasets based on the WFDEI correction methodology (Weedon et al., 2014) or the global Multi-Source Weighted-Ensemble Precipitation (MSWEP) dataset (Beck et al., 2016). Here we present a validation of the datasets over the Rhine catchment by means of a distributed hydrological model (wflow, Schellekens et al., 2014) using a number of precipitation scenarios. (1) We start by running the model using the local reference dataset derived by spatial interpolation of gauge observations. Furthermore we use (2) the MSWEP dataset at the native 0.25-degree resolution followed by (3) MSWEP downscaled with the WorldClim dataset and final (4) MSWEP downscaled with the local reference dataset. The validation will be based on comparison of the modeled river discharges as well as rainfall statistics. We expect that down-scaling the MSWEP dataset with the WorldClim data to higher resolution will increase its performance. To test the performance of the down-scaling routine we have added a run with MSWEP data down-scaled with the local dataset and compare this with the run based on the local dataset itself. - Beck, H. E. et al., 2016. MSWEP: 3-hourly 0.25° global gridded precipitation (1979-2015) by merging gauge, satellite, and reanalysis data, Hydrol. Earth Syst. Sci. Discuss., doi:10.5194/hess-2016-236, accepted for final publication. - Hijmans, R.J. et al., 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965-1978. - Schellekens, J. et al., 2016. A global water resources ensemble of hydrological models: the eartH2Observe Tier-1 dataset, Earth Syst. Sci. Data Discuss., doi:10.5194/essd-2016-55, under review. - Schellekens, J. et al., 2014. Rapid setup of hydrological and hydraulic models using OpenStreetMap and the SRTM derived digital elevation model. Environmental Modelling&Software - Weedon, G.P. et al., 2014. The WFDEI meteorological forcing data set: WATCH Forcing Data methodology applied to ERA-Interim reanalysis data. Water Resources Research, 50, doi:10.1002/2014WR015638.

  3. Updated Value of Service Reliability Estimates for Electric Utility Customers in the United States

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sullivan, Michael; Schellenberg, Josh; Blundell, Marshall

    2015-01-01

    This report updates the 2009 meta-analysis that provides estimates of the value of service reliability for electricity customers in the United States (U.S.). The meta-dataset now includes 34 different datasets from surveys fielded by 10 different utility companies between 1989 and 2012. Because these studies used nearly identical interruption cost estimation or willingness-to-pay/accept methods, it was possible to integrate their results into a single meta-dataset describing the value of electric service reliability observed in all of them. Once the datasets from the various studies were combined, a two-part regression model was used to estimate customer damage functions that can bemore » generally applied to calculate customer interruption costs per event by season, time of day, day of week, and geographical regions within the U.S. for industrial, commercial, and residential customers. This report focuses on the backwards stepwise selection process that was used to develop the final revised model for all customer classes. Across customer classes, the revised customer interruption cost model has improved significantly because it incorporates more data and does not include the many extraneous variables that were in the original specification from the 2009 meta-analysis. The backwards stepwise selection process led to a more parsimonious model that only included key variables, while still achieving comparable out-of-sample predictive performance. In turn, users of interruption cost estimation tools such as the Interruption Cost Estimate (ICE) Calculator will have less customer characteristics information to provide and the associated inputs page will be far less cumbersome. The upcoming new version of the ICE Calculator is anticipated to be released in 2015.« less

  4. Potential for using regional and global datasets for national scale ecosystem service modelling

    NASA Astrophysics Data System (ADS)

    Maxwell, Deborah; Jackson, Bethanna

    2016-04-01

    Ecosystem service models are increasingly being used by planners and policy makers to inform policy development and decisions about national-level resource management. Such models allow ecosystem services to be mapped and quantified, and subsequent changes to these services to be identified and monitored. In some cases, the impact of small scale changes can be modelled at a national scale, providing more detailed information to decision makers about where to best focus investment and management interventions that could address these issues, while moving toward national goals and/or targets. National scale modelling often uses national (or local) data (for example, soils, landcover and topographical information) as input. However, there are some places where fine resolution and/or high quality national datasets cannot be easily obtained, or do not even exist. In the absence of such detailed information, regional or global datasets could be used as input to such models. There are questions, however, about the usefulness of these coarser resolution datasets and the extent to which inaccuracies in this data may degrade predictions of existing and potential ecosystem service provision and subsequent decision making. Using LUCI (the Land Utilisation and Capability Indicator) as an example predictive model, we examine how the reliability of predictions change when national datasets of soil, landcover and topography are substituted with coarser scale regional and global datasets. We specifically look at how LUCI's predictions of where water services, such as flood risk, flood mitigation, erosion and water quality, change when national data inputs are replaced by regional and global datasets. Using the Conwy catchment, Wales, as a case study, the land cover products compared are the UK's Land Cover Map (2007), the European CORINE land cover map and the ESA global land cover map. Soils products include the National Soil Map of England and Wales (NatMap) and the European Soils Database. NEXTmap elevation data, which covers the UK and parts of continental Europe, are compared to global AsterDEM and SRTM30 topographical products. While the regional and global datasets can be used to fill gaps in data requirements, the coarser resolution of these datasets means that there is greater aggregation of information over larger areas. This loss of detail impacts on the reliability of model output, particularly where significant discrepancies between datasets exist. The implications of this loss of detail in terms of spatial planning and decision making is discussed. Finally, in the context of broader development the need for better nationally and globally available data to allow LUCI and other ecosystem models to become more globally applicable is highlighted.

  5. Open and scalable analytics of large Earth observation datasets: From scenes to multidimensional arrays using SciDB and GDAL

    NASA Astrophysics Data System (ADS)

    Appel, Marius; Lahn, Florian; Buytaert, Wouter; Pebesma, Edzer

    2018-04-01

    Earth observation (EO) datasets are commonly provided as collection of scenes, where individual scenes represent a temporal snapshot and cover a particular region on the Earth's surface. Using these data in complex spatiotemporal modeling becomes difficult as soon as data volumes exceed a certain capacity or analyses include many scenes, which may spatially overlap and may have been recorded at different dates. In order to facilitate analytics on large EO datasets, we combine and extend the geospatial data abstraction library (GDAL) and the array-based data management and analytics system SciDB. We present an approach to automatically convert collections of scenes to multidimensional arrays and use SciDB to scale computationally intensive analytics. We evaluate the approach in three study cases on national scale land use change monitoring with Landsat imagery, global empirical orthogonal function analysis of daily precipitation, and combining historical climate model projections with satellite-based observations. Results indicate that the approach can be used to represent various EO datasets and that analyses in SciDB scale well with available computational resources. To simplify analyses of higher-dimensional datasets as from climate model output, however, a generalization of the GDAL data model might be needed. All parts of this work have been implemented as open-source software and we discuss how this may facilitate open and reproducible EO analyses.

  6. Setting up a hydrological model based on global data for the Ayeyarwady basin in Myanmar

    NASA Astrophysics Data System (ADS)

    ten Velden, Corine; Sloff, Kees; Nauta, Tjitte

    2017-04-01

    The use of global datasets in local hydrological modelling can be of great value. It opens up the possibility to include data for areas where local data is not or only sparsely available. In hydrological modelling the existence of both static physical data such as elevation and land use, and dynamic meteorological data such as precipitation and temperature, is essential for setting up a hydrological model, but often such data is difficult to obtain at the local level. For the Ayeyarwady catchment in Myanmar a distributed hydrological model (Wflow: https://github.com/openstreams/wflow) was set up with only global datasets, as part of a water resources study. Myanmar is an emerging economy, which has only recently become more receptive to foreign influences. It has a very limited hydrometeorological measurement network, with large spatial and temporal gaps, and data that are of uncertain quality and difficult to obtain. The hydrological model was thus set up based on resampled versions of the SRTM digital elevation model, the GlobCover land cover dataset and the HWSD soil dataset. Three global meteorological datasets were assessed and compared for use in the hydrological model: TRMM, WFDEI and MSWEP. The meteorological datasets were assessed based on their conformity with several precipitation station measurements, and the overall model performance was assessed by calculating the NSE and RVE based on discharge measurements of several gauging stations. The model was run for the period 1979-2012 on a daily time step, and the results show an acceptable applicability of the used global datasets in the hydrological model. The WFDEI forcing dataset gave the best results, with a NSE of 0.55 at the outlet of the model and a RVE of 8.5%, calculated over the calibration period 2006-2012. As a general trend the modelled discharge at the upstream stations tends to be underestimated, and at the downstream stations slightly overestimated. The quality of the discharge measurements that form the basis for the performance calculations is uncertain; data analysis suggests that rating curves are not frequently updated. The modelling results are not perfect and there is ample room for improvement, but the results are reasonable given the notion that setting up a hydrological model for this area would not have been possible without the use of global datasets due to the lack of available local data. The resulting hydrological model then enabled the set-up of the RIBASIM water allocation model for the Ayeyarwady basin in order to assess its water resources. The study discussed here is a first step; ideally this is followed up by a more thorough calibration and validation with the limited local measurements available, e.g. a precipitation correction based on the available rainfall measurements, to ensure the integration of global and local data.

  7. Predicting gestational age using neonatal metabolic markers

    PubMed Central

    Ryckman, Kelli K.; Berberich, Stanton L.; Dagle, John M.

    2016-01-01

    Background Accurate gestational age estimation is extremely important for clinical care decisions of the newborn as well as for perinatal health research. Although prenatal ultrasound dating is one of the most accurate methods for estimating gestational age, it is not feasible in all settings. Identifying novel and accurate methods for gestational age estimation at birth is important, particularly for surveillance of preterm birth rates in areas without routine ultrasound dating. Objective We hypothesized that metabolic and endocrine markers captured by routine newborn screening could improve gestational age estimation in the absence of prenatal ultrasound technology. Study Design This is a retrospective analysis of 230,013 newborn metabolic screening records collected by the Iowa Newborn Screening Program between 2004 and 2009. The data were randomly split into a model-building dataset (n = 153,342) and a model-testing dataset (n = 76,671). We performed multiple linear regression modeling with gestational age, in weeks, as the outcome measure. We examined 44 metabolites, including biomarkers of amino acid and fatty acid metabolism, thyroid-stimulating hormone, and 17-hydroxyprogesterone. The coefficient of determination (R2) and the root-mean-square error were used to evaluate models in the model-building dataset that were then tested in the model-testing dataset. Results The newborn metabolic regression model consisted of 88 parameters, including the intercept, 37 metabolite measures, 29 squared metabolite measures, and 21 cubed metabolite measures. This model explained 52.8% of the variation in gestational age in the model-testing dataset. Gestational age was predicted within 1 week for 78% of the individuals and within 2 weeks of gestation for 95% of the individuals. This model yielded an area under the curve of 0.899 (95% confidence interval 0.895−0.903) in differentiating those born preterm (<37 weeks) from those born term (≥37 weeks). In the subset of infants born small-for-gestational age, the average difference between gestational ages predicted by the newborn metabolic model and the recorded gestational age was 1.5 weeks. In contrast, the average difference between gestational ages predicted by the model including only newborn weight and the recorded gestational age was 1.9 weeks. The estimated prevalence of preterm birth <37 weeks’ gestation in the subset of infants that were small for gestational age was 18.79% when the model including only newborn weight was used, over twice that of the actual prevalence of 9.20%. The newborn metabolic model underestimated the preterm birth prevalence at 6.94% but was closer to the prevalence based on the recorded gestational age than the model including only newborn weight. Conclusions The newborn metabolic profile, as derived from routine newborn screening markers, is an accurate method for estimating gestational age. In small-for-gestational age neonates, the newborn metabolic model predicts gestational age to a better degree than newborn weight alone. Newborn metabolic screening is a potentially effective method for population surveillance of preterm birth in the absence of prenatal ultrasound measurements or newborn weight. PMID:26645954

  8. Predicting gestational age using neonatal metabolic markers.

    PubMed

    Ryckman, Kelli K; Berberich, Stanton L; Dagle, John M

    2016-04-01

    Accurate gestational age estimation is extremely important for clinical care decisions of the newborn as well as for perinatal health research. Although prenatal ultrasound dating is one of the most accurate methods for estimating gestational age, it is not feasible in all settings. Identifying novel and accurate methods for gestational age estimation at birth is important, particularly for surveillance of preterm birth rates in areas without routine ultrasound dating. We hypothesized that metabolic and endocrine markers captured by routine newborn screening could improve gestational age estimation in the absence of prenatal ultrasound technology. This is a retrospective analysis of 230,013 newborn metabolic screening records collected by the Iowa Newborn Screening Program between 2004 and 2009. The data were randomly split into a model-building dataset (n = 153,342) and a model-testing dataset (n = 76,671). We performed multiple linear regression modeling with gestational age, in weeks, as the outcome measure. We examined 44 metabolites, including biomarkers of amino acid and fatty acid metabolism, thyroid-stimulating hormone, and 17-hydroxyprogesterone. The coefficient of determination (R(2)) and the root-mean-square error were used to evaluate models in the model-building dataset that were then tested in the model-testing dataset. The newborn metabolic regression model consisted of 88 parameters, including the intercept, 37 metabolite measures, 29 squared metabolite measures, and 21 cubed metabolite measures. This model explained 52.8% of the variation in gestational age in the model-testing dataset. Gestational age was predicted within 1 week for 78% of the individuals and within 2 weeks of gestation for 95% of the individuals. This model yielded an area under the curve of 0.899 (95% confidence interval 0.895-0.903) in differentiating those born preterm (<37 weeks) from those born term (≥37 weeks). In the subset of infants born small-for-gestational age, the average difference between gestational ages predicted by the newborn metabolic model and the recorded gestational age was 1.5 weeks. In contrast, the average difference between gestational ages predicted by the model including only newborn weight and the recorded gestational age was 1.9 weeks. The estimated prevalence of preterm birth <37 weeks' gestation in the subset of infants that were small for gestational age was 18.79% when the model including only newborn weight was used, over twice that of the actual prevalence of 9.20%. The newborn metabolic model underestimated the preterm birth prevalence at 6.94% but was closer to the prevalence based on the recorded gestational age than the model including only newborn weight. The newborn metabolic profile, as derived from routine newborn screening markers, is an accurate method for estimating gestational age. In small-for-gestational age neonates, the newborn metabolic model predicts gestational age to a better degree than newborn weight alone. Newborn metabolic screening is a potentially effective method for population surveillance of preterm birth in the absence of prenatal ultrasound measurements or newborn weight. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.

  9. Object instance recognition using motion cues and instance specific appearance models

    NASA Astrophysics Data System (ADS)

    Schumann, Arne

    2014-03-01

    In this paper we present an object instance retrieval approach. The baseline approach consists of a pool of image features which are computed on the bounding boxes of a query object track and compared to a database of tracks in order to find additional appearances of the same object instance. We improve over this simple baseline approach in multiple ways: 1) we include motion cues to achieve improved robustness to viewpoint and rotation changes, 2) we include operator feedback to iteratively re-rank the resulting retrieval lists and 3) we use operator feedback and location constraints to train classifiers and learn an instance specific appearance model. We use these classifiers to further improve the retrieval results. The approach is evaluated on two popular public datasets for two different applications. We evaluate person re-identification on the CAVIAR shopping mall surveillance dataset and vehicle instance recognition on the VIVID aerial dataset and achieve significant improvements over our baseline results.

  10. Network analysis of mesoscale optical recordings to assess regional, functional connectivity.

    PubMed

    Lim, Diana H; LeDue, Jeffrey M; Murphy, Timothy H

    2015-10-01

    With modern optical imaging methods, it is possible to map structural and functional connectivity. Optical imaging studies that aim to describe large-scale neural connectivity often need to handle large and complex datasets. In order to interpret these datasets, new methods for analyzing structural and functional connectivity are being developed. Recently, network analysis, based on graph theory, has been used to describe and quantify brain connectivity in both experimental and clinical studies. We outline how to apply regional, functional network analysis to mesoscale optical imaging using voltage-sensitive-dye imaging and channelrhodopsin-2 stimulation in a mouse model. We include links to sample datasets and an analysis script. The analyses we employ can be applied to other types of fluorescence wide-field imaging, including genetically encoded calcium indicators, to assess network properties. We discuss the benefits and limitations of using network analysis for interpreting optical imaging data and define network properties that may be used to compare across preparations or other manipulations such as animal models of disease.

  11. Experiments to Determine Whether Recursive Partitioning (CART) or an Artificial Neural Network Overcomes Theoretical Limitations of Cox Proportional Hazards Regression

    NASA Technical Reports Server (NTRS)

    Kattan, Michael W.; Hess, Kenneth R.; Kattan, Michael W.

    1998-01-01

    New computationally intensive tools for medical survival analyses include recursive partitioning (also called CART) and artificial neural networks. A challenge that remains is to better understand the behavior of these techniques in effort to know when they will be effective tools. Theoretically they may overcome limitations of the traditional multivariable survival technique, the Cox proportional hazards regression model. Experiments were designed to test whether the new tools would, in practice, overcome these limitations. Two datasets in which theory suggests CART and the neural network should outperform the Cox model were selected. The first was a published leukemia dataset manipulated to have a strong interaction that CART should detect. The second was a published cirrhosis dataset with pronounced nonlinear effects that a neural network should fit. Repeated sampling of 50 training and testing subsets was applied to each technique. The concordance index C was calculated as a measure of predictive accuracy by each technique on the testing dataset. In the interaction dataset, CART outperformed Cox (P less than 0.05) with a C improvement of 0.1 (95% Cl, 0.08 to 0.12). In the nonlinear dataset, the neural network outperformed the Cox model (P less than 0.05), but by a very slight amount (0.015). As predicted by theory, CART and the neural network were able to overcome limitations of the Cox model. Experiments like these are important to increase our understanding of when one of these new techniques will outperform the standard Cox model. Further research is necessary to predict which technique will do best a priori and to assess the magnitude of superiority.

  12. A gridded hourly rainfall dataset for the UK applied to a national physically-based modelling system

    NASA Astrophysics Data System (ADS)

    Lewis, Elizabeth; Blenkinsop, Stephen; Quinn, Niall; Freer, Jim; Coxon, Gemma; Woods, Ross; Bates, Paul; Fowler, Hayley

    2016-04-01

    An hourly gridded rainfall product has great potential for use in many hydrological applications that require high temporal resolution meteorological data. One important example of this is flood risk management, with flooding in the UK highly dependent on sub-daily rainfall intensities amongst other factors. Knowledge of sub-daily rainfall intensities is therefore critical to designing hydraulic structures or flood defences to appropriate levels of service. Sub-daily rainfall rates are also essential inputs for flood forecasting, allowing for estimates of peak flows and stage for flood warning and response. In addition, an hourly gridded rainfall dataset has significant potential for practical applications such as better representation of extremes and pluvial flash flooding, validation of high resolution climate models and improving the representation of sub-daily rainfall in weather generators. A new 1km gridded hourly rainfall dataset for the UK has been created by disaggregating the daily Gridded Estimates of Areal Rainfall (CEH-GEAR) dataset using comprehensively quality-controlled hourly rain gauge data from over 1300 observation stations across the country. Quality control measures include identification of frequent tips, daily accumulations and dry spells, comparison of daily totals against the CEH-GEAR daily dataset, and nearest neighbour checks. The quality control procedure was validated against historic extreme rainfall events and the UKCP09 5km daily rainfall dataset. General use of the dataset has been demonstrated by testing the sensitivity of a physically-based hydrological modelling system for Great Britain to the distribution and rates of rainfall and potential evapotranspiration. Of the sensitivity tests undertaken, the largest improvements in model performance were seen when an hourly gridded rainfall dataset was combined with potential evapotranspiration disaggregated to hourly intervals, with 61% of catchments showing an increase in NSE between observed and simulated streamflows as a result of more realistic sub-daily meteorological forcing.

  13. Evaluation of Global Observations-Based Evapotranspiration Datasets and IPCC AR4 Simulations

    NASA Technical Reports Server (NTRS)

    Mueller, B.; Seneviratne, S. I.; Jimenez, C.; Corti, T.; Hirschi, M.; Balsamo, G.; Ciais, P.; Dirmeyer, P.; Fisher, J. B.; Guo, Z.; hide

    2011-01-01

    Quantification of global land evapotranspiration (ET) has long been associated with large uncertainties due to the lack of reference observations. Several recently developed products now provide the capacity to estimate ET at global scales. These products, partly based on observational data, include satellite ]based products, land surface model (LSM) simulations, atmospheric reanalysis output, estimates based on empirical upscaling of eddycovariance flux measurements, and atmospheric water balance datasets. The LandFlux-EVAL project aims to evaluate and compare these newly developed datasets. Additionally, an evaluation of IPCC AR4 global climate model (GCM) simulations is presented, providing an assessment of their capacity to reproduce flux behavior relative to the observations ]based products. Though differently constrained with observations, the analyzed reference datasets display similar large-scale ET patterns. ET from the IPCC AR4 simulations was significantly smaller than that from the other products for India (up to 1 mm/d) and parts of eastern South America, and larger in the western USA, Australia and China. The inter-product variance is lower across the IPCC AR4 simulations than across the reference datasets in several regions, which indicates that uncertainties may be underestimated in the IPCC AR4 models due to shared biases of these simulations.

  14. Indoor radon measurements in south west England explained by topsoil and stream sediment geochemistry, airborne gamma-ray spectroscopy and geology.

    PubMed

    Ferreira, Antonio; Daraktchieva, Zornitza; Beamish, David; Kirkwood, Charles; Lister, T Robert; Cave, Mark; Wragg, Joanna; Lee, Kathryn

    2018-01-01

    Predictive mapping of indoor radon potential often requires the use of additional datasets. A range of geological, geochemical and geophysical data may be considered, either individually or in combination. The present work is an evaluation of how much of the indoor radon variation in south west England can be explained by four different datasets: a) the geology (G), b) the airborne gamma-ray spectroscopy (AGR), c) the geochemistry of topsoil (TSG) and d) the geochemistry of stream sediments (SSG). The study area was chosen since it provides a large (197,464) indoor radon dataset in association with the above information. Geology provides information on the distribution of the materials that may contribute to radon release while the latter three items provide more direct observations on the distributions of the radionuclide elements uranium (U), thorium (Th) and potassium (K). In addition, (c) and (d) provide multi-element assessments of geochemistry which are also included in this study. The effectiveness of datasets for predicting the existing indoor radon data is assessed through the level (the higher the better) of explained variation (% of variance or ANOVA) obtained from the tested models. A multiple linear regression using a compositional data (CODA) approach is carried out to obtain the required measure of determination for each analysis. Results show that, amongst the four tested datasets, the soil geochemistry (TSG, i.e. including all the available 41 elements, 10 major - Al, Ca, Fe, K, Mg, Mn, Na, P, Si, Ti - plus 31 trace) provides the highest explained variation of indoor radon (about 40%); more than double the value provided by U alone (ca. 15%), or the sub composition U, Th, K (ca. 16%) from the same TSG data. The remaining three datasets provide values ranging from about 27% to 32.5%. The enhanced prediction of the AGR model relative to the U, Th, K in soils suggests that the AGR signal captures more than just the U, Th and K content in the soil. The best result is obtained by including the soil geochemistry with geology and AGR (TSG + G + AGR, ca. 47%). However, adding G and AGR to the TSG model only slightly improves the prediction (ca. +7%), suggesting that the geochemistry of soils already contain most of the information given by geology and airborne datasets together, at least with regard to the explanation of indoor radon. From the present analysis performed in the SW of England, it may be concluded that each one of the four datasets is likely to be useful for radon mapping purposes, whether alone or in combination with others. The present work also suggest that the complete soil geochemistry dataset (TSG) is more effective for indoor radon modelling than using just the U (+Th, K) concentration in soil. Copyright © 2016 Natural Environment Research Council. Published by Elsevier Ltd.. All rights reserved.

  15. A semiparametric graphical modelling approach for large-scale equity selection

    PubMed Central

    Liu, Han; Mulvey, John; Zhao, Tianqi

    2016-01-01

    We propose a new stock selection strategy that exploits rebalancing returns and improves portfolio performance. To effectively harvest rebalancing gains, we apply ideas from elliptical-copula graphical modelling and stability inference to select stocks that are as independent as possible. The proposed elliptical-copula graphical model has a latent Gaussian representation; its structure can be effectively inferred using the regularized rank-based estimators. The resulting algorithm is computationally efficient and scales to large data-sets. To show the efficacy of the proposed method, we apply it to conduct equity selection based on a 16-year health care stock data-set and a large 34-year stock data-set. Empirical tests show that the proposed method is superior to alternative strategies including a principal component analysis-based approach and the classical Markowitz strategy based on the traditional buy-and-hold assumption. PMID:28316507

  16. CO2 Push-Pull Dual (Conjugate) Faults Injection Simulations

    DOE Data Explorer

    Oldenburg, Curtis (ORCID:0000000201326016); Lee, Kyung Jae; Doughty, Christine; Jung, Yoojin; Borgia, Andrea; Pan, Lehua; Zhang, Rui; Daley, Thomas M.; Altundas, Bilgin; Chugunov, Nikita

    2017-07-20

    This submission contains datasets and a final manuscript associated with a project simulating carbon dioxide push-pull into a conjugate fault system modeled after Dixie Valley- sensitivity analysis of significant parameters and uncertainty prediction by data-worth analysis. Datasets include: (1) Forward simulation runs of standard cases (push & pull phases), (2) Local sensitivity analyses (push & pull phases), and (3) Data-worth analysis (push & pull phases).

  17. Does using different modern climate datasets impact pollen-based paleoclimate reconstructions in North America during the past 2,000 years

    NASA Astrophysics Data System (ADS)

    Ladd, Matthew; Viau, Andre

    2013-04-01

    Paleoclimate reconstructions rely on the accuracy of modern climate datasets for calibration of fossil records under the assumption of climate normality through time, which means that the modern climate operates in a similar manner as over the past 2,000 years. In this study, we show how using different modern climate datasets have an impact on a pollen-based reconstruction of mean temperature of the warmest month (MTWA) during the past 2,000 years for North America. The modern climate datasets used to explore this research question include the: Whitmore et al., (2005) modern climate dataset; North American Regional Reanalysis (NARR); National Center For Environmental Prediction (NCEP); European Center for Medium Range Weather Forecasting (ECMWF) ERA-40 reanalysis; WorldClim, Global Historical Climate Network (GHCN) and New et al., which is derived from the CRU dataset. Results show that some caution is advised in using the reanalysis data on large-scale reconstructions. Station data appears to dampen out the variability of the reconstruction produced using station based datasets. The reanalysis or model-based datasets are not recommended for paleoclimate large-scale North American reconstructions as they appear to lack some of the dynamics observed in station datasets (CRU) which resulted in warm-biased reconstructions as compared to the station-based reconstructions. The Whitmore et al. (2005) modern climate dataset appears to be a compromise between CRU-based datasets and model-based datasets except for the ERA-40. In addition, an ultra-high resolution gridded climate dataset such as WorldClim may only be useful if the pollen calibration sites in North America have at least the same spatial precision. We reconstruct the MTWA to within +/-0.01°C by using an average of all curves derived from the different modern climate datasets, demonstrating the robustness of the procedure used. It may be that the use of an average of different modern datasets may reduce the impact of uncertainty of paleoclimate reconstructions, however, this is yet to be determined with certainty. Future evaluation using for example the newly developed Berkeley earth surface temperature datasets should be tested against the paleoclimate record.

  18. Spatio-Temporal Data Model for Integrating Evolving Nation-Level Datasets

    NASA Astrophysics Data System (ADS)

    Sorokine, A.; Stewart, R. N.

    2017-10-01

    Ability to easily combine the data from diverse sources in a single analytical workflow is one of the greatest promises of the Big Data technologies. However, such integration is often challenging as datasets originate from different vendors, governments, and research communities that results in multiple incompatibilities including data representations, formats, and semantics. Semantics differences are hardest to handle: different communities often use different attribute definitions and associate the records with different sets of evolving geographic entities. Analysis of global socioeconomic variables across multiple datasets over prolonged time is often complicated by the difference in how boundaries and histories of countries or other geographic entities are represented. Here we propose an event-based data model for depicting and tracking histories of evolving geographic units (countries, provinces, etc.) and their representations in disparate data. The model addresses the semantic challenge of preserving identity of geographic entities over time by defining criteria for the entity existence, a set of events that may affect its existence, and rules for mapping between different representations (datasets). Proposed model is used for maintaining an evolving compound database of global socioeconomic and environmental data harvested from multiple sources. Practical implementation of our model is demonstrated using PostgreSQL object-relational database with the use of temporal, geospatial, and NoSQL database extensions.

  19. Diagnostics for generalized linear hierarchical models in network meta-analysis.

    PubMed

    Zhao, Hong; Hodges, James S; Carlin, Bradley P

    2017-09-01

    Network meta-analysis (NMA) combines direct and indirect evidence comparing more than 2 treatments. Inconsistency arises when these 2 information sources differ. Previous work focuses on inconsistency detection, but little has been done on how to proceed after identifying inconsistency. The key issue is whether inconsistency changes an NMA's substantive conclusions. In this paper, we examine such discrepancies from a diagnostic point of view. Our methods seek to detect influential and outlying observations in NMA at a trial-by-arm level. These observations may have a large effect on the parameter estimates in NMA, or they may deviate markedly from other observations. We develop formal diagnostics for a Bayesian hierarchical model to check the effect of deleting any observation. Diagnostics are specified for generalized linear hierarchical NMA models and investigated for both published and simulated datasets. Results from our example dataset using either contrast- or arm-based models and from the simulated datasets indicate that the sources of inconsistency in NMA tend not to be influential, though results from the example dataset suggest that they are likely to be outliers. This mimics a familiar result from linear model theory, in which outliers with low leverage are not influential. Future extensions include incorporating baseline covariates and individual-level patient data. Copyright © 2017 John Wiley & Sons, Ltd.

  20. Phase 1 Free Air CO2 Enrichment Model-Data Synthesis (FACE-MDS): Meteorological Data

    DOE Data Explorer

    Norby, R. J.; Oren, R.; Boden, T. A. [Carbon Dioxide Information Analysis Center (CDIAC), Oak Ridge National Laboratory (ORNL); De Kauwe, M. G.; Kim, D.; Medlyn, B. E.; Riggs, J. S.; Tharp, M. L.; Walker, A. P.; Yang, B.; Zaehle, S.

    2015-01-01

    These datasets comprise the meteorological, CO2 and N deposition data used to run models for the Duke and Oak Ridge FACE experiments. Phase 1 datasets are reproduced here for posterity and reproducibility although these meteorological datasets are superseded by the Phase 2 datasets. If you would like to use the meteorological datasets to run your own model or for any other purpose please use the Phase 2 datasets.

  1. Inferring microbial interaction networks from metagenomic data using SgLV-EKF algorithm.

    PubMed

    Alshawaqfeh, Mustafa; Serpedin, Erchin; Younes, Ahmad Bani

    2017-03-27

    Inferring the microbial interaction networks (MINs) and modeling their dynamics are critical in understanding the mechanisms of the bacterial ecosystem and designing antibiotic and/or probiotic therapies. Recently, several approaches were proposed to infer MINs using the generalized Lotka-Volterra (gLV) model. Main drawbacks of these models include the fact that these models only consider the measurement noise without taking into consideration the uncertainties in the underlying dynamics. Furthermore, inferring the MIN is characterized by the limited number of observations and nonlinearity in the regulatory mechanisms. Therefore, novel estimation techniques are needed to address these challenges. This work proposes SgLV-EKF: a stochastic gLV model that adopts the extended Kalman filter (EKF) algorithm to model the MIN dynamics. In particular, SgLV-EKF employs a stochastic modeling of the MIN by adding a noise term to the dynamical model to compensate for modeling uncertainties. This stochastic modeling is more realistic than the conventional gLV model which assumes that the MIN dynamics are perfectly governed by the gLV equations. After specifying the stochastic model structure, we propose the EKF to estimate the MIN. SgLV-EKF was compared with two similarity-based algorithms, one algorithm from the integral-based family and two regression-based algorithms, in terms of the achieved performance on two synthetic data-sets and two real data-sets. The first data-set models the randomness in measurement data, whereas, the second data-set incorporates uncertainties in the underlying dynamics. The real data-sets are provided by a recent study pertaining to an antibiotic-mediated Clostridium difficile infection. The experimental results demonstrate that SgLV-EKF outperforms the alternative methods in terms of robustness to measurement noise, modeling errors, and tracking the dynamics of the MIN. Performance analysis demonstrates that the proposed SgLV-EKF algorithm represents a powerful and reliable tool to infer MINs and track their dynamics.

  2. An empirical understanding of triple collocation evaluation measure

    NASA Astrophysics Data System (ADS)

    Scipal, Klaus; Doubkova, Marcela; Hegyova, Alena; Dorigo, Wouter; Wagner, Wolfgang

    2013-04-01

    Triple collocation method is an advanced evaluation method that has been used in the soil moisture field for only about half a decade. The method requires three datasets with an independent error structure that represent an identical phenomenon. The main advantages of the method are that it a) doesn't require a reference dataset that has to be considered to represent the truth, b) limits the effect of random and systematic errors of other two datasets, and c) simultaneously assesses the error of three datasets. The objective of this presentation is to assess the triple collocation error (Tc) of the ASAR Global Mode Surface Soil Moisture (GM SSM 1) km dataset and highlight problems of the method related to its ability to cancel the effect of error of ancillary datasets. In particular, the goal is to a) investigate trends in Tc related to the change in spatial resolution from 5 to 25 km, b) to investigate trends in Tc related to the choice of a hydrological model, and c) to study the relationship between Tc and other absolute evaluation methods (namely RMSE and Error Propagation EP). The triple collocation method is implemented using ASAR GM, AMSR-E, and a model (either AWRA-L, GLDAS-NOAH, or ERA-Interim). First, the significance of the relationship between the three soil moisture datasets was tested that is a prerequisite for the triple collocation method. Second, the trends in Tc related to the choice of the third reference dataset and scale were assessed. For this purpose the triple collocation is repeated replacing AWRA-L with two different globally available model reanalysis dataset operating at different spatial resolution (ERA-Interim and GLDAS-NOAH). Finally, the retrieved results were compared to the results of the RMSE and EP evaluation measures. Our results demonstrate that the Tc method does not eliminate the random and time-variant systematic errors of the second and the third dataset used in the Tc. The possible reasons include the fact a) that the TC method could not fully function with datasets acting at very different spatial resolutions, or b) that the errors were not fully independent as initially assumed.

  3. Modeling Pacific Northwest carbon and water cycling using CARAIB Dynamic Vegetation Model

    NASA Astrophysics Data System (ADS)

    Dury, M.; Kim, J. B.; Still, C. J.; Francois, L. M.; Jiang, Y.

    2015-12-01

    While uncertainties remain regarding projected temperature and precipitation changes, climate warming is already affecting ecosystems in the Pacific Northwest (PNW). Decrease in ecosystem productivity as well as increase in mortality of some plant species induced by drought and disturbance have been reported. Here, we applied the process-based dynamic vegetation model CARAIB to PNW to simulate the response of water and carbon cycling to current and future climate change projections. The vegetation model has already been successfully applied to Europe to simulate plant physiological response to climate change. We calibrated CARAIB to PNW using global Plant Functional Types. For calibration, the model is driven with the gridded surface meteorological dataset UIdaho MACA METDATA with 1/24-degree (~4-km) resolution at a daily time step for the period 1979-2014. The model ability to reproduce the current spatial and temporal variations of carbon stocks and fluxes was evaluated using a variety of available datasets, including eddy covariance and satellite observations. We focused particularly on past severe drought and fire episodes. Then, we simulated future conditions using the UIdaho MACAv2-METDATA dataset, which includes downscaled CMIP5 projections from 28 GCMs for RCP4.5 and RCP8.5. We evaluated the future ecosystem carbon balance resulting from changes in drought frequency as well as in fire risk. We also simulated future productivity and drought-induced mortality of several key PNW tree species.

  4. Modeling and Databases for Teaching Petrology

    NASA Astrophysics Data System (ADS)

    Asher, P.; Dutrow, B.

    2003-12-01

    With the widespread availability of high-speed computers with massive storage and ready transport capability of large amounts of data, computational and petrologic modeling and the use of databases provide new tools with which to teach petrology. Modeling can be used to gain insights into a system, predict system behavior, describe a system's processes, compare with a natural system or simply to be illustrative. These aspects result from data driven or empirical, analytical or numerical models or the concurrent examination of multiple lines of evidence. At the same time, use of models can enhance core foundations of the geosciences by improving critical thinking skills and by reinforcing prior knowledge gained. However, the use of modeling to teach petrology is dictated by the level of expectation we have for students and their facility with modeling approaches. For example, do we expect students to push buttons and navigate a program, understand the conceptual model and/or evaluate the results of a model. Whatever the desired level of sophistication, specific elements of design should be incorporated into a modeling exercise for effective teaching. These include, but are not limited to; use of the scientific method, use of prior knowledge, a clear statement of purpose and goals, attainable goals, a connection to the natural/actual system, a demonstration that complex heterogeneous natural systems are amenable to analyses by these techniques and, ideally, connections to other disciplines and the larger earth system. Databases offer another avenue with which to explore petrology. Large datasets are available that allow integration of multiple lines of evidence to attack a petrologic problem or understand a petrologic process. These are collected into a database that offers a tool for exploring, organizing and analyzing the data. For example, datasets may be geochemical, mineralogic, experimental and/or visual in nature, covering global, regional to local scales. These datasets provide students with access to large amount of related data through space and time. Goals of the database working group include educating earth scientists about information systems in general, about the importance of metadata about ways of using databases and datasets as educational tools and about the availability of existing datasets and databases. The modeling and databases groups hope to create additional petrologic teaching tools using these aspects and invite the community to contribute to the effort.

  5. Visualized analysis of mixed numeric and categorical data via extended self-organizing map.

    PubMed

    Hsu, Chung-Chian; Lin, Shu-Han

    2012-01-01

    Many real-world datasets are of mixed types, having numeric and categorical attributes. Even though difficult, analyzing mixed-type datasets is important. In this paper, we propose an extended self-organizing map (SOM), called MixSOM, which utilizes a data structure distance hierarchy to facilitate the handling of numeric and categorical values in a direct, unified manner. Moreover, the extended model regularizes the prototype distance between neighboring neurons in proportion to their map distance so that structures of the clusters can be portrayed better on the map. Extensive experiments on several synthetic and real-world datasets are conducted to demonstrate the capability of the model and to compare MixSOM with several existing models including Kohonen's SOM, the generalized SOM and visualization-induced SOM. The results show that MixSOM is superior to the other models in reflecting the structure of the mixed-type data and facilitates further analysis of the data such as exploration at various levels of granularity.

  6. Electricity, water, and natural gas consumption of a residential house in Canada from 2012 to 2014.

    PubMed

    Makonin, Stephen; Ellert, Bradley; Bajić, Ivan V; Popowich, Fred

    2016-06-07

    With the cost of consuming resources increasing (both economically and ecologically), homeowners need to find ways to curb consumption. The Almanac of Minutely Power dataset Version 2 (AMPds2) has been released to help computational sustainability researchers, power and energy engineers, building scientists and technologists, utility companies, and eco-feedback researchers test their models, systems, algorithms, or prototypes on real house data. In the vast majority of cases, real-world datasets lead to more accurate models and algorithms. AMPds2 is the first dataset to capture all three main types of consumption (electricity, water, and natural gas) over a long period of time (2 years) and provide 11 measurement characteristics for electricity. No other such datasets from Canada exist. Each meter has 730 days of captured data. We also include environmental and utility billing data for cost analysis. AMPds2 data has been pre-cleaned to provide for consistent and comparable accuracy results amongst different researchers and machine learning algorithms.

  7. Electricity, water, and natural gas consumption of a residential house in Canada from 2012 to 2014

    PubMed Central

    Makonin, Stephen; Ellert, Bradley; Bajić, Ivan V.; Popowich, Fred

    2016-01-01

    With the cost of consuming resources increasing (both economically and ecologically), homeowners need to find ways to curb consumption. The Almanac of Minutely Power dataset Version 2 (AMPds2) has been released to help computational sustainability researchers, power and energy engineers, building scientists and technologists, utility companies, and eco-feedback researchers test their models, systems, algorithms, or prototypes on real house data. In the vast majority of cases, real-world datasets lead to more accurate models and algorithms. AMPds2 is the first dataset to capture all three main types of consumption (electricity, water, and natural gas) over a long period of time (2 years) and provide 11 measurement characteristics for electricity. No other such datasets from Canada exist. Each meter has 730 days of captured data. We also include environmental and utility billing data for cost analysis. AMPds2 data has been pre-cleaned to provide for consistent and comparable accuracy results amongst different researchers and machine learning algorithms. PMID:27271937

  8. Exposing earth surface process model simulations to a large audience

    NASA Astrophysics Data System (ADS)

    Overeem, I.; Kettner, A. J.; Borkowski, L.; Russell, E. L.; Peddicord, H.

    2015-12-01

    The Community Surface Dynamics Modeling System (CSDMS) represents a diverse group of >1300 scientists who develop and apply numerical models to better understand the Earth's surface. CSDMS has a mandate to make the public more aware of model capabilities and therefore started sharing state-of-the-art surface process modeling results with large audiences. One platform to reach audiences outside the science community is through museum displays on 'Science on a Sphere' (SOS). Developed by NOAA, SOS is a giant globe, linked with computers and multiple projectors and can display data and animations on a sphere. CSDMS has developed and contributed model simulation datasets for the SOS system since 2014, including hydrological processes, coastal processes, and human interactions with the environment. Model simulations of a hydrological and sediment transport model (WBM-SED) illustrate global river discharge patterns. WAVEWATCH III simulations have been specifically processed to show the impacts of hurricanes on ocean waves, with focus on hurricane Katrina and super storm Sandy. A large world dataset of dams built over the last two centuries gives an impression of the profound influence of humans on water management. Given the exposure of SOS, CSDMS aims to contribute at least 2 model datasets a year, and will soon provide displays of global river sediment fluxes and changes of the sea ice free season along the Arctic coast. Over 100 facilities worldwide show these numerical model displays to an estimated 33 million people every year. Datasets storyboards, and teacher follow-up materials associated with the simulations, are developed to address common core science K-12 standards. CSDMS dataset documentation aims to make people aware of the fact that they look at numerical model results, that underlying models have inherent assumptions and simplifications, and that limitations are known. CSDMS contributions aim to familiarize large audiences with the use of numerical modeling as a tool to create understanding of environmental processes.

  9. A data-driven model for constraint of present-day glacial isostatic adjustment in North America

    NASA Astrophysics Data System (ADS)

    Simon, K. M.; Riva, R. E. M.; Kleinherenbrink, M.; Tangdamrongsub, N.

    2017-09-01

    Geodetic measurements of vertical land motion and gravity change are incorporated into an a priori model of present-day glacial isostatic adjustment (GIA) in North America via least-squares adjustment. The result is an updated GIA model wherein the final predicted signal is informed by both observational data, and prior knowledge (or intuition) of GIA inferred from models. The data-driven method allows calculation of the uncertainties of predicted GIA fields, and thus offers a significant advantage over predictions from purely forward GIA models. In order to assess the influence each dataset has on the final GIA prediction, the vertical land motion and GRACE-measured gravity data are incorporated into the model first independently (i.e., one dataset only), then simultaneously. The relative weighting of the datasets and the prior input is iteratively determined by variance component estimation in order to achieve the most statistically appropriate fit to the data. The best-fit model is obtained when both datasets are inverted and gives respective RMS misfits to the GPS and GRACE data of 1.3 mm/yr and 0.8 mm/yr equivalent water layer change. Non-GIA signals (e.g., hydrology) are removed from the datasets prior to inversion. The post-fit residuals between the model predictions and the vertical motion and gravity datasets, however, suggest particular regions where significant non-GIA signals may still be present in the data, including unmodeled hydrological changes in the central Prairies west of Lake Winnipeg. Outside of these regions of misfit, the posterior uncertainty of the predicted model provides a measure of the formal uncertainty associated with the GIA process; results indicate that this quantity is sensitive to the uncertainty and spatial distribution of the input data as well as that of the prior model information. In the study area, the predicted uncertainty of the present-day GIA signal ranges from ∼0.2-1.2 mm/yr for rates of vertical land motion, and from ∼3-4 mm/yr of equivalent water layer change for gravity variations.

  10. The Status of the NASA MEaSUREs Combined ASTER and MODIS Emissivity Over Land (CAMEL) Products

    NASA Astrophysics Data System (ADS)

    Borbas, E. E.; Feltz, M.; Hulley, G. C.; Knuteson, R. O.; Hook, S. J.

    2017-12-01

    As part of a NASA MEaSUREs Land Surface Temperature and Emissivity project, the University of Wisconsin, Space Science and Engineering Center and the NASA's Jet Propulsion Laboratory have developed a global monthly mean emissivity Earth System Data Record (ESDR). The CAMEL ESDR was produced by merging two current state-of-the-art emissivity datasets: the UW-Madison MODIS Infrared emissivity dataset (UWIREMIS), and the JPL ASTER Global Emissivity Dataset v4 (GEDv4). The dataset includes monthly global data records of emissivity, uncertainty at 13 hinge points between 3.6-14.3 µm, and Principal Components Analysis (PCA) coefficients at 5 kilometer resolution for years 2003 to 2015. A high spectral resolution algorithm is also provided for HSR applications. The dataset is currently being tested in sounder retrieval algorithm (e.g. CrIS, IASI) and has already been implemented in RTTOV-12 for immediate use in numerical weather modeling and data assimilation. This poster will present the current status of the dataset.

  11. Historic AVHRR Processing in the Eumetsat Climate Monitoring Satellite Application Facility (cmsaf) (Invited)

    NASA Astrophysics Data System (ADS)

    Karlsson, K.

    2010-12-01

    The EUMETSAT CMSAF project (www.cmsaf.eu) compiles climatological datasets from various satellite sources with emphasis on the use of EUMETSAT-operated satellites. However, since climate monitoring primarily has a global scope, also datasets merging data from various satellites and satellite operators are prepared. One such dataset is the CMSAF historic GAC (Global Area Coverage) dataset which is based on AVHRR data from the full historic series of NOAA-satellites and the European METOP satellite in mid-morning orbit launched in October 2006. The CMSAF GAC dataset consists of three groups of products: Macroscopical cloud products (cloud amount, cloud type and cloud top), cloud physical products (cloud phase, cloud optical thickness and cloud liquid water path) and surface radiation products (including surface albedo). Results will be presented and discussed for all product groups, including some preliminary inter-comparisons with other datasets (e.g., PATMOS-X, MODIS and CloudSat/CALIPSO datasets). A background will also be given describing the basic methodology behind the derivation of all products. This will include a short historical review of AVHRR cloud processing and resulting AVHRR applications at SMHI. Historic GAC processing is one of five pilot projects selected by the SCOPE-CM (Sustained Co-Ordinated Processing of Environmental Satellite data for Climate Monitoring) project organised by the WMO Space programme. The pilot project is carried out jointly between CMSAF and NOAA with the purpose of finding an optimal GAC processing approach. The initial activity is to inter-compare results of the CMSAF GAC dataset and the NOAA PATMOS-X dataset for the case when both datasets have been derived using the same inter-calibrated AVHRR radiance dataset. The aim is to get further knowledge of e.g. most useful multispectral methods and the impact of ancillary datasets (for example from meteorological reanalysis datasets from NCEP and ECMWF). The CMSAF project is currently defining plans for another five years (2012-2017) of operations and development. New GAC reprocessing efforts are planned and new methodologies will be tested. Central questions here will be how to increase the quantitative use of the products through improving error and uncertainty estimates and how to compile the information in a way to allow meaningful and efficient ways of using the data for e.g. validation of climate model information.

  12. The effect of leverage and/or influential on structure-activity relationships.

    PubMed

    Bolboacă, Sorana D; Jäntschi, Lorentz

    2013-05-01

    In the spirit of reporting valid and reliable Quantitative Structure-Activity Relationship (QSAR) models, the aim of our research was to assess how the leverage (analysis with Hat matrix, h(i)) and the influential (analysis with Cook's distance, D(i)) of QSAR models may reflect the models reliability and their characteristics. The datasets included in this research were collected from previously published papers. Seven datasets which accomplished the imposed inclusion criteria were analyzed. Three models were obtained for each dataset (full-model, h(i)-model and D(i)-model) and several statistical validation criteria were applied to the models. In 5 out of 7 sets the correlation coefficient increased when compounds with either h(i) or D(i) higher than the threshold were removed. Withdrawn compounds varied from 2 to 4 for h(i)-models and from 1 to 13 for D(i)-models. Validation statistics showed that D(i)-models possess systematically better agreement than both full-models and h(i)-models. Removal of influential compounds from training set significantly improves the model and is recommended to be conducted in the process of quantitative structure-activity relationships developing. Cook's distance approach should be combined with hat matrix analysis in order to identify the compounds candidates for removal.

  13. The New LASP Interactive Solar IRradiance Datacenter (LISIRD)

    NASA Astrophysics Data System (ADS)

    Baltzer, T.; Wilson, A.; Lindholm, D. M.; Snow, M. A.; Woodraska, D.; Pankratz, C. K.

    2017-12-01

    The New LASP Interactive Solar IRradiance Datacenter (LISIRD) The University of Colorado at Boulder's Laboratory for Atmospheric and Space Physics (LASP) has a long history of providing state of the art Solar instrumentation and datasets to the community. In 2005, LASP created a web interface called LISIRD which provided plotting of and access to a number of Solar Irradiance measured and modeled datasets, and it has been used extensively by members of the community both within and outside of LASP. In August of 2017, LASP is set to release a new version of LISIRD for use by anyone interested in viewing and downloading the datasets it serves. This talk will describe the new LISIRD with emphasis on features enabled by it to include: New and more functional plotting interfaces Better dataset browse and search capabilities More datasets Easier to add datasets from a wider array of resources Cleaner interface with better use of screen real estate Much easier to update metadata describing each dataset Much of this capability is leveraged off new infrastructure that will also be touched upon.

  14. Simulation of Smart Home Activity Datasets

    PubMed Central

    Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

    2015-01-01

    A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation. PMID:26087371

  15. Simulation of Smart Home Activity Datasets.

    PubMed

    Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

    2015-06-16

    A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.

  16. Statistical Analysis of Large Simulated Yield Datasets for Studying Climate Effects

    NASA Technical Reports Server (NTRS)

    Makowski, David; Asseng, Senthold; Ewert, Frank; Bassu, Simona; Durand, Jean-Louis; Martre, Pierre; Adam, Myriam; Aggarwal, Pramod K.; Angulo, Carlos; Baron, Chritian; hide

    2015-01-01

    Many studies have been carried out during the last decade to study the effect of climate change on crop yields and other key crop characteristics. In these studies, one or several crop models were used to simulate crop growth and development for different climate scenarios that correspond to different projections of atmospheric CO2 concentration, temperature, and rainfall changes (Semenov et al., 1996; Tubiello and Ewert, 2002; White et al., 2011). The Agricultural Model Intercomparison and Improvement Project (AgMIP; Rosenzweig et al., 2013) builds on these studies with the goal of using an ensemble of multiple crop models in order to assess effects of climate change scenarios for several crops in contrasting environments. These studies generate large datasets, including thousands of simulated crop yield data. They include series of yield values obtained by combining several crop models with different climate scenarios that are defined by several climatic variables (temperature, CO2, rainfall, etc.). Such datasets potentially provide useful information on the possible effects of different climate change scenarios on crop yields. However, it is sometimes difficult to analyze these datasets and to summarize them in a useful way due to their structural complexity; simulated yield data can differ among contrasting climate scenarios, sites, and crop models. Another issue is that it is not straightforward to extrapolate the results obtained for the scenarios to alternative climate change scenarios not initially included in the simulation protocols. Additional dynamic crop model simulations for new climate change scenarios are an option but this approach is costly, especially when a large number of crop models are used to generate the simulated data, as in AgMIP. Statistical models have been used to analyze responses of measured yield data to climate variables in past studies (Lobell et al., 2011), but the use of a statistical model to analyze yields simulated by complex process-based crop models is a rather new idea. We demonstrate herewith that statistical methods can play an important role in analyzing simulated yield data sets obtained from the ensembles of process-based crop models. Formal statistical analysis is helpful to estimate the effects of different climatic variables on yield, and to describe the between-model variability of these effects.

  17. Benchmarking protein classification algorithms via supervised cross-validation.

    PubMed

    Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor

    2008-04-24

    Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.

  18. High-resolution grids of hourly meteorological variables for Germany

    NASA Astrophysics Data System (ADS)

    Krähenmann, S.; Walter, A.; Brienen, S.; Imbery, F.; Matzarakis, A.

    2018-02-01

    We present a 1-km2 gridded German dataset of hourly surface climate variables covering the period 1995 to 2012. The dataset comprises 12 variables including temperature, dew point, cloud cover, wind speed and direction, global and direct shortwave radiation, down- and up-welling longwave radiation, sea level pressure, relative humidity and vapour pressure. This dataset was constructed statistically from station data, satellite observations and model data. It is outstanding in terms of spatial and temporal resolution and in the number of climate variables. For each variable, we employed the most suitable gridding method and combined the best of several information sources, including station records, satellite-derived data and data from a regional climate model. A module to estimate urban heat island intensity was integrated for air and dew point temperature. Owing to the low density of available synop stations, the gridded dataset does not capture all variations that may occur at a resolution of 1 km2. This applies to areas of complex terrain (all the variables), and in particular to wind speed and the radiation parameters. To achieve maximum precision, we used all observational information when it was available. This, however, leads to inhomogeneities in station network density and affects the long-term consistency of the dataset. A first climate analysis for Germany was conducted. The Rhine River Valley, for example, exhibited more than 100 summer days in 2003, whereas in 1996, the number was low everywhere in Germany. The dataset is useful for applications in various climate-related studies, hazard management and for solar or wind energy applications and it is available via doi: 10.5676/DWD_CDC/TRY_Basis_v001.

  19. How accurately are climatological characteristics and surface water and energy balances represented for the Colombian Caribbean Catchment Basin?

    NASA Astrophysics Data System (ADS)

    Hoyos, Isabel; Baquero-Bernal, Astrid; Hagemann, Stefan

    2013-09-01

    In Colombia, the access to climate related observational data is restricted and their quantity is limited. But information about the current climate is fundamental for studies on present and future climate changes and their impacts. In this respect, this information is especially important over the Colombian Caribbean Catchment Basin (CCCB) that comprises over 80 % of the population of Colombia and produces about 85 % of its GDP. Consequently, an ensemble of several datasets has been evaluated and compared with respect to their capability to represent the climate over the CCCB. The comparison includes observations, reconstructed data (CPC, Delaware), reanalyses (ERA-40, NCEP/NCAR), and simulated data produced with the regional climate model REMO. The capabilities to represent the average annual state, the seasonal cycle, and the interannual variability are investigated. The analyses focus on surface air temperature and precipitation as well as on surface water and energy balances. On one hand the CCCB characteristics poses some difficulties to the datasets as the CCCB includes a mountainous region with three mountain ranges, where the dynamical core of models and model parameterizations can fail. On the other hand, it has the most dense network of stations, with the longest records, in the country. The results can be summarised as follows: all of the datasets demonstrate a cold bias in the average temperature of CCCB. However, the variability of the average temperature of CCCB is most poorly represented by the NCEP/NCAR dataset. The average precipitation in CCCB is overestimated by all datasets. For the ERA-40, NCEP/NCAR, and REMO datasets, the amplitude of the annual cycle is extremely high. The variability of the average precipitation in CCCB is better represented by the reconstructed data of CPC and Delaware, as well as by NCEP/NCAR. Regarding the capability to represent the spatial behaviour of CCCB, temperature is better represented by Delaware and REMO, while precipitation is better represented by Delaware. Among the three datasets that permit an analysis of surface water and energy balances (REMO, ERA-40, and NCEP/NCAR), REMO best demonstrates the closure property of the surface water balance within the basin, while NCEP/NCAR does not demonstrate this property well. The three datasets represent the energy balance fairly well, although some inconsistencies were found in the individual balance components for NCEP/NCAR.

  20. GLEAM v3: updated land evaporation and root-zone soil moisture datasets

    NASA Astrophysics Data System (ADS)

    Martens, Brecht; Miralles, Diego; Lievens, Hans; van der Schalie, Robin; de Jeu, Richard; Fernández-Prieto, Diego; Verhoest, Niko

    2016-04-01

    Evaporation determines the availability of surface water resources and the requirements for irrigation. In addition, through its impacts on the water, carbon and energy budgets, evaporation influences the occurrence of rainfall and the dynamics of air temperature. Therefore, reliable estimates of this flux at regional to global scales are of major importance for water management and meteorological forecasting of extreme events. However, the global-scale magnitude and variability of the flux, and the sensitivity of the underlying physical process to changes in environmental factors, are still poorly understood due to the limited global coverage of in situ measurements. Remote sensing techniques can help to overcome the lack of ground data. However, evaporation is not directly observable from satellite systems. As a result, recent efforts have focussed on combining the observable drivers of evaporation within process-based models. The Global Land Evaporation Amsterdam Model (GLEAM, www.gleam.eu) estimates terrestrial evaporation based on daily satellite observations of meteorological drivers of terrestrial evaporation, vegetation characteristics and soil moisture. Since the publication of the first version of the model in 2011, GLEAM has been widely applied for the study of trends in the water cycle, interactions between land and atmosphere and hydrometeorological extreme events. A third version of the GLEAM global datasets will be available from the beginning of 2016 and will be distributed using www.gleam.eu as gateway. The updated datasets include separate estimates for the different components of the evaporative flux (i.e. transpiration, bare-soil evaporation, interception loss, open-water evaporation and snow sublimation), as well as variables like the evaporative stress, potential evaporation, root-zone soil moisture and surface soil moisture. A new dataset using SMOS-based input data of surface soil moisture and vegetation optical depth will also be distributed. The most important updates in GLEAM include the revision of the soil moisture data assimilation system, the evaporative stress functions and the infiltration of rainfall. In this presentation, we will highlight the changes of the methodology and present the new datasets, their validation against in situ observations and the comparisons against alternative datasets of terrestrial evaporation, such as GLDAS-Noah, ERA-Interim and previous GLEAM datasets. Preliminary results indicate that the magnitude and the spatio-temporal variability of the evaporation estimates have been slightly improved upon previous versions of the datasets.

  1. Filtering Raw Terrestrial Laser Scanning Data for Efficient and Accurate Use in Geomorphologic Modeling

    NASA Astrophysics Data System (ADS)

    Gleason, M. J.; Pitlick, J.; Buttenfield, B. P.

    2011-12-01

    Terrestrial laser scanning (TLS) represents a new and particularly effective remote sensing technique for investigating geomorphologic processes. Unfortunately, TLS data are commonly characterized by extremely large volume, heterogeneous point distribution, and erroneous measurements, raising challenges for applied researchers. To facilitate efficient and accurate use of TLS in geomorphology, and to improve accessibility for TLS processing in commercial software environments, we are developing a filtering method for raw TLS data to: eliminate data redundancy; produce a more uniformly spaced dataset; remove erroneous measurements; and maintain the ability of the TLS dataset to accurately model terrain. Our method conducts local aggregation of raw TLS data using a 3-D search algorithm based on the geometrical expression of expected random errors in the data. This approach accounts for the estimated accuracy and precision limitations of the instruments and procedures used in data collection, thereby allowing for identification and removal of potential erroneous measurements prior to data aggregation. Initial tests of the proposed technique on a sample TLS point cloud required a modest processing time of approximately 100 minutes to reduce dataset volume over 90 percent (from 12,380,074 to 1,145,705 points). Preliminary analysis of the filtered point cloud revealed substantial improvement in homogeneity of point distribution and minimal degradation of derived terrain models. We will test the method on two independent TLS datasets collected in consecutive years along a non-vegetated reach of the North Fork Toutle River in Washington. We will evaluate the tool using various quantitative, qualitative, and statistical methods. The crux of this evaluation will include a bootstrapping analysis to test the ability of the filtered datasets to model the terrain at roughly the same accuracy as the raw datasets.

  2. Defining metrics of the Quasi-Biennial Oscillation in global climate models

    NASA Astrophysics Data System (ADS)

    Schenzinger, Verena; Osprey, Scott; Gray, Lesley; Butchart, Neal

    2017-06-01

    As the dominant mode of variability in the tropical stratosphere, the Quasi-Biennial Oscillation (QBO) has been subject to extensive research. Though there is a well-developed theory of this phenomenon being forced by wave-mean flow interaction, simulating the QBO adequately in global climate models still remains difficult. This paper presents a set of metrics to characterize the morphology of the QBO using a number of different reanalysis datasets and the FU Berlin radiosonde observation dataset. The same metrics are then calculated from Coupled Model Intercomparison Project 5 and Chemistry-Climate Model Validation Activity 2 simulations which included a representation of QBO-like behaviour to evaluate which aspects of the QBO are well captured by the models and which ones remain a challenge for future model development.

  3. A Benchmark Dataset for SSVEP-Based Brain-Computer Interfaces.

    PubMed

    Wang, Yijun; Chen, Xiaogang; Gao, Xiaorong; Gao, Shangkai

    2017-10-01

    This paper presents a benchmark steady-state visual evoked potential (SSVEP) dataset acquired with a 40-target brain- computer interface (BCI) speller. The dataset consists of 64-channel Electroencephalogram (EEG) data from 35 healthy subjects (8 experienced and 27 naïve) while they performed a cue-guided target selecting task. The virtual keyboard of the speller was composed of 40 visual flickers, which were coded using a joint frequency and phase modulation (JFPM) approach. The stimulation frequencies ranged from 8 Hz to 15.8 Hz with an interval of 0.2 Hz. The phase difference between two adjacent frequencies was . For each subject, the data included six blocks of 40 trials corresponding to all 40 flickers indicated by a visual cue in a random order. The stimulation duration in each trial was five seconds. The dataset can be used as a benchmark dataset to compare the methods for stimulus coding and target identification in SSVEP-based BCIs. Through offline simulation, the dataset can be used to design new system diagrams and evaluate their BCI performance without collecting any new data. The dataset also provides high-quality data for computational modeling of SSVEPs. The dataset is freely available fromhttp://bci.med.tsinghua.edu.cn/download.html.

  4. Radiation damage-He diffusivity models applied to deep-time thermochronology: Zircon and titanite (U-Th)/He datasets from cratonic settings

    NASA Astrophysics Data System (ADS)

    Guenthner, W.; DeLucia, M. S.; Marshak, S.; Reiners, P. W.; Drake, H.; Thomson, S.; Ault, A. K.; Tillberg, M.

    2017-12-01

    Advances in understanding the effects of radiation damage on He diffusion in uranium-bearing accessory minerals have shown the utility of damage-diffusivity models for interpreting datasets from geologic settings with long-term, low-temperature thermal histories. Craton interiors preserve a billion-year record of long-term, long-wavelength vertical motions of the lithosphere. Prior thermochronologic work in these settings has focused on radiation damage models used in conjunction with apatite (U-Th)/He dates to constrain Phanerozoic thermal histories. Owing to the more complex damage-diffusivity relationship in zircon, the zircon (U-Th)/He system yields both higher and, in some cases, lower temperature sensitivities than the apatite system, and this greater range in turn allows researchers to access deeper time (i.e., Proterozoic) segments of craton time-temperature histories. Here, we show two examples of this approach by focusing on zircon (U-Th)/He datasets from 1.8 Ga granitoids of the Fennoscandian Shield in southeastern Sweden, and 1.4 Ga granites and rhyolites of the Ozark Plateau in southeastern Missouri. In the Ozark dataset, the zircon (U-Th)/He data, combined with a damage-diffusivity model, predict negative correlations between date and effective uranium (eU) concentration (a measurement proportional to radiation damage) from thermal histories that include an episode of Proterozoic cooling (interpreted as exhumation) following reheating (interpreted as burial) to temperature of 260°C at 850-680 Ma. In the Fennoscandian Shield, a similar damage model-based approach yields time-temperature constraints with burial to 217°C between 944 Ma and 851 Ma, followed by exhumation from 850 to 500 Ma, and burial to 154°C between 366 Ma and 224 Ma. Our Fennoscandian Shield samples also include titanite (U-Th)/He dates that span a wide range (945-160 Ma) and are negatively correlated with eU concentration, analogous to our zircon He dataset. These results support the initial findings of Baughman et al. (2017, Tectonics), and suggest that further research into the radiation damage effect on He diffusion in titanite could yield a comprehensive damage-diffusivity model for the titanite (U-Th)/He thermochronometer.

  5. Integrative Analysis of High-throughput Cancer Studies with Contrasted Penalization

    PubMed Central

    Shi, Xingjie; Liu, Jin; Huang, Jian; Zhou, Yong; Shia, BenChang; Ma, Shuangge

    2015-01-01

    In cancer studies with high-throughput genetic and genomic measurements, integrative analysis provides a way to effectively pool and analyze heterogeneous raw data from multiple independent studies and outperforms “classic” meta-analysis and single-dataset analysis. When marker selection is of interest, the genetic basis of multiple datasets can be described using the homogeneity model or the heterogeneity model. In this study, we consider marker selection under the heterogeneity model, which includes the homogeneity model as a special case and can be more flexible. Penalization methods have been developed in the literature for marker selection. This study advances from the published ones by introducing the contrast penalties, which can accommodate the within- and across-dataset structures of covariates/regression coefficients and, by doing so, further improve marker selection performance. Specifically, we develop a penalization method that accommodates the across-dataset structures by smoothing over regression coefficients. An effective iterative algorithm, which calls an inner coordinate descent iteration, is developed. Simulation shows that the proposed method outperforms the benchmark with more accurate marker identification. The analysis of breast cancer and lung cancer prognosis studies with gene expression measurements shows that the proposed method identifies genes different from those using the benchmark and has better prediction performance. PMID:24395534

  6. Seismic Surveys Negatively Affect Humpback Whale Singing Activity off Northern Angola

    PubMed Central

    Cerchio, Salvatore; Strindberg, Samantha; Collins, Tim; Bennett, Chanda; Rosenbaum, Howard

    2014-01-01

    Passive acoustic monitoring was used to document the presence of singing humpback whales off the coast of Northern Angola, and opportunistically test for the effect of seismic survey activity in the vicinity on the number of singing whales. Two Marine Autonomous Recording Units (MARUs) were deployed between March and December 2008 in the offshore environment. Song was first heard in mid June and continued through the remaining duration of the study. Seismic survey activity was heard regularly during two separate periods, consistently throughout July and intermittently in mid-October/November. Numbers of singers were counted during the first ten minutes of every hour for the period from 24 May to 1 December, and Generalized Additive Mixed Models (GAMMs) were used to assess the effect of survey day (seasonality), hour (diel variation), moon phase and received levels of seismic survey pulses (measured from a single pulse during each ten-minute sampled period) on singer number. Application of GAMMs indicated significant seasonal variation, which was the most pronounced effect when assessing the full dataset across the entire season (p<0.001); however seasonality almost entirely dropped out of top-ranked models when applied to a reduced dataset during the July period of seismic survey activity. Diel variation was significant in both the full and reduced datasets (from p<0.01 to p<0.05) and often included in the top-ranked models. The number of singers significantly decreased with increasing received level of seismic survey pulses (from p<0.01 to p<0.05); this explanatory variable was included among the top ranked models for one MARU in the full dataset and both MARUs in the reduced dataset. This suggests that the breeding display of humpback whales is disrupted by seismic survey activity, and thus merits further attention and study, and potentially conservation action in the case of sensitive breeding populations. PMID:24618836

  7. Seismic surveys negatively affect humpback whale singing activity off northern Angola.

    PubMed

    Cerchio, Salvatore; Strindberg, Samantha; Collins, Tim; Bennett, Chanda; Rosenbaum, Howard

    2014-01-01

    Passive acoustic monitoring was used to document the presence of singing humpback whales off the coast of Northern Angola, and opportunistically test for the effect of seismic survey activity in the vicinity on the number of singing whales. Two Marine Autonomous Recording Units (MARUs) were deployed between March and December 2008 in the offshore environment. Song was first heard in mid June and continued through the remaining duration of the study. Seismic survey activity was heard regularly during two separate periods, consistently throughout July and intermittently in mid-October/November. Numbers of singers were counted during the first ten minutes of every hour for the period from 24 May to 1 December, and Generalized Additive Mixed Models (GAMMs) were used to assess the effect of survey day (seasonality), hour (diel variation), moon phase and received levels of seismic survey pulses (measured from a single pulse during each ten-minute sampled period) on singer number. Application of GAMMs indicated significant seasonal variation, which was the most pronounced effect when assessing the full dataset across the entire season (p<0.001); however seasonality almost entirely dropped out of top-ranked models when applied to a reduced dataset during the July period of seismic survey activity. Diel variation was significant in both the full and reduced datasets (from p<0.01 to p<0.05) and often included in the top-ranked models. The number of singers significantly decreased with increasing received level of seismic survey pulses (from p<0.01 to p<0.05); this explanatory variable was included among the top ranked models for one MARU in the full dataset and both MARUs in the reduced dataset. This suggests that the breeding display of humpback whales is disrupted by seismic survey activity, and thus merits further attention and study, and potentially conservation action in the case of sensitive breeding populations.

  8. Analysis of Multiple Precipitation Products and Preliminary Assessment of Their Impact on Global Land Data Assimilation System (GLDAS) Land Surface States

    NASA Technical Reports Server (NTRS)

    Gottschalck, Jon; Meng, Jesse; Rodel, Matt; Houser, paul

    2005-01-01

    Land surface models (LSMs) are computer programs, similar to weather and climate prediction models, which simulate the stocks and fluxes of water (including soil moisture, snow, evaporation, and runoff) and energy (including the temperature of and sensible heat released from the soil) after they arrive on the land surface as precipitation and sunlight. It is not currently possible to measure all of the variables of interest everywhere on Earth with sufficient accuracy and space-time resolution. Hence LSMs have been developed to integrate the available observations with our understanding of the physical processes involved, using powerful computers, in order to map these stocks and fluxes as they change in time. The maps are used to improve weather forecasts, support water resources and agricultural applications, and study the Earth's water cycle and climate variability. NASA's Global Land Data Assimilation System (GLDAS) project facilitates testing of several different LSMs with a variety of input datasets (e.g., precipitation, plant type). Precipitation is arguably the most important input to LSMs. Many precipitation datasets have been produced using satellite and rain gauge observations and weather forecast models. In this study, seven different global precipitation datasets were evaluated over the United States, where dense rain gauge networks contribute to reliable precipitation maps. We then used the seven datasets as inputs to GLDAS simulations, so that we could diagnose their impacts on output stocks and fluxes of water. In terms of totals, the Climate Prediction Center (CPC) Merged Analysis of Precipitation (CMAP) had the closest agreement with the US rain gauge dataset for all seasons except winter. The CMAP precipitation was also the most closely correlated in time with the rain gauge data during spring, fall, and winter, while the satellitebased estimates performed best in summer. The GLDAS simulations revealed that modeled soil moisture is highly sensitive to precipitation, with differences in spring and summer as large as 45% depending on the choice of precipitation input.

  9. USGS approach to real-time estimation of earthquake-triggered ground failure - Results of 2015 workshop

    USGS Publications Warehouse

    Allstadt, Kate E.; Thompson, Eric M.; Wald, David J.; Hamburger, Michael W.; Godt, Jonathan W.; Knudsen, Keith L.; Jibson, Randall W.; Jessee, M. Anna; Zhu, Jing; Hearne, Michael; Baise, Laurie G.; Tanyas, Hakan; Marano, Kristin D.

    2016-03-30

    The U.S. Geological Survey (USGS) Earthquake Hazards and Landslide Hazards Programs are developing plans to add quantitative hazard assessments of earthquake-triggered landsliding and liquefaction to existing real-time earthquake products (ShakeMap, ShakeCast, PAGER) using open and readily available methodologies and products. To date, prototype global statistical models have been developed and are being refined, improved, and tested. These models are a good foundation, but much work remains to achieve robust and defensible models that meet the needs of end users. In order to establish an implementation plan and identify research priorities, the USGS convened a workshop in Golden, Colorado, in October 2015. This document summarizes current (as of early 2016) capabilities, research and operational priorities, and plans for further studies that were established at this workshop. Specific priorities established during the meeting include (1) developing a suite of alternative models; (2) making use of higher resolution and higher quality data where possible; (3) incorporating newer global and regional datasets and inventories; (4) reducing barriers to accessing inventory datasets; (5) developing methods for using inconsistent or incomplete datasets in aggregate; (6) developing standardized model testing and evaluation methods; (7) improving ShakeMap shaking estimates, particularly as relevant to ground failure, such as including topographic amplification and accounting for spatial variability; and (8) developing vulnerability functions for loss estimates.

  10. High-Resolution Digital Terrain Models of the Sacramento/San Joaquin Delta Region, California

    USGS Publications Warehouse

    Coons, Tom; Soulard, Christopher E.; Knowles, Noah

    2008-01-01

    The U.S. Geological Survey (USGS) Western Region Geographic Science Center, in conjunction with the USGS Water Resources Western Branch of Regional Research, has developed a high-resolution elevation dataset covering the Sacramento/San Joaquin Delta region of California. The elevation data were compiled photogrammically from aerial photography (May 2002) with a scale of 1:15,000. The resulting dataset has a 10-meter horizontal resolution grid of elevation values. The vertical accuracy was determined to be 1 meter. Two versions of the elevation data are available: the first dataset has all water coded as zero, whereas the second dataset has bathymetry data merged with the elevation data. The projection of both datasets is set to UTM Zone 10, NAD 1983. The elevation data are clipped into files that spatially approximate 7.5-minute USGS quadrangles, with about 100 meters of overlap to facilitate combining the files into larger regions without data gaps. The files are named after the 7.5-minute USGS quadrangles that cover the same general spatial extent. File names that include a suffix (_b) indicate that the bathymetry data are included (for example, sac_east versus sac_east_b). These files are provided in ESRI Grid format.

  11. The road to NHDPlus — Advancements in digital stream networks and associated catchments

    USGS Publications Warehouse

    Moore, Richard B.; Dewald, Thomas A.

    2016-01-01

    A progression of advancements in Geographic Information Systems techniques for hydrologic network and associated catchment delineation has led to the production of the National Hydrography Dataset Plus (NHDPlus). NHDPlus is a digital stream network for hydrologic modeling with catchments and a suite of related geospatial data. Digital stream networks with associated catchments provide a geospatial framework for linking and integrating water-related data. Advancements in the development of NHDPlus are expected to continue to improve the capabilities of this national geospatial hydrologic framework. NHDPlus is built upon the medium-resolution NHD and, like NHD, was developed by the U.S. Environmental Protection Agency and U.S. Geological Survey to support the estimation of streamflow and stream velocity used in fate-and-transport modeling. Catchments included with NHDPlus were created by integrating vector information from the NHD and from the Watershed Boundary Dataset with the gridded land surface elevation as represented by the National Elevation Dataset. NHDPlus is an actively used and continually improved dataset. Users recognize the importance of a reliable stream network and associated catchments. The NHDPlus spatial features and associated data tables will continue to be improved to support regional water quality and streamflow models and other user-defined applications.

  12. Preprocessed Consortium for Neuropsychiatric Phenomics dataset.

    PubMed

    Gorgolewski, Krzysztof J; Durnez, Joke; Poldrack, Russell A

    2017-01-01

    Here we present preprocessed MRI data of 265 participants from the Consortium for Neuropsychiatric Phenomics (CNP) dataset. The preprocessed dataset includes minimally preprocessed data in the native, MNI and surface spaces accompanied with potential confound regressors, tissue probability masks, brain masks and transformations. In addition the preprocessed dataset includes unthresholded group level and single subject statistical maps from all tasks included in the original dataset. We hope that availability of this dataset will greatly accelerate research.

  13. Prediction model to estimate presence of coronary artery disease: retrospective pooled analysis of existing cohorts

    PubMed Central

    Genders, Tessa S S; Steyerberg, Ewout W; Nieman, Koen; Galema, Tjebbe W; Mollet, Nico R; de Feyter, Pim J; Krestin, Gabriel P; Alkadhi, Hatem; Leschka, Sebastian; Desbiolles, Lotus; Meijs, Matthijs F L; Cramer, Maarten J; Knuuti, Juhani; Kajander, Sami; Bogaert, Jan; Goetschalckx, Kaatje; Cademartiri, Filippo; Maffei, Erica; Martini, Chiara; Seitun, Sara; Aldrovandi, Annachiara; Wildermuth, Simon; Stinn, Björn; Fornaro, Jürgen; Feuchtner, Gudrun; De Zordo, Tobias; Auer, Thomas; Plank, Fabian; Friedrich, Guy; Pugliese, Francesca; Petersen, Steffen E; Davies, L Ceri; Schoepf, U Joseph; Rowe, Garrett W; van Mieghem, Carlos A G; van Driessche, Luc; Sinitsyn, Valentin; Gopalan, Deepa; Nikolaou, Konstantin; Bamberg, Fabian; Cury, Ricardo C; Battle, Juan; Maurovich-Horvat, Pál; Bartykowszki, Andrea; Merkely, Bela; Becker, Dávid; Hadamitzky, Martin; Hausleiter, Jörg; Dewey, Marc; Zimmermann, Elke; Laule, Michael

    2012-01-01

    Objectives To develop prediction models that better estimate the pretest probability of coronary artery disease in low prevalence populations. Design Retrospective pooled analysis of individual patient data. Setting 18 hospitals in Europe and the United States. Participants Patients with stable chest pain without evidence for previous coronary artery disease, if they were referred for computed tomography (CT) based coronary angiography or catheter based coronary angiography (indicated as low and high prevalence settings, respectively). Main outcome measures Obstructive coronary artery disease (≥50% diameter stenosis in at least one vessel found on catheter based coronary angiography). Multiple imputation accounted for missing predictors and outcomes, exploiting strong correlation between the two angiography procedures. Predictive models included a basic model (age, sex, symptoms, and setting), clinical model (basic model factors and diabetes, hypertension, dyslipidaemia, and smoking), and extended model (clinical model factors and use of the CT based coronary calcium score). We assessed discrimination (c statistic), calibration, and continuous net reclassification improvement by cross validation for the four largest low prevalence datasets separately and the smaller remaining low prevalence datasets combined. Results We included 5677 patients (3283 men, 2394 women), of whom 1634 had obstructive coronary artery disease found on catheter based coronary angiography. All potential predictors were significantly associated with the presence of disease in univariable and multivariable analyses. The clinical model improved the prediction, compared with the basic model (cross validated c statistic improvement from 0.77 to 0.79, net reclassification improvement 35%); the coronary calcium score in the extended model was a major predictor (0.79 to 0.88, 102%). Calibration for low prevalence datasets was satisfactory. Conclusions Updated prediction models including age, sex, symptoms, and cardiovascular risk factors allow for accurate estimation of the pretest probability of coronary artery disease in low prevalence populations. Addition of coronary calcium scores to the prediction models improves the estimates. PMID:22692650

  14. Assessment of NASA's Physiographic and Meteorological Datasets as Input to HSPF and SWAT Hydrological Models

    NASA Technical Reports Server (NTRS)

    Alacron, Vladimir J.; Nigro, Joseph D.; McAnally, William H.; OHara, Charles G.; Engman, Edwin Ted; Toll, David

    2011-01-01

    This paper documents the use of simulated Moderate Resolution Imaging Spectroradiometer land use/land cover (MODIS-LULC), NASA-LIS generated precipitation and evapo-transpiration (ET), and Shuttle Radar Topography Mission (SRTM) datasets (in conjunction with standard land use, topographical and meteorological datasets) as input to hydrological models routinely used by the watershed hydrology modeling community. The study is focused in coastal watersheds in the Mississippi Gulf Coast although one of the test cases focuses in an inland watershed located in northeastern State of Mississippi, USA. The decision support tools (DSTs) into which the NASA datasets were assimilated were the Soil Water & Assessment Tool (SWAT) and the Hydrological Simulation Program FORTRAN (HSPF). These DSTs are endorsed by several US government agencies (EPA, FEMA, USGS) for water resources management strategies. These models use physiographic and meteorological data extensively. Precipitation gages and USGS gage stations in the region were used to calibrate several HSPF and SWAT model applications. Land use and topographical datasets were swapped to assess model output sensitivities. NASA-LIS meteorological data were introduced in the calibrated model applications for simulation of watershed hydrology for a time period in which no weather data were available (1997-2006). The performance of the NASA datasets in the context of hydrological modeling was assessed through comparison of measured and model-simulated hydrographs. Overall, NASA datasets were as useful as standard land use, topographical , and meteorological datasets. Moreover, NASA datasets were used for performing analyses that the standard datasets could not made possible, e.g., introduction of land use dynamics into hydrological simulations

  15. Data in support of energy performance of double-glazed windows.

    PubMed

    Shakouri, Mahmoud; Banihashemi, Saeed

    2016-06-01

    This paper provides the data used in a research project to propose a new simplified windows rating system based on saved annual energy ("Developing an empirical predictive energy-rating model for windows by using Artificial Neural Network" (Shakouri Hassanabadi and Banihashemi Namini, 2012) [1], "Climatic, parametric and non-parametric analysis of energy performance of double-glazed windows in different climates" (Banihashemi et al., 2015) [2]). A full factorial simulation study was conducted to evaluate the performance of 26 different types of windows in a four-story residential building. In order to generalize the results, the selected windows were tested in four climates of cold, tropical, temperate, and hot and arid; and four different main orientations of North, West, South and East. The accompanied datasets include the annual saved cooling and heating energy in different climates and orientations by using the selected windows. Moreover, a complete dataset is provided that includes the specifications of 26 windows, climate data, month, and orientation of the window. This dataset can be used to make predictive models for energy efficiency assessment of double glazed windows.

  16. LITHO1.0: An Updated Crust and Lithosphere Model of the Earth

    NASA Astrophysics Data System (ADS)

    Masters, G.; Ma, Z.; Laske, G.; Pasyanos, M. E.

    2011-12-01

    We are developing LITHO1.0: an updated crust and lithosphere model of the Earth. The overall plan is to take the popular CRUST2.0 model - a global model of crustal structure with a relatively poor representation of the uppermost mantle - and improve its nominal resolution to 1 degree and extend the model to include lithospheric structure. The new model, LITHO1.0, will be constrained by many different datasets including extremely large new datasets of relatively short period group velocity data. Other data sets include (but are not limited to) compilations of receiver function constraints and active source studies. To date, we have completed the compilation of extremely large global datasets of group velocity for Rayleigh and Love waves from 10mHz to 40mHz using a cluster analysis technique. We have also extended the method to measure phase velocity and are complementing the group velocity with global data sets of longer period phase data that help to constrain deep lithosphere properties. To model these data, we require a starting model for the crust at a nominal resolution of 1 degree. This has been developed by constructing a map of crustal thickness using data from receiver function and active source experiments where available, and by using CRUST2.0 where other constraints are not available. Particular care has been taken to make sure that the locations of sharp changes in crustal thickness are accurately represented. This map is then used as a template to extend CRUST2.0 to 1 degree nominal resolution and to develop starting maps of all crustal properties. We are currently modeling the data using two techniques. The first is a linearized inversion about the 3D crustal starting model. Note that it is important to use local eigenfunctions to compute Frechet derivatives due to the extreme variations in crustal structure. Another technique uses a targeted grid search method. A preliminary model for the crustal part of the model will be presented.

  17. Improving Snow Modeling by Assimilating Observational Data Collected by Citizen Scientists

    NASA Astrophysics Data System (ADS)

    Crumley, R. L.; Hill, D. F.; Arendt, A. A.; Wikstrom Jones, K.; Wolken, G. J.; Setiawan, L.

    2017-12-01

    Modeling seasonal snow pack in alpine environments includes a multiplicity of challenges caused by a lack of spatially extensive and temporally continuous observational datasets. This is partially due to the difficulty of collecting measurements in harsh, remote environments where extreme gradients in topography exist, accompanied by large model domains and inclement weather. Engaging snow enthusiasts, snow professionals, and community members to participate in the process of data collection may address some of these challenges. In this study, we use SnowModel to estimate seasonal snow water equivalence (SWE) in the Thompson Pass region of Alaska while incorporating snow depth measurements collected by citizen scientists. We develop a modeling approach to assimilate hundreds of snow depth measurements from participants in the Community Snow Observations (CSO) project (www.communitysnowobs.org). The CSO project includes a mobile application where participants record and submit geo-located snow depth measurements while working and recreating in the study area. These snow depth measurements are randomly located within the model grid at irregular time intervals over the span of four months in the 2017 water year. This snow depth observation dataset is converted into a SWE dataset by employing an empirically-based, bulk density and SWE estimation method. We then assimilate this data using SnowAssim, a sub-model within SnowModel, to constrain the SWE output by the observed data. Multiple model runs are designed to represent an array of output scenarios during the assimilation process. An effort to present model output uncertainties is included, as well as quantification of the pre- and post-assimilation divergence in modeled SWE. Early results reveal pre-assimilation SWE estimations are consistently greater than the post-assimilation estimations, and the magnitude of divergence increases throughout the snow pack evolution period. This research has implications beyond the Alaskan context because it increases our ability to constrain snow modeling outputs by making use of snow measurements collected by non-expert, citizen scientists.

  18. Assessment of Turbulent Shock-Boundary Layer Interaction Computations Using the OVERFLOW Code

    NASA Technical Reports Server (NTRS)

    Oliver, A. B.; Lillard, R. P.; Schwing, A. M.; Blaisdell, G> A.; Lyrintzis, A. S.

    2007-01-01

    The performance of two popular turbulence models, the Spalart-Allmaras model and Menter s SST model, and one relatively new model, Olsen & Coakley s Lag model, are evaluated using the OVERFLOWcode. Turbulent shock-boundary layer interaction predictions are evaluated with three different experimental datasets: a series of 2D compression ramps at Mach 2.87, a series of 2D compression ramps at Mach 2.94, and an axisymmetric coneflare at Mach 11. The experimental datasets include flows with no separation, moderate separation, and significant separation, and use several different experimental measurement techniques (including laser doppler velocimetry (LDV), pitot-probe measurement, inclined hot-wire probe measurement, preston tube skin friction measurement, and surface pressure measurement). Additionally, the OVERFLOW solutions are compared to the solutions of a second CFD code, DPLR. The predictions for weak shock-boundary layer interactions are in reasonable agreement with the experimental data. For strong shock-boundary layer interactions, all of the turbulence models overpredict the separation size and fail to predict the correct skin friction recovery distribution. In most cases, surface pressure predictions show too much upstream influence, however including the tunnel side-wall boundary layers in the computation improves the separation predictions.

  19. An Automated Method to Identify Mesoscale Convective Complexes in the Regional Climate Model Evaluation System

    NASA Astrophysics Data System (ADS)

    Whitehall, K. D.; Jenkins, G. S.; Mattmann, C. A.; Waliser, D. E.; Kim, J.; Goodale, C. E.; Hart, A. F.; Ramirez, P.; Whittell, J.; Zimdars, P. A.

    2012-12-01

    Mesoscale convective complexes (MCCs) are large (2 - 3 x 105 km2) nocturnal convectively-driven weather systems that are generally associated with high precipitation events in short durations (less than 12hrs) in various locations through out the tropics and midlatitudes (Maddox 1980). These systems are particularly important for climate in the West Sahel region, where the precipitation associated with them is a principal component of the rainfall season (Laing and Fritsch 1993). These systems occur on weather timescales and are historically identified from weather data analysis via manual and more recently automated processes (Miller and Fritsch 1991, Nesbett 2006, Balmey and Reason 2012). The Regional Climate Model Evaluation System (RCMES) is an open source tool designed for easy evaluation of climate and Earth system data through access to standardized datasets, and intrinsic tools that perform common analysis and visualization tasks (Hart et al. 2011). The RCMES toolkit also provides the flexibility of user-defined subroutines for further metrics, visualization and even dataset manipulation. The purpose of this study is to present a methodology for identifying MCCs in observation datasets using the RCMES framework. TRMM 3 hourly datasets will be used to demonstrate the methodology for 2005 boreal summer. This method promotes the use of open source software for scientific data systems to address a concern to multiple stakeholders in the earth sciences. A historical MCC dataset provides a platform with regards to further studies of the variability of frequency on various timescales of MCCs that is important for many including climate scientists, meteorologists, water resource managers, and agriculturalists. The methodology of using RCMES for searching and clipping datasets will engender a new realm of studies as users of the system will no longer be restricted to solely using the datasets as they reside in their own local systems; instead will be afforded rapid, effective, and transparent access, processing and visualization of the wealth of remote sensing datasets and climate model outputs available.

  20. Simulating Virtual Terminal Area Weather Data Bases for Use in the Wake Vortex Avoidance System (Wake VAS) Prediction Algorithm

    NASA Technical Reports Server (NTRS)

    Kaplan, Michael L.; Lin, Yuh-Lang

    2004-01-01

    During the research project, sounding datasets were generated for the region surrounding 9 major airports, including Dallas, TX, Boston, MA, New York, NY, Chicago, IL, St. Louis, MO, Atlanta, GA, Miami, FL, San Francico, CA, and Los Angeles, CA. The numerical simulation of winter and summer environments during which no instrument flight rule impact was occurring at these 9 terminals was performed using the most contemporary version of the Terminal Area PBL Prediction System (TAPPS) model nested from 36 km to 6 km to 1 km horizontal resolution and very detailed vertical resolution in the planetary boundary layer. The soundings from the 1 km model were archived at 30 minute time intervals for a 24 hour period and the vertical dependent variables as well as derived quantities, i.e., 3-dimensional wind components, temperatures, pressures, mixing ratios, turbulence kinetic energy and eddy dissipation rates were then interpolated to 5 m vertical resolution up to 1000 m elevation above ground level. After partial validation against field experiment datasets for Dallas as well as larger scale and much coarser resolution observations at the other 8 airports, these sounding datasets were sent to NASA for use in the Virtual Air Space and Modeling program. The application of these datasets being to determine representative airport weather environments to diagnose the response of simulated wake vortices to realistic atmospheric environments. These virtual datasets are based on large scale observed atmospheric initial conditions that are dynamically interpolated in space and time. The 1 km nested-grid simulated datasets providing a very coarse and highly smoothed representation of airport environment meteorological conditions. Details concerning the airport surface forcing are virtually absent from these simulated datasets although the observed background atmospheric processes have been compared to the simulated fields and the fields were found to accurately replicate the flows surrounding the airport where coarse verification data were available as well as where airport scale datasets were available.

  1. The Surface Radiation Budget over Oceans and Continents.

    NASA Astrophysics Data System (ADS)

    Garratt, J. R.; Prata, A. J.; Rotstayn, L. D.; McAvaney, B. J.; Cusack, S.

    1998-08-01

    An updated evaluation of the surface radiation budget in climate models (1994-96 versions; seven datasets available, with and without aerosols) and in two new satellite-based global datasets (with aerosols) is presented. All nine datasets capture the broad mean monthly zonal variations in the flux components and in the net radiation, with maximum differences of some 100 W m2 occurring in the downwelling fluxes at specific latitudes. Using long-term surface observations, both from land stations and the Pacific warm pool (with typical uncertainties in the annual values varying between ±5 and 20 W m2), excess net radiation (RN) and downwelling shortwave flux density (So) are found in all datasets, consistent with results from earlier studies [for global land, excesses of 15%-20% (12 W m2) in RN and about 12% (20 W m2) in So]. For the nine datasets combined, the spread in annual fluxes is significant: for RN, it is 15 (50) W m2 over global land (Pacific warm pool) in an observed annual mean of 65 (135) W m2; for So, it is 25 (60) W m2 over land (warm pool) in an annual mean of 176 (197) W m2.The effects of aerosols are included in three of the authors' datasets, based on simple aerosol climatologies and assumptions regarding aerosol optical properties. They offer guidance on the broad impact of aerosols on climate, suggesting that the inclusion of aerosols in models would reduce the annual So by 15-20 W m2 over land and 5-10 W m2 over the oceans. Model differences in cloud cover contribute to differences in So between datasets; for global land, this is most clearly demonstrated through the effects of cloud cover on the surface shortwave cloud forcing. The tendency for most datasets to underestimate cloudiness, particularly over global land, and possibly to underestimate atmospheric water vapor absorption, probably contributes to the excess downwelling shortwave flux at the surface.

  2. SBRML: a markup language for associating systems biology data with models.

    PubMed

    Dada, Joseph O; Spasić, Irena; Paton, Norman W; Mendes, Pedro

    2010-04-01

    Research in systems biology is carried out through a combination of experiments and models. Several data standards have been adopted for representing models (Systems Biology Markup Language) and various types of relevant experimental data (such as FuGE and those of the Proteomics Standards Initiative). However, until now, there has been no standard way to associate a model and its entities to the corresponding datasets, or vice versa. Such a standard would provide a means to represent computational simulation results as well as to frame experimental data in the context of a particular model. Target applications include model-driven data analysis, parameter estimation, and sharing and archiving model simulations. We propose the Systems Biology Results Markup Language (SBRML), an XML-based language that associates a model with several datasets. Each dataset is represented as a series of values associated with model variables, and their corresponding parameter values. SBRML provides a flexible way of indexing the results to model parameter values, which supports both spreadsheet-like data and multidimensional data cubes. We present and discuss several examples of SBRML usage in applications such as enzyme kinetics, microarray gene expression and various types of simulation results. The XML Schema file for SBRML is available at http://www.comp-sys-bio.org/SBRML under the Academic Free License (AFL) v3.0.

  3. Improving Risk Adjustment for Mortality After Pediatric Cardiac Surgery: The UK PRAiS2 Model.

    PubMed

    Rogers, Libby; Brown, Katherine L; Franklin, Rodney C; Ambler, Gareth; Anderson, David; Barron, David J; Crowe, Sonya; English, Kate; Stickley, John; Tibby, Shane; Tsang, Victor; Utley, Martin; Witter, Thomas; Pagel, Christina

    2017-07-01

    Partial Risk Adjustment in Surgery (PRAiS), a risk model for 30-day mortality after children's heart surgery, has been used by the UK National Congenital Heart Disease Audit to report expected risk-adjusted survival since 2013. This study aimed to improve the model by incorporating additional comorbidity and diagnostic information. The model development dataset was all procedures performed between 2009 and 2014 in all UK and Ireland congenital cardiac centers. The outcome measure was death within each 30-day surgical episode. Model development followed an iterative process of clinical discussion and development and assessment of models using logistic regression under 25 × 5 cross-validation. Performance was measured using Akaike information criterion, the area under the receiver-operating characteristic curve (AUC), and calibration. The final model was assessed in an external 2014 to 2015 validation dataset. The development dataset comprised 21,838 30-day surgical episodes, with 539 deaths (mortality, 2.5%). The validation dataset comprised 4,207 episodes, with 97 deaths (mortality, 2.3%). The updated risk model included 15 procedural, 11 diagnostic, and 4 comorbidity groupings, and nonlinear functions of age and weight. Performance under cross-validation was: median AUC of 0.83 (range, 0.82 to 0.83), median calibration slope and intercept of 0.92 (range, 0.64 to 1.25) and -0.23 (range, -1.08 to 0.85) respectively. In the validation dataset, the AUC was 0.86 (95% confidence interval [CI], 0.82 to 0.89), and the calibration slope and intercept were 1.01 (95% CI, 0.83 to 1.18) and 0.11 (95% CI, -0.45 to 0.67), respectively, showing excellent performance. A more sophisticated PRAiS2 risk model for UK use was developed with additional comorbidity and diagnostic information, alongside age and weight as nonlinear variables. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.

  4. WRF-TMH: predicting transmembrane helix by fusing composition index and physicochemical properties of amino acids.

    PubMed

    Hayat, Maqsood; Khan, Asifullah

    2013-05-01

    Membrane protein is the prime constituent of a cell, which performs a role of mediator between intra and extracellular processes. The prediction of transmembrane (TM) helix and its topology provides essential information regarding the function and structure of membrane proteins. However, prediction of TM helix and its topology is a challenging issue in bioinformatics and computational biology due to experimental complexities and lack of its established structures. Therefore, the location and orientation of TM helix segments are predicted from topogenic sequences. In this regard, we propose WRF-TMH model for effectively predicting TM helix segments. In this model, information is extracted from membrane protein sequences using compositional index and physicochemical properties. The redundant and irrelevant features are eliminated through singular value decomposition. The selected features provided by these feature extraction strategies are then fused to develop a hybrid model. Weighted random forest is adopted as a classification approach. We have used two benchmark datasets including low and high-resolution datasets. tenfold cross validation is employed to assess the performance of WRF-TMH model at different levels including per protein, per segment, and per residue. The success rates of WRF-TMH model are quite promising and are the best reported so far on the same datasets. It is observed that WRF-TMH model might play a substantial role, and will provide essential information for further structural and functional studies on membrane proteins. The accompanied web predictor is accessible at http://111.68.99.218/WRF-TMH/ .

  5. Development of machine learning models for diagnosis of glaucoma.

    PubMed

    Kim, Seong Jae; Cho, Kyong Jin; Oh, Sejong

    2017-01-01

    The study aimed to develop machine learning models that have strong prediction power and interpretability for diagnosis of glaucoma based on retinal nerve fiber layer (RNFL) thickness and visual field (VF). We collected various candidate features from the examination of retinal nerve fiber layer (RNFL) thickness and visual field (VF). We also developed synthesized features from original features. We then selected the best features proper for classification (diagnosis) through feature evaluation. We used 100 cases of data as a test dataset and 399 cases of data as a training and validation dataset. To develop the glaucoma prediction model, we considered four machine learning algorithms: C5.0, random forest (RF), support vector machine (SVM), and k-nearest neighbor (KNN). We repeatedly composed a learning model using the training dataset and evaluated it by using the validation dataset. Finally, we got the best learning model that produces the highest validation accuracy. We analyzed quality of the models using several measures. The random forest model shows best performance and C5.0, SVM, and KNN models show similar accuracy. In the random forest model, the classification accuracy is 0.98, sensitivity is 0.983, specificity is 0.975, and AUC is 0.979. The developed prediction models show high accuracy, sensitivity, specificity, and AUC in classifying among glaucoma and healthy eyes. It will be used for predicting glaucoma against unknown examination records. Clinicians may reference the prediction results and be able to make better decisions. We may combine multiple learning models to increase prediction accuracy. The C5.0 model includes decision rules for prediction. It can be used to explain the reasons for specific predictions.

  6. Data on the natural ventilation performance of windcatcher with anti-short-circuit device (ASCD).

    PubMed

    Nejat, Payam; Calautit, John Kaiser; Majid, Muhd Zaimi Abd; Hughes, Ben Richard; Jomehzadeh, Fatemeh

    2016-12-01

    This article presents the datasets which were the results of the study explained in the research paper 'Anti-short-circuit device: a new solution for short-circuiting in windcatcher and improvement of natural ventilation performance' (P. Nejat, J.K. Calautit, M.Z. Abd. Majid, B.R. Hughes, F. Jomehzadeh, 2016) [1] which introduces a new technique to reduce or prevent short-circuiting in a two-sided windcatcher and also lowers the indoor CO2 concentration and improve the ventilation distribution. Here, we provide details of the numerical modeling set-up and data collection method to facilitate reproducibility. The datasets includes indoor airflow, ventilation rates and CO2 concentration data at several points in the flow field. The CAD geometry of the windcatcher models are also included.

  7. Fleet DNA (Presentation)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Walkokwicz, K.; Duran, A.

    2014-06-01

    The Fleet DNA project objectives include capturing and quantifying drive cycle and technology variation for the multitude of medium- and heavy-duty vocations; providing a common data storage warehouse for medium- and heavy-duty vehicle fleet data across DOE activities and laboratories; and integrating existing DOE tools, models, and analyses to provide data-driven decision making capabilities. Fleet DNA advantages include: for Government - providing in-use data for standard drive cycle development, R&D, tech targets, and rule making; for OEMs - real-world usage datasets provide concrete examples of customer use profiles; for fleets - vocational datasets help illustrate how to maximize return onmore » technology investments; for Funding Agencies - ways are revealed to optimize the impact of financial incentive offers; and for researchers -a data source is provided for modeling and simulation.« less

  8. Weighted analysis of paired microarray experiments.

    PubMed

    Kristiansson, Erik; Sjögren, Anders; Rudemo, Mats; Nerman, Olle

    2005-01-01

    In microarray experiments quality often varies, for example between samples and between arrays. The need for quality control is therefore strong. A statistical model and a corresponding analysis method is suggested for experiments with pairing, including designs with individuals observed before and after treatment and many experiments with two-colour spotted arrays. The model is of mixed type with some parameters estimated by an empirical Bayes method. Differences in quality are modelled by individual variances and correlations between repetitions. The method is applied to three real and several simulated datasets. Two of the real datasets are of Affymetrix type with patients profiled before and after treatment, and the third dataset is of two-colour spotted cDNA type. In all cases, the patients or arrays had different estimated variances, leading to distinctly unequal weights in the analysis. We suggest also plots which illustrate the variances and correlations that affect the weights computed by our analysis method. For simulated data the improvement relative to previously published methods without weighting is shown to be substantial.

  9. Prediction of Mutagenicity of Chemicals from Their Calculated Molecular Descriptors: A Case Study with Structurally Homogeneous versus Diverse Datasets.

    PubMed

    Basak, Subhash C; Majumdar, Subhabrata

    2015-01-01

    Variation in high-dimensional data is often caused by a few latent factors, and hence dimension reduction or variable selection techniques are often useful in gathering useful information from the data. In this paper we consider two such recent methods: Interrelated two-way clustering and envelope models. We couple these methods with traditional statistical procedures like ridge regression and linear discriminant analysis, and apply them on two data sets which have more predictors than samples (i.e. n < p scenario) and several types of molecular descriptors. One of these datasets consists of a congeneric group of Amines while the other has a much diverse collection compounds. The difference of prediction results between these two datasets for both the methods supports the hypothesis that for a congeneric set of compounds, descriptors of a certain type are enough to provide good QSAR models, but as the data set grows diverse including a variety of descriptors can improve model quality considerably.

  10. Non-minimal derivative coupling gravity in cosmology

    NASA Astrophysics Data System (ADS)

    Gumjudpai, Burin; Rangdee, Phongsaphat

    2015-11-01

    We give a brief review of the non-minimal derivative coupling (NMDC) scalar field theory in which there is non-minimal coupling between the scalar field derivative term and the Einstein tensor. We assume that the expansion is of power-law type or super-acceleration type for small redshift. The Lagrangian includes the NMDC term, a free kinetic term, a cosmological constant term and a barotropic matter term. For a value of the coupling constant that is compatible with inflation, we use the combined WMAP9 (WMAP9 + eCMB + BAO + H_0) dataset, the PLANCK + WP dataset, and the PLANCK TT, TE, EE + lowP + Lensing + ext datasets to find the value of the cosmological constant in the model. Modeling the expansion with power-law gives a negative cosmological constants while the phantom power-law (super-acceleration) expansion gives positive cosmological constant with large error bar. The value obtained is of the same order as in the Λ CDM model, since at late times the NMDC effect is tiny due to small curvature.

  11. The SPoRT-WRF: Evaluating the Impact of NASA Datasets on Convective Forecasts

    NASA Technical Reports Server (NTRS)

    Zavodsky, Bradley; Case, Jonathan; Kozlowski, Danielle; Molthan, Andrew

    2012-01-01

    The Short-term Prediction Research and Transition Center (SPoRT) is a collaborative partnership between NASA and operational forecasting entities, including a number of National Weather Service offices. SPoRT transitions real-time NASA products and capabilities to its partners to address specific operational forecast challenges. One challenge that forecasters face is applying convection-allowing numerical models to predict mesoscale convective weather. In order to address this specific forecast challenge, SPoRT produces real-time mesoscale model forecasts using the Weather Research and Forecasting (WRF) model that includes unique NASA products and capabilities. Currently, the SPoRT configuration of the WRF model (SPoRT-WRF) incorporates the 4-km Land Information System (LIS) land surface data, 1-km SPoRT sea surface temperature analysis and 1-km Moderate resolution Imaging Spectroradiometer (MODIS) greenness vegetation fraction (GVF) analysis, and retrieved thermodynamic profiles from the Atmospheric Infrared Sounder (AIRS). The LIS, SST, and GVF data are all integrated into the SPoRT-WRF through adjustments to the initial and boundary conditions, and the AIRS data are assimilated into a 9-hour SPoRT WRF forecast each day at 0900 UTC. This study dissects the overall impact of the NASA datasets and the individual surface and atmospheric component datasets on daily mesoscale forecasts. A case study covering the super tornado outbreak across the Ce ntral and Southeastern United States during 25-27 April 2011 is examined. Three different forecasts are analyzed including the SPoRT-WRF (NASA surface and atmospheric data), the SPoRT WRF without AIRS (NASA surface data only), and the operational National Severe Storms Laboratory (NSSL) WRF (control with no NASA data). The forecasts are compared qualitatively by examining simulated versus observed radar reflectivity. Differences between the simulated reflectivity are further investigated using convective parameters along with model soundings to determine the impacts of the various NASA datasets. Additionally, quantitative evaluation of select meteorological parameters is performed using the Meteorological Evaluation Tools model verification package to compare forecasts to in situ surface and upper air observations.

  12. Extended Kalman Filter for Estimation of Parameters in Nonlinear State-Space Models of Biochemical Networks

    PubMed Central

    Sun, Xiaodian; Jin, Li; Xiong, Momiao

    2008-01-01

    It is system dynamics that determines the function of cells, tissues and organisms. To develop mathematical models and estimate their parameters are an essential issue for studying dynamic behaviors of biological systems which include metabolic networks, genetic regulatory networks and signal transduction pathways, under perturbation of external stimuli. In general, biological dynamic systems are partially observed. Therefore, a natural way to model dynamic biological systems is to employ nonlinear state-space equations. Although statistical methods for parameter estimation of linear models in biological dynamic systems have been developed intensively in the recent years, the estimation of both states and parameters of nonlinear dynamic systems remains a challenging task. In this report, we apply extended Kalman Filter (EKF) to the estimation of both states and parameters of nonlinear state-space models. To evaluate the performance of the EKF for parameter estimation, we apply the EKF to a simulation dataset and two real datasets: JAK-STAT signal transduction pathway and Ras/Raf/MEK/ERK signaling transduction pathways datasets. The preliminary results show that EKF can accurately estimate the parameters and predict states in nonlinear state-space equations for modeling dynamic biochemical networks. PMID:19018286

  13. Wind and wave dataset for Matara, Sri Lanka

    NASA Astrophysics Data System (ADS)

    Luo, Yao; Wang, Dongxiao; Priyadarshana Gamage, Tilak; Zhou, Fenghua; Madusanka Widanage, Charith; Liu, Taiwei

    2018-01-01

    We present a continuous in situ hydro-meteorology observational dataset from a set of instruments first deployed in December 2012 in the south of Sri Lanka, facing toward the north Indian Ocean. In these waters, simultaneous records of wind and wave data are sparse due to difficulties in deploying measurement instruments, although the area hosts one of the busiest shipping lanes in the world. This study describes the survey, deployment, and measurements of wind and waves, with the aim of offering future users of the dataset the most comprehensive and as much information as possible. This dataset advances our understanding of the nearshore hydrodynamic processes and wave climate, including sea waves and swells, in the north Indian Ocean. Moreover, it is a valuable resource for ocean model parameterization and validation. The archived dataset (Table 1) is examined in detail, including wave data at two locations with water depths of 20 and 10 m comprising synchronous time series of wind, ocean astronomical tide, air pressure, etc. In addition, we use these wave observations to evaluate the ERA-Interim reanalysis product. Based on Buoy 2 data, the swells are the main component of waves year-round, although monsoons can markedly alter the proportion between swell and wind sea. The dataset (Luo et al., 2017) is publicly available from Science Data Bank (https://doi.org/10.11922/sciencedb.447).

  14. A multiscale Bayesian data integration approach for mapping air dose rates around the Fukushima Daiichi Nuclear Power Plant.

    PubMed

    Wainwright, Haruko M; Seki, Akiyuki; Chen, Jinsong; Saito, Kimiaki

    2017-02-01

    This paper presents a multiscale data integration method to estimate the spatial distribution of air dose rates in the regional scale around the Fukushima Daiichi Nuclear Power Plant. We integrate various types of datasets, such as ground-based walk and car surveys, and airborne surveys, all of which have different scales, resolutions, spatial coverage, and accuracy. This method is based on geostatistics to represent spatial heterogeneous structures, and also on Bayesian hierarchical models to integrate multiscale, multi-type datasets in a consistent manner. The Bayesian method allows us to quantify the uncertainty in the estimates, and to provide the confidence intervals that are critical for robust decision-making. Although this approach is primarily data-driven, it has great flexibility to include mechanistic models for representing radiation transport or other complex correlations. We demonstrate our approach using three types of datasets collected at the same time over Fukushima City in Japan: (1) coarse-resolution airborne surveys covering the entire area, (2) car surveys along major roads, and (3) walk surveys in multiple neighborhoods. Results show that the method can successfully integrate three types of datasets and create an integrated map (including the confidence intervals) of air dose rates over the domain in high resolution. Moreover, this study provides us with various insights into the characteristics of each dataset, as well as radiocaesium distribution. In particular, the urban areas show high heterogeneity in the contaminant distribution due to human activities as well as large discrepancy among different surveys due to such heterogeneity. Copyright © 2016 Elsevier Ltd. All rights reserved.

  15. Data discovery with DATS: exemplar adoptions and lessons learned.

    PubMed

    Gonzalez-Beltran, Alejandra N; Campbell, John; Dunn, Patrick; Guijarro, Diana; Ionescu, Sanda; Kim, Hyeoneui; Lyle, Jared; Wiser, Jeffrey; Sansone, Susanna-Assunta; Rocca-Serra, Philippe

    2018-01-01

    The DAta Tag Suite (DATS) is a model supporting dataset description, indexing, and discovery. It is available as an annotated serialization with schema.org, a vocabulary used by major search engines, thus making the datasets discoverable on the web. DATS underlies DataMed, the National Institutes of Health Big Data to Knowledge Data Discovery Index prototype, which aims to provide a "PubMed for datasets." The experience gained while indexing a heterogeneous range of >60 repositories in DataMed helped in evaluating DATS's entities, attributes, and scope. In this work, 3 additional exemplary and diverse data sources were mapped to DATS by their representatives or experts, offering a deep scan of DATS fitness against a new set of existing data. The procedure, including feedback from users and implementers, resulted in DATS implementation guidelines and best practices, and identification of a path for evolving and optimizing the model. Finally, the work exposed additional needs when defining datasets for indexing, especially in the context of clinical and observational information. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association.

  16. A bias-corrected CMIP5 dataset for Africa using the CDF-t method - a contribution to agricultural impact studies

    NASA Astrophysics Data System (ADS)

    Moise Famien, Adjoua; Janicot, Serge; Delfin Ochou, Abe; Vrac, Mathieu; Defrance, Dimitri; Sultan, Benjamin; Noël, Thomas

    2018-03-01

    The objective of this paper is to present a new dataset of bias-corrected CMIP5 global climate model (GCM) daily data over Africa. This dataset was obtained using the cumulative distribution function transform (CDF-t) method, a method that has been applied to several regions and contexts but never to Africa. Here CDF-t has been applied over the period 1950-2099 combining Historical runs and climate change scenarios for six variables: precipitation, mean near-surface air temperature, near-surface maximum air temperature, near-surface minimum air temperature, surface downwelling shortwave radiation, and wind speed, which are critical variables for agricultural purposes. WFDEI has been used as the reference dataset to correct the GCMs. Evaluation of the results over West Africa has been carried out on a list of priority user-based metrics that were discussed and selected with stakeholders. It includes simulated yield using a crop model simulating maize growth. These bias-corrected GCM data have been compared with another available dataset of bias-corrected GCMs using WATCH Forcing Data as the reference dataset. The impact of WFD, WFDEI, and also EWEMBI reference datasets has been also examined in detail. It is shown that CDF-t is very effective at removing the biases and reducing the high inter-GCM scattering. Differences with other bias-corrected GCM data are mainly due to the differences among the reference datasets. This is particularly true for surface downwelling shortwave radiation, which has a significant impact in terms of simulated maize yields. Projections of future yields over West Africa are quite different, depending on the bias-correction method used. However all these projections show a similar relative decreasing trend over the 21st century.

  17. A Solar Data Model for Use in Virtual Observatories

    NASA Astrophysics Data System (ADS)

    Reardon, K. P.; Bentley, R. D.; Messerotti, M.; Giordano, S.

    2004-05-01

    The creation of a virtual solar observatories relies heavily on the merging of the metadata describing different datasets into a common form so that it can be handled in a standard way for all associated resources. In order to bring together the varied data descriptions that already exist, it is necessary to have a common framework on which all the different datasets can be represented. The definition of this framework is done through a data model which attempts to provide a simplified but realistic description of the various entities that make up a data set or solar resource. We present the solar data model which has been developed as part of the European Grid of Solar Observations (EGSO) project. This model attempts to include many of the different elements in the field of solar physics, including data producers, data sets, event lists, and data providers. This global picture can then be used to focus on the particular elements required for a specific implementation. We present the different aspects of the model and describe some systems in which portions of this model have been implemented.

  18. A generalized framework unifying image registration and respiratory motion models and incorporating image reconstruction, for partial image data or full images

    NASA Astrophysics Data System (ADS)

    McClelland, Jamie R.; Modat, Marc; Arridge, Simon; Grimes, Helen; D'Souza, Derek; Thomas, David; O' Connell, Dylan; Low, Daniel A.; Kaza, Evangelia; Collins, David J.; Leach, Martin O.; Hawkes, David J.

    2017-06-01

    Surrogate-driven respiratory motion models relate the motion of the internal anatomy to easily acquired respiratory surrogate signals, such as the motion of the skin surface. They are usually built by first using image registration to determine the motion from a number of dynamic images, and then fitting a correspondence model relating the motion to the surrogate signals. In this paper we present a generalized framework that unifies the image registration and correspondence model fitting into a single optimization. This allows the use of ‘partial’ imaging data, such as individual slices, projections, or k-space data, where it would not be possible to determine the motion from an individual frame of data. Motion compensated image reconstruction can also be incorporated using an iterative approach, so that both the motion and a motion-free image can be estimated from the partial image data. The framework has been applied to real 4DCT, Cine CT, multi-slice CT, and multi-slice MR data, as well as simulated datasets from a computer phantom. This includes the use of a super-resolution reconstruction method for the multi-slice MR data. Good results were obtained for all datasets, including quantitative results for the 4DCT and phantom datasets where the ground truth motion was known or could be estimated.

  19. A generalized framework unifying image registration and respiratory motion models and incorporating image reconstruction, for partial image data or full images.

    PubMed

    McClelland, Jamie R; Modat, Marc; Arridge, Simon; Grimes, Helen; D'Souza, Derek; Thomas, David; Connell, Dylan O'; Low, Daniel A; Kaza, Evangelia; Collins, David J; Leach, Martin O; Hawkes, David J

    2017-06-07

    Surrogate-driven respiratory motion models relate the motion of the internal anatomy to easily acquired respiratory surrogate signals, such as the motion of the skin surface. They are usually built by first using image registration to determine the motion from a number of dynamic images, and then fitting a correspondence model relating the motion to the surrogate signals. In this paper we present a generalized framework that unifies the image registration and correspondence model fitting into a single optimization. This allows the use of 'partial' imaging data, such as individual slices, projections, or k-space data, where it would not be possible to determine the motion from an individual frame of data. Motion compensated image reconstruction can also be incorporated using an iterative approach, so that both the motion and a motion-free image can be estimated from the partial image data. The framework has been applied to real 4DCT, Cine CT, multi-slice CT, and multi-slice MR data, as well as simulated datasets from a computer phantom. This includes the use of a super-resolution reconstruction method for the multi-slice MR data. Good results were obtained for all datasets, including quantitative results for the 4DCT and phantom datasets where the ground truth motion was known or could be estimated.

  20. A generalized framework unifying image registration and respiratory motion models and incorporating image reconstruction, for partial image data or full images

    PubMed Central

    McClelland, Jamie R; Modat, Marc; Arridge, Simon; Grimes, Helen; D’Souza, Derek; Thomas, David; Connell, Dylan O’; Low, Daniel A; Kaza, Evangelia; Collins, David J; Leach, Martin O; Hawkes, David J

    2017-01-01

    Abstract Surrogate-driven respiratory motion models relate the motion of the internal anatomy to easily acquired respiratory surrogate signals, such as the motion of the skin surface. They are usually built by first using image registration to determine the motion from a number of dynamic images, and then fitting a correspondence model relating the motion to the surrogate signals. In this paper we present a generalized framework that unifies the image registration and correspondence model fitting into a single optimization. This allows the use of ‘partial’ imaging data, such as individual slices, projections, or k-space data, where it would not be possible to determine the motion from an individual frame of data. Motion compensated image reconstruction can also be incorporated using an iterative approach, so that both the motion and a motion-free image can be estimated from the partial image data. The framework has been applied to real 4DCT, Cine CT, multi-slice CT, and multi-slice MR data, as well as simulated datasets from a computer phantom. This includes the use of a super-resolution reconstruction method for the multi-slice MR data. Good results were obtained for all datasets, including quantitative results for the 4DCT and phantom datasets where the ground truth motion was known or could be estimated. PMID:28195833

  1. Sparse models for correlative and integrative analysis of imaging and genetic data

    PubMed Central

    Lin, Dongdong; Cao, Hongbao; Calhoun, Vince D.

    2014-01-01

    The development of advanced medical imaging technologies and high-throughput genomic measurements has enhanced our ability to understand their interplay as well as their relationship with human behavior by integrating these two types of datasets. However, the high dimensionality and heterogeneity of these datasets presents a challenge to conventional statistical methods; there is a high demand for the development of both correlative and integrative analysis approaches. Here, we review our recent work on developing sparse representation based approaches to address this challenge. We show how sparse models are applied to the correlation and integration of imaging and genetic data for biomarker identification. We present examples on how these approaches are used for the detection of risk genes and classification of complex diseases such as schizophrenia. Finally, we discuss future directions on the integration of multiple imaging and genomic datasets including their interactions such as epistasis. PMID:25218561

  2. Multimodal Teaching Analytics: Automated Extraction of Orchestration Graphs from Wearable Sensor Data.

    PubMed

    Prieto, Luis P; Sharma, Kshitij; Kidzinski, Łukasz; Rodríguez-Triana, María Jesús; Dillenbourg, Pierre

    2018-04-01

    The pedagogical modelling of everyday classroom practice is an interesting kind of evidence, both for educational research and teachers' own professional development. This paper explores the usage of wearable sensors and machine learning techniques to automatically extract orchestration graphs (teaching activities and their social plane over time), on a dataset of 12 classroom sessions enacted by two different teachers in different classroom settings. The dataset included mobile eye-tracking as well as audiovisual and accelerometry data from sensors worn by the teacher. We evaluated both time-independent and time-aware models, achieving median F1 scores of about 0.7-0.8 on leave-one-session-out k-fold cross-validation. Although these results show the feasibility of this approach, they also highlight the need for larger datasets, recorded in a wider variety of classroom settings, to provide automated tagging of classroom practice that can be used in everyday practice across multiple teachers.

  3. Estimating aboveground biomass in the boreal forests of the Yukon River Basin, Alaska

    NASA Astrophysics Data System (ADS)

    Ji, L.; Wylie, B. K.; Nossov, D.; Peterson, B.; Waldrop, M. P.; McFarland, J.; Alexander, H. D.; Mack, M. C.; Rover, J. A.; Chen, X.

    2011-12-01

    Quantification of aboveground biomass (AGB) in Alaska's boreal forests is essential to accurately evaluate terrestrial carbon stocks and dynamics in northern high-latitude ecosystems. However, regional AGB datasets with spatially detailed information (<500 m) are not available for this extensive and remote area. Our goal was to map AGB at 30-m resolution for the boreal forests in the Yukon River Basin of Alaska using recent Landsat data and ground measurements. We collected field data in the Yukon River Basin from 2008 to 2010. Ground measurements included diameter at breast height (DBH) or basal diameter (BD) for live and dead trees and shrubs (>1 m tall), which were converted to plot-level AGB using allometric equations. We acquired Landsat Enhanced Thematic Mapper Plus (ETM+) images from the Web Enabled Landsat Data (WELD) that provides multi-date composites of top-of-atmosphere reflectance and brightness temperature for Alaska. From the WELD images, we generated a three-year (2008 - 2010) image composite for the Yukon River Basin using a series of compositing criteria including non-saturation, non-cloudiness, maximal normalize difference vegetation index (NDVI), and maximal brightness temperature. Airborne lidar datasets were acquired for two sub-regions in the central basin in 2009, which were converted to vegetation height datasets using the bare-earth digital surface model (DSM) and the first-return DSM. We created a multiple regression model in which the response variable was the field-observed AGB and the predictor variables were Landsat-derived reflectance, brightness temperature, and spectral vegetation indices including NDVI, soil adjusted vegetation index (SAVI), enhanced vegetation index (EVI), normalized difference infrared index (NDII), and normalized difference water index (NDWI). Principal component analysis was incorporated in the regression model to remedy the multicollinearity problems caused by high correlations between predictor variables. The model fitted the observed data well with an R-square of 0.62, mean absolute error of 29.1 Mg/ha, and mean bias error of 3.9 Mg/ha. By applying this model to the Landsat mosaic, we generated a 30-m AGB map for the boreal forests in the Yukon River Basin. Validation of the Landsat-derived AGB using the lidar dataset indicated a significant correlation between the AGB estimates and the lidar-derived canopy height. The production of a basin-wide boreal forest AGB dataset will provide an important biophysical parameter for the modeling and investigation of Alaska's ecosystems.

  4. Utah FORGE Groundwater Data

    DOE Data Explorer

    Joe Moore

    2016-07-20

    This submission includes two modelled drawdown scenarios with new supply well locations, a total dissolved solids (TDS) concentration grid (raster dataset representing the spatial distribution of TDS), and an excel spreadsheet containing well data.

  5. Quantification of source impact to PM using three-dimensional weighted factor model analysis on multi-site data

    NASA Astrophysics Data System (ADS)

    Shi, Guoliang; Peng, Xing; Huangfu, Yanqi; Wang, Wei; Xu, Jiao; Tian, Yingze; Feng, Yinchang; Ivey, Cesunica E.; Russell, Armistead G.

    2017-07-01

    Source apportionment technologies are used to understand the impacts of important sources of particulate matter (PM) air quality, and are widely used for both scientific studies and air quality management. Generally, receptor models apportion speciated PM data from a single sampling site. With the development of large scale monitoring networks, PM speciation are observed at multiple sites in an urban area. For these situations, the models should account for three factors, or dimensions, of the PM, including the chemical species concentrations, sampling periods and sampling site information, suggesting the potential power of a three-dimensional source apportionment approach. However, the principle of three-dimensional Parallel Factor Analysis (Ordinary PARAFAC) model does not always work well in real environmental situations for multi-site receptor datasets. In this work, a new three-way receptor model, called "multi-site three way factor analysis" model is proposed to deal with the multi-site receptor datasets. Synthetic datasets were developed and introduced into the new model to test its performance. Average absolute error (AAE, between estimated and true contributions) for extracted sources were all less than 50%. Additionally, three-dimensional ambient datasets from a Chinese mega-city, Chengdu, were analyzed using this new model to assess the application. Four factors are extracted by the multi-site WFA3 model: secondary source have the highest contributions (64.73 and 56.24 μg/m3), followed by vehicular exhaust (30.13 and 33.60 μg/m3), crustal dust (26.12 and 29.99 μg/m3) and coal combustion (10.73 and 14.83 μg/m3). The model was also compared to PMF, with general agreement, though PMF suggested a lower crustal contribution.

  6. EpiK: A Knowledge Base for Epidemiological Modeling and Analytics of Infectious Diseases

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hasan, S. M. Shamimul; Fox, Edward A.; Bisset, Keith

    Computational epidemiology seeks to develop computational methods to study the distribution and determinants of health-related states or events (including disease), and the application of this study to the control of diseases and other health problems. Recent advances in computing and data sciences have led to the development of innovative modeling environments to support this important goal. The datasets used to drive the dynamic models as well as the data produced by these models presents unique challenges owing to their size, heterogeneity and diversity. These datasets form the basis of effective and easy to use decision support and analytical environments. Asmore » a result, it is important to develop scalable data management systems to store, manage and integrate these datasets. In this paper, we develop EpiK—a knowledge base that facilitates the development of decision support and analytical environments to support epidemic science. An important goal is to develop a framework that links the input as well as output datasets to facilitate effective spatio-temporal and social reasoning that is critical in planning and intervention analysis before and during an epidemic. The data management framework links modeling workflow data and its metadata using a controlled vocabulary. The metadata captures information about storage, the mapping between the linked model and the physical layout, and relationships to support services. EpiK is designed to support agent-based modeling and analytics frameworks—aggregate models can be seen as special cases and are thus supported. We use semantic web technologies to create a representation of the datasets that encapsulates both the location and the schema heterogeneity. The choice of RDF as a representation language is motivated by the diversity and growth of the datasets that need to be integrated. A query bank is developed—the queries capture a broad range of questions that can be posed and answered during a typical case study pertaining to disease outbreaks. The queries are constructed using SPARQL Protocol and RDF Query Language (SPARQL) over the EpiK. EpiK can hide schema and location heterogeneity while efficiently supporting queries that span the computational epidemiology modeling pipeline: from model construction to simulation output. As a result, we show that the performance of benchmark queries varies significantly with respect to the choice of hardware underlying the database and resource description framework (RDF) engine.« less

  7. EpiK: A Knowledge Base for Epidemiological Modeling and Analytics of Infectious Diseases

    DOE PAGES

    Hasan, S. M. Shamimul; Fox, Edward A.; Bisset, Keith; ...

    2017-11-06

    Computational epidemiology seeks to develop computational methods to study the distribution and determinants of health-related states or events (including disease), and the application of this study to the control of diseases and other health problems. Recent advances in computing and data sciences have led to the development of innovative modeling environments to support this important goal. The datasets used to drive the dynamic models as well as the data produced by these models presents unique challenges owing to their size, heterogeneity and diversity. These datasets form the basis of effective and easy to use decision support and analytical environments. Asmore » a result, it is important to develop scalable data management systems to store, manage and integrate these datasets. In this paper, we develop EpiK—a knowledge base that facilitates the development of decision support and analytical environments to support epidemic science. An important goal is to develop a framework that links the input as well as output datasets to facilitate effective spatio-temporal and social reasoning that is critical in planning and intervention analysis before and during an epidemic. The data management framework links modeling workflow data and its metadata using a controlled vocabulary. The metadata captures information about storage, the mapping between the linked model and the physical layout, and relationships to support services. EpiK is designed to support agent-based modeling and analytics frameworks—aggregate models can be seen as special cases and are thus supported. We use semantic web technologies to create a representation of the datasets that encapsulates both the location and the schema heterogeneity. The choice of RDF as a representation language is motivated by the diversity and growth of the datasets that need to be integrated. A query bank is developed—the queries capture a broad range of questions that can be posed and answered during a typical case study pertaining to disease outbreaks. The queries are constructed using SPARQL Protocol and RDF Query Language (SPARQL) over the EpiK. EpiK can hide schema and location heterogeneity while efficiently supporting queries that span the computational epidemiology modeling pipeline: from model construction to simulation output. As a result, we show that the performance of benchmark queries varies significantly with respect to the choice of hardware underlying the database and resource description framework (RDF) engine.« less

  8. Automatic training and reliability estimation for 3D ASM applied to cardiac MRI segmentation

    NASA Astrophysics Data System (ADS)

    Tobon-Gomez, Catalina; Sukno, Federico M.; Butakoff, Constantine; Huguet, Marina; Frangi, Alejandro F.

    2012-07-01

    Training active shape models requires collecting manual ground-truth meshes in a large image database. While shape information can be reused across multiple imaging modalities, intensity information needs to be imaging modality and protocol specific. In this context, this study has two main purposes: (1) to test the potential of using intensity models learned from MRI simulated datasets and (2) to test the potential of including a measure of reliability during the matching process to increase robustness. We used a population of 400 virtual subjects (XCAT phantom), and two clinical populations of 40 and 45 subjects. Virtual subjects were used to generate simulated datasets (MRISIM simulator). Intensity models were trained both on simulated and real datasets. The trained models were used to segment the left ventricle (LV) and right ventricle (RV) from real datasets. Segmentations were also obtained with and without reliability information. Performance was evaluated with point-to-surface and volume errors. Simulated intensity models obtained average accuracy comparable to inter-observer variability for LV segmentation. The inclusion of reliability information reduced volume errors in hypertrophic patients (EF errors from 17 ± 57% to 10 ± 18% LV MASS errors from -27 ± 22 g to -14 ± 25 g), and in heart failure patients (EF errors from -8 ± 42% to -5 ± 14%). The RV model of the simulated images needs further improvement to better resemble image intensities around the myocardial edges. Both for real and simulated models, reliability information increased segmentation robustness without penalizing accuracy.

  9. Automatic training and reliability estimation for 3D ASM applied to cardiac MRI segmentation.

    PubMed

    Tobon-Gomez, Catalina; Sukno, Federico M; Butakoff, Constantine; Huguet, Marina; Frangi, Alejandro F

    2012-07-07

    Training active shape models requires collecting manual ground-truth meshes in a large image database. While shape information can be reused across multiple imaging modalities, intensity information needs to be imaging modality and protocol specific. In this context, this study has two main purposes: (1) to test the potential of using intensity models learned from MRI simulated datasets and (2) to test the potential of including a measure of reliability during the matching process to increase robustness. We used a population of 400 virtual subjects (XCAT phantom), and two clinical populations of 40 and 45 subjects. Virtual subjects were used to generate simulated datasets (MRISIM simulator). Intensity models were trained both on simulated and real datasets. The trained models were used to segment the left ventricle (LV) and right ventricle (RV) from real datasets. Segmentations were also obtained with and without reliability information. Performance was evaluated with point-to-surface and volume errors. Simulated intensity models obtained average accuracy comparable to inter-observer variability for LV segmentation. The inclusion of reliability information reduced volume errors in hypertrophic patients (EF errors from 17 ± 57% to 10 ± 18%; LV MASS errors from -27 ± 22 g to -14 ± 25 g), and in heart failure patients (EF errors from -8 ± 42% to -5 ± 14%). The RV model of the simulated images needs further improvement to better resemble image intensities around the myocardial edges. Both for real and simulated models, reliability information increased segmentation robustness without penalizing accuracy.

  10. CometQuest: A Rosetta Adventure

    NASA Technical Reports Server (NTRS)

    Leon, Nancy J.; Fisher, Diane K.; Novati, Alexander; Chmielewski, Artur B.; Fitzpatrick, Austin J.; Angrum, Andrea

    2012-01-01

    This software is a higher-performance implementation of tiled WMS, with integral support for KML and time-varying data. This software is compliant with the Open Geospatial WMS standard, and supports KML natively as a WMS return type, including support for the time attribute. Regionated KML wrappers are generated that match the existing tiled WMS dataset. Ping and JPG formats are supported, and the software is implemented as an Apache 2.0 module that supports a threading execution model that is capable of supporting very high request rates. The module intercepts and responds to WMS requests that match certain patterns and returns the existing tiles. If a KML format that matches an existing pyramid and tile dataset is requested, regionated KML is generated and returned to the requesting application. In addition, KML requests that do not match the existing tile datasets generate a KML response that includes the corresponding JPG WMS request, effectively adding KML support to a backing WMS server.

  11. Tiled WMS/KML Server V2

    NASA Technical Reports Server (NTRS)

    Plesea, Lucian

    2012-01-01

    This software is a higher-performance implementation of tiled WMS, with integral support for KML and time-varying data. This software is compliant with the Open Geospatial WMS standard, and supports KML natively as a WMS return type, including support for the time attribute. Regionated KML wrappers are generated that match the existing tiled WMS dataset. Ping and JPG formats are supported, and the software is implemented as an Apache 2.0 module that supports a threading execution model that is capable of supporting very high request rates. The module intercepts and responds to WMS requests that match certain patterns and returns the existing tiles. If a KML format that matches an existing pyramid and tile dataset is requested, regionated KML is generated and returned to the requesting application. In addition, KML requests that do not match the existing tile datasets generate a KML response that includes the corresponding JPG WMS request, effectively adding KML support to a backing WMS server.

  12. A study on identification of bacteria in environmental samples using single-cell Raman spectroscopy: feasibility and reference libraries.

    PubMed

    Baritaux, Jean-Charles; Simon, Anne-Catherine; Schultz, Emmanuelle; Emain, C; Laurent, P; Dinten, Jean-Marc

    2016-05-01

    We report on our recent efforts towards identifying bacteria in environmental samples by means of Raman spectroscopy. We established a database of Raman spectra from bacteria submitted to various environmental conditions. This dataset was used to verify that Raman typing is possible from measurements performed in non-ideal conditions. Starting from the same dataset, we then varied the phenotype and matrix diversity content included in the reference library used to train the statistical model. The results show that it is possible to obtain models with an extended coverage of spectral variabilities, compared to environment-specific models trained on spectra from a restricted set of conditions. Broad coverage models are desirable for environmental samples since the exact conditions of the bacteria cannot be controlled.

  13. Evaluation of the Global Land Data Assimilation System (GLDAS) air temperature data products

    USGS Publications Warehouse

    Ji, Lei; Senay, Gabriel B.; Verdin, James P.

    2015-01-01

    There is a high demand for agrohydrologic models to use gridded near-surface air temperature data as the model input for estimating regional and global water budgets and cycles. The Global Land Data Assimilation System (GLDAS) developed by combining simulation models with observations provides a long-term gridded meteorological dataset at the global scale. However, the GLDAS air temperature products have not been comprehensively evaluated, although the accuracy of the products was assessed in limited areas. In this study, the daily 0.25° resolution GLDAS air temperature data are compared with two reference datasets: 1) 1-km-resolution gridded Daymet data (2002 and 2010) for the conterminous United States and 2) global meteorological observations (2000–11) archived from the Global Historical Climatology Network (GHCN). The comparison of the GLDAS datasets with the GHCN datasets, including 13 511 weather stations, indicates a fairly high accuracy of the GLDAS data for daily temperature. The quality of the GLDAS air temperature data, however, is not always consistent in different regions of the world; for example, some areas in Africa and South America show relatively low accuracy. Spatial and temporal analyses reveal a high agreement between GLDAS and Daymet daily air temperature datasets, although spatial details in high mountainous areas are not sufficiently estimated by the GLDAS data. The evaluation of the GLDAS data demonstrates that the air temperature estimates are generally accurate, but caution should be taken when the data are used in mountainous areas or places with sparse weather stations.

  14. Sparse Group Penalized Integrative Analysis of Multiple Cancer Prognosis Datasets

    PubMed Central

    Liu, Jin; Huang, Jian; Xie, Yang; Ma, Shuangge

    2014-01-01

    SUMMARY In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach. PMID:23938111

  15. Next-Generation Machine Learning for Biological Networks.

    PubMed

    Camacho, Diogo M; Collins, Katherine M; Powers, Rani K; Costello, James C; Collins, James J

    2018-06-14

    Machine learning, a collection of data-analytical techniques aimed at building predictive models from multi-dimensional datasets, is becoming integral to modern biological research. By enabling one to generate models that learn from large datasets and make predictions on likely outcomes, machine learning can be used to study complex cellular systems such as biological networks. Here, we provide a primer on machine learning for life scientists, including an introduction to deep learning. We discuss opportunities and challenges at the intersection of machine learning and network biology, which could impact disease biology, drug discovery, microbiome research, and synthetic biology. Copyright © 2018 Elsevier Inc. All rights reserved.

  16. Statistical procedures for analyzing mental health services data.

    PubMed

    Elhai, Jon D; Calhoun, Patrick S; Ford, Julian D

    2008-08-15

    In mental health services research, analyzing service utilization data often poses serious problems, given the presence of substantially skewed data distributions. This article presents a non-technical introduction to statistical methods specifically designed to handle the complexly distributed datasets that represent mental health service use, including Poisson, negative binomial, zero-inflated, and zero-truncated regression models. A flowchart is provided to assist the investigator in selecting the most appropriate method. Finally, a dataset of mental health service use reported by medical patients is described, and a comparison of results across several different statistical methods is presented. Implications of matching data analytic techniques appropriately with the often complexly distributed datasets of mental health services utilization variables are discussed.

  17. Free internet datasets for streamflow modelling using SWAT in the Johor river basin, Malaysia

    NASA Astrophysics Data System (ADS)

    Tan, M. L.

    2014-02-01

    Streamflow modelling is a mathematical computational approach that represents terrestrial hydrology cycle digitally and is used for water resources assessment. However, such modelling endeavours require a large amount of data. Generally, governmental departments produce and maintain these data sets which make it difficult to obtain this data due to bureaucratic constraints. In some countries, the availability and quality of geospatial and climate datasets remain a critical issue due to many factors such as lacking of ground station, expertise, technology, financial support and war time. To overcome this problem, this research used public domain datasets from the Internet as "input" to a streamflow model. The intention is simulate daily and monthly streamflow of the Johor River Basin in Malaysia. The model used is the Soil and Water Assessment Tool (SWAT). As input free data including a digital elevation model (DEM), land use information, soil and climate data were used. The model was validated by in-situ streamflow information obtained from Rantau Panjang station for the year 2006. The coefficient of determination and Nash-Sutcliffe efficiency were 0.35/0.02 for daily simulated streamflow and 0.92/0.21 for monthly simulated streamflow, respectively. The results show that free data can provide a better simulation at a monthly scale compared to a daily basis in a tropical region. A sensitivity analysis and calibration procedure should be conducted in order to maximize the "goodness-of-fit" between simulated and observed streamflow. The application of Internet datasets promises an acceptable performance of streamflow modelling. This research demonstrates that public domain data is suitable for streamflow modelling in a tropical river basin within acceptable accuracy.

  18. Topic modeling for cluster analysis of large biological and medical datasets

    PubMed Central

    2014-01-01

    Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets. PMID:25350106

  19. Topic modeling for cluster analysis of large biological and medical datasets.

    PubMed

    Zhao, Weizhong; Zou, Wen; Chen, James J

    2014-01-01

    The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets.

  20. Digital Elevation Models of the Pre-Eruption 2000 Crater and 2004-07 Dome-Building Eruption at Mount St. Helens, Washington, USA

    USGS Publications Warehouse

    Messerich, J.A.; Schilling, S.P.; Thompson, R.A.

    2008-01-01

    Presented in this report are 27 digital elevation model (DEM) datasets for the crater area of Mount St. Helens. These datasets include pre-eruption baseline data collected in 2000, incremental model subsets collected during the 2004-07 dome building eruption, and associated shaded-relief image datasets. Each dataset was collected photogrammetrically with digital softcopy methods employing a combination of manual collection and iterative compilation of x,y,z coordinate triplets utilizing autocorrelation techniques. DEM data points collected using autocorrelation methods were rigorously edited in stereo and manually corrected to ensure conformity with the ground surface. Data were first collected as a triangulated irregular network (TIN) then interpolated to a grid format. DEM data are based on aerotriangulated photogrammetric solutions for aerial photograph strips flown at a nominal scale of 1:12,000 using a combination of surveyed ground control and photograph-identified control points. The 2000 DEM is based on aerotriangulation of four strips totaling 31 photographs. Subsequent DEMs collected during the course of the eruption are based on aerotriangulation of single aerial photograph strips consisting of between three and seven 1:12,000-scale photographs (two to six stereo pairs). Most datasets were based on three or four stereo pairs. Photogrammetric errors associated with each dataset are presented along with ground control used in the photogrammetric aerotriangulation. The temporal increase in area of deformation in the crater as a result of dome growth, deformation, and translation of glacial ice resulted in continual adoption of new ground control points and abandonment of others during the course of the eruption. Additionally, seasonal snow cover precluded the consistent use of some ground control points.

  1. Status and Preliminary Evaluation for Chinese Re-Analysis Datasets

    NASA Astrophysics Data System (ADS)

    bin, zhao; chunxiang, shi; tianbao, zhao; dong, si; jingwei, liu

    2016-04-01

    Based on operational T639L60 spectral model, combined with Hybird_GSI assimilation system by using meteorological observations including radiosondes, buoyes, satellites el al., a set of Chinese Re-Analysis (CRA) datasets is developing by Chinese National Meteorological Information Center (NMIC) of Chinese Meteorological Administration (CMA). The datasets are run at 30km (0.28°latitude / longitude) resolution which holds higher resolution than most of the existing reanalysis dataset. The reanalysis is done in an effort to enhance the accuracy of historical synoptic analysis and aid to find out detailed investigation of various weather and climate systems. The current status of reanalysis is in a stage of preliminary experimental analysis. One-year forecast data during Jun 2013 and May 2014 has been simulated and used in synoptic and climate evaluation. We first examine the model prediction ability with the new assimilation system, and find out that it represents significant improvement in Northern and Southern hemisphere, due to addition of new satellite data, compared with operational T639L60 model, the effect of upper-level prediction is improved obviously and overall prediction stability is enhanced. In climatological analysis, compared with ERA-40, NCEP/NCAR and NCEP/DOE reanalyses, the results show that surface temperature simulates a bit lower in land and higher over ocean, 850-hPa specific humidity reflects weakened anomaly and the zonal wind value anomaly is focus on equatorial tropics. Meanwhile, the reanalysis dataset shows good ability for various climate index, such as subtropical high index, ESMI (East-Asia subtropical Summer Monsoon Index) et al., especially for the Indian and western North Pacific monsoon index. Latter we will further improve the assimilation system and dynamical simulating performance, and obtain 40-years (1979-2018) reanalysis datasets. It will provide a more comprehensive analysis for synoptic and climate diagnosis.

  2. Uncertainty Assessment of the NASA Earth Exchange Global Daily Downscaled Climate Projections (NEX-GDDP) Dataset

    NASA Technical Reports Server (NTRS)

    Wang, Weile; Nemani, Ramakrishna R.; Michaelis, Andrew; Hashimoto, Hirofumi; Dungan, Jennifer L.; Thrasher, Bridget L.; Dixon, Keith W.

    2016-01-01

    The NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP) dataset is comprised of downscaled climate projections that are derived from 21 General Circulation Model (GCM) runs conducted under the Coupled Model Intercomparison Project Phase 5 (CMIP5) and across two of the four greenhouse gas emissions scenarios (RCP4.5 and RCP8.5). Each of the climate projections includes daily maximum temperature, minimum temperature, and precipitation for the periods from 1950 through 2100 and the spatial resolution is 0.25 degrees (approximately 25 km x 25 km). The GDDP dataset has received warm welcome from the science community in conducting studies of climate change impacts at local to regional scales, but a comprehensive evaluation of its uncertainties is still missing. In this study, we apply the Perfect Model Experiment framework (Dixon et al. 2016) to quantify the key sources of uncertainties from the observational baseline dataset, the downscaling algorithm, and some intrinsic assumptions (e.g., the stationary assumption) inherent to the statistical downscaling techniques. We developed a set of metrics to evaluate downscaling errors resulted from bias-correction ("quantile-mapping"), spatial disaggregation, as well as the temporal-spatial non-stationarity of climate variability. Our results highlight the spatial disaggregation (or interpolation) errors, which dominate the overall uncertainties of the GDDP dataset, especially over heterogeneous and complex terrains (e.g., mountains and coastal area). In comparison, the temporal errors in the GDDP dataset tend to be more constrained. Our results also indicate that the downscaled daily precipitation also has relatively larger uncertainties than the temperature fields, reflecting the rather stochastic nature of precipitation in space. Therefore, our results provide insights in improving statistical downscaling algorithms and products in the future.

  3. SVM-PB-Pred: SVM based protein block prediction method using sequence profiles and secondary structures.

    PubMed

    Suresh, V; Parthasarathy, S

    2014-01-01

    We developed a support vector machine based web server called SVM-PB-Pred, to predict the Protein Block for any given amino acid sequence. The input features of SVM-PB-Pred include i) sequence profiles (PSSM) and ii) actual secondary structures (SS) from DSSP method or predicted secondary structures from NPS@ and GOR4 methods. There were three combined input features PSSM+SS(DSSP), PSSM+SS(NPS@) and PSSM+SS(GOR4) used to test and train the SVM models. Similarly, four datasets RS90, DB433, LI1264 and SP1577 were used to develop the SVM models. These four SVM models developed were tested using three different benchmarking tests namely; (i) self consistency, (ii) seven fold cross validation test and (iii) independent case test. The maximum possible prediction accuracy of ~70% was observed in self consistency test for the SVM models of both LI1264 and SP1577 datasets, where PSSM+SS(DSSP) input features was used to test. The prediction accuracies were reduced to ~53% for PSSM+SS(NPS@) and ~43% for PSSM+SS(GOR4) in independent case test, for the SVM models of above two same datasets. Using our method, it is possible to predict the protein block letters for any query protein sequence with ~53% accuracy, when the SP1577 dataset and predicted secondary structure from NPS@ server were used. The SVM-PB-Pred server can be freely accessed through http://bioinfo.bdu.ac.in/~svmpbpred.

  4. Coupled Land Surface-Subsurface Hydrogeophysical Inverse Modeling to Estimate Soil Organic Carbon Content in an Arctic Tundra

    NASA Astrophysics Data System (ADS)

    Tran, A. P.; Dafflon, B.; Hubbard, S.

    2017-12-01

    Soil organic carbon (SOC) is crucial for predicting carbon climate feedbacks in the vulnerable organic-rich Arctic region. However, it is challenging to achieve this property due to the general limitations of conventional core sampling and analysis methods. In this study, we develop an inversion scheme that uses single or multiple datasets, including soil liquid water content, temperature and ERT data, to estimate the vertical profile of SOC content. Our approach relies on the fact that SOC content strongly influences soil hydrological-thermal parameters, and therefore, indirectly controls the spatiotemporal dynamics of soil liquid water content, temperature and their correlated electrical resistivity. The scheme includes several advantages. First, this is the first time SOC content is estimated by using a coupled hydrogeophysical inversion. Second, by using the Community Land Model, we can account for the land surface dynamics (evapotranspiration, snow accumulation and melting) and ice/liquid phase transition. Third, we combine a deterministic and an adaptive Markov chain Monte Carlo optimization algorithm to better estimate the posterior distributions of desired model parameters. Finally, the simulated subsurface variables are explicitly linked to soil electrical resistivity via petrophysical and geophysical models. We validate the developed scheme using synthetic experiments. The results show that compared to inversion of single dataset, joint inversion of these datasets significantly reduces parameter uncertainty. The joint inversion approach is able to estimate SOC content within the shallow active layer with high reliability. Next, we apply the scheme to estimate OC content along an intensive ERT transect in Barrow, Alaska using multiple datasets acquired in the 2013-2015 period. The preliminary results show a good agreement between modeled and measured soil temperature, thaw layer thickness and electrical resistivity. The accuracy of estimated SOC content will be evaluated by comparison with measurements from soil samples along the transect. Our study presents a new surface-subsurface, deterministic-stochastic hydrogeophysical inversion approach, as well as the benefit of including multiple types of data to estimate SOC and associated hydrological-thermal dynamics.

  5. Updated archaeointensity dataset from the SW Pacific

    NASA Astrophysics Data System (ADS)

    Hill, Mimi; Nilsson, Andreas; Holme, Richard; Hurst, Elliot; Turner, Gillian; Herries, Andy; Sheppard, Peter

    2016-04-01

    It is well known that there are far more archaeomagnetic data from the Northern Hemisphere than from the Southern. Here we present a compilation of archaeointensity data from the SW Pacific region covering the past 3000 years. The results have primarily been obtained from a collection of ceramics from the SW Pacific Islands including Fiji, Tonga, Papua New Guinea, New Caledonia and Vanuatu. In addition we present results obtained from heated clay balls from Australia. The microwave method has predominantly been used with a variety of experimental protocols including IZZI and Coe variants. Standard Thellier archaeointensity experiments using the IZZI protocol have also been carried out on selected samples. The dataset is compared to regional predictions from current global geomagnetic field models, and the influence of the new data on constraining the pfm9k family of global geomagnetic field models is explored.

  6. Sensitivity of a numerical wave model on wind re-analysis datasets

    NASA Astrophysics Data System (ADS)

    Lavidas, George; Venugopal, Vengatesan; Friedrich, Daniel

    2017-03-01

    Wind is the dominant process for wave generation. Detailed evaluation of metocean conditions strengthens our understanding of issues concerning potential offshore applications. However, the scarcity of buoys and high cost of monitoring systems pose a barrier to properly defining offshore conditions. Through use of numerical wave models, metocean conditions can be hindcasted and forecasted providing reliable characterisations. This study reports the sensitivity of wind inputs on a numerical wave model for the Scottish region. Two re-analysis wind datasets with different spatio-temporal characteristics are used, the ERA-Interim Re-Analysis and the CFSR-NCEP Re-Analysis dataset. Different wind products alter results, affecting the accuracy obtained. The scope of this study is to assess different available wind databases and provide information concerning the most appropriate wind dataset for the specific region, based on temporal, spatial and geographic terms for wave modelling and offshore applications. Both wind input datasets delivered results from the numerical wave model with good correlation. Wave results by the 1-h dataset have higher peaks and lower biases, in expense of a high scatter index. On the other hand, the 6-h dataset has lower scatter but higher biases. The study shows how wind dataset affects the numerical wave modelling performance, and that depending on location and study needs, different wind inputs should be considered.

  7. Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification.

    PubMed

    Li, Jinyan; Fong, Simon; Sung, Yunsick; Cho, Kyungeun; Wong, Raymond; Wong, Kelvin K L

    2016-01-01

    An imbalanced dataset is defined as a training dataset that has imbalanced proportions of data in both interesting and uninteresting classes. Often in biomedical applications, samples from the stimulating class are rare in a population, such as medical anomalies, positive clinical tests, and particular diseases. Although the target samples in the primitive dataset are small in number, the induction of a classification model over such training data leads to poor prediction performance due to insufficient training from the minority class. In this paper, we use a novel class-balancing method named adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique (ASCB_DmSMOTE) to solve this imbalanced dataset problem, which is common in biomedical applications. The proposed method combines under-sampling and over-sampling into a swarm optimisation algorithm. It adaptively selects suitable parameters for the rebalancing algorithm to find the best solution. Compared with the other versions of the SMOTE algorithm, significant improvements, which include higher accuracy and credibility, are observed with ASCB_DmSMOTE. Our proposed method tactfully combines two rebalancing techniques together. It reasonably re-allocates the majority class in the details and dynamically optimises the two parameters of SMOTE to synthesise a reasonable scale of minority class for each clustered sub-imbalanced dataset. The proposed methods ultimately overcome other conventional methods and attains higher credibility with even greater accuracy of the classification model.

  8. Evaluating the Contribution of NASA Remotely-Sensed Data Sets on a Convection-Allowing Forecast Model

    NASA Technical Reports Server (NTRS)

    Zavodsky, Bradley T.; Case, Jonathan L.; Molthan, Andrew L.

    2012-01-01

    The Short-term Prediction Research and Transition (SPoRT) Center is a collaborative partnership between NASA and operational forecasting partners, including a number of National Weather Service forecast offices. SPoRT provides real-time NASA products and capabilities to help its partners address specific operational forecast challenges. One challenge that forecasters face is using guidance from local and regional deterministic numerical models configured at convection-allowing resolution to help assess a variety of mesoscale/convective-scale phenomena such as sea-breezes, local wind circulations, and mesoscale convective weather potential on a given day. While guidance from convection-allowing models has proven valuable in many circumstances, the potential exists for model improvements by incorporating more representative land-water surface datasets, and by assimilating retrieved temperature and moisture profiles from hyper-spectral sounders. In order to help increase the accuracy of deterministic convection-allowing models, SPoRT produces real-time, 4-km CONUS forecasts using a configuration of the Weather Research and Forecasting (WRF) model (hereafter SPoRT-WRF) that includes unique NASA products and capabilities including 4-km resolution soil initialization data from the Land Information System (LIS), 2-km resolution SPoRT SST composites over oceans and large water bodies, high-resolution real-time Green Vegetation Fraction (GVF) composites derived from the Moderate-resolution Imaging Spectroradiometer (MODIS) instrument, and retrieved temperature and moisture profiles from the Atmospheric Infrared Sounder (AIRS) and Infrared Atmospheric Sounding Interferometer (IASI). NCAR's Model Evaluation Tools (MET) verification package is used to generate statistics of model performance compared to in situ observations and rainfall analyses for three months during the summer of 2012 (June-August). Detailed analyses of specific severe weather outbreaks during the summer will be presented to assess the potential added-value of the SPoRT datasets and data assimilation methodology compared to a WRF configuration without the unique datasets and data assimilation.

  9. Comparing methods of analysing datasets with small clusters: case studies using four paediatric datasets.

    PubMed

    Marston, Louise; Peacock, Janet L; Yu, Keming; Brocklehurst, Peter; Calvert, Sandra A; Greenough, Anne; Marlow, Neil

    2009-07-01

    Studies of prematurely born infants contain a relatively large percentage of multiple births, so the resulting data have a hierarchical structure with small clusters of size 1, 2 or 3. Ignoring the clustering may lead to incorrect inferences. The aim of this study was to compare statistical methods which can be used to analyse such data: generalised estimating equations, multilevel models, multiple linear regression and logistic regression. Four datasets which differed in total size and in percentage of multiple births (n = 254, multiple 18%; n = 176, multiple 9%; n = 10 098, multiple 3%; n = 1585, multiple 8%) were analysed. With the continuous outcome, two-level models produced similar results in the larger dataset, while generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) produced divergent estimates using the smaller dataset. For the dichotomous outcome, most methods, except generalised least squares multilevel modelling (ML GH 'xtlogit' in Stata) gave similar odds ratios and 95% confidence intervals within datasets. For the continuous outcome, our results suggest using multilevel modelling. We conclude that generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) should be used with caution when the dataset is small. Where the outcome is dichotomous and there is a relatively large percentage of non-independent data, it is recommended that these are accounted for in analyses using logistic regression with adjusted standard errors or multilevel modelling. If, however, the dataset has a small percentage of clusters greater than size 1 (e.g. a population dataset of children where there are few multiples) there appears to be less need to adjust for clustering.

  10. Interactive Visualization and Analysis of Geospatial Data Sets - TrikeND-iGlobe

    NASA Astrophysics Data System (ADS)

    Rosebrock, Uwe; Hogan, Patrick; Chandola, Varun

    2013-04-01

    The visualization of scientific datasets is becoming an ever-increasing challenge as advances in computing technologies have enabled scientists to build high resolution climate models that have produced petabytes of climate data. To interrogate and analyze these large datasets in real-time is a task that pushes the boundaries of computing hardware and software. But integration of climate datasets with geospatial data requires considerable amount of effort and close familiarity of various data formats and projection systems, which has prevented widespread utilization outside of climate community. TrikeND-iGlobe is a sophisticated software tool that bridges this gap, allows easy integration of climate datasets with geospatial datasets and provides sophisticated visualization and analysis capabilities. The objective for TrikeND-iGlobe is the continued building of an open source 4D virtual globe application using NASA World Wind technology that integrates analysis of climate model outputs with remote sensing observations as well as demographic and environmental data sets. This will facilitate a better understanding of global and regional phenomenon, and the impact analysis of climate extreme events. The critical aim is real-time interactive interrogation. At the data centric level the primary aim is to enable the user to interact with the data in real-time for the purpose of analysis - locally or remotely. TrikeND-iGlobe provides the basis for the incorporation of modular tools that provide extended interactions with the data, including sub-setting, aggregation, re-shaping, time series analysis methods and animation to produce publication-quality imagery. TrikeND-iGlobe may be run locally or can be accessed via a web interface supported by high-performance visualization compute nodes placed close to the data. It supports visualizing heterogeneous data formats: traditional geospatial datasets along with scientific data sets with geographic coordinates (NetCDF, HDF, etc.). It also supports multiple data access mechanisms, including HTTP, FTP, WMS, WCS, and Thredds Data Server (for NetCDF data and for scientific data, TrikeND-iGlobe supports various visualization capabilities, including animations, vector field visualization, etc. TrikeND-iGlobe is a collaborative open-source project, contributors include NASA (ARC-PX), ORNL (Oakridge National Laboratories), Unidata, Kansas University, CSIRO CMAR Australia and Geoscience Australia.

  11. REM-3D Reference Datasets: Reconciling large and diverse compilations of travel-time observations

    NASA Astrophysics Data System (ADS)

    Moulik, P.; Lekic, V.; Romanowicz, B. A.

    2017-12-01

    A three-dimensional Reference Earth model (REM-3D) should ideally represent the consensus view of long-wavelength heterogeneity in the Earth's mantle through the joint modeling of large and diverse seismological datasets. This requires reconciliation of datasets obtained using various methodologies and identification of consistent features. The goal of REM-3D datasets is to provide a quality-controlled and comprehensive set of seismic observations that would not only enable construction of REM-3D, but also allow identification of outliers and assist in more detailed studies of heterogeneity. The community response to data solicitation has been enthusiastic with several groups across the world contributing recent measurements of normal modes, (fundamental mode and overtone) surface waves, and body waves. We present results from ongoing work with body and surface wave datasets analyzed in consultation with a Reference Dataset Working Group. We have formulated procedures for reconciling travel-time datasets that include: (1) quality control for salvaging missing metadata; (2) identification of and reasons for discrepant measurements; (3) homogenization of coverage through the construction of summary rays; and (4) inversions of structure at various wavelengths to evaluate inter-dataset consistency. In consultation with the Reference Dataset Working Group, we retrieved the station and earthquake metadata in several legacy compilations and codified several guidelines that would facilitate easy storage and reproducibility. We find strong agreement between the dispersion measurements of fundamental-mode Rayleigh waves, particularly when made using supervised techniques. The agreement deteriorates substantially in surface-wave overtones, for which discrepancies vary with frequency and overtone number. A half-cycle band of discrepancies is attributed to reversed instrument polarities at a limited number of stations, which are not reflected in the instrument response history. By assessing inter-dataset consistency across similar paths, we quantify travel-time measurement errors for both surface and body waves. Finally, we discuss challenges associated with combining high frequency ( 1 Hz) and long period (10-20s) body-wave measurements into the REM-3D reference dataset.

  12. General Purpose Electronegativity Relaxation Charge Models Applied to CoMFA and CoMSIA Study of GSK-3 Inhibitors.

    PubMed

    Tsareva, Daria A; Osolodkin, Dmitry I; Shulga, Dmitry A; Oliferenko, Alexander A; Pisarev, Sergey A; Palyulin, Vladimir A; Zefirov, Nikolay S

    2011-03-14

    Two fast empirical charge models, Kirchhoff Charge Model (KCM) and Dynamic Electronegativity Relaxation (DENR), had been developed in our laboratory previously for widespread use in drug design research. Both models are based on the electronegativity relaxation principle (Adv. Quantum Chem. 2006, 51, 139-156) and parameterized against ab initio dipole/quadrupole moments and molecular electrostatic potentials, respectively. As 3D QSAR studies comprise one of the most important fields of applied molecular modeling, they naturally have become the first topic to test our charges and thus, indirectly, the assumptions laid down to the charge model theories in a case study. Here these charge models are used in CoMFA and CoMSIA methods and tested on five glycogen synthase kinase 3 (GSK-3) inhibitor datasets, relevant to our current studies, and one steroid dataset. For comparison, eight other different charge models, ab initio through semiempirical and empirical, were tested on the same datasets. The complex analysis including correlation and cross-validation, charges robustness and predictability, as well as visual interpretability of 3D contour maps generated was carried out. As a result, our new electronegativity relaxation-based models both have shown stable results, which in conjunction with other benefits discussed render them suitable for building reliable 3D QSAR models. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  13. To generate a finite element model of human thorax using the VCH dataset

    NASA Astrophysics Data System (ADS)

    Shi, Hui; Liu, Qian

    2009-10-01

    Purpose: To generate a three-dimensional (3D) finite element (FE) model of human thorax which may provide the basis of biomechanics simulation for the study of design effect and mechanism of safety belt when vehicle collision. Methods: Using manually or semi-manually segmented method, the interested area can be segmented from the VCH (Visible Chinese Human) dataset. The 3D surface model of thorax is visualized by using VTK (Visualization Toolkit) and further translated into (Stereo Lithography) STL format, which approximates the geometry of solid model by representing the boundaries with triangular facets. The data in STL format need to be normalized into NURBS surfaces and IGES format using software such as Geomagic Studio to provide archetype for reverse engineering. The 3D FE model was established using Ansys software. Results: The generated 3D FE model was an integrated thorax model which could reproduce human's complicated structure morphology including clavicle, ribs, spine and sternum. It was consisted of 1 044 179 elements in total. Conclusions: Compared with the previous thorax model, this FE model enhanced the authenticity and precision of results analysis obviously, which can provide a sound basis for analysis of human thorax biomechanical research. Furthermore, using the method above, we can also establish 3D FE models of some other organizes and tissues utilizing the VCH dataset.

  14. DOTAGWA: A CASE STUDY IN WEB-BASED ARCHITECTURES FOR CONNECTING SURFACE WATER MODELS TO SPATIALLY ENABLED WEB APPLICATIONS

    EPA Science Inventory

    The Automated Geospatial Watershed Assessment (AGWA) tool is a desktop application that uses widely available standardized spatial datasets to derive inputs for multi-scale hydrologic models (Miller et al., 2007). The required data sets include topography (DEM data), soils, clima...

  15. Development of cropland management dataset to support U.S. SWAT assessments

    USDA-ARS?s Scientific Manuscript database

    The Soil and Water Assessment Tool (SWAT) is a widely used hydrologic/water quality simulation model in the U.S. Process-based models like SWAT require a great deal of data to accurately represent the natural world, including topography, landuse, soils, weather, and management. With the exception ...

  16. Improving the discoverability, accessibility, and citability of omics datasets: a case report.

    PubMed

    Darlington, Yolanda F; Naumov, Alexey; McOwiti, Apollo; Kankanamge, Wasula H; Becnel, Lauren B; McKenna, Neil J

    2017-03-01

    Although omics datasets represent valuable assets for hypothesis generation, model testing, and data validation, the infrastructure supporting their reuse lacks organization and consistency. Using nuclear receptor signaling transcriptomic datasets as proof of principle, we developed a model to improve the discoverability, accessibility, and citability of published omics datasets. Primary datasets were retrieved from archives, processed to extract data points, then subjected to metadata enrichment and gap filling. The resulting secondary datasets were exposed on responsive web pages to support mining of gene lists, discovery of related datasets, and single-click citation integration with popular reference managers. Automated processes were established to embed digital object identifier-driven links to the secondary datasets in associated journal articles, small molecule and gene-centric databases, and a dataset search engine. Our model creates multiple points of access to reprocessed and reannotated derivative datasets across the digital biomedical research ecosystem, promoting their visibility and usability across disparate research communities. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  17. A comparison of U.S. geological survey seamless elevation models with shuttle radar topography mission data

    USGS Publications Warehouse

    Gesch, D.; Williams, J.; Miller, W.

    2001-01-01

    Elevation models produced from Shuttle Radar Topography Mission (SRTM) data will be the most comprehensive, consistently processed, highest resolution topographic dataset ever produced for the Earth's land surface. Many applications that currently use elevation data will benefit from the increased availability of data with higher accuracy, quality, and resolution, especially in poorly mapped areas of the globe. SRTM data will be produced as seamless data, thereby avoiding many of the problems inherent in existing multi-source topographic databases. Serving as precursors to SRTM datasets, the U.S. Geological Survey (USGS) has produced and is distributing seamless elevation datasets that facilitate scientific use of elevation data over large areas. GTOPO30 is a global elevation model with a 30 arc-second resolution (approximately 1-kilometer). The National Elevation Dataset (NED) covers the United States at a resolution of 1 arc-second (approximately 30-meters). Due to their seamless format and broad area coverage, both GTOPO30 and NED represent an advance in the usability of elevation data, but each still includes artifacts from the highly variable source data used to produce them. The consistent source data and processing approach for SRTM data will result in elevation products that will be a significant addition to the current availability of seamless datasets, specifically for many areas outside the U.S. One application that demonstrates some advantages that may be realized with SRTM data is delineation of land surface drainage features (watersheds and stream channels). Seamless distribution of elevation data in which a user interactively specifies the area of interest and order parameters via a map server is already being successfully demonstrated with existing USGS datasets. Such an approach for distributing SRTM data is ideal for a dataset that undoubtedly will be of very high interest to the spatial data user community.

  18. Microclimate Data Improve Predictions of Insect Abundance Models Based on Calibrated Spatiotemporal Temperatures.

    PubMed

    Rebaudo, François; Faye, Emile; Dangles, Olivier

    2016-01-01

    A large body of literature has recently recognized the role of microclimates in controlling the physiology and ecology of species, yet the relevance of fine-scale climatic data for modeling species performance and distribution remains a matter of debate. Using a 6-year monitoring of three potato moth species, major crop pests in the tropical Andes, we asked whether the spatiotemporal resolution of temperature data affect the predictions of models of moth performance and distribution. For this, we used three different climatic data sets: (i) the WorldClim dataset (global dataset), (ii) air temperature recorded using data loggers (weather station dataset), and (iii) air crop canopy temperature (microclimate dataset). We developed a statistical procedure to calibrate all datasets to monthly and yearly variation in temperatures, while keeping both spatial and temporal variances (air monthly temperature at 1 km² for the WorldClim dataset, air hourly temperature for the weather station, and air minute temperature over 250 m radius disks for the microclimate dataset). Then, we computed pest performances based on these three datasets. Results for temperature ranging from 9 to 11°C revealed discrepancies in the simulation outputs in both survival and development rates depending on the spatiotemporal resolution of the temperature dataset. Temperature and simulated pest performances were then combined into multiple linear regression models to compare predicted vs. field data. We used an additional set of study sites to test the ability of the results of our model to be extrapolated over larger scales. Results showed that the model implemented with microclimatic data best predicted observed pest abundances for our study sites, but was less accurate than the global dataset model when performed at larger scales. Our simulations therefore stress the importance to consider different temperature datasets depending on the issue to be solved in order to accurately predict species abundances. In conclusion, keeping in mind that the mismatch between the size of organisms and the scale at which climate data are collected and modeled remains a key issue, temperature dataset selection should be balanced by the desired output spatiotemporal scale for better predicting pest dynamics and developing efficient pest management strategies.

  19. Microclimate Data Improve Predictions of Insect Abundance Models Based on Calibrated Spatiotemporal Temperatures

    PubMed Central

    Rebaudo, François; Faye, Emile; Dangles, Olivier

    2016-01-01

    A large body of literature has recently recognized the role of microclimates in controlling the physiology and ecology of species, yet the relevance of fine-scale climatic data for modeling species performance and distribution remains a matter of debate. Using a 6-year monitoring of three potato moth species, major crop pests in the tropical Andes, we asked whether the spatiotemporal resolution of temperature data affect the predictions of models of moth performance and distribution. For this, we used three different climatic data sets: (i) the WorldClim dataset (global dataset), (ii) air temperature recorded using data loggers (weather station dataset), and (iii) air crop canopy temperature (microclimate dataset). We developed a statistical procedure to calibrate all datasets to monthly and yearly variation in temperatures, while keeping both spatial and temporal variances (air monthly temperature at 1 km² for the WorldClim dataset, air hourly temperature for the weather station, and air minute temperature over 250 m radius disks for the microclimate dataset). Then, we computed pest performances based on these three datasets. Results for temperature ranging from 9 to 11°C revealed discrepancies in the simulation outputs in both survival and development rates depending on the spatiotemporal resolution of the temperature dataset. Temperature and simulated pest performances were then combined into multiple linear regression models to compare predicted vs. field data. We used an additional set of study sites to test the ability of the results of our model to be extrapolated over larger scales. Results showed that the model implemented with microclimatic data best predicted observed pest abundances for our study sites, but was less accurate than the global dataset model when performed at larger scales. Our simulations therefore stress the importance to consider different temperature datasets depending on the issue to be solved in order to accurately predict species abundances. In conclusion, keeping in mind that the mismatch between the size of organisms and the scale at which climate data are collected and modeled remains a key issue, temperature dataset selection should be balanced by the desired output spatiotemporal scale for better predicting pest dynamics and developing efficient pest management strategies. PMID:27148077

  20. Use of Anecdotal Occurrence Data in Species Distribution Models: An Example Based on the White-Nosed Coati (Nasua narica) in the American Southwest

    PubMed Central

    Frey, Jennifer K.; Lewis, Jeremy C.; Guy, Rachel K.; Stuart, James N.

    2013-01-01

    Simple Summary We evaluated the influence of occurrence records with different reliability on predicted distribution of a unique, rare mammal in the American Southwest, the white-nosed coati (Nasua narica). We concluded that occurrence datasets that include anecdotal records can be used to infer species distributions, providing such data are used only for easily-identifiable species and based on robust modeling methods such as maximum entropy. Use of a reliability rating system is critical for using anecdotal data. Abstract Species distributions are usually inferred from occurrence records. However, these records are prone to errors in spatial precision and reliability. Although influence of spatial errors has been fairly well studied, there is little information on impacts of poor reliability. Reliability of an occurrence record can be influenced by characteristics of the species, conditions during the observation, and observer’s knowledge. Some studies have advocated use of anecdotal data, while others have advocated more stringent evidentiary standards such as only accepting records verified by physical evidence, at least for rare or elusive species. Our goal was to evaluate the influence of occurrence records with different reliability on species distribution models (SDMs) of a unique mammal, the white-nosed coati (Nasua narica) in the American Southwest. We compared SDMs developed using maximum entropy analysis of combined bioclimatic and biophysical variables and based on seven subsets of occurrence records that varied in reliability and spatial precision. We found that the predicted distribution of the coati based on datasets that included anecdotal occurrence records were similar to those based on datasets that only included physical evidence. Coati distribution in the American Southwest was predicted to occur in southwestern New Mexico and southeastern Arizona and was defined primarily by evenness of climate and Madrean woodland and chaparral land-cover types. Coati distribution patterns in this region suggest a good model for understanding the biogeographic structure of range margins. We concluded that occurrence datasets that include anecdotal records can be used to infer species distributions, providing such data are used only for easily-identifiable species and based on robust modeling methods such as maximum entropy. Use of a reliability rating system is critical for using anecdotal data. PMID:26487405

  1. Rapid Global Fitting of Large Fluorescence Lifetime Imaging Microscopy Datasets

    PubMed Central

    Warren, Sean C.; Margineanu, Anca; Alibhai, Dominic; Kelly, Douglas J.; Talbot, Clifford; Alexandrov, Yuriy; Munro, Ian; Katan, Matilda

    2013-01-01

    Fluorescence lifetime imaging (FLIM) is widely applied to obtain quantitative information from fluorescence signals, particularly using Förster Resonant Energy Transfer (FRET) measurements to map, for example, protein-protein interactions. Extracting FRET efficiencies or population fractions typically entails fitting data to complex fluorescence decay models but such experiments are frequently photon constrained, particularly for live cell or in vivo imaging, and this leads to unacceptable errors when analysing data on a pixel-wise basis. Lifetimes and population fractions may, however, be more robustly extracted using global analysis to simultaneously fit the fluorescence decay data of all pixels in an image or dataset to a multi-exponential model under the assumption that the lifetime components are invariant across the image (dataset). This approach is often considered to be prohibitively slow and/or computationally expensive but we present here a computationally efficient global analysis algorithm for the analysis of time-correlated single photon counting (TCSPC) or time-gated FLIM data based on variable projection. It makes efficient use of both computer processor and memory resources, requiring less than a minute to analyse time series and multiwell plate datasets with hundreds of FLIM images on standard personal computers. This lifetime analysis takes account of repetitive excitation, including fluorescence photons excited by earlier pulses contributing to the fit, and is able to accommodate time-varying backgrounds and instrument response functions. We demonstrate that this global approach allows us to readily fit time-resolved fluorescence data to complex models including a four-exponential model of a FRET system, for which the FRET efficiencies of the two species of a bi-exponential donor are linked, and polarisation-resolved lifetime data, where a fluorescence intensity and bi-exponential anisotropy decay model is applied to the analysis of live cell homo-FRET data. A software package implementing this algorithm, FLIMfit, is available under an open source licence through the Open Microscopy Environment. PMID:23940626

  2. Importance of Past Human and Natural Disturbance in Present-Day Net Ecosystem Productivity

    NASA Astrophysics Data System (ADS)

    Felzer, B. S.; Phelps, P.

    2014-12-01

    Gridded datasets of Net Ecosystem Exchange derived from eddy covariance and remote sensing measurements provide a means of validating Net Ecosystem Productivity (NEP, opposite of NEE) from terrestrial ecosystem models. While most forested regions in the U.S. are observed to be moderate to strong carbon sinks, models not including human or natural disturbances will tend to be more carbon neutral, which is expected of mature ecosystems. We have developed the Terrestrial Ecosystems Model Hydro version (TEM-Hydro) to include both human and natural disturbances to compare against gridded NEP datasets. Human disturbances are based on the Hurtt et al. (2006) land use transition dataset and include transient agricultural (crops and pasture) conversion and abandonment and timber harvest. We include natural disturbances of storms and fires based on stochastic return intervals. Tropical storms and hurricane return intervals are based on Zheng et al. (2009) and occur only along the U.S. Atlantic and Gulf coasts. Fire return intervals are based on LANDFIRE Rapid Assessment Vegetation Models and vegetation types from the Hurtt dataset. We are running three experiments with TEM-Hydro from 1700-2011 for the conterminous U.S.: potential vegetation (POT), human disturbance only (agriculture and timber harvest, LULC), and human plus natural disturbance (agriculture, timber harvest, storms, and fire, DISTURB). The goal is to compare our NEP values to those obtained by FLUXNET-MTE (Jung et al. 2009) from 1982-2008 and ECMOD (Xiao et al., 2008) from 2000-2006 for different plant functional types (PFTs) within the conterminous U.S. Preliminary results show that, for the entire U.S., potential vegetation yields an NEP of 10.8 gCm-2yr-1 vs 128.1 gCm-2yr-1 for LULC and 89.8 gCm-2yr-1 for DISTURB from 1982-2008. The effect of regrowth following agricultural and timber harvest disturbance therefore contributes substantially to the present-day carbon sink, while stochastic storms and fires have a negative effect on NEP. Even though the current NEP reflects the carbon uptake from regrowth, a full carbon accounting would also include the carbon released to the atmosphere during disturbance or carbon lost to decomposition of agricultural or timber products

  3. Backscatter Modeling at 2.1 Micron Wavelength for Space-Based and Airborne Lidars Using Aerosol Physico-Chemical and Lidar Datasets

    NASA Technical Reports Server (NTRS)

    Srivastava, V.; Rothermel, J.; Jarzembski, M. A.; Clarke, A. D.; Cutten, D. R.; Bowdle, D. A.; Spinhirne, J. D.; Menzies, R. T.

    1999-01-01

    Space-based and airborne coherent Doppler lidars designed for measuring global tropospheric wind profiles in cloud-free air rely on backscatter, beta from aerosols acting as passive wind tracers. Aerosol beta distribution in the vertical can vary over as much as 5-6 orders of magnitude. Thus, the design of a wave length-specific, space-borne or airborne lidar must account for the magnitude of 8 in the region or features of interest. The SPAce Readiness Coherent Lidar Experiment under development by the National Aeronautics and Space Administration (NASA) and scheduled for launch on the Space Shuttle in 2001, will demonstrate wind measurements from space using a solid-state 2 micrometer coherent Doppler lidar. Consequently, there is a critical need to understand variability of aerosol beta at 2.1 micrometers, to evaluate signal detection under varying aerosol loading conditions. Although few direct measurements of beta at 2.1 micrometers exist, extensive datasets, including climatologies in widely-separated locations, do exist for other wavelengths based on CO2 and Nd:YAG lidars. Datasets also exist for the associated microphysical and chemical properties. An example of a multi-parametric dataset is that of the NASA GLObal Backscatter Experiment (GLOBE) in 1990 in which aerosol chemistry and size distributions were measured concurrently with multi-wavelength lidar backscatter observations. More recently, continuous-wave (CW) lidar backscatter measurements at mid-infrared wavelengths have been made during the Multicenter Airborne Coherent Atmospheric Wind Sensor (MACAWS) experiment in 1995. Using Lorenz-Mie theory, these datasets have been used to develop a method to convert lidar backscatter to the 2.1 micrometer wavelength. This paper presents comparison of modeled backscatter at wavelengths for which backscatter measurements exist including converted beta (sub 2.1).

  4. Parameterized Spectral Bathymetric Roughness Using the Nonequispaced Fast Fourier Transform

    NASA Astrophysics Data System (ADS)

    Fabre, David Hanks

    The ocean and acoustic modeling community has specifically asked for roughness from bathymetry. An effort has been undertaken to provide what can be thought of as the high frequency content of bathymetry. By contrast, the low frequency content of bathymetry is the set of contours. The two-dimensional amplitude spectrum calculated with the nonequispaced fast Fourier transform (Kunis, 2006) is exploited as the statistic to provide several parameters of roughness following the method of Fox (1996). When an area is uniformly rough, it is termed isotropically rough. When an area exhibits lineation effects (like in a trough or a ridge line in the bathymetry), the term anisotropically rough is used. A predominant spatial azimuth of lineation summarizes anisotropic roughness. The power law model fit produces a roll-off parameter that also provides insight into the roughness of the area. These four parameters give rise to several derived parameters. Algorithmic accomplishments include reviving Fox's method (1985, 1996) and improving the method with the possibly geophysically more appropriate nonequispaced fast Fourier transform. A new composite parameter, simply the overall integral length of the nonlinear parameterizing function, is used to make within-dataset comparisons. A synthetic dataset and six multibeam datasets covering practically all depth regimes have been analyzed with the tools that have been developed. Data specific contributions include possibly discovering an aspect ratio isotropic cutoff level (less than 1.2), showing a range of spectral fall-off values between about -0.5 for a sandybottomed Gulf of Mexico area, to about -1.8 for a coral reef area just outside of the Saipan harbor. We also rank the targeted type of dataset, the best resolution gridded datasets, from smoothest to roughest using a factor based on the kernel dimensions, a percentage from the windowing operation, all multiplied by the overall integration length.

  5. An ISA-TAB-Nano based data collection framework to support data-driven modelling of nanotoxicology

    PubMed Central

    Marchese Robinson, Richard L; Richarz, Andrea-Nicole; Rallo, Robert

    2015-01-01

    Summary Analysis of trends in nanotoxicology data and the development of data driven models for nanotoxicity is facilitated by the reporting of data using a standardised electronic format. ISA-TAB-Nano has been proposed as such a format. However, in order to build useful datasets according to this format, a variety of issues has to be addressed. These issues include questions regarding exactly which (meta)data to report and how to report them. The current article discusses some of the challenges associated with the use of ISA-TAB-Nano and presents a set of resources designed to facilitate the manual creation of ISA-TAB-Nano datasets from the nanotoxicology literature. These resources were developed within the context of the NanoPUZZLES EU project and include data collection templates, corresponding business rules that extend the generic ISA-TAB-Nano specification as well as Python code to facilitate parsing and integration of these datasets within other nanoinformatics resources. The use of these resources is illustrated by a “Toy Dataset” presented in the Supporting Information. The strengths and weaknesses of the resources are discussed along with possible future developments. PMID:26665069

  6. Attribute Utility Motivated k-anonymization of Datasets to Support the Heterogeneous Needs of Biomedical Researchers

    PubMed Central

    Ye, Huimin; Chen, Elizabeth S.

    2011-01-01

    In order to support the increasing need to share electronic health data for research purposes, various methods have been proposed for privacy preservation including k-anonymity. Many k-anonymity models provide the same level of anoymization regardless of practical need, which may decrease the utility of the dataset for a particular research study. In this study, we explore extensions to the k-anonymity algorithm that aim to satisfy the heterogeneous needs of different researchers while preserving privacy as well as utility of the dataset. The proposed algorithm, Attribute Utility Motivated k-anonymization (AUM), involves analyzing the characteristics of attributes and utilizing them to minimize information loss during the anonymization process. Through comparison with two existing algorithms, Mondrian and Incognito, preliminary results indicate that AUM may preserve more information from original datasets thus providing higher quality results with lower distortion. PMID:22195223

  7. A three-step approach for the derivation and validation of high-performing predictive models using an operational dataset: congestive heart failure readmission case study.

    PubMed

    AbdelRahman, Samir E; Zhang, Mingyuan; Bray, Bruce E; Kawamoto, Kensaku

    2014-05-27

    The aim of this study was to propose an analytical approach to develop high-performing predictive models for congestive heart failure (CHF) readmission using an operational dataset with incomplete records and changing data over time. Our analytical approach involves three steps: pre-processing, systematic model development, and risk factor analysis. For pre-processing, variables that were absent in >50% of records were removed. Moreover, the dataset was divided into a validation dataset and derivation datasets which were separated into three temporal subsets based on changes to the data over time. For systematic model development, using the different temporal datasets and the remaining explanatory variables, the models were developed by combining the use of various (i) statistical analyses to explore the relationships between the validation and the derivation datasets; (ii) adjustment methods for handling missing values; (iii) classifiers; (iv) feature selection methods; and (iv) discretization methods. We then selected the best derivation dataset and the models with the highest predictive performance. For risk factor analysis, factors in the highest-performing predictive models were analyzed and ranked using (i) statistical analyses of the best derivation dataset, (ii) feature rankers, and (iii) a newly developed algorithm to categorize risk factors as being strong, regular, or weak. The analysis dataset consisted of 2,787 CHF hospitalizations at University of Utah Health Care from January 2003 to June 2013. In this study, we used the complete-case analysis and mean-based imputation adjustment methods; the wrapper subset feature selection method; and four ranking strategies based on information gain, gain ratio, symmetrical uncertainty, and wrapper subset feature evaluators. The best-performing models resulted from the use of a complete-case analysis derivation dataset combined with the Class-Attribute Contingency Coefficient discretization method and a voting classifier which averaged the results of multi-nominal logistic regression and voting feature intervals classifiers. Of 42 final model risk factors, discharge disposition, discretized age, and indicators of anemia were the most significant. This model achieved a c-statistic of 86.8%. The proposed three-step analytical approach enhanced predictive model performance for CHF readmissions. It could potentially be leveraged to improve predictive model performance in other areas of clinical medicine.

  8. Real-world datasets for portfolio selection and solutions of some stochastic dominance portfolio models.

    PubMed

    Bruni, Renato; Cesarone, Francesco; Scozzari, Andrea; Tardella, Fabio

    2016-09-01

    A large number of portfolio selection models have appeared in the literature since the pioneering work of Markowitz. However, even when computational and empirical results are described, they are often hard to replicate and compare due to the unavailability of the datasets used in the experiments. We provide here several datasets for portfolio selection generated using real-world price values from several major stock markets. The datasets contain weekly return values, adjusted for dividends and for stock splits, which are cleaned from errors as much as possible. The datasets are available in different formats, and can be used as benchmarks for testing the performances of portfolio selection models and for comparing the efficiency of the algorithms used to solve them. We also provide, for these datasets, the portfolios obtained by several selection strategies based on Stochastic Dominance models (see "On Exact and Approximate Stochastic Dominance Strategies for Portfolio Selection" (Bruni et al. [2])). We believe that testing portfolio models on publicly available datasets greatly simplifies the comparison of the different portfolio selection strategies.

  9. The DataBridge: A System For Optimizing The Use Of Dark Data From The Long Tail Of Science

    NASA Astrophysics Data System (ADS)

    Lander, H.; Rajasekar, A.

    2015-12-01

    The DataBridge is a National Science Foundation funded collaborative project (OCI-1247652, OCI-1247602, OCI-1247663) designed to assist in the discovery of dark data sets from the long tail of science. The DataBridge aims to to build queryable communities of datasets using sociometric network analysis. This approach is being tested to evaluate the ability to leverage various forms of metadata to facilitate discovery of new knowledge. Each dataset in the Databridge has an associated name space used as a first level partitioning. In addition to testing known algorithms for SNA community building, the DataBridge project has built a message-based platform that allows users to provide their own algorithms for each of the stages in the community building process. The stages are: Signature Generation (SG): An SG algorithm creates a metadata signature for a dataset. Signature algorithms might use text metadata provided by the dataset creator or derive metadata. Relevance Algorithm (RA): An RA compares a pair of datasets and produces a similarity value between 0 and 1 for the two datasets. Sociometric Network Analysis (SNA): The SNA will operate on a similarity matrix produced by an RA to partition all of the datasets in the name space into a set of clusters. These clusters represent communities of closely related datasets. The DataBridge also includes a web application that produces a visual representation of the clustering. Future work includes a more complete application that will allow different types of searching of the network of datasets. The DataBridge approach is relevant to geoscience research and informatics. In this presentation we will outline the project, illustrate the deployment of the approach, and discuss other potential applications and next steps for the research such as applying this approach to models. In addition we will explore the relevance of DataBridge to other geoscience projects such as various EarthCube Building Blocks and DIBBS projects.

  10. A global wind resource atlas including high-resolution terrain effects

    NASA Astrophysics Data System (ADS)

    Hahmann, Andrea; Badger, Jake; Olsen, Bjarke; Davis, Neil; Larsen, Xiaoli; Badger, Merete

    2015-04-01

    Currently no accurate global wind resource dataset is available to fill the needs of policy makers and strategic energy planners. Evaluating wind resources directly from coarse resolution reanalysis datasets underestimate the true wind energy resource, as the small-scale spatial variability of winds is missing. This missing variability can account for a large part of the local wind resource. Crucially, it is the windiest sites that suffer the largest wind resource errors: in simple terrain the windiest sites may be underestimated by 25%, in complex terrain the underestimate can be as large as 100%. The small-scale spatial variability of winds can be modelled using novel statistical methods and by application of established microscale models within WAsP developed at DTU Wind Energy. We present the framework for a single global methodology, which is relative fast and economical to complete. The method employs reanalysis datasets, which are downscaled to high-resolution wind resource datasets via a so-called generalization step, and microscale modelling using WAsP. This method will create the first global wind atlas (GWA) that covers all land areas (except Antarctica) and 30 km coastal zone over water. Verification of the GWA estimates will be done at carefully selected test regions, against verified estimates from mesoscale modelling and satellite synthetic aperture radar (SAR). This verification exercise will also help in the estimation of the uncertainty of the new wind climate dataset. Uncertainty will be assessed as a function of spatial aggregation. It is expected that the uncertainty at verification sites will be larger than that of dedicated assessments, but the uncertainty will be reduced at levels of aggregation appropriate for energy planning, and importantly much improved relative to what is used today. In this presentation we discuss the methodology used, which includes the generalization of wind climatologies, and the differences in local and spatially aggregated wind resources that result from using different reanalyses in the various verification regions. A prototype web interface for the public access to the data will also be showcased.

  11. Inter-annual Tropospheric Aerosol Variability in Late Twentieth Century and its Impact on Tropical Atlantic and West African Climate by Direct and Semi-direct Effects

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Evans, Katherine J; Hack, James J; Truesdale, John

    A new high-resolution (0.9more » $$^{\\circ}$$x1.25$$^{\\circ}$$ in the horizontal) global tropospheric aerosol dataset with monthly resolution is generated using the finite-volume configuration of Community Atmosphere Model (CAM4) coupled to a bulk aerosol model and forced with recent estimates of surface emissions for the latter part of twentieth century. The surface emissions dataset is constructed from Coupled Model Inter-comparison Project (CMIP5) decadal-resolution surface emissions dataset to include REanalysis of TROpospheric chemical composition (RETRO) wildfire monthly emissions dataset. Experiments forced with the new tropospheric aerosol dataset and conducted using the spectral configuration of CAM4 with a T85 truncation (1.4$$^{\\circ}$$x1.4$$^{\\circ}$$) with prescribed twentieth century observed sea surface temperature, sea-ice and greenhouse gases reveal that variations in tropospheric aerosol levels can induce significant regional climate variability on the inter-annual timescales. Regression analyses over tropical Atlantic and Africa reveal that increasing dust aerosols can cool the North African landmass and shift convection southwards from West Africa into the Gulf of Guinea in the spring season in the simulations. Further, we find that increasing carbonaceous aerosols emanating from the southwestern African savannas can cool the region significantly and increase the marine stratocumulus cloud cover over the southeast tropical Atlantic ocean by aerosol-induced diabatic heating of the free troposphere above the low clouds. Experiments conducted with CAM4 coupled to a slab ocean model suggest that present day aerosols can shift the ITCZ southwards over the tropical Atlantic and can reduce the ocean mixed layer temperature beneath the increased marine stratocumulus clouds in the southeastern tropical Atlantic.« less

  12. MODFLOW-LGR: Practical application to a large regional dataset

    NASA Astrophysics Data System (ADS)

    Barnes, D.; Coulibaly, K. M.

    2011-12-01

    In many areas of the US, including southwest Florida, large regional-scale groundwater models have been developed to aid in decision making and water resources management. These models are subsequently used as a basis for site-specific investigations. Because the large scale of these regional models is not appropriate for local application, refinement is necessary to analyze the local effects of pumping wells and groundwater related projects at specific sites. The most commonly used approach to date is Telescopic Mesh Refinement or TMR. It allows the extraction of a subset of the large regional model with boundary conditions derived from the regional model results. The extracted model is then updated and refined for local use using a variable sized grid focused on the area of interest. MODFLOW-LGR, local grid refinement, is an alternative approach which allows model discretization at a finer resolution in areas of interest and provides coupling between the larger "parent" model and the locally refined "child." In the present work, these two approaches are tested on a mining impact assessment case in southwest Florida using a large regional dataset (The Lower West Coast Surficial Aquifer System Model). Various metrics for performance are considered. They include: computation time, water balance (as compared to the variable sized grid), calibration, implementation effort, and application advantages and limitations. The results indicate that MODFLOW-LGR is a useful tool to improve local resolution of regional scale models. While performance metrics, such as computation time, are case-dependent (model size, refinement level, stresses involved), implementation effort, particularly when regional models of suitable scale are available, can be minimized. The creation of multiple child models within a larger scale parent model makes it possible to reuse the same calibrated regional dataset with minimal modification. In cases similar to the Lower West Coast model, where a model is larger than optimal for direct application as a parent grid, a combination of TMR and LGR approaches should be used to develop a suitable parent grid.

  13. Response Surface Methodology Using a Fullest Balanced Model: A Re-Analysis of a Dataset in the Korean Journal for Food Science of Animal Resources.

    PubMed

    Rheem, Sungsue; Rheem, Insoo; Oh, Sejong

    2017-01-01

    Response surface methodology (RSM) is a useful set of statistical techniques for modeling and optimizing responses in research studies of food science. In the analysis of response surface data, a second-order polynomial regression model is usually used. However, sometimes we encounter situations where the fit of the second-order model is poor. If the model fitted to the data has a poor fit including a lack of fit, the modeling and optimization results might not be accurate. In such a case, using a fullest balanced model, which has no lack of fit, can fix such problem, enhancing the accuracy of the response surface modeling and optimization. This article presents how to develop and use such a model for the better modeling and optimizing of the response through an illustrative re-analysis of a dataset in Park et al. (2014) published in the Korean Journal for Food Science of Animal Resources .

  14. TH-E-BRF-05: Comparison of Survival-Time Prediction Models After Radiotherapy for High-Grade Glioma Patients Based On Clinical and DVH Features

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Magome, T; Haga, A; Igaki, H

    Purpose: Although many outcome prediction models based on dose-volume information have been proposed, it is well known that the prognosis may be affected also by multiple clinical factors. The purpose of this study is to predict the survival time after radiotherapy for high-grade glioma patients based on features including clinical and dose-volume histogram (DVH) information. Methods: A total of 35 patients with high-grade glioma (oligodendroglioma: 2, anaplastic astrocytoma: 3, glioblastoma: 30) were selected in this study. All patients were treated with prescribed dose of 30–80 Gy after surgical resection or biopsy from 2006 to 2013 at The University of Tokyomore » Hospital. All cases were randomly separated into training dataset (30 cases) and test dataset (5 cases). The survival time after radiotherapy was predicted based on a multiple linear regression analysis and artificial neural network (ANN) by using 204 candidate features. The candidate features included the 12 clinical features (tumor location, extent of surgical resection, treatment duration of radiotherapy, etc.), and the 192 DVH features (maximum dose, minimum dose, D95, V60, etc.). The effective features for the prediction were selected according to a step-wise method by using 30 training cases. The prediction accuracy was evaluated by a coefficient of determination (R{sup 2}) between the predicted and actual survival time for the training and test dataset. Results: In the multiple regression analysis, the value of R{sup 2} between the predicted and actual survival time was 0.460 for the training dataset and 0.375 for the test dataset. On the other hand, in the ANN analysis, the value of R{sup 2} was 0.806 for the training dataset and 0.811 for the test dataset. Conclusion: Although a large number of patients would be needed for more accurate and robust prediction, our preliminary Result showed the potential to predict the outcome in the patients with high-grade glioma. This work was partly supported by the JSPS Core-to-Core Program(No. 23003) and Grant-in-aid from the JSPS Fellows.« less

  15. Predicting Overall Survival After Stereotactic Ablative Radiation Therapy in Early-Stage Lung Cancer: Development and External Validation of the Amsterdam Prognostic Model

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Louie, Alexander V., E-mail: Dr.alexlouie@gmail.com; Department of Radiation Oncology, London Regional Cancer Program, University of Western Ontario, London, Ontario; Department of Epidemiology, Harvard School of Public Health, Harvard University, Boston, Massachusetts

    Purpose: A prognostic model for 5-year overall survival (OS), consisting of recursive partitioning analysis (RPA) and a nomogram, was developed for patients with early-stage non-small cell lung cancer (ES-NSCLC) treated with stereotactic ablative radiation therapy (SABR). Methods and Materials: A primary dataset of 703 ES-NSCLC SABR patients was randomly divided into a training (67%) and an internal validation (33%) dataset. In the former group, 21 unique parameters consisting of patient, treatment, and tumor factors were entered into an RPA model to predict OS. Univariate and multivariate models were constructed for RPA-selected factors to evaluate their relationship with OS. A nomogrammore » for OS was constructed based on factors significant in multivariate modeling and validated with calibration plots. Both the RPA and the nomogram were externally validated in independent surgical (n=193) and SABR (n=543) datasets. Results: RPA identified 2 distinct risk classes based on tumor diameter, age, World Health Organization performance status (PS) and Charlson comorbidity index. This RPA had moderate discrimination in SABR datasets (c-index range: 0.52-0.60) but was of limited value in the surgical validation cohort. The nomogram predicting OS included smoking history in addition to RPA-identified factors. In contrast to RPA, validation of the nomogram performed well in internal validation (r{sup 2}=0.97) and external SABR (r{sup 2}=0.79) and surgical cohorts (r{sup 2}=0.91). Conclusions: The Amsterdam prognostic model is the first externally validated prognostication tool for OS in ES-NSCLC treated with SABR available to individualize patient decision making. The nomogram retained strong performance across surgical and SABR external validation datasets. RPA performance was poor in surgical patients, suggesting that 2 different distinct patient populations are being treated with these 2 effective modalities.« less

  16. National Water Model: Providing the Nation with Actionable Water Intelligence

    NASA Astrophysics Data System (ADS)

    Aggett, G. R.; Bates, B.

    2017-12-01

    The National Water Model (NWM) provides national, street-level detail of water movement through time and space. Operating hourly, this flood of information offers enormous benefits in the form of water resource management, natural disaster preparedness, and the protection of life and property. The Geo-Intelligence Division at the NOAA National Water Center supplies forecasters and decision-makers with timely, actionable water intelligence through the processing of billions of NWM data points every hour. These datasets include current streamflow estimates, short and medium range streamflow forecasts, and many other ancillary datasets. The sheer amount of NWM data produced yields a dataset too large to allow for direct human comprehension. As such, it is necessary to undergo model data post-processing, filtering, and data ingestion by visualization web apps that make use of cartographic techniques to bring attention to the areas of highest urgency. This poster illustrates NWM output post-processing and cartographic visualization techniques being developed and employed by the Geo-Intelligence Division at the NOAA National Water Center to provide national actionable water intelligence.

  17. An approach for spherical harmonic analysis of non-smooth data

    NASA Astrophysics Data System (ADS)

    Wang, Hansheng; Wu, Patrick; Wang, Zhiyong

    2006-12-01

    A method is proposed to evaluate the spherical harmonic coefficients of a global or regional, non-smooth, observable dataset sampled on an equiangular grid. The method is based on an integration strategy using new recursion relations. Because a bilinear function is used to interpolate points within the grid cells, this method is suitable for non-smooth data; the slope of the data may be piecewise continuous, with extreme changes at the boundaries. In order to validate the method, the coefficients of an axisymmetric model are computed, and compared with the derived analytical expressions. Numerical results show that this method is indeed reasonable for non-smooth models, and that the maximum degree for spherical harmonic analysis should be empirically determined by several factors including the model resolution and the degree of non-smoothness in the dataset, and it can be several times larger than the total number of latitudinal grid points. It is also shown that this method is appropriate for the approximate analysis of a smooth dataset. Moreover, this paper provides the program flowchart and an internet address where the FORTRAN code with program specifications are made available.

  18. NASA AVOSS Fast-Time Models for Aircraft Wake Prediction: User's Guide (APA3.8 and TDP2.1)

    NASA Technical Reports Server (NTRS)

    Ahmad, Nash'at N.; VanValkenburg, Randal L.; Pruis, Matthew J.; Limon Duparcmeur, Fanny M.

    2016-01-01

    NASA's current distribution of fast-time wake vortex decay and transport models includes APA (Version 3.8) and TDP (Version 2.1). This User's Guide provides detailed information on the model inputs, file formats, and model outputs. A brief description of the Memphis 1995, Dallas/Fort Worth 1997, and the Denver 2003 wake vortex datasets is given along with the evaluation of models. A detailed bibliography is provided which includes publications on model development, wake field experiment descriptions, and applications of the fast-time wake vortex models.

  19. Making the MagIC (Magnetics Information Consortium) Web Application Accessible to New Users and Useful to Experts

    NASA Astrophysics Data System (ADS)

    Minnett, R.; Koppers, A.; Jarboe, N.; Tauxe, L.; Constable, C.; Jonestrask, L.

    2017-12-01

    Challenges are faced by both new and experienced users interested in contributing their data to community repositories, in data discovery, or engaged in potentially transformative science. The Magnetics Information Consortium (https://earthref.org/MagIC) has recently simplified its data model and developed a new containerized web application to reduce the friction in contributing, exploring, and combining valuable and complex datasets for the paleo-, geo-, and rock magnetic scientific community. The new data model more closely reflects the hierarchical workflow in paleomagnetic experiments to enable adequate annotation of scientific results and ensure reproducibility. The new open-source (https://github.com/earthref/MagIC) application includes an upload tool that is integrated with the data model to provide early data validation feedback and ease the friction of contributing and updating datasets. The search interface provides a powerful full text search of contributions indexed by ElasticSearch and a wide array of filters, including specific geographic and geological timescale filtering, to support both novice users exploring the database and experts interested in compiling new datasets with specific criteria across thousands of studies and millions of measurements. The datasets are not large, but they are complex, with many results from evolving experimental and analytical approaches. These data are also extremely valuable due to the cost in collecting or creating physical samples and the, often, destructive nature of the experiments. MagIC is heavily invested in encouraging young scientists as well as established labs to cultivate workflows that facilitate contributing their data in a consistent format. This eLightning presentation includes a live demonstration of the MagIC web application, developed as a configurable container hosting an isomorphic Meteor JavaScript application, MongoDB database, and ElasticSearch search engine. Visitors can explore the MagIC Database through maps and image or plot galleries or search and filter the raw measurements and their derived hierarchy of analytical interpretations.

  20. On the radiobiological impact of metal artifacts in head-and-neck IMRT in terms of tumor control probability (TCP) and normal tissue complication probability (NTCP).

    PubMed

    Kim, Yusung; Tomé, Wolfgang A

    2007-11-01

    To investigate the effects of distorted head-and-neck (H&N) intensity-modulated radiation therapy (IMRT) dose distributions (hot and cold spots) on normal tissue complication probability (NTCP) and tumor control probability (TCP) due to dental-metal artifacts. Five patients' IMRT treatment plans have been analyzed, employing five different planning image data-sets: (a) uncorrected (UC); (b) homogeneous uncorrected (HUC); (c) sinogram completion corrected (SCC); (d) minimum-value-corrected (MVC); and (e) streak-artifact-reduction including minimum-value-correction (SAR-MVC), which has been taken as the reference data-set. The effects on NTCP and TCP were evaluated using the Lyman-NTCP model and the Logistic-TCP model, respectively. When compared to the predicted NTCP obtained using the reference data-set, the treatment plan based on the original CT data-set (UC) yielded an increase in NTCP of 3.2 and 2.0% for the spared parotid gland and the spinal cord, respectively. While for the treatment plans based on the MVC CT data-set the NTCP increased by a 1.1% and a 0.1% for the spared parotid glands and the spinal cord, respectively. In addition, the MVC correction method showed a reduction in TCP for target volumes (MVC: delta TCP = -0.6% vs. UC: delta TCP = -1.9%) with respect to that of the reference CT data-set. Our results indicate that the presence of dental-metal-artifacts in H&N planning CT data-sets has an impact on the estimates of TCP and NTCP. In particular dental-metal-artifacts lead to an increase in NTCP for the spared parotid glands and a slight decrease in TCP for target volumes.

  1. Analyzing How We Do Analysis and Consume Data, Results from the SciDAC-Data Project

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ding, P.; Aliaga, L.; Mubarak, M.

    One of the main goals of the Dept. of Energy funded SciDAC-Data project is to analyze the more than 410,000 high energy physics datasets that have been collected, generated and defined over the past two decades by experiments using the Fermilab storage facilities. These datasets have been used as the input to over 5.6 million recorded analysis projects, for which detailed analytics have been gathered. The analytics and meta information for these datasets and analysis projects are being combined with knowledge of their part of the HEP analysis chains for major experiments to understand how modern computing and data deliverymore » is being used. We present the first results of this project, which examine in detail how the CDF, D0, NOvA, MINERvA and MicroBooNE experiments have organized, classified and consumed petascale datasets to produce their physics results. The results include analysis of the correlations in dataset/file overlap, data usage patterns, data popularity, dataset dependency and temporary dataset consumption. The results provide critical insight into how workflows and data delivery schemes can be combined with different caching strategies to more efficiently perform the work required to mine these large HEP data volumes and to understand the physics analysis requirements for the next generation of HEP computing facilities. In particular we present a detailed analysis of the NOvA data organization and consumption model corresponding to their first and second oscillation results (2014-2016) and the first look at the analysis of the Tevatron Run II experiments. We present statistical distributions for the characterization of these data and data driven models describing their consumption« less

  2. Analyzing how we do Analysis and Consume Data, Results from the SciDAC-Data Project

    NASA Astrophysics Data System (ADS)

    Ding, P.; Aliaga, L.; Mubarak, M.; Tsaris, A.; Norman, A.; Lyon, A.; Ross, R.

    2017-10-01

    One of the main goals of the Dept. of Energy funded SciDAC-Data project is to analyze the more than 410,000 high energy physics datasets that have been collected, generated and defined over the past two decades by experiments using the Fermilab storage facilities. These datasets have been used as the input to over 5.6 million recorded analysis projects, for which detailed analytics have been gathered. The analytics and meta information for these datasets and analysis projects are being combined with knowledge of their part of the HEP analysis chains for major experiments to understand how modern computing and data delivery is being used. We present the first results of this project, which examine in detail how the CDF, D0, NOvA, MINERvA and MicroBooNE experiments have organized, classified and consumed petascale datasets to produce their physics results. The results include analysis of the correlations in dataset/file overlap, data usage patterns, data popularity, dataset dependency and temporary dataset consumption. The results provide critical insight into how workflows and data delivery schemes can be combined with different caching strategies to more efficiently perform the work required to mine these large HEP data volumes and to understand the physics analysis requirements for the next generation of HEP computing facilities. In particular we present a detailed analysis of the NOvA data organization and consumption model corresponding to their first and second oscillation results (2014-2016) and the first look at the analysis of the Tevatron Run II experiments. We present statistical distributions for the characterization of these data and data driven models describing their consumption.

  3. Comparing apples and oranges: the Community Intercomparison Suite

    NASA Astrophysics Data System (ADS)

    Schutgens, Nick; Stier, Philip; Kershaw, Philip; Pascoe, Stephen

    2015-04-01

    Visual representation and comparison of geoscientific datasets presents a huge challenge due to the large variety of file formats and spatio-temporal sampling of data (be they observations or simulations). The Community Intercomparison Suite attempts to greatly simplify these tasks for users by offering an intelligent but simple command line tool for visualisation and colocation of diverse datasets. In addition, CIS can subset and aggregate large datasets into smaller more manageable datasets. Our philosophy is to remove as much as possible the need for specialist knowledge by the user of the structure of a dataset. The colocation of observations with model data is as simple as: "cis col ::" which will resample the simulation data to the spatio-temporal sampling of the observations, contingent on a few user-defined options that specify a resampling kernel. As an example, we apply CIS to a case study of biomass burning aerosol from the Congo. Remote sensing observations, in-situe observations and model data are shown in various plots, with the purpose of either comparing different datasets or integrating them into a single comprehensive picture. CIS can deal with both gridded and ungridded datasets of 2, 3 or 4 spatio-temporal dimensions. It can handle different spatial coordinates (e.g. longitude or distance, altitude or pressure level). CIS supports both HDF, netCDF and ASCII file formats. The suite is written in Python with entirely publicly available open source dependencies. Plug-ins allow a high degree of user-moddability. A web-based developer hub includes a manual and simple examples. CIS is developed as open source code by a specialist IT company under supervision of scientists from the University of Oxford and the Centre of Environmental Data Archival as part of investment in the JASMIN superdatacluster facility.

  4. Creating a non-linear total sediment load formula using polynomial best subset regression model

    NASA Astrophysics Data System (ADS)

    Okcu, Davut; Pektas, Ali Osman; Uyumaz, Ali

    2016-08-01

    The aim of this study is to derive a new total sediment load formula which is more accurate and which has less application constraints than the well-known formulae of the literature. 5 most known stream power concept sediment formulae which are approved by ASCE are used for benchmarking on a wide range of datasets that includes both field and flume (lab) observations. The dimensionless parameters of these widely used formulae are used as inputs in a new regression approach. The new approach is called Polynomial Best subset regression (PBSR) analysis. The aim of the PBRS analysis is fitting and testing all possible combinations of the input variables and selecting the best subset. Whole the input variables with their second and third powers are included in the regression to test the possible relation between the explanatory variables and the dependent variable. While selecting the best subset a multistep approach is used that depends on significance values and also the multicollinearity degrees of inputs. The new formula is compared to others in a holdout dataset and detailed performance investigations are conducted for field and lab datasets within this holdout data. Different goodness of fit statistics are used as they represent different perspectives of the model accuracy. After the detailed comparisons are carried out we figured out the most accurate equation that is also applicable on both flume and river data. Especially, on field dataset the prediction performance of the proposed formula outperformed the benchmark formulations.

  5. Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts.

    PubMed

    Cocos, Anne; Fiks, Alexander G; Masino, Aaron J

    2017-07-01

    Social media is an important pharmacovigilance data source for adverse drug reaction (ADR) identification. Human review of social media data is infeasible due to data quantity, thus natural language processing techniques are necessary. Social media includes informal vocabulary and irregular grammar, which challenge natural language processing methods. Our objective is to develop a scalable, deep-learning approach that exceeds state-of-the-art ADR detection performance in social media. We developed a recurrent neural network (RNN) model that labels words in an input sequence with ADR membership tags. The only input features are word-embedding vectors, which can be formed through task-independent pretraining or during ADR detection training. Our best-performing RNN model used pretrained word embeddings created from a large, non-domain-specific Twitter dataset. It achieved an approximate match F-measure of 0.755 for ADR identification on the dataset, compared to 0.631 for a baseline lexicon system and 0.65 for the state-of-the-art conditional random field model. Feature analysis indicated that semantic information in pretrained word embeddings boosted sensitivity and, combined with contextual awareness captured in the RNN, precision. Our model required no task-specific feature engineering, suggesting generalizability to additional sequence-labeling tasks. Learning curve analysis showed that our model reached optimal performance with fewer training examples than the other models. ADR detection performance in social media is significantly improved by using a contextually aware model and word embeddings formed from large, unlabeled datasets. The approach reduces manual data-labeling requirements and is scalable to large social media datasets. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  6. Multiresolution comparison of precipitation datasets for large-scale models

    NASA Astrophysics Data System (ADS)

    Chun, K. P.; Sapriza Azuri, G.; Davison, B.; DeBeer, C. M.; Wheater, H. S.

    2014-12-01

    Gridded precipitation datasets are crucial for driving large-scale models which are related to weather forecast and climate research. However, the quality of precipitation products is usually validated individually. Comparisons between gridded precipitation products along with ground observations provide another avenue for investigating how the precipitation uncertainty would affect the performance of large-scale models. In this study, using data from a set of precipitation gauges over British Columbia and Alberta, we evaluate several widely used North America gridded products including the Canadian Gridded Precipitation Anomalies (CANGRD), the National Center for Environmental Prediction (NCEP) reanalysis, the Water and Global Change (WATCH) project, the thin plate spline smoothing algorithms (ANUSPLIN) and Canadian Precipitation Analysis (CaPA). Based on verification criteria for various temporal and spatial scales, results provide an assessment of possible applications for various precipitation datasets. For long-term climate variation studies (~100 years), CANGRD, NCEP, WATCH and ANUSPLIN have different comparative advantages in terms of their resolution and accuracy. For synoptic and mesoscale precipitation patterns, CaPA provides appealing performance of spatial coherence. In addition to the products comparison, various downscaling methods are also surveyed to explore new verification and bias-reduction methods for improving gridded precipitation outputs for large-scale models.

  7. Process mining in oncology using the MIMIC-III dataset

    NASA Astrophysics Data System (ADS)

    Prima Kurniati, Angelina; Hall, Geoff; Hogg, David; Johnson, Owen

    2018-03-01

    Process mining is a data analytics approach to discover and analyse process models based on the real activities captured in information systems. There is a growing body of literature on process mining in healthcare, including oncology, the study of cancer. In earlier work we found 37 peer-reviewed papers describing process mining research in oncology with a regular complaint being the limited availability and accessibility of datasets with suitable information for process mining. Publicly available datasets are one option and this paper describes the potential to use MIMIC-III, for process mining in oncology. MIMIC-III is a large open access dataset of de-identified patient records. There are 134 publications listed as using the MIMIC dataset, but none of them have used process mining. The MIMIC-III dataset has 16 event tables which are potentially useful for process mining and this paper demonstrates the opportunities to use MIMIC-III for process mining in oncology. Our research applied the L* lifecycle method to provide a worked example showing how process mining can be used to analyse cancer pathways. The results and data quality limitations are discussed along with opportunities for further work and reflection on the value of MIMIC-III for reproducible process mining research.

  8. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets

    PubMed Central

    Wernisch, Lorenz

    2017-01-01

    Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm. PMID:29036190

  9. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.

    PubMed

    Gabasova, Evelina; Reid, John; Wernisch, Lorenz

    2017-10-01

    Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.

  10. Soil moisture datasets at five sites in the central Sierra Nevada and northern Coast Ranges, California

    USGS Publications Warehouse

    Stern, Michelle A.; Anderson, Frank A.; Flint, Lorraine E.; Flint, Alan L.

    2018-05-03

    In situ soil moisture datasets are important inputs used to calibrate and validate watershed, regional, or statewide modeled and satellite-based soil moisture estimates. The soil moisture dataset presented in this report includes hourly time series of the following: soil temperature, volumetric water content, water potential, and total soil water content. Data were collected by the U.S. Geological Survey at five locations in California: three sites in the central Sierra Nevada and two sites in the northern Coast Ranges. This report provides a description of each of the study areas, procedures and equipment used, processing steps, and time series data from each site in the form of comma-separated values (.csv) tables.

  11. New Algorithm and Software (BNOmics) for Inferring and Visualizing Bayesian Networks from Heterogeneous Big Biological and Genetic Data

    PubMed Central

    Gogoshin, Grigoriy; Boerwinkle, Eric

    2017-01-01

    Abstract Bayesian network (BN) reconstruction is a prototypical systems biology data analysis approach that has been successfully used to reverse engineer and model networks reflecting different layers of biological organization (ranging from genetic to epigenetic to cellular pathway to metabolomic). It is especially relevant in the context of modern (ongoing and prospective) studies that generate heterogeneous high-throughput omics datasets. However, there are both theoretical and practical obstacles to the seamless application of BN modeling to such big data, including computational inefficiency of optimal BN structure search algorithms, ambiguity in data discretization, mixing data types, imputation and validation, and, in general, limited scalability in both reconstruction and visualization of BNs. To overcome these and other obstacles, we present BNOmics, an improved algorithm and software toolkit for inferring and analyzing BNs from omics datasets. BNOmics aims at comprehensive systems biology—type data exploration, including both generating new biological hypothesis and testing and validating the existing ones. Novel aspects of the algorithm center around increasing scalability and applicability to varying data types (with different explicit and implicit distributional assumptions) within the same analysis framework. An output and visualization interface to widely available graph-rendering software is also included. Three diverse applications are detailed. BNOmics was originally developed in the context of genetic epidemiology data and is being continuously optimized to keep pace with the ever-increasing inflow of available large-scale omics datasets. As such, the software scalability and usability on the less than exotic computer hardware are a priority, as well as the applicability of the algorithm and software to the heterogeneous datasets containing many data types—single-nucleotide polymorphisms and other genetic/epigenetic/transcriptome variables, metabolite levels, epidemiological variables, endpoints, and phenotypes, etc. PMID:27681505

  12. New Algorithm and Software (BNOmics) for Inferring and Visualizing Bayesian Networks from Heterogeneous Big Biological and Genetic Data.

    PubMed

    Gogoshin, Grigoriy; Boerwinkle, Eric; Rodin, Andrei S

    2017-04-01

    Bayesian network (BN) reconstruction is a prototypical systems biology data analysis approach that has been successfully used to reverse engineer and model networks reflecting different layers of biological organization (ranging from genetic to epigenetic to cellular pathway to metabolomic). It is especially relevant in the context of modern (ongoing and prospective) studies that generate heterogeneous high-throughput omics datasets. However, there are both theoretical and practical obstacles to the seamless application of BN modeling to such big data, including computational inefficiency of optimal BN structure search algorithms, ambiguity in data discretization, mixing data types, imputation and validation, and, in general, limited scalability in both reconstruction and visualization of BNs. To overcome these and other obstacles, we present BNOmics, an improved algorithm and software toolkit for inferring and analyzing BNs from omics datasets. BNOmics aims at comprehensive systems biology-type data exploration, including both generating new biological hypothesis and testing and validating the existing ones. Novel aspects of the algorithm center around increasing scalability and applicability to varying data types (with different explicit and implicit distributional assumptions) within the same analysis framework. An output and visualization interface to widely available graph-rendering software is also included. Three diverse applications are detailed. BNOmics was originally developed in the context of genetic epidemiology data and is being continuously optimized to keep pace with the ever-increasing inflow of available large-scale omics datasets. As such, the software scalability and usability on the less than exotic computer hardware are a priority, as well as the applicability of the algorithm and software to the heterogeneous datasets containing many data types-single-nucleotide polymorphisms and other genetic/epigenetic/transcriptome variables, metabolite levels, epidemiological variables, endpoints, and phenotypes, etc.

  13. An intercomparison of GCM and RCM dynamical downscaling for characterizing the hydroclimatology of California and Nevada

    NASA Astrophysics Data System (ADS)

    Xu, Z.; Rhoades, A.; Johansen, H.; Ullrich, P. A.; Collins, W. D.

    2017-12-01

    Dynamical downscaling is widely used to properly characterize regional surface heterogeneities that shape the local hydroclimatology. However, the factors in dynamical downscaling, including the refinement of model horizontal resolution, large-scale forcing datasets and dynamical cores, have not been fully evaluated. Two cutting-edge global-to-regional downscaling methods are used to assess these, specifically the variable-resolution Community Earth System Model (VR-CESM) and the Weather Research & Forecasting (WRF) regional climate model, under different horizontal resolutions (28, 14, and 7 km). Two groups of WRF simulations are driven by either the NCEP reanalysis dataset (WRF_NCEP) or VR-CESM outputs (WRF_VRCESM) to evaluate the effects of the large-scale forcing datasets. The impacts of dynamical core are assessed by comparing the VR-CESM simulations to the coupled WRF_VRCESM simulations with the same physical parameterizations and similar grid domains. The simulated hydroclimatology (i.e., total precipitation, snow cover, snow water equivalent and surface temperature) are compared with the reference datasets. The large-scale forcing datasets are critical to the WRF simulations in more accurately simulating total precipitation, SWE and snow cover, but not surface temperature. Both the WRF and VR-CESM results highlight that no significant benefit is found in the simulated hydroclimatology by just increasing horizontal resolution refinement from 28 to 7 km. Simulated surface temperature is sensitive to the choice of dynamical core. WRF generally simulates higher temperatures than VR-CESM, alleviates the systematic cold bias of DJF temperatures over the California mountain region, but overestimates the JJA temperature in California's Central Valley.

  14. Global patterns of current and future road infrastructure

    NASA Astrophysics Data System (ADS)

    Meijer, Johan R.; Huijbregts, Mark A. J.; Schotten, Kees C. G. J.; Schipper, Aafke M.

    2018-06-01

    Georeferenced information on road infrastructure is essential for spatial planning, socio-economic assessments and environmental impact analyses. Yet current global road maps are typically outdated or characterized by spatial bias in coverage. In the Global Roads Inventory Project we gathered, harmonized and integrated nearly 60 geospatial datasets on road infrastructure into a global roads dataset. The resulting dataset covers 222 countries and includes over 21 million km of roads, which is two to three times the total length in the currently best available country-based global roads datasets. We then related total road length per country to country area, population density, GDP and OECD membership, resulting in a regression model with adjusted R 2 of 0.90, and found that that the highest road densities are associated with densely populated and wealthier countries. Applying our regression model to future population densities and GDP estimates from the Shared Socioeconomic Pathway (SSP) scenarios, we obtained a tentative estimate of 3.0–4.7 million km additional road length for the year 2050. Large increases in road length were projected for developing nations in some of the world’s last remaining wilderness areas, such as the Amazon, the Congo basin and New Guinea. This highlights the need for accurate spatial road datasets to underpin strategic spatial planning in order to reduce the impacts of roads in remaining pristine ecosystems.

  15. The allometric exponent for scaling clearance varies with age: a study on seven propofol datasets ranging from preterm neonates to adults.

    PubMed

    Wang, Chenguang; Allegaert, Karel; Peeters, Mariska Y M; Tibboel, Dick; Danhof, Meindert; Knibbe, Catherijne A J

    2014-01-01

    For scaling clearance between adults and children, allometric scaling with a fixed exponent of 0.75 is often applied. In this analysis, we performed a systematic study on the allometric exponent for scaling propofol clearance between two subpopulations selected from neonates, infants, toddlers, children, adolescents and adults. Seven propofol studies were included in the analysis (neonates, infants, toddlers, children, adolescents, adults1 and adults2). In a systematic manner, two out of the six study populations were selected resulting in 15 combined datasets. In addition, the data of the seven studies were regrouped into five age groups (FDA Guidance 1998), from which four combined datasets were prepared consisting of one paediatric age group and the adult group. In each of these 19 combined datasets, the allometric scaling exponent for clearance was estimated using population pharmacokinetic modelling (nonmem 7.2). The allometric exponent for propofol clearance varied between 1.11 and 2.01 in cases where the neonate dataset was included. When two paediatric datasets were analyzed, the exponent varied between 0.2 and 2.01, while it varied between 0.56 and 0.81 when the adult population and a paediatric dataset except for neonates were selected. Scaling from adults to adolescents, children, infants and neonates resulted in exponents of 0.74, 0.70, 0.60 and 1.11 respectively. For scaling clearance, ¾ allometric scaling may be of value for scaling between adults and adolescents or children, while it can neither be used for neonates nor for two paediatric populations. For scaling to neonates an exponent between 1 and 2 was identified. © 2013 The British Pharmacological Society.

  16. Genomic prediction based on data from three layer lines using non-linear regression models.

    PubMed

    Huang, Heyun; Windig, Jack J; Vereijken, Addie; Calus, Mario P L

    2014-11-06

    Most studies on genomic prediction with reference populations that include multiple lines or breeds have used linear models. Data heterogeneity due to using multiple populations may conflict with model assumptions used in linear regression methods. In an attempt to alleviate potential discrepancies between assumptions of linear models and multi-population data, two types of alternative models were used: (1) a multi-trait genomic best linear unbiased prediction (GBLUP) model that modelled trait by line combinations as separate but correlated traits and (2) non-linear models based on kernel learning. These models were compared to conventional linear models for genomic prediction for two lines of brown layer hens (B1 and B2) and one line of white hens (W1). The three lines each had 1004 to 1023 training and 238 to 240 validation animals. Prediction accuracy was evaluated by estimating the correlation between observed phenotypes and predicted breeding values. When the training dataset included only data from the evaluated line, non-linear models yielded at best a similar accuracy as linear models. In some cases, when adding a distantly related line, the linear models showed a slight decrease in performance, while non-linear models generally showed no change in accuracy. When only information from a closely related line was used for training, linear models and non-linear radial basis function (RBF) kernel models performed similarly. The multi-trait GBLUP model took advantage of the estimated genetic correlations between the lines. Combining linear and non-linear models improved the accuracy of multi-line genomic prediction. Linear models and non-linear RBF models performed very similarly for genomic prediction, despite the expectation that non-linear models could deal better with the heterogeneous multi-population data. This heterogeneity of the data can be overcome by modelling trait by line combinations as separate but correlated traits, which avoids the occasional occurrence of large negative accuracies when the evaluated line was not included in the training dataset. Furthermore, when using a multi-line training dataset, non-linear models provided information on the genotype data that was complementary to the linear models, which indicates that the underlying data distributions of the three studied lines were indeed heterogeneous.

  17. Recent Upgrades to NASA SPoRT Initialization Datasets for the Environmental Modeling System

    NASA Technical Reports Server (NTRS)

    Case, Jonathan L.; Lafontaine, Frank J.; Molthan, Andrew L.; Zavodsky, Bradley T.; Rozumalski, Robert A.

    2012-01-01

    The NASA Short-term Prediction Research and Transition (SPoRT) Center has developed several products for its NOAA/National Weather Service (NWS) partners that can initialize specific fields for local model runs within the NOAA/NWS Science and Training Resource Center Environmental Modeling System (EMS). The suite of SPoRT products for use in the EMS consists of a Sea Surface Temperature (SST) composite that includes a Lake Surface Temperature (LST) analysis over the Great Lakes, a Great Lakes sea-ice extent within the SST composite, a real-time Green Vegetation Fraction (GVF) composite, and NASA Land Information System (LIS) gridded output. This paper and companion poster describe each dataset and provide recent upgrades made to the SST, Great Lakes LST, GVF composites, and the real-time LIS runs.

  18. Detecting and modelling structures on the micro and the macro scales: Assessing their effects on solute transport behaviour

    NASA Astrophysics Data System (ADS)

    Haslauer, C. P.; Bárdossy, A.; Sudicky, E. A.

    2017-09-01

    This paper demonstrates quantitative reasoning to separate the dataset of spatially distributed variables into different entities and subsequently characterize their geostatistical properties, properly. The main contribution of the paper is a statistical based algorithm that matches the manual distinction results. This algorithm is based on measured data and is generally applicable. In this paper, it is successfully applied at two datasets of saturated hydraulic conductivity (K) measured at the Borden (Canada) and the Lauswiesen (Germany) aquifers. The boundary layer was successfully delineated at Borden despite its only mild heterogeneity and only small statistical differences between the divided units. The methods are verified with the more heterogeneous Lauswiesen aquifer K data-set, where a boundary layer has previously been delineated. The effects of the macro- and the microstructure on solute transport behaviour are evaluated using numerical solute tracer experiments. Within the microscale structure, both Gaussian and non-Gaussian models of spatial dependence of K are evaluated. The effects of heterogeneity both on the macro- and the microscale are analysed using numerical tracer experiments based on four scenarios: including or not including the macroscale structures and optimally fitting a Gaussian or a non-Gaussian model for the spatial dependence in the micro-structure. The paper shows that both micro- and macro-scale structures are important, as in each of the four possible geostatistical scenarios solute transport behaviour differs meaningfully.

  19. Central focused convolutional neural networks: Developing a data-driven model for lung nodule segmentation.

    PubMed

    Wang, Shuo; Zhou, Mu; Liu, Zaiyi; Liu, Zhenyu; Gu, Dongsheng; Zang, Yali; Dong, Di; Gevaert, Olivier; Tian, Jie

    2017-08-01

    Accurate lung nodule segmentation from computed tomography (CT) images is of great importance for image-driven lung cancer analysis. However, the heterogeneity of lung nodules and the presence of similar visual characteristics between nodules and their surroundings make it difficult for robust nodule segmentation. In this study, we propose a data-driven model, termed the Central Focused Convolutional Neural Networks (CF-CNN), to segment lung nodules from heterogeneous CT images. Our approach combines two key insights: 1) the proposed model captures a diverse set of nodule-sensitive features from both 3-D and 2-D CT images simultaneously; 2) when classifying an image voxel, the effects of its neighbor voxels can vary according to their spatial locations. We describe this phenomenon by proposing a novel central pooling layer retaining much information on voxel patch center, followed by a multi-scale patch learning strategy. Moreover, we design a weighted sampling to facilitate the model training, where training samples are selected according to their degree of segmentation difficulty. The proposed method has been extensively evaluated on the public LIDC dataset including 893 nodules and an independent dataset with 74 nodules from Guangdong General Hospital (GDGH). We showed that CF-CNN achieved superior segmentation performance with average dice scores of 82.15% and 80.02% for the two datasets respectively. Moreover, we compared our results with the inter-radiologists consistency on LIDC dataset, showing a difference in average dice score of only 1.98%. Copyright © 2017. Published by Elsevier B.V.

  20. NASA SPoRT Initialization Datasets for Local Model Runs in the Environmental Modeling System

    NASA Technical Reports Server (NTRS)

    Case, Jonathan L.; LaFontaine, Frank J.; Molthan, Andrew L.; Carcione, Brian; Wood, Lance; Maloney, Joseph; Estupinan, Jeral; Medlin, Jeffrey M.; Blottman, Peter; Rozumalski, Robert A.

    2011-01-01

    The NASA Short-term Prediction Research and Transition (SPoRT) Center has developed several products for its National Weather Service (NWS) partners that can be used to initialize local model runs within the Weather Research and Forecasting (WRF) Environmental Modeling System (EMS). These real-time datasets consist of surface-based information updated at least once per day, and produced in a composite or gridded product that is easily incorporated into the WRF EMS. The primary goal for making these NASA datasets available to the WRF EMS community is to provide timely and high-quality information at a spatial resolution comparable to that used in the local model configurations (i.e., convection-allowing scales). The current suite of SPoRT products supported in the WRF EMS include a Sea Surface Temperature (SST) composite, a Great Lakes sea-ice extent, a Greenness Vegetation Fraction (GVF) composite, and Land Information System (LIS) gridded output. The SPoRT SST composite is a blend of primarily the Moderate Resolution Imaging Spectroradiometer (MODIS) infrared and Advanced Microwave Scanning Radiometer for Earth Observing System data for non-precipitation coverage over the oceans at 2-km resolution. The composite includes a special lake surface temperature analysis over the Great Lakes using contributions from the Remote Sensing Systems temperature data. The Great Lakes Environmental Research Laboratory Ice Percentage product is used to create a sea-ice mask in the SPoRT SST composite. The sea-ice mask is produced daily (in-season) at 1.8-km resolution and identifies ice percentage from 0 100% in 10% increments, with values above 90% flagged as ice.

  1. Dataset used to improve liquid water absorption models in the microwave

    DOE Data Explorer

    Turner, David

    2015-12-14

    Two datasets, one a compilation of laboratory data and one a compilation from three field sites, are provided here. These datasets provide measurements of the real and imaginary refractive indices and absorption as a function of cloud temperature. These datasets were used in the development of the new liquid water absorption model that was published in Turner et al. 2015.

  2. Forest Fire Danger Rating (FFDR) Prediction over the Korean Peninsula

    NASA Astrophysics Data System (ADS)

    Song, B.; Won, M.; Jang, K.; Yoon, S.; Lim, J.

    2016-12-01

    Approximately five hundred forest fires occur and inflict the losses of both life and property each year in Korea during the forest fire seasons in the spring and autumn. Thus, an accurate prediction of forest fire is essential for effective forest fire prevention. The meteorology is one of important factors to predict and understand the fire occurrence as well as its behaviors and spread. In this study, we present the Forest Fire Danger Rating Systems (FFDRS) on the Korean Peninsula based on the Daily Weather Index (DWI) which represents the meteorological characteristics related to forest fire. The thematic maps including temperature, humidity, and wind speed produced from Korea Meteorology Administration (KMA) were applied to the forest fire occurrence probability model by logistic regression to analyze the DWI over the Korean Peninsula. The regional data assimilation and prediction system (RDAPS) and the improved digital forecast model were used to verify the sensitivity of DWI. The result of verification test revealed that the improved digital forecast model dataset showed better agreements with the real-time weather data. The forest fire danger rating index (FFDRI) calculated by the improved digital forecast model dataset showed a good agreement with the real-time weather dataset at the 233 administrative districts (R2=0.854). In addition, FFDRI were compared with observation-based FFDRI at 76 national weather stations. The mean difference was 0.5 at the site-level. The results produced in this study indicate that the improved digital forecast model dataset can be useful to predict the FFDRI in the Korean Peninsula successfully.

  3. Multivariate prediction of odor from pig production based on in-situ measurement of odorants

    NASA Astrophysics Data System (ADS)

    Hansen, Michael J.; Jonassen, Kristoffer E. N.; Løkke, Mette Marie; Adamsen, Anders Peter S.; Feilberg, Anders

    2016-06-01

    The aim of the present study was to estimate a prediction model for odor from pig production facilities based on measurements of odorants by Proton-Transfer-Reaction Mass spectrometry (PTR-MS). Odor measurements were performed at four different pig production facilities with and without odor abatement technologies using a newly developed mobile odor laboratory equipped with a PTR-MS for measuring odorants and an olfactometer for measuring the odor concentration by human panelists. A total of 115 odor measurements were carried out in the mobile laboratory and simultaneously air samples were collected in Nalophan bags and analyzed at accredited laboratories after 24 h. The dataset was divided into a calibration dataset containing 94 samples and a validation dataset containing 21 samples. The prediction model based on the measurements in the mobile laboratory was able to explain 74% of the variation in the odor concentration based on odorants, whereas the prediction models based on odor measurements with bag samples explained only 46-57%. This study is the first application of direct field olfactometry to livestock odor and emphasizes the importance of avoiding any bias from sample storage in studies of odor-odorant relationships. Application of the model on the validation dataset gave a high correlation between predicted and measured odor concentration (R2 = 0.77). Significant odorants in the prediction models include phenols and indoles. In conclusion, measurements of odorants on-site in pig production facilities is an alternative to dynamic olfactometry that can be applied for measuring odor from pig houses and the effects of odor abatement technologies.

  4. The Isprs Benchmark on Indoor Modelling

    NASA Astrophysics Data System (ADS)

    Khoshelham, K.; Díaz Vilariño, L.; Peter, M.; Kang, Z.; Acharya, D.

    2017-09-01

    Automated generation of 3D indoor models from point cloud data has been a topic of intensive research in recent years. While results on various datasets have been reported in literature, a comparison of the performance of different methods has not been possible due to the lack of benchmark datasets and a common evaluation framework. The ISPRS benchmark on indoor modelling aims to address this issue by providing a public benchmark dataset and an evaluation framework for performance comparison of indoor modelling methods. In this paper, we present the benchmark dataset comprising several point clouds of indoor environments captured by different sensors. We also discuss the evaluation and comparison of indoor modelling methods based on manually created reference models and appropriate quality evaluation criteria. The benchmark dataset is available for download at: http://www2.isprs.org/commissions/comm4/wg5/benchmark-on-indoor-modelling.html.

  5. Open Source Bayesian Models. 1. Application to ADME/Tox and Drug Discovery Datasets.

    PubMed

    Clark, Alex M; Dole, Krishna; Coulon-Spektor, Anna; McNutt, Andrew; Grass, George; Freundlich, Joel S; Reynolds, Robert C; Ekins, Sean

    2015-06-22

    On the order of hundreds of absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) models have been described in the literature in the past decade which are more often than not inaccessible to anyone but their authors. Public accessibility is also an issue with computational models for bioactivity, and the ability to share such models still remains a major challenge limiting drug discovery. We describe the creation of a reference implementation of a Bayesian model-building software module, which we have released as an open source component that is now included in the Chemistry Development Kit (CDK) project, as well as implemented in the CDD Vault and in several mobile apps. We use this implementation to build an array of Bayesian models for ADME/Tox, in vitro and in vivo bioactivity, and other physicochemical properties. We show that these models possess cross-validation receiver operator curve values comparable to those generated previously in prior publications using alternative tools. We have now described how the implementation of Bayesian models with FCFP6 descriptors generated in the CDD Vault enables the rapid production of robust machine learning models from public data or the user's own datasets. The current study sets the stage for generating models in proprietary software (such as CDD) and exporting these models in a format that could be run in open source software using CDK components. This work also demonstrates that we can enable biocomputation across distributed private or public datasets to enhance drug discovery.

  6. Open Source Bayesian Models. 1. Application to ADME/Tox and Drug Discovery Datasets

    PubMed Central

    2015-01-01

    On the order of hundreds of absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) models have been described in the literature in the past decade which are more often than not inaccessible to anyone but their authors. Public accessibility is also an issue with computational models for bioactivity, and the ability to share such models still remains a major challenge limiting drug discovery. We describe the creation of a reference implementation of a Bayesian model-building software module, which we have released as an open source component that is now included in the Chemistry Development Kit (CDK) project, as well as implemented in the CDD Vault and in several mobile apps. We use this implementation to build an array of Bayesian models for ADME/Tox, in vitro and in vivo bioactivity, and other physicochemical properties. We show that these models possess cross-validation receiver operator curve values comparable to those generated previously in prior publications using alternative tools. We have now described how the implementation of Bayesian models with FCFP6 descriptors generated in the CDD Vault enables the rapid production of robust machine learning models from public data or the user’s own datasets. The current study sets the stage for generating models in proprietary software (such as CDD) and exporting these models in a format that could be run in open source software using CDK components. This work also demonstrates that we can enable biocomputation across distributed private or public datasets to enhance drug discovery. PMID:25994950

  7. Historical gridded reconstruction of potential evapotranspiration for the UK

    NASA Astrophysics Data System (ADS)

    Tanguy, Maliko; Prudhomme, Christel; Smith, Katie; Hannaford, Jamie

    2018-06-01

    Potential evapotranspiration (PET) is a necessary input data for most hydrological models and is often needed at a daily time step. An accurate estimation of PET requires many input climate variables which are, in most cases, not available prior to the 1960s for the UK, nor indeed most parts of the world. Therefore, when applying hydrological models to earlier periods, modellers have to rely on PET estimations derived from simplified methods. Given that only monthly observed temperature data is readily available for the late 19th and early 20th century at a national scale for the UK, the objective of this work was to derive the best possible UK-wide gridded PET dataset from the limited data available.To that end, firstly, a combination of (i) seven temperature-based PET equations, (ii) four different calibration approaches and (iii) seven input temperature data were evaluated. For this evaluation, a gridded daily PET product based on the physically based Penman-Monteith equation (the CHESS PET dataset) was used, the rationale being that this provides a reliable ground truth PET dataset for evaluation purposes, given that no directly observed, distributed PET datasets exist. The performance of the models was also compared to a naïve method, which is defined as the simplest possible estimation of PET in the absence of any available climate data. The naïve method used in this study is the CHESS PET daily long-term average (the period from 1961 to 1990 was chosen), or CHESS-PET daily climatology.The analysis revealed that the type of calibration and the input temperature dataset had only a minor effect on the accuracy of the PET estimations at catchment scale. From the seven equations tested, only the calibrated version of the McGuinness-Bordne equation was able to outperform the naïve method and was therefore used to derive the gridded, reconstructed dataset. The equation was calibrated using 43 catchments across Great Britain.The dataset produced is a 5 km gridded PET dataset for the period 1891 to 2015, using the Met Office 5 km monthly gridded temperature data available for that time period as input data for the PET equation. The dataset includes daily and monthly PET grids and is complemented with a suite of mapped performance metrics to help users assess the quality of the data spatially.This dataset is expected to be particularly valuable as input to hydrological models for any catchment in the UK. The data can be accessed at https://doi.org/10.5285/17b9c4f7-1c30-4b6f-b2fe-f7780159939c.

  8. TH-A-9A-01: Active Optical Flow Model: Predicting Voxel-Level Dose Prediction in Spine SBRT

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liu, J; Wu, Q.J.; Yin, F

    2014-06-15

    Purpose: To predict voxel-level dose distribution and enable effective evaluation of cord dose sparing in spine SBRT. Methods: We present an active optical flow model (AOFM) to statistically describe cord dose variations and train a predictive model to represent correlations between AOFM and PTV contours. Thirty clinically accepted spine SBRT plans are evenly divided into training and testing datasets. The development of predictive model consists of 1) collecting a sequence of dose maps including PTV and OAR (spinal cord) as well as a set of associated PTV contours adjacent to OAR from the training dataset, 2) classifying data into fivemore » groups based on PTV's locations relative to OAR, two “Top”s, “Left”, “Right”, and “Bottom”, 3) randomly selecting a dose map as the reference in each group and applying rigid registration and optical flow deformation to match all other maps to the reference, 4) building AOFM by importing optical flow vectors and dose values into the principal component analysis (PCA), 5) applying another PCA to features of PTV and OAR contours to generate an active shape model (ASM), and 6) computing a linear regression model of correlations between AOFM and ASM.When predicting dose distribution of a new case in the testing dataset, the PTV is first assigned to a group based on its contour characteristics. Contour features are then transformed into ASM's principal coordinates of the selected group. Finally, voxel-level dose distribution is determined by mapping from the ASM space to the AOFM space using the predictive model. Results: The DVHs predicted by the AOFM-based model and those in clinical plans are comparable in training and testing datasets. At 2% volume the dose difference between predicted and clinical plans is 4.2±4.4% and 3.3±3.5% in the training and testing datasets, respectively. Conclusion: The AOFM is effective in predicting voxel-level dose distribution for spine SBRT. Partially supported by NIH/NCI under grant #R21CA161389 and a master research grant by Varian Medical System.« less

  9. Investigating the Effects of Imputation Methods for Modelling Gene Networks Using a Dynamic Bayesian Network from Gene Expression Data

    PubMed Central

    CHAI, Lian En; LAW, Chow Kuan; MOHAMAD, Mohd Saberi; CHONG, Chuii Khim; CHOON, Yee Wen; DERIS, Safaai; ILLIAS, Rosli Md

    2014-01-01

    Background: Gene expression data often contain missing expression values. Therefore, several imputation methods have been applied to solve the missing values, which include k-nearest neighbour (kNN), local least squares (LLS), and Bayesian principal component analysis (BPCA). However, the effects of these imputation methods on the modelling of gene regulatory networks from gene expression data have rarely been investigated and analysed using a dynamic Bayesian network (DBN). Methods: In the present study, we separately imputed datasets of the Escherichia coli S.O.S. DNA repair pathway and the Saccharomyces cerevisiae cell cycle pathway with kNN, LLS, and BPCA, and subsequently used these to generate gene regulatory networks (GRNs) using a discrete DBN. We made comparisons on the basis of previous studies in order to select the gene network with the least error. Results: We found that BPCA and LLS performed better on larger networks (based on the S. cerevisiae dataset), whereas kNN performed better on smaller networks (based on the E. coli dataset). Conclusion: The results suggest that the performance of each imputation method is dependent on the size of the dataset, and this subsequently affects the modelling of the resultant GRNs using a DBN. In addition, on the basis of these results, a DBN has the capacity to discover potential edges, as well as display interactions, between genes. PMID:24876803

  10. Integrative Analysis of Cancer Diagnosis Studies with Composite Penalization

    PubMed Central

    Liu, Jin; Huang, Jian; Ma, Shuangge

    2013-01-01

    Summary In cancer diagnosis studies, high-throughput gene profiling has been extensively conducted, searching for genes whose expressions may serve as markers. Data generated from such studies have the “large d, small n” feature, with the number of genes profiled much larger than the sample size. Penalization has been extensively adopted for simultaneous estimation and marker selection. Because of small sample sizes, markers identified from the analysis of single datasets can be unsatisfactory. A cost-effective remedy is to conduct integrative analysis of multiple heterogeneous datasets. In this article, we investigate composite penalization methods for estimation and marker selection in integrative analysis. The proposed methods use the minimax concave penalty (MCP) as the outer penalty. Under the homogeneity model, the ridge penalty is adopted as the inner penalty. Under the heterogeneity model, the Lasso penalty and MCP are adopted as the inner penalty. Effective computational algorithms based on coordinate descent are developed. Numerical studies, including simulation and analysis of practical cancer datasets, show satisfactory performance of the proposed methods. PMID:24578589

  11. A database of marine phytoplankton abundance, biomass and species composition in Australian waters

    NASA Astrophysics Data System (ADS)

    Davies, Claire H.; Coughlan, Alex; Hallegraeff, Gustaaf; Ajani, Penelope; Armbrecht, Linda; Atkins, Natalia; Bonham, Prudence; Brett, Steve; Brinkman, Richard; Burford, Michele; Clementson, Lesley; Coad, Peter; Coman, Frank; Davies, Diana; Dela-Cruz, Jocelyn; Devlin, Michelle; Edgar, Steven; Eriksen, Ruth; Furnas, Miles; Hassler, Christel; Hill, David; Holmes, Michael; Ingleton, Tim; Jameson, Ian; Leterme, Sophie C.; Lønborg, Christian; McLaughlin, James; McEnnulty, Felicity; McKinnon, A. David; Miller, Margaret; Murray, Shauna; Nayar, Sasi; Patten, Renee; Pritchard, Tim; Proctor, Roger; Purcell-Meyerink, Diane; Raes, Eric; Rissik, David; Ruszczyk, Jason; Slotwinski, Anita; Swadling, Kerrie M.; Tattersall, Katherine; Thompson, Peter; Thomson, Paul; Tonks, Mark; Trull, Thomas W.; Uribe-Palomino, Julian; Waite, Anya M.; Yauwenas, Rouna; Zammit, Anthony; Richardson, Anthony J.

    2016-06-01

    There have been many individual phytoplankton datasets collected across Australia since the mid 1900s, but most are unavailable to the research community. We have searched archives, contacted researchers, and scanned the primary and grey literature to collate 3,621,847 records of marine phytoplankton species from Australian waters from 1844 to the present. Many of these are small datasets collected for local questions, but combined they provide over 170 years of data on phytoplankton communities in Australian waters. Units and taxonomy have been standardised, obviously erroneous data removed, and all metadata included. We have lodged this dataset with the Australian Ocean Data Network (http://portal.aodn.org.au/) allowing public access. The Australian Phytoplankton Database will be invaluable for global change studies, as it allows analysis of ecological indicators of climate change and eutrophication (e.g., changes in distribution; diatom:dinoflagellate ratios). In addition, the standardised conversion of abundance records to biomass provides modellers with quantifiable data to initialise and validate ecosystem models of lower marine trophic levels.

  12. Cross-platform normalization of microarray and RNA-seq data for machine learning applications

    PubMed Central

    Thompson, Jeffrey A.; Tan, Jie

    2016-01-01

    Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language. PMID:26844019

  13. Association between rs3087243 and rs231775 polymorphism within the cytotoxic T-lymphocyte antigen 4 gene and Graves' disease: a case/control study combined with meta-analyses

    PubMed Central

    Dai, Yu; Zeng, Tianshu; Xiao, Fei; Chen, Lulu; Kong, Wen

    2017-01-01

    We conducted a case/control study to assess the impact of SNP rs3087243 and rs231775 within the CTLA4 gene, on the susceptibility to Graves' disease (GD) in a Chinese Han dataset (271 cases and 298 controls). The frequency of G allele for rs3087243 and rs231775 was observed to be significantly higher in subjects with GD than in control subjects (p = 0.005 and p = 0.000, respectively). After logistic regression analysis, a significant association was detected between SNP rs3087243 and GD in the additive and recessive models. Similarly, association for the SNP rs231775 could also be detected in the additive model, dominant model and recessive model. A meta-analysis, including 27 published datasets along with the current dataset, was performed to further confirm the association. Consistent with our case/control results, rs3087243 and rs231775 showed a significant association with GD in all genetic models. Of note, ethnic stratification revealed that these two SNPs were associated with susceptibility to GD in populations of both Asian and European descent. In conclusion, our data support that the rs3087243 and rs231775 polymorphisms within the CTLA4 gene confer genetic susceptibility to GD. PMID:29299173

  14. Genetic and Psychosocial Predictors of Aggression: Variable Selection and Model Building With Component-Wise Gradient Boosting.

    PubMed

    Suchting, Robert; Gowin, Joshua L; Green, Charles E; Walss-Bass, Consuelo; Lane, Scott D

    2018-01-01

    Rationale : Given datasets with a large or diverse set of predictors of aggression, machine learning (ML) provides efficient tools for identifying the most salient variables and building a parsimonious statistical model. ML techniques permit efficient exploration of data, have not been widely used in aggression research, and may have utility for those seeking prediction of aggressive behavior. Objectives : The present study examined predictors of aggression and constructed an optimized model using ML techniques. Predictors were derived from a dataset that included demographic, psychometric and genetic predictors, specifically FK506 binding protein 5 (FKBP5) polymorphisms, which have been shown to alter response to threatening stimuli, but have not been tested as predictors of aggressive behavior in adults. Methods : The data analysis approach utilized component-wise gradient boosting and model reduction via backward elimination to: (a) select variables from an initial set of 20 to build a model of trait aggression; and then (b) reduce that model to maximize parsimony and generalizability. Results : From a dataset of N = 47 participants, component-wise gradient boosting selected 8 of 20 possible predictors to model Buss-Perry Aggression Questionnaire (BPAQ) total score, with R 2 = 0.66. This model was simplified using backward elimination, retaining six predictors: smoking status, psychopathy (interpersonal manipulation and callous affect), childhood trauma (physical abuse and neglect), and the FKBP5_13 gene (rs1360780). The six-factor model approximated the initial eight-factor model at 99.4% of R 2 . Conclusions : Using an inductive data science approach, the gradient boosting model identified predictors consistent with previous experimental work in aggression; specifically psychopathy and trauma exposure. Additionally, allelic variants in FKBP5 were identified for the first time, but the relatively small sample size limits generality of results and calls for replication. This approach provides utility for the prediction of aggression behavior, particularly in the context of large multivariate datasets.

  15. Nowcasting Beach Advisories at Ohio Lake Erie Beaches

    USGS Publications Warehouse

    Francy, Donna S.; Darner, Robert A.

    2007-01-01

    Data were collected during the recreational season of 2007 to test and refine predictive models at three Lake Erie beaches. In addition to E. coli concentrations, field personnel collected or compiled data for environmental and water-quality variables expected to affect E. coli concentrations including turbidity, wave height, water temperature, lake level, rainfall, and antecedent dry days and wet days. At Huntington (Bay Village) and Edgewater (Cleveland) during 2007, the models provided correct responses 82.7 and 82.1 percent of the time; these percentages were greater than percentages obtained using the previous day?s E. coli concentrations (current method). In contrast, at Villa Angela during 2007, the model provided correct responses only 61.3 percent of the days monitored. The data from 2007 were added to existing datasets and the larger datasets were split into two (Huntington) or three (Edgewater) segments by date based on the occurrence of false negatives and positives (named ?season 1, season 2, season 3?). Models were developed for dated segments and for combined datasets. At Huntington, the summed responses for separate best models for seasons 1 and 2 provided a greater percentage of correct responses (85.6 percent) than the one combined best model (83.1 percent). Similar results were found for Edgewater. Water resource managers will determine how to apply these models to the Internet-based ?nowcast? system for issuing water-quality advisories during 2008.

  16. Handling a Small Dataset Problem in Prediction Model by employ Artificial Data Generation Approach: A Review

    NASA Astrophysics Data System (ADS)

    Lateh, Masitah Abdul; Kamilah Muda, Azah; Yusof, Zeratul Izzah Mohd; Azilah Muda, Noor; Sanusi Azmi, Mohd

    2017-09-01

    The emerging era of big data for past few years has led to large and complex data which needed faster and better decision making. However, the small dataset problems still arise in a certain area which causes analysis and decision are hard to make. In order to build a prediction model, a large sample is required as a training sample of the model. Small dataset is insufficient to produce an accurate prediction model. This paper will review an artificial data generation approach as one of the solution to solve the small dataset problem.

  17. Fuzzy neural network technique for system state forecasting.

    PubMed

    Li, Dezhi; Wang, Wilson; Ismail, Fathy

    2013-10-01

    In many system state forecasting applications, the prediction is performed based on multiple datasets, each corresponding to a distinct system condition. The traditional methods dealing with multiple datasets (e.g., vector autoregressive moving average models and neural networks) have some shortcomings, such as limited modeling capability and opaque reasoning operations. To tackle these problems, a novel fuzzy neural network (FNN) is proposed in this paper to effectively extract information from multiple datasets, so as to improve forecasting accuracy. The proposed predictor consists of both autoregressive (AR) nodes modeling and nonlinear nodes modeling; AR models/nodes are used to capture the linear correlation of the datasets, and the nonlinear correlation of the datasets are modeled with nonlinear neuron nodes. A novel particle swarm technique [i.e., Laplace particle swarm (LPS) method] is proposed to facilitate parameters estimation of the predictor and improve modeling accuracy. The effectiveness of the developed FNN predictor and the associated LPS method is verified by a series of tests related to Mackey-Glass data forecast, exchange rate data prediction, and gear system prognosis. Test results show that the developed FNN predictor and the LPS method can capture the dynamics of multiple datasets effectively and track system characteristics accurately.

  18. The advanced qualtiy control techniques planned for the Internation Soil Moisture Network

    NASA Astrophysics Data System (ADS)

    Xaver, A.; Gruber, A.; Hegiova, A.; Sanchis-Dufau, A. D.; Dorigo, W. A.

    2012-04-01

    In situ soil moisture observations are essential to evaluate and calibrate modeled and remotely sensed soil moisture products. Although a number of meteorological networks and field campaigns measuring soil moisture exist on a global and long-term scale, their observations are not easily accessible and lack standardization of both technique and protocol. Thus, handling and especially comparing these datasets with satellite products or land surface models is a demanding issue. To overcome these limitations the International Soil Moisture Network (ISMN; http://www.ipf.tuwien.ac.at/insitu/) has been initiated to act as a centralized data hosting facility. One advantage of the ISMN is that users are able to access the harmonized datasets easily through a web portal. Another advantage is the fully automated processing chain including the data harmonization in terms of units and sampling interval, but even more important is the advanced quality control system each measurement has to run through. The quality of in situ soil moisture measurements is crucial for the validation of satellite- and model-based soil moisture retrievals; therefore a sophisticated quality control system was developed. After a check for plausibility and geophysical limits a quality flag is added to each measurement. An enhanced flagging mechanism was recently defined using a spectrum based approach to detect spurious spikes, jumps and plateaus. The International Soil Moisture Network has already evolved to one of the most important distribution platforms for in situ soil moisture observations and is still growing. Currently, data from 27 networks in total covering more than 800 stations in Europe, North America, Australia, Asia and Africa is hosted by the ISMN. Available datasets also include historical datasets as well as near real-time measurements. The improved quality control system will provide important information for satellite-based as well as land surface model-based validation studies.

  19. Booly: a new data integration platform.

    PubMed

    Do, Long H; Esteves, Francisco F; Karten, Harvey J; Bier, Ethan

    2010-10-13

    Data integration is an escalating problem in bioinformatics. We have developed a web tool and warehousing system, Booly, that features a simple yet flexible data model coupled with the ability to perform powerful comparative analysis, including the use of Boolean logic to merge datasets together, and an integrated aliasing system to decipher differing names of the same gene or protein. Furthermore, Booly features a collaborative sharing system and a public repository so that users can retrieve new datasets while contributors can easily disseminate new content. We illustrate the uses of Booly with several examples including: the versatile creation of homebrew datasets, the integration of heterogeneous data to identify genes useful for comparing avian and mammalian brain architecture, and generation of a list of Food and Drug Administration (FDA) approved drugs with possible alternative disease targets. The Booly paradigm for data storage and analysis should facilitate integration between disparate biological and medical fields and result in novel discoveries that can then be validated experimentally. Booly can be accessed at http://booly.ucsd.edu.

  20. Booly: a new data integration platform

    PubMed Central

    2010-01-01

    Background Data integration is an escalating problem in bioinformatics. We have developed a web tool and warehousing system, Booly, that features a simple yet flexible data model coupled with the ability to perform powerful comparative analysis, including the use of Boolean logic to merge datasets together, and an integrated aliasing system to decipher differing names of the same gene or protein. Furthermore, Booly features a collaborative sharing system and a public repository so that users can retrieve new datasets while contributors can easily disseminate new content. Results We illustrate the uses of Booly with several examples including: the versatile creation of homebrew datasets, the integration of heterogeneous data to identify genes useful for comparing avian and mammalian brain architecture, and generation of a list of Food and Drug Administration (FDA) approved drugs with possible alternative disease targets. Conclusions The Booly paradigm for data storage and analysis should facilitate integration between disparate biological and medical fields and result in novel discoveries that can then be validated experimentally. Booly can be accessed at http://booly.ucsd.edu. PMID:20942966

  1. Diurnal fluctuations in brain volume: Statistical analyses of MRI from large populations.

    PubMed

    Nakamura, Kunio; Brown, Robert A; Narayanan, Sridar; Collins, D Louis; Arnold, Douglas L

    2015-09-01

    We investigated fluctuations in brain volume throughout the day using statistical modeling of magnetic resonance imaging (MRI) from large populations. We applied fully automated image analysis software to measure the brain parenchymal fraction (BPF), defined as the ratio of the brain parenchymal volume and intracranial volume, thus accounting for variations in head size. The MRI data came from serial scans of multiple sclerosis (MS) patients in clinical trials (n=755, 3269 scans) and from subjects participating in the Alzheimer's Disease Neuroimaging Initiative (ADNI, n=834, 6114 scans). The percent change in BPF was modeled with a linear mixed effect (LME) model, and the model was applied separately to the MS and ADNI datasets. The LME model for the MS datasets included random subject effects (intercept and slope over time) and fixed effects for the time-of-day, time from the baseline scan, and trial, which accounted for trial-related effects (for example, different inclusion criteria and imaging protocol). The model for ADNI additionally included the demographics (baseline age, sex, subject type [normal, mild cognitive impairment, or Alzheimer's disease], and interaction between subject type and time from baseline). There was a statistically significant effect of time-of-day on the BPF change in MS clinical trial datasets (-0.180 per day, that is, 0.180% of intracranial volume, p=0.019) as well as the ADNI dataset (-0.438 per day, that is, 0.438% of intracranial volume, p<0.0001), showing that the brain volume is greater in the morning. Linearly correcting the BPF values with the time-of-day reduced the required sample size to detect a 25% treatment effect (80% power and 0.05 significance level) on change in brain volume from 2 time-points over a period of 1year by 2.6%. Our results have significant implications for future brain volumetric studies, suggesting that there is a potential acquisition time bias that should be randomized or statistically controlled to account for the day-to-day brain volume fluctuations. Copyright © 2015 Elsevier Inc. All rights reserved.

  2. Evolving hard problems: Generating human genetics datasets with a complex etiology.

    PubMed

    Himmelstein, Daniel S; Greene, Casey S; Moore, Jason H

    2011-07-07

    A goal of human genetics is to discover genetic factors that influence individuals' susceptibility to common diseases. Most common diseases are thought to result from the joint failure of two or more interacting components instead of single component failures. This greatly complicates both the task of selecting informative genetic variants and the task of modeling interactions between them. We and others have previously developed algorithms to detect and model the relationships between these genetic factors and disease. Previously these methods have been evaluated with datasets simulated according to pre-defined genetic models. Here we develop and evaluate a model free evolution strategy to generate datasets which display a complex relationship between individual genotype and disease susceptibility. We show that this model free approach is capable of generating a diverse array of datasets with distinct gene-disease relationships for an arbitrary interaction order and sample size. We specifically generate eight-hundred Pareto fronts; one for each independent run of our algorithm. In each run the predictiveness of single genetic variation and pairs of genetic variants have been minimized, while the predictiveness of third, fourth, or fifth-order combinations is maximized. Two hundred runs of the algorithm are further dedicated to creating datasets with predictive four or five order interactions and minimized lower-level effects. This method and the resulting datasets will allow the capabilities of novel methods to be tested without pre-specified genetic models. This allows researchers to evaluate which methods will succeed on human genetics problems where the model is not known in advance. We further make freely available to the community the entire Pareto-optimal front of datasets from each run so that novel methods may be rigorously evaluated. These 76,600 datasets are available from http://discovery.dartmouth.edu/model_free_data/.

  3. Revision and proposed modification for a total maximum daily load model for Upper Klamath Lake, Oregon

    USGS Publications Warehouse

    Wherry, Susan A.; Wood, Tamara M.; Anderson, Chauncey W.

    2015-01-01

    Using the extended 1991–2010 external phosphorus loading dataset, the lake TMDL model was recalibrated following the same procedures outlined in the Phase 1 review. The version of the model selected for further development incorporated an updated sediment initial condition, a numerical solution method for the chlorophyll a model, changes to light and phosphorus factors limiting algal growth, and a new pH-model regression, which removed Julian day dependence in order to avoid discontinuities in pH at year boundaries. This updated lake TMDL model was recalibrated using the extended dataset in order to compare calibration parameters to those obtained from a calibration with the original 7.5-year dataset. The resulting algal settling velocity calibrated from the extended dataset was more than twice the value calibrated with the original dataset, and, because the calibrated values of algal settling velocity and recycle rate are related (more rapid settling required more rapid recycling), the recycling rate also was larger than that determined with the original dataset. These changes in calibration parameters highlight the uncertainty in critical rates in the Upper Klamath Lake TMDL model and argue for their direct measurement in future data collection to increase confidence in the model predictions.

  4. Downscaled and debiased climate simulations for North America from 21,000 years ago to 2100AD

    PubMed Central

    Lorenz, David J.; Nieto-Lugilde, Diego; Blois, Jessica L.; Fitzpatrick, Matthew C.; Williams, John W.

    2016-01-01

    Increasingly, ecological modellers are integrating paleodata with future projections to understand climate-driven biodiversity dynamics from the past through the current century. Climate simulations from earth system models are necessary to this effort, but must be debiased and downscaled before they can be used by ecological models. Downscaling methods and observational baselines vary among researchers, which produces confounding biases among downscaled climate simulations. We present unified datasets of debiased and downscaled climate simulations for North America from 21 ka BP to 2100AD, at 0.5° spatial resolution. Temporal resolution is decadal averages of monthly data until 1950AD, average climates for 1950–2005 AD, and monthly data from 2010 to 2100AD, with decadal averages also provided. This downscaling includes two transient paleoclimatic simulations and 12 climate models for the IPCC AR5 (CMIP5) historical (1850–2005), RCP4.5, and RCP8.5 21st-century scenarios. Climate variables include primary variables and derived bioclimatic variables. These datasets provide a common set of climate simulations suitable for seamlessly modelling the effects of past and future climate change on species distributions and diversity. PMID:27377537

  5. Downscaled and debiased climate simulations for North America from 21,000 years ago to 2100AD.

    PubMed

    Lorenz, David J; Nieto-Lugilde, Diego; Blois, Jessica L; Fitzpatrick, Matthew C; Williams, John W

    2016-07-05

    Increasingly, ecological modellers are integrating paleodata with future projections to understand climate-driven biodiversity dynamics from the past through the current century. Climate simulations from earth system models are necessary to this effort, but must be debiased and downscaled before they can be used by ecological models. Downscaling methods and observational baselines vary among researchers, which produces confounding biases among downscaled climate simulations. We present unified datasets of debiased and downscaled climate simulations for North America from 21 ka BP to 2100AD, at 0.5° spatial resolution. Temporal resolution is decadal averages of monthly data until 1950AD, average climates for 1950-2005 AD, and monthly data from 2010 to 2100AD, with decadal averages also provided. This downscaling includes two transient paleoclimatic simulations and 12 climate models for the IPCC AR5 (CMIP5) historical (1850-2005), RCP4.5, and RCP8.5 21st-century scenarios. Climate variables include primary variables and derived bioclimatic variables. These datasets provide a common set of climate simulations suitable for seamlessly modelling the effects of past and future climate change on species distributions and diversity.

  6. Quantifying Arctic Terrestrial Environment Behaviors Using Geophysical, Point-Scale and Remote Sensing Data

    NASA Astrophysics Data System (ADS)

    Dafflon, B.; Hubbard, S. S.; Ulrich, C.; Peterson, J. E.; Wu, Y.; Wainwright, H. M.; Gangodagamage, C.; Kholodov, A. L.; Kneafsey, T. J.

    2013-12-01

    Improvement in parameterizing Arctic process-rich terrestrial models to simulate feedbacks to a changing climate requires advances in estimating the spatiotemporal variations in active layer and permafrost properties - in sufficiently high resolution yet over modeling-relevant scales. As part of the DOE Next-Generation Ecosystem Experiments (NGEE-Arctic), we are developing advanced strategies for imaging the subsurface and for investigating land and subsurface co-variability and dynamics. Our studies include acquisition and integration of various measurements, including point-based, surface-based geophysical, and remote sensing datasets These data have been collected during a series of campaigns at the NGEE Barrow, AK site along transects that traverse a range of hydrological and geomorphological conditions, including low- to high- centered polygons and drained thaw lake basins. In this study, we describe the use of galvanic-coupled electrical resistance tomography (ERT), capacitively-coupled resistivity (CCR) , permafrost cores, above-ground orthophotography, and digital elevation model (DEM) to (1) explore complementary nature and trade-offs between characterization resolution, spatial extent and accuracy of different datasets; (2) develop inversion approaches to quantify permafrost characteristics (such as ice content, ice wedge frequency, and presence of unfrozen deep layer) and (3) identify correspondences between permafrost and land surface properties (such as water inundation, topography, and vegetation). In terms of methods, we developed a 1D-based direct search approach to estimate electrical conductivity distribution while allowing exploration of multiple solutions and prior information in a flexible way. Application of the method to the Barrow datasets reveals the relative information content of each dataset for characterizing permafrost properties, which shows features variability from below one meter length scales to large trends over more than a kilometer. Further, we used Pole- and Kite-based low-altitude aerial photography with inferred DEM, as well as DEM from LiDAR dataset, to quantify land-surface properties and their co-variability with the subsurface properties. Comparison of the above- and below-ground characterization information indicate that while some permafrost characteristics correspond with changes in hydrogeomorphological expressions, others features show more complex linkages with landscape properties. Overall, our results indicate that remote sensing data, point-scale measurements and surface geophysical measurements enable the identification of regional zones having similar relations between subsurface and land surface properties. Identification of such zonation and associated permafrost-land surface properties can be used to guide investigations of carbon cycling processes and for model parameterization.

  7. A comparison of daily evaporation downscaled using WRFDA model and GLEAM dataset over the Iberian Peninsula.

    NASA Astrophysics Data System (ADS)

    José González-Rojí, Santos; Sáenz, Jon; Ibarra-Berastegi, Gabriel

    2017-04-01

    GLEAM dataset was presented a few years ago and since that moment, it has just been used for validation of evaporation in a few places of the world (Australia and Africa). The Iberian Peninsula is composed of different soil types and it is affected by different weather regimes, with different climate regions. It is this feature which makes it a very interesting zone for the study of the meteorological cycle, including evaporation. For that purpose, a numerical downscaling exercise over the Iberian Peninsula was run nesting the WRF model inside ERA Interim. Two model configurations were tested in two experiments spanning the period 2010-2014 after a one-year spin-up (2009). In the first experiment (N), boundary conditions drive the model. The second experiment (D) is configured the same way as the N case, but 3DVAR data assimilation is run every six hours (00Z, 06Z, 12Z and 18Z) using observations obtained from the PREPBUFR dataset. For both N and D runs and ERA Interim, the evaporation of the model runs was compared to GLEAM v3.0b and v3.0c datasets over the Iberian Peninsula, both at the daily and monthly time scales. GLEAM v3.0a was not used for validation as it uses for forcing radiation and air temperature data from ERA Interim. Results show that the experiment with data assimilation (D) improve the results obtained for N experiment. Moreover, correlations values are comparable to the ones obtained with ERA Interim. However, some negative correlation values are observed at Portuguese and Mediterranean coasts for both WRF runs. All of these problematic points are considered as urban sites by the NOAH land surface model. Because of that, the model is not able to simulate a correct evaporation value. Even with these discrepancies, better results than for ERA Interim are observed for seasonal Biases and daily RMSEs over Iberian Peninsula, obtaining the best values inland. Minimal differences are observed for the two GLEAM datasets selected.

  8. Validating Computational Human Behavior Models: Consistency and Accuracy Issues

    DTIC Science & Technology

    2004-06-01

    includes a discussion of SME demographics, content, and organization of the datasets . This research generalizes data from two pilot studies and two base...meet requirements for validating the varied and complex behavioral models. Through a series of empirical studies , this research identifies subject...meet requirements for validating the varied and complex behavioral models. Through a series of empirical studies , this research identifies subject

  9. The IERS Special Bureau for Tides

    NASA Technical Reports Server (NTRS)

    Ray, Richard D.; Chao, B. F.; Desai, S. D.

    2002-01-01

    The Global Geophysical Fluids Center of the International Earth Rotation Service (IERS) comprises 8 special bureaus, one of which is the Special Bureau for Tides. Its purpose is to facilitate studies related to tidal effects in earth rotation. To that end it collects various relevant datasets and distributes them, primarily through its website at bowie.gsfc.nasa.gov/ggfc/tides. Example datasets include tabulations of tidal variations in angular momentum and in earth rotation as estimated from numerical ocean tide models and from meteorological reanalysis products. The web site also features an interactive tidal prediction "machine" which generates tidal predictions (e.g., of UT1) from lists of harmonic constants. The Special Bureau relies on the tidal and earth-rotation communities to build and enlarge its datasets; further contributions from this community are most welcome.

  10. Heuristics for Relevancy Ranking of Earth Dataset Search Results

    NASA Astrophysics Data System (ADS)

    Lynnes, C.; Quinn, P.; Norton, J.

    2016-12-01

    As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.

  11. Heuristics for Relevancy Ranking of Earth Dataset Search Results

    NASA Technical Reports Server (NTRS)

    Lynnes, Christopher; Quinn, Patrick; Norton, James

    2016-01-01

    As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.

  12. Relevancy Ranking of Satellite Dataset Search Results

    NASA Technical Reports Server (NTRS)

    Lynnes, Christopher; Quinn, Patrick; Norton, James

    2017-01-01

    As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.

  13. School Attendance Problems and Youth Psychopathology: Structural Cross-Lagged Regression Models in Three Longitudinal Data Sets

    ERIC Educational Resources Information Center

    Wood, Jeffrey J.; Lynne-Landsman, Sarah D.; Langer, David A.; Wood, Patricia A.; Clark, Shaunna L.; Eddy, J. Mark; Ialongo, Nick

    2012-01-01

    This study tests a model of reciprocal influences between absenteeism and youth psychopathology using 3 longitudinal datasets (Ns = 20,745, 2,311, and 671). Participants in 1st through 12th grades were interviewed annually or biannually. Measures of psychopathology include self-, parent-, and teacher-report questionnaires. Structural cross-lagged…

  14. On the Value of Climate Elasticity Indices to Assess the Impact of Climate Change on Streamflow Projection using an ensemble of bias corrected CMIP5 dataset

    NASA Astrophysics Data System (ADS)

    Demirel, Mehmet; Moradkhani, Hamid

    2015-04-01

    Changes in two climate elasticity indices, i.e. temperature and precipitation elasticity of streamflow, were investigated using an ensemble of bias corrected CMIP5 dataset as forcing to two hydrologic models. The Variable Infiltration Capacity (VIC) and the Sacramento Soil Moisture Accounting (SAC-SMA) hydrologic models, were calibrated at 1/16 degree resolution and the simulated streamflow was routed to the basin outlet of interest. We estimated precipitation and temperature elasticity of streamflow from: (1) observed streamflow; (2) simulated streamflow by VIC and SAC-SMA models using observed climate for the current climate (1963-2003); (3) simulated streamflow using simulated climate from 10 GCM - CMIP5 dataset for the future climate (2010-2099) including two concentration pathways (RCP4.5 and RCP8.5) and two downscaled climate products (BCSD and MACA). The streamflow sensitivity to long-term (e.g., 30-year) average annual changes in temperature and precipitation is estimated for three periods i.e. 2010-40, 2040-70 and 2070-99. We compared the results of the three cases to reflect on the value of precipitation and temperature indices to assess the climate change impacts on Columbia River streamflow. Moreover, these three cases for two models are used to assess the effects of different uncertainty sources (model forcing, model structure and different pathways) on the two climate elasticity indices.

  15. Comparison between calculations of shortwave radiation with different aerosol datasets and measured data at the MSU MO (Russia)

    NASA Astrophysics Data System (ADS)

    Poliukhov, Aleksei; Chubarova, Natalia; Kinne, Stephan; Rivin, Gdaliy; Shatunova, Marina; Tarasova, Tatiana

    2017-02-01

    The radiation block of the COSMO non-hydrostatic mesoscale model of the atmosphere and soil active layer was tested against a relatively new effective CLIRAD(FC05)-SW radiation model and radiative measurements at the Moscow State University Meteorological Observatory (MSU MO, 55.7N, 37.5E) using different aerosol datasets in cloudless conditions. We used the data of shortwave radiation components from the Kipp&Zonen net radiometer CNR4. The model simulations were performed with the application of various aerosol climatologies including the new MACv2 climatology and the aerosol and water vapor dataset from CIMEL (AERONET) sun photometer measurements. The application of the new MACv2 climatology in the CLIRAD(FC05)-SW radiation model provides the annual average relative error of the total global radiation of -3% varying from 0.5% in May to -7.7% in December. The uncertainty of radiative calculations in the COSMO model according to preliminary estimates changes from 1.4% to 8.4%. against CLIRAD(FC05)-SW radiation model with the same parameters. We showed that in clear sky conditions the sensitivity of air temperature at 2 meters to shortwave net radiation changes is about 0.7-0.9°C per100 W/m2 due to the application of aerosol climatologies over Moscow.

  16. A Regional Climate Model Evaluation System based on Satellite and other Observations

    NASA Astrophysics Data System (ADS)

    Lean, P.; Kim, J.; Waliser, D. E.; Hall, A. D.; Mattmann, C. A.; Granger, S. L.; Case, K.; Goodale, C.; Hart, A.; Zimdars, P.; Guan, B.; Molotch, N. P.; Kaki, S.

    2010-12-01

    Regional climate models are a fundamental tool needed for downscaling global climate simulations and projections, such as those contributing to the Coupled Model Intercomparison Projects (CMIPs) that form the basis of the IPCC Assessment Reports. The regional modeling process provides the means to accommodate higher resolution and a greater complexity of Earth System processes. Evaluation of both the global and regional climate models against observations is essential to identify model weaknesses and to direct future model development efforts focused on reducing the uncertainty associated with climate projections. However, the lack of reliable observational data and the lack of formal tools are among the serious limitations to addressing these objectives. Recent satellite observations are particularly useful as they provide a wealth of information on many different aspects of the climate system, but due to their large volume and the difficulties associated with accessing and using the data, these datasets have been generally underutilized in model evaluation studies. Recognizing this problem, NASA JPL / UCLA is developing a model evaluation system to help make satellite observations, in conjunction with in-situ, assimilated, and reanalysis datasets, more readily accessible to the modeling community. The system includes a central database to store multiple datasets in a common format and codes for calculating predefined statistical metrics to assess model performance. This allows the time taken to compare model simulations with satellite observations to be reduced from weeks to days. Early results from the use this new model evaluation system for evaluating regional climate simulations over California/western US regions will be presented.

  17. Big Data in Psychology: Introduction to Special Issue

    PubMed Central

    Harlow, Lisa L.; Oswald, Frederick L.

    2016-01-01

    The introduction to this special issue on psychological research involving big data summarizes the highlights of 10 articles that address a number of important and inspiring perspectives, issues, and applications. Four common themes that emerge in the articles with respect to psychological research conducted in the area of big data are mentioned, including: 1. The benefits of collaboration across disciplines, such as those in the social sciences, applied statistics, and computer science. Doing so assists in grounding big data research in sound theory and practice, as well as in affording effective data retrieval and analysis. 2. Availability of large datasets on Facebook, Twitter, and other social media sites that provide a psychological window into the attitudes and behaviors of a broad spectrum of the population. 3. Identifying, addressing, and being sensitive to ethical considerations when analyzing large datasets gained from public or private sources. 4. The unavoidable necessity of validating predictive models in big data by applying a model developed on one dataset to a separate set of data or hold-out sample. Translational abstracts that summarize the articles in very clear and understandable terms are included in Appendix A, and a glossary of terms relevant to big data research discussed in the articles is presented in Appendix B. PMID:27918177

  18. Ancestral haplotype-based association mapping with generalized linear mixed models accounting for stratification.

    PubMed

    Zhang, Z; Guillaume, F; Sartelet, A; Charlier, C; Georges, M; Farnir, F; Druet, T

    2012-10-01

    In many situations, genome-wide association studies are performed in populations presenting stratification. Mixed models including a kinship matrix accounting for genetic relatedness among individuals have been shown to correct for population and/or family structure. Here we extend this methodology to generalized linear mixed models which properly model data under various distributions. In addition we perform association with ancestral haplotypes inferred using a hidden Markov model. The method was shown to properly account for stratification under various simulated scenari presenting population and/or family structure. Use of ancestral haplotypes resulted in higher power than SNPs on simulated datasets. Application to real data demonstrates the usefulness of the developed model. Full analysis of a dataset with 4600 individuals and 500 000 SNPs was performed in 2 h 36 min and required 2.28 Gb of RAM. The software GLASCOW can be freely downloaded from www.giga.ulg.ac.be/jcms/prod_381171/software. francois.guillaume@jouy.inra.fr Supplementary data are available at Bioinformatics online.

  19. The derivation and validation of a simple model for predicting in-hospital mortality of acutely admitted patients to internal medicine wards.

    PubMed

    Sakhnini, Ali; Saliba, Walid; Schwartz, Naama; Bisharat, Naiel

    2017-06-01

    Limited information is available about clinical predictors of in-hospital mortality in acute unselected medical admissions. Such information could assist medical decision-making.To develop a clinical model for predicting in-hospital mortality in unselected acute medical admissions and to test the impact of secondary conditions on hospital mortality.This is an analysis of the medical records of patients admitted to internal medicine wards at one university-affiliated hospital. Data obtained from the years 2013 to 2014 were used as a derivation dataset for creating a prediction model, while data from 2015 was used as a validation dataset to test the performance of the model. For each admission, a set of clinical and epidemiological variables was obtained. The main diagnosis at hospitalization was recorded, and all additional or secondary conditions that coexisted at hospital admission or that developed during hospital stay were considered secondary conditions.The derivation and validation datasets included 7268 and 7843 patients, respectively. The in-hospital mortality rate averaged 7.2%. The following variables entered the final model; age, body mass index, mean arterial pressure on admission, prior admission within 3 months, background morbidity of heart failure and active malignancy, and chronic use of statins and antiplatelet agents. The c-statistic (ROC-AUC) of the prediction model was 80.5% without adjustment for main or secondary conditions, 84.5%, with adjustment for the main diagnosis, and 89.5% with adjustment for the main diagnosis and secondary conditions. The accuracy of the predictive model reached 81% on the validation dataset.A prediction model based on clinical data with adjustment for secondary conditions exhibited a high degree of prediction accuracy. We provide a proof of concept that there is an added value for incorporating secondary conditions while predicting probabilities of in-hospital mortality. Further improvement of the model performance and validation in other cohorts are needed to aid hospitalists in predicting health outcomes.

  20. On the uncertainties associated with using gridded rainfall data as a proxy for observed

    NASA Astrophysics Data System (ADS)

    Tozer, C. R.; Kiem, A. S.; Verdon-Kidd, D. C.

    2011-09-01

    Gridded rainfall datasets are used in many hydrological and climatological studies, in Australia and elsewhere, including for hydroclimatic forecasting, climate attribution studies and climate model performance assessments. The attraction of the spatial coverage provided by gridded data is clear, particularly in Australia where the spatial and temporal resolution of the rainfall gauge network is sparse. However, the question that must be asked is whether it is suitable to use gridded data as a proxy for observed point data, given that gridded data is inherently "smoothed" and may not necessarily capture the temporal and spatial variability of Australian rainfall which leads to hydroclimatic extremes (i.e. droughts, floods)? This study investigates this question through a statistical analysis of three monthly gridded Australian rainfall datasets - the Bureau of Meteorology (BOM) dataset, the Australian Water Availability Project (AWAP) and the SILO dataset. To demonstrate the hydrological implications of using gridded data as a proxy for gauged data, a rainfall-runoff model is applied to one catchment in South Australia (SA) initially using gridded data as the source of rainfall input and then gauged rainfall data. The results indicate a markedly different runoff response associated with each of the different sources of rainfall data. It should be noted that this study does not seek to identify which gridded dataset is the "best" for Australia, as each gridded data source has its pros and cons, as does gauged or point data. Rather the intention is to quantify differences between various gridded data sources and how they compare with gauged data so that these differences can be considered and accounted for in studies that utilise these gridded datasets. Ultimately, if key decisions are going to be based on the outputs of models that use gridded data, an estimate (or at least an understanding) of the uncertainties relating to the assumptions made in the development of gridded data and how that gridded data compares with reality should be made.

  1. An Intercomparison of Large-Extent Tree Canopy Cover Geospatial Datasets

    NASA Astrophysics Data System (ADS)

    Bender, S.; Liknes, G.; Ruefenacht, B.; Reynolds, J.; Miller, W. P.

    2017-12-01

    As a member of the Multi-Resolution Land Characteristics Consortium (MRLC), the U.S. Forest Service (USFS) is responsible for producing and maintaining the tree canopy cover (TCC) component of the National Land Cover Database (NLCD). The NLCD-TCC data are available for the conterminous United States (CONUS), coastal Alaska, Hawai'i, Puerto Rico, and the U.S. Virgin Islands. The most recent official version of the NLCD-TCC data is based primarily on reference data from 2010-2011 and is part of the multi-component 2011 version of the NLCD. NLCD data are updated on a five-year cycle. The USFS is currently producing the next official version (2016) of the NLCD-TCC data for the United States, and it will be made publicly-available in early 2018. In this presentation, we describe the model inputs, modeling methods, and tools used to produce the 30-m NLCD-TCC data. Several tree cover datasets at 30-m, as well as datasets at finer resolution, have become available in recent years due to advancements in earth observation data and their availability, computing, and sensors. We compare multiple tree cover datasets that have similar resolution to the NLCD-TCC data. We also aggregate the tree class from fine-resolution land cover datasets to a percent canopy value on a 30-m pixel, in order to compare the fine-resolution datasets to the datasets created directly from 30-m Landsat data. The extent of the tree canopy cover datasets included in the study ranges from global and national to the state level. Preliminary investigation of multiple tree cover datasets over the CONUS indicates a high amount of spatial variability. For example, in a comparison of the NLCD-TCC and the Global Land Cover Facility's Landsat Tree Cover Continuous Fields (2010) data by MRLC mapping zones, the zone-level root mean-square deviation ranges from 2% to 39% (mean=17%, median=15%). The analysis outcomes are expected to inform USFS decisions with regard to the next cycle (2021) of NLCD-TCC production.

  2. [Parallel virtual reality visualization of extreme large medical datasets].

    PubMed

    Tang, Min

    2010-04-01

    On the basis of a brief description of grid computing, the essence and critical techniques of parallel visualization of extreme large medical datasets are discussed in connection with Intranet and common-configuration computers of hospitals. In this paper are introduced several kernel techniques, including the hardware structure, software framework, load balance and virtual reality visualization. The Maximum Intensity Projection algorithm is realized in parallel using common PC cluster. In virtual reality world, three-dimensional models can be rotated, zoomed, translated and cut interactively and conveniently through the control panel built on virtual reality modeling language (VRML). Experimental results demonstrate that this method provides promising and real-time results for playing the role in of a good assistant in making clinical diagnosis.

  3. Development and validation of a prognostic nomogram for colorectal cancer after radical resection based on individual patient data from three large-scale phase III trials

    PubMed Central

    Akiyoshi, Takashi; Maeda, Hiromichi; Kashiwabara, Kosuke; Kanda, Mitsuro; Mayanagi, Shuhei; Aoyama, Toru; Hamada, Chikuma; Sadahiro, Sotaro; Fukunaga, Yosuke; Ueno, Masashi; Sakamoto, Junichi; Saji, Shigetoyo; Yoshikawa, Takaki

    2017-01-01

    Background Few prediction models have so far been developed and assessed for the prognosis of patients who undergo curative resection for colorectal cancer (CRC). Materials and Methods We prepared a clinical dataset including 5,530 patients who participated in three major randomized controlled trials as a training dataset and 2,263 consecutive patients who were treated at a cancer-specialized hospital as a validation dataset. All subjects underwent radical resection for CRC which was histologically diagnosed to be adenocarcinoma. The main outcomes that were predicted were the overall survival (OS) and disease free survival (DFS). The identification of the variables in this nomogram was based on a Cox regression analysis and the model performance was evaluated by Harrell's c-index. The calibration plot and its slope were also studied. For the external validation assessment, risk group stratification was employed. Results The multivariate Cox model identified variables; sex, age, pathological T and N factor, tumor location, size, lymphnode dissection, postoperative complications and adjuvant chemotherapy. The c-index was 0.72 (95% confidence interval [CI] 0.66-0.77) for the OS and 0.74 (95% CI 0.69-0.78) for the DFS. The proposed stratification in the risk groups demonstrated a significant distinction between the Kaplan–Meier curves for OS and DFS in the external validation dataset. Conclusions We established a clinically reliable nomogram to predict the OS and DFS in patients with CRC using large scale and reliable independent patient data from phase III randomized controlled trials. The external validity was also confirmed on the practical dataset. PMID:29228760

  4. A reassessment of North American river basin water balances in light of new estimates of mountain snow accumulation

    NASA Astrophysics Data System (ADS)

    Wrzesien, M.; Durand, M. T.; Pavelsky, T.

    2017-12-01

    The hydrologic cycle is a key component of many aspects of daily life, yet not all water cycle processes are fully understood. In particular, water storage in mountain snowpacks remains largely unknown. Previous work with a high resolution regional climate model suggests that global and continental models underestimate mountain snow accumulation, perhaps by as much as 50%. Therefore, we hypothesize that since snow water equivalent (one aspect of the water balance) is underestimated, accepted water balances for major river basins are likely wrong, particularly for mountainous river basins. Here we examine water balances for four major high latitude North American watersheds - the Columbia, Mackenzie, Nelson, and Yukon. The mountainous percentage of each basin ranges, which allows us to consider whether a bias in the water balance is affected by mountain area percentage within the watershed. For our water balance evaluation, we especially consider precipitation estimates from a variety of datasets, including models, such as WRF and MERRA, and observation-based, such as CRU and GPCP. We ask whether the precipitation datasets provide enough moisture for seasonal snow to accumulate within the basin and whether we see differences in the variability of annual and seasonal precipitation from each dataset. From our reassessment of high-latitude water balances, we aim to determine whether the current understanding is sufficient to describe all processes within the hydrologic cycle or whether datasets appear to be biased, particularly in high-elevation precipitation. Should currently-available datasets appear to be similarly biased in precipitation, as we have seen in mountain snow accumulation, we discuss the implications for the continental water budget.

  5. Brady's Geothermal Field - Analysis of Pressure Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lim, David

    *This submission provides corrections to GDR Submissions 844 and 845* Poroelastic Tomography (PoroTomo) by Adjoint Inverse Modeling of Data from Hydrology. The 3 *csv files containing pressure data are the corrected versions of the pressure dataset found in Submission 844. The dataset has been corrected in the sense that the atmospheric pressure has been subtracted from the total pressure measured in the well. Also, the transducers used at wells 56A-1 and SP-2 are sensitive to surface temperature fluctuations. These temperature effects have been removed from the corrected datasets. The 4th *csv file contains corrected version of the pumping data foundmore » in Submission 845. The data has been corrected in the sense that the data from several wells that were used during the PoroTomo deployment pumping tests that were not included in the original dataset has been added. In addition, several other minor changes have been made to the pumping records due to flow rate instrument calibration issues that were discovered.« less

  6. Integrative prescreening in analysis of multiple cancer genomic studies

    PubMed Central

    2012-01-01

    Background In high throughput cancer genomic studies, results from the analysis of single datasets often suffer from a lack of reproducibility because of small sample sizes. Integrative analysis can effectively pool and analyze multiple datasets and provides a cost effective way to improve reproducibility. In integrative analysis, simultaneously analyzing all genes profiled may incur high computational cost. A computationally affordable remedy is prescreening, which fits marginal models, can be conducted in a parallel manner, and has low computational cost. Results An integrative prescreening approach is developed for the analysis of multiple cancer genomic datasets. Simulation shows that the proposed integrative prescreening has better performance than alternatives, particularly including prescreening with individual datasets, an intensity approach and meta-analysis. We also analyze multiple microarray gene profiling studies on liver and pancreatic cancers using the proposed approach. Conclusions The proposed integrative prescreening provides an effective way to reduce the dimensionality in cancer genomic studies. It can be coupled with existing analysis methods to identify cancer markers. PMID:22799431

  7. BanglaLekha-Isolated: A multi-purpose comprehensive dataset of Handwritten Bangla Isolated characters.

    PubMed

    Biswas, Mithun; Islam, Rafiqul; Shom, Gautam Kumar; Shopon, Md; Mohammed, Nabeel; Momen, Sifat; Abedin, Anowarul

    2017-06-01

    BanglaLekha-Isolated, a Bangla handwritten isolated character dataset is presented in this article. This dataset contains 84 different characters comprising of 50 Bangla basic characters, 10 Bangla numerals and 24 selected compound characters. 2000 handwriting samples for each of the 84 characters were collected, digitized and pre-processed. After discarding mistakes and scribbles, 1,66,105 handwritten character images were included in the final dataset. The dataset also includes labels indicating the age and the gender of the subjects from whom the samples were collected. This dataset could be used not only for optical handwriting recognition research but also to explore the influence of gender and age on handwriting. The dataset is publicly available at https://data.mendeley.com/datasets/hf6sf8zrkc/2.

  8. A machine learning pipeline for automated registration and classification of 3D lidar data

    NASA Astrophysics Data System (ADS)

    Rajagopal, Abhejit; Chellappan, Karthik; Chandrasekaran, Shivkumar; Brown, Andrew P.

    2017-05-01

    Despite the large availability of geospatial data, registration and exploitation of these datasets remains a persis- tent challenge in geoinformatics. Popular signal processing and machine learning algorithms, such as non-linear SVMs and neural networks, rely on well-formatted input models as well as reliable output labels, which are not always immediately available. In this paper we outline a pipeline for gathering, registering, and classifying initially unlabeled wide-area geospatial data. As an illustrative example, we demonstrate the training and test- ing of a convolutional neural network to recognize 3D models in the OGRIP 2007 LiDAR dataset using fuzzy labels derived from OpenStreetMap as well as other datasets available on OpenTopography.org. When auxiliary label information is required, various text and natural language processing filters are used to extract and cluster keywords useful for identifying potential target classes. A subset of these keywords are subsequently used to form multi-class labels, with no assumption of independence. Finally, we employ class-dependent geometry extraction routines to identify candidates from both training and testing datasets. Our regression networks are able to identify the presence of 6 structural classes, including roads, walls, and buildings, in volumes as big as 8000 m3 in as little as 1.2 seconds on a commodity 4-core Intel CPU. The presented framework is neither dataset nor sensor-modality limited due to the registration process, and is capable of multi-sensor data-fusion.

  9. Deep Learning for Lowtextured Image Matching

    NASA Astrophysics Data System (ADS)

    Kniaz, V. V.; Fedorenko, V. V.; Fomin, N. A.

    2018-05-01

    Low-textured objects pose challenges for an automatic 3D model reconstruction. Such objects are common in archeological applications of photogrammetry. Most of the common feature point descriptors fail to match local patches in featureless regions of an object. Hence, automatic documentation of the archeological process using Structure from Motion (SfM) methods is challenging. Nevertheless, such documentation is possible with the aid of a human operator. Deep learning-based descriptors have outperformed most of common feature point descriptors recently. This paper is focused on the development of a new Wide Image Zone Adaptive Robust feature Descriptor (WIZARD) based on the deep learning. We use a convolutional auto-encoder to compress discriminative features of a local path into a descriptor code. We build a codebook to perform point matching on multiple images. The matching is performed using the nearest neighbor search and a modified voting algorithm. We present a new "Multi-view Amphora" (Amphora) dataset for evaluation of point matching algorithms. The dataset includes images of an Ancient Greek vase found at Taman Peninsula in Southern Russia. The dataset provides color images, a ground truth 3D model, and a ground truth optical flow. We evaluated the WIZARD descriptor on the "Amphora" dataset to show that it outperforms the SIFT and SURF descriptors on the complex patch pairs.

  10. The Landcover Impact on the Aspect/Slope Accuracy Dependence of the SRTM-1 Elevation Data for the Humboldt Range.

    PubMed

    Miliaresis, George C

    2008-05-15

    The U.S. National Landcover Dataset (NLCD) and the U.S National Elevation Dataset (NED) (bare earth elevations) were used in an attempt to assess to what extent the directional and slope dependency of the Shuttle Radar Topography Mission (SRTM) finished digital elevation model is affected by landcover. Four landcover classes: forest, shrubs, grass and snow cover, were included in the study area (Humboldt Range in NW portion of Nevada, USA). Statistics, rose diagrams, and frequency distributions of the elevation differences (NED-SRTM) per landcover class per geographic direction were used. The decomposition of elevation differences on the basis of aspect and slope terrain classes identifies a) over-estimation of elevation by the SRTM instrument along E, NE and N directions (negative elevation difference that decreases linearly with slope) while b) underestimation is evident towards W, SW and S directions (positive elevation difference increasing with slope). The aspect/slope/landcover elevation differences modelling overcome the systematic errors evident in the SRTM dataset and revealed vegetation height information and the snow penetration capability of the SRTM instrument. The linear regression lines per landcover class might provide means of correcting the systematic error (aspect/slope dependency) evident in SRTM dataset.

  11. The Landcover Impact on the Aspect/Slope Accuracy Dependence of the SRTM-1 Elevation Data for the Humboldt Range

    PubMed Central

    Miliaresis, George C.

    2008-01-01

    The U.S. National Landcover Dataset (NLCD) and the U.S National Elevation Dataset (NED) (bare earth elevations) were used in an attempt to assess to what extent the directional and slope dependency of the Shuttle Radar Topography Mission (SRTM) finished digital elevation model is affected by landcover. Four landcover classes: forest, shrubs, grass and snow cover, were included in the study area (Humboldt Range in NW portion of Nevada, USA). Statistics, rose diagrams, and frequency distributions of the elevation differences (NED-SRTM) per landcover class per geographic direction were used. The decomposition of elevation differences on the basis of aspect and slope terrain classes identifies a) over-estimation of elevation by the SRTM instrument along E, NE and N directions (negative elevation difference that decreases linearly with slope) while b) underestimation is evident towards W, SW and S directions (positive elevation difference increasing with slope). The aspect/slope/landcover elevation differences modelling overcome the systematic errors evident in the SRTM dataset and revealed vegetation height information and the snow penetration capability of the SRTM instrument. The linear regression lines per landcover class might provide means of correcting the systematic error (aspect/slope dependency) evident in SRTM dataset. PMID:27879870

  12. Developing a Resource for Implementing ArcSWAT Using Global Datasets

    NASA Astrophysics Data System (ADS)

    Taggart, M.; Caraballo Álvarez, I. O.; Mueller, C.; Palacios, S. L.; Schmidt, C.; Milesi, C.; Palmer-Moloney, L. J.

    2015-12-01

    This project developed a comprehensive user manual outlining methods for adapting and implementing global datasets for use within ArcSWAT for international and worldwide applications. The Soil and Water Assessment Tool (SWAT) is a hydrologic model that looks at a number of hydrologic variables including runoff and the chemical makeup of water at a given location on the Earth's surface using Digital Elevation Models (DEM), land cover, soil, and weather data. However, the application of ArcSWAT for projects outside of the United States is challenging as there is no standard framework for inputting global datasets into ArcSWAT. This project aims to remove this obstacle by outlining methods for adapting and implementing these global datasets via the user manual. The manual takes the user through the processes of data conditioning while providing solutions and suggestions for common errors. The efficacy of the manual was explored using examples from watersheds located in Puerto Rico, Mexico and Western Africa. Each run explored the various options for setting up a ArcSWAT project as well as a range of satellite data products and soil databases. Future work will incorporate in-situ data for validation and calibration of the model and outline additional resources to assist future users in efficiently implementing the model for worldwide applications. The capacity to manage and monitor freshwater availability is of critical importance in both developed and developing countries. As populations grow and climate changes, both the quality and quantity of freshwater are affected resulting in negative impacts on the health of the surrounding population. The use of hydrologic models such as ArcSWAT can help stakeholders and decision makers understand the future impacts of these changes enabling informed and substantiated decisions.

  13. Advancing land surface model development with satellite-based Earth observations

    NASA Astrophysics Data System (ADS)

    Orth, Rene; Dutra, Emanuel; Trigo, Isabel F.; Balsamo, Gianpaolo

    2017-04-01

    The land surface forms an essential part of the climate system. It interacts with the atmosphere through the exchange of water and energy and hence influences weather and climate, as well as their predictability. Correspondingly, the land surface model (LSM) is an essential part of any weather forecasting system. LSMs rely on partly poorly constrained parameters, due to sparse land surface observations. With the use of newly available land surface temperature observations, we show in this study that novel satellite-derived datasets help to improve LSM configuration, and hence can contribute to improved weather predictability. We use the Hydrology Tiled ECMWF Scheme of Surface Exchanges over Land (HTESSEL) and validate it comprehensively against an array of Earth observation reference datasets, including the new land surface temperature product. This reveals satisfactory model performance in terms of hydrology, but poor performance in terms of land surface temperature. This is due to inconsistencies of process representations in the model as identified from an analysis of perturbed parameter simulations. We show that HTESSEL can be more robustly calibrated with multiple instead of single reference datasets as this mitigates the impact of the structural inconsistencies. Finally, performing coupled global weather forecasts we find that a more robust calibration of HTESSEL also contributes to improved weather forecast skills. In summary, new satellite-based Earth observations are shown to enhance the multi-dataset calibration of LSMs, thereby improving the representation of insufficiently captured processes, advancing weather predictability and understanding of climate system feedbacks. Orth, R., E. Dutra, I. F. Trigo, and G. Balsamo (2016): Advancing land surface model development with satellite-based Earth observations. Hydrol. Earth Syst. Sci. Discuss., doi:10.5194/hess-2016-628

  14. Quality Controlling CMIP datasets at GFDL

    NASA Astrophysics Data System (ADS)

    Horowitz, L. W.; Radhakrishnan, A.; Balaji, V.; Adcroft, A.; Krasting, J. P.; Nikonov, S.; Mason, E. E.; Schweitzer, R.; Nadeau, D.

    2017-12-01

    As GFDL makes the switch from model development to production in light of the Climate Model Intercomparison Project (CMIP), GFDL's efforts are shifted to testing and more importantly establishing guidelines and protocols for Quality Controlling and semi-automated data publishing. Every CMIP cycle introduces key challenges and the upcoming CMIP6 is no exception. The new CMIP experimental design comprises of multiple MIPs facilitating research in different focus areas. This paradigm has implications not only for the groups that develop the models and conduct the runs, but also for the groups that monitor, analyze and quality control the datasets before data publishing, before their knowledge makes its way into reports like the IPCC (Intergovernmental Panel on Climate Change) Assessment Reports. In this talk, we discuss some of the paths taken at GFDL to quality control the CMIP-ready datasets including: Jupyter notebooks, PrePARE, LAMP (Linux, Apache, MySQL, PHP/Python/Perl): technology-driven tracker system to monitor the status of experiments qualitatively and quantitatively, provide additional metadata and analysis services along with some in-built controlled-vocabulary validations in the workflow. In addition to this, we also discuss the integration of community-based model evaluation software (ESMValTool, PCMDI Metrics Package, and ILAMB) as part of our CMIP6 workflow.

  15. Learning maximum entropy models from finite-size data sets: A fast data-driven algorithm allows sampling from the posterior distribution.

    PubMed

    Ferrari, Ulisse

    2016-08-01

    Maximum entropy models provide the least constrained probability distributions that reproduce statistical properties of experimental datasets. In this work we characterize the learning dynamics that maximizes the log-likelihood in the case of large but finite datasets. We first show how the steepest descent dynamics is not optimal as it is slowed down by the inhomogeneous curvature of the model parameters' space. We then provide a way for rectifying this space which relies only on dataset properties and does not require large computational efforts. We conclude by solving the long-time limit of the parameters' dynamics including the randomness generated by the systematic use of Gibbs sampling. In this stochastic framework, rather than converging to a fixed point, the dynamics reaches a stationary distribution, which for the rectified dynamics reproduces the posterior distribution of the parameters. We sum up all these insights in a "rectified" data-driven algorithm that is fast and by sampling from the parameters' posterior avoids both under- and overfitting along all the directions of the parameters' space. Through the learning of pairwise Ising models from the recording of a large population of retina neurons, we show how our algorithm outperforms the steepest descent method.

  16. Improving protein–protein interactions prediction accuracy using protein evolutionary information and relevance vector machine model

    PubMed Central

    An, Ji‐Yong; Meng, Fan‐Rong; Chen, Xing; Yan, Gui‐Ying; Hu, Ji‐Pu

    2016-01-01

    Abstract Predicting protein–protein interactions (PPIs) is a challenging task and essential to construct the protein interaction networks, which is important for facilitating our understanding of the mechanisms of biological systems. Although a number of high‐throughput technologies have been proposed to predict PPIs, there are unavoidable shortcomings, including high cost, time intensity, and inherently high false positive rates. For these reasons, many computational methods have been proposed for predicting PPIs. However, the problem is still far from being solved. In this article, we propose a novel computational method called RVM‐BiGP that combines the relevance vector machine (RVM) model and Bi‐gram Probabilities (BiGP) for PPIs detection from protein sequences. The major improvement includes (1) Protein sequences are represented using the Bi‐gram probabilities (BiGP) feature representation on a Position Specific Scoring Matrix (PSSM), in which the protein evolutionary information is contained; (2) For reducing the influence of noise, the Principal Component Analysis (PCA) method is used to reduce the dimension of BiGP vector; (3) The powerful and robust Relevance Vector Machine (RVM) algorithm is used for classification. Five‐fold cross‐validation experiments executed on yeast and Helicobacter pylori datasets, which achieved very high accuracies of 94.57 and 90.57%, respectively. Experimental results are significantly better than previous methods. To further evaluate the proposed method, we compare it with the state‐of‐the‐art support vector machine (SVM) classifier on the yeast dataset. The experimental results demonstrate that our RVM‐BiGP method is significantly better than the SVM‐based method. In addition, we achieved 97.15% accuracy on imbalance yeast dataset, which is higher than that of balance yeast dataset. The promising experimental results show the efficiency and robust of the proposed method, which can be an automatic decision support tool for future proteomics research. For facilitating extensive studies for future proteomics research, we developed a freely available web server called RVM‐BiGP‐PPIs in Hypertext Preprocessor (PHP) for predicting PPIs. The web server including source code and the datasets are available at http://219.219.62.123:8888/BiGP/. PMID:27452983

  17. Reconstructing missing information on precipitation datasets: impact of tails on adopted statistical distributions.

    NASA Astrophysics Data System (ADS)

    Pedretti, Daniele; Beckie, Roger Daniel

    2014-05-01

    Missing data in hydrological time-series databases are ubiquitous in practical applications, yet it is of fundamental importance to make educated decisions in problems involving exhaustive time-series knowledge. This includes precipitation datasets, since recording or human failures can produce gaps in these time series. For some applications, directly involving the ratio between precipitation and some other quantity, lack of complete information can result in poor understanding of basic physical and chemical dynamics involving precipitated water. For instance, the ratio between precipitation (recharge) and outflow rates at a discharge point of an aquifer (e.g. rivers, pumping wells, lysimeters) can be used to obtain aquifer parameters and thus to constrain model-based predictions. We tested a suite of methodologies to reconstruct missing information in rainfall datasets. The goal was to obtain a suitable and versatile method to reduce the errors given by the lack of data in specific time windows. Our analyses included both a classical chronologically-pairing approach between rainfall stations and a probability-based approached, which accounted for the probability of exceedence of rain depths measured at two or multiple stations. Our analyses proved that it is not clear a priori which method delivers the best methodology. Rather, this selection should be based considering the specific statistical properties of the rainfall dataset. In this presentation, our emphasis is to discuss the effects of a few typical parametric distributions used to model the behavior of rainfall. Specifically, we analyzed the role of distributional "tails", which have an important control on the occurrence of extreme rainfall events. The latter strongly affect several hydrological applications, including recharge-discharge relationships. The heavy-tailed distributions we considered were parametric Log-Normal, Generalized Pareto, Generalized Extreme and Gamma distributions. The methods were first tested on synthetic examples, to have a complete control of the impact of several variables such as minimum amount of data required to obtain reliable statistical distributions from the selected parametric functions. Then, we applied the methodology to precipitation datasets collected in the Vancouver area and on a mining site in Peru.

  18. Assessing models of arsenic occurrence in drinking water from bedrock aquifers in New Hampshire

    USGS Publications Warehouse

    Andy, Caroline; Fahnestock, Maria Florencia; Lombard, Melissa; Hayes, Laura; Bryce, Julie; Ayotte, Joseph

    2017-01-01

    Three existing multivariate logistic regression models were assessed using new data to evaluate the capacity of the models to correctly predict the probability of groundwater arsenic concentrations exceeding the threshold values of 1, 5, and 10 micrograms per liter (µg/L) in New Hampshire, USA. A recently released testing dataset includes arsenic concentrations from groundwater samples collected in 2004–2005 from a mix of 367 public-supply and private domestic wells. The use of this dataset to test three existing logistic regression models demonstrated enhanced overall predictive accuracy for the 5 and 10 μg/L models. Overall accuracies of 54.8, 76.3, and 86.4 percent were reported for the 1, 5, and 10 μg/L models, respectively. The state was divided by counties into northwest and southeast regions. Regional differences in accuracy were identified; models had an average accuracy of 83.1 percent for the counties in the northwest and 63.7 percent in the southeast. This is most likely due to high model specificity in the northwest and regional differences in arsenic occurrence. Though these models have limitations, they allow for arsenic hazard assessment across the region. The introduction of well-type (public or private), well depth, and casing length as explanatory variables may be appropriate measures to improve model performance. Our findings indicate that the original models generalize to the testing dataset, and should continue to serve as an important vehicle of preventative public health that may be applied to other groundwater contaminants in New Hampshire.

  19. Assessment of municipal solid waste settlement models based on field-scale data analysis.

    PubMed

    Bareither, Christopher A; Kwak, Seungbok

    2015-08-01

    An evaluation of municipal solid waste (MSW) settlement model performance and applicability was conducted based on analysis of two field-scale datasets: (1) Yolo and (2) Deer Track Bioreactor Experiment (DTBE). Twelve MSW settlement models were considered that included a range of compression behavior (i.e., immediate compression, mechanical creep, and biocompression) and range of total (2-22) and optimized (2-7) model parameters. A multi-layer immediate settlement analysis developed for Yolo provides a framework to estimate initial waste thickness and waste thickness at the end-of-immediate compression. Model application to the Yolo test cells (conventional and bioreactor landfills) via least squares optimization yielded high coefficient of determinations for all settlement models (R(2)>0.83). However, empirical models (i.e., power creep, logarithmic, and hyperbolic models) are not recommended for use in MSW settlement modeling due to potential non-representative long-term MSW behavior, limited physical significance of model parameters, and required settlement data for model parameterization. Settlement models that combine mechanical creep and biocompression into a single mathematical function constrain time-dependent settlement to a single process with finite magnitude, which limits model applicability. Overall, all models evaluated that couple multiple compression processes (immediate, creep, and biocompression) provided accurate representations of both Yolo and DTBE datasets. A model presented in Gourc et al. (2010) included the lowest number of total and optimized model parameters and yielded high statistical performance for all model applications (R(2)⩾0.97). Copyright © 2015 Elsevier Ltd. All rights reserved.

  20. The Interaction Network Ontology-supported modeling and mining of complex interactions represented with multiple keywords in biomedical literature.

    PubMed

    Özgür, Arzucan; Hur, Junguk; He, Yongqun

    2016-01-01

    The Interaction Network Ontology (INO) logically represents biological interactions, pathways, and networks. INO has been demonstrated to be valuable in providing a set of structured ontological terms and associated keywords to support literature mining of gene-gene interactions from biomedical literature. However, previous work using INO focused on single keyword matching, while many interactions are represented with two or more interaction keywords used in combination. This paper reports our extension of INO to include combinatory patterns of two or more literature mining keywords co-existing in one sentence to represent specific INO interaction classes. Such keyword combinations and related INO interaction type information could be automatically obtained via SPARQL queries, formatted in Excel format, and used in an INO-supported SciMiner, an in-house literature mining program. We studied the gene interaction sentences from the commonly used benchmark Learning Logic in Language (LLL) dataset and one internally generated vaccine-related dataset to identify and analyze interaction types containing multiple keywords. Patterns obtained from the dependency parse trees of the sentences were used to identify the interaction keywords that are related to each other and collectively represent an interaction type. The INO ontology currently has 575 terms including 202 terms under the interaction branch. The relations between the INO interaction types and associated keywords are represented using the INO annotation relations: 'has literature mining keywords' and 'has keyword dependency pattern'. The keyword dependency patterns were generated via running the Stanford Parser to obtain dependency relation types. Out of the 107 interactions in the LLL dataset represented with two-keyword interaction types, 86 were identified by using the direct dependency relations. The LLL dataset contained 34 gene regulation interaction types, each of which associated with multiple keywords. A hierarchical display of these 34 interaction types and their ancestor terms in INO resulted in the identification of specific gene-gene interaction patterns from the LLL dataset. The phenomenon of having multi-keyword interaction types was also frequently observed in the vaccine dataset. By modeling and representing multiple textual keywords for interaction types, the extended INO enabled the identification of complex biological gene-gene interactions represented with multiple keywords.

  1. Brokering technologies to realize the hydrology scenario in NSF BCube

    NASA Astrophysics Data System (ADS)

    Boldrini, Enrico; Easton, Zachary; Fuka, Daniel; Pearlman, Jay; Nativi, Stefano

    2015-04-01

    In the National Science Foundation (NSF) BCube project an international team composed of cyber infrastructure experts, geoscientists, social scientists and educators are working together to explore the use of brokering technologies, initially focusing on four domains: hydrology, oceans, polar, and weather. In the hydrology domain, environmental models are fundamental to understand the behaviour of hydrological systems. A specific model usually requires datasets coming from different disciplines for its initialization (e.g. elevation models from Earth observation, weather data from Atmospheric sciences, etc.). Scientific datasets are usually available on heterogeneous publishing services, such as inventory and access services (e.g. OGC Web Coverage Service, THREDDS Data Server, etc.). Indeed, datasets are published according to different protocols, moreover they usually come in different formats, resolutions, Coordinate Reference Systems (CRSs): in short different grid environments depending on the original data and the publishing service processing capabilities. Scientists can thus be impeded by the burden of discovery, access and normalize the desired datasets to the grid environment required by the model. These technological tasks of course divert scientists from their main, scientific goals. The use of GI-axe brokering framework has been experimented in a hydrology scenario where scientists needed to compare a particular hydrological model with two different input datasets (digital elevation models): - the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) dataset, v.2. - the Shuttle Radar Topography Mission (SRTM) dataset, v.3. These datasets were published by means of Hyrax Server technology, which can provide NetCDF files at their original resolution and CRS. Scientists had their model running on ArcGIS, so the main goal was to import the datasets using the available ArcPy library and have EPSG:4326 with the same resolution grid as the reference system, so that model outputs could be compared. ArcPy however is able to access only GeoTIff datasets that are published by a OGC Web Coverage Service (WCS). The GI-axe broker has then been deployed between the client application and the data providers. It has been configured to broker the two different Hyrax service endpoints and republish the data content through a WCS interface for the use of the ArcPy library. Finally, scientists were able to easily run the model, and to concentrate on the comparison of the different results obtained according to the selected input dataset. The use of a third party broker to perform such technological tasks has also shown to have the potential advantage of increasing the repeatability of a study among different researchers.

  2. On the mutual relationship between conceptual models and datasets in geophysical monitoring of volcanic systems

    NASA Astrophysics Data System (ADS)

    Neuberg, J. W.; Thomas, M.; Pascal, K.; Karl, S.

    2012-04-01

    Geophysical datasets are essential to guide particularly short-term forecasting of volcanic activity. Key parameters are derived from these datasets and interpreted in different ways, however, the biggest impact on the interpretation is not determined by the range of parameters but controlled through the parameterisation and the underlying conceptual model of the volcanic process. On the other hand, the increasing number of sophisticated geophysical models need to be constrained by monitoring data, to transform a merely numerical exercise into a useful forecasting tool. We utilise datasets from the "big three", seismology, deformation and gas emissions, to gain insight in the mutual relationship between conceptual models and constraining data. We show that, e.g. the same seismic dataset can be interpreted with respect to a wide variety of different models with very different implications to forecasting. In turn, different data processing procedures lead to different outcomes even though they are based on the same conceptual model. Unsurprisingly, the most reliable interpretation will be achieved by employing multi-disciplinary models with overlapping constraints.

  3. The influence of data characteristics on detecting wetland/stream surface-water connections in the Delmarva Peninsula, Maryland and Delaware

    USGS Publications Warehouse

    Vanderhoof, Melanie; Distler, Hayley; Lang, Megan W.; Alexander, Laurie C.

    2018-01-01

    The dependence of downstream waters on upstream ecosystems necessitates an improved understanding of watershed-scale hydrological interactions including connections between wetlands and streams. An evaluation of such connections is challenging when, (1) accurate and complete datasets of wetland and stream locations are often not available and (2) natural variability in surface-water extent influences the frequency and duration of wetland/stream connectivity. The Upper Choptank River watershed on the Delmarva Peninsula in eastern Maryland and Delaware is dominated by a high density of small, forested wetlands. In this analysis, wetland/stream surface water connections were quantified using multiple wetland and stream datasets, including headwater streams and depressions mapped from a lidar-derived digital elevation model. Surface-water extent was mapped across the watershed for spring 2015 using Landsat-8, Radarsat-2 and Worldview-3 imagery. The frequency of wetland/stream connections increased as a more complete and accurate stream dataset was used and surface-water extent was included, in particular when the spatial resolution of the imagery was finer (i.e., <10 m). Depending on the datasets used, 12–60% of wetlands by count (21–93% of wetlands by area) experienced surface-water interactions with streams during spring 2015. This translated into a range of 50–94% of the watershed contributing direct surface water runoff to streamflow. This finding suggests that our interpretation of the frequency and duration of wetland/stream connections will be influenced not only by the spatial and temporal characteristics of wetlands, streams and potential flowpaths, but also by the completeness, accuracy and resolution of input datasets.

  4. DataMed - an open source discovery index for finding biomedical datasets.

    PubMed

    Chen, Xiaoling; Gururaj, Anupama E; Ozyurt, Burak; Liu, Ruiling; Soysal, Ergin; Cohen, Trevor; Tiryaki, Firat; Li, Yueling; Zong, Nansu; Jiang, Min; Rogith, Deevakar; Salimi, Mandana; Kim, Hyeon-Eui; Rocca-Serra, Philippe; Gonzalez-Beltran, Alejandra; Farcas, Claudiu; Johnson, Todd; Margolis, Ron; Alter, George; Sansone, Susanna-Assunta; Fore, Ian M; Ohno-Machado, Lucila; Grethe, Jeffrey S; Xu, Hua

    2018-01-13

    Finding relevant datasets is important for promoting data reuse in the biomedical domain, but it is challenging given the volume and complexity of biomedical data. Here we describe the development of an open source biomedical data discovery system called DataMed, with the goal of promoting the building of additional data indexes in the biomedical domain. DataMed, which can efficiently index and search diverse types of biomedical datasets across repositories, is developed through the National Institutes of Health-funded biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium. It consists of 2 main components: (1) a data ingestion pipeline that collects and transforms original metadata information to a unified metadata model, called DatA Tag Suite (DATS), and (2) a search engine that finds relevant datasets based on user-entered queries. In addition to describing its architecture and techniques, we evaluated individual components within DataMed, including the accuracy of the ingestion pipeline, the prevalence of the DATS model across repositories, and the overall performance of the dataset retrieval engine. Our manual review shows that the ingestion pipeline could achieve an accuracy of 90% and core elements of DATS had varied frequency across repositories. On a manually curated benchmark dataset, the DataMed search engine achieved an inferred average precision of 0.2033 and a precision at 10 (P@10, the number of relevant results in the top 10 search results) of 0.6022, by implementing advanced natural language processing and terminology services. Currently, we have made the DataMed system publically available as an open source package for the biomedical community. © The Author 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  5. Management and assimilation of diverse, distributed watershed datasets

    NASA Astrophysics Data System (ADS)

    Varadharajan, C.; Faybishenko, B.; Versteeg, R.; Agarwal, D.; Hubbard, S. S.; Hendrix, V.

    2016-12-01

    The U.S. Department of Energy's (DOE) Watershed Function Scientific Focus Area (SFA) seeks to determine how perturbations to mountainous watersheds (e.g., floods, drought, early snowmelt) impact the downstream delivery of water, nutrients, carbon, and metals over seasonal to decadal timescales. We are building a software platform that enables integration of diverse and disparate field, laboratory, and simulation datasets, of various types including hydrological, geological, meteorological, geophysical, geochemical, ecological and genomic datasets across a range of spatial and temporal scales within the Rifle floodplain and the East River watershed, Colorado. We are using agile data management and assimilation approaches, to enable web-based integration of heterogeneous, multi-scale dataSensor-based observations of water-level, vadose zone and groundwater temperature, water quality, meteorology as well as biogeochemical analyses of soil and groundwater samples have been curated and archived in federated databases. Quality Assurance and Quality Control (QA/QC) are performed on priority datasets needed for on-going scientific analyses, and hydrological and geochemical modeling. Automated QA/QC methods are used to identify and flag issues in the datasets. Data integration is achieved via a brokering service that dynamically integrates data from distributed databases via web services, based on user queries. The integrated results are presented to users in a portal that enables intuitive search, interactive visualization and download of integrated datasets. The concepts, approaches and codes being used are shared across various data science components of various large DOE-funded projects such as the Watershed Function SFA, Next Generation Ecosystem Experiment (NGEE) Tropics, Ameriflux/FLUXNET, and Advanced Simulation Capability for Environmental Management (ASCEM), and together contribute towards DOE's cyberinfrastructure for data management and model-data integration.

  6. Architecture of the local spatial data infrastructure for regional climate change research

    NASA Astrophysics Data System (ADS)

    Titov, Alexander; Gordov, Evgeny

    2013-04-01

    Georeferenced datasets (meteorological databases, modeling and reanalysis results, etc.) are actively used in modeling and analysis of climate change for various spatial and temporal scales. Due to inherent heterogeneity of environmental datasets as well as their size which might constitute up to tens terabytes for a single dataset studies in the area of climate and environmental change require a special software support based on SDI approach. A dedicated architecture of the local spatial data infrastructure aiming at regional climate change analysis using modern web mapping technologies is presented. Geoportal is a key element of any SDI, allowing searching of geoinformation resources (datasets and services) using metadata catalogs, producing geospatial data selections by their parameters (data access functionality) as well as managing services and applications of cartographical visualization. It should be noted that due to objective reasons such as big dataset volume, complexity of data models used, syntactic and semantic differences of various datasets, the development of environmental geodata access, processing and visualization services turns out to be quite a complex task. Those circumstances were taken into account while developing architecture of the local spatial data infrastructure as a universal framework providing geodata services. So that, the architecture presented includes: 1. Effective in terms of search, access, retrieval and subsequent statistical processing, model of storing big sets of regional georeferenced data, allowing in particular to store frequently used values (like monthly and annual climate change indices, etc.), thus providing different temporal views of the datasets 2. General architecture of the corresponding software components handling geospatial datasets within the storage model 3. Metadata catalog describing in detail using ISO 19115 and CF-convention standards datasets used in climate researches as a basic element of the spatial data infrastructure as well as its publication according to OGC CSW (Catalog Service Web) specification 4. Computational and mapping web services to work with geospatial datasets based on OWS (OGC Web Services) standards: WMS, WFS, WPS 5. Geoportal as a key element of thematic regional spatial data infrastructure providing also software framework for dedicated web applications development To realize web mapping services Geoserver software is used since it provides natural WPS implementation as a separate software module. To provide geospatial metadata services GeoNetwork Opensource (http://geonetwork-opensource.org) product is planned to be used for it supports ISO 19115/ISO 19119/ISO 19139 metadata standards as well as ISO CSW 2.0 profile for both client and server. To implement thematic applications based on geospatial web services within the framework of local SDI geoportal the following open source software have been selected: 1. OpenLayers JavaScript library, providing basic web mapping functionality for the thin client such as web browser 2. GeoExt/ExtJS JavaScript libraries for building client-side web applications working with geodata services. The web interface developed will be similar to the interface of such popular desktop GIS applications, as uDIG, QuantumGIS etc. The work is partially supported by RF Ministry of Education and Science grant 8345, SB RAS Program VIII.80.2.1 and IP 131.

  7. How do you assign persistent identifiers to extracts from large, complex, dynamic data sets that underpin scholarly publications?

    NASA Astrophysics Data System (ADS)

    Wyborn, Lesley; Car, Nicholas; Evans, Benjamin; Klump, Jens

    2016-04-01

    Persistent identifiers in the form of a Digital Object Identifier (DOI) are becoming more mainstream, assigned at both the collection and dataset level. For static datasets, this is a relatively straight-forward matter. However, many new data collections are dynamic, with new data being appended, models and derivative products being revised with new data, or the data itself revised as processing methods are improved. Further, because data collections are becoming accessible as services, researchers can log in and dynamically create user-defined subsets for specific research projects: they also can easily mix and match data from multiple collections, each of which can have a complex history. Inevitably extracts from such dynamic data sets underpin scholarly publications, and this presents new challenges. The National Computational Infrastructure (NCI) has been experiencing and making progress towards addressing these issues. The NCI is large node of the Research Data Services initiative (RDS) of the Australian Government's research infrastructure, which currently makes available over 10 PBytes of priority research collections, ranging from geosciences, geophysics, environment, and climate, through to astronomy, bioinformatics, and social sciences. Data are replicated to, or are produced at, NCI and then processed there to higher-level data products or directly analysed. Individual datasets range from multi-petabyte computational models and large volume raster arrays, down to gigabyte size, ultra-high resolution datasets. To facilitate access, maximise reuse and enable integration across the disciplines, datasets have been organized on a platform called the National Environmental Research Data Interoperability Platform (NERDIP). Combined, the NERDIP data collections form a rich and diverse asset for researchers: their co-location and standardization optimises the value of existing data, and forms a new resource to underpin data-intensive Science. New publication procedures require that a persistent identifier (DOI) be provided for the dataset that underpins the publication. Being able to produce these for data extracts from the NCI data node using only DOIs is proving difficult: preserving a copy of each data extract is not possible due to data scale. A proposal is for researchers to use workflows that capture the provenance of each data extraction, including metadata (e.g., version of the dataset used, the query and time of extraction). In parallel, NCI is now working with the NERDIP dataset providers to ensure that the provenance of data publication is also captured in provenance systems including references to previous versions and a history of data appended or modified. This proposed solution would require an enhancement to new scholarly publication procedures whereby the reference to underlying dataset to a scholarly publication would be the persistent identifier of the provenance workflow that created the data extract. In turn, the provenance workflow would itself link to a series of persistent identifiers that, at a minimum, provide complete dataset production transparency and, if required, would facilitate reconstruction of the dataset. Such a solution will require strict adherence to design patterns for provenance representation to ensure that the provenance representation of the workflow does indeed contain information required to deliver dataset generation transparency and a pathway to reconstruction.

  8. Uncertainty in structural interpretation: Lessons to be learnt

    NASA Astrophysics Data System (ADS)

    Bond, Clare E.

    2015-05-01

    Uncertainty in the interpretation of geological data is an inherent element of geology. Datasets from different sources: remotely sensed seismic imagery, field data and borehole data, are often combined and interpreted to create a geological model of the sub-surface. The data have limited resolution and spatial distribution that results in uncertainty in the interpretation of the data and in the subsequent geological model(s) created. Methods to determine the extent of interpretational uncertainty of a dataset, how to capture and express that uncertainty, and consideration of uncertainties in terms of risk have been investigated. Here I review the work that has taken place and discuss best practice in accounting for uncertainties in structural interpretation workflows. Barriers to best practice are reflected on, including the use of software packages for interpretation. Experimental evidence suggests that minimising interpretation error through the use of geological reasoning and rules can help decrease interpretation uncertainty; through identification of inadmissible interpretations and in highlighting areas of uncertainty. Understanding expert thought processes and reasoning, including the use of visuospatial skills, during interpretation may aid in the identification of uncertainties, and in the education of new geoscientists.

  9. Evaluating uncertainties in modelling the snow hydrology of the Fraser River Basin, British Columbia, Canada

    NASA Astrophysics Data System (ADS)

    Islam, Siraj Ul; Déry, Stephen J.

    2017-03-01

    This study evaluates predictive uncertainties in the snow hydrology of the Fraser River Basin (FRB) of British Columbia (BC), Canada, using the Variable Infiltration Capacity (VIC) model forced with several high-resolution gridded climate datasets. These datasets include the Canadian Precipitation Analysis and the thin-plate smoothing splines (ANUSPLIN), North American Regional Reanalysis (NARR), University of Washington (UW) and Pacific Climate Impacts Consortium (PCIC) gridded products. Uncertainties are evaluated at different stages of the VIC implementation, starting with the driving datasets, optimization of model parameters, and model calibration during cool and warm phases of the Pacific Decadal Oscillation (PDO). The inter-comparison of the forcing datasets (precipitation and air temperature) and their VIC simulations (snow water equivalent - SWE - and runoff) reveals widespread differences over the FRB, especially in mountainous regions. The ANUSPLIN precipitation shows a considerable dry bias in the Rocky Mountains, whereas the NARR winter air temperature is 2 °C warmer than the other datasets over most of the FRB. In the VIC simulations, the elevation-dependent changes in the maximum SWE (maxSWE) are more prominent at higher elevations of the Rocky Mountains, where the PCIC-VIC simulation accumulates too much SWE and ANUSPLIN-VIC yields an underestimation. Additionally, at each elevation range, the day of maxSWE varies from 10 to 20 days between the VIC simulations. The snow melting season begins early in the NARR-VIC simulation, whereas the PCIC-VIC simulation delays the melting, indicating seasonal uncertainty in SWE simulations. When compared with the observed runoff for the Fraser River main stem at Hope, BC, the ANUSPLIN-VIC simulation shows considerable underestimation of runoff throughout the water year owing to reduced precipitation in the ANUSPLIN forcing dataset. The NARR-VIC simulation yields more winter and spring runoff and earlier decline of flows in summer due to a nearly 15-day earlier onset of the FRB springtime snowmelt. Analysis of the parametric uncertainty in the VIC calibration process shows that the choice of the initial parameter range plays a crucial role in defining the model hydrological response for the FRB. Furthermore, the VIC calibration process is biased toward cool and warm phases of the PDO and the choice of proper calibration and validation time periods is important for the experimental setup. Overall the VIC hydrological response is prominently influenced by the uncertainties involved in the forcing datasets rather than those in its parameter optimization and experimental setups.

  10. Solar Irradiance Data Products at the LASP Interactive Solar IRradiance Datacenter (LISIRD)

    NASA Astrophysics Data System (ADS)

    Lindholm, D. M.; Ware DeWolfe, A.; Wilson, A.; Pankratz, C. K.; Snow, M. A.; Woods, T. N.

    2011-12-01

    The Laboratory for Atmospheric and Space Physics (LASP) has developed the LASP Interactive Solar IRradiance Datacenter (LISIRD, http://lasp.colorado.edu/lisird/) web site to provide access to a comprehensive set of solar irradiance measurements and related datasets. Current data holdings include products from NASA missions SORCE, UARS, SME, and TIMED-SEE. The data provided covers a wavelength range from soft X-ray (XUV) at 0.1 nm up to the near infrared (NIR) at 2400 nm, as well as Total Solar Irradiance (TSI). Other datasets include solar indices, spectral and flare models, solar images, and more. The LISIRD web site features updated plotting, browsing, and download capabilities enabled by dygraphs, JavaScript, and Ajax calls to the LASP Time Series Server (LaTiS). In addition to the web browser interface, most of the LISIRD datasets can be accessed via the LaTiS web service interface that supports the OPeNDAP standard. OPeNDAP clients and other programming APIs are available for making requests that subset, aggregate, or filter data on the server before it is transported to the user. This poster provides an overview of the LISIRD system, summarizes the datasets currently available, and provides details on how to access solar irradiance data products through LISIRD's interfaces.

  11. New models and online calculator for predicting non-sentinel lymph node status in sentinel lymph node positive breast cancer patients.

    PubMed

    Kohrt, Holbrook E; Olshen, Richard A; Bermas, Honnie R; Goodson, William H; Wood, Douglas J; Henry, Solomon; Rouse, Robert V; Bailey, Lisa; Philben, Vicki J; Dirbas, Frederick M; Dunn, Jocelyn J; Johnson, Denise L; Wapnir, Irene L; Carlson, Robert W; Stockdale, Frank E; Hansen, Nora M; Jeffrey, Stefanie S

    2008-03-04

    Current practice is to perform a completion axillary lymph node dissection (ALND) for breast cancer patients with tumor-involved sentinel lymph nodes (SLNs), although fewer than half will have non-sentinel node (NSLN) metastasis. Our goal was to develop new models to quantify the risk of NSLN metastasis in SLN-positive patients and to compare predictive capabilities to another widely used model. We constructed three models to predict NSLN status: recursive partitioning with receiver operating characteristic curves (RP-ROC), boosted Classification and Regression Trees (CART), and multivariate logistic regression (MLR) informed by CART. Data were compiled from a multicenter Northern California and Oregon database of 784 patients who prospectively underwent SLN biopsy and completion ALND. We compared the predictive abilities of our best model and the Memorial Sloan-Kettering Breast Cancer Nomogram (Nomogram) in our dataset and an independent dataset from Northwestern University. 285 patients had positive SLNs, of which 213 had known angiolymphatic invasion status and 171 had complete pathologic data including hormone receptor status. 264 (93%) patients had limited SLN disease (micrometastasis, 70%, or isolated tumor cells, 23%). 101 (35%) of all SLN-positive patients had tumor-involved NSLNs. Three variables (tumor size, angiolymphatic invasion, and SLN metastasis size) predicted risk in all our models. RP-ROC and boosted CART stratified patients into four risk levels. MLR informed by CART was most accurate. Using two composite predictors calculated from three variables, MLR informed by CART was more accurate than the Nomogram computed using eight predictors. In our dataset, area under ROC curve (AUC) was 0.83/0.85 for MLR (n = 213/n = 171) and 0.77 for Nomogram (n = 171). When applied to an independent dataset (n = 77), AUC was 0.74 for our model and 0.62 for Nomogram. The composite predictors in our model were the product of angiolymphatic invasion and size of SLN metastasis, and the product of tumor size and square of SLN metastasis size. We present a new model developed from a community-based SLN database that uses only three rather than eight variables to achieve higher accuracy than the Nomogram for predicting NSLN status in two different datasets.

  12. Assessment of average of normals (AON) procedure for outlier-free datasets including qualitative values below limit of detection (LoD): an application within tumor markers such as CA 15-3, CA 125, and CA 19-9.

    PubMed

    Usta, Murat; Aral, Hale; Mete Çilingirtürk, Ahmet; Kural, Alev; Topaç, Ibrahim; Semerci, Tuna; Hicri Köseoğlu, Mehmet

    2016-11-01

    Average of normals (AON) is a quality control procedure that is sensitive only to systematic errors that can occur in an analytical process in which patient test results are used. The aim of this study was to develop an alternative model in order to apply the AON quality control procedure to datasets that include qualitative values below limit of detection (LoD). The reported patient test results for tumor markers, such as CA 15-3, CA 125, and CA 19-9, analyzed by two instruments, were retrieved from the information system over a period of 5 months, using the calibrator and control materials with the same lot numbers. The median as a measure of central tendency and the median absolute deviation (MAD) as a measure of dispersion were used for the complementary model of AON quality control procedure. The u bias values, which were determined for the bias component of the measurement uncertainty, were partially linked to the percentages of the daily median values of the test results that fall within the control limits. The results for these tumor markers, in which lower limits of reference intervals are not medically important for clinical diagnosis and management, showed that the AON quality control procedure, using the MAD around the median, can be applied for datasets including qualitative values below LoD.

  13. In silico modeling to predict drug-induced phospholipidosis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Choi, Sydney S.; Kim, Jae S.; Valerio, Luis G., E-mail: luis.valerio@fda.hhs.gov

    2013-06-01

    Drug-induced phospholipidosis (DIPL) is a preclinical finding during pharmaceutical drug development that has implications on the course of drug development and regulatory safety review. A principal characteristic of drugs inducing DIPL is known to be a cationic amphiphilic structure. This provides evidence for a structure-based explanation and opportunity to analyze properties and structures of drugs with the histopathologic findings for DIPL. In previous work from the FDA, in silico quantitative structure–activity relationship (QSAR) modeling using machine learning approaches has shown promise with a large dataset of drugs but included unconfirmed data as well. In this study, we report the constructionmore » and validation of a battery of complementary in silico QSAR models using the FDA's updated database on phospholipidosis, new algorithms and predictive technologies, and in particular, we address high performance with a high-confidence dataset. The results of our modeling for DIPL include rigorous external validation tests showing 80–81% concordance. Furthermore, the predictive performance characteristics include models with high sensitivity and specificity, in most cases above ≥ 80% leading to desired high negative and positive predictivity. These models are intended to be utilized for regulatory toxicology applied science needs in screening new drugs for DIPL. - Highlights: • New in silico models for predicting drug-induced phospholipidosis (DIPL) are described. • The training set data in the models is derived from the FDA's phospholipidosis database. • We find excellent predictivity values of the models based on external validation. • The models can support drug screening and regulatory decision-making on DIPL.« less

  14. The Model Analyst’s Toolkit: Scientific Model Development, Analysis, and Validation

    DTIC Science & Technology

    2015-02-20

    being integrated within MAT, including Granger causality. Granger causality tests whether a data series helps when predicting future values of another...relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society, 424-438. Granger, C. W. (1980). Testing ... testing dataset. This effort is described in Section 3.2. 3.1. Improvements in Granger Causality User Interface Various metrics of causality are

  15. Respiration climacteric in tomato fruits elucidated by constraint-based modelling.

    PubMed

    Colombié, Sophie; Beauvoit, Bertrand; Nazaret, Christine; Bénard, Camille; Vercambre, Gilles; Le Gall, Sophie; Biais, Benoit; Cabasson, Cécile; Maucourt, Mickaël; Bernillon, Stéphane; Moing, Annick; Dieuaide-Noubhani, Martine; Mazat, Jean-Pierre; Gibon, Yves

    2017-03-01

    Tomato is a model organism to study the development of fleshy fruit including ripening initiation. Unfortunately, few studies deal with the brief phase of accelerated ripening associated with the respiration climacteric because of practical problems involved in measuring fruit respiration. Because constraint-based modelling allows predicting accurate metabolic fluxes, we investigated the respiration and energy dissipation of fruit pericarp at the breaker stage using a detailed stoichiometric model of the respiratory pathway, including alternative oxidase and uncoupling proteins. Assuming steady-state, a metabolic dataset was transformed into constraints to solve the model on a daily basis throughout tomato fruit development. We detected a peak of CO 2 released and an excess of energy dissipated at 40 d post anthesis (DPA) just before the onset of ripening coinciding with the respiration climacteric. We demonstrated the unbalanced carbon allocation with the sharp slowdown of accumulation (for syntheses and storage) and the beginning of the degradation of starch and cell wall polysaccharides. Experiments with fruits harvested from plants cultivated under stress conditions confirmed the concept. We conclude that modelling with an accurate metabolic dataset is an efficient tool to bypass the difficulty of measuring fruit respiration and to elucidate the underlying mechanisms of ripening. © 2016 The Authors. New Phytologist © 2016 New Phytologist Trust.

  16. Metric Evaluation Pipeline for 3d Modeling of Urban Scenes

    NASA Astrophysics Data System (ADS)

    Bosch, M.; Leichtman, A.; Chilcott, D.; Goldberg, H.; Brown, M.

    2017-05-01

    Publicly available benchmark data and metric evaluation approaches have been instrumental in enabling research to advance state of the art methods for remote sensing applications in urban 3D modeling. Most publicly available benchmark datasets have consisted of high resolution airborne imagery and lidar suitable for 3D modeling on a relatively modest scale. To enable research in larger scale 3D mapping, we have recently released a public benchmark dataset with multi-view commercial satellite imagery and metrics to compare 3D point clouds with lidar ground truth. We now define a more complete metric evaluation pipeline developed as publicly available open source software to assess semantically labeled 3D models of complex urban scenes derived from multi-view commercial satellite imagery. Evaluation metrics in our pipeline include horizontal and vertical accuracy and completeness, volumetric completeness and correctness, perceptual quality, and model simplicity. Sources of ground truth include airborne lidar and overhead imagery, and we demonstrate a semi-automated process for producing accurate ground truth shape files to characterize building footprints. We validate our current metric evaluation pipeline using 3D models produced using open source multi-view stereo methods. Data and software is made publicly available to enable further research and planned benchmarking activities.

  17. Using Graph Indices for the Analysis and Comparison of Chemical Datasets.

    PubMed

    Fourches, Denis; Tropsha, Alexander

    2013-10-01

    In cheminformatics, compounds are represented as points in multidimensional space of chemical descriptors. When all pairs of points found within certain distance threshold in the original high dimensional chemistry space are connected by distance-labeled edges, the resulting data structure can be defined as Dataset Graph (DG). We show that, similarly to the conventional description of organic molecules, many graph indices can be computed for DGs as well. We demonstrate that chemical datasets can be effectively characterized and compared by computing simple graph indices such as the average vertex degree or Randic connectivity index. This approach is used to characterize and quantify the similarity between different datasets or subsets of the same dataset (e.g., training, test, and external validation sets used in QSAR modeling). The freely available ADDAGRA program has been implemented to build and visualize DGs. The approach proposed and discussed in this report could be further explored and utilized for different cheminformatics applications such as dataset diversification by acquiring external compounds, dataset processing prior to QSAR modeling, or (dis)similarity modeling of multiple datasets studied in chemical genomics applications. Copyright © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  18. Do pre-trained deep learning models improve computer-aided classification of digital mammograms?

    NASA Astrophysics Data System (ADS)

    Aboutalib, Sarah S.; Mohamed, Aly A.; Zuley, Margarita L.; Berg, Wendie A.; Luo, Yahong; Wu, Shandong

    2018-02-01

    Digital mammography screening is an important exam for the early detection of breast cancer and reduction in mortality. False positives leading to high recall rates, however, results in unnecessary negative consequences to patients and health care systems. In order to better aid radiologists, computer-aided tools can be utilized to improve distinction between image classifications and thus potentially reduce false recalls. The emergence of deep learning has shown promising results in the area of biomedical imaging data analysis. This study aimed to investigate deep learning and transfer learning methods that can improve digital mammography classification performance. In particular, we evaluated the effect of pre-training deep learning models with other imaging datasets in order to boost classification performance on a digital mammography dataset. Two types of datasets were used for pre-training: (1) a digitized film mammography dataset, and (2) a very large non-medical imaging dataset. By using either of these datasets to pre-train the network initially, and then fine-tuning with the digital mammography dataset, we found an increase in overall classification performance in comparison to a model without pre-training, with the very large non-medical dataset performing the best in improving the classification accuracy.

  19. Trust, but verify: On the importance of chemical structure curation in cheminformatics and QSAR modeling research

    PubMed Central

    Fourches, Denis; Muratov, Eugene; Tropsha, Alexander

    2010-01-01

    Molecular modelers and cheminformaticians typically analyze experimental data generated by other scientists. Consequently, when it comes to data accuracy, cheminformaticians are always at the mercy of data providers who may inadvertently publish (partially) erroneous data. Thus, dataset curation is crucial for any cheminformatics analysis such as similarity searching, clustering, QSAR modeling, virtual screening, etc., especially nowadays when the availability of chemical datasets in public domain has skyrocketed in recent years. Despite the obvious importance of this preliminary step in the computational analysis of any dataset, there appears to be no commonly accepted guidance or set of procedures for chemical data curation. The main objective of this paper is to emphasize the need for a standardized chemical data curation strategy that should be followed at the onset of any molecular modeling investigation. Herein, we discuss several simple but important steps for cleaning chemical records in a database including the removal of a fraction of the data that cannot be appropriately handled by conventional cheminformatics techniques. Such steps include the removal of inorganic and organometallic compounds, counterions, salts and mixtures; structure validation; ring aromatization; normalization of specific chemotypes; curation of tautomeric forms; and the deletion of duplicates. To emphasize the importance of data curation as a mandatory step in data analysis, we discuss several case studies where chemical curation of the original “raw” database enabled the successful modeling study (specifically, QSAR analysis) or resulted in a significant improvement of model's prediction accuracy. We also demonstrate that in some cases rigorously developed QSAR models could be even used to correct erroneous biological data associated with chemical compounds. We believe that good practices for curation of chemical records outlined in this paper will be of value to all scientists working in the fields of molecular modeling, cheminformatics, and QSAR studies. PMID:20572635

  20. Behavior Correlates of Post-Stroke Disability Using Data Mining and Infographics.

    PubMed

    Yoon, Sunmoo; Gutierrez, Jose

    Disability is a potential risk for stroke survivors. This study aims to identify disability risk factors associated with stroke and their relative importance and relationships from a national behavioral risk factor dataset. Data of post-stroke individuals in the U.S (n=19,603) including 397 variables were extracted from a publically available national dataset and analyzed. Data mining algorithms including C4.5 and linear regression with M5s methods were applied to build association models for post-stroke disability using Weka software. The relative importance and relationship of 70 variables associated with disability were presented in infographics for clinicians to understand easily. Fifty-five percent of post-stroke patients experience disability. Exercise, employment and satisfaction of life were relatively important factors associated with disability among stroke patients. Modifiable behavior factors strongly associated with disability include exercise (OR: 0.46, P<0.01) and good rest (OR 0.37, P<0.01). Data mining is promising to discover factors associated with post-stroke disability from a large population dataset. The findings can be potentially valuable for establishing the priorities for clinicians and researchers and for stroke patient education. The methods may generalize to other health conditions.

  1. Behavior Correlates of Post-Stroke Disability Using Data Mining and Infographics

    PubMed Central

    Yoon, Sunmoo; Gutierrez, Jose

    2015-01-01

    Purpose Disability is a potential risk for stroke survivors. This study aims to identify disability risk factors associated with stroke and their relative importance and relationships from a national behavioral risk factor dataset. Methods Data of post-stroke individuals in the U.S (n=19,603) including 397 variables were extracted from a publically available national dataset and analyzed. Data mining algorithms including C4.5 and linear regression with M5s methods were applied to build association models for post-stroke disability using Weka software. The relative importance and relationship of 70 variables associated with disability were presented in infographics for clinicians to understand easily. Results Fifty-five percent of post-stroke patients experience disability. Exercise, employment and satisfaction of life were relatively important factors associated with disability among stroke patients. Modifiable behavior factors strongly associated with disability include exercise (OR: 0.46, P<0.01) and good rest (OR 0.37, P<0.01). Conclusions Data mining is promising to discover factors associated with post-stroke disability from a large population dataset. The findings can be potentially valuable for establishing the priorities for clinicians and researchers and for stroke patient education. The methods may generalize to other health conditions. PMID:26835413

  2. A fragmentation model of earthquake-like behavior in internet access activity

    NASA Astrophysics Data System (ADS)

    Paguirigan, Antonino A.; Angco, Marc Jordan G.; Bantang, Johnrob Y.

    We present a fragmentation model that generates almost any inverse power-law size distribution, including dual-scaled versions, consistent with the underlying dynamics of systems with earthquake-like behavior. We apply the model to explain the dual-scaled power-law statistics observed in an Internet access dataset that covers more than 32 million requests. The non-Poissonian statistics of the requested data sizes m and the amount of time τ needed for complete processing are consistent with the Gutenberg-Richter-law. Inter-event times δt between subsequent requests are also shown to exhibit power-law distributions consistent with the generalized Omori law. Thus, the dataset is similar to the earthquake data except that two power-law regimes are observed. Using the proposed model, we are able to identify underlying dynamics responsible in generating the observed dual power-law distributions. The model is universal enough for its applicability to any physical and human dynamics that is limited by finite resources such as space, energy, time or opportunity.

  3. Measurement Properties of the NIH-Minimal Dataset Dutch Language Version in Patients With Chronic Low Back Pain.

    PubMed

    Boer, Annemarie; Dutmer, Alisa L; Schiphorst Preuper, Henrica R; van der Woude, Lucas H V; Stewart, Roy E; Deyo, Richard A; Reneman, Michiel F; Soer, Remko

    2017-10-01

    Validation study with cross-sectional and longitudinal measurements. To translate the US National Institutes of Health (NIH)-minimal dataset for clinical research on chronic low back pain into the Dutch language and to test its validity and reliability among people with chronic low back pain. The NIH developed a minimal dataset to encourage more complete and consistent reporting of clinical research and to be able to compare studies across countries in patients with low back pain. In the Netherlands, the NIH-minimal dataset has not been translated before and measurement properties are unknown. Cross-cultural validity was tested by a formal forward-backward translation. Structural validity was tested with exploratory factor analyses (comparative fit index, Tucker-Lewis index, and root mean square error of approximation). Hypothesis testing was performed to compare subscales of the NIH dataset with the Pain Disability Index and the EurQol-5D (Pearson correlation coefficients). Internal consistency was tested with Cronbach α and test-retest reliability at 2 weeks was calculated in a subsample of patients with Intraclass Correlation Coefficients and weighted Kappa (κω). In total, 452 patients were included of which 52 were included for the test-retest study. factor analysis for structural validity pointed into the direction of a seven-factor model (Cronbach α = 0.78). Factors and total score of the NIH-minimal dataset showed fair to good correlations with Pain Disability Index (r = 0.43-0.70) and EuroQol-5D (r = -0.41 to -0.64). Reliability: test-retest reliability per item showed substantial agreement (κω=0.65). Test-retest reliability per factor was moderate to good (Intraclass Correlation Coefficient = 0.71). The Dutch language version measurement properties of the NIH-minimal were satisfactory. N/A.

  4. Identifying key genes in glaucoma based on a benchmarked dataset and the gene regulatory network.

    PubMed

    Chen, Xi; Wang, Qiao-Ling; Zhang, Meng-Hui

    2017-10-01

    The current study aimed to identify key genes in glaucoma based on a benchmarked dataset and gene regulatory network (GRN). Local and global noise was added to the gene expression dataset to produce a benchmarked dataset. Differentially-expressed genes (DEGs) between patients with glaucoma and normal controls were identified utilizing the Linear Models for Microarray Data (Limma) package based on benchmarked dataset. A total of 5 GRN inference methods, including Zscore, GeneNet, context likelihood of relatedness (CLR) algorithm, Partial Correlation coefficient with Information Theory (PCIT) and GEne Network Inference with Ensemble of Trees (Genie3) were evaluated using receiver operating characteristic (ROC) and precision and recall (PR) curves. The interference method with the best performance was selected to construct the GRN. Subsequently, topological centrality (degree, closeness and betweenness) was conducted to identify key genes in the GRN of glaucoma. Finally, the key genes were validated by performing reverse transcription-quantitative polymerase chain reaction (RT-qPCR). A total of 176 DEGs were detected from the benchmarked dataset. The ROC and PR curves of the 5 methods were analyzed and it was determined that Genie3 had a clear advantage over the other methods; thus, Genie3 was used to construct the GRN. Following topological centrality analysis, 14 key genes for glaucoma were identified, including IL6 , EPHA2 and GSTT1 and 5 of these 14 key genes were validated by RT-qPCR. Therefore, the current study identified 14 key genes in glaucoma, which may be potential biomarkers to use in the diagnosis of glaucoma and aid in identifying the molecular mechanism of this disease.

  5. Cloud Computing: A model Construct of Real-Time Monitoring for Big Dataset Analytics Using Apache Spark

    NASA Astrophysics Data System (ADS)

    Alkasem, Ameen; Liu, Hongwei; Zuo, Decheng; Algarash, Basheer

    2018-01-01

    The volume of data being collected, analyzed, and stored has exploded in recent years, in particular in relation to the activity on the cloud computing. While large-scale data processing, analysis, storage, and platform model such as cloud computing were previously and currently are increasingly. Today, the major challenge is it address how to monitor and control these massive amounts of data and perform analysis in real-time at scale. The traditional methods and model systems are unable to cope with these quantities of data in real-time. Here we present a new methodology for constructing a model for optimizing the performance of real-time monitoring of big datasets, which includes a machine learning algorithms and Apache Spark Streaming to accomplish fine-grained fault diagnosis and repair of big dataset. As a case study, we use the failure of Virtual Machines (VMs) to start-up. The methodology proposition ensures that the most sensible action is carried out during the procedure of fine-grained monitoring and generates the highest efficacy and cost-saving fault repair through three construction control steps: (I) data collection; (II) analysis engine and (III) decision engine. We found that running this novel methodology can save a considerate amount of time compared to the Hadoop model, without sacrificing the classification accuracy or optimization of performance. The accuracy of the proposed method (92.13%) is an improvement on traditional approaches.

  6. Conducting high-value secondary dataset analysis: an introductory guide and resources.

    PubMed

    Smith, Alexander K; Ayanian, John Z; Covinsky, Kenneth E; Landon, Bruce E; McCarthy, Ellen P; Wee, Christina C; Steinman, Michael A

    2011-08-01

    Secondary analyses of large datasets provide a mechanism for researchers to address high impact questions that would otherwise be prohibitively expensive and time-consuming to study. This paper presents a guide to assist investigators interested in conducting secondary data analysis, including advice on the process of successful secondary data analysis as well as a brief summary of high-value datasets and online resources for researchers, including the SGIM dataset compendium ( www.sgim.org/go/datasets ). The same basic research principles that apply to primary data analysis apply to secondary data analysis, including the development of a clear and clinically relevant research question, study sample, appropriate measures, and a thoughtful analytic approach. A real-world case description illustrates key steps: (1) define your research topic and question; (2) select a dataset; (3) get to know your dataset; and (4) structure your analysis and presentation of findings in a way that is clinically meaningful. Secondary dataset analysis is a well-established methodology. Secondary analysis is particularly valuable for junior investigators, who have limited time and resources to demonstrate expertise and productivity.

  7. Otolith reading and multi-model inference for improved estimation of age and growth in the gilthead seabream Sparus aurata (L.)

    NASA Astrophysics Data System (ADS)

    Mercier, Lény; Panfili, Jacques; Paillon, Christelle; N'diaye, Awa; Mouillot, David; Darnaude, Audrey M.

    2011-05-01

    Accurate knowledge of fish age and growth is crucial for species conservation and management of exploited marine stocks. In exploited species, age estimation based on otolith reading is routinely used for building growth curves that are used to implement fishery management models. However, the universal fit of the von Bertalanffy growth function (VBGF) on data from commercial landings can lead to uncertainty in growth parameter inference, preventing accurate comparison of growth-based history traits between fish populations. In the present paper, we used a comprehensive annual sample of wild gilthead seabream ( Sparus aurata L.) in the Gulf of Lions (France, NW Mediterranean) to test a methodology improving growth modelling for exploited fish populations. After validating the timing for otolith annual increment formation for all life stages, a comprehensive set of growth models (including VBGF) were fitted to the obtained age-length data, used as a whole or sub-divided between group 0 individuals and those coming from commercial landings (ages 1-6). Comparisons in growth model accuracy based on Akaike Information Criterion allowed assessment of the best model for each dataset and, when no model correctly fitted the data, a multi-model inference (MMI) based on model averaging was carried out. The results provided evidence that growth parameters inferred with VBGF must be used with high caution. Hence, VBGF turned to be among the less accurate for growth prediction irrespective of the dataset and its fit to the whole population, the juvenile or the adult datasets provided different growth parameters. The best models for growth prediction were the Tanaka model, for group 0 juveniles, and the MMI, for the older fish, confirming that growth differs substantially between juveniles and adults. All asymptotic models failed to correctly describe the growth of adult S. aurata, probably because of the poor representation of old individuals in the dataset. Multi-model inference associated with separate analysis of juveniles and adult fish is then advised to obtain objective estimations of growth parameters when sampling cannot be corrected towards older fish.

  8. Boosted regression tree, table, and figure data

    EPA Pesticide Factsheets

    Spreadsheets are included here to support the manuscript Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition. This dataset is associated with the following publication:Golden , H., C. Lane , A. Prues, and E. D'Amico. Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition. JAWRA. American Water Resources Association, Middleburg, VA, USA, 52(5): 1251-1274, (2016).

  9. Volcano Modelling and Simulation gateway (VMSg): A new web-based framework for collaborative research in physical modelling and simulation of volcanic phenomena

    NASA Astrophysics Data System (ADS)

    Esposti Ongaro, T.; Barsotti, S.; de'Michieli Vitturi, M.; Favalli, M.; Longo, A.; Nannipieri, L.; Neri, A.; Papale, P.; Saccorotti, G.

    2009-12-01

    Physical and numerical modelling is becoming of increasing importance in volcanology and volcanic hazard assessment. However, new interdisciplinary problems arise when dealing with complex mathematical formulations, numerical algorithms and their implementations on modern computer architectures. Therefore new frameworks are needed for sharing knowledge, software codes, and datasets among scientists. Here we present the Volcano Modelling and Simulation gateway (VMSg, accessible at http://vmsg.pi.ingv.it), a new electronic infrastructure for promoting knowledge growth and transfer in the field of volcanological modelling and numerical simulation. The new web portal, developed in the framework of former and ongoing national and European projects, is based on a dynamic Content Manager System (CMS) and was developed to host and present numerical models of the main volcanic processes and relationships including magma properties, magma chamber dynamics, conduit flow, plume dynamics, pyroclastic flows, lava flows, etc. Model applications, numerical code documentation, simulation datasets as well as model validation and calibration test-cases are also part of the gateway material.

  10. Validation of Short-Term Noise Assessment Procedures: FY16 Summary of Procedures, Progress, and Preliminary Results

    DTIC Science & Technology

    Validation project. This report describes the procedure used to generate the noise models output dataset , and then it compares that dataset to the...benchmark, the Engineer Research and Development Centers Long-Range Sound Propagation dataset . It was found that the models consistently underpredict the

  11. Improving Satellite Observation Utilization for Model Initialization with Machine Learning: An Introduction and Tackling the "Labeled Dataset" Challenge for Cyclones Around the World

    NASA Astrophysics Data System (ADS)

    Bonfanti, C. E.; Stewart, J.; Lee, Y. J.; Govett, M.; Trailovic, L.; Etherton, B.

    2017-12-01

    One of the National Oceanic and Atmospheric Administration (NOAA) goals is to provide timely and reliable weather forecasts to support important decisions when and where people need it for safety, emergencies, planning for day-to-day activities. Satellite data is essential for areas lacking in-situ observations for use as initial conditions in Numerical Weather Prediction (NWP) Models, such as spans of the ocean or remote areas of land. Currently only about 7% of total received satellite data is selected for use and from that, an even smaller percentage ever are assimilated into NWP models. With machine learning, the computational and time costs needed for satellite data selection can be greatly reduced. We study various machine learning approaches to process orders of magnitude more satellite data in significantly less time allowing for a greater quantity and more intelligent selection of data to be used for assimilation purposes. Given the future launches of satellites in the upcoming years, machine learning is capable of being applied for better selection of Regions of Interest (ROI) in the magnitudes more of satellite data that will be received. This paper discusses the background of machine learning methods as applied to weather forecasting and the challenges of creating a "labeled dataset" for training and testing purposes. In the training stage of supervised machine learning, labeled data are important to identify a ROI as either true or false so that the model knows what signatures in satellite data to identify. Authors have selected cyclones, including tropical cyclones and mid-latitude lows, as ROI for their machine learning purposes and created a labeled dataset of true or false for ROI from Global Forecast System (GFS) reanalysis data. A dataset like this does not yet exist and given the need for a high quantity of samples, is was decided this was best done with automation. This process was done by developing a program similar to the National Center for Environmental Prediction (NCEP) tropical cyclone tracker by Marchok that was used to identify cyclones based off its physical characteristics. We will discuss the methods and challenges to creating this dataset and the dataset's use for our current supervised machine learning model as well as use for future work on events such as convection initiation.

  12. Characterization of Transport Errors in Chemical Forecasts from a Global Tropospheric Chemical Transport Model

    NASA Technical Reports Server (NTRS)

    Bey, I.; Jacob, D. J.; Liu, H.; Yantosca, R. M.; Sachse, G. W.

    2004-01-01

    We propose a new methodology to characterize errors in the representation of transport processes in chemical transport models. We constrain the evaluation of a global three-dimensional chemical transport model (GEOS-CHEM) with an extended dataset of carbon monoxide (CO) concentrations obtained during the Transport and Chemical Evolution over the Pacific (TRACE-P) aircraft campaign. The TRACEP mission took place over the western Pacific, a region frequently impacted by continental outflow associated with different synoptic-scale weather systems (such as cold fronts) and deep convection, and thus provides a valuable dataset. for our analysis. Model simulations using both forecast and assimilated meteorology are examined. Background CO concentrations are computed as a function of latitude and altitude and subsequently subtracted from both the observed and the model datasets to focus on the ability of the model to simulate variability on a synoptic scale. Different sampling strategies (i.e., spatial displacement and smoothing) are applied along the flight tracks to search for systematic model biases. Statistical quantities such as correlation coefficient and centered root-mean-square difference are computed between the simulated and the observed fields and are further inter-compared using Taylor diagrams. We find no systematic bias in the model for the TRACE-P region when we consider the entire dataset (i.e., from the surface to 12 km ). This result indicates that the transport error in our model is globally unbiased, which has important implications for using the model to conduct inverse modeling studies. Using the First-Look assimilated meteorology only provides little improvement of the correlation, in comparison with the forecast meteorology. These general statements can be refined when the entire dataset is divided into different vertical domains, i.e., the lower troposphere (less than 2 km), the middle troposphere (2-6 km), and the upper troposphere (greater than 6 km). The best agreement between the observations and the model is found in the lower and middle troposphere. Downward displacements in the lower troposphere provide a better fit with the observed value, which could indicate a problem in the representation of boundary layer height in the model. Significant improvement is also found for downward and southward displacements in the upper troposphere. There are several potential sources of errors in our simulation of the continental outflow in the upper troposphere which could lead to such biases, including the location and/or the strength of deep convective cells as well as that of wildfires in Southeast Asia.

  13. Scaled Tank Test Design and Results for the Aquantis 2.5 MW Ocean Current Generation Device

    DOE Data Explorer

    Swales, Henry; Kils, Ole; Coakley, David B.; Sites, Eric; Mayer, Tyler

    2015-06-03

    Aquantis 2.5 MW Ocean Current Generation Device, Tow Tank Dynamic Rig Structural Analysis Results. This is the detailed documentation for scaled device testing in a tow tank, including models, drawings, presentations, cost of energy analysis, and structural analysis. This dataset also includes specific information on drivetrain, roller bearing, blade fabrication, mooring, and rotor characteristics.

  14. Dose calculation for photon-emitting brachytherapy sources with average energy higher than 50 keV: report of the AAPM and ESTRO.

    PubMed

    Perez-Calatayud, Jose; Ballester, Facundo; Das, Rupak K; Dewerd, Larry A; Ibbott, Geoffrey S; Meigooni, Ali S; Ouhib, Zoubir; Rivard, Mark J; Sloboda, Ron S; Williamson, Jeffrey F

    2012-05-01

    Recommendations of the American Association of Physicists in Medicine (AAPM) and the European Society for Radiotherapy and Oncology (ESTRO) on dose calculations for high-energy (average energy higher than 50 keV) photon-emitting brachytherapy sources are presented, including the physical characteristics of specific (192)Ir, (137)Cs, and (60)Co source models. This report has been prepared by the High Energy Brachytherapy Source Dosimetry (HEBD) Working Group. This report includes considerations in the application of the TG-43U1 formalism to high-energy photon-emitting sources with particular attention to phantom size effects, interpolation accuracy dependence on dose calculation grid size, and dosimetry parameter dependence on source active length. Consensus datasets for commercially available high-energy photon sources are provided, along with recommended methods for evaluating these datasets. Recommendations on dosimetry characterization methods, mainly using experimental procedures and Monte Carlo, are established and discussed. Also included are methodological recommendations on detector choice, detector energy response characterization and phantom materials, and measurement specification methodology. Uncertainty analyses are discussed and recommendations for high-energy sources without consensus datasets are given. Recommended consensus datasets for high-energy sources have been derived for sources that were commercially available as of January 2010. Data are presented according to the AAPM TG-43U1 formalism, with modified interpolation and extrapolation techniques of the AAPM TG-43U1S1 report for the 2D anisotropy function and radial dose function.

  15. A global gridded dataset of daily precipitation going back to 1950, ideal for analysing precipitation extremes

    NASA Astrophysics Data System (ADS)

    Contractor, S.; Donat, M.; Alexander, L. V.

    2017-12-01

    Reliable observations of precipitation are necessary to determine past changes in precipitation and validate models, allowing for reliable future projections. Existing gauge based gridded datasets of daily precipitation and satellite based observations contain artefacts and have a short length of record, making them unsuitable to analyse precipitation extremes. The largest limiting factor for the gauge based datasets is a dense and reliable station network. Currently, there are two major data archives of global in situ daily rainfall data, first is Global Historical Station Network (GHCN-Daily) hosted by National Oceanic and Atmospheric Administration (NOAA) and the other by Global Precipitation Climatology Centre (GPCC) part of the Deutsche Wetterdienst (DWD). We combine the two data archives and use automated quality control techniques to create a reliable long term network of raw station data, which we then interpolate using block kriging to create a global gridded dataset of daily precipitation going back to 1950. We compare our interpolated dataset with existing global gridded data of daily precipitation: NOAA Climate Prediction Centre (CPC) Global V1.0 and GPCC Full Data Daily Version 1.0, as well as various regional datasets. We find that our raw station density is much higher than other datasets. To avoid artefacts due to station network variability, we provide multiple versions of our dataset based on various completeness criteria, as well as provide the standard deviation, kriging error and number of stations for each grid cell and timestep to encourage responsible use of our dataset. Despite our efforts to increase the raw data density, the in situ station network remains sparse in India after the 1960s and in Africa throughout the timespan of the dataset. Our dataset would allow for more reliable global analyses of rainfall including its extremes and pave the way for better global precipitation observations with lower and more transparent uncertainties.

  16. Compensatory neurofuzzy model for discrete data classification in biomedical

    NASA Astrophysics Data System (ADS)

    Ceylan, Rahime

    2015-03-01

    Biomedical data is separated to two main sections: signals and discrete data. So, studies in this area are about biomedical signal classification or biomedical discrete data classification. There are artificial intelligence models which are relevant to classification of ECG, EMG or EEG signals. In same way, in literature, many models exist for classification of discrete data taken as value of samples which can be results of blood analysis or biopsy in medical process. Each algorithm could not achieve high accuracy rate on classification of signal and discrete data. In this study, compensatory neurofuzzy network model is presented for classification of discrete data in biomedical pattern recognition area. The compensatory neurofuzzy network has a hybrid and binary classifier. In this system, the parameters of fuzzy systems are updated by backpropagation algorithm. The realized classifier model is conducted to two benchmark datasets (Wisconsin Breast Cancer dataset and Pima Indian Diabetes dataset). Experimental studies show that compensatory neurofuzzy network model achieved 96.11% accuracy rate in classification of breast cancer dataset and 69.08% accuracy rate was obtained in experiments made on diabetes dataset with only 10 iterations.

  17. Raster Files for Utah Play Fairway Analysis

    DOE Data Explorer

    Wannamaker, Phil

    2017-06-16

    This submission contains raster files associated with several datasets that include earthquake density, Na/K geothermometers, fault density, heat flow, and gravity. Integrated together using spatial modeler tools in ArcGIS, these files can be used for play fairway analysis in regard to geothermal exploration.

  18. Comparison of multi-proxy data with past1000 model output over the Terminal Classic Period (800-1000 A.D.) on the Yucatan Peninsula.

    NASA Astrophysics Data System (ADS)

    Van Pelt, S.; Kohfeld, K. E.; Allen, D. M.

    2015-12-01

    The decline of the Mayan Civilization is thought to be caused by a series of droughts that affected the Yucatan Peninsula during the Terminal Classic Period (T.C.P.) 800-1000 AD. The goals of this study are two-fold: (a) to compare paleo-model simulations of the past 1000 years with a compilation of multiple proxies of changes in moisture conditions for the Yucatan Peninsula during the T.C.P. and (b) to use this comparison to inform the modeling of groundwater recharge in this region, with a focus on generating the daily climate data series needed as input to a groundwater recharge model. To achieve the first objective, we compiled a dataset of 5 proxies from seven locations across the Yucatan Peninsula, to be compared with temperature and precipitation output from the Community Climate System Model Version 4 (CCSM4), which is part of the Coupled Model Intercomparison Project Phase 5 (CMIP5) past1000 experiment. The proxy dataset includes oxygen isotopes from speleothems and gastropod/ostrocod shells (11 records); and sediment density, mineralogy, and magnetic susceptibility records from lake sediment cores (3 records). The proxy dataset is supplemented by a compilation of reconstructed temperatures using pollen and tree ring records for North America (archived in the PAGES2k global network data). Our preliminary analysis suggests that many of these datasets show evidence of drier and warmer climate on the Yucatan Peninsula around the T.C.P. when compared to modern conditions, although the amplitude and timing of individual warming and drying events varies between sites. This comparison with modeled output will ultimately be used to inform backward shift factors that will be input to a stochastic weather generator. These shift factors will be based on monthly changes in temperature and precipitation and applied to a modern daily climate time series for the Yucatan Peninsula to produce a daily climate time series for the T.C.P.

  19. A Review on Human Activity Recognition Using Vision-Based Method.

    PubMed

    Zhang, Shugang; Wei, Zhiqiang; Nie, Jie; Huang, Lei; Wang, Shuang; Li, Zhen

    2017-01-01

    Human activity recognition (HAR) aims to recognize activities from a series of observations on the actions of subjects and the environmental conditions. The vision-based HAR research is the basis of many applications including video surveillance, health care, and human-computer interaction (HCI). This review highlights the advances of state-of-the-art activity recognition approaches, especially for the activity representation and classification methods. For the representation methods, we sort out a chronological research trajectory from global representations to local representations, and recent depth-based representations. For the classification methods, we conform to the categorization of template-based methods, discriminative models, and generative models and review several prevalent methods. Next, representative and available datasets are introduced. Aiming to provide an overview of those methods and a convenient way of comparing them, we classify existing literatures with a detailed taxonomy including representation and classification methods, as well as the datasets they used. Finally, we investigate the directions for future research.

  20. AstroBlend: An astrophysical visualization package for Blender

    NASA Astrophysics Data System (ADS)

    Naiman, J. P.

    2016-04-01

    The rapid growth in scale and complexity of both computational and observational astrophysics over the past decade necessitates efficient and intuitive methods for examining and visualizing large datasets. Here, I present AstroBlend, an open-source Python library for use within the three dimensional modeling software, Blender. While Blender has been a popular open-source software among animators and visual effects artists, in recent years it has also become a tool for visualizing astrophysical datasets. AstroBlend combines the three dimensional capabilities of Blender with the analysis tools of the widely used astrophysical toolset, yt, to afford both computational and observational astrophysicists the ability to simultaneously analyze their data and create informative and appealing visualizations. The introduction of this package includes a description of features, work flow, and various example visualizations. A website - www.astroblend.com - has been developed which includes tutorials, and a gallery of example images and movies, along with links to downloadable data, three dimensional artistic models, and various other resources.

  1. A Review on Human Activity Recognition Using Vision-Based Method

    PubMed Central

    Nie, Jie

    2017-01-01

    Human activity recognition (HAR) aims to recognize activities from a series of observations on the actions of subjects and the environmental conditions. The vision-based HAR research is the basis of many applications including video surveillance, health care, and human-computer interaction (HCI). This review highlights the advances of state-of-the-art activity recognition approaches, especially for the activity representation and classification methods. For the representation methods, we sort out a chronological research trajectory from global representations to local representations, and recent depth-based representations. For the classification methods, we conform to the categorization of template-based methods, discriminative models, and generative models and review several prevalent methods. Next, representative and available datasets are introduced. Aiming to provide an overview of those methods and a convenient way of comparing them, we classify existing literatures with a detailed taxonomy including representation and classification methods, as well as the datasets they used. Finally, we investigate the directions for future research. PMID:29065585

  2. Measuring spatially varying, multispectral, ultraviolet bidirectional reflectance distribution function with an imaging spectrometer

    NASA Astrophysics Data System (ADS)

    Li, Hongsong; Lyu, Hang; Liao, Ningfang; Wu, Wenmin

    2016-12-01

    The bidirectional reflectance distribution function (BRDF) data in the ultraviolet (UV) band are valuable for many applications including cultural heritage, material analysis, surface characterization, and trace detection. We present a BRDF measurement instrument working in the near- and middle-UV spectral range. The instrument includes a collimated UV light source, a rotation stage, a UV imaging spectrometer, and a control computer. The data captured by the proposed instrument describe spatial, spectral, and angular variations of the light scattering from a sample surface. Such a multidimensional dataset of an example sample is captured by the proposed instrument and analyzed by a k-mean clustering algorithm to separate surface regions with same material but different surface roughnesses. The clustering results show that the angular dimension of the dataset can be exploited for surface roughness characterization. The two clustered BRDFs are fitted to a theoretical BRDF model. The fitting results show good agreement between the measurement data and the theoretical model.

  3. A database of marine phytoplankton abundance, biomass and species composition in Australian waters

    PubMed Central

    Davies, Claire H.; Coughlan, Alex; Hallegraeff, Gustaaf; Ajani, Penelope; Armbrecht, Linda; Atkins, Natalia; Bonham, Prudence; Brett, Steve; Brinkman, Richard; Burford, Michele; Clementson, Lesley; Coad, Peter; Coman, Frank; Davies, Diana; Dela-Cruz, Jocelyn; Devlin, Michelle; Edgar, Steven; Eriksen, Ruth; Furnas, Miles; Hassler, Christel; Hill, David; Holmes, Michael; Ingleton, Tim; Jameson, Ian; Leterme, Sophie C.; Lønborg, Christian; McLaughlin, James; McEnnulty, Felicity; McKinnon, A. David; Miller, Margaret; Murray, Shauna; Nayar, Sasi; Patten, Renee; Pritchard, Tim; Proctor, Roger; Purcell-Meyerink, Diane; Raes, Eric; Rissik, David; Ruszczyk, Jason; Slotwinski, Anita; Swadling, Kerrie M.; Tattersall, Katherine; Thompson, Peter; Thomson, Paul; Tonks, Mark; Trull, Thomas W.; Uribe-Palomino, Julian; Waite, Anya M.; Yauwenas, Rouna; Zammit, Anthony; Richardson, Anthony J.

    2016-01-01

    There have been many individual phytoplankton datasets collected across Australia since the mid 1900s, but most are unavailable to the research community. We have searched archives, contacted researchers, and scanned the primary and grey literature to collate 3,621,847 records of marine phytoplankton species from Australian waters from 1844 to the present. Many of these are small datasets collected for local questions, but combined they provide over 170 years of data on phytoplankton communities in Australian waters. Units and taxonomy have been standardised, obviously erroneous data removed, and all metadata included. We have lodged this dataset with the Australian Ocean Data Network (http://portal.aodn.org.au/) allowing public access. The Australian Phytoplankton Database will be invaluable for global change studies, as it allows analysis of ecological indicators of climate change and eutrophication (e.g., changes in distribution; diatom:dinoflagellate ratios). In addition, the standardised conversion of abundance records to biomass provides modellers with quantifiable data to initialise and validate ecosystem models of lower marine trophic levels. PMID:27328409

  4. WE-G-207-06: 3D Fluoroscopic Image Generation From Patient-Specific 4DCBCT-Based Motion Models Derived From Physical Phantom and Clinical Patient Images

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Dhou, S; Cai, W; Hurwitz, M

    2015-06-15

    Purpose: Respiratory-correlated cone-beam CT (4DCBCT) images acquired immediately prior to treatment have the potential to represent patient motion patterns and anatomy during treatment, including both intra- and inter-fractional changes. We develop a method to generate patient-specific motion models based on 4DCBCT images acquired with existing clinical equipment and used to generate time varying volumetric images (3D fluoroscopic images) representing motion during treatment delivery. Methods: Motion models are derived by deformably registering each 4DCBCT phase to a reference phase, and performing principal component analysis (PCA) on the resulting displacement vector fields. 3D fluoroscopic images are estimated by optimizing the resulting PCAmore » coefficients iteratively through comparison of the cone-beam projections simulating kV treatment imaging and digitally reconstructed radiographs generated from the motion model. Patient and physical phantom datasets are used to evaluate the method in terms of tumor localization error compared to manually defined ground truth positions. Results: 4DCBCT-based motion models were derived and used to generate 3D fluoroscopic images at treatment time. For the patient datasets, the average tumor localization error and the 95th percentile were 1.57 and 3.13 respectively in subsets of four patient datasets. For the physical phantom datasets, the average tumor localization error and the 95th percentile were 1.14 and 2.78 respectively in two datasets. 4DCBCT motion models are shown to perform well in the context of generating 3D fluoroscopic images due to their ability to reproduce anatomical changes at treatment time. Conclusion: This study showed the feasibility of deriving 4DCBCT-based motion models and using them to generate 3D fluoroscopic images at treatment time in real clinical settings. 4DCBCT-based motion models were found to account for the 3D non-rigid motion of the patient anatomy during treatment and have the potential to localize tumor and other patient anatomical structures at treatment time even when inter-fractional changes occur. This project was supported, in part, through a Master Research Agreement with Varian Medical Systems, Inc., Palo Alto, CA. The project was also supported, in part, by Award Number R21CA156068 from the National Cancer Institute.« less

  5. Comprehensive preclinical evaluation of a multi-physics model of liver tumor radiofrequency ablation.

    PubMed

    Audigier, Chloé; Mansi, Tommaso; Delingette, Hervé; Rapaka, Saikiran; Passerini, Tiziano; Mihalef, Viorel; Jolly, Marie-Pierre; Pop, Raoul; Diana, Michele; Soler, Luc; Kamen, Ali; Comaniciu, Dorin; Ayache, Nicholas

    2017-09-01

    We aim at developing a framework for the validation of a subject-specific multi-physics model of liver tumor radiofrequency ablation (RFA). The RFA computation becomes subject specific after several levels of personalization: geometrical and biophysical (hemodynamics, heat transfer and an extended cellular necrosis model). We present a comprehensive experimental setup combining multimodal, pre- and postoperative anatomical and functional images, as well as the interventional monitoring of intra-operative signals: the temperature and delivered power. To exploit this dataset, an efficient processing pipeline is introduced, which copes with image noise, variable resolution and anisotropy. The validation study includes twelve ablations from five healthy pig livers: a mean point-to-mesh error between predicted and actual ablation extent of 5.3 ± 3.6 mm is achieved. This enables an end-to-end preclinical validation framework that considers the available dataset.

  6. Tooth enamel oxygen "isoscapes" show a high degree of human mobility in prehistoric Britain.

    PubMed

    Pellegrini, Maura; Pouncett, John; Jay, Mandy; Pearson, Mike Parker; Richards, Michael P

    2016-10-07

    A geostatistical model to predict human skeletal oxygen isotope values (δ 18 O p ) in Britain is presented here based on a new dataset of Chalcolithic and Early Bronze Age human teeth. The spatial statistics which underpin this model allow the identification of individuals interpreted as 'non-local' to the areas where they were buried (spatial outliers). A marked variation in δ 18 O p is observed in several areas, including the Stonehenge region, the Peak District, and the Yorkshire Wolds, suggesting a high degree of human mobility. These areas, rich in funerary and ceremonial monuments, may have formed focal points for people, some of whom would have travelled long distances, ultimately being buried there. The dataset and model represent a baseline for future archaeological studies, avoiding the complex conversions from skeletal to water δ 18 O values-a process known to be problematic.

  7. Tooth enamel oxygen “isoscapes” show a high degree of human mobility in prehistoric Britain

    PubMed Central

    Pellegrini, Maura; Pouncett, John; Jay, Mandy; Pearson, Mike Parker; Richards, Michael P.

    2016-01-01

    A geostatistical model to predict human skeletal oxygen isotope values (δ18Op) in Britain is presented here based on a new dataset of Chalcolithic and Early Bronze Age human teeth. The spatial statistics which underpin this model allow the identification of individuals interpreted as ‘non-local’ to the areas where they were buried (spatial outliers). A marked variation in δ18Op is observed in several areas, including the Stonehenge region, the Peak District, and the Yorkshire Wolds, suggesting a high degree of human mobility. These areas, rich in funerary and ceremonial monuments, may have formed focal points for people, some of whom would have travelled long distances, ultimately being buried there. The dataset and model represent a baseline for future archaeological studies, avoiding the complex conversions from skeletal to water δ18O values–a process known to be problematic. PMID:27713538

  8. Tooth enamel oxygen “isoscapes” show a high degree of human mobility in prehistoric Britain

    NASA Astrophysics Data System (ADS)

    Pellegrini, Maura; Pouncett, John; Jay, Mandy; Pearson, Mike Parker; Richards, Michael P.

    2016-10-01

    A geostatistical model to predict human skeletal oxygen isotope values (δ18Op) in Britain is presented here based on a new dataset of Chalcolithic and Early Bronze Age human teeth. The spatial statistics which underpin this model allow the identification of individuals interpreted as ‘non-local’ to the areas where they were buried (spatial outliers). A marked variation in δ18Op is observed in several areas, including the Stonehenge region, the Peak District, and the Yorkshire Wolds, suggesting a high degree of human mobility. These areas, rich in funerary and ceremonial monuments, may have formed focal points for people, some of whom would have travelled long distances, ultimately being buried there. The dataset and model represent a baseline for future archaeological studies, avoiding the complex conversions from skeletal to water δ18O values-a process known to be problematic.

  9. Development and Validation of Triarchic Psychopathy Scales from the Multidimensional Personality Questionnaire

    PubMed Central

    Brislin, Sarah J.; Drislane, Laura E.; Smith, Shannon Toney; Edens, John F.; Patrick, Christopher J.

    2015-01-01

    Psychopathy is conceptualized by the triarchic model as encompassing three distinct phenotypic constructs: boldness, meanness, and disinhibition. In the current study, the Multidimensional Personality Questionnaire (MPQ), a normal-range personality measure, was evaluated for representation of these three constructs. Consensus ratings were used to identify MPQ items most related to each triarchic (Tri) construct. Scale measures were developed from items indicative of each construct, and scores for these scales were evaluated for convergent and discriminant validity in community (N = 176) and incarcerated samples (N = 240). A cross the two samples, MPQ-Tri scale scores demonstrated good internal consistencies and relationships with criterion measures of various types consistent with predictions based on the triarchic model. Findings are discussed in terms of their implications for further investigation of the triarchic model constructs in preexisting datasets that include the MPQ, in particular longitudinal and genetically informative datasets. PMID:25642934

  10. Antibody-protein interactions: benchmark datasets and prediction tools evaluation

    PubMed Central

    Ponomarenko, Julia V; Bourne, Philip E

    2007-01-01

    Background The ability to predict antibody binding sites (aka antigenic determinants or B-cell epitopes) for a given protein is a precursor to new vaccine design and diagnostics. Among the various methods of B-cell epitope identification X-ray crystallography is one of the most reliable methods. Using these experimental data computational methods exist for B-cell epitope prediction. As the number of structures of antibody-protein complexes grows, further interest in prediction methods using 3D structure is anticipated. This work aims to establish a benchmark for 3D structure-based epitope prediction methods. Results Two B-cell epitope benchmark datasets inferred from the 3D structures of antibody-protein complexes were defined. The first is a dataset of 62 representative 3D structures of protein antigens with inferred structural epitopes. The second is a dataset of 82 structures of antibody-protein complexes containing different structural epitopes. Using these datasets, eight web-servers developed for antibody and protein binding sites prediction have been evaluated. In no method did performance exceed a 40% precision and 46% recall. The values of the area under the receiver operating characteristic curve for the evaluated methods were about 0.6 for ConSurf, DiscoTope, and PPI-PRED methods and above 0.65 but not exceeding 0.70 for protein-protein docking methods when the best of the top ten models for the bound docking were considered; the remaining methods performed close to random. The benchmark datasets are included as a supplement to this paper. Conclusion It may be possible to improve epitope prediction methods through training on datasets which include only immune epitopes and through utilizing more features characterizing epitopes, for example, the evolutionary conservation score. Notwithstanding, overall poor performance may reflect the generality of antigenicity and hence the inability to decipher B-cell epitopes as an intrinsic feature of the protein. It is an open question as to whether ultimately discriminatory features can be found. PMID:17910770

  11. QSAR Modeling of Rat Acute Toxicity by Oral Exposure

    PubMed Central

    Zhu, Hao; Martin, Todd M.; Ye, Lin; Sedykh, Alexander; Young, Douglas M.; Tropsha, Alexander

    2009-01-01

    Few Quantitative Structure-Activity Relationship (QSAR) studies have successfully modeled large, diverse rodent toxicity endpoints. In this study, a comprehensive dataset of 7,385 compounds with their most conservative lethal dose (LD50) values has been compiled. A combinatorial QSAR approach has been employed to develop robust and predictive models of acute toxicity in rats caused by oral exposure to chemicals. To enable fair comparison between the predictive power of models generated in this study versus a commercial toxicity predictor, TOPKAT (Toxicity Prediction by Komputer Assisted Technology), a modeling subset of the entire dataset was selected that included all 3,472 compounds used in the TOPKAT’s training set. The remaining 3,913 compounds, which were not present in the TOPKAT training set, were used as the external validation set. QSAR models of five different types were developed for the modeling set. The prediction accuracy for the external validation set was estimated by determination coefficient R2 of linear regression between actual and predicted LD50 values. The use of the applicability domain threshold implemented in most models generally improved the external prediction accuracy but expectedly led to the decrease in chemical space coverage; depending on the applicability domain threshold, R2 ranged from 0.24 to 0.70. Ultimately, several consensus models were developed by averaging the predicted LD50 for every compound using all 5 models. The consensus models afforded higher prediction accuracy for the external validation dataset with the higher coverage as compared to individual constituent models. The validated consensus LD50 models developed in this study can be used as reliable computational predictors of in vivo acute toxicity. PMID:19845371

  12. Managed Development Environment Successes for MSFC's VIPA Team

    NASA Technical Reports Server (NTRS)

    Finckenor, Jeff; Corder, Gary; Owens, James; Meehan, Jim; Tidwell, Paul H.

    2005-01-01

    This paper outlines the best practices of the Vehicle Design Team for VIPA. The functions of the VIPA Vehicle Design (VVD) discipline team are to maintain the controlled reference geometry and provide linked, simplified geometry for each of the other discipline analyses. The core of the VVD work, and the approach for VVD s first task of controlling the reference geometry, involves systems engineering, top-down, layout-based CAD modeling within a Product Data Manager (PDM) development environment. The top- down approach allows for simple control of very large, integrated assemblies and greatly enhances the ability to generate trade configurations and reuse data. The second VVD task, model simplification for analysis, is handled within the managed environment through application of the master model concept. In this approach, there is a single controlling, or master, product definition dataset. Connected to this master model are reference datasets with live geometric and expression links. The referenced models can be for drawings, manufacturing, visualization, embedded analysis, or analysis simplification. A discussion of web based interaction, including visualization, between the design and other disciplines is included. Demonstrated examples are cited, including the Space Launch Initiative development cycle, the Saturn V systems integration and verification cycle, an Orbital Space Plane study, and NASA Exploration Office studies of Shuttle derived and clean sheet launch vehicles. The VIPA Team has brought an immense amount of detailed data to bear on program issues. A central piece of that success has been the Managed Development Environment and the VVD Team approach to modeling.

  13. Observational uncertainty and regional climate model evaluation: A pan-European perspective

    NASA Astrophysics Data System (ADS)

    Kotlarski, Sven; Szabó, Péter; Herrera, Sixto; Räty, Olle; Keuler, Klaus; Soares, Pedro M.; Cardoso, Rita M.; Bosshard, Thomas; Pagé, Christian; Boberg, Fredrik; Gutiérrez, José M.; Jaczewski, Adam; Kreienkamp, Frank; Liniger, Mark. A.; Lussana, Cristian; Szepszo, Gabriella

    2017-04-01

    Local and regional climate change assessments based on downscaling methods crucially depend on the existence of accurate and reliable observational reference data. In dynamical downscaling via regional climate models (RCMs) observational data can influence model development itself and, later on, model evaluation, parameter calibration and added value assessment. In empirical-statistical downscaling, observations serve as predictand data and directly influence model calibration with corresponding effects on downscaled climate change projections. Focusing on the evaluation of RCMs, we here analyze the influence of uncertainties in observational reference data on evaluation results in a well-defined performance assessment framework and on a European scale. For this purpose we employ three different gridded observational reference grids, namely (1) the well-established EOBS dataset (2) the recently developed EURO4M-MESAN regional re-analysis, and (3) several national high-resolution and quality-controlled gridded datasets that recently became available. In terms of climate models five reanalysis-driven experiments carried out by five different RCMs within the EURO-CORDEX framework are used. Two variables (temperature and precipitation) and a range of evaluation metrics that reflect different aspects of RCM performance are considered. We furthermore include an illustrative model ranking exercise and relate observational spread to RCM spread. The results obtained indicate a varying influence of observational uncertainty on model evaluation depending on the variable, the season, the region and the specific performance metric considered. Over most parts of the continent, the influence of the choice of the reference dataset for temperature is rather small for seasonal mean values and inter-annual variability. Here, model uncertainty (as measured by the spread between the five RCM simulations considered) is typically much larger than reference data uncertainty. For parameters of the daily temperature distribution and for the spatial pattern correlation, however, important dependencies on the reference dataset can arise. The related evaluation uncertainties can be as large or even larger than model uncertainty. For precipitation the influence of observational uncertainty is, in general, larger than for temperature. It often dominates model uncertainty especially for the evaluation of the wet day frequency, the spatial correlation and the shape and location of the distribution of daily values. But even the evaluation of large-scale seasonal mean values can be considerably affected by the choice of the reference. When employing a simple and illustrative model ranking scheme on these results it is found that RCM ranking in many cases depends on the reference dataset employed.

  14. How many sightings to model rare marine species distributions

    PubMed Central

    Authier, Matthieu; Monestiez, Pascal; Ridoux, Vincent

    2018-01-01

    Despite large efforts, datasets with few sightings are often available for rare species of marine megafauna that typically live at low densities. This paucity of data makes modelling the habitat of these taxa particularly challenging. We tested the predictive performance of different types of species distribution models fitted to decreasing numbers of sightings. Generalised additive models (GAMs) with three different residual distributions and the presence only model MaxEnt were tested on two megafauna case studies differing in both the number of sightings and ecological niches. From a dolphin (277 sightings) and an auk (1,455 sightings) datasets, we simulated rarity with a sighting thinning protocol by random sampling (without replacement) of a decreasing fraction of sightings. Better prediction of the distribution of a rarely sighted species occupying a narrow habitat (auk dataset) was expected compared to the distribution of a rarely sighted species occupying a broad habitat (dolphin dataset). We used the original datasets to set up a baseline model and fitted additional models on fewer sightings but keeping effort constant. Model predictive performance was assessed with mean squared error and area under the curve. Predictions provided by the models fitted to the thinned-out datasets were better than a homogeneous spatial distribution down to a threshold of approximately 30 sightings for a GAM with a Tweedie distribution and approximately 130 sightings for the other models. Thinning the sighting data for the taxon with narrower habitats seemed to be less detrimental to model predictive performance than for the broader habitat taxon. To generate reliable habitat modelling predictions for rarely sighted marine predators, our results suggest (1) using GAMs with a Tweedie distribution with presence-absence data and (2) implementing, as a conservative empirical measure, at least 50 sightings in the models. PMID:29529097

  15. Linguistic Extensions of Topic Models

    ERIC Educational Resources Information Center

    Boyd-Graber, Jordan

    2010-01-01

    Topic models like latent Dirichlet allocation (LDA) provide a framework for analyzing large datasets where observations are collected into groups. Although topic modeling has been fruitfully applied to problems social science, biology, and computer vision, it has been most widely used to model datasets where documents are modeled as exchangeable…

  16. Assessing the Application of a Geographic Presence-Only Model for Land Suitability Mapping

    PubMed Central

    Heumann, Benjamin W.; Walsh, Stephen J.; McDaniel, Phillip M.

    2011-01-01

    Recent advances in ecological modeling have focused on novel methods for characterizing the environment that use presence-only data and machine-learning algorithms to predict the likelihood of species occurrence. These novel methods may have great potential for land suitability applications in the developing world where detailed land cover information is often unavailable or incomplete. This paper assesses the adaptation and application of the presence-only geographic species distribution model, MaxEnt, for agricultural crop suitability mapping in a rural Thailand where lowland paddy rice and upland field crops predominant. To assess this modeling approach, three independent crop presence datasets were used including a social-demographic survey of farm households, a remote sensing classification of land use/land cover, and ground control points, used for geodetic and thematic reference that vary in their geographic distribution and sample size. Disparate environmental data were integrated to characterize environmental settings across Nang Rong District, a region of approximately 1,300 sq. km in size. Results indicate that the MaxEnt model is capable of modeling crop suitability for upland and lowland crops, including rice varieties, although model results varied between datasets due to the high sensitivity of the model to the distribution of observed crop locations in geographic and environmental space. Accuracy assessments indicate that model outcomes were influenced by the sample size and the distribution of sample points in geographic and environmental space. The need for further research into accuracy assessments of presence-only models lacking true absence data is discussed. We conclude that the Maxent model can provide good estimates of crop suitability, but many areas need to be carefully scrutinized including geographic distribution of input data and assessment methods to ensure realistic modeling results. PMID:21860606

  17. Topographic and hydrographic GIS datasets for the Afghan Geological Survey and U.S. Geological Survey 2013 mineral areas of interest

    USGS Publications Warehouse

    Casey, Brittany N.; Chirico, Peter G.

    2013-01-01

    Afghanistan is endowed with a vast amount of mineral resources, and it is believed that the current economic state of the country could be greatly improved through investment in the extraction and production of these resources. In 2007, the “Preliminary Non-Fuel Resource Assessment of Afghanistan 2007” was completed by members of the U.S. Geological Survey and Afghan Geological Survey (Peters and others, 2007). The assessment delineated 20 mineralized areas for further study using a geologic-based methodology. In 2011, a follow-on data product, “Summaries and Data Packages of Important Areas for Mineral Investment and Production Opportunities of Nonfuel Minerals in Afghanistan,” was released (Peters and others, 2011). As part of this more recent work, geologic, geohydrologic, and hyperspectral studies were carried out in the areas of interest (AOIs) to assess the location and characteristics of the mineral resources. The 2011 publication included a dataset of 24 identified AOIs containing subareas, a corresponding digital elevation model (DEM), elevation contours, areal extent, and hydrography for each AOI. In 2012, project scientists identified five new AOIs and two subareas in Afghanistan. These new areas are Ahankashan, Kandahar, Parwan, North Bamyan, and South Bamyan. The two identified subareas include Obatu-Shela and Sekhab-ZamtoKalay, both located within the larger Kandahar AOI. In addition, an extended Kandahar AOI is included in the project for water resource modeling purposes. The dataset presented in this publication consists of the areal extent of the five new AOIs, two subareas, and the extended Kandahar AOI, elevation contours at 100-, 50-, and 25-meter intervals, an enhanced DEM, and a hydrographic dataset covering the extent of the new study area. The resulting raster and vector layers are intended for use by government agencies, developmental organizations, and private companies in Afghanistan to assist with mineral assessments, monitoring, management, and investment.

  18. MEBS, a software platform to evaluate large (meta)genomic collections according to their metabolic machinery: unraveling the sulfur cycle

    PubMed Central

    Zapata-Peñasco, Icoquih; Poot-Hernandez, Augusto Cesar; Eguiarte, Luis E

    2017-01-01

    Abstract The increasing number of metagenomic and genomic sequences has dramatically improved our understanding of microbial diversity, yet our ability to infer metabolic capabilities in such datasets remains challenging. We describe the Multigenomic Entropy Based Score pipeline (MEBS), a software platform designed to evaluate, compare, and infer complex metabolic pathways in large “omic” datasets, including entire biogeochemical cycles. MEBS is open source and available through https://github.com/eead-csic-compbio/metagenome_Pfam_score. To demonstrate its use, we modeled the sulfur cycle by exhaustively curating the molecular and ecological elements involved (compounds, genes, metabolic pathways, and microbial taxa). This information was reduced to a collection of 112 characteristic Pfam protein domains and a list of complete-sequenced sulfur genomes. Using the mathematical framework of relative entropy (H΄), we quantitatively measured the enrichment of these domains among sulfur genomes. The entropy of each domain was used both to build up a final score that indicates whether a (meta)genomic sample contains the metabolic machinery of interest and to propose marker domains in metagenomic sequences such as DsrC (PF04358). MEBS was benchmarked with a dataset of 2107 non-redundant microbial genomes from RefSeq and 935 metagenomes from MG-RAST. Its performance, reproducibility, and robustness were evaluated using several approaches, including random sampling, linear regression models, receiver operator characteristic plots, and the area under the curve metric (AUC). Our results support the broad applicability of this algorithm to accurately classify (AUC = 0.985) hard-to-culture genomes (e.g., Candidatus Desulforudis audaxviator), previously characterized ones, and metagenomic environments such as hydrothermal vents, or deep-sea sediment. Our benchmark indicates that an entropy-based score can capture the metabolic machinery of interest and can be used to efficiently classify large genomic and metagenomic datasets, including uncultivated/unexplored taxa. PMID:29069412

  19. MEBS, a software platform to evaluate large (meta)genomic collections according to their metabolic machinery: unraveling the sulfur cycle.

    PubMed

    De Anda, Valerie; Zapata-Peñasco, Icoquih; Poot-Hernandez, Augusto Cesar; Eguiarte, Luis E; Contreras-Moreira, Bruno; Souza, Valeria

    2017-11-01

    The increasing number of metagenomic and genomic sequences has dramatically improved our understanding of microbial diversity, yet our ability to infer metabolic capabilities in such datasets remains challenging. We describe the Multigenomic Entropy Based Score pipeline (MEBS), a software platform designed to evaluate, compare, and infer complex metabolic pathways in large "omic" datasets, including entire biogeochemical cycles. MEBS is open source and available through https://github.com/eead-csic-compbio/metagenome_Pfam_score. To demonstrate its use, we modeled the sulfur cycle by exhaustively curating the molecular and ecological elements involved (compounds, genes, metabolic pathways, and microbial taxa). This information was reduced to a collection of 112 characteristic Pfam protein domains and a list of complete-sequenced sulfur genomes. Using the mathematical framework of relative entropy (H΄), we quantitatively measured the enrichment of these domains among sulfur genomes. The entropy of each domain was used both to build up a final score that indicates whether a (meta)genomic sample contains the metabolic machinery of interest and to propose marker domains in metagenomic sequences such as DsrC (PF04358). MEBS was benchmarked with a dataset of 2107 non-redundant microbial genomes from RefSeq and 935 metagenomes from MG-RAST. Its performance, reproducibility, and robustness were evaluated using several approaches, including random sampling, linear regression models, receiver operator characteristic plots, and the area under the curve metric (AUC). Our results support the broad applicability of this algorithm to accurately classify (AUC = 0.985) hard-to-culture genomes (e.g., Candidatus Desulforudis audaxviator), previously characterized ones, and metagenomic environments such as hydrothermal vents, or deep-sea sediment. Our benchmark indicates that an entropy-based score can capture the metabolic machinery of interest and can be used to efficiently classify large genomic and metagenomic datasets, including uncultivated/unexplored taxa. © The Author 2017. Published by Oxford University Press.

  20. Pooled solifenacin overactive bladder trial data: Creation, validation and analysis of an integrated database.

    PubMed

    Chapple, Christopher R; Cardozo, Linda; Snijder, Robert; Siddiqui, Emad; Herschorn, Sender

    2016-12-15

    Patient-level data are available for 11 randomized, controlled, Phase III/Phase IV solifenacin clinical trials. Meta-analyses were conducted to interrogate the data, to broaden knowledge about solifenacin and overactive bladder (OAB) in general. Before integrating data, datasets from individual studies were mapped to a single format using methodology developed by the Clinical Data Interchange Standards Consortium (CDISC). Initially, the data structure was harmonized, to ensure identical categorization, using the CDISC Study Data Tabulation Model (SDTM). To allow for patient level meta-analysis, data were integrated and mapped to analysis datasets. Mapping included adding derived and categorical variables and followed standards described as the Analysis Data Model (ADaM). Mapping to both SDTM and ADaM was performed twice by two independent programming teams, results compared, and inconsistencies corrected in the final output. ADaM analysis sets included assignments of patients to the Safety Analysis Set and the Full Analysis Set. There were three analysis groupings: Analysis group 1 (placebo-controlled, monotherapy, fixed-dose studies, n = 3011); Analysis group 2 (placebo-controlled, monotherapy, pooled, fixed- and flexible-dose, n = 5379); Analysis group 3 (all solifenacin monotherapy-treated patients, n = 6539). Treatment groups were: solifenacin 5 mg fixed dose, solifenacin 5/10 mg flexible dose, solifenacin 10 mg fixed dose and overall solifenacin. Patient were similar enough for data pooling to be acceptable. Creating ADaM datasets provided significant information about individual studies and the derivation decisions made in each study; validated ADaM datasets now exist for medical history, efficacy and AEs. Results from these meta-analyses were similar over time.

  1. A modified active appearance model based on an adaptive artificial bee colony.

    PubMed

    Abdulameer, Mohammed Hasan; Sheikh Abdullah, Siti Norul Huda; Othman, Zulaiha Ali

    2014-01-01

    Active appearance model (AAM) is one of the most popular model-based approaches that have been extensively used to extract features by highly accurate modeling of human faces under various physical and environmental circumstances. However, in such active appearance model, fitting the model with original image is a challenging task. State of the art shows that optimization method is applicable to resolve this problem. However, another common problem is applying optimization. Hence, in this paper we propose an AAM based face recognition technique, which is capable of resolving the fitting problem of AAM by introducing a new adaptive ABC algorithm. The adaptation increases the efficiency of fitting as against the conventional ABC algorithm. We have used three datasets: CASIA dataset, property 2.5D face dataset, and UBIRIS v1 images dataset in our experiments. The results have revealed that the proposed face recognition technique has performed effectively, in terms of accuracy of face recognition.

  2. Species distribution modelling for Rhipicephalus microplus (Acari: Ixodidae) in Benin, West Africa: comparing datasets and modelling algorithms.

    PubMed

    De Clercq, E M; Leta, S; Estrada-Peña, A; Madder, M; Adehan, S; Vanwambeke, S O

    2015-01-01

    Rhipicephalus microplus is one of the most widely distributed and economically important ticks, transmitting Babesia bigemina, B. bovis and Anaplasma marginale. It was recently introduced to West Africa on live animals originating from Brazil. Knowing the precise environmental suitability for the tick would allow veterinary health officials to draft vector control strategies for different regions of the country. To test the performance of modelling algorithms and different sets of environmental explanatory variables, species distribution models for this tick species in Benin were developed using generalized linear models, linear discriminant analysis and random forests. The training data for these models were a dataset containing reported absence or presence in 104 farms, randomly selected across Benin. These farms were sampled at the end of the rainy season, which corresponds with an annual peak in tick abundance. Two environmental datasets for the country of Benin were compared: one based on interpolated climate data (WorldClim) and one based on remotely sensed images (MODIS). The pixel size for both environmental datasets was 1 km. Highly suitable areas occurred mainly along the warmer and humid coast extending northwards to central Benin. The northern hot and drier areas were found to be unsuitable. The models developed and tested on data from the entire country were generally found to perform well, having an AUC value greater than 0.92. Although statistically significant, only small differences in accuracy measures were found between the modelling algorithms, or between the environmental datasets. The resulting risk maps differed nonetheless. Models based on interpolated climate suggested gradual variations in habitat suitability, while those based on remotely sensed data indicated a sharper contrast between suitable and unsuitable areas, and a patchy distribution of the suitable areas. Remotely sensed data yielded more spatial detail in the predictions. When computing accuracy measures on a subset of data along the invasion front, the modelling technique Random Forest outperformed the other modelling approaches, and results with MODIS-derived variables were better than those using WorldClim data. The high environmental suitability for R. microplus in the southern half of Benin raises concern at the regional level for animal health, including its potential to substantially alter transmission risk of Babesia bovis. The northern part of Benin appeared overall of low environmental suitability. Continuous surveillance in the transition zone however remains relevant, in relation to important cattle movements in the region, and to the invasive character of R. microplus. Copyright © 2014 The Authors. Published by Elsevier B.V. All rights reserved.

  3. Realistic computer network simulation for network intrusion detection dataset generation

    NASA Astrophysics Data System (ADS)

    Payer, Garrett

    2015-05-01

    The KDD-99 Cup dataset is dead. While it can continue to be used as a toy example, the age of this dataset makes it all but useless for intrusion detection research and data mining. Many of the attacks used within the dataset are obsolete and do not reflect the features important for intrusion detection in today's networks. Creating a new dataset encompassing a large cross section of the attacks found on the Internet today could be useful, but would eventually fall to the same problem as the KDD-99 Cup; its usefulness would diminish after a period of time. To continue research into intrusion detection, the generation of new datasets needs to be as dynamic and as quick as the attacker. Simply examining existing network traffic and using domain experts such as intrusion analysts to label traffic is inefficient, expensive, and not scalable. The only viable methodology is simulation using technologies including virtualization, attack-toolsets such as Metasploit and Armitage, and sophisticated emulation of threat and user behavior. Simulating actual user behavior and network intrusion events dynamically not only allows researchers to vary scenarios quickly, but enables online testing of intrusion detection mechanisms by interacting with data as it is generated. As new threat behaviors are identified, they can be added to the simulation to make quicker determinations as to the effectiveness of existing and ongoing network intrusion technology, methodology and models.

  4. Exploring Normalization and Network Reconstruction Methods using In Silico and In Vivo Models

    EPA Science Inventory

    Abstract: Lessons learned from the recent DREAM competitions include: The search for the best network reconstruction method continues, and we need more complete datasets with ground truth from more complex organisms. It has become obvious that the network reconstruction methods t...

  5. Extending TOPS: Ontology-driven Anomaly Detection and Analysis System

    NASA Astrophysics Data System (ADS)

    Votava, P.; Nemani, R. R.; Michaelis, A.

    2010-12-01

    Terrestrial Observation and Prediction System (TOPS) is a flexible modeling software system that integrates ecosystem models with frequent satellite and surface weather observations to produce ecosystem nowcasts (assessments of current conditions) and forecasts useful in natural resources management, public health and disaster management. We have been extending the Terrestrial Observation and Prediction System (TOPS) to include a capability for automated anomaly detection and analysis of both on-line (streaming) and off-line data. In order to best capture the knowledge about data hierarchies, Earth science models and implied dependencies between anomalies and occurrences of observable events such as urbanization, deforestation, or fires, we have developed an ontology to serve as a knowledge base. We can query the knowledge base and answer questions about dataset compatibilities, similarities and dependencies so that we can, for example, automatically analyze similar datasets in order to verify a given anomaly occurrence in multiple data sources. We are further extending the system to go beyond anomaly detection towards reasoning about possible causes of anomalies that are also encoded in the knowledge base as either learned or implied knowledge. This enables us to scale up the analysis by eliminating a large number of anomalies early on during the processing by either failure to verify them from other sources, or matching them directly with other observable events without having to perform an extensive and time-consuming exploration and analysis. The knowledge is captured using OWL ontology language, where connections are defined in a schema that is later extended by including specific instances of datasets and models. The information is stored using Sesame server and is accessible through both Java API and web services using SeRQL and SPARQL query languages. Inference is provided using OWLIM component integrated with Sesame.

  6. Revisiting Frazier's subdeltas: enhancing datasets with dimensionality, better to understand geologic systems

    USGS Publications Warehouse

    Flocks, James

    2006-01-01

    Scientific knowledge from the past century is commonly represented by two-dimensional figures and graphs, as presented in manuscripts and maps. Using today's computer technology, this information can be extracted and projected into three- and four-dimensional perspectives. Computer models can be applied to datasets to provide additional insight into complex spatial and temporal systems. This process can be demonstrated by applying digitizing and modeling techniques to valuable information within widely used publications. The seminal paper by D. Frazier, published in 1967, identified 16 separate delta lobes formed by the Mississippi River during the past 6,000 yrs. The paper includes stratigraphic descriptions through geologic cross-sections, and provides distribution and chronologies of the delta lobes. The data from Frazier's publication are extensively referenced in the literature. Additional information can be extracted from the data through computer modeling. Digitizing and geo-rectifying Frazier's geologic cross-sections produce a three-dimensional perspective of the delta lobes. Adding the chronological data included in the report provides the fourth-dimension of the delta cycles, which can be visualized through computer-generated animation. Supplemental information can be added to the model, such as post-abandonment subsidence of the delta-lobe surface. Analyzing the regional, net surface-elevation balance between delta progradations and land subsidence is computationally intensive. By visualizing this process during the past 4,500 yrs through multi-dimensional animation, the importance of sediment compaction in influencing both the shape and direction of subsequent delta progradations becomes apparent. Visualization enhances a classic dataset, and can be further refined using additional data, as well as provide a guide for identifying future areas of study.

  7. Modeling extreme sea levels due to tropical and extra-tropical cyclones at the global-scale

    NASA Astrophysics Data System (ADS)

    Muis, S.; Lin, N.; Verlaan, M.; Winsemius, H.; Ward, P.; Aerts, J.

    2017-12-01

    Extreme sea levels, a combination of storm surges and astronomical tides, can cause catastrophic floods. Due to their intense wind speeds and low pressure, tropical cyclones (TCs) typically cause higher storm surges than extra-tropical cyclones (ETCs), but ETCs may still contribute significantly to the overall flood risk. In this contribution, we show a novel approach to model extreme sea levels due to both tropical and extra-tropical cyclones at the global-scale. Using a global hydrodynamic model we have developed the Global Tide and Surge Reanalysis (GTSR) dataset (Muis et al., 2016), which provides daily maximum timeseries of storm tide from 1979 to 2014. GTSR is based on wind and pressure fields from the ERA-Interim climate reanalysis (Dee at al., 2011). A severe limitation of the GTSR dataset is the underrepresentation of TCs. This is due to the relatively coarse grid resolution of ERA-Interim, which means that the strong intensities of TCs are not fully included. Furthermore, the length of ERA-Interim is too short to estimate the probabilities of extreme TCs in a reliable way. We will discuss potential ways to address this limitation, and demonstrate how to improve the global GTSR framework. We will apply the improved framework to the east coast of the United States. First, we improve our meteorological forcing by applying a parametric hurricane model (Holland 1980), and we improve the tide and surge reanalysis dataset (Muis et al., 2016) by explicitly modeling the historical TCs in the Extended Best Track dataset (Demuth et al., 2006). Second, we improve our sampling by statistically extending the observed TC record to many thousands of years (Emanuel et al., 2006). The improved framework allows for the mapping of probabilities of extreme sea levels, including extremes TC events, for the east coast of the United States. ReferencesDee et al (2011). The ERA-Interim reanalysis: configuration and performance of the data assimilation system. Q. J. R. Meteorol. Soc. 137, 553-97. Emanuel et al (2006). A Statistical Deterministic Approach to Hurricane Risk Assessment/ Bull. Am. Meteorol. Soc. 87, 299-314. Holland (1980). An analytic model of the wind and pressure profiles in hurricanes. Mon. Weather Rev. 108, 1212-1218. Muis et al (2016). A global reanalysis of storm surge and extreme sea levels. Nat. Commun. 7, 1-11

  8. Site-conditions map for Portugal based on VS measurements: methodology and final model

    NASA Astrophysics Data System (ADS)

    Vilanova, Susana; Narciso, João; Carvalho, João; Lopes, Isabel; Quinta Ferreira, Mario; Moura, Rui; Borges, José; Nemser, Eliza; Pinto, carlos

    2017-04-01

    In this paper we present a statistically significant site-condition model for Portugal based on shear-wave velocity (VS) data and surface geology. We also evaluate the performance of commonly used Vs30 proxies based on exogenous data and analyze the implications of using those proxies for calculating site amplification in seismic hazard assessment. The dataset contains 161 Vs profiles acquired in Portugal in the context of research projects, technical reports, academic thesis and academic papers. The methodologies involved in characterizing the Vs structure at the sites in the database include seismic refraction, multichannel analysis of seismic waves and refraction microtremor. Invasive measurements were performed in selected locations in order to compare the Vs profiles obtained from both invasive and non-invasive techniques. In general there was good agreement in the subsurface structure of Vs30 obtained from the different methodologies. The database flat-file includes information on Vs30, surface geology at 1:50.000 and 1:500.000 scales, elevation and topographic slope and based on SRTM30 topographic dataset. The procedure used to develop the site-conditions map is based on a three-step process that includes defining a preliminary set of geological units based on the literature, performing statistical tests to assess whether or not the differences in the distributions of Vs30 are statistically significant, and merging of the geological units accordingly. The dataset was, to some extent, affected by clustering and/or preferential sampling and therefore a declustering algorithm was applied. The final model includes three geological units: 1) Igneous, metamorphic and old (Paleogene and Mesozoic) sedimentary rocks; 2) Neogene and Pleistocene formations, and 3) Holocene formations. The evaluation of proxies indicates that although geological analogues and topographic slope are in general unbiased, the latter shows significant bias for particular geological units and subsequently for some geographical regions.

  9. DMirNet: Inferring direct microRNA-mRNA association networks.

    PubMed

    Lee, Minsu; Lee, HyungJune

    2016-12-05

    MicroRNAs (miRNAs) play important regulatory roles in the wide range of biological processes by inducing target mRNA degradation or translational repression. Based on the correlation between expression profiles of a miRNA and its target mRNA, various computational methods have previously been proposed to identify miRNA-mRNA association networks by incorporating the matched miRNA and mRNA expression profiles. However, there remain three major issues to be resolved in the conventional computation approaches for inferring miRNA-mRNA association networks from expression profiles. 1) Inferred correlations from the observed expression profiles using conventional correlation-based methods include numerous erroneous links or over-estimated edge weight due to the transitive information flow among direct associations. 2) Due to the high-dimension-low-sample-size problem on the microarray dataset, it is difficult to obtain an accurate and reliable estimate of the empirical correlations between all pairs of expression profiles. 3) Because the previously proposed computational methods usually suffer from varying performance across different datasets, a more reliable model that guarantees optimal or suboptimal performance across different datasets is highly needed. In this paper, we present DMirNet, a new framework for identifying direct miRNA-mRNA association networks. To tackle the aforementioned issues, DMirNet incorporates 1) three direct correlation estimation methods (namely Corpcor, SPACE, Network deconvolution) to infer direct miRNA-mRNA association networks, 2) the bootstrapping method to fully utilize insufficient training expression profiles, and 3) a rank-based Ensemble aggregation to build a reliable and robust model across different datasets. Our empirical experiments on three datasets demonstrate the combinatorial effects of necessary components in DMirNet. Additional performance comparison experiments show that DMirNet outperforms the state-of-the-art Ensemble-based model [1] which has shown the best performance across the same three datasets, with a factor of up to 1.29. Further, we identify 43 putative novel multi-cancer-related miRNA-mRNA association relationships from an inferred Top 1000 direct miRNA-mRNA association network. We believe that DMirNet is a promising method to identify novel direct miRNA-mRNA relations and to elucidate the direct miRNA-mRNA association networks. Since DMirNet infers direct relationships from the observed data, DMirNet can contribute to reconstructing various direct regulatory pathways, including, but not limited to, the direct miRNA-mRNA association networks.

  10. Twente spine model: A complete and coherent dataset for musculo-skeletal modeling of the thoracic and cervical regions of the human spine.

    PubMed

    Bayoglu, Riza; Geeraedts, Leo; Groenen, Karlijn H J; Verdonschot, Nico; Koopman, Bart; Homminga, Jasper

    2017-06-14

    Musculo-skeletal modeling could play a key role in advancing our understanding of the healthy and pathological spine, but the credibility of such models are strictly dependent on the accuracy of the anatomical data incorporated. In this study, we present a complete and coherent musculo-skeletal dataset for the thoracic and cervical regions of the human spine, obtained through detailed dissection of an embalmed male cadaver. We divided the muscles into a number of muscle-tendon elements, digitized their attachments at the bones, and measured morphological muscle parameters. In total, 225 muscle elements were measured over 39 muscles. For every muscle element, we provide the coordinates of its attachments, fiber length, tendon length, sarcomere length, optimal fiber length, pennation angle, mass, and physiological cross-sectional area together with the skeletal geometry of the cadaver. Results were consistent with similar anatomical studies. Furthermore, we report new data for several muscles such as rotatores, multifidus, levatores costarum, spinalis, semispinalis, subcostales, transversus thoracis, and intercostales muscles. This dataset complements our previous study where we presented a consistent dataset for the lumbar region of the spine (Bayoglu et al., 2017). Therefore, when used together, these datasets enable a complete and coherent dataset for the entire spine. The complete dataset will be used to develop a musculo-skeletal model for the entire human spine to study clinical and ergonomic applications. Copyright © 2017 Elsevier Ltd. All rights reserved.

  11. NV PFA - Steptoe Valley

    DOE Data Explorer

    Jim Faulds

    2015-10-29

    All datasets and products specific to the Steptoe Valley model area. Includes a packed ArcMap project (.mpk), individually zipped shapefiles, and a file geodatabase for the northern Steptoe Valley area; a GeoSoft Oasis montaj project containing GM-SYS 2D gravity profiles along the trace of our seismic reflection lines; a 3D model in EarthVision; spreadsheet of links to published maps; and spreadsheets of well data.

  12. 3D printing from MRI Data: Harnessing strengths and minimizing weaknesses.

    PubMed

    Ripley, Beth; Levin, Dmitry; Kelil, Tatiana; Hermsen, Joshua L; Kim, Sooah; Maki, Jeffrey H; Wilson, Gregory J

    2017-03-01

    3D printing facilitates the creation of accurate physical models of patient-specific anatomy from medical imaging datasets. While the majority of models to date are created from computed tomography (CT) data, there is increasing interest in creating models from other datasets, such as ultrasound and magnetic resonance imaging (MRI). MRI, in particular, holds great potential for 3D printing, given its excellent tissue characterization and lack of ionizing radiation. There are, however, challenges to 3D printing from MRI data as well. Here we review the basics of 3D printing, explore the current strengths and weaknesses of printing from MRI data as they pertain to model accuracy, and discuss considerations in the design of MRI sequences for 3D printing. Finally, we explore the future of 3D printing and MRI, including creative applications and new materials. 5 J. Magn. Reson. Imaging 2017;45:635-645. © 2016 International Society for Magnetic Resonance in Medicine.

  13. Limited Sampling Strategy for Accurate Prediction of Pharmacokinetics of Saroglitazar: A 3-point Linear Regression Model Development and Successful Prediction of Human Exposure.

    PubMed

    Joshi, Shuchi N; Srinivas, Nuggehally R; Parmar, Deven V

    2018-03-01

    Our aim was to develop and validate the extrapolative performance of a regression model using a limited sampling strategy for accurate estimation of the area under the plasma concentration versus time curve for saroglitazar. Healthy subject pharmacokinetic data from a well-powered food-effect study (fasted vs fed treatments; n = 50) was used in this work. The first 25 subjects' serial plasma concentration data up to 72 hours and corresponding AUC 0-t (ie, 72 hours) from the fasting group comprised a training dataset to develop the limited sampling model. The internal datasets for prediction included the remaining 25 subjects from the fasting group and all 50 subjects from the fed condition of the same study. The external datasets included pharmacokinetic data for saroglitazar from previous single-dose clinical studies. Limited sampling models were composed of 1-, 2-, and 3-concentration-time points' correlation with AUC 0-t of saroglitazar. Only models with regression coefficients (R 2 ) >0.90 were screened for further evaluation. The best R 2 model was validated for its utility based on mean prediction error, mean absolute prediction error, and root mean square error. Both correlations between predicted and observed AUC 0-t of saroglitazar and verification of precision and bias using Bland-Altman plot were carried out. None of the evaluated 1- and 2-concentration-time points models achieved R 2 > 0.90. Among the various 3-concentration-time points models, only 4 equations passed the predefined criterion of R 2 > 0.90. Limited sampling models with time points 0.5, 2, and 8 hours (R 2 = 0.9323) and 0.75, 2, and 8 hours (R 2 = 0.9375) were validated. Mean prediction error, mean absolute prediction error, and root mean square error were <30% (predefined criterion) and correlation (r) was at least 0.7950 for the consolidated internal and external datasets of 102 healthy subjects for the AUC 0-t prediction of saroglitazar. The same models, when applied to the AUC 0-t prediction of saroglitazar sulfoxide, showed mean prediction error, mean absolute prediction error, and root mean square error <30% and correlation (r) was at least 0.9339 in the same pool of healthy subjects. A 3-concentration-time points limited sampling model predicts the exposure of saroglitazar (ie, AUC 0-t ) within predefined acceptable bias and imprecision limit. Same model was also used to predict AUC 0-∞ . The same limited sampling model was found to predict the exposure of saroglitazar sulfoxide within predefined criteria. This model can find utility during late-phase clinical development of saroglitazar in the patient population. Copyright © 2018 Elsevier HS Journals, Inc. All rights reserved.

  14. Towards improved parameterization of a macroscale hydrologic model in a discontinuous permafrost boreal forest ecosystem

    DOE PAGES

    Endalamaw, Abraham; Bolton, W. Robert; Young-Robertson, Jessica M.; ...

    2017-09-14

    Modeling hydrological processes in the Alaskan sub-arctic is challenging because of the extreme spatial heterogeneity in soil properties and vegetation communities. Nevertheless, modeling and predicting hydrological processes is critical in this region due to its vulnerability to the effects of climate change. Coarse-spatial-resolution datasets used in land surface modeling pose a new challenge in simulating the spatially distributed and basin-integrated processes since these datasets do not adequately represent the small-scale hydrological, thermal, and ecological heterogeneity. The goal of this study is to improve the prediction capacity of mesoscale to large-scale hydrological models by introducing a small-scale parameterization scheme, which bettermore » represents the spatial heterogeneity of soil properties and vegetation cover in the Alaskan sub-arctic. The small-scale parameterization schemes are derived from observations and a sub-grid parameterization method in the two contrasting sub-basins of the Caribou Poker Creek Research Watershed (CPCRW) in Interior Alaska: one nearly permafrost-free (LowP) sub-basin and one permafrost-dominated (HighP) sub-basin. The sub-grid parameterization method used in the small-scale parameterization scheme is derived from the watershed topography. We found that observed soil thermal and hydraulic properties – including the distribution of permafrost and vegetation cover heterogeneity – are better represented in the sub-grid parameterization method than the coarse-resolution datasets. Parameters derived from the coarse-resolution datasets and from the sub-grid parameterization method are implemented into the variable infiltration capacity (VIC) mesoscale hydrological model to simulate runoff, evapotranspiration (ET), and soil moisture in the two sub-basins of the CPCRW. Simulated hydrographs based on the small-scale parameterization capture most of the peak and low flows, with similar accuracy in both sub-basins, compared to simulated hydrographs based on the coarse-resolution datasets. On average, the small-scale parameterization scheme improves the total runoff simulation by up to 50 % in the LowP sub-basin and by up to 10 % in the HighP sub-basin from the large-scale parameterization. This study shows that the proposed sub-grid parameterization method can be used to improve the performance of mesoscale hydrological models in the Alaskan sub-arctic watersheds.« less

  15. Towards improved parameterization of a macroscale hydrologic model in a discontinuous permafrost boreal forest ecosystem

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Endalamaw, Abraham; Bolton, W. Robert; Young-Robertson, Jessica M.

    Modeling hydrological processes in the Alaskan sub-arctic is challenging because of the extreme spatial heterogeneity in soil properties and vegetation communities. Nevertheless, modeling and predicting hydrological processes is critical in this region due to its vulnerability to the effects of climate change. Coarse-spatial-resolution datasets used in land surface modeling pose a new challenge in simulating the spatially distributed and basin-integrated processes since these datasets do not adequately represent the small-scale hydrological, thermal, and ecological heterogeneity. The goal of this study is to improve the prediction capacity of mesoscale to large-scale hydrological models by introducing a small-scale parameterization scheme, which bettermore » represents the spatial heterogeneity of soil properties and vegetation cover in the Alaskan sub-arctic. The small-scale parameterization schemes are derived from observations and a sub-grid parameterization method in the two contrasting sub-basins of the Caribou Poker Creek Research Watershed (CPCRW) in Interior Alaska: one nearly permafrost-free (LowP) sub-basin and one permafrost-dominated (HighP) sub-basin. The sub-grid parameterization method used in the small-scale parameterization scheme is derived from the watershed topography. We found that observed soil thermal and hydraulic properties – including the distribution of permafrost and vegetation cover heterogeneity – are better represented in the sub-grid parameterization method than the coarse-resolution datasets. Parameters derived from the coarse-resolution datasets and from the sub-grid parameterization method are implemented into the variable infiltration capacity (VIC) mesoscale hydrological model to simulate runoff, evapotranspiration (ET), and soil moisture in the two sub-basins of the CPCRW. Simulated hydrographs based on the small-scale parameterization capture most of the peak and low flows, with similar accuracy in both sub-basins, compared to simulated hydrographs based on the coarse-resolution datasets. On average, the small-scale parameterization scheme improves the total runoff simulation by up to 50 % in the LowP sub-basin and by up to 10 % in the HighP sub-basin from the large-scale parameterization. This study shows that the proposed sub-grid parameterization method can be used to improve the performance of mesoscale hydrological models in the Alaskan sub-arctic watersheds.« less

  16. SpectralNET – an application for spectral graph analysis and visualization

    PubMed Central

    Forman, Joshua J; Clemons, Paul A; Schreiber, Stuart L; Haggarty, Stephen J

    2005-01-01

    Background Graph theory provides a computational framework for modeling a variety of datasets including those emerging from genomics, proteomics, and chemical genetics. Networks of genes, proteins, small molecules, or other objects of study can be represented as graphs of nodes (vertices) and interactions (edges) that can carry different weights. SpectralNET is a flexible application for analyzing and visualizing these biological and chemical networks. Results Available both as a standalone .NET executable and as an ASP.NET web application, SpectralNET was designed specifically with the analysis of graph-theoretic metrics in mind, a computational task not easily accessible using currently available applications. Users can choose either to upload a network for analysis using a variety of input formats, or to have SpectralNET generate an idealized random network for comparison to a real-world dataset. Whichever graph-generation method is used, SpectralNET displays detailed information about each connected component of the graph, including graphs of degree distribution, clustering coefficient by degree, and average distance by degree. In addition, extensive information about the selected vertex is shown, including degree, clustering coefficient, various distance metrics, and the corresponding components of the adjacency, Laplacian, and normalized Laplacian eigenvectors. SpectralNET also displays several graph visualizations, including a linear dimensionality reduction for uploaded datasets (Principal Components Analysis) and a non-linear dimensionality reduction that provides an elegant view of global graph structure (Laplacian eigenvectors). Conclusion SpectralNET provides an easily accessible means of analyzing graph-theoretic metrics for data modeling and dimensionality reduction. SpectralNET is publicly available as both a .NET application and an ASP.NET web application from . Source code is available upon request. PMID:16236170

  17. SpectralNET--an application for spectral graph analysis and visualization.

    PubMed

    Forman, Joshua J; Clemons, Paul A; Schreiber, Stuart L; Haggarty, Stephen J

    2005-10-19

    Graph theory provides a computational framework for modeling a variety of datasets including those emerging from genomics, proteomics, and chemical genetics. Networks of genes, proteins, small molecules, or other objects of study can be represented as graphs of nodes (vertices) and interactions (edges) that can carry different weights. SpectralNET is a flexible application for analyzing and visualizing these biological and chemical networks. Available both as a standalone .NET executable and as an ASP.NET web application, SpectralNET was designed specifically with the analysis of graph-theoretic metrics in mind, a computational task not easily accessible using currently available applications. Users can choose either to upload a network for analysis using a variety of input formats, or to have SpectralNET generate an idealized random network for comparison to a real-world dataset. Whichever graph-generation method is used, SpectralNET displays detailed information about each connected component of the graph, including graphs of degree distribution, clustering coefficient by degree, and average distance by degree. In addition, extensive information about the selected vertex is shown, including degree, clustering coefficient, various distance metrics, and the corresponding components of the adjacency, Laplacian, and normalized Laplacian eigenvectors. SpectralNET also displays several graph visualizations, including a linear dimensionality reduction for uploaded datasets (Principal Components Analysis) and a non-linear dimensionality reduction that provides an elegant view of global graph structure (Laplacian eigenvectors). SpectralNET provides an easily accessible means of analyzing graph-theoretic metrics for data modeling and dimensionality reduction. SpectralNET is publicly available as both a .NET application and an ASP.NET web application from http://chembank.broad.harvard.edu/resources/. Source code is available upon request.

  18. Linking Field and Satellite Observations to Reveal Differences in Single vs. Double-Cropped Soybean Yields in Central Brazil

    NASA Astrophysics Data System (ADS)

    Jeffries, G. R.; Cohn, A.

    2016-12-01

    Soy-corn double cropping (DC) has been widely adopted in Central Brazil alongside single cropped (SC) soybean production. DC involves different cropping calendars, soy varieties, and may be associated with different crop yield patterns and volatility than SC. Study of the performance of the region's agriculture in a changing climate depends on tracking differences in the productivity of SC vs. DC, but has been limited by crop yield data that conflate the two systems. We predicted SC and DC yields across Central Brazil, drawing on field observations and remotely sensed data. We first modeled field yield estimates as a function of remotely sensed DC status and vegetation index (VI) metrics, and other management and biophysical factors. We then used the statistical model estimated to predict SC and DC soybean yields at each 500 m2 grid cell of Central Brazil for harvest years 2001 - 2015. The yield estimation model was constructed using 1) a repeated cross-sectional survey of soybean yields and management factors for years 2007-2015, 2) a custom agricultural land cover classification dataset which assimilates earlier datasets for the region, and 3) 500m 8-day MODIS image composites used to calculate the wide dynamic range vegetation index (WDRVI) and derivative metrics such as area under the curve for WDRVI values in critical crop development periods. A statistical yield estimation model which primarily entails WDRVI metrics, DC status, and spatial fixed effects was developed on a subset of the yield dataset. Model validation was conducted by predicting previously withheld yield records, and then assessing error and goodness-of-fit for predicted values with metrics including root mean squared error (RMSE), mean squared error (MSE), and R2. We found a statistical yield estimation model which incorporates WDRVI and DC status to be way to estimate crop yields over the region. Statistical properties of the resulting gridded yield dataset may be valuable for understanding linkages between crop yields, farm management factors, and climate.

  19. Clearing your Desk! Software and Data Services for Collaborative Web Based GIS Analysis

    NASA Astrophysics Data System (ADS)

    Tarboton, D. G.; Idaszak, R.; Horsburgh, J. S.; Ames, D. P.; Goodall, J. L.; Band, L. E.; Merwade, V.; Couch, A.; Hooper, R. P.; Maidment, D. R.; Dash, P. K.; Stealey, M.; Yi, H.; Gan, T.; Gichamo, T.; Yildirim, A. A.; Liu, Y.

    2015-12-01

    Can your desktop computer crunch the large GIS datasets that are becoming increasingly common across the geosciences? Do you have access to or the know-how to take advantage of advanced high performance computing (HPC) capability? Web based cyberinfrastructure takes work off your desk or laptop computer and onto infrastructure or "cloud" based data and processing servers. This talk will describe the HydroShare collaborative environment and web based services being developed to support the sharing and processing of hydrologic data and models. HydroShare supports the upload, storage, and sharing of a broad class of hydrologic data including time series, geographic features and raster datasets, multidimensional space-time data, and other structured collections of data. Web service tools and a Python client library provide researchers with access to HPC resources without requiring them to become HPC experts. This reduces the time and effort spent in finding and organizing the data required to prepare the inputs for hydrologic models and facilitates the management of online data and execution of models on HPC systems. This presentation will illustrate the use of web based data and computation services from both the browser and desktop client software. These web-based services implement the Terrain Analysis Using Digital Elevation Model (TauDEM) tools for watershed delineation, generation of hydrology-based terrain information, and preparation of hydrologic model inputs. They allow users to develop scripts on their desktop computer that call analytical functions that are executed completely in the cloud, on HPC resources using input datasets stored in the cloud, without installing specialized software, learning how to use HPC, or transferring large datasets back to the user's desktop. These cases serve as examples for how this approach can be extended to other models to enhance the use of web and data services in the geosciences.

  20. Twin Data That Made a Big Difference, and That Deserve to Be Better-Known and Used in Teaching

    ERIC Educational Resources Information Center

    Campbell, Harlan; Hanley, James A.

    2017-01-01

    Because of their efficiency and ability to keep many other factors constant, twin studies have a special appeal for investigators. Just as with any teaching dataset, a "matched-sets" dataset used to illustrate a statistical model should be compelling, still relevant, and valid. Indeed, such a "model dataset" should meet the…

  1. Gene-Expression Signature Predicts Postoperative Recurrence in Stage I Non-Small Cell Lung Cancer Patients

    PubMed Central

    Lu, Yan; Wang, Liang; Liu, Pengyuan; Yang, Ping; You, Ming

    2012-01-01

    About 30% stage I non-small cell lung cancer (NSCLC) patients undergoing resection will recur. Robust prognostic markers are required to better manage therapy options. The purpose of this study is to develop and validate a novel gene-expression signature that can predict tumor recurrence of stage I NSCLC patients. Cox proportional hazards regression analysis was performed to identify recurrence-related genes and a partial Cox regression model was used to generate a gene signature of recurrence in the training dataset −142 stage I lung adenocarcinomas without adjunctive therapy from the Director's Challenge Consortium. Four independent validation datasets, including GSE5843, GSE8894, and two other datasets provided by Mayo Clinic and Washington University, were used to assess the prediction accuracy by calculating the correlation between risk score estimated from gene expression and real recurrence-free survival time and AUC of time-dependent ROC analysis. Pathway-based survival analyses were also performed. 104 probesets correlated with recurrence in the training dataset. They are enriched in cell adhesion, apoptosis and regulation of cell proliferation. A 51-gene expression signature was identified to distinguish patients likely to develop tumor recurrence (Dxy = −0.83, P<1e-16) and this signature was validated in four independent datasets with AUC >85%. Multiple pathways including leukocyte transendothelial migration and cell adhesion were highly correlated with recurrence-free survival. The gene signature is highly predictive of recurrence in stage I NSCLC patients, which has important prognostic and therapeutic implications for the future management of these patients. PMID:22292069

  2. Peeking Beneath the Caldera: Communicating Subsurface Knowledge of Newberry Volcano

    NASA Astrophysics Data System (ADS)

    Mark-Moser, M.; Rose, K.; Schultz, J.; Cameron, E.

    2016-12-01

    "Imaging the Subsurface: Enhanced Geothermal Systems and Exploring Beneath Newberry Volcano" is an interactive website that presents a three-dimensional subsurface model of Newberry Volcano developed at National Energy Technology Laboratory (NETL). Created using the Story Maps application by ArcGIS Online, this format's dynamic capabilities provide the user the opportunity for multimedia engagement with the datasets and information used to build the subsurface model. This website allows for an interactive experience that the user dictates, including interactive maps, instructive videos and video capture of the subsurface model, and linked information throughout the text. This Story Map offers a general background on the technology of enhanced geothermal systems and the geologic and development history of Newberry Volcano before presenting NETL's modeling efforts that support the installation of enhanced geothermal systems. The model is driven by multiple geologic and geophysical datasets to compare and contrast results which allow for the targeting of potential EGS sites and the reduction of subsurface uncertainty. This Story Map aims to communicate to a broad audience, and provides a platform to effectively introduce the model to researchers and stakeholders.

  3. A Human Activity Recognition System Based on Dynamic Clustering of Skeleton Data.

    PubMed

    Manzi, Alessandro; Dario, Paolo; Cavallo, Filippo

    2017-05-11

    Human activity recognition is an important area in computer vision, with its wide range of applications including ambient assisted living. In this paper, an activity recognition system based on skeleton data extracted from a depth camera is presented. The system makes use of machine learning techniques to classify the actions that are described with a set of a few basic postures. The training phase creates several models related to the number of clustered postures by means of a multiclass Support Vector Machine (SVM), trained with Sequential Minimal Optimization (SMO). The classification phase adopts the X-means algorithm to find the optimal number of clusters dynamically. The contribution of the paper is twofold. The first aim is to perform activity recognition employing features based on a small number of informative postures, extracted independently from each activity instance; secondly, it aims to assess the minimum number of frames needed for an adequate classification. The system is evaluated on two publicly available datasets, the Cornell Activity Dataset (CAD-60) and the Telecommunication Systems Team (TST) Fall detection dataset. The number of clusters needed to model each instance ranges from two to four elements. The proposed approach reaches excellent performances using only about 4 s of input data (~100 frames) and outperforms the state of the art when it uses approximately 500 frames on the CAD-60 dataset. The results are promising for the test in real context.

  4. QSAR studies of the bioactivity of hepatitis C virus (HCV) NS3/4A protease inhibitors by multiple linear regression (MLR) and support vector machine (SVM).

    PubMed

    Qin, Zijian; Wang, Maolin; Yan, Aixia

    2017-07-01

    In this study, quantitative structure-activity relationship (QSAR) models using various descriptor sets and training/test set selection methods were explored to predict the bioactivity of hepatitis C virus (HCV) NS3/4A protease inhibitors by using a multiple linear regression (MLR) and a support vector machine (SVM) method. 512 HCV NS3/4A protease inhibitors and their IC 50 values which were determined by the same FRET assay were collected from the reported literature to build a dataset. All the inhibitors were represented with selected nine global and 12 2D property-weighted autocorrelation descriptors calculated from the program CORINA Symphony. The dataset was divided into a training set and a test set by a random and a Kohonen's self-organizing map (SOM) method. The correlation coefficients (r 2 ) of training sets and test sets were 0.75 and 0.72 for the best MLR model, 0.87 and 0.85 for the best SVM model, respectively. In addition, a series of sub-dataset models were also developed. The performances of all the best sub-dataset models were better than those of the whole dataset models. We believe that the combination of the best sub- and whole dataset SVM models can be used as reliable lead designing tools for new NS3/4A protease inhibitors scaffolds in a drug discovery pipeline. Copyright © 2017 Elsevier Ltd. All rights reserved.

  5. Developing and testing temperature models for regulated systems: a case study on the Upper Delaware River

    USGS Publications Warehouse

    Cole, Jeffrey C.; Maloney, Kelly O.; Schmid, Matthias; McKenna, James E.

    2014-01-01

    Water temperature is an important driver of many processes in riverine ecosystems. If reservoirs are present, their releases can greatly influence downstream water temperatures. Models are important tools in understanding the influence these releases may have on the thermal regimes of downstream rivers. In this study, we developed and tested a suite of models to predict river temperature at a location downstream of two reservoirs in the Upper Delaware River (USA), a section of river that is managed to support a world-class coldwater fishery. Three empirical models were tested, including a Generalized Least Squares Model with a cosine trend (GLScos), AutoRegressive Integrated Moving Average (ARIMA), and Artificial Neural Network (ANN). We also tested one mechanistic Heat Flux Model (HFM) that was based on energy gain and loss. Predictor variables used in model development included climate data (e.g., solar radiation, wind speed, etc.) collected from a nearby weather station and temperature and hydrologic data from upstream U.S. Geological Survey gages. Models were developed with a training dataset that consisted of data from 2008 to 2011; they were then independently validated with a test dataset from 2012. Model accuracy was evaluated using root mean square error (RMSE), Nash Sutcliffe efficiency (NSE), percent bias (PBIAS), and index of agreement (d) statistics. Model forecast success was evaluated using baseline-modified prime index of agreement (md) at the one, three, and five day predictions. All five models accurately predicted daily mean river temperature across the entire training dataset (RMSE = 0.58–1.311, NSE = 0.99–0.97, d = 0.98–0.99); ARIMA was most accurate (RMSE = 0.57, NSE = 0.99), but each model, other than ARIMA, showed short periods of under- or over-predicting observed warmer temperatures. For the training dataset, all models besides ARIMA had overestimation bias (PBIAS = −0.10 to −1.30). Validation analyses showed all models performed well; the HFM model was the most accurate compared other models (RMSE = 0.92, both NSE = 0.98, d = 0.99) and the ARIMA model was least accurate (RMSE = 2.06, NSE = 0.92, d = 0.98); however, all models had an overestimation bias (PBIAS = −4.1 to −10.20). Aside from the one day forecast ARIMA model (md = 0.53), all models forecasted fairly well at the one, three, and five day forecasts (md = 0.77–0.96). Overall, we were successful in developing models predicting daily mean temperature across a broad range of temperatures. These models, specifically the GLScos, ANN, and HFM, may serve as important tools for predicting conditions and managing thermal releases in regulated river systems such as the Delaware River. Further model development may be important in customizing predictions for particular biological or ecological needs, or for particular temporal or spatial scales.

  6. Developing and testing temperature models for regulated systems: A case study on the Upper Delaware River

    NASA Astrophysics Data System (ADS)

    Cole, Jeffrey C.; Maloney, Kelly O.; Schmid, Matthias; McKenna, James E.

    2014-11-01

    Water temperature is an important driver of many processes in riverine ecosystems. If reservoirs are present, their releases can greatly influence downstream water temperatures. Models are important tools in understanding the influence these releases may have on the thermal regimes of downstream rivers. In this study, we developed and tested a suite of models to predict river temperature at a location downstream of two reservoirs in the Upper Delaware River (USA), a section of river that is managed to support a world-class coldwater fishery. Three empirical models were tested, including a Generalized Least Squares Model with a cosine trend (GLScos), AutoRegressive Integrated Moving Average (ARIMA), and Artificial Neural Network (ANN). We also tested one mechanistic Heat Flux Model (HFM) that was based on energy gain and loss. Predictor variables used in model development included climate data (e.g., solar radiation, wind speed, etc.) collected from a nearby weather station and temperature and hydrologic data from upstream U.S. Geological Survey gages. Models were developed with a training dataset that consisted of data from 2008 to 2011; they were then independently validated with a test dataset from 2012. Model accuracy was evaluated using root mean square error (RMSE), Nash Sutcliffe efficiency (NSE), percent bias (PBIAS), and index of agreement (d) statistics. Model forecast success was evaluated using baseline-modified prime index of agreement (md) at the one, three, and five day predictions. All five models accurately predicted daily mean river temperature across the entire training dataset (RMSE = 0.58-1.311, NSE = 0.99-0.97, d = 0.98-0.99); ARIMA was most accurate (RMSE = 0.57, NSE = 0.99), but each model, other than ARIMA, showed short periods of under- or over-predicting observed warmer temperatures. For the training dataset, all models besides ARIMA had overestimation bias (PBIAS = -0.10 to -1.30). Validation analyses showed all models performed well; the HFM model was the most accurate compared other models (RMSE = 0.92, both NSE = 0.98, d = 0.99) and the ARIMA model was least accurate (RMSE = 2.06, NSE = 0.92, d = 0.98); however, all models had an overestimation bias (PBIAS = -4.1 to -10.20). Aside from the one day forecast ARIMA model (md = 0.53), all models forecasted fairly well at the one, three, and five day forecasts (md = 0.77-0.96). Overall, we were successful in developing models predicting daily mean temperature across a broad range of temperatures. These models, specifically the GLScos, ANN, and HFM, may serve as important tools for predicting conditions and managing thermal releases in regulated river systems such as the Delaware River. Further model development may be important in customizing predictions for particular biological or ecological needs, or for particular temporal or spatial scales.

  7. A Reliable Methodology for Determining Seed Viability by Using Hyperspectral Data from Two Sides of Wheat Seeds.

    PubMed

    Zhang, Tingting; Wei, Wensong; Zhao, Bin; Wang, Ranran; Li, Mingliu; Yang, Liming; Wang, Jianhua; Sun, Qun

    2018-03-08

    This study investigated the possibility of using visible and near-infrared (VIS/NIR) hyperspectral imaging techniques to discriminate viable and non-viable wheat seeds. Both sides of individual seeds were subjected to hyperspectral imaging (400-1000 nm) to acquire reflectance spectral data. Four spectral datasets, including the ventral groove side, reverse side, mean (the mean of two sides' spectra of every seed), and mixture datasets (two sides' spectra of every seed), were used to construct the models. Classification models, partial least squares discriminant analysis (PLS-DA), and support vector machines (SVM), coupled with some pre-processing methods and successive projections algorithm (SPA), were built for the identification of viable and non-viable seeds. Our results showed that the standard normal variate (SNV)-SPA-PLS-DA model had high classification accuracy for whole seeds (>85.2%) and for viable seeds (>89.5%), and that the prediction set was based on a mixed spectral dataset by only using 16 wavebands. After screening with this model, the final germination of the seed lot could be higher than 89.5%. Here, we develop a reliable methodology for predicting the viability of wheat seeds, showing that the VIS/NIR hyperspectral imaging is an accurate technique for the classification of viable and non-viable wheat seeds in a non-destructive manner.

  8. A Reliable Methodology for Determining Seed Viability by Using Hyperspectral Data from Two Sides of Wheat Seeds

    PubMed Central

    Zhang, Tingting; Wei, Wensong; Zhao, Bin; Wang, Ranran; Li, Mingliu; Yang, Liming; Wang, Jianhua; Sun, Qun

    2018-01-01

    This study investigated the possibility of using visible and near-infrared (VIS/NIR) hyperspectral imaging techniques to discriminate viable and non-viable wheat seeds. Both sides of individual seeds were subjected to hyperspectral imaging (400–1000 nm) to acquire reflectance spectral data. Four spectral datasets, including the ventral groove side, reverse side, mean (the mean of two sides’ spectra of every seed), and mixture datasets (two sides’ spectra of every seed), were used to construct the models. Classification models, partial least squares discriminant analysis (PLS-DA), and support vector machines (SVM), coupled with some pre-processing methods and successive projections algorithm (SPA), were built for the identification of viable and non-viable seeds. Our results showed that the standard normal variate (SNV)-SPA-PLS-DA model had high classification accuracy for whole seeds (>85.2%) and for viable seeds (>89.5%), and that the prediction set was based on a mixed spectral dataset by only using 16 wavebands. After screening with this model, the final germination of the seed lot could be higher than 89.5%. Here, we develop a reliable methodology for predicting the viability of wheat seeds, showing that the VIS/NIR hyperspectral imaging is an accurate technique for the classification of viable and non-viable wheat seeds in a non-destructive manner. PMID:29517991

  9. Learning Maximal Entropy Models from finite size datasets: a fast Data-Driven algorithm allows to sample from the posterior distribution

    NASA Astrophysics Data System (ADS)

    Ferrari, Ulisse

    A maximal entropy model provides the least constrained probability distribution that reproduces experimental averages of an observables set. In this work we characterize the learning dynamics that maximizes the log-likelihood in the case of large but finite datasets. We first show how the steepest descent dynamics is not optimal as it is slowed down by the inhomogeneous curvature of the model parameters space. We then provide a way for rectifying this space which relies only on dataset properties and does not require large computational efforts. We conclude by solving the long-time limit of the parameters dynamics including the randomness generated by the systematic use of Gibbs sampling. In this stochastic framework, rather than converging to a fixed point, the dynamics reaches a stationary distribution, which for the rectified dynamics reproduces the posterior distribution of the parameters. We sum up all these insights in a ``rectified'' Data-Driven algorithm that is fast and by sampling from the parameters posterior avoids both under- and over-fitting along all the directions of the parameters space. Through the learning of pairwise Ising models from the recording of a large population of retina neurons, we show how our algorithm outperforms the steepest descent method. This research was supported by a Grant from the Human Brain Project (HBP CLAP).

  10. An electrical load measurements dataset of United Kingdom households from a two-year longitudinal study

    PubMed Central

    Murray, David; Stankovic, Lina; Stankovic, Vladimir

    2017-01-01

    Smart meter roll-outs provide easy access to granular meter measurements, enabling advanced energy services, ranging from demand response measures, tailored energy feedback and smart home/building automation. To design such services, train and validate models, access to data that resembles what is expected of smart meters, collected in a real-world setting, is necessary. The REFIT electrical load measurements dataset described in this paper includes whole house aggregate loads and nine individual appliance measurements at 8-second intervals per house, collected continuously over a period of two years from 20 houses. During monitoring, the occupants were conducting their usual routines. At the time of publishing, the dataset has the largest number of houses monitored in the United Kingdom at less than 1-minute intervals over a period greater than one year. The dataset comprises 1,194,958,790 readings, that represent over 250,000 monitored appliance uses. The data is accessible in an easy-to-use comma-separated format, is time-stamped and cleaned to remove invalid measurements, correctly label appliance data and fill in small gaps of missing data. PMID:28055033

  11. An electrical load measurements dataset of United Kingdom households from a two-year longitudinal study

    NASA Astrophysics Data System (ADS)

    Murray, David; Stankovic, Lina; Stankovic, Vladimir

    2017-01-01

    Smart meter roll-outs provide easy access to granular meter measurements, enabling advanced energy services, ranging from demand response measures, tailored energy feedback and smart home/building automation. To design such services, train and validate models, access to data that resembles what is expected of smart meters, collected in a real-world setting, is necessary. The REFIT electrical load measurements dataset described in this paper includes whole house aggregate loads and nine individual appliance measurements at 8-second intervals per house, collected continuously over a period of two years from 20 houses. During monitoring, the occupants were conducting their usual routines. At the time of publishing, the dataset has the largest number of houses monitored in the United Kingdom at less than 1-minute intervals over a period greater than one year. The dataset comprises 1,194,958,790 readings, that represent over 250,000 monitored appliance uses. The data is accessible in an easy-to-use comma-separated format, is time-stamped and cleaned to remove invalid measurements, correctly label appliance data and fill in small gaps of missing data.

  12. An electrical load measurements dataset of United Kingdom households from a two-year longitudinal study.

    PubMed

    Murray, David; Stankovic, Lina; Stankovic, Vladimir

    2017-01-05

    Smart meter roll-outs provide easy access to granular meter measurements, enabling advanced energy services, ranging from demand response measures, tailored energy feedback and smart home/building automation. To design such services, train and validate models, access to data that resembles what is expected of smart meters, collected in a real-world setting, is necessary. The REFIT electrical load measurements dataset described in this paper includes whole house aggregate loads and nine individual appliance measurements at 8-second intervals per house, collected continuously over a period of two years from 20 houses. During monitoring, the occupants were conducting their usual routines. At the time of publishing, the dataset has the largest number of houses monitored in the United Kingdom at less than 1-minute intervals over a period greater than one year. The dataset comprises 1,194,958,790 readings, that represent over 250,000 monitored appliance uses. The data is accessible in an easy-to-use comma-separated format, is time-stamped and cleaned to remove invalid measurements, correctly label appliance data and fill in small gaps of missing data.

  13. A high-resolution European dataset for hydrologic modeling

    NASA Astrophysics Data System (ADS)

    Ntegeka, Victor; Salamon, Peter; Gomes, Goncalo; Sint, Hadewij; Lorini, Valerio; Thielen, Jutta

    2013-04-01

    There is an increasing demand for large scale hydrological models not only in the field of modeling the impact of climate change on water resources but also for disaster risk assessments and flood or drought early warning systems. These large scale models need to be calibrated and verified against large amounts of observations in order to judge their capabilities to predict the future. However, the creation of large scale datasets is challenging for it requires collection, harmonization, and quality checking of large amounts of observations. For this reason, only a limited number of such datasets exist. In this work, we present a pan European, high-resolution gridded dataset of meteorological observations (EFAS-Meteo) which was designed with the aim to drive a large scale hydrological model. Similar European and global gridded datasets already exist, such as the HadGHCND (Caesar et al., 2006), the JRC MARS-STAT database (van der Goot and Orlandi, 2003) and the E-OBS gridded dataset (Haylock et al., 2008). However, none of those provide similarly high spatial resolution and/or a complete set of variables to force a hydrologic model. EFAS-Meteo contains daily maps of precipitation, surface temperature (mean, minimum and maximum), wind speed and vapour pressure at a spatial grid resolution of 5 x 5 km for the time period 1 January 1990 - 31 December 2011. It furthermore contains calculated radiation, which is calculated by using a staggered approach depending on the availability of sunshine duration, cloud cover and minimum and maximum temperature, and evapotranspiration (potential evapotranspiration, bare soil and open water evapotranspiration). The potential evapotranspiration was calculated using the Penman-Monteith equation with the above-mentioned meteorological variables. The dataset was created as part of the development of the European Flood Awareness System (EFAS) and has been continuously updated throughout the last years. The dataset variables are used as inputs to the hydrological calibration and validation of EFAS as well as for establishing long-term discharge "proxy" climatologies which can then in turn be used for statistical analysis to derive return periods or other time series derivatives. In addition, this dataset will be used to assess climatological trends in Europe. Unfortunately, to date no baseline dataset at the European scale exists to test the quality of the herein presented data. Hence, a comparison against other existing datasets can therefore only be an indication of data quality. Due to availability, a comparison was made for precipitation and temperature only, arguably the most important meteorological drivers for hydrologic models. A variety of analyses was undertaken at country scale against data reported to EUROSTAT and E-OBS datasets. The comparison revealed that while the datasets showed overall similar temporal and spatial patterns, there were some differences in magnitudes especially for precipitation. It is not straightforward to define the specific cause for these differences. However, in most cases the comparatively low observation station density appears to be the principal reason for the differences in magnitude.

  14. The 3D Reference Earth Model: Status and Preliminary Results

    NASA Astrophysics Data System (ADS)

    Moulik, P.; Lekic, V.; Romanowicz, B. A.

    2017-12-01

    In the 20th century, seismologists constructed models of how average physical properties (e.g. density, rigidity, compressibility, anisotropy) vary with depth in the Earth's interior. These one-dimensional (1D) reference Earth models (e.g. PREM) have proven indispensable in earthquake location, imaging of interior structure, understanding material properties under extreme conditions, and as a reference in other fields, such as particle physics and astronomy. Over the past three decades, new datasets motivated more sophisticated efforts that yielded models of how properties vary both laterally and with depth in the Earth's interior. Though these three-dimensional (3D) models exhibit compelling similarities at large scales, differences in the methodology, representation of structure, and dataset upon which they are based, have prevented the creation of 3D community reference models. As part of the REM-3D project, we are compiling and reconciling reference seismic datasets of body wave travel-time measurements, fundamental mode and overtone surface wave dispersion measurements, and normal mode frequencies and splitting functions. These reference datasets are being inverted for a long-wavelength, 3D reference Earth model that describes the robust long-wavelength features of mantle heterogeneity. As a community reference model with fully quantified uncertainties and tradeoffs and an associated publically available dataset, REM-3D will facilitate Earth imaging studies, earthquake characterization, inferences on temperature and composition in the deep interior, and be of improved utility to emerging scientific endeavors, such as neutrino geoscience. Here, we summarize progress made in the construction of the reference long period dataset and present a preliminary version of REM-3D in the upper-mantle. In order to determine the level of detail warranted for inclusion in REM-3D, we analyze the spectrum of discrepancies between models inverted with different subsets of the reference dataset. This procedure allows us to evaluate the extent of consistency in imaging heterogeneity at various depths and between spatial scales.

  15. Context-Aware Generative Adversarial Privacy

    NASA Astrophysics Data System (ADS)

    Huang, Chong; Kairouz, Peter; Chen, Xiao; Sankar, Lalitha; Rajagopal, Ram

    2017-12-01

    Preserving the utility of published datasets while simultaneously providing provable privacy guarantees is a well-known challenge. On the one hand, context-free privacy solutions, such as differential privacy, provide strong privacy guarantees, but often lead to a significant reduction in utility. On the other hand, context-aware privacy solutions, such as information theoretic privacy, achieve an improved privacy-utility tradeoff, but assume that the data holder has access to dataset statistics. We circumvent these limitations by introducing a novel context-aware privacy framework called generative adversarial privacy (GAP). GAP leverages recent advancements in generative adversarial networks (GANs) to allow the data holder to learn privatization schemes from the dataset itself. Under GAP, learning the privacy mechanism is formulated as a constrained minimax game between two players: a privatizer that sanitizes the dataset in a way that limits the risk of inference attacks on the individuals' private variables, and an adversary that tries to infer the private variables from the sanitized dataset. To evaluate GAP's performance, we investigate two simple (yet canonical) statistical dataset models: (a) the binary data model, and (b) the binary Gaussian mixture model. For both models, we derive game-theoretically optimal minimax privacy mechanisms, and show that the privacy mechanisms learned from data (in a generative adversarial fashion) match the theoretically optimal ones. This demonstrates that our framework can be easily applied in practice, even in the absence of dataset statistics.

  16. [Theory, method and application of method R on estimation of (co)variance components].

    PubMed

    Liu, Wen-Zhong

    2004-07-01

    Theory, method and application of Method R on estimation of (co)variance components were reviewed in order to make the method be reasonably used. Estimation requires R values,which are regressions of predicted random effects that are calculated using complete dataset on predicted random effects that are calculated using random subsets of the same data. By using multivariate iteration algorithm based on a transformation matrix,and combining with the preconditioned conjugate gradient to solve the mixed model equations, the computation efficiency of Method R is much improved. Method R is computationally inexpensive,and the sampling errors and approximate credible intervals of estimates can be obtained. Disadvantages of Method R include a larger sampling variance than other methods for the same data,and biased estimates in small datasets. As an alternative method, Method R can be used in larger datasets. It is necessary to study its theoretical properties and broaden its application range further.

  17. Comparative description of ten transcriptomes of newly sequenced invertebrates and efficiency estimation of genomic sampling in non-model taxa

    PubMed Central

    2012-01-01

    Introduction Traditionally, genomic or transcriptomic data have been restricted to a few model or emerging model organisms, and to a handful of species of medical and/or environmental importance. Next-generation sequencing techniques have the capability of yielding massive amounts of gene sequence data for virtually any species at a modest cost. Here we provide a comparative analysis of de novo assembled transcriptomic data for ten non-model species of previously understudied animal taxa. Results cDNA libraries of ten species belonging to five animal phyla (2 Annelida [including Sipuncula], 2 Arthropoda, 2 Mollusca, 2 Nemertea, and 2 Porifera) were sequenced in different batches with an Illumina Genome Analyzer II (read length 100 or 150 bp), rendering between ca. 25 and 52 million reads per species. Read thinning, trimming, and de novo assembly were performed under different parameters to optimize output. Between 67,423 and 207,559 contigs were obtained across the ten species, post-optimization. Of those, 9,069 to 25,681 contigs retrieved blast hits against the NCBI non-redundant database, and approximately 50% of these were assigned with Gene Ontology terms, covering all major categories, and with similar percentages in all species. Local blasts against our datasets, using selected genes from major signaling pathways and housekeeping genes, revealed high efficiency in gene recovery compared to available genomes of closely related species. Intriguingly, our transcriptomic datasets detected multiple paralogues in all phyla and in nearly all gene pathways, including housekeeping genes that are traditionally used in phylogenetic applications for their purported single-copy nature. Conclusions We generated the first study of comparative transcriptomics across multiple animal phyla (comparing two species per phylum in most cases), established the first Illumina-based transcriptomic datasets for sponge, nemertean, and sipunculan species, and generated a tractable catalogue of annotated genes (or gene fragments) and protein families for ten newly sequenced non-model organisms, some of commercial importance (i.e., Octopus vulgaris). These comprehensive sets of genes can be readily used for phylogenetic analysis, gene expression profiling, developmental analysis, and can also be a powerful resource for gene discovery. The characterization of the transcriptomes of such a diverse array of animal species permitted the comparison of sequencing depth, functional annotation, and efficiency of genomic sampling using the same pipelines, which proved to be similar for all considered species. In addition, the datasets revealed their potential as a resource for paralogue detection, a recurrent concern in various aspects of biological inquiry, including phylogenetics, molecular evolution, development, and cellular biochemistry. PMID:23190771

  18. SAMPL5: 3D-RISM partition coefficient calculations with partial molar volume corrections and solute conformational sampling.

    PubMed

    Luchko, Tyler; Blinov, Nikolay; Limon, Garrett C; Joyce, Kevin P; Kovalenko, Andriy

    2016-11-01

    Implicit solvent methods for classical molecular modeling are frequently used to provide fast, physics-based hydration free energies of macromolecules. Less commonly considered is the transferability of these methods to other solvents. The Statistical Assessment of Modeling of Proteins and Ligands 5 (SAMPL5) distribution coefficient dataset and the accompanying explicit solvent partition coefficient reference calculations provide a direct test of solvent model transferability. Here we use the 3D reference interaction site model (3D-RISM) statistical-mechanical solvation theory, with a well tested water model and a new united atom cyclohexane model, to calculate partition coefficients for the SAMPL5 dataset. The cyclohexane model performed well in training and testing ([Formula: see text] for amino acid neutral side chain analogues) but only if a parameterized solvation free energy correction was used. In contrast, the same protocol, using single solute conformations, performed poorly on the SAMPL5 dataset, obtaining [Formula: see text] compared to the reference partition coefficients, likely due to the much larger solute sizes. Including solute conformational sampling through molecular dynamics coupled with 3D-RISM (MD/3D-RISM) improved agreement with the reference calculation to [Formula: see text]. Since our initial calculations only considered partition coefficients and not distribution coefficients, solute sampling provided little benefit comparing against experiment, where ionized and tautomer states are more important. Applying a simple [Formula: see text] correction improved agreement with experiment from [Formula: see text] to [Formula: see text], despite a small number of outliers. Better agreement is possible by accounting for tautomers and improving the ionization correction.

  19. SAMPL5: 3D-RISM partition coefficient calculations with partial molar volume corrections and solute conformational sampling

    NASA Astrophysics Data System (ADS)

    Luchko, Tyler; Blinov, Nikolay; Limon, Garrett C.; Joyce, Kevin P.; Kovalenko, Andriy

    2016-11-01

    Implicit solvent methods for classical molecular modeling are frequently used to provide fast, physics-based hydration free energies of macromolecules. Less commonly considered is the transferability of these methods to other solvents. The Statistical Assessment of Modeling of Proteins and Ligands 5 (SAMPL5) distribution coefficient dataset and the accompanying explicit solvent partition coefficient reference calculations provide a direct test of solvent model transferability. Here we use the 3D reference interaction site model (3D-RISM) statistical-mechanical solvation theory, with a well tested water model and a new united atom cyclohexane model, to calculate partition coefficients for the SAMPL5 dataset. The cyclohexane model performed well in training and testing (R=0.98 for amino acid neutral side chain analogues) but only if a parameterized solvation free energy correction was used. In contrast, the same protocol, using single solute conformations, performed poorly on the SAMPL5 dataset, obtaining R=0.73 compared to the reference partition coefficients, likely due to the much larger solute sizes. Including solute conformational sampling through molecular dynamics coupled with 3D-RISM (MD/3D-RISM) improved agreement with the reference calculation to R=0.93. Since our initial calculations only considered partition coefficients and not distribution coefficients, solute sampling provided little benefit comparing against experiment, where ionized and tautomer states are more important. Applying a simple pK_{ {a}} correction improved agreement with experiment from R=0.54 to R=0.66, despite a small number of outliers. Better agreement is possible by accounting for tautomers and improving the ionization correction.

  20. QUANTIFYING THE CLIMATE, AIR QUALITY AND HEALTH BENEFITS OF IMPROVED COOKSTOVES: AN INTEGRATED LABORATORY, FIELD AND MODELING STUDY

    EPA Science Inventory

    Expected results and outputs include: extensive dataset of in-field and laboratory emissions data for traditional and improved cookstoves; parameterization to predict cookstove emissions from drive cycle data; indoor and personal exposure data for traditional and improved cook...

  1. Inquiry in the Physical Geology Classroom: Supporting Students' Conceptual Model Development

    ERIC Educational Resources Information Center

    Miller, Heather R.; McNeal, Karen S.; Herbert, Bruce E.

    2010-01-01

    This study characterizes the impact of an inquiry-based learning (IBL) module versus a traditionally structured laboratory exercise. Laboratory sections were randomized into experimental and control groups. The experimental group was taught using IBL pedagogical techniques and included manipulation of large-scale data-sets, use of multiple…

  2. A Community Terrain-Following Ocean Modeling System (ROMS/TOMS)

    DTIC Science & Technology

    2013-09-30

    workshop at the Windsor Atlântica Hotel, Rio de Janeiro , Brazil, October 22-25, 2012. As in the past, several tutorials were offered on basic and...from the European Centre For Medium-Range Weather Forecasts (ECMWF) ERA-Interim, 3-hour dataset. River runoff is included along the Alabama

  3. On the uncertainties associated with using gridded rainfall data as a proxy for observed

    NASA Astrophysics Data System (ADS)

    Tozer, C. R.; Kiem, A. S.; Verdon-Kidd, D. C.

    2012-05-01

    Gridded rainfall datasets are used in many hydrological and climatological studies, in Australia and elsewhere, including for hydroclimatic forecasting, climate attribution studies and climate model performance assessments. The attraction of the spatial coverage provided by gridded data is clear, particularly in Australia where the spatial and temporal resolution of the rainfall gauge network is sparse. However, the question that must be asked is whether it is suitable to use gridded data as a proxy for observed point data, given that gridded data is inherently "smoothed" and may not necessarily capture the temporal and spatial variability of Australian rainfall which leads to hydroclimatic extremes (i.e. droughts, floods). This study investigates this question through a statistical analysis of three monthly gridded Australian rainfall datasets - the Bureau of Meteorology (BOM) dataset, the Australian Water Availability Project (AWAP) and the SILO dataset. The results of the monthly, seasonal and annual comparisons show that not only are the three gridded datasets different relative to each other, there are also marked differences between the gridded rainfall data and the rainfall observed at gauges within the corresponding grids - particularly for extremely wet or extremely dry conditions. Also important is that the differences observed appear to be non-systematic. To demonstrate the hydrological implications of using gridded data as a proxy for gauged data, a rainfall-runoff model is applied to one catchment in South Australia initially using gauged data as the source of rainfall input and then gridded rainfall data. The results indicate a markedly different runoff response associated with each of the different sources of rainfall data. It should be noted that this study does not seek to identify which gridded dataset is the "best" for Australia, as each gridded data source has its pros and cons, as does gauged data. Rather, the intention is to quantify differences between various gridded data sources and how they compare with gauged data so that these differences can be considered and accounted for in studies that utilise these gridded datasets. Ultimately, if key decisions are going to be based on the outputs of models that use gridded data, an estimate (or at least an understanding) of the uncertainties relating to the assumptions made in the development of gridded data and how that gridded data compares with reality should be made.

  4. Spatial Variability in Column CO2 Inferred from High Resolution GEOS-5 Global Model Simulations: Implications for Remote Sensing and Inversions

    NASA Technical Reports Server (NTRS)

    Ott, L.; Putman, B.; Collatz, J.; Gregg, W.

    2012-01-01

    Column CO2 observations from current and future remote sensing missions represent a major advancement in our understanding of the carbon cycle and are expected to help constrain source and sink distributions. However, data assimilation and inversion methods are challenged by the difference in scale of models and observations. OCO-2 footprints represent an area of several square kilometers while NASA s future ASCENDS lidar mission is likely to have an even smaller footprint. In contrast, the resolution of models used in global inversions are typically hundreds of kilometers wide and often cover areas that include combinations of land, ocean and coastal areas and areas of significant topographic, land cover, and population density variations. To improve understanding of scales of atmospheric CO2 variability and representativeness of satellite observations, we will present results from a global, 10-km simulation of meteorology and atmospheric CO2 distributions performed using NASA s GEOS-5 general circulation model. This resolution, typical of mesoscale atmospheric models, represents an order of magnitude increase in resolution over typical global simulations of atmospheric composition allowing new insight into small scale CO2 variations across a wide range of surface flux and meteorological conditions. The simulation includes high resolution flux datasets provided by NASA s Carbon Monitoring System Flux Pilot Project at half degree resolution that have been down-scaled to 10-km using remote sensing datasets. Probability distribution functions are calculated over larger areas more typical of global models (100-400 km) to characterize subgrid-scale variability in these models. Particular emphasis is placed on coastal regions and regions containing megacities and fires to evaluate the ability of coarse resolution models to represent these small scale features. Additionally, model output are sampled using averaging kernels characteristic of OCO-2 and ASCENDS measurement concepts to create realistic pseudo-datasets. Pseudo-data are averaged over coarse model grid cell areas to better understand the ability of measurements to characterize CO2 distributions and spatial gradients on both short (daily to weekly) and long (monthly to seasonal) time scales

  5. Sensitivity of quantitative groundwater recharge estimates to volumetric and distribution uncertainty in rainfall forcing products

    NASA Astrophysics Data System (ADS)

    Werner, Micha; Westerhoff, Rogier; Moore, Catherine

    2017-04-01

    Quantitative estimates of recharge due to precipitation excess are an important input to determining sustainable abstraction of groundwater resources, as well providing one of the boundary conditions required for numerical groundwater modelling. Simple water balance models are widely applied for calculating recharge. In these models, precipitation is partitioned between different processes and stores; including surface runoff and infiltration, storage in the unsaturated zone, evaporation, capillary processes, and recharge to groundwater. Clearly the estimation of recharge amounts will depend on the estimation of precipitation volumes, which may vary, depending on the source of precipitation data used. However, the partitioning between the different processes is in many cases governed by (variable) intensity thresholds. This means that the estimates of recharge will not only be sensitive to input parameters such as soil type, texture, land use, potential evaporation; but mainly to the precipitation volume and intensity distribution. In this paper we explore the sensitivity of recharge estimates due to difference in precipitation volumes and intensity distribution in the rainfall forcing over the Canterbury region in New Zealand. We compare recharge rates and volumes using a simple water balance model that is forced using rainfall and evaporation data from; the NIWA Virtual Climate Station Network (VCSN) data (which is considered as the reference dataset); the ERA-Interim/WATCH dataset at 0.25 degrees and 0.5 degrees resolution; the TRMM-3B42 dataset; the CHIRPS dataset; and the recently releases MSWEP dataset. Recharge rates are calculated at a daily time step over the 14 year period from the 2000 to 2013 for the full Canterbury region, as well as at eight selected points distributed over the region. Lysimeter data with observed estimates of recharge are available at four of these points, as well as recharge estimates from the NGRM model, an independent model constructed using the same base data and forced with the VCSN precipitation dataset. Results of the comparison of the rainfall products show that there are significant differences in precipitation volume between the forcing products; in the order of 20% at most points. Even more significant differences can be seen, however, in the distribution of precipitation. For the VCSN data wet days (defined as >0.1mm precipitation) occur on some 20-30% of days (depending on location). This is reasonably reflected in the TRMM and CHIRPS data, while for the re-analysis based products some 60%to 80% of days are wet, albeit at lower intensities. These differences are amplified in the recharge estimates. At most points, volumetric differences are in the order of 40-60%, though difference may range into several orders of magnitude. The frequency distributions of recharge also differ significantly, with recharge over 0.1 mm occurring on 4-6% of days for the VCNS, CHIRPS, and TRMM datasets, but up to the order of 12% of days for the re-analysis data. Comparison against the lysimeter data show estimates to be reasonable, in particular for the reference datasets. Surprisingly some estimates of the lower resolution re-analysis datasets are reasonable, though this does seem to be due to lower recharge being compensated by recharge occurring more frequently. These results underline the importance of correct representation of rainfall volumes, as well as of distribution, particularly when evaluating possible changes to for example changes in precipitation intensity and volume. This holds for precipitation data derived from satellite based and re-analysis products, but also for interpolated data from gauges, where the distribution of intensities is strongly influenced by the interpolation process.

  6. DigOut: viewing differential expression genes as outliers.

    PubMed

    Yu, Hui; Tu, Kang; Xie, Lu; Li, Yuan-Yuan

    2010-12-01

    With regards to well-replicated two-conditional microarray datasets, the selection of differentially expressed (DE) genes is a well-studied computational topic, but for multi-conditional microarray datasets with limited or no replication, the same task is not properly addressed by previous studies. This paper adopts multivariate outlier analysis to analyze replication-lacking multi-conditional microarray datasets, finding that it performs significantly better than the widely used limit fold change (LFC) model in a simulated comparative experiment. Compared with the LFC model, the multivariate outlier analysis also demonstrates improved stability against sample variations in a series of manipulated real expression datasets. The reanalysis of a real non-replicated multi-conditional expression dataset series leads to satisfactory results. In conclusion, a multivariate outlier analysis algorithm, like DigOut, is particularly useful for selecting DE genes from non-replicated multi-conditional gene expression dataset.

  7. Survey-based socio-economic data from slums in Bangalore, India

    NASA Astrophysics Data System (ADS)

    Roy, Debraj; Palavalli, Bharath; Menon, Niveditha; King, Robin; Pfeffer, Karin; Lees, Michael; Sloot, Peter M. A.

    2018-01-01

    In 2010, an estimated 860 million people were living in slums worldwide, with around 60 million added to the slum population between 2000 and 2010. In 2011, 200 million people in urban Indian households were considered to live in slums. In order to address and create slum development programmes and poverty alleviation methods, it is necessary to understand the needs of these communities. Therefore, we require data with high granularity in the Indian context. Unfortunately, there is a paucity of highly granular data at the level of individual slums. We collected the data presented in this paper in partnership with the slum dwellers in order to overcome the challenges such as validity and efficacy of self reported data. Our survey of Bangalore covered 36 slums across the city. The slums were chosen based on stratification criteria, which included geographical location of the slum, whether the slum was resettled or rehabilitated, notification status of the slum, the size of the slum and the religious profile. This paper describes the relational model of the slum dataset, the variables in the dataset, the variables constructed for analysis and the issues identified with the dataset. The data collected includes around 267,894 data points spread over 242 questions for 1,107 households. The dataset can facilitate interdisciplinary research on spatial and temporal dynamics of urban poverty and well-being in the context of rapid urbanization of cities in developing countries.

  8. Towards Improved Satellite-In Situ Oceanographic Data Interoperability and Associated Value Added Services at the Podaac

    NASA Astrophysics Data System (ADS)

    Tsontos, V. M.; Huang, T.; Holt, B.

    2015-12-01

    The earth science enterprise increasingly relies on the integration and synthesis of multivariate datasets from diverse observational platforms. NASA's ocean salinity missions, that include Aquarius/SAC-D and the SPURS (Salinity Processes in the Upper Ocean Regional Study) field campaign, illustrate the value of integrated observations in support of studies on ocean circulation, the water cycle, and climate. However, the inherent heterogeneity of resulting data and the disparate, distributed systems that serve them complicates their effective utilization for both earth science research and applications. Key technical interoperability challenges include adherence to metadata and data format standards that are particularly acute for in-situ data and the lack of a unified metadata model facilitating archival and integration of both satellite and oceanographic field datasets. Here we report on efforts at the PO.DAAC, NASA's physical oceanographic data center, to extend our data management and distribution support capabilities for field campaign datasets such as those from SPURS. We also discuss value-added services, based on the integration of satellite and in-situ datasets, which are under development with a particular focus on DOMS. The distributed oceanographic matchup service (DOMS) implements a portable technical infrastructure and associated web services that will be broadly accessible via the PO.DAAC for the dynamic collocation of satellite and in-situ data, hosted by distributed data providers, in support of mission cal/val, science and operational applications.

  9. Wide-Area Cooperative Biometric Tagging, Tracking and Locating in a Multimodal Sensor Network

    DTIC Science & Technology

    2014-12-04

    12] 89.3% 2.7% 7 5 50s Our Model 90.7% 2.7% 6 5 4.6s TABLE III: Comparison of tracking results on CAVIAR dataset. The number of trajectories in...other. We evaluate our approach on two widely used public single-camera pedestrian tracking datasets: the CAVIAR dataset [1] and the TownCentre dataset...collaborators at Progeny. It is also being provided to ONR along with datasets on which it has been tested. REFERENCES [1] Caviar dataset. http

  10. NASA AVOSS Fast-Time Wake Prediction Models: User's Guide

    NASA Technical Reports Server (NTRS)

    Ahmad, Nash'at N.; VanValkenburg, Randal L.; Pruis, Matthew

    2014-01-01

    The National Aeronautics and Space Administration (NASA) is developing and testing fast-time wake transport and decay models to safely enhance the capacity of the National Airspace System (NAS). The fast-time wake models are empirical algorithms used for real-time predictions of wake transport and decay based on aircraft parameters and ambient weather conditions. The aircraft dependent parameters include the initial vortex descent velocity and the vortex pair separation distance. The atmospheric initial conditions include vertical profiles of temperature or potential temperature, eddy dissipation rate, and crosswind. The current distribution includes the latest versions of the APA (3.4) and the TDP (2.1) models. This User's Guide provides detailed information on the model inputs, file formats, and the model output. An example of a model run and a brief description of the Memphis 1995 Wake Vortex Dataset is also provided.

  11. Modelling road accident blackspots data with the discrete generalized Pareto distribution.

    PubMed

    Prieto, Faustino; Gómez-Déniz, Emilio; Sarabia, José María

    2014-10-01

    This study shows how road traffic networks events, in particular road accidents on blackspots, can be modelled with simple probabilistic distributions. We considered the number of crashes and the number of fatalities on Spanish blackspots in the period 2003-2007, from Spanish General Directorate of Traffic (DGT). We modelled those datasets, respectively, with the discrete generalized Pareto distribution (a discrete parametric model with three parameters) and with the discrete Lomax distribution (a discrete parametric model with two parameters, and particular case of the previous model). For that, we analyzed the basic properties of both parametric models: cumulative distribution, survival, probability mass, quantile and hazard functions, genesis and rth-order moments; applied two estimation methods of their parameters: the μ and (μ+1) frequency method and the maximum likelihood method; used two goodness-of-fit tests: Chi-square test and discrete Kolmogorov-Smirnov test based on bootstrap resampling; and compared them with the classical negative binomial distribution in terms of absolute probabilities and in models including covariates. We found that those probabilistic models can be useful to describe the road accident blackspots datasets analyzed. Copyright © 2014 Elsevier Ltd. All rights reserved.

  12. To develop a regional ICU mortality prediction model during the first 24 h of ICU admission utilizing MODS and NEMS with six other independent variables from the Critical Care Information System (CCIS) Ontario, Canada.

    PubMed

    Kao, Raymond; Priestap, Fran; Donner, Allan

    2016-01-01

    Intensive care unit (ICU) scoring systems or prediction models evolved to meet the desire of clinical and administrative leaders to assess the quality of care provided by their ICUs. The Critical Care Information System (CCIS) is province-wide data information for all Ontario, Canada level 3 and level 2 ICUs collected for this purpose. With the dataset, we developed a multivariable logistic regression ICU mortality prediction model during the first 24 h of ICU admission utilizing the explanatory variables including the two validated scores, Multiple Organs Dysfunctional Score (MODS) and Nine Equivalents Nursing Manpower Use Score (NEMS) followed by the variables age, sex, readmission to the ICU during the same hospital stay, admission diagnosis, source of admission, and the modified Charlson Co-morbidity Index (CCI) collected through the hospital health records. This study is a single-center retrospective cohort review of 8822 records from the Critical Care Trauma Centre (CCTC) and Medical-Surgical Intensive Care Unit (MSICU) of London Health Sciences Centre (LHSC), Ontario, Canada between 1 Jan 2009 to 30 Nov 2012. Multivariable logistic regression on training dataset (n = 4321) was used to develop the model and validate by bootstrapping method on the testing dataset (n = 4501). Discrimination, calibration, and overall model performance were also assessed. The predictors significantly associated with ICU mortality included: age (p < 0.001), source of admission (p < 0.0001), ICU admitting diagnosis (p < 0.0001), MODS (p < 0.0001), and NEMS (p < 0.0001). The variables sex and modified CCI were not significantly associated with ICU mortality. The training dataset for the developed model has good discriminating ability between patients with high risk and those with low risk of mortality (c-statistic 0.787). The Hosmer and Lemeshow goodness-of-fit test has a strong correlation between the observed and expected ICU mortality (χ (2) = 5.48; p > 0.31). The overall optimism of the estimation between the training and testing data set ΔAUC = 0.003, indicating a stable prediction model. This study demonstrates that CCIS data available after the first 24 h of ICU admission at LHSC can be used to create a robust mortality prediction model with acceptable fit statistic and internal validity for valid benchmarking and monitoring ICU performance.

  13. The Dynamic General Vegetation Model MC1 over the United States and Canada at a 5-arcminute resolution: model inputs and outputs

    Treesearch

    Ray Drapek; John B. Kim; Ronald P. Neilson

    2015-01-01

    Land managers need to include climate change in their decisionmaking, but the climate models that project future climates operate at spatial scales that are too coarse to be of direct use. To create a dataset more useful to managers, soil and historical climate were assembled for the United States and Canada at a 5-arcminute grid resolution. Nine CMIP3 future climate...

  14. Evaluation of the Bitterness of Traditional Chinese Medicines using an E-Tongue Coupled with a Robust Partial Least Squares Regression Method.

    PubMed

    Lin, Zhaozhou; Zhang, Qiao; Liu, Ruixin; Gao, Xiaojie; Zhang, Lu; Kang, Bingya; Shi, Junhan; Wu, Zidan; Gui, Xinjing; Li, Xuelin

    2016-01-25

    To accurately, safely, and efficiently evaluate the bitterness of Traditional Chinese Medicines (TCMs), a robust predictor was developed using robust partial least squares (RPLS) regression method based on data obtained from an electronic tongue (e-tongue) system. The data quality was verified by the Grubb's test. Moreover, potential outliers were detected based on both the standardized residual and score distance calculated for each sample. The performance of RPLS on the dataset before and after outlier detection was compared to other state-of-the-art methods including multivariate linear regression, least squares support vector machine, and the plain partial least squares regression. Both R² and root-mean-squares error (RMSE) of cross-validation (CV) were recorded for each model. With four latent variables, a robust RMSECV value of 0.3916 with bitterness values ranging from 0.63 to 4.78 were obtained for the RPLS model that was constructed based on the dataset including outliers. Meanwhile, the RMSECV, which was calculated using the models constructed by other methods, was larger than that of the RPLS model. After six outliers were excluded, the performance of all benchmark methods markedly improved, but the difference between the RPLS model constructed before and after outlier exclusion was negligible. In conclusion, the bitterness of TCM decoctions can be accurately evaluated with the RPLS model constructed using e-tongue data.

  15. How Human and Natural Disturbance Affects the U.S. Carbon Sink

    NASA Astrophysics Data System (ADS)

    Felzer, B. S.

    2015-12-01

    Gridded datasets of Net Ecosystem Exchange derived from eddy covariance and remote sensing measurements (EC-MOD and FLUXNET-MTE) provide a means of validating Net Ecosystem Productivity (NEP, opposite of NEE) from terrestrial ecosystem models. While most forested regions in the U.S. are observed to be moderate to strong carbon sinks, models not including human or natural disturbances will tend to be more carbon neutral, which is expected of mature ecosystems. I have developed the Terrestrial Ecosystems Model Hydro version (TEM-Hydro) to include both human and natural disturbances to compare against gridded NEP datasets. Human disturbances are based on the Hurtt et al. land use transition dataset and include transient agricultural (crops and pasture) conversion and abandonment and timber harvest. Natural disturbances include tropical storms and hurricane and fires based on stochastic return intervals. Model results indicate that forests are the largest carbon sink, seconded by croplands and pastures, if not accounting for decomposition of agricultural products and animal respiration. Grasslands and shrublands are both small sinks or carbon neutral. The NEP of forests in EC-MOD from 2001-2006 is 240 gCm2yr-1 and for FLUXNET-MTE from 1982-2007 is 375 gCm-2yr-1. With potential vegetation, the respective forest sinks for those two time periods are 54 and 62 gCm-2yr-1, respectively. Including the effects of human disturbance increases the sinks to 154 and 147 gCm-2yr-1. The effect of stochastic fire and storms is to reduce the NEP to 114 and 108 gCm-2yr-1. While the positive carbon sink today is the result of past land use disturbance, net carbon sequestration, including product decomposition, conversion fluxes, and animal respiration, has not yet returned to predisturbance levels as seen in the potential vegetation. Differences in response to disturbance have to do with the type, frequency, and intensity of disturbance. Fire, in particular, is seen to have a net negative effect on carbon storage in forests due to decomposition of coarse woody debris and the fact that some nitrogen is lost during volatilization. Croplands become a carbon source if assuming product decomposition occurs where the crops are grown, and pasturelands become carbon neutral if accounting for animal respiration.

  16. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments.

    PubMed

    Ionescu, Catalin; Papava, Dragos; Olaru, Vlad; Sminchisescu, Cristian

    2014-07-01

    We introduce a new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different viewpoints, for training realistic human sensing systems and for evaluating the next generation of human pose estimation models and algorithms. Besides increasing the size of the datasets in the current state-of-the-art by several orders of magnitude, we also aim to complement such datasets with a diverse set of motions and poses encountered as part of typical human activities (taking photos, talking on the phone, posing, greeting, eating, etc.), with additional synchronized image, human motion capture, and time of flight (depth) data, and with accurate 3D body scans of all the subject actors involved. We also provide controlled mixed reality evaluation scenarios where 3D human models are animated using motion capture and inserted using correct 3D geometry, in complex real environments, viewed with moving cameras, and under occlusion. Finally, we provide a set of large-scale statistical models and detailed evaluation baselines for the dataset illustrating its diversity and the scope for improvement by future work in the research community. Our experiments show that our best large-scale model can leverage our full training set to obtain a 20% improvement in performance compared to a training set of the scale of the largest existing public dataset for this problem. Yet the potential for improvement by leveraging higher capacity, more complex models with our large dataset, is substantially vaster and should stimulate future research. The dataset together with code for the associated large-scale learning models, features, visualization tools, as well as the evaluation server, is available online at http://vision.imar.ro/human3.6m.

  17. Topographical effects of climate dataset and their impacts on the estimation of regional net primary productivity

    NASA Astrophysics Data System (ADS)

    Sun, L. Qing; Feng, Feng X.

    2014-11-01

    In this study, we first built and compared two different climate datasets for Wuling mountainous area in 2010, one of which considered topographical effects during the ANUSPLIN interpolation was referred as terrain-based climate dataset, while the other one did not was called ordinary climate dataset. Then, we quantified the topographical effects of climatic inputs on NPP estimation by inputting two different climate datasets to the same ecosystem model, the Boreal Ecosystem Productivity Simulator (BEPS), to evaluate the importance of considering relief when estimating NPP. Finally, we found the primary contributing variables to the topographical effects through a series of experiments given an overall accuracy of the model output for NPP. The results showed that: (1) The terrain-based climate dataset presented more reliable topographic information and had closer agreements with the station dataset than the ordinary climate dataset at successive time series of 365 days in terms of the daily mean values. (2) On average, ordinary climate dataset underestimated NPP by 12.5% compared with terrain-based climate dataset over the whole study area. (3) The primary climate variables contributing to the topographical effects of climatic inputs for Wuling mountainous area were temperatures, which suggest that it is necessary to correct temperature differences for estimating NPP accurately in such a complex terrain.

  18. Prediction of drug indications based on chemical interactions and chemical similarities.

    PubMed

    Huang, Guohua; Lu, Yin; Lu, Changhong; Zheng, Mingyue; Cai, Yu-Dong

    2015-01-01

    Discovering potential indications of novel or approved drugs is a key step in drug development. Previous computational approaches could be categorized into disease-centric and drug-centric based on the starting point of the issues or small-scaled application and large-scale application according to the diversity of the datasets. Here, a classifier has been constructed to predict the indications of a drug based on the assumption that interactive/associated drugs or drugs with similar structures are more likely to target the same diseases using a large drug indication dataset. To examine the classifier, it was conducted on a dataset with 1,573 drugs retrieved from Comprehensive Medicinal Chemistry database for five times, evaluated by 5-fold cross-validation, yielding five 1st order prediction accuracies that were all approximately 51.48%. Meanwhile, the model yielded an accuracy rate of 50.00% for the 1st order prediction by independent test on a dataset with 32 other drugs in which drug repositioning has been confirmed. Interestingly, some clinically repurposed drug indications that were not included in the datasets are successfully identified by our method. These results suggest that our method may become a useful tool to associate novel molecules with new indications or alternative indications with existing drugs.

  19. Prediction of Drug Indications Based on Chemical Interactions and Chemical Similarities

    PubMed Central

    Huang, Guohua; Lu, Yin; Lu, Changhong; Cai, Yu-Dong

    2015-01-01

    Discovering potential indications of novel or approved drugs is a key step in drug development. Previous computational approaches could be categorized into disease-centric and drug-centric based on the starting point of the issues or small-scaled application and large-scale application according to the diversity of the datasets. Here, a classifier has been constructed to predict the indications of a drug based on the assumption that interactive/associated drugs or drugs with similar structures are more likely to target the same diseases using a large drug indication dataset. To examine the classifier, it was conducted on a dataset with 1,573 drugs retrieved from Comprehensive Medicinal Chemistry database for five times, evaluated by 5-fold cross-validation, yielding five 1st order prediction accuracies that were all approximately 51.48%. Meanwhile, the model yielded an accuracy rate of 50.00% for the 1st order prediction by independent test on a dataset with 32 other drugs in which drug repositioning has been confirmed. Interestingly, some clinically repurposed drug indications that were not included in the datasets are successfully identified by our method. These results suggest that our method may become a useful tool to associate novel molecules with new indications or alternative indications with existing drugs. PMID:25821813

  20. Rainfall simulation experiments in the southwestern USA using the Walnut Gulch Rainfall Simulator

    NASA Astrophysics Data System (ADS)

    Polyakov, Viktor; Stone, Jeffry; Holifield Collins, Chandra; Nearing, Mark A.; Paige, Ginger; Buono, Jared; Gomez-Pond, Rae-Landa

    2018-01-01

    This dataset contains hydrological, erosion, vegetation, ground cover, and other supplementary information from 272 rainfall simulation experiments conducted on 23 semiarid rangeland locations in Arizona and Nevada between 2002 and 2013. On 30 % of the plots, simulations were conducted up to five times during the decade of study. The rainfall was generated using the Walnut Gulch Rainfall Simulator on 2 m by 6 m plots. Simulation sites included brush and grassland areas with various degrees of disturbance by grazing, wildfire, or brush removal. This dataset advances our understanding of basic hydrological and biological processes that drive soil erosion on arid rangelands. It can be used to estimate runoff, infiltration, and erosion rates at a variety of ecological sites in the Southwestern USA. The inclusion of wildfire and brush treatment locations combined with long-term observations makes it important for studying vegetation recovery, ecological transitions, and the effect of management. It is also a valuable resource for erosion model parameterization and validation. The dataset is available from the National Agricultural Library at https://data.nal.usda.gov/search/type/dataset (DOI: https://doi.org/10.15482/USDA.ADC/1358583).

  1. Dataset for petroleum based stock markets and GAUSS codes for SAMEM.

    PubMed

    Khalifa, Ahmed A A; Bertuccelli, Pietro; Otranto, Edoardo

    2017-02-01

    This article includes a unique data set of a balanced daily (Monday, Tuesday and Wednesday) for oil and natural gas volatility and the oil rich economies' stock markets for Saudi Arabia, Qatar, Kuwait, Abu Dhabi, Dubai, Bahrain and Oman, using daily data over the period spanning Oct. 18, 2006-July 30, 2015. Additionally, we have included unique GAUSS codes for estimating the spillover asymmetric multiplicative error model (SAMEM) with application to Petroleum-Based Stock Market. The data, the model and the codes have many applications in business and social science.

  2. Application of the Conway-Maxwell-Poisson generalized linear model for analyzing motor vehicle crashes.

    PubMed

    Lord, Dominique; Guikema, Seth D; Geedipally, Srinivas Reddy

    2008-05-01

    This paper documents the application of the Conway-Maxwell-Poisson (COM-Poisson) generalized linear model (GLM) for modeling motor vehicle crashes. The COM-Poisson distribution, originally developed in 1962, has recently been re-introduced by statisticians for analyzing count data subjected to over- and under-dispersion. This innovative distribution is an extension of the Poisson distribution. The objectives of this study were to evaluate the application of the COM-Poisson GLM for analyzing motor vehicle crashes and compare the results with the traditional negative binomial (NB) model. The comparison analysis was carried out using the most common functional forms employed by transportation safety analysts, which link crashes to the entering flows at intersections or on segments. To accomplish the objectives of the study, several NB and COM-Poisson GLMs were developed and compared using two datasets. The first dataset contained crash data collected at signalized four-legged intersections in Toronto, Ont. The second dataset included data collected for rural four-lane divided and undivided highways in Texas. Several methods were used to assess the statistical fit and predictive performance of the models. The results of this study show that COM-Poisson GLMs perform as well as NB models in terms of GOF statistics and predictive performance. Given the fact the COM-Poisson distribution can also handle under-dispersed data (while the NB distribution cannot or has difficulties converging), which have sometimes been observed in crash databases, the COM-Poisson GLM offers a better alternative over the NB model for modeling motor vehicle crashes, especially given the important limitations recently documented in the safety literature about the latter type of model.

  3. Climate Forcing Datasets for Agricultural Modeling: Merged Products for Gap-Filling and Historical Climate Series Estimation

    NASA Technical Reports Server (NTRS)

    Ruane, Alex C.; Goldberg, Richard; Chryssanthacopoulos, James

    2014-01-01

    The AgMERRA and AgCFSR climate forcing datasets provide daily, high-resolution, continuous, meteorological series over the 1980-2010 period designed for applications examining the agricultural impacts of climate variability and climate change. These datasets combine daily resolution data from retrospective analyses (the Modern-Era Retrospective Analysis for Research and Applications, MERRA, and the Climate Forecast System Reanalysis, CFSR) with in situ and remotely-sensed observational datasets for temperature, precipitation, and solar radiation, leading to substantial reductions in bias in comparison to a network of 2324 agricultural-region stations from the Hadley Integrated Surface Dataset (HadISD). Results compare favorably against the original reanalyses as well as the leading climate forcing datasets (Princeton, WFD, WFD-EI, and GRASP), and AgMERRA distinguishes itself with substantially improved representation of daily precipitation distributions and extreme events owing to its use of the MERRA-Land dataset. These datasets also peg relative humidity to the maximum temperature time of day, allowing for more accurate representation of the diurnal cycle of near-surface moisture in agricultural models. AgMERRA and AgCFSR enable a number of ongoing investigations in the Agricultural Model Intercomparison and Improvement Project (AgMIP) and related research networks, and may be used to fill gaps in historical observations as well as a basis for the generation of future climate scenarios.

  4. Time Series Expression Analyses Using RNA-seq: A Statistical Approach

    PubMed Central

    Oh, Sunghee; Song, Seongho; Grabowski, Gregory; Zhao, Hongyu; Noonan, James P.

    2013-01-01

    RNA-seq is becoming the de facto standard approach for transcriptome analysis with ever-reducing cost. It has considerable advantages over conventional technologies (microarrays) because it allows for direct identification and quantification of transcripts. Many time series RNA-seq datasets have been collected to study the dynamic regulations of transcripts. However, statistically rigorous and computationally efficient methods are needed to explore the time-dependent changes of gene expression in biological systems. These methods should explicitly account for the dependencies of expression patterns across time points. Here, we discuss several methods that can be applied to model timecourse RNA-seq data, including statistical evolutionary trajectory index (SETI), autoregressive time-lagged regression (AR(1)), and hidden Markov model (HMM) approaches. We use three real datasets and simulation studies to demonstrate the utility of these dynamic methods in temporal analysis. PMID:23586021

  5. Time series expression analyses using RNA-seq: a statistical approach.

    PubMed

    Oh, Sunghee; Song, Seongho; Grabowski, Gregory; Zhao, Hongyu; Noonan, James P

    2013-01-01

    RNA-seq is becoming the de facto standard approach for transcriptome analysis with ever-reducing cost. It has considerable advantages over conventional technologies (microarrays) because it allows for direct identification and quantification of transcripts. Many time series RNA-seq datasets have been collected to study the dynamic regulations of transcripts. However, statistically rigorous and computationally efficient methods are needed to explore the time-dependent changes of gene expression in biological systems. These methods should explicitly account for the dependencies of expression patterns across time points. Here, we discuss several methods that can be applied to model timecourse RNA-seq data, including statistical evolutionary trajectory index (SETI), autoregressive time-lagged regression (AR(1)), and hidden Markov model (HMM) approaches. We use three real datasets and simulation studies to demonstrate the utility of these dynamic methods in temporal analysis.

  6. 76 FR 35856 - Fisheries of the South Atlantic; Southeast Data, Assessment, and Review (SEDAR); Public Meeting

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-06-20

    ... evaluates potential datasets and recommends which datasets are appropriate for assessment analyses. The... Series Using datasets recommended from the Data Workshop, Panelists will employ assessment models to...

  7. 76 FR 47564 - Fisheries of the South Atlantic; Southeast Data, Assessment, and Review (SEDAR); Assessment...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-08-05

    ... compiles and evaluates potential datasets and recommends which datasets are appropriate for assessment... Series Using datasets recommended from the Data Workshop, Panelists will employ assessment models to...

  8. Geometrical features assessment of liver's tumor with application of artificial neural network evolved by imperialist competitive algorithm.

    PubMed

    Keshavarz, M; Mojra, A

    2015-05-01

    Geometrical features of a cancerous tumor embedded in biological soft tissue, including tumor size and depth, are a necessity in the follow-up procedure and making suitable therapeutic decisions. In this paper, a new socio-politically motivated global search strategy which is called imperialist competitive algorithm (ICA) is implemented to train a feed forward neural network (FFNN) to estimate the tumor's geometrical characteristics (FFNNICA). First, a viscoelastic model of liver tissue is constructed by using a series of in vitro uniaxial and relaxation test data. Then, 163 samples of the tissue including a tumor with different depths and diameters are generated by making use of PYTHON programming to link the ABAQUS and MATLAB together. Next, the samples are divided into 123 samples as training dataset and 40 samples as testing dataset. Training inputs of the network are mechanical parameters extracted from palpation of the tissue through a developing noninvasive technology called artificial tactile sensing (ATS). Last, to evaluate the FFNNICA performance, outputs of the network including tumor's depth and diameter are compared with desired values for both training and testing datasets. Deviations of the outputs from desired values are calculated by a regression analysis. Statistical analysis is also performed by measuring Root Mean Square Error (RMSE) and Efficiency (E). RMSE in diameter and depth estimations are 0.50 mm and 1.49, respectively, for the testing dataset. Results affirm that the proposed optimization algorithm for training neural network can be useful to characterize soft tissue tumors accurately by employing an artificial palpation approach. Copyright © 2015 John Wiley & Sons, Ltd.

  9. Integration of Neuroimaging and Microarray Datasets through Mapping and Model-Theoretic Semantic Decomposition of Unstructured Phenotypes

    PubMed Central

    Pantazatos, Spiro P.; Li, Jianrong; Pavlidis, Paul; Lussier, Yves A.

    2009-01-01

    An approach towards heterogeneous neuroscience dataset integration is proposed that uses Natural Language Processing (NLP) and a knowledge-based phenotype organizer system (PhenOS) to link ontology-anchored terms to underlying data from each database, and then maps these terms based on a computable model of disease (SNOMED CT®). The approach was implemented using sample datasets from fMRIDC, GEO, The Whole Brain Atlas and Neuronames, and allowed for complex queries such as “List all disorders with a finding site of brain region X, and then find the semantically related references in all participating databases based on the ontological model of the disease or its anatomical and morphological attributes”. Precision of the NLP-derived coding of the unstructured phenotypes in each dataset was 88% (n = 50), and precision of the semantic mapping between these terms across datasets was 98% (n = 100). To our knowledge, this is the first example of the use of both semantic decomposition of disease relationships and hierarchical information found in ontologies to integrate heterogeneous phenotypes across clinical and molecular datasets. PMID:20495688

  10. Optimization of DRASTIC method by supervised committee machine artificial intelligence to assess groundwater vulnerability for Maragheh-Bonab plain aquifer, Iran

    NASA Astrophysics Data System (ADS)

    Fijani, Elham; Nadiri, Ata Allah; Asghari Moghaddam, Asghar; Tsai, Frank T.-C.; Dixon, Barnali

    2013-10-01

    Contamination of wells with nitrate-N (NO3-N) poses various threats to human health. Contamination of groundwater is a complex process and full of uncertainty in regional scale. Development of an integrative vulnerability assessment methodology can be useful to effectively manage (including prioritization of limited resource allocation to monitor high risk areas) and protect this valuable freshwater source. This study introduces a supervised committee machine with artificial intelligence (SCMAI) model to improve the DRASTIC method for groundwater vulnerability assessment for the Maragheh-Bonab plain aquifer in Iran. Four different AI models are considered in the SCMAI model, whose input is the DRASTIC parameters. The SCMAI model improves the committee machine artificial intelligence (CMAI) model by replacing the linear combination in the CMAI with a nonlinear supervised ANN framework. To calibrate the AI models, NO3-N concentration data are divided in two datasets for the training and validation purposes. The target value of the AI models in the training step is the corrected vulnerability indices that relate to the first NO3-N concentration dataset. After model training, the AI models are verified by the second NO3-N concentration dataset. The results show that the four AI models are able to improve the DRASTIC method. Since the best AI model performance is not dominant, the SCMAI model is considered to combine the advantages of individual AI models to achieve the optimal performance. The SCMAI method re-predicts the groundwater vulnerability based on the different AI model prediction values. The results show that the SCMAI outperforms individual AI models and committee machine with artificial intelligence (CMAI) model. The SCMAI model ensures that no water well with high NO3-N levels would be classified as low risk and vice versa. The study concludes that the SCMAI model is an effective model to improve the DRASTIC model and provides a confident estimate of the pollution risk.

  11. A New Global Group Velocity Dataset for Constraining Crust and Upper Mantle Properties

    NASA Astrophysics Data System (ADS)

    Ma, Z.; Masters, G.; Laske, G.; Pasyanos, M. E.

    2010-12-01

    We are improving our CRUST2.0 to a new LITHO1.0 model, refining the nominal resolution to 1 degree and including lithospheric structure. The new model is constrained by many datasets, including very large datasets of surface wave group velocity built using a new, efficient measurement technique. This technique starts in a similar fashion to the traditional frequency-time analysis, but instead of making measurements for all frequencies for a single source-station pair, we apply cluster analysis to make measurements for all recordings for a single event at a single target frequency. By changing the nominal frequencies of the bandpass filter, we filter each trace until the centroid frequency of the band-passed spectrum matches the target frequency. We have processed all the LH data from IRIS (and some of the BH data from PASSCAL experiments and the POLARIS network) from 1976 to 2007. The Rayleigh wave group velocity data set is complete from 10mHz to 40mHz at increments of 2.5mHz. The data set has about 330000 measurements for 10 and 20mHz, 200000 for 30mHz and 110000 for 40mHz. We are also building a similar dataset for Love waves, though its size will be about half that of the Rayleigh wave dataset. The SMAD of the group arrival time difference between our global dataset and other more regional datasets is about 12 seconds for 20mHz, 9 seconds for 30mHz, and 7 seconds for 40mHz. Though the discrepancies are about twice as big as our measurement precision (estimated by looking at group arrival time differences between closely-spaced stations), it is still much smaller than the signal in the data (e.g., the group arrival time for 20mHz can differ from the prediction of a 1D Earth by over 250 seconds). The fact that there is no systematic bias between the datasets encourages us to combine them to improve coverage of some key areas. Group velocity maps inverted from the combined datasets show many interesting signals though the dominant signal is related to variations in crustal thickness. For 20mHz, group velocity perturbations from the global mean range from -25% to 11%, with a standard deviation of 4%. We adjust the smoothing of lateral structure in the inversion so that the error of the inferred group velocity is nearly uniform globally. For 20mHz, a 0.1% error in group velocity leads to resolution of features of dimension 9 degrees or less everywhere. The resolution in Eurasia is 5.5 degrees and in N. America is 4.5-5 degrees. For 30mHz, for the same 0.1% error, we can resolve structure of 10 degrees globally, 6.5 degrees for Eurasia and 5.5 degree for N. America.

  12. A Spatially Distinct History of the Development of California Groundfish Fisheries

    PubMed Central

    Miller, Rebecca R.; Field, John C.; Santora, Jarrod A.; Schroeder, Isaac D.; Huff, David D.; Key, Meisha; Pearson, Don E.; MacCall, Alec D.

    2014-01-01

    During the past century, commercial fisheries have expanded from small vessels fishing in shallow, coastal habitats to a broad suite of vessels and gears that fish virtually every marine habitat on the globe. Understanding how fisheries have developed in space and time is critical for interpreting and managing the response of ecosystems to the effects of fishing, however time series of spatially explicit data are typically rare. Recently, the 1933–1968 portion of the commercial catch dataset from the California Department of Fish and Wildlife was recovered and digitized, completing the full historical series for both commercial and recreational datasets from 1933–2010. These unique datasets include landing estimates at a coarse 10 by 10 minute “grid-block” spatial resolution and extends the entire length of coastal California up to 180 kilometers from shore. In this study, we focus on the catch history of groundfish which were mapped for each grid-block using the year at 50% cumulative catch and total historical catch per habitat area. We then constructed generalized linear models to quantify the relationship between spatiotemporal trends in groundfish catches, distance from ports, depth, percentage of days with wind speed over 15 knots, SST and ocean productivity. Our results indicate that over the history of these fisheries, catches have taken place in increasingly deeper habitat, at a greater distance from ports, and in increasingly inclement weather conditions. Understanding spatial development of groundfish fisheries and catches in California are critical for improving population models and for evaluating whether implicit stock assessment model assumptions of relative homogeneity of fisheries removals over time and space are reasonable. This newly reconstructed catch dataset and analysis provides a comprehensive appreciation for the development of groundfish fisheries with respect to commonly assumed trends of global fisheries patterns that are typically constrained by a lack of long-term spatial datasets. PMID:24967973

  13. Improving protein-protein interactions prediction accuracy using protein evolutionary information and relevance vector machine model.

    PubMed

    An, Ji-Yong; Meng, Fan-Rong; You, Zhu-Hong; Chen, Xing; Yan, Gui-Ying; Hu, Ji-Pu

    2016-10-01

    Predicting protein-protein interactions (PPIs) is a challenging task and essential to construct the protein interaction networks, which is important for facilitating our understanding of the mechanisms of biological systems. Although a number of high-throughput technologies have been proposed to predict PPIs, there are unavoidable shortcomings, including high cost, time intensity, and inherently high false positive rates. For these reasons, many computational methods have been proposed for predicting PPIs. However, the problem is still far from being solved. In this article, we propose a novel computational method called RVM-BiGP that combines the relevance vector machine (RVM) model and Bi-gram Probabilities (BiGP) for PPIs detection from protein sequences. The major improvement includes (1) Protein sequences are represented using the Bi-gram probabilities (BiGP) feature representation on a Position Specific Scoring Matrix (PSSM), in which the protein evolutionary information is contained; (2) For reducing the influence of noise, the Principal Component Analysis (PCA) method is used to reduce the dimension of BiGP vector; (3) The powerful and robust Relevance Vector Machine (RVM) algorithm is used for classification. Five-fold cross-validation experiments executed on yeast and Helicobacter pylori datasets, which achieved very high accuracies of 94.57 and 90.57%, respectively. Experimental results are significantly better than previous methods. To further evaluate the proposed method, we compare it with the state-of-the-art support vector machine (SVM) classifier on the yeast dataset. The experimental results demonstrate that our RVM-BiGP method is significantly better than the SVM-based method. In addition, we achieved 97.15% accuracy on imbalance yeast dataset, which is higher than that of balance yeast dataset. The promising experimental results show the efficiency and robust of the proposed method, which can be an automatic decision support tool for future proteomics research. For facilitating extensive studies for future proteomics research, we developed a freely available web server called RVM-BiGP-PPIs in Hypertext Preprocessor (PHP) for predicting PPIs. The web server including source code and the datasets are available at http://219.219.62.123:8888/BiGP/. © 2016 The Authors Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.

  14. The Cryosphere Model Comparison Tool (CmCt): Ice Sheet Model Validation and Comparison Tool for Greenland and Antarctica

    NASA Astrophysics Data System (ADS)

    Simon, E.; Nowicki, S.; Neumann, T.; Tyahla, L.; Saba, J. L.; Guerber, J. R.; Bonin, J. A.; DiMarzio, J. P.

    2017-12-01

    The Cryosphere model Comparison tool (CmCt) is a web based ice sheet model validation tool that is being developed by NASA to facilitate direct comparison between observational data and various ice sheet models. The CmCt allows the user to take advantage of several decades worth of observations from Greenland and Antarctica. Currently, the CmCt can be used to compare ice sheet models provided by the user with remotely sensed satellite data from ICESat (Ice, Cloud, and land Elevation Satellite) laser altimetry, GRACE (Gravity Recovery and Climate Experiment) satellite, and radar altimetry (ERS-1, ERS-2, and Envisat). One or more models can be uploaded through the CmCt website and compared with observational data, or compared to each other or other models. The CmCt calculates statistics on the differences between the model and observations, and other quantitative and qualitative metrics, which can be used to evaluate the different model simulations against the observations. The qualitative metrics consist of a range of visual outputs and the quantitative metrics consist of several whole-ice-sheet scalar values that can be used to assign an overall score to a particular simulation. The comparison results from CmCt are useful in quantifying improvements within a specific model (or within a class of models) as a result of differences in model dynamics (e.g., shallow vs. higher-order dynamics approximations), model physics (e.g., representations of ice sheet rheological or basal processes), or model resolution (mesh resolution and/or changes in the spatial resolution of input datasets). The framework and metrics could also be used for use as a model-to-model intercomparison tool, simply by swapping outputs from another model as the observational datasets. Future versions of the tool will include comparisons with other datasets that are of interest to the modeling community, such as ice velocity, ice thickness, and surface mass balance.

  15. Utilizing high throughput screening data for predictive toxicology models: protocols and application to MLSCN assays

    NASA Astrophysics Data System (ADS)

    Guha, Rajarshi; Schürer, Stephan C.

    2008-06-01

    Computational toxicology is emerging as an encouraging alternative to experimental testing. The Molecular Libraries Screening Center Network (MLSCN) as part of the NIH Molecular Libraries Roadmap has recently started generating large and diverse screening datasets, which are publicly available in PubChem. In this report, we investigate various aspects of developing computational models to predict cell toxicity based on cell proliferation screening data generated in the MLSCN. By capturing feature-based information in those datasets, such predictive models would be useful in evaluating cell-based screening results in general (for example from reporter assays) and could be used as an aid to identify and eliminate potentially undesired compounds. Specifically we present the results of random forest ensemble models developed using different cell proliferation datasets and highlight protocols to take into account their extremely imbalanced nature. Depending on the nature of the datasets and the descriptors employed we were able to achieve percentage correct classification rates between 70% and 85% on the prediction set, though the accuracy rate dropped significantly when the models were applied to in vivo data. In this context we also compare the MLSCN cell proliferation results with animal acute toxicity data to investigate to what extent animal toxicity can be correlated and potentially predicted by proliferation results. Finally, we present a visualization technique that allows one to compare a new dataset to the training set of the models to decide whether the new dataset may be reliably predicted.

  16. A Modified Active Appearance Model Based on an Adaptive Artificial Bee Colony

    PubMed Central

    Othman, Zulaiha Ali

    2014-01-01

    Active appearance model (AAM) is one of the most popular model-based approaches that have been extensively used to extract features by highly accurate modeling of human faces under various physical and environmental circumstances. However, in such active appearance model, fitting the model with original image is a challenging task. State of the art shows that optimization method is applicable to resolve this problem. However, another common problem is applying optimization. Hence, in this paper we propose an AAM based face recognition technique, which is capable of resolving the fitting problem of AAM by introducing a new adaptive ABC algorithm. The adaptation increases the efficiency of fitting as against the conventional ABC algorithm. We have used three datasets: CASIA dataset, property 2.5D face dataset, and UBIRIS v1 images dataset in our experiments. The results have revealed that the proposed face recognition technique has performed effectively, in terms of accuracy of face recognition. PMID:25165748

  17. A Unified Framework for Complex Networks with Degree Trichotomy Based on Markov Chains.

    PubMed

    Hui, David Shui Wing; Chen, Yi-Chao; Zhang, Gong; Wu, Weijie; Chen, Guanrong; Lui, John C S; Li, Yingtao

    2017-06-16

    This paper establishes a Markov chain model as a unified framework for describing the evolution processes in complex networks. The unique feature of the proposed model is its capability in addressing the formation mechanism that can reflect the "trichotomy" observed in degree distributions, based on which closed-form solutions can be derived. Important special cases of the proposed unified framework are those classical models, including Poisson, Exponential, Power-law distributed networks. Both simulation and experimental results demonstrate a good match of the proposed model with real datasets, showing its superiority over the classical models. Implications of the model to various applications including citation analysis, online social networks, and vehicular networks design, are also discussed in the paper.

  18. 75 FR 78679 - Fisheries of the Gulf of Mexico and South Atlantic; Southeast Data, Assessment, and Review (SEDAR...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-12-16

    ... evaluates potential datasets and recommends which datasets are appropriate for assessment analyses. The...: Using datasets recommended from the Data Workshop, participants will employ assessment models to...

  19. Analysis of mass spectrometry data from the secretome of an explant model of articular cartilage exposed to pro-inflammatory and anti-inflammatory stimuli using machine learning

    PubMed Central

    2013-01-01

    Background Osteoarthritis (OA) is an inflammatory disease of synovial joints involving the loss and degeneration of articular cartilage. The gold standard for evaluating cartilage loss in OA is the measurement of joint space width on standard radiographs. However, in most cases the diagnosis is made well after the onset of the disease, when the symptoms are well established. Identification of early biomarkers of OA can facilitate earlier diagnosis, improve disease monitoring and predict responses to therapeutic interventions. Methods This study describes the bioinformatic analysis of data generated from high throughput proteomics for identification of potential biomarkers of OA. The mass spectrometry data was generated using a canine explant model of articular cartilage treated with the pro-inflammatory cytokine interleukin 1 β (IL-1β). The bioinformatics analysis involved the application of machine learning and network analysis to the proteomic mass spectrometry data. A rule based machine learning technique, BioHEL, was used to create a model that classified the samples into their relevant treatment groups by identifying those proteins that separated samples into their respective groups. The proteins identified were considered to be potential biomarkers. Protein networks were also generated; from these networks, proteins pivotal to the classification were identified. Results BioHEL correctly classified eighteen out of twenty-three samples, giving a classification accuracy of 78.3% for the dataset. The dataset included the four classes of control, IL-1β, carprofen, and IL-1β and carprofen together. This exceeded the other machine learners that were used for a comparison, on the same dataset, with the exception of another rule-based method, JRip, which performed equally well. The proteins that were most frequently used in rules generated by BioHEL were found to include a number of relevant proteins including matrix metalloproteinase 3, interleukin 8 and matrix gla protein. Conclusions Using this protocol, combining an in vitro model of OA with bioinformatics analysis, a number of relevant extracellular matrix proteins were identified, thereby supporting the application of these bioinformatics tools for analysis of proteomic data from in vitro models of cartilage degradation. PMID:24330474

  20. Relations of water-quality constituent concentrations to surrogate measurements in the lower Platte River corridor, Nebraska, 2007 through 2011

    USGS Publications Warehouse

    Schaepe, Nathaniel J.; Soenksen, Philip J.; Rus, David L.

    2014-01-01

    The lower Platte River, Nebraska, provides drinking water, irrigation water, and in-stream flows for recreation, wildlife habitat, and vital habitats for several threatened and endangered species. The U.S. Geological Survey (USGS), in cooperation with the Lower Platte River Corridor Alliance (LPRCA) developed site-specific regression models for water-quality constituents at four sites (Shell Creek near Columbus, Nebraska [USGS site 06795500]; Elkhorn River at Waterloo, Nebr. [USGS site 06800500]; Salt Creek near Ashland, Nebr. [USGS site 06805000]; and Platte River at Louisville, Nebr. [USGS site 06805500]) in the lower Platte River corridor. The models were developed by relating continuously monitored water-quality properties (surrogate measurements) to discrete water-quality samples. These models enable existing web-based software to provide near-real-time estimates of stream-specific constituent concentrations to support natural resources management decisions. Since 2007, USGS, in cooperation with the LPRCA, has continuously monitored four water-quality properties seasonally within the lower Platte River corridor: specific conductance, water temperature, dissolved oxygen, and turbidity. During 2007 through 2011, the USGS and the Nebraska Department of Environmental Quality collected and analyzed discrete water-quality samples for nutrients, major ions, pesticides, suspended sediment, and bacteria. These datasets were used to develop the regression models. This report documents the collection of these various water-quality datasets and the development of the site-specific regression models. Regression models were developed for all four monitored sites. Constituent models for Shell Creek included nitrate plus nitrite, total phosphorus, orthophosphate, atrazine, acetochlor, suspended sediment, and Escherichia coli (E. coli) bacteria. Regression models that were developed for the Elkhorn River included nitrate plus nitrite, total Kjeldahl nitrogen, total phosphorus, orthophosphate, chloride, atrazine, acetochlor, suspended sediment, and E. coli. Models developed for Salt Creek included nitrate plus nitrite, total Kjeldahl nitrogen, suspended sediment, and E. coli. Lastly, models developed for the Platte River site included total Kjeldahl nitrogen, total phosphorus, sodium, metolachlor, atrazine, acetochlor, suspended sediment, and E. coli.

  1. A comparative study of family-specific protein-ligand complex affinity prediction based on random forest approach

    NASA Astrophysics Data System (ADS)

    Wang, Yu; Guo, Yanzhi; Kuang, Qifan; Pu, Xuemei; Ji, Yue; Zhang, Zhihang; Li, Menglong

    2015-04-01

    The assessment of binding affinity between ligands and the target proteins plays an essential role in drug discovery and design process. As an alternative to widely used scoring approaches, machine learning methods have also been proposed for fast prediction of the binding affinity with promising results, but most of them were developed as all-purpose models despite of the specific functions of different protein families, since proteins from different function families always have different structures and physicochemical features. In this study, we proposed a random forest method to predict the protein-ligand binding affinity based on a comprehensive feature set covering protein sequence, binding pocket, ligand structure and intermolecular interaction. Feature processing and compression was respectively implemented for different protein family datasets, which indicates that different features contribute to different models, so individual representation for each protein family is necessary. Three family-specific models were constructed for three important protein target families of HIV-1 protease, trypsin and carbonic anhydrase respectively. As a comparison, two generic models including diverse protein families were also built. The evaluation results show that models on family-specific datasets have the superior performance to those on the generic datasets and the Pearson and Spearman correlation coefficients ( R p and Rs) on the test sets are 0.740, 0.874, 0.735 and 0.697, 0.853, 0.723 for HIV-1 protease, trypsin and carbonic anhydrase respectively. Comparisons with the other methods further demonstrate that individual representation and model construction for each protein family is a more reasonable way in predicting the affinity of one particular protein family.

  2. Modelling land cover change in the Ganga basin

    NASA Astrophysics Data System (ADS)

    Moulds, S.; Tsarouchi, G.; Mijic, A.; Buytaert, W.

    2013-12-01

    Over recent decades the green revolution in India has driven substantial environmental change. Modelling experiments have identified northern India as a 'hot spot' of land-atmosphere coupling strength during the boreal summer. However, there is a wide range of sensitivity of atmospheric variables to soil moisture between individual climate models. The lack of a comprehensive land cover change dataset to force climate models has been identified as a major contributor to model uncertainty. In this work a time series dataset of land cover change between 1970 and 2010 is constructed for northern India to improve the quantification of regional hydrometeorological feedbacks. The MODIS instrument on board the Aqua and Terra satellites provides near-continuous remotely sensed datasets from 2000 to the present day. However, the quality of satellite products before 2000 is poor. To complete the dataset MODIS images are extrapolated back in time using the Conversion of Land Use and its Effects at small regional extent (CLUE-s) modelling framework. Non-spatial estimates of land cover area from national agriculture and forest statistics, available on a state-wise, annual basis, are used as a direct model input. Land cover change is allocated spatially as a function of biophysical and socioeconomic drivers identified using logistic regression. This dataset will provide an essential input to a high resolution, physically based land surface model to generate the lower boundary condition to assess the impact of land cover change on regional climate.

  3. Characterization and visualization of the accuracy of FIA's CONUS-wide tree species datasets

    Treesearch

    Rachel Riemann; Barry T. Wilson

    2014-01-01

    Modeled geospatial datasets have been created for 325 tree species across the contiguous United States (CONUS). Effective application of all geospatial datasets depends on their accuracy. Dataset error can be systematic (bias) or unsystematic (scatter), and their magnitude can vary by region and scale. Each of these characteristics affects the locations, scales, uses,...

  4. Gsflow-py: An integrated hydrologic model development tool

    NASA Astrophysics Data System (ADS)

    Gardner, M.; Niswonger, R. G.; Morton, C.; Henson, W.; Huntington, J. L.

    2017-12-01

    Integrated hydrologic modeling encompasses a vast number of processes and specifications, variable in time and space, and development of model datasets can be arduous. Model input construction techniques have not been formalized or made easily reproducible. Creating the input files for integrated hydrologic models (IHM) requires complex GIS processing of raster and vector datasets from various sources. Developing stream network topology that is consistent with the model resolution digital elevation model is important for robust simulation of surface water and groundwater exchanges. Distribution of meteorologic parameters over the model domain is difficult in complex terrain at the model resolution scale, but is necessary to drive realistic simulations. Historically, development of input data for IHM models has required extensive GIS and computer programming expertise which has restricted the use of IHMs to research groups with available financial, human, and technical resources. Here we present a series of Python scripts that provide a formalized technique for the parameterization and development of integrated hydrologic model inputs for GSFLOW. With some modifications, this process could be applied to any regular grid hydrologic model. This Python toolkit automates many of the necessary and laborious processes of parameterization, including stream network development and cascade routing, land coverages, and meteorological distribution over the model domain.

  5. Gridded global surface ozone metrics for atmospheric chemistry model evaluation

    NASA Astrophysics Data System (ADS)

    Sofen, E. D.; Bowdalo, D.; Evans, M. J.; Apadula, F.; Bonasoni, P.; Cupeiro, M.; Ellul, R.; Galbally, I. E.; Girgzdiene, R.; Luppo, S.; Mimouni, M.; Nahas, A. C.; Saliba, M.; Tørseth, K.; Wmo Gaw, Epa Aqs, Epa Castnet, Capmon, Naps, Airbase, Emep, Eanet Ozone Datasets, All Other Contributors To

    2015-07-01

    The concentration of ozone at the Earth's surface is measured at many locations across the globe for the purposes of air quality monitoring and atmospheric chemistry research. We have brought together all publicly available surface ozone observations from online databases from the modern era to build a consistent dataset for the evaluation of chemical transport and chemistry-climate (Earth System) models for projects such as the Chemistry-Climate Model Initiative and Aer-Chem-MIP. From a total dataset of approximately 6600 sites and 500 million hourly observations from 1971-2015, approximately 2200 sites and 200 million hourly observations pass screening as high-quality sites in regional background locations that are appropriate for use in global model evaluation. There is generally good data volume since the start of air quality monitoring networks in 1990 through 2013. Ozone observations are biased heavily toward North America and Europe with sparse coverage over the rest of the globe. This dataset is made available for the purposes of model evaluation as a set of gridded metrics intended to describe the distribution of ozone concentrations on monthly and annual timescales. Metrics include the moments of the distribution, percentiles, maximum daily eight-hour average (MDA8), SOMO35, AOT40, and metrics related to air quality regulatory thresholds. Gridded datasets are stored as netCDF-4 files and are available to download from the British Atmospheric Data Centre (doi:10.5285/08fbe63d-fa6d-4a7a-b952-5932e3ab0452). We provide recommendations to the ozone measurement community regarding improving metadata reporting to simplify ongoing and future efforts in working with ozone data from disparate networks in a consistent manner.

  6. A Web Resource for Standardized Benchmark Datasets, Metrics, and Rosetta Protocols for Macromolecular Modeling and Design.

    PubMed

    Ó Conchúir, Shane; Barlow, Kyle A; Pache, Roland A; Ollikainen, Noah; Kundert, Kale; O'Meara, Matthew J; Smith, Colin A; Kortemme, Tanja

    2015-01-01

    The development and validation of computational macromolecular modeling and design methods depend on suitable benchmark datasets and informative metrics for comparing protocols. In addition, if a method is intended to be adopted broadly in diverse biological applications, there needs to be information on appropriate parameters for each protocol, as well as metrics describing the expected accuracy compared to experimental data. In certain disciplines, there exist established benchmarks and public resources where experts in a particular methodology are encouraged to supply their most efficient implementation of each particular benchmark. We aim to provide such a resource for protocols in macromolecular modeling and design. We present a freely accessible web resource (https://kortemmelab.ucsf.edu/benchmarks) to guide the development of protocols for protein modeling and design. The site provides benchmark datasets and metrics to compare the performance of a variety of modeling protocols using different computational sampling methods and energy functions, providing a "best practice" set of parameters for each method. Each benchmark has an associated downloadable benchmark capture archive containing the input files, analysis scripts, and tutorials for running the benchmark. The captures may be run with any suitable modeling method; we supply command lines for running the benchmarks using the Rosetta software suite. We have compiled initial benchmarks for the resource spanning three key areas: prediction of energetic effects of mutations, protein design, and protein structure prediction, each with associated state-of-the-art modeling protocols. With the help of the wider macromolecular modeling community, we hope to expand the variety of benchmarks included on the website and continue to evaluate new iterations of current methods as they become available.

  7. Large scale validation of the M5L lung CAD on heterogeneous CT datasets.

    PubMed

    Torres, E Lopez; Fiorina, E; Pennazio, F; Peroni, C; Saletta, M; Camarlinghi, N; Fantacci, M E; Cerello, P

    2015-04-01

    M5L, a fully automated computer-aided detection (CAD) system for the detection and segmentation of lung nodules in thoracic computed tomography (CT), is presented and validated on several image datasets. M5L is the combination of two independent subsystems, based on the Channeler Ant Model as a segmentation tool [lung channeler ant model (lungCAM)] and on the voxel-based neural approach. The lungCAM was upgraded with a scan equalization module and a new procedure to recover the nodules connected to other lung structures; its classification module, which makes use of a feed-forward neural network, is based of a small number of features (13), so as to minimize the risk of lacking generalization, which could be possible given the large difference between the size of the training and testing datasets, which contain 94 and 1019 CTs, respectively. The lungCAM (standalone) and M5L (combined) performance was extensively tested on 1043 CT scans from three independent datasets, including a detailed analysis of the full Lung Image Database Consortium/Image Database Resource Initiative database, which is not yet found in literature. The lungCAM and M5L performance is consistent across the databases, with a sensitivity of about 70% and 80%, respectively, at eight false positive findings per scan, despite the variable annotation criteria and acquisition and reconstruction conditions. A reduced sensitivity is found for subtle nodules and ground glass opacities (GGO) structures. A comparison with other CAD systems is also presented. The M5L performance on a large and heterogeneous dataset is stable and satisfactory, although the development of a dedicated module for GGOs detection could further improve it, as well as an iterative optimization of the training procedure. The main aim of the present study was accomplished: M5L results do not deteriorate when increasing the dataset size, making it a candidate for supporting radiologists on large scale screenings and clinical programs.

  8. Comparison of Four Precipitation Forcing Datasets in Land Information System Simulations over the Continental U.S.

    NASA Technical Reports Server (NTRS)

    Case, Jonathan L.; Kumar, Sujay V.; Kuligowski, Robert J.; Langston, Carrie

    2013-01-01

    The NASA Short ]term Prediction Research and Transition (SPoRT) Center in Huntsville, AL is running a real ]time configuration of the NASA Land Information System (LIS) with the Noah land surface model (LSM). Output from the SPoRT ]LIS run is used to initialize land surface variables for local modeling applications at select National Weather Service (NWS) partner offices, and can be displayed in decision support systems for situational awareness and drought monitoring. The SPoRT ]LIS is run over a domain covering the southern and eastern United States, fully nested within the National Centers for Environmental Prediction Stage IV precipitation analysis grid, which provides precipitation forcing to the offline LIS ]Noah runs. The SPoRT Center seeks to expand the real ]time LIS domain to the entire Continental U.S. (CONUS); however, geographical limitations with the Stage IV analysis product have inhibited this expansion. Therefore, a goal of this study is to test alternative precipitation forcing datasets that can enable the LIS expansion by improving upon the current geographical limitations of the Stage IV product. The four precipitation forcing datasets that are inter ]compared on a 4 ]km resolution CONUS domain include the Stage IV, an experimental GOES quantitative precipitation estimate (QPE) from NESDIS/STAR, the National Mosaic and QPE (NMQ) product from the National Severe Storms Laboratory, and the North American Land Data Assimilation System phase 2 (NLDAS ]2) analyses. The NLDAS ]2 dataset is used as the control run, with each of the other three datasets considered experimental runs compared against the control. The regional strengths, weaknesses, and biases of each precipitation analysis are identified relative to the NLDAS ]2 control in terms of accumulated precipitation pattern and amount, and the impacts on the subsequent LSM spin ]up simulations. The ultimate goal is to identify an alternative precipitation forcing dataset that can best support an expansion of the real ]time SPoRT ]LIS to a domain covering the entire CONUS.

  9. Creating a standardized watersheds database for the Lower Rio Grande/Río Bravo, Texas

    USGS Publications Warehouse

    Brown, J.R.; Ulery, Randy L.; Parcher, Jean W.

    2000-01-01

    This report describes the creation of a large-scale watershed database for the lower Rio Grande/Río Bravo Basin in Texas. The watershed database includes watersheds delineated to all 1:24,000-scale mapped stream confluences and other hydrologically significant points, selected watershed characteristics, and hydrologic derivative datasets.Computer technology allows generation of preliminary watershed boundaries in a fraction of the time needed for manual methods. This automated process reduces development time and results in quality improvements in watershed boundaries and characteristics. These data can then be compiled in a permanent database, eliminating the time-consuming step of data creation at the beginning of a project and providing a stable base dataset that can give users greater confidence when further subdividing watersheds.A standardized dataset of watershed characteristics is a valuable contribution to the understanding and management of natural resources. Vertical integration of the input datasets used to automatically generate watershed boundaries is crucial to the success of such an effort. The optimum situation would be to use the digital orthophoto quadrangles as the source of all the input datasets. While the hydrographic data from the digital line graphs can be revised to match the digital orthophoto quadrangles, hypsography data cannot be revised to match the digital orthophoto quadrangles. Revised hydrography from the digital orthophoto quadrangle should be used to create an updated digital elevation model that incorporates the stream channels as revised from the digital orthophoto quadrangle. Computer-generated, standardized watersheds that are vertically integrated with existing digital line graph hydrographic data will continue to be difficult to create until revisions can be made to existing source datasets. Until such time, manual editing will be necessary to make adjustments for man-made features and changes in the natural landscape that are not reflected in the digital elevation model data.

  10. Creating a standardized watersheds database for the lower Rio Grande/Rio Bravo, Texas

    USGS Publications Warehouse

    Brown, Julie R.; Ulery, Randy L.; Parcher, Jean W.

    2000-01-01

    This report describes the creation of a large-scale watershed database for the lower Rio Grande/Rio Bravo Basin in Texas. The watershed database includes watersheds delineated to all 1:24,000-scale mapped stream confluences and other hydrologically significant points, selected watershed characteristics, and hydrologic derivative datasets. Computer technology allows generation of preliminary watershed boundaries in a fraction of the time needed for manual methods. This automated process reduces development time and results in quality improvements in watershed boundaries and characteristics. These data can then be compiled in a permanent database, eliminating the time-consuming step of data creation at the beginning of a project and providing a stable base dataset that can give users greater confidence when further subdividing watersheds. A standardized dataset of watershed characteristics is a valuable contribution to the understanding and management of natural resources. Vertical integration of the input datasets used to automatically generate watershed boundaries is crucial to the success of such an effort. The optimum situation would be to use the digital orthophoto quadrangles as the source of all the input datasets. While the hydrographic data from the digital line graphs can be revised to match the digital orthophoto quadrangles, hypsography data cannot be revised to match the digital orthophoto quadrangles. Revised hydrography from the digital orthophoto quadrangle should be used to create an updated digital elevation model that incorporates the stream channels as revised from the digital orthophoto quadrangle. Computer-generated, standardized watersheds that are vertically integrated with existing digital line graph hydrographic data will continue to be difficult to create until revisions can be made to existing source datasets. Until such time, manual editing will be necessary to make adjustments for man-made features and changes in the natural landscape that are not reflected in the digital elevation model data.

  11. Comparing apples and oranges: the Community Intercomparison Suite

    NASA Astrophysics Data System (ADS)

    Schutgens, Nick; Stier, Philip; Pascoe, Stephen

    2014-05-01

    Visual representation and comparison of geoscientific datasets presents a huge challenge due to the large variety of file formats and spatio-temporal sampling of data (be they observations or simulations). The Community Intercomparison Suite attempts to greatly simplify these tasks for users by offering an intelligent but simple command line tool for visualisation and colocation of diverse datasets. In addition, CIS can subset and aggregate large datasets into smaller more manageable datasets. Our philosophy is to remove as much as possible the need for specialist knowledge by the user of the structure of a dataset. The colocation of observations with model data is as simple as: "cis col ::" which will resample the simulation data to the spatio-temporal sampling of the observations, contingent on a few user-defined options that specify a resampling kernel. CIS can deal with both gridded and ungridded datasets of 2, 3 or 4 spatio-temporal dimensions. It can handle different spatial coordinates (e.g. longitude or distance, altitude or pressure level). CIS supports both HDF, netCDF and ASCII file formats. The suite is written in Python with entirely publicly available open source dependencies. Plug-ins allow a high degree of user-moddability. A web-based developer hub includes a manual and simple examples. CIS is developed as open source code by a specialist IT company under supervision of scientists from the University of Oxford as part of investment in the JASMIN superdatacluster facility at the Centre of Environmental Data Archival.

  12. Resolution testing and limitations of geodetic and tsunami datasets for finite fault inversions along subduction zones

    NASA Astrophysics Data System (ADS)

    Williamson, A.; Newman, A. V.

    2017-12-01

    Finite fault inversions utilizing multiple datasets have become commonplace for large earthquakes pending data availability. The mixture of geodetic datasets such as Global Navigational Satellite Systems (GNSS) and InSAR, seismic waveforms, and when applicable, tsunami waveforms from Deep-Ocean Assessment and Reporting of Tsunami (DART) gauges, provide slightly different observations that when incorporated together lead to a more robust model of fault slip distribution. The merging of different datasets is of particular importance along subduction zones where direct observations of seafloor deformation over the rupture area are extremely limited. Instead, instrumentation measures related ground motion from tens to hundreds of kilometers away. The distance from the event and dataset type can lead to a variable degree of resolution, affecting the ability to accurately model the spatial distribution of slip. This study analyzes the spatial resolution attained individually from geodetic and tsunami datasets as well as in a combined dataset. We constrain the importance of distance between estimated parameters and observed data and how that varies between land-based and open ocean datasets. Analysis focuses on accurately scaled subduction zone synthetic models as well as analysis of the relationship between slip and data in recent large subduction zone earthquakes. This study shows that seafloor deformation sensitive datasets, like open-ocean tsunami waveforms or seafloor geodetic instrumentation, can provide unique offshore resolution for understanding most large and particularly tsunamigenic megathrust earthquake activity. In most environments, we simply lack the capability to resolve static displacements using land-based geodetic observations.

  13. Quality-control of an hourly rainfall dataset and climatology of extremes for the UK.

    PubMed

    Blenkinsop, Stephen; Lewis, Elizabeth; Chan, Steven C; Fowler, Hayley J

    2017-02-01

    Sub-daily rainfall extremes may be associated with flash flooding, particularly in urban areas but, compared with extremes on daily timescales, have been relatively little studied in many regions. This paper describes a new, hourly rainfall dataset for the UK based on ∼1600 rain gauges from three different data sources. This includes tipping bucket rain gauge data from the UK Environment Agency (EA), which has been collected for operational purposes, principally flood forecasting. Significant problems in the use of such data for the analysis of extreme events include the recording of accumulated totals, high frequency bucket tips, rain gauge recording errors and the non-operation of gauges. Given the prospect of an intensification of short-duration rainfall in a warming climate, the identification of such errors is essential if sub-daily datasets are to be used to better understand extreme events. We therefore first describe a series of procedures developed to quality control this new dataset. We then analyse ∼380 gauges with near-complete hourly records for 1992-2011 and map the seasonal climatology of intense rainfall based on UK hourly extremes using annual maxima, n-largest events and fixed threshold approaches. We find that the highest frequencies and intensities of hourly extreme rainfall occur during summer when the usual orographically defined pattern of extreme rainfall is replaced by a weaker, north-south pattern. A strong diurnal cycle in hourly extremes, peaking in late afternoon to early evening, is also identified in summer and, for some areas, in spring. This likely reflects the different mechanisms that generate sub-daily rainfall, with convection dominating during summer. The resulting quality-controlled hourly rainfall dataset will provide considerable value in several contexts, including the development of standard, globally applicable quality-control procedures for sub-daily data, the validation of the new generation of very high-resolution climate models and improved understanding of the drivers of extreme rainfall.

  14. Quality‐control of an hourly rainfall dataset and climatology of extremes for the UK

    PubMed Central

    Lewis, Elizabeth; Chan, Steven C.; Fowler, Hayley J.

    2016-01-01

    ABSTRACT Sub‐daily rainfall extremes may be associated with flash flooding, particularly in urban areas but, compared with extremes on daily timescales, have been relatively little studied in many regions. This paper describes a new, hourly rainfall dataset for the UK based on ∼1600 rain gauges from three different data sources. This includes tipping bucket rain gauge data from the UK Environment Agency (EA), which has been collected for operational purposes, principally flood forecasting. Significant problems in the use of such data for the analysis of extreme events include the recording of accumulated totals, high frequency bucket tips, rain gauge recording errors and the non‐operation of gauges. Given the prospect of an intensification of short‐duration rainfall in a warming climate, the identification of such errors is essential if sub‐daily datasets are to be used to better understand extreme events. We therefore first describe a series of procedures developed to quality control this new dataset. We then analyse ∼380 gauges with near‐complete hourly records for 1992–2011 and map the seasonal climatology of intense rainfall based on UK hourly extremes using annual maxima, n‐largest events and fixed threshold approaches. We find that the highest frequencies and intensities of hourly extreme rainfall occur during summer when the usual orographically defined pattern of extreme rainfall is replaced by a weaker, north–south pattern. A strong diurnal cycle in hourly extremes, peaking in late afternoon to early evening, is also identified in summer and, for some areas, in spring. This likely reflects the different mechanisms that generate sub‐daily rainfall, with convection dominating during summer. The resulting quality‐controlled hourly rainfall dataset will provide considerable value in several contexts, including the development of standard, globally applicable quality‐control procedures for sub‐daily data, the validation of the new generation of very high‐resolution climate models and improved understanding of the drivers of extreme rainfall. PMID:28239235

  15. Vision-based Detection of Acoustic Timed Events: a Case Study on Clarinet Note Onsets

    NASA Astrophysics Data System (ADS)

    Bazzica, A.; van Gemert, J. C.; Liem, C. C. S.; Hanjalic, A.

    2017-05-01

    Acoustic events often have a visual counterpart. Knowledge of visual information can aid the understanding of complex auditory scenes, even when only a stereo mixdown is available in the audio domain, \\eg identifying which musicians are playing in large musical ensembles. In this paper, we consider a vision-based approach to note onset detection. As a case study we focus on challenging, real-world clarinetist videos and carry out preliminary experiments on a 3D convolutional neural network based on multiple streams and purposely avoiding temporal pooling. We release an audiovisual dataset with 4.5 hours of clarinetist videos together with cleaned annotations which include about 36,000 onsets and the coordinates for a number of salient points and regions of interest. By performing several training trials on our dataset, we learned that the problem is challenging. We found that the CNN model is highly sensitive to the optimization algorithm and hyper-parameters, and that treating the problem as binary classification may prevent the joint optimization of precision and recall. To encourage further research, we publicly share our dataset, annotations and all models and detail which issues we came across during our preliminary experiments.

  16. Computational approaches for predicting biomedical research collaborations.

    PubMed

    Zhang, Qing; Yu, Hong

    2014-01-01

    Biomedical research is increasingly collaborative, and successful collaborations often produce high impact work. Computational approaches can be developed for automatically predicting biomedical research collaborations. Previous works of collaboration prediction mainly explored the topological structures of research collaboration networks, leaving out rich semantic information from the publications themselves. In this paper, we propose supervised machine learning approaches to predict research collaborations in the biomedical field. We explored both the semantic features extracted from author research interest profile and the author network topological features. We found that the most informative semantic features for author collaborations are related to research interest, including similarity of out-citing citations, similarity of abstracts. Of the four supervised machine learning models (naïve Bayes, naïve Bayes multinomial, SVMs, and logistic regression), the best performing model is logistic regression with an ROC ranging from 0.766 to 0.980 on different datasets. To our knowledge we are the first to study in depth how research interest and productivities can be used for collaboration prediction. Our approach is computationally efficient, scalable and yet simple to implement. The datasets of this study are available at https://github.com/qingzhanggithub/medline-collaboration-datasets.

  17. Comparing NetCDF and SciDB on managing and querying 5D hydrologic dataset

    NASA Astrophysics Data System (ADS)

    Liu, Haicheng; Xiao, Xiao

    2016-11-01

    Efficiently extracting information from high dimensional hydro-meteorological modelling datasets requires smart solutions. Traditional methods are mostly based on files, which can be edited and accessed handily. But they have problems of efficiency due to contiguous storage structure. Others propose databases as an alternative for advantages such as native functionalities for manipulating multidimensional (MD) arrays, smart caching strategy and scalability. In this research, NetCDF file based solutions and the multidimensional array database management system (DBMS) SciDB applying chunked storage structure are benchmarked to determine the best solution for storing and querying 5D large hydrologic modelling dataset. The effect of data storage configurations including chunk size, dimension order and compression on query performance is explored. Results indicate that dimension order to organize storage of 5D data has significant influence on query performance if chunk size is very large. But the effect becomes insignificant when chunk size is properly set. Compression of SciDB mostly has negative influence on query performance. Caching is an advantage but may be influenced by execution of different query processes. On the whole, NetCDF solution without compression is in general more efficient than the SciDB DBMS.

  18. A multivariate model for predicting segmental body composition.

    PubMed

    Tian, Simiao; Mioche, Laurence; Denis, Jean-Baptiste; Morio, Béatrice

    2013-12-01

    The aims of the present study were to propose a multivariate model for predicting simultaneously body, trunk and appendicular fat and lean masses from easily measured variables and to compare its predictive capacity with that of the available univariate models that predict body fat percentage (BF%). The dual-energy X-ray absorptiometry (DXA) dataset (52% men and 48% women) with White, Black and Hispanic ethnicities (1999-2004, National Health and Nutrition Examination Survey) was randomly divided into three sub-datasets: a training dataset (TRD), a test dataset (TED); a validation dataset (VAD), comprising 3835, 1917 and 1917 subjects. For each sex, several multivariate prediction models were fitted from the TRD using age, weight, height and possibly waist circumference. The most accurate model was selected from the TED and then applied to the VAD and a French DXA dataset (French DB) (526 men and 529 women) to assess the prediction accuracy in comparison with that of five published univariate models, for which adjusted formulas were re-estimated using the TRD. Waist circumference was found to improve the prediction accuracy, especially in men. For BF%, the standard error of prediction (SEP) values were 3.26 (3.75) % for men and 3.47 (3.95)% for women in the VAD (French DB), as good as those of the adjusted univariate models. Moreover, the SEP values for the prediction of body and appendicular lean masses ranged from 1.39 to 2.75 kg for both the sexes. The prediction accuracy was best for age < 65 years, BMI < 30 kg/m2 and the Hispanic ethnicity. The application of our multivariate model to large populations could be useful to address various public health issues.

  19. Comparison of Radiative Energy Flows in Observational Datasets and Climate Modeling

    NASA Technical Reports Server (NTRS)

    Raschke, Ehrhard; Kinne, Stefan; Rossow, William B.; Stackhouse, Paul W. Jr.; Wild, Martin

    2016-01-01

    This study examines radiative flux distributions and local spread of values from three major observational datasets (CERES, ISCCP, and SRB) and compares them with results from climate modeling (CMIP3). Examinations of the spread and differences also differentiate among contributions from cloudy and clear-sky conditions. The spread among observational datasets is in large part caused by noncloud ancillary data. Average differences of at least 10Wm(exp -2) each for clear-sky downward solar, upward solar, and upward infrared fluxes at the surface demonstrate via spatial difference patterns major differences in assumptions for atmospheric aerosol, solar surface albedo and surface temperature, and/or emittance in observational datasets. At the top of the atmosphere (TOA), observational datasets are less influenced by the ancillary data errors than at the surface. Comparisons of spatial radiative flux distributions at the TOA between observations and climate modeling indicate large deficiencies in the strength and distribution of model-simulated cloud radiative effects. Differences are largest for lower-altitude clouds over low-latitude oceans. Global modeling simulates stronger cloud radiative effects (CRE) by +30Wmexp -2) over trade wind cumulus regions, yet smaller CRE by about -30Wm(exp -2) over (smaller in area) stratocumulus regions. At the surface, climate modeling simulates on average about 15Wm(exp -2) smaller radiative net flux imbalances, as if climate modeling underestimates latent heat release (and precipitation). Relative to observational datasets, simulated surface net fluxes are particularly lower over oceanic trade wind regions (where global modeling tends to overestimate the radiative impact of clouds). Still, with the uncertainty in noncloud ancillary data, observational data do not establish a reliable reference.

  20. Using iterative cluster merging with improved gap statistics to perform online phenotype discovery in the context of high-throughput RNAi screens

    PubMed Central

    Yin, Zheng; Zhou, Xiaobo; Bakal, Chris; Li, Fuhai; Sun, Youxian; Perrimon, Norbert; Wong, Stephen TC

    2008-01-01

    Background The recent emergence of high-throughput automated image acquisition technologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular phenotypes in different experimental conditions has been dependent upon the expert opinions of well-trained biologists. Such qualitative analysis is particularly effective in detecting subtle, but important, deviations in phenotypes. However, while the rapid and continuing development of automated microscope-based technologies now facilitates the acquisition of trillions of cells in thousands of diverse experimental conditions, such as in the context of RNA interference (RNAi) or small-molecule screens, the massive size of these datasets precludes human analysis. Thus, the development of automated methods which aim to identify novel and biological relevant phenotypes online is one of the major challenges in high-throughput image-based screening. Ideally, phenotype discovery methods should be designed to utilize prior/existing information and tackle three challenging tasks, i.e. restoring pre-defined biological meaningful phenotypes, differentiating novel phenotypes from known ones and clarifying novel phenotypes from each other. Arbitrarily extracted information causes biased analysis, while combining the complete existing datasets with each new image is intractable in high-throughput screens. Results Here we present the design and implementation of a novel and robust online phenotype discovery method with broad applicability that can be used in diverse experimental contexts, especially high-throughput RNAi screens. This method features phenotype modelling and iterative cluster merging using improved gap statistics. A Gaussian Mixture Model (GMM) is employed to estimate the distribution of each existing phenotype, and then used as reference distribution in gap statistics. This method is broadly applicable to a number of different types of image-based datasets derived from a wide spectrum of experimental conditions and is suitable to adaptively process new images which are continuously added to existing datasets. Validations were carried out on different dataset, including published RNAi screening using Drosophila embryos [Additional files 1, 2], dataset for cell cycle phase identification using HeLa cells [Additional files 1, 3, 4] and synthetic dataset using polygons, our methods tackled three aforementioned tasks effectively with an accuracy range of 85%–90%. When our method is implemented in the context of a Drosophila genome-scale RNAi image-based screening of cultured cells aimed to identifying the contribution of individual genes towards the regulation of cell-shape, it efficiently discovers meaningful new phenotypes and provides novel biological insight. We also propose a two-step procedure to modify the novelty detection method based on one-class SVM, so that it can be used to online phenotype discovery. In different conditions, we compared the SVM based method with our method using various datasets and our methods consistently outperformed SVM based method in at least two of three tasks by 2% to 5%. These results demonstrate that our methods can be used to better identify novel phenotypes in image-based datasets from a wide range of conditions and organisms. Conclusion We demonstrate that our method can detect various novel phenotypes effectively in complex datasets. Experiment results also validate that our method performs consistently under different order of image input, variation of starting conditions including the number and composition of existing phenotypes, and dataset from different screens. In our findings, the proposed method is suitable for online phenotype discovery in diverse high-throughput image-based genetic and chemical screens. PMID:18534020

Top